4. Set Default Cleaning Operations
A critical step in any data science project is ensuring the data is clean and prepared for feature engineering.
To maintain data quality during feature engineering in FeatureByte, you can centralize Cleaning Operations at the table level. This approach allows you to effectively address common issues, such as string-based datetime columns, missing values, disguised missing values (e.g., those not explicitly labeled as missing), and outliers.
We will perform cleaning operations on the NEW_APPLICATION table, including:
- Defining the schema for the BIRTHDATE column stored as a string-based datetime
- Ignoring disguised missing values in DAYS_EMPLOYED
- Handling Outliers in AMT_REQ_CREDIT_BUREAU_QRT
Approval Flow
In Catalogs with Approval Flow enabled, changes in table metadata, such as cleaning operations, initiate a review process. To see this in action, check out the Grocery Dataset UI Tutorials. This process helps recommend new versions of features and lists linked to these tables, ensuring that models and deployments always use versions that account for data updates and potential issues.
Step 1: Locate Columns to Be Cleaned¶
-
From the menu, go to the 'Explore' section and access the Table Catalog.
-
From the Table Catalog, select the NEW_APPLICATION table.
-
Navigate to the 'Describe' tab, and click
to run a descriptive analysis.
-
Filter the columns by clicking
.
Step 2: Define the schema for the BIRTHDATE column¶
-
Go to the 'Columns' tab and search for 'BIRTHDATE'.
-
Click on the critical data info edit button in
.
-
Add its Timestamp Schema:
- is recorded in: UTC
- time format string: "YYYY-MM-DD"
- timezone to convert to local time: "America/Los_Angeles".
-
Click
to save the cleaning operation.
Note
Changing the schema will reset the column's existing semantic tag.
If you are using Databricks and specifying the schema for a timestamp column, keep in mind that FeatureByte retrieves timestamps exactly as they are stored, without adjusting for your Databricks cluster's time zone settings.
Step 3: Ignore Disguised Missing Values in DAYS_EMPLOYED¶
-
Go to the 'Columns' tab and search for 'DAYS_EMPLOYED'.
-
Click on the critical data info edit button in
.
- Ignore disguised missing values equal to
365243
. -
Click
to save the cleaning operation.
Step 4: Handle Outliers in AMT_REQ_CREDIT_BUREAU_QRT¶
-
Search for 'AMT_REQ_CREDIT_BUREAU_QRT'.
-
Click on the critical data info edit button in
.
- Cap values greater than
20
. -
Click
to save the cleaning operation.
Step 4: Review Cleaning Operations for NEW_APPLICATION table¶
The newly applied cleaning steps for the BIRTHDATE
, DAYS_EMPLOYED
and AMT_REQ_CREDIT_BUREAU_QR
columns should now be visible in the 'Columns' tab of the NEW_APPLICATION table.