4. Set Default Cleaning Operations

A critical step in any data science project is ensuring the data is clean and prepared for feature engineering.

To maintain data quality during feature engineering in FeatureByte, you can centralize Cleaning Operations at the table level. This approach allows you to effectively address common issues, such as string-based datetime columns, missing values, disguised missing values (e.g., those not explicitly labeled as missing), and outliers.

We will perform cleaning operations on the NEW_APPLICATION table and CLIENT_PROFILE table, including:

Ignoring disguised missing values in DAYS_EMPLOYED
Defining the schema for the BIRTHDATE column stored as a string-based datetime

Approval Flow

In Catalogs with Approval Flow enabled, changes in table metadata, such as cleaning operations, initiate a review process. To see this in action, check out the Grocery Dataset UI Tutorials. This process helps recommend new versions of features and lists linked to these tables, ensuring that models and deployments always use versions that account for data updates and potential issues.

Step 1: Locate Columns to Be Cleaned¶

From the menu, go to the 'Explore' section and access the Table Catalog.
From the Table Catalog, select the NEW_APPLICATION table.
Navigate to the 'Describe' tab, and click to run a descriptive analysis.
Filter the columns by clicking .

Step 2: Ignore Disguised Missing Values in DAYS_EMPLOYED¶

Go to the 'Columns' tab of the NEW_APPLICATION table and search for 'DAYS_EMPLOYED'.
Click on the critical data info edit button in .
Ignore disguised missing values equal to 365243.
Click to save the cleaning operation.
Review Cleaning Operations for NEW_APPLICATION table. The newly applied cleaning steps for the DAYS_EMPLOYED column should now be visible in the 'Columns' tab of the NEW_APPLICATION table.

Step 3: Define the schema for the BIRTHDATE column¶

Go to the 'Columns' tab of the CLIENT_PROFILE table and search for 'BIRTHDATE'.
Click on the critical data info edit button in .
Add its Timestamp Schema:
- is recorded in: Local time
- time format string: "YYYY-MM-DD"
- timezone to convert to local time: "America/Los_Angeles".
Click to save the cleaning operation.