4. Set Default Cleaning Operations

Ensuring clean, well-prepared data is a critical step in any data science project, especially when building reliable features for machine learning.

In FeatureByte, you can maintain high data quality by centralizing Cleaning Operations at the table level.

Centralizing these operations allows you to easily address issues such as:

String-encoded datetime fields
Missing or disguised missing values
Outliers and other data inconsistencies

In this tutorial, you will:

Run table-level EDA on the NEW_APPLICATION and CLIENT_PROFILE tables to identify data quality issues.
Apply appropriate cleaning operations, including:
- Ignoring disguised missing values in the DAYS_EMPLOYED column
- Defining the timestamp schema for the string-based BIRTHDATE column

Approval Flow

In Catalogs with Approval Flow enabled, updates to table metadata—such as cleaning operations—trigger a review process. To see an example of this workflow, refer to the Manage Feature Life Cycle Tutorials. This process ensures that model features and deployments always use versions that incorporate the latest data updates and validated fixes.

Optimizing EDA Performance

For large tables, you can generate the EDA from a Development Dataset to significantly speed up the analysis. This workflow is illustrated in the Create Development Dataset section.

Step 1: Identify Columns Requiring Cleaning in NEW_APPLICATION¶

Go to the 'Explore' → Table Catalog.
Select the NEW_APPLICATION table and open the EDA tab.
Click to create a new analysis. Keep the default settings and click .
Review the results.
Use the Issues filter to highlight columns that may require cleaning. Click to read the full EDA summary.
Collapse the analysis row to view the statistics and plot.

Step 2: Ignore Disguised Missing Values in `DAYS_EMPLOYED`¶

Click the Critical Data Info edit button .
Add a rule to to ignore disguised missing values equal to 365243, then click .
Click to regenerate results using the updated cleaning rule.
Collapse the analysis row and select the Cleaned filter to view EDA after the cleaning operation is applied.

Step 3: Identify Columns Requiring Cleaning in CLIENT_PROFILE¶

From the Table Catalog, select the CLIENT_PROFILE table and open the EDA tab.
Click to run EDA. Keep default settings, then click . Review the resulting analysis.

Step 4: Define the Schema for the String-Based `BIRTHDATE` Column¶

Click the Critical Data Info edit button .
Add the Timestamp Schema:
- Recorded in: Local time
- Time format string: "YYYY-MM-DD"
- Timezone to convert to local time: "America/Los_Angeles".
Click to save changes. Then refresh EDA using .
Collapse the analysis row and select the Cleaned filter to view the post-cleaning analysis.