4. Set Default Cleaning Operations
Ensuring clean, well-prepared data is a critical step in any data science project, especially when building reliable features for machine learning.
In FeatureByte, you can maintain high data quality by centralizing Cleaning Operations at the table level.
Centralizing these operations allows you to easily address issues such as:
- String-encoded datetime fields
- Missing or disguised missing values
- Outliers and other data inconsistencies
In this tutorial, you will:
- Run table-level EDA on the NEW_APPLICATION and CLIENT_PROFILE tables to identify data quality issues.
- Apply appropriate cleaning operations, including:
- Ignoring disguised missing values in the
DAYS_EMPLOYEDcolumn - Defining the timestamp schema for the string-based
BIRTHDATEcolumn
- Ignoring disguised missing values in the
Approval Flow
In Catalogs with Approval Flow enabled, updates to table metadata—such as cleaning operations—trigger a review process. To see an example of this workflow, refer to the Manage Feature Life Cycle Tutorials. This process ensures that model features and deployments always use versions that incorporate the latest data updates and validated fixes.
Optimizing EDA Performance
For large tables, you can generate the EDA from a Development Dataset to significantly speed up the analysis. This workflow is illustrated in the Create Development Dataset section.
Step 1: Identify Columns Requiring Cleaning in NEW_APPLICATION¶
-
Go to the 'Explore' → Table Catalog.

-
Select the NEW_APPLICATION table and open the EDA tab.

-
Click
to create a new analysis. Keep the default settings and click
.
-
Review the results.

-
Use the Issues filter
to highlight columns that may require cleaning. Click
to read the full EDA summary.
-
Collapse the analysis row to view the statistics and plot.

Step 2: Ignore Disguised Missing Values in DAYS_EMPLOYED¶
-
Click the Critical Data Info edit button
. -
Add a rule to to ignore disguised missing values equal to
365243, then click
.
-
Click
to regenerate results using the updated cleaning rule.
-
Collapse the analysis row and select the Cleaned filter
to view EDA after the cleaning operation is applied.
Step 3: Identify Columns Requiring Cleaning in CLIENT_PROFILE¶
-
From the Table Catalog, select the CLIENT_PROFILE table and open the EDA tab.
-
Click
to run EDA. Keep default settings, then click
. Review the resulting analysis.
Step 4: Define the Schema for the String-Based BIRTHDATE Column¶
-
Click the Critical Data Info edit button
. -
Add the Timestamp Schema:
- Recorded in: Local time
- Time format string: "YYYY-MM-DD"
- Timezone to convert to local time: "America/Los_Angeles".

-
Click
to save changes. Then refresh EDA using
.
-
Collapse the analysis row and select the Cleaned filter
to view the post-cleaning analysis.