Skip to content

4. Set Default Cleaning Operations

Ensuring clean, well-prepared data is a critical step in any data science project, especially when building reliable features for machine learning.

In FeatureByte, you can maintain high data quality by centralizing Cleaning Operations at the table level.

Centralizing these operations allows you to easily address issues such as:

  • String-encoded datetime fields
  • Missing or disguised missing values
  • Outliers and other data inconsistencies

In this tutorial, you will:

  1. Run table-level EDA on the NEW_APPLICATION and CLIENT_PROFILE tables to identify data quality issues.
  2. Apply appropriate cleaning operations, including:
    • Ignoring disguised missing values in the DAYS_EMPLOYED column
    • Defining the timestamp schema for the string-based BIRTHDATE column

Approval Flow

In Catalogs with Approval Flow enabled, updates to table metadata—such as cleaning operations—trigger a review process. To see an example of this workflow, refer to the Manage Feature Life Cycle Tutorials. This process ensures that model features and deployments always use versions that incorporate the latest data updates and validated fixes.

Optimizing EDA Performance

For large tables, you can generate the EDA from a Development Dataset to significantly speed up the analysis. This workflow is illustrated in the Create Development Dataset section.


Step 1: Identify Columns Requiring Cleaning in NEW_APPLICATION

  1. Go to the 'Explore' → Table Catalog.

    Table Catalog


  2. Select the NEW_APPLICATION table and open the EDA tab.

    Table Catalog


  3. Click new EDA button to create a new analysis. Keep the default settings and click create analysis button.

    EDA Settings


  4. Review the results.

    EDA Settings


  5. Use the Issues filter issues button to highlight columns that may require cleaning. Click show more to read the full EDA summary.

    Column Description


  6. Collapse the analysis row to view the statistics and plot.

    Column Description


Step 2: Ignore Disguised Missing Values in DAYS_EMPLOYED

  1. Click the Critical Data Info edit button edit.

  2. Add a rule to to ignore disguised missing values equal to 365243, then click Apply 1 cleaning step.

    Column CDI


  3. Click refresh EDA to regenerate results using the updated cleaning rule.

    refresh EDA


  4. Collapse the analysis row and select the Cleaned filter cleaned button to view EDA after the cleaning operation is applied.

    refreshed EDA


Step 3: Identify Columns Requiring Cleaning in CLIENT_PROFILE

  1. From the Table Catalog, select the CLIENT_PROFILE table and open the EDA tab.

  2. Click new EDA button to run EDA. Keep default settings, then click create analysis button. Review the resulting analysis.

    EDA Settings


Step 4: Define the Schema for the String-Based BIRTHDATE Column

  1. Click the Critical Data Info edit button edit.

  2. Add the Timestamp Schema:

    • Recorded in: Local time
    • Time format string: "YYYY-MM-DD"
    • Timezone to convert to local time: "America/Los_Angeles".

    cleaning BIRTHDATE


  3. Click Apply 1 cleaning step to save changes. Then refresh EDA using refresh EDA.

    refresh BIRTHDATE


  4. Collapse the analysis row and select the Cleaned filter cleaned button to view the post-cleaning analysis.

    refreshed BIRTHDATE