Skip to content

4. Set Default Cleaning Operations

A critical step in any data science project is ensuring the data is clean and prepared for feature engineering.

To maintain data quality during feature engineering in FeatureByte, you can centralize Cleaning Operations at the table level. This approach allows you to effectively address common issues, such as string-based datetime columns, missing values, disguised missing values (e.g., those not explicitly labeled as missing), and outliers.

We will perform cleaning operations on the NEW_APPLICATION table and CLIENT_PROFILE table, including:

  • Ignoring disguised missing values in DAYS_EMPLOYED
  • Defining the schema for the BIRTHDATE column stored as a string-based datetime

Approval Flow

In Catalogs with Approval Flow enabled, changes in table metadata, such as cleaning operations, initiate a review process. To see this in action, check out the Grocery Dataset UI Tutorials. This process helps recommend new versions of features and lists linked to these tables, ensuring that models and deployments always use versions that account for data updates and potential issues.


Step 1: Locate Columns to Be Cleaned

  1. From the menu, go to the 'Explore' section and access the Table Catalog.

    Column Description

  2. From the Table Catalog, select the NEW_APPLICATION table.

  3. Navigate to the 'Describe' tab, and click describe button to run a descriptive analysis.

    Column Description

  4. Filter the columns by clicking column filter.

    Column Description


Step 2: Ignore Disguised Missing Values in DAYS_EMPLOYED

  1. Go to the 'Columns' tab of the NEW_APPLICATION table and search for 'DAYS_EMPLOYED'.

    Column CDI

  2. Click on the critical data info edit button in edit.

  3. Ignore disguised missing values equal to 365243.
  4. Click Apply 1 cleaning step to save the cleaning operation.

    Column CDI

  5. Review Cleaning Operations for NEW_APPLICATION table. The newly applied cleaning steps for the DAYS_EMPLOYED column should now be visible in the 'Columns' tab of the NEW_APPLICATION table.

    Column CDI

Step 3: Define the schema for the BIRTHDATE column

  1. Go to the 'Columns' tab of the CLIENT_PROFILE table and search for 'BIRTHDATE'.

  2. Click on the critical data info edit button in edit.

  3. Add its Timestamp Schema:

    • is recorded in: Local time
    • time format string: "YYYY-MM-DD"
    • timezone to convert to local time: "America/Los_Angeles".

    Column CDI

  4. Click Apply 1 cleaning step to save the cleaning operation.

    Column CDI