Skip to content

4. Set Default Cleaning Operations

A critical step in any data science project is ensuring the data is clean and prepared for feature engineering.

To maintain data quality during feature engineering in FeatureByte, you can centralize Cleaning Operations at the table level. This approach allows you to effectively address common issues, such as string-based datetime columns, missing values, disguised missing values (e.g., those not explicitly labeled as missing), and outliers.

We will perform cleaning operations on the NEW_APPLICATION table, including:

  • Defining the schema for the BIRTHDATE column stored as a string-based datetime
  • Ignoring disguised missing values in DAYS_EMPLOYED
  • Handling Outliers in AMT_REQ_CREDIT_BUREAU_QRT

Approval Flow

In Catalogs with Approval Flow enabled, changes in table metadata, such as cleaning operations, initiate a review process. To see this in action, check out the Grocery Dataset UI Tutorials. This process helps recommend new versions of features and lists linked to these tables, ensuring that models and deployments always use versions that account for data updates and potential issues.


Step 1: Locate Columns to Be Cleaned

  1. From the menu, go to the 'Explore' section and access the Table Catalog.

    Column Description

  2. From the Table Catalog, select the NEW_APPLICATION table.

  3. Navigate to the 'Describe' tab, and click describe button to run a descriptive analysis.

    Column Description

  4. Filter the columns by clicking column filter.

    Column Description


Step 2: Define the schema for the BIRTHDATE column

  1. Go to the 'Columns' tab and search for 'BIRTHDATE'.

    Column CDI

  2. Click on the critical data info edit button in edit.

  3. Add its Timestamp Schema:

    • is recorded in: UTC
    • time format string: "YYYY-MM-DD"
    • timezone to convert to local time: "America/Los_Angeles".
  4. Click Apply 1 cleaning step to save the cleaning operation.

    Column CDI

Note

Changing the schema will reset the column's existing semantic tag.

If you are using Databricks and specifying the schema for a timestamp column, keep in mind that FeatureByte retrieves timestamps exactly as they are stored, without adjusting for your Databricks cluster's time zone settings.


Step 3: Ignore Disguised Missing Values in DAYS_EMPLOYED

  1. Go to the 'Columns' tab and search for 'DAYS_EMPLOYED'.

    Column CDI

  2. Click on the critical data info edit button in edit.

  3. Ignore disguised missing values equal to 365243.
  4. Click Apply 1 cleaning step to save the cleaning operation.

    Column CDI

Step 4: Handle Outliers in AMT_REQ_CREDIT_BUREAU_QRT

  1. Search for 'AMT_REQ_CREDIT_BUREAU_QRT'.

  2. Click on the critical data info edit button in edit.

  3. Cap values greater than 20.
  4. Click Apply 1 cleaning step to save the cleaning operation.

    Column CDI

Step 4: Review Cleaning Operations for NEW_APPLICATION table

The newly applied cleaning steps for the BIRTHDATE, DAYS_EMPLOYED and AMT_REQ_CREDIT_BUREAU_QR columns should now be visible in the 'Columns' tab of the NEW_APPLICATION table.

Column CDI