5. Set Default Cleaning Operations
Setup Cleaning Operations¶
Our data modeling is done, now we are ready to work with data itself.
A crucial step in every data science project is ensuring the data is clean and ready for feature engineering. Issues such as string-based datetime columns, missing values, disguised missing values (missing values that are not explicitly encoded as missing values), or outliers can significantly impair the quality of features and eventually the quality of the final model.
FeatureByte offers an API to effectively address these concerns.
Important Note for FeatureByte Enterprise Users¶
- In Catalogs with Approval Flow enabled, changes in table metadata, such as cleaning operations, initiate a review process. To see this in action, check out the Grocery Dataset UI Tutorials. This process helps recommend new versions of features and lists linked to these tables, ensuring that models and deployments always use versions that account for data updates and potential issues.
import featurebyte as fb
import pandas as pd
# Set your profile to the tutorial environment
fb.use_profile("tutorial")
catalog_name = "Loan Applications Dataset SDK Tutorial"
catalog = fb.Catalog.activate(catalog_name)
14:07:52 | INFO | SDK version: 3.0.1.dev45 INFO :featurebyte:SDK version: 3.0.1.dev45 14:07:52 | INFO | No catalog activated. INFO :featurebyte:No catalog activated. 14:07:52 | INFO | Using profile: tutorial INFO :featurebyte:Using profile: tutorial 14:07:52 | INFO | Using configuration file at: /Users/gxav/.featurebyte/config.yaml INFO :featurebyte:Using configuration file at: /Users/gxav/.featurebyte/config.yaml 14:07:52 | INFO | Active profile: tutorial (https://tutorials.featurebyte.com/api/v1) INFO :featurebyte:Active profile: tutorial (https://tutorials.featurebyte.com/api/v1) 14:07:52 | INFO | SDK version: 3.0.1.dev45 INFO :featurebyte:SDK version: 3.0.1.dev45 14:07:52 | INFO | No catalog activated. INFO :featurebyte:No catalog activated. 14:07:53 | INFO | Catalog activated: Loan Applications Dataset SDK Tutorial INFO :featurebyte.api.catalog:Catalog activated: Loan Applications Dataset SDK Tutorial
Locate Columns to Be Cleaned¶
We will perform cleaning operations on the NEW_APPLICATION table and CLIENT_PROFILE table, including:
- Ignoring disguised missing values in DAYS_EMPLOYED
- Defining the schema for the BIRTHDATE column stored as a string-based datetime
Those operations will be applied by default when a view is created from the table. You can however overwrite those operations by creating a view in a manual mode, or access a view raw data
Define the schema for the BIRTHDATE column¶
client_profile_table = catalog.get_table("CLIENT_PROFILE")
Let's check the dtype of BIRTHDATE
client_profile_table.BIRTHDATE.info.dtype
'VARCHAR'
We will add its Timestamp Schema
- format_string="YYYY-MM-DD"
- is_utc_time=False
- timezone="America/Los_Angeles"
client_profile_table["BIRTHDATE"].update_critical_data_info(
cleaning_operations=[
fb.AddTimestampSchema(
timestamp_schema=fb.TimestampSchema(
is_utc_time=False,
format_string="YYYY-MM-DD",
timezone="America/Los_Angeles",
),
),
]
)
Ignore Disguised Missing Values in DAYS_EMPLOYED¶
application_table = catalog.get_table("NEW_APPLICATION")
Let's look into descriptive statistics of DAYS_EMPLOYED
application_table.DAYS_EMPLOYED.describe()
Done! |████████████████████████████████████████| 100% in 51.8s (0.02%/s)
DAYS_EMPLOYED | |
---|---|
dtype | INT |
unique | 12774 |
%missing | 0.0 |
%empty | NaN |
entropy | NaN |
top | 365243 |
freq | 60848.0 |
mean | 64172.58446 |
std | 141579.990093 |
min | -17912 |
25% | -2772.0 |
50% | -1220.0 |
75% | -289.0 |
max | 365243 |
Done! |████████████████████████████████████████| 100% in 6.1s (0.17%/s)
Amount | |
---|---|
dtype | FLOAT |
unique | 6647 |
%missing | 0.0 |
%empty | NaN |
entropy | NaN |
top | 1 |
freq | 1043.0 |
mean | 19.165033 |
std | 23.732982 |
min | 0.0 |
25% | 4.28 |
50% | 10.58 |
75% | 24.53 |
max | 360.84 |
We can see that DAYS_EMPLOYED presents a disguised missing value: 365243. We will replace 365243 by a missing value.
application_table["DAYS_EMPLOYED"].update_critical_data_info(
cleaning_operations=[
fb.DisguisedValueImputation(disguised_values=[365243], imputed_value=None),
]
)
Review Cleaning Operations for CLIENT_PROFILE and NEW_APPLICATION tables¶
If we look at the columns_info
, we'll see that critical_data_info
for BIRTHDATE and DAYS_EMPLOYED column is populated with cleaning operations now.
pd.DataFrame(client_profile_table.info(verbose=True)["columns_info"])
name | dtype | entity | semantic | critical_data_info | description | |
---|---|---|---|---|---|---|
0 | ClientID | INT | Client | scd_natural_key_id | None | ID of the client |
1 | BIRTHDATE | VARCHAR | None | None | {'cleaning_operations': [{'type': 'add_timesta... | Client birthdate |
2 | GENDER | VARCHAR | None | None | None | Gender of the client |
3 | FLAG_OWN_CAR | VARCHAR | None | None | None | Flag if the client owns a car |
4 | FLAG_OWN_REALTY | VARCHAR | None | None | None | Flag if client owns a house or flat |
5 | CNT_CHILDREN | INT | None | None | None | Number of children the client has |
6 | INCOME_TYPE | VARCHAR | None | None | None | Clients income type (businessman, working, mat... |
7 | EDUCATION_TYPE | VARCHAR | None | None | None | Level of highest education the client achieved |
8 | FAMILY_STATUS | VARCHAR | None | None | None | Family status of the client |
9 | HOUSING_TYPE | VARCHAR | None | None | None | What is the housing situation of the client (r... |
10 | OCCUPATION_TYPE | VARCHAR | None | None | None | What kind of occupation does the client have |
11 | CNT_FAM_MEMBERS | FLOAT | None | None | None | How many family members does client have |
12 | ORGANIZATION_TYPE | VARCHAR | None | None | None | Type of organization where client works |
13 | SCD_effective_timestamp | TIMESTAMP | None | scd_effective_timestamp | None | Timestamp when the record becomes effective. |
14 | SCD_end_timestamp | TIMESTAMP | None | scd_end_timestamp | None | Timestamp when the record is no longer valid |
15 | available_at | TIMESTAMP | None | record_creation_timestamp | None | Timestamp the record was added to the database |
pd.DataFrame(application_table.info(verbose=True)["columns_info"])
name | dtype | entity | semantic | critical_data_info | description | |
---|---|---|---|---|---|---|
0 | SK_ID_CURR | INT | New Application | dimension_id | None | ID of application |
1 | ClientID | INT | Client | None | None | ID of the client |
2 | CONTRACT_TYPE | VARCHAR | None | None | None | Identification if loan is cash or revolving |
3 | AMT_INCOME_TOTAL | FLOAT | None | None | None | Income of the client |
4 | AMT_CREDIT | FLOAT | None | None | None | Credit amount of the loan |
... | ... | ... | ... | ... | ... | ... |
107 | AMT_REQ_CREDIT_BUREAU_WEEK | FLOAT | None | None | None | Number of enquiries to Credit Bureau about the... |
108 | AMT_REQ_CREDIT_BUREAU_MON | FLOAT | None | None | None | Number of enquiries to Credit Bureau about the... |
109 | AMT_REQ_CREDIT_BUREAU_QRT | FLOAT | None | None | None | Number of enquiries to Credit Bureau about the... |
110 | AMT_REQ_CREDIT_BUREAU_YEAR | FLOAT | None | None | None | Number of enquiries to Credit Bureau about the... |
111 | available_at | TIMESTAMP | None | record_creation_timestamp | None | Timestamp the record was added to the database |
112 rows × 6 columns