5. Set Default Cleaning Operations
Setup Cleaning Operations¶
Our data modeling is done, now we are ready to work with data itself.
A crucial step in every data science project is ensuring the data is clean and ready for feature engineering. Issues such as string-based datetime columns, missing values, disguised missing values (missing values that are not explicitly encoded as missing values), or outliers can significantly impair the quality of features and eventually the quality of the final model.
FeatureByte offers an API to effectively address these concerns.
Important Note for FeatureByte Enterprise Users¶
- In Catalogs with Approval Flow enabled, changes in table metadata, such as cleaning operations, initiate a review process. To see this in action, check out the Grocery Dataset UI Tutorials. This process helps recommend new versions of features and lists linked to these tables, ensuring that models and deployments always use versions that account for data updates and potential issues.
import featurebyte as fb
# Set your profile to the tutorial environment
fb.use_profile("tutorial")
catalog_name = "Credit Default Dataset SDK Tutorial"
catalog = fb.Catalog.activate(catalog_name)
16:38:44 | WARNING | Service endpoint is inaccessible: http://featurebyte-server:8088/ 16:38:44 | INFO | Using profile: tutorial 16:38:44 | INFO | Using configuration file at: /Users/gxav/.featurebyte/config.yaml 16:38:44 | INFO | Active profile: tutorial (https://tutorials.featurebyte.com/api/v1) 16:38:44 | INFO | SDK version: 2.1.0.dev113 16:38:44 | INFO | No catalog activated. 16:38:44 | INFO | Catalog activated: Credit Default Dataset SDK Tutorial
Locate Columns to Be Cleaned¶
We will perform cleaning operations on the NEW_APPLICATION table, including:
- Defining the schema for the BIRTHDATE column stored as a string-based datetime
- Ignoring disguised missing values in DAYS_EMPLOYED
- Handling Outliers in AMT_REQ_CREDIT_BUREAU_QRT
Those operations will be applied by default when a view is created from the table. You can however overwrite those operations by creating a view in a manual mode, or access a view raw data
application_table = catalog.get_table("NEW_APPLICATION")
Define the schema for the BIRTHDATE column¶
Let's check the dtype of BIRTHDATE
application_table.BIRTHDATE.info.dtype
'VARCHAR'
We will add its Timestamp Schema
- format_string="YYYY-MM-DD"
- is_utc_time=False
- timezone="America/Los_Angeles"
application_table["BIRTHDATE"].update_critical_data_info(
cleaning_operations=[
fb.AddTimestampSchema(
timestamp_schema=fb.TimestampSchema(
is_utc_time=False,
format_string="YYYY-MM-DD",
timezone="America/Los_Angeles",
),
),
]
)
Ignore Disguised Missing Values in DAYS_EMPLOYED¶
Let's look into descriptive statistics of DAYS_EMPLOYED
application_table.DAYS_EMPLOYED.describe()
Done! |████████████████████████████████████████| 100% in 6.1s (0.17%/s)
DAYS_EMPLOYED | |
---|---|
dtype | INT |
unique | 12744 |
%missing | 0.0 |
%empty | NaN |
entropy | NaN |
top | 365243 |
freq | 59942.0 |
mean | 64053.741606 |
std | 141481.040114 |
min | -17912 |
25% | -2772.0 |
50% | -1219.0 |
75% | -290.0 |
max | 365243 |
Done! |████████████████████████████████████████| 100% in 6.1s (0.17%/s)
Amount | |
---|---|
dtype | FLOAT |
unique | 6647 |
%missing | 0.0 |
%empty | NaN |
entropy | NaN |
top | 1 |
freq | 1043.0 |
mean | 19.165033 |
std | 23.732982 |
min | 0.0 |
25% | 4.28 |
50% | 10.58 |
75% | 24.53 |
max | 360.84 |
We can see that DAYS_EMPLOYED presents a disguised missing value: 365243. We will replace 365243 by a missing value.
application_table["DAYS_EMPLOYED"].update_critical_data_info(
cleaning_operations=[
fb.DisguisedValueImputation(disguised_values=[365243], imputed_value=None),
]
)
Handle Outliers in AMT_REQ_CREDIT_BUREAU_QRT¶
Let's look into descriptive statistics of DAYS_EMPLOYED
application_table.AMT_REQ_CREDIT_BUREAU_QRT.describe()
Done! |████████████████████████████████████████| 100% in 6.1s (0.17%/s)
AMT_REQ_CREDIT_BUREAU_QRT | |
---|---|
dtype | FLOAT |
unique | 11 |
%missing | 13.4362 |
%empty | NaN |
entropy | NaN |
top | NaN |
freq | NaN |
mean | 0.286087 |
std | 0.79044 |
min | 0.0 |
25% | 0.0 |
50% | 0.0 |
75% | 0.0 |
max | 261.0 |
We can see that AMT_REQ_CREDIT_BUREAU_QRT presents outliers. We will cap its value at 20.
application_table["AMT_REQ_CREDIT_BUREAU_QRT"].update_critical_data_info(
cleaning_operations=[
fb.ValueBeyondEndpointImputation(
type="greater_than", end_point=20, imputed_value=20
),
]
)
Review Cleaning Operations for NEW_APPLICATION table¶
If we look at the columns_info
, we'll see that critical_data_info
for BIRTHDATE, DAYS_EMPLOYED, AMT_REQ_CREDIT_BUREAU_QRT column is populated with cleaning operations now.
import pandas as pd
pd.DataFrame(application_table.info(verbose=True)["columns_info"])
name | dtype | entity | semantic | critical_data_info | description | |
---|---|---|---|---|---|---|
0 | BIRTHDATE | VARCHAR | None | None | {'cleaning_operations': [{'type': 'add_timesta... | Client birthdate |
1 | CODE_GENDER | VARCHAR | None | None | None | Gender of the client |
2 | INCOME_TYPE | VARCHAR | None | None | None | Clients income type (businessman, working, mat... |
3 | EDUCATION_TYPE | VARCHAR | None | None | None | Level of highest education the client achieved |
4 | OCCUPATION_TYPE | VARCHAR | None | None | None | What kind of occupation does the client have |
5 | ORGANIZATION_TYPE | VARCHAR | None | None | None | Type of organization where client works |
6 | APPLICATION_ID | INT | New Application | dimension_id | None | ID of application |
7 | CLIENT_ID | INT | Client | None | None | ID of the client |
8 | AMT_CREDIT | FLOAT | None | None | None | Credit amount of the loan |
9 | AMT_ANNUITY | FLOAT | None | None | None | Loan annuity |
10 | AMT_GOODS_VALUE | FLOAT | None | None | None | For consumer loans it is the value of the good... |
11 | REGION_POPULATION_RELATIVE | FLOAT | None | None | None | Normalized population of region where client l... |
12 | APPLICATION_TIME | TIMESTAMP | None | None | None | Application timestamp |
13 | DAYS_EMPLOYED | INT | None | None | {'cleaning_operations': [{'type': 'disguised',... | How many days before the application the perso... |
14 | DAYS_REGISTRATION | FLOAT | None | None | None | How many days before the application did clien... |
15 | DAYS_LAST_PHONE_CHANGE | FLOAT | None | None | None | How many days before the application did clien... |
16 | FLOORSMAX_MEDI | FLOAT | None | None | None | Normalized information about building where th... |
17 | LANDAREA_MEDI | FLOAT | None | None | None | Normalized information about building where th... |
18 | FLAG_DOCUMENT_3 | INT | None | None | None | Did client provide document |
19 | AMT_REQ_CREDIT_BUREAU_QRT | FLOAT | None | None | {'cleaning_operations': [{'type': 'greater_tha... | Number of enquiries to Credit Bureau about the... |
20 | available_at | TIMESTAMP | None | record_creation_timestamp | None | Timestamp the record was added to the database |