5. Set Default Cleaning Operations

Setup Cleaning Operations¶

Our data modeling is done, now we are ready to work with data itself.

A crucial step in every data science project is ensuring the data is clean and ready for feature engineering. Issues such as string-based datetime columns, missing values, disguised missing values (missing values that are not explicitly encoded as missing values), or outliers can significantly impair the quality of features and eventually the quality of the final model.

FeatureByte offers an API to effectively address these concerns.

Important Note for FeatureByte Enterprise Users¶

In Catalogs with Approval Flow enabled, changes in table metadata, such as cleaning operations, initiate a review process. To see this in action, check out the Grocery Dataset UI Tutorials. This process helps recommend new versions of features and lists linked to these tables, ensuring that models and deployments always use versions that account for data updates and potential issues.

In [1]:

            
                Copied!
                
import featurebyte as fb
import pandas as pd

# Set your profile to the tutorial environment
fb.use_profile("tutorial")

catalog_name = "Loan Applications Dataset SDK Tutorial"
catalog = fb.Catalog.activate(catalog_name)
import featurebyte as fb
import pandas as pd

# Set your profile to the tutorial environment
fb.use_profile("tutorial")

catalog_name = "Loan Applications Dataset SDK Tutorial"
catalog = fb.Catalog.activate(catalog_name)

14:07:52 | INFO     | SDK version: 3.0.1.dev45
INFO    :featurebyte:SDK version: 3.0.1.dev45
14:07:52 | INFO     | No catalog activated.
INFO    :featurebyte:No catalog activated.
14:07:52 | INFO     | Using profile: tutorial
INFO    :featurebyte:Using profile: tutorial
14:07:52 | INFO     | Using configuration file at: /Users/gxav/.featurebyte/config.yaml
INFO    :featurebyte:Using configuration file at: /Users/gxav/.featurebyte/config.yaml
14:07:52 | INFO     | Active profile: tutorial (https://tutorials.featurebyte.com/api/v1)
INFO    :featurebyte:Active profile: tutorial (https://tutorials.featurebyte.com/api/v1)
14:07:52 | INFO     | SDK version: 3.0.1.dev45
INFO    :featurebyte:SDK version: 3.0.1.dev45
14:07:52 | INFO     | No catalog activated.
INFO    :featurebyte:No catalog activated.
14:07:53 | INFO     | Catalog activated: Loan Applications Dataset SDK Tutorial
INFO    :featurebyte.api.catalog:Catalog activated: Loan Applications Dataset SDK Tutorial

Locate Columns to Be Cleaned¶

We will perform cleaning operations on the NEW_APPLICATION table and CLIENT_PROFILE table, including:

Ignoring disguised missing values in DAYS_EMPLOYED
Defining the schema for the BIRTHDATE column stored as a string-based datetime

Those operations will be applied by default when a view is created from the table. You can however overwrite those operations by creating a view in a manual mode, or access a view raw data

Define the schema for the BIRTHDATE column¶

In [2]:

            
                Copied!
                
client_profile_table = catalog.get_table("CLIENT_PROFILE")
client_profile_table = catalog.get_table("CLIENT_PROFILE")

Let's check the dtype of BIRTHDATE

In [3]:

            
                Copied!
                
client_profile_table.BIRTHDATE.info.dtype
client_profile_table.BIRTHDATE.info.dtype

Out[3]:

'VARCHAR'

We will add its Timestamp Schema

format_string="YYYY-MM-DD"
is_utc_time=False
timezone="America/Los_Angeles"

In [4]:

            
                Copied!
                
                    
                    
                
                

        
client_profile_table["BIRTHDATE"].update_critical_data_info(
    cleaning_operations=[
        fb.AddTimestampSchema(
            timestamp_schema=fb.TimestampSchema(
                is_utc_time=False,
                format_string="YYYY-MM-DD",
                timezone="America/Los_Angeles",
            ),
        ),
    ]
)
client_profile_table["BIRTHDATE"].update_critical_data_info(
    cleaning_operations=[
        fb.AddTimestampSchema(
            timestamp_schema=fb.TimestampSchema(
                is_utc_time=False,
                format_string="YYYY-MM-DD",
                timezone="America/Los_Angeles",
            ),
        ),
    ]
)

Ignore Disguised Missing Values in DAYS_EMPLOYED¶

In [5]:

            
                Copied!
                
application_table = catalog.get_table("NEW_APPLICATION")
application_table = catalog.get_table("NEW_APPLICATION")

Let's look into descriptive statistics of DAYS_EMPLOYED

In [6]:

            
                Copied!
                
application_table.DAYS_EMPLOYED.describe()
application_table.DAYS_EMPLOYED.describe()

Done! |████████████████████████████████████████| 100% in 51.8s (0.02%/s)

Out[6]:

	DAYS_EMPLOYED
dtype	INT
unique	12774
%missing	0.0
%empty	NaN
entropy	NaN
top	365243
freq	60848.0
mean	64172.58446
std	141579.990093
min	-17912
25%	-2772.0
50%	-1220.0
75%	-289.0
max	365243

Done! |████████████████████████████████████████| 100% in 6.1s (0.17%/s)

Out[6]:

	Amount
dtype	FLOAT
unique	6647
%missing	0.0
%empty	NaN
entropy	NaN
top	1
freq	1043.0
mean	19.165033
std	23.732982
min	0.0
25%	4.28
50%	10.58
75%	24.53
max	360.84

We can see that DAYS_EMPLOYED presents a disguised missing value: 365243. We will replace 365243 by a missing value.

In [7]:

            
                Copied!
                
application_table["DAYS_EMPLOYED"].update_critical_data_info(
    cleaning_operations=[
        fb.DisguisedValueImputation(disguised_values=[365243], imputed_value=None),
    ]
)
application_table["DAYS_EMPLOYED"].update_critical_data_info(
    cleaning_operations=[
        fb.DisguisedValueImputation(disguised_values=[365243], imputed_value=None),
    ]
)

Review Cleaning Operations for CLIENT_PROFILE and NEW_APPLICATION tables¶

If we look at the columns_info, we'll see that critical_data_info for BIRTHDATE and DAYS_EMPLOYED column is populated with cleaning operations now.

In [8]:

            
                Copied!
                
pd.DataFrame(client_profile_table.info(verbose=True)["columns_info"])
pd.DataFrame(client_profile_table.info(verbose=True)["columns_info"])

Out[8]:

	name	dtype	entity	semantic	critical_data_info	description
0	ClientID	INT	Client	scd_natural_key_id	None	ID of the client
1	BIRTHDATE	VARCHAR	None	None	{'cleaning_operations': [{'type': 'add_timesta...	Client birthdate
2	GENDER	VARCHAR	None	None	None	Gender of the client
3	FLAG_OWN_CAR	VARCHAR	None	None	None	Flag if the client owns a car
4	FLAG_OWN_REALTY	VARCHAR	None	None	None	Flag if client owns a house or flat
5	CNT_CHILDREN	INT	None	None	None	Number of children the client has
6	INCOME_TYPE	VARCHAR	None	None	None	Clients income type (businessman, working, mat...
7	EDUCATION_TYPE	VARCHAR	None	None	None	Level of highest education the client achieved
8	FAMILY_STATUS	VARCHAR	None	None	None	Family status of the client
9	HOUSING_TYPE	VARCHAR	None	None	None	What is the housing situation of the client (r...
10	OCCUPATION_TYPE	VARCHAR	None	None	None	What kind of occupation does the client have
11	CNT_FAM_MEMBERS	FLOAT	None	None	None	How many family members does client have
12	ORGANIZATION_TYPE	VARCHAR	None	None	None	Type of organization where client works
13	SCD_effective_timestamp	TIMESTAMP	None	scd_effective_timestamp	None	Timestamp when the record becomes effective.
14	SCD_end_timestamp	TIMESTAMP	None	scd_end_timestamp	None	Timestamp when the record is no longer valid
15	available_at	TIMESTAMP	None	record_creation_timestamp	None	Timestamp the record was added to the database

In [9]:

            
                Copied!
                
pd.DataFrame(application_table.info(verbose=True)["columns_info"])
pd.DataFrame(application_table.info(verbose=True)["columns_info"])

Out[9]:

	name	dtype	entity	semantic	critical_data_info	description
0	SK_ID_CURR	INT	New Application	dimension_id	None	ID of application
1	ClientID	INT	Client	None	None	ID of the client
2	CONTRACT_TYPE	VARCHAR	None	None	None	Identification if loan is cash or revolving
3	AMT_INCOME_TOTAL	FLOAT	None	None	None	Income of the client
4	AMT_CREDIT	FLOAT	None	None	None	Credit amount of the loan
...	...	...	...	...	...	...
107	AMT_REQ_CREDIT_BUREAU_WEEK	FLOAT	None	None	None	Number of enquiries to Credit Bureau about the...
108	AMT_REQ_CREDIT_BUREAU_MON	FLOAT	None	None	None	Number of enquiries to Credit Bureau about the...
109	AMT_REQ_CREDIT_BUREAU_QRT	FLOAT	None	None	None	Number of enquiries to Credit Bureau about the...
110	AMT_REQ_CREDIT_BUREAU_YEAR	FLOAT	None	None	None	Number of enquiries to Credit Bureau about the...
111	available_at	TIMESTAMP	None	record_creation_timestamp	None	Timestamp the record was added to the database

112 rows × 6 columns

That's all! Now, every time we generate a new entry from the invoice table, we can be confident that no undesirable values will slip through.

To learn more, refer to following materials:¶

SDK reference for¶

In [ ]: