5. Set Default Cleaning Operations

Setup Cleaning Operations¶

Our data modeling is done, now we are ready to work with data itself.

A crucial step in every data science project is ensuring the data is clean and ready for feature engineering. Issues such as missing values, disguised missing values (missing values that are not explicitly encoded as missing values), or outliers can significantly impair the quality of features and eventually the quality of the final model.

FeatureByte offers an API to effectively address these concerns.

Important Note for FeatureByte Enterprise Users¶

In Catalogs with Approval Flow enabled, changes in table metadata such as cleaning operations initiate a review process. This process recommends new versions of features and lists linked to these tables, ensuring that new models and deployments use versions that address any data issues.

In [1]:

            
                Copied!
                
import featurebyte as fb

# Set your profile to the tutorial environment
fb.use_profile("tutorial")

catalog_name = "Grocery Dataset Tutorial"
catalog = fb.Catalog.activate(catalog_name)
import featurebyte as fb

# Set your profile to the tutorial environment
fb.use_profile("tutorial")

catalog_name = "Grocery Dataset Tutorial"
catalog = fb.Catalog.activate(catalog_name)

10:42:42 | WARNING  | Service endpoint is inaccessible: http://featurebyte-server:8088/
10:42:42 | INFO     | Using profile: tutorial
10:42:42 | INFO     | Using configuration file at: /Users/gxav/.featurebyte/config.yaml
10:42:42 | INFO     | Active profile: tutorial (https://tutorials.featurebyte.com/api/v1)
10:42:42 | INFO     | SDK version: 2.0.1.dev67
10:42:42 | INFO     | No catalog activated.
10:42:42 | INFO     | Catalog activated: Grocery Dataset Tutorial
16:06:02 | INFO     | Using configuration file at: /Users/gxav/.featurebyte/config.yaml
16:06:02 | INFO     | Active profile: tutorial (https://tutorials.featurebyte.com/api/v1)
16:06:02 | WARNING  | Remote SDK version (1.1.0.dev7) is different from local (1.1.0.dev1). Update local SDK to avoid unexpected behavior.
16:06:02 | INFO     | No catalog activated.
16:06:02 | INFO     | Catalog activated: Grocery Dataset Tutorial

Let's look into descriptive statistics of Invoice Amount

In [2]:

            
                Copied!
                
invoice_table = catalog.get_table("GROCERYINVOICE")
invoice_table = catalog.get_table("GROCERYINVOICE")

In [3]:

            
                Copied!
                
invoice_table.Amount.describe()
invoice_table.Amount.describe()

Done! |████████████████████████████████████████| 100% in 6.1s (0.17%/s)

Out[3]:

	Amount
dtype	FLOAT
unique	6647
%missing	0.0
%empty	NaN
entropy	NaN
top	NaN
freq	NaN
mean	19.195901
std	23.729811
min	0.0
25%	4.29
50%	10.605
75%	24.55
max	360.84

Done! |████████████████████████████████████████| 100% in 6.1s (0.17%/s)

Out[3]:

	Amount
dtype	FLOAT
unique	6647
%missing	0.0
%empty	NaN
entropy	NaN
top	1
freq	1043.0
mean	19.165033
std	23.732982
min	0.0
25%	4.28
50%	10.58
75%	24.53
max	360.84

Although the Amount column doesn't present any anomalies, we will set default cleaning operations to ensure that if any issues happen in the future, the data will remain clean.

We will here set the following cleaning operations:

ignore disguised missing values equal to -99 and -98
cap any amount less than 0 Euro
cap any amount greater than 2000 Euros

Those operations will be applied by default when a view is created from the table. You can however overwrite those operations by creating a view in a manual mode, or access a view raw data

In [4]:

            
                Copied!
                
                    
                    
                
                

        
invoice_table["Amount"].update_critical_data_info(
    cleaning_operations=[
        fb.DisguisedValueImputation(disguised_values=[-99, -98], imputed_value=None),
        fb.ValueBeyondEndpointImputation(
            type="less_than", end_point=0, imputed_value=0
        ),
        fb.ValueBeyondEndpointImputation(
            type="greater_than", end_point=2000, imputed_value=2000
        ),
    ]
)
invoice_table["Amount"].update_critical_data_info(
    cleaning_operations=[
        fb.DisguisedValueImputation(disguised_values=[-99, -98], imputed_value=None),
        fb.ValueBeyondEndpointImputation(
            type="less_than", end_point=0, imputed_value=0
        ),
        fb.ValueBeyondEndpointImputation(
            type="greater_than", end_point=2000, imputed_value=2000
        ),
    ]
)

If we look at the columns_info, we'll see that critical_data_info for Amount column is populated with cleaning operations now.

In [5]:

            
                Copied!
                
import pandas as pd
pd.DataFrame(invoice_table.info(verbose=True)["columns_info"])
import pandas as pd
pd.DataFrame(invoice_table.info(verbose=True)["columns_info"])

Out[5]:

	name	dtype	entity	semantic	critical_data_info	description
0	GroceryInvoiceGuid	VARCHAR	invoice	event_id	None	Unique identifier of each row in the table, in...
1	GroceryCustomerGuid	VARCHAR	customer	None	None	Unique identifier for each customer, in GUID f...
2	Timestamp	TIMESTAMP	None	event_timestamp	None	The GMT timestamp of when this invoice transac...
3	tz_offset	VARCHAR	None	time_zone	None	The local timezone offset of the invoice event.
4	record_available_at	TIMESTAMP	None	record_creation_timestamp	None	A timestamp for when this row was added to the...
5	Amount	FLOAT	None	None	{'cleaning_operations': [{'imputed_value': Non...	The total amount of the invoice, including all...

That's all! Now, every time we generate a new entry from the invoice table, we can be confident that no undesirable values will slip through.

5. Set Default Cleaning Operations

Setup Cleaning Operations¶

Important Note for FeatureByte Enterprise Users¶

To learn more, refer to following materials:¶

SDK reference for¶