5. Set Default Cleaning Operations
Setup Cleaning Operations¶
Our data modeling is done, now we are ready to work with data itself.
A crucial step in every data science project is ensuring the data is clean and ready for feature engineering. Issues such as missing values, disguised missing values (missing values that are not explicitly encoded as missing values), or outliers can significantly impair the quality of features and eventually the quality of the final model.
FeatureByte offers an API to effectively address these concerns.
import featurebyte as fb
# Set your profile to the tutorial environment
fb.use_profile("tutorial")
catalog_name = "Grocery Dataset Tutorial"
catalog = fb.Catalog.activate(catalog_name)
21:58:01 | INFO | Using configuration file at: /Users/gxav/.featurebyte/config.yaml 21:58:01 | INFO | Active profile: tutorial (https://tutorials.featurebyte.com/api/v1) 21:58:02 | WARNING | Remote SDK version (0.5.0.dev6) is different from local (0.5.0.dev1). Update local SDK to avoid unexpected behavior. 21:58:02 | INFO | No catalog activated. 21:58:02 | INFO | 6 feature lists, 31 features deployed 21:58:02 | INFO | Using profile: tutorial 21:58:03 | INFO | Using configuration file at: /Users/gxav/.featurebyte/config.yaml 21:58:03 | INFO | Active profile: tutorial (https://tutorials.featurebyte.com/api/v1) 21:58:03 | WARNING | Remote SDK version (0.5.0.dev6) is different from local (0.5.0.dev1). Update local SDK to avoid unexpected behavior. 21:58:03 | INFO | No catalog activated. 21:58:03 | INFO | 6 feature lists, 31 features deployed 21:58:04 | INFO | Catalog activated: Grocery Dataset Tutorial
Let's look into descriptive statistics of Invoice Amount
invoice_table = catalog.get_table("GROCERYINVOICE")
invoice_table.Amount.describe()
Amount | |
---|---|
dtype | FLOAT |
unique | 6647 |
%missing | 0.0 |
%empty | NaN |
entropy | NaN |
top | 1 |
freq | 735.0 |
mean | 19.084728 |
std | 23.666983 |
min | 0.0 |
25% | 4.28 |
50% | 10.53 |
75% | 24.4175 |
max | 360.84 |
Although the Amount column doesn't present any anomalies, we will set default cleaning operations to ensure that if any issues happen in the future, the data will remain clean.
We will here set the following cleaning operations:
- ignore disguised missing values equal to -99 and -98
- cap any amount less than 0 Euro
- cap any amount greater than 2000 Euros
Those operations will be applied by default when a view is created from the table. You can however overwrite those operations by creating a view in a manual mode, or access a view raw data
invoice_table["Amount"].update_critical_data_info(
cleaning_operations=[
fb.DisguisedValueImputation(disguised_values=[-99, -98], imputed_value=None),
fb.ValueBeyondEndpointImputation(
type="less_than", end_point=0, imputed_value=0
),
fb.ValueBeyondEndpointImputation(
type="greater_than", end_point=2000, imputed_value=2000
),
]
)
If we look at the columns_info
, we'll see that critical_data_info
for Amount column is populated with cleaning operations now.
invoice_table.info(verbose=True)
name | GROCERYINVOICE | |||||||||||||||||||||||||||||||||||||||||||||||||
created_at | 2023-09-11 13:56:34 | |||||||||||||||||||||||||||||||||||||||||||||||||
updated_at | 2023-09-11 13:58:11 | |||||||||||||||||||||||||||||||||||||||||||||||||
description | Grocery invoice details, containing the timestamp and the total amount of the invoice | |||||||||||||||||||||||||||||||||||||||||||||||||
status | PUBLIC_DRAFT | |||||||||||||||||||||||||||||||||||||||||||||||||
catalog_name | Grocery Dataset Tutorial | |||||||||||||||||||||||||||||||||||||||||||||||||
record_creation_timestamp_column | record_available_at | |||||||||||||||||||||||||||||||||||||||||||||||||
table_details |
|
|||||||||||||||||||||||||||||||||||||||||||||||||
entities |
|
|||||||||||||||||||||||||||||||||||||||||||||||||
semantics | ['event_id', 'event_timestamp', 'record_creation_timestamp', 'time_zone'] | |||||||||||||||||||||||||||||||||||||||||||||||||
column_count | 6 | |||||||||||||||||||||||||||||||||||||||||||||||||
columns_info |
|
|||||||||||||||||||||||||||||||||||||||||||||||||
event_timestamp_column | Timestamp | |||||||||||||||||||||||||||||||||||||||||||||||||
event_id_column | GroceryInvoiceGuid | |||||||||||||||||||||||||||||||||||||||||||||||||
default_feature_job_setting |
|