### Setup Cleaning Operations 

Our data modeling is done, now we are ready to work with data itself. 

A crucial step in every data science project is ensuring the data is clean and ready for feature engineering. Issues such as missing values, disguised missing values (missing values that are not explicitly encoded as missing values), or outliers can significantly impair the quality of features and eventually the quality of the final model.

FeatureByte offers an API to effectively address these concerns.

### Important Note for FeatureByte Enterprise Users

- **In Catalogs with [Approval Flow](https://docs.featurebyte.com/latest/about/glossary/#approval-flow) enabled**, changes in table metadata such as cleaning operations initiate a review process. This process recommends new versions of features and lists linked to these tables, ensuring that new models and deployments use versions that address any data issues.


In [1]:
import featurebyte as fb

# Set your profile to the tutorial environment
fb.use_profile("tutorial")

catalog_name = "Grocery Dataset Tutorial"
catalog = fb.Catalog.activate(catalog_name)  



[32;20m16:06:02[0m | [1m[38;20mINFO    [0m[0m | [1m[38;20mUsing profile: tutorial[0m[0m


[32;20m16:06:02[0m | [1m[38;20mINFO    [0m[0m | [1m[38;20mUsing configuration file at: /Users/gxav/.featurebyte/config.yaml[0m[0m


[32;20m16:06:02[0m | [1m[38;20mINFO    [0m[0m | [1m[38;20mActive profile: tutorial (https://tutorials.featurebyte.com/api/v1)[0m[0m




[32;20m16:06:02[0m | [1m[38;20mINFO    [0m[0m | [1m[38;20mNo catalog activated.[0m[0m


[32;20m16:06:02[0m | [1m[38;20mINFO    [0m[0m | [1m[38;20mCatalog activated: Grocery Dataset Tutorial[0m[0m


Let's look into descriptive statistics of Invoice Amount

In [2]:
invoice_table = catalog.get_table("GROCERYINVOICE")

In [3]:
invoice_table.Amount.describe()

Working... |                                        | ▁▃▅ 0% in 0s (~0s, 0.0%/s)

Working... |██                                      | ▂▄▆ 5% in 0s (~2s, 0.5%/s)

Working... |██                                      | ▃▅▇ 5% in 0s (~2s, 0.4%/s)

Working... |██                                      | ▄▆█ 5% in 0s (~2s, 0.4%/s)

Working... |██                                      | ▅▇▇ 5% in 0s (~2s, 0.4%/s)

Working... |██                                      | ▆█▆ 5% in 0s (~3s, 0.3%/s)

Working... |██                                      | ▇▇▅ 5% in 0s (~3s, 0.3%/s)

Working... |██                                      | █▆▄ 5% in 0s (~3s, 0.3%/s)

Working... |██                                      | ▇▅▃ 5% in 0s (~4s, 0.2%/s)

Working... |██                                      | ▆▄▂ 5% in 0s (~4s, 0.2%/s)

Working... |██                                      | ▅▃▁ 5% in 0s (~4s, 0.2%/s)

Working... |██                                      | ▄▂▂ 5% in 0s (~5s, 0.2%/s)

Working... |██                                      | ▃▁▃ 5% in 0s (~5s, 0.2%/s)

Working... |██                                      | ▂▂▄ 5% in 0s (~6s, 0.1%/s)

Working... |██                                      | ▁▃▅ 5% in 0s (~6s, 0.1%/s)

Working... |██                                      | ▂▄▆ 5% in 0s (~7s, 0.1%/s)

Working... |██                                      | ▃▅▇ 5% in 1s (~8s, 0.1%/s)

Working... |██                                      | ▄▆█ 5% in 1s (~8s, 0.1%/s)

Working... |██                                      | ▅▇▇ 5% in 1s (~9s, 0.1%/s)

Working... |██                                      | ▆█▆ 5% in 1s (~9s, 0.1%/s)

Working... |██                                      | ▇▇▅ 5% in 1s (~10s, 0.1%/s

Working... |██                                      | █▆▄ 5% in 1s (~11s, 0.1%/s

Working... |██                                      | ▇▅▃ 5% in 1s (~11s, 0.1%/s

Working... |██                                      | ▆▄▂ 5% in 1s (~12s, 0.1%/s

Working... |██                                      | ▅▃▁ 5% in 1s (~13s, 0.1%/s

Working... |██                                      | ▄▂▂ 5% in 1s (~14s, 0.1%/s

Working... |██                                      | ▃▁▃ 5% in 1s (~14s, 0.1%/s

Working... |██                                      | ▂▂▄ 5% in 1s (~15s, 0.1%/s

Working... |██                                      | ▁▃▅ 5% in 1s (~16s, 0.1%/s

Working... |██                                      | ▂▄▆ 5% in 1s (~17s, 0.1%/s

Working... |██                                      | ▃▅▇ 5% in 1s (~17s, 0.1%/s

Working... |██                                      | ▄▆█ 5% in 1s (~18s, 0.0%/s

Working... |██                                      | ▅▇▇ 5% in 1s (~19s, 0.0%/s

Working... |██                                      | ▆█▆ 5% in 1s (~20s, 0.0%/s

Working... |██                                      | ▇▇▅ 5% in 1s (~21s, 0.0%/s

Working... |██                                      | █▆▄ 5% in 1s (~22s, 0.0%/s

Working... |██                                      | ▇▅▃ 5% in 1s (~23s, 0.0%/s

Working... |██                                      | ▆▄▂ 5% in 1s (~23s, 0.0%/s

Working... |██                                      | ▅▃▁ 5% in 1s (~24s, 0.0%/s



Working... |██                                      | ▄▂▂ 5% in 2s (~25s, 0.0%/s

Working... |██                                      | ▃▁▃ 5% in 2s (~26s, 0.0%/s

Working... |██                                      | ▂▂▄ 5% in 2s (~27s, 0.0%/s

Working... |██                                      | ▁▃▅ 5% in 2s (~28s, 0.0%/s

Working... |██                                      | ▂▄▆ 5% in 2s (~29s, 0.0%/s

Working... |██                                      | ▃▅▇ 5% in 2s (~30s, 0.0%/s

Working... |██                                      | ▄▆█ 5% in 2s (~31s, 0.0%/s

Working... |██                                      | ▅▇▇ 5% in 2s (~32s, 0.0%/s

Working... |██                                      | ▆█▆ 5% in 2s (~33s, 0.0%/s

Working... |██                                      | ▇▇▅ 5% in 2s (~34s, 0.0%/s

Working... |██                                      | █▆▄ 5% in 2s (~35s, 0.0%/s

Working... |██                                      | ▇▅▃ 5% in 2s (~36s, 0.0%/s

Working... |██                                      | ▆▄▂ 5% in 2s (~38s, 0.0%/s

Working... |██                                      | ▅▃▁ 5% in 2s (~39s, 0.0%/s

Working... |██                                      | ▄▂▂ 5% in 2s (~40s, 0.0%/s

Working... |██                                      | ▃▁▃ 5% in 2s (~41s, 0.0%/s

Working... |██                                      | ▂▂▄ 5% in 2s (~42s, 0.0%/s

Working... |██                                      | ▁▃▅ 5% in 3s (~43s, 0.0%/s

Working... |██                                      | ▂▄▆ 5% in 3s (~44s, 0.0%/s

Working... |██████████████████████████████████████  | ▃▅▇ 95% in 3s (~22s, 0.1%/

Working... |██████████████████████████████████████  | ▄▆█ 95% in 3s (~11s, 0.2%/

Working... |██████████████████████████████████████  | ▅▇▇ 95% in 3s (~6s, 0.2%/s

Working... |██████████████████████████████████████  | ▆█▆ 95% in 3s (~3s, 0.3%/s

Working... |██████████████████████████████████████  | ▇▇▅ 95% in 3s (~2s, 0.3%/s

Working... |██████████████████████████████████████  | █▆▄ 95% in 3s (~1s, 0.3%/s

Working... |██████████████████████████████████████  | ▇▅▃ 95% in 3s (~1s, 0.3%/s

Working... |██████████████████████████████████████  | ▆▄▂ 95% in 3s (~0s, 0.3%/s

Working... |██████████████████████████████████████  | ▅▃▁ 95% in 3s (~0s, 0.3%/s

Working... |████████████████████████████████████████| ▄▂▂ 100% in 3s (~0s, 0.3%/

Working... |████████████████████████████████████████| ▃▁▃ 100% in 3s (~0s, 0.3%/

Working... |████████████████████████████████████████| ▂▂▄ 100% in 3s (~0s, 0.3%/

Working... |████████████████████████████████████████| ▁▃▅ 100% in 3s (~0s, 0.3%/

Working... |████████████████████████████████████████| ▂▄▆ 100% in 3s (~0s, 0.3%/

Working... |████████████████████████████████████████| ▃▅▇ 100% in 3s (~0s, 0.3%/

Working... |████████████████████████████████████████| ▄▆█ 100% in 3s (~0s, 0.3%/

Working... |████████████████████████████████████████| ▅▇▇ 100% in 3s (~0s, 0.3%/

Working... |████████████████████████████████████████| ▆█▆ 100% in 3s (~0s, 0.3%/

Working... |████████████████████████████████████████| ▇▇▅ 100% in 3s (~0s, 0.3%/

Working... |████████████████████████████████████████| █▆▄ 100% in 3s (~0s, 0.3%/

Working... |████████████████████████████████████████| ▇▅▃ 100% in 3s (~0s, 0.3%/

Working... |████████████████████████████████████████| ▆▄▂ 100% in 3s (~0s, 0.3%/

Working... |████████████████████████████████████████| ▅▃▁ 100% in 3s (~0s, 0.3%/

Working... |████████████████████████████████████████| ▄▂▂ 100% in 3s (~0s, 0.3%/

Working... |████████████████████████████████████████| ▃▁▃ 100% in 3s (~0s, 0.3%/

Working... |████████████████████████████████████████| ▂▂▄ 100% in 3s (~0s, 0.3%/

Working... |████████████████████████████████████████| ▁▃▅ 100% in 3s (~0s, 0.3%/

Working... |████████████████████████████████████████| ▂▄▆ 100% in 3s (~0s, 0.3%/

Working... |████████████████████████████████████████| ▃▅▇ 100% in 3s (~0s, 0.3%/

Working... |████████████████████████████████████████| ▄▆█ 100% in 3s (~0s, 0.3%/

Working... |████████████████████████████████████████| ▅▇▇ 100% in 3s (~0s, 0.3%/

Working... |████████████████████████████████████████| ▆█▆ 100% in 3s (~0s, 0.3%/

Working... |████████████████████████████████████████| ▇▇▅ 100% in 4s (~0s, 0.3%/

Working... |████████████████████████████████████████| █▆▄ 100% in 4s (~0s, 0.3%/

Working... |████████████████████████████████████████| ▇▅▃ 100% in 4s (~0s, 0.3%/

Working... |████████████████████████████████████████| ▆▄▂ 100% in 4s (~0s, 0.3%/

Working... |████████████████████████████████████████| ▅▃▁ 100% in 4s (~0s, 0.3%/

Working... |████████████████████████████████████████| ▄▂▂ 100% in 4s (~0s, 0.3%/

Working... |████████████████████████████████████████| ▃▁▃ 100% in 4s (~0s, 0.3%/

Working... |████████████████████████████████████████| ▂▂▄ 100% in 4s (~0s, 0.3%/

Working... |████████████████████████████████████████| ▁▃▅ 100% in 4s (~0s, 0.3%/

Working... |████████████████████████████████████████| ▂▄▆ 100% in 4s (~0s, 0.3%/

Working... |████████████████████████████████████████| ▃▅▇ 100% in 4s (~0s, 0.3%/

Working... |████████████████████████████████████████| ▄▆█ 100% in 4s (~0s, 0.3%/

Working... |████████████████████████████████████████| ▅▇▇ 100% in 4s (~0s, 0.3%/

Working... |████████████████████████████████████████| ▆█▆ 100% in 4s (~0s, 0.3%/

Working... |████████████████████████████████████████| ▇▇▅ 100% in 4s (~0s, 0.3%/

Working... |████████████████████████████████████████| █▆▄ 100% in 4s (~0s, 0.3%/

Working... |████████████████████████████████████████| ▇▅▃ 100% in 4s (~0s, 0.3%/

Working... |████████████████████████████████████████| ▆▄▂ 100% in 4s (~0s, 0.3%/

Working... |████████████████████████████████████████| ▅▃▁ 100% in 4s (~0s, 0.3%/

Working... |████████████████████████████████████████| ▄▂▂ 100% in 4s (~0s, 0.3%/

Working... |████████████████████████████████████████| ▃▁▃ 100% in 4s (~0s, 0.3%/

Working... |████████████████████████████████████████| ▂▂▄ 100% in 4s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▁▃▅ 100% in 4s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▂▄▆ 100% in 4s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▃▅▇ 100% in 4s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▄▆█ 100% in 4s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▅▇▇ 100% in 4s (~0s, 0.2%/



Working... |████████████████████████████████████████| ▆█▆ 100% in 4s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▇▇▅ 100% in 4s (~0s, 0.2%/

Working... |████████████████████████████████████████| █▆▄ 100% in 4s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▇▅▃ 100% in 4s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▆▄▂ 100% in 4s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▅▃▁ 100% in 4s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▄▂▂ 100% in 4s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▃▁▃ 100% in 4s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▂▂▄ 100% in 4s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▁▃▅ 100% in 5s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▂▄▆ 100% in 5s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▃▅▇ 100% in 5s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▄▆█ 100% in 5s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▅▇▇ 100% in 5s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▆█▆ 100% in 5s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▇▇▅ 100% in 5s (~0s, 0.2%/

Working... |████████████████████████████████████████| █▆▄ 100% in 5s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▇▅▃ 100% in 5s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▆▄▂ 100% in 5s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▅▃▁ 100% in 5s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▄▂▂ 100% in 5s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▃▁▃ 100% in 5s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▂▂▄ 100% in 5s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▁▃▅ 100% in 5s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▂▄▆ 100% in 5s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▃▅▇ 100% in 5s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▄▆█ 100% in 5s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▅▇▇ 100% in 5s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▆█▆ 100% in 5s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▇▇▅ 100% in 5s (~0s, 0.2%/

Working... |████████████████████████████████████████| █▆▄ 100% in 5s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▇▅▃ 100% in 5s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▆▄▂ 100% in 5s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▅▃▁ 100% in 5s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▄▂▂ 100% in 5s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▃▁▃ 100% in 5s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▂▂▄ 100% in 5s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▁▃▅ 100% in 5s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▂▄▆ 100% in 5s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▃▅▇ 100% in 5s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▄▆█ 100% in 5s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▅▇▇ 100% in 5s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▆█▆ 100% in 5s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▇▇▅ 100% in 6s (~0s, 0.2%/

Working... |████████████████████████████████████████| █▆▄ 100% in 6s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▇▅▃ 100% in 6s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▆▄▂ 100% in 6s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▅▃▁ 100% in 6s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▄▂▂ 100% in 6s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▃▁▃ 100% in 6s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▂▂▄ 100% in 6s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▁▃▅ 100% in 6s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▂▄▆ 100% in 6s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▃▅▇ 100% in 6s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▄▆█ 100% in 6s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▅▇▇ 100% in 6s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▆█▆ 100% in 6s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▇▇▅ 100% in 6s (~0s, 0.2%/

Working... |████████████████████████████████████████| █▆▄ 100% in 6s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▇▅▃ 100% in 6s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▆▄▂ 100% in 6s (~0s, 0.2%/

Working... |████████████████████████████████████████| ▅▃▁ 100% in 6s (~0s, 0.2%/

Done! |████████████████████████████████████████| 100% in 6.1s (0.17%/s)         




Unnamed: 0,Amount
dtype,FLOAT
unique,6647
%missing,0.0
%empty,
entropy,
top,1
freq,1043.0
mean,19.165033
std,23.732982
min,0.0


Although the Amount column doesn't present any anomalies, we will set default cleaning operations to ensure that if any issues happen in the future, the data will remain clean.

We will here set the following cleaning operations:

* ignore disguised missing values equal to -99 and -98
* cap any amount less than 0 Euro
* cap any amount greater than 2000 Euros

Those operations will be applied by default when a [view](https://docs.featurebyte.com/latest/about/glossary/#view) is created from the table.
You can however overwrite those operations by creating a view in a [manual mode](https://docs.featurebyte.com/latest/reference/featurebyte.api.event_table.EventTable.get_view/), or access a view [raw data](https://docs.featurebyte.com/latest/reference/core/view/#accessing-source-table-raw-data-in-views) 

In [4]:
invoice_table["Amount"].update_critical_data_info(
    cleaning_operations=[
        fb.DisguisedValueImputation(disguised_values=[-99, -98], imputed_value=None),
        fb.ValueBeyondEndpointImputation(
            type="less_than", end_point=0, imputed_value=0
        ),
        fb.ValueBeyondEndpointImputation(
            type="greater_than", end_point=2000, imputed_value=2000
        ),
    ]
)

If we look at the `columns_info`, we'll see that `critical_data_info` for Amount column is populated with cleaning operations now. 

In [5]:
import pandas as pd
pd.DataFrame(invoice_table.info(verbose=True)["columns_info"])

Unnamed: 0,name,dtype,entity,semantic,critical_data_info,description
0,GroceryInvoiceGuid,VARCHAR,invoice,event_id,,"Unique identifier of each row in the table, in..."
1,GroceryCustomerGuid,VARCHAR,customer,,,"Unique identifier for each customer, in GUID f..."
2,Timestamp,TIMESTAMP,,event_timestamp,,The GMT timestamp of when this invoice transac...
3,tz_offset,VARCHAR,,time_zone,,The local timezone offset of the invoice event.
4,record_available_at,TIMESTAMP,,record_creation_timestamp,,A timestamp for when this row was added to the...
5,Amount,FLOAT,,,{'cleaning_operations': [{'imputed_value': Non...,"The total amount of the invoice, including all..."


That's all! Now, every time we generate a new entry from the invoice table, we can be confident that no undesirable values will slip through.

### To learn more, refer to following materials: 
- [Cleaning Operations](https://docs.featurebyte.com/latest/about/glossary/#cleaning-operations)
- [Views](https://docs.featurebyte.com/latest/about/glossary/#views-and-column-transforms)

#### SDK reference for
- [Table.get view()](https://docs.featurebyte.com/latest/reference/featurebyte.api.event_table.EventTable.get_view/)
- [TableColumn.update_critical_data_info()](https://docs.featurebyte.com/latest/reference/featurebyte.api.base_table.TableColumn.update_critical_data_info/)