TableColumn
A TableColumn object represents a column within a Table object. You can add metadata to TableColumn objects to help with feature engineering, such as tagging columns with entity references or defining default data cleaning operations.
Entity Tagging¶
To tag a column with an entity reference, utilize the as_entity()
method and provide the entity name.
invoice_table = catalog.get_table("GROCERYINVOICE")
# Tag the entities for the grocery invoice table
invoice_table.GroceryInvoiceGuid.as_entity("groceryinvoice")
invoice_table.GroceryCustomerGuid.as_entity("grocerycustomer")
To remove an entity tag, set the entity name to None.
Defining Default Cleaning Operations¶
You can establish default cleaning operations that automatically apply when creating views from a table, ensuring data consistency and accuracy.
For a specific column, define an ordered sequence of cleaning operations. Ensure that values imputed in earlier steps are not marked for cleaning in later operations.
Use the following contructors to define each cleaning operation:
MissingValueImputation
: Imputes missing values.DisguisedValueImputation
: Imputes disguised values from a list.UnexpectedValueImputation
: Imputes unexpected values not found in a given list.ValueBeyondEndpointImputation
: Imputes numeric or date values outside specified boundaries.StringValueImputation
: Imputes string values.
Note
If the imputed_value
parameter is None, the values to impute are replaced with missing values, and the corresponding rows are ignored during aggregation operations.
To set the default cleaning operations for a column, use the update_critical_data_info()
method:
items_table = catalog.get_table("INVOICEITEMS")
# Discount amount should not be negative
items_table.Discount.update_critical_data_info(
cleaning_operations=[
fb.MissingValueImputation(imputed_value=0),
fb.ValueBeyondEndpointImputation(
type="less_than", end_point=0, imputed_value=0
),
]
)
Exploring TableColumn¶
To list columns in a Table object, use the columns
property:
To display column specifications, including tagged entity IDs and default cleaning operations, use the columns_info
property:
To obtain TableColumn descriptive statistics, use the describe()
method:
To materialize a selection of rows, use the preview()
or sample()
methods:
df = invoice_table.Amount.sample(
from_timestamp=pd.Timestamp('2023-04-01'),
to_timestamp=pd.Timestamp('2023-05-01'),
size=100, seed=23
)
By default, statistics and materialization are computed before applying cleaning operations. To include cleaning operations in the output, set the after_cleaning parameter to True:
Updating Description¶
Table and column descriptions are automatically fetched from your Data Warehouse when they are available. If these descriptions are missing or incomplete, you have the option to edit and update them.
To see a description of a column in a Table object, use the description
property:
To update description of a column in a Table object, use the update_description()
method: