Skip to content

TableColumn

A TableColumn object represents a column within a Table object. You can add metadata to TableColumn objects to help with feature engineering, such as tagging columns with entity references or defining default data cleaning operations.

Entity Tagging

To tag a column with an entity reference, utilize the as_entity() method and provide the entity name.

invoice_table = catalog.get_table("GROCERYINVOICE")
# Tag the entities for the grocery invoice table
invoice_table.GroceryInvoiceGuid.as_entity("groceryinvoice")
invoice_table.GroceryCustomerGuid.as_entity("grocerycustomer")

To remove an entity tag, set the entity name to None.

invoice_table.GroceryCustomerGuid.as_entity(None)

Defining default cleaning operations

You can establish default cleaning operations that automatically apply when creating views from a table, ensuring data consistency and accuracy.

For a specific column, define an ordered sequence of cleaning operations. Ensure that values imputed in earlier steps are not marked for cleaning in later operations.

Use the following contructors to define each cleaning operation:

Note

If the imputed_value parameter is None, the values to impute are replaced with missing values, and the corresponding rows are ignored during aggregation operations.

To set the default cleaning operations for a column, use the update_critical_data_info() method:

items_table = catalog.get_table("INVOICEITEMS")
# Discount amount should not be negative
items_table.Discount.update_critical_data_info(
    cleaning_operations=[
        fb.MissingValueImputation(imputed_value=0),
        fb.ValueBeyondEndpointImputation(
            type="less_than", end_point=0, imputed_value=0
        ),
    ]
)

Exploring TableColumn

To list columns in a Table object, use the columns property:

display(items_table.columns)

To display column specifications, including tagged entity IDs and default cleaning operations, use the columns_info property:

display(items_table.columns_info)

To obtain TableColumn descriptive statistics, use the describe() method:

invoice_table.Amount.describe()
df = invoice_table.Amount.preview(limit=20)

To materialize a selection of rows, use the preview() or sample() methods:

df = invoice_table.Amount.sample(
    from_timestamp=pd.Timestamp('2023-04-01'),
    to_timestamp=pd.Timestamp('2023-05-01'),
    size=100, seed=23
)

By default, statistics and materialization are computed before applying cleaning operations. To include cleaning operations in the output, set the after_cleaning parameter to True:

invoice_table.Amount.describe(after_cleaning=True)