Skip to content

Table

A Table object provides a centralized location for metadata about a source table. This metadata determines the type of operations that can be applied to the table's views and includes essential information for feature engineering.

Important

A source table can only be associated with one active Table object in a catalog at a time. This means that the active Table object in the catalog is the source of truth for the metadata of the source table. If a Table object becomes deprecated, a new Table object can be registered with the same source table.

Registering Tables

Before registering tables, ensure that the catalog you want to work with is active.

catalog = fb.Catalog.activate(<catalog_name>)

Select the source table you are interested in.

ds = fb.FeatureStore.get("playground").get_data_source()
source_table = ds.get_source_table(
    database_name="spark_catalog",
    schema_name="GROCERY",
    table_name="GROCERYINVOICE"
)

To create Table objects from a SourceTable object, you must use specific methods depending on the type of data contained in the source table:

Registering a table according to its type determines the types of feature engineering operations that are possible on the table's views and enforces guardrails accordingly.

Example of registering an event table using the create_event_table() method:

invoice_table = source_table.create_event_table(
    name="GROCERYINVOICE",
    event_id_column="GroceryInvoiceGuid",
    event_timestamp_column="Timestamp",
    event_timestamp_timezone_offset_column="tz_offset",
    record_creation_timestamp_column="record_available_at"
)

Implementing Default Job Settings for Consistency

A default feature job setting is established at the table level to help streamline the configuration of feature job settings for features and ensure consistency across features developed by different team members. For an EventTable, the default feature job setting can be initialized using an automated analysis of the table data's availability and freshness. This analysis depends on the presence of record creation timestamps in the source table that are typically included during data warehouse updates.

The initialization of the default feature job setting is done using the initialize_default_feature_job_setting() method:

invoice_table.initialize_default_feature_job_setting()

Note

ItemTable objects inherit the default feature job setting from their related EventTable objects. For Views that originate from SCDTable objects, features that require aggregation operations have a default feature job setting that executes daily, aligning with the view's creation time.

To help you manage the default feature job settings, you can perform the following actions:

# Create a new analysis with a specific time period
analysis = invoice_table.create_new_feature_job_setting_analysis(
    analysis_date=pd.Timestamp('2023-04-10'),
    analysis_length=3600*24*28,
)
# List previous analyses
invoice_table.list_feature_job_setting_analysis()
# Retrieve a specific analysis
analysis = fb.FeatureJobSettingAnalysis.get_by_id(<analysis_id>)
# Backtest a manual setting
manual_setting = fb.FeatureJobSetting(
    blind_spot="135s",
    frequency="60m",
    time_modulo_frequency="90s",
)
backtest_result = analysis.backtest(feature_job_setting=manual_setting)
# Update the default feature job setting
invoice_table.update_default_feature_job_setting(manual_setting)

Enhancing Feature Engineering with Metadata

Optionally, you can include additional metadata at the column level after creating a table to support feature engineering further.

This could involve identifying columns that reference specific entities using the as_entity method:

# Tag the entities for the grocery invoice table
invoice_table.GroceryInvoiceGuid.as_entity("groceryinvoice")
invoice_table.GroceryCustomerGuid.as_entity("grocerycustomer")

This could also involve defining default cleaning operations using the update_critical_data_info method:

# Discount amount should not be negative
items_table.Discount.update_critical_data_info(
    cleaning_operations=[
        fb.MissingValueImputation(imputed_value=0),
        fb.ValueBeyondEndpointImputation(
            type="less_than", end_point=0, imputed_value=0
        ),
    ]
)

For more details, refer to the TableColumn documentation page.

Managing Table status

When a table is created, it is automatically added to the active catalog with its status set to 'PUBLIC_DRAFT'. Once the table is prepared for feature engineering, you can modify its status to 'PUBLISHED'.

Note

If a table needs to be deprecated, update its status to 'DEPRECATED'.

After deprecating a table,

To obtain the current status of a table, use the status property. To change the status, use the update_status() method:

print(invoice_table.status)
invoice_table.update_status("PUBLISHED")

Accessing a Table from the Catalog

Existing tables can be accessed through the catalog using the list_tables() and get_table() methods.

# List tables in the catalog
catalog.list_tables()
# Retrieve a table
invoice_table = catalog.get_table("GROCERYINVOICE")

You can also retrieve a Table object using its Object ID using the get_table_by_id() method.

table = catalog.get_table_by_id("TableID")

Exploring a Table

To explore a table, you can:

  • obtain detailed information using the info() method
  • acquire descriptive statistics using the describe() method
  • obtain a selection of rows using the preview() method
  • obtain a larger random selection of rows based on a specified time range, size, and seed using the sample() method
# Obtain detailed information on a table
invoice_table.info()
# Acquire descriptive statistics for a table
invoice_table.describe()
# Obtain a selection of table rows
df = invoice_table.preview(limit=20)
# Obtain a random selection of table rows based on a specified time range, size, and seed
df = invoice_table.sample(
    from_timestamp=pd.Timestamp('2023-04-01'),
    to_timestamp=pd.Timestamp('2023-05-01'),
    size=100, seed=23
)

By default, the statistics and materialization are computed before applying cleaning operations defined at the table level. To include these cleaning operations, set the after_cleaning parameter to True.

invoice_table.describe(after_cleaning=True)

Creating Views to Prepare Data Before Defining Features

To prepare data before defining features, View objects are created from Table objects using the get_view method.

customer_table = catalog.get_table("GROCERYCUSTOMER")
invoice_view = invoice_table.get_view()

Besides EventView, ItemView, DimensionView, and SCDView, another type of view can be created from an SCDTable: Change Views. These views provide a way to analyze changes happening in a specific attribute within the natural key of the SCD table. To get a Change view, use the get_change_view method:

address_changed_view = customer_table.get_change_view(
    track_changes_column="StreetAddress"
)