A Table object provides a centralized location for metadata about a source table. This metadata determines the type of operations that can be applied to the table's views and includes essential information for feature engineering.
A source table can only be associated with one active Table object in a catalog at a time. This means that the active Table object in the catalog is the source of truth for the metadata of the source table. If a Table object becomes deprecated, a new Table object can be registered with the same source table.
Before registering tables, ensure that the catalog you want to work with is active.
Select the source table you are interested in.
ds = fb.FeatureStore.get("playground").get_data_source() source_table = ds.get_source_table( database_name="spark_catalog", schema_name="GROCERY", table_name="GROCERYINVOICE" )
To create Table objects from a SourceTable object, you must use specific methods depending on the type of data contained in the source table:
create_event_table(): creates an EventTable object from a source table, where each row indicates a unique business event occurring at a particular time.
create_item_table()creates an ItemTable object from a source table containing detailed information about a specific business event.
create_dimension_table(): creates a DimensionTable object from a source table containing static descriptive data.
create_scd_table(): creates an SCDTable object from a source table containing data that changes slowly and unpredictably over time, known as a Slowly Changing Dimension (SCD) table.
Registering a table according to its type determines the types of feature engineering operations that are possible on the table's views and enforces guardrails accordingly.
Example of registering an event table using the
invoice_table = source_table.create_event_table( name="GROCERYINVOICE", event_id_column="GroceryInvoiceGuid", event_timestamp_column="Timestamp", event_timestamp_timezone_offset_column="tz_offset", record_creation_timestamp_column="record_available_at" )
Implementing Default Job Settings for Consistency¶
A default feature job setting is established at the table level to help streamline the configuration of feature job settings for features and ensure consistency across features developed by different team members. For an EventTable, the default feature job setting can be initialized using an automated analysis of the table data's availability and freshness. This analysis depends on the presence of record creation timestamps in the source table that are typically included during data warehouse updates.
The initialization of the default feature job setting is done using the
ItemTable objects inherit the default feature job setting from their related EventTable objects. For Views that originate from SCDTable objects, features that require aggregation operations have a default feature job setting that executes daily, aligning with the view's creation time.
To help you manage the default feature job settings, you can perform the following actions:
- Execute a new analysis using the
create_new_feature_job_setting_analysis()method or view previous analyses using the
list_feature_job_setting_analysis()method from a EventTable object,
- Obtain an analysis using the
- Create a custom setting using the
- Perform backtests on custom settings with the
backtest()method from an analysis,
- Manually update the default feature job setting of a EventTable object using the
# Create a new analysis with a specific time period analysis = invoice_table.create_new_feature_job_setting_analysis( analysis_date=pd.Timestamp('2023-04-10'), analysis_length=3600*24*28, ) # List previous analyses invoice_table.list_feature_job_setting_analysis() # Retrieve a specific analysis analysis = fb.FeatureJobSettingAnalysis.get_by_id(<analysis_id>) # Backtest a manual setting manual_setting = fb.FeatureJobSetting( blind_spot="135s", frequency="60m", time_modulo_frequency="90s", ) backtest_result = analysis.backtest(feature_job_setting=manual_setting) # Update the default feature job setting invoice_table.update_default_feature_job_setting(manual_setting)
Enhancing Feature Engineering with Metadata¶
Optionally, you can include additional metadata at the column level after creating a table to support feature engineering further.
This could involve identifying columns that reference specific entities using the
# Tag the entities for the grocery invoice table invoice_table.GroceryInvoiceGuid.as_entity("groceryinvoice") invoice_table.GroceryCustomerGuid.as_entity("grocerycustomer")
This could also involve defining default cleaning operations using the
# Discount amount should not be negative items_table.Discount.update_critical_data_info( cleaning_operations=[ fb.MissingValueImputation(imputed_value=0), fb.ValueBeyondEndpointImputation( type="less_than", end_point=0, imputed_value=0 ), ] )
For more details, refer to the TableColumn documentation page.
Managing Table status¶
When a table is created, it is automatically added to the active catalog with its status set to 'PUBLIC_DRAFT'. Once the table is prepared for feature engineering, you can modify its status to 'PUBLISHED'.
If a table needs to be deprecated, update its status to 'DEPRECATED'.
After deprecating a table,
the readiness state of features using the table cannot be promoted to "PRODUCTION_READY".
you can create a new Table using the same source table but a different name.
To obtain the current status of a table, use the
status property. To change the status, use the
Accessing a Table from the Catalog¶
Existing tables can be accessed through the catalog using the
# List tables in the catalog catalog.list_tables() # Retrieve a table invoice_table = catalog.get_table("GROCERYINVOICE")
You can also retrieve a Table object using its Object ID using the
Exploring a Table¶
To explore a table, you can:
- obtain detailed information using the
- acquire descriptive statistics using the
- obtain a selection of rows using the
- obtain a larger random selection of rows based on a specified time range, size, and seed using the
# Obtain detailed information on a table invoice_table.info() # Acquire descriptive statistics for a table invoice_table.describe() # Obtain a selection of table rows df = invoice_table.preview(limit=20) # Obtain a random selection of table rows based on a specified time range, size, and seed df = invoice_table.sample( from_timestamp=pd.Timestamp('2023-04-01'), to_timestamp=pd.Timestamp('2023-05-01'), size=100, seed=23 )
By default, the statistics and materialization are computed before applying cleaning operations defined at the table level. To include these cleaning operations, set the after_cleaning parameter to True.
Creating Views to Prepare Data Before Defining Features¶
To prepare data before defining features, View objects are created from Table objects using the
Besides EventView, ItemView, DimensionView, and SCDView, another type of view can be created from an SCDTable: Change Views. These views provide a way to analyze changes happening in a specific attribute within the natural key of the SCD table. To get a Change view, use the