Table
A Table object provides a centralized location for metadata about a source table. This metadata determines the type of operations that can be applied to the table's views and includes essential information for feature engineering.
Important
A source table can only be associated with one active Table object in a catalog at a time. This means that the active Table object in the catalog is the source of truth for the metadata of the source table. If a Table object becomes deprecated, a new Table object can be registered with the same source table.
Registering Tables¶
Before registering tables, ensure that the catalog you want to work with is active.
Select the source table you are interested in.
ds = fb.FeatureStore.get("playground").get_data_source()
source_table = ds.get_source_table(
database_name="spark_catalog",
schema_name="GROCERY",
table_name="GROCERYINVOICE"
)
To create Table objects from a SourceTable object, you must use specific methods depending on the type of data contained in the source table:
create_event_table()
: creates an EventTable object from a source table, where each row indicates a unique business event occurring at a particular time.create_item_table()
creates an ItemTable object from a source table containing detailed information about a specific business event.create_dimension_table()
: creates a DimensionTable object from a source table containing static descriptive data.create_scd_table()
: creates an SCDTable object from a source table containing data that changes slowly and unpredictably over time, known as a Slowly Changing Dimension (SCD) table.
Registering a table according to its type determines the types of feature engineering operations that are possible on the table's views and enforces guardrails accordingly.
Example of registering an event table using the create_event_table()
method:
invoice_table = source_table.create_event_table(
name="GROCERYINVOICE",
event_id_column="GroceryInvoiceGuid",
event_timestamp_column="Timestamp",
event_timestamp_timezone_offset_column="tz_offset",
record_creation_timestamp_column="record_available_at"
)
Implementing Default Job Settings for Consistency¶
A default feature job setting is established at the table level to help streamline the configuration of feature job settings for features and ensure consistency across features developed by different team members. For an EventTable, the default feature job setting can be initialized using an automated analysis of the table data's availability and freshness. This analysis depends on the presence of record creation timestamps in the source table that are typically included during data warehouse updates.
The initialization of the default feature job setting is done using the initialize_default_feature_job_setting()
method:
Note
ItemTable objects inherit the default feature job setting from their related EventTable objects. For Views that originate from SCDTable objects, features that require aggregation operations have a default feature job setting that executes daily, aligning with the view's creation time.
To help you manage the default feature job settings, you can perform the following actions:
- Execute a new analysis using the
create_new_feature_job_setting_analysis()
method or view previous analyses using thelist_feature_job_setting_analysis()
method from a EventTable object, - Obtain an analysis using the
FeatureJobSettingAnalysis.get_by_id()
class method, - Create a custom setting using the
FeatureJobSetting
constructor, - Perform backtests on custom settings with the
backtest()
method from an analysis, - Manually update the default feature job setting of a EventTable object using the
update_default_feature_job_setting()
method.
# Create a new analysis with a specific time period
analysis = invoice_table.create_new_feature_job_setting_analysis(
analysis_date=pd.Timestamp('2023-04-10'),
analysis_length=3600*24*28,
)
# List previous analyses
invoice_table.list_feature_job_setting_analysis()
# Retrieve a specific analysis
analysis = fb.FeatureJobSettingAnalysis.get_by_id(<analysis_id>)
# Backtest a manual setting
manual_setting = fb.FeatureJobSetting(
blind_spot="135s",
frequency="60m",
time_modulo_frequency="90s",
)
backtest_result = analysis.backtest(feature_job_setting=manual_setting)
# Update the default feature job setting
invoice_table.update_default_feature_job_setting(manual_setting)
Enhancing Feature Engineering with Metadata¶
Optionally, you can include additional metadata at the column level after creating a table to support feature engineering further.
This could involve identifying columns that reference specific entities using the as_entity
method:
# Tag the entities for the grocery invoice table
invoice_table.GroceryInvoiceGuid.as_entity("groceryinvoice")
invoice_table.GroceryCustomerGuid.as_entity("grocerycustomer")
This could also involve defining default cleaning operations using the update_critical_data_info
method:
# Discount amount should not be negative
items_table.Discount.update_critical_data_info(
cleaning_operations=[
fb.MissingValueImputation(imputed_value=0),
fb.ValueBeyondEndpointImputation(
type="less_than", end_point=0, imputed_value=0
),
]
)
For more details, refer to the TableColumn documentation page.
Managing Table status¶
When a table is created, it is automatically added to the active catalog with its status set to 'PUBLIC_DRAFT'. Once the table is prepared for feature engineering, you can modify its status to 'PUBLISHED'.
Note
If a table needs to be deprecated, update its status to 'DEPRECATED'.
After deprecating a table,
-
the readiness state of features using the table cannot be promoted to "PRODUCTION_READY".
-
you can create a new Table using the same source table but a different name.
To obtain the current status of a table, use the status
property. To change the status, use the update_status()
method:
Accessing a Table from the Catalog¶
Existing tables can be accessed through the catalog using the list_tables()
and get_table()
methods.
# List tables in the catalog
catalog.list_tables()
# Retrieve a table
invoice_table = catalog.get_table("GROCERYINVOICE")
You can also retrieve a Table object using its Object ID using the get_table_by_id()
method.
Exploring a Table¶
To explore a table, you can:
- obtain detailed information using the
info()
method - acquire descriptive statistics using the
describe()
method - obtain a selection of rows using the
preview()
method - obtain a larger random selection of rows based on a specified time range, size, and seed using the
sample()
method
# Obtain detailed information on a table
invoice_table.info()
# Acquire descriptive statistics for a table
invoice_table.describe()
# Obtain a selection of table rows
df = invoice_table.preview(limit=20)
# Obtain a random selection of table rows based on a specified time range, size, and seed
df = invoice_table.sample(
from_timestamp=pd.Timestamp('2023-04-01'),
to_timestamp=pd.Timestamp('2023-05-01'),
size=100, seed=23
)
By default, the statistics and materialization are computed before applying cleaning operations defined at the table level. To include these cleaning operations, set the after_cleaning parameter to True.
Creating Views to Prepare Data Before Defining Features¶
To prepare data before defining features, View objects are created from Table objects using the get_view
method.
Besides EventView, ItemView, DimensionView, and SCDView, another type of view can be created from an SCDTable: Change Views. These views provide a way to analyze changes happening in a specific attribute within the natural key of the SCD table. To get a Change view, use the get_change_view
method: