2. Register tables
Registering tables in the catalog¶
Our catalog is created and we can start registering tables in it.
First of all, let's activate our catalog.¶
We will be repeating this command in following notebooks.
import featurebyte as fb
# Set your profile to the tutorial environment
fb.use_profile("tutorial")
catalog_name = "Grocery Dataset Tutorial"
catalog = fb.Catalog.activate(catalog_name)
15:27:17 | INFO | SDK version: 1.0.2.dev46 15:27:17 | INFO | No catalog activated. 15:27:17 | INFO | Using profile: tutorial 15:27:17 | INFO | Using configuration file at: /Users/gxav/.featurebyte/config.yaml 15:27:17 | INFO | Active profile: tutorial (https://tutorials.featurebyte.com/api/v1) 15:27:17 | INFO | SDK version: 1.0.2.dev46 15:27:17 | INFO | No catalog activated. 15:27:18 | INFO | Catalog activated: Grocery Dataset Tutorial
Get data source¶
To be able to get source tables from the data warehouse we need to get data source, which our catalog has access to.
This data source contains collection of tables in our database, and we can use it to explore our DB schema:
ds = catalog.get_data_source()
Here we see we have access to a number of databases. For these tutorials we will use the one called 'DEMO_DATASETS' and the 'GROCERY' schema under it.
ds.list_databases()
['DEMO_DATASETS', 'TUTORIAL']
database_name = 'DEMO_DATASETS'
ds.list_schemas(database_name=database_name)
['CREDITCARD', 'GROCERY', 'HEALTHCARE', 'INFORMATION_SCHEMA']
schema_name = 'GROCERY'
ds.list_source_tables(database_name=database_name, schema_name=schema_name)
['GROCERYCUSTOMER', 'INVOICEITEMS', 'GROCERYINVOICE', 'GROCERYPRODUCT']
Get source tables¶
We identified database and schema we want to work with, it is time to get source tables and register them in the catalog.
customer_source_table = ds.get_source_table(
database_name=database_name,
schema_name=schema_name,
table_name="GROCERYCUSTOMER"
)
invoice_source_table = ds.get_source_table(
database_name=database_name,
schema_name=schema_name,
table_name="GROCERYINVOICE"
)
items_source_table = ds.get_source_table(
database_name=database_name,
schema_name=schema_name,
table_name="INVOICEITEMS"
)
product_source_table = ds.get_source_table(
database_name=database_name,
schema_name=schema_name,
table_name="GROCERYPRODUCT"
)
Exploring Source Tables¶
You can obtain descriptive statistics, preview a selection of rows, or collect additional information on their columns.
# Obtain descriptive statistics
invoice_source_table.describe()
Done! |████████████████████████████████████████| 100% in 9.6s (0.10%/s)
GroceryInvoiceGuid | GroceryCustomerGuid | Timestamp | tz_offset | record_available_at | Amount | |
---|---|---|---|---|---|---|
dtype | VARCHAR | VARCHAR | TIMESTAMP | VARCHAR | TIMESTAMP | FLOAT |
unique | 58333 | 500 | 58283 | 4 | 14372 | 6647 |
%missing | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
%empty | 0 | 0 | NaN | 0 | NaN | NaN |
entropy | 6.214608 | 5.860939 | NaN | 0.786916 | NaN | NaN |
top | 003224bd-aad1-4e34-9182-1f8f3a6b0a57 | cea213d4-36e4-48c3-ae8d-c7a25911e11c | 2022-02-08 14:47:42.000 | +02:00 | 2022-08-23 15:01:00.000 | 1 |
freq | 1.0 | 1044.0 | 2.0 | 30559.0 | 17.0 | 973.0 |
mean | NaN | NaN | NaN | NaN | NaN | 19.177691 |
std | NaN | NaN | NaN | NaN | NaN | 23.746579 |
min | NaN | NaN | 2022-01-01T04:17:46.000000000 | NaN | 2022-01-01T05:01:00.000000000 | 0.0 |
25% | NaN | NaN | NaN | NaN | NaN | 4.28 |
50% | NaN | NaN | NaN | NaN | NaN | 10.58 |
75% | NaN | NaN | NaN | NaN | NaN | 24.56 |
max | NaN | NaN | 2024-04-26T06:43:50.000000000 | NaN | 2024-04-26T07:01:00.000000000 | 360.84 |
# Preview a selection of rows
invoice_source_table.preview()
GroceryInvoiceGuid | GroceryCustomerGuid | Timestamp | tz_offset | record_available_at | Amount | |
---|---|---|---|---|---|---|
0 | 753a59e9-1291-4882-bc7a-39633607e192 | 07c21f1d-1b16-4a92-bfd2-04d62cfa35ee | 2022-01-04 17:15:23 | +01:00 | 2022-01-04 18:01:00 | 6.17 |
1 | 040c86f7-9e16-4468-bf9f-b80afc4a3610 | 07c21f1d-1b16-4a92-bfd2-04d62cfa35ee | 2022-01-17 16:38:55 | +01:00 | 2022-01-17 17:01:00 | 5.58 |
2 | 460fe41e-258c-409d-85bb-b1b639659a02 | 07c21f1d-1b16-4a92-bfd2-04d62cfa35ee | 2022-02-02 18:09:57 | +01:00 | 2022-02-02 19:01:00 | 242.15 |
3 | 46c48917-06a9-4f53-994e-d7ab45717073 | 07c21f1d-1b16-4a92-bfd2-04d62cfa35ee | 2022-03-13 17:38:52 | +01:00 | 2022-03-13 18:01:00 | 76.63 |
4 | 29f43ed8-c684-45a3-8a6e-e5e09f228549 | 07c21f1d-1b16-4a92-bfd2-04d62cfa35ee | 2022-03-19 13:32:36 | +01:00 | 2022-03-19 14:01:00 | 40.33 |
5 | b649add7-08fa-4185-be0a-3dc351befcd1 | 07c21f1d-1b16-4a92-bfd2-04d62cfa35ee | 2022-03-26 13:38:56 | +01:00 | 2022-03-26 14:01:00 | 3.52 |
6 | b9353637-8a7e-4ee9-9095-380d2df051f6 | 07c21f1d-1b16-4a92-bfd2-04d62cfa35ee | 2022-04-16 12:53:14 | +02:00 | 2022-04-16 13:01:00 | 22.30 |
7 | 82b59f81-4e08-48f4-b191-06311b429dd7 | 07c21f1d-1b16-4a92-bfd2-04d62cfa35ee | 2022-04-29 19:24:24 | +02:00 | 2022-04-29 20:01:00 | 51.71 |
8 | 8cc9e4cc-7593-4f67-84fc-7b9107f2cb57 | 07c21f1d-1b16-4a92-bfd2-04d62cfa35ee | 2022-05-14 12:27:20 | +02:00 | 2022-05-14 13:01:00 | 40.20 |
9 | 26c8945d-e1d4-4dbf-9841-f46bbf58f556 | 07c21f1d-1b16-4a92-bfd2-04d62cfa35ee | 2022-05-24 13:08:21 | +02:00 | 2022-05-24 14:01:00 | 2.50 |
# Collect additional information on their columns
invoice_source_table.columns_info
[ColumnInfo(name='GroceryInvoiceGuid', dtype='VARCHAR', description='Unique identifier of each row in the table, in GUID format. Uniquely identifies each invoice.', entity_id=None, semantic_id=None, critical_data_info=None), ColumnInfo(name='GroceryCustomerGuid', dtype='VARCHAR', description='Unique identifier for each customer, in GUID format.', entity_id=None, semantic_id=None, critical_data_info=None), ColumnInfo(name='Timestamp', dtype='TIMESTAMP', description='The GMT timestamp of when this invoice transaction event occurred.', entity_id=None, semantic_id=None, critical_data_info=None), ColumnInfo(name='tz_offset', dtype='VARCHAR', description='The local timezone offset of the invoice event.', entity_id=None, semantic_id=None, critical_data_info=None), ColumnInfo(name='record_available_at', dtype='TIMESTAMP', description='A timestamp for when this row was added to the database.', entity_id=None, semantic_id=None, critical_data_info=None), ColumnInfo(name='Amount', dtype='FLOAT', description='The total amount of the invoice, including all items and any discounts applied. Cannot be negative.', entity_id=None, semantic_id=None, critical_data_info=None)]
Registering Tables in the Catalog¶
This step, though slightly more intricate than our previous actions, is vital for our subsequent feature engineering tasks.
For accurate feature derivation, FeatureByte needs to understand the 'roles' of various tables.
We categorize tables into four types:
- Event tables - These capture distinct business events occurring at specific moments. An example would be customer invoices, specifically, the "Grocery Invoice" in our scenario.
- Item tables - These delve into the specifics about an event, such as the products a customer purchased. In our context, this is represented by "Invoice Items".
- Slowly Changing Dimension tables - These denote data similar to dimensions but may evolve over time. For instance, customers might shift addresses or update other details. In our use case, this is the "Grocery Customer".
- Dimension tables - These tables contain unchanging descriptive data, like information on particular products retailed in a store, exemplified by "Grocery Product" in our setting.
Feel free to explore more on tables and table types, but at this point understanding just basic differences is more than enough.
Let's register each table using respective type:
customer_table = customer_source_table.create_scd_table(
name="GROCERYCUSTOMER",
surrogate_key_column='RowID',
natural_key_column="GroceryCustomerGuid",
effective_timestamp_column="ValidFrom",
current_flag_column ="CurrentRecord",
record_creation_timestamp_column="record_available_at"
)
invoice_table = invoice_source_table.create_event_table(
name="GROCERYINVOICE",
event_id_column="GroceryInvoiceGuid",
event_timestamp_column="Timestamp",
event_timestamp_timezone_offset_column="tz_offset",
record_creation_timestamp_column="record_available_at"
)
items_table = items_source_table.create_item_table(
name="INVOICEITEMS",
event_id_column="GroceryInvoiceGuid",
item_id_column="GroceryInvoiceItemGuid",
event_table_name="GROCERYINVOICE",
record_creation_timestamp_column="record_available_at"
)
product_table = product_source_table.create_dimension_table(
name="GROCERYPRODUCT",
dimension_id_column="GroceryProductGuid"
)
After this we will be able to see our tables in the catalog:
display(catalog.list_tables())
id | name | type | status | entities | created_at | |
---|---|---|---|---|---|---|
0 | 662b577aaa13c89fa14554e3 | GROCERYPRODUCT | dimension_table | PUBLIC_DRAFT | [] | 2024-04-26T07:27:55.127000 |
1 | 662b5778aa13c89fa14554e2 | INVOICEITEMS | item_table | PUBLIC_DRAFT | [] | 2024-04-26T07:27:52.998000 |
2 | 662b5775aa13c89fa14554e1 | GROCERYINVOICE | event_table | PUBLIC_DRAFT | [] | 2024-04-26T07:27:50.409000 |
3 | 662b5773aa13c89fa14554e0 | GROCERYCUSTOMER | scd_table | PUBLIC_DRAFT | [] | 2024-04-26T07:27:48.329000 |
Initialize feature job settings¶
The last step we need to tackle is setting up the feature job settings.
Essentially, these settings determine when and how frequently we want to update the feature store. It also sets a 'blind spot' period, which is the time gap between when a feature is computed and the latest available event.
For instance, in our grocery context, if we aim to predict customer spending for the upcoming two weeks, we might consider using all of their invoices up to the present. However, our data pipeline may not capture the most recent invoices immediately due to the time required for data collection from edge devices, processing through ETL, and other steps. If we base our predictions on data up to a certain point, our production scenario might not be accurately represented, leading to poor prediction accuracy.
Lukily, featurebyte is smart enough to compute those settings for us:
invoice_table.initialize_default_feature_job_setting()
Done! |████████████████████████████████████████| 100% in 12.3s (0.08%/s)
The analysis period starts at 2024-03-29 06:43:50 and ends at 2024-04-26 06:43:50
The column used for the event timestamp is Timestamp
The column used for the record creation timestamp in GROCERYINVOICE is record_available_at
STATISTICS ON TIME BETWEEN GROCERYINVOICE RECORDS CREATIONS
- Average time is 4213.807531380753 s
- Median time is 3600.0 s
- Lowest time is 3600.0 s
- Largest time is 39600.0 s
based on a total of 476 unique record creation timestamps.
The BEST ESTIMATE FOR GROCERYINVOICE UPDATES FREQUENCY is every 1 hour
The longer time between records creations are due to 196 MISSING UPDATES.
GROCERYINVOICE UPDATES TIME starts 1.0 minute and ends 1.0 minute after the start of each 1 hour
This includes a buffer of 60 s to allow for late jobs.
The 57 jobs that occurred after missing jobs don't seem to have processed significantly older records.
Search for optimal blind spot
- blind spot for 99.5 % of events to land: 120 s
- blind spot for 99.9 % of events to land: 120 s
- blind spot for 99.95 % of events to land: 120 s
- blind spot for 99.99 % of events to land: 120 s
- blind spot for 99.995 % of events to land: 120 s
- blind spot for 100.0 % of events to land: 120 s
In SUMMARY, the recommended FEATUREJOB DEFAULT setting is:
frequency: 3600
job_time_modulo_frequency: 120
blind_spot: 120
The resulting FEATURE CUTOFF modulo frequency is 0 s.
For a feature cutoff at 0 s:
- time for 99.5 % of events to land: 120 s
- time for 99.9 % of events to land: 120 s
- time for 99.95 % of events to land: 120 s
- time for 99.99 % of events to land: 120 s
- time for 99.995 % of events to land: 120 s
- time for 100.0 % of events to land: 120 s
- Frequency = 3600 s / Job time modulo frequency = 120 s / Blind spot = 120 s
The backtest found that all records would have been processed on time.
- Based on the past records created from 2024-03-29 06:00:00 to 2024-04-26 06:00:00, the table is regularly updated 1.0 minute after the start of each 1 hour within a interval. No job failure or late job has been detected.
- The features computation jobs are recommended to be scheduled after the table updates completion and be set 2 minutes after the start of each 1 hour.
- Based on the analysis of the records latency, the blind_spot parameter used to determine the window of the features aggregation is recommended to be set at 120 s.
- frequency: 3600 s
- job_time_modulo_frequency: 120 s
- blind_spot: 120 s
You can always override default feature job settings, see update_default_feature_job_setting.
That's it for this tutorial, now we are ready for modeling our entities.
More details about concepts we discussed in this tutorial:¶
SDK reference for¶