2. Register tables

Registering tables in the catalog¶

Our catalog is created and we can start registering tables in it.

First of all, let's activate our catalog.¶

We will be repeating this command in following notebooks.

In [1]:

            
                Copied!
                
import featurebyte as fb

# Set your profile to the tutorial environment
fb.use_profile("tutorial")

catalog_name = "Grocery Dataset Tutorial"
catalog = fb.Catalog.activate(catalog_name)
import featurebyte as fb

# Set your profile to the tutorial environment
fb.use_profile("tutorial")

catalog_name = "Grocery Dataset Tutorial"
catalog = fb.Catalog.activate(catalog_name)

15:27:17 | INFO     | SDK version: 1.0.2.dev46
15:27:17 | INFO     | No catalog activated.
15:27:17 | INFO     | Using profile: tutorial
15:27:17 | INFO     | Using configuration file at: /Users/gxav/.featurebyte/config.yaml
15:27:17 | INFO     | Active profile: tutorial (https://tutorials.featurebyte.com/api/v1)
15:27:17 | INFO     | SDK version: 1.0.2.dev46
15:27:17 | INFO     | No catalog activated.
15:27:18 | INFO     | Catalog activated: Grocery Dataset Tutorial

Get data source¶

To be able to get source tables from the data warehouse we need to get data source, which our catalog has access to.

This data source contains collection of tables in our database, and we can use it to explore our DB schema:

In [2]:

            
                Copied!
                
ds = catalog.get_data_source()
ds = catalog.get_data_source()

Here we see we have access to a number of databases. For these tutorials we will use the one called 'DEMO_DATASETS' and the 'GROCERY' schema under it.

In [3]:

            
                Copied!
                
ds.list_databases()
ds.list_databases()

Out[3]:

['DEMO_DATASETS', 'TUTORIAL']

In [4]:

            
                Copied!
                
database_name = 'DEMO_DATASETS'
ds.list_schemas(database_name=database_name)
database_name = 'DEMO_DATASETS'
ds.list_schemas(database_name=database_name)

Out[4]:

['CREDITCARD', 'GROCERY', 'HEALTHCARE', 'INFORMATION_SCHEMA']

In [5]:

            
                Copied!
                
schema_name = 'GROCERY'
ds.list_source_tables(database_name=database_name, schema_name=schema_name)
schema_name = 'GROCERY'
ds.list_source_tables(database_name=database_name, schema_name=schema_name)

Out[5]:

['GROCERYCUSTOMER', 'INVOICEITEMS', 'GROCERYINVOICE', 'GROCERYPRODUCT']

Get source tables¶

We identified database and schema we want to work with, it is time to get source tables and register them in the catalog.

In [6]:

            
                Copied!
                
                    
                    
                
                

        
customer_source_table = ds.get_source_table(
    database_name=database_name,
    schema_name=schema_name,
    table_name="GROCERYCUSTOMER"
)
invoice_source_table = ds.get_source_table(
    database_name=database_name,
    schema_name=schema_name,
    table_name="GROCERYINVOICE"
)
items_source_table = ds.get_source_table(
    database_name=database_name,
    schema_name=schema_name,
    table_name="INVOICEITEMS"
)
product_source_table = ds.get_source_table(
    database_name=database_name,
    schema_name=schema_name,
    table_name="GROCERYPRODUCT"
)
customer_source_table = ds.get_source_table(
    database_name=database_name,
    schema_name=schema_name,
    table_name="GROCERYCUSTOMER"
)
invoice_source_table = ds.get_source_table(
    database_name=database_name,
    schema_name=schema_name,
    table_name="GROCERYINVOICE"
)
items_source_table = ds.get_source_table(
    database_name=database_name,
    schema_name=schema_name,
    table_name="INVOICEITEMS"
)
product_source_table = ds.get_source_table(
    database_name=database_name,
    schema_name=schema_name,
    table_name="GROCERYPRODUCT"
)

Exploring Source Tables¶

You can obtain descriptive statistics, preview a selection of rows, or collect additional information on their columns.

In [7]:

            
                Copied!
                
# Obtain descriptive statistics
invoice_source_table.describe()
# Obtain descriptive statistics
invoice_source_table.describe()

Done! |████████████████████████████████████████| 100% in 9.6s (0.10%/s)

Out[7]:

	GroceryInvoiceGuid	GroceryCustomerGuid	Timestamp	tz_offset	record_available_at	Amount
dtype	VARCHAR	VARCHAR	TIMESTAMP	VARCHAR	TIMESTAMP	FLOAT
unique	58333	500	58283	4	14372	6647
%missing	0.0	0.0	0.0	0.0	0.0	0.0
%empty	0	0	NaN	0	NaN	NaN
entropy	6.214608	5.860939	NaN	0.786916	NaN	NaN
top	003224bd-aad1-4e34-9182-1f8f3a6b0a57	cea213d4-36e4-48c3-ae8d-c7a25911e11c	2022-02-08 14:47:42.000	+02:00	2022-08-23 15:01:00.000	1
freq	1.0	1044.0	2.0	30559.0	17.0	973.0
mean	NaN	NaN	NaN	NaN	NaN	19.177691
std	NaN	NaN	NaN	NaN	NaN	23.746579
min	NaN	NaN	2022-01-01T04:17:46.000000000	NaN	2022-01-01T05:01:00.000000000	0.0
25%	NaN	NaN	NaN	NaN	NaN	4.28
50%	NaN	NaN	NaN	NaN	NaN	10.58
75%	NaN	NaN	NaN	NaN	NaN	24.56
max	NaN	NaN	2024-04-26T06:43:50.000000000	NaN	2024-04-26T07:01:00.000000000	360.84

In [8]:

            
                Copied!
                
# Preview a selection of rows
invoice_source_table.preview()
# Preview a selection of rows
invoice_source_table.preview()

Out[8]:

	GroceryInvoiceGuid	GroceryCustomerGuid	Timestamp	tz_offset	record_available_at	Amount
0	753a59e9-1291-4882-bc7a-39633607e192	07c21f1d-1b16-4a92-bfd2-04d62cfa35ee	2022-01-04 17:15:23	+01:00	2022-01-04 18:01:00	6.17
1	040c86f7-9e16-4468-bf9f-b80afc4a3610	07c21f1d-1b16-4a92-bfd2-04d62cfa35ee	2022-01-17 16:38:55	+01:00	2022-01-17 17:01:00	5.58
2	460fe41e-258c-409d-85bb-b1b639659a02	07c21f1d-1b16-4a92-bfd2-04d62cfa35ee	2022-02-02 18:09:57	+01:00	2022-02-02 19:01:00	242.15
3	46c48917-06a9-4f53-994e-d7ab45717073	07c21f1d-1b16-4a92-bfd2-04d62cfa35ee	2022-03-13 17:38:52	+01:00	2022-03-13 18:01:00	76.63
4	29f43ed8-c684-45a3-8a6e-e5e09f228549	07c21f1d-1b16-4a92-bfd2-04d62cfa35ee	2022-03-19 13:32:36	+01:00	2022-03-19 14:01:00	40.33
5	b649add7-08fa-4185-be0a-3dc351befcd1	07c21f1d-1b16-4a92-bfd2-04d62cfa35ee	2022-03-26 13:38:56	+01:00	2022-03-26 14:01:00	3.52
6	b9353637-8a7e-4ee9-9095-380d2df051f6	07c21f1d-1b16-4a92-bfd2-04d62cfa35ee	2022-04-16 12:53:14	+02:00	2022-04-16 13:01:00	22.30
7	82b59f81-4e08-48f4-b191-06311b429dd7	07c21f1d-1b16-4a92-bfd2-04d62cfa35ee	2022-04-29 19:24:24	+02:00	2022-04-29 20:01:00	51.71
8	8cc9e4cc-7593-4f67-84fc-7b9107f2cb57	07c21f1d-1b16-4a92-bfd2-04d62cfa35ee	2022-05-14 12:27:20	+02:00	2022-05-14 13:01:00	40.20
9	26c8945d-e1d4-4dbf-9841-f46bbf58f556	07c21f1d-1b16-4a92-bfd2-04d62cfa35ee	2022-05-24 13:08:21	+02:00	2022-05-24 14:01:00	2.50

In [9]:

            
                Copied!
                
# Collect additional information on their columns
invoice_source_table.columns_info
# Collect additional information on their columns
invoice_source_table.columns_info

Out[9]:

[ColumnInfo(name='GroceryInvoiceGuid', dtype='VARCHAR', description='Unique identifier of each row in the table, in GUID format. Uniquely identifies each invoice.', entity_id=None, semantic_id=None, critical_data_info=None),
 ColumnInfo(name='GroceryCustomerGuid', dtype='VARCHAR', description='Unique identifier for each customer, in GUID format.', entity_id=None, semantic_id=None, critical_data_info=None),
 ColumnInfo(name='Timestamp', dtype='TIMESTAMP', description='The GMT timestamp of when this invoice transaction event occurred.', entity_id=None, semantic_id=None, critical_data_info=None),
 ColumnInfo(name='tz_offset', dtype='VARCHAR', description='The local timezone offset of the invoice event.', entity_id=None, semantic_id=None, critical_data_info=None),
 ColumnInfo(name='record_available_at', dtype='TIMESTAMP', description='A timestamp for when this row was added to the database.', entity_id=None, semantic_id=None, critical_data_info=None),
 ColumnInfo(name='Amount', dtype='FLOAT', description='The total amount of the invoice, including all items and any discounts applied. Cannot be negative.', entity_id=None, semantic_id=None, critical_data_info=None)]

Registering Tables in the Catalog¶

This step, though slightly more intricate than our previous actions, is vital for our subsequent feature engineering tasks.

For accurate feature derivation, FeatureByte needs to understand the 'roles' of various tables.

We categorize tables into four types:

Event tables - These capture distinct business events occurring at specific moments. An example would be customer invoices, specifically, the "Grocery Invoice" in our scenario.
Item tables - These delve into the specifics about an event, such as the products a customer purchased. In our context, this is represented by "Invoice Items".
Slowly Changing Dimension tables - These denote data similar to dimensions but may evolve over time. For instance, customers might shift addresses or update other details. In our use case, this is the "Grocery Customer".
Dimension tables - These tables contain unchanging descriptive data, like information on particular products retailed in a store, exemplified by "Grocery Product" in our setting.

Feel free to explore more on tables and table types, but at this point understanding just basic differences is more than enough.

Let's register each table using respective type:

In [10]:

            
                Copied!
                
                    
                    
                
                

        
customer_table = customer_source_table.create_scd_table(
    name="GROCERYCUSTOMER",
    surrogate_key_column='RowID',
    natural_key_column="GroceryCustomerGuid",
    effective_timestamp_column="ValidFrom",
    current_flag_column ="CurrentRecord",
    record_creation_timestamp_column="record_available_at"
)
invoice_table = invoice_source_table.create_event_table(   
    name="GROCERYINVOICE",
    event_id_column="GroceryInvoiceGuid",
    event_timestamp_column="Timestamp",
    event_timestamp_timezone_offset_column="tz_offset",
    record_creation_timestamp_column="record_available_at"
)
items_table = items_source_table.create_item_table(
    name="INVOICEITEMS",
    event_id_column="GroceryInvoiceGuid",
    item_id_column="GroceryInvoiceItemGuid",
    event_table_name="GROCERYINVOICE",
    record_creation_timestamp_column="record_available_at"    
)
product_table = product_source_table.create_dimension_table(
    name="GROCERYPRODUCT",
    dimension_id_column="GroceryProductGuid"
)
customer_table = customer_source_table.create_scd_table(
    name="GROCERYCUSTOMER",
    surrogate_key_column='RowID',
    natural_key_column="GroceryCustomerGuid",
    effective_timestamp_column="ValidFrom",
    current_flag_column ="CurrentRecord",
    record_creation_timestamp_column="record_available_at"
)
invoice_table = invoice_source_table.create_event_table(   
    name="GROCERYINVOICE",
    event_id_column="GroceryInvoiceGuid",
    event_timestamp_column="Timestamp",
    event_timestamp_timezone_offset_column="tz_offset",
    record_creation_timestamp_column="record_available_at"
)
items_table = items_source_table.create_item_table(
    name="INVOICEITEMS",
    event_id_column="GroceryInvoiceGuid",
    item_id_column="GroceryInvoiceItemGuid",
    event_table_name="GROCERYINVOICE",
    record_creation_timestamp_column="record_available_at"    
)
product_table = product_source_table.create_dimension_table(
    name="GROCERYPRODUCT",
    dimension_id_column="GroceryProductGuid"
)

After this we will be able to see our tables in the catalog:

In [11]:

            
                Copied!
                
display(catalog.list_tables())
display(catalog.list_tables())

	id	name	type	status	entities	created_at
0	662b577aaa13c89fa14554e3	GROCERYPRODUCT	dimension_table	PUBLIC_DRAFT	[]	2024-04-26T07:27:55.127000
1	662b5778aa13c89fa14554e2	INVOICEITEMS	item_table	PUBLIC_DRAFT	[]	2024-04-26T07:27:52.998000
2	662b5775aa13c89fa14554e1	GROCERYINVOICE	event_table	PUBLIC_DRAFT	[]	2024-04-26T07:27:50.409000
3	662b5773aa13c89fa14554e0	GROCERYCUSTOMER	scd_table	PUBLIC_DRAFT	[]	2024-04-26T07:27:48.329000

Initialize feature job settings¶

The last step we need to tackle is setting up the feature job settings.

Essentially, these settings determine when and how frequently we want to update the feature store. It also sets a 'blind spot' period, which is the time gap between when a feature is computed and the latest available event.

For instance, in our grocery context, if we aim to predict customer spending for the upcoming two weeks, we might consider using all of their invoices up to the present. However, our data pipeline may not capture the most recent invoices immediately due to the time required for data collection from edge devices, processing through ETL, and other steps. If we base our predictions on data up to a certain point, our production scenario might not be accurately represented, leading to poor prediction accuracy.

Lukily, featurebyte is smart enough to compute those settings for us:

In [12]:

            
                Copied!
                
invoice_table.initialize_default_feature_job_setting()
invoice_table.initialize_default_feature_job_setting()

Done! |████████████████████████████████████████| 100% in 12.3s (0.08%/s)

Feature Job Setting Analysis

Feature Job Setting Analysis Report

Warehouse Jobs Statistics

The analysis is for the event table: GROCERYINVOICE
The analysis period starts at 2024-03-29 06:43:50 and ends at 2024-04-26 06:43:50
The column used for the event timestamp is Timestamp
The column used for the record creation timestamp in GROCERYINVOICE is record_available_at

STATISTICS ON TIME BETWEEN GROCERYINVOICE RECORDS CREATIONS
- Average time is 4213.807531380753 s
- Median time is 3600.0 s
- Lowest time is 3600.0 s
- Largest time is 39600.0 s
based on a total of 476 unique record creation timestamps.

The BEST ESTIMATE FOR GROCERYINVOICE UPDATES FREQUENCY is every 1 hour

The longer time between records creations are due to 196 MISSING UPDATES.

GROCERYINVOICE UPDATES TIME starts 1.0 minute and ends 1.0 minute after the start of each 1 hour

Job Frequency Recommendation

The RECOMMENDED FEATURE JOB FREQUENCY is 2 minutes after the start of each 1 hour.
This includes a buffer of 60 s to allow for late jobs.

The 57 jobs that occurred after missing jobs don't seem to have processed significantly older records.

Blind Spot Search

The OPTIMAL BLIND SPOT setting is 120 s to keep late data at less than 0.0 %
Search for optimal blind spot
- blind spot for 99.5 % of events to land: 120 s
- blind spot for 99.9 % of events to land: 120 s
- blind spot for 99.95 % of events to land: 120 s
- blind spot for 99.99 % of events to land: 120 s
- blind spot for 99.995 % of events to land: 120 s
- blind spot for 100.0 % of events to land: 120 s

Feature Job Setting Recommendation

The RECOMMENDED BLIND SPOT setting is 120 s

In SUMMARY, the recommended FEATUREJOB DEFAULT setting is:
frequency: 3600
job_time_modulo_frequency: 120
blind_spot: 120

The resulting FEATURE CUTOFF modulo frequency is 0 s.

Feature Tiles Event Landing Time

For a feature cutoff at 0 s:
- time for 99.5 % of events to land: 120 s
- time for 99.9 % of events to land: 120 s
- time for 99.95 % of events to land: 120 s
- time for 99.99 % of events to land: 120 s
- time for 99.995 % of events to land: 120 s
- time for 100.0 % of events to land: 120 s

Backtest Result

For the feature job setting:
- Frequency = 3600 s / Job time modulo frequency = 120 s / Blind spot = 120 s
The backtest found that all records would have been processed on time.

Summary

Key findings:

Based on the past records created from 2024-03-29 06:00:00 to 2024-04-26 06:00:00, the table is regularly updated 1.0 minute after the start of each 1 hour within a interval. No job failure or late job has been detected.
The features computation jobs are recommended to be scheduled after the table updates completion and be set 2 minutes after the start of each 1 hour.
Based on the analysis of the records latency, the blind_spot parameter used to determine the window of the features aggregation is recommended to be set at 120 s.

The recommended Default Feature Job setting is:

frequency: 3600 s
job_time_modulo_frequency: 120 s
blind_spot: 120 s

You can always override default feature job settings, see update_default_feature_job_setting.

That's it for this tutorial, now we are ready for modeling our entities.

More details about concepts we discussed in this tutorial:¶

SDK reference for¶

In [ ]: