Deep Dive Tutorial: Data Modeling¶
Learning Objectives¶
In this tutorial you will learn how to:
- Create a catalog
- Register tables
- Analyse and set feature job settings
- Define data cleaning operations
- Define entities and table relationships
- Create views of tables
- Join views
Set up the prerequisites¶
Learning Objectives
In this section you will:
- import libraries
- start your local featurebyte server
Load the featurebyte library and connect to the local instance of featurebyte¶
# library imports
import pandas as pd
import numpy as np
from datetime import datetime
# load the featurebyte SDK
import featurebyte as fb
# start the local server, then wait for it to be healthy before proceeding
fb.playground()
01:40:18 | INFO | Using configuration file at: /home/chester/.featurebyte/config.yaml 01:40:18 | INFO | Active profile: local (http://127.0.0.1:8088) 01:40:18 | INFO | SDK version: 0.2.2 01:40:19 | INFO | Active catalog: default 01:40:19 | INFO | 0 feature list, 0 feature deployed 01:40:19 | INFO | (1/4) Starting featurebyte services Container spark-thrift Running Container mongo-rs Running Container redis Running Container featurebyte-server Running Container featurebyte-worker Running Container mongo-rs Waiting Container redis Waiting Container mongo-rs Waiting Container redis Healthy Container mongo-rs Healthy Container mongo-rs Healthy 01:40:19 | INFO | (2/4) Creating local spark feature store 01:40:19 | INFO | (3/4) Import datasets 01:40:20 | INFO | Dataset grocery already exists, skipping import 01:40:20 | INFO | Dataset healthcare already exists, skipping import 01:40:20 | INFO | Dataset creditcard already exists, skipping import 01:40:20 | INFO | (4/4) Playground environment started successfully. Ready to go! 🚀
Create a catalog¶
Learning Objectives
In this section you will:
- learn about catalogs
- activate a new catalog
Concept: Catalog¶
A Catalog object operates as a centralized metadata repository for organizing tables, entities, features, and feature lists and other objects to facilitate feature serving for a specific domain. By employing a catalog, your team members can share, search, access, and reuse these assets.
Concept: The default catalog¶
When you start Featurebyte, it automatically activates the default catalog.
Do not use the default catalog for data modeling or feature engineering. Always create a new catalog or activate an existing catalog for your work.
Example: Get the active catalog¶
# get the active catalog
catalog = fb.Catalog.get_active()
print(catalog.name)
default
Example: Create a new catalog¶
catalog_name = "deep dive data modeling " + datetime.now().strftime("%Y%m%d:%H%M")
print(catalog_name)
# create a catalog
catalog = fb.Catalog.create(catalog_name, feature_store_name='playground')
deep dive data modeling 20230511:0140
Register tables¶
Learning Objectives
In this section you will:
- be introduced to the case study dataset
- learn the about FeatureByte catalog table types
- declare FeatureByte catalog tables
Case Study: French grocery dataset¶
The French grocery dataset contains four tables containing data from a chain of grocery stores.
The data source has already been declared in the playground feature store that was installed as part of FeatureByte.
Concept: Data source¶
A data source represents a collection of source tables that the feature store can access.
Example: Connect to a pre-defined data source¶
# get data source from the local spark feature store
ds = fb.FeatureStore.get("playground").get_data_source()
# list the databases in the data source
ds.list_databases()
['spark_catalog']
# list all schemas in the spark catalog
ds.list_schemas('spark_catalog')
['creditcard', 'default', 'grocery', 'healthcare', 'playground']
# list the tables in the grocery schema
ds.list_source_tables('spark_catalog', 'grocery')
['__grocerycustomer', '__groceryinvoice', '__invoiceitems', 'grocerycustomer', 'groceryinvoice', 'groceryproduct', 'invoiceitems']
Concept: Source table¶
A source table is a table from a data warehouse that the feature store can access.
Concept: Catalog table¶
A Catalog Table provides a centralized location for metadata about a source table. This metadata determines the type of operations that can be applied to the table's views and includes essential information for feature engineering.
Concept: Table types¶
Understanding the type of data contained in a table is crucial because it helps determine the appropriate feature engineering techniques that can be applied to the table.
Featurebyte supports four of the most common types of data table.
- Slowly Changing Dimension (SCD) table
- Event table
- Item table
- Dimension table
Concept: Slowly changing dimension (SCD) table¶
An SCD Table represents a table in a data warehouse that contains data that changes slowly and unpredictably over time.
There are two main types of SCDs:
- Type 1: Overwrites old data with new data
- Type 2: Maintains a history of changes by creating a new record for each change.
FeatureByte only supports the use of Type 2 SCDs since SCDs of Type 1 may cause data leaks during model training and poor performance during inference.
An SCD Table of Type 2 utilizes a natural key to distinguish each active row and facilitate tracking of changes over time. The SCD table employs effective and expiration date columns to determine the active status of a row. In certain instances, an active flag column may replace the expiration date column to indicate if a row is currently active.
To create an SCD Table in FeatureByte, it is necessary to identify columns for the natural key, effective timestamp, optionally surrogate key, expiration timestamp, and active flag.
Example: Declare a slowly changing dimension table¶
# declare the grocery customer table
customer_table = ds.get_source_table(
database_name="spark_catalog",
schema_name="GROCERY",
table_name="GROCERYCUSTOMER"
).create_scd_table(
name="GROCERYCUSTOMER",
surrogate_key_column='RowID',
natural_key_column="GroceryCustomerGuid",
effective_timestamp_column="ValidFrom",
current_flag_column ="CurrentRecord",
record_creation_timestamp_column="record_available_at"
)
# show the column names and types for grocery customer
ds.get_source_table(database_name="spark_catalog", schema_name="GROCERY", table_name="GROCERYCUSTOMER").dtypes
RowID VARCHAR GroceryCustomerGuid VARCHAR ValidFrom TIMESTAMP Gender VARCHAR Title VARCHAR GivenName VARCHAR MiddleInitial VARCHAR Surname VARCHAR StreetAddress VARCHAR City VARCHAR State VARCHAR PostalCode VARCHAR BrowserUserAgent VARCHAR DateOfBirth DATE Latitude FLOAT Longitude FLOAT record_available_at TIMESTAMP CurrentRecord BOOL dtype: object
# show sample data
customer_table.sample(5)
RowID | GroceryCustomerGuid | ValidFrom | Gender | Title | GivenName | MiddleInitial | Surname | StreetAddress | City | State | PostalCode | BrowserUserAgent | DateOfBirth | Latitude | Longitude | record_available_at | CurrentRecord | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | e46287d9-3c3e-4565-a32a-6d0740b63a04 | a8cd7041-3f41-4a6b-9745-798e2300a717 | 2019-01-10 09:06:37 | female | Ms. | Anne | C | Lang | 6 Rue Roussy | ORLÉANS | Centre | 45000 | Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl... | 1974-04-27 | 47.983993 | 1.981258 | 2019-01-10 10:01:00 | True |
1 | 15057786-e3b4-4f13-8521-d1d84285e716 | bbaff8e5-44ab-4f61-a4e6-405f274bf429 | 2022-07-03 16:01:40 | male | Mr. | Henry | R | Saindon | 59 rue Clement Marot | PESSAC | Aquitaine | 33600 | Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl... | 2001-03-26 | 44.769466 | -0.613324 | 2022-07-03 17:01:00 | False |
2 | 5ec5d068-02f4-49fe-853a-038347dc505a | 9359ef7b-7fd8-4587-bc40-e89f6acc1218 | 2019-01-09 20:44:25 | female | Mrs. | Madeleine | X | Gagnon | 71 rue du Faubourg National | TAVERNY | Île-de-France | 95150 | Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; ... | 1964-11-25 | 49.121484 | 2.260801 | 2019-01-09 21:01:00 | True |
3 | 8ca33786-cb20-49ab-8fac-2ab76a7912c0 | 7ce7bcc5-9ded-4f9a-bd9a-5f85f8ea6cca | 2020-10-16 11:39:05 | female | Ms. | Christiane | A | Devost | 40 rue Nationale | PARIS | Île-de-France | 75007 | Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:6... | 2003-11-05 | 48.923195 | 2.251334 | 2020-10-16 12:01:00 | True |
4 | fcf52fbc-6459-4bc7-a72f-fca1d49243cc | fb39edea-9527-4a9b-a4f5-f9cf697a124f | 2019-01-01 13:53:50 | male | Mr. | Faustin | E | Gauvin | 57 boulevard de la Liberation | MARSEILLE | Provence-Alpes-Côte d'Azur | 13014 | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5... | 1987-05-04 | 43.326089 | 5.311465 | 2019-01-01 14:01:00 | True |
Concept: Event table¶
An Event table represents a table in the data warehouse where each row indicates a unique business event occurring at a particular time.
Event tables can take various forms, such as an Order table in E-commerce, Credit Card Transactions in Banking, Doctor Visits in Healthcare, and Clickstream on the Internet.
To create an Event Table in FeatureByte, it is necessary to identify two important columns in your data: the event key and timestamp. The event key is a unique identifier for each event, while the timestamp indicates when the event occurred.
If your data warehouse is a Snowflake data warehouse, FeatureByte accepts timestamp columns that include time zone offset information.
For timestamp columns without time zone offset information or for non-Snowflake data warehouses, you can specify a separate column that provides the time zone offset information. By doing so, date parts transforms in the event timestamp column will be based on the local time instead of UTC.
Additionally, the column that represents the record creation timestamp may be identified to enable an automatic analysis of data availability and freshness of the source table. This analysis can assist in selecting the default scheduling of the computation of features associated with the Event table.
Example: Declare an event table¶
# register GroceryInvoice as an event data
invoice_table = ds.get_source_table(
database_name="spark_catalog",
schema_name="GROCERY",
table_name="GROCERYINVOICE"
).create_event_table(
name="GROCERYINVOICE",
event_id_column="GroceryInvoiceGuid",
event_timestamp_column="Timestamp",
event_timestamp_timezone_offset_column="tz_offset",
record_creation_timestamp_column="record_available_at"
)
# show the column names and types for grocery invoice
ds.get_source_table(database_name="spark_catalog", schema_name="GROCERY", table_name="GROCERYINVOICE").dtypes
GroceryInvoiceGuid VARCHAR GroceryCustomerGuid VARCHAR Timestamp TIMESTAMP tz_offset VARCHAR record_available_at TIMESTAMP Amount FLOAT dtype: object
invoice_table.preview()
GroceryInvoiceGuid | GroceryCustomerGuid | Timestamp | tz_offset | record_available_at | Amount | |
---|---|---|---|---|---|---|
0 | 73396167-19bc-4bd4-8064-ebbda4909101 | 5c96089d-95f7-4a12-ab13-e082836253f1 | 2022-01-12 19:24:51 | +01:00 | 2022-01-12 20:01:00 | 15.34 |
1 | 684492c1-31bd-4d70-8018-0296073532b7 | 5c96089d-95f7-4a12-ab13-e082836253f1 | 2022-01-18 16:04:37 | +01:00 | 2022-01-18 17:01:00 | 14.99 |
2 | 2ce65601-a030-4007-8afa-df3c8b815014 | 5c96089d-95f7-4a12-ab13-e082836253f1 | 2022-01-21 13:36:07 | +01:00 | 2022-01-21 14:01:00 | 18.68 |
3 | 77e9a7f1-78d9-43a5-9135-06eb4577c123 | 5c96089d-95f7-4a12-ab13-e082836253f1 | 2022-02-01 20:59:24 | +01:00 | 2022-02-01 21:01:00 | 66.09 |
4 | a865b722-3efc-48c8-aa7b-e59b897b5553 | 5c96089d-95f7-4a12-ab13-e082836253f1 | 2022-02-10 17:53:31 | +01:00 | 2022-02-10 18:01:00 | 42.25 |
5 | 1eed961f-1730-494b-9845-b767cc907d97 | 5c96089d-95f7-4a12-ab13-e082836253f1 | 2022-02-12 18:10:54 | +01:00 | 2022-02-12 19:01:00 | 35.81 |
6 | f3d88431-7826-4112-ad85-a1a6c4259aef | 5c96089d-95f7-4a12-ab13-e082836253f1 | 2022-02-24 16:52:58 | +01:00 | 2022-02-24 17:01:00 | 74.00 |
7 | 52412379-f9d6-4512-a29f-4a2f2880ff99 | 5c96089d-95f7-4a12-ab13-e082836253f1 | 2022-03-01 10:23:14 | +01:00 | 2022-03-01 11:01:00 | 34.58 |
8 | 99e069d2-71bf-49c0-8d58-4facfb178c9d | 5c96089d-95f7-4a12-ab13-e082836253f1 | 2022-03-06 19:22:16 | +01:00 | 2022-03-06 20:01:00 | 85.74 |
9 | 770b26fc-48c3-48a9-8631-626f640e080a | 5c96089d-95f7-4a12-ab13-e082836253f1 | 2022-03-13 15:22:09 | +01:00 | 2022-03-13 16:01:00 | 49.31 |
# show sample data
invoice_table.sample(5)
GroceryInvoiceGuid | GroceryCustomerGuid | Timestamp | tz_offset | record_available_at | Amount | |
---|---|---|---|---|---|---|
0 | 6f0f8768-59b0-4bf1-aa21-258c83515e45 | d2fc87d2-3584-4c8f-9359-b3ff10b5dc09 | 2022-12-26 16:29:21 | +01:00 | 2022-12-26 17:01:00 | 24.04 |
1 | 7d12246f-d5f7-4ed6-8aa2-8611beb7f613 | b6d4377e-9f04-4c04-bc56-b970e54279ca | 2023-02-18 16:00:03 | +01:00 | 2023-02-18 17:01:00 | 25.61 |
2 | 49887643-3fa1-4171-89e0-344160238c01 | c22fa3eb-55a5-4a4f-9301-38f6b6f0567e | 2022-06-23 18:24:52 | +02:00 | 2022-06-23 19:01:00 | 23.88 |
3 | 2cfae7b0-b3a2-4973-9561-de8e4788e388 | e034e01c-50de-42f0-a879-82c093af5f49 | 2022-12-19 15:49:29 | +01:00 | 2022-12-19 16:01:00 | 6.21 |
4 | 360eb328-b0cb-4f75-bf64-d6b5216b50ad | 7a1bc5dc-e198-419e-b972-0abbdf8903c1 | 2022-02-13 16:52:33 | +01:00 | 2022-02-13 17:01:00 | 14.30 |
Concept: Item table¶
An Item table represents a table in the data warehouse containing detailed information about a specific business event.
For instance, an Item table may contain information about:
- Product Items purchased in Customer Orders
- or Drug Prescriptions issued during Doctor Visits by Patients.
Typically, an Item table has a 'one-to-many' relationship with an Event table. Despite not explicitly including a timestamp, it is inherently linked to an event timestamp through its association with the Event table.
To create an Item Table, it is necessary to identify the columns that represent the item key and the event key and determine which Event table is associated with the Item table.
Example: Declare an item table¶
# register invoice items as an item table
items_table = ds.get_source_table(
database_name="spark_catalog",
schema_name="GROCERY",
table_name="INVOICEITEMS"
).create_item_table(
name="INVOICEITEMS",
event_id_column="GroceryInvoiceGuid",
item_id_column="GroceryInvoiceItemGuid",
event_table_name="GROCERYINVOICE"
)
# show the column names and types for invoice items
ds.get_source_table(database_name="spark_catalog", schema_name="GROCERY", table_name="INVOICEITEMS").dtypes
GroceryInvoiceItemGuid VARCHAR GroceryInvoiceGuid VARCHAR GroceryProductGuid VARCHAR Quantity FLOAT UnitPrice FLOAT TotalCost FLOAT Discount FLOAT record_available_at TIMESTAMP dtype: object
# show sample data
items_table.sample(5)
GroceryInvoiceItemGuid | GroceryInvoiceGuid | GroceryProductGuid | Quantity | UnitPrice | TotalCost | Discount | record_available_at | |
---|---|---|---|---|---|---|---|---|
0 | 0c5181e3-9d9c-402d-902a-1649c3a26232 | 8d99deb7-78cc-4924-ac04-9cb99e1e282c | 8b9739d4-1a3f-4c96-886d-d0492ba45c07 | 1.0 | 1.74 | 1.74 | 0.18 | 2022-08-02 10:01:00 |
1 | 5b853ed2-aea7-4fad-aaa5-bcadbef0eba8 | 163e7004-db43-4e0d-a093-cd7bf27caf10 | a7fd9147-874f-4f3d-b262-3e408cc30db8 | 1.0 | 2.50 | 2.50 | 0.39 | 2023-04-14 17:01:00 |
2 | d2d7633e-3bdf-430d-920e-13825cad3e19 | 4aac4b3b-0cd9-4bf7-a650-68f40fb85865 | 5d9e7f80-4c03-44b9-b44b-5083f0645261 | 1.0 | 0.75 | 0.75 | 0.00 | 2022-06-28 13:01:00 |
3 | 7c4c38cc-7150-4bca-b2c1-0d4616d4809f | 5226254f-97d6-4080-a4fa-0269f2da1bc0 | a59f0ed9-f70d-474d-9347-4605af059856 | 3.0 | 0.66 | 1.98 | 0.00 | 2022-02-24 13:01:00 |
4 | cd0d8e88-e8fd-41d9-a4a4-8c9d4e05a1d8 | af5633bc-0008-40ee-b1a1-8dfd4c98eba9 | 5f38510e-1c5f-481a-98e8-8c282b03e7bf | 1.0 | 1.29 | 1.29 | 0.00 | 2022-03-20 14:01:00 |
Concept: Dimension table¶
A Dimension table represents a table in the data warehouse containing static descriptive data.
Using a Dimension table requires special attention. If the data in the table changes slowly, it is not advisable to use it because these changes can cause significant data leaks during model training and adversely affect the inference performance. In such cases, it is recommended to use a Slowly Changing Dimension table of Type 2 that maintains a history of changes.
To create a Dimension Table in FeatureByte, it is necessary to identify which column represents its primary key.
# register GroceryProduct as a dimension table
product_table = ds.get_source_table(
database_name="spark_catalog",
schema_name="GROCERY",
table_name="GROCERYPRODUCT"
).create_dimension_table(
name="GROCERYPRODUCT",
dimension_id_column="GroceryProductGuid"
)
# show the column names and types for grocery product
ds.get_source_table(database_name="spark_catalog", schema_name="GROCERY", table_name="GROCERYPRODUCT").dtypes
GroceryProductGuid VARCHAR ProductGroup VARCHAR dtype: object
# show sample data
product_table.sample(5)
GroceryProductGuid | ProductGroup | |
---|---|---|
0 | 8a9c2ad6-865d-45b6-a039-6d0ebcc69b74 | Épices |
1 | 22a5f17e-feea-45e7-9cc0-49a9eb8029ac | Chat |
2 | 89de4090-3a29-4418-ae9b-34b93a484dc4 | Pains |
3 | 454dfbe9-75f2-4a58-a225-179b0533ea9a | Vinaigres |
4 | 6e936197-17c3-4b8b-8427-8536704f9d2d | Frites et Pommes de terre |
Analyse and set feature job settings¶
Learning Objectives
In this section you will:
- learn about feature jobs
- analyze data to estimate optimal feature job settings
- declare feature job settings
Background¶
FeatureByte is designed to work with data warehouses that receive regular data refreshes from operational sources, which means that feature values may differ in terms of freshness and availability. If these operational limitations are not taken into account, inconsistencies between offline and online feature values may occur.
To prevent such inconsistencies, it's crucial to synchronize the frequency of batch feature computations with the frequency of source table refreshes, and to compute features after the source table refresh is fully completed. In addition, for historical serving to accurately replicate the production environment, it's essential to use data that was available at the historical points-in-time, considering the present or future latency of data. Latency of data refers to the time difference between the timestamp of an event and the timestamp at which the event data is accessible for ingestion. Any period during which data may be missing is referred to as a "blind spot".
To address these challenges, the Feature Job Setting in FeatureByte captures information about the frequency of batch feature computations, the timing of the batch process, and the assumed blind spot for the data. This helps ensure consistency between offline and online feature values, and accurate historical serving that reflects the conditions present in the production environment.
Concept: Feature job¶
A Feature Job is a batch process that generates both offline and online tiles and feature values for a specific feature before storing them in the feature store. The scheduling of a Feature Job is determined by the feature job settings associated with the respective feature.
Feature job orchestration is initiated when a feature is deployed and continues until the feature deployment is disabled, ensuring the feature store consistently possesses the latest values for each feature.
Concept: Feature job setting¶
The Feature Job Setting in FeatureByte captures essential details about batch feature computations for the online feature store, including the frequency and timing of the batch process, as well as the assumed blind spot for the data. This helps to maintain consistency between offline and online feature values and ensures accurate historical serving that reflects the production environment. The setting comprises three parameters:
- The frequency parameter specifies how often the batch process should run.
- The time_modulo_frequency parameter defines the timing from the end of the frequency time period to when the feature job commences. For example, a feature job with the following settings (frequency 60m, time_modulo_frequency: 130s) will start 2 min and 10 seconds after the beginning of each hour: 00:02:10, 01:02:10, 02:02:10, …, 15:02:10, …, 23:02:10.
- The blind_spot parameter sets the time gap between feature computation and the latest event timestamp to be processed.
Case study: A data warehouse refreshes each hour. The data refresh starts 10 seconds after the hour and is usually finished within 2 minutes. Sometimes the data refresh misses the latest data, up to a maximum of the last 30 seconds at the end of the hour. Therefore the feature job settings will be:
- frequency: 60m
- time_modulo_frequency: 10s + 2m + 5s (a safety buffer) = 135s
- blind_spot: 30s + 10s + 2m + 5s = 165s
To accommodate changes in source data management, Feature Job Setting supports versioning of both features and feature lists. The scheduling information and blind spot are stored in the Feature Job Setting metadata associated with each feature.
While Feature Jobs are primarily designed to support online requests, this information is also used during historical requests to minimize offline-online inconsistency.
Concept: Feature job setting recommendations¶
FeatureByte automatically analyzes data availability and freshness of an event table in order to suggest an optimal setting for scheduling Feature Jobs and associated Blind Spot information.
This analysis relies on the presence of record creation timestamps in the source table, typically added when updating data in the warehouse. The analysis focuses on a recent time window, such as the past four weeks.
FeatureByte estimates the data update frequency based on the distribution of time intervals among the sequence of record creation timestamps. It also assesses the timeliness of source table updates and identifies late jobs using an outlier detection algorithm. By default, the recommended scheduling time takes late jobs into account.
To accommodate data that may not arrive on time during warehouse updates, a blind spot is proposed for determining the cutoff for feature aggregation windows, in addition to scheduling frequency and time of the Feature Job. The suggested blind spot offers a percentage of late data closest to the user-defined tolerance, with a default of 0.005%.
To validate the Feature Job schedule and blind spot recommendations, a backtest is conducted. Users can also backtest their own settings.
Example: Feature job settings analysis¶
# initialize the feature job settings for the invoice table
invoice_table.initialize_default_feature_job_setting()
Done! |████████████████████████████████████████| 100% in 12.1s (0.08%/s)
The analysis period starts at 2023-04-13 00:17:03 and ends at 2023-05-11 00:17:03
The column used for the event timestamp is Timestamp
The column used for the record creation timestamp in GROCERYINVOICE is record_available_at
STATISTICS ON TIME BETWEEN GROCERYINVOICE RECORDS CREATIONS
- Average time is 4209.777777777777 s
- Median time is 3600.0 s
- Lowest time is 3600.0 s
- Largest time is 28800.0 s
based on a total of 498 unique record creation timestamps.
The longer time between records creations are due to 173 MISSING UPDATES.