Quick Start Tutorial: End-to-End Workflow¶
Learning Objectives¶
In this tutorial you will learn how to:
- Create a catalog
- Define a data model for a catalog
- Add features to a catalog
- Solve a use case
- Deploy and serve a feature list
- Manage the feature list lifecycle
Set up the prerequisites¶
Learning Objectives
In this section you will:
- import libraries
- start your local featurebyte server
Load the featurebyte library and connect to the local instance of featurebyte¶
# library imports
import pandas as pd
import numpy as np
from datetime import datetime
# load the featurebyte SDK
import featurebyte as fb
# start the local server, then wait for it to be healthy before proceeding
fb.playground()
02:09:43 | INFO | Using configuration file at: /home/chester/.featurebyte/config.yaml 02:09:43 | INFO | Active profile: local (http://127.0.0.1:8088) 02:09:43 | INFO | SDK version: 0.2.2 02:09:43 | INFO | Active catalog: default 02:09:43 | INFO | 0 feature list, 0 feature deployed 02:09:43 | INFO | (1/4) Starting featurebyte services Container mongo-rs Running Container featurebyte-server Running Container spark-thrift Running Container redis Running Container featurebyte-worker Running Container mongo-rs Waiting Container mongo-rs Waiting Container redis Waiting Container redis Healthy Container mongo-rs Healthy Container mongo-rs Healthy 02:09:44 | INFO | (2/4) Creating local spark feature store 02:09:44 | INFO | (3/4) Import datasets 02:09:45 | INFO | Dataset grocery already exists, skipping import 02:09:45 | INFO | Dataset healthcare already exists, skipping import 02:09:45 | INFO | Dataset creditcard already exists, skipping import 02:09:45 | INFO | (4/4) Playground environment started successfully. Ready to go! 🚀
Create a catalog¶
Once you have a feature store, you can create a Catalog, which acts as a central repository for metadata related to FeatureByte objects: tables, entities, features, and feature lists.
For data sources covering multiple domains, use separate Catalogs for each domain to maintain clarity and easy access to domain-specific metadata.
Learning Objectives
In this section you will:
- learn about catalogs
- create a new catalog
Concept: Catalog¶
A Catalog object operates as a centralized metadata repository for organizing tables, entities, features, and feature lists and other objects to facilitate feature serving for a specific domain. By employing a catalog, your team members can share, search, access, and reuse these assets.
Example: Create a new catalog¶
catalog_name = "quick start end-to-end " + datetime.now().strftime("%Y%m%d:%H%M")
# create a catalog
catalog = fb.Catalog.create(catalog_name, 'playground')
# you can activate an existing catalog
catalog = fb.Catalog.activate(catalog_name)
02:09:45 | INFO | Catalog activated: quick start end-to-end 20230511:0209
Define a Data Model¶
Defining your Catalog's Data Model is crucial for facilitating feature engineering, organization, and serving of features and feature lists. It is an infrequent but essential task to ensure good practices.
Learning Objectives
In this section you will:
- be introduced to the case study dataset
- declare FeatureByte catalog tables
- define data cleaning operations
- declare and tag entities
Case Study: French grocery dataset¶
The French grocery dataset contains four tables containing data from a chain of grocery stores.
The data source has already been declared in the playground feature store that was installed as part of FeatureByte.
Concept: Data source¶
A data source is a collection of tables accessible via a connection to a data warehouse or database. It is used to explore and retrieve details about tables that can be used as source table in the FeatureByte catalog.
Example: Connect to a pre-defined data source¶
# get data source from the local spark feature store
ds = fb.FeatureStore.get("playground").get_data_source()
Concept: Catalog table¶
A Catalog Table provides a centralized location for metadata about a source table. This metadata determines the type of operations that can be applied to the table's views and includes essential information for feature engineering.
Example: Declare catalog tables¶
# register GroceryInvoice as an event data
invoice_table = ds.get_source_table(
database_name="spark_catalog",
schema_name="GROCERY",
table_name="GROCERYINVOICE"
).create_event_table(
name="GROCERYINVOICE",
event_id_column="GroceryInvoiceGuid",
event_timestamp_column="Timestamp",
event_timestamp_timezone_offset_column="tz_offset",
record_creation_timestamp_column="record_available_at"
)
# show sample data
invoice_table.sample(5)
GroceryInvoiceGuid | GroceryCustomerGuid | Timestamp | tz_offset | record_available_at | Amount | |
---|---|---|---|---|---|---|
0 | 6f0f8768-59b0-4bf1-aa21-258c83515e45 | d2fc87d2-3584-4c8f-9359-b3ff10b5dc09 | 2022-12-26 16:29:21 | +01:00 | 2022-12-26 17:01:00 | 24.04 |
1 | 7d12246f-d5f7-4ed6-8aa2-8611beb7f613 | b6d4377e-9f04-4c04-bc56-b970e54279ca | 2023-02-18 16:00:03 | +01:00 | 2023-02-18 17:01:00 | 25.61 |
2 | 49887643-3fa1-4171-89e0-344160238c01 | c22fa3eb-55a5-4a4f-9301-38f6b6f0567e | 2022-06-23 18:24:52 | +02:00 | 2022-06-23 19:01:00 | 23.88 |
3 | 2cfae7b0-b3a2-4973-9561-de8e4788e388 | e034e01c-50de-42f0-a879-82c093af5f49 | 2022-12-19 15:49:29 | +01:00 | 2022-12-19 16:01:00 | 6.21 |
4 | 360eb328-b0cb-4f75-bf64-d6b5216b50ad | 7a1bc5dc-e198-419e-b972-0abbdf8903c1 | 2022-02-13 16:52:33 | +01:00 | 2022-02-13 17:01:00 | 14.30 |
# register invoice items as an item table
items_table = ds.get_source_table(
database_name="spark_catalog",
schema_name="GROCERY",
table_name="INVOICEITEMS"
).create_item_table(
name="INVOICEITEMS",
event_id_column="GroceryInvoiceGuid",
item_id_column="GroceryInvoiceItemGuid",
event_table_name="GROCERYINVOICE"
)
# show sample data
items_table.sample(5)
GroceryInvoiceItemGuid | GroceryInvoiceGuid | GroceryProductGuid | Quantity | UnitPrice | TotalCost | Discount | record_available_at | |
---|---|---|---|---|---|---|---|---|
0 | 0c5181e3-9d9c-402d-902a-1649c3a26232 | 8d99deb7-78cc-4924-ac04-9cb99e1e282c | 8b9739d4-1a3f-4c96-886d-d0492ba45c07 | 1.0 | 1.74 | 1.74 | 0.18 | 2022-08-02 10:01:00 |
1 | 5b853ed2-aea7-4fad-aaa5-bcadbef0eba8 | 163e7004-db43-4e0d-a093-cd7bf27caf10 | a7fd9147-874f-4f3d-b262-3e408cc30db8 | 1.0 | 2.50 | 2.50 | 0.39 | 2023-04-14 17:01:00 |
2 | d2d7633e-3bdf-430d-920e-13825cad3e19 | 4aac4b3b-0cd9-4bf7-a650-68f40fb85865 | 5d9e7f80-4c03-44b9-b44b-5083f0645261 | 1.0 | 0.75 | 0.75 | 0.00 | 2022-06-28 13:01:00 |
3 | 7c4c38cc-7150-4bca-b2c1-0d4616d4809f | 5226254f-97d6-4080-a4fa-0269f2da1bc0 | a59f0ed9-f70d-474d-9347-4605af059856 | 3.0 | 0.66 | 1.98 | 0.00 | 2022-02-24 13:01:00 |
4 | cd0d8e88-e8fd-41d9-a4a4-8c9d4e05a1d8 | af5633bc-0008-40ee-b1a1-8dfd4c98eba9 | 5f38510e-1c5f-481a-98e8-8c282b03e7bf | 1.0 | 1.29 | 1.29 | 0.00 | 2022-03-20 14:01:00 |
Concept: Feature job setting¶
The Feature Job Setting in FeatureByte captures essential details about batch feature computations for the online feature store, including the frequency and timing of the batch process, as well as the assumed blind spot for the data. This helps to maintain consistency between offline and online feature values and ensures accurate historical serving that reflects the production environment. The setting comprises three parameters:
- The frequency parameter specifies how often the batch process should run.
- The time_modulo_frequency parameter defines the timing of the batch process.
- The blind_spot parameter sets the time gap between feature computation and the latest event timestamp to be processed.
To ensure consistency of Feature Job Setting across features created by different team members, a Default Feature Job Setting is defined at the table level. However, it's possible to override this setting during feature declaration.
Example: Feature job settings analysis¶
# initialize the feature job settings for the invoice table
invoice_table.initialize_default_feature_job_setting()
Done! |████████████████████████████████████████| 100% in 12.1s (0.08%/s)
The analysis period starts at 2023-04-13 00:17:03 and ends at 2023-05-11 00:17:03
The column used for the event timestamp is Timestamp
The column used for the record creation timestamp in GROCERYINVOICE is record_available_at
STATISTICS ON TIME BETWEEN GROCERYINVOICE RECORDS CREATIONS
- Average time is 4209.777777777777 s
- Median time is 3600.0 s
- Lowest time is 3600.0 s
- Largest time is 28800.0 s
based on a total of 498 unique record creation timestamps.
The longer time between records creations are due to 173 MISSING UPDATES.
This includes a buffer of 5 s to allow for late jobs.
The 76 jobs that occurred after missing jobs don't seem to have processed significantly older records.
Search for optimal blind spot
- blind spot for 99.5 % of events to land: 60 s
- blind spot for 99.9 % of events to land: 120 s
- blind spot for 99.95 % of events to land: 120 s
- blind spot for 99.99 % of events to land: 120 s
- blind spot for 99.995 % of events to land: 120 s
- blind spot for 100.0 % of events to land: 120 s