Quick Start Tutorial: Model Training¶
Learning Objectives¶
In this tutorial you will learn:
- How to design an observation set for your use case
- How to materialize training data
- How your ML training environment can consume training data
Set up the prerequisites¶
Learning Objectives
In this section you will:
- start your local featurebyte server
- import libraries
- learn the about catalogs
- activate a pre-built catalog
# library imports
import pandas as pd
import numpy as np
import random
# load the featurebyte SDK
import featurebyte as fb
# start the local server, then wait for it to be healthy before proceeding
fb.playground()
02:24:50 | INFO | Using configuration file at: /home/chester/.featurebyte/config.yaml 02:24:50 | INFO | Active profile: local (http://127.0.0.1:8088) 02:24:50 | INFO | SDK version: 0.2.2 02:24:50 | INFO | Active catalog: default 02:24:50 | INFO | 0 feature list, 0 feature deployed 02:24:50 | INFO | (1/4) Starting featurebyte services Container redis Running Container spark-thrift Running Container mongo-rs Running Container featurebyte-worker Running Container featurebyte-server Running Container redis Waiting Container mongo-rs Waiting Container mongo-rs Waiting Container mongo-rs Healthy Container mongo-rs Healthy Container redis Healthy 02:24:51 | INFO | (2/4) Creating local spark feature store 02:24:51 | INFO | (3/4) Import datasets 02:24:51 | INFO | Dataset grocery already exists, skipping import 02:24:51 | INFO | Dataset healthcare already exists, skipping import 02:24:51 | INFO | Dataset creditcard already exists, skipping import 02:24:51 | INFO | (4/4) Playground environment started successfully. Ready to go! 🚀
Create a pre-built catalog for this tutorial, with the data, metadata, and features already set up¶
Note that creating a pre-built catalog is not a step you will do in real-life. This is a function specific to this quick-start tutorial to quickly skip over many of the preparatory steps and get you to a point where you can materialize features.
In a real-life project you would do data modeling, declaring the tables, entities, and the associated metadata. This would not be a frequent task, but forms the basis for best-practice feature engineering.
# get the functions to create a pre-built catalog
from prebuilt_catalogs import *
# create a new catalog for this tutorial
catalog = create_tutorial_catalog(PrebuiltCatalog.QuickStartModelTraining)
Cleaning up existing tutorial catalogs
02:24:54 | INFO | Catalog activated: quick start model training 20230511:0224
Building a quick start catalog for model training named [quick start model training 20230511:0224] Creating new catalog Catalog created Registering the source tables Registering the entities Tagging the entities to columns in the data tables Populating the feature store with example features Setting feature readiness Saving Feature(s) |████████████████████████████████████████| 1/1 [100%] in 0.2s Loading Feature(s) |████████████████████████████████████████| 1/1 [100%] in 0.2s Saving Feature(s) |████████████████████████████████████████| 8/8 [100%] in 2.5s Loading Feature(s) |████████████████████████████████████████| 8/8 [100%] in 2.0s Catalog created and pre-populated with data and features
Example: Create views from tables in the Catalog¶
# create the views
grocery_customer_view = catalog.get_view("GROCERYCUSTOMER")
grocery_invoice_view = catalog.get_view("GROCERYINVOICE")
grocery_items_view = catalog.get_view("INVOICEITEMS")
grocery_product_view = catalog.get_view("GROCERYPRODUCT")
Create an observation set for your use case¶
Learning Objectives
In this section you will learn:
- the purpose of observation sets
- the relationship between entities, point in time, and observation sets
- how to design an observation set suitable for training data
Case Study: Predicting Customer Spend¶
Your chain of grocery stores wants to target market customers immediately after each purchase. As one step in this marketing campaign, they want to predict future customer spend in the 14 days after a purchase.
Concept: Materialization¶
A feature in FeatureByte is defined by the logical plan for its computation. The act of computing the feature is known as Feature Materialization.
The materialization of features is made on demand to fulfill historical requests, whereas for prediction purposes, feature values are generated through a batch process called a "Feature Job". The Feature Job is scheduled based on the defined settings associated with each feature.
Concept: Observation set¶
An observation set combines entity key values and historical points-in-time, for which you wish to materialize feature values.
The observation set can be a pandas DataFrame or an ObservationTable object representing an observation set in the feature store. An accepted serving name must be used for the column containing the entity values. The column containing points-in-time must be labelled "POINT-IN-TIME" and the point-in-time timestamps should be in UTC.
Concept: Point in time¶
A point-in-time for a feature refers to a specific moment in the past with which the feature's values are associated.
It is a crucial aspect of historical feature serving, which allows machine learning models to make predictions based on historical data. By providing a point-in-time, a feature can be used to train and test models on past data, enabling them to make accurate predictions for similar situations in the future.
Case Study: Predicting Customer Spend¶
Your chain of grocery stores wants to target market customers immediately after each purchase. As one step in this marketing campaign, they want to predict future customer spend in the 14 days after a purchase.
# get the feature list for the target feature
customer_target_list = catalog.get_feature_list("TargetFeature")
# display details about the target feature
display(customer_target_list.list_features())
Loading Feature(s) |████████████████████████████████████████| 1/1 [100%] in 0.2s
id | name | version | dtype | readiness | online_enabled | tables | primary_tables | entities | primary_entities | created_at | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 645c5226acd3a4fed277636f | Target | V230511 | FLOAT | PRODUCTION_READY | False | [GROCERYINVOICE] | [GROCERYINVOICE] | [grocerycustomer] | [grocerycustomer] | 2023-05-11 02:25:43.244 |
# create a large observation table from a view
# filter the view to exclude points in time that won't have data for historical windows
filter = (grocery_invoice_view["Timestamp"] >= pd.to_datetime("2022-04-01")) & \
(grocery_invoice_view["Timestamp"] < pd.to_datetime("2023-04-01"))
observation_set_view = grocery_invoice_view[filter].copy()
# create a new observation table
observation_table = observation_set_view.create_observation_table(
name = "10,000 Customers immediately after each purchase from May-22 to Mar-23",
sample_rows = 10000,
columns = ["Timestamp", "GroceryCustomerGuid"],
columns_rename_mapping = {"Timestamp": "POINT_IN_TIME", "GroceryCustomerGuid": "GROCERYCUSTOMERGUID"},
)
# if the observation table isn't too large, you can materialize it
display(observation_table.to_pandas())
Done! |████████████████████████████████████████| 100% in 9.1s (0.11%/s) Downloading table |████████████████████████████████████████| 10000/10000 [100%]
POINT_IN_TIME | GROCERYCUSTOMERGUID | |
---|---|---|
0 | 2022-04-05 18:55:03 | 5c96089d-95f7-4a12-ab13-e082836253f1 |
1 | 2022-04-08 13:10:00 | 5c96089d-95f7-4a12-ab13-e082836253f1 |
2 | 2022-04-11 11:49:05 | 5c96089d-95f7-4a12-ab13-e082836253f1 |
3 | 2022-04-15 09:50:57 | 5c96089d-95f7-4a12-ab13-e082836253f1 |
4 | 2022-05-14 15:00:07 | 5c96089d-95f7-4a12-ab13-e082836253f1 |
... | ... | ... |
9995 | 2022-06-12 09:20:38 | ec4b86c7-a0ae-44f5-ba7c-0c19c8685365 |
9996 | 2022-06-26 10:03:35 | ec4b86c7-a0ae-44f5-ba7c-0c19c8685365 |
9997 | 2022-07-30 09:03:31 | ec4b86c7-a0ae-44f5-ba7c-0c19c8685365 |
9998 | 2022-08-09 12:41:11 | ec4b86c7-a0ae-44f5-ba7c-0c19c8685365 |
9999 | 2022-09-28 12:42:39 | ec4b86c7-a0ae-44f5-ba7c-0c19c8685365 |
10000 rows × 2 columns
Materialize Training Data¶
Learning Objectives
In this section you will learn:
- how to create historical training data
- how to merge target and features
Example: Get historical values¶
# list the feature lists
display(catalog.list_feature_lists())
id | name | num_feature | status | deployed | readiness_frac | online_frac | tables | entities | created_at | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 645c5226acd3a4fed2776367 | Features | 8 | DRAFT | False | 1.0 | 0.0 | [GROCERYCUSTOMER, GROCERYINVOICE, INVOICEITEMS... | [grocerycustomer, frenchstate] | 2023-05-11 02:25:48.839 |
1 | 645c5228acd3a4fed2776373 | TargetFeature | 1 | DRAFT | False | 1.0 | 0.0 | [GROCERYINVOICE] | [grocerycustomer] | 2023-05-11 02:25:44.540 |
# get the feature list
feature_list = catalog.get_feature_list("Features")
Loading Feature(s) |████████████████████████████████████████| 8/8 [100%] in 2.0s
# Compute the historical feature table
training_table_features = feature_list.compute_historical_feature_table(
observation_table,
historical_feature_table_name='customer training table - invoices Apr-22 to Mar-23 - features only'
)
# display the training data
display(training_table_features.to_pandas())
Done! |████████████████████████████████████████| 100% in 1:31.0 (0.01%/s) Downloading table |████████████████████████████████████████| 10000/10000 [100%]
POINT_IN_TIME | GROCERYCUSTOMERGUID | CustomerAvgInvoiceAmount_28d | CustomerSpend_28d | CustomerStateSimilarity_28d | CustomerInventoryStability_14d28d | StateMeanLongitude | StateMeanLatitude | StateAvgInvoiceAmount_28d | StatePopulation | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2022-04-01 08:43:25 | 352d1de1-4419-40e5-b2a5-6d6922384b05 | 7.990000 | 15.98 | 0.367632 | 0.866025 | 2.237559 | 48.740582 | 18.021939 | 183 |
1 | 2022-04-01 09:57:05 | ed56f1f6-310d-4b7c-9f5b-554103282f15 | 35.840000 | 35.84 | 0.871330 | 1.000000 | -1.871965 | 48.354199 | 15.970000 | 3 |
2 | 2022-04-01 12:20:01 | b21ae11c-83cf-4146-832e-1163413a3295 | 2.940938 | 94.11 | 0.823614 | 0.968960 | -0.530407 | 49.185500 | 8.032955 | 5 |
3 | 2022-04-01 13:10:44 | 32dd07d0-2c16-4b34-8cc9-01f258e0b935 | 3.590833 | 43.09 | 0.270366 | 0.868540 | 2.237559 | 48.740582 | 17.888677 | 183 |
4 | 2022-04-01 13:42:08 | 24196ecb-be71-42b2-a748-89ed1960e4fc | 11.569167 | 138.83 | 0.636762 | 0.922243 | 2.237559 | 48.740582 | 17.888677 | 183 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9995 | 2022-12-31 13:07:50 | 41c1bdf5-b596-4fc4-9570-ecd86a0d9a98 | 10.664000 | 53.32 | 0.640142 | 0.978258 | -1.075038 | 47.401700 | 24.036600 | 18 |
9996 | 2022-12-31 14:11:20 | ad22d91b-6212-46ad-af9e-7e7b2df034d9 | 58.835000 | 235.34 | 0.833155 | 0.895061 | -0.494788 | 44.676056 | 18.831443 | 25 |
9997 | 2022-12-31 15:24:49 | 4c90b25e-628f-4692-b221-cc4fd07896aa | 3.558571 | 24.91 | 0.501629 | 0.963529 | -0.494788 | 44.676056 | 18.959091 | 25 |
9998 | 2022-12-31 16:21:56 | 59d264dd-494b-4c79-9794-d6fa103b0f7e | 16.726667 | 100.36 | 0.843072 | 0.699854 | 4.386779 | 48.815086 | 18.112222 | 8 |
9999 | 2022-12-31 18:07:27 | 5f18f733-ef27-423b-8fb7-6172948c9255 | 7.414762 | 155.71 | 0.768394 | 0.945736 | 5.887195 | 43.456104 | 16.312388 | 53 |
10000 rows × 10 columns
Example: Get target values¶
When target values use aggregates or time offsets, you first need to offset the point in time by the time window.
# add 14 days to the timestamps in the observation set
observation_set_target = observation_table.to_pandas().copy()
observation_set_target["POINT_IN_TIME"] = observation_set_target["POINT_IN_TIME"] + pd.DateOffset(days=14)
display(observation_set_target)
Downloading table |████████████████████████████████████████| 10000/10000 [100%]
POINT_IN_TIME | GROCERYCUSTOMERGUID | |
---|---|---|
0 | 2022-04-19 18:55:03 | 5c96089d-95f7-4a12-ab13-e082836253f1 |
1 | 2022-04-22 13:10:00 | 5c96089d-95f7-4a12-ab13-e082836253f1 |
2 | 2022-04-25 11:49:05 | 5c96089d-95f7-4a12-ab13-e082836253f1 |
3 | 2022-04-29 09:50:57 | 5c96089d-95f7-4a12-ab13-e082836253f1 |
4 | 2022-05-28 15:00:07 | 5c96089d-95f7-4a12-ab13-e082836253f1 |
... | ... | ... |
9995 | 2022-06-26 09:20:38 | ec4b86c7-a0ae-44f5-ba7c-0c19c8685365 |
9996 | 2022-07-10 10:03:35 | ec4b86c7-a0ae-44f5-ba7c-0c19c8685365 |
9997 | 2022-08-13 09:03:31 | ec4b86c7-a0ae-44f5-ba7c-0c19c8685365 |
9998 | 2022-08-23 12:41:11 | ec4b86c7-a0ae-44f5-ba7c-0c19c8685365 |
9999 | 2022-10-12 12:42:39 | ec4b86c7-a0ae-44f5-ba7c-0c19c8685365 |
10000 rows × 2 columns
# Materialize the target feature using get historical features
training_data_target = customer_target_list.compute_historical_features(observation_set_target)
# remove the offset from the point in time column
training_data_target["POINT_IN_TIME"] = training_data_target["POINT_IN_TIME"] - pd.DateOffset(days=14)
display(training_data_target)
Retrieving Historical Feature(s) |████████████████████████████████████████| 2/2
POINT_IN_TIME | GROCERYCUSTOMERGUID | Target | |
---|---|---|---|
0 | 2022-04-05 18:55:03 | 5c96089d-95f7-4a12-ab13-e082836253f1 | 114.55 |
1 | 2022-04-08 13:10:00 | 5c96089d-95f7-4a12-ab13-e082836253f1 | 107.77 |
2 | 2022-04-11 11:49:05 | 5c96089d-95f7-4a12-ab13-e082836253f1 | 79.24 |
3 | 2022-04-15 09:50:57 | 5c96089d-95f7-4a12-ab13-e082836253f1 | 74.16 |
4 | 2022-05-14 15:00:07 | 5c96089d-95f7-4a12-ab13-e082836253f1 | 143.37 |
... | ... | ... | ... |
9995 | 2022-06-12 09:20:38 | ec4b86c7-a0ae-44f5-ba7c-0c19c8685365 | 8.86 |
9996 | 2022-06-26 10:03:35 | ec4b86c7-a0ae-44f5-ba7c-0c19c8685365 | 21.15 |
9997 | 2022-07-30 09:03:31 | ec4b86c7-a0ae-44f5-ba7c-0c19c8685365 | 15.25 |
9998 | 2022-08-09 12:41:11 | ec4b86c7-a0ae-44f5-ba7c-0c19c8685365 | 28.19 |
9999 | 2022-09-28 12:42:39 | ec4b86c7-a0ae-44f5-ba7c-0c19c8685365 | 27.14 |
10000 rows × 3 columns
Example: Merging materialized values for features and target¶
# merge training data features and training data target
training_data = training_table_features.to_pandas()
training_data = training_data.merge(
training_data_target, on=["GROCERYCUSTOMERGUID", "POINT_IN_TIME"]
)
display(training_data)
Downloading table |████████████████████████████████████████| 10000/10000 [100%]
POINT_IN_TIME | GROCERYCUSTOMERGUID | CustomerAvgInvoiceAmount_28d | CustomerSpend_28d | CustomerStateSimilarity_28d | CustomerInventoryStability_14d28d | StateMeanLongitude | StateMeanLatitude | StateAvgInvoiceAmount_28d | StatePopulation | Target | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2022-04-01 08:43:25 | 352d1de1-4419-40e5-b2a5-6d6922384b05 | 7.990000 | 15.98 | 0.367632 | 0.866025 | 2.237559 | 48.740582 | 18.021939 | 183 | 7.79 |
1 | 2022-04-01 09:57:05 | ed56f1f6-310d-4b7c-9f5b-554103282f15 | 35.840000 | 35.84 | 0.871330 | 1.000000 | -1.871965 | 48.354199 | 15.970000 | 3 | 149.77 |
2 | 2022-04-01 12:20:01 | b21ae11c-83cf-4146-832e-1163413a3295 | 2.940938 | 94.11 | 0.823614 | 0.968960 | -0.530407 | 49.185500 | 8.032955 | 5 | 23.61 |
3 | 2022-04-01 13:10:44 | 32dd07d0-2c16-4b34-8cc9-01f258e0b935 | 3.590833 | 43.09 | 0.270366 | 0.868540 | 2.237559 | 48.740582 | 17.888677 | 183 | 34.78 |
4 | 2022-04-01 13:42:08 | 24196ecb-be71-42b2-a748-89ed1960e4fc | 11.569167 | 138.83 | 0.636762 | 0.922243 | 2.237559 | 48.740582 | 17.888677 | 183 | 56.48 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9999 | 2022-12-31 13:07:50 | 41c1bdf5-b596-4fc4-9570-ecd86a0d9a98 | 10.664000 | 53.32 | 0.640142 | 0.978258 | -1.075038 | 47.401700 | 24.036600 | 18 | 104.50 |
10000 | 2022-12-31 14:11:20 | ad22d91b-6212-46ad-af9e-7e7b2df034d9 | 58.835000 | 235.34 | 0.833155 | 0.895061 | -0.494788 | 44.676056 | 18.831443 | 25 | 177.10 |
10001 | 2022-12-31 15:24:49 | 4c90b25e-628f-4692-b221-cc4fd07896aa | 3.558571 | 24.91 | 0.501629 | 0.963529 | -0.494788 | 44.676056 | 18.959091 | 25 | 10.64 |
10002 | 2022-12-31 16:21:56 | 59d264dd-494b-4c79-9794-d6fa103b0f7e | 16.726667 | 100.36 | 0.843072 | 0.699854 | 4.386779 | 48.815086 | 18.112222 | 8 | 97.37 |
10003 | 2022-12-31 18:07:27 | 5f18f733-ef27-423b-8fb7-6172948c9255 | 7.414762 | 155.71 | 0.768394 | 0.945736 | 5.887195 | 43.456104 | 16.312388 | 53 | 32.18 |
10004 rows × 11 columns
Consuming training data¶
Learning Objectives
In this section you will learn:
- how to save a training file
- how to use a pandas data frame
Example: Save the training data to a file¶
# save training data as a csv file
training_data.to_csv("training_data.csv", index=False)
# save the training file as a parquet file
training_data.to_parquet("training_data.parquet")
Example: Training a scikit learn model¶
Note that you will need to install scikit learn https://scikit-learn.org/stable/install.html
# EDA on the training data
training_data.describe()
CustomerAvgInvoiceAmount_28d | CustomerSpend_28d | CustomerStateSimilarity_28d | CustomerInventoryStability_14d28d | StateMeanLongitude | StateMeanLatitude | StateAvgInvoiceAmount_28d | StatePopulation | Target | |
---|---|---|---|---|---|---|---|---|---|
count | 9627.000000 | 10004.000000 | 10004.000000 | 10004.000000 | 10004.000000 | 10004.000000 | 10002.000000 | 10004.000000 | 10004.000000 |
mean | 18.147826 | 133.975648 | 0.589494 | 0.757273 | 3.352376 | 45.211608 | 18.175718 | 78.274790 | 83.101813 |
std | 14.750370 | 122.513374 | 0.219764 | 0.300337 | 9.154240 | 9.968909 | 3.685447 | 75.154162 | 71.969131 |
min | 0.790000 | 0.000000 | 0.000000 | 0.000000 | -50.017299 | -12.713308 | 3.887857 | 1.000000 | 0.000000 |
25% | 8.382917 | 43.085000 | 0.478941 | 0.721688 | 2.237559 | 43.706807 | 16.552013 | 14.000000 | 29.607500 |
50% | 14.645357 | 97.910000 | 0.627414 | 0.883883 | 2.241215 | 48.177401 | 17.808388 | 33.000000 | 61.755000 |
75% | 23.188889 | 189.810000 | 0.743846 | 0.946222 | 5.054081 | 48.739485 | 20.130470 | 180.000000 | 117.667500 |
max | 332.300000 | 801.030000 | 1.000000 | 1.000000 | 45.189819 | 50.669452 | 47.358750 | 183.000000 | 541.140000 |
# do any columns in the training data contain missing values?
training_data.isna().any()
POINT_IN_TIME False GROCERYCUSTOMERGUID False CustomerAvgInvoiceAmount_28d True CustomerSpend_28d False CustomerStateSimilarity_28d False CustomerInventoryStability_14d28d False StateMeanLongitude False StateMeanLatitude False StateAvgInvoiceAmount_28d True StatePopulation False Target False dtype: bool
! pip install scikit-learn
Requirement already satisfied: scikit-learn in ./venv/lib/python3.10/site-packages (1.2.2) Requirement already satisfied: scipy>=1.3.2 in ./venv/lib/python3.10/site-packages (from scikit-learn) (1.10.1) Requirement already satisfied: joblib>=1.1.1 in ./venv/lib/python3.10/site-packages (from scikit-learn) (1.2.0) Requirement already satisfied: threadpoolctl>=2.0.0 in ./venv/lib/python3.10/site-packages (from scikit-learn) (3.1.0) Requirement already satisfied: numpy>=1.17.3 in ./venv/lib/python3.10/site-packages (from scikit-learn) (1.24.3)
# use sklearn to train a random forest regression model on the training data
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(training_data.drop(columns=["GROCERYCUSTOMERGUID", "POINT_IN_TIME"]),
training_data["Target"], test_size=0.2, random_state=42)
# train the model
model = HistGradientBoostingRegressor()
model.fit(X_train, y_train)
# get predictions
y_pred = model.predict(X_test)
# calculate the mean squared error
mse = mean_squared_error(y_test, y_pred)
print("Mean squared error: ", mse)
# save the model
import joblib
joblib.dump(model, "model.pkl")
Mean squared error: 12.721547723729405
['model.pkl']
Next Steps¶
Now that you've completed the quick-start feature engineering tutorial, you can put your knowledge into practice or learn more:
- Learn more about materializing features via the "Deep Dive Materializing Features" tutorial
- Put your knowledge into practice by creating features in the "credit card dataset feature engineering playground" or "healthcare dataset feature engineering playground" workspaces
- Learn more about feature governance via the "Quick Start Feature Governance" tutorial