Quick Start Tutorial: Model Training¶

Learning Objectives¶

In this tutorial you will learn:

How to design an observation set for your use case
How to materialize training data
How your ML training environment can consume training data

Set up the prerequisites¶

Learning Objectives

In this section you will:

start your local featurebyte server
import libraries
learn the about catalogs
activate a pre-built catalog

In [1]:

            
                Copied!
                
# library imports
import pandas as pd
import numpy as np
import random

# load the featurebyte SDK
import featurebyte as fb

# start the local server, then wait for it to be healthy before proceeding
fb.playground()
# library imports
import pandas as pd
import numpy as np
import random

# load the featurebyte SDK
import featurebyte as fb

# start the local server, then wait for it to be healthy before proceeding
fb.playground()

02:24:50 | INFO     | Using configuration file at: /home/chester/.featurebyte/config.yaml
02:24:50 | INFO     | Active profile: local (http://127.0.0.1:8088)
02:24:50 | INFO     | SDK version: 0.2.2
02:24:50 | INFO     | Active catalog: default
02:24:50 | INFO     | 0 feature list, 0 feature deployed
02:24:50 | INFO     | (1/4) Starting featurebyte services
 Container redis  Running
 Container spark-thrift  Running
 Container mongo-rs  Running
 Container featurebyte-worker  Running
 Container featurebyte-server  Running
 Container redis  Waiting
 Container mongo-rs  Waiting
 Container mongo-rs  Waiting
 Container mongo-rs  Healthy
 Container mongo-rs  Healthy
 Container redis  Healthy
02:24:51 | INFO     | (2/4) Creating local spark feature store
02:24:51 | INFO     | (3/4) Import datasets
02:24:51 | INFO     | Dataset grocery already exists, skipping import
02:24:51 | INFO     | Dataset healthcare already exists, skipping import
02:24:51 | INFO     | Dataset creditcard already exists, skipping import
02:24:51 | INFO     | (4/4) Playground environment started successfully. Ready to go! 🚀

Create a pre-built catalog for this tutorial, with the data, metadata, and features already set up¶

Note that creating a pre-built catalog is not a step you will do in real-life. This is a function specific to this quick-start tutorial to quickly skip over many of the preparatory steps and get you to a point where you can materialize features.

In a real-life project you would do data modeling, declaring the tables, entities, and the associated metadata. This would not be a frequent task, but forms the basis for best-practice feature engineering.

In [2]:

            
                Copied!
                
# get the functions to create a pre-built catalog
from prebuilt_catalogs import *

# create a new catalog for this tutorial
catalog = create_tutorial_catalog(PrebuiltCatalog.QuickStartModelTraining)
# get the functions to create a pre-built catalog
from prebuilt_catalogs import *

# create a new catalog for this tutorial
catalog = create_tutorial_catalog(PrebuiltCatalog.QuickStartModelTraining)

Cleaning up existing tutorial catalogs

02:24:54 | INFO     | Catalog activated: quick start model training 20230511:0224

Building a quick start catalog for model training named [quick start model training 20230511:0224]
Creating new catalog
Catalog created
Registering the source tables
Registering the entities
Tagging the entities to columns in the data tables
Populating the feature store with example features
Setting feature readiness
Saving Feature(s) |████████████████████████████████████████| 1/1 [100%] in 0.2s 
Loading Feature(s) |████████████████████████████████████████| 1/1 [100%] in 0.2s
Saving Feature(s) |████████████████████████████████████████| 8/8 [100%] in 2.5s 
Loading Feature(s) |████████████████████████████████████████| 8/8 [100%] in 2.0s
Catalog created and pre-populated with data and features

Example: Create views from tables in the Catalog¶

In [3]:

            
                Copied!
                
# create the views
grocery_customer_view = catalog.get_view("GROCERYCUSTOMER")
grocery_invoice_view = catalog.get_view("GROCERYINVOICE")
grocery_items_view = catalog.get_view("INVOICEITEMS")
grocery_product_view = catalog.get_view("GROCERYPRODUCT")
# create the views
grocery_customer_view = catalog.get_view("GROCERYCUSTOMER")
grocery_invoice_view = catalog.get_view("GROCERYINVOICE")
grocery_items_view = catalog.get_view("INVOICEITEMS")
grocery_product_view = catalog.get_view("GROCERYPRODUCT")

Create an observation set for your use case¶

Learning Objectives

In this section you will learn:

the purpose of observation sets
the relationship between entities, point in time, and observation sets
how to design an observation set suitable for training data

Case Study: Predicting Customer Spend¶

Your chain of grocery stores wants to target market customers immediately after each purchase. As one step in this marketing campaign, they want to predict future customer spend in the 14 days after a purchase.

Concept: Materialization¶

A feature in FeatureByte is defined by the logical plan for its computation. The act of computing the feature is known as Feature Materialization.

The materialization of features is made on demand to fulfill historical requests, whereas for prediction purposes, feature values are generated through a batch process called a "Feature Job". The Feature Job is scheduled based on the defined settings associated with each feature.

Concept: Observation set¶

An observation set combines entity key values and historical points-in-time, for which you wish to materialize feature values.

The observation set can be a pandas DataFrame or an ObservationTable object representing an observation set in the feature store. An accepted serving name must be used for the column containing the entity values. The column containing points-in-time must be labelled "POINT-IN-TIME" and the point-in-time timestamps should be in UTC.

Concept: Point in time¶

A point-in-time for a feature refers to a specific moment in the past with which the feature's values are associated.

It is a crucial aspect of historical feature serving, which allows machine learning models to make predictions based on historical data. By providing a point-in-time, a feature can be used to train and test models on past data, enabling them to make accurate predictions for similar situations in the future.

Case Study: Predicting Customer Spend¶

Your chain of grocery stores wants to target market customers immediately after each purchase. As one step in this marketing campaign, they want to predict future customer spend in the 14 days after a purchase.

In [4]:

            
                Copied!
                
# get the feature list for the target feature
customer_target_list = catalog.get_feature_list("TargetFeature")

# display details about the target feature
display(customer_target_list.list_features())
# get the feature list for the target feature
customer_target_list = catalog.get_feature_list("TargetFeature")

# display details about the target feature
display(customer_target_list.list_features())

Loading Feature(s) |████████████████████████████████████████| 1/1 [100%] in 0.2s

	id	name	version	dtype	readiness	online_enabled	tables	primary_tables	entities	primary_entities	created_at
0	645c5226acd3a4fed277636f	Target	V230511	FLOAT	PRODUCTION_READY	False	[GROCERYINVOICE]	[GROCERYINVOICE]	[grocerycustomer]	[grocerycustomer]	2023-05-11 02:25:43.244

In [5]:

            
                Copied!
                
                    
                    
                
                

        
# create a large observation table from a view

# filter the view to exclude points in time that won't have data for historical windows
filter = (grocery_invoice_view["Timestamp"] >= pd.to_datetime("2022-04-01")) & \
    (grocery_invoice_view["Timestamp"] < pd.to_datetime("2023-04-01"))
observation_set_view = grocery_invoice_view[filter].copy()

# create a new observation table
observation_table = observation_set_view.create_observation_table(
    name = "10,000 Customers immediately after each purchase from May-22 to Mar-23",
    sample_rows = 10000,
    columns = ["Timestamp", "GroceryCustomerGuid"],
    columns_rename_mapping = {"Timestamp": "POINT_IN_TIME", "GroceryCustomerGuid": "GROCERYCUSTOMERGUID"},
)

# if the observation table isn't too large, you can materialize it
display(observation_table.to_pandas())
# create a large observation table from a view

# filter the view to exclude points in time that won't have data for historical windows
filter = (grocery_invoice_view["Timestamp"] >= pd.to_datetime("2022-04-01")) & \
    (grocery_invoice_view["Timestamp"] < pd.to_datetime("2023-04-01"))
observation_set_view = grocery_invoice_view[filter].copy()

# create a new observation table
observation_table = observation_set_view.create_observation_table(
    name = "10,000 Customers immediately after each purchase from May-22 to Mar-23",
    sample_rows = 10000,
    columns = ["Timestamp", "GroceryCustomerGuid"],
    columns_rename_mapping = {"Timestamp": "POINT_IN_TIME", "GroceryCustomerGuid": "GROCERYCUSTOMERGUID"},
)

# if the observation table isn't too large, you can materialize it
display(observation_table.to_pandas())

Done! |████████████████████████████████████████| 100% in 9.1s (0.11%/s)         
Downloading table |████████████████████████████████████████| 10000/10000 [100%]

	POINT_IN_TIME	GROCERYCUSTOMERGUID
0	2022-04-05 18:55:03	5c96089d-95f7-4a12-ab13-e082836253f1
1	2022-04-08 13:10:00	5c96089d-95f7-4a12-ab13-e082836253f1
2	2022-04-11 11:49:05	5c96089d-95f7-4a12-ab13-e082836253f1
3	2022-04-15 09:50:57	5c96089d-95f7-4a12-ab13-e082836253f1
4	2022-05-14 15:00:07	5c96089d-95f7-4a12-ab13-e082836253f1
...	...	...
9995	2022-06-12 09:20:38	ec4b86c7-a0ae-44f5-ba7c-0c19c8685365
9996	2022-06-26 10:03:35	ec4b86c7-a0ae-44f5-ba7c-0c19c8685365
9997	2022-07-30 09:03:31	ec4b86c7-a0ae-44f5-ba7c-0c19c8685365
9998	2022-08-09 12:41:11	ec4b86c7-a0ae-44f5-ba7c-0c19c8685365
9999	2022-09-28 12:42:39	ec4b86c7-a0ae-44f5-ba7c-0c19c8685365

10000 rows × 2 columns

Materialize Training Data¶

Learning Objectives

In this section you will learn:

how to create historical training data
how to merge target and features

Example: Get historical values¶

In [6]:

            
                Copied!
                
# list the feature lists
display(catalog.list_feature_lists())
# list the feature lists
display(catalog.list_feature_lists())

	id	name	num_feature	status	deployed	readiness_frac	online_frac	tables	entities	created_at
0	645c5226acd3a4fed2776367	Features	8	DRAFT	False	1.0	0.0	[GROCERYCUSTOMER, GROCERYINVOICE, INVOICEITEMS...	[grocerycustomer, frenchstate]	2023-05-11 02:25:48.839
1	645c5228acd3a4fed2776373	TargetFeature	1	DRAFT	False	1.0	0.0	[GROCERYINVOICE]	[grocerycustomer]	2023-05-11 02:25:44.540

In [7]:

            
                Copied!
                
# get the feature list
feature_list = catalog.get_feature_list("Features")
# get the feature list
feature_list = catalog.get_feature_list("Features")

Loading Feature(s) |████████████████████████████████████████| 8/8 [100%] in 2.0s

In [8]:

            
                Copied!
                
                    
                    
                
                

        
# Compute the historical feature table
training_table_features = feature_list.compute_historical_feature_table(
    observation_table,
    historical_feature_table_name='customer training table - invoices Apr-22 to Mar-23 - features only'
)

# display the training data
display(training_table_features.to_pandas())
# Compute the historical feature table
training_table_features = feature_list.compute_historical_feature_table(
    observation_table,
    historical_feature_table_name='customer training table - invoices Apr-22 to Mar-23 - features only'
)

# display the training data
display(training_table_features.to_pandas())

Done! |████████████████████████████████████████| 100% in 1:31.0 (0.01%/s)       
Downloading table |████████████████████████████████████████| 10000/10000 [100%]

	POINT_IN_TIME	GROCERYCUSTOMERGUID	CustomerAvgInvoiceAmount_28d	CustomerSpend_28d	CustomerStateSimilarity_28d	CustomerInventoryStability_14d28d	StateMeanLongitude	StateMeanLatitude	StateAvgInvoiceAmount_28d	StatePopulation
0	2022-04-01 08:43:25	352d1de1-4419-40e5-b2a5-6d6922384b05	7.990000	15.98	0.367632	0.866025	2.237559	48.740582	18.021939	183
1	2022-04-01 09:57:05	ed56f1f6-310d-4b7c-9f5b-554103282f15	35.840000	35.84	0.871330	1.000000	-1.871965	48.354199	15.970000	3
2	2022-04-01 12:20:01	b21ae11c-83cf-4146-832e-1163413a3295	2.940938	94.11	0.823614	0.968960	-0.530407	49.185500	8.032955	5
3	2022-04-01 13:10:44	32dd07d0-2c16-4b34-8cc9-01f258e0b935	3.590833	43.09	0.270366	0.868540	2.237559	48.740582	17.888677	183
4	2022-04-01 13:42:08	24196ecb-be71-42b2-a748-89ed1960e4fc	11.569167	138.83	0.636762	0.922243	2.237559	48.740582	17.888677	183
...	...	...	...	...	...	...	...	...	...	...
9995	2022-12-31 13:07:50	41c1bdf5-b596-4fc4-9570-ecd86a0d9a98	10.664000	53.32	0.640142	0.978258	-1.075038	47.401700	24.036600	18
9996	2022-12-31 14:11:20	ad22d91b-6212-46ad-af9e-7e7b2df034d9	58.835000	235.34	0.833155	0.895061	-0.494788	44.676056	18.831443	25
9997	2022-12-31 15:24:49	4c90b25e-628f-4692-b221-cc4fd07896aa	3.558571	24.91	0.501629	0.963529	-0.494788	44.676056	18.959091	25
9998	2022-12-31 16:21:56	59d264dd-494b-4c79-9794-d6fa103b0f7e	16.726667	100.36	0.843072	0.699854	4.386779	48.815086	18.112222	8
9999	2022-12-31 18:07:27	5f18f733-ef27-423b-8fb7-6172948c9255	7.414762	155.71	0.768394	0.945736	5.887195	43.456104	16.312388	53

10000 rows × 10 columns

Example: Get target values¶

When target values use aggregates or time offsets, you first need to offset the point in time by the time window.

In [9]:

            
                Copied!
                
# add 14 days to the timestamps in the observation set
observation_set_target = observation_table.to_pandas().copy()
observation_set_target["POINT_IN_TIME"] = observation_set_target["POINT_IN_TIME"] + pd.DateOffset(days=14)

display(observation_set_target)
# add 14 days to the timestamps in the observation set
observation_set_target = observation_table.to_pandas().copy()
observation_set_target["POINT_IN_TIME"] = observation_set_target["POINT_IN_TIME"] + pd.DateOffset(days=14)

display(observation_set_target)

Downloading table |████████████████████████████████████████| 10000/10000 [100%]

	POINT_IN_TIME	GROCERYCUSTOMERGUID
0	2022-04-19 18:55:03	5c96089d-95f7-4a12-ab13-e082836253f1
1	2022-04-22 13:10:00	5c96089d-95f7-4a12-ab13-e082836253f1
2	2022-04-25 11:49:05	5c96089d-95f7-4a12-ab13-e082836253f1
3	2022-04-29 09:50:57	5c96089d-95f7-4a12-ab13-e082836253f1
4	2022-05-28 15:00:07	5c96089d-95f7-4a12-ab13-e082836253f1
...	...	...
9995	2022-06-26 09:20:38	ec4b86c7-a0ae-44f5-ba7c-0c19c8685365
9996	2022-07-10 10:03:35	ec4b86c7-a0ae-44f5-ba7c-0c19c8685365
9997	2022-08-13 09:03:31	ec4b86c7-a0ae-44f5-ba7c-0c19c8685365
9998	2022-08-23 12:41:11	ec4b86c7-a0ae-44f5-ba7c-0c19c8685365
9999	2022-10-12 12:42:39	ec4b86c7-a0ae-44f5-ba7c-0c19c8685365

10000 rows × 2 columns

In [10]:

            
                Copied!
                
# Materialize the target feature using get historical features
training_data_target = customer_target_list.compute_historical_features(observation_set_target)

# remove the offset from the point in time column
training_data_target["POINT_IN_TIME"] = training_data_target["POINT_IN_TIME"] - pd.DateOffset(days=14)

display(training_data_target)
# Materialize the target feature using get historical features
training_data_target = customer_target_list.compute_historical_features(observation_set_target)

# remove the offset from the point in time column
training_data_target["POINT_IN_TIME"] = training_data_target["POINT_IN_TIME"] - pd.DateOffset(days=14)

display(training_data_target)

Retrieving Historical Feature(s) |████████████████████████████████████████| 2/2

	POINT_IN_TIME	GROCERYCUSTOMERGUID	Target
0	2022-04-05 18:55:03	5c96089d-95f7-4a12-ab13-e082836253f1	114.55
1	2022-04-08 13:10:00	5c96089d-95f7-4a12-ab13-e082836253f1	107.77
2	2022-04-11 11:49:05	5c96089d-95f7-4a12-ab13-e082836253f1	79.24
3	2022-04-15 09:50:57	5c96089d-95f7-4a12-ab13-e082836253f1	74.16
4	2022-05-14 15:00:07	5c96089d-95f7-4a12-ab13-e082836253f1	143.37
...	...	...	...
9995	2022-06-12 09:20:38	ec4b86c7-a0ae-44f5-ba7c-0c19c8685365	8.86
9996	2022-06-26 10:03:35	ec4b86c7-a0ae-44f5-ba7c-0c19c8685365	21.15
9997	2022-07-30 09:03:31	ec4b86c7-a0ae-44f5-ba7c-0c19c8685365	15.25
9998	2022-08-09 12:41:11	ec4b86c7-a0ae-44f5-ba7c-0c19c8685365	28.19
9999	2022-09-28 12:42:39	ec4b86c7-a0ae-44f5-ba7c-0c19c8685365	27.14

10000 rows × 3 columns

Example: Merging materialized values for features and target¶

In [11]:

            
                Copied!
                
# merge training data features and training data target
training_data = training_table_features.to_pandas()
training_data = training_data.merge(
    training_data_target, on=["GROCERYCUSTOMERGUID", "POINT_IN_TIME"]
)

display(training_data)
# merge training data features and training data target
training_data = training_table_features.to_pandas()
training_data = training_data.merge(
    training_data_target, on=["GROCERYCUSTOMERGUID", "POINT_IN_TIME"]
)

display(training_data)

Downloading table |████████████████████████████████████████| 10000/10000 [100%]

	POINT_IN_TIME	GROCERYCUSTOMERGUID	CustomerAvgInvoiceAmount_28d	CustomerSpend_28d	CustomerStateSimilarity_28d	CustomerInventoryStability_14d28d	StateMeanLongitude	StateMeanLatitude	StateAvgInvoiceAmount_28d	StatePopulation	Target
0	2022-04-01 08:43:25	352d1de1-4419-40e5-b2a5-6d6922384b05	7.990000	15.98	0.367632	0.866025	2.237559	48.740582	18.021939	183	7.79
1	2022-04-01 09:57:05	ed56f1f6-310d-4b7c-9f5b-554103282f15	35.840000	35.84	0.871330	1.000000	-1.871965	48.354199	15.970000	3	149.77
2	2022-04-01 12:20:01	b21ae11c-83cf-4146-832e-1163413a3295	2.940938	94.11	0.823614	0.968960	-0.530407	49.185500	8.032955	5	23.61
3	2022-04-01 13:10:44	32dd07d0-2c16-4b34-8cc9-01f258e0b935	3.590833	43.09	0.270366	0.868540	2.237559	48.740582	17.888677	183	34.78
4	2022-04-01 13:42:08	24196ecb-be71-42b2-a748-89ed1960e4fc	11.569167	138.83	0.636762	0.922243	2.237559	48.740582	17.888677	183	56.48
...	...	...	...	...	...	...	...	...	...	...	...
9999	2022-12-31 13:07:50	41c1bdf5-b596-4fc4-9570-ecd86a0d9a98	10.664000	53.32	0.640142	0.978258	-1.075038	47.401700	24.036600	18	104.50
10000	2022-12-31 14:11:20	ad22d91b-6212-46ad-af9e-7e7b2df034d9	58.835000	235.34	0.833155	0.895061	-0.494788	44.676056	18.831443	25	177.10
10001	2022-12-31 15:24:49	4c90b25e-628f-4692-b221-cc4fd07896aa	3.558571	24.91	0.501629	0.963529	-0.494788	44.676056	18.959091	25	10.64
10002	2022-12-31 16:21:56	59d264dd-494b-4c79-9794-d6fa103b0f7e	16.726667	100.36	0.843072	0.699854	4.386779	48.815086	18.112222	8	97.37
10003	2022-12-31 18:07:27	5f18f733-ef27-423b-8fb7-6172948c9255	7.414762	155.71	0.768394	0.945736	5.887195	43.456104	16.312388	53	32.18

10004 rows × 11 columns

Consuming training data¶

Learning Objectives

In this section you will learn:

how to save a training file
how to use a pandas data frame

Example: Save the training data to a file¶

In [12]:

            
                Copied!
                
# save training data as a csv file
training_data.to_csv("training_data.csv", index=False)
# save training data as a csv file
training_data.to_csv("training_data.csv", index=False)

In [13]:

            
                Copied!
                
# save the training file as a parquet file
training_data.to_parquet("training_data.parquet")
# save the training file as a parquet file
training_data.to_parquet("training_data.parquet")

Example: Training a scikit learn model¶

Note that you will need to install scikit learn https://scikit-learn.org/stable/install.html

In [14]:

            
                Copied!
                
# EDA on the training data
training_data.describe()
# EDA on the training data
training_data.describe()

Out[14]:

	CustomerAvgInvoiceAmount_28d	CustomerSpend_28d	CustomerStateSimilarity_28d	CustomerInventoryStability_14d28d	StateMeanLongitude	StateMeanLatitude	StateAvgInvoiceAmount_28d	StatePopulation	Target
count	9627.000000	10004.000000	10004.000000	10004.000000	10004.000000	10004.000000	10002.000000	10004.000000	10004.000000
mean	18.147826	133.975648	0.589494	0.757273	3.352376	45.211608	18.175718	78.274790	83.101813
std	14.750370	122.513374	0.219764	0.300337	9.154240	9.968909	3.685447	75.154162	71.969131
min	0.790000	0.000000	0.000000	0.000000	-50.017299	-12.713308	3.887857	1.000000	0.000000
25%	8.382917	43.085000	0.478941	0.721688	2.237559	43.706807	16.552013	14.000000	29.607500
50%	14.645357	97.910000	0.627414	0.883883	2.241215	48.177401	17.808388	33.000000	61.755000
75%	23.188889	189.810000	0.743846	0.946222	5.054081	48.739485	20.130470	180.000000	117.667500
max	332.300000	801.030000	1.000000	1.000000	45.189819	50.669452	47.358750	183.000000	541.140000

In [15]:

            
                Copied!
                
# do any columns in the training data contain missing values?
training_data.isna().any()
# do any columns in the training data contain missing values?
training_data.isna().any()

Out[15]:

POINT_IN_TIME                        False
GROCERYCUSTOMERGUID                  False
CustomerAvgInvoiceAmount_28d          True
CustomerSpend_28d                    False
CustomerStateSimilarity_28d          False
CustomerInventoryStability_14d28d    False
StateMeanLongitude                   False
StateMeanLatitude                    False
StateAvgInvoiceAmount_28d             True
StatePopulation                      False
Target                               False
dtype: bool

In [16]:

            
                Copied!
                
! pip install scikit-learn
! pip install scikit-learn

Requirement already satisfied: scikit-learn in ./venv/lib/python3.10/site-packages (1.2.2)
Requirement already satisfied: scipy>=1.3.2 in ./venv/lib/python3.10/site-packages (from scikit-learn) (1.10.1)
Requirement already satisfied: joblib>=1.1.1 in ./venv/lib/python3.10/site-packages (from scikit-learn) (1.2.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in ./venv/lib/python3.10/site-packages (from scikit-learn) (3.1.0)
Requirement already satisfied: numpy>=1.17.3 in ./venv/lib/python3.10/site-packages (from scikit-learn) (1.24.3)

In [17]:

            
                Copied!
                
                    
                    
                
                

        
# use sklearn to train a random forest regression model on the training data
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(training_data.drop(columns=["GROCERYCUSTOMERGUID", "POINT_IN_TIME"]), 
                                                    training_data["Target"], test_size=0.2, random_state=42)

# train the model
model = HistGradientBoostingRegressor()
model.fit(X_train, y_train)

# get predictions
y_pred = model.predict(X_test)

# calculate the mean squared error
mse = mean_squared_error(y_test, y_pred)
print("Mean squared error: ", mse)

# save the model
import joblib
joblib.dump(model, "model.pkl")
# use sklearn to train a random forest regression model on the training data
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(training_data.drop(columns=["GROCERYCUSTOMERGUID", "POINT_IN_TIME"]), 
                                                    training_data["Target"], test_size=0.2, random_state=42)

# train the model
model = HistGradientBoostingRegressor()
model.fit(X_train, y_train)

# get predictions
y_pred = model.predict(X_test)

# calculate the mean squared error
mse = mean_squared_error(y_test, y_pred)
print("Mean squared error: ", mse)

# save the model
import joblib
joblib.dump(model, "model.pkl")

Mean squared error:  12.721547723729405

Out[17]:

['model.pkl']

Next Steps¶

Now that you've completed the quick-start feature engineering tutorial, you can put your knowledge into practice or learn more:

Learn more about materializing features via the "Deep Dive Materializing Features" tutorial
Put your knowledge into practice by creating features in the "credit card dataset feature engineering playground" or "healthcare dataset feature engineering playground" workspaces
Learn more about feature governance via the "Quick Start Feature Governance" tutorial