Deep Dive Tutorial: Materializing Features¶

Learning Objectives¶

In this tutorial you will learn:

How to construct an observation set
How features, entities, and observation sets are used together
How to preview features
How to get historical values
How and why to deploy features
How to serve and consume deployed features

Set up the prerequisites¶

Learning Objectives

In this section you will:

start your local featurebyte server
import libraries
learn the about catalogs
activate a pre-built catalogs

Load the featurebyte library and connect to the local instance of featurebyte¶

In [1]:

            
                Copied!
                
# library imports
import pandas as pd
import numpy as np

# load the featurebyte SDK
import featurebyte as fb

# start the local server, then wait for it to be healthy before proceeding
fb.playground()
# library imports
import pandas as pd
import numpy as np

# load the featurebyte SDK
import featurebyte as fb

# start the local server, then wait for it to be healthy before proceeding
fb.playground()

02:01:40 | INFO     | Using configuration file at: /home/chester/.featurebyte/config.yaml
02:01:40 | INFO     | Active profile: local (http://127.0.0.1:8088)
02:01:40 | INFO     | SDK version: 0.2.2
02:01:40 | INFO     | Active catalog: default
02:01:40 | INFO     | 0 feature list, 0 feature deployed
02:01:40 | INFO     | (1/4) Starting featurebyte services
 Container redis  Running
 Container mongo-rs  Running
 Container spark-thrift  Running
 Container featurebyte-worker  Running
 Container featurebyte-server  Running
 Container mongo-rs  Waiting
 Container redis  Waiting
 Container mongo-rs  Waiting
 Container redis  Healthy
 Container mongo-rs  Healthy
 Container mongo-rs  Healthy
02:01:41 | INFO     | (2/4) Creating local spark feature store
02:01:41 | INFO     | (3/4) Import datasets
02:01:42 | INFO     | Dataset grocery already exists, skipping import
02:01:42 | INFO     | Dataset healthcare already exists, skipping import
02:01:42 | INFO     | Dataset creditcard already exists, skipping import
02:01:42 | INFO     | (4/4) Playground environment started successfully. Ready to go! 🚀

Create a pre-built catalog for this tutorial, with the data, metadata, and features already set up¶

Note that creating a pre-built catalog is not a step you will do in real-life. This is a function specific to this quick-start tutorial to quickly skip over many of the preparatory steps and get you to a point where you can materialize features.

In a real-life project you would do data modeling, declaring the tables, entities, and the associated metadata. This would not be a frequent task, but forms the basis for best-practice feature engineering.

In [2]:

            
                Copied!
                
# get the functions to create a pre-built catalog
from prebuilt_catalogs import *

# create a new catalog for this tutorial
catalog = create_tutorial_catalog(PrebuiltCatalog.DeepDiveMaterializingFeatures)
# get the functions to create a pre-built catalog
from prebuilt_catalogs import *

# create a new catalog for this tutorial
catalog = create_tutorial_catalog(PrebuiltCatalog.DeepDiveMaterializingFeatures)

Cleaning up existing tutorial catalogs

02:01:43 | INFO     | Catalog activated: deep dive feature engineering 20230511:0145

Cleaning catalog: deep dive feature engineering 20230511:0145
  1 historical feature tables
  3 observation tables
Done! |████████████████████████████████████████| 100% in 6.0s (0.17%/s)         
Done! |████████████████████████████████████████| 100% in 6.0s (0.17%/s)         
Done! |████████████████████████████████████████| 100% in 6.1s (0.17%/s)         
Done! |████████████████████████████████████████| 100% in 6.0s (0.17%/s)

02:02:07 | INFO     | Catalog activated: default

Building a deep dive catalog for materializing features named [deep dive materializing features 20230511:0202]
Creating new catalog

02:02:07 | INFO     | Catalog activated: deep dive materializing features 20230511:0202

Catalog created
Registering the source tables
Registering the entities
Tagging the entities to columns in the data tables
Populating the feature store with example features
Saving Feature(s) |████████████████████████████████████████| 4/4 [100%] in 4.7s 
Loading Feature(s) |████████████████████████████████████████| 4/4 [100%] in 2.1s
Saving Feature(s) |████████████████████████████████████████| 1/1 [100%] in 0.7s 
Loading Feature(s) |████████████████████████████████████████| 1/1 [100%] in 0.2s
Catalog created and pre-populated with data and features

Load the tables for this catalog¶

In [3]:

            
                Copied!
                
# get the tables for this catalog
grocery_customer_table = catalog.get_table("GROCERYCUSTOMER")
grocery_items_table = catalog.get_table("INVOICEITEMS")
grocery_invoice_table = catalog.get_table("GROCERYINVOICE")
grocery_product_table = catalog.get_table("GROCERYPRODUCT")
# get the tables for this catalog
grocery_customer_table = catalog.get_table("GROCERYCUSTOMER")
grocery_items_table = catalog.get_table("INVOICEITEMS")
grocery_invoice_table = catalog.get_table("GROCERYINVOICE")
grocery_product_table = catalog.get_table("GROCERYPRODUCT")

Create views for the tables in this catalog¶

In [4]:

            
                Copied!
                
# create the views
grocery_customer_view = grocery_customer_table.get_view()
grocery_invoice_view = grocery_invoice_table.get_view()
grocery_items_view = grocery_items_table.get_view()
grocery_product_view = grocery_product_table.get_view()
# create the views
grocery_customer_view = grocery_customer_table.get_view()
grocery_invoice_view = grocery_invoice_table.get_view()
grocery_items_view = grocery_items_table.get_view()
grocery_product_view = grocery_product_table.get_view()

How to construct an observation set¶

Learning Objectives

In this section you will learn:

the purpose of observation sets
the relationship between entities, point in time, and observation sets
how to construct an observation set

Concept: Materialization¶

A feature in FeatureByte is defined by the logical plan for its computation. The act of computing the feature is known as Feature Materialization.

The materialization of features is made on demand to fulfill historical requests, whereas for prediction purposes, feature values are generated through a batch process called a "Feature Job". The Feature Job is scheduled based on the defined settings associated with each feature.

Concept: Observation set¶

An observation set combines entity key values and historical points-in-time, for which you wish to materialize feature values.

The observation set can be a Pandas DataFrame or an ObservationTable object representing an observation set in the feature store.

Concept: Point in time¶

A point-in-time for a feature refers to a specific moment in the past with which the feature's values are associated.

It is a crucial aspect of historical feature serving, which allows machine learning models to make predictions based on historical data. By providing a point-in-time, a feature can be used to train and test models on past data, enabling them to make accurate predictions for similar situations in the future.

An observation set is created as a Pandas DataFrame containing the keys for the primary entity, and points in time. The column name for the primary entity must be its serving name, and the column name for the point in time must be "POINT_IN_TIME".

Example: Create an observation set based upon events¶

Some use cases are about events, and require predictions to be triggered when a specified event occurs.

A use case requiring predictions about a grocery customer whenever an invoice event occurs, your observation set may be sampled from historical invoices.

In [5]:

            
                Copied!
                
# show the serving name for grocery customer
entity_list = catalog.list_entities()
display(entity_list[entity_list.name == "grocerycustomer"])
# show the serving name for grocery customer
entity_list = catalog.list_entities()
display(entity_list[entity_list.name == "grocerycustomer"])

	id	name	serving_names	created_at
3	645c4caa2ce151fd3fe4e2b9	grocerycustomer	[GROCERYCUSTOMERGUID]	2023-05-11 02:02:18.177

In [6]:

            
                Copied!
                
                    
                    
                
                

        
# get a sample of 200 customer IDs and invoice event timestamps from 01-Apr-2022 to 31-Mar-2023
filter = (grocery_invoice_view["Timestamp"] >= pd.to_datetime("2022-04-01")) & \
    (grocery_invoice_view["Timestamp"] <= pd.to_datetime("2023-03-31"))
observation_set = (
    grocery_invoice_view[filter].sample(200)[["GroceryCustomerGuid", "Timestamp"]]
    .rename({
        "Timestamp": "POINT_IN_TIME",
        "GroceryCustomerGuid": "GROCERYCUSTOMERGUID",
    }, axis=1)
)
display(observation_set)
# get a sample of 200 customer IDs and invoice event timestamps from 01-Apr-2022 to 31-Mar-2023
filter = (grocery_invoice_view["Timestamp"] >= pd.to_datetime("2022-04-01")) & \
    (grocery_invoice_view["Timestamp"] <= pd.to_datetime("2023-03-31"))
observation_set = (
    grocery_invoice_view[filter].sample(200)[["GroceryCustomerGuid", "Timestamp"]]
    .rename({
        "Timestamp": "POINT_IN_TIME",
        "GroceryCustomerGuid": "GROCERYCUSTOMERGUID",
    }, axis=1)
)
display(observation_set)

	GROCERYCUSTOMERGUID	POINT_IN_TIME
0	a900e82a-5742-4929-aaf7-7e79ed5383f2	2022-04-14 20:01:23
1	7a024068-3f99-4114-9d90-3a61f679be51	2022-07-05 16:03:08
2	5b185248-658c-4dbe-bbb7-70d215fb6a05	2022-12-20 07:59:08
3	12c2d702-1b92-4375-8fd4-5b3bd18f7d87	2022-11-13 17:09:40
4	d7316f3d-6ea9-49b6-97f0-3d20ea9d1331	2023-01-31 18:11:55
...	...	...
195	4eb4ee84-ee13-4eec-9c26-61b6eb4ba35b	2022-10-31 09:22:10
196	2b54ef0e-8b02-4f1e-896a-767d23a6162a	2022-09-03 12:17:46
197	3eb57343-4b91-4e06-bed5-c763514c4e64	2022-04-05 18:52:48
198	144a0fe4-2137-43f6-b266-411b9eb7cb31	2023-01-30 14:21:34
199	888aa655-927f-41c8-a0ba-7dab2872fca8	2022-08-13 15:18:07

200 rows × 2 columns

Concept: Observation table¶

An ObservationTable object is a representation of an observation set in the feature store. Unlike a local Pandas DataFrame, the ObservationTable is part of the catalog and can be shared or reused.

ObservationTable objects can be created from a source table or from a view after subsampling.

Example: Create an observation table based upon events¶

In [7]:

            
                Copied!
                
                    
                    
                
                

        
# create a large observation table from a view
# observation tables are the recommended workflow for training data

# filter the view to exclude points in time that won't have data for historical windows
filter = (grocery_invoice_view["Timestamp"] >= pd.to_datetime("2022-04-01")) & \
    (grocery_invoice_view["Timestamp"] < pd.to_datetime("2023-04-01"))
observation_set_view = grocery_invoice_view[filter].copy()

# create a new observation table
observation_table = observation_set_view.create_observation_table(
    name = "10000 customers who were active between 01-Apr-2022 and 31-Mar-2023",
    sample_rows = 10000,
    columns = ["Timestamp", "GroceryCustomerGuid"],
    columns_rename_mapping = {"Timestamp": "POINT_IN_TIME", "GroceryCustomerGuid": "GROCERYCUSTOMERGUID"},
)

# if the observation table isn't too large, you can materialize it
display(observation_table.to_pandas())
# create a large observation table from a view
# observation tables are the recommended workflow for training data

# filter the view to exclude points in time that won't have data for historical windows
filter = (grocery_invoice_view["Timestamp"] >= pd.to_datetime("2022-04-01")) & \
    (grocery_invoice_view["Timestamp"] < pd.to_datetime("2023-04-01"))
observation_set_view = grocery_invoice_view[filter].copy()

# create a new observation table
observation_table = observation_set_view.create_observation_table(
    name = "10000 customers who were active between 01-Apr-2022 and 31-Mar-2023",
    sample_rows = 10000,
    columns = ["Timestamp", "GroceryCustomerGuid"],
    columns_rename_mapping = {"Timestamp": "POINT_IN_TIME", "GroceryCustomerGuid": "GROCERYCUSTOMERGUID"},
)

# if the observation table isn't too large, you can materialize it
display(observation_table.to_pandas())

Done! |████████████████████████████████████████| 100% in 9.1s (0.11%/s)         
Downloading table |████████████████████████████████████████| 10000/10000 [100%]

	POINT_IN_TIME	GROCERYCUSTOMERGUID
0	2022-04-05 18:55:03	5c96089d-95f7-4a12-ab13-e082836253f1
1	2022-04-08 13:10:00	5c96089d-95f7-4a12-ab13-e082836253f1
2	2022-05-14 15:00:07	5c96089d-95f7-4a12-ab13-e082836253f1
3	2022-05-20 13:03:26	5c96089d-95f7-4a12-ab13-e082836253f1
4	2022-05-29 15:35:31	5c96089d-95f7-4a12-ab13-e082836253f1
...	...	...
9995	2022-11-21 17:52:55	afeec4ce-0a90-41f1-802b-7ff2bb42b292
9996	2022-11-27 09:02:18	afeec4ce-0a90-41f1-802b-7ff2bb42b292
9997	2022-11-29 12:19:40	afeec4ce-0a90-41f1-802b-7ff2bb42b292
9998	2022-11-30 19:11:36	afeec4ce-0a90-41f1-802b-7ff2bb42b292
9999	2022-11-30 19:12:27	afeec4ce-0a90-41f1-802b-7ff2bb42b292

10000 rows × 2 columns

Example: Create an observation set based upon regularly scheduled batch predictions¶

Some use cases require predictions to be triggered at regular time periods. Some use cases have conditions for which only a subset of entities require predictions.

A use case requiring monthly predictions for recently active customers may use an observation set containing sample customer IDs combined with predefined timestamps.

In [8]:

            
                Copied!
                
                    
                    
                
                

        
# define a function to list a sample of the customers who were active in a given month
def get_recently_active_customers(month_number):
    # filter the invoices by month
    filter = (grocery_invoice_view["Timestamp"].dt.month == month_number) & (grocery_invoice_view["Timestamp"].dt.year == 2022)
    # get a list of customers who made an invoice in the month
    recently_active_customers = grocery_invoice_view[filter].sample(200)["GroceryCustomerGuid"].unique()
    # get the start of the month
    point_in_time = pd.Timestamp(f"2022-{month_number}-01")
    # get the end of the month
    end_of_month = point_in_time + pd.DateOffset(months=1)
    # get the point in time by subtracting 0.001 second from the end of the month
    point_in_time = end_of_month - pd.Timedelta(seconds=0.001)
    # combine the point in time with the customer IDs
    recently_active_customers = pd.DataFrame({
        "GROCERYCUSTOMERGUID": recently_active_customers,
        "POINT_IN_TIME": point_in_time,
    })
    return recently_active_customers

# create an observation set comprised of up to 200 customers per month who were active in that month in the second half of 2022
observation_set = pd.concat([get_recently_active_customers(month_number) for month_number in range(7, 13)], ignore_index=True)
display(observation_set)
# define a function to list a sample of the customers who were active in a given month
def get_recently_active_customers(month_number):
    # filter the invoices by month
    filter = (grocery_invoice_view["Timestamp"].dt.month == month_number) & (grocery_invoice_view["Timestamp"].dt.year == 2022)
    # get a list of customers who made an invoice in the month
    recently_active_customers = grocery_invoice_view[filter].sample(200)["GroceryCustomerGuid"].unique()
    # get the start of the month
    point_in_time = pd.Timestamp(f"2022-{month_number}-01")
    # get the end of the month
    end_of_month = point_in_time + pd.DateOffset(months=1)
    # get the point in time by subtracting 0.001 second from the end of the month
    point_in_time = end_of_month - pd.Timedelta(seconds=0.001)
    # combine the point in time with the customer IDs
    recently_active_customers = pd.DataFrame({
        "GROCERYCUSTOMERGUID": recently_active_customers,
        "POINT_IN_TIME": point_in_time,
    })
    return recently_active_customers

# create an observation set comprised of up to 200 customers per month who were active in that month in the second half of 2022
observation_set = pd.concat([get_recently_active_customers(month_number) for month_number in range(7, 13)], ignore_index=True)
display(observation_set)

	GROCERYCUSTOMERGUID	POINT_IN_TIME
0	575ceb64-e6ef-446d-9a38-929e35e4cbef	2022-07-31 23:59:59.999
1	b95f380e-7e7b-4bca-9762-fd9a4fd07419	2022-07-31 23:59:59.999
2	cfd39ed9-3140-4af5-9f72-77881aa6c2a8	2022-07-31 23:59:59.999
3	79b85aee-d548-4e6d-89b0-6969fcce5feb	2022-07-31 23:59:59.999
4	db2d5721-8869-40f7-984c-a94d614fdf69	2022-07-31 23:59:59.999
...	...	...
856	ff38d86f-cd9a-4860-9b0a-eb387bfe0a10	2022-12-31 23:59:59.999
857	5fc2332e-03ac-448d-bf34-f3322cdc295e	2022-12-31 23:59:59.999
858	6132395b-aa85-4fc7-849d-8b8bbd47e1f9	2022-12-31 23:59:59.999
859	c6ef9073-3351-4f54-869a-4c926a479520	2022-12-31 23:59:59.999
860	20f61507-e7d7-450d-b44f-665d1dfd889f	2022-12-31 23:59:59.999

861 rows × 2 columns

Previewing features¶

Learning Objectives

In this section you will learn:

how to preview features
the limitations of previews

Example: Preview features¶

During feature prototyping, new features may not have been saved to the catalog. A data scientist will want to preview sample features to sensibility check their feature declaration.

In [9]:

            
                Copied!
                
# create a lookup feature that is the city in which the customer resides
french_state_lookup = grocery_customer_view.City.as_feature("CustomerCity")

# preview materialized values for the unsaved feature
display(french_state_lookup.preview(observation_set.sample(5)))
# create a lookup feature that is the city in which the customer resides
french_state_lookup = grocery_customer_view.City.as_feature("CustomerCity")

# preview materialized values for the unsaved feature
display(french_state_lookup.preview(observation_set.sample(5)))

	GROCERYCUSTOMERGUID	POINT_IN_TIME	CustomerCity
857	5fc2332e-03ac-448d-bf34-f3322cdc295e	2022-12-31 23:59:59.999	LONGJUMEAU
471	d73f08a7-4206-4687-9165-40a8de80a0e0	2022-10-31 23:59:59.999	STAINS
692	7ee2ad09-6876-4744-bcef-a6e0cb839fb4	2022-11-30 23:59:59.999	CERGY
262	a773e0ef-73aa-4e71-b14e-ab0bf6f8a0a9	2022-08-31 23:59:59.999	VALENCE
546	3f575b5f-1b7e-4e36-b923-2b8cfd77cef3	2022-10-31 23:59:59.999	TOURS

Feature previews are not suited to creating training files or feature serving. Previews have a limitation of 50 rows and do not create an audit trail.

Create training data¶

Learning Objectives

In this section you will learn:

how to design an observation set suitable for training data
how to get historical values for a feature list
how to get historical values for the target
how to join features and the target to create training data

Design an Observation Set for Training¶

Observation Training Design: A training data observation set should typically meet the following criteria:

be collected from a time period that does not start until after the earliest data availability timestamp plus longest time window in the features
be collected from a time period that ends before the latest data timestamp less the time window of the target value
uses points in time that align with the anticipated timing of the use case inference, whether it's based on a regular schedule, triggered by an event, or any other timing mechanism.
does not have duplicate rows
has a column containing the primary entity of the use case, using its serving name
has a column, named "POINT_IN_TIME", containing the points in time
has for the same entity key points in time that have time intervals greater than the horizon of the target to avoid leakage

Case Study: Predicting Customer Spend¶

Your chain of grocery stores wants to target market customers immediately after each purchase. As one step in this marketing campaign, they want to predict future customer spend in the 14 days after a purchase.

Example: Create an observation table for training data¶

In [10]:

            
                Copied!
                
# describe the customer view
display(grocery_customer_view.describe())
# describe the customer view
display(grocery_customer_view.describe())

	RowID	GroceryCustomerGuid	ValidFrom	Gender	Title	GivenName	MiddleInitial	Surname	StreetAddress	City	State	PostalCode	BrowserUserAgent	DateOfBirth	Latitude	Longitude
dtype	VARCHAR	VARCHAR	TIMESTAMP	VARCHAR	VARCHAR	VARCHAR	VARCHAR	VARCHAR	VARCHAR	VARCHAR	VARCHAR	VARCHAR	VARCHAR	DATE	FLOAT	FLOAT
unique	530	500	530	2	4	347	26	352	512	300	27	353	82	495	530	530
%missing	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
%empty	0	0	NaN	0	0	0	0	0	0	0	0	0	0	NaN	NaN	NaN
entropy	6.214608	6.191446	NaN	0.692285	1.146938	5.726251	2.925542	5.749627	6.201803	5.435211	2.49532	5.763347	3.814598	NaN	NaN	NaN
top	0069200d-adf5-490a-acca-14bdf78072a0	0b7196a2-2dab-4218-a234-e193f7bc4470	2019-01-01 07:23:45	male	Mr.	Joanna	A	Saindon	1 cours Jean Jaures	PARIS	Île-de-France	75004	Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...	NaN	-12.704022	-0.102024
freq	1.0	3.0	1.0	276.0	264.0	5.0	66.0	6.0	2.0	25.0	189.0	5.0	51.0	NaN	1.0	1.0
mean	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	46.50512	2.383389
std	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	6.108698	7.822694
min	NaN	NaN	2019-01-01T07:23:45.000000000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1937-07-02T00:00:00.000000000	-12.71811	-61.12404
25%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	44.861372	1.959153
50%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	48.555884	2.40135
75%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	48.912734	4.734203
max	NaN	NaN	2023-01-30T19:14:03.000000000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2003-11-05T00:00:00.000000000	51.11185	45.214809

Note that there are 471 unique customers

In [11]:

            
                Copied!
                
# describe the invoice view
display(grocery_invoice_view.describe())
# describe the invoice view
display(grocery_invoice_view.describe())

	GroceryInvoiceGuid	GroceryCustomerGuid	Timestamp	tz_offset	Amount
dtype	VARCHAR	VARCHAR	TIMESTAMP	VARCHAR	FLOAT
unique	36076	500	36054	4	6668
%missing	0.0	0.0	0.0	0.0	0.0
%empty	0	0	NaN	0	NaN
entropy	6.214608	5.825998	NaN	0.828849	NaN
top	000949fe-1884-40bb-939d-a52df200981f	3019bdbf-667c-4081-acb5-26cd2d559c5e	2022-01-05 11:34:17	+02:00	1.0
freq	1.0	582.0	2.0	18722.0	746.0
mean	NaN	NaN	NaN	NaN	18.392823
std	NaN	NaN	NaN	NaN	22.832304
min	NaN	NaN	2022-01-01T04:17:46.000000000	NaN	0.0
25%	NaN	NaN	NaN	NaN	3.99
50%	NaN	NaN	NaN	NaN	10.2
75%	NaN	NaN	NaN	NaN	23.5325
max	NaN	NaN	2023-05-11T00:17:03.000000000	NaN	354.39

Note that the earliest data timestamp is at the beginning of 2022, and the timestamps end in the present.

In [12]:

            
                Copied!
                
# get the customer feature list
customer_feature_list = catalog.get_feature_list("CustomerFeatures")

# display details about the features in the customer feature list
display(customer_feature_list.list_features())
# get the customer feature list
customer_feature_list = catalog.get_feature_list("CustomerFeatures")

# display details about the features in the customer feature list
display(customer_feature_list.list_features())

Loading Feature(s) |████████████████████████████████████████| 4/4 [100%] in 1.0s

	id	name	version	dtype	readiness	online_enabled	tables	primary_tables	entities	primary_entities	created_at
0	645c4cae2ce151fd3fe4e2c5	StateMeanLongitude	V230511	FLOAT	DRAFT	False	[GROCERYCUSTOMER]	[GROCERYCUSTOMER]	[frenchstate]	[frenchstate]	2023-05-11 02:02:27.344
1	645c4cae2ce151fd3fe4e2c3	StateMeanLatitude	V230511	FLOAT	DRAFT	False	[GROCERYCUSTOMER]	[GROCERYCUSTOMER]	[frenchstate]	[frenchstate]	2023-05-11 02:02:26.627
2	645c4cae2ce151fd3fe4e2c1	CustomerInventoryMostFrequent_4w	V230511	VARCHAR	DRAFT	False	[GROCERYINVOICE, INVOICEITEMS, GROCERYPRODUCT]	[INVOICEITEMS]	[grocerycustomer]	[grocerycustomer]	2023-05-11 02:02:25.571
3	645c4cae2ce151fd3fe4e2bf	CustomerInventoryEntropy_4w	V230511	FLOAT	DRAFT	False	[GROCERYINVOICE, INVOICEITEMS, GROCERYPRODUCT]	[INVOICEITEMS]	[grocerycustomer]	[grocerycustomer]	2023-05-11 02:02:24.037

Note that the longest time window in the features is 4 weeks.

In [13]:

            
                Copied!
                
# get the target
customer_target_list = catalog.get_feature_list("TargetFeature")

# display details about the target feature
display(customer_target_list.list_features())
# get the target
customer_target_list = catalog.get_feature_list("TargetFeature")

# display details about the target feature
display(customer_target_list.list_features())

Loading Feature(s) |████████████████████████████████████████| 1/1 [100%] in 0.2s

	id	name	version	dtype	readiness	online_enabled	tables	primary_tables	entities	primary_entities	created_at
0	645c4cae2ce151fd3fe4e2cb	Target	V230511	FLOAT	DRAFT	False	[GROCERYINVOICE]	[GROCERYINVOICE]	[grocerycustomer]	[grocerycustomer]	2023-05-11 02:02:31.545

Note that the time window for the target is 14 days

We can conclude that it would be safe for the training data observation set's points in time to commence on 29-Jan-2022 and end 14 days before the present.

We will create an observation set for invoice dates from Feb-22 to Dec-22.

In [14]:

            
                Copied!
                
                    
                    
                
                

        
# create a large observation table from a view

# filter to get Feb-22 to Jan-23
filter = (grocery_invoice_view["Timestamp"] >= pd.to_datetime("2022-02-01")) & \
    (grocery_invoice_view["Timestamp"] < pd.to_datetime("2023-04-01"))
observation_set_view = grocery_invoice_view[filter].copy()

# create a new observation table
observation_table_large = observation_set_view.create_observation_table(
    name = '1000 customers who were active between 01-Feb-2022 and 31-Jan-2023', 
    sample_rows = 1000,
    columns = ["Timestamp", "GroceryCustomerGuid"],
    columns_rename_mapping = {"Timestamp": "POINT_IN_TIME", "GroceryCustomerGuid": "GROCERYCUSTOMERGUID"},
)

# if the observation table isn't too large, you can materialize it
display(observation_table_large.to_pandas())
# create a large observation table from a view

# filter to get Feb-22 to Jan-23
filter = (grocery_invoice_view["Timestamp"] >= pd.to_datetime("2022-02-01")) & \
    (grocery_invoice_view["Timestamp"] < pd.to_datetime("2023-04-01"))
observation_set_view = grocery_invoice_view[filter].copy()

# create a new observation table
observation_table_large = observation_set_view.create_observation_table(
    name = '1000 customers who were active between 01-Feb-2022 and 31-Jan-2023', 
    sample_rows = 1000,
    columns = ["Timestamp", "GroceryCustomerGuid"],
    columns_rename_mapping = {"Timestamp": "POINT_IN_TIME", "GroceryCustomerGuid": "GROCERYCUSTOMERGUID"},
)

# if the observation table isn't too large, you can materialize it
display(observation_table_large.to_pandas())

Done! |████████████████████████████████████████| 100% in 12.1s (0.08%/s)        
Downloading table |████████████████████████████████████████| 1000/1000 [100%] in

	POINT_IN_TIME	GROCERYCUSTOMERGUID
0	2022-02-12 18:10:54	5c96089d-95f7-4a12-ab13-e082836253f1
1	2022-03-20 20:19:26	5c96089d-95f7-4a12-ab13-e082836253f1
2	2022-04-30 19:05:33	abdef773-ab72-43b6-8e77-050804c1c5fc
3	2022-05-27 14:17:12	abdef773-ab72-43b6-8e77-050804c1c5fc
4	2022-06-11 15:11:59	abdef773-ab72-43b6-8e77-050804c1c5fc
...	...	...
995	2022-10-01 20:22:29	c9bdbb70-27e7-4ca1-a429-17b67703c06b
996	2022-12-16 14:14:18	c9bdbb70-27e7-4ca1-a429-17b67703c06b
997	2022-07-30 14:21:13	09d5703e-3238-4bec-9255-71d30f0d3fbb
998	2022-02-03 10:18:56	afeec4ce-0a90-41f1-802b-7ff2bb42b292
999	2022-02-04 16:34:59	afeec4ce-0a90-41f1-802b-7ff2bb42b292

1000 rows × 2 columns

Example: Get historical values¶

In [15]:

            
                Copied!
                
# use the get historical features function to get the feature values for the observation set
training_data_features = customer_feature_list.compute_historical_features(observation_set)
display(training_data_features)
# use the get historical features function to get the feature values for the observation set
training_data_features = customer_feature_list.compute_historical_features(observation_set)
display(training_data_features)

Retrieving Historical Feature(s) |████████████████████████████████████████| 1/1

	GROCERYCUSTOMERGUID	POINT_IN_TIME	StateMeanLongitude	StateMeanLatitude	CustomerInventoryMostFrequent_4w	CustomerInventoryEntropy_4w
0	575ceb64-e6ef-446d-9a38-929e35e4cbef	2022-07-31 23:59:59.999	4.453320	48.906913	Pizza Surgelées	2.257205
1	b95f380e-7e7b-4bca-9762-fd9a4fd07419	2022-07-31 23:59:59.999	2.240549	48.737227	Colas, Thés glacés et Sodas	1.927392
2	cfd39ed9-3140-4af5-9f72-77881aa6c2a8	2022-07-31 23:59:59.999	2.240549	48.737227	Pains	3.309872
3	79b85aee-d548-4e6d-89b0-6969fcce5feb	2022-07-31 23:59:59.999	6.023457	47.176003	Colas, Thés glacés et Sodas	2.614161
4	db2d5721-8869-40f7-984c-a94d614fdf69	2022-07-31 23:59:59.999	2.240549	48.737227	Bières et Cidres	2.153532
...	...	...	...	...	...	...
856	ff38d86f-cd9a-4860-9b0a-eb387bfe0a10	2022-12-31 23:59:59.999	5.054081	45.500198	Sirops	3.133063
857	5fc2332e-03ac-448d-bf34-f3322cdc295e	2022-12-31 23:59:59.999	2.242254	48.739038	Fromages	3.254689
858	6132395b-aa85-4fc7-849d-8b8bbd47e1f9	2022-12-31 23:59:59.999	2.242254	48.739038	Chips et Tortillas	2.474379
859	c6ef9073-3351-4f54-869a-4c926a479520	2022-12-31 23:59:59.999	5.887195	43.456104	Pizza Surgelées	2.378475
860	20f61507-e7d7-450d-b44f-665d1dfd889f	2022-12-31 23:59:59.999	1.349651	47.662871	Glaces et Sorbets	3.442462

861 rows × 6 columns

Concept: Historical feature table¶

A HistoricalFeatureTable object represents a table in the feature store containing historical feature values from a historical feature request. The historical feature values can also be obtained as a Pandas DataFrame, but using a HistoricalFeatureTable object has some benefits such as handling large tables, storing the data in the feature store for reuse, and offering full lineage of the training and test data.

In [16]:

            
                Copied!
                
# the syntax is different when using an observation table to create a historical feature table

# Compute the historical feature table
training_table = customer_feature_list.compute_historical_feature_table(
    observation_table_large,
    historical_feature_table_name='customer training table on 1000 customers who were active between 01-Feb-2022 and 31-Jan-2023'
)

# display the training data
display(training_table.to_pandas())
# the syntax is different when using an observation table to create a historical feature table

# Compute the historical feature table
training_table = customer_feature_list.compute_historical_feature_table(
    observation_table_large,
    historical_feature_table_name='customer training table on 1000 customers who were active between 01-Feb-2022 and 31-Jan-2023'
)

# display the training data
display(training_table.to_pandas())

Done! |████████████████████████████████████████| 100% in 33.3s (0.03%/s)        
Downloading table |████████████████████████████████████████| 1000/1000 [100%] in

	POINT_IN_TIME	GROCERYCUSTOMERGUID	StateMeanLongitude	StateMeanLatitude	CustomerInventoryMostFrequent_4w	CustomerInventoryEntropy_4w
0	2022-02-01 15:51:21	94127b9f-1366-4bbe-afea-7cd77225da52	5.963028	48.799660	Chips et Tortillas	3.317135
1	2022-02-01 20:32:11	ed9730f3-859b-4284-83a4-407032f81332	2.236025	48.740263	None	NaN
2	2022-02-02 00:03:38	ca874bc5-6ee9-437e-b18e-7b52da691d6c	2.236025	48.740263	Soupe	2.685945
3	2022-02-02 10:22:09	5b1300f3-54c3-4eab-b00e-b54ac7714a58	2.236025	48.740263	Légumes Frais	2.886775
4	2022-02-02 16:54:12	c1e68071-765e-4616-bdce-630250c50a9f	2.236025	48.740263	Préparations pour Gâteaux et Flans	2.369993
...	...	...	...	...	...	...
995	2022-12-30 08:32:02	d519b6c9-5f34-4b75-95e3-3778e2d63b01	5.887195	43.456104	Chips et Tortillas	2.692340
996	2022-12-30 10:41:09	f761a5d1-3b66-4faf-82f1-6cd59e2e28f8	0.934599	49.391777	Boucherie	2.445018
997	2022-12-30 15:25:23	fb92d601-9f59-41fb-97fe-79d627d95bd8	2.242254	48.739038	Soupe	2.971832
998	2022-12-31 19:45:53	13d79bc4-8887-4bf4-99ca-a496374fbff7	5.054081	45.500198	Colas, Thés glacés et Sodas	0.693147
999	2022-12-31 21:55:13	d9800a91-c533-40f2-8b5e-375db88ec2ed	2.242254	48.739038	Préparations pour Gâteaux et Flans	-0.000000

1000 rows × 6 columns

Example: Get target values¶

When target values use aggregates or time offsets, you first need to offset the point in time by the time window.

In [17]:

            
                Copied!
                
# add 14 days to the timestamps in the observation set
observation_set_target = observation_table_large.to_pandas()
observation_set_target["POINT_IN_TIME"] = observation_set_target["POINT_IN_TIME"] + pd.DateOffset(days=14)
display(observation_set_target)
# add 14 days to the timestamps in the observation set
observation_set_target = observation_table_large.to_pandas()
observation_set_target["POINT_IN_TIME"] = observation_set_target["POINT_IN_TIME"] + pd.DateOffset(days=14)
display(observation_set_target)

Downloading table |████████████████████████████████████████| 1000/1000 [100%] in

	POINT_IN_TIME	GROCERYCUSTOMERGUID
0	2022-02-26 18:10:54	5c96089d-95f7-4a12-ab13-e082836253f1
1	2022-04-03 20:19:26	5c96089d-95f7-4a12-ab13-e082836253f1
2	2022-05-14 19:05:33	abdef773-ab72-43b6-8e77-050804c1c5fc
3	2022-06-10 14:17:12	abdef773-ab72-43b6-8e77-050804c1c5fc
4	2022-06-25 15:11:59	abdef773-ab72-43b6-8e77-050804c1c5fc
...	...	...
995	2022-10-15 20:22:29	c9bdbb70-27e7-4ca1-a429-17b67703c06b
996	2022-12-30 14:14:18	c9bdbb70-27e7-4ca1-a429-17b67703c06b
997	2022-08-13 14:21:13	09d5703e-3238-4bec-9255-71d30f0d3fbb
998	2022-02-17 10:18:56	afeec4ce-0a90-41f1-802b-7ff2bb42b292
999	2022-02-18 16:34:59	afeec4ce-0a90-41f1-802b-7ff2bb42b292

1000 rows × 2 columns

In [18]:

            
                Copied!
                
# Materialize the target feature using get historical features
training_data_target = customer_target_list.compute_historical_features(observation_set_target)

# remove the offset from the point in time column
training_data_target["POINT_IN_TIME"] = training_data_target["POINT_IN_TIME"] - pd.DateOffset(days=14)

display(training_data_target)
# Materialize the target feature using get historical features
training_data_target = customer_target_list.compute_historical_features(observation_set_target)

# remove the offset from the point in time column
training_data_target["POINT_IN_TIME"] = training_data_target["POINT_IN_TIME"] - pd.DateOffset(days=14)

display(training_data_target)

Retrieving Historical Feature(s) |████████████████████████████████████████| 1/1

	POINT_IN_TIME	GROCERYCUSTOMERGUID	Target
0	2022-02-12 18:10:54	5c96089d-95f7-4a12-ab13-e082836253f1	109.81
1	2022-03-20 20:19:26	5c96089d-95f7-4a12-ab13-e082836253f1	60.49
2	2022-04-30 19:05:33	abdef773-ab72-43b6-8e77-050804c1c5fc	256.00
3	2022-05-27 14:17:12	abdef773-ab72-43b6-8e77-050804c1c5fc	204.73
4	2022-06-11 15:11:59	abdef773-ab72-43b6-8e77-050804c1c5fc	158.05
...	...	...	...
995	2022-10-01 20:22:29	c9bdbb70-27e7-4ca1-a429-17b67703c06b	84.38
996	2022-12-16 14:14:18	c9bdbb70-27e7-4ca1-a429-17b67703c06b	69.34
997	2022-07-30 14:21:13	09d5703e-3238-4bec-9255-71d30f0d3fbb	43.43
998	2022-02-03 10:18:56	afeec4ce-0a90-41f1-802b-7ff2bb42b292	123.02
999	2022-02-04 16:34:59	afeec4ce-0a90-41f1-802b-7ff2bb42b292	100.25

1000 rows × 3 columns

Example: Merging materialized values for features and target¶

In [19]:

            
                Copied!
                
# merge training data features and training data target
training_data = training_table.to_pandas().merge(training_data_target, on=["GROCERYCUSTOMERGUID", "POINT_IN_TIME"])
display(training_data)
# merge training data features and training data target
training_data = training_table.to_pandas().merge(training_data_target, on=["GROCERYCUSTOMERGUID", "POINT_IN_TIME"])
display(training_data)

Downloading table |████████████████████████████████████████| 1000/1000 [100%] in

	POINT_IN_TIME	GROCERYCUSTOMERGUID	StateMeanLongitude	StateMeanLatitude	CustomerInventoryMostFrequent_4w	CustomerInventoryEntropy_4w	Target
0	2022-02-01 15:51:21	94127b9f-1366-4bbe-afea-7cd77225da52	5.963028	48.799660	Chips et Tortillas	3.317135	226.18
1	2022-02-01 20:32:11	ed9730f3-859b-4284-83a4-407032f81332	2.236025	48.740263	None	NaN	111.31
2	2022-02-02 00:03:38	ca874bc5-6ee9-437e-b18e-7b52da691d6c	2.236025	48.740263	Soupe	2.685945	1.38
3	2022-02-02 10:22:09	5b1300f3-54c3-4eab-b00e-b54ac7714a58	2.236025	48.740263	Légumes Frais	2.886775	145.88
4	2022-02-02 16:54:12	c1e68071-765e-4616-bdce-630250c50a9f	2.236025	48.740263	Préparations pour Gâteaux et Flans	2.369993	133.26
...	...	...	...	...	...	...	...
995	2022-12-30 08:32:02	d519b6c9-5f34-4b75-95e3-3778e2d63b01	5.887195	43.456104	Chips et Tortillas	2.692340	38.38
996	2022-12-30 10:41:09	f761a5d1-3b66-4faf-82f1-6cd59e2e28f8	0.934599	49.391777	Boucherie	2.445018	37.16
997	2022-12-30 15:25:23	fb92d601-9f59-41fb-97fe-79d627d95bd8	2.242254	48.739038	Soupe	2.971832	32.77
998	2022-12-31 19:45:53	13d79bc4-8887-4bf4-99ca-a496374fbff7	5.054081	45.500198	Colas, Thés glacés et Sodas	0.693147	51.65
999	2022-12-31 21:55:13	d9800a91-c533-40f2-8b5e-375db88ec2ed	2.242254	48.739038	Préparations pour Gâteaux et Flans	-0.000000	17.02

1000 rows × 7 columns

Deploying features¶

Learning Objectives

In this section you will learn:

feature readiness
feature list status
how to deploy a feature list

Feature readiness¶

To help differentiate features that are in the prototype stage and features that are ready for production, a feature version can have one of four readiness levels:

PRODUCTION_READY: ready for deployment in production environments.
PUBLIC_DRAFT: shared for feedback purposes.
DRAFT: in the prototype stage.
DEPRECATED`: not advised for use in either training or prediction.

In [20]:

            
                Copied!
                
# view the readiness of the features
catalog.list_features()
# view the readiness of the features
catalog.list_features()

Out[20]:

	id	name	dtype	readiness	online_enabled	tables	primary_tables	entities	primary_entities	created_at
0	645c4cae2ce151fd3fe4e2cb	Target	FLOAT	DRAFT	False	[GROCERYINVOICE]	[GROCERYINVOICE]	[grocerycustomer]	[grocerycustomer]	2023-05-11 02:02:31.560
1	645c4cae2ce151fd3fe4e2c5	StateMeanLongitude	FLOAT	DRAFT	False	[GROCERYCUSTOMER]	[GROCERYCUSTOMER]	[frenchstate]	[frenchstate]	2023-05-11 02:02:27.361
2	645c4cae2ce151fd3fe4e2c3	StateMeanLatitude	FLOAT	DRAFT	False	[GROCERYCUSTOMER]	[GROCERYCUSTOMER]	[frenchstate]	[frenchstate]	2023-05-11 02:02:26.643
3	645c4cae2ce151fd3fe4e2c1	CustomerInventoryMostFrequent_4w	VARCHAR	DRAFT	False	[GROCERYINVOICE, INVOICEITEMS, GROCERYPRODUCT]	[INVOICEITEMS]	[grocerycustomer]	[grocerycustomer]	2023-05-11 02:02:25.589
4	645c4cae2ce151fd3fe4e2bf	CustomerInventoryEntropy_4w	FLOAT	DRAFT	False	[GROCERYINVOICE, INVOICEITEMS, GROCERYPRODUCT]	[INVOICEITEMS]	[grocerycustomer]	[grocerycustomer]	2023-05-11 02:02:24.060

When a feature has been reviewed and is ready for production, its readiness can be upgraded.

In [21]:

            
                Copied!
                
# get CustomerInventoryEntropy_4w
customer_inventory_entropy_4w = catalog.get_feature("CustomerInventoryEntropy_4w")
# get CustomerInventoryEntropy_4w
customer_inventory_entropy_4w = catalog.get_feature("CustomerInventoryEntropy_4w")

In [22]:

            
                Copied!
                
# check feature definition file
customer_inventory_entropy_4w.definition
# check feature definition file
customer_inventory_entropy_4w.definition

Out[22]:

# Generated by SDK version: 0.2.2
from bson import ObjectId
from featurebyte import DimensionTable
from featurebyte import FeatureJobSetting
from featurebyte import ItemTable


# item_table name: "INVOICEITEMS", event_table name: "GROCERYINVOICE"
item_table = ItemTable.get_by_id(ObjectId("645c4ca52ce151fd3fe4e2b7"))
item_view = item_table.get_view(
    event_suffix=None,
    view_mode="manual",
    drop_column_names=[],
    column_cleaning_operations=[],
    event_drop_column_names=["record_available_at"],
    event_column_cleaning_operations=[],
    event_join_column_names=[
        "Timestamp",
        "GroceryInvoiceGuid",
        "GroceryCustomerGuid",
        "tz_offset",
    ],
)

# dimension_table name: "GROCERYPRODUCT"
dimension_table = DimensionTable.get_by_id(ObjectId("645c4ca92ce151fd3fe4e2b8"))
dimension_view = dimension_table.get_view(
    view_mode="manual", drop_column_names=[], column_cleaning_operations=[]
)
joined_view = item_view.join(
    dimension_view, on="GroceryProductGuid", how="left", rsuffix=""
)
grouped = joined_view.groupby(
    by_keys=["GroceryCustomerGuid"], category="ProductGroup"
).aggregate_over(
    value_column=None,
    method="count",
    windows=["4w"],
    feature_names=["CustomerInventory_4w"],
    feature_job_setting=FeatureJobSetting(
        blind_spot="0s", frequency="3600s", time_modulo_frequency="90s"
    ),
    skip_fill_na=True,
)
feat = grouped["CustomerInventory_4w"]
feat_1 = feat.cd.entropy()
feat_1.name = "CustomerInventoryEntropy_4w"
output = feat_1

In [23]:

            
                Copied!
                
# change the readiness to public
customer_inventory_entropy_4w.update_readiness("PRODUCTION_READY")

# view the readiness of the features
catalog.list_features()
# change the readiness to public
customer_inventory_entropy_4w.update_readiness("PRODUCTION_READY")

# view the readiness of the features
catalog.list_features()

Out[23]:

	id	name	dtype	readiness	online_enabled	tables	primary_tables	entities	primary_entities	created_at
0	645c4cae2ce151fd3fe4e2cb	Target	FLOAT	DRAFT	False	[GROCERYINVOICE]	[GROCERYINVOICE]	[grocerycustomer]	[grocerycustomer]	2023-05-11 02:02:31.560
1	645c4cae2ce151fd3fe4e2c5	StateMeanLongitude	FLOAT	DRAFT	False	[GROCERYCUSTOMER]	[GROCERYCUSTOMER]	[frenchstate]	[frenchstate]	2023-05-11 02:02:27.361
2	645c4cae2ce151fd3fe4e2c3	StateMeanLatitude	FLOAT	DRAFT	False	[GROCERYCUSTOMER]	[GROCERYCUSTOMER]	[frenchstate]	[frenchstate]	2023-05-11 02:02:26.643
3	645c4cae2ce151fd3fe4e2c1	CustomerInventoryMostFrequent_4w	VARCHAR	DRAFT	False	[GROCERYINVOICE, INVOICEITEMS, GROCERYPRODUCT]	[INVOICEITEMS]	[grocerycustomer]	[grocerycustomer]	2023-05-11 02:02:25.589
4	645c4cae2ce151fd3fe4e2bf	CustomerInventoryEntropy_4w	FLOAT	PRODUCTION_READY	False	[GROCERYINVOICE, INVOICEITEMS, GROCERYPRODUCT]	[INVOICEITEMS]	[grocerycustomer]	[grocerycustomer]	2023-05-11 02:02:24.060

Feature list status¶

Feature lists can be assigned one of five status levels to differentiate between experimental feature lists and those suitable for deployment or already deployed.

DEPLOYED: Assigned to feature list with at least one deployed version.
TEMPLATE: For feature lists as reference templates or safe starting points.
PUBLIC_DRAFT: For feature lists shared for feedback purposes.
DRAFT: For feature lists in the prototype stage.
DEPRECATED: For outdated or unnecessary feature lists.

In [24]:

            
                Copied!
                
# view the status of the feature lists
display(catalog.list_feature_lists())
# view the status of the feature lists
display(catalog.list_feature_lists())

	id	name	num_feature	status	deployed	readiness_frac	online_frac	tables	entities	created_at
0	645c4caf2ce151fd3fe4e2cf	TargetFeature	1	DRAFT	False	0.00	0.0	[GROCERYINVOICE]	[grocerycustomer]	2023-05-11 02:02:32.023
1	645c4cae2ce151fd3fe4e2c7	CustomerFeatures	4	DRAFT	False	0.25	0.0	[GROCERYCUSTOMER, GROCERYINVOICE, INVOICEITEMS...	[grocerycustomer, frenchstate]	2023-05-11 02:02:28.510

When a feature list is ready for review, its status can be updated.

In [25]:

            
                Copied!
                
# get the CustomerFeatures feature list
customer_feature_list = catalog.get_feature_list("CustomerFeatures")

# update the status to PUBLIC_DRAFT
customer_feature_list.update_status("PUBLIC_DRAFT")

# view the status of the feature lists
display(catalog.list_feature_lists())
# get the CustomerFeatures feature list
customer_feature_list = catalog.get_feature_list("CustomerFeatures")

# update the status to PUBLIC_DRAFT
customer_feature_list.update_status("PUBLIC_DRAFT")

# view the status of the feature lists
display(catalog.list_feature_lists())

Loading Feature(s) |████████████████████████████████████████| 4/4 [100%] in 1.0s

	id	name	num_feature	status	deployed	readiness_frac	online_frac	tables	entities	created_at
0	645c4caf2ce151fd3fe4e2cf	TargetFeature	1	DRAFT	False	0.00	0.0	[GROCERYINVOICE]	[grocerycustomer]	2023-05-11 02:02:32.023
1	645c4cae2ce151fd3fe4e2c7	CustomerFeatures	4	PUBLIC_DRAFT	False	0.25	0.0	[GROCERYCUSTOMER, GROCERYINVOICE, INVOICEITEMS...	[grocerycustomer, frenchstate]	2023-05-11 02:02:28.510

Deploying a feature list¶

In [26]:

            
                Copied!
                
# deploy the customer feature list
deployment = customer_feature_list.deploy(make_production_ready=True)
deployment.enable()

# view the status of the feature lists
display(catalog.list_feature_lists())
# deploy the customer feature list
deployment = customer_feature_list.deploy(make_production_ready=True)
deployment.enable()

# view the status of the feature lists
display(catalog.list_feature_lists())

Loading Feature(s) |████████████████████████████████████████| 4/4 [100%] in 1.0s
Done! |████████████████████████████████████████| 100% in 6.0s (0.17%/s)         
Done! |████████████████████████████████████████| 100% in 57.3s (0.02%/s)

	id	name	num_feature	status	deployed	readiness_frac	online_frac	tables	entities	created_at
0	645c4caf2ce151fd3fe4e2cf	TargetFeature	1	DRAFT	False	0.0	0.0	[GROCERYINVOICE]	[grocerycustomer]	2023-05-11 02:02:32.023
1	645c4cae2ce151fd3fe4e2c7	CustomerFeatures	4	DEPLOYED	True	1.0	1.0	[GROCERYCUSTOMER, GROCERYINVOICE, INVOICEITEMS...	[grocerycustomer, frenchstate]	2023-05-11 02:02:28.510

Why deploy?¶

When you deploy a feature list, behind the scenes the Feature Store starts regularly pre-calculating and caching feature values. This can significantly reduce the latency of feature serving.

Serving and consuming features¶

Learning Objectives

In this section you will learn:

the point in time used for production serving
how to create a Python function to consume a feature list
how to consume a feature list

Point in time for deployment¶

The production feature serving API uses the current time as its point in time. To consume the feature list, send only the primary entity via the serving name.

Automatically create a Python function for consuming the API¶

You can either use a python template or a shell script where the generated code will use the curl command to send the request.

For the python template, set the language parameter value as 'python'. For the shell script, set the language parameter value as 'sh'.

In [27]:

            
                Copied!
                
# get a python template for consuming the feature serving API
deployment.get_online_serving_code(language="python")
# get a python template for consuming the feature serving API
deployment.get_online_serving_code(language="python")

Loading Feature(s) |████████████████████████████████████████| 4/4 [100%] in 1.0s

Out[27]:

from typing import Any, Dict

import pandas as pd
import requests


def request_features(entity_serving_names: Dict[str, Any]) -> pd.DataFrame:
    """
    Send POST request to online serving endpoint

    Parameters
    ----------
    entity_serving_names: Dict[str, Any]
        Entity serving name values to used for serving request

    Returns
    -------
    pd.DataFrame
    """
    response = requests.post(
        url="http://127.0.0.1:8088/deployment/645c4d432ce151fd3fe4e2d9/online_features",
        headers={"Content-Type": "application/json", "active-catalog-id": "645c4c9f2ce151fd3fe4e2b4"},
        json={"entity_serving_names": entity_serving_names},
    )
    assert response.status_code == 200, response.json()
    return pd.DataFrame.from_dict(response.json()["features"])


request_features([{"GROCERYCUSTOMERGUID": "0041bdff-4917-42d5-bd6d-5a555ac616c5"}])

Copy the online serving code that was generated above, paste it into the cell below, then run it

In [28]:

            
                Copied!
                
# replace the contents of this Python code cell with the output from to_be_deployed.get_online_serving_code(language="python")
# replace the contents of this Python code cell with the output from to_be_deployed.get_online_serving_code(language="python")

Concept: Batch request table¶

A BatchRequestTable object is a representation of a table in the feature store that specifies entity values for batch serving.

In [29]:

            
                Copied!
                
# this is a new use case, a daily batch run for customers who were active in the latest 24 hours

# filter the invoice view to get customers who had an invoice in the latest 24 hours
batch_request_timestamp = pd.Timestamp.now(tz="utc")
filter = grocery_invoice_view["Timestamp"] > batch_request_timestamp - pd.to_timedelta(24, unit="hour")
recently_active_view = grocery_invoice_view[filter].copy()

display(recently_active_view.preview())
# this is a new use case, a daily batch run for customers who were active in the latest 24 hours

# filter the invoice view to get customers who had an invoice in the latest 24 hours
batch_request_timestamp = pd.Timestamp.now(tz="utc")
filter = grocery_invoice_view["Timestamp"] > batch_request_timestamp - pd.to_timedelta(24, unit="hour")
recently_active_view = grocery_invoice_view[filter].copy()

display(recently_active_view.preview())

	GroceryInvoiceGuid	GroceryCustomerGuid	Timestamp	tz_offset	Amount
0	8b7add80-0e1a-4b09-87e6-db5f8d75593a	bcd8cedb-9f49-461c-86bd-920fa9316239	2023-05-10 13:07:21	+02:00	89.35
1	bd470f90-b2c1-40b1-953d-8c66431733c3	aae48cd3-7646-4df6-9700-3ef7f29ec80f	2023-05-10 17:17:36	+02:00	5.49
2	6b3f61eb-21e8-41bf-9145-09879dc77731	9b1b8037-8506-4a54-981a-3b7e694a489f	2023-05-10 19:04:17	+02:00	8.17
3	46a92d6c-3cc5-4b9c-9349-a26b271772d8	9a7aae23-2036-4728-809e-cca766af86e0	2023-05-10 15:07:27	+02:00	18.99
4	2ebdc5ce-cd64-457a-802a-09da7f72e7dd	c8e9cd10-9f9d-4d03-befd-ef8eda747c19	2023-05-10 10:39:17	+02:00	21.24
5	2f345a21-7e24-40fa-b76b-461343278e17	489b3454-bd45-4d14-a355-500f42bad6c2	2023-05-10 17:47:56	+02:00	23.88
6	3ad5ec7e-e398-4324-b9c7-fbf2c9f1db4f	0041bdff-4917-42d5-bd6d-5a555ac616c5	2023-05-10 13:24:03	+02:00	25.34
7	955932bd-4fc5-42dd-b544-fc670ab1f2fb	2623c1c0-4aeb-4ee4-8be6-9c011040bf79	2023-05-10 21:26:10	+02:00	4.00
8	f19be5b4-a382-4991-9882-bbf901001f02	37bca45c-e365-499e-b3e5-d166a279b8c5	2023-05-10 15:29:50	+02:00	3.08
9	22a3fb9b-7085-4e90-9ecf-496bb399d346	9c23c4e8-f0e8-4aa4-83e9-3d3525461a8f	2023-05-10 14:50:53	+02:00	1.67

In [30]:

            
                Copied!
                
                    
                    
                
                

        
# create a batch request table from the filtered view
# note that the table does not contain a prediction point in time
# batch requests use the batch run time as the point in time
batch_request_table = recently_active_view.create_batch_request_table(
    "customer batch request for customers active in the latest 24 hours as at " + str(batch_request_timestamp),
    columns = ["GroceryCustomerGuid"],
    columns_rename_mapping = {"GroceryCustomerGuid": "GROCERYCUSTOMERGUID"}
)
# create a batch request table from the filtered view
# note that the table does not contain a prediction point in time
# batch requests use the batch run time as the point in time
batch_request_table = recently_active_view.create_batch_request_table(
    "customer batch request for customers active in the latest 24 hours as at " + str(batch_request_timestamp),
    columns = ["GroceryCustomerGuid"],
    columns_rename_mapping = {"GroceryCustomerGuid": "GROCERYCUSTOMERGUID"}
)

Done! |████████████████████████████████████████| 100% in 9.1s (0.11%/s)

Concept: Batch feature table¶

A BatchFeatureTable object is a representation of a table in the feature store that contains feature values from batch serving. The object includes metadata on the Deployment and the BatchRequestTable used to create it.

In [31]:

            
                Copied!
                
# enable the deployment - this is a pre-requisite
if not deployment.enabled:
    deployment.enable()
# enable the deployment - this is a pre-requisite
if not deployment.enabled:
    deployment.enable()

In [32]:

            
                Copied!
                
# request batch features
batch_features = deployment.compute_batch_feature_table(
    batch_request_table=batch_request_table,
    batch_feature_table_name = 'customer batch feature data for customers active in the latest 24 hours as at ' + str(batch_request_timestamp)
)
# request batch features
batch_features = deployment.compute_batch_feature_table(
    batch_request_table=batch_request_table,
    batch_feature_table_name = 'customer batch feature data for customers active in the latest 24 hours as at ' + str(batch_request_timestamp)
)

Done! |████████████████████████████████████████| 100% in 12.1s (0.08%/s)

In [33]:

            
                Copied!
                
# display the contents of the batch feature table
display(batch_features.to_pandas())
# display the contents of the batch feature table
display(batch_features.to_pandas())

Downloading table |████████████████████████████████████████| 77/77 [100%] in 0.1

	GROCERYCUSTOMERGUID	CustomerInventoryEntropy_4w	CustomerInventoryMostFrequent_4w	StateMeanLatitude	StateMeanLongitude
0	34be2f38-fe5b-4c18-863d-178b7ad6ff4e	3.293622	Pizza Surgelées	48.177401	7.573264
1	9b1b8037-8506-4a54-981a-3b7e694a489f	3.199416	Pains	48.177401	7.573264
2	d3421c0a-67e9-4520-8977-34efe22cd3d1	0.693147	Laits	48.177401	7.573264
3	4d37054d-a274-4b7f-93b7-9600a0f0a9fa	2.865703	Céréales	44.667825	-0.491515
4	cf215271-2c09-45a3-9dd5-9b7dd057a1c7	2.947367	Sirops	44.667825	-0.491515
...	...	...	...	...	...
72	ec683769-e192-418a-934f-12c8c683c8fe	3.380674	Colas, Thés glacés et Sodas	48.739038	2.242254
73	f6a783f7-5091-46fa-8ebf-aa13ec868234	3.026714	Colas, Thés glacés et Sodas	48.739038	2.242254
74	f6a783f7-5091-46fa-8ebf-aa13ec868234	3.026714	Colas, Thés glacés et Sodas	48.739038	2.242254
75	f6a783f7-5091-46fa-8ebf-aa13ec868234	3.026714	Colas, Thés glacés et Sodas	48.739038	2.242254
76	f79b5a63-7863-471d-8c6e-cc1b48bd385b	3.095558	Fruits secs	48.739038	2.242254

77 rows × 5 columns

In [34]:

            
                Copied!
                
# display the batch feature table metadata
batch_features.info()
# display the batch feature table metadata
batch_features.info()

Out[34]:

{
  'name': 'customer batch feature data for customers active in the latest 24 hours as at 2023-05-11 02:06:04.129024+00:00',
  'created_at': '2023-05-11T02:06:23.485000',
  'updated_at': None,
  'batch_request_table_name': 'customer batch request for customers active in the latest 24 hours as at 2023-05-11 02:06:04.129024+00:00',
  'deployment_name': 'Deployment with CustomerFeatures_V230511',
  'table_details': {
    'database_name': 'spark_catalog',
    'schema_name': 'playground',
    'table_name': 'BATCH_FEATURE_TABLE_645c4d985f04ef51f2c63553'
  }
}

Disable a deployment¶

In [35]:

            
                Copied!
                
# disable the feature list deployment
deployment.disable()
# disable the feature list deployment
deployment.disable()

Done! |████████████████████████████████████████| 100% in 18.1s (0.06%/s)

Next Steps¶

Now that you've completed the deep dive materializing features tutorial, you can put your knowledge into practice or learn more:

Put your knowledge into practice by creating features in the "credit card dataset feature engineering playground" or "healthcare dataset feature engineering playground" catalogs
Learn more about feature governance via the "Quick Start Feature Governance" tutorial
Learn about data modeling via the "Deep Dive Data Modeling" tutorial