Quick Start Tutorial: Reusing Features¶
Learning Objectives¶
In this tutorial you will learn:
- How to access catalogs of data, entities, features, and feature lists
- How to search for features suitable for the unit of analysis
- How to understand an existing feature
- How to create new features from existing features
- How to create a new feature list from existing features
Set up the prerequisites¶
Learning Objectives
In this section you will:
- start your local featurebyte server
- import libraries
- learn the about catalogs
- activate a pre-built catalog
Load the featurebyte library and connect to the local instance of featurebyte¶
# library imports
import pandas as pd
import numpy as np
import random
# load the featurebyte SDK
import featurebyte as fb
# start the local server, then wait for it to be healthy before proceeding
fb.playground()
02:29:43 | INFO | Using configuration file at: /home/chester/.featurebyte/config.yaml 02:29:43 | INFO | Active profile: local (http://127.0.0.1:8088) 02:29:43 | INFO | SDK version: 0.2.2 02:29:43 | INFO | Active catalog: default 02:29:43 | INFO | 0 feature list, 0 feature deployed 02:29:43 | INFO | (1/4) Starting featurebyte services Container mongo-rs Running Container redis Running Container featurebyte-server Running Container spark-thrift Running Container featurebyte-worker Running Container mongo-rs Waiting Container mongo-rs Waiting Container redis Waiting Container redis Healthy Container mongo-rs Healthy Container mongo-rs Healthy 02:29:44 | INFO | (2/4) Creating local spark feature store 02:29:44 | INFO | (3/4) Import datasets 02:29:45 | INFO | Dataset grocery already exists, skipping import 02:29:45 | INFO | Dataset healthcare already exists, skipping import 02:29:45 | INFO | Dataset creditcard already exists, skipping import 02:29:45 | INFO | (4/4) Playground environment started successfully. Ready to go! 🚀
Create a pre-built catalog for this tutorial, with the data, metadata, and features already set up¶
Note that creating a pre-built catalog is not a step you will do in real-life. This is a function specific to this quick-start tutorial to quickly skip over many of the preparatory steps and get you to a point where you can materialize features.
In a real-life project you would do data modeling, declaring the tables, entities, and the associated metadata. This would not be a frequent task, but forms the basis for best-practice feature engineering.
# get the functions to create a pre-built catalog
from prebuilt_catalogs import *
# create a new catalog for this tutorial
catalog = create_tutorial_catalog(PrebuiltCatalog.QuickStartReusingFeatures)
Cleaning up existing tutorial catalogs
02:29:45 | INFO | Catalog activated: quick start model training 20230511:0224
Cleaning catalog: quick start model training 20230511:0224 1 historical feature tables 1 observation tables Done! |████████████████████████████████████████| 100% in 6.1s (0.17%/s) Done! |████████████████████████████████████████| 100% in 6.0s (0.17%/s)
02:29:59 | INFO | Catalog activated: default 02:29:59 | INFO | Catalog activated: quick start reusing features 20230511:0229
Building a quick start catalog for reusing features named [quick start reusing features 20230511:0229] Creating new catalog Catalog created Registering the source tables Registering the entities Tagging the entities to columns in the data tables Populating the feature store with example features Saving Feature(s) |████████████████████████████████████████| 5/5 [100%] in 1.6s Loading Feature(s) |████████████████████████████████████████| 5/5 [100%] in 1.3s
Example: Load the tables and views¶
# get the tables for this workspace
grocery_customer_table = catalog.get_table("GROCERYCUSTOMER")
grocery_items_table = catalog.get_table("INVOICEITEMS")
grocery_invoice_table = catalog.get_table("GROCERYINVOICE")
grocery_product_table = catalog.get_table("GROCERYPRODUCT")
# create the views
grocery_customer_view = grocery_customer_table.get_view()
grocery_invoice_view = grocery_invoice_table.get_view()
grocery_items_view = grocery_items_table.get_view()
grocery_product_view = grocery_product_table.get_view()
Accessing Catalogs¶
Learning Objectives:
In this section you will learn how to display catalogs of:
- tables
- entities
- features
- feature lists
Example: A catalog of tables¶
# list the tables in the catalog
catalog.list_tables()
id | name | type | status | entities | created_at | |
---|---|---|---|---|---|---|
0 | 645c5330b37fce40c0e3c89d | GROCERYPRODUCT | dimension_table | PUBLIC_DRAFT | [groceryproduct] | 2023-05-11 02:30:09.678 |
1 | 645c532db37fce40c0e3c89c | INVOICEITEMS | item_table | PUBLIC_DRAFT | [groceryinvoice, groceryproduct] | 2023-05-11 02:30:06.967 |
2 | 645c532ab37fce40c0e3c89b | GROCERYINVOICE | event_table | PUBLIC_DRAFT | [groceryinvoice, grocerycustomer] | 2023-05-11 02:30:03.446 |
3 | 645c5328b37fce40c0e3c89a | GROCERYCUSTOMER | scd_table | PUBLIC_DRAFT | [grocerycustomer, frenchstate] | 2023-05-11 02:30:01.588 |
# load a table
grocery_customer_table = catalog.get_table("GROCERYCUSTOMER")
# show the metadata
grocery_customer_table.info()
{ 'name': 'GROCERYCUSTOMER', 'created_at': '2023-05-11T02:30:01.588000', 'updated_at': '2023-05-11T02:30:10.171000', 'status': 'PUBLIC_DRAFT', 'catalog_name': 'quick start reusing features 20230511:0229', 'record_creation_timestamp_column': 'record_available_at', 'table_details': { 'database_name': 'spark_catalog', 'schema_name': 'GROCERY', 'table_name': 'GROCERYCUSTOMER' }, 'entities': [ { 'name': 'frenchstate', 'serving_names': [ 'FRENCHSTATE' ], 'catalog_name': 'quick start reusing features 20230511:0229' }, { 'name': 'grocerycustomer', 'serving_names': [ 'GROCERYCUSTOMERGUID' ], 'catalog_name': 'quick start reusing features 20230511:0229' } ], 'semantics': [ 'scd_surrogate_key_id', 'scd_natural_key_id' ], 'column_count': 18, 'columns_info': None, 'natural_key_column': 'GroceryCustomerGuid', 'effective_timestamp_column': 'ValidFrom', 'surrogate_key_column': 'RowID', 'end_timestamp_column': None, 'current_flag_column': 'CurrentRecord' }
Example: A catalog of entities¶
# list the entities in the catalog
catalog.list_entities()
id | name | serving_names | created_at | |
---|---|---|---|---|
0 | 645c5331b37fce40c0e3c8a1 | frenchstate | [FRENCHSTATE] | 2023-05-11 02:30:09.982 |
1 | 645c5331b37fce40c0e3c8a0 | groceryproduct | [GROCERYPRODUCTGUID] | 2023-05-11 02:30:09.910 |
2 | 645c5331b37fce40c0e3c89f | groceryinvoice | [GROCERYINVOICEGUID] | 2023-05-11 02:30:09.840 |
3 | 645c5331b37fce40c0e3c89e | grocerycustomer | [GROCERYCUSTOMERGUID] | 2023-05-11 02:30:09.776 |
# list the entity relationships in the catalog
catalog.list_relationships()
id | relationship_type | entity | related_entity | relation_table | relation_table_type | enabled | created_at | updated_at | |
---|---|---|---|---|---|---|---|---|---|
0 | 645c53329c28d6ed179a1a73 | child_parent | groceryinvoice | grocerycustomer | GROCERYINVOICE | event_table | True | 2023-05-11 02:30:10.464 | None |
1 | 645c53329c28d6ed179a1a6c | child_parent | grocerycustomer | frenchstate | GROCERYCUSTOMER | scd_table | True | 2023-05-11 02:30:10.220 | None |
# load an entity
customer_entity = catalog.get_entity("grocerycustomer")
# show the metadata
customer_entity.info()
{ 'name': 'grocerycustomer', 'created_at': '2023-05-11T02:30:09.776000', 'updated_at': '2023-05-11T02:30:10.426000', 'serving_names': [ 'GROCERYCUSTOMERGUID' ], 'catalog_name': 'quick start reusing features 20230511:0229' }
Example: A catalog of features¶
# list the features in the catalog
catalog.list_features()
id | name | dtype | readiness | online_enabled | tables | primary_tables | entities | primary_entities | created_at | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 645c5347b37fce40c0e3c8c0 | InvoiceUniqueProductGroupCount | FLOAT | DRAFT | False | [GROCERYINVOICE, INVOICEITEMS, GROCERYPRODUCT] | [INVOICEITEMS] | [groceryinvoice] | [groceryinvoice] | 2023-05-11 02:30:33.185 |
1 | 645c5346b37fce40c0e3c8bc | InvoiceDiscountAmount | FLOAT | DRAFT | False | [GROCERYINVOICE, INVOICEITEMS] | [INVOICEITEMS] | [groceryinvoice] | [groceryinvoice] | 2023-05-11 02:30:31.101 |
2 | 645c5344b37fce40c0e3c8ba | InvoiceItemCount | FLOAT | DRAFT | False | [GROCERYINVOICE, INVOICEITEMS] | [INVOICEITEMS] | [groceryinvoice] | [groceryinvoice] | 2023-05-11 02:30:29.423 |
3 | 645c5343b37fce40c0e3c8b8 | CustomerYearOfBirth | INT | DRAFT | False | [GROCERYCUSTOMER] | [GROCERYCUSTOMER] | [grocerycustomer] | [grocerycustomer] | 2023-05-11 02:30:27.690 |
4 | 645c5341b37fce40c0e3c8b4 | CustomerSpend_14d | FLOAT | DRAFT | False | [GROCERYINVOICE] | [GROCERYINVOICE] | [grocerycustomer] | [grocerycustomer] | 2023-05-11 02:30:26.874 |
5 | 645c5340b37fce40c0e3c8b2 | CustomerInventory_24w | OBJECT | DRAFT | False | [GROCERYINVOICE, INVOICEITEMS, GROCERYPRODUCT] | [INVOICEITEMS] | [grocerycustomer] | [grocerycustomer] | 2023-05-11 02:30:25.298 |
6 | 645c533eb37fce40c0e3c8b0 | CustomerInventory_28d | OBJECT | DRAFT | False | [GROCERYINVOICE, INVOICEITEMS, GROCERYPRODUCT] | [INVOICEITEMS] | [grocerycustomer] | [grocerycustomer] | 2023-05-11 02:30:23.367 |
7 | 645c533cb37fce40c0e3c8ae | StateMeanLongitude | FLOAT | DRAFT | False | [GROCERYCUSTOMER] | [GROCERYCUSTOMER] | [frenchstate] | [frenchstate] | 2023-05-11 02:30:20.604 |
8 | 645c533bb37fce40c0e3c8ac | StateMeanLatitude | FLOAT | DRAFT | False | [GROCERYCUSTOMER] | [GROCERYCUSTOMER] | [frenchstate] | [frenchstate] | 2023-05-11 02:30:19.879 |
9 | 645c533ab37fce40c0e3c8aa | StateAvgInvoiceAmount_28d | FLOAT | DRAFT | False | [GROCERYCUSTOMER, GROCERYINVOICE] | [GROCERYINVOICE] | [frenchstate] | [frenchstate] | 2023-05-11 02:30:19.007 |
10 | 645c5337b37fce40c0e3c8a8 | StateInventory_28d | OBJECT | DRAFT | False | [GROCERYCUSTOMER, GROCERYINVOICE, INVOICEITEMS... | [INVOICEITEMS] | [frenchstate] | [frenchstate] | 2023-05-11 02:30:17.076 |
11 | 645c5336b37fce40c0e3c8a4 | StatePopulation | FLOAT | DRAFT | False | [GROCERYCUSTOMER] | [GROCERYCUSTOMER] | [frenchstate] | [frenchstate] | 2023-05-11 02:30:14.988 |
12 | 645c5335b37fce40c0e3c8a2 | StateName | VARCHAR | DRAFT | False | [GROCERYCUSTOMER] | [GROCERYCUSTOMER] | [grocerycustomer] | [grocerycustomer] | 2023-05-11 02:30:13.971 |
# load a feature
state_population = catalog.get_feature("StatePopulation")
# show the metadata
state_population.info()
{ 'name': 'StatePopulation', 'created_at': '2023-05-11T02:30:14.988000', 'updated_at': None, 'entities': [ { 'name': 'frenchstate', 'serving_names': [ 'FRENCHSTATE' ], 'catalog_name': 'quick start reusing features 20230511:0229' } ], 'primary_entity': [ { 'name': 'frenchstate', 'serving_names': [ 'FRENCHSTATE' ], 'catalog_name': 'quick start reusing features 20230511:0229' } ], 'tables': [ { 'name': 'GROCERYCUSTOMER', 'status': 'PUBLIC_DRAFT', 'catalog_name': 'quick start reusing features 20230511:0229' } ], 'default_version_mode': 'AUTO', 'version_count': 1, 'catalog_name': 'quick start reusing features 20230511:0229', 'dtype': 'FLOAT', 'primary_table': [ { 'name': 'GROCERYCUSTOMER', 'status': 'PUBLIC_DRAFT', 'catalog_name': 'quick start reusing features 20230511:0229' } ], 'default_feature_id': '645c5336b37fce40c0e3c8a4', 'version': { 'this': 'V230511', 'default': 'V230511' }, 'readiness': { 'this': 'DRAFT', 'default': 'DRAFT' }, 'table_feature_job_setting': { 'this': [], 'default': [] }, 'table_cleaning_operation': { 'this': [], 'default': [] }, 'versions_info': None, 'metadata': { 'input_columns': { 'Input0': { 'data': 'GROCERYCUSTOMER', 'column_name': 'GroceryCustomerGuid', 'semantic': 'scd_natural_key_id' }, 'Input1': { 'data': 'GROCERYCUSTOMER', 'column_name': 'ValidFrom', 'semantic': None } }, 'derived_columns': {}, 'aggregations': { 'F0': { 'name': 'StatePopulation', 'column': None, 'function': 'count', 'keys': [ 'State' ], 'window': None, 'category': None, 'filter': False } }, 'post_aggregation': { 'name': 'StatePopulation', 'inputs': [ 'F0' ], 'transforms': [ 'is_null', 'conditional' ] } } }
# show the feature lineage for the state population feature
display(state_population.definition)
# Generated by SDK version: 0.2.2
from bson import ObjectId
from featurebyte import SCDTable
# scd_table name: "GROCERYCUSTOMER"
scd_table = SCDTable.get_by_id(ObjectId("645c5328b37fce40c0e3c89a"))
scd_view = scd_table.get_view(
view_mode="manual",
drop_column_names=["record_available_at", "CurrentRecord"],
column_cleaning_operations=[],
)
feat = scd_view.groupby(by_keys=["State"], category=None).aggregate_asat(
value_column=None,
method="count",
feature_name="StatePopulation",
offset=None,
backward=True,
skip_fill_na=True,
)
feat[feat.isnull()] = 0
feat_1 = feat
feat_1.name = "StatePopulation"
output = feat_1
Example: A catalog of feature lists¶
# list the feature lists in the catalog
catalog.list_feature_lists()
id | name | num_feature | status | deployed | readiness_frac | online_frac | tables | entities | created_at | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 645c5349b37fce40c0e3c8c2 | StateFeatureList | 5 | DRAFT | False | 0.0 | 0.0 | [GROCERYCUSTOMER, GROCERYINVOICE, INVOICEITEMS... | [frenchstate] | 2023-05-11 02:30:36.305 |
# load the feature list
state_features = catalog.get_feature_list("StateFeatureList")
# show the metadata
state_features.info()
Loading Feature(s) |████████████████████████████████████████| 5/5 [100%] in 1.3s
{ 'name': 'StateFeatureList', 'created_at': '2023-05-11T02:30:36.305000', 'updated_at': None, 'entities': [ { 'name': 'frenchstate', 'serving_names': [ 'FRENCHSTATE' ], 'catalog_name': 'quick start reusing features 20230511:0229' } ], 'primary_entity': [ { 'name': 'frenchstate', 'serving_names': [ 'FRENCHSTATE' ], 'catalog_name': 'quick start reusing features 20230511:0229' } ], 'tables': [ { 'name': 'GROCERYPRODUCT', 'status': 'PUBLIC_DRAFT', 'catalog_name': 'quick start reusing features 20230511:0229' }, { 'name': 'INVOICEITEMS', 'status': 'PUBLIC_DRAFT', 'catalog_name': 'quick start reusing features 20230511:0229' }, { 'name': 'GROCERYINVOICE', 'status': 'PUBLIC_DRAFT', 'catalog_name': 'quick start reusing features 20230511:0229' }, { 'name': 'GROCERYCUSTOMER', 'status': 'PUBLIC_DRAFT', 'catalog_name': 'quick start reusing features 20230511:0229' } ], 'default_version_mode': 'AUTO', 'version_count': 1, 'catalog_name': 'quick start reusing features 20230511:0229', 'dtype_distribution': [ { 'dtype': 'FLOAT', 'count': 4 }, { 'dtype': 'OBJECT', 'count': 1 } ], 'status': 'DRAFT', 'feature_count': 5, 'version': { 'this': 'V230511', 'default': 'V230511' }, 'production_ready_fraction': { 'this': 0.0, 'default': 0.0 }, 'versions_info': None, 'deployed': False }
# list the features in the feature list
state_features.list_features()
id | name | version | dtype | readiness | online_enabled | tables | primary_tables | entities | primary_entities | created_at | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 645c533cb37fce40c0e3c8ae | StateMeanLongitude | V230511 | FLOAT | DRAFT | False | [GROCERYCUSTOMER] | [GROCERYCUSTOMER] | [frenchstate] | [frenchstate] | 2023-05-11 02:30:20.588 |
1 | 645c533bb37fce40c0e3c8ac | StateMeanLatitude | V230511 | FLOAT | DRAFT | False | [GROCERYCUSTOMER] | [GROCERYCUSTOMER] | [frenchstate] | [frenchstate] | 2023-05-11 02:30:19.863 |
2 | 645c533ab37fce40c0e3c8aa | StateAvgInvoiceAmount_28d | V230511 | FLOAT | DRAFT | False | [GROCERYCUSTOMER, GROCERYINVOICE] | [GROCERYINVOICE] | [frenchstate] | [frenchstate] | 2023-05-11 02:30:18.989 |
3 | 645c5337b37fce40c0e3c8a8 | StateInventory_28d | V230511 | OBJECT | DRAFT | False | [GROCERYCUSTOMER, GROCERYINVOICE, INVOICEITEMS... | [INVOICEITEMS] | [frenchstate] | [frenchstate] | 2023-05-11 02:30:17.054 |
4 | 645c5336b37fce40c0e3c8a4 | StatePopulation | V230511 | FLOAT | DRAFT | False | [GROCERYCUSTOMER] | [GROCERYCUSTOMER] | [frenchstate] | [frenchstate] | 2023-05-11 02:30:14.971 |
Search for Features¶
Learning Objectives
In this section, you will learn:
- what a primary entity is
- how to search for suitable features
Concept: Primary entity¶
Feature primary entity: The primary entity of a feature defines the level of analysis for that feature. When a feature is a result of an aggregation grouped by multiple entities, the primary entity is a tuple of those entities. For instance, if a feature quantifies the interaction between a customer entity and a merchant entity in the past, such as the sum of transaction amounts grouped by customer and merchant in the past 4 weeks, the primary entity is the tuple of customer and merchant.
When a feature is derived for features with different primary entities, the primary entity is determined by the entity relationships, and the lowest level entity is selected as the primary entity. If the underlying entities have no relationship, the primary entity becomes a tuple of those entities. For example, if a feature compares the basket of a customer with the average basket of customers in the same city, the primary entity is the customer since the customer entity is a child of the customer city entity. However, if the feature is the distance between the customer location and the merchant location, the primary entity becomes the tuple of customer and merchant since these entities do not have any child-parent relationship.
Feature List primary entity: The main focus of a feature list is determined by its primary entity, which typically corresponds to the primary entity of the Use Case that the feature list was created for.
If the features within the list pertain to different primary entities, the primary entity of the feature list is selected based on the entities relationships, with the lowest level entity chosen as the primary entity. In cases where there are no relationships between entities, the primary entity may become a tuple comprising those entities. To illustrate, consider a feature list comprising features related to card, customer, and customer city. In this case, the primary entity is the card entity since it is a child of both the customer and customer city entities. However, if the feature list also contains features for merchant and merchant city, the primary entity is a tuple of card and merchant.
Use Case primary entity: In a Use Case, the primary entity is the object or concept that defines its problem statement. Usually, this entity is singular, but in cases such as the recommendation engine use case, it can be a tuple of entities that interact with each other.
Case study: Predicting customer spend¶
Consider a use case to predict customer spend. The unit of analysis and primary entity is grocery customer. You can use features with primary entities of grocery customer or french state (because state is a parent entity of customer).
Example: Search for suitable features¶
# get a list of all the features in the catalog
all_features = catalog.list_features()
# filter to retain only those with grocery customer or state as their primary entity
child_entity = "groceryinvoice"
suitable_features = all_features.loc[
[child_entity not in x for x in all_features.entities.values]
]
product_entity = "groceryproduct"
suitable_features = suitable_features.loc[
[product_entity not in x for x in suitable_features.entities.values]
]
# show the features
display(suitable_features)
id | name | dtype | readiness | online_enabled | tables | primary_tables | entities | primary_entities | created_at | |
---|---|---|---|---|---|---|---|---|---|---|
3 | 645c5343b37fce40c0e3c8b8 | CustomerYearOfBirth | INT | DRAFT | False | [GROCERYCUSTOMER] | [GROCERYCUSTOMER] | [grocerycustomer] | [grocerycustomer] | 2023-05-11 02:30:27.690 |
4 | 645c5341b37fce40c0e3c8b4 | CustomerSpend_14d | FLOAT | DRAFT | False | [GROCERYINVOICE] | [GROCERYINVOICE] | [grocerycustomer] | [grocerycustomer] | 2023-05-11 02:30:26.874 |
5 | 645c5340b37fce40c0e3c8b2 | CustomerInventory_24w | OBJECT | DRAFT | False | [GROCERYINVOICE, INVOICEITEMS, GROCERYPRODUCT] | [INVOICEITEMS] | [grocerycustomer] | [grocerycustomer] | 2023-05-11 02:30:25.298 |
6 | 645c533eb37fce40c0e3c8b0 | CustomerInventory_28d | OBJECT | DRAFT | False | [GROCERYINVOICE, INVOICEITEMS, GROCERYPRODUCT] | [INVOICEITEMS] | [grocerycustomer] | [grocerycustomer] | 2023-05-11 02:30:23.367 |
7 | 645c533cb37fce40c0e3c8ae | StateMeanLongitude | FLOAT | DRAFT | False | [GROCERYCUSTOMER] | [GROCERYCUSTOMER] | [frenchstate] | [frenchstate] | 2023-05-11 02:30:20.604 |
8 | 645c533bb37fce40c0e3c8ac | StateMeanLatitude | FLOAT | DRAFT | False | [GROCERYCUSTOMER] | [GROCERYCUSTOMER] | [frenchstate] | [frenchstate] | 2023-05-11 02:30:19.879 |
9 | 645c533ab37fce40c0e3c8aa | StateAvgInvoiceAmount_28d | FLOAT | DRAFT | False | [GROCERYCUSTOMER, GROCERYINVOICE] | [GROCERYINVOICE] | [frenchstate] | [frenchstate] | 2023-05-11 02:30:19.007 |
10 | 645c5337b37fce40c0e3c8a8 | StateInventory_28d | OBJECT | DRAFT | False | [GROCERYCUSTOMER, GROCERYINVOICE, INVOICEITEMS... | [INVOICEITEMS] | [frenchstate] | [frenchstate] | 2023-05-11 02:30:17.076 |
11 | 645c5336b37fce40c0e3c8a4 | StatePopulation | FLOAT | DRAFT | False | [GROCERYCUSTOMER] | [GROCERYCUSTOMER] | [frenchstate] | [frenchstate] | 2023-05-11 02:30:14.988 |
12 | 645c5335b37fce40c0e3c8a2 | StateName | VARCHAR | DRAFT | False | [GROCERYCUSTOMER] | [GROCERYCUSTOMER] | [grocerycustomer] | [grocerycustomer] | 2023-05-11 02:30:13.971 |
# find suitable features that use the grocery invoice items table
grocery_items_features = suitable_features.loc[
["INVOICEITEMS" in x for x in suitable_features.tables.values]
]
# show the features
display(grocery_items_features)
id | name | dtype | readiness | online_enabled | tables | primary_tables | entities | primary_entities | created_at | |
---|---|---|---|---|---|---|---|---|---|---|
5 | 645c5340b37fce40c0e3c8b2 | CustomerInventory_24w | OBJECT | DRAFT | False | [GROCERYINVOICE, INVOICEITEMS, GROCERYPRODUCT] | [INVOICEITEMS] | [grocerycustomer] | [grocerycustomer] | 2023-05-11 02:30:25.298 |
6 | 645c533eb37fce40c0e3c8b0 | CustomerInventory_28d | OBJECT | DRAFT | False | [GROCERYINVOICE, INVOICEITEMS, GROCERYPRODUCT] | [INVOICEITEMS] | [grocerycustomer] | [grocerycustomer] | 2023-05-11 02:30:23.367 |
10 | 645c5337b37fce40c0e3c8a8 | StateInventory_28d | OBJECT | DRAFT | False | [GROCERYCUSTOMER, GROCERYINVOICE, INVOICEITEMS... | [INVOICEITEMS] | [frenchstate] | [frenchstate] | 2023-05-11 02:30:17.076 |
Understand an Existing feature¶
Learning Objectives
In this section you will learn how to:
- load a feature from the catalog
- view the metadata of a feature
- materialize feature values
- view feature lineage as a definition file
Example: Load a feature from the catalog¶
# get the CustomerInventory_28d feature
customer_inventory_28d = catalog.get_feature("CustomerInventory_28d")
Example: View the metadata of a feature¶
# get a list of all the features in the catalog
all_features = catalog.list_features()
# display the current feature
display(all_features.loc[all_features.name == customer_inventory_28d.name])
id | name | dtype | readiness | online_enabled | tables | primary_tables | entities | primary_entities | created_at | |
---|---|---|---|---|---|---|---|---|---|---|
6 | 645c533eb37fce40c0e3c8b0 | CustomerInventory_28d | OBJECT | DRAFT | False | [GROCERYINVOICE, INVOICEITEMS, GROCERYPRODUCT] | [INVOICEITEMS] | [grocerycustomer] | [grocerycustomer] | 2023-05-11 02:30:23.367 |
# view the detailed metadata
customer_inventory_28d.info()
{ 'name': 'CustomerInventory_28d', 'created_at': '2023-05-11T02:30:23.367000', 'updated_at': None, 'entities': [ { 'name': 'grocerycustomer', 'serving_names': [ 'GROCERYCUSTOMERGUID' ], 'catalog_name': 'quick start reusing features 20230511:0229' } ], 'primary_entity': [ { 'name': 'grocerycustomer', 'serving_names': [ 'GROCERYCUSTOMERGUID' ], 'catalog_name': 'quick start reusing features 20230511:0229' } ], 'tables': [ { 'name': 'GROCERYPRODUCT', 'status': 'PUBLIC_DRAFT', 'catalog_name': 'quick start reusing features 20230511:0229' }, { 'name': 'INVOICEITEMS', 'status': 'PUBLIC_DRAFT', 'catalog_name': 'quick start reusing features 20230511:0229' }, { 'name': 'GROCERYINVOICE', 'status': 'PUBLIC_DRAFT', 'catalog_name': 'quick start reusing features 20230511:0229' } ], 'default_version_mode': 'AUTO', 'version_count': 1, 'catalog_name': 'quick start reusing features 20230511:0229', 'dtype': 'OBJECT', 'primary_table': [ { 'name': 'INVOICEITEMS', 'status': 'PUBLIC_DRAFT', 'catalog_name': 'quick start reusing features 20230511:0229' } ], 'default_feature_id': '645c533eb37fce40c0e3c8b0', 'version': { 'this': 'V230511', 'default': 'V230511' }, 'readiness': { 'this': 'DRAFT', 'default': 'DRAFT' }, 'table_feature_job_setting': { 'this': [ { 'table_name': 'GROCERYINVOICE', 'feature_job_setting': { 'blind_spot': '0s', 'frequency': '3600s', 'time_modulo_frequency': '90s' } } ], 'default': [ { 'table_name': 'GROCERYINVOICE', 'feature_job_setting': { 'blind_spot': '0s', 'frequency': '3600s', 'time_modulo_frequency': '90s' } } ] }, 'table_cleaning_operation': { 'this': [], 'default': [] }, 'versions_info': None, 'metadata': { 'input_columns': { 'Input0': { 'data': 'GROCERYPRODUCT', 'column_name': 'ProductGroup', 'semantic': None } }, 'derived_columns': {}, 'aggregations': { 'F0': { 'name': 'CustomerInventory_28d', 'column': None, 'function': 'count', 'keys': [ 'GroceryCustomerGuid' ], 'window': '28d', 'category': 'ProductGroup', 'filter': False } }, 'post_aggregation': None } }
Example: Materialize sample values¶
# get some invoice IDs and invoice event timestamps from Q4 2022
filter = (grocery_invoice_view["Timestamp"].dt.year == 2022) & (grocery_invoice_view["Timestamp"].dt.month >= 10)
observation_set = (
grocery_invoice_view[filter].sample(10)[["GroceryCustomerGuid", "Timestamp"]]
.rename({
"Timestamp": "POINT_IN_TIME",
"GroceryCustomerGuid": "GROCERYCUSTOMERGUID",
}, axis=1)
)
display(observation_set)
GROCERYCUSTOMERGUID | POINT_IN_TIME | |
---|---|---|
0 | 7a1bc5dc-e198-419e-b972-0abbdf8903c1 | 2022-12-23 10:41:08 |
1 | 6999ea3f-fc7e-4b48-b01f-02a71e0f474d | 2022-11-30 18:25:02 |
2 | 33a71f26-f36f-423f-8ddb-a7d674102d3b | 2022-10-29 09:29:31 |
3 | adb23858-0ea8-4ec1-9d17-5ae5cb70d856 | 2022-11-29 11:54:38 |
4 | 09fbee0c-521e-40ee-a2ff-8ed4187dcbc4 | 2022-11-09 16:57:02 |
5 | 0a796b2c-db2d-4414-847b-a999557c4008 | 2022-11-28 03:10:50 |
6 | be10bc87-b09e-49ec-a66a-d6a801a29abf | 2022-10-21 21:05:13 |
7 | 08d9c64b-b5e1-40d3-9964-0b3e216ff0c7 | 2022-10-02 12:34:51 |
8 | 69f895f3-2677-47b1-a577-b048e5004d4d | 2022-10-31 12:37:45 |
9 | db2d5721-8869-40f7-984c-a94d614fdf69 | 2022-11-02 18:40:31 |
# display the feature values
display(customer_inventory_28d.preview(observation_set))
GROCERYCUSTOMERGUID | POINT_IN_TIME | CustomerInventory_28d | |
---|---|---|---|
0 | 7a1bc5dc-e198-419e-b972-0abbdf8903c1 | 2022-12-23 10:41:08 | {"Aide à la Pâtisserie":1,"Beurre":1,"Biscuits... |
1 | 6999ea3f-fc7e-4b48-b01f-02a71e0f474d | 2022-11-30 18:25:02 | {"Biscuits":2,"Chips et Tortillas":1,"Colas, T... |
2 | 33a71f26-f36f-423f-8ddb-a7d674102d3b | 2022-10-29 09:29:31 | {"Adoucissants et Soin du linge":1,"Autres Pro... |
3 | adb23858-0ea8-4ec1-9d17-5ae5cb70d856 | 2022-11-29 11:54:38 | NaN |
4 | 09fbee0c-521e-40ee-a2ff-8ed4187dcbc4 | 2022-11-09 16:57:02 | {"Bières et Cidres":1,"Chips et Tortillas":2,"... |
5 | 0a796b2c-db2d-4414-847b-a999557c4008 | 2022-11-28 03:10:50 | {"Chocolats en Poudre":1,"Céréales":1,"Lait UH... |
6 | be10bc87-b09e-49ec-a66a-d6a801a29abf | 2022-10-21 21:05:13 | {"Bières et Cidres":2,"Colas, Thés glacés et S... |
7 | 08d9c64b-b5e1-40d3-9964-0b3e216ff0c7 | 2022-10-02 12:34:51 | {"Autres Produits Laitiers":1,"Biscuits apérit... |
8 | 69f895f3-2677-47b1-a577-b048e5004d4d | 2022-10-31 12:37:45 | {"Fromages":2,"Lessives":1,"Nettoyants Vaissel... |
9 | db2d5721-8869-40f7-984c-a94d614fdf69 | 2022-11-02 18:40:31 | {"Biscuits apéritifs":1,"Chat":5,"Colas, Thés ... |
Example: View the feature lineage¶
# display the feature lineage for the feature we just loaded from the feature store
display(customer_inventory_28d.definition)
# Generated by SDK version: 0.2.2
from bson import ObjectId
from featurebyte import DimensionTable
from featurebyte import FeatureJobSetting
from featurebyte import ItemTable
# item_table name: "INVOICEITEMS", event_table name: "GROCERYINVOICE"
item_table = ItemTable.get_by_id(ObjectId("645c532db37fce40c0e3c89c"))
item_view = item_table.get_view(
event_suffix=None,
view_mode="manual",
drop_column_names=[],
column_cleaning_operations=[],
event_drop_column_names=["record_available_at"],
event_column_cleaning_operations=[],
event_join_column_names=[
"Timestamp",
"GroceryInvoiceGuid",
"GroceryCustomerGuid",
"tz_offset",
],
)
# dimension_table name: "GROCERYPRODUCT"
dimension_table = DimensionTable.get_by_id(ObjectId("645c5330b37fce40c0e3c89d"))
dimension_view = dimension_table.get_view(
view_mode="manual", drop_column_names=[], column_cleaning_operations=[]
)
joined_view = item_view.join(dimension_view, on=None, how="left", rsuffix="")
grouped = joined_view.groupby(
by_keys=["GroceryCustomerGuid"], category="ProductGroup"
).aggregate_over(
value_column=None,
method="count",
windows=["28d"],
feature_names=["CustomerInventory_28d"],
feature_job_setting=FeatureJobSetting(
blind_spot="0s", frequency="3600s", time_modulo_frequency="90s"
),
skip_fill_na=True,
)
feat = grouped["CustomerInventory_28d"]
output = feat
Create New Features from Existing Features¶
You can use existing features as inputs to new features.
Learning objectives
In this section you wil learn how to:
- create a new feature from two existing features
Example: Create a new similarity feature from two existing features¶
# get the StateInventory_28d feature
state_inventory_28d = catalog.get_feature("StateInventory_28d")
# get the CustomerInventory_28d feature
customer_inventory_28d = catalog.get_feature("CustomerInventory_28d")
# create a new feature that is the cosine similarity of the two features
customer_state_items_similarity_28d = customer_inventory_28d.cd.cosine_similarity(
state_inventory_28d
)
customer_state_items_similarity_28d.name = "CustomerStateItemsSimilarity_28d"
customer_state_items_similarity_28d.save()
# display the feature lineage for the feature we just created
display(customer_state_items_similarity_28d.definition)
# Generated by SDK version: 0.2.2
from bson import ObjectId
from featurebyte import DimensionTable
from featurebyte import FeatureJobSetting
from featurebyte import ItemTable
from featurebyte import SCDTable
# scd_table name: "GROCERYCUSTOMER"
scd_table = SCDTable.get_by_id(ObjectId("645c5328b37fce40c0e3c89a"))
scd_view = scd_table.get_view(
view_mode="manual",
drop_column_names=["record_available_at", "CurrentRecord"],
column_cleaning_operations=[],
)
# item_table name: "INVOICEITEMS", event_table name: "GROCERYINVOICE"
item_table = ItemTable.get_by_id(ObjectId("645c532db37fce40c0e3c89c"))
item_view = item_table.get_view(
event_suffix=None,
view_mode="manual",
drop_column_names=[],
column_cleaning_operations=[],
event_drop_column_names=["record_available_at"],
event_column_cleaning_operations=[],
event_join_column_names=[
"Timestamp",
"GroceryInvoiceGuid",
"GroceryCustomerGuid",
"tz_offset",
],
)
# dimension_table name: "GROCERYPRODUCT"
dimension_table = DimensionTable.get_by_id(ObjectId("645c5330b37fce40c0e3c89d"))
dimension_view = dimension_table.get_view(
view_mode="manual", drop_column_names=[], column_cleaning_operations=[]
)
joined_view = item_view.join(dimension_view, on=None, how="left", rsuffix="")
joined_view_1 = joined_view.join(scd_view, on=None, how="left", rsuffix="")
grouped = joined_view_1.groupby(
by_keys=["State"], category="ProductGroup"
).aggregate_over(
value_column=None,
method="count",
windows=["28d"],
feature_names=["StateInventory_28d"],
feature_job_setting=FeatureJobSetting(
blind_spot="0s", frequency="3600s", time_modulo_frequency="90s"
),
skip_fill_na=True,
)
feat = grouped["StateInventory_28d"]
grouped_1 = joined_view.groupby(
by_keys=["GroceryCustomerGuid"], category="ProductGroup"
).aggregate_over(
value_column=None,
method="count",
windows=["28d"],
feature_names=["CustomerInventory_28d"],
feature_job_setting=FeatureJobSetting(
blind_spot="0s", frequency="3600s", time_modulo_frequency="90s"
),
skip_fill_na=True,
)
feat_1 = grouped_1["CustomerInventory_28d"]
feat_2 = feat_1.cd.cosine_similarity(other=feat)
feat_2.name = "CustomerStateItemsSimilarity_28d"
output = feat_2
Create a New Feature List From Existing Features¶
Learning objectives
In this section you will learn how to:
- create a feature list with a primary entity suited to your use case
Example: Create a customer level feature list¶
# get a list of all the features in the catalog
all_features = catalog.list_features()
# filter to retain only those with grocery customer or state as their primary entity
child_entity = "groceryinvoice"
suitable_features = all_features.loc[
[child_entity not in x for x in all_features.entities.values]
]
product_entity = "groceryproduct"
suitable_features = suitable_features.loc[
[product_entity not in x for x in suitable_features.entities.values]
]
# show the features
display(suitable_features)
id | name | dtype | readiness | online_enabled | tables | primary_tables | entities | primary_entities | created_at | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 645c5363b37fce40c0e3c8c9 | CustomerStateItemsSimilarity_28d | FLOAT | DRAFT | False | [GROCERYCUSTOMER, GROCERYINVOICE, INVOICEITEMS... | [INVOICEITEMS] | [grocerycustomer, frenchstate] | [grocerycustomer] | 2023-05-11 02:31:01.994 |
4 | 645c5343b37fce40c0e3c8b8 | CustomerYearOfBirth | INT | DRAFT | False | [GROCERYCUSTOMER] | [GROCERYCUSTOMER] | [grocerycustomer] | [grocerycustomer] | 2023-05-11 02:30:27.690 |
5 | 645c5341b37fce40c0e3c8b4 | CustomerSpend_14d | FLOAT | DRAFT | False | [GROCERYINVOICE] | [GROCERYINVOICE] | [grocerycustomer] | [grocerycustomer] | 2023-05-11 02:30:26.874 |
6 | 645c5340b37fce40c0e3c8b2 | CustomerInventory_24w | OBJECT | DRAFT | False | [GROCERYINVOICE, INVOICEITEMS, GROCERYPRODUCT] | [INVOICEITEMS] | [grocerycustomer] | [grocerycustomer] | 2023-05-11 02:30:25.298 |
7 | 645c533eb37fce40c0e3c8b0 | CustomerInventory_28d | OBJECT | DRAFT | False | [GROCERYINVOICE, INVOICEITEMS, GROCERYPRODUCT] | [INVOICEITEMS] | [grocerycustomer] | [grocerycustomer] | 2023-05-11 02:30:23.367 |
8 | 645c533cb37fce40c0e3c8ae | StateMeanLongitude | FLOAT | DRAFT | False | [GROCERYCUSTOMER] | [GROCERYCUSTOMER] | [frenchstate] | [frenchstate] | 2023-05-11 02:30:20.604 |
9 | 645c533bb37fce40c0e3c8ac | StateMeanLatitude | FLOAT | DRAFT | False | [GROCERYCUSTOMER] | [GROCERYCUSTOMER] | [frenchstate] | [frenchstate] | 2023-05-11 02:30:19.879 |
10 | 645c533ab37fce40c0e3c8aa | StateAvgInvoiceAmount_28d | FLOAT | DRAFT | False | [GROCERYCUSTOMER, GROCERYINVOICE] | [GROCERYINVOICE] | [frenchstate] | [frenchstate] | 2023-05-11 02:30:19.007 |
11 | 645c5337b37fce40c0e3c8a8 | StateInventory_28d | OBJECT | DRAFT | False | [GROCERYCUSTOMER, GROCERYINVOICE, INVOICEITEMS... | [INVOICEITEMS] | [frenchstate] | [frenchstate] | 2023-05-11 02:30:17.076 |
12 | 645c5336b37fce40c0e3c8a4 | StatePopulation | FLOAT | DRAFT | False | [GROCERYCUSTOMER] | [GROCERYCUSTOMER] | [frenchstate] | [frenchstate] | 2023-05-11 02:30:14.988 |
13 | 645c5335b37fce40c0e3c8a2 | StateName | VARCHAR | DRAFT | False | [GROCERYCUSTOMER] | [GROCERYCUSTOMER] | [grocerycustomer] | [grocerycustomer] | 2023-05-11 02:30:13.971 |
# create a new feature list from the 12 features we just searched for
customer_features = fb.FeatureList([
catalog.get_feature(x)
for x in suitable_features.name.values
], name="CustomerFeatures")
customer_features.save()
# display a sample of the feature list values
display(customer_features.preview(observation_set))
Saving Feature(s) |████████████████████████████████████████| 11/11 [100%] in 3.6 Loading Feature(s) |████████████████████████████████████████| 11/11 [100%] in 2.
GROCERYCUSTOMERGUID | POINT_IN_TIME | CustomerStateItemsSimilarity_28d | CustomerYearOfBirth | CustomerSpend_14d | CustomerInventory_24w | CustomerInventory_28d | StateMeanLongitude | StateMeanLatitude | StateAvgInvoiceAmount_28d | StateInventory_28d | StatePopulation | StateName | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7a1bc5dc-e198-419e-b972-0abbdf8903c1 | 2022-12-23 10:41:08 | 0.811556 | 1943 | 82.98 | {"Adoucissants et Soin du linge":1,"Aide à la ... | {"Aide à la Pâtisserie":1,"Beurre":1,"Biscuits... | 5.887195 | 43.456104 | 16.133960 | {"Adoucissants et Soin du linge":5,"Aide à la ... | 53 | Provence-Alpes-Côte d'Azur |
1 | 6999ea3f-fc7e-4b48-b01f-02a71e0f474d | 2022-11-30 18:25:02 | 0.685521 | 1950 | 8.96 | {"Adoucissants et Soin du linge":2,"Biscuits a... | {"Biscuits":2,"Chips et Tortillas":1,"Colas, T... | 2.241215 | 48.738384 | 19.683605 | {"Adoucissants et Soin du linge":13,"Aide à la... | 181 | Île-de-France |
2 | 33a71f26-f36f-423f-8ddb-a7d674102d3b | 2022-10-29 09:29:31 | 0.623732 | 1955 | 182.03 | {"Adoucissants et Soin du linge":2,"Aide à la ... | {"Adoucissants et Soin du linge":1,"Autres Pro... | 2.241215 | 48.738384 | 18.578355 | {"Adoucissants et Soin du linge":21,"Aide à la... | 181 | Île-de-France |
3 | adb23858-0ea8-4ec1-9d17-5ae5cb70d856 | 2022-11-29 11:54:38 | NaN | 1977 | 0.00 | {"Cave à Vins":2,"Colas, Thés glacés et Sodas"... | NaN | 2.241215 | 48.738384 | 19.542023 | {"Adoucissants et Soin du linge":14,"Aide à la... | 181 | Île-de-France |
4 | 09fbee0c-521e-40ee-a2ff-8ed4187dcbc4 | 2022-11-09 16:57:02 | 0.601944 | 1997 | 7.50 | {"Adoucissants et Soin du linge":7,"Aide à la ... | {"Bières et Cidres":1,"Chips et Tortillas":2,"... | 5.855939 | 48.789776 | 21.820370 | {"Adoucissants et Soin du linge":1,"Aide à la ... | 8 | Lorraine |
5 | 0a796b2c-db2d-4414-847b-a999557c4008 | 2022-11-28 03:10:50 | 0.268576 | 1963 | 11.24 | {"Biscuits apéritifs":1,"Biscuits":1,"Boucheri... | {"Chocolats en Poudre":1,"Céréales":1,"Lait UH... | 2.241215 | 48.738384 | 19.911254 | {"Adoucissants et Soin du linge":14,"Aide à la... | 181 | Île-de-France |
6 | be10bc87-b09e-49ec-a66a-d6a801a29abf | 2022-10-21 21:05:13 | 0.550035 | 1972 | 34.26 | {"Adoucissants et Soin du linge":1,"Bières et ... | {"Bières et Cidres":2,"Colas, Thés glacés et S... | 2.241215 | 48.738384 | 18.652529 | {"Adoucissants et Soin du linge":25,"Aide à la... | 181 | Île-de-France |
7 | 08d9c64b-b5e1-40d3-9964-0b3e216ff0c7 | 2022-10-02 12:34:51 | 0.836615 | 1971 | 21.59 | {"Aide à la Pâtisserie":2,"Animalerie, Soins e... | {"Autres Produits Laitiers":1,"Biscuits apérit... | 7.573264 | 48.177401 | 17.236800 | {"Aide à la Pâtisserie":5,"Autres Produits Lai... | 12 | Alsace |
8 | 69f895f3-2677-47b1-a577-b048e5004d4d | 2022-10-31 12:37:45 | 0.442991 | 1992 | 7.08 | {"Beurre":2,"Biscuits":1,"Bières et Cidres":2,... | {"Fromages":2,"Lessives":1,"Nettoyants Vaissel... | 5.054081 | 45.500198 | 17.619500 | {"Adoucissants et Soin du linge":5,"Aide à la ... | 33 | Rhône-Alpes |
9 | db2d5721-8869-40f7-984c-a94d614fdf69 | 2022-11-02 18:40:31 | 0.428169 | 1957 | 7.52 | {"Aide à la Pâtisserie":1,"Biscuits apéritifs"... | {"Biscuits apéritifs":1,"Chat":5,"Colas, Thés ... | 2.241215 | 48.738384 | 18.723496 | {"Adoucissants et Soin du linge":14,"Aide à la... | 181 | Île-de-France |
# list the feature lists in the catalog
catalog.list_feature_lists()
id | name | num_feature | status | deployed | readiness_frac | online_frac | tables | entities | created_at | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 645c536eb37fce40c0e3c8cb | CustomerFeatures | 11 | DRAFT | False | 0.0 | 0.0 | [GROCERYCUSTOMER, GROCERYINVOICE, INVOICEITEMS... | [grocerycustomer, frenchstate] | 2023-05-11 02:31:16.244 |
1 | 645c5349b37fce40c0e3c8c2 | StateFeatureList | 5 | DRAFT | False | 0.0 | 0.0 | [GROCERYCUSTOMER, GROCERYINVOICE, INVOICEITEMS... | [frenchstate] | 2023-05-11 02:30:36.305 |
Next Steps¶
Now that you've completed the quick-start reusing features tutorial, you can put your knowledge into practice or learn more:
- Learn more about materializing features via the "Deep Dive Materializing Features" tutorial
- Learn more about feature engineering via the "Deep Dive Feature Engineering" tutorial
- Learn about data modeling via the "Deep Dive Data Modeling" tutorial