Deep Dive Tutorial: Feature Engineering¶
Learning Objectives¶
In this tutorial you will learn:
- How to create and use views
- How features, entities, and observation sets are used together
- How to filter views
- How to join views
- How to aggregate data into features
- How to create features from features
- How to add a feature to a view
- How to use signal types for creative feature ideation
- How entities are the key to coherent feature lists
Set up the prerequisites¶
Learning Objectives
In this section you will:
- start your local featurebyte server
- import libraries
- learn about catalogs
- activate a pre-built catalog
Load the featurebyte library and connect to the local instance of featurebyte¶
# library imports
import pandas as pd
import numpy as np
# load the featurebyte SDK
import featurebyte as fb
from featurebyte.api.request_column import RequestColumn
# start the local server, then wait for it to be healthy before proceeding
fb.playground()
16:50:25 | INFO | Using configuration file at: C:\Users\colin\.featurebyte\config.yaml 16:50:25 | INFO | Active profile: local (http://127.0.0.1:8088) 16:50:25 | INFO | SDK version: 0.2.2 16:50:25 | INFO | Active catalog: default 16:50:25 | INFO | 0 feature list, 0 feature deployed 16:50:25 | INFO | (1/4) Starting featurebyte services 16:50:28 | INFO | (2/4) Creating local spark feature store 16:50:28 | INFO | (3/4) Import datasets 16:50:28 | INFO | Dataset grocery already exists, skipping import 16:50:28 | INFO | Dataset healthcare already exists, skipping import 16:50:28 | INFO | Dataset creditcard already exists, skipping import 16:50:28 | INFO | (4/4) Playground environment started successfully. Ready to go! 🚀
Create a pre-built catalog for this tutorial, with the data, metadata, and features already set up¶
Note that creating a pre-built catalog is not a step you will do in real-life. This is a function specific to this quick-start tutorial to quickly skip over many of the preparatory steps and get you to a point where you can materialize features.
In a real-life project you would do data modeling, declaring the tables, entities, and the associated metadata. This would not be a frequent task, but forms the basis for best-practice feature engineering.
# get the functions to create a pre-built catalog
from prebuilt_catalogs import *
# create a new catalog for this tutorial
catalog = create_tutorial_catalog(PrebuiltCatalog.DeepDiveFeatureEngineeering)
Cleaning up existing tutorial catalogs
16:50:35 | INFO | Catalog activated: deep dive feature engineering 20230516:1650
Building a deep dive catalog for feature engineering named [deep dive feature engineering 20230516:1650] Creating new catalog Catalog created Registering the source tables Registering the entities Tagging the entities to columns in the data tables Populating the feature store with example features Catalog created and pre-populated with data and features
Create Views of Tables within the Catalog¶
Learning Objectives
In this section you will learn:
- the dataset being used in this tutorial
- the purpose of FeatureByte tables
- standard table types
- how to load a table
- the purpose of FeatureByte views
- how to create a view from a table
Introduction to the French grocery dataset¶
This tutorial uses the French grocery dataset that has been pre-installed in quick-start feature engineering catalog. It consists of 4 data tables recording grocery purchasing activity for each customer.
- GroceryCustomer is a slowly changing dimension table containing customer attributes.
- GroceryInvoice is an event table containing grocery purchase transactions.
- InvoiceItems is an event items table containing details of the basket of grocery items purchased in each transaction.
- GroceryProduct is a dimension table containing the product attributes for each grocery item being sold.
Concept: Catalog table¶
A Catalog Table provides a centralized location for metadata about a source table. This metadata determines the type of operations that can be applied to the table's views and includes essential information for feature engineering.
Concept: Table types¶
Understanding the type of data contained in a table is crucial because it helps determine the appropriate feature engineering techniques that can be applied to the table.
Featurebyte supports four of the most common types of data table.
- An event table represents a table in the data warehouse where each row indicates a unique business event occurring at a particular time. Event tables can take various forms, such as an Order table in E-commerce, Credit Card Transactions in Banking, Doctor Visits in Healthcare, and Clickstream on the Internet.
- An item table represents a table in the data warehouse containing detailed information about a specific business event. For instance, an Item table can contain information about Product Items purchased in Customer Orders or Drug Prescriptions issued during Doctor Visits by Patients.
- A dimension table represents a table in the data warehouse containing static descriptive data. Using a Dimension table requires special attention. If the data in the table changes slowly, it is not advisable to use it because these changes can cause significant data leaks during model training and adversely affect the inference performance. In such cases, it is recommended to use a Slowly Changing Dimension table of Type 2 that maintains a history of changes. For example, dimension data could contain the product group of each grocery product.
- A slowly changing dimension (SCD) table represents a table in a data warehouse that contains data that changes slowly and unpredictably over time. There are two main types of SCDs: Type 1, which overwrites old data with new data, and Type 2, which maintains a history of changes by creating a new record for each change. FeatureByte only supports the use of Type 2 SCDs since SCDs of Type 1 may cause data leaks during model training and poor performance during inference. An SCD Table of Type 2 utilizes a natural key to distinguish each active row and facilitate tracking of changes over time. The SCD table employs effective and expiration date columns to determine the active status of a row. In certain instances, an active flag column may replace the expiration date column to indicate if a row is currently active. For example, slowly changing dimension data could contain customer data, which has attributes that need versioning, such as when a customer changes address.
Example: Load featurebyte tables¶
FeatureByte works on the principle of not moving data unnecessarily. So when you load a featurebyte table, you load its metadata, not the full contents of the table.
# get the tables for this workspace
grocery_customer_table = catalog.get_table("GROCERYCUSTOMER")
grocery_items_table = catalog.get_table("INVOICEITEMS")
grocery_invoice_table = catalog.get_table("GROCERYINVOICE")
grocery_product_table = catalog.get_table("GROCERYPRODUCT")
Concept: FeatureByte view¶
A FeatureByte view is a local virtual table that can be modified and joined to other views to prepare data before feature definition. A view does not contain any data of its own, but instead retrieves data from the underlying tables each time it is queried. It doesn't modify the data in those tables either. The view object works similar to a SQL view.
Load the tables for this catalog¶
# create the views
grocery_customer_view = grocery_customer_table.get_view()
grocery_invoice_view = grocery_invoice_table.get_view()
grocery_items_view = grocery_items_table.get_view()
grocery_product_view = grocery_product_table.get_view()
Features¶
Learning Objectives
In this section you will learn:
- about FeatureByte features
- the purpose of entities
- the purpose and usage of observation sets
Concept: Feature¶
A Feature object contains the logical plan to compute a feature which is usually used as input data to train or predict Machine Learning models.
There are three ways to define the plan for Feature objects from views: either as a Lookup feature, as an Aggregate feature or as a Cross Aggregate feature.
Additionally, Feature objects can be created as transformations of one or more existing features.
Concept: Entity¶
An Entity object contains metadata on a real-world object or concept represented or referenced by tables within your data warehouse.
Entities facilitate automatic table join definitions, serve as the unit of analysis for feature engineering, and aid in organizing features, feature lists, and use cases.
All features must relate to an entity (or entities) as their primary unit of analysis.
Concept: Feature Primary Entity¶
The primary entity of a feature defines the level of analysis for that feature.
The primary entity is usually a single entity. However, in some instances, it may be a tuple of entities.
When a feature is a result of an aggregation grouped by multiple entities, the primary entity is a tuple of those entities. For instance, if a feature quantifies the interaction between a customer entity and a merchant entity in the past, such as the sum of transaction amounts grouped by customer and merchant in the past 4 weeks, the primary entity is the tuple of customer and merchant.
When a feature is derived from features with different primary entities, the primary entity is determined by the entity relationships. The lowest level entity in the hierarchy is selected as the primary entity. If the entities have no relationship, the primary entity becomes a tuple of those entities.
For example, if a feature compares the basket of a customer with the average basket of customers in the same city, the primary entity is the customer since the customer entity is a child of the customer city entity. However, if the feature is the distance between the customer location and the merchant location, the primary entity becomes the tuple of customer and merchant since these entities do not have any parent-child relationship.
Example: List entities¶
Note that in this case study, all entities except French state are used for joining tables.
All entities can be used as a unit of analysis for features. For example, the french state entity can be used for creating features that aggregate over the geography.
# list the entities in the dataset
catalog.list_entities()
id | name | serving_names | created_at | |
---|---|---|---|---|
0 | 646343e55ddec2e093fe8659 | frenchstate | [FRENCHSTATE] | 2023-05-16 08:50:45.240 |
1 | 646343e55ddec2e093fe8658 | groceryproduct | [GROCERYPRODUCTGUID] | 2023-05-16 08:50:45.172 |
2 | 646343e55ddec2e093fe8657 | groceryinvoice | [GROCERYINVOICEGUID] | 2023-05-16 08:50:45.105 |
3 | 646343e45ddec2e093fe8656 | grocerycustomer | [GROCERYCUSTOMERGUID] | 2023-05-16 08:50:45.035 |
Concept: Observation set¶
An observation set combines entity key values and historical points-in-time, for which you wish to materialize feature values.
The observation set can be a Pandas DataFrame or an ObservationTable object representing an observation set in the feature store.
Concept: Observation table¶
An ObservationTable object is a representation of an observation set in the feature store. Unlike a local Pandas DataFrame, the ObservationTable is part of the catalog and can be shared or reused.
ObservationTable objects can be created from a source table or from a view after subsampling.
Example: Creating an observation table¶
Some use cases are about events, and require predictions to be triggered when a specified event occurs.
For a use case requiring predictions about a grocery customer whenever an invoice event occurs, your observation set may be sampled from historical invoices.
# create a large observation table from a view
# filter the view to exclude points in time that won't have data for historical windows
filter = (grocery_invoice_view["Timestamp"] >= pd.to_datetime("2022-04-01")) & \
(grocery_invoice_view["Timestamp"] < pd.to_datetime("2023-04-01"))
observation_set_view = grocery_invoice_view[filter].copy()
# create a new observation table
observation_table = observation_set_view.create_observation_table(
name = "200 customers who were active between 01-Apr-2022 and 31-Mar-2023",
sample_rows = 200,
columns = ["Timestamp", "GroceryCustomerGuid"],
columns_rename_mapping = {"Timestamp": "POINT_IN_TIME", "GroceryCustomerGuid": "GROCERYCUSTOMERGUID"},
)
# if the observation table isn't too large, you can materialize it
display(observation_table.to_pandas())
Done! |████████████████████████████████████████| 100% in 9.1s (0.11%/s) Downloading table |████████████████████████████████████████| 200/200 [100%] in 0
POINT_IN_TIME | GROCERYCUSTOMERGUID | |
---|---|---|
0 | 2022-08-31 15:58:39 | abdef773-ab72-43b6-8e77-050804c1c5fc |
1 | 2022-09-07 16:02:39 | abdef773-ab72-43b6-8e77-050804c1c5fc |
2 | 2022-04-02 19:30:03 | aae48cd3-7646-4df6-9700-3ef7f29ec80f |
3 | 2022-08-23 17:54:26 | aae48cd3-7646-4df6-9700-3ef7f29ec80f |
4 | 2022-05-11 15:52:06 | 17da4b5e-02b9-4e9b-8839-5a5aa77b9bf5 |
... | ... | ... |
195 | 2022-08-16 14:15:16 | bbaff8e5-44ab-4f61-a4e6-405f274bf429 |
196 | 2022-04-06 08:11:45 | 3dff3cdc-3e64-4b6f-ab62-8165afca8c55 |
197 | 2022-04-15 19:37:08 | 7ba62f2e-d7b7-4bfb-8059-628fc5cf3f92 |
198 | 2022-10-26 20:49:01 | e55e1ee2-7f4e-47a2-95f2-5c7a7ad5602c |
199 | 2022-06-03 16:38:54 | c0c4da4d-08a3-4a03-a1f6-9c015362caf9 |
200 rows × 2 columns
# create a short pandas observation set useful for previewing features at customer entity level
# create a new observation table
observation_table_customers = observation_set_view.create_observation_table(
name = "5 customers who were active between 01-Apr-2022 and 31-Mar-2023",
sample_rows = 5,
columns = ["Timestamp", "GroceryCustomerGuid"],
columns_rename_mapping = {"Timestamp": "POINT_IN_TIME", "GroceryCustomerGuid": "GROCERYCUSTOMERGUID"},
)
observation_set = observation_table_customers.to_pandas()
display(observation_set)
Done! |████████████████████████████████████████| 100% in 9.1s (0.11%/s) Downloading table |████████████████████████████████████████| 5/5 [100%] in 0.1s
POINT_IN_TIME | GROCERYCUSTOMERGUID | |
---|---|---|
0 | 2022-10-03 10:23:18 | edd7c932-6a3e-482d-8e36-1402c2d3c3ff |
1 | 2022-11-17 19:23:53 | d7316f3d-6ea9-49b6-97f0-3d20ea9d1331 |
2 | 2022-06-23 16:26:43 | 197d9b45-2cad-4f77-842e-4df23f7e52d6 |
3 | 2022-08-01 21:03:58 | e5644562-22c9-4aa3-a332-c93cce923596 |
4 | 2022-06-23 16:32:00 | 575ceb64-e6ef-446d-9a38-929e35e4cbef |
# create a short pandas observation set useful for previewing features at invoice entity level
# create a new observation table
observation_table_invoices = observation_set_view.create_observation_table(
name = "5 invoices with timestamps between 01-Apr-2022 and 31-Mar-2023",
sample_rows = 5,
columns = ["Timestamp", "GroceryInvoiceGuid"],
columns_rename_mapping = {"Timestamp": "POINT_IN_TIME", "GroceryInvoiceGuid": "GROCERYINVOICEGUID"},
)
observation_set_invoices = observation_table_invoices.to_pandas()
display(observation_set_invoices)
Done! |████████████████████████████████████████| 100% in 9.1s (0.11%/s) Downloading table |████████████████████████████████████████| 5/5 [100%] in 0.1s
POINT_IN_TIME | GROCERYINVOICEGUID | |
---|---|---|
0 | 2022-10-27 11:04:22 | 6806fcb5-ab2c-439d-a35a-296f2a891a80 |
1 | 2022-05-16 10:22:49 | ebe9fe2b-2457-470e-b27c-dd07a18186e8 |
2 | 2022-08-20 17:31:23 | 50dd32a2-97f6-4092-8afb-7b7d08430f8b |
3 | 2022-05-12 13:41:39 | 92d8e1fb-d882-4104-a7e2-794208162c4e |
4 | 2022-11-30 17:55:27 | 961ea607-9375-4a14-ad70-fc3a6353f963 |
Filtering¶
Learning Objectives
In this section you will learn:
- how to filter a view
- how to transform data using conditions and filters
Example: Filtering a view¶
The syntax for filtering a view is the same as pandas.
# create a filter for filtering rows to see only small purchases
filter = grocery_invoice_view["Amount"] < 10
filtered_invoice_view = grocery_invoice_view[filter]
display(filtered_invoice_view.sample())
GroceryInvoiceGuid | GroceryCustomerGuid | Timestamp | tz_offset | Amount | |
---|---|---|---|---|---|
0 | a425afee-805d-443c-8fe7-91c76a19630b | 9c23c4e8-f0e8-4aa4-83e9-3d3525461a8f | 2022-09-05 10:33:13 | +02:00 | 8.16 |
1 | 92766e8c-d6dc-4cdc-92ec-465c75d202ae | b21ae11c-83cf-4146-832e-1163413a3295 | 2022-07-04 14:21:13 | +02:00 | 8.98 |
2 | c1615f5f-f70d-4f52-ac7d-cc5ca8defa35 | dfdd117d-e9fc-46c5-96ce-642bfad7f2b9 | 2022-02-07 17:16:39 | +01:00 | 5.23 |
3 | ba7eb0c9-522d-44a1-86f7-d658c92698bb | 56b44a06-48c8-4ea7-8314-972a69ff99be | 2022-05-15 15:38:50 | +02:00 | 1.89 |
4 | 94c248c5-11cc-423f-987f-04c7ce3b5dec | b21ae11c-83cf-4146-832e-1163413a3295 | 2023-01-25 21:59:33 | +01:00 | 8.29 |
5 | 597d6b7d-be61-4bad-939c-0be97443be4f | 806ca360-27ce-49e2-bec4-c048aaf14ad5 | 2023-04-21 14:50:22 | +02:00 | 9.06 |
6 | cdda7c23-e17f-47eb-af9a-ecf500a5c7db | 43f100f0-f804-4774-a8a5-dee3ad2f9ea5 | 2023-01-23 22:22:02 | -04:00 | 2.58 |
7 | 0c135ebc-5133-4a9f-b8c7-dc95ba0095d6 | f8d10416-e125-4dcf-a351-08653114cf0f | 2023-04-03 14:22:12 | +02:00 | 1.88 |
8 | 8c375705-2092-4582-a723-da270df3e730 | 3019bdbf-667c-4081-acb5-26cd2d559c5e | 2022-03-17 03:16:31 | +03:00 | 5.25 |
9 | e524cd1c-2a0d-437c-9d49-8eb293e9541d | cd9d2656-375c-4b43-8f7c-5f89729fd5f4 | 2022-04-04 19:27:25 | +02:00 | 2.00 |
Example: Conditional transformations¶
The featurebyte way of doing if-then-else transformations is via conditions or filters.
# flag items as discounted, free, or undiscounted
discounted_filter = grocery_items_view["Discount"] > 0
free_filter = grocery_items_view["TotalCost"] == 0
grocery_items_view["DiscountCategory"] = "Undiscounted"
grocery_items_view.DiscountCategory[discounted_filter] = "Discounted"
grocery_items_view.DiscountCategory[free_filter] = "Free"
display(grocery_items_view[["TotalCost", "DiscountCategory"]].sample())
GroceryInvoiceItemGuid | TotalCost | DiscountCategory | |
---|---|---|---|
0 | 0c5181e3-9d9c-402d-902a-1649c3a26232 | 1.74 | Discounted |
1 | 5b853ed2-aea7-4fad-aaa5-bcadbef0eba8 | 2.50 | Discounted |
2 | d2d7633e-3bdf-430d-920e-13825cad3e19 | 0.75 | Undiscounted |
3 | 7c4c38cc-7150-4bca-b2c1-0d4616d4809f | 1.98 | Undiscounted |
4 | cd0d8e88-e8fd-41d9-a4a4-8c9d4e05a1d8 | 1.29 | Undiscounted |
5 | d010fe4e-266c-4c7a-93dc-e9382bf701e3 | 1.00 | Discounted |
6 | 8e662549-45a1-44d6-9303-ed57ec075ee5 | 8.99 | Undiscounted |
7 | aee67dc0-b9cc-494c-983c-615bc7e6d407 | 2.50 | Discounted |
8 | c44ca187-bfec-4d36-942d-12b53d21e6c9 | 1.19 | Discounted |
9 | 200d5fcc-cb83-4d25-a4d8-823dd6b85532 | 0.25 | Undiscounted |
Joins¶
Learning Objectives
In this section you will learn:
- how views are joined
- the purpose of natural keys
- which view types can be joined
- how joins are frequently unnecessary
Concept: Principles of featurebyte joins¶
In featurebyte:
- Joins operate on views
- Join criteria by common entities, and by event timestamps for joins of event views and slowly changing data
- Similarly to pandas, for the right-hand-side view, the join key must be its index (its natural key).
- Joins add columns to an existing view
- Joins never increase the number of rows in a view.
- By default, the number of rows do not change after a join. However, the number of rows may reduce if an inner join is selected.
- Only one-to-one and many-to-one relationships are supported. One-to-many and many-to-many relationships are not supported.
- Always start with the view that has the many side of the relationship, then join the view that has the one side of the relationship
- Similarly to a left join, rows with no match will contain missing values for the joined fields
Concept: Natural key¶
A Natural Key is a generally accepted identifier used to identify real-world objects uniquely. In a Slowly Changing Dimension (SCD) table, a natural key (also called alternate key) is a column or a group of columns that remain constant over time and uniquely identifies each active row in the table at any point-in-time.
This key is crucial in maintaining and analyzing the historical changes made in the table.
Concept: View joins¶
To join two views, use the join()
method of the left view and specify the right view object in the other_view
parameter. The method will match rows from both views based on a shared key, which is either the primary key of the right view or the natural key if the right view is an SCD view.
If the shared key identifies an entity that is referenced in the left view or the column name of the shared key is the same in both views, the join()
method will automatically identify the column in the left view to use for the join.
By default, a left join is performed, and the resulting view will have the same number of rows as the left view. However, you can set the how
parameter to 'inner' to perform an inner join. In this case, the resulting view will only contain rows where there is a match between the columns in both tables.
When the right view is an SCD view, the event timestamp of the left view determines which record of the right view to join.
Example: Join event data to item data¶
Event data is automatically joined to item data via the event ID.
When an ItemView is created, the event_timestamp and the entities of the event data the item data is associated with are automatically added. Featurebyte automatically joins the parent event's entity and timestamp to the item view.
The preferred method to add columns from the event view is the join_event_data_attributes method.
# copy the invoice amount into the items view
grocery_items_view = grocery_items_view.join_event_table_attributes(['Amount'], event_suffix='_invoice_total')
display(grocery_items_view.preview())
GroceryInvoiceItemGuid | GroceryInvoiceGuid | GroceryProductGuid | Quantity | UnitPrice | TotalCost | Discount | record_available_at | GroceryCustomerGuid | Timestamp | tz_offset | DiscountCategory | Amount_invoice_total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | c6c7c746-782c-418b-b123-7ded5a8d8d30 | e87223dc-c5ef-4360-8637-86dc69af5d39 | 2d19bd87-a6ea-4553-9caa-1ab6d7e322b9 | 1.0 | 2.000 | 2.00 | 1.49 | 2022-01-01 18:01:00 | caef12d6-ea0d-42ee-894a-77933ebf7f71 | 2022-01-01 17:39:59 | +01:00 | Discounted | 66.84 |
1 | a2c85c99-65a6-480b-9259-3339a3b5dc19 | e87223dc-c5ef-4360-8637-86dc69af5d39 | ed2445ad-8a97-4a4d-9f5c-eb253d66f916 | 1.0 | 0.990 | 0.99 | 0.20 | 2022-01-01 18:01:00 | caef12d6-ea0d-42ee-894a-77933ebf7f71 | 2022-01-01 17:39:59 | +01:00 | Discounted | 66.84 |
2 | c8d0bba1-5adf-4747-b6f7-cd16925392dc | e87223dc-c5ef-4360-8637-86dc69af5d39 | 77b6847c-65bd-4e40-925f-7c2e2db80936 | 1.0 | 1.990 | 1.99 | 0.00 | 2022-01-01 18:01:00 | caef12d6-ea0d-42ee-894a-77933ebf7f71 | 2022-01-01 17:39:59 | +01:00 | Undiscounted | 66.84 |
3 | adb793ab-90ff-4bc0-a9eb-47f71952d97c | e87223dc-c5ef-4360-8637-86dc69af5d39 | 7b3f3630-f2e4-4d18-8174-36d35d171069 | 1.0 | 2.000 | 2.00 | 1.15 | 2022-01-01 18:01:00 | caef12d6-ea0d-42ee-894a-77933ebf7f71 | 2022-01-01 17:39:59 | +01:00 | Discounted | 66.84 |
4 | 700d6299-9e33-4a10-8054-7cdcb9fa8529 | e87223dc-c5ef-4360-8637-86dc69af5d39 | 8a01f8db-9796-4e17-8e8d-b893770e5a40 | 1.0 | 1.000 | 1.00 | 0.61 | 2022-01-01 18:01:00 | caef12d6-ea0d-42ee-894a-77933ebf7f71 | 2022-01-01 17:39:59 | +01:00 | Discounted | 66.84 |
5 | fe0683c7-f8ae-4aa3-a1a8-6e762c262644 | e87223dc-c5ef-4360-8637-86dc69af5d39 | 525dfef6-62a2-488d-9f2e-c4631cc4a8a2 | 1.0 | 1.290 | 1.29 | 0.00 | 2022-01-01 18:01:00 | caef12d6-ea0d-42ee-894a-77933ebf7f71 | 2022-01-01 17:39:59 | +01:00 | Undiscounted | 66.84 |
6 | 1b194981-b1bb-4306-a760-8131493f544f | e87223dc-c5ef-4360-8637-86dc69af5d39 | c18c8a1c-bf87-4b2c-bc2b-8ea468e92d82 | 2.0 | 1.145 | 2.29 | 0.00 | 2022-01-01 18:01:00 | caef12d6-ea0d-42ee-894a-77933ebf7f71 | 2022-01-01 17:39:59 | +01:00 | Undiscounted | 66.84 |
7 | c5f9cd59-4f4a-4a39-9d7a-7b7f0c2bb07f | e87223dc-c5ef-4360-8637-86dc69af5d39 | 9ee0d42e-dcb7-493b-98ed-ff63d951431d | 1.0 | 1.290 | 1.29 | 0.00 | 2022-01-01 18:01:00 | caef12d6-ea0d-42ee-894a-77933ebf7f71 | 2022-01-01 17:39:59 | +01:00 | Undiscounted | 66.84 |
8 | a6ac8ea1-fb10-4d11-9470-8b83d14e2f45 | e87223dc-c5ef-4360-8637-86dc69af5d39 | a99853f4-a90e-4b8b-97d0-d42aad75d1f0 | 1.0 | 1.290 | 1.29 | 0.50 | 2022-01-01 18:01:00 | caef12d6-ea0d-42ee-894a-77933ebf7f71 | 2022-01-01 17:39:59 | +01:00 | Discounted | 66.84 |
9 | c35b60b8-941f-4fa6-b272-41c7bfca228c | e87223dc-c5ef-4360-8637-86dc69af5d39 | 49686449-3e3d-445f-b98b-26e86027855a | 1.0 | 1.590 | 1.59 | 0.00 | 2022-01-01 18:01:00 | caef12d6-ea0d-42ee-894a-77933ebf7f71 | 2022-01-01 17:39:59 | +01:00 | Undiscounted | 66.84 |
Example: Join Slowly Changing Dimension view to Event view¶
When the right view of a join is an SCD view, the event timestamp of the left view determines which record of the right view to join.
# Join selected columns from the grocery customer view with the grocery invoice view
grocery_invoice_view = grocery_invoice_view.join(grocery_customer_view[["Gender", "State"]], rsuffix="_Customer")
display(grocery_invoice_view.sample())
GroceryInvoiceGuid | GroceryCustomerGuid | Timestamp | tz_offset | Amount | Gender_Customer | State_Customer | |
---|---|---|---|---|---|---|---|
0 | 6a1bf1df-1878-489c-afa6-05c44b3505bb | 32dd07d0-2c16-4b34-8cc9-01f258e0b935 | 2022-03-18 14:32:05 | +01:00 | 2.98 | male | Île-de-France |
1 | 003cc207-03d8-4cc6-bcd0-fa00360b0c6e | 59883bef-01fd-4321-95c4-fd0fad7f19ce | 2022-06-30 17:34:23 | +02:00 | 5.69 | male | Rhône-Alpes |
2 | 80c0a662-0978-4587-83e1-f23b2fdb54b1 | cc4220ec-16ab-4bb9-991d-deef994bf27a | 2022-05-15 14:34:42 | +02:00 | 12.03 | female | Île-de-France |
3 | 22da6729-31d3-4865-9ca2-3df148d93138 | 144a0fe4-2137-43f6-b266-411b9eb7cb31 | 2022-01-18 19:00:11 | +01:00 | 11.75 | female | Île-de-France |
4 | f291eedf-e115-4546-9eed-d6fafbcdd80e | 97074c73-4eea-4d6e-b4fb-09c975c570cc | 2023-01-02 12:08:03 | +01:00 | 2.50 | male | Île-de-France |
5 | 06eb0029-0091-4060-bc2f-dbfaa7581034 | cc4220ec-16ab-4bb9-991d-deef994bf27a | 2022-07-25 14:44:59 | +02:00 | 6.99 | female | Île-de-France |
6 | 8cc7b598-54f5-4299-b0b5-e5488e701443 | 402d0c33-adfa-4bdd-a4be-fce76171466d | 2022-05-25 07:52:05 | +02:00 | 10.37 | male | Alsace |
7 | a986b10e-12a0-4d96-888a-af69b00ab94b | 6ef2203f-d91c-43f2-93c7-ffc414035765 | 2022-03-16 15:52:11 | +01:00 | 9.47 | male | Champagne-Ardenne |
8 | 1f398f45-869d-4d33-9da6-0a1b4f10550a | a303679d-0e83-4e18-bc3f-6cae734ed8f3 | 2022-10-24 14:01:28 | +02:00 | 111.42 | male | Île-de-France |
9 | 39e80f95-1f24-46d6-a604-f5cbb71a6c43 | e490ab6d-c699-44c3-a284-41a7bbb1ee6f | 2023-02-10 01:00:16 | +01:00 | 1.29 | female | Île-de-France |
Example: Join Dimension view to Item view¶
# join the grocery product view with the grocery items view
grocery_items_view = grocery_items_view.join(grocery_product_view)
display(grocery_items_view.preview())
GroceryInvoiceItemGuid | GroceryInvoiceGuid | GroceryProductGuid | Quantity | UnitPrice | TotalCost | Discount | record_available_at | GroceryCustomerGuid | Timestamp | tz_offset | DiscountCategory | Amount_invoice_total | ProductGroup | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | c6c7c746-782c-418b-b123-7ded5a8d8d30 | e87223dc-c5ef-4360-8637-86dc69af5d39 | 2d19bd87-a6ea-4553-9caa-1ab6d7e322b9 | 1.0 | 2.000 | 2.00 | 1.49 | 2022-01-01 18:01:00 | caef12d6-ea0d-42ee-894a-77933ebf7f71 | 2022-01-01 17:39:59 | +01:00 | Discounted | 66.84 | Plats Cuisinés Surgelés |
1 | a2c85c99-65a6-480b-9259-3339a3b5dc19 | e87223dc-c5ef-4360-8637-86dc69af5d39 | ed2445ad-8a97-4a4d-9f5c-eb253d66f916 | 1.0 | 0.990 | 0.99 | 0.20 | 2022-01-01 18:01:00 | caef12d6-ea0d-42ee-894a-77933ebf7f71 | 2022-01-01 17:39:59 | +01:00 | Discounted | 66.84 | Pâtes, Riz, Purées et Féculents |
2 | c8d0bba1-5adf-4747-b6f7-cd16925392dc | e87223dc-c5ef-4360-8637-86dc69af5d39 | 77b6847c-65bd-4e40-925f-7c2e2db80936 | 1.0 | 1.990 | 1.99 | 0.00 | 2022-01-01 18:01:00 | caef12d6-ea0d-42ee-894a-77933ebf7f71 | 2022-01-01 17:39:59 | +01:00 | Undiscounted | 66.84 | Épices |
3 | adb793ab-90ff-4bc0-a9eb-47f71952d97c | e87223dc-c5ef-4360-8637-86dc69af5d39 | 7b3f3630-f2e4-4d18-8174-36d35d171069 | 1.0 | 2.000 | 2.00 | 1.15 | 2022-01-01 18:01:00 | caef12d6-ea0d-42ee-894a-77933ebf7f71 | 2022-01-01 17:39:59 | +01:00 | Discounted | 66.84 | Premiers Soins |
4 | 700d6299-9e33-4a10-8054-7cdcb9fa8529 | e87223dc-c5ef-4360-8637-86dc69af5d39 | 8a01f8db-9796-4e17-8e8d-b893770e5a40 | 1.0 | 1.000 | 1.00 | 0.61 | 2022-01-01 18:01:00 | caef12d6-ea0d-42ee-894a-77933ebf7f71 | 2022-01-01 17:39:59 | +01:00 | Discounted | 66.84 | Jus Frais |
5 | fe0683c7-f8ae-4aa3-a1a8-6e762c262644 | e87223dc-c5ef-4360-8637-86dc69af5d39 | 525dfef6-62a2-488d-9f2e-c4631cc4a8a2 | 1.0 | 1.290 | 1.29 | 0.00 | 2022-01-01 18:01:00 | caef12d6-ea0d-42ee-894a-77933ebf7f71 | 2022-01-01 17:39:59 | +01:00 | Undiscounted | 66.84 | Bonbons |
6 | 1b194981-b1bb-4306-a760-8131493f544f | e87223dc-c5ef-4360-8637-86dc69af5d39 | c18c8a1c-bf87-4b2c-bc2b-8ea468e92d82 | 2.0 | 1.145 | 2.29 | 0.00 | 2022-01-01 18:01:00 | caef12d6-ea0d-42ee-894a-77933ebf7f71 | 2022-01-01 17:39:59 | +01:00 | Undiscounted | 66.84 | Chips et Tortillas |
7 | c5f9cd59-4f4a-4a39-9d7a-7b7f0c2bb07f | e87223dc-c5ef-4360-8637-86dc69af5d39 | 9ee0d42e-dcb7-493b-98ed-ff63d951431d | 1.0 | 1.290 | 1.29 | 0.00 | 2022-01-01 18:01:00 | caef12d6-ea0d-42ee-894a-77933ebf7f71 | 2022-01-01 17:39:59 | +01:00 | Undiscounted | 66.84 | Bonbons |
8 | a6ac8ea1-fb10-4d11-9470-8b83d14e2f45 | e87223dc-c5ef-4360-8637-86dc69af5d39 | a99853f4-a90e-4b8b-97d0-d42aad75d1f0 | 1.0 | 1.290 | 1.29 | 0.50 | 2022-01-01 18:01:00 | caef12d6-ea0d-42ee-894a-77933ebf7f71 | 2022-01-01 17:39:59 | +01:00 | Discounted | 66.84 | Pains |
9 | c35b60b8-941f-4fa6-b272-41c7bfca228c | e87223dc-c5ef-4360-8637-86dc69af5d39 | 49686449-3e3d-445f-b98b-26e86027855a | 1.0 | 1.590 | 1.59 | 0.00 | 2022-01-01 18:01:00 | caef12d6-ea0d-42ee-894a-77933ebf7f71 | 2022-01-01 17:39:59 | +01:00 | Undiscounted | 66.84 | Jus Frais |
Example: Use an inner join¶
Inner joins are useful for filtering views because they drop unmatched rows.
# get a grocery items view
soda_items_view = grocery_items_view.copy()
# create a filter to only include products that have the text "Soda" in the product group
filter = grocery_product_view.ProductGroup.str.contains("Soda")
# apply the filter to the grocery product view
soda_product_view = grocery_product_view[filter]
# join the grocery product view with the grocery items view
soda_items_view = soda_items_view.join(soda_product_view, how = "inner", rsuffix="_Soda")
# preview the result
display(soda_items_view.preview())
GroceryInvoiceItemGuid | GroceryInvoiceGuid | GroceryProductGuid | Quantity | UnitPrice | TotalCost | Discount | record_available_at | GroceryCustomerGuid | Timestamp | tz_offset | DiscountCategory | Amount_invoice_total | ProductGroup | ProductGroup_Soda | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 13876e34-1fa7-4099-9c1e-1acdb0f8cd6e | a7c15443-ec3a-412b-9b9b-2045d025f21c | 2de4cc2b-ec73-44ac-ba71-8e504176ca95 | 2.0 | 0.59 | 1.18 | 0.00 | 2022-01-01 17:01:00 | a91dcdc2-75a6-4172-94a0-ee53e715d634 | 2022-01-01 16:18:07 | +01:00 | Undiscounted | 15.54 | Colas, Thés glacés et Sodas | Colas, Thés glacés et Sodas |
1 | e510e8bc-7770-4ee1-848c-d933c0bcab91 | a7c15443-ec3a-412b-9b9b-2045d025f21c | e60b8a87-5e36-4ad5-a21e-c345c0bbae90 | 1.0 | 0.59 | 0.59 | 0.00 | 2022-01-01 17:01:00 | a91dcdc2-75a6-4172-94a0-ee53e715d634 | 2022-01-01 16:18:07 | +01:00 | Undiscounted | 15.54 | Colas, Thés glacés et Sodas | Colas, Thés glacés et Sodas |
2 | 04e9a957-cfb0-4b5b-8787-c85862d1c4ec | ad9f2a45-bce4-4b9d-b6ec-f0b9bda76123 | 38508242-2806-42d4-8276-a4b1b7676b87 | 1.0 | 1.59 | 1.59 | 0.00 | 2022-01-01 18:01:00 | 3bede04c-ff39-49b6-8f79-428c189dc1cc | 2022-01-01 17:21:20 | +01:00 | Undiscounted | 38.94 | Colas, Thés glacés et Sodas | Colas, Thés glacés et Sodas |
3 | 7566d89e-b619-493e-b2b3-6a37353336a4 | ad9f2a45-bce4-4b9d-b6ec-f0b9bda76123 | ecccf560-0df9-4a45-898c-f5a3c9e27f6b | 2.0 | 1.59 | 3.18 | 0.00 | 2022-01-01 18:01:00 | 3bede04c-ff39-49b6-8f79-428c189dc1cc | 2022-01-01 17:21:20 | +01:00 | Undiscounted | 38.94 | Colas, Thés glacés et Sodas | Colas, Thés glacés et Sodas |
4 | e1b6f89c-75c9-452e-9b41-c469af489a04 | ad9f2a45-bce4-4b9d-b6ec-f0b9bda76123 | 1cc8cfd0-a9f7-4e11-a6b0-7400681acbb3 | 2.0 | 1.59 | 3.18 | 0.00 | 2022-01-01 18:01:00 | 3bede04c-ff39-49b6-8f79-428c189dc1cc | 2022-01-01 17:21:20 | +01:00 | Undiscounted | 38.94 | Colas, Thés glacés et Sodas | Colas, Thés glacés et Sodas |
5 | 112a4dd3-9db3-4177-99ae-7da02e2e02b2 | eed978c4-fe58-4498-ba0d-012ba7bc960d | a9949a9a-a3fe-4ca9-87cc-1c24e2ec7601 | 1.0 | 3.34 | 3.34 | 1.25 | 2022-01-01 12:01:00 | 8ad0c1f9-e6cb-43d2-a459-3297f1b29f55 | 2022-01-01 11:03:41 | +01:00 | Discounted | 42.54 | Colas, Thés glacés et Sodas | Colas, Thés glacés et Sodas |
6 | 857f1cc6-c430-4f98-8002-abdadf65792b | 94e5c784-5a12-4210-bb82-4026f329e278 | 6bfd3270-6467-41ab-a43d-1134e611f01e | 1.0 | 1.29 | 1.29 | 0.00 | 2022-01-01 13:01:00 | d0e9e8fa-bd2d-415f-a3b3-00c3e2f8e358 | 2022-01-01 12:40:29 | +01:00 | Undiscounted | 1.29 | Colas, Thés glacés et Sodas | Colas, Thés glacés et Sodas |
7 | 8c53e8b2-9d64-43df-8abc-d13f25e4a297 | 6d8f8303-5cb1-41b5-ad73-3ca590c6f942 | 701d100a-f643-4f15-ae08-2b0b318ca34f | 2.0 | 4.69 | 9.38 | 0.00 | 2022-01-01 16:01:00 | 05b77e7e-0a50-4341-b18c-be77efe3b084 | 2022-01-01 15:23:14 | +01:00 | Undiscounted | 23.37 | Colas, Thés glacés et Sodas | Colas, Thés glacés et Sodas |
8 | 0e1b8ac1-5c0d-464e-b530-7e5e90d3f070 | fdf02faa-3f00-4acd-97ec-76b177ae0bc0 | ad65c776-5937-4bed-9a8f-cb5d0db3356b | 1.0 | 1.39 | 1.39 | 0.00 | 2022-01-01 12:01:00 | 7846cb26-efa9-45a5-a1e1-6be13992ef59 | 2022-01-01 11:02:26 | +01:00 | Undiscounted | 14.32 | Colas, Thés glacés et Sodas | Colas, Thés glacés et Sodas |
9 | b32ba058-2be5-482a-9439-30d097d970ec | 13cc738b-c852-4348-a5cf-ffe6b860c281 | 5b9d37e5-4c91-4f40-8cfd-c0a0bb22fcbd | 2.0 | 1.39 | 2.78 | 0.00 | 2022-01-01 18:01:00 | 7d1f7b4b-e121-47be-99ae-15264807b48a | 2022-01-01 17:49:26 | +01:00 | Undiscounted | 20.63 | Colas, Thés glacés et Sodas | Colas, Thés glacés et Sodas |
Concept: Supported joins¶
Not all views can be joined to each other. SCD views cannot be joined to other SCD views, while only dimension views can be joined to other dimension views. Change views cannot be joined to any views.
The diagram below shows which view types can be joined to an existing view. Green indicates a join is possible. Grey indicates a join is not allowed.