Feature
A Feature object represents the logical plan—also known as the computational blueprint—for deriving a feature. It defines how feature values are computed from underlying data, whether for model training or real-time prediction.
The feature values are computed by:
- Using a set of observations for training purposes.
- Enumerating the values of the feature's associated entity for prediction.
In some cases, feature values can be obtained directly from existing attributes within source tables. More commonly, however, features are engineered through a combination of operations such as joins, filters, transformations, and aggregations.
In FeatureByte, this computational blueprint is defined using View objects. A Feature object can be built in several ways:
- Lookup features — derived directly from entity attributes.
- Aggregate features — computed by summarizing data across time or entity groupings.
- Cross Aggregate features — capturing aggregated relationships across categorical dimensions.
Additionally, new Feature objects can be created as transformations of existing features or as new versions of previously defined ones.
Lookup features¶
Lookup features are features derived directly from an entity’s attributes in a view without requiring any aggregation. They provide a one-to-one mapping between an attribute’s value and the entity it describes. In other words, these features represent intrinsic properties of the entity as recorded in the source view.
For example, consider the Grocery dataset used in our tutorials. You can extract the Amount
column from the GROCERYINVOICE
view as a feature for the groceryinvoice
entity using the as_feature()
method:
invoice_view = catalog.get_view("GROCERYINVOICE")
invoice_view["Amount"].as_feature("Invoice_Amount")
Lookup Features from Time-Varying Views¶
For a Slowly Changing Dimension (SCD) view and a Snapshots view where attribute values evolve over time, the feature is associated with:
- The SCD's natural key, or
- The Snapshots view's series ID,
- Optionally combined with a time offset.
By default, the feature value corresponds to:
- The active attribute at the points-in-time for historical observations, and
- The current attribute for prediction use cases.
If an offset is specified, the feature retrieves the attribute’s value from a fixed point in the past relative to the observation time.
Example: Using an Offset
In the example below, we define a feature that represents whether a customer was using Windows 28 days before the observation time:
customer_view = catalog.get_view("GROCERYCUSTOMER")
# Extract the operating system indicator from the BrowserUserAgent column
customer_view["OperatingSystemIsWindows"] = customer_view.BrowserUserAgent.str.contains("Windows")
uses_windows_28d_ago = customer_view.OperatingSystemIsWindows.as_feature(
"UsesWindows_28d_ago", offset='28d'
)
Aggregate features¶
Aggregate features are a key part of feature engineering. They are created by applying aggregation functions to collections of data points grouped by an entity (or a tuple of entities).
Supported aggregation functions include latest
, count
, count distinct
, sum
, average
, minimum
, maximum
, and standard deviation
.
In the presence of categorical columns, use Cross Aggregate operations.
Defining an Aggregate Feature
Creating an aggregate feature involves two main steps:
-
Group data by entity (or entities)
Use the
groupby()
method to group view rows by one or more columns representing entities in the view: -
Choose the aggregation type
Non-Temporal Aggregates
: Aggregations performed without considering time.Aggregates Over A Window
: Aggregations computed within a specific time window (common for event, item, snapshot, and time-series data).Aggregates “As At” a Point-In-Time
: Aggregations based on data active at a specific moment in time (used for SCD and Snapshots views).
Non-Temporal Aggregate example¶
A Non-Temporal Aggregate is defined using the aggregate()
method on a GroupBy object:
# Get the number of items in each invoice
invoice_item_count = items_by_invoice.aggregate(
None,
method=fb.AggFunc.COUNT,
feature_name="InvoiceItemCount",
)
Important
To prevent time leakage, non-temporal aggregates are only supported for Item views, when the grouping key is the event key of the Item view. An example of such features is the count of items in an Order.
Aggregate Over a Window example¶
An Aggregate Over a Window is created using the aggregate_over()
method on a GroupBy object derived from an EventView, ItemView, TimeSeriesView, SnapshotsView or a ChangeView:
# Group items by the column GroceryCustomerGuid that references the customer entity
items_by_customer = items_view.groupby("GroceryCustomerGuid")
# Define features measuring total discounts received by a customer
customer_discounts = items_by_customer.aggregate_over(
"Discount",
method=fb.AggFunc.SUM,
feature_names=["CustomerDiscounts_7d", "CustomerDiscounts_28d"],
windows=['7d', '28d'],
)
For TimeSeriesView or a SnapshotsView, you must specify windows using the CalendarWindow
class. You can also use the CalendarWindow
class with other views when calendar-based windows are desired to aggregate seasonal events such as salary or rental.
# Group by the column ClientID that references the client entity
credit_card_balances_by_client = credit_card_balances_view.groupby("ClientID")
# Declare features that measure the drawings by client during the past 3 and 12 calendar months.
client_amt_drawings = credit_card_balances_by_client.aggregate_over(
"AMT_DRAWINGS",
method=fb.AggFunc.SUM,
feature_names=["Client_AMT_DRAWINGS_3cMo", "Client_AMT_DRAWINGS_12cMo"],
windows=[
fb.CalendarWindow(unit="MONTH", size=3),
fb.CalendarWindow(unit="MONTH", size=12)
],
)
Note
The output is a FeatureGroup object because the operation can define multiple window-based features.
You can extract a single Feature from a FeatureGroup by referencing its name:
Note
By default, Aggregate Over a Window features use the default feature job setting defined at their primary table level.
You can override this by specifying a custom setting when defining the feature:
# Set a different feature job setting
customer_discount = items_by_customer.aggregate_over(
"Discount",
method=fb.AggFunc.SUM,
feature_names=["CustomerDiscounts_7d", "CustomerDiscounts_28d"],
fill_value=0,
windows=['7d', '28d'],
feature_job_setting=fb.FeatureJobSetting(
blind_spot="135s",
period="60m",
offset="90s",
)
)
You can also specify an offset to shift the window backward relative to the observation time while keeping the window size fixed.
Example — aggregating over 28 days ending 7 days prior to the observation point:
customer_discounts = items_by_customer.aggregate_over(
"Discount",
method=fb.AggFunc.SUM,
feature_names=["CustomerDiscounts_28d_offset_7d"],
windows=["28d"],
offset="7d",
)
Aggregate As At a Point-in-Time example¶
An Aggregate As At a Point-in-Time is defined using the aggregate_asat()
method on a GroupBy object derived from a SCDView or SnapshotsView:
# get view
customer_view = catalog.get_view("GROCERYCUSTOMER")
# Group rows by the State column referencing the state entity
groupby_state = customer_view.groupby("State")
# Declare feature that counts the number of customers in the State
state_customers_count = groupby_state.aggregate_asat(
None,
method=fb.AggFunc.COUNT,
feature_name="StateCustomersCount"
)
Note
- The key used for aggregation should not be the natural key of the SCDView or the series ID of the SnapshotsView, since only one active record or snapshot exists per key at any given time.
- You can specify an offset to aggregate values as of a specific time before before the point-in-time.
# Declare same feature as at 28 days before the point-in-time
state_customers_count_28d = groupby_state.aggregate_asat(
None,
method=fb.AggFunc.COUNT,
feature_name="StateCustomersCount",
offset="28d"
)
Cross Aggregate features¶
Cross Aggregate features extend aggregate features by aggregating data across different categories. These features capture patterns and relationships across categorical dimensions.
Defining a Cross Aggregate Feature
-
Group by entity and category
Use
groupby()
with both entity keys and a categorical column:# Join product view to items view product_view = catalog.get_view("GROCERYPRODUCT") items_view = items_view.join(product_view) # Group items by the column GroceryCustomerGuid that references the customer entity # And use ProductGroup as the column to perform operations across items_by_customer_across_product_group = items_view.groupby( by_keys = "GroceryCustomerGuid", category="ProductGroup" )
-
Select the aggregation type
Similar to standard aggregate features, you can define Non-Temporal, Window-based, or As-At aggregates.
Note
The materialized feature value of a Cross Aggregate feature is a dictionary, where each key represents a category from the grouping column and each value represents the aggregated metric for that category.
Transforming Features¶
Feature objects can be derived from multiple Feature objects through generic, numeric, string, datetime, and dictionary transforms.
When a feature is derived from features with different primary entities, the entity relationships determine the primary entity, and the lowest level entity is selected as the primary entity. If the entities have no relationship, the primary entity becomes a tuple of those entities.
Generic Transforms¶
Generic transformations applicable to ViewColumn objects can also be applied to Feature objects of any data type. The list of generic transforms can be found in the provided glossary.
Numeric Transforms¶
Numeric Features can be manipulated using built-in arithmetic operators (+, -, *, .md). For example:
In addition to these arithmetic operations, other numeric transformations that are applicable to ViewColumn objects can also be applied to Feature objects.
String Transforms¶
String Feature objects can be concatenated directly, as shown below:
Other String transforms that are applicable to ViewColumn objects can also be applied to Feature objects.
Datetime Transforms¶
The Datetime Feature objects can be transformed in several ways, such as calculating differences, adding time intervals, or extracting date components. The glossary provides a list of supported dateparts transforms that are applicable to ViewColumn objects and can also be used with Feature objects.
Point-in-time Transforms¶
Features can be derived from the points-in-time provided during feature materialization.
This allows the creation of "time since" features that compare the latest event timestamp with the point-in-time provided in the feature request.
# Create feature that retrieves the timestamp of the latest invoice of a Customer
invoice_view = catalog.get_view("GROCERYINVOICE")
latest_invoice = invoice_view.groupby("GroceryCustomerGuid").aggregate_over(
value_column="Timestamp",
method="latest",
windows=[None],
feature_names=["Customer Latest Visit"],
)
# Create feature that computes the time since the latest invoice
feature = (
fb.RequestColumn.point_in_time() - latest_invoice["Customer Latest Visit"]
).dt.hour
feature.name = "Customer number of hours since last visit"
Note
For historical feature requests, the point-in-time values are provided by the "POINT_IN_TIME" column of the observation set.
For online and batch serving, the point-in-time value is the timestamp when the feature request is made.
Dictionary Transforms¶
Additional transformations are supported for features resulting from Cross Aggregate features. These include:
get_value
: Retrieves the value based on the key provided.most_frequent
: Retrieves the most frequent key.unique_count
: Computes number of distinct keys.entropy
: Computes the entropy over the keys.get_rank
: Computes the rank of a particular key.get_relative_frequency
: Computes the relative frequency of a particular key.cosine_similarity
: Computes the cosine similarity with another cross aggregate feature.
In this example, a feature is created to measure the similarity of customer purchases and purchases of customers living in the same state using two Cross Aggregate features using the cosine_similarity()
method:
# Join customer view to items view
items_view = items_view.join(customer_view)
# Cross Aggregate feature of purchases of customers living in the same state
# across product group over the past 4 weeks
state_inventory_28d = items_view.groupby(
by_keys="State", category="ProductGroup"
).aggregate_over(
"TotalCost",
method=fb.AggFunc.SUM,
feature_names=["StateInventory_28d"],
windows=["28d"]
)
# Create a feature that measures the similarity of customer purchases
# and purchases of customers living in the same state
customer_state_similarity_28d = \
customer_inventory_28d["CustomerInventory_28d"].cd.cosine_similarity(
state_inventory_28d["StateInventory_28d"]
)
customer_state_similarity_28d.name = \
"Customer Similarity with purchases in the same state over 28 days"
Conditional Transforms¶
You can apply if-then-else logic by using conditional statements, which include other Feature objects related to the same entity.
cond = customer_state == "Ile-de-France"
customer_spent_over_7d[cond] = 100 + customer_spent_over_7d[cond]
Previewing a Feature¶
First, verify the primary entity of a Feature, which indicates the entities that can be used to serve the feature. A feature can be served by its primary entity or any descendant serving entities.
You can obtain the primary entity of a feature by using the primary_entity
method as shown below:
# This should show the name of the primary entity together with its serving names.
# The only accepted serving_name in this example is 'GROCERYCUSTOMERGUID'.
display(customer_state_similarity_28d.primary_entity)
Note
You can preview a Feature object using a small observation set of up to 50 rows. Unlike the compute_historical_features()
method, this method does not store partial aggregations (tiles) to speed up future computation. Instead, it computes the feature values on the fly and should be used only for small observation sets for debugging or exploring unsaved features.
The small observation set must combine historical points-in-time and key values of the primary entity from the feature. Associated serving entities can also be utilized.
An accepted serving name should be used for the column containing the entity values.
The historical points-in-time must be timestamps in UTC and must be contained in a column named 'POINT-IN-TIME'.
The preview()
method returns a pandas DataFrame.
import pandas as pd
observation_set = pd.DataFrame({
'GROCERYCUSTOMERGUID': ["30e3fbe4-3cbe-4d51-b6ca-1f990ef9773d"],
'POINT_IN_TIME': [pd.Timestamp("2022-12-17 12:12:40")]
})
display(customer_state_similarity_28d.preview(observation_set))
Adding a Feature Object to the Catalog¶
Before saving a feature derived from transformations and adding it to the catalog, assign a name.
Saving a Feature Object makes the object persistent and adds it to the catalog.
Note
After saving it, a Feature object cannot be modified. New Feature Objects with the same namespace can be created to support versioning. Refer to the versioning section, for more details.
Listing Unsaved Features¶
Features that have not been saved will not be persisted once you close your Notebook. Use the list_unsaved_features()
method to check what features are still unsaved. Save the features that you wish to keep.
Setting Feature Readiness¶
To help differentiate Feature objects that are in the prototype stage and objects that are ready for production, a Feature object can have one of four readiness levels:
PRODUCTION_READY
: Assigned to Feature objects ready for deployment in production environments.PUBLIC_DRAFT
: For Feature objects shared for feedback purposes.DRAFT
: For Feature objects in the prototype stage.DEPRECATED
: For feature objects not advised for training or online serving.
By default, new Feature objects are assigned the DRAFT status. You can delete only Draft Feature objects and cannot revert other statuses to DRAFT.
Important
Only one Feature object belonging to a group of Feature objects with the same namespace can be designated as PRODUCTION_READY at a time.
When a Feature object is promoted to PRODUCTION_READY, guardrails are applied automatically to compare the Feature object's cleaning operations and feature job setting with the latest defaults. If you are assured in the promoted Feature object's settings, you can bypass these guardrails by setting ignore_guardrails to True.
You can change the readiness state of a Feature object using the update_readiness
method:
display(customer_state_similarity_28d.readiness)
customer_state_similarity_28d.update_readiness("PUBLIC_DRAFT")
Managing Feature Versions¶
A new feature version is a new Feature object with the same namespace as the original Feature object. The new Feature object has its own Object ID and version name.
New feature versions allow for reusing a Feature with different feature job settings or cleaning operations. If the source table's availability or freshness changes, new feature versions can be created with updated feature job settings. If data quality in the source table changes, new feature versions can be generated with cleaning operations to address new quality issues. Older feature versions can continue to be served without disrupting ML tasks that rely on the feature.
A new version can be created by updating the current feature's feature job setting (if provided) and table cleaning operations (if provided) using the create_new_version()
method. The new version's readiness is set to "DRAFT" by default.
new_version = customer_state_similarity_28d.create_new_version(
table_feature_job_settings=[
fb.TableFeatureJobSetting(
table_name="GROCERYINVOICE",
feature_job_setting=fb.FeatureJobSetting(
blind_spot="60s",
period="3600s",
offset="90s",
)
)
]
)
print(new_version.readiness)
The Object ID and version name of the new Feature object can be accessed using the id
and version
properties. The name remains the same as the original Feature object.
print("new_version.name", new_version.name)
print("new_version.id", new_version.id)
print("new_version.version", new_version.version)
You can list Feature objects (versions) with the same namespace from any Feature object using the list_versions
method.
Setting a Default Feature Version¶
The default version simplifies feature reuse by providing the most appropriate version when none is explicitly specified. By default, the feature's default version mode is automatic, selecting the highest readiness level version. The most recent one becomes the default if multiple versions have the same readiness level.
You can change the feature's default version mode using the update_default_version_mode()
method.
When a feature's default version mode is set to manual, you can designate a specific Feature object among the highest readiness level versions as the default version (as opposed to the most recent one in automatic mode) for Feature objects with the same namespace using the as_default_version()
method.
new_version.update_readiness("PUBLIC_DRAFT") # new_version becomes the default version
new_version.update_default_version_mode("MANUAL")
customer_state_similarity_28d.as_default_version()
To reset the default version mode of the feature and make the original feature version the default, use the following code:
customer_state_similarity_28d.update_default_version_mode("AUTO")
customer_state_similarity_28d.is_default
Accessing a Feature from the Catalog¶
You can refer to the catalog to view a list of existing features, including their detailed information, using the list_features()
method:
Note
The list_features()
method returns the default version of each feature.
To obtain the default version of a feature, utilize its namespace when using the get_feature()
method. If you want to retrieve a specific version, provide the version name as well.
default_version = catalog.get_feature("CustomerStateSimilarity_28d")
new_version_added_to_catalog = catalog.get_feature(
"CustomerStateSimilarity_28d", version=new_version.version
)
You can also retrieve a Feature object using its Object ID using the get_feature_by_id()
method.
Accessing the Feature Definition file of a Feature object¶
The feature definition file is a Feature object's single source of truth. The file is generated automatically after a feature is declared in the SDK.
This file uses the same SDK syntax as the feature declaration and provides an explicit outline of the intended operations of the feature declaration, including those inherited but not explicitly declared by you. For example, these operations may include feature job settings and cleaning operations inherited from tables metadata.
The feature definition file is the basis for generating the final logical execution graph, which is then transpiled into platform-specific SQL (e.g. SnowSQL, SparkSQL) for feature materialization.
The file can be easily displayed in the SDK using the definition
property.