A Feature object contains the logical plan (also referred to as a blueprint) to compute a feature.
The feature values are computed by:
- Using a set of observations for training purposes.
- Enumerating the values of the feature's associated entity for prediction.
Features can sometimes be extracted directly from existing attributes in the source tables. However, in many cases, features are created through a sequence of operations like row transformations, joins, filters, and aggregates.
In FeatureByte, the computational blueprint for Feature objects can be defined from View objects in three ways:
Lookup features are simple features extracted directly from entity attributes in a view without the need for aggregation. For example, features extracted from a column in a specific view reflect characteristics of the entity linked with that view's primary key.
Consider the Grocery dataset used in our tutorials. Here, you can designate the "Amount" column from the "GROCERYINVOICE" table view as a feature for the "groceryinvoice" entity using the
For a Slowly Changing Dimension (SCD) view where attributes change overtime, the feature is linked to the entity identified by the table's natural key. By default, the feature value is acquired by selecting:
- The active attribute at the points-in-time of the observation set used by the historical feature request, and
- The current attribute for prediction.
In the following example, the "UsesWindows" feature indicates whether a customer is using Windows. It is a Lookup feature for the "grocerycustomer" entity that is identified by the natural key "GroceryCustomerGuid" of the SCD table "GROCERYCUSTOMER".
customer_view = catalog.get_view("GROCERYCUSTOMER") # Extract operating system from BrowserUserAgent column customer_view["OperatingSystemIsWindows"] = \ customer_view.BrowserUserAgent.str.contains("Windows") # Create a feature from the OperatingSystemIsWindows column uses_windows = customer_view.OperatingSystemIsWindows.as_feature("UsesWindows")
In case of an SCD view, you can specify an offset, if you want the attribute value from a specific point in the past.
In the following example, we use an offset of 28 days to create a feature that indicates the attribute value four weeks prior to the observation point.
Aggregate features are an important type of feature engineering that involves applying various aggregation functions to a collection of data points grouped by an entity (or a tuple of entities). Supported aggregation functions include the latest, count, sum, average, minimum, maximum, and standard deviation.
Below is the two step process to define an aggregate feature:
Select the aggregation type according to the view type and the level of analysis. There are three types of aggregations:
Simple Aggregates: Features created through aggregation operations without considering temporal aspects.
Aggregates Over A Window: Features generated by aggregating data within a specific time frame, commonly used for analyzing event data, item data, and change view data.
Aggregates “As At” a Point-In-Time: Features generated by aggregating data active at a specific moment in time, available only for SCD views.
Simple Aggregate example¶
To avoid time leakage, simple aggregates are only supported for Item views, when the grouping key is the event key of the Item view. An example of such features is the count of items in an Order.
Aggregate Over a Window example¶
# Group items by the column GroceryCustomerGuid that references the customer entity items_by_customer = items_view.groupby("GroceryCustomerGuid") # Declare features that measure the discount received by customer customer_discounts = items_by_customer.aggregate_over( "Discount", method=fb.AggFunc.SUM, feature_names=["CustomerDiscounts_7d", "CustomerDiscounts_28d"], fill_value=0, windows=['7d', '28d'] )
The output is a FeatureGroup object as the operation can support multiple window settings.
To extract a Feature object from a FeatureGroup object, you can use the Feature object's name to subset it, as shown below:
By default, Aggregate Over a Window features use the default feature job setting defined at their primary table level.
You can set a different feature job setting when defining the feature.
# Set a different feature job setting customer_discount = items_by_customer.aggregate_over( "Discount", method=fb.AggFunc.SUM, feature_names=["CustomerDiscounts_7d", "CustomerDiscounts_28d"], fill_value=0, windows=['7d', '28d'], feature_job_setting=fb.FeatureJobSetting( blind_spot="135s", frequency="60m", time_modulo_frequency="90s", ) )
Aggregate As At a Point-in-Time example¶
# get view customer_view = catalog.get_view("GROCERYCUSTOMER") # Group rows by the State column referencing the state entity groupby_state = customer_view.groupby("State") # Declare feature that counts the number of customers in the State state_customers_count = groupby_state.aggregate_asat( None, method=fb.AggFunc.COUNT, feature_name="StateCustomersCount" )
- The key used to create aggregate features based on a specific point-in-time should not be the natural key of the SCDView. This is because, for any given natural key value, there can only be one active row at a particular point-in-time.
- You can specify an offset, if you want the aggregation to be done on rows active at a specific time before the point-in-time specified by the feature request.
Cross Aggregate features¶
Cross Aggregate features are a type of Aggregate Feature that involves aggregating data across different categories. This enables the creation of features that capture patterns in an entity across these categories.
Below is the two step process to define an Cross Aggregate feature:
Determine the level of analysis by grouping view rows based on columns representing one or more entities in the view, and utilizing a categorical column for performing operations across categories using the
# Join product view to items view product_view = catalog.get_view("GROCERYPRODUCT") items_view = items_view.join(product_view) # Group items by the column GroceryCustomerGuid that references the customer entity # And use ProductGroup as the column to perform operations across items_by_customer_across_product_group = items_view.groupby( by_keys = "GroceryCustomerGuid", category=”ProductGroup” )
Select the aggregation type according to the view type. Similar to Aggregate features, three types are supported: Simple Aggregate, Aggregate Over a Window, and Aggregate As At a Point-in-Time.
The feature value of a Cross Aggregate features after materialization is a dictionary with keys representing the categories of the categorical column and their corresponding values indicating the aggregated values for each category.
Feature objects can be derived from multiple Feature objects through generic, numeric, string, datetime, and dictionary transforms.
When a feature is derived from features with different primary entities, the entity relationships determine the primary entity, and the lowest level entity is selected as the primary entity. If the entities have no relationship, the primary entity becomes a tuple of those entities.
Numeric Features can be manipulated using built-in arithmetic operators (+, -, *, /). For example:
String Feature objects can be concatenated directly, as shown below:
The Datetime Feature objects can be transformed in several ways, such as calculating differences, adding time intervals, or extracting date components. The glossary provides a list of supported dateparts transforms that are applicable to ViewColumn objects and can also be used with Feature objects.
This allows the creation of "time since" features that compare the latest event timestamp with the point-in-time provided in the feature request.
# Create feature that retrieves the timestamp of the latest invoice of a Customer invoice_view = catalog.get_view("GROCERYINVOICE") latest_invoice = invoice_view.groupby("GroceryCustomerGuid").aggregate_over( value_column="Timestamp", method="latest", windows=[None], feature_names=["Customer Latest Visit"], ) # Create feature that computes the time since the latest invoice feature = ( fb.RequestColumn.point_in_time() - latest_invoice["Customer Latest Visit"] ).dt.hour feature.name = "Customer number of hours since last visit"
For online and batch serving, the point-in-time value is the timestamp when the feature request is made.
Additional transformations are supported for features resulting from Cross Aggregate features. These include:
get_value: Retrieves the value based on the key provided.
most_frequent: Retrieves the most frequent key.
unique_count: Computes number of distinct keys.
entropy: Computes the entropy over the keys.
get_rank: Computes the rank of a particular key.
get_relative_frequency: Computes the relative frequency of a particular key.
cosine_similarity: Computes the cosine similarity with another cross aggregate feature.
In this example, a feature is created to measure the similarity of customer purchases and purchases of customers living in the same state using two Cross Aggregate features using the
# Join customer view to items view items_view = items_view.join(customer_view) # Cross Aggregate feature of purchases of customers living in the same state # across product group over the past 4 weeks state_inventory_28d = items_view.groupby( by_keys="State", category="ProductGroup" ).aggregate_over( "TotalCost", method=fb.AggFunc.SUM, feature_names=["StateInventory_28d"], windows=['28d'] ) # Create a feature that measures the similarity of customer purchases # and purchases of customers living in the same state customer_state_similarity_28d = \ customer_inventory_28d["CustomerInventory_28d"].cd.cosine_similarity( state_inventory_28d["StateInventory_28d"] ) customer_state_similarity_28d.name = \ "Customer Similarity with purchases in the same state over 28 days"
You can apply if-then-else logic by using conditional statements, which include other Feature objects related to the same entity.
Previewing a Feature¶
You can obtain the primary entity of a feature by using the
primary_entity method as shown below:
You can preview a Feature object using a small observation set of up to 50 rows. Unlike the
compute_historical_features() method, this method does not store partial aggregations (tiles) to speed up future computation. Instead, it computes the feature values on the fly and should be used only for small observation sets for debugging or exploring unsaved features.
The small observation set must combine historical points-in-time and key values of the primary entity from the feature. Associated serving entities can also be utilized.
An accepted serving name should be used for the column containing the entity values.
The historical points-in-time must be timestamps in UTC and must be contained in a column named 'POINT-IN-TIME'.
preview() method returns a pandas DataFrame.
Adding a Feature Object to the Catalog¶
Before saving a feature derived from transformations and adding it to the catalog, assign a name.
Saving a Feature Object makes the object persistent and adds it to the catalog.
After saving it, a Feature object cannot be modified. New Feature Objects with the same namespace can be created to support versioning. Refer to the versioning section, for more details.
Listing Unsaved Features¶
Features that have not been saved will not be persisted once you close your Notebook. Use the
list_unsaved_features() method to check what features are still unsaved. Save the features that you wish to keep.
Setting Feature Readiness¶
To help differentiate Feature objects that are in the prototype stage and objects that are ready for production, a Feature object can have one of four readiness levels:
PRODUCTION_READY: Assigned to Feature objects ready for deployment in production environments.
PUBLIC_DRAFT: For Feature objects shared for feedback purposes.
DRAFT: For Feature objects in the prototype stage.
DEPRECATED: For feature objects not advised for training or online serving.
By default, new Feature objects are assigned the DRAFT status. You can delete only Draft Feature objects and cannot revert other statuses to DRAFT.
Only one Feature object belonging to a group of Feature objects with the same namespace can be designated as PRODUCTION_READY at a time.
When a Feature object is promoted to PRODUCTION_READY, guardrails are applied automatically to compare the Feature object's cleaning operations and feature job setting with the latest defaults. If you are assured in the promoted Feature object's settings, you can bypass these guardrails by setting ignore_guardrails to True.
You can change the readiness state of a Feature object using the
Managing Feature Versions¶
A new feature version is a new Feature object with the same namespace as the original Feature object. The new Feature object has its own Object ID and version name.
New feature versions allow for reusing a Feature with different feature job settings or cleaning operations. If the source table's availability or freshness changes, new feature versions can be created with updated feature job settings. If data quality in the source table changes, new feature versions can be generated with cleaning operations to address new quality issues. Older feature versions can continue to be served without disrupting ML tasks that rely on the feature.
A new version can be created by updating the current feature's feature job setting (if provided) and table cleaning operations (if provided) using the
create_new_version() method. The new version's readiness is set to "DRAFT" by default.
You can list Feature objects (versions) with the same namespace from any Feature object using the
Setting a Default Feature Version¶
The default version simplifies feature reuse by providing the most appropriate version when none is explicitly specified. By default, the feature's default version mode is automatic, selecting the highest readiness level version. The most recent one becomes the default if multiple versions have the same readiness level.
You can change the feature's default version mode using the
When a feature's default version mode is set to manual, you can designate a specific Feature object as the default version for Feature objects with the same namespace using the
To reset the default version mode of the feature and make the original feature version the default, use the following code:
Accessing a Feature from the Catalog¶
You can refer to the catalog to view a list of existing features, including their detailed information, using the
list_features() method returns the default version of each feature.
To obtain the default version of a feature, utilize its namespace when using the
get_feature() method. If you want to retrieve a specific version, provide the version name as well.
You can also retrieve a Feature object using its Object ID using the
Accessing the Feature Definition file of a Feature object¶
The feature definition file is a Feature object's single source of truth. The file is generated automatically after a feature is declared in the SDK.
This file uses the same SDK syntax as the feature declaration and provides an explicit outline of the intended operations of the feature declaration, including those inherited but not explicitly declared by you. For example, these operations may include feature job settings and cleaning operations inherited from tables metadata.
The feature definition file is the basis for generating the final logical execution graph, which is then transpiled into platform-specific SQL (e.g. SnowSQL, SparkSQL) for feature materialization.
The file can be easily displayed in the SDK using the