Skip to content

Glossary

FeatureByte Catalog

A FeatureByte Catalog operates as a centralized repository for organizing tables, entities, features, and feature lists and other objects to facilitate feature serving.

By employing a catalog, team members can effortlessly share, search, retrieve, and reuse these assets while obtaining comprehensive information about their properties.

Create multiple catalogs for data warehouses covering multiple domains to maintain clarity and easy access to domain-specific metadata.

SDK Reference

Refer to the Catalog object main page or to the specific links:

Default Catalog

When you start FeatureByte, it automatically activates the default catalog. Do not use the default catalog for data modeling or feature engineering. Instead, always create a new catalog or activate an existing catalog for your work.

Data Source

A Data Source object in FeatureByte represents a collection of source tables that the feature store can access. From a data source, you can:

  • Retrieve the list of databases available
  • Obtain the list of schemas within the desired database
  • Access the list of source tables contained in the selected schema
  • Retrieve a source table for exploration or registering it in the catalog.

SDK Reference

Refer to the DataSource object main page or to the specific links:

Source Table

A Source Table in FeatureByte is a table of interest that the feature store can access and is located within the data warehouse.

To register a Table in a catalog, first determine its type. There are four supported types: event table, item table, dimension table and slowly changing dimension table.

Note

Registering a table according to its type determines the types of feature engineering operations that are possible on the table's views and enforces guardrails accordingly.

To identify the table type and collect key metadata, Exploratory Data Analysis (EDA) can be performed on the source table. You can obtain descriptive statistics, preview a selection of rows, or get a larger sample of rows for a specific time range.

SDK Reference

Refer to the SourceTable object main page or to the specific links:

Primary key

A Primary Key is a column or set of columns uniquely identifying each record (row) in a table.

The primary key is used to enforce the integrity of the data and ensure no duplicate records in the table. The primary key must satisfy the following requirements:

  • Unique: Each record in the table must have a unique primary key value.
  • Non-null: The primary key cannot be null (empty) for any record.
  • Stable: The primary key value should not change over time, or at least not change frequently.

Examples

Some common examples of primary keys include social security numbers and unique identification numbers.

Natural key

A Natural Key is a generally accepted identifier used to identify real-world objects uniquely. In a Slowly Changing Dimension (SCD) table, a natural key (also called alternate key) is a column or a group of columns that remain constant over time and uniquely identifies each active row in the table at any point-in-time.

This key is crucial in maintaining and analyzing the historical changes made in the table.

Example

Consider a SCD table providing changing information on customers, such as their addresses. The customer ID column of this table can be considered a natural key since:

  • it remains constant
  • uniquely identifies each customer

A given customer ID is associated with at most one address at a particular point-in-time, while over time, multiple addresses can be associated with a given customer ID.

Foreign key

A Foreign Key is a column or a group of columns in one table that refers to the primary key in another table. It establishes a relationship between two tables.

Example

An example of foreign key is Customer ID in an Orders table, which links it to the Customer table where Customer ID is the natural key.

Surrogate key

A Surrogate Key is an artificial key assigned by the system. In a Slowly Changing Dimension (SCD) table, a surrogate key is a unique identifier assigned to each record. It is used to provide a stable identifier for dimension data even as it changes over time.

Example

Consider a table that keeps track of customer addresses over time, known as a Slowly Changing Dimension (SCD) table. When a customer updates their address, a new record with the updated address is added rather than modifying the existing record. To uniquely identify each record, a surrogate key is used as the primary key. Additionally, an effective timestamp is included to indicate when the address change occurred.

In this table, the Customer ID acts as the natural key, connecting records to a specific customer. The Customer ID alone does not guarantee uniqueness, as customers may have multiple addresses throughout time. But, each Customer ID is linked to only one address for a specific time period, enabling the table to preserve historical data.

Effective Timestamp

The Effective Timestamp column in a Slowly Changing Dimension (SCD) table specifies the time when the record becomes active or effective.

Example

If a customer changes their address, the effective timestamp would be the date when the new address becomes active.

Expiration Timestamp

The Expiration (or end) Timestamp column in a Slowly Changing Dimension (SCD) table specifies the time when the record is no longer valid or active.

Example

If a customer changes their address, the expiration timestamp would be when the old address is no longer valid.

Note

While this column is useful for data management, it cannot be used for feature engineering as it is related to information unknown during the inference time and may cause time leakage. For this reason, the column is discarded by default when views are generated from tables.

Active Flag

The Active Flag (also known as Current Flag) column in a Slowly Changing Dimension (SCD) table is used to identify the current version of the record.

Example

If a customer changes their address, the active flag would be set to 'Y' for the new address and 'N' for the old address.

Note

While this column is useful for data management, it cannot be used for feature engineering as the value changes overtime and may differ between training and inference time. It may cause time leakage. For this reason, the column is discarded by default when views are generated from tables.

Record Creation Timestamp

A Record Creation Timestamp refers to the time when a particular record was created in the data warehouse. The record creation timestamp is usually automatically generated by the system when the record is first created, but a user or an administrator can manually set it.

Note

While this column is useful for data management, it is usually not used for feature engineering as it is sensitive to changes in data management that are usually unrelated to the target to predict. This also may cause feature drift and undesirable impact on predictions. For this reason, the column is discarded by default when views are generated from tables.

The information is, however, used to analyze the data availability and freshness of the tables to help with the configuration of their default feature job setting.

Time Zone Offset

A time zone offset, also known as a UTC offset, is a difference in time between Coordinated Universal Time (UTC) and a local time zone. The offset is usually expressed as a positive or negative number of hours and minutes relative to UTC.

Example

If the local time is 3 hours ahead of UTC, the time zone offset would be represented as "+03:00". Similarly, if the local time is 2 hours behind UTC, the time zone offset would be represented as "-02:00".

Note

When you register an Event table, you can specify a separate column that provides the time zone offset information. By doing so, all date parts transforms in the event timestamp column will be based on the local time instead of UTC.

The required format for the column is "(+|-)HH:mm".

Timestamp with Time Zone Offset

The Snowflake data warehouse supports a timestamp type with time zone offset information (TIMESTAMP_TZ). FeatureByte recognises this timestamp type and date parts for columns or features using timestamp with time zone offset are based on the local time instead of UTC.

Important

Timestamp columns that are stored without time zone offset information are assumed to be UTC timestamps.

Table

A Table in FeatureByte represents a source table and provides a centralized location for metadata for that table. This metadata determines the type of operations that can be applied to the table's views.

Important

A source table can only be associated with one active table in the catalog at a time. This means that the active table in the catalog is the source of truth for the metadata of the source table. If a table in the catalog becomes deprecated, it can be replaced with a new table in the catalog that has updated metadata.

To register a table in a catalog, determine its type first. The table’s type will determine the types of feature engineering operations possible on the table's views and enforces guardrails accordingly. Currently, FeatureByte recognizes four table types:

Two additional table types, Regular Time Series and Sensor data, will be supported shortly.

Optionally, you can include additional metadata at the column level after creating a table to support feature engineering further. This could involve tagging columns with related entity references or defining default cleaning operations.

SDK Reference

Refer to the Table object main page or to the specific links:

Event Table

An Event Table represents a table in the data warehouse where each row indicates a unique business event occurring at a particular time.

Examples

Event tables can take various forms, such as

  • An Order table in E-commerce
  • A Credit Card Transactions table in Banking
  • Doctor Visits in Healthcare
  • Clickstream on the Internet.

To create an Event Table in FeatureByte, it is necessary to identify two important columns in your data: the event key and timestamp. The event key is a unique identifier for each event, while the timestamp indicates when the event occurred.

Note

If your data warehouse is a Snowflake data warehouse, FeatureByte accepts timestamp columns that include time zone offset information.

For timestamp columns without time zone offset information or for non-Snowflake data warehouses, you can specify a separate column that provides the time zone offset information. By doing so, all date parts transforms in the event timestamp column will be based on the local time instead of UTC.

Additionally, the column that represents the record creation timestamp may be identified to enable an automatic analysis of data availability and freshness of the source table. This analysis can assist in selecting the default feature job setting that defines the scheduling of the computation of features associated with the Event table.

Item Table

An Item Table represents a table in the data warehouse containing detailed information about a specific business event.

Examples

An Item table may contain information about:

  • Product Items purchased in Customer Orders
  • or Drug Prescriptions issued during Doctor Visits by Patients.

Typically, an Item table has a 'one-to-many' relationship with an Event table. Despite not explicitly including a timestamp, it is inherently linked to an event timestamp through its association with the Event table.

To create an Item Table, it is necessary to identify the columns that represent the item key and the event key and determine which Event table is associated with the Item table.

SDK Reference

How to register an item table.

Slowly Changing Dimension (SCD) Table

An SCD Table represents a table in a data warehouse that contains data that changes slowly and unpredictably over time.

There are two main types of SCD Tables:

  • Type 1: Overwrites old data with new data
  • Type 2: Maintains a history of changes by creating a new record for each change.

FeatureByte only supports using Type 2 SCD Tables since Type 1 SCD Tables may cause data leaks during model training and poor performance during inference.

A Type 2 SCD Table utilizes a natural key to distinguish each active row and facilitate tracking of changes over time. The SCD table employs effective and end (or expiration) timestamp columns to determine the active status of a row. In certain instances, an active flag column may replace the expiration timestamp column to indicate whether a row is active.

Example

Here is an example of a Type 2 SCD table for tracking changes to customer information:

Customer ID First Name Last Name Address City State Zip Code Valid From Valid To
123456 John Smith 123 Main St San Francisco CA 12345 13/01/2019 11:00:00 16/03/2021 10:00:00
123456 John Smith 456 Oak St Oakland CA 67890 16/03/2021 10:00:00 NULL
789012 Jane Doe 789 Maple Ave New York City NY 34567 15/09/2020 10:00:00 NULL

In this example, each row represents a specific version of customer information. The customer entity is identified by the natural key "Customer ID". If a customer's information changes, a new row is added to the table with the updated information, along with an effective timestamp ("Valid From" column) and end timestamp ("Valid To" column) to indicate the period during which that version of the information was active. The end timestamp is NULL for the current version of the information, indicating that it is still active.

For example, the customer with ID 123456 initially had an address of 123 Main St in San Francisco, CA, but then changed his address to 456 Oak St in Oakland, CA on 16/03/2021. This change is reflected in the SCD table by adding a new row with the updated address and Valid From of 16/03/2021 10:00:00, and a Valid To with the same timestamp for the previous version of the address.

To create an SCD Table in FeatureByte, it is necessary to identify columns for the natural key, effective timestamp, optionally surrogate key, end (or expiration) timestamp, and active flag.

SDK Reference

How to register a SCD table.

Dimension Table

A Dimension Table represents a table in the data warehouse containing static descriptive data.

Important

Using a Dimension table requires special attention. If the data in the table changes slowly, it is not advisable to use it because these changes can cause significant data leaks during model training and adversely affect the inference performance. In such cases, it is recommended to use a Type 2 Slowly Changing Dimension table that maintains a history of changes.

To create a Dimension Table in FeatureByte, it is necessary to identify which column represents its primary key.

SDK Reference

How to register a dimension table.

Table Column

A Table Column refers to a specific column within a table. You can add metadata to the column to help with feature engineering, such as tagging the column with entity references or defining default cleaning-operations.

SDK Reference

Refer to the TableColumn object main page or to the specific links:

Cleaning Operations

Cleaning Operations determine the procedure for cleaning data in a table column before performing feature engineering. The cleaning operations can either be set as a default operation in the metadata of a table column or established when creating a view in a manual mode.

These operations specify how to manage the following scenarios:

  • Missing values
  • Disguised values
  • Values that are not in an anticipated list
  • Numeric values and dates that are out of boundaries
  • String values when numeric values are expected

If changes occur in the data quality of the source table, new versions of the feature can be created with new cleaning operations that address the new quality issues.

SDK Reference

How to:

Entity

An Entity is a real-world object or concept represented or referenced by columns in your source tables.

Examples

Common examples of entities include customer, merchant, city, product, and order.

In FeatureByte, entities are used to:

SDK Reference

Refer to the Entity object main page or to the specific links:

Entity Serving Name

An Entity's Serving Name is the name of the unique identifier used to identify the entity during a preview or serving request. Typically, the serving name for an entity is the name of the primary key (or natural key) of the table that represents the entity. An entity can have multiple serving names for convenience, but the unique identifier should remain unique.

SDK Reference

How to get the serving names of an entity.

Entity Tagging

The Entity Tagging process involves identifying the specific columns in tables that identify or reference a particular entity. These columns are often primary keys, natural keys, or foreign keys.

Example

Consider a database for a company that consists of 2 SCD tables: one table for employees and one table for departments. In this database,

  • the natural key of the employees table identifies the Employee entity.
  • the natural key of the department tables identifies the Department entity.
  • the employees table may also have a foreign key column referencing the Department entity.

Feature Primary Entity

The Primary Entity of a feature defines the level of analysis for that feature.

The primary entity is usually a single entity. However, in some instances, it may be a tuple of entities.

When a feature is a result of an aggregation grouped by multiple entities, the primary entity is a tuple of those entities.

Example

Entity Diagram

Suppose a feature quantifies the interaction between a customer entity and a merchant entity in the past, such as the sum of transaction amounts grouped by customer and merchant in the past four weeks.

The primary entity of this feature is the tuple of customer and merchant.

When a feature is derived from features with different primary entities, the primary entity is determined by the entity relationships between the entities. The lowest level entity in the hierarchy is selected as the primary entity. If the entities have no relationship, the primary entity becomes a tuple of those entities.

Example

Entity Diagram

Consider two entities: customer and customer city, where the customer entity is a child of customer city entity. If a new feature is created that compares a customer's basket with the average basket of customers in the same city, the primary entity for that feature would be the customer entity. This is because the customer entity is a child of the customer city entity and the customer city entity can be deduced automatically.

Alternatively, if two entities, such as customer and merchant, do not have any relationship, the primary entity for a feature that calculates the distance between the customer location and the merchant location would be the tuple of customer and merchant entities. This is because the two entities do not have any parent-child relationship.

SDK Reference

How to get the primary entity of a feature.

Feature List Primary Entity

The primary entity of a feature list determines the entities that can be used to serve the feature list, which typically corresponds to the primary entity of the Use Case that the feature list was created for.

If the features within the list pertain to different primary entities, the primary entity of the feature list is selected based on the entities relationships, with the lowest level entity in the hierarchy chosen as the primary entity. In cases where there are no relationships between entities, the primary entity may become a tuple comprising those entities.

Example

Entity Diagram

Consider a feature list containing features related to card, customer, and customer city. In this case, the primary entity is the card entity since it is a child of both the customer and customer city entities.

However, if the feature list also contains merchant and merchant city features, the primary entity is a tuple of card and merchant.

SDK Reference

How to get the primary entity of a feature list.

Use Case Primary Entity

In a Use Case, the Primary Entity is the object or concept that defines its problem statement. Usually, this entity is singular, but in cases such as recommendation engines, it can be a tuple of entities such as (Customer, Product).

Serving Entity

A Serving Entity is any entity that can be used to preview or serve a feature or feature list, regardless of whether it is the primary entity. Serving entities associated with a feature or feature list are typically descendants of the primary entity.

Example

Entity Diagram

Suppose that a customer is the primary entity for a feature, the serving entities for that feature could include related entities such as the card and transaction entities, which are child or grandchild of the customer entity.

Entity Relationship

The parent-child relationship and the supertype-subtype relationship are the two main types of Entity Relationships can assist feature engineering and feature serving.

The parent-child relationship is automatically established in FeatureByte during the entity tagging process, while identifying supertype-subtype relationships require manual intervention.

These relationships can be used to suggest, facilitate and verify joins during feature engineering and streamline the process of serving feature lists containing multiple entity-assigned features.

Important

Note that FeatureByte only supports parent-child relationships currently. Nevertheless, it is expected that supertype-subtype relationships will also be supported shortly, thus enabling more efficient feature engineering and feature serving.

SDK Reference

Refer to the Relationship object main page or to the specific links:

Parent-Child Relationship

A Parent-Child Relationship is a hierarchical connection that links one entity (the child) to another (the parent). Each child entity key value can have only one parent entity key value, but a parent entity key value can have multiple child entity key values.

Example

Examples of parent-child relationships include:

  • Hierarchical organization chart: A company's employees are arranged in a hierarchy, with each employee having a manager. The employee entity represents the child, and the manager entity represents the parent.
  • Product catalog: In an e-commerce system, a product catalog may be categorized into categories and subcategories. Each category or subcategory represents a child of its parent category.
  • Geographical hierarchy: In a geographical data model, cities are arranged in states, which are arranged in countries. Each city is the child of its parent state, and each state is the child of its parent country.
  • Credit Card hierarchy: A credit card transaction is the child of a card and a merchant. A card is a child of a customer. A customer is a child of a customer city. And a merchant is a child of a merchant city.

Entity Diagram

Note

In FeatureByte, the parent-child relationship is automatically established when the primary key (or natural key in the context of a SCD table) identifies one entity. This entity is the child entity. Other entities that are referenced in the table are identified as parent entities.

Supertype-Subtype Relationship

In a data model, a Supertype-Subtype Relationship is a hierarchical relationship between two or more entity types where one entity type (the subtype) inherits attributes and relationships from another entity type (the supertype).

The subtype entity is typically a more specialized version of the supertype entity, representing a subset of the data that applies to a particular domain. Although the subtype entity inherits properties and relationships from the supertype entity, It can have its unique attributes or relationships.

Examples

Here are a few examples of supertype-subtype relationships involving a person, student, and teacher:

  1. Person is the supertype, while student and teacher are both subtypes of person.
  2. Student is a subtype of person. This is because a student is a specific type of person who is enrolled in a school or university.
  3. Teacher is also a subtype of person since a teacher is a specific type responsible for educating and instructing students.
  4. A more specific subtype of student could be a graduate student, which refers to a student who has already completed a bachelor's degree and is pursuing a higher-level degree.
  5. Another subtype of teacher could be a professor, typically a teacher with a higher academic rank and significant experience in their field.

Supertype-subtype relationships describe how a more general category (the supertype) can be divided into more specific subcategories (the subtypes). In this case, a person is the most general category, while student and teacher are more specific categories that fall under the broader umbrella of "person."

View

A view is a local virtual table that can be modified and joined to other views to prepare data before feature definition. A view does not contain any data of its own.

Views in FeatureByte allow operations similar to Pandas, such as:

Unlike Pandas DataFrames, which require loading all data into memory, views are materialized only when needed during previews or feature materialization.

When a view is created, it inherits the metadata of the catalog table it originated from. Currently, five types of views are supported:

Two view construction modes are available:

  • Auto (default): Automatically cleans data according to default operations specified for each column within the table and excludes special columns not meant for feature engineering.
  • Manual: Allows custom cleaning operations without applying default cleaning operations.

Although views provide access to cleaned data, you can still perform data manipulation using the raw data from the source table. To do this, utilize the view's raw attribute, which enables you to work directly with the unprocessed data of the source table.

View Column

A View Column is a column within a FeatureByte view. When creating a view, a View Column represents the cleaned version of a table column. The cleaning procedure for a View Column depends on the view's construction mode and typically follows the default cleaning operations associated with the corresponding table column.

By default, special columns not intended for feature engineering are excluded from view columns. These columns may consist of record creation and expiration timestamps, surrogate keys, and active flags.

You can add new columns to a view by performing joins or by deriving new columns from existing ones.

If you wish to add new columns derived from the raw data in the source table, use the view's raw attribute to access the source table's unprocessed data.

SDK Reference

Refer to the ViewColumn object main page or to the specific links:

Change View

A Change View is created from a Slowly Changing Dimension (SCD) table to provide a way to analyze changes that occur in an attribute of the natural key of the table over time. This view consists of five columns:

Once the Change View is created, it can be used to generate features in the same way as features from an Event View.

Examples

Changes to a SCD table can provide valuable insights into customer behavior, such as:

  • the number of times a customer has moved in the past six months,
  • their previous address if they recently moved,
  • whether they have gone through a recent divorce,
  • if there are new additions to their family,
  • or if they have started a new job.

SDK Reference

How to create a Change View from a SCD table.

View Subsetting

Similar to a Pandas DataFrame, new views can be created from subsets of views. Additionally, a condition-based subset can be used to replace the values of a column.

View Sample

Using the sample method, a view can be materialized with a random selection of rows for a given time range, size, and seed to control sampling.

Note

Views from tables in a Snowflake data warehouse do not support the use of seed.

SDK Reference

How to materialize a sample of a view.

View Join

To join two views, use the join() method of the left view and specify the right view object in the other_view parameter. The method will match rows from both views based on a shared key, which is either the primary key of the right view or the natural key if the right view is a Slowly Changing Dimension (SCD) view.

If the shared key identifies an entity that is referenced in the left view or the column name of the shared key is the same in both views, the join() method will automatically identify the column in the left view to use for the join.

By default, a left join is performed, and the resulting view will have the same number of rows as the left view. However, you can set the how parameter to 'inner' to perform an inner join. In this case, the resulting view will only contain rows where there is a match between the columns in both tables.

When the right view is an SCD view, the event timestamp of the left view determines which record of the right view to join.

Note

For Item View, the event timestamp and columns representing entities in the related event table are automatically added. Additional attributes can be joined using the join_event_table_attributes() method.

Important

Not all views can be joined to each other. SCD views cannot be joined to other SCD views, while only dimension views can be joined to other dimension views. Change views cannot be joined to any views.

View Column Transforms

View Column Transforms refer to the ability to apply transformation operations on columns within a view. These operations generate a new column that can either be assigned back to the view or used for subsequent transformations.

The different types of transforms include generic transforms, numeric transforms, string transforms, datetime transforms, and lag transforms.

Generic Transforms

SDK Reference

You can apply the following transforms to columns of any data type in a view:

  • isnull: Returns a new boolean column that indicates whether each row is missing.
  • notnull: Returns a new boolean column that indicates whether each row is not missing.
  • isin: Returns a new boolean column showing whether each element in the view column matches an element in the passed sequence of values
  • fillna: Replaces missing values in-place with specified values.
  • astype: Converts the data type of the column.

Numeric Transforms

SDK Reference

In addition to built-in arithmetic operators (+, -, *, /, etc), you can apply the following transforms to columns of numeric type in a view:

  • abs: Returns absolute value
  • sqrt: Returns square root value
  • pow: Returns power value
  • log: Returns logarithm with natural base
  • exp: Returns exponential value
  • floor: Rounds down to the nearest integer
  • ceil: Rounds up to the nearest integer

String Transforms

API Reference

In addition to string columns concatenation, you can apply the following transforms to columns of string type in a view:

  • len: Returns the length of the string
  • lower: Converts all characters to lowercase
  • upper: Converts all characters to uppercase
  • strip: Trims white space(s) or a specific character on the left & right string boundaries
  • lstrip: Trims white space(s) or a specific character on the left string boundaries
  • rstrip: Trims white space(s) or a specific character on the right string boundaries
  • replace: Replaces substring with a new string
  • pad: Pads string up to the specified width size
  • contains: Returns a boolean flag column indicating whether each string element contains a target string
  • slice: Slices substrings for each string element

Datetime Transforms

The date or timestamp (datetime) columns in a view can undergo the following transformations:

  • Calculate the difference between two datetime columns.
  • Add a time interval to a datetime column to generate a new datetime column.
  • Extract date components from a datetime column.

Note

Date parts for columns or features using timestamp with time zone offset are based on the local time instead of UTC.

Date parts for columns or features using event timestamps of Event tables, where a separate column was specified to provide the time zone offset information, will also be based on the local time instead of UTC.

SDK Reference

How to extract date components:

  • microsecond: Returns the microsecond component of each element
  • millisecond: Returns the millisecond component of each element
  • second: Returns the second component of each element
  • minute: Returns the minute component of each element
  • hour: Returns the hour component of each element
  • day: Returns the day component of each element in a view column
  • day_of_week: Returns the day of week component of each element
  • week: Returns the week component of each element
  • month: Returns the month component of each element
  • quarter: Returns the quarter component of each element
  • year: Returns the year component of each element

Lag Transforms

The use of Lag Transforms enables the retrieval of the preceding value associated with a particular entity in a view.

This, in turn, makes it feasible to compute essential features, such as those that depend on inter-event time and the proximity to the previous point.

Note

Lag transforms are only supported for Event and Change views.

SDK Reference

How to extract lags from a view column.

Features

Input data used to train Machine Learning models and compute predictions is referred to as features.

These features can sometimes be derived from attributes already present in the source tables.

Example

A customer churn model may use features obtained directly from a customer profile table, such as age, gender, income, and location.

However, in many cases, features are created by applying a series of row transformations, joins, filters, and aggregates.

Example

A customer churn model may utilize aggregate features that reflect the customer's account details over a given period, such as

  • the customer entropy of product types purchased over the past 12 weeks,
  • the customer count of canceled orders over the past 56 weeks,
  • and the customer amount spent over the past seven days.

Feature Materialization

A feature in FeatureByte is defined by the logical plan for its computation. The act of computing the feature is known as Feature Materialization.

The materialization of features is made on demand to fulfill historical requests, whereas for prediction purposes, feature values are generated through a batch process called a "Feature Job". The Feature Job is scheduled based on the defined settings associated with each feature.

To materialize the feature values, either entities to which the feature is assigned or their child entities (the serving entities) must be instantiated. Additionally, in the context of historical feature serving, an observation set is required, created by combining entity key values and point-in-time references that correspond to particular moments in the past.

Point-In-Time

A Point-In-Time for a feature refers to a specific moment in the past with which the feature's values are associated.

It is a crucial aspect of historical feature serving that allows Machine Learning models to make predictions based on historical data. By providing a point-in-time, a feature can be used to train and test models on past data, enabling them to make accurate predictions for similar situations in the future.

Feature Object

A Feature object in FeatureByte contains the logical plan to compute the feature.

There are three ways to define the plan for Feature objects from views:

  1. Lookup features
  2. Aggregate features
  3. Cross Aggregate features

Additionally, Feature objects can be created as transformations of one or more existing features.

SDK Reference

Refer to the Feature object main page or to the specific links:

Lookup Features

A Lookup Feature refers to an entity’s attribute in a view at a specific point-in-time. Lookup features do not involve any aggregation processes.

When a view's primary key identifies an entity, it is simple to designate its attributes as features for that particular entity.

Examples

Examples of Lookup features are a customer's birthplace retrieved from a Customer Dimension table or a transaction amount retrieved from a Transactions Event table.

When an entity serves as the natural key of an SCD view, it is also possible to assign one of its attributes as a feature for that entity. However, in those cases, the feature is materialized through point-in-time joins, and the resulting value corresponds to the active row at the point-in-time specified in the feature request.

Example

A customer feature could be the customer's street address at the request's point-in-time.

When dealing with an SCD view, you can specify an offset if you want to get the feature value at a specific time before the request's point-in-time.

Example

By setting the offset to 9 weeks in the previous example, the feature value would be the customer's street address nine weeks before the request's point-in-time.

SDK Reference

How to create a Lookup feature.

Aggregate Features

Aggregate Features are an important type of feature engineering that involves applying various aggregation functions to a collection of data points grouped by an entity (or a tuple of entities). Supported aggregation functions include the latest, count, sum, average, minimum, maximum, and standard deviation. It is important to consider the temporal aspect when conducting these aggregation operations.

There are three main types of aggregate features:

  1. simple aggregates,
  2. aggregates over a window
  3. and aggregates "as at" a point-in-time.

Note

If a feature is intended to capture patterns of interaction between two or more entities, these aggregations are grouped by the tuple of the entities. For instance, an aggregate feature can be created to show the amount spent by a customer with a merchant in the past.

SDK Reference

How to create:

Cross Aggregate Features

Cross Aggregate features are a type of Aggregate Feature that involves aggregating data across different categories. This enables the creation of features that capture patterns in an entity across these categories.

Example

The amount spent by a customer on each product category over a specific time period is a Cross Aggregate feature. In this case, the customer is the entity being analyzed and the product category is the categorical variable that the aggregation was done across for each customer. The resulting feature could be used to identify spending patterns or preferences of individual customers across different product categories.

Note

When such a feature is computed for a customer, a dictionary is returned that contains keys representing the product categories purchased by the customer and their corresponding values representing the total amount spent on each category.

Like other types of Aggregate Features, it is important to consider the temporal aspect when conducting aggregation operations. The three main types of Cross Aggregate features include:

SDK Reference

How to group by entity across categories to perform cross aggregates.

Simple Aggregates

Simple Aggregate features refer to features that are generated through aggregation operations without considering any temporal aspects. In other words, these features are created by aggregating values without considering the order or sequence in which they occur.

Important

To avoid time leakage, the simple aggregate is only supported for Item views, when the grouping key is the event key of the Item view. An example of such features is the count of items in Order.

Note

Simple aggregate features obtained from an Item view can be added as a column to the corresponding event view. Once the feature is integrated, it can be aggregated over a time window to create aggregate features over a window. For instance, you can calculate a customer's average order size over the last three weeks by using the order size feature extracted from the Order Items view and aggregating it over that time frame in the related Order view.

SDK Reference

How to:

Aggregates Over A Window

Aggregates over a window refer to features generated by aggregating data within a specific time frame. These types of features are commonly used for analyzing event and item data.

The duration of the window is specified when the feature is created. The end point of the window is determined when the feature is served, based on the point-in-time values specified by the feature request and the feature job setting of the feature.

SDK Reference

How to create an aggregate over feature.

Aggregates “As At” a Point-In-Time

Aggregates "As At" a Point-In-Time are features that are generated by aggregating data that is active at a particular moment in time. These types of features are only available for slowly changing dimension (SCD) views and the grouping key used for generating these features should not be the natural key of the SCD view.

You can specify an offset, if you want the aggregation to be done on rows active at a specific time before the point-in-time specified by the feature request.

Example

An aggregate ‘as at’ feature from a Credit Cards table could be the customer's count of credit cards at the specified point-in-time of the feature request.

With an offset of 2 weeks, the feature would be the customer's count of credit cards 2 weeks before the specified point-in-time of the feature request.

SDK Reference

How to create an aggregate "asat" feature.

Aggregates Of Changes Over a Window

Aggregates of changes over a window are features that summarize changes in a Slowly Changing Dimension (SCD) table within a specific time frame. These features are created by aggregating data from a Change view that is derived from a column in the SCD table.

Example

One possible aggregate feature of changes over a window could be the count of address changes that occurred within the last 12 weeks for a customer.

SDK Reference

How to create:

Feature Transforms

Feature Transforms is a flexible functionality that allows the generation of new features by applying a broad range of transformation operations to existing features. These transformations can be applied to individual features or multiple features from the same or distinct entities.

The available transformation operations resemble those provided for view columns. However, additional transformations are also supported for features resulting from Cross Aggregate features.

Features can also be derived from multiple features and the points-in-time provided during feature materialization.

Examples of features derived from Cross Aggregates

  • Most common weekday for customer visits in the past 12 week
  • Count of unique items purchased by a customer in the past 4 weeks
  • List of distinct items bought by a customer in the past 4 weeks
  • Amount spent by a customer on ice cream in the past 4 weeks
  • Weekday entropy for customer visits in the past 12 weeks

Examples of features derived from multiple features

  • Similarity between customer’s basket during the past week and past 12 weeks
  • Similarity between a customer's item basket and the baskets of customers in the same city over the past 2 weeks
  • Order amount z-score based on a customer's order history over the past 12 weeks

SDK Reference

How to transform the dictionary output of cross aggregate features:

Feature Version

A Feature Version enables the reuse of a Feature with varying feature job settings or distinct cleaning operations.

If the availability or freshness of the source table change, new versions of the feature can be generated with a new feature job setting. On the other hand, if changes occur in the data quality of the source table, new versions of the feature can be created with new cleaning operations that address the new quality issues.

To ensure the seamless inference of Machine Learning tasks that depend on the feature, old versions of the feature can still be served without any disruption.

Note

In the FeatureByte SDK, a new feature version is a new Feature object with the same namespace as the original Feature object. The new Feature object has its own Object ID and version name.

Feature Readiness

To help differentiate features that are in the prototype stage and features that are ready for production, a feature version can have one of four readiness levels:

  1. PRODUCTION_READY: ready for deployment in production environments.
  2. PUBLIC_DRAFT: shared for feedback purposes.
  3. DRAFT: in the prototype stage.
  4. DEPRECATED: not advised for use in either training or prediction.

Important

Only one feature version can be designated as PRODUCTION_READY at a time.

When a feature version is promoted to PRODUCTION_READY, guardrails are applied automatically to ensure consistency with defauft cleaning operations and feature job settings. You can disregard these guardrails if the settings of the promoted feature version adhere to equally robust practices.

SDK Reference

How to:

Default Feature Version

The default version of a feature streamlines the process of reusing features by providing the most appropriate version. Additionally, it simplifies the creation of new versions of feature lists.

By default, the feature's version with the highest level of readiness is considered, unless you override this selection. In cases where multiple versions share the highest level of readiness, the most recent version is automatically chosen as the default.

Note

When a feature is accessed from a catalog without specifying its object ID or its version name but only by its name, the default version is automatically retrieved.

Feature Definition File

The feature definition file is the single source of truth for a feature version. This file is automatically generated when a feature is declared in the SDK or a new version is derived.

The syntax used in the SDK is also used in the feature definition file. The file provides an explicit outline of the intended operations of the feature declaration, including those inherited but not explicitly declared by you. For example, these operations may include feature job settings and cleaning operations inherited from tables metadata.

The feature definition file is the basis for generating the final logical execution graph, which is then transpiled into platform-specific SQL (e.g. SnowSQL, SparkSQL) for feature materialization.

Definition File

SDK Reference

How to obtain the feature definition file.

Feature List

A Feature List is a collection of features that is tailored to meet the needs of a particular use case. It is commonly used in generating feature values for Machine Learning training and inference.

To obtain Exploratory Data Analysis (EDA), training, or test data for a Use Case, the Feature List is first used to gather historical feature values. These values are then employed to analyze features and train and test models. Once a model has been trained and validated, the Feature List is deployed, and the feature values can be accessed through online and batch serving to generate predictions.

The primary entity of the Feature List determines the entities that can be used to serve the feature list, which typically corresponds to the primary entity of the Use Case for which the feature list was created. The primary entity of the Feature List is determined by analyzing the relationship between the primary entities of the individual features listed.

Note

A feature list can be served by its primary entity or any descendant serving entities.

SDK Reference

Refer to the FeatureList object main page or to the specific links:

Feature Group

A Feature Group is a temporary collection of features that facilitates the manipulation of features and the creation of feature lists.

Note

It is not possible to save the Feature Group as a whole. Instead, each feature within the group can be saved individually.

SDK Reference

Refer to the FeatureGroup object main page or to the specific links:

Feature List Version

The Feature List Version allows using each feature's latest version. Upon creation of a new feature list version, the latest default versions of features are employed unless particular feature versions are specified.

Feature List Status

Feature lists can be assigned one of five status levels to differentiate between experimental feature lists and those suitable for deployment or already deployed.

  • "DEPLOYED": Assigned to feature list with at least one deployed version.
  • "TEMPLATE": For feature lists as reference templates or safe starting points.
  • "PUBLIC_DRAFT": For feature lists shared for feedback purposes.
  • "DRAFT": For feature lists in the prototype stage.
  • "DEPRECATED": For outdated or unnecessary feature lists.

Note

The status is managed at the namespace level of a Feature List object, meaning all versions of a feature list share the same status.

For the following scenarios, some status levels are automatically assigned to feature lists:

  • when a new feature list is created, the "DRAFT" status is assigned to the feature list.
  • when at least one version of the feature list is deployed, the "DEPLOYED" status is assigned.
  • when deployment is disabled for all versions of the feature list, the "PUBLIC_DRAFT" status is assigned.

Additional guidelines:

  • Before setting a feature list status to "TEMPLATE", ensure all features in the default version are "PRODUCTION_READY".
  • Only "DRAFT" feature lists can be deleted.
  • You cannot revert a feature list status to a "DRAFT" status.
  • Once a feature list is in "DEPLOYED" status, you cannot update the status to other status until all the associated deployments are disabled.

SDK Reference

How to:

Feature List Readiness

The Feature List Readiness metric provides a statistic on the readiness of features in the feature list version. This metric represents the percentage of features that are production ready within the given feature list.

Important

Before a feature list version is deployed, all its features must be "production ready" and the metric should be 100%.

SDK Reference

How to get the readiness metric of a feature list.

Feature List Serving

A feature list is primarily served to address a Use Case. The feature list is first used to gather historical feature values to train and test models. Once a model has been trained and validated, the feature list is deployed. Feature values can be then accessed through online and batch serving to generate predictions.

Historical Feature Serving

Historical serving of a feature list is usually intended for exploration, model training, and testing. The requested data is represented by an observation set that combines entity key values and historical points-in-time, for which you want to materialize feature values.

Requesting historical features is supported by two methods:

  • compute_historical_features(): returns a loaded DataFrame. Use this method when the output is expected to be of a manageable size that can be handled locally.
  • compute_historical_feature_table(): returns a HistoricalFeatureTable object representing the output table stored in the feature store. This method is suitable for handling large tables and storing them in the feature store for reuse or auditing.

Note

It is important to note that historical feature values are not pre-computed or stored. Instead, the serving process combines partially aggregated data as offline tiles. This approach of pre-computing and storing partially aggregated data minimizes compute resources significantly.

Observation Set

An Observation Set combines entity key values and historical points-in-time for which you want to materialize feature values.

Observation Set

Note

An accepted serving name must be used for the column containing the entity values.

The column containing points-in-time must be labeled "POINT_IN_TIME" and the point-in-time timestamps should be in UTC.

The observation set may also include additional data, such as the target or other data unavailable in the data warehouse to compute features on demand.

The Observation Set can be:

  • a pandas DataFrame.
  • or an ObservationTable object representing an observation set in the feature store.

Unlike the local pandas DataFrame, the ObservationTable is part of the catalog and can be shared or reused. It can be created from a source table or a view after subsampling.

SDK Reference

Refer to the ObservationTable main page or to the specific links:

Observation Set Distribution

To offer a high-quality training set for Machine Learning, it is crucial to carefully select the entity key values and their corresponding points-in-time in an observation set:

  • The distribution of points-in-time must replicate the expected inference time.
  • The history of point-in-time must be sufficiently long to capture all seasonal variations.
  • The distribution of entity key values must be representative of the population that would have been subject to inference at the historical points-in-time.
  • The time interval between two points-in-time for a given entity key value must be greater than the target horizon to prevent data leakage.

Example

When developing a model to predict, once a week on Monday morning, customer churn over six months, it is recommended to choose:

  • historical points-in-time that also occur every Monday morning.
  • points-in-time that cover at least a year to capture all seasonal variations.
  • the customer key values must be randomly selected from the active customers population at the chosen points-in-time.
  • Additionally, the time interval between two points-in-time for a given customer key value must be longer than the churn horizon (six months in this scenario) to prevent data leakage.

Online and Batch Serving

The process of utilizing a feature list for making predictions is typically carried out online or batch serving. The feature list must be first deployed and its associated Deployment object must be enabled. This triggers the orchestration of the feature materialization into the online feature store. The online feature store then provides pre-computed feature values for online or batch serving.

The request data of both the online and batch serving consists of the key values of one of the serving entities of the deployed feature list.

Note

An accepted serving name must be used for the column containing the entity values.

The request data does not include specific timestamps, as the point-in-time is automatically determined when the request is submitted.

An REST API service supports online feature serving. Python or shell script templates for the REST API service are retrieved from the Deployment object.

Shell template

Batch serving is supported by first creating a BatchRequestTable object in the SDK that lists the entity key values for which inference is needed. The BatchRequestTable is created from either a source table in the data warehouse or a view.

Batch features values are then obtained in the SDK from the Deployment object and the BatchRequesTable. The output is a BatchFeatureTable that represents the batch features values stored in the feature store and contains metadata offering complete lineage on how the table was produced.

SDK Reference

Refer to the BatchRequestTable and BatchFeatureTable main pages or to the specific links:

Feature List Deployment

A feature list is deployed to support online and batch serving. This triggers the orchestration of the feature materialization into the online feature store.

A feature list is deployed without creating separate pipelines or using different tools. The deployment complexity is abstracted away from users.

Deployment can be disabled anytime if the online and batch serving of the feature list is no longer needed. Unlike for the log and wait approach adopted by some feature stores, disabling the deployment of a feature does not affect the serving of the historical requests.

SDK Reference

Refer to the Deployment main page or to the specific links:

Feature Store

The purpose of a Feature Store is to centralize pre-calculated values, which can significantly reduce the latency of feature serving during training and inference.

FeatureByte Feature Stores are designed to integrate seamlessly with data warehouses, eliminating the need for bulk outbound data transfers that can pose security risks. Furthermore, the computation of features is performed in the data warehouse, taking advantage of its scalability, stability, and efficiency.

Pre-calculated values for online and batch serving are stored in an online feature store.

Partial aggregations in the form of online and offline tiles are also stored to streamline feature materialization for historical request and online and batch serving. This approach enables computation to be performed incrementally on tiles rather than the entire time window, leading to more efficient resource utilization.

Once a feature is deployed, the FeatureByte service automatically initiates materialization of feature and tiles, scheduled based on the feature job setting of the feature.

SDK Reference

Refer to the FeatureStore object main page or to the specific links:

Tiles

Tiles are a method of storing partial aggregations in the feature store, which helps to minimize the resources required to fulfill historical and online requests. There are two types of tiles managed by FeatureByte: offline tiles and online tiles.

When a feature has not yet been deployed, offline tiles are cached following a historical feature request to reduce the latency of subsequent requests. Once the feature has been deployed, offline tiles are computed and stored according to the feature job setting.

The tiling approach adopted by FeatureByte also significantly reduces storage requirements compared to storing offline features. This is because tiles are more sparse than features and can be shared by features that use the same input columns and aggregation functions.

Feature Job Background

FeatureByte is designed to work with data warehouses that receive regular data refreshes from operational sources, meaning that features may use data with various freshness and availability. If these operational limitations are not considered, inconsistencies between offline requests and online and batch feature values may occur.

To prevent such inconsistencies, it is crucial to synchronize the frequency of batch feature computations with the frequency of source table refreshes and to compute features after the source table refresh is fully completed. In addition, for historical serving to accurately replicate the production environment, it is essential to use data that would have been available at the historical points-in-time, considering the present or future data latency. Latency of data refers to the time difference between the timestamp of an event and the timestamp at which the event data is accessible for ingestion. Any period during which data may be missing is referred as a "blind spot".

To address these challenges, the feature job setting in FeatureByte captures information about the frequency of batch feature computations, the timing of the batch process, and the assumed blind spot for the data. This helps ensure consistency between offline and online feature values and accurate historical serving that reflects the conditions present in the production environment.

Feature Job

A Feature Job is a batch process that generates both offline and online tiles and feature values for a specific feature before storing them in the feature store. The scheduling of a Feature Job is determined by the feature job setting associated with the respective feature.

Feature job orchestration is initiated when a feature is deployed and continues until the feature deployment is disabled, ensuring the feature store consistently possesses the latest values for each feature.

Feature Job Status

A Feature Job Status is a report on the recent activity of scheduled feature jobs associated with a feature or a feature list.

The report includes recent runs for these jobs, their success status, and the job durations.

Failed and late jobs can occur for various reasons, including insufficient compute capacity. Examine your data warehouse logs for more information on the errors. If errors result from inadequate compute capacity, consider increasing your instance size.

SDK Reference

How to get the feature job status for a feature list.

Feature Job Setting

The Feature Job Setting in FeatureByte captures essential details about batch feature computations for the online feature store, including the frequency and timing of the batch process, as well as the assumed blind spot for the data. This helps to maintain consistency between offline and online feature values and ensures accurate historical serving that reflects the production environment.

The setting comprises of three parameters:

  • The frequency parameter that specifies how often the batch process should run.
  • The time_modulo_frequency parameter that defines the timing from the end of the frequency time period to when the feature job commences. For example, a feature job with the following settings (frequency 60m, time_modulo_frequency: 130s) will start 2 min and 10 seconds after the beginning of each hour: 00:02:10, 01:02:10, 02:02:10, …, 15:02:10, …, 23:02:10.
  • The blind_spot parameter that sets the time gap between feature computation and the latest event timestamp to be processed.

Case study: A data warehouse refreshes each hour. The data refresh starts 10 seconds after the hour and is usually finished within 2 minutes. However, sometimes the data refresh misses the latest data, up to a maximum of the last 30 seconds at the end of the hour. Therefore the feature job settings will be:

  • frequency: 60m
  • time_modulo_frequency: 10s + 2m + 5s (a safety buffer) = 135s
  • blind_spot: 30s + 10s + 2m + 5s = 165s

In order to deal with changes in the management of the source tables where the features are sourced from, which could affect the availability or the freshness of the data, a new version of the feature can be created with updated feature job settings.

While Feature Jobs are primarily designed to support online requests, this information is also used during historical requests to minimize offline-online inconsistency.

To ensure consistency of Feature Job Setting across features created by different team members, a Default Feature Job Setting is defined at the table level. However, it is possible to override this setting during feature declaration.

SDK Reference

How to declare a feature job setting.

Default Feature Job Setting

The Default Feature Job Setting establishes the default setting used by features that aggregate data in a table, ensuring consistency of the Feature Job Setting across features created by different team members. While it is possible to override the setting during feature declaration, using the Default Feature Job Setting simplifies the process of setting up the Feature Job Setting for each feature.

To further streamline the process, FeatureByte offers automated analysis of an event table record creation and suggests appropriate setting values.

Online-Offline Inconsistency

Online-Offline (or Training-Serving) Inconsistency refers to the potential differences in the performance of a Machine Learning model during its training, deployment, or serving phases. These inconsistencies can occur due to various factors, such as differences in the data distributions, input data preprocessing, and runtime environments.

During training, the model learns patterns and relationships in the input data that allow it to make accurate predictions on the training set. However, when the model is deployed in a real-world scenario, it may encounter data different from the training data, which can lead to unexpected and potentially erroneous predictions.

To mitigate Online-Offline inconsistencies, it is important to carefully design and evaluate the model architecture, preprocessing steps, and training process to ensure that the model can generalize well to new data. Additionally, monitoring the model's performance during deployment and fine-tuning the model as necessary can help address any inconsistencies.

Feature Job Setting Recommendations

FeatureByte automatically analyzes data availability and freshness of an event table to suggest an optimal setting for scheduling Feature Jobs and associated Blind Spot information.

This analysis relies on the availability of record creation timestamps in the source table, typically added when updating data in the warehouse. Additionally, the analysis focuses on a recent time window, such as the past four weeks.

FeatureByte estimates the data update frequency based on the distribution of time intervals among the sequence of record creation timestamps. It also assesses the timeliness of source table updates and identifies late jobs using an outlier detection algorithm. By default, the recommended scheduling time takes late jobs into account.

To accommodate data that may not arrive on time during warehouse updates, a blind spot is proposed for determining the cutoff for feature aggregation windows, in addition to scheduling frequency and time of the Feature Job. The suggested blind spot offers a percentage of late data closest to the user-defined tolerance, with a default of 0.005%.

To validate the Feature Job schedule and blind spot recommendations, a backtest is conducted. You can also backtest your custom settings.