Skip to content

Concepts

FeatureByte Catalog

A FeatureByte Catalog operates as a centralized repository for organizing tables, entities, features, and feature lists and other objects to facilitate feature reuse and serving.

By employing a catalog, team members can effortlessly share, search, retrieve, and reuse these assets while obtaining comprehensive information about their properties.

Create multiple catalogs for data warehouses covering multiple domains to maintain clarity and easy access to domain-specific metadata.

SDK Reference

Refer to the Catalog object main page or to the specific links:

User Interface

Learn by example with the 'Create Catalog' tutorial of the Credit Default UI tutorials and Grocery UI tutorials.

Source Table and Special Columns

Data Source

A Data Source object in FeatureByte represents a collection of source tables that the feature store can access. From a data source, you can:

  • Retrieve the list of databases available
  • Obtain the list of schemas within the desired database
  • Access the list of source tables contained in the selected schema
  • Retrieve a source table for exploration or registering it in the catalog.

SDK Reference

Refer to the DataSource object main page or to the specific links:

User Interface

Learn by example with the 'Register Tables' tutorial of the Credit Default UI tutorials and Grocery UI tutorials.

Source Table

A Source Table in FeatureByte is a table of interest that the feature store can access and is located within the data warehouse.

To register a Source Table in a FeatureByte catalog, first determine its type. There are five supported types: event table, item table, time series table, dimension table and slowly changing dimension table.

Note

Registering a table according to its type determines the types of feature engineering operations that are possible on the table's views and enforces guardrails accordingly.

To identify the table type and collect key metadata, Exploratory Data Analysis (EDA) can be performed on the source table. You can obtain descriptive statistics, preview a selection of rows, or collect additional information on their columns.

SDK Reference

Refer to the SourceTable object main page or to the specific links:

User Interface

Learn by example with the 'Register Tables' tutorial of the Credit Default UI tutorials and Grocery UI tutorials.

Primary Key

A Primary Key is a column that uniquely identifies each record (row) in a table.

The primary key ensures data integrity by preventing duplicate records and must meet the following requirements:

  • Unique – Each record must have a distinct primary key value.
  • Non-null – The primary key cannot contain null (empty) values.
  • Stable – The primary key value should remain unchanged over time.

FeatureByte tables can contain the following types of primary keys:

Event ID

The Event ID uniquely represents an event in the Event table.

If the Event Table contains multiple records for the same event ID (tracking status changes over time), the event ID cannot be treated as a primary key. In such cases:

  • the table should include an event status column to differentiate records.
  • the event timestamp should reflect the update time of the event status.
  • the table view should be filtered using the event status column before feature engineering.
  • the table cannot be used as a right table in joins or associated with an Item Table.

Item ID

An item ID serves as the primary key in an Item table. It typically has a one-to-many relationship with the event ID, meaning that a single event (e.g., a customer order) can be associated with multiple items (e.g., different products in that order).

Item ID Examples

Retail & E-commerce (Customer Orders)

  • Event ID: Order ID
  • Item ID: Product ID
  • Additional Attributes: Product name, quantity, price, discount, category

Healthcare (Drug Prescriptions in Doctor Visits)

  • Event ID: Visit ID
  • Item ID: Prescription ID
  • Additional Attributes: Drug name, dosage, frequency, prescribing doctor

Banking & Finance (Transaction Breakdowns)

  • Event ID: Transaction ID
  • Item ID: Line Item ID
  • Additional Attributes: Merchant, transaction type, amount, tax, currency

Logistics & Supply Chain (Shipment Details)

  • Event ID: Shipment ID
  • Item ID: Package ID
  • Additional Attributes: Weight, dimensions, destination

Dimension ID

A Dimension ID serves as the primary key in a Dimension table, uniquely identifying each record (row) in the table. Dimension IDs must be unique and stable over time to ensure data consistency and reliability in historical analysis.

Example

In a Product Dimension Table, each product would have a unique Dimension ID, ensuring that product details remain consistent across records.

Surrogate Key

In a Slowly Changing Dimension (SCD) table, a surrogate key is a unique identifier assigned to each record. It ensures a stable, system-generated identifier that remains unchanged, even as the table evolves over time.

Example

Consider an SCD Table that tracks customer addresses over time. When a customer updates their address, instead of modifying the existing record, a new record is added with the updated information.

  • The Surrogate Key acts as the primary key, uniquely identifying each record.
  • The Customer ID serves as the natural key, linking all records to a specific customer.
  • An Effective Timestamp marks when each address became valid.
  • An End Timestamp marks when each address became invalid.

Example Table:

Surrogate Key Customer ID (Natural Key) Address Valid From Valid To
1 123456 123 Main St 13/01/2019 11:00:00 16/03/2021 10:00:00
2 123456 456 Oak St 16/03/2021 10:00:00 NULL

Key Insights:

  • The Surrogate Key (1, 2) uniquely identifies each row.
  • The Customer ID remains the same across records, preserving the historical link.
  • The Valid From and Valid To timestamps define the active period of each record.
  • The latest record (456 Oak St) has a NULL Valid To, indicating it is still active.

Series ID

A series ID in a time series table identifies and separates different time series within the table, ensuring each series can be grouped, analyzed, and processed independently.

Example

Imagine a time series table tracking hourly sales for multiple stores. The series ID represents each store, ensuring sales data is kept separate for analysis. For example, "Store_A" and "Store_B" would have their own series IDs, allowing you to calculate trends, forecast future sales, or analyze growth within each store independently.

Natural Key

In a Slowly Changing Dimension (SCD) table, a natural key (also called alternate key) is a column that remains constant over time and uniquely identifies each active row at any point-in-time.

This key is essential for maintaining historical records and analyzing changes over time within the table.

Example

Consider an SCD Table that tracks customer addresses over time. The Customer ID can be considered a natural key because:

  • It remains constant for a given customer.
  • It uniquely identifies each customer at any point in time.

Key Behavior:

  • At any given point in time, a Customer ID is associated with one active address.
  • Over time, multiple addresses can be linked to the same Customer ID, preserving historical changes.

Foreign Key

A Foreign Key is a column in one table that refers to the primary key in another table. It establishes a relationship between two tables.

Example

An example of foreign key is Customer ID in an Orders table, which links it to the Customer table where Customer ID is the natural key.

Special Timestamp columns

Event Timestamp

The event timestamp column in an Event table records the exact time an event occurred.

  • If the event table contains multiple records for the same event ID (tracking status changes over time), the event timestamp should reflect the time when the status was updated.
  • For events spanning a period (e.g., sessions), use the timestamp corresponding to the end of the period.

Time Zone Considerations

The event timestamp must be recorded as a UTC Timestamp.

If you are using Databricks, keep in mind that FeatureByte retrieves timestamps exactly as they are stored, without adjusting for your Databricks cluster's time zone settings.

If using Snowflake, FeatureByte accepts timestamp columns that include time zone offset information.

For timestamp columns without time zone offset (or in non-Snowflake data warehouses), you can either specify a separate column for the time zone offset or define a fixed time zone offset that applies to all records. This ensures all date parts transforms are based on local time instead of UTC.

Reference Datetime Column

A Reference DateTime Column in a Time Series Table serves as the primary temporal anchor for each record, indicating when a measurement or event occurred. This column is a essential for time series analysis, as it establishes the time dimension necessary for ordering, aggregating, and analyzing data over time.

The column's format can vary based on the granularity of the data, including precise timestamps, dates, year-months, year-quarters, or even years.

Column Format

The Reference DateTime Column can be stored as a Timestamp or represented as a string. If represented as a string, you must specify the format specific to your datawarehouse.

Time Zone Considerations

To ensure temporal accuracy, you must specify whether the column is recorded in UTC or local time.

If you are using Databricks, keep in mind that FeatureByte retrieves timestamps exactly as they are stored, without adjusting for your Databricks cluster's time zone settings.

Additionally, you must define the time zone component associated with the Reference Datetime Column. This time zone will be used for aggregations over local calendar windows.

Finally, you must define the time interval, which specifies the expected frequency of data points in the series (e.g., hourly, daily, or monthly).

Effective Timestamp

The Effective Timestamp column in a Slowly Changing Dimension (SCD) table specifies the time when the record becomes active or effective.

Example

If a customer changes their address, the effective timestamp would be the date when the new address becomes active.

Column Format

The Effective Timestamp Column can be stored as a Timestamp or represented as a string. If represented as a string, you must specify the format specific to your datawarehouse.

Time Zone Considerations

To ensure temporal accuracy, you must specify if the column is recorded in local time and define its time zone component.

Expiration Timestamp

The Expiration Timestamp (or End Timestamp) column in a Slowly Changing Dimension (SCD) table specifies the time when the record is no longer valid or active.

Example

If a customer changes their address, the expiration timestamp would be when the old address is no longer valid.

Column Format & Time Zone Considerations

Similar to the Effective Timestamp, the Expiration Timestamp can be stored as a Timestamp or represented as a string.

Time Leakage Consideration

While the Expiration Timestamp is useful for data management, it cannot be used for feature engineering, as it represents future information unknown during inference time, potentially causing time leakage. For this reason, the column is automatically discarded when generating views from tables.

Record Creation Timestamp

A Record Creation Timestamp refers to the time when a particular record was created in the data warehouse. The record creation timestamp is usually automatically generated by the system when the record is first created, but a user or an administrator can manually set it.

Note

While this column is useful for data management, it is usually not used for feature engineering as it is sensitive to changes in data management that are usually unrelated to the target to predict. This also may cause feature drift and undesirable impact on predictions. For this reason, the column is discarded by default when views are generated from tables.

The information is, however, used to analyze the data availability and freshness of the tables to help with the configuration of their default feature job setting.

Time Zone Component

The time zone component defines how date-time columns are interpreted:

  • If the date-time column is recorded in UTC, it will be converted to local time.
  • If the date-time column is recorded in local time, it will be converted to UTC.

Note

If you are using Databricks, keep in mind that FeatureByte retrieves timestamps exactly as they are stored, without adjusting for your Databricks cluster's time zone settings.

Ways to Specify Time Zones

You can specify the time zone in one of the following ways:

  • Time Zone Name: Use a standard name from the International Time Zone Database (e.g., "America/New_York", "Asia/Singapore").
  • Time Zone Offset: Define an offset from UTC (e.g., +08:00, -05:00).
  • Fixed Time Zone: Apply a single time zone uniformly to all records at the table level.
  • Per-Record Time Zone: Specify a time zone column in the table to assign individual time zones to each record.

Daylight Saving Time Zone

Daylight saving time (DST) is managed using time zones that are defined by the International Time Zone Database (commonly known as the IANA Time Zone Database or tz database).

Examples

  • America/New_York: Eastern Time Zone in the United States, which observes DST.
    • Standard Time (EST): UTC-5
    • Daylight Time (EDT): UTC-4
  • Europe/London: United Kingdom, which observes DST.
    • Standard Time (GMT): UTC+0
    • Daylight Time (BST): UTC+1
  • Asia/Kolkata: India, which does not observe DST.
    • Standard Time: UTC+5:30

Time Zone Offset

A time zone offset, also known as a UTC offset, is a difference in time between Coordinated Universal Time (UTC) and a local time zone. The offset is usually expressed as a positive or negative number of hours and minutes relative to UTC.

Example

If the local time is 3 hours ahead of UTC, the time zone offset would be represented as "+03:00". Similarly, if the local time is 2 hours behind UTC, the time zone offset would be represented as "-02:00".

Note

When you register an Event table, you can specify a separate column that provides the time zone offset information. By doing so, all date parts transforms in the event timestamp column will be based on the local time instead of UTC.

The required format for the column is "(+|-)HH:mm".

Timestamp with Time Zone Offset

The Snowflake data warehouse supports a timestamp type with time zone offset information (TIMESTAMP_TZ). FeatureByte recognises this timestamp type and date parts for columns or features using timestamp with time zone offset are based on the local time instead of UTC.

Time Series Reference Time Zone

The Time Series Reference Time Zone is determined as follows:

  • It is the Daylight Saving Time Zone of the Time Series Reference Datetime Column if the Datetime column is associated with a unique time zone.
  • If the Datetime column is associated with multiple time zones, it is the westernmost time zone among those specified in the column that indicates the time zone of the Datetime column values.

    Westernmost Time Zone Example

    Suppose you have a dataset with a user_time_zone column, where users are located in different time zones such as America/New_York, America/Chicago, and America/Los_Angeles. The reference time zone should be America/Los_Angeles, as it is the westernmost among them.

This reference time zone is critical for defining calendar-based aggregation periods (e.g., daily, weekly, or monthly) during feature computation.

Example

Consider the following scenario:

  • Scheduled job time: 2025/01/31 23:00 UTC
  • Reference time zone: Asia/Singapore

Outcome:

  • The corresponding calendar date is 2025/02/01.
  • The aggregation for the latest complete month would include data from January.

String-Based DateTime Format

If a datetime column is represented as a string, you must specify the format specific to your data warehouse. Examples include:

  • Databricks (Spark SQL): "yyyy-MM-dd HH:mm:ss". Reference.
  • Snowflake: "YYYY-MM-DD HH24:MI:SS". Reference.
  • BigQuery: "%Y-%m-%d %H:%M:%S". Reference.

Active Flag

The Active Flag (also known as Current Flag) column in a Slowly Changing Dimension (SCD) table is used to identify the current version of the record.

Example

If a customer changes their address, the active flag would be set to 'Y' for the new address and 'N' for the old address.

Note

While this column is useful for data management, it cannot be used for feature engineering as the value changes overtime and may differ between training and inference time. It may cause time leakage. For this reason, the column is discarded by default when views are generated from tables.

FeatureByte Tables

Table

A Table in FeatureByte represents a source table and provides a centralized location for metadata for that table. This metadata determines the type of operations that can be applied to the table's views.

Important

A source table can only be associated with one active table in the catalog at a time. This means that the active table in the catalog is the source of truth for the metadata of the source table. If a table in the catalog becomes deprecated, it can be replaced with a new table in the catalog that has updated metadata.

Table Registration

To register a table in a catalog, determine its type first. The table’s type will determine the types of feature engineering operations possible on the table's views and enforces guardrails accordingly. FeatureByte recognizes five table types:

  • Event Table: Captures unique events, where each row represents a distinct event at a specific point in time.
  • Item Table: Provides detailed breakdowns or components related to a primary event.
  • Time Series Table: Stores data recorded at regular intervals, with each time series identified by a series_id. The table may contain aggregated data, regularly occurring events, or regular snapshots, typically analyzed using calendar-based window units (e.g., calendar day, calendar week, calendar month).
  • Slowly Changing Dimension Table (SCD): Tracks historical changes in specific attributes of an entity over time, maintaining both current and historical records.
  • Dimension Table: Contains static descriptive data for classification or metadata purposes, used only when the attributes do not change over time.

Optionally, you can include additional metadata at the column level after creating a table to support feature engineering further. This could involve tagging columns with related entity references, updating column description, tagging semantics or defining default cleaning operations.

User Interface

Learn by example with the 'Register Tables' tutorial of the Credit Default UI tutorials and Grocery UI tutorials.

Event Table

An Event Table represents a table in the data warehouse where each row corresponds to a unique event occurring at a specific point-in-time.

Examples

Event Tables can take various forms across different industries, such as:

  • E-commerce: Order table
  • Banking: Credit card transactions table
  • Healthcare: Doctor visits table
  • Internet: Clickstream data

Creating an Event Table in FeatureByte

To create an Event Table in FeatureByte, you must specify the event timestamp, which indicates when the event occurred.

Additionally, you can optionally identify an event ID, which serves as a unique identifier for each event.

Event Table Tracking Event Status

Some Event Tables may contain multiple records for the same event ID, tracking changes in its status over time.

Best Practice: Ideally, register two separate tables:

Alternative Approach: If splitting the table is not feasible, you can still register a single table as an Event Table, but in such cases:

  • the table must include an event status column to differentiate records.
  • the event timestamp should reflect the update time of the event status.
  • the table view should be filtered using the event status column before feature engineering.
  • the table cannot be used as a right table in joins or associated with an Item Table.

Additionally, you may specify a record creation timestamp to enable automatic analysis of data availability and freshness. This analysis helps in selecting the default feature job setting which defines the scheduling of feature computation associated with the Event table.

User Interface

Learn by example with the 'Register Tables' tutorial of the Credit Default UI tutorials and Grocery UI tutorials.

Item Table

An Item Table represents a table in the data warehouse containing detailed information about a primary event.

Examples

An Item table may contain information about:

  • Product Items purchased in Customer Orders
  • or Drug Prescriptions issued during Doctor Visits by Patients.

Typically, an Item table has a 'one-to-many' relationship with an Event table. Despite not explicitly including a timestamp, it is inherently linked to an event timestamp through its association with the Event table.

Creating an Item Table in FeatureByte

To create an Item Table, you must specify an event ID and determine which Event table is associated with the Item table.

SDK Reference

How to register an item table.

User Interface

Learn by example with the 'Register Tables' tutorial of the Grocery UI tutorials.

Time Series Table

A Time Series Table stores data recorded at regular intervals, with each time series identified by a series_id. The table may contain:

  • Aggregated data
  • Regular snapshots capturing stateful information at each time interval
  • Regularly occurring events

Time series data is typically analyzed using calendar-based window units (e.g., calendar day, calendar week, calendar month).

Note

Some tables can be registered either as an Event Table or a Time Series Table when they contain regular events. If the events occur monthly, prefer registering them as a Monthly Time Series to enforce calendar month aggregation that aligns with the event frequency.

Examples

Common examples of time series tables include:

  • Retail: Daily sales records.
  • Banking: Credit card balance snapshots.
  • Weather: Hourly temperature measurements.
  • Finance: Stock price history.

Creating a Time Series Table in FeatureByte

To create a Time Series Table in FeatureByte, you must define the following columns:

  1. The Reference Datetime Column: The primary temporal anchor for each record.
  2. The Series ID (optional): Identifies distinct time series within the table if your dataset contains multiple series.

Additionally, you will configure a Cron default feature job setting to ensure feature computations align with time series updates.

SDK Reference

Refer to the Table object main page or to the specific links:

User Interface

Learn by example with the 'Register Tables' tutorial of the Credit Default UI tutorials.

Slowly Changing Dimension (SCD) Table

An SCD Table represents a table in a data warehouse that stores slowly and unpredictably changing data over time.

There are two main types of SCD Tables:

  • Type 1: Overwrites old data with new data
  • Type 2: Maintains a history of changes by creating a new record for each change.

FeatureByte only supports using Type 2 SCD Tables since Type 1 SCD Tables may cause data leaks during model training and poor performance during inference.

A Type 2 SCD Table employs effective and end (or expiration) timestamp columns to determine whether the row is active. In some cases, an active flag column may be used instead of an expiration timestamp column to indicate whether a row is currently valid.

Example

Here is an example of a Type 2 SCD table for tracking changes to customer information:

Customer ID First Name Last Name Address City State Zip Code Valid From Valid To
123456 John Smith 123 Main St San Francisco CA 12345 13/01/2019 11:00:00 16/03/2021 10:00:00
123456 John Smith 456 Oak St Oakland CA 67890 16/03/2021 10:00:00 NULL
789012 Jane Doe 789 Maple Ave New York City NY 34567 15/09/2020 10:00:00 NULL

In this example, each row represents a specific version of customer information. The customer entity is identified by the natural key "Customer ID". If a customer's information changes, a new row is added to the table with the updated information, along with an effective timestamp ("Valid From" column) and end timestamp ("Valid To" column) to indicate the period during which that version of the information was active. The end timestamp is NULL for the current version of the information, indicating that it is still active.

For example, the customer with ID 123456 initially had an address of 123 Main St in San Francisco, CA, but then changed his address to 456 Oak St in Oakland, CA on 16/03/2021. This change is reflected in the SCD table by adding a new row with the updated address and Valid From of 16/03/2021 10:00:00, and a Valid To with the same timestamp for the previous version of the address.

Creating a SCD Table in FeatureByte

To create an SCD Table in FeatureByte, you need to specify and Effective Timestamp that indicates when a record became active.

Optionally, you may specify:

SDK Reference

How to register a SCD table.

User Interface

Learn by example with the 'Register Tables' tutorial of the Credit Default UI tutorials and Grocery UI tutorials.

Dimension Table

A Dimension Table represents a table in the data warehouse containing static descriptive data.

Important

Using a Dimension table requires special attention. If the data in the table changes slowly, it is not advisable to use it because these changes can cause significant data leaks during model training and adversely affect the inference performance. In such cases, it is recommended to use a Type 2 Slowly Changing Dimension table that maintains a history of changes.

To create a Dimension Table in FeatureByte, it is necessary to identify which column represents its primary key, also referred in FeatureByte as the dimension ID.

SDK Reference

How to register a dimension table.

User Interface

Learn by example with the 'Register Tables' tutorial of the Credit Default UI tutorials and Grocery UI tutorials.

Table Status

When a table is registered in a catalog, its status is set to 'PUBLIC_DRAFT' by default. Once the table is prepared for feature engineering, you can modify its status to 'PUBLISHED'. If a table needs to be deprecated, you can update its status to 'DEPRECATED'.

User Interface

Learn by example with our 'Manage feature life cycle' UI tutorials.

Table Columns Metadata

Table Column

A Table Column refers to a specific column within a table. You can add metadata to the column to help with feature engineering, such as tagging the column with entity references, updating column description, tagging semantics or defining default cleaning operations.

SDK Reference

Refer to the TableColumn object main page or to the specific links:

User Interface

Learn by example with the 'Update descriptions and Tag Semantics' tutorial of the Credit Default UI tutorials and Grocery UI tutorials and the 'Set Default Cleaning Operations' tutorial of the Credit Default UI tutorials and Grocery UI tutorials.

Entity Tagging

The Entity Tagging process involves identifying the specific columns in tables that identify or reference a particular entity.

These columns are typically primary keys, natural keys, or foreign keys of the table, but not necessarily.

Example

Consider a database for a company that consists of 2 SCD tables: one table for employees and one table for departments. In this database,

  • the natural key of the employees table identifies the Employee entity.
  • the natural key of the department tables identifies the Department entity.
  • the employees table may also have a foreign key column referencing the Department entity.

User Interface

Learn by example with the 'Register Entities' tutorial of the Credit Default UI tutorials and Grocery UI tutorials.

Cleaning Operations

Cleaning Operations determine the procedure for cleaning data in a table column before performing feature engineering. The cleaning operations can either be set as a default operation in the metadata of a table column or established when creating a view in a manual mode.

These operations specify how to manage the following scenarios:

  • String-based datetime format
  • Missing values
  • Disguised values
  • Values that are not in an anticipated list
  • Numeric values and dates that are out of boundaries
  • String values when numeric values are expected

If changes occur in the data quality of the source table, new versions of the feature can be created with new cleaning operations that address the new quality issues.

Cleaning Operations Approval

In Catalogs with Approval Flow enabled, changes in table metadata initiate a review process. This process recommends new versions of features and lists linked to these tables, ensuring that new models and deployments use versions that address any data issues.

SDK Reference

How to:

User Interface

Learn by example with the 'Set Default Cleaning Operations' tutorial of the Credit Default UI tutorials and Grocery UI tutorials and the 'Manage feature life cycle' UI tutorials.

Column Semantics

Recognizing the semantics of data fields and tables is essential for effective and reliable feature engineering. Without this understanding, there's a risk of creating irrelevant or misleading features, and missing out on key insights. Here are some examples of common errors due to misunderstanding data semantics:

  • Incorrectly applying 'sum' to intensity measurements, like patient temperatures in a doctor's visit table.
  • Misinterpreting a weekday column as numerical and using inappropriate operations like sum, average, or max, instead of more suitable ones like count per weekday, most frequent weekday, weekdays entropy, or unique count.

To guide users in choosing the right feature engineering techniques, FeatureByte introduces a semantic layer for each registered table. This layer encodes the semantics of data fields using a specially designed data ontology, tailored for feature engineering.

Feature Ideation assists in this process for enterprise users. It uses Generative AI to analyze metadata from tables and columns and proposes semantic tags for each column. This semantic tagging is then used by Feature Ideation to suggest relevant data aggregations, filters, and feature combinations during feature ideation.

User Interface

Learn by example with the 'Update descriptions and Tag Semantics' tutorial of the Credit Default UI tutorials and Grocery UI tutorials.

Key Numeric Aggregation Column

A 'Key Numeric Aggregation Column' is a crucial numeric column within a table that is invaluable for constructing aggregated features. This column usually comprises additive values like counts, sums, or durations, which are ideal for summarization tasks. It acts as a key component for aggregating metrics across different dimensions: specifically, it allows for the computation of sums across grouped categories defined by categorical columns. This aggregation is vital for deciphering patterns and trends within data subgroups. The features generated from such aggregations can be directly applied or further processed for in-depth analyses, such as evaluating diversity, assessing stability, or identifying key categories. Additionally, the 'Key Numeric Aggregation Column' enriches analyses that rely on counts by offering deeper insights into the distribution across these categories.

Feature Ideation assists in the identification of these columns for enterprise users.

Examples:

Total Transaction Amount by Transaction Description

Suppose we have a dataset containing credit card transactions with columns like CardID, TransactionDescription, and Amount. By using the "Amount" column as the Aggregation Metric, we can create a feature that aggregates the total transaction amount for each distinct transaction description, per card.

CardID Feature
Card1 {'Retail Purchase': 500, 'Restaurant': 300, 'Online Shopping': 700}
Card2 {'Retail Purchase': 400, 'Online Shopping': 600}

Total Count by Transaction Description

Alternatively, using counts as the Aggregation Metric can capture the frequency of transactions for each distinct transaction description, per card.

CardID Feature
Card1 {'Retail Purchase': 3, 'Restaurant': 2, 'Online Shopping': 2}
Card2 {'Retail Purchase': 1, 'Online Shopping': 3}

Table Catalog

The Tables registered in the catalog can be listed and retrieved by name for easy access and management.

SDK Reference

How to:

User Inferface

Table Catalog

Entities and Relationships

Entity

An Entity is a real-world object or concept represented or referenced by columns in your source tables.

Examples

Common examples of entities include customer, merchant, city, product, and order.

In FeatureByte, entities are used to:

Note

While entities are typically associated with a primary key or foreign key in the data, they can also be represented by categorical columns that define groups of related objects. For example, a City entity may represent multiple customers, and a Product Group entity may encompass multiple products, even though neither is explicitly used as a foreign key.

SDK Reference

Refer to the Entity object main page and how to add a new entity to a catalog.

User Interface

Learn by example with the 'Register Entities' tutorial of the Credit Default UI tutorials and Grocery UI tutorials.

Entity Serving Name

An Entity's Serving Name is the name of the unique identifier used to identify the entity during a preview or serving request. It is also the name of the column representing the entity in an observation set. Typically, the serving name for an entity is the name of the primary key (or natural key) of the table that represents the entity. An entity can have multiple serving names for convenience, but the unique identifier should remain unique.

SDK Reference

How to get the serving names of an entity.

User Interface

Learn by example with the 'Register Entities' tutorial of the Credit Default UI tutorials and Grocery UI tutorials.

Feature Primary Entity

The Primary Entity of a feature defines the level of analysis for that feature.

The Primary Entity is usually a single entity. However, there are cases where it may be a tuple of entities.

An example of when the primary entity becomes a tuple of entities is when a feature results from aggregatiing data based on those entities to measure interactions between them.

Example

Entity Diagram

Suppose a feature quantifies the interaction between a customer entity and a merchant entity in the past, such as the sum of transaction amounts grouped by customer and merchant in the past four weeks.

The primary entity of this feature is the tuple of customer and merchant.

When a feature is derived from features with different primary entities, the primary entity is determined by the entity relationships between the entities. The lowest level entity in the hierarchy is selected as the primary entity. If the entities have no relationship, the primary entity becomes a tuple of those entities.

Example

Entity Diagram

Consider two entities: customer and customer city, where the customer entity is a child of customer city entity. If a new feature is created that compares a customer's basket with the average basket of customers in the same city, the primary entity for that feature would be the customer entity. This is because the customer entity is a child of the customer city entity and the customer city entity can be deduced automatically.

Alternatively, if two entities, such as customer and merchant, do not have any relationship, the primary entity for a feature that calculates the distance between the customer location and the merchant location would be the tuple of customer and merchant entities. This is because the two entities do not have any parent-child relationship.

SDK Reference

How to get the primary entity of a feature.

Feature List Primary Entity

The primary entity of a feature list determines the entities that can be used to serve the feature list, which typically corresponds to the primary entity of the Use Case that the feature list was created for.

If the features within the list pertain to different primary entities, the primary entity of the feature list is selected based on the entities relationships, with the lowest level entity in the hierarchy chosen as the primary entity. In cases where there are no relationships between entities, the primary entity may become a tuple comprising those entities.

Example

Entity Diagram

Consider a feature list containing features related to card, customer, and customer city. In this case, the primary entity is the card entity since it is a child of both the customer and customer city entities.

However, if the feature list also contains merchant and merchant city features, the primary entity is a tuple of card and merchant.

SDK Reference

How to get the primary entity of a feature list.

Serving Entity

A Serving Entity is any entity that can be used to preview or serve a feature or feature list, regardless of whether it is the primary entity. Serving entities associated with a feature or feature list are typically descendants of the primary entity and uniquely identify the primary entity.

Example

Entity Diagram

Suppose that a customer is the primary entity for a feature, the serving entities for that feature could include related entities such as the card and transaction entities, which are child or grandchild of the customer entity and uniquely identify the customer.

Use Case Primary Entity

In a Use Case, the Primary Entity is the object or concept that defines its problem statement and Context. Usually, this entity is singular, but in cases such as recommendation engines, it can be a tuple of entities such as (Customer, Product).

Observation Table Primary Entity

An Observation Table Primary Entity is the entity of the Context or Use Case the table represents.

To utilize an Observation Table for computing historical feature values of a feature list, it's important that its Primary Entity should match the feature list's primary entity or be a related serving entity.

Entity Relationship

The parent-child relationship and the supertype-subtype relationship are the two main types of Entity Relationships that can assist feature engineering and feature serving.

The parent-child relationship is automatically established in FeatureByte during the entity tagging process, while identifying supertype-subtype relationships require manual intervention.

These relationships can be used to suggest, facilitate and verify joins during feature engineering and streamline the process of serving feature lists containing multiple entity-assigned features.

Important

Note that FeatureByte only supports parent-child relationships currently. Nevertheless, it is expected that supertype-subtype relationships will also be supported shortly, thus enabling more efficient feature engineering and feature serving.

SDK Reference

Refer to the Relationship object main page or to the specific links:

User Interface

Learn by example with the 'Register Entities' tutorial of the Credit Default UI tutorials and Grocery UI tutorials.

Parent-Child Relationship

A Parent-Child Relationship is a hierarchical connection that links one entity (the child) to another (the parent). Each child entity key value can have only one parent entity key value, but a parent entity key value can have multiple child entity key values.

Example

Examples of parent-child relationships include:

  • Hierarchical organization chart: A company's employees are arranged in a hierarchy, with each employee having a manager. The employee entity represents the child, and the manager entity represents the parent.
  • Product catalog: In an e-commerce system, a product catalog may be categorized into categories and subcategories. Each category or subcategory represents a child of its parent category.
  • Geographical hierarchy: In a geographical data model, cities are arranged in states, which are arranged in countries. Each city is the child of its parent state, and each state is the child of its parent country.
  • Credit Card hierarchy: A credit card transaction is the child of a card and a merchant. A card is a child of a customer. A customer is a child of a customer city. And a merchant is a child of a merchant city.

Entity Diagram

Note

In FeatureByte, the parent-child relationship is automatically established when the primary key (or natural key in the context of a SCD table) identifies one entity. This entity is the child entity. Other entities that are referenced in the table are identified as parent entities.

Supertype-Subtype Relationship

In a data model, a Supertype-Subtype Relationship is a hierarchical relationship between two or more entity types where one entity type (the subtype) inherits attributes and relationships from another entity type (the supertype).

The subtype entity is typically a more specialized version of the supertype entity, representing a subset of the data that applies to a particular domain. Although the subtype entity inherits properties and relationships from the supertype entity, It can have its unique attributes or relationships.

Examples

Here are a few examples of supertype-subtype relationships involving a person, student, and teacher:

  1. Person is the supertype, while student and teacher are both subtypes of person.
  2. Student is a subtype of person. This is because a student is a specific type of person who is enrolled in a school or university.
  3. Teacher is also a subtype of person since a teacher is a specific type responsible for educating and instructing students.
  4. A more specific subtype of student could be a graduate student, which refers to a student who has already completed a bachelor's degree and is pursuing a higher-level degree.
  5. Another subtype of teacher could be a professor, typically a teacher with a higher academic rank and significant experience in their field.

Supertype-subtype relationships describe how a more general category (the supertype) can be divided into more specific subcategories (the subtypes). In this case, a person is the most general category, while student and teacher are more specific categories that fall under the broader umbrella of "person."

Entity Catalog

The Entities registered in the catalog can be listed and retrieved by name for easy access and management.

SDK Reference

How to:

User Inferface

Entity Catalog

Use Case Formulation

Target

In Machine Learning, a "target" refers to the outcome that the model is being trained to predict. It's a critical component in supervised learning, where the goal is to create a model that can accurately forecast or classify the target based on the patterns it identifies in the input features.

In FeatureByte, a target can be established in two ways:

  • Descriptive Approach: You directly outline your prediction goal.
  • Logical Approach: This technique calculates targets within FeatureByte, mirroring the process of creating features.

SDK Reference

Refer to the Target object main page and how to create a descriptive target

User Interface

Learn by example with the 'Create Use Cases' tutorial of the Credit Default UI tutorials and Grocery UI tutorials.

Target Logical Plan

The process for establishing a logical plan for a Target closely mirrors that for creating features, with a critical difference: the plan for a Target utilizes forward operations, in contrast to the backward operations applied in feature creation.

Target objects, built upon View objects, come in three varieties:

  1. Lookup Targets: Directly retrieve values from view attributes for a future point in time.
  2. Forward Window-based Aggregate Targets: Use forward-looking aggregations over grouped data.
  3. Aggregate Targets as a Future Point-in-Time: Apply aggregations at a designated future moment.

Additionally, targets can emerge as transformations of existing Target objects, offering various ways to define what you want to predict.

Target Definition File

The target definition file is the single source of truth for a target. This file is automatically generated when a feature is declared in the SDK or a new version is derived.

The syntax used in the SDK is also used in the target definition file. The file provides an explicit outline of the intended operations of the feature declaration, including those inherited but not explicitly declared by you. These operations may include cleaning operations inherited from tables metadata.

The target definition file is the basis for generating the final logical execution graph, which is then transpiled into platform-specific SQL (e.g. SnowSQL, SparkSQL) for target materialization.

SDK Reference

How to obtain the target definition file.

User Interface

Learn by example with the 'Create Use Cases' tutorial of the Credit Default UI tutorials and Grocery UI tutorials.

Target Materialization

Materializing target values in FeatureByte using observation sets can be done through two distinct approaches:

User Interface

Learn by example with the 'Create Observation Tables' tutorial of the Credit Default UI tutorials and Grocery UI tutorials.

Target Catalog

The Targets registered in the catalog can be listed and retrieved by name for easy access and management.

SDK Reference

How to:

User Inferface

How to list registered Targets.

Target Catalog

Context

A Context defines the scope and circumstances in which features are expected to be served.

Examples

Contexts can vary significantly. For instance:

  • Batch Predictions Context: Making weekly batch predictions for an active customer that has made at least one purchase over the past 12 weeks.
  • Real-Time Predictions Context: Offering real-time predictions for a credit card transaction that has been recently processed.

While creating a basic context requires only identifying the relevant entity, adding a detailed description is beneficial. This should ideally cover:

  • Contextual Subset Details: Characteristics of the entity subset being targeted.
  • Serving Timing: Insights into when predictions are needed, whether in batch or real-time scenarios.
  • Inference Data Availability: What data is available at the time of inference.
  • Constraints: Any legal, operational, or other constraints that might impact the context.

SDK Reference

Refer to the Context object main page and how to create a context.

User Interface

Learn by example with the 'Create Use Cases' tutorial of the Credit Default UI tutorials and Grocery UI tutorials.

Context Association with Observation Table

After defining a Context, it can be linked to an Observation Table. This process enables the observation table to act as the default preview/eda table for the Context. Additionally, all observation tables associated with the Context can be listed.

User Interface

Learn by example with the 'Create Observation Tables' tutorial of the Credit Default UI tutorials and Grocery UI tutorials.

Context Catalog

The Contexts registered in the catalog can be listed and retrieved by name for easy access and management.

SDK Reference

How to:

User Inferface

How to list registered Contexts.

Context Catalog

Use Case

A Use Case formulates the modelling problem by associating a Context with a Target. Use Cases facilitate the organization of your observation tables, feature tables and deployments. Use Cases also play a crucial role in Feature Ideation, enabling it to provide tailored feature suggestions.

To construct a new Use Case, the following information is required:

  1. Select a Context: Choose a registered Context that defines the environment of your Use Case.

  2. Define a Target: Specify a registered Target that represents the goal of your Use Case.

Note

The context and target must correspond to the same entities.

For a comprehensive Use Case setup, include a detailed description. Providing a detailed description of the use case, context, and target ensures better documentation and enhances the effectiveness of the Feature Ideation in suggesting relevant features and assessing their semantic relevance.

SDK Reference

Refer to the Use Case object main page or to the specific links:

User Interface

Learn by example with the 'Create Use Cases' tutorials of the Credit Default UI tutorials and Grocery UI tutorials.

Use Case Association with Observation Table

Observation tables are automatically linked to a Use Case when they are derived from:

  • an observation table that is linked to the use case's Context
  • a target that is linked to the use case

An observation table can be manually linked to the Use Case to support cases where the observation table is not derived from another observation table.

This process enables the observation table to act as the default preview/eda table for the Use Case. Additionally, all observation tables associated with the Use Case can be listed.

Use Case Association with Feature Table

Feature tables are automatically associated with use cases via the observation tables they originate from.

Feature tables associated with a use case can be listed easily from the Use Case object.

Use Case Association with Deployment

A deployment is associated with a use case when the use case is specified during the deployment of the related feature list.

Deployments associated with a use case can be listed easily from the Use Case object.

Use Case Catalog

The Use Cases registered in the catalog can be listed and retrieved by name for easy access and management.

SDK Reference

How to:

User Inferface

How to list registered Use Cases.

Use Case Catalog

Observation Set

An Observation Set is essentially a collection of historical data points that serve as a foundation for learning. Think of it as the backbone of a training dataset. Its primary role is to process and compute features, which then form the training data for Machine Learning models. For a given use case, the same Observation Table is often employed in multiple experiments. However, the specific features chosen and the Machine Learning models applied may vary between these experiments.

Each data point represents a historical moment for a particular entity and may include target values.

Observation Set

Ideally, an observation set should be explicitly linked to a specific Context or Use Case, ensuring thorough documentation and facilitating its reuse.

Other important considerations when constructing an Observation Set are:

  1. Choosing the Right Entity Key Values: Select values that represent your target population accurately for each historical timestamp.
  2. Accuracy in Timestamps: Ensure all timestamps are in Coordinated Universal Time (UTC) and cover a sufficient range to depict seasonal changes. They should represent the expected time distribution in real-world scenarios.
  3. Maintaining Data Integrity: Avoid time leakage (future data in the training set) by spacing out your timestamps correctly.

Example

To predict customer churn every Monday morning over six months, you might:

  • Use historical timestamps from Monday mornings of the past years
  • Choose customer keys randomly from the active customer base at those times.
  • Set intervals longer than six months between data points for each customer to avoid time leakage.

Technical Details

  • The entity values column should have an accepted serving name.
  • Label the timestamps column as "POINT_IN_TIME" and use UTC.
  • In FeatureByte, an Observation Set can be a pandas DataFrame or an Observation Table object from the feature store.

Once an Observation Set is defined, you can use it to materialize a feature list into historical feature values to form a training or testing set for your Machine Learning model.

Observation Table

An Observation Table is an observation set integrated in the catalog. It can be created from various sources and is essential for sharing and reusing data within the feature store.

User Interface

Learn by example with the 'Create Observation Tables' tutorial of the Credit Default UI tutorials and Grocery UI tutorials.

Observation Table Association with a Context or Use Case

Once added to the catalog, an Observation Table can be linked to specific Contexts or Use Cases.

For Use Case linkage, you can include the Use Case's Target values by materializing them with a table associated with its Context.

Observation Table Purpose

Tagging an Observation Table with purposes like 'preview', 'eda', 'training' or 'validation_test' facilitates its identification and reuse.

Default eda and preview tables can also be set for a Context or a Use Case.

Observation Table Catalog

The Observation Tables registered in the catalog can be listed and retrieved by name for easy access and management.

SDK Reference

How to:

Views and Column Transforms

View

A view is a local virtual table that can be modified and joined to other views to prepare data before feature definition. A view does not contain any data of its own.

Views in FeatureByte allow operations similar to Pandas, such as:

Unlike Pandas DataFrames, which require loading all data into memory, views are materialized only when needed during previews or feature materialization.

View Creation

When a view is created, it inherits the metadata of the FeatureByte table it originated from. Currently, five types of views are supported:

Two view construction modes are available:

  • Auto (default): Automatically cleans data according to default operations specified for each column within the table and excludes special columns not meant for feature engineering.
  • Manual: Allows custom cleaning operations without applying default cleaning operations.

Although views provide access to cleaned data, you can still perform data manipulation using the raw data from the source table. To do this, utilize the view's raw attribute, which enables you to work directly with the unprocessed data of the source table.

Change View

A Change View is created from a Slowly Changing Dimension (SCD) table to provide a way to analyze changes that occur in an attribute of the natural key of the table over time. This view consists of five columns:

Once the Change View is created, it can be used to generate features in the same way as features from an Event View.

Examples

Changes to a SCD table can provide valuable insights into customer behavior, such as:

  • the number of times a customer has moved in the past six months,
  • their previous address if they recently moved,
  • whether they have gone through a recent divorce,
  • if there are new additions to their family,
  • or if they have started a new job.

SDK Reference

How to create a Change View from a SCD table.

Filters

Filters are an essential element in feature engineering strategies. They enable the segmentation of data into sub-groups, which facilitates specific operations and analyses:

  • Targeted Aggregations: Filters allow for meaningful aggregations of data that would otherwise be nonsensical. For instance, transactions can be categorized by their outcomes such as "Authorized", "Approved", or "Cancelled".
  • Focused Analysis: By using filters, it is possible to narrow down the analysis to specific event types and derive additional, relevant features for those types. For example, analyzing transactions by weekday may yield insightful trends for "Purchases" but may be less significant for "Banking Fees".

Feature Ideation leverages Generative AI to aid enterprise users in identifying effective filters.

Within our SDK, users can manipulate data similarly to how one would use a Pandas DataFrame. It is possible to create new views from subsets of views. Additionally, a condition-based subset can be used to replace the values of a column.

View Sample

Using the sample method, a view can be materialized with a random selection of rows for a given time range, size, and seed to control sampling.

Note

Views from tables in a Snowflake data warehouse do not support the use of seed.

SDK Reference

How to materialize a sample of a view.

View Join

To join two views, use the join() method of the left view and specify the right view object in the other_view parameter. The method will match rows from both views based on a shared key, which is either the primary key of the right view or the natural key if the right view is a Slowly Changing Dimension (SCD) view.

If the shared key identifies an entity that is referenced in the left view or the column name of the shared key is the same in both views, the join() method will automatically identify the column in the left view to use for the join.

By default, a left join is performed, and the resulting view will have the same number of rows as the left view. However, you can set the how parameter to 'inner' to perform an inner join. In this case, the resulting view will only contain rows where there is a match between the columns in both tables.

When the right view is an SCD view, the event timestamp or the reference datetime column of the left view determines which record of the right view to join.

Note

For Item View, the event timestamp and columns representing entities in the related event table are automatically added. Additional attributes can be joined using the join_event_table_attributes() method.

Important

Not all views can be joined to each other. SCD views cannot be joined to other SCD views, while only dimension views can be joined to other dimension views. Change views cannot be joined to any views.

View Column

A View Column is a column within a FeatureByte view. When creating a view, a View Column represents the cleaned version of a table column. The cleaning procedure for a View Column depends on the view's construction mode and typically follows the default cleaning operations associated with the corresponding table column.

By default, special columns not intended for feature engineering are excluded from view columns. These columns may consist of record creation and expiration timestamps, surrogate keys, and active flags.

You can add new columns to a view by performing joins or by deriving new columns from existing ones.

If you wish to add new columns derived from the raw data in the source table, use the view's raw attribute to access the source table's unprocessed data.

SDK Reference

Refer to the ViewColumn object main page or to the specific links:

View Column Transforms

View Column Transforms refer to the ability to apply transformation operations on columns within a view. By applying these transformation operations, you can create a new column. This new column can either be reassigned to the original view or utilized for further transformations.

The different types of transforms include:

Additionally, you have the option to apply custom SQL User-Defined Functions (UDFs) on view columns. This is particularly useful for integrating transformer models with FeatureByte.

Generic Transforms

SDK Reference

You can apply the following transforms to columns of any data type in a view:

  • isnull: Returns a new boolean column that indicates whether each row is missing.
  • notnull: Returns a new boolean column that indicates whether each row is not missing.
  • isin: Returns a new boolean column showing whether each element in the view column matches an element in the passed sequence of values
  • fillna: Replaces missing values in-place with specified values.
  • astype: Converts the data type of the column.

Numeric Transforms

SDK Reference

In addition to built-in arithmetic operators (+, -, *, /, etc), you can apply the following transforms to columns of numeric type in a view:

  • abs: Returns absolute value
  • sqrt: Returns square root value
  • pow: Returns power value
  • log: Returns logarithm with natural base
  • exp: Returns exponential value
  • floor: Rounds down to the nearest integer
  • ceil: Rounds up to the nearest integer

String Transforms

API Reference

In addition to string columns concatenation, you can apply the following transforms to columns of string type in a view:

  • len: Returns the length of the string
  • lower: Converts all characters to lowercase
  • upper: Converts all characters to uppercase
  • strip: Trims white space(s) or a specific character on the left & right string boundaries
  • lstrip: Trims white space(s) or a specific character on the left string boundaries
  • rstrip: Trims white space(s) or a specific character on the right string boundaries
  • replace: Replaces substring with a new string
  • pad: Pads string up to the specified width size
  • contains: Returns a boolean flag column indicating whether each string element contains a target string
  • slice: Slices substrings for each string element

Datetime Transforms

The date or timestamp (datetime) columns in a view can undergo the following transformations:

  • Calculate the difference between two datetime columns.
  • Add a time interval to a datetime column to generate a new datetime column.
  • Extract date components from a datetime column.

Note

Date parts for columns or features using timestamp with time zone offset are based on the local time instead of UTC.

Date parts for columns or features using event timestamps of Event tables, where a separate column was specified to provide the time zone offset information, will also be based on the local time instead of UTC.

SDK Reference

How to extract date components:

  • microsecond: Returns the microsecond component of each element
  • millisecond: Returns the millisecond component of each element
  • second: Returns the second component of each element
  • minute: Returns the minute component of each element
  • hour: Returns the hour component of each element
  • day: Returns the day component of each element in a view column
  • day_of_week: Returns the day of week component of each element
  • week: Returns the week component of each element
  • month: Returns the month component of each element
  • quarter: Returns the quarter component of each element
  • year: Returns the year component of each element

Lag Transforms

The use of Lag Transforms enables the retrieval of the preceding value associated with a particular entity in a view.

This, in turn, makes it feasible to compute essential features, such as those that depend on inter-event time and the proximity to the previous point.

Note

Lag transforms are only supported for Event and Change views.

SDK Reference

How to extract lags from a view column.

UDF Transforms

A SQL User-Defined Function (UDF) is a custom function created by users to execute specific operations not covered by standard SQL functions. UDFs encapsulate complex logic into a single, callable routine.

An application of this is in computing text embeddings using transformer-based models or large language models (LLMs), which can be formulated as a UDF.

Creating a SQL Embedding UDF

For step-by-step guidance on creating a SQL Embedding UDF, visit the Bring Your Own Transformer tutorials.

SDK Reference

Refer to the UserDefinedFunction object main page or to the specific links:

Feature Creation

Features

Input data used to train Machine Learning models and compute predictions is referred to as features.

These features can sometimes be derived from attributes already present in the source tables.

Example

A customer churn model may use features obtained directly from a customer profile table, such as age, gender, income, and location.

However, in many cases, features are created by applying a series of row transformations, joins, filters, and aggregates.

Example

A customer churn model may utilize aggregate features that reflect the customer's account details over a given period, such as

  • the customer entropy of product types purchased over the past 12 weeks,
  • the customer count of canceled orders over the past 56 weeks,
  • and the customer amount spent over the past seven days.

FeatureByte offers two ways to create features:

Feature Object

A Feature object in FeatureByte SDK contains the logical plan to compute the feature.

There are three ways to define the plan for Feature objects from views:

  1. Lookup features
  2. Aggregate features
  3. Cross Aggregate features

Additionally, Feature objects can be created as transformations of one or more existing features.

SDK Reference

Refer to the Feature object main page or to the specific links:

Lookup Features

A Lookup Feature refers to an entity’s attribute in a view at a specific point-in-time. Lookup features are the simpler form of a feature as they do not involve any aggregation operations.

When a view's primary key identifies an entity, it is simple to designate its attributes as features for that particular entity.

Examples

Examples of Lookup features are a customer's birthplace retrieved from a Customer Dimension table or a transaction amount retrieved from a Transactions Event table.

When an entity serves as the natural key of an SCD view, it is also possible to assign one of its attributes as a feature for that entity. However, in those cases, the feature is materialized through point-in-time joins, and the resulting value corresponds to the active row at the point-in-time specified in the feature request.

Example

A customer feature could be the customer's street address at the request's point-in-time.

When dealing with an SCD view, you can specify an offset if you want to get the feature value at a specific time before the request's point-in-time.

Example

By setting the offset to 9 weeks in the previous example, the feature value would be the customer's street address nine weeks before the request's point-in-time.

SDK Reference

How to create a Lookup feature.

Aggregate Features

Aggregate features are a fundamental aspect of feature engineering, essential for transforming transactional data into meaningful insights. These features are derived by applying a range of aggregation functions to data points grouped by one or more entities.

Supported aggregation functions include:

  • Count Counts the number of occurrences for an entity. Useful in scenarios requiring a count of events or items, like the number of transactions per customer or the frequency of specific events.
  • Sum: Calculates the total sum of column values for an entity. This function is essential in aggregating numerical data, such as totaling expenditures per customer or aggregating resource usage.
  • Average (Mean): Computes the mean value of column values for an entity. This function is key in finding the average or typical value, applicable in various contexts like calculating the average spending of customers or the average temperature over a period. It is also useful for computing the mean vector of embeddings in multi-dimensional data spaces, which is valuable in fields like natural language processing or image analysis.
  • Minimum and Maximum: Identifies the lowest and highest values in a column for an entity, respectively. These functions are essential for understanding the range of data, such as the minimum and maximum temperatures recorded. The maximum function is particularly useful in text embeddings to highlight the most significant features in text data.
  • Standard Deviation: Measures the variability or dispersion around the mean of column values for an entity. It is significant in assessing the spread or distribution of data points.
  • Count Distinct: Calculates the number of distinct values in a column for an entity. This is useful for assessing diversity.
  • Latest: Retrieves the most recent value in a column for an entity. This is particularly useful for datasets where the latest information is of prime importance, such as tracking recent user activity.
  • NA Count Tallies the number of missing data points in a column for an entity. This is particularly valuable in datasets where the presence of missing data can indicate significant trends or issues.

Note

More signals (such as mode or entropy) can be obtained from categorical columns by first aggregating data across those columns. For more details, see the Cross Aggregate Features section.

SDK Reference

How to access the list of aggregation methods.

While leveraging these aggregation functions, it's crucial to incorporate the temporal dimension of the dataset to ensure meaningful and contextually relevant aggregations. Ignoring the temporal dimension would also lead to temporal leakage.

There are three main types of aggregate features:

  1. Non-Temporal Aggregates,
  2. Aggregates Over a Window
  3. and Aggregates "As At" a Point-in-Time.

Note

If a feature is intended to capture patterns of interaction between two or more entities, these aggregations are grouped by the tuple of the entities. For instance, an aggregate feature can be created to show the amount spent by a customer with a merchant in the past.

SDK Reference

How to create:

Cross Aggregate Features

Cross Aggregate Features in FeatureByte provide a powerful mechanism to aggregate data across categorical columns, enabling sophisticated data analysis and insight generation. This functionality allows you to categorize data into groups (a process known as 'bucketing') based on categorical column values and perform various aggregation operations like counting records within each category or summing up values of a numeric column for those categories. Beyond counting and summing, you can employ additional aggregation methods tailored to your analysis needs.

This feature facilitates advanced analytical tasks, such as:

  1. Entropy Analysis: Assess the entropy in data distributions that emerge from aggregating sums or counts across categories. Such analysis is crucial for understanding data variability or diversity, shedding light on the unpredictability in aspects like customer behavior or product performance.

  2. Temporal and Comparative Distribution Analysis: Compare category-based distributions over time or against overarching groups. This is instrumental in tracking how engagements within categories evolve over time or in relation to larger entities.

  3. Identifying Key Categories: Uncover significant trends or preferences within your data, including:

    • Identifying the most frequently occurring category (mode), highlighting prevalent trends.
    • Pinpointing categories with the highest or lowest aggregated values, such as sales or user engagement, to recognize outstanding or lagging areas.
    • Aggregating values for a specific category to gain detailed insights into particular segments of interest.
  4. Prevalence of Entity Attributes: Evaluate the commonality of certain attributes within entities, such as assessing customer age bands across products. This involves:

    • Aggregating by product across age bands.
    • Aggregating across age bands only.
    • Analyzing proportions to understand demographic affinities or discrepancies for specific products.

Example Use Case

Imagine analyzing customer spending habits. A Cross Aggregate feature might calculate the total amount spent by each customer across different product categories over a specified period. This aggregation offers insights into customer spending patterns or preferences, enriching understanding of behavior across various product categories.

Technical Implementation

When computing Cross Aggregate features for an entity (e.g., a customer), the outcome is typically structured as a dictionary. This dictionary's keys are the product categories engaged by the customer, with values representing total expenditure in each category. This structure effectively captures the customer's cross-category spending behavior, providing a holistic view of their purchase preferences.

Like other types of Aggregate Features, it is important to consider the temporal aspect when conducting aggregation operations. The three main types of Cross Aggregate features include:

SDK Reference

How to group by entity across categories to perform cross aggregates.

Non-Temporal Aggregates

Non-Temporal Aggregate features refer to features that are generated through aggregation operations without considering any temporal aspects. In other words, these features are created by aggregating values without considering the order or sequence in which they occur.

Important

To avoid time leakage, the non-temporal aggregate is only supported for Item views, when the grouping key is the event key of the Item view. An example of such features is the count of items in Order.

Note

Non-temporal aggregate features obtained from an Item view can be added as a column to the corresponding event view. Once the feature is integrated, it can be aggregated over a time window to create aggregate features over a window. For instance, you can calculate a customer's average order size over the last three weeks by using the order size feature extracted from the Order Items view and aggregating it over that time frame in the related Order view.

SDK Reference

How to:

Aggregates Over A Window

Aggregates over a window refer to features derived by summarizing data within a defined time frame. These features are commonly utilized to analyze event data, item data, and time series data, providing insights into patterns, trends, or behaviors within the specified period.

The duration of the window is specified when the feature is created. The end point of the window is determined when the feature is served, based on the point-in-time values specified by the feature request and the feature job setting of the feature.

SDK Reference

How to create an aggregate over feature.

Aggregates “As At” a Point-In-Time

Aggregates "As At" a Point-In-Time are features that are generated by aggregating data that is active at a particular moment in time. These types of features are only available for slowly changing dimension (SCD) views and the grouping key used for generating these features should not be the natural key of the SCD view.

You can specify an offset, if you want the aggregation to be done on rows active at a specific time before the point-in-time specified by the feature request.

Example

An aggregate ‘as at’ feature from a Credit Cards table could be the customer's count of credit cards at the specified point-in-time of the feature request.

With an offset of 2 weeks, the feature would be the customer's count of credit cards 2 weeks before the specified point-in-time of the feature request.

SDK Reference

How to create an aggregate "asat" feature.

Aggregates Of Changes Over a Window

Aggregates of changes over a window are features that summarize changes in a Slowly Changing Dimension (SCD) table within a specific time frame. These features are created by aggregating data from a Change view that is derived from a column in the SCD table.

Example

One possible aggregate feature of changes over a window could be the count of address changes that occurred within the last 12 weeks for a customer.

SDK Reference

How to create:

Temporal Window

In feature engineering, a "Temporal Window" refers to a specific period over which data points are gathered and analyzed to extract valuable features for modeling. Employing multiple windows enables the capture of dynamics across short, medium, and long-term intervals within the data.

Window Size determines the duration of the temporal window (e.g., minutes, hours, days, weeks), and its selection depends on the specific use case and data characteristics.

Feature Ideation assists enterprise users in identifying the most appropriate window sizes for their particular applications.

Examples

  • Shop Sum of sales over the past 4 weeks.
  • Total call duration by a customer over the week.
  • Rolling average of heart rate variability over the last 24 hours.
  • Maximum machine temperature recorded in the last 30 minutes.

Edge Effects

  • At the Beginning of the Data: Ensure the starting point of your training data is after the initial table observations plus the window size. This adjustment prevents incomplete data windows at the start of the dataset.
  • At the End of the Data: Set a sufficiently large blind spot in the feature job settings to account for the potential unavailability of the most recent data points due to data latency.

Feature Transforms

Feature Transforms is a flexible functionality that allows the generation of new features by applying a broad range of transformation operations to existing features. These transformations can be applied to individual features or multiple features from the same or distinct entities.

The available transformation operations resemble those provided for view columns. However, additional transformations are also supported for features resulting from Cross Aggregate features.

Features can also be derived from multiple features and the points-in-time provided during feature materialization.

Examples of features derived from Cross Aggregates

  • Most common weekday for customer visits in the past 12 week
  • Count of unique items purchased by a customer in the past 4 weeks
  • List of distinct items bought by a customer in the past 4 weeks
  • Amount spent by a customer on ice cream in the past 4 weeks
  • Weekday entropy for customer visits in the past 12 weeks

Examples of features derived from multiple features

  • Similarity between customer’s basket during the past week and past 12 weeks
  • Similarity between a customer's item basket and the baskets of customers in the same city over the past 2 weeks
  • Order amount z-score based on a customer's order history over the past 12 weeks

SDK Reference

How to transform the dictionary output of cross aggregate features:

Feature Ideation

Feature Ideation automates key aspects of feature engineering while allowing user input at various stages. It streamlines the process of identifying relevant data, generating feature recommendations, and integrating features into production.

Key Capabilities

Identifying Relevant Data

  • Data Discovery: Analyzes tables and relationships to identify those most relevant to the use case.
  • Semantic Tagging: Infers missing semantic tags using column metadata and aligns them with a predefined data ontology for feature engineering.

Feature Engineering Recommendations

Feature Ideation provides recommendations for feature creation, considering data properties and the target use case:

  • Time Window Recommendation: Suggests appropriate time windows for data aggregation.
  • Event Filters: Identifies relevant filters to isolate key events based on event types and statuses.
  • Column Transformations: Recommends useful transformations, such as time deltas, ratios, and differences.
  • Numeric Column Selection: Highlights numeric columns that are suitable for aggregation across categories.
  • Event Frequency Analysis: Identifies patterns in event occurrence to optimize timing-based feature engineering.

Automatic Feature Suggestions

Combines recommendations and best practices to generate structured, ready-to-use features.

Feature Evaluation and Compilation

  • Relevance Evaluation: Assesses the semantic and statistical relevance of proposed features using Generative AI and exploratory data analysis (EDA).
  • Redundancy Check: Compares suggested features with existing ones to avoid duplication.
  • Feature Selection: Applies techniques to prioritize the most informative features.

Feature Integration and Traceability

  • Direct Catalog Addition: Allows features to be added to the Catalog without writing code.
  • Notebook Export: Provides an option to download feature definitions for further inspection and customization.
  • Documentation and Auditability: Logs all steps to ensure transparency in the feature engineering process.

Modes of Operation

  • Fully Automated Mode: Runs the entire feature ideation process automatically.
  • Semi-Automated Mode: Allows you to review, refine, and adjust recommendations interactively.

User Interface

The 'Ideate Features' tutorial in the Credit Default UI tutorials and Grocery UI tutorials demonstrates the Fully Automated Mode.

The 'Refine Ideation' tutorial in the Credit Default UI tutorials and Grocery UI tutorials demonstrates the Semi-Automated Mode.

Feature Selection

After ideating features, FeatureByte supports three types of feature selection:

  • Rule-based feature selection: Selects features with the highest predictive scores overall and/or per theme.
  • SHAP-based feature selection: Identifies top-performing features of XGBoost or LightGBM models by analyzing their SHAP (SHapley Additive exPlanations) values.
  • GenAI-based feature selection: Refines feature selection using Generative AI.

Screening Criteria

Feature selection can be applied to a pre-selection of candidates, which may come from a prior selection, any filtered view of the ideated features, or a manual selection.

Candidates can also be further screened by:

  • Excluding Low Added Value Features: Removes features with limited predictive power and their derivations. This includes:

    • Numeric and categorical features compared with features flagging their missing values.
    • Dictionary-type features compared with simpler alternatives based on total counts or sums without grouping.
  • Excluding Specific Feature Types: Removes dictionary-type and embedding-type features from the candidate set.

Rule-Based Feature Selection

Rule-based feature selection identifies features with the highest predictive scores either overall or per theme.

Parameters:

  • Number of Top Features Overall: Specifies the number of features to select based on overall predictive scores.
  • Number of Top Features per Theme: Specifies the number of features to select per theme.
  • Selection Logic: Determines how the criteria are applied:

    • OR: Selects features that meet at least one of the criteria: being in the top features overall OR in the top features per theme.
    • AND: Selects only the features that meet both criteria: being in the top features overall AND in the top features per theme.

SHAP-Based Feature Selection

SHAP-based feature selection refines the feature set using L1 regularization and/or SHAP (SHapley Additive exPlanations) importance thresholds derived from XGBoost or LightGBM models trained on EDA data.

Parameters:

  • Model Type: The type of ML model used to compute SHAP values (XGBoost or LightGBM).
  • L1 Rounds: Number of iterations to apply L1 regularization on SHAP values to eliminate features with minimal contribution or high collinearity.
  • Importance Rounds: Number of iterations to apply SHAP importance thresholds, retaining only top-performing features.
  • Cumulative Importance Threshold: Retains features until their cumulative SHAP importance reaches or stays below this fraction (a value between 0 and 1).

Pre-Filtering Options: These are similar to the screening parameters used in Rule-based feature selection.

GenAI-Based Feature Selection

GenAI-based feature selection leverages Generative AI to refine feature selection. Simply set the Target Feature Count to specify the desired number of features to retain.

Feature Catalog

The Features registered in the catalog can be listed and retrieved by name for easy access and management.

In the SDK, features can be filtered based on two key attributes:

SDK Reference

Self-Organized Feature Catalog

FeatureByte Enterprise enhances the Feature Catalog with advanced capabilities:

  • Use Case Compatibility: It ensures that only features compatible with a defined Use Case are displayed, as detailed in Feature Compatibility with a Use Case.
  • Signal Type Categorization: Features are categorized by their Signal Type, facilitating easier identification and use.
  • Thematic Organization: Features are organized thematically, incorporating three key aspects:

    • The feature's Primary Entity
    • The feature's Primary Table
    • The feature's Signal Type

In addition to basic filters, advanced filtering options in FeatureByte Enterprise include:

User Interface

Learn by example with the 'Create New Feature List' tutorial of the Credit Default UI tutorials and Grocery UI tutorials.

Feature Compatibility with a Use Case

In the context of a Use Case, it's crucial to ensure that the features are compatible with the Use Case Primary Entity . For a feature to be considered compatible, its Primary Entity must align with the Use Case Primary Entity in one of two ways:

  • Direct Match: The feature's Primary Entity should be the same as the Use Case Primary Entity.
  • Hierarchical Relationship: The feature's Primary Entity should be either a parent or grandparent of the Use Case Primary Entity.

Example

Entity Diagram

Consider the following scenario:

Use Case: Card Default Prediction. Use Case Primary Entity: Card.

Feature in Question: A feature that records the maximum distance between a customer's residence and the locations of their card transactions over the past 24 hours. Feature Primary Entity: Customer.

Analysis: This feature is compatible with the Use Case. Despite the Feature Primary Entity being 'Customer', it is directly linked to the 'Card' entity, which uniquely identifies each customer. Therefore, the feature can be effectively utilized in the Card Default Prediction Use Case.

In FeatureByte Enterprise, this concept plays a crucial role by ensuring that only features compatible with a defined Use Case are displayed in the Feature Catalog. This functionality streamlines the selection process and enhances the overall effectiveness of Use Case implementation.

Feature Signal Type

In FeatureByte, the 'signal type' of a feature is a key indicator of the information it captures. This categorization is essential not only during feature ideation but also in organizing features in the catalog and assessing the comprehensiveness of a feature list.

Signal Type Examples

  • Attribute: gets the attribute of the entity at a point-in-time. For instance, it might record the employment status of a customer at a specific time.
  • Frequency: counts the occurrence of events, like the number of times a user logs into an application.
  • Recency: measures the time since the latest event, crucial in tracking customer engagement.
  • Timing: relates to when the events happened, helpful in understanding the regularity of events such as binge watching.
  • Latest event: attributes of the latest event, such as the latest transaction location in a credit card record.
  • Stats: aggregates a numeric column's values, like the total spent by a customer over the past 4 weeks.
  • Diversity: measures the variability of data values, useful in understanding the range of customer preferences.
  • Stability: compares recent events to those of earlier periods to gauge consistency.
  • Similarity: compares an individual entity feature to a group, important in anomaly detection.
  • Most frequent: gets the most frequent value of a categorical column, like the best-selling product in a store.
  • Bucketing: aggregates a column's values across categories of a categorical column, allowing multi-dimensional analysis.
  • Attribute stats: collects stats for an attribute of the entity, such as the representation of a customer age in the overall population purchases.
  • Attribute change: measures the occurrence or magnitude of changes to slowly changing attributes, crucial to detect key changes in the customer environment.

Tutorials

See examples of features categorized by their signal type in the 'Create New Feature List' tutorial of the Credit Default UI tutorials and Grocery UI tutorials.

Automated Signal Type tagging

FeatureByte Enterprise simplifies the categorization of features by their signal types through an automated tagging system. This intelligent system ensures each feature is accurately and consistently associated with its relevant signal type, reducing manual effort and enhancing the efficiency of the cataloging process.

Feature Primary Table

The Feature Primary Table is the central table, serving as the foundational source of data for the feature.

In a setup where an SCD table is joined with an Event table, the event table typically acts as the primary table. It contains the main events or transactions of interest, and these events are further enriched by joining with the SCD table.

Feature Secondary Table

The Feature Secondary Table supplements the primary table by providing additional attributes or dimensions. This table is typically joined with the primary table to enhance the data with more context.

Feature Theme

The Feature Theme is a concept in FeatureByte Enterprise, utilized to systematically categorize and organize features within the feature catalog. This categorization is achieved by integrating three key components:

  • Primary Entity: This element represents the main focus of the feature. It's the central aspect around which the feature is built.
  • Primary Table: This is the core database table from which the feature primarily draws its data. It provides the foundational dataset that defines the structure and context of the feature.
  • Signal Type: This component identifies the nature of the data signals used in the feature.

This thematic organization aids in providing a clear and structured view of the feature catalog, facilitating easier navigation and understanding of the available features.

Feature Relevance

Feature relevance is essential for evaluating the impact of individual features on predictive models before modeling. Two key metrics are utilized to assess feature relevance:

Predictive Score

The Predictive Score (PS) measures the relationship between a feature and the target variable within a specific use case. A PS score of 1 indicates perfect correlation with the target, while 0 suggests no correlation.

Note

PS evaluates features independently and might overlook potential interactions among them, which could significantly affect predictive relevance. Some features may exhibit limited predictive utility when analyzed alone. However, when combined with others, they might reveal significant predictive power due to interaction effects.

Details

PS utilizes XGBoost for numerical, categorical, embedding or dictionary features, and regularized linear regression for textual features. The score is based on the Gini Norm (a scaled version of the Gini coefficient):

  • For regression, Gini Norm provides a quantitative measure of how well a model can distinguish between different groups. In insurance, it is frequently used to quantify how well the model can differentiate between high-risk and low-risk individuals. In Marketing, it is used to quantify how well the model can differentiate between high-value and low-value customers.
  • In classification, Gini Norm is equivalent to 2x(AUC - 0.5), where AUC is the Area Under the ROC Curve, providing a measure of the model's ability to discriminate between positive and negative classes.

Semantic Relevance

Semantic relevance, derived through Generative AI, examines the significance of each feature within a specific use case based on its semantic value without directly analyzing the data. This metric considers both the feature's description and the context of the use case. It complements the predictive score by ensuring that features not only display statistical correlation with the target variable but also carry contextual meaning.

High semantic relevance scores, combined with low statistical correlation, may indicate potential data quality issues or highlight the limitations of relying solely on statistical relevance. Semantic relevance can also capture critical constraints such as fairness, causality, and other contextual factors.

Feature Materialization

The act of computing the feature is known as Feature Materialization.

The materialization of features is made:

  • on demand to fulfill historical requests,
  • whereas for prediction purposes, feature values are pre-computed through a batch process called a "Feature Job".

The Feature Job is scheduled based on the defined settings associated with each feature.

To materialize the feature values, either:

Additionally, in the context of historical feature serving, an observation set is required, created by combining:

  • entity key values
  • and point-in-time references that correspond to particular moments in the past.

Point-In-Time

A Point-In-Time for a feature refers to a specific moment in the past with which the feature's values are associated.

It is a crucial aspect of historical feature serving that allows Machine Learning models to make predictions based on historical data. By providing a point-in-time, a feature can be used to train and test models on past data, enabling them to make accurate predictions for similar situations in the future.

Feature Governance

Feature Version

A Feature Version enables the reuse of a Feature with varying feature job settings or distinct cleaning operations.

If the availability or freshness of the source table change, new versions of the feature can be generated with a new feature job setting. On the other hand, if changes occur in the data quality of the source table, new versions of the feature can be created with new cleaning operations that address the new quality issues.

To ensure the seamless inference of Machine Learning tasks that depend on the feature, old versions of the feature can still be served without any disruption.

Note

In the FeatureByte SDK, a new feature version is a new Feature object with the same namespace as the original Feature object. The new Feature object has its own Object ID and version name.

Feature Readiness

To help differentiate features that are in the prototype stage and features that are ready for production, a feature version can have one of four readiness levels:

  1. PRODUCTION_READY: ready for deployment in production environments.
  2. PUBLIC_DRAFT: shared for feedback purposes.
  3. DRAFT: in the prototype stage.
  4. DEPRECATED: not advised for use in either training or prediction.

Important

Only one feature version can be designated as PRODUCTION_READY at a time.

When a feature version is promoted to PRODUCTION_READY, guardrails are applied automatically to ensure consistency with defauft cleaning operations and feature job settings. You can disregard these guardrails if the settings of the promoted feature version adhere to equally robust practices.

Important Note for FeatureByte Enterprise Users

In Catalogs with Approval Flow enabled, moving features to production-ready status involves a comprehensive approval process.

This includes several evaluations, such as checking the feature's compliance with default cleaning operations and the feature job setting of its source tables. It also involves confirming the status of these tables and backtesting the feature job setting to prevent future training-serving inconsistencies. Additionally, essential details of the feature, particularly its feature definition file, are shared and subjected to a thorough review.

SDK Reference

How to:

User Interface

Learn by example with the 'Deploy and serve a feature list' tutorial of the Credit Default UI tutorials and Grocery UI tutorials.

Default Feature Version

The default version of a feature streamlines the process of reusing features by providing the most appropriate version. Additionally, it simplifies the creation of new versions of feature lists.

By default, the feature's version with the highest level of readiness is considered, unless you override this selection. In cases where multiple versions share the highest level of readiness, the most recent version is automatically chosen as the default.

Note

When a feature is accessed from a catalog without specifying its object ID or its version name but only by its name, the default version is automatically retrieved.

Feature Definition File

The feature definition file is the single source of truth for a feature version. This file is automatically generated when a feature is declared in the SDK or a new version is derived.

The syntax used in the SDK is also used in the feature definition file. The file provides an explicit outline of the intended operations of the feature declaration, including those inherited but not explicitly declared by you. For example, these operations may include feature job settings and cleaning operations inherited from tables metadata.

The feature definition file is the basis for generating the final logical execution graph, which is then transpiled into platform-specific SQL (e.g. SnowSQL, SparkSQL) for feature materialization.

Definition File

SDK Reference

How to obtain the feature definition file.

Feature Online Enabled

An online enabled feature is a feature that is used by at least one deployed feature list.

Feature List Creation

Feature List

A Feature List is a collection of features. It is usually tailored to meet the needs of a particular use case and generate feature values for Machine Learning training and inference.

Historical feature values are first obtained to train and test models.

Once a model has been trained and validated, the Feature List can be deployed, and pre-computed feature values can be stored in the feature store and accessed through online and batch serving to generate predictions.

SDK Reference

Refer to the FeatureList object main page or to the specific links:

User Interface

Learn by example with the 'Create New Feature List' tutorial of the Credit Default UI tutorials and Grocery UI tutorials.

Feature Group

A Feature Group is a temporary collection of features that facilitates the manipulation of features and the creation of feature lists.

Note

It is not possible to save the Feature Group as a whole. Instead, each feature within the group can be saved individually. To save a Feature Group as whole, convert it first as a Feature List.

SDK Reference

Refer to the FeatureGroup object main page or to the specific links:

Feature List Builder

The Feature List Builder facilitates the construction of new feature lists. It becomes active once a specific Use Case is identified. Users can then enrich their feature list by selecting relevant features from two resources: the Feature Catalog or the Feature List Catalog.

The tool offers real-time statistics on several aspects: the readiness level of the selected features, which indicates the percentage of features that are production ready, the percentage of features currently active online, and the diversity of themes incorporated into the list.

Moreover, it dynamically suggests additional features from unrepresented themes. This recommendation system is designed to ensure the feature list encompasses a broad spectrum of signals, enhancing the overall predictive power of the feature list.

User Interface

Learn by example with the 'Create New Feature List' tutorial of the Credit Default UI tutorials and Grocery UI tutorials.

Feature List Catalog

The Feature Lists registered in the catalog can be listed and retrieved by name for easy access and management.

SDK Reference

How to:

In the SDK, feature lists can be filtered based on three key attributes:

In FeatureByte Enterprise, feature lists can also be filtered based on:

Feature List Compatibility with a Use Case

In the context of a Use Case, it's crucial to ensure that the feature lists are compatible with the Use Case Primary Entity. For a feature list to be considered compatible, its Primary Entity must align with the Use Case Primary Entity in one of two ways:

  • Direct Match: The feature list's Primary Entity should be the same as the Use Case Primary Entity.
  • Hierarchical Relationship: The feature list's Primary Entity should be either a parent or grandparent of the Use Case Primary Entity.

In FeatureByte Enterprise, this concept plays a crucial role by ensuring that only feature lists compatible with a defined Use Case are displayed in the Feature List Catalog User Interface.

Example

Entity Diagram

Consider the following scenario:

Use Case: Card Default Prediction. Use Case Primary Entity: Card.

Feature List in Question: The feature list contains 2 features. - A feature that records the maximum distance between a customer's residence and the locations of their card transactions over the past 24 hours. - A feature on the Customer City population. The Feature List Primary Entity: Customer.

Analysis: This feature list is compatible with the Use Case. Despite the Feature List Primary Entity being 'Customer', it is directly linked to the 'Card' entity, which uniquely identifies each customer. Therefore, the feature can be effectively utilized in the Card Default Prediction Use Case.

Feature List Thematic Coverage

FeatureByte Enterprise leverages the systematic thematic categorization of features by analysing the Feature Theme attributed to each feature in a given feature list to assess its comprehensiveness. Any thematic areas that are not adequately covered by the existing features in the list are highligthed as "Themes not covered".

Feature List Serving

Note

A feature list can be served by its primary entity or any descendant serving entities.

Historical Feature Serving

Historical serving of a feature list is usually intended for exploration, model training, and testing. The requested data is represented by an observation set that combines entity key values and historical points-in-time, for which you want to materialize feature values.

Requesting historical features is supported by two methods:

  • compute_historical_features(): returns a loaded DataFrame. Use this method when the output is expected to be of a manageable size that can be handled locally.
  • compute_historical_feature_table(): returns a HistoricalFeatureTable object representing the output table stored in the feature store. This method is suitable for handling large tables and storing them in the feature store for reuse or auditing.

Note

Historical feature values are not pre-computed or stored. Instead, the serving process combines partially aggregated data as offline tiles. This approach of pre-computing and storing partially aggregated data minimizes compute resources significantly.

User Interface

Learn by example with the 'Compute Feature Table' tutorial of the Credit Default UI tutorials and Grocery UI tutorials.

Feature List Deployment

A feature list can be deployed to support its online and batch serving.

To create a Deployment, the corresponding feature list must have all its features labeled as "PRODUCTION_READY".

A feature list is deployed without creating separate pipelines or using different tools. The deployment complexity is abstracted away from users.

When a deployment is created, the deployment can be associated with a Use Case to facilitate the tracking of both deployments and use cases.

Note

A given feature list can be associated with multiple deployments and use cases if needed.

SDK Reference

Refer to the Deployment main page or to the specific links:

User Interface

Learn by example with the 'Deploy and serve a feature list' tutorial of the Credit Default UI tutorials and Grocery UI tutorials.

Online and Batch Serving

The process of utilizing a feature list for making predictions is typically carried out online or batch serving. The feature list must be first deployed and its associated Deployment object must be enabled. This triggers the orchestration of the feature materialization into the online feature store. The online feature store then provides pre-computed feature values for online or batch serving.

The request data of both the online and batch serving consists of the key values of one of the serving entities of the deployed feature list.

Note

An accepted serving name must be used for the column containing the entity values.

The request data does not include specific timestamps, as the point-in-time is automatically determined when the request is submitted.

An REST API service supports online feature serving. Python or shell script templates for the REST API service are retrieved from the Deployment object.

Shell template

Batch serving is supported by first creating a BatchRequestTable object in the SDK that lists the entity key values for which inference is needed. The BatchRequestTable is created from either a source table in the data warehouse or a view.

Batch features values are then obtained in the SDK from the Deployment object and the BatchRequesTable. The output is a BatchFeatureTable that represents the batch features values stored in the feature store and contains metadata offering complete lineage on how the table was produced.

User Interface

Learn by example with the 'Deploy and serve a feature list' tutorial of the Credit Default UI tutorials and Grocery UI tutorials.

Feature List Governance

Feature List Version

The Feature List Version allows using each feature's latest version. Upon creation of a new feature list version, the latest default versions of features are employed unless particular feature versions are specified.

SDK Reference

How to:

Default Feature List Version

The 'Default Version of a Feature List' must comprise the default version of each feature, as indicated by its default_feature_fraction property being equal to 1. If this fraction is less than 1, a new feature list version must be created as the Default Feature List Version. Upon creation of this new list, the default_feature_fraction of the Default Feature List Version will be reset to 1.

Feature List Status

Feature lists can be assigned one of five status levels to differentiate between experimental feature lists and those suitable for deployment or already deployed.

  • "DEPLOYED": Assigned to feature list with at least one deployed version.
  • "TEMPLATE": For feature lists as reference templates or safe starting points.
  • "PUBLIC_DRAFT": For feature lists shared for feedback purposes.
  • "DRAFT": For feature lists in the prototype stage.
  • "DEPRECATED": For outdated or unnecessary feature lists.

Note

The status is managed at the namespace level of a Feature List object, meaning all versions of a feature list share the same status.

For the following scenarios, some status levels are automatically assigned to feature lists:

  • when a new feature list is created, the "DRAFT" status is assigned to the feature list.
  • when at least one version of the feature list is deployed, the "DEPLOYED" status is assigned.
  • when deployment is disabled for all versions of the feature list, the "PUBLIC_DRAFT" status is assigned.

Additional guidelines:

  • Before setting a feature list status to "TEMPLATE", ensure all features in the default version are "PRODUCTION_READY".
  • Only "DRAFT" feature lists can be deleted.
  • You cannot revert a feature list status to a "DRAFT" status.
  • Once a feature list is in "DEPLOYED" status, you cannot update the status to other status until all the associated deployments are disabled.

SDK Reference

How to:

Feature List Readiness

The Feature List Readiness metric provides a statistic on the readiness of features in the feature list version. This metric represents the percentage of features that are production ready within the given feature list.

Important

Before a feature list version is deployed, all its features must be "production ready" and the metric should be 100%.

SDK Reference

How to get the readiness metric of a feature list.

Feature List Percentage of Online Enabled Features

The 'Feature List Percentage of Online Enabled Features' represents the proportion of features used by at least one deployed feature list. A percentage near 1 suggests a lower cost for deploying the feature list.

Feature Table

A Feature Table contains historical feature values from a historical feature request that are typically produced to train or test Machine Learning models. The historical feature values can also be obtained as a Pandas DataFrame, but using a Feature Table has some benefits such as handling large tables, storing the data in the feature store for reuse, and offering full lineage.

SDK Reference

Refer to the HistoricalFeatureTable object main page.

Feature Table Creation

In SDK, a HistoricalFeatureTable object is created by getting historical features from a feature list by using the compute_historical_feature_table() method. The method uses as input an observation table that combines historical points-in-time and key values of the feature list's primary entity or of its related serving entities.

In FeatureByte Enterprise User Interface, a Feature Table can be generated by selected a feature list and specifying an observation table compatible with the feature list.

SDK Reference

How to compute feature table.

Feature Table Lineage

The Feature Table contains metadata on the Feature List and Observation Table used.

SDK Reference

How to:

Feature Table Purpose

The purpose of a Feature Table depends on the purpose of the observation table it comes from. It can vary from being a simple preview to being used for more complex tasks like exploratory data analysis, training, or validation tests. This classification helps in easily identifying and reusing Feature Tables.

Feature Table Association with a Context or Use Case

The association of a Feature Table with specific Contexts or Use Cases is determined by its originating observation table. This link makes it straightforward to organize and locate Feature Tables relevant to particular use cases.

Deployment

In FeatureByte, a Deployment object manages the online and batch serving of a deployed FeatureList for specific Use Cases.

Enabling and Disabling Deployments

A Deployment Object is initiated when a FeatureList is deemed ready for production deployment.

Upon creation, the Deployment can be enabled for online and batch serving, triggering the orchestration of feature materialization into the online feature store.

Deployments can be disabled at any time, ceasing the online and batch serving of the feature list without impacting serving of the historical requests. This approach is distinct from the 'log and wait' method used in some other feature stores.

Note

If the feature list is associated with multiple deployments (for different use cases), disabling one deployment will not affect the serving of other deployments.

SDK Reference

Refer to the Deployment main page or to the specific links:

Deployment and Online Serving

For online serving, Deployment objects offer Python or shell script templates for REST API services.

Deployment and Batch Serving

Batch serving utilizes the SDK's compute_batch_feature_table() method, returning a BatchFeatureTable object that represents a table in the feature store with batch feature values.

SDK Reference

For more details, refer to the SDK reference for BatchFeatureTable object.

Feature Job Status

The Deployment object provides reports on recent activities of scheduled feature jobs, including run history, success status, and durations.

In cases of failed or late jobs, it's advised to review data warehouse logs for insights, especially if the issue relates to compute capacity.

SDK Reference

How to get the feature job status for a feature list.

Deployment Catalog

Deployments can be associated with specific Use Cases, and all related deployments can be managed and listed from the Use Case.

Within the catalog, deployments can be listed, retrieved by name, or by Object ID.

SDK Reference

How to:

The Deployment object class methods allow for listing and managing deployments across all catalogs.

SDK Reference

How to:

  • list() to list all deployments across catalogs.
  • get() to get an Deployment object by its name.
  • get_by_id() to get a Deployment object by its Object ID.

Approval Flow

Enabling Approval Flow

FeatureByte Enterprise catalogs can incorporate an Approval Flow. When active, key actions require approval such as:

To check if Approval Flow is active, look for a validation mark next to the Catalog name.

Name

If it's missing, click the settings icon near the Catalog name at the top of the screen to access and enable the Approval Flow option.

Name

Feature Adjustments

When table metadata changes occur (e.g., new cleaning operations, updating feature job settings), they trigger new feature versions. This ensures compatibility with new data. Users can modify default actions for these features and analyze the impact of both original and updated operations.

Name

Approval Flow Checks

Approval Flow involves several automated checks:

For Marking a Feature as Production-Ready:

Name

For Changes in Cleaning Operations:

  • Analysis of features with actions diverging from new operations.
  • Completion of this analysis changes request checks to green.
  • Emphasis on understanding impacts of both new and original operations.

Name

For Changes in Feature Job Setting:

Name

Learning Through UI Tutorials

For a practical understanding of the approval flow, explore our UI tutorials:

Feature Store

The purpose of a Feature Store is to centralize pre-calculated values, which can significantly reduce the latency of feature serving during training and inference.

FeatureByte Feature Stores are designed to integrate seamlessly with data warehouses, eliminating the need for bulk outbound data transfers that can pose security risks. Furthermore, the computation of features is performed in the data warehouse, taking advantage of its scalability, stability, and efficiency.

Pre-calculated values for online and batch serving are stored in an online feature store.

Partial aggregations in the form of online and offline tiles are also stored to streamline feature materialization for historical request and online and batch serving. This approach enables computation to be performed incrementally on tiles rather than the entire time window, leading to more efficient resource utilization.

Once a feature is deployed, the FeatureByte service automatically initiates materialization of feature and tiles, scheduled based on the feature job setting of the feature.

SDK Reference

Refer to the FeatureStore object main page or to the specific links:

Tiles

Tiles are a method of storing partial aggregations in the feature store, which helps to minimize the resources required to fulfill historical and online requests. There are two types of tiles managed by FeatureByte: offline tiles and online tiles.

When a feature has not yet been deployed, offline tiles are cached following a historical feature request to reduce the latency of subsequent requests. Once the feature has been deployed, offline tiles are computed and stored according to the feature job setting.

The tiling approach adopted by FeatureByte also significantly reduces storage requirements compared to storing offline features. This is because tiles are more sparse than features and can be shared by features that use the same input columns and aggregation functions.

Feature Jobs

Feature Job Background

FeatureByte is designed to work with data warehouses that receive regular data refreshes from operational sources, meaning that features may use data with various freshness and availability. If these operational limitations are not considered, inconsistencies between offline requests and online and batch feature values may occur.

To prevent such inconsistencies, it is crucial to synchronize the frequency of batch feature computations with the frequency of source table refreshes and to compute features after the source table refresh is fully completed. In addition, for historical serving to accurately replicate the production environment, it is essential to use data that would have been available at the historical points-in-time, considering the present or future data latency. Latency of data refers to the time difference between the timestamp of an event and the timestamp at which the event data is accessible for ingestion. Any period during which data may be missing is referred as a "blind spot".

To address these challenges, the feature job setting in FeatureByte captures information about the frequency of batch feature computations, the timing of the batch process, and the assumed blind spot for the data. This helps ensure consistency between offline and online feature values and accurate historical serving that reflects the conditions present in the production environment.

Feature Job

A Feature Job is a batch process that generates both offline and online tiles and feature values for a specific feature before storing them in the feature store. The scheduling of a Feature Job is determined by the feature job setting associated with the respective feature.

Feature job orchestration is initiated when a feature is deployed and continues until the feature deployment is disabled, ensuring the feature store consistently possesses the latest values for each feature.

Feature Job Setting

The Feature Job Setting in FeatureByte captures essential details about batch feature computations for the online feature store, including the frequency and timing of the batch process, as well as the assumed blind spot for the data. This helps to maintain consistency between offline and online feature values and ensures accurate historical serving that reflects the production environment.

Feature Job Setting for Time Series

The feature job setting consists of three key parameters:

  1. Crontab: Defines the cron schedule for the feature job, specifying when the job should run.

  2. Time Zone: Determines the time zone in which the cron schedule operates, ensuring alignment with local time conventions.

  3. Reference Time Zone: Specifies the time zone used to define calendar-based aggregation periods (e.g., daily, weekly, or monthly). This reference time zone ensures consistency when calculating calendar periods across different data time zones.

    • For example, if the scheduled job is 2025/01/31 23:00 UTC and the reference time zone is Asia/Singapore, the corresponding calendar date would be 2025/02/01. Consequently, the aggregation for the latest complete month would cover January.
    • If a time zone column is used to assign individual time zones per record, the reference time zone should be the westernmost time zone among those specified in the column. This ensures that aggregation periods fully encompass the calendar dates of all relevant observations.

Feature Job Setting for Other Tables

The feature job setting consists of three parameters:

  1. Period: Specifies how often the batch process should run.

    • For example, a period of 60m indicates that the feature job will execute every 60 minutes.
  2. Offset: Defines the time delay from the end of the period to when the feature job starts.

    • For instance, with a setting of period: 60m and offset: 130s, the feature job will commence 2 minutes and 10 seconds after the start of each hour: 00:02:10, 01:02:10, 02:02:10, ..., 23:02:10.
  3. Blind Spot: Represents the time gap between when a feature is computed and the latest available event. This ensures that delayed data ingestion or processing does not compromise feature accuracy.

    Why is the Blind Spot Important?

    In an ideal scenario, we would include all prior events up to the present in our feature computation. However, due to the time required for data collection, ETL processing, and other pipeline delays, the most recent events may not be immediately available.

    Setting the blind spot too small may result in data leakage during model training, as the production environment might not have access to the most recent data at inference time. Conversely, setting it too large may lead to stale feature values, reducing predictive performance.

Case Study

Consider a scenario where a data warehouse refreshes hourly:

  • The refresh begins 10 seconds after the hour.
  • It typically completes within 2 minutes but can occasionally miss the most recent data, up to 30 seconds before the end of the hour.

To accommodate these factors, the feature job settings would be:

  • Period: 60m
  • Offset: 10s + 2m + 5s (safety buffer) = 135s
  • Blind Spot: 30s + 10s + 2m + 5s = 165s

These settings ensure the feature job runs after the data is fully refreshed while accounting for potential delays.

Feature Versioning and Flexibility

When changes occur in the management of the source tables—such as updates impacting data availability or freshness—a new feature version can be created with updated feature job settings to maintain accuracy and consistency.

Alignment with Online and Historical Requests

Although Feature Jobs are primarily designed to handle online requests, these settings also support historical requests. This helps minimize inconsistencies between offline and online data processing.

Consistency Across Teams

To ensure consistent feature job settings across teams, a Default Feature Job Setting is defined at the table level. However, team members can override this default setting when declaring specific features, offering flexibility for unique requirements.

SDK Reference

How to declare a feature job setting.

Blind Spot

In Feature Job Settings, the blind spot refers to the time gap between when a feature is computed and the timestamp of the most recent event included in that computation. Accounting for this gap is essential to maintain consistency between training and serving, ensuring that inference data is complete and aligned with real-world availability.

Understanding Data Latency and Blind Spots

Data latency represents the time elapsed from when an event occurs to when its data becomes available for use. In the context of data ingestion, a blind spot is any period where data might be missing due to ingestion delays. Specifically, in feature computation, the blind spot extends from:

  • The completion of data ingestion in the data warehouse
  • To the start of the feature computation job

Why Does the Blind Spot Matter?

The blind spot directly impacts the timeliness and relevance of features used at inference time. If the blind spot is too short, the model may rely on data that wouldn't be available in a production setting, leading to training-serving inconsistencies. Conversely, if it's too long, the model may work with stale data, potentially reducing predictive performance.

Default Feature Job Setting

The Default Feature Job Setting establishes the default setting used by features that aggregate data in a table, ensuring consistency of the Feature Job Setting across features created by different team members. While it is possible to override the setting during feature declaration, using the Default Feature Job Setting simplifies the process of setting up the Feature Job Setting for each feature.

To further streamline the process, FeatureByte offers automated analysis of an event table record creation and suggests appropriate setting values.

Approval Flow for Default Feature Job Setting

In Catalogs with Approval Flow enabled, changes in table metadata (including Default Feature Job Setting) initiate a review process. This process recommends new versions of features and lists linked to these tables, ensuring that new models and deployments use versions that address any data issues.

SDK Reference

How to:

User Interface

Learn by example with our 'Manage feature life cycle' UI tutorials.

Feature Job Setting Recommendations

FeatureByte automatically analyzes data availability and freshness of an event table to suggest an optimal setting for scheduling Feature Jobs and associated Blind Spot information.

This analysis relies on the availability of record creation timestamps in the source table, typically added when updating data in the warehouse. Additionally, the analysis focuses on a recent time window, such as the past four weeks.

FeatureByte estimates the data update frequency based on the distribution of time intervals among the sequence of record creation timestamps. It also assesses the timeliness of source table updates and identifies late jobs using an outlier detection algorithm. By default, the recommended scheduling time takes late jobs into account.

To accommodate data that may not arrive on time during warehouse updates, a blind spot is proposed for determining the cutoff for feature aggregation windows, in addition to scheduling frequency and time of the Feature Job. The suggested blind spot offers a percentage of late data closest to the user-defined tolerance, with a default of 0.005%.

To validate the Feature Job schedule and blind spot recommendations, a backtest is conducted. You can also backtest your custom settings.

Feature Job Setting Backtest

A backtest in feature job settings evaluates the effectiveness of these settings with respect to the availability and freshness of data. This process involves calculating the proportion of new data that would have been missed in the computation of a feature if these settings had been used in previous calculations. Here, "new data" refers to data processed during the latest time frame that matches the job's frequency.

A percentage higher than 0 indicates potential future problems with training-serving consistency, as it implies that serving might utilize incomplete data.

Common reasons for backtest failures include:

  1. Misalignment of Frequencies: The frequency at which feature jobs run should ideally be a multiple of the data warehouse's update frequency. This alignment ensures that each feature job incorporates the most recent data updates.
  2. Premature Feature Job Start: Starting a feature job too early, before the data warehouse update is complete, can lead to incomplete data incorporation. To avoid this, set a larger offset after the completion of the data warehouse update, allowing enough time for all data to be processed.
  3. Inadequate Data Latency Handling: Failing to account for an adequate blind spot period, the time necessary to cover data latency, can result in using incomplete data for serving. This blind spot should be long enough to ensure that all relevant data has been updated and is ready for use.
  4. Data Warehouse Update Issues: Issues such as past failures or irregular updates in the data warehouse can also lead to backtest failures. If these issues are identified, it's important to assess whether they are likely to recur and to adjust settings or processes accordingly.

Training-Serving Inconsistency

Training-Serving Inconsistency (or Training-Serving Skew) is a difference between performance during training and performance during serving. This skew can be caused by:

  • A discrepancy between how you handle data in the training and serving pipelines.
  • A change in the data between when you train and when you serve.

This inconsistency can lead to unexpected and potentially erroneous predictions.

Data Ontology

FeatureByte’s Data Ontology is a structured framework that categorizes columns in a dataset based on their meaning and usage. It is organized as a hierarchical tree, where each semantic type represents a distinct data classification, equipped with specialized feature engineering practices. This structured approach enhances data understanding, ensures consistent processing, and optimizes feature transformation techniques.

Semantic Type

A semantic type defines the meaning, expected values, and appropriate feature engineering operations for a column in a table. By associating each column with a semantic type, the ontology enables standardized processing, ensuring that data is transformed, aggregated, and utilized effectively for analysis and machine learning.

Semantic Type Detection

Semantic types can be automatically detected or manually assigned at the table level. Additionally, during Feature Ideation, they can be overwritten to refine feature engineering strategies based on evolving insights.

Which Semantic Type Should You Focus On?

When working with different table types, pay close attention to specific semantic types, as they influence filtering strategies, data aggregation, and feature engineering choices.

In Event Table and Time Series Table, check out the event_type (categorization of events based on their primary purpose or nature) and event_status (state, condition, or outcome of an event) semantic types. These columns will guide event-based filtering strategies.

In a Slowly Changing Dimension Table, check out the termination_timestamp and termination_date semantic types that indicate when an entity is actively terminated, sometimes prematurely. These columns determine how active entities are aggregated and when terminated entities should be analyzed.

For all tables, check out:

  • the non_additive_numeric semantic types (numeric values where direct addition is not meaningful). Understanding these columns prevents incorrect sum operations.
  • the automated non_informative semantic type (column with constant value). This may indicate problems in your data.
  • the not_to_use semantic type (sensitive, personal, operational, or non-reliable data that should not be used). This decides whether feature engineering should be operated for those columns.
  • the ambiguous_numeric (column that combines different units or scales) and ambiguous_categorical (column that does not provide unique information by itself) semantic types. These columns may require prior manual transformations before being used by feature engineering.

By carefully reviewing these semantic types, you can enhance feature selection and ensure high-quality transformations for machine learning.

Ontology Tree

FeatureByte’s ontology follows a hierarchical tree structure, where broader semantic types define general properties, and more specific types refine these properties for specialized use cases. Child nodes inherit feature engineering practices from their parent nodes, ensuring consistency while allowing for domain-specific adjustments.

Tree Key Concepts

  • Inheritance: Child nodes inherit feature engineering practices from their parent nodes.
  • Levels of Specificity: The Ontology is divided into levels, each providing a finer degree of specificity:
    • Level 1: Basic generic semantic types.
    • Level 2 & 3: More precise semantics for advanced feature engineering.
    • Level 4: Domain-specific nodes.

Semantic type: numeric

Description: Represents quantitative data that can be aggregated. Contains either integer or decimal.

  • non_additive_numeric: Numeric variable where direct addition does not yield meaningful interpretation. Examples of non-additive numeric variables are speed, age or tenure, unit price, temperature, rating, percentage, rank, or order.
    • measurement_of_intensity: Numeric values that represent the magnitude of a specific metric.
      • temperature: Numerical value indicating thermal levels, such as patient body temperature.
      • patient_temperature: Specific instance of temperature measurement for a patient.
      • patient_blood_pressure: Measurement capturing the arterial blood pressure of a patient.
      • sound_frequency: Number of vibrations or cycles per second of a sound wave, measured in Hertz (Hz).
      • unit_price: Cost of a single item or unit of measurement.
    • time_dependent_monotonic_value: Numeric values that increase over time.
      • age: The length of time that an individual has lived or a thing has existed.
      • account_duration: The length of time an account has been active.
      • tenure: Duration of time that someone has been in a specific role or occupation.
    • ratio: Represents a proportional relationship between two quantities, often maintaining a fixed relation.
    • percentage: A way of expressing a number as a fraction of 100.
      • discount_percentage: The percentage reduction from the original price.
    • statistics: A category that reflects mathematical characteristics derived from a dataset.
      • mean: The average value derived from a set of numbers.
    • distance: Refers to a measure of space between two points, can be positive, and often encoded in units like meters, kilometers, miles, etc.
    • rank: Refers to the position or level of something within a hierarchy, indicating relative importance compared to others.
    • order: Represents the arrangement or sequence of items according to particular criteria.
  • semi_additive_numeric: Numeric values where addition makes sense only within a specific point in time and not across time periods.
    • point_in_time_value: Represents values that provide a snapshot of a person or organization's status at a specific moment.
      • snapshot_value: A value taken at a specific moment in time, useful for tracking changes.
      • balance: The amount of money available in a financial account at a given moment.
      • stock: The quantity of items, products, or supplies held in inventory.
      • occupancy: Number of units occupied (e.g., rooms, apartments, or beds) at a given time.
      • headcount: Number of individuals within a group, organization, or event.
      • facilities: Number of distinct facilities or locations, such as hospitals, schools, stores, or businesses.
      • capacity: Maximum number of occupants or items a facility or system can hold, such as beds in a hospital, seats in a stadium, or total volume in storage.
      • asset_valuation: Assessed or market value of assets at a specific point in time.
      • liability_amount: Total amount of liabilities or debts owed by an individual or organization.
    • periodic_value: Represents values measured over fixed, regular intervals, reflecting metrics that reset each period without accumulating.
      • recurring_amount: Regular charges for ongoing services billed at fixed intervals or financial amounts that repeat over specific intervals.
      • periodic_cost: Costs incurred regularly at each time period.
      • recurring_budget: Budgets set for recurring intervals.
      • recurring_count: Counts or quantities that recur at regular intervals.
      • recurring_duration: Time durations that apply regularly over each period.
      • recurring_usage: Usage or consumption measured over each standard period.
    • accrued_metric: Represents values that accumulate over time, reflecting growing totals.
      • cumulative_amount: Total amounts that accumulate over time without resetting.
      • cumulative_cost: Costs that accumulate over a period, showing the sum of expenses.
      • cumulative_budget: Budget amounts that accumulate over time, reflecting the total allocated.
      • cumulative_count: Total counts that add up over time.
      • cumulative_duration: Time durations that sum up over periods, representing accumulated usage or operation time.
      • cumulative_usage: Usage or consumption totals that accumulate over time.
    • interval_metric: A metric that quantifies the difference between two measurements taken over distinct periods of time. This metric can be used to observe changes or trends within a specified interval.
  • additive_numeric: Numeric variable where direct addition provides meaningful interpretation, including addition of multiple observations over some time frame.
    • unbounded_amount: Refers to a total monetary amount that can be either positive or negative.
      • unbounded_purchase_amount: Total amount spent on purchases, which can include refunds resulting in negative values.
      • unbounded_transaction_amount: Total monetary value of financial transactions, capable of reflecting both credits and debits.
      • unbounded_discount: Total discounts applied, allowing for both positive and negative values to account for additions or corrections.
    • non_negative_amount: Refers to a total monetary amount that can only be zero or positive.
      • non_negative_purchase_amount: Total amount spent on purchases without the possibility of refunds or returns resulting in negative values.
      • non_negative_transaction_amount: Total monetary value of transactions that cannot reflect debts or credits that would turn the value negative.
      • non_negative_discount: Total value of discounts given, which can’t be adjusted negatively.
    • non_positive_amount: Refers to a total monetary amount that can only be zero or negative.
      • non_positive_purchase_amount: Total amounts reflecting refunds or returns, which do not include new spending.
      • non_positive_transaction_amount: Sum of deductions or charges in financial transactions that do not account for incoming values.
      • non_positive_discount: Total adjustments reflecting reductions, but not increases in discount values.
    • count: Refers to a specific or measurable number (count, quantity) of items.
    • unbounded_time_delta: Refers to a time difference that can be either negative or positive.
    • non_negative_time_delta: Refers to a time difference that can only be zero or positive.
    • duration: Refers to a positive duration, often measured in units like seconds, minutes, or hours.
  • inter_event_distance: Numerical representation of the distance between two events, measured in physical space.
  • inter_event_time: Numerical representation of the time duration between two events.
    • inter_event_moving_time: Time duration specifically representing periods of movement or travel between events.
  • circular: Numeric data that represent periodic intervals where the end connects back to the beginning.
    • time_of_day: Represents various time segments within a day, such as morning, afternoon, evening, and night.
    • day_of_year: Denotes the sequential day within the year, with January 1st as 1 and December 31st as 365 (or 366 in leap years).
    • day_of_month: Represents the day within the month, encoded as an integer from 1 to 31.
    • month_of_year: Represents the month within a year, encoded as an integer from 1 (January) to 12 (December).
    • quarter_of_year: Indicates the quarter within a year, encoded as an integer from 1 (January-March) to 4 (October-December).
    • day_of_week: Represents the day within the week, encoded as an integer from 1 (Monday) to 7 (Sunday).
    • hour_of_day: Indicates the hour within the day, encoded as an integer from 0 (midnight) to 23 (11 PM).
    • hour_of_week: Represents the hour within a week, from 0 (midnight on Monday) to 167 (11 PM on Sunday).
    • direction: Represents directional headings (e.g., North, South, East, West) in degrees.

Semantic type: binary

Description: A special case of categorical where the column represents a binary flag with exactly two distinct categories.

  • boolean: Variable which represents a binary flag with values of true/false or yes/no.
    • binary_numeric: Numeric representation of binary values, often as 0 or 1.
    • binary_logical: Logical representation of binary states, usually as true/false.
    • physical_presence_indicator: Physical flag that indicates whether an event was performed physically rather than online.
      • is_in_store_transaction: Indicates if a transaction was conducted in a physical store.
      • is_in_person_event: Indicates if an event occurred in person.
  • filter_field: Binary flags used for filtering purposes.
    • is_positive: Indicates if a value is positive.
    • is_moving: Indicates if an object or subject is in motion.

Semantic type: categorical

Description: Contains values that represent discrete groups and categories. These values can be short text, codes, or numeric.

  • nominal_categorical: Categorical variables in which the categories do not have a meaningful order or ranking.
    • demographic_attribute: Includes a variety of attributes related to personal identity, social status, and professional roles.
      • gender: Represents gender identity of a person, often including values like 'female', 'male', 'non-binary', etc.
      • person_title: Denotes gender and marital status, e.g., Mr, Mrs, Dr, Prof, etc.
      • job_title: Titles or designations within an organizational structure, such as 'Manager', 'Director', 'Engineer'.
    • event_type: Categorization of events, grouping them into broad categories based on their primary purpose or nature.
    • context: Surrounding conditions or setting in which events occur.
    • status: Represents the status of a record, e.g., user account status (active, suspended), order status (pending, shipped, delivered), task status (started, completed), etc.
      • event_status: State, condition, or outcome of an event.
    • location: Represents any codified location information like zip codes, area codes, city, country, state codes, etc.
      • zip_code: Postal code for a specific geographic area.
      • area_code: Phone prefix designating a specific geographic region.
      • county_and_state: Combination of county and state, e.g., 'Fairfax County, Virginia' or 'Orange, CA'.
      • city_and_state: Combination of city and state, e.g., 'Los Angeles, CA' or 'Austin, Texas'.
      • state: Variable representing state, e.g., 'Texas' or 'CA'.
      • country: Variable representing country, e.g. 'USA' or 'France'.
    • code: Symbolic or numeric codes utilized across various domains, excluding location codes.
      • barcode: Machine-readable representation of information.
      • icd_10_cm: International Classification of Diseases, 10th Revision, Clinical Modification coding for diseases.
      • cpt_treatment_code: Current Procedural Terminology codes for medical treatment procedures.
      • ndc_drug_code: National Drug Codes for medications.
      • isbn: International Standard Book Number for books.
      • issn: International Standard Serial Number for periodicals.
      • status_code: Codes representing status, e.g., HTTP status codes.
      • reason_code: Codes that explain causes or reasons within various contexts.
      • mcc_code: Merchant Category Codes used in financial transactions.
  • ordinal: Represents categories that have a clear, distinct order or rank.
    • rating: Levels of quality or satisfaction, such as 'poor', 'average', 'good'.
    • severity_level: Levels representing severity, such as 'low', 'medium', 'high'.
    • brackets: Ranges that categories items into specific limits, such as income brackets.
      • distance_buckets: Groups distances into specified intervals.
      • age_group: Divides ages into ranges.
  • cyclic_categorical: Categorical values in a cyclic or repeating order.
    • categorical_month_of_year: Categorical representation of months within a year.
    • categorical_quarter_of_year: Categorical representation of quarters within a year.
    • categorical_day_of_week: Categorical representation of days within a week.
    • categorical_hour_of_week: Categorical representation of hours within a week.
    • categorical_direction: Categorical representation of directions, such as cardinal points (N, NE, E, SE, S, SW, W, NW).

Semantic type: date_time

Description: Encompasses temporal data types ranging from broad scales (years) to precise measurements (timestamps).

  • timestamp_field: Precise point in time, typically including date and time components.
    • start_timestamp: Timestamp marking the beginning of an event, project, or activity.
    • end_timestamp: Scheduled conclusion of an event or activity as a timestamp.
    • termination_timestamp: Timestamp marking the active termination of an event or process.
    • birth_timestamp: Date and time of birth of a person as a timestamp.
  • date_field: Dates without time information.
    • start_date: Date signaling the beginning of an event, project, or activity.
    • end_date: Scheduled conclusion date of an event or activity.
    • termination_date: Date of active termination of an event or process.
    • date_of_birth: Date of birth of a person.
  • year: Represents a calendar year typically as a four-digit integer (e.g., 2024).
    • year_of_birth: Year of birth of a person.
  • year_quarter: Specifies a quarter within a year, including both the year and the quarter (e.g., 2024-Q1).
  • year_month: Represents a specific month in a specific year (e.g., 2024-05).
  • epoch: Specific point in time as the number of seconds (or milliseconds) elapsed since the Unix epoch (January 1, 1970, at 00:00:00 UTC).

Semantic type: text

Description: Contains free-form strings of varying length and complexity.

  • special_text: Represents more or less structured information like addresses, URLs, emails, phone numbers, names, time zones, software codes, etc.
    • street_address: Specifies the location of a property on a street, without specifying the city or town.
    • address: Uniquely identifies the location of a property with information on the street, the city, and the country.
      • billing_address: Represents an address associated with an individual's or organization's method of payment, such as a credit card or bank account.
      • shipping_address: Represents an address where a customer requests goods or products to be delivered.
    • url: An internet URL that specifies the address of a resource on the web.
    • email: An email address of a person or an entity used for electronic communication.
    • organization_name: The name of a company or an organization, used for identifying corporate entities.
    • software_code: A set of instructions written in a specific programming language that can be executed by a computer to perform a defined task or set of tasks.
  • long_text: Represents descriptive, unstructured data like reviews, descriptions, posts, tweets, etc.
    • review: Represents a written evaluation or assessment of a product, movie, service, etc.
    • description: Represents any general description, for example, a product description.
    • resume: A document that summarizes a person's work experience, education, and skills.
    • event_record: Contains details of events, such as logs or records from specific occurrences.
    • twitter: A short post or message on the social media platform Twitter
  • numeric_with_unit: Represents any measurement with units, like length with inches, time with hours, weight with kilograms, volume with liters, area with square meters, speed with meters per second, and temperature with Celsius.
    • amount_with_currency: Represents a monetary amount associated with a specific currency.
    • length_with_unit: Represents a length measurement specified with a unit, such as meters or inches.
    • time_with_unit: Represents a time duration associated with a specific unit, like hours, minutes, or seconds.
    • weight_with_unit: Represents a weight measurement specified with a unit, such as kilograms or pounds.
    • volume_with_unit: Represents a volume measurement specified with a unit, such as liters or gallons.
    • area_with_unit: Represents an area measurement specified with a unit, such as square meters or square feet.
    • speed_with_unit: Represents a speed measurement specified with a unit, such as kilometers per hour or miles per hour.
    • temperature_with_unit: Represents a temperature measurement specified with a unit, such as Celsius or Fahrenheit.

Semantic type: coordinates

Description: Represents geographical coordinates used for identifying locations on Earth.

  • longitude: Represents the longitude value on Earth's surface, with values between -180 and 180 degrees.
    • local_longitude: Non-global, zone-specific longitude values allowing for approximations in distance or centroid calculations.
      • local_longitude_of_moving_object: The longitude value specific to a moving object, expressed within a localized zone.
      • local_longitude_of_car: The longitude value specific to a moving car, within a localized zone.
    • longitude_of_moving_object: Specifies the longitude of an object in motion.
  • latitude: Represents the latitude value on Earth's surface, with values between -90 and 90 degrees.
    • local_latitude: Non-global, zone-specific latitude values allowing for approximations in distance or centroid calculations.
      • local_latitude_of_moving_object: The latitude value specific to a moving object, expressed within a localized zone.
      • local_latitude_of_car: The latitude value specific to a moving car, within a localized zone.
    • latitude_of_moving_object: Specifies the latitude of an object in motion.
  • latitude_in_degrees_minutes_and_seconds: Represents latitude expressed in degrees, minutes, and seconds (DMS) format.
  • longitude_in_degrees_minutes_and_seconds: Represents longitude expressed in degrees, minutes, and seconds (DMS) format.
  • latitude_longitude: Combines latitude and longitude values, representing a location.
  • longitude_latitude: Combines longitude and latitude values, representing a location.

Semantic type: sequence

Description: Represents an ordered series of items, such as categories, text, or numbers.

  • categorical_sequence: An ordered series of categorical values.
  • text_sequence: An ordered series of textual elements.
  • numeric_sequence: An ordered series of numerical values.

Semantic type: list

Description: Contains a series of values, which can be categories, text, or numerical, separated by a comma or other delimiter.

  • categorical_list: A list of categorical values.
  • text_list: A list of textual elements.
  • numeric_list: A list of numerical values.

Semantic type: dictionary

Description: Represents a collection of key-value pairs, where keys are unique identifiers.

  • dictionary_of_unbounded_values: A dictionary where values are unbounded and can take any form.
  • dictionary_of_non_negative_values: A dictionary where values are non-negative numbers.
    • dictionary_of_count: A dictionary specifically used to count occurrences of items, where values are count numbers.
  • dictionary_of_non_positive_values: A dictionary where values are non-positive numbers.

Semantic type: vector

Description: Represents a mathematical vector, an array of numbers used to measure direction and magnitude.

  • embedding: A dense vector representation of a piece of data, often used in machine learning for features like words or images.

Semantic type: converter

Description: Represents a value used to transform one unit or format into another, including but not limited to:

  • fx_rate: A foreign exchange rate used to convert from one currency to another.
    • billing_fx_rate: Refers to foreign exchange rates in financial transactions concerning billing and invoicing in international trade.
    • billing_fx_inverse_rate: Refers to the inverse of the billing foreign exchange rate, used to convert back from the target currency to the source currency.
  • time_zone: Represents a geographical region where the same standard time is used.

Semantic type: unit

Description: Represents types of units used to quantify specific properties.

  • currency: A unit of money.
  • length_unit: A unit used to measure length, such as meters or miles.
  • time_unit: A unit used to measure time, such as seconds or hours.
  • weight_unit: A unit used to measure weight, such as kilograms or pounds.
  • volume_unit: A unit used to measure volume, such as liters or gallons.
  • area_unit: A unit used to measure area, such as square meters or acres.
  • speed_unit: A unit used to measure speed, such as meters per second or miles per hour.
  • temperature_unit: A unit used to measure temperature, such as Celsius or Fahrenheit.

Semantic type: temporal_key

Description: Identifiers that represent specific points or periods in time, commonly used to track the timing and duration of events or records in a database.

  • event_timestamp: The timestamp column in an Event table, recording the exact time a specific event occurred.
  • scd_effective_timestamp: The timestamp column in a Slowly Changing Dimension (SCD) table, indicating when the record becomes active or effective.
  • scd_end_timestamp: The timestamp column in a Slowly Changing Dimension (SCD) table, indicating when the record becomes inactive or outdated.
  • iot_sensor_timestamp: The timestamp captured from an IoT sensor, indicating the precise time the sensor data was collected.
  • time_series_date_time: A column representing the temporal reference in a time series dataset. It can capture various time granularities, such as year, year-month, date, or date-time.

Semantic type: unique_identifier

Description: (UID) A string of characters, numbers, or symbols used to uniquely identify an entity within a system or context. These identifiers ensure that every item, event, or entity can be distinctly recognized and referenced within a database or data structure.

  • event_id: The primary key in an Event table, uniquely identifying each event recorded in the system.
  • item_id: The primary key in an Item table, containing detailed information about specific items or transactions.
  • series_id: Uniquely identifies each series within a table containing multiple series, enabling clear distinction and tracking of individual entities, such as products or categories.
  • dimension_id: The primary key in a Dimension table, uniquely identifying each dimension entry in the database.
  • scd_surrogate_key_id: The unique identifier assigned to each record in a Slowly Changing Dimension table, providing a stable identifier as the table evolves over time.
  • scd_natural_key_id: The key in a Slowly Changing Dimension table that remains static over time, uniquely identifying each active row at any given point. Also known as an alternate key.
  • foreign_key_id: A column in one table that references the primary key in another table, establishing a relationship between the two tables.

Semantic type: ambiguous_numeric

Description: Numeric columns where values can represent different units or scales, potentially leading to misinterpretation without clarification.

  • mixed_unit_numeric: Numeric variables that can represent measurements in various units.
    • mixed_currency_amount: Monetary values in different currencies.
    • mixed_unit_length: Length measurements in different units (e.g., meters, feet, miles).
    • mixed_unit_time: Time measurements in different units (e.g., seconds, minutes, hours).
    • mixed_unit_weight: Weight measurements in different units (e.g., grams, pounds, kilograms).
    • mixed_unit_volume: Volume measurements in different units (e.g., liters, gallons).
    • mixed_unit_area: Area measurements in different units (e.g., square meters, square feet).
    • mixed_unit_speed: Speed measurements in different units (e.g., kilometers per hour, miles per hour).
    • mixed_unit_temperature: Temperature measurements in different units (e.g., Celsius, Fahrenheit).

Semantic type: ambiguous_categorical

Description: A categorical column that does not provide unique information by itself within a given context. These values require additional features or data to clarify their meaning, as they can lead to misinterpretation without context.

  • ambiguous_nominal_categorical: A categorical column that does not provide unique information by itself within a given context. These values require additional features or data to clarify their meaning, as they can lead to misinterpretation without context.
    • ambiguous_location
      • city_name: Represents a city in any country, potentially leading to ambiguity without further geographical details.
      • county_name: Represents counties (e.g., Jackson County) in any country, which can be ambiguous without additional regional information.

Semantic type: not_to_use

Description: Contains sensitive, personal, operational, or non-reliable data that should not be used in analysis to protect privacy or data integrity.

  • operational_key: Keys used for internal system operations rather than data analysis.
    • scd_current_flag: A column in a Slowly Changing Dimension (SCD) table used to indicate the current version of the record.
    • record_creation_timestamp: The timestamp indicating when a particular record was created in the data warehouse, often auto-generated upon record creation.
    • row_id: Unique identifier assigned to each row, primarily for the system to efficiently index, reference, and retrieve records.
  • personal_identifiable_information: Information that can uniquely identify an individual.
    • name: Contains individuals' personal names, which may include first names, last names, middle names, given names, etc.
      • person_name: The name of a person, or any component of the name.
      • given_name: The given name of a person.
      • middle_name: A middle name or middle initial, often the first letter of the middle name.
      • surname: The last name of a person.
    • phone_number: A string formatted as a phone number from any country.
  • confidential_information: Information that is sensitive and should be protected from unauthorized access.
  • noisy_data: Data that is too erratic or random, providing no meaningful insight and often obscuring useful data.

Semantic type: non_informative

Description: A column in which the value remains constant, providing no variance or useful information for analysis purposes.

Semantic type: unknown

Description: Non identified semantic type