Concepts¶

FeatureByte Catalog¶

A FeatureByte Catalog operates as a centralized repository for organizing tables, entities, features, and feature lists and other objects to facilitate feature reuse and serving.

By employing a catalog, team members can effortlessly share, search, retrieve, and reuse these assets while obtaining comprehensive information about their properties.

Create multiple catalogs for data warehouses covering multiple domains to maintain clarity and easy access to domain-specific metadata.

SDK Reference

Refer to the Catalog object main page or to the specific links:

list catalogs,
create a catalog,
get the currently active catalog,
activate a catalog,
list tables, entities, features or feature lists in a catalog,
and retrieve a table, an entity, a feature or a feature list from a catalog.

User Interface

Learn by example with the 'Create Catalog' tutorial.

Source Table and Special Columns¶

Data Source¶

A Data Source object in FeatureByte represents a collection of source tables that the feature store can access. From a data source, you can:

Retrieve the list of databases available
Obtain the list of schemas within the desired database
Access the list of source tables contained in the selected schema
Retrieve a source table for exploration or registering it in the catalog.

SDK Reference

Refer to the DataSource object main page or to the specific links:

User Interface

Learn by example in the 'Register Tables' tutorial.

Source Table¶

A Source Table in FeatureByte is a table of interest that the feature store can access and is located within the data warehouse.

To register a Source Table in a FeatureByte catalog, first determine its type. There are six supported types: event table, item table, snapshots table, time series table, dimension table and slowly changing dimension table.

Note

Registering a table according to its type determines the types of feature engineering operations that are possible on the table's views and enforces guardrails accordingly.

To identify the table type and collect key metadata, Exploratory Data Analysis (EDA) can be performed on the source table. You can obtain descriptive statistics, preview a selection of rows, or collect additional information on their columns.

SDK Reference

Refer to the SourceTable object main page or to the specific links:

list source tables in a data source,
retrieve a source table from a data source,
obtain descriptive statistics,
and preview a selection of rows,

User Interface

Learn by example in the 'Register Tables' tutorial.

Primary Key¶

A Primary Key is a column that uniquely identifies each record (row) in a table.

The primary key ensures data integrity by preventing duplicate records and must meet the following requirements:

Unique – Each record must have a distinct primary key value.
Non-null – The primary key cannot contain null (empty) values.
Stable – The primary key value should remain unchanged over time.

FeatureByte tables can contain the following types of primary keys:

Dimension ID: The primary key in Dimension table.
Surrogate key: The primary key in Slowly Changing Dimension (SCD) table.
Event ID: In some cases, used as the primary key in Event table.
Item ID: The primary key in Item table.

Event ID¶

The Event ID uniquely represents an event in the Event table.

If the Event Table contains multiple records for the same event ID (tracking status changes over time), the event ID cannot be treated as a primary key. In such cases:

the table should include an event status column to differentiate records.
the event timestamp should reflect the update time of the event status.
the table view should be filtered using the event status column before feature engineering.
the table cannot be used as a right table in joins or associated with an Item Table.

Item ID¶

An item ID serves as the primary key in an Item table. It typically has a one-to-many relationship with the event ID, meaning that a single event (e.g., a customer order) can be associated with multiple items (e.g., different products in that order).

Item ID Examples

Retail & E-commerce (Customer Orders)

Event ID: Order ID
Item ID: Product ID
Additional Attributes: Product name, quantity, price, discount, category

Healthcare (Drug Prescriptions in Doctor Visits)

Event ID: Visit ID
Item ID: Prescription ID
Additional Attributes: Drug name, dosage, frequency, prescribing doctor

Banking & Finance (Transaction Breakdowns)

Event ID: Transaction ID
Item ID: Line Item ID
Additional Attributes: Merchant, transaction type, amount, tax, currency

Logistics & Supply Chain (Shipment Details)

Event ID: Shipment ID
Item ID: Package ID
Additional Attributes: Weight, dimensions, destination

Dimension ID¶

A Dimension ID serves as the primary key in a Dimension table, uniquely identifying each record (row) in the table. Dimension IDs must be unique and stable over time to ensure data consistency and reliability in historical analysis.

Example

In a Product Dimension Table, each product would have a unique Dimension ID, ensuring that product details remain consistent across records.

Surrogate Key¶

In a Slowly Changing Dimension (SCD) table, a surrogate key is a unique identifier assigned to each record. It ensures a stable, system-generated identifier that remains unchanged, even as the table evolves over time.

Example

Consider an SCD Table that tracks customer addresses over time. When a customer updates their address, instead of modifying the existing record, a new record is added with the updated information.

The Surrogate Key acts as the primary key, uniquely identifying each record.
The Customer ID serves as the natural key, linking all records to a specific customer.
An Effective Timestamp marks when each address became valid.
An End Timestamp marks when each address became invalid.

Example Table:

Surrogate Key	Customer ID (Natural Key)	Address	Valid From	Valid To
1	123456	123 Main St	13/01/2019 11:00:00	16/03/2021 10:00:00
2	123456	456 Oak St	16/03/2021 10:00:00	NULL

Key Insights:

The Surrogate Key (1, 2) uniquely identifies each row.
The Customer ID remains the same across records, preserving the historical link.
The Valid From and Valid To timestamps define the active period of each record.
The latest record (456 Oak St) has a NULL Valid To, indicating it is still active.

Series ID¶

A series ID in a snapshots table or time series table identifies and separates different series within the table, ensuring each series can be grouped, analyzed, and processed independently.

Example

Imagine a time series table tracking hourly sales for multiple stores. The series ID represents each store, ensuring sales data is kept separate for analysis. For example, "Store_A" and "Store_B" would have their own series IDs, allowing you to calculate trends, forecast future sales, or analyze growth within each store independently.

Natural Key¶

In a Slowly Changing Dimension (SCD) table, a natural key (also called alternate key) is a column that remains constant over time and uniquely identifies each active row at any point-in-time.

This key is essential for maintaining historical records and analyzing changes over time within the table.

Example

Consider an SCD Table that tracks customer addresses over time. The Customer ID can be considered a natural key because:

It remains constant for a given customer.
It uniquely identifies each customer at any point in time.

Key Behavior:

At any given point in time, a Customer ID is associated with one active address.
Over time, multiple addresses can be linked to the same Customer ID, preserving historical changes.

Foreign Key¶

A Foreign Key is a column in one table that refers to the primary key in another table. It establishes a relationship between two tables.

Example

An example of foreign key is Customer ID in an Orders table, which links it to the Customer table where Customer ID is the natural key.

Special DateTime Columns¶

This section describes all special DateTime-related columns used across table types and how they handle timestamps, time zones, and temporal consistency during feature computation.

DateTime columns can be stored as a Timestamp, Date, or represented as a string. If represented as a string, you must specify the format specific to your data warehouse.

Time Zone Considerations

To ensure temporal accuracy, specify whether the datetime column is recorded in local time and define its time zone component.

If the column is recorded in UTC, share the local time zone information to ensure that date part transforms are based on local time rather than UTC.
In Databricks, FeatureByte retrieves timestamps exactly as stored, without adjusting for your cluster’s time zone settings.
In Snowflake, FeatureByte accepts timestamp columns that include time zone offset information.

String-Based DateTime Format¶

If a DateTime column is represented as a string, specify the format consistent with your data warehouse:

Databricks (Spark SQL): "yyyy-MM-dd HH:mm:ss" — Reference
Snowflake: "YYYY-MM-DD HH24:MI:SS" — Reference
BigQuery: "%Y-%m-%d %H:%M:%S" — Reference

Event Timestamp¶

The Event Timestamp column in an Event Table records the exact time when an event occurred.

If the event table contains multiple records for the same Event ID (e.g., tracking status changes), the timestamp should reflect when the status was updated.
For events spanning a duration (e.g., sessions), use the timestamp corresponding to the end of the period.

Reference Datetime Column¶

The Reference Datetime Column in a Time Series Table serves as the primary temporal anchor for each record, indicating when a measurement occurred. It defines the time dimension used for ordering, aggregating, and analyzing data over time.

You must also define the time interval, which specifies the expected frequency of data points (e.g., hourly, daily, or monthly).

Snapshot Datetime Column¶

The Snapshot Datetime Column in a Snapshots Table serves as the temporal anchor for each snapshot. You must define the time interval, which specifies how often snapshots occur (e.g., hourly, daily, or monthly).

Effective Timestamp¶

The Effective Timestamp column in a Slowly Changing Dimension (SCD) Table specifies when a record becomes active or effective.

Example

If a customer changes their address, the effective timestamp represents the date when the new address becomes active.

Expiration Timestamp¶

The Expiration Timestamp (or End Timestamp) column in an SCD Table indicates when a record is no longer valid or active.

Example

If a customer changes their address, the expiration timestamp represents when the old address is no longer valid.

Time Leakage Consideration

Although useful for data management, the Expiration Timestamp cannot be used for feature engineering because it reflects future information unavailable during inference. To prevent time leakage, this column is automatically excluded when generating views from tables.

Record Creation Timestamp¶

The Record Creation Timestamp indicates when a record was created in the data warehouse. It is usually system-generated at record creation.

Note

While useful for data management, this column is typically not used for feature engineering. Because it is influenced by data management processes rather than predictive signals, it may cause feature drift and degrade model performance. Therefore, it is discarded by default when generating views.

However, this information is used to analyze data availability and freshness and to configure the table’s default feature job setting.

Time Zone Component¶

The Time Zone Component defines how DateTime columns are interpreted:

If recorded in UTC, values are converted to local time.
If recorded in local time, values are converted to UTC.

Note

In Databricks, FeatureByte retrieves timestamps exactly as stored, without adjusting for cluster time zone settings.

Ways to Specify Time Zones:

Time Zone Name: Use standard names from the IANA Time Zone Database (e.g., "America/New_York", "Asia/Singapore").
Time Zone Offset: Define an offset from UTC (e.g., +08:00, -05:00).
Fixed Time Zone: Apply a single time zone uniformly to all records at the table level.
Per-Record Time Zone: Specify a column in the table that assigns a time zone to each record.

Daylight Saving Time Zone¶

Daylight Saving Time (DST) is managed using time zones defined in the IANA Time Zone Database (also known as the tz database).

It can be declared directly or through a dedicated column if the table includes multiple time zones.

Examples

America/New_York — Eastern Time (U.S.):
- Standard Time (EST): UTC−5
- Daylight Time (EDT): UTC−4
Europe/London — United Kingdom:
- Standard Time (GMT): UTC+0
- Daylight Time (BST): UTC+1
Asia/Kolkata — India (does not observe DST):
- Standard Time: UTC+5:30

Time Zone Offset¶

A Time Zone Offset (or UTC Offset) represents the difference between Coordinated Universal Time (UTC) and local time, expressed in hours and minutes. It applies to all table types and ensures that timestamp-based operations are consistent across data recorded in different time zones.

Like Daylight Saving Time Zone, Time Zone Offset can be defined globally or through a per-record column if time zones vary across rows.

Example

Local time 3 hours ahead of UTC → +03:00
Local time 2 hours behind UTC → -02:00

Note

The expected format is (+|-)HH:mm.

Timestamp with Time Zone Offset¶

In Snowflake, the TIMESTAMP_TZ type supports timestamps with embedded time zone offsets. FeatureByte recognizes this type, and all date part transforms on such columns or features are based on local time instead of UTC.

Reference Time Zone¶

The Reference Time Zone specifies the time zone used to define calendar-based aggregation periods (e.g., daily, weekly, or monthly) during feature computation. It ensures consistency when calculating calendar intervals across datasets containing timestamps from different time zones.

The Reference Time Zone applies to all table types and is configured as part of the Cron Feature Job Setting.
It is specifically used to align calendar-based aggregations (such as daily or monthly features) across time zones.

For Time Series and Snapshots, the Reference Time Zone is determined as follows:

If the Reference Datetime Column is associated with a single time zone, it uses that column’s Daylight Saving Time Zone.
If the Datetime column spans multiple time zones, it uses the westernmost time zone among those specified in the column.

Westernmost Time Zone Example

Suppose a dataset includes a user_time_zone column with values such as America/New_York, America/Chicago, and America/Los_Angeles. In this case, the Reference Time Zone should be America/Los_Angeles, as it is the westernmost among them.

Example

Scheduled job time: 2025/01/31 23:00 UTC
Reference time zone: Asia/Singapore

Result: - The corresponding calendar date is 2025/02/01. - The aggregation for the latest complete month includes data from January.

Summary of Special DateTime Columns¶

The table below summarizes all supported DateTime-related columns and their intended use across table types.

Column Name	Applies To	Purpose
Event Timestamp	Event Table	Captures when an event occurred. Used as the primary temporal anchor for event-based features.
Reference Datetime Column	Time Series Table	Indicates when a measurement occurred and defines the temporal order of the series.
Snapshot Datetime Column	Snapshots Table	Identifies when a snapshot was taken and defines snapshot frequency.
Effective Timestamp	SCD Table	Marks when a record becomes active or valid.
Expiration Timestamp	SCD Table	Marks when a record expires or is no longer valid.
Record Creation Timestamp	All Table Types	Indicates when the record was created in the data warehouse. Used for freshness and availability analysis.
Time Zone Component	All Table Types	Defines whether DateTime columns are in local or UTC time and how conversions are handled.
Daylight Saving Time Zone	All Table Types	Manages daylight saving behavior using IANA time zones. Can be declared globally or through a per-record column if time zones differ.
Time Zone Offset	All Table Types	Specifies the UTC offset for timestamp interpretation and conversion. Can be declared globally or through a column when multiple time zones exist.
Timestamp with Time Zone Offset	Snowflake-specific	Supports native timestamp values with embedded time zone offsets (`TIMESTAMP_TZ`).
Reference Time Zone	All Table Types	Defined as part of the Cron Feature Job Setting for all tables. Used to align calendar-based aggregations (daily, weekly, monthly) consistently across time zones.

Active Flag¶

The Active Flag (also known as Current Flag) column in a Slowly Changing Dimension (SCD) table is used to identify the current version of the record.

Example

If a customer changes their address, the active flag would be set to 'Y' for the new address and 'N' for the old address.

Note

While this column is useful for data management, it cannot be used for feature engineering as the value changes overtime and may differ between training and inference time. It may cause time leakage. For this reason, the column is discarded by default when views are generated from tables.

Managed View¶

A Managed View lets you define, via a custom SQL script, the exact subset of data that FeatureByte should use before table registration. This is especially useful for trimming excessively long histories or censoring columns that are not suitable for modeling or feature engineering— for example, personally identifiable information (PII) or system-generated columns. While column transformations are possible within a managed view, it’s recommended to use cleaning operations for such tasks.

User Interface

Learn by example in the 'Register Tables' tutorial.

FeatureByte Tables¶

Table¶

A Table in FeatureByte represents a source table and provides a centralized location for metadata for that table. This metadata determines the type of operations that can be applied to the table's views.

Important

A source table can only be associated with one active table in the catalog at a time. This means that the active table in the catalog is the source of truth for the metadata of the source table. If a table in the catalog becomes deprecated, it can be replaced with a new table in the catalog that has updated metadata.

Table Registration¶

To register a table in a catalog, first determine its type. The table type defines the possible feature engineering operations on its views and enforces type-specific guardrails accordingly.

FeatureByte supports seven table types:

Event Table: Captures unique events, where each row represents a distinct event occurring at a specific point in time.
Item Table: Provides detailed breakdowns or components related to a primary event.
Snapshots Table: Captures periodic states at regular intervals (e.g., daily, weekly, or monthly snapshots). Must contain a series_id (e.g., account_id, customer_id, policy_number) that distinguishes independent snapshot series.
Time Series Table: Stores regular measurements or aggregated data recorded at consistent time intervals. Unlike the Snapshots Table, it may include missing timestamps in the sequence. It cannot be joined to other tables and cannot be used to establish relationships between entities.
Slowly Changing Dimension (SCD) Table: Tracks historical changes in specific attributes of an entity over time, maintaining both current and historical records.
Dimension Table: Contains static descriptive data for classification or metadata purposes, used when attributes do not change over time.
Calendar Table: Contains date-keyed reference data such as holidays, business day indicators, or seasonal attributes, used to enrich other tables with temporal context.

After creating a table, you can optionally include additional column-level metadata to further support feature engineering. This may involve tagging columns with related entities, updating column descriptions, defining column semantics, or specifying default data cleaning operations.

SDK Reference

Refer to the Table object main page, or use the specific links below:

User Interface

Learn by example in the 'Register Tables' tutorial.

Event Table¶

An Event Table represents a table in the data warehouse where each row corresponds to a unique event occurring at a specific point in time.

Examples

Event tables can take various forms across industries, such as:

E-commerce: Orders
Banking: Credit card transactions
Healthcare: Doctor visits
Internet: Clickstream data

Creating an Event Table in FeatureByte

To create an Event Table, you must specify the event timestamp, which indicates when the event occurred.

Optionally, you can specify an event ID, which serves as a unique identifier for each event.

Event Table Tracking Event Status

Some Event Tables may contain multiple records for the same event ID, tracking changes in event status over time.

Best practice: Ideally, register two separate tables:

An Event Table containing the static fields of the event.
A Slowly Changing Dimension (SCD) Table that tracks event status and dynamic fields over time.

Alternative approach: If splitting the table is not feasible, you may still register a single table as an Event Table, but in such cases:

The table must include an event status column to differentiate records.
The event timestamp should reflect the time the status was updated.
The view must be filtered using the event status column before feature engineering.
The table cannot be used as a right table in joins or associated with an Item Table.

Optionally, specify a record creation timestamp to enable automatic analysis of data availability and freshness. This analysis assists in configuring the default feature job setting, which determines the scheduling of feature computation for the Event Table.

SDK Reference

Refer to the Table object main page or to:

User Interface

Learn by example in the 'Register Tables' tutorial.

Item Table¶

An Item Table represents detailed information about components or sub-events associated with a primary event.

Examples

Examples of Item Tables include:

E-commerce: Product items within customer orders
Healthcare: Drug prescriptions issued during doctor visits

Typically, an Item Table has a one-to-many relationship with an Event Table. Although it might not include a timestamp directly, it inherits temporal context through its link to the Event Table.

Creating an Item Table in FeatureByte

To create an Item Table, specify:

The event ID column.
The Event Table it is associated with.

SDK Reference

How to register an Item Table.

Snapshots Table¶

A Snapshots Table (or Periodic Snapshot Fact Table) captures periodic states of data at strict, regular intervals (e.g., daily, weekly, or monthly).

It must contain a series_id (e.g., account_id, customer_id, policy_number) to distinguish independent snapshot series.

Snapshot data is commonly used for lookup features, aggregations as of a point in time, or calendar-based window aggregations (e.g., day, week, month). It can also be used to enrich other tables containing the same Series ID.

Examples

Common examples include:

Banking: Credit card balance snapshots
Customer analytics: Customer profile histories captured as periodic snapshots instead of a Slowly Changing Dimension Table

Creating a Snapshots Table in FeatureByte

To create a Snapshots Table, define:

The Snapshot Datetime Column: the primary temporal anchor for each record.
The time interval (e.g., hourly, daily, monthly).
The Series ID, identifying distinct snapshot series.

Important

Snapshots must be recorded at strict, regular intervals with no missing periods. For each combination of Series ID and Snapshot Datetime, there must be exactly one record.

If this condition cannot be met, consider registering the table as a Time Series Table.

A Cron default feature job setting ensures feature computations align with snapshot update intervals.

SDK Reference

User Interface

Learn by example in the 'Register Tables' tutorial.

Time Series Table¶

A Time Series Table stores measurements or aggregated values recorded at regular intervals. Each time series is identified by a series_id.

Time series data is typically used for calendar-based window aggregations (e.g., daily, weekly, monthly).

Examples

Retail: Daily sales records
Weather: Hourly temperature readings
Finance: Stock price histories

Creating a Time Series Table in FeatureByte

To create a Time Series Table, define: 1. The Reference Datetime Column: the temporal anchor for each record. 2. The time interval, specifying the expected frequency (e.g., hourly, daily, monthly). 3. The Series ID (optional), identifying distinct time series if multiple exist in the dataset.

A Cron default feature job setting aligns feature computation with update frequency.

Restriction

A Time Series Table cannot be joined to other tables and cannot establish relationships between entities.

SDK Reference

User Interface

Learn by example in the 'Register Tables' tutorial.

Slowly Changing Dimension (SCD) Table¶

An SCD Table stores attributes that change slowly and unpredictably over time, preserving both current and historical records.

FeatureByte supports only Type 2 SCD Tables, as Type 1 (which overwrite old data) may cause data leakage and inconsistent inference behavior.

Type 2 SCD tables use:

Effective Timestamp: when a record becomes active.
Expiration (or End) Timestamp: when the record becomes inactive.
Optionally, an Active Flag: to indicate current validity.

Example

Example Type 2 SCD table tracking customer address changes:

Customer ID	First Name	Last Name	Address	City	State	Zip Code	Valid From	Valid To
123456	John	Smith	123 Main St	San Francisco	CA	12345	2019-01-13 11:00:00	2021-03-16 10:00:00
123456	John	Smith	456 Oak St	Oakland	CA	67890	2021-03-16 10:00:00	NULL
789012	Jane	Doe	789 Maple Ave	New York City	NY	34567	2020-09-15 10:00:00	NULL

Creating an SCD Table in FeatureByte

To create an SCD Table, specify the Effective Timestamp. Optionally, include:

Natural Key — identifies an entity across records.
Surrogate Key — unique identifier per row.
Expiration Timestamp — marks when a record becomes inactive.
Active Flag — indicates if a record is currently valid.

SDK Reference

How to register a SCD Table.

User Interface

Learn by example in the 'Register Tables' tutorial.

Dimension Table¶

A Dimension Table contains static descriptive data used for classification, labeling, or enrichment purposes.

Important

Use Dimension Tables only for truly static data. If the data changes over time, use a Type 2 SCD Table instead to prevent data leakage during model training and inference.

Creating a Dimension Table in FeatureByte

Specify the column that represents the table’s Primary Key, also referred to as the Dimension ID.

SDK Reference

How to register a Dimension Table.

User Interface

Learn by example in the 'Register Tables' tutorial.

Calendar Table¶

A Calendar Table contains date-keyed reference data that is known in advance and used to enrich other tables with temporal context during feature engineering. Each row corresponds to a specific date or datetime and may include attributes such as holidays, business day indicators, seasonal flags, or other pre-determined date-based metadata.

Unlike other temporal tables that capture observed or measured data, a Calendar Table represents information that can be determined ahead of time — making it safe to use for future dates without risk of data leakage.

Examples

Calendar tables can take various forms across industries, such as:

Retail: Holiday calendars with promotional periods and seasonal indicators
Finance: Business day calendars with trading day flags per market region
Manufacturing: Production calendars with shift schedules and planned downtime
Healthcare: Facility operating calendars with staffing level indicators

Creating a Calendar Table in FeatureByte

To create a Calendar Table, you must specify the calendar datetime column, which represents the date or datetime that each row refers to.

Optionally, you can specify:

A Series ID column, to distinguish independent calendar series within the table (e.g., by region, store, or market). When absent, the calendar is treated as global.
A Record Creation Timestamp column, to track when each record was stored in the data warehouse.

SDK Reference

How to register a Calendar Table.

User Interface

Learn by example in the 'Register Tables' tutorial.

Table Type Summary¶

The following table summarizes the purpose, key requirements, and join/relationship capabilities of each supported table type.

Note: “Join” refers to using the table as the right table in a join operation.

Table Type	Purpose	Key Temporal / Structural Columns	Join & Relationship Behavior
Event Table	Captures unique events occurring at specific points in time.	`event_timestamp` (required), optional `event_id`, optional `record_creation_timestamp`	Can join with other tables; can serve as parent for Item Table.
Item Table	Stores detailed records linked to a primary event (one-to-many relationship).	`event_id` (required), linked Event Table	Must be linked to an Event Table; inherits event timestamp context.
Snapshots Table	Captures periodic snapshots of state at regular intervals.	`snapshot_datetime` (required), `series_id` (required)	Can join with other tables; supports "as-at" lookups and temporal joins; must have unique combination of `series_id` and `snapshot_datetime`.
Time Series Table	Stores regular measurements or aggregated values over time.	`reference_datetime` (required), optional `series_id`	Cannot be joined to other tables or used to establish relationships.
Slowly Changing Dimension (SCD) Table	Tracks historical changes in entity attributes over time.	`effective_timestamp` (required), optional `expiration_timestamp`, `active_flag`	Can join with other tables; supports "as-at" lookups and temporal joins.
Dimension Table	Contains static descriptive data for entities or classifications.	`dimension_id`	Can join with other tables; should only contain static, non-temporal data.
Calendar Table	Contains date-keyed reference data known in advance (e.g., holidays, business days).	`calendar_datetime` (required), optional `series_id`	Can enrich other tables with temporal context; safe for future dates as data is pre-determined.

Table Status¶

When a table is registered in a catalog, its status is set to 'PUBLIC_DRAFT' by default. Once the table is prepared for feature engineering, you can modify its status to 'PUBLISHED'. If a table needs to be deprecated, you can update its status to 'DEPRECATED'.

SDK Reference

How to:

User Interface

Learn by example with our 'Manage feature life cycle' UI tutorials.

Table EDA¶

Table EDA provides an automated assessment of data quality for each table in the Catalog. It evaluates column statistics, detects anomalies, and highlights issues such as string-encoded timestamps or numeric fields, and disguised missing values. These insights help you quickly determine which columns require cleaning operations. EDA can be run directly on your source tables or on a development dataset for faster analysis when working with large datasets.

User Interface

Learn by example with our 'Set Default Cleaning Operations' UI tutorials.

Table Columns Metadata¶

Table Column¶

A Table Column refers to a specific column within a table. You can add metadata to the column to help with feature engineering, such as tagging the column with entity references, updating column description, tagging semantics or defining default cleaning operations.

SDK Reference

Refer to the TableColumn object main page or to the specific links:

update_description,
tag an entity to a column,
obtain descrpitive statistics for a column,
and specify default cleaning operations.

User Interface

Learn by example with the 'Update descriptions and Tag Semantics' tutorial and the 'Set Default Cleaning Operations' tutorial.

Entity Tagging¶

The Entity Tagging process involves identifying the specific columns in tables that identify or reference a particular entity.

These columns are typically primary keys, natural keys, or foreign keys of the table, but not necessarily.

Example

Consider a database for a company that consists of 2 SCD tables: one table for employees and one table for departments. In this database,

the natural key of the employees table identifies the Employee entity.
the natural key of the department tables identifies the Department entity.
the employees table may also have a foreign key column referencing the Department entity.

SDK Reference

How to tag a column with an entity reference.

User Interface

Learn by example with the 'Register Entities' tutorial.

Cleaning Operations¶

Cleaning Operations determine the procedure for cleaning data in a table column before performing feature engineering. The cleaning operations can either be set as a default operation in the metadata of a table column or established when creating a view in a manual mode.

These operations specify how to manage the following scenarios:

String-based datetime format
Missing values
Disguised values
Values that are not in an anticipated list
Numeric values and dates that are out of boundaries
String values when numeric values are expected

If changes occur in the data quality of the source table, new versions of the feature can be created with new cleaning operations that address the new quality issues.

Cleaning Operations Approval

In Catalogs with Approval Flow enabled, changes in table metadata initiate a review process. This process recommends new versions of features and lists linked to these tables, ensuring that new models and deployments use versions that address any data issues.

SDK Reference

How to:

set default cleaning operations for a column,
create a view in a manual mode,
create a new feature version with new cleaning operations.

User Interface

Learn by example with the 'Set Default Cleaning Operations' tutorial and the 'Manage feature life cycle' tutorials.

Column Semantics¶

Recognizing the semantics of data fields and tables is essential for effective and reliable feature engineering. Without this understanding, there's a risk of creating irrelevant or misleading features, and missing out on key insights. Here are some examples of common errors due to misunderstanding data semantics:

Incorrectly applying 'sum' to intensity measurements, like patient temperatures in a doctor's visit table.
Misinterpreting a weekday column as numerical and using inappropriate operations like sum, average, or max, instead of more suitable ones like count per weekday, most frequent weekday, weekdays entropy, or unique count.

To guide users in choosing the right feature engineering techniques, FeatureByte introduces a semantic layer for each registered table. This layer encodes the semantics of data fields using a specially designed data ontology, tailored for feature engineering.

Ideation assists in this process for enterprise users. It uses Generative AI to analyze metadata from tables and columns and proposes semantic tags for each column. This semantic tagging is then used by Ideation to suggest relevant data aggregations, filters, and feature combinations during feature ideation.

User Interface

Learn by example with the 'Update descriptions and Tag Semantics' tutorial.

Key Numeric Aggregation Column¶

A 'Key Numeric Aggregation Column' is a crucial numeric column within a table that is invaluable for constructing aggregated features. This column usually comprises additive values like counts, sums, or durations, which are ideal for summarization tasks. It acts as a key component for aggregating metrics across different dimensions: specifically, it allows for the computation of sums across grouped categories defined by categorical columns. This aggregation is vital for deciphering patterns and trends within data subgroups. The features generated from such aggregations can be directly applied or further processed for in-depth analyses, such as evaluating diversity, assessing stability, or identifying key categories. Additionally, the 'Key Numeric Aggregation Column' enriches analyses that rely on counts by offering deeper insights into the distribution across these categories.

Ideation assists in the identification of these columns for enterprise users.

Examples:

Total Transaction Amount by Transaction Description

Suppose we have a dataset containing credit card transactions with columns like CardID, TransactionDescription, and Amount. By using the "Amount" column as the Aggregation Metric, we can create a feature that aggregates the total transaction amount for each distinct transaction description, per card.

CardID	Feature
Card1	{'Retail Purchase': 500, 'Restaurant': 300, 'Online Shopping': 700}
Card2	{'Retail Purchase': 400, 'Online Shopping': 600}

Total Count by Transaction Description

Alternatively, using counts as the Aggregation Metric can capture the frequency of transactions for each distinct transaction description, per card.

CardID	Feature
Card1	{'Retail Purchase': 3, 'Restaurant': 2, 'Online Shopping': 2}
Card2	{'Retail Purchase': 1, 'Online Shopping': 3}

Table Catalog¶

The Tables registered in the catalog can be listed and retrieved by name for easy access and management.

SDK Reference

How to:

list tables in a catalog,
get a table from a catalog.

User Inferface

Table Catalog

Entities and Relationships¶

Entity¶

An Entity is a real-world object or concept represented or referenced by columns in your source tables.

Examples

Common examples of entities include customer, merchant, city, product, and order.

In FeatureByte, entities are used to:

identify the unit of analysis for a feature or a use case
organizing features, and feature lists in the catalog
identifying entities that can be used to serve the feature or the feature list.
establishing table relationships.

Note

While entities are typically associated with a primary key or foreign key in the data, they can also be represented by categorical columns that define groups of related objects. For example, a City entity may represent multiple customers, and a Product Group entity may encompass multiple products, even though neither is explicitly used as a foreign key.

SDK Reference

Refer to the Entity object main page and how to add a new entity to a catalog.

User Interface

Learn by example with the 'Register Entities' tutorial.

Entity Serving Name¶

An Entity's Serving Name is the name of the unique identifier used to identify the entity during a preview or serving request. It is also the name of the column representing the entity in an observation set. Typically, the serving name for an entity is the name of the primary key (or natural key) of the table that represents the entity. An entity can have multiple serving names for convenience, but the unique identifier should remain unique.

SDK Reference

How to get the serving names of an entity.

User Interface

Learn by example with the 'Register Entities' tutorial.

Feature Primary Entity¶

The Primary Entity of a feature defines the level of analysis for that feature.

The Primary Entity is usually a single entity. However, there are cases where it may be a tuple of entities.

An example of when the primary entity becomes a tuple of entities is when a feature results from aggregatiing data based on those entities to measure interactions between them.

Example

Entity Diagram

Suppose a feature quantifies the interaction between a customer entity and a merchant entity in the past, such as the sum of transaction amounts grouped by customer and merchant in the past four weeks.

The primary entity of this feature is the tuple of customer and merchant.

When a feature is derived from features with different primary entities, the primary entity is determined by the entity relationships between the entities. The lowest level entity in the hierarchy is selected as the primary entity. If the entities have no relationship, the primary entity becomes a tuple of those entities.

Example

Entity Diagram

Consider two entities: customer and customer city, where the customer entity is a child of customer city entity. If a new feature is created that compares a customer's basket with the average basket of customers in the same city, the primary entity for that feature would be the customer entity. This is because the customer entity is a child of the customer city entity and the customer city entity can be deduced automatically.

Alternatively, if two entities, such as customer and merchant, do not have any relationship, the primary entity for a feature that calculates the distance between the customer location and the merchant location would be the tuple of customer and merchant entities. This is because the two entities do not have any parent-child relationship.

SDK Reference

How to get the primary entity of a feature.

Feature List Primary Entity¶

The primary entity of a feature list determines the entities that can be used to serve the feature list, which typically corresponds to the primary entity of the Use Case that the feature list was created for.

If the features within the list pertain to different primary entities, the primary entity of the feature list is selected based on the entities relationships, with the lowest level entity in the hierarchy chosen as the primary entity. In cases where there are no relationships between entities, the primary entity may become a tuple comprising those entities.

Example

Entity Diagram

Consider a feature list containing features related to card, customer, and customer city. In this case, the primary entity is the card entity since it is a child of both the customer and customer city entities.

However, if the feature list also contains merchant and merchant city features, the primary entity is a tuple of card and merchant.

SDK Reference

How to get the primary entity of a feature list.

Serving Entity¶

A Serving Entity is any entity that can be used to preview or serve a feature or feature list, regardless of whether it is the primary entity. Serving entities associated with a feature or feature list are typically descendants of the primary entity and uniquely identify the primary entity.

Example

Entity Diagram

Suppose that a customer is the primary entity for a feature, the serving entities for that feature could include related entities such as the card and transaction entities, which are child or grandchild of the customer entity and uniquely identify the customer.

Use Case Primary Entity¶

In a Use Case, the Primary Entity is the object or concept that defines its problem statement and Context. Usually, this entity is singular, but in cases such as recommendation engines, it can be a tuple of entities such as (Customer, Product).

Observation Table Primary Entity¶

An Observation Table Primary Entity is the entity of the Context or Use Case the table represents.

To utilize an Observation Table for computing historical feature values of a feature list, it's important that its Primary Entity should match the feature list's primary entity or be a related serving entity.

Entity Relationship¶

The parent-child relationship and the supertype-subtype relationship are the two main types of Entity Relationships that can assist feature engineering and feature serving.

The parent-child relationship is automatically established in FeatureByte during the entity tagging process, while identifying supertype-subtype relationships require manual intervention.

These relationships can be used to suggest, facilitate and verify joins during feature engineering and streamline the process of serving feature lists containing multiple entity-assigned features.

Important

Note that FeatureByte only supports parent-child relationships currently. Nevertheless, it is expected that supertype-subtype relationships will also be supported shortly, thus enabling more efficient feature engineering and feature serving.

SDK Reference

Refer to the Relationship object main page or to the specific links:

list relationships between entities in a catalog.

User Interface

Learn by example with the 'Register Entities' tutorial.

Parent-Child Relationship¶

A Parent-Child Relationship is a hierarchical connection that links one entity (the child) to another (the parent). Each child entity key value can have only one parent entity key value, but a parent entity key value can have multiple child entity key values.

Example

Examples of parent-child relationships include:

Hierarchical organization chart: A company's employees are arranged in a hierarchy, with each employee having a manager. The employee entity represents the child, and the manager entity represents the parent.
Product catalog: In an e-commerce system, a product catalog may be categorized into categories and subcategories. Each category or subcategory represents a child of its parent category.
Geographical hierarchy: In a geographical data model, cities are arranged in states, which are arranged in countries. Each city is the child of its parent state, and each state is the child of its parent country.
Credit Card hierarchy: A credit card transaction is the child of a card and a merchant. A card is a child of a customer. A customer is a child of a customer city. And a merchant is a child of a merchant city.

Entity Diagram

Note

In FeatureByte, the parent-child relationship is automatically established when the primary key (or natural key in the context of a SCD table) identifies one entity. This entity is the child entity. Other entities that are referenced in the table are identified as parent entities.

Supertype-Subtype Relationship¶

In a data model, a Supertype-Subtype Relationship is a hierarchical relationship between two or more entity types where one entity type (the subtype) inherits attributes and relationships from another entity type (the supertype).

The subtype entity is typically a more specialized version of the supertype entity, representing a subset of the data that applies to a particular domain. Although the subtype entity inherits properties and relationships from the supertype entity, It can have its unique attributes or relationships.

Examples

Here are a few examples of supertype-subtype relationships involving a person, student, and teacher:

Person is the supertype, while student and teacher are both subtypes of person.
Student is a subtype of person. This is because a student is a specific type of person who is enrolled in a school or university.
Teacher is also a subtype of person since a teacher is a specific type responsible for educating and instructing students.
A more specific subtype of student could be a graduate student, which refers to a student who has already completed a bachelor's degree and is pursuing a higher-level degree.
Another subtype of teacher could be a professor, typically a teacher with a higher academic rank and significant experience in their field.

Supertype-subtype relationships describe how a more general category (the supertype) can be divided into more specific subcategories (the subtypes). In this case, a person is the most general category, while student and teacher are more specific categories that fall under the broader umbrella of "person."

Entity Catalog¶

The Entities registered in the catalog can be listed and retrieved by name for easy access and management.

SDK Reference

How to:

list entities in a catalog,
get an entity from a catalog,

User Inferface

Entity Catalog

Entity Selection¶

Entity Selection is automatically suggested based on your use case. The selection defines, per table, the analysis level of features that may be generated during feature ideation. In most cases, the entity of the use case is recommended, or one of its parent entities if the entity itself can not be joined to the table.

You can extend the selection to any eligible parent entities. This may result in additional features being generated, including similarity features.

During the automated creation of a development dataset, the same entity selection is suggested. If the development dataset is used during feature ideation, the selection will constrain the eligible entities available for feature generation.

Examples

If your use case is defined at the customer level, the system will suggest customer as the entity. You may extend the selection to a parent entity such as household or county, enabling the generation of additional features like household purchase frequency (last 7 days) or similarity features such as customer vs. county purchase by product (last 14 days).

Important

Features computed at the level of an entity with low cardinality (such as State) may be expensive to compute.

Use Case Formulation¶

Target¶

In Machine Learning, a "target" refers to the outcome that the model is being trained to predict. It's a critical component in supervised learning, where the goal is to create a model that can accurately forecast or classify the target based on the patterns it identifies in the input features.

In FeatureByte, a target can be established in two ways:

Descriptive Approach: You directly outline your prediction goal.
Logical Approach: This technique calculates targets within FeatureByte, mirroring the process of creating features.

SDK Reference

Refer to the Target object main page and how to create a descriptive target

User Interface

Learn by example with the 'Create Use Cases' tutorial.

Target Logical Plan¶

The process for establishing a logical plan for a Target closely mirrors that for creating features, with a critical difference: the plan for a Target utilizes forward operations, in contrast to the backward operations applied in feature creation.

Target objects, built upon View objects, come in three varieties:

Lookup Targets: Directly retrieve values from view attributes for a future point in time.
Forward Window-based Aggregate Targets: Use forward-looking aggregations over grouped data.
Aggregate Targets as a Future Point-in-Time: Apply aggregations at a designated future moment.

Additionally, targets can emerge as transformations of existing Target objects, offering various ways to define what you want to predict.

SDK Reference

How to:

Target Definition File¶

The target definition file is the single source of truth for a target. This file is automatically generated when a feature is declared in the SDK or a new version is derived.

The syntax used in the SDK is also used in the target definition file. The file provides an explicit outline of the intended operations of the feature declaration, including those inherited but not explicitly declared by you. These operations may include cleaning operations inherited from tables metadata.

The target definition file is the basis for generating the final logical execution graph, which is then transpiled into platform-specific SQL (e.g. SnowSQL, SparkSQL) for target materialization.

SDK Reference

How to obtain the target definition file.

User Interface

Learn by example with the 'Create Use Cases' tutorial.

Target Materialization¶

Materializing target values in FeatureByte using observation sets can be done through two distinct approaches:

Using compute_targets(): This method returns a DataFrame filled with target values, suitable for immediate analysis and use.
Using compute_target_table(): This approach yields an ObservationTable object, representing an observation table suitable for long-term storage and linking with a Use Case for repeated use.

SDK Reference

How to:

User Interface

Learn by example with the 'Create Observation Tables' tutorial.

Target Catalog¶

The Targets registered in the catalog can be listed and retrieved by name for easy access and management.

SDK Reference

How to:

list targets available in a catalog
get a target from a catalog

User Inferface

How to list registered Targets.

Target Catalog

Treatment¶

A Treatment describes the causal treatment variable and its assignment mechanism for an experiment, quasi-experiment, or observational study. It provides structured metadata that downstream causal estimators can use to select appropriate identification and modeling strategies.

Treatment objects are essential for causal inference and uplift modeling in FeatureByte. They capture critical information about:

Treatment Type: The scale of the treatment variable (binary, multi-arm, or continuous)
Assignment Source: How units are assigned to treatment (randomized or observational)
Assignment Design: Specific design within the chosen source (simple-randomization, stratified-randomization, business-rule, etc.)
Temporal Structure: When and how treatment is applied over time
Interference: Whether units can affect each other (violations of SUTVA)
Propensity: How treatment assignment probabilities are known or estimated

Examples

Treatment types can vary based on the experimental design:

Binary Treatment: A/B test comparing exposed vs control group (e.g., coupon vs no coupon)
Multi-Arm Treatment: Testing multiple variants or dosage tiers (e.g., 10%, 20%, 30% discount levels)
Continuous Treatment: Numeric dose or intensity (e.g., marketing spend amount, price level)

Treatments are used in combination with Contexts to define the experimental or observational setting for causal analysis. When a Context is associated with a Treatment, FeatureByte can:

Apply appropriate causal identification strategies
Select suitable statistical methods for uplift modeling
Ensure model evaluation metrics align with the experimental design
Provide interpretable results in the context of the assignment mechanism

Current Support

At this stage, FeatureByte supports causal modeling for randomized binary treatments only. Support for multi-arm and continuous treatments, as well as observational studies, will be added in future releases.

SDK Reference

Refer to the Treatment object main page and how to create a treatment.

Treatment Catalog¶

The Treatments registered in the catalog can be listed and retrieved by name for easy access and management.

SDK Reference

How to:

list treatments available in a catalog
get a treatment from a catalog

Context¶

A Context defines the scope and circumstances in which features are expected to be served.

Examples

Contexts can vary significantly. For instance:

Batch Predictions Context: Making weekly batch predictions for an active customer that has made at least one purchase over the past 12 weeks.
Real-Time Predictions Context: Offering real-time predictions for a credit card transaction that has been recently processed.

While creating a basic context requires only identifying the relevant entity, adding a detailed description is beneficial. This should ideally cover:

Contextual Subset Details: Characteristics of the entity subset being targeted.
Serving Timing: Insights into when predictions are needed, whether in batch or real-time scenarios.
Inference Data Availability: What data is available at the time of inference.
Constraints: Any legal, operational, or other constraints that might impact the context.

For causal inference and uplift modeling scenarios, a Context can be associated with a Treatment. This association defines the experimental or observational setting and enables FeatureByte to apply appropriate causal identification strategies and statistical methods.

For time series forecasting scenarios, a Context can include a ForecastPointSchema that defines the granularity (day, week, hour, etc.), data type, and timezone handling for the FORECAST_POINT column. This enables features derived from the forecast point, such as forecast horizon or local time date parts.

A Context can also define user-provided columns via UserProvidedColumn. These columns represent external data (e.g., customer-provided information or real-time inputs) that will be supplied in observation tables at materialization time, without needing to be stored in source tables.

SDK Reference

Refer to the Context object main page and how to create a context.

User Interface

Learn by example with the 'Create Use Cases' tutorial.

Context Association with Observation Table¶

After defining a Context, it can be linked to an Observation Table. This process enables the observation table to act as the default preview/eda table for the Context. Additionally, all observation tables associated with the Context can be listed.

SDK Reference

How to:

User Interface

Learn by example with the 'Create Observation Tables' tutorial.

Context Catalog¶

The Contexts registered in the catalog can be listed and retrieved by name for easy access and management.

SDK Reference

How to:

list contexts available in a catalog
get a context from a catalog

User Inferface

How to list registered Contexts.

Context Catalog

Use Case¶

A Use Case formulates the modelling problem by associating a Context with a Target. Use Cases facilitate the organization of your observation tables, feature tables and deployments. Use Cases also play a crucial role in Ideation, enabling it to provide tailored feature suggestions.

To construct a new Use Case, the following information is required:

Select a Context: Choose a registered Context that defines the environment of your Use Case. For causal modeling, the Context should be associated with a Treatment.
Define a Target: Specify a registered Target that represents the goal of your Use Case.

Note

The context and target must correspond to the same entities.

When a Use Case combines a Context with an associated Treatment and a Target, it enables causal analysis and uplift modeling. This allows you to measure the causal impact of interventions and predict treatment effects.

When a Use Case combines a Context with a ForecastPointSchema and a Target, it is automatically typed as a FORECAST use case, enabling time series forecasting at specific future dates.

For a comprehensive Use Case setup, include a detailed description. Providing a detailed description of the use case, context, and target ensures better documentation and enhances the effectiveness of the Ideation in suggesting relevant features and assessing their semantic relevance.

SDK Reference

Refer to the Use Case object main page or to the specific links:

create a use case

User Interface

Learn by example with the 'Create Use Cases' tutorial.

Use Case Association with Observation Table¶

Observation tables are automatically linked to a Use Case when they are derived from:

an observation table that is linked to the use case's Context
a target that is linked to the use case

An observation table can be manually linked to the Use Case to support cases where the observation table is not derived from another observation table.

This process enables the observation table to act as the default preview/eda table for the Use Case. Additionally, all observation tables associated with the Use Case can be listed.

SDK Reference

How to:

Use Case Association with Feature Table¶

Feature tables are automatically associated with use cases via the observation tables they originate from.

Feature tables associated with a use case can be listed easily from the Use Case object.

SDK Reference

How to:

Use Case Association with Deployment¶

A deployment is associated with a use case when the use case is specified during the deployment of the related feature list.

Deployments associated with a use case can be listed easily from the Use Case object.

SDK Reference

How to:

Use Case Catalog¶

The Use Cases registered in the catalog can be listed and retrieved by name for easy access and management.

SDK Reference

How to:

list use cases available in a catalog
get a use case from a catalog

User Inferface

How to list registered Use Cases.

Use Case Catalog

Observation Set¶

An Observation Set is essentially a collection of historical data points that serve as a foundation for learning. Think of it as the backbone of a training dataset. Its primary role is to process and compute features, which then form the training data for Machine Learning models. For a given use case, the same Observation Table is often employed in multiple experiments. However, the specific features chosen and the Machine Learning models applied may vary between these experiments.

Each data point represents a historical moment for a particular entity and may include target values.

Observation Set

Ideally, an observation set should be explicitly linked to a specific Context or Use Case, ensuring thorough documentation and facilitating its reuse.

Other important considerations when constructing an Observation Set are:

Choosing the Right Entity Key Values: Select values that represent your target population accurately for each historical timestamp.
Accuracy in Timestamps: Ensure all timestamps are in Coordinated Universal Time (UTC) and cover a sufficient range to depict seasonal changes. They should represent the expected time distribution in real-world scenarios.
Maintaining Data Integrity: Avoid time leakage (future data in the training set) by spacing out your timestamps correctly.

Example

To predict customer churn every Monday morning over six months, you might:

Use historical timestamps from Monday mornings of the past years
Choose customer keys randomly from the active customer base at those times.
Set intervals longer than six months between data points for each customer to avoid time leakage.

Technical Details

The entity values column should have an accepted serving name.
Label the timestamps column as "POINT_IN_TIME" and use UTC.
For forecast use cases, include a "FORECAST_POINT" column representing the future date being predicted for.
In FeatureByte, an Observation Set can be a pandas DataFrame or an Observation Table object from the feature store.

Once an Observation Set is defined, you can use it to materialize a feature list into historical feature values to form a training or testing set for your Machine Learning model.

SDK Reference

How to:

get the primary entity of a feature list,
compute historical feature values as a DataFrame.
compute historical feature values as a HistoricalFeatureTable object.

Observation Table¶

An Observation Table is an observation set integrated in the catalog. It can be created from various sources and is essential for sharing and reusing data within the feature store. For forecast use cases, observation tables also include a FORECAST_POINT column representing the future date being predicted for, along with computed metadata such as the forecast horizon and forecast point range.

SDK Reference

Refer to the ObservationTable object main page or to the specific links:

User Interface

Learn by example with the 'Create Observation Tables' tutorial.

Observation Table Association with a Context or Use Case¶

Once added to the catalog, an Observation Table can be linked to specific Contexts or Use Cases.

For Use Case linkage, you can include the Use Case's Target values by materializing them with a table associated with its Context.

SDK Reference

How to:

Observation Table Purpose¶

Tagging an Observation Table with purposes like 'preview', 'eda', 'training' or 'validation_test' facilitates its identification and reuse.

Default eda and preview tables can also be set for a Context or a Use Case.

SDK Reference

How to:

Observation Table Splitting¶

An Observation Table can be split into non-overlapping subsets for training and evaluation. The split is seeded for reproducibility, and the first subset is automatically assigned the TRAINING purpose while subsequent subsets are assigned VALIDATION_TEST.

SDK Reference

How to:

split an observation table

Forecast Observation Table Automation¶

For forecast Use Cases, FeatureByte provides a Forecast Automation endpoint that automatically generates Observation Tables with appropriate POINT_IN_TIME and FORECAST_POINT values. This is the recommended way to create observation tables for time series forecasting, as it ensures correct alignment between prediction schedules and forecast horizons.

The automation is configured with:

Prediction Schedule (cron expression): Defines when predictions are generated (e.g., 30 3 * * 1 for weekly on Monday at 3:30 AM).
Prediction Schedule Timezone: The IANA timezone for the cron schedule (e.g., America/Los_Angeles).
Forecast Horizon: The number of time steps (days, hours, etc.) to forecast ahead.
Forecast Start Offset: The number of time steps to skip before the first forecast point (typically 0).
Periods: A list of date ranges, each producing one observation table with a name, purpose, target observation count, and mode.
Target Observation Count: Controls the size of the generated observation table. The automation will sample from the full set of possible entity-forecast point combinations to produce approximately this many rows.

Each period specifies a mode that controls how observation rows are generated:

ONE_ROW_PER_ENTITY_FORECAST_POINT: Generates one row per entity per forecast point, with a random POINT_IN_TIME offset. This is the only recommended mode for training observation tables, as it avoids over-fitting by ensuring each entity-forecast point pair appears only once with a unique point in time.
FORECAST_SERIES: Generates complete forecast series — for each POINT_IN_TIME, all forecast points within the horizon are included. This mode is used for visualization observation tables, where you want to see the full predicted series for each prediction date.

User Interface

Learn by example with the 'Create Observation Tables' tutorial.

Observation Table Catalog¶

The Observation Tables registered in the catalog can be listed and retrieved by name for easy access and management.

SDK Reference

How to:

list observation tables available in a catalog
get an observation table from a catalog
and get an observation table by its Object ID from a catalog

Development Dataset¶

A Development Dataset is a collection of source tables that serve as substitutes for production source tables. It is used during feature ideation to accelerate exploratory data analysis (EDA) and feature selection. Development datasets are especially valuable when the original tables are extremely large, and only a subset of the data is needed for analysis.

You can create a Development Dataset in two ways:

Manually: by mapping production tables to smaller existing development tables.
Automatically: from the EDA observation table of a Use Case, combined with a feature lookback to ensure sufficient history for feature aggregation.

Note

When built automatically, the Development Dataset is tailored to the essential needs of your use case. By default, it is built for entity selection aligned with your use case. If you want to extend entity selection to parent entities, update the settings accordingly — though this may result in a larger development dataset. Also, only development tables with significant sampling (<5%) are materialized by default.

User Interface

Learn by example with the 'Create Development Dataset' tutorial.

Views and Column Transforms¶

View¶

A view is a local virtual table that can be modified and joined to other views to prepare data before feature definition. A view does not contain any data of its own.

Views in FeatureByte allow operations similar to Pandas, such as:

creating and transforming columns and extracting lags
filtering records, capturing attribute changes, and joining views

Unlike Pandas DataFrames, which require loading all data into memory, views are materialized only when needed during previews or feature materialization.

View Creation¶

When a view is created, it inherits the metadata of the FeatureByte table it originated from. Currently, the following types of views are supported:

Event Views created from an Event table
Item Views created from an Item table
Snapshots Views created from a Snapshots table
Time Series Views created from a Time Series table
Dimension Views created from a Dimension table
Slowly Changing Dimension (SCD) Views created from a SCD table
Calendar Views created from a Calendar table
Change Views created from a SCD table.

Two view construction modes are available:

Auto (default): Automatically cleans data according to default operations specified for each column within the table and excludes special columns not meant for feature engineering.
Manual: Allows custom cleaning operations without applying default cleaning operations.

Although views provide access to cleaned data, you can still perform data manipulation using the raw data from the source table. To do this, utilize the view's raw attribute, which enables you to work directly with the unprocessed data of the source table.

SDK Reference

Refer to the View object main page or to the specific links:

Change View¶

A Change View is created from a Slowly Changing Dimension (SCD) table to provide a way to analyze changes that occur in an attribute of the natural key of the table over time. This view consists of five columns:

the natural key of the SCD table,
the change timestamp, which is equal to the effective timestamp of the SCD table,
the prior effective timestamp,
the value of the attribute before the change occurred,
and the value of the attribute after the change occurred.

Once the Change View is created, it can be used to generate features in the same way as features from an Event View.

Examples

Changes to a SCD table can provide valuable insights into customer behavior, such as:

the number of times a customer has moved in the past six months,
their previous address if they recently moved,
whether they have gone through a recent divorce,
if there are new additions to their family,
or if they have started a new job.

SDK Reference

How to create a Change View from a SCD table.

Filters¶

Filters are an essential element in feature engineering strategies. They enable the segmentation of data into sub-groups, which facilitates specific operations and analyses:

Targeted Aggregations: Filters allow for meaningful aggregations of data that would otherwise be nonsensical. For instance, transactions can be categorized by their outcomes such as "Authorized", "Approved", or "Cancelled".
Focused Analysis: By using filters, it is possible to narrow down the analysis to specific event types and derive additional, relevant features for those types. For example, analyzing transactions by weekday may yield insightful trends for "Purchases" but may be less significant for "Banking Fees".

Ideation leverages Generative AI to aid enterprise users in identifying effective filters.

Within our SDK, users can manipulate data similarly to how one would use a Pandas DataFrame. It is possible to create new views from subsets of views. Additionally, a condition-based subset can be used to replace the values of a column.

View Sample¶

Using the sample method, a view can be materialized with a random selection of rows for a given time range, size, and seed to control sampling.

Note

Views from tables in a Snowflake data warehouse do not support the use of seed.

SDK Reference

How to materialize a sample of a view.

View Join¶

You can combine two views using the join() method. This operation enriches one view (the left view) with attributes from another (the right view).

The method matches rows based on a shared key, which depends on the right view type:

Dimension View → joins on the dimension ID
Snapshots View → joins on the series ID
SCD View → joins on the natural key
Event View → joins on the event ID

If the shared key identifies an entity already referenced in the left view — or the column name matches in both views — the join key is detected automatically.

By default, a left join is performed, and the resulting view will have the same number of rows as the left view. To restrict the output to matching rows only, set how="inner" for an inner join.

When the right view is an SCD or Snapshots view, the event timestamp or reference datetime column of the left view determines which record from the right view is joined.

Item Views

For Item view, the event timestamp and entity columns from the related Event Table are automatically available. You can enrich the Item View with additional event-level attributes using the join_event_table_attributes() method.

Join Restrictions

SCD views cannot be joined to other SCD views.
Only Dimension views can be joined to other Dimension views.
Change views and Time Series views cannot be used as right tables.

SDK Reference

How to:

View Column¶

A View Column is a column within a FeatureByte view. When creating a view, a View Column represents the cleaned version of a table column. The cleaning procedure for a View Column depends on the view's construction mode and typically follows the default cleaning operations associated with the corresponding table column.

By default, special columns not intended for feature engineering are excluded from view columns. These columns may consist of record creation and expiration timestamps, surrogate keys, and active flags.

You can add new columns to a view by performing joins or by deriving new columns from existing ones.

If you wish to add new columns derived from the raw data in the source table, use the view's raw attribute to access the source table's unprocessed data.

SDK Reference

Refer to the ViewColumn object main page or to the specific links:

obtain view columns info
access raw data
and obtain descriptive statistics for a view column.

View Column Transforms¶

View Column Transforms refer to the ability to apply transformation operations on columns within a view. By applying these transformation operations, you can create a new column. This new column can either be reassigned to the original view or utilized for further transformations.

The different types of transforms include:

generic transforms,
numeric transforms,
string transforms,
datetime transforms,
and lag transforms.

Additionally, you have the option to apply custom SQL User-Defined Functions (UDFs) on view columns. This is particularly useful for integrating transformer models with FeatureByte.

Generic Transforms¶

SDK Reference

You can apply the following transforms to columns of any data type in a view:

isnull: Returns a new boolean column that indicates whether each row is missing.
notnull: Returns a new boolean column that indicates whether each row is not missing.
isin: Returns a new boolean column showing whether each element in the view column matches an element in the passed sequence of values
fillna: Replaces missing values in-place with specified values.
astype: Converts the data type of the column.

Numeric Transforms¶

SDK Reference

In addition to built-in arithmetic operators (+, -, *, /, etc), you can apply the following transforms to columns of numeric type in a view:

abs: Returns absolute value
sqrt: Returns square root value
pow: Returns power value
log: Returns logarithm with natural base
exp: Returns exponential value
floor: Rounds down to the nearest integer
ceil: Rounds up to the nearest integer

String Transforms¶

API Reference

In addition to string columns concatenation, you can apply the following transforms to columns of string type in a view:

len: Returns the length of the string
lower: Converts all characters to lowercase
upper: Converts all characters to uppercase
strip: Trims white space(s) or a specific character on the left & right string boundaries
lstrip: Trims white space(s) or a specific character on the left string boundaries
rstrip: Trims white space(s) or a specific character on the right string boundaries
replace: Replaces substring with a new string
pad: Pads string up to the specified width size
contains: Returns a boolean flag column indicating whether each string element contains a target string
slice: Slices substrings for each string element

Datetime Transforms¶

The date or timestamp (datetime) columns in a view can undergo the following transformations:

Calculate the difference between two datetime columns.
Add a time interval to a datetime column to generate a new datetime column.
Extract date components from a datetime column.

Note

Date parts for columns or features using timestamp with time zone offset are based on the local time instead of UTC.

Date parts for columns or features using event timestamps of Event tables, where a separate column was specified to provide the time zone offset information, will also be based on the local time instead of UTC.

SDK Reference

How to extract date components:

microsecond: Returns the microsecond component of each element
millisecond: Returns the millisecond component of each element
second: Returns the second component of each element
minute: Returns the minute component of each element
hour: Returns the hour component of each element
day: Returns the day component of each element in a view column
day_of_week: Returns the day of week component of each element
week: Returns the week component of each element
month: Returns the month component of each element
quarter: Returns the quarter component of each element
year: Returns the year component of each element

Lag Transforms¶

The use of Lag Transforms enables the retrieval of the preceding value associated with a particular entity in a view.

This, in turn, makes it feasible to compute essential features, such as those that depend on inter-event time and the proximity to the previous point.

Note

Lag transforms are only supported for Event and Change views.

SDK Reference

How to extract lags from a view column.

UDF Transforms¶

A SQL User-Defined Function (UDF) is a custom function created by users to execute specific operations not covered by standard SQL functions. UDFs encapsulate complex logic into a single, callable routine.

An application of this is in computing text embeddings using transformer-based models or large language models (LLMs), which can be formulated as a UDF.

Creating a SQL Embedding UDF

For step-by-step guidance on creating a SQL Embedding UDF, visit the Bring Your Own Transformer tutorials.

SDK Reference

Refer to the UserDefinedFunction object main page or to the specific links:

make the function available to the FeatureByte SDK,
retrieve a UDF instance from the catalog,

Feature Creation¶

Features¶

Features are the inputs used to train machine learning models and to compute predictions.

Sometimes, features can be taken directly from attributes already present in the source tables.

Example

A customer churn model may use features such as age, gender, income, and location from a customer profile table.

More commonly, features are engineered via row transformations, joins, filters, and aggregations.

Example

A churn model may rely on aggregates that summarize recent behavior, such as:

entropy of product types purchased over the past 12 weeks,
count of canceled orders over the past 56 weeks,
total amount spent over the past 7 days.

FeatureByte offers two ways to create features:

Manually — using the SDK declarative framework
Automatically — via Ideation

Feature Object¶

A Feature object in the FeatureByte SDK represents the logical plan—your computational blueprint—for deriving a feature.

Plans are defined from views in three primary ways:

Lookup features
Aggregate features
Cross Aggregate features

Additionally, Feature objects can be created as transformations of one or more existing features.

SDK Reference

See the Feature object, plus:

Create a Lookup feature
Group by entity for Aggregates and Cross Aggregates

Lookup Features¶

A Lookup feature refers to an entity attribute in a view at a specific point-in-time. Lookup features do not involve aggregation.

When a view’s primary key identifies an entity, its attributes can be exposed directly as features for that entity.

Examples

Customer’s birthplace from a Customer Dimension table
Transaction amount from a Transactions Event table

For an SCD or Snapshots view where attributes vary over time, a lookup feature is materialized through point-in-time joins; the value corresponds to the active row at the request’s point-in-time.

Example

A customer’s street address at the request’s point-in-time.

For a Calendar view, lookup features use the forecast point when available, falling back to the point-in-time otherwise. This allows calendar attributes (e.g., whether a date is a holiday) to be retrieved for future dates in forecasting scenarios, leveraging the fact that calendar data is known in advance. Since calendar data is pre-determined, offsets can go in both directions: a positive offset looks backward in time, while a negative offset looks forward.

You can also specify an offset to retrieve the attribute before the request’s point-in-time.

Example

With a 9-week offset, the feature is the customer’s street address nine weeks before the request’s point-in-time.

SDK Reference

How to create a Lookup feature.

Aggregate Features¶

Aggregate features summarize data grouped by one or more entities. They are fundamental for turning transactional records into informative signals.

Supported aggregation functions include: count, count distinct, sum, average, minimum, maximum, standard deviation, latest and na count.

Note

For richer signals from categorical columns (e.g., mode, entropy), see Cross Aggregate Features.

Ignoring time when aggregating can cause temporal leakage. FeatureByte provides three aggregate modalities:

Non-Temporal Aggregates
Aggregates Over a Window
Aggregates “As At” a Point-In-Time

Note

To capture interactions between two or more entities, group by the tuple of entities (e.g., amount a customer spent with a merchant in the past).

SDK Reference

Create:

Cross Aggregate Features¶

Cross Aggregate features aggregate across categories (a “by entity, across category” pattern). Group by entity keys and a categorical column, then compute counts, sums, or other reductions per category.

They are useful for:

Entropy / diversity over category distributions
Temporal or cohort comparisons of category distributions
Identifying key categories
Prevalence of entity attributes

Example Use Case

Total amount spent by each customer per product category over the last 4 weeks, yielding a distribution of spend across categories.

Technical Implementation

Cross Aggregates typically materialize as a dictionary: keys are categories; values are the aggregated metric.

As with standard aggregates, you can define:

Non-Temporal Cross Aggregates
Cross Aggregates Over a Window
Cross Aggregates “As At” a Point-In-Time

SDK Reference

How to group by entity across categories.

Non-Temporal Aggregates¶

Non-temporal aggregates ignore time ordering when summarizing values.

Important

To prevent time leakage, non-temporal aggregates are only supported for Item views when the grouping key is the event key of the Item view (e.g., count of items per order).

Note

Non-temporal features from an Item view can be added as a column to the corresponding Event view, and then windowed to create an aggregate over a window (e.g., a customer’s average order size over the past 3 weeks).

SDK Reference

Create a non-temporal aggregate
Add a feature as a column

Aggregates Over A Window¶

Aggregates over a window summarize data within a defined time frame and are common for event, item, snapshots, and time-series data.

The window length is set at feature creation.
The window end is determined at serving time by the request’s point-in-time and the feature’s feature job setting.

SDK Reference

Create an aggregate over feature.

Aggregates “As At” a Point-In-Time¶

As-at aggregates summarize data active at a specific instant. These are supported for SCD and Snapshots views. The grouping key should not be the SCD’s natural key (since only one active row exists per key at a time).

You may also specify an offset to aggregate as of a time before the request’s point-in-time.

Example

Customer’s count of credit cards as at the request’s point-in-time
With a 2-week offset, the count as at two weeks prior

SDK Reference

Create an aggregate “asat” feature.

Aggregates Of Changes Over a Window¶

Aggregates of changes summarize transitions captured in a Slowly Changing Dimension (SCD) table over a time window. They are defined using a Change view derived from an SCD column.

Example

Count of address changes for a customer in the last 12 weeks.

SDK Reference

Create a change view
Create an aggregate over from a change view

Temporal Window¶

A temporal window defines the period over which data is summarized to compute an aggregate feature. Using multiple windows helps capture short-, medium-, and long-term patterns in the data.

FeatureByte supports two types of temporal windows:

Rolling windows: Defined by a fixed duration preceding the point-in-time (e.g., 7d, 30d). These move continuously with each reference time.
Calendar windows: Aligned with calendar boundaries (e.g., previous full day, week, or month) according to the table’s Reference Time Zone. These are useful when aggregations must align with reporting or operational cycles.

When to Use Calendar Windows

Use calendar windows when your data or analysis depends on calendar-aligned periods—such as seasonal events (salary, rental), weekly operational metrics, or daily snapshot data. Calendar windows must be used for Snapshots and Time Series tables and can be optionally specified for Event and Item tables.

Window size (minutes, hours, days, weeks, …) depends on your use case and data cadence. Ideation can suggest effective window sizes based on your dataset.

Examples

Sum of shop sales over the past 4 weeks (rolling window)
Total call duration by a customer over the previous full calendar month (calendar window)
Rolling average of heart-rate variability over the last 24 hours
Maximum machine temperature in the last 30 minutes

Edge Effects

Start of data: Begin training after the earliest observation plus the largest window size to avoid truncated windows.
End of data: Set a sufficiently large blind spot to account for late-arriving data.

Feature Transforms¶

Feature Transforms generate new features by applying transformation operations to existing features (single or multiple), even across different entities. Available operations mirror those for view columns, with additional transforms for dictionary-shaped outputs from Cross Aggregates.

Features can also be derived from multiple features and from the points-in-time provided during feature materialization.

Examples of features derived from Cross Aggregates

Most common weekday for customer visits in the past 12 weeks
Count of unique items purchased by a customer in the past 4 weeks
List of distinct items bought by a customer in the past 4 weeks
Amount spent by a customer on ice cream in the past 4 weeks
Weekday entropy for customer visits in the past 12 weeks

Examples of features derived from multiple features

Similarity between a customer’s basket in the past week vs past 12 weeks
Similarity between a customer’s item basket and the baskets of customers in the same city over the past 2 weeks
Order amount z-score based on a customer’s order history over the past 12 weeks

SDK Reference

Transform dictionary outputs of cross aggregates:

get_value — retrieve a value by key
most_frequent — most frequent key
unique_count — number of distinct keys
entropy — entropy over keys
get_rank — rank of a key
get_relative_frequency — relative frequency of a key
cosine_similarity — similarity to another cross aggregate
normalize — normalize values by dividing each by the sum

Ideation¶

Ideation accelerates feature and model development — reducing experimentation cycles from months to hours.

It mirrors a Data Scientist’s end-to-end workflow, providing intelligent automation and transparent documentation at every step.

The Ideation Process¶

Ideation systematically guides you through the following stages:

Analyze tables and relationships to identify relevant datasets.
Infer missing semantic tags using column metadata and patterns.
Recommend column transformations such as time deltas, ratios, and differences.
Identify key filters to isolate critical events and data segments.
Highlight important columns for deeper feature engineering.
Propose suitable aggregation time windows for temporal analysis.
Analyze event frequency patterns to uncover timing signals.
Recommend and evaluate features based on semantic relevance.
Detect reusable features from the existing Feature Catalog.
Perform exploratory data analysis (EDA) on each feature and assign a Predictive Score to sort features.
Select a feature set using SHAP value analysis.
Train and evaluate machine learning models on the resulting feature set.

Each step is fully logged and traceable, ensuring transparency and reproducibility.

Modes of Operation¶

Ideation can be executed in two modes:

Fully Automated Mode – Runs the complete ideation process end-to-end with minimal intervention.
Semi-Automated Mode – Lets you review, refine, and adjust recommendations interactively for greater control.

User Interface

The 'Ideate Features and Models' tutorial demonstrates the Fully Automated Mode.

The 'Refine Ideation' tutorial demonstrates the Semi-Automated Mode.

Feature Selection¶

After ideating features, FeatureByte supports three types of feature selection:

Rule-based feature selection: Selects features with the highest predictive scores overall and/or per theme.
SHAP-based feature selection: Identifies top-performing features of XGBoost or LightGBM models by analyzing their SHAP (SHapley Additive exPlanations) values.
GenAI-based feature selection: Refines feature selection using Generative AI.

Screening Criteria¶

Feature selection can be applied to a pre-selection of candidates, which may come from a prior selection, any filtered view of the ideated features, or a manual selection.

Candidates can also be further screened by:

Excluding Low Added Value Features: Removes features with limited predictive power and their derivations. This includes:
- Numeric and categorical features compared with features flagging their missing values.
- Dictionary-type features compared with simpler alternatives based on total counts or sums without grouping.
Excluding Specific Feature Types: Removes dictionary-type and embedding-type features from the candidate set.

Rule-Based Feature Selection¶

Rule-based feature selection identifies features with the highest predictive scores either overall or per theme.

Parameters:

Number of Top Features Overall: Specifies the number of features to select based on overall predictive scores.
Number of Top Features per Theme: Specifies the number of features to select per theme.
Selection Logic: Determines how the criteria are applied:
- OR: Selects features that meet at least one of the criteria: being in the top features overall OR in the top features per theme.
- AND: Selects only the features that meet both criteria: being in the top features overall AND in the top features per theme.

SHAP-Based Feature Selection¶

SHAP-based feature selection refines the feature set using L1 regularization and/or SHAP (SHapley Additive exPlanations) importance thresholds derived from XGBoost or LightGBM models trained on EDA data.

Parameters:

Model Type: The type of ML model used to compute SHAP values (XGBoost or LightGBM).
L1 Rounds: Number of iterations to apply L1 regularization on SHAP values to eliminate features with minimal contribution or high collinearity.
Importance Rounds: Number of iterations to apply SHAP importance thresholds, retaining only top-performing features.
Cumulative Importance Threshold: Retains features until their cumulative SHAP importance reaches or stays below this fraction (a value between 0 and 1).

Pre-Filtering Options: These are similar to the screening parameters used in Rule-based feature selection.

GenAI-Based Feature Selection¶

GenAI-based feature selection leverages Generative AI to refine feature selection. Simply set the Target Feature Count to specify the desired number of features to retain.

Feature Catalog¶

The Features registered in the catalog can be listed and retrieved by name for easy access and management.

In the SDK, features can be filtered based on two key attributes:

Primary Entity
and Primary Table.

SDK Reference

list features in a catalog,
get a feature from a catalog,

Self-Organized Feature Catalog¶

FeatureByte Enterprise enhances the Feature Catalog with advanced capabilities:

Use Case Compatibility: It ensures that only features compatible with a defined Use Case are displayed, as detailed in Feature Compatibility with a Use Case.
Signal Type Categorization: Features are categorized by their Signal Type, facilitating easier identification and use.
Thematic Organization: Features are organized thematically, incorporating three key aspects:
- The feature's Primary Entity
- The feature's Primary Table
- The feature's Signal Type

In addition to basic filters, advanced filtering options in FeatureByte Enterprise include:

Signal Type.
Online Status.
Production readiness.
Feature data types.

User Interface

Learn by example with the 'Create New Feature List' tutorial.

Feature Compatibility with a Use Case¶

In the context of a Use Case, it's crucial to ensure that the features are compatible with the Use Case Primary Entity . For a feature to be considered compatible, its Primary Entity must align with the Use Case Primary Entity in one of two ways:

Direct Match: The feature's Primary Entity should be the same as the Use Case Primary Entity.
Hierarchical Relationship: The feature's Primary Entity should be either a parent or grandparent of the Use Case Primary Entity.

Example

Entity Diagram

Consider the following scenario:

Use Case: Card Default Prediction. Use Case Primary Entity: Card.

Feature in Question: A feature that records the maximum distance between a customer's residence and the locations of their card transactions over the past 24 hours. Feature Primary Entity: Customer.

Analysis: This feature is compatible with the Use Case. Despite the Feature Primary Entity being 'Customer', it is directly linked to the 'Card' entity, which uniquely identifies each customer. Therefore, the feature can be effectively utilized in the Card Default Prediction Use Case.

In FeatureByte Enterprise, this concept plays a crucial role by ensuring that only features compatible with a defined Use Case are displayed in the Feature Catalog. This functionality streamlines the selection process and enhances the overall effectiveness of Use Case implementation.

Feature Signal Type¶

In FeatureByte, the 'signal type' of a feature is a key indicator of the information it captures. This categorization is essential not only during feature ideation but also in organizing features in the catalog and assessing the comprehensiveness of a feature list.

Signal Type Examples

Attribute: gets the attribute of the entity at a point-in-time. For instance, it might record the employment status of a customer at a specific time.
Frequency: counts the occurrence of events, like the number of times a user logs into an application.
Recency: measures the time since the latest event, crucial in tracking customer engagement.
Timing: relates to when the events happened, helpful in understanding the regularity of events such as binge watching.
Latest event: attributes of the latest event, such as the latest transaction location in a credit card record.
Stats: aggregates a numeric column's values, like the total spent by a customer over the past 4 weeks.
Diversity: measures the variability of data values, useful in understanding the range of customer preferences.
Stability: compares recent events to those of earlier periods to gauge consistency.
Similarity: compares an individual entity feature to a group, important in anomaly detection.
Most frequent: gets the most frequent value of a categorical column, like the best-selling product in a store.
Bucketing: aggregates a column's values across categories of a categorical column, allowing multi-dimensional analysis.
Attribute stats: collects stats for an attribute of the entity, such as the representation of a customer age in the overall population purchases.
Attribute change: measures the occurrence or magnitude of changes to slowly changing attributes, crucial to detect key changes in the customer environment.

Tutorials

See examples of features categorized by their signal type in the 'Create New Feature List' tutorial.

Automated Signal Type tagging¶

FeatureByte Enterprise simplifies the categorization of features by their signal types through an automated tagging system. This intelligent system ensures each feature is accurately and consistently associated with its relevant signal type, reducing manual effort and enhancing the efficiency of the cataloging process.

Feature Primary Table¶

The Feature Primary Table is the central table, serving as the foundational source of data for the feature.

In a setup where an SCD table is joined with an Event table, the event table typically acts as the primary table. It contains the main events or transactions of interest, and these events are further enriched by joining with the SCD table.

Feature Secondary Table¶

The Feature Secondary Table supplements the primary table by providing additional attributes or dimensions. This table is typically joined with the primary table to enhance the data with more context.

Feature Theme¶

The Feature Theme is a concept in FeatureByte Enterprise, utilized to systematically categorize and organize features within the feature catalog. This categorization is achieved by integrating three key components:

Primary Entity: This element represents the main focus of the feature. It's the central aspect around which the feature is built.
Primary Table: This is the core database table from which the feature primarily draws its data. It provides the foundational dataset that defines the structure and context of the feature.
Signal Type: This component identifies the nature of the data signals used in the feature.

This thematic organization aids in providing a clear and structured view of the feature catalog, facilitating easier navigation and understanding of the available features.

Feature EDA¶

Feature EDA (Exploratory Data Analysis) is the step in the FeatureByte Copilot pipeline that evaluates each candidate feature against the target variable. It produces plots (distribution, mean target, etc.) and assigns a score to each feature so they can be ranked and filtered before model training.

Standard EDA¶

When no Naive Prediction is configured, the standard EDA is performed. Each feature is scored independently against the raw target variable using a Predictive Score (univariate analysis).

Residual EDA¶

When a Naive Prediction is available, the standard EDA is replaced by a Residual EDA. Instead of scoring features against the raw target, the analysis measures what each feature adds on top of the naive baseline:

Plots are computed on the transformed target: residuals (target − naive) for additive structures, or ratios (target / naive) for multiplicative structures.
Scoring uses the Incremental Predictive Score instead of the standard Predictive Score.

This ensures that features are evaluated on their incremental contribution rather than on signal that the naive prediction already captures.

Feature Relevance¶

Feature relevance is essential for evaluating the impact of individual features on predictive models before modeling. Two key metrics are utilized to assess feature relevance:

Predictive Score¶

The Predictive Score (PS) measures the relationship between a feature and the target variable within a specific use case. A PS score of 1 indicates perfect correlation with the target, while 0 suggests no correlation.

Note

PS evaluates features independently and might overlook potential interactions among them, which could significantly affect predictive relevance. Some features may exhibit limited predictive utility when analyzed alone. However, when combined with others, they might reveal significant predictive power due to interaction effects.

Details

PS utilizes XGBoost for numerical, categorical, embedding or dictionary features, and regularized linear regression for textual features. The score is based on the Gini Norm (a scaled version of the Gini coefficient):

For regression, Gini Norm provides a quantitative measure of how well a model can distinguish between different groups. In insurance, it is frequently used to quantify how well the model can differentiate between high-risk and low-risk individuals. In Marketing, it is used to quantify how well the model can differentiate between high-value and low-value customers.
In classification, Gini Norm is equivalent to 2x(AUC - 0.5), where AUC is the Area Under the ROC Curve, providing a measure of the model's ability to discriminate between positive and negative classes.

Incremental Predictive Score¶

When a Naive Prediction is available for a use case, the standard Predictive Score is replaced by the Incremental Predictive Score (IPS). The IPS measures each feature's added predictive value beyond the naive baseline prediction. It is computed as:

IPS = GiniNorm(feature + naive) − GiniNorm(naive alone)

The score is floored at 0. A high IPS indicates that the feature provides significant predictive lift on top of the naive prediction, while 0 means the feature adds no value beyond the baseline.

Note

IPS evaluates each feature independently and may not capture interactions among features that could contribute to predictive power when combined.

Details

A LightGBM model is trained with the naive prediction as an offset (init_score) for all feature types, including text (which is represented as a TF-IDF n-gram matrix). Two models are compared: one using the feature plus the naive offset, and one using only the naive offset. For multiplicative naive predictions, a Poisson objective with log-linked offsets is used; for additive ones, standard regression is used. The IPS is the difference in Gini Norm between the two models.

Semantic Relevance¶

Semantic relevance, derived through Generative AI, examines the significance of each feature within a specific use case based on its semantic value without directly analyzing the data. This metric considers both the feature's description and the context of the use case. It complements the predictive score by ensuring that features not only display statistical correlation with the target variable but also carry contextual meaning.

High semantic relevance scores, combined with low statistical correlation, may indicate potential data quality issues or highlight the limitations of relying solely on statistical relevance. Semantic relevance can also capture critical constraints such as fairness, causality, and other contextual factors.

Naive Prediction¶

A Naive Prediction is a simple baseline forecast derived from historical values of the target variable (the forecasted column). It is configured during the Ideation Metadata step of the FeatureByte Copilot pipeline and is defined by two parameters:

Window: A time window (e.g., 7d, 28d) over which historical values are aggregated to form the baseline prediction.
Structure: Either additive or multiplicative, which determines how features are compared to the baseline:
- Additive: the naive prediction is the average (or temporal mean of sums) of the forecasted column over the window. Features are compared via difference: feature − naive.
- Multiplicative: same aggregation, but features are compared via ratio: feature / naive.

How it is generated¶

During feature ideation, the system identifies a naive prediction feature for each primary entity:

For series ID entities (e.g., individual stores or products): the average of the forecasted column over the naive window.
For higher-level entities (e.g., regions, categories): the temporal mean of sums of the forecasted column over the naive window.

Each feature derived from the forecasted column is then relativized against the naive prediction that shares its primary entity (difference for additive, ratio for multiplicative). The original features (using sum, avg, min, or max of the forecasted column) are pruned and replaced by these relative features.

The final naive prediction — the one used as baseline throughout the rest of the pipeline (EDA, feature selection, model training) — is the one whose primary entity matches the use case entity. Naive predictions for other entities are only used internally to relativize features and to compute cross-entity ratios, but are then pruned.

Role in the pipeline¶

The naive prediction feature plays a central role throughout the pipeline:

EDA: Standard Predictive Scores are replaced by Incremental Predictive Scores, which measure each feature's added value beyond the naive baseline.
Feature Selection: The naive prediction feature is always included in every selected feature set.
Model Training: The naive prediction is used as an offset (init_score), so the model learns to predict the residual (additive) or ratio (multiplicative) on top of the baseline.

Feature Materialization¶

The act of computing the feature is known as Feature Materialization.

The materialization of features is made:

on demand to fulfill historical requests,
whereas for prediction purposes, feature values are pre-computed through a batch process called a "Feature Job".

The Feature Job is scheduled based on the defined settings associated with each feature.

To materialize the feature values, either:

entities to which the feature is assigned
or their descendant entities (the serving entities) must be instantiated.

Additionally, in the context of historical feature serving, an observation set is required, created by combining:

entity key values
and point-in-time references that correspond to particular moments in the past.

Point-In-Time¶

A Point-In-Time for a feature refers to a specific moment in the past with which the feature's values are associated.

It is a crucial aspect of historical feature serving that allows Machine Learning models to make predictions based on historical data. By providing a point-in-time, a feature can be used to train and test models on past data, enabling them to make accurate predictions for similar situations in the future.

Feature Governance¶

Feature Version¶

A Feature Version enables the reuse of a Feature with varying feature job settings or distinct cleaning operations.

If the availability or freshness of the source table change, new versions of the feature can be generated with a new feature job setting. On the other hand, if changes occur in the data quality of the source table, new versions of the feature can be created with new cleaning operations that address the new quality issues.

To ensure the seamless inference of Machine Learning tasks that depend on the feature, old versions of the feature can still be served without any disruption.

Note

In the FeatureByte SDK, a new feature version is a new Feature object with the same namespace as the original Feature object. The new Feature object has its own Object ID and version name.

SDK Reference

How to:

create a new feature version,
list versions for a feature,
get the feature version name,
get the feature version Object ID,
get a specific version for a feature from a catalog,
get a specific version by its Object ID from a catalog,
and set default version

Feature Readiness¶

To help differentiate features that are in the prototype stage and features that are ready for production, a feature version can have one of four readiness levels:

PRODUCTION_READY: ready for deployment in production environments.
PUBLIC_DRAFT: shared for feedback purposes.
DRAFT: in the prototype stage.
DEPRECATED: not advised for use in either training or prediction.

Important

Only one feature version can be designated as PRODUCTION_READY at a time.

When a feature version is promoted to PRODUCTION_READY, guardrails are applied automatically to ensure consistency with defauft cleaning operations and feature job settings. You can disregard these guardrails if the settings of the promoted feature version adhere to equally robust practices.

Important Note for FeatureByte Enterprise Users

In Catalogs with Approval Flow enabled, moving features to production-ready status involves a comprehensive approval process.

This includes several evaluations, such as checking the feature's compliance with default cleaning operations and the feature job setting of its source tables. It also involves confirming the status of these tables and backtesting the feature job setting to prevent future training-serving inconsistencies. Additionally, essential details of the feature, particularly its feature definition file, are shared and subjected to a thorough review.

SDK Reference

How to:

change readiness of a feature version.

User Interface

Learn by example with the 'Deploy and serve' tutorial.

Default Feature Version¶

The default version of a feature streamlines the process of reusing features by providing the most appropriate version. Additionally, it simplifies the creation of new versions of feature lists.

By default, the feature's version with the highest level of readiness is considered, unless you override this selection. In cases where multiple versions share the highest level of readiness, the most recent version is automatically chosen as the default.

Note

When a feature is accessed from a catalog without specifying its object ID or its version name but only by its name, the default version is automatically retrieved.

SDK Reference

How to:

Feature Definition File¶

The feature definition file is the single source of truth for a feature version. This file is automatically generated when a feature is declared in the SDK or a new version is derived.

The syntax used in the SDK is also used in the feature definition file. The file provides an explicit outline of the intended operations of the feature declaration, including those inherited but not explicitly declared by you. For example, these operations may include feature job settings and cleaning operations inherited from tables metadata.

The feature definition file is the basis for generating the final logical execution graph, which is then transpiled into platform-specific SQL (e.g. SnowSQL, SparkSQL) for feature materialization.

Definition File

SDK Reference

How to obtain the feature definition file.

Feature Online Enabled¶

An online enabled feature is a feature that is used by at least one deployed feature list.

SDK Reference

How to obtain the feature online-enabled property.

Feature List Creation¶

Feature List¶

A Feature List is a collection of features. It is usually tailored to meet the needs of a particular use case and generate feature values for Machine Learning training and inference.

A Feature List may optionally include a Naive Prediction reference, which identifies the naive prediction feature and its structure (additive or multiplicative). When present, the naive prediction feature is always included in the list and is used as an offset during model training.

Historical feature values are first obtained to train and test models.

Once a model has been trained and validated, the Feature List can be deployed, and pre-computed feature values can be stored in the feature store and accessed through online and batch serving to generate predictions.

SDK Reference

Refer to the FeatureList object main page or to the specific links:

create a feature list,
list features in a feature list.

User Interface

Learn by example with the 'Create New Feature List' tutorial.

Feature Group¶

A Feature Group is a temporary collection of features that facilitates the manipulation of features and the creation of feature lists.

Note

It is not possible to save the Feature Group as a whole. Instead, each feature within the group can be saved individually. To save a Feature Group as whole, convert it first as a Feature List.

SDK Reference

Refer to the FeatureGroup object main page or to the specific links:

Feature List Builder¶

The Feature List Builder facilitates the construction of new feature lists. It becomes active once a specific Use Case is identified. Users can then enrich their feature list by selecting relevant features from two resources: the Feature Catalog or the Feature List Catalog.

The tool offers real-time statistics on several aspects: the readiness level of the selected features, which indicates the percentage of features that are production ready, the percentage of features currently active online, and the diversity of themes incorporated into the list.

Moreover, it dynamically suggests additional features from unrepresented themes. This recommendation system is designed to ensure the feature list encompasses a broad spectrum of signals, enhancing the overall predictive power of the feature list.

User Interface

Learn by example with the 'Create New Feature List' tutorial.

Feature List Catalog¶

The Feature Lists registered in the catalog can be listed and retrieved by name for easy access and management.

SDK Reference

How to:

list feature lists in a catalog,
get a feature list from a catalog,

In the SDK, feature lists can be filtered based on three key attributes:

The Primary Entity of the Feature List
The Primary Tables used by the features within the lists.
The Primary Entities used by the features within the lists.

In FeatureByte Enterprise, feature lists can also be filtered based on:

Use Case, as detailed in Feature List Compatibility with a Use Case
Usage Status.
Production readiness.
Percentage of features deployed in production.
Exclusion of lists containing certain feature data types

Feature List Compatibility with a Use Case¶

In the context of a Use Case, it's crucial to ensure that the feature lists are compatible with the Use Case Primary Entity. For a feature list to be considered compatible, its Primary Entity must align with the Use Case Primary Entity in one of two ways:

Direct Match: The feature list's Primary Entity should be the same as the Use Case Primary Entity.
Hierarchical Relationship: The feature list's Primary Entity should be either a parent or grandparent of the Use Case Primary Entity.

In FeatureByte Enterprise, this concept plays a crucial role by ensuring that only feature lists compatible with a defined Use Case are displayed in the Feature List Catalog User Interface.

Example

Entity Diagram

Consider the following scenario:

Use Case: Card Default Prediction. Use Case Primary Entity: Card.

Feature List in Question: The feature list contains 2 features. - A feature that records the maximum distance between a customer's residence and the locations of their card transactions over the past 24 hours. - A feature on the Customer City population. The Feature List Primary Entity: Customer.

Analysis: This feature list is compatible with the Use Case. Despite the Feature List Primary Entity being 'Customer', it is directly linked to the 'Card' entity, which uniquely identifies each customer. Therefore, the feature can be effectively utilized in the Card Default Prediction Use Case.

Feature List Thematic Coverage¶

FeatureByte Enterprise leverages the systematic thematic categorization of features by analysing the Feature Theme attributed to each feature in a given feature list to assess its comprehensiveness. Any thematic areas that are not adequately covered by the existing features in the list are highligthed as "Themes not covered".

Feature List Simplification¶

FeatureByte provides two approaches for Feature List Simplification:

Traditional SHAP-based Simplification Uses standard SHAP importance derived from a trained model.
Regularized SHAP-based Simplification (Novel Technique) Employs a new method that regularizes SHAP values to yield a more compact and interpretable feature set.

Both approaches support Key-Based Feature extraction from dictionary-style features, improving interpretability when working with nested or high-dimensional feature structures.

How the Regularized SHAP Technique Works¶

This new technique produces a simpler and more interpretable Feature List through a two-step process:

Template Model Training Train an XGBoost or LightGBM template model on nested training data to generate SHAP values.
Regularized Linear Model Training Train a regularized linear model on nested validation data using the SHAP values from Step 1 as inputs. The regularization encourages sparsity, naturally reducing the feature set.

Advantages Over Traditional SHAP Importance¶

Compared to the standard SHAP-importance-based approach, the regularized technique offers several benefits:

No manual feature count thresholding The method automatically selects the most essential features, eliminating the need to manually choose a cut-off.
More compact and effective feature sets The resulting feature list is typically smaller while preserving model accuracy, offering an optimal balance between simplicity and predictive performance.

Feature List Serving¶

Note

A feature list can be served by its primary entity or any descendant serving entities.

Historical Feature Serving¶

Historical serving of a feature list is usually intended for exploration, model training, and testing outside FeatureByte. The requested data is represented by an observation set that combines entity key values and historical points-in-time, for which you want to materialize feature values.

Requesting historical features is supported by two methods:

compute_historical_features(): returns a loaded DataFrame. Use this method when the output is expected to be of a manageable size that can be handled locally.
compute_historical_feature_table(): returns a HistoricalFeatureTable object representing the output table stored in the feature store. This method is suitable for handling large tables and storing them in the feature store for reuse or auditing.

Note

Historical feature values are not pre-computed or stored. Instead, the serving process combines partially aggregated data as offline tiles. This approach of pre-computing and storing partially aggregated data minimizes compute resources significantly.

SDK Reference

Refer to the HistoricalFeatureTable main page or to the specific links:

compute historical feature values as a DataFrame.
compute historical feature values as a HistoricalFeatureTable object.
list historical features tables available in a catalog
get an historical features table from a catalog
and get an historical features table by its Object ID from a catalog

User Interface

Learn by example with the 'Compute Feature Table' tutorial.

Feature List Deployment¶

A feature list can be deployed to enable online and batch serving.

Before creating a deployment, ensure that all features in the corresponding feature list are labeled as "PRODUCTION_READY".

When creating a deployment, it must be associated with a use case, which determines the entity used for serving.

Note

A single feature list can be linked to multiple deployments and use cases if needed.

SDK Reference

Refer to the Deployment documentation or the specific method below: - Deploy a feature list

User Interface

Learn by example in the 'Deploy and Serve' tutorial.

Online and Batch Serving¶

To serve a feature list, it must first be deployed, and the corresponding deployment must be enabled.

Both online and batch serving require request data that includes key values from the primary entity of the use case associated with the deployment.

Batch Serving¶

Batch feature values are generated from the Deployment and a source table or a managed view containing the key values of the primary entity from the associated use case.

The output is a Batch Feature Table, which contains the computed feature values and metadata describing how the table was generated.

Note

The column containing the entity values must use a valid serving name. If your source table uses a different column name, configure the mapping during the request.

You can optionally specify a point-in-time for feature computation. If not provided, it will be automatically determined when the request is submitted.

Online Serving¶

FeatureByte supports online feature serving through a low-latency key–value store. This enables real-time feature retrieval for use cases such as recommendation systems, fraud detection, and personalized user experiences.

To configure online serving, follow the steps in the Administrator Guide.

When online serving is enabled, FeatureByte automatically orchestrates feature materialization into the online store.

Requests are performed via a REST API service. You can obtain ready-to-use Python or shell script templates for these API calls directly from the Deployment.

Shell template

SDK Reference

Refer to the BatchFeatureTable documentation or the following SDK methods:

User Interface

Explore the 'Deploy and Serve' tutorial for a hands-on example in the FeatureByte UI.

Feature List Governance¶

Feature List Version¶

The Feature List Version allows using each feature's latest version. Upon creation of a new feature list version, the latest default versions of features are employed unless particular feature versions are specified.

SDK Reference

How to:

create a new feature list version,
list versions for a feature list,
get a specific version of a feature list from a catalog,

Default Feature List Version¶

The 'Default Version of a Feature List' must comprise the default version of each feature, as indicated by its default_feature_fraction property being equal to 1. If this fraction is less than 1, a new feature list version must be created as the Default Feature List Version. Upon creation of this new list, the default_feature_fraction of the Default Feature List Version will be reset to 1.

SDK Reference

How to:

Feature List Status¶

Feature lists can be assigned one of five status levels to differentiate between experimental feature lists and those suitable for deployment or already deployed.

"DEPLOYED": Assigned to feature list with at least one deployed version.
"TEMPLATE": For feature lists as reference templates or safe starting points.
"PUBLIC_DRAFT": For feature lists shared for feedback purposes.
"DRAFT": For feature lists in the prototype stage.
"DEPRECATED": For outdated or unnecessary feature lists.

Note

The status is managed at the namespace level of a Feature List object, meaning all versions of a feature list share the same status.

For the following scenarios, some status levels are automatically assigned to feature lists:

when a new feature list is created, the "DRAFT" status is assigned to the feature list.
when at least one version of the feature list is deployed, the "DEPLOYED" status is assigned.
when deployment is disabled for all versions of the feature list, the "PUBLIC_DRAFT" status is assigned.

Additional guidelines:

Before setting a feature list status to "TEMPLATE", ensure all features in the default version are "PRODUCTION_READY".
Only "DRAFT" feature lists can be deleted.
You cannot revert a feature list status to a "DRAFT" status.
Once a feature list is in "DEPLOYED" status, you cannot update the status to other status until all the associated deployments are disabled.

SDK Reference

How to:

get the status of a feature list,
change the status of a feature list.

Feature List Readiness¶

The Feature List Readiness metric provides a statistic on the readiness of features in the feature list version. This metric represents the percentage of features that are production ready within the given feature list.

Important

Before a feature list version is deployed, all its features must be "production ready" and the metric should be 100%.

SDK Reference

How to get the readiness metric of a feature list.

Feature List Percentage of Online Enabled Features¶

The 'Feature List Percentage of Online Enabled Features' represents the proportion of features used by at least one deployed feature list. A percentage near 1 suggests a lower cost for deploying the feature list.

Feature Table¶

A Feature Table contains historical feature values from a historical feature request that are typically produced to train or test Machine Learning models. The historical feature values can also be obtained as a Pandas DataFrame, but using a Feature Table has some benefits such as handling large tables, storing the data in the feature store for reuse, and offering full lineage.

SDK Reference

Refer to the HistoricalFeatureTable object main page.

Feature Table Creation¶

In SDK, a HistoricalFeatureTable object is created by getting historical features from a feature list by using the compute_historical_feature_table() method. The method uses as input an observation table that combines historical points-in-time and key values of the feature list's primary entity or of its related serving entities.

In FeatureByte Enterprise User Interface, a Feature Table can be generated by selected a feature list and specifying an observation table compatible with the feature list.

SDK Reference

How to compute feature table.

Feature Table Lineage¶

The Feature Table contains metadata on the Feature List and Observation Table used.

SDK Reference

How to:

to access the feature_list_id
to access the observation_table_id

Feature Table Purpose¶

The purpose of a Feature Table depends on the purpose of the observation table it comes from. It can vary from being a simple preview to being used for more complex tasks like exploratory data analysis, training, or validation tests. This classification helps in easily identifying and reusing Feature Tables.

Feature Table Association with a Context or Use Case¶

The association of a Feature Table with specific Contexts or Use Cases is determined by its originating observation table. This link makes it straightforward to organize and locate Feature Tables relevant to particular use cases.

SDK Reference

How to list feature tables related to a Use Case

FeatureByte Machine Learning Model¶

A FeatureByte Machine Learning Model predicts the target of a Use Case. It can be initiated from:

A Feature List or Ideated Feature Selection
A Model Template with user-defined settings
A Training Observation Table
A Validation Observation Table

During ideation, FeatureByte automatically trains models based on the selected features.

Model Report Export¶

A comprehensive Model Report summarizing the model’s configuration settings, training details, and evaluation analysis can be exported in either PDF or Markdown (MD) format for documentation, review, or sharing purposes.

Model Template¶

A Model Template encapsulates both preprocessing tasks and the estimator.

Preprocessing¶

Preprocessing steps handle feature transformations for:

Numeric features
Categorical features
Dictionary features
Embedding features
Text features

Each preprocessing step and estimator component is fully configurable via template parameters.

Estimators¶

Default templates include models such as LGBM and XGBoost with built-in early stopping.

User-defined templates will be supported in a future release.

Binary Classification Evaluation¶

Interactive plots provide insights into separability, ranking quality, calibration, and decision-threshold trade-offs, based on model predictions from the validation dataset.

Available Plots¶

Score Distributions
Avg Predicted vs Actual (per score-ranked bin)
ROC Curve
Precision–Recall Curve
KS / Gain Curve
Lift Chart
Metrics vs Threshold
Gain Report (Interactive Table)

Choosing the Right Plot¶

Goal	Recommended Plot(s)	Description
Check class separability	Score Distributions	Visualize overlap between positive and negative score distributions.
Assess ranking quality	ROC Curve (AUC)	Shows True Positive Rate vs False Positive Rate across thresholds.
Handle class imbalance	Precision–Recall Curve (AP)	More informative for imbalanced datasets.
Select a threshold	Metrics vs Threshold + Confusion Matrix	Compare metrics and trade-offs interactively.
Analyze top-k performance	KS / Gain Curve, Lift Chart, Gain Report	Evaluate recall/lift by population depth and export targeting tables.

Score Distributions (Classification)¶

Displays overlaid histograms of model-predicted scores.

Tabs:

By Class — Positive vs Negative distributions
All Samples — Overall frequency

Controls: Adjustable bins (5–100), hover tooltips Axes:

X — Predicted Score
Y — Frequency

Best for: Quick visual check of class separation.

Name

Avg Predicted vs Actual per Score-Ranked Bin (Classification)¶

Two curves plotted across score-ranked bins:

Avg Predicted Score
Actual Positive Rate

X-axis: Bin number (ranked by predicted score)

Use it to: Compare predicted probabilities with actual outcomes — a visual calibration check.

Name

ROC Curve¶

Plots True Positive Rate (TPR) vs False Positive Rate (FPR).

Features:

Displays AUC (Area Under Curve)
Includes Random Baseline and KS marker
Tooltips show TPR, FPR, threshold, and Youden’s J

Best for: Assessing overall discrimination power.

Name

Precision–Recall Curve¶

Shows the trade-off between Precision and Recall.

Features:

Highlights F1-optimal threshold
Tooltips include Precision, Recall, Threshold, and F1

Best for: Evaluating positive-class performance in imbalanced datasets.

Name

KS / Gain Curve¶

Plots cumulative True Positive Rate (Gain) and False Positive Rate against Population Depth (%).

Features:

Random baseline
KS statistic and marker
Tooltips with depth, TPR/FPR, threshold, and Youden’s J

Best for: Analyzing recall at top-k coverage and supporting ranking strategies.

Name

Lift Chart¶

Shows Lift = Recall / Population Depth vs Population Depth (%).

Features:

Baseline lift = 1.0 (random expectation)
KS marker and hover tooltips

Best for: Quantifying how well true positives are concentrated in the top-scoring segment.

Name

Metrics vs Threshold¶

Plots multiple metrics as functions of the decision threshold, including:

Precision, Recall (TPR), F1, Youden’s J, Accuracy, MCC, Jaccard.

Interactive Features:

Threshold slider with live Confusion Matrix
“Jump to KS” / “Jump to F1” shortcuts
Vertical markers at KS and F1 thresholds
Selectable metric lines

Best for: Exploring threshold-dependent performance trade-offs.

Name

Gain Report (Interactive Table)¶

A cumulative gain/lift table computed at fixed population depths.

Granularity: Deciles (10%), Ventiles (5%), or Percentiles (1%)

Columns Include:

Population Depth (%)
Score Range (score ≥ threshold)
Cumulative True Positives / Population
Gain (%) and Lift

Extras: CSV download for reporting.

Best for: Creating business-friendly targeting or prioritization reports.

Name

Regression Evaluation¶

Interactive plots visualize prediction quality, calibration, and score distribution using validation data.

Available Plots¶

Score Distributions
Predicted vs Actual (sample)
Avg Predicted vs Actual per Score-Ranked Bin

Score Distributions (Regression)¶

Histogram showing the distribution of predicted values.

Controls: Adjustable bin slider (5–100), hover tooltips

Axes:

X — Predicted Value
Y — Frequency

Best for: Understanding distribution, scale, and skew in predictions.

Name

Avg Predicted vs Actual per Score-Ranked Bin (Regression)¶

Plots:

Average Predicted Value
Actual Average Target Value

X-axis: Bin number (ranked by predicted score)

Best for: Visual calibration or bias checks across score ranges.

Name

Predicted vs Actual (Sample)¶

Scatter plot of individual predictions vs actual target values.

Axes:

X — Predicted Score
Y — Actual Target Value

Controls: Adjustable point size and opacity

Best for: Identifying bias, nonlinearity, or heteroscedasticity in residuals.

Name

Feature Importance¶

SHAP Feature Importance is available at:

The feature level
The feature key level (for dictionary features)

New Feature Lists can be derived by:

Retaining features with the highest importance
Generating new features from feature-key-level reports

Name

Model Refit¶

Models can be retrained on new training data while retaining tuned hyperparameters from the original model.

Leaderboard¶

A Leaderboard ranks all models scored on the same validation or holdout observation table.

Predictions¶

Historical Holdout Predictions¶

You can compute predictions using holdout observation tables to evaluate model generalization and monitor predictive accuracy.

Optionally, the corresponding feature values and SHAP values can be retrieved alongside the predictions.

Model Deployment¶

A Model Deployment automatically triggers the deployment of its associated Feature List if it has not been deployed yet.

Before creating a model deployment, ensure that all features in the related feature list are labeled as "PRODUCTION_READY".

User Interface

Learn by example in the 'Deploy and Serve' tutorial.

Batch Predictions¶

Before generating batch predictions, make sure the associated deployment is enabled.

Batch predictions are computed from the Deployment and a source table or a managed view containing the key values of the primary entity from the associated use case.

The output is a Prediction Table, which includes the generated predictions and metadata describing how the table was created. Optionally, you can retrieve the corresponding feature values and SHAP values along with the predictions.

Note

The column containing entity values must use a valid serving name. If your source table uses a different column name, configure the mapping during the request.

You can optionally specify a point-in-time for prediction generation. If not provided, it will be automatically determined when the request is submitted.

Forecast Comparison¶

A Forecast Comparison is an interactive visualization that compares a model's predictions against actual target values across multiple forecast points in time, for a specific entity.

The plot displays:

Multiple prediction lines — one for each Point In Time when a prediction was made, showing the full forecast series over the horizon.
A single target/actual line — showing what actually happened.
Interactive filtering — to zoom into specific time ranges and hover for detailed data inspection.

Forecast Comparisons are generated from a Prediction Table that was computed on a FORECAST_SERIES observation table. This mode ensures that for each Point In Time, all forecast points within the horizon are included, producing complete prediction series that can be plotted as continuous lines.

To create a Forecast Comparison:

Score a model on a FORECAST_SERIES observation table to produce a Prediction Table.
Select an entity value (e.g., a specific store) to filter the visualization.
The system generates an interactive plot comparing predicted vs actual values.

User Interface

Learn by example with the 'Predict and Evaluate' tutorial.

Online Predictions¶

Coming Soon

Deployment¶

In FeatureByte, a Deployment manages the online and batch serving of a deployed FeatureList and batch predictions of deployed Model for specific Use Cases. Additionally, features can be exported as SQL code for integration into custom pipelines outside of FeatureByte.

Enabling and Disabling Deployments¶

A Deployment Object is initiated when a FeatureList is deemed ready for production deployment.

Upon creation, the Deployment can be enabled for online and batch serving, triggering the orchestration of feature materialization into the online store.

Deployments can be disabled at any time, ceasing the online and batch serving of the feature list without impacting serving of the historical requests. This approach is distinct from the 'log and wait' method used in some other feature stores.

Note

If the feature list is associated with multiple deployments (for different use cases), disabling one deployment will not affect the serving of other deployments.

SDK Reference

Refer to the Deployment main page or to the specific links:

Feature Job Status¶

The Deployment object provides reports on recent activities of scheduled feature jobs, including run history, success status, and durations.

In cases of failed or late jobs, it's advised to review data warehouse logs for insights, especially if the issue relates to compute capacity.

SDK Reference

How to get the feature job status for a feature list.

Deployment Catalog¶

Deployments are linked to specific Use Cases, and all related deployments can be listed and managed directly from the corresponding Use Case.

SDK Reference

Learn how to list deployments related to a Use Case

Within the catalog, deployments can be listed, retrieved by either name, or by Object ID.

SDK Reference

How to:

list deployments available in a catalog,
get an a deployment from a catalog,
and get an deployment by its Object ID from a catalog.

The Deployment object class provides methods to list and manage deployments across all catalogs.

SDK Reference

How to:

list() to list all deployments across catalogs.
get() to get an Deployment object by its name.
get_by_id() to get a Deployment object by its Object ID.

User Interface

To explore the UI/UX workflow, see the 'Deploy and Serve' tutorial.

Approval Flow¶

Enabling Approval Flow¶

FeatureByte Enterprise catalogs can incorporate an Approval Flow. When active, key actions require approval such as:

Marking a feature as Production-Ready
Changing a table's Cleaning Operations,
Changing a table's Defaut Feature Job Setting.

To check if Approval Flow is active, look for a validation mark next to the Catalog name.

Name

If it's missing, click the settings icon near the Catalog name at the top of the screen to access and enable the Approval Flow option.

Name

Feature Adjustments¶

When table metadata changes occur (e.g., new cleaning operations, updating feature job settings), they trigger new feature versions. This ensures compatibility with new data. Users can modify default actions for these features and analyze the impact of both original and updated operations.

Name

Approval Flow Checks¶

Approval Flow involves several automated checks:

For Marking a Feature as Production-Ready:

Compliance with default cleaning operations and feature job setting of its source tables.
Table status assessment
Recent analysis of data availability and freshness.
Backtesting to avoid training-serving inconsistencies.

Name

For Changes in Cleaning Operations:

Analysis of features with actions diverging from new operations.
Completion of this analysis changes request checks to green.
Emphasis on understanding impacts of both new and original operations.

Name

For Changes in Feature Job Setting:

Recent analysis of data availability and freshness.
Backtesting of the new setting to prevent future training-serving inconsistencies.

Name

Learning Through UI Tutorials¶

For a practical understanding of the approval flow, explore our UI tutorials:

'Deploy and serve a feature list' tutorial.
'Manage feature life cycle' tutorial.

Feature Store¶

The purpose of a Feature Store is to centralize pre-calculated values, which can significantly reduce the latency of feature serving during training and inference.

FeatureByte Feature Stores are designed to integrate seamlessly with data warehouses, eliminating the need for bulk outbound data transfers that can pose security risks. Furthermore, the computation of features is performed in the data warehouse, taking advantage of its scalability, stability, and efficiency.

Pre-calculated values for online and batch serving are stored in an online feature store.

Partial aggregations in the form of online and offline tiles are also stored to streamline feature materialization for historical request and online and batch serving. This approach enables computation to be performed incrementally on tiles rather than the entire time window, leading to more efficient resource utilization.

Once a feature is deployed, the FeatureByte service automatically initiates materialization of feature and tiles, scheduled based on the feature job setting of the feature.

SDK Reference

Refer to the FeatureStore object main page or to the specific links:

Tiles¶

Tiles are a method of storing partial aggregations in the feature store, which helps to minimize the resources required to fulfill historical and online requests. There are two types of tiles managed by FeatureByte: offline tiles and online tiles.

When a feature has not yet been deployed, offline tiles are cached following a historical feature request to reduce the latency of subsequent requests. Once the feature has been deployed, offline tiles are computed and stored according to the feature job setting.

The tiling approach adopted by FeatureByte also significantly reduces storage requirements compared to storing offline features. This is because tiles are more sparse than features and can be shared by features that use the same input columns and aggregation functions.

Feature Jobs¶

Feature Job Background¶

FeatureByte is designed to work with data warehouses that receive regular data refreshes from operational sources, meaning that features may use data with various freshness and availability. If these operational limitations are not considered, inconsistencies between offline requests and online and batch feature values may occur.

To prevent such inconsistencies, it is crucial to synchronize the frequency of batch feature computations with the frequency of source table refreshes and to compute features after the source table refresh is fully completed. In addition, for historical serving to accurately replicate the production environment, it is essential to use data that would have been available at the historical points-in-time, considering the present or future data latency. Latency of data refers to the time difference between the timestamp of an event and the timestamp at which the event data is accessible for ingestion. Any period during which data may be missing is referred as a "blind spot".

To address these challenges, the feature job setting in FeatureByte captures information about the frequency of batch feature computations, the timing of the batch process, and the assumed blind spot for the data. This helps ensure consistency between offline and online feature values and accurate historical serving that reflects the conditions present in the production environment.

Feature Job¶

A Feature Job is a batch process that generates both offline and online tiles and feature values for a specific feature before storing them in the feature store. The scheduling of a Feature Job is determined by the feature job setting associated with the respective feature.

Feature job orchestration is initiated when a feature is deployed and continues until the feature deployment is disabled, ensuring the feature store consistently possesses the latest values for each feature.

Feature Job Setting¶

A Feature Job Setting defines how frequently and when feature computation jobs are scheduled to run. These settings ensure that features are computed consistently and without data leakage across different environments and teams.

CRON Job Settings¶

CRON job settings consist of four key parameters:

Crontab — Defines the cron schedule for the feature job, specifying when the job should run.
Time Zone — Determines the time zone in which the cron schedule operates.
Blind Spot — Specifies the period of time immediately preceding the job execution that should be excluded to avoid data leakage from late-arriving records.
Reference Time Zone — Defines the time zone used for calendar-based aggregation periods (e.g., daily, weekly, or monthly). This ensures consistent calendar boundaries across data sources in different time zones.
- For example, if a scheduled job runs at 2025/01/31 23:00 UTC and the reference time zone is Asia/Singapore, the corresponding calendar date is 2025/02/01. Therefore, the aggregation for the most recent full month would cover January.
- Typically, the reference time zone should be the westernmost time zone among those associated with the data’s timestamps, ensuring that each aggregation window fully includes all relevant observations.

Periodic Job Settings for Event Tables¶

For an Event Table, the default feature job setting can be automatically initialized through an analysis of the table data's availability and freshness. This configuration follows a Periodic Job Setting model defined by three key parameters:

Period: Specifies how often the batch process should run.
- Example: A period of 60m indicates that the feature job executes every 60 minutes.
Offset: Defines the time delay from the end of the period to when the feature job starts.
- Example: With period: 60m and offset: 130s, the feature job starts 2 minutes and 10 seconds after each hour—at 00:02:10, 01:02:10, 02:02:10, ..., 23:02:10.
Blind Spot: Specifies the period of time immediately preceding the job execution that should be excluded to avoid data leakage from late-arriving records.

This analysis depends on the presence of record creation timestamps in the source table, typically populated during data warehouse updates.

Feature Versioning and Flexibility¶

When changes occur in the management of the source tables—such as updates impacting data availability or freshness—a new feature version can be created with updated feature job settings to maintain accuracy and consistency.

Alignment with Online and Historical Requests¶

Although Feature Jobs are primarily designed to handle online requests, these settings also support historical requests. This helps minimize inconsistencies between offline and online data processing.

Consistency Across Teams¶

To ensure consistent feature job settings across teams, a Default Feature Job Setting is defined at the table level. However, team members can override this default setting when declaring specific features, offering flexibility for unique requirements.

SDK Reference

How to declare a feature job setting.

In Feature Job Settings, the blind spot refers to the time gap between when a feature is computed and the timestamp of the most recent event included in that computation. Accounting for this gap is essential to maintain consistency between training and serving, ensuring that inference data is complete and aligned with real-world availability.

Understanding Data Latency and Blind Spots

Data latency represents the time elapsed from when an event occurs to when its data becomes available for use. In the context of data ingestion, a blind spot is any period where data might be missing due to ingestion delays. Specifically, in feature computation, the blind spot extends from:

The completion of data ingestion in the data warehouse
To the start of the feature computation job

Why Does the Blind Spot Matter?

The blind spot directly impacts the timeliness and relevance of features used at inference time. If the blind spot is too short, the model may rely on data that wouldn't be available in a production setting, leading to training-serving inconsistencies. Conversely, if it's too long, the model may work with stale data, potentially reducing predictive performance.

Default Feature Job Setting¶

The Default Feature Job Setting defines the baseline configuration used by features that aggregate data within a table. It ensures consistency in the Feature Job Setting across features developed by different team members.

Although it can be overridden during feature declaration, using a Default Feature Job Setting simplifies configuration and promotes alignment across teams.

Feature Job Settings are usually defined via a CRON configuration. For an Event Table, FeatureByte provides automated analysis of the table’s record creation timestamps to recommend appropriate parameter values.

Approval Flow for Default Feature Job Setting

In catalogs with Approval Flow enabled, any changes to table metadata—including the Default Feature Job Setting—trigger a review process. This process recommends new versions of features and feature lists linked to the affected tables, ensuring that models and deployments use versions incorporating the latest data corrections or enhancements.

SDK Reference

Learn how to:

Initialize the default feature job setting
Update the default feature job setting for Event Tables, Snapshots, and Time Series tables

User Interface

Explore the 'Manage Feature Life Cycle' tutorial to see this process in action via the UI.

Feature Job Setting Recommendations¶

FeatureByte automatically analyzes data availability and freshness of an event table to suggest an optimal setting for scheduling Feature Jobs and associated Blind Spot information.

This analysis relies on the availability of record creation timestamps in the source table, typically added when updating data in the warehouse. Additionally, the analysis focuses on a recent time window, such as the past four weeks.

FeatureByte estimates the data update frequency based on the distribution of time intervals among the sequence of record creation timestamps. It also assesses the timeliness of source table updates and identifies late jobs using an outlier detection algorithm. By default, the recommended scheduling time takes late jobs into account.

To accommodate data that may not arrive on time during warehouse updates, a blind spot is proposed for determining the cutoff for feature aggregation windows, in addition to scheduling frequency and time of the Feature Job. The suggested blind spot offers a percentage of late data closest to the user-defined tolerance, with a default of 0.005%.

To validate the Feature Job schedule and blind spot recommendations, a backtest is conducted. You can also backtest your custom settings.

SDK Reference

How to:

Feature Job Setting Backtest¶

A backtest in feature job settings evaluates the effectiveness of these settings with respect to the availability and freshness of data. This process involves calculating the proportion of new data that would have been missed in the computation of a feature if these settings had been used in previous calculations. Here, "new data" refers to data processed during the latest time frame that matches the job's frequency.

A percentage higher than 0 indicates potential future problems with training-serving consistency, as it implies that serving might utilize incomplete data.

Common reasons for backtest failures include:

Misalignment of Frequencies: The frequency at which feature jobs run should ideally be a multiple of the data warehouse's update frequency. This alignment ensures that each feature job incorporates the most recent data updates.
Premature Feature Job Start: Starting a feature job too early, before the data warehouse update is complete, can lead to incomplete data incorporation. To avoid this, set a larger offset after the completion of the data warehouse update, allowing enough time for all data to be processed.
Inadequate Data Latency Handling: Failing to account for an adequate blind spot period, the time necessary to cover data latency, can result in using incomplete data for serving. This blind spot should be long enough to ensure that all relevant data has been updated and is ready for use.
Data Warehouse Update Issues: Issues such as past failures or irregular updates in the data warehouse can also lead to backtest failures. If these issues are identified, it's important to assess whether they are likely to recur and to adjust settings or processes accordingly.

SDK Reference

How to backtest a custom feature job setting.

Training-Serving Inconsistency¶

Training-Serving Inconsistency (or Training-Serving Skew) is a difference between performance during training and performance during serving. This skew can be caused by:

A discrepancy between how you handle data in the training and serving pipelines.
A change in the data between when you train and when you serve.

This inconsistency can lead to unexpected and potentially erroneous predictions.

Data Ontology¶

FeatureByte’s Data Ontology is a structured framework that categorizes columns in a dataset based on their meaning and usage. It is organized as a hierarchical tree, where each semantic type represents a distinct data classification, equipped with specialized feature engineering practices. This structured approach enhances data understanding, ensures consistent processing, and optimizes feature transformation techniques.

Semantic Type¶

A semantic type defines the meaning, expected values, and appropriate feature engineering operations for a column in a table. By associating each column with a semantic type, the ontology enables standardized processing, ensuring that data is transformed, aggregated, and utilized effectively for analysis and machine learning.

Semantic Types Reference¶

The following table lists the primary semantic type categories. Each column in a table is assigned a semantic type that determines how it is processed during feature engineering. Some categories require a more specific subtype — see the Ontology Tree below for the full hierarchy.

Category	Assignable directly?	Feature engineering impact
`numeric`	No — use a subtype	High. Determines aggregation methods: `additive_numeric` allows SUM/AVG/MIN/MAX, `non_additive_numeric` allows AVG/MIN/MAX only (no SUM), `semi_additive_numeric` allows SUM/AVG/MIN/MAX at a point in time (e.g., balances, stock). Also includes `circular` (periodic numeric like day-of-week), `inter_event_time`, and `inter_event_distance`. Subtypes like `non_negative_amount`, `unit_price`, `ratio`, `percentage` further refine operations.
`categorical`	No — use a subtype	High. Drives unique count, cross-aggregation, and group-by features. `nominal_categorical` for unordered categories (e.g., `status`, `location`, `event_type`, `event_status`), `ordinal` for ranked values (e.g., ratings, severity levels), `cyclic_categorical` for repeating patterns (e.g., day of week, month of year).
`binary`	Yes	High. Enables percentage aggregation features (e.g., "% of events where flag is true"). Subtypes: `boolean` (true/false), `filter_field` (binary flags for event filtering).
`temporal_key`	No — auto-assigned	High. Automatically assigned during table registration — should not be set manually. Identifies time columns: `event_timestamp`, `scd_effective_timestamp`, `snapshot_date_time`, `time_series_date_time`, `calendar_date`. Determines how time-based aggregations are computed.
`coordinates`	No — use a subtype	Medium. Enables geospatial features: `latitude`/`longitude` pairs generate haversine distance and location band features. `local_latitude`/`local_longitude` allow AVG/MIN/MAX aggregation.
`date_time`	Yes	Medium. Date/timestamp columns used for age derivation and date-part extraction. Subtypes include `date_of_birth`, `birth_timestamp`, `start_date`, `end_date`, `termination_date`, `termination_timestamp`.
`unique_identifier`	Yes	Low. Primary/foreign keys used for table joins and entity resolution. Not directly used in feature generation.
`dictionary`	Yes	Medium. Key-value pair columns — ideation can unbundle keys and create per-key features.
`vector`	Yes	Medium. Numeric arrays including `embedding` — eligible for AVG/MAX aggregation and cosine similarity.
`list`	Yes	Low. Delimited value lists — categorical, text, or numeric.
`sequence`	Yes	Low. Ordered series — categorical, text, or numeric.
`converter`	Yes	Excluded. Blocked from feature generation (e.g., `fx_rate`).
`unit`	Yes	Low. Unit labels (e.g., `currency`, `length_unit`) — used for cross-aggregation with amounts.
`not_to_use`	Yes	Excluded. Columns blocked from all feature generation — sensitive data, operational keys, noisy data.
`non_informative`	Yes	Excluded. Constant-value columns — automatically detected and skipped.
`unknown`	Yes	Excluded. Unclassified — requires manual review before ideation can use the column.
`ambiguous_numeric`	Yes	Excluded. Blocked from aggregation — columns mixing different units or scales need manual resolution.
`ambiguous_categorical`	Yes	Excluded. Blocked from aggregation and lookup — columns lacking unique meaning without context.
`text`	No — use a subtype	Low. `special_text` (addresses, URLs, emails, names — addresses and names blocked as PII), `long_text` (reviews, descriptions), `numeric_with_unit` (measurements with units like "10 kg").

Semantic Type Detection¶

Semantic types can be automatically detected or manually assigned at the table level. Additionally, during Ideation, they can be overwritten to refine feature engineering strategies based on evolving insights. See Semantic Detection API for how to run detection and apply semantic tags via the API.

Which Semantic Type Should You Focus On?

When working with different table types, pay close attention to specific semantic types, as they influence filtering strategies, data aggregation, and feature engineering choices.

In Event Table, Snapshots Table and Time Series Table, check out the event_type (categorization of events based on their primary purpose or nature) and event_status (state, condition, or outcome of an event) semantic types. These columns will guide event-based filtering strategies.

In a Slowly Changing Dimension Table, check out the termination_timestamp and termination_date semantic types that indicate when an entity is actively terminated, sometimes prematurely. These columns determine how active entities are aggregated and when terminated entities should be analyzed.

For all tables, check out:

the non_additive_numeric semantic types (numeric values where direct addition is not meaningful). Understanding these columns prevents incorrect sum operations.
the automated non_informative semantic type (column with constant value). This may indicate problems in your data.
the not_to_use semantic type (sensitive, personal, operational, or non-reliable data that should not be used). This decides whether feature engineering should be operated for those columns.
the ambiguous_numeric (column that combines different units or scales) and ambiguous_categorical (column that does not provide unique information by itself) semantic types. These columns may require prior manual transformations before being used by feature engineering.

By carefully reviewing these semantic types, you can enhance feature selection and ensure high-quality transformations for machine learning.

Ontology Tree¶

FeatureByte’s ontology follows a hierarchical tree structure, where broader semantic types define general properties, and more specific types refine these properties for specialized use cases. Child nodes inherit feature engineering practices from their parent nodes, ensuring consistency while allowing for domain-specific adjustments.

Tree Key Concepts¶

Inheritance: Child nodes inherit feature engineering practices from their parent nodes.
Levels of Specificity: The Ontology is divided into levels, each providing a finer degree of specificity:
- Level 1: Basic generic semantic types.
- Level 2 & 3: More precise semantics for advanced feature engineering.
- Level 4: Domain-specific nodes.

Semantic type: numeric¶

Description: Represents quantitative data that can be aggregated. Contains either integer or decimal.

non_additive_numeric: Numeric variable where direct addition does not yield meaningful interpretation. Examples of non-additive numeric variables are speed, age or tenure, unit price, temperature, rating, percentage, rank, or order.
- measurement_of_intensity: Numeric values that represent the magnitude of a specific metric.
  - temperature: Numerical value indicating thermal levels, such as patient body temperature.
  - patient_temperature: Specific instance of temperature measurement for a patient.
  - patient_blood_pressure: Measurement capturing the arterial blood pressure of a patient.
  - sound_frequency: Number of vibrations or cycles per second of a sound wave, measured in Hertz (Hz).
  - unit_price: Cost of a single item or unit of measurement.
- time_dependent_monotonic_value: Numeric values that increase over time.
  - age: The length of time that an individual has lived or a thing has existed.
  - account_duration: The length of time an account has been active.
  - tenure: Duration of time that someone has been in a specific role or occupation.
- ratio: Represents a proportional relationship between two quantities, often maintaining a fixed relation.
- percentage: A way of expressing a number as a fraction of 100.
  - discount_percentage: The percentage reduction from the original price.
- statistics: A category that reflects mathematical characteristics derived from a dataset.
  - mean: The average value derived from a set of numbers.
- distance: Refers to a measure of space between two points, can be positive, and often encoded in units like meters, kilometers, miles, etc.
- rank: Refers to the position or level of something within a hierarchy, indicating relative importance compared to others.
- order: Represents the arrangement or sequence of items according to particular criteria.
semi_additive_numeric: Numeric values where addition makes sense only within a specific point in time and not across time periods.
- point_in_time_value: Represents values that provide a snapshot of a person or organization's status at a specific moment.
  - snapshot_value: A value taken at a specific moment in time, useful for tracking changes.
  - balance: The amount of money available in a financial account at a given moment.
  - stock: The quantity of items, products, or supplies held in inventory.
  - occupancy: Number of units occupied (e.g., rooms, apartments, or beds) at a given time.
  - headcount: Number of individuals within a group, organization, or event.
  - facilities: Number of distinct facilities or locations, such as hospitals, schools, stores, or businesses.
  - capacity: Maximum number of occupants or items a facility or system can hold, such as beds in a hospital, seats in a stadium, or total volume in storage.
  - asset_valuation: Assessed or market value of assets at a specific point in time.
  - liability_amount: Total amount of liabilities or debts owed by an individual or organization.
- periodic_value: Represents values measured over fixed, regular intervals, reflecting metrics that reset each period without accumulating.
  - recurring_amount: Regular charges for ongoing services billed at fixed intervals or financial amounts that repeat over specific intervals.
  - periodic_cost: Costs incurred regularly at each time period.
  - recurring_budget: Budgets set for recurring intervals.
  - recurring_count: Counts or quantities that recur at regular intervals.
  - recurring_duration: Time durations that apply regularly over each period.
  - recurring_usage: Usage or consumption measured over each standard period.
- accrued_metric: Represents values that accumulate over time, reflecting growing totals.
  - cumulative_amount: Total amounts that accumulate over time without resetting.
  - cumulative_cost: Costs that accumulate over a period, showing the sum of expenses.
  - cumulative_budget: Budget amounts that accumulate over time, reflecting the total allocated.
  - cumulative_count: Total counts that add up over time.
  - cumulative_duration: Time durations that sum up over periods, representing accumulated usage or operation time.
  - cumulative_usage: Usage or consumption totals that accumulate over time.
- interval_metric: A metric that quantifies the difference between two measurements taken over distinct periods of time. This metric can be used to observe changes or trends within a specified interval.
additive_numeric: Numeric variable where direct addition provides meaningful interpretation, including addition of multiple observations over some time frame.
- unbounded_amount: Refers to a total monetary amount that can be either positive or negative.
  - unbounded_purchase_amount: Total amount spent on purchases, which can include refunds resulting in negative values.
  - unbounded_transaction_amount: Total monetary value of financial transactions, capable of reflecting both credits and debits.
  - unbounded_discount: Total discounts applied, allowing for both positive and negative values to account for additions or corrections.
- non_negative_amount: Refers to a total monetary amount that can only be zero or positive.
  - non_negative_purchase_amount: Total amount spent on purchases without the possibility of refunds or returns resulting in negative values.
  - non_negative_transaction_amount: Total monetary value of transactions that cannot reflect debts or credits that would turn the value negative.
  - non_negative_discount: Total value of discounts given, which can’t be adjusted negatively.
- non_positive_amount: Refers to a total monetary amount that can only be zero or negative.
  - non_positive_purchase_amount: Total amounts reflecting refunds or returns, which do not include new spending.
  - non_positive_transaction_amount: Sum of deductions or charges in financial transactions that do not account for incoming values.
  - non_positive_discount: Total adjustments reflecting reductions, but not increases in discount values.
- count: Refers to a specific or measurable number (count, quantity) of items.
- unbounded_time_delta: Refers to a time difference that can be either negative or positive.
- non_negative_time_delta: Refers to a time difference that can only be zero or positive.
- duration: Refers to a positive duration, often measured in units like seconds, minutes, or hours.
inter_event_distance: Numerical representation of the distance between two events, measured in physical space.
inter_event_time: Numerical representation of the time duration between two events.
- inter_event_moving_time: Time duration specifically representing periods of movement or travel between events.
circular: Numeric data that represent periodic intervals where the end connects back to the beginning.
- time_of_day: Represents various time segments within a day, such as morning, afternoon, evening, and night.
- day_of_year: Denotes the sequential day within the year, with January 1st as 1 and December 31st as 365 (or 366 in leap years).
- day_of_month: Represents the day within the month, encoded as an integer from 1 to 31.
- month_of_year: Represents the month within a year, encoded as an integer from 1 (January) to 12 (December).
- quarter_of_year: Indicates the quarter within a year, encoded as an integer from 1 (January-March) to 4 (October-December).
- day_of_week: Represents the day within the week, encoded as an integer from 1 (Monday) to 7 (Sunday).
- hour_of_day: Indicates the hour within the day, encoded as an integer from 0 (midnight) to 23 (11 PM).
- hour_of_week: Represents the hour within a week, from 0 (midnight on Monday) to 167 (11 PM on Sunday).
- direction: Represents directional headings (e.g., North, South, East, West) in degrees.

Semantic type: binary¶

Description: A special case of categorical where the column represents a binary flag with exactly two distinct categories.

boolean: Variable which represents a binary flag with values of true/false or yes/no.
- binary_numeric: Numeric representation of binary values, often as 0 or 1.
- binary_logical: Logical representation of binary states, usually as true/false.
- physical_presence_indicator: Physical flag that indicates whether an event was performed physically rather than online.
  - is_in_store_transaction: Indicates if a transaction was conducted in a physical store.
  - is_in_person_event: Indicates if an event occurred in person.
filter_field: Binary flags used for filtering purposes.
- is_positive: Indicates if a value is positive.
- is_moving: Indicates if an object or subject is in motion.

Semantic type: categorical¶

Description: Contains values that represent discrete groups and categories. These values can be short text, codes, or numeric.

nominal_categorical: Categorical variables in which the categories do not have a meaningful order or ranking.
- demographic_attribute: Includes a variety of attributes related to personal identity, social status, and professional roles.
  - gender: Represents gender identity of a person, often including values like 'female', 'male', 'non-binary', etc.
  - person_title: Denotes gender and marital status, e.g., Mr, Mrs, Dr, Prof, etc.
  - job_title: Titles or designations within an organizational structure, such as 'Manager', 'Director', 'Engineer'.
- event_type: Categorization of events, grouping them into broad categories based on their primary purpose or nature.
- context: Surrounding conditions or setting in which events occur.
- status: Represents the status of a record, e.g., user account status (active, suspended), order status (pending, shipped, delivered), task status (started, completed), etc.
  - event_status: State, condition, or outcome of an event.
- location: Represents any codified location information like zip codes, area codes, city, country, state codes, etc.
  - zip_code: Postal code for a specific geographic area.
  - area_code: Phone prefix designating a specific geographic region.
  - county_and_state: Combination of county and state, e.g., 'Fairfax County, Virginia' or 'Orange, CA'.
  - city_and_state: Combination of city and state, e.g., 'Los Angeles, CA' or 'Austin, Texas'.
  - state: Variable representing state, e.g., 'Texas' or 'CA'.
  - country: Variable representing country, e.g. 'USA' or 'France'.
- code: Symbolic or numeric codes utilized across various domains, excluding location codes.
  - barcode: Machine-readable representation of information.
  - icd_10_cm: International Classification of Diseases, 10th Revision, Clinical Modification coding for diseases.
  - cpt_treatment_code: Current Procedural Terminology codes for medical treatment procedures.
  - ndc_drug_code: National Drug Codes for medications.
  - isbn: International Standard Book Number for books.
  - issn: International Standard Serial Number for periodicals.
  - status_code: Codes representing status, e.g., HTTP status codes.
  - reason_code: Codes that explain causes or reasons within various contexts.
  - mcc_code: Merchant Category Codes used in financial transactions.
ordinal: Represents categories that have a clear, distinct order or rank.
- rating: Levels of quality or satisfaction, such as 'poor', 'average', 'good'.
- severity_level: Levels representing severity, such as 'low', 'medium', 'high'.
- brackets: Ranges that categories items into specific limits, such as income brackets.
  - distance_buckets: Groups distances into specified intervals.
  - age_group: Divides ages into ranges.
cyclic_categorical: Categorical values in a cyclic or repeating order.
- categorical_month_of_year: Categorical representation of months within a year.
- categorical_quarter_of_year: Categorical representation of quarters within a year.
- categorical_day_of_week: Categorical representation of days within a week.
- categorical_hour_of_week: Categorical representation of hours within a week.
- categorical_direction: Categorical representation of directions, such as cardinal points (N, NE, E, SE, S, SW, W, NW).

Semantic type: date_time¶

Description: Encompasses temporal data types ranging from broad scales (years) to precise measurements (timestamps).

timestamp_field: Precise point in time, typically including date and time components.
- start_timestamp: Timestamp marking the beginning of an event, project, or activity.
- end_timestamp: Scheduled conclusion of an event or activity as a timestamp.
- termination_timestamp: Timestamp marking the active termination of an event or process.
- birth_timestamp: Date and time of birth of a person as a timestamp.
date_field: Dates without time information.
- start_date: Date signaling the beginning of an event, project, or activity.
- end_date: Scheduled conclusion date of an event or activity.
- termination_date: Date of active termination of an event or process.
- date_of_birth: Date of birth of a person.
year: Represents a calendar year typically as a four-digit integer (e.g., 2024).
- year_of_birth: Year of birth of a person.
year_quarter: Specifies a quarter within a year, including both the year and the quarter (e.g., 2024-Q1).
year_month: Represents a specific month in a specific year (e.g., 2024-05).
epoch: Specific point in time as the number of seconds (or milliseconds) elapsed since the Unix epoch (January 1, 1970, at 00:00:00 UTC).

Semantic type: text¶

Description: Contains free-form strings of varying length and complexity.

special_text: Represents more or less structured information like addresses, URLs, emails, phone numbers, names, time zones, software codes, etc.
- street_address: Specifies the location of a property on a street, without specifying the city or town.
- address: Uniquely identifies the location of a property with information on the street, the city, and the country.
  - billing_address: Represents an address associated with an individual's or organization's method of payment, such as a credit card or bank account.
  - shipping_address: Represents an address where a customer requests goods or products to be delivered.
- url: An internet URL that specifies the address of a resource on the web.
- email: An email address of a person or an entity used for electronic communication.
- organization_name: The name of a company or an organization, used for identifying corporate entities.
- software_code: A set of instructions written in a specific programming language that can be executed by a computer to perform a defined task or set of tasks.
long_text: Represents descriptive, unstructured data like reviews, descriptions, posts, tweets, etc.
- review: Represents a written evaluation or assessment of a product, movie, service, etc.
- description: Represents any general description, for example, a product description.
- resume: A document that summarizes a person's work experience, education, and skills.
- event_record: Contains details of events, such as logs or records from specific occurrences.
- twitter: A short post or message on the social media platform Twitter
numeric_with_unit: Represents any measurement with units, like length with inches, time with hours, weight with kilograms, volume with liters, area with square meters, speed with meters per second, and temperature with Celsius.
- amount_with_currency: Represents a monetary amount associated with a specific currency.
- length_with_unit: Represents a length measurement specified with a unit, such as meters or inches.
- time_with_unit: Represents a time duration associated with a specific unit, like hours, minutes, or seconds.
- weight_with_unit: Represents a weight measurement specified with a unit, such as kilograms or pounds.
- volume_with_unit: Represents a volume measurement specified with a unit, such as liters or gallons.
- area_with_unit: Represents an area measurement specified with a unit, such as square meters or square feet.
- speed_with_unit: Represents a speed measurement specified with a unit, such as kilometers per hour or miles per hour.
- temperature_with_unit: Represents a temperature measurement specified with a unit, such as Celsius or Fahrenheit.

Semantic type: coordinates¶

Description: Represents geographical coordinates used for identifying locations on Earth.

longitude: Represents the longitude value on Earth's surface, with values between -180 and 180 degrees.
- local_longitude: Non-global, zone-specific longitude values allowing for approximations in distance or centroid calculations.
  - local_longitude_of_moving_object: The longitude value specific to a moving object, expressed within a localized zone.
  - local_longitude_of_car: The longitude value specific to a moving car, within a localized zone.
- longitude_of_moving_object: Specifies the longitude of an object in motion.
latitude: Represents the latitude value on Earth's surface, with values between -90 and 90 degrees.
- local_latitude: Non-global, zone-specific latitude values allowing for approximations in distance or centroid calculations.
  - local_latitude_of_moving_object: The latitude value specific to a moving object, expressed within a localized zone.
  - local_latitude_of_car: The latitude value specific to a moving car, within a localized zone.
- latitude_of_moving_object: Specifies the latitude of an object in motion.
latitude_in_degrees_minutes_and_seconds: Represents latitude expressed in degrees, minutes, and seconds (DMS) format.
longitude_in_degrees_minutes_and_seconds: Represents longitude expressed in degrees, minutes, and seconds (DMS) format.
latitude_longitude: Combines latitude and longitude values, representing a location.
longitude_latitude: Combines longitude and latitude values, representing a location.

Semantic type: sequence¶

Description: Represents an ordered series of items, such as categories, text, or numbers.

categorical_sequence: An ordered series of categorical values.
text_sequence: An ordered series of textual elements.
numeric_sequence: An ordered series of numerical values.

Semantic type: list¶

Description: Contains a series of values, which can be categories, text, or numerical, separated by a comma or other delimiter.

categorical_list: A list of categorical values.
text_list: A list of textual elements.
numeric_list: A list of numerical values.

Semantic type: dictionary¶

Description: Represents a collection of key-value pairs, where keys are unique identifiers.

dictionary_of_unbounded_values: A dictionary where values are unbounded and can take any form.
dictionary_of_non_negative_values: A dictionary where values are non-negative numbers.
- dictionary_of_count: A dictionary specifically used to count occurrences of items, where values are count numbers.
dictionary_of_non_positive_values: A dictionary where values are non-positive numbers.

Semantic type: vector¶

Description: Represents a mathematical vector, an array of numbers used to measure direction and magnitude.

embedding: A dense vector representation of a piece of data, often used in machine learning for features like words or images.

Semantic type: converter¶

Description: Represents a value used to transform one unit or format into another, including but not limited to:

fx_rate: A foreign exchange rate used to convert from one currency to another.
- billing_fx_rate: Refers to foreign exchange rates in financial transactions concerning billing and invoicing in international trade.
- billing_fx_inverse_rate: Refers to the inverse of the billing foreign exchange rate, used to convert back from the target currency to the source currency.
time_zone: Represents a geographical region where the same standard time is used.

Semantic type: unit¶

Description: Represents types of units used to quantify specific properties.

currency: A unit of money.
length_unit: A unit used to measure length, such as meters or miles.
time_unit: A unit used to measure time, such as seconds or hours.
weight_unit: A unit used to measure weight, such as kilograms or pounds.
volume_unit: A unit used to measure volume, such as liters or gallons.
area_unit: A unit used to measure area, such as square meters or acres.
speed_unit: A unit used to measure speed, such as meters per second or miles per hour.
temperature_unit: A unit used to measure temperature, such as Celsius or Fahrenheit.

Semantic type: temporal_key¶

Description: Identifiers that represent specific points or periods in time, commonly used to track the timing and duration of events or records in a database.

event_timestamp: The timestamp column in an Event table, recording the exact time a specific event occurred.
scd_effective_timestamp: The timestamp column in a Slowly Changing Dimension (SCD) table, indicating when the record becomes active or effective.
scd_end_timestamp: The timestamp column in a Slowly Changing Dimension (SCD) table, indicating when the record becomes inactive or outdated.
iot_sensor_timestamp: The timestamp captured from an IoT sensor, indicating the precise time the sensor data was collected.
snapshot_date_time: A column representing the temporal reference in a snapshots dataset. It can capture various time granularities, such as year, year-month, date, or date-time.
time_series_date_time: A column representing the temporal reference in a time series dataset. It can capture various time granularities, such as year, year-month, date, or date-time.

Semantic type: unique_identifier¶

Description: (UID) A string of characters, numbers, or symbols used to uniquely identify an entity within a system or context. These identifiers ensure that every item, event, or entity can be distinctly recognized and referenced within a database or data structure.

event_id: The primary key in an Event table, uniquely identifying each event recorded in the system.
item_id: The primary key in an Item table, containing detailed information about specific items or transactions.
series_id: Uniquely identifies each series within a table containing multiple series, enabling clear distinction and tracking of individual entities, such as products or categories.
dimension_id: The primary key in a Dimension table, uniquely identifying each dimension entry in the database.
scd_surrogate_key_id: The unique identifier assigned to each record in a Slowly Changing Dimension table, providing a stable identifier as the table evolves over time.
scd_natural_key_id: The key in a Slowly Changing Dimension table that remains static over time, uniquely identifying each active row at any given point. Also known as an alternate key.
foreign_key_id: A column in one table that references the primary key in another table, establishing a relationship between the two tables.

Semantic type: ambiguous_numeric¶

Description: Numeric columns where values can represent different units or scales, potentially leading to misinterpretation without clarification.

mixed_unit_numeric: Numeric variables that can represent measurements in various units.
- mixed_currency_amount: Monetary values in different currencies.
- mixed_unit_length: Length measurements in different units (e.g., meters, feet, miles).
- mixed_unit_time: Time measurements in different units (e.g., seconds, minutes, hours).
- mixed_unit_weight: Weight measurements in different units (e.g., grams, pounds, kilograms).
- mixed_unit_volume: Volume measurements in different units (e.g., liters, gallons).
- mixed_unit_area: Area measurements in different units (e.g., square meters, square feet).
- mixed_unit_speed: Speed measurements in different units (e.g., kilometers per hour, miles per hour).
- mixed_unit_temperature: Temperature measurements in different units (e.g., Celsius, Fahrenheit).

Semantic type: ambiguous_categorical¶

Description: A categorical column that does not provide unique information by itself within a given context. These values require additional features or data to clarify their meaning, as they can lead to misinterpretation without context.

ambiguous_nominal_categorical: A categorical column that does not provide unique information by itself within a given context. These values require additional features or data to clarify their meaning, as they can lead to misinterpretation without context.
- ambiguous_location
  - city_name: Represents a city in any country, potentially leading to ambiguity without further geographical details.
  - county_name: Represents counties (e.g., Jackson County) in any country, which can be ambiguous without additional regional information.

Semantic type: not_to_use¶

Description: Contains sensitive, personal, operational, or non-reliable data that should not be used in analysis to protect privacy or data integrity.

operational_key: Keys used for internal system operations rather than data analysis.
- scd_current_flag: A column in a Slowly Changing Dimension (SCD) table used to indicate the current version of the record.
- record_creation_timestamp: The timestamp indicating when a particular record was created in the data warehouse, often auto-generated upon record creation.
- row_id: Unique identifier assigned to each row, primarily for the system to efficiently index, reference, and retrieve records.
personal_identifiable_information: Information that can uniquely identify an individual.
- name: Contains individuals' personal names, which may include first names, last names, middle names, given names, etc.
  - person_name: The name of a person, or any component of the name.
  - given_name: The given name of a person.
  - middle_name: A middle name or middle initial, often the first letter of the middle name.
  - surname: The last name of a person.
- phone_number: A string formatted as a phone number from any country.
confidential_information: Information that is sensitive and should be protected from unauthorized access.
noisy_data: Data that is too erratic or random, providing no meaningful insight and often obscuring useful data.

Semantic type: non_informative¶

Description: A column in which the value remains constant, providing no variance or useful information for analysis purposes.

Semantic type: unknown¶

Description: Non identified semantic type