Skip to content

8. Discover and Create Features with FeatureByte Copilot

FeatureByte offers two primary methods for feature creation:

In this tutorial, we'll focus on automatic feature creation using FeatureByte Copilot.

Note

If you want to learn how to manually create features, please consult our SDK tutorials.

Step 1: Select Your Use Case

Navigate to Feature Ideation from the Create section of the menu.

Choose the use case: "Customer Activity Next Week before a purchase". If you could not create the target in the SDK, choose "Customer Activity Next Week before a purchase (using the descriptive target)"

Name

Step 2: Run Semantics Detection

Click 'Run Semantics Detection' to let FeatureByte identify and tag relevant tables and data columns.

Name

If the columns aren’t tagged correctly, check the suggested tags and make adjustments.

Name

Note

The accuracy of suggestions improves with more detailed data descriptions. Refer to our tutorial "Add descriptions and Tag Semantics" to update table and columns descriptions.

Step 3: Initiate Feature Ideation

Start the Feature Ideation process by clicking 'Start Feature Ideation'.

Name

Step 4: Evaluate Feature Engineering Strategy

Review Copilot’s data aggregation and filtering suggestions.

Name

First, validate the temporal windows that should be used for the window-based aggregations. These windows will also set offsets for features generated from SCD tables.

Optionally, choose an appropriate Observation Table for improved data evaluation:

  • For the use case "Customer Activity Next Week before a purchase," consider using the "Pre_Purchase_Customer_Activity_next_week_2023_10K" table.
  • For the use case "Customer Activity Next Week before a purchase (using the descriptive target)," consider using the "Pre_Purchase_Customer_Activity_next_week_2023_10K_manual_version" table.

Name

Next, validate the key numeric aggregation column for each table. This column will be used to create "bucketed" features with categorical features such as the Product Group or the Weekday. In our case, TotalCost is proposed for the items table "INVOICEITEMS". For example, a feature generated is the Customer Sum of TotalCost per Product Group during the past 2 weeks. This feature can then be converted into a sparse matrix before modeling or is transformed by FeatureByte to evaluate various aspects such as the diversity of the customer's basket, the stability of their purchase behavior, their similarity to larger groups, and the product groups with the most spending. Name

For our dataset, no event types or event statuses have been found.

Note

Not all use cases require filtering. In some scenarios, like the one in this example, filtering might be unnecessary.

Filtering becomes relevant when specific types of columns, such as "event type" and "event status," are present. For instance, in Credit Card Transactions, you might encounter columns indicating transaction type (e.g., "Purchase", "Cash Advance", "Reversal", ...) and transaction status (e.g., "Authorized", "Rejected", "Cancelled"). In such cases, Generative AI will recommend filters that are pertinent to the specific use case.

Step 5: Get Automated Feature Suggestions

Click 'Start' to initiate the feature search.

Name

Name

Explore the suggested features, with newly proposed ones marked as "New" in the "Readiness" column. Features already present in the Feature Catalog will display their readiness status as either "Draft," "Public Draft," or "Production Ready."

Name

Features are sorted by default based on two criteria: semantic relevance scores and complexity.

Semantic relevance scores, derived from Generative AI, evaluate each feature's significance within the specified use case. This metric complements statistical relevance by ensuring features not only demonstrate high correlation with the target variable but also carry contextual meaning.

Moreover, high semantic relevance scores coupled with low statistical correlation may indicate potential data quality issues or the limitations of solely relying on statistical relevance without considering interactions with other features.

Additionaly, each feature is systematically categorized by its primary entity, primary table, and signal type, streamlining feature exploration.

To further enhance analysis, we'll delve into computing predictive scores for each feature on demand.

Step 6: Review specific feature details

CUSTOMER_Count_of_invoices_4w

Let's explore the specifics of a simple feature:

'About' Tab

This tab offers a concise description of the feature alongside an evaluation of its relevance to the use case by Generative AI.

Name

'SDK' Tab

Here, you'll find the SDK code snippet to generate this feature. Feel free to download and customize the code by selecting the feature and clicking "generate notebook".

Name

'EDA' Tab

Access an on-demand Exploratory Data Analysis (EDA) in this section. Ensure you've set a default EDA table for the use case in the "07 Create Observation Tables" section.

Name

The EDA outcomes provide statistical insights on the feature, including its distribution and correlation with the target. A predictive score encapsulates these findings, quantified as R² in regression scenarios and as 2x(AUC-0.5) in classification contexts. A score of 1 indicates perfect correlation with the target, while a score of 0 signifies no correlation.

Name

Click on the graph to open a pop-up window for detailed analysis.

Name

CUSTOMER_items_TotalCosts_by_product_ProductGroup_4w

You can access it by opening the filter bar.

Name

Select 'bucketing' as a signal type.

Name

You can also search for "CUSTOMER_items_TotalCosts_by_product_ProductGroup_4w".

Name

Let's run EDA for this feature.

Graphs are provided for each Product Group. You can select the desired Product Group in the "Feature Key" dropdown menu. The Product Group with the highest importance, as determined by an XGBoost model predicting customer activity the next week, is listed first.

Notably, expenditure on colas and sodas emerges as the most significant!

Name

CUSTOMER_Entropy_of_items_TotalCosts_by_product_ProductGroup_4w

Let's explore a feature derived from "CUSTOMER_items_TotalCosts_by_product_ProductGroup_4w". This feature measures the diversity of the Customer's basket.

It can be found by selecting 'diversity' as a signal type.

Name

Let's run the EDA for the feature.

Name

Although the feature's predictive power may not be as strong as "CUSTOMER_Count_of_invoices_4w," it should be valuable in a feature list because it captures a different signal than frequency. This assumption is confirmed by the Generative AI assessment.

Name

Before adding the feature to the Feature Catalog, clear the previous selection by clicking on the red cross.

Name

You can now select the feature and add it to the Feature Catalog by clicking on 'Save Features / Feature List'.

Name

Run EDA for 100 features

Let's clear the search.

Name

Clear the filter and collapse the filter bar.

Name

Once done, you should get an unfiltered list:

Name

We will now select 100 features by checking the check box next to 'Name'.

Name

Name

Run EDA for all selected features by clicking on 'Run EDA Analysis'. This should take approximately 5 minutes.

Name

Add 50 features based on their predictive power

Sort the results by the features' predictive scores.

Name

Clear the previous selection by clicking on the red cross.

Name

Change the number of features per page to 50.

Name

Select the top 50 features by checking the check box next to 'Name'.

Name

Let's add all these features to the Feature Catalog.

Name