8. Discover and Create Features with FeatureByte Copilot

FeatureByte offers two primary methods for feature creation:

Manually: Utilize the SDK declarative framework
Automatically: Employ FeatureByte Copilot

In this tutorial, we'll focus on automatic feature creation using FeatureByte Copilot.

Note

If you want to learn how to manually create features, please consult our SDK tutorials.

Step 1: Select Your Use Case¶

Navigate to Feature Ideation from the Create section of the menu.

Choose the use case: "Customer Activity Next Week before a purchase". If you could not create the target in the SDK, choose "Customer Activity Next Week before a purchase (using the descriptive target)"

Name

Step 2: Run Semantics Detection¶

Click 'Run Semantics Detection' to let FeatureByte identify and tag relevant tables and data columns.

Name

If the columns aren’t tagged correctly, check the suggested tags and make adjustments.

Name

Note

The accuracy of suggestions improves with more detailed data descriptions. Refer to our tutorial "Add descriptions and Tag Semantics" to update table and columns descriptions.

Step 3: Initiate Feature Ideation¶

Start the Feature Ideation process by clicking Name

Step 4: Evaluate Feature Engineering Strategy¶

Review the configuration suggested by Copilot.

Name

Start by validating the temporal windows. These define the duration for window-based aggregations and also determine offsets for features derived from Slowly Changing Dimension (SCD) tables.

Next, consider selecting an appropriate Observation Table to enhance data evaluation:

For the use case "Customer Activity Next Week before a purchase," use the "Pre_Purchase_Customer_Activity_next_week_2023_10K" table.
For the use case "Customer Activity Next Week before a purchase (using the descriptive target)," use the "Pre_Purchase_Customer_Activity_next_week_2023_10K_manual_version" table.

Name

Optionally, incorporate User Defined Functions in your analysis. The application of these functions will be covered in the final section of this tutorial series.

Then, assess the frequency of events for each table. This will determine the date part to extract and the types of features to engineer. Categorize event frequency as:

Rare Events: Infrequent and unpredictable occurrences.
Seasonal Regular Events: Regular occurrences tied to specific seasons or periods, happening no more than once per season.
Frequent Events: Common occurrences, happening several times a week but at irregular intervals.
Very Frequent Events: Events happening multiple times daily.
Telemetry Events: High-frequency, regular measurement events.

For our Grocery use case, Copilot suggested Frequent Events.

Name

Afterwards, verify the key numeric aggregation column for each table. This column, combined with categorical data like Product Group or Weekday, enables analysis of value distribution across different categories to create "bucketed" features.

For the 'INVOICEITEMS' table, 'TotalCost' is recommended as the key column. A resulting feature might be the Customer's Sum of TotalCost per Product Group over the past 2 weeks, reflecting the monetary value of their basket. This feature may be used to form other features to provide insights into purchase patterns, basket diversity, and spending habits within product groups. Additionally, this feature can be converted into a sparse matrix for modeling purposes.

Name

Finally, although Copilot does not recommend specific filters for our dataset due to the lack of event types or statuses, its filtering capabilities are still accessible for exploring potential features. In the final section of this tutorial series, we will employ these filters to unearth features that concentrate on specific subsets of the customer's basket.

Note

Filtering is crucial when columns like "event type" and "event status" exist, such as in Credit Card Transactions with identifiers for transaction types (e.g., "Purchase", "Cash Advance", "Reversal") and statuses (e.g., "Authorized", "Rejected", "Cancelled"). In such scenarios, Copilot will recommend relevant filters for the specific use case.

Step 5: Get Automated Feature Suggestions¶

Click Start Button to initiate the feature search.

Name

Explore the suggested features, with newly proposed ones marked as "New" in the "Readiness" column. Features already present in the Feature Catalog will display their readiness status as either "Draft," "Public Draft," or "Production Ready."

Name

Features are sorted by default based on two criteria: semantic relevance scores and complexity.

Semantic relevance scores, derived from Generative AI, evaluate each feature's significance within the specified use case. This metric complements statistical relevance by ensuring features not only demonstrate high correlation with the target variable but also carry contextual meaning.

Additionally, each feature is systematically categorized by its primary entity, primary table, and signal type, streamlining feature exploration.

To further enhance this analysis, we'll delve into computing predictive scores for each feature on demand.

Step 6: Review specific feature details¶

CUSTOMER_Count_of_invoices_4w¶

Let's explore the specifics of a simple feature: CUSTOMER_Count_of_invoices_4w by using the search function and typing Name .

Name

'About' Tab¶

This tab offers a concise description of the feature alongside an evaluation of its relevance to the use case by Generative AI.

Name

'SDK' Tab¶

Here, you'll find the SDK code snippet to generate this feature. Feel free to download and customize the code by selecting the feature and clicking "generate notebook". We will give an example in the final section of this tutorial series.

Name

'EDA' Tab¶

Access an on-demand Exploratory Data Analysis (EDA) in this section. Ensure you've set a default EDA table for the use case in the "07 Create Observation Tables" section.

Name

The EDA outcomes provide statistical insights on the feature, including its distribution and correlation with the target. A predictive score encapsulates these findings, quantified as R² in regression scenarios and as 2x(AUC-0.5) in classification contexts. A score of 1 indicates perfect correlation with the target, while a score of 0 signifies no correlation.

Name

Click on the graph to open a pop-up window for detailed analysis.

Name

CUSTOMER_items_TotalCosts_by_product_ProductGroup_4w¶

First let's clear our prior search Name .

You can now access CUSTOMER_items_TotalCosts_by_product_ProductGroup_4w by opening the filter bar: Name

Select 'bucketing' as a signal type.

Name

Let's run EDA for CUSTOMER_items_TotalCosts_by_product_ProductGroup_4w.

Graphs are provided for each Product Group. You can select the desired Product Group in the "Feature Key" dropdown menu. The Product Group with the highest importance, as determined by an XGBoost model predicting customer activity the next week, is listed first.

Notably, expenditure on colas and sodas emerges as the most significant!

Name

CUSTOMER_Entropy_of_items_TotalCosts_by_product_ProductGroup_4w¶

Let's explore a feature derived from "CUSTOMER_items_TotalCosts_by_product_ProductGroup_4w". This feature measures the diversity of the Customer's basket.

CUSTOMER_Entropy_of_items_TotalCosts_by_product_ProductGroup_4w can be found by selecting 'diversity' as a signal type.

Name

Let's run the EDA for CUSTOMER_Entropy_of_items_TotalCosts_by_product_ProductGroup_4w.

Name

Although the feature's predictive power may not be as strong as "CUSTOMER_Count_of_invoices_4w," it should be valuable in a feature list because it captures a different signal than frequency. This assumption is confirmed by the Generative AI assessment.

Name

Before adding the feature to the Feature Catalog, let's clear any previous selection if you might have done.

Name

You can now select the feature and add it to the Feature Catalog by clicking on Save Feature

Name

Run EDA for features with high semantic relevance¶

Clear any prior search. Clear previous selection in the filter bar Name and select 9 and 10 Relevance Scores.

Name

Select 100 features by checking the check box next to 'Name'.

Name

Go to the second page if any.

Name

Select remaining features by checking the check box next to 'Name'.

Name

Run EDA for all selected features by clicking on 'Run EDA Analysis'. This should take less than 3 minutes.

Name

Add 50 features based on their predictive powers¶

Clear the filter and collapse the filter bar: Name

Clear the previous selection.

Name

Go back to the first page.

Name

Once done, you should get an unfiltered list:

Name

Sort the results by the features' predictive scores.

Name

Change the number of features per page to 50.

Name

Select the top 50 features by checking the check box next to 'Name'.

Name

Let's add all these features to the Feature Catalog.

Name