7b. Create Development Dataset
What is a Development Dataset?
A Development Dataset is a collection of source tables that serve as substitutes for production source tables. It is used during feature ideation to accelerate exploratory data analysis (EDA) and feature selection. Development datasets are especially valuable when the original tables are extremely large, and only a subset of the data is needed for analysis.
How to create a Development Dataset ?
You can create a Development Dataset in two ways:
- Manually: by mapping production tables to smaller, existing development tables.
- Automatically: from the EDA observation table of a Use Case, combined with a feature lookback to ensure sufficient history for feature aggregation.
This guide explains how to create a Development Dataset from a Use Case's EDA Observation Table. We will create a Development Dataset from 50K applications, the EDA table created earlier for Loan Default by client.
Step 1: Navigate to Development Dataset Catalog¶
From the menu, go to the Formulate section and select the Development Dataset catalog.
Step 2: Create Development Dataset from an Observation Table¶
- Click
.
- Select the Create New Tables tab and set the Use Case as Loan Default by Client.
- Review the suggested settings. In this example, we choose to set the Feature Lookback to 25 months.
- Click
. This will create a Development Dataset with the status
DRAFT
.
What is Entity Selection in Settings?
Entity Selection is automatically suggested based on your use case. The selection defines, per table, the analysis level of features that may be generated during feature ideation. In most cases, the entity of the use case is recommended, or one of its parent entities if the entity can not be joined to the table.
You can extend the selection to any eligible parent entities. This may result in additional features being generated, including similarity features. In this example, no additional parent entities can be selected.
Step 3: Review the SQL to create Distinct IDs tables.¶
-
Click on the Development Dataset and navigate to the SQL Plan tab.
-
Review the SQL that generates Distinct IDs tables.
-
These tables are used to materialize samples of the source tables.
-
The complexity of the SQL script may vary depending on your data model and the entity selection.
-
After review, click
next to the
Draft
status. This will compute the Sample-to-Full ratio for each table, helping you decide whether to create new development tables.
Step 4: Review the Plan to Materialize Development Tables.¶
-
Navigate to the Distinct IDs Tables tab and review the tables and their lineage.
-
Go to the Development Tables tab and review the plan. In this case, all materializations are disabled by default because the expected Sample-to-Full ratio for all tables is greater than 5%.
-
Navigate to the Settings tab, set
Max Sample-to-Full Ratio
to0.15 (15%)
, and clickto activate materialization for all tables.
-
Return to the Development Tables tab and review the updated plan.
-
Go back to the SQL Plan tab to review the updated SQL script.
-
After review, click
next to the
Entity Sampling
status. -
Confirm the materialization settings.
Step 5: Review the Development Dataset¶
-
Navigate to the Development Tables tab and review the development tables.
-
Click on one development table to view more details.
You’ve successfully created a Development Dataset and it’s ready to use for feature ideation!