2. Register Tables
Registering tables in the catalog¶
Our catalog is created and we can start registering tables in it.
First of all, let's activate our catalog.¶
We will be repeating this command in following notebooks.
import featurebyte as fb
# Set your profile to the tutorial environment
fb.use_profile("tutorial")
catalog_name = "Credit Default Dataset SDK Tutorial"
catalog = fb.Catalog.activate(catalog_name)
16:37:06 | WARNING | Service endpoint is inaccessible: http://featurebyte-server:8088/ 16:37:06 | INFO | Using profile: tutorial 16:37:07 | INFO | Using configuration file at: /Users/gxav/.featurebyte/config.yaml 16:37:07 | INFO | Active profile: tutorial (https://tutorials.featurebyte.com/api/v1) 16:37:07 | INFO | SDK version: 2.1.0.dev113 16:37:07 | INFO | No catalog activated. 16:37:07 | INFO | Catalog activated: Credit Default Dataset SDK Tutorial 16:04:52 | WARNING | Remote SDK version (1.1.0.dev7) is different from local (1.1.0.dev1). Update local SDK to avoid unexpected behavior. 16:04:52 | INFO | No catalog activated. 16:04:52 | INFO | Catalog activated: Grocery Dataset Tutorial
Get data source¶
To be able to get source tables from the data warehouse we need to get data source, which our catalog has access to.
This data source contains collection of tables in our database, and we can use it to explore our DB schema:
ds = catalog.get_data_source()
Here we see we have access to a number of databases. For these tutorials we will use the one called 'DEMO_DATASETS' and the 'GROCERY' schema under it.
ds.list_databases()
['DEMO_DATASETS', 'TUTORIAL']
database_name = 'DEMO_DATASETS'
ds.list_schemas(database_name=database_name)
['CREDITCARD', 'CREDIT_DEFAULT_TUTORIAL', 'GROCERY', 'HEALTHCARE', 'INFORMATION_SCHEMA']
schema_name = 'CREDIT_DEFAULT_TUTORIAL'
ds.list_source_tables(database_name=database_name, schema_name=schema_name)
['CASH_INSTALLMENTS', 'CASH_LOAN_STATUS', 'CONSUMER_INSTALLMENTS', 'CONSUMER_LOAN_STATUS', 'CREDIT_DEFAULT_FULL_OBSERVATIONS', 'CREDIT_DEFAULT_SAMPLE_OBSERVATIONS', 'NEW_APPLICATION', 'PRIOR_APPLICATIONS']
Get source tables¶
We identified database and schema we want to work with, it is time to get source tables and register them in the catalog.
We'll utilize the following four source tables of our Credit Default Dataset:
Table | Description |
---|---|
NEW_APPLICATION | Records new loan applications. |
PRIOR_APPLICATIONS | Contains data on prior loan applications and the final decision. |
CONSUMER_LOAN_STATUS | Tracks consumer loans status. |
CONSUMER_INSTALLMENTS | Logs monthly installments for consumer loans at the time of payment. |
Two source tables are left for you to explore.
Table | Description |
---|---|
CASH_LOAN_STATUS | Tracks cash loans status. |
CASH_INSTALLMENTS | Logs monthly installments for cash loans at the time of payment. |
application_source_table = ds.get_source_table(
database_name=database_name,
schema_name=schema_name,
table_name="NEW_APPLICATION"
)
prior_applications_source_table = ds.get_source_table(
database_name=database_name,
schema_name=schema_name,
table_name="PRIOR_APPLICATIONS"
)
consumer_loans_source_table = ds.get_source_table(
database_name=database_name,
schema_name=schema_name,
table_name="CONSUMER_LOAN_STATUS"
)
consumer_installments_source_table = ds.get_source_table(
database_name=database_name,
schema_name=schema_name,
table_name="CONSUMER_INSTALLMENTS"
)
Explore Source Tables¶
You can obtain descriptive statistics, preview a selection of rows, or collect additional information on their columns.
# Obtain descriptive statistics
prior_applications_source_table.describe()
Done! |████████████████████████████████████████| 100% in 9.1s (0.11%/s)
APPLICATION_ID | CLIENT_ID | CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_GOODS_VALUE | CONTRACT_STATUS | DECISION_DATE | TIMEZONE_OFFSET | CODE_REJECT_REASON | GOODS_CATEGORY | CNT_PAYMENT | YIELD_GROUP | PRODUCT_COMBINATION | available_at | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
dtype | INT | INT | VARCHAR | FLOAT | FLOAT | FLOAT | FLOAT | VARCHAR | TIMESTAMP | VARCHAR | VARCHAR | VARCHAR | FLOAT | VARCHAR | VARCHAR | TIMESTAMP |
unique | 1670214 | 338857 | 4 | 357959 | 93885 | 86803 | 93885 | 4 | 1506737 | 3 | 9 | 28 | 49 | 5 | 17 | 267740 |
%missing | 0.0 | 0.0 | 0.0 | 22.2867 | 0.0 | 0.0001 | 23.0818 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 22.2864 | 0.0 | 0.0207 | 0.0 |
%empty | NaN | NaN | 0 | NaN | NaN | NaN | NaN | 0 | NaN | 0 | 0 | 0 | NaN | 0 | 0 | NaN |
entropy | NaN | NaN | 0.972882 | NaN | NaN | NaN | NaN | 0.981049 | NaN | 1.098611 | 0.743124 | 1.568896 | NaN | 1.507147 | 2.48288 | NaN |
top | 1727222 | 187868 | Cash loans | NaN | NaN | NaN | NaN | Approved | NaN | -07:00 | XAP | XNA | NaN | XNA | Cash | NaN |
freq | 1.0 | 77.0 | 747553.0 | NaN | NaN | NaN | NaN | 1036781.0 | NaN | 557756.0 | 1353093.0 | 950809.0 | NaN | 517215.0 | 285990.0 | NaN |
mean | 1923089.135331 | 278357.174099 | NaN | 15955.120659 | 175233.86036 | 196114.021218 | 227847.279283 | NaN | NaN | NaN | NaN | NaN | 16.054082 | NaN | NaN | NaN |
std | 532597.958696 | 102814.823849 | NaN | 14782.137335 | 292779.762386 | 318574.616547 | 315396.557937 | NaN | NaN | NaN | NaN | NaN | 14.567288 | NaN | NaN | NaN |
min | 1000001 | 100001 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | NaN | 2010-09-01T17:11:16.000000000 | NaN | NaN | NaN | 0.0 | NaN | NaN | 2010-09-02T06:59:00.000000000 |
25% | 1461857.25 | 189329.0 | NaN | 6321.78 | 18720.0 | 24160.5 | 50841.0 | NaN | NaN | NaN | NaN | NaN | 6.0 | NaN | NaN | NaN |
50% | 1923110.5 | 278714.5 | NaN | 11250.0 | 71046.0 | 80541.0 | 112320.0 | NaN | NaN | NaN | NaN | NaN | 12.0 | NaN | NaN | NaN |
75% | 2384279.75 | 367514.0 | NaN | 20658.42 | 180360.0 | 216418.5 | 234000.0 | NaN | NaN | NaN | NaN | NaN | 24.0 | NaN | NaN | NaN |
max | 2845382 | 456255 | NaN | 418058.145 | 6905160.0 | 6905160.0 | 6905160.0 | NaN | 2024-08-27T00:26:16.000000000 | NaN | NaN | NaN | 84.0 | NaN | NaN | 2024-08-28T06:01:00.000000000 |
Done! |████████████████████████████████████████| 100% in 9.1s (0.11%/s)
GroceryInvoiceGuid | GroceryCustomerGuid | Timestamp | tz_offset | record_available_at | Amount | |
---|---|---|---|---|---|---|
dtype | VARCHAR | VARCHAR | TIMESTAMP | VARCHAR | TIMESTAMP | FLOAT |
unique | 61513 | 500 | 61461 | 4 | 15181 | 6647 |
%missing | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
%empty | 0 | 0 | NaN | 0 | NaN | NaN |
entropy | 6.214608 | 5.860121 | NaN | 0.782711 | NaN | NaN |
top | 003224bd-aad1-4e34-9182-1f8f3a6b0a57 | cea213d4-36e4-48c3-ae8d-c7a25911e11c | 2022-02-08 14:47:42.000 | +02:00 | 2022-08-23 15:01:00.000 | 1 |
freq | 1.0 | 1072.0 | 2.0 | 33673.0 | 17.0 | 1043.0 |
mean | NaN | NaN | NaN | NaN | NaN | 19.165033 |
std | NaN | NaN | NaN | NaN | NaN | 23.732982 |
min | NaN | NaN | 2022-01-01T04:17:46.000000000 | NaN | 2022-01-01T05:01:00.000000000 | 0.0 |
25% | NaN | NaN | NaN | NaN | NaN | 4.28 |
50% | NaN | NaN | NaN | NaN | NaN | 10.58 |
75% | NaN | NaN | NaN | NaN | NaN | 24.53 |
max | NaN | NaN | 2024-06-12T07:29:25.000000000 | NaN | 2024-06-12T08:01:00.000000000 | 360.84 |
# Preview a selection of rows
prior_applications_source_table.preview()
APPLICATION_ID | CLIENT_ID | CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_GOODS_VALUE | CONTRACT_STATUS | DECISION_DATE | TIMEZONE_OFFSET | CODE_REJECT_REASON | GOODS_CATEGORY | CNT_PAYMENT | YIELD_GROUP | PRODUCT_COMBINATION | available_at | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2409570 | 106763 | Consumer loans | 6231.960 | 61605.0 | 55444.5 | 61605.0 | Approved | 2019-10-03 18:24:04 | -05:00 | XAP | Consumer Electronics | 10.0 | low_normal | POS household without interest | 2019-10-04 06:02:00 |
1 | 2050716 | 244529 | Consumer loans | 6682.320 | 43155.0 | 47425.5 | 43155.0 | Approved | 2024-01-27 18:13:28 | -07:00 | XAP | Mobile | 10.0 | high | POS household with interest | 2024-01-28 06:40:00 |
2 | 2716297 | 174445 | Consumer loans | 4304.565 | 40500.0 | 39564.0 | 40500.0 | Approved | 2017-05-01 16:36:35 | -07:00 | XAP | Furniture | 12.0 | high | POS industry with interest | 2017-05-02 06:43:00 |
3 | 2633072 | 334346 | Consumer loans | 13354.785 | 165955.5 | 187312.5 | 165955.5 | Approved | 2017-04-29 21:11:39 | -06:00 | XAP | Audio/Video | 18.0 | middle | POS household with interest | 2017-04-30 06:28:00 |
4 | 2680906 | 411586 | Cash loans | 13072.185 | 90000.0 | 110146.5 | 90000.0 | Refused | 2017-04-17 08:37:23 | -05:00 | HC | XNA | 12.0 | high | Cash Street: high | 2017-04-18 06:00:00 |
5 | 1232927 | 454931 | Cash loans | 20317.590 | 270000.0 | 291919.5 | 270000.0 | Approved | 2022-01-24 11:54:40 | -05:00 | XAP | XNA | 18.0 | low_normal | Cash X-Sell: low | 2022-01-25 06:47:00 |
6 | 2800956 | 366231 | Consumer loans | 17939.250 | 153000.0 | 153000.0 | 153000.0 | Approved | 2022-05-28 15:31:15 | -07:00 | XAP | Clothing and Accessories | 10.0 | middle | POS industry with interest | 2022-05-29 06:58:00 |
7 | 1702309 | 211842 | Consumer loans | NaN | 52380.0 | 52380.0 | 52380.0 | Unused offer | 2018-12-23 19:15:59 | -06:00 | CLIENT | Mobile | NaN | XNA | POS mobile with interest | 2018-12-24 06:26:00 |
8 | 1705998 | 150619 | Consumer loans | NaN | 41796.0 | 41796.0 | 41796.0 | Unused offer | 2018-07-31 19:35:49 | -05:00 | CLIENT | Mobile | NaN | XNA | POS mobile with interest | 2018-08-01 06:16:00 |
9 | 1633822 | 322184 | Consumer loans | 10229.940 | 144900.0 | 144900.0 | 144900.0 | Refused | 2020-03-17 16:58:06 | -06:00 | LIMIT | Furniture | 18.0 | middle | POS industry with interest | 2020-03-18 06:24:00 |
Register Tables in the Catalog¶
For accurate feature derivation, FeatureByte needs to recognize the roles of different tables.
Each table should be assigned a specific type based on its structure and purpose:
NEW_APPLICATION --> Dimension table.
Why Dimesion Table?: While we could have registered this as an Event Table, this table contains only a subset of the applications. By setting it as a Dimension Table, we disable aggregations, preventing potentially non-meaningful computations.
PRIOR_APPLICATIONS --> Event table.
Why Event Table?: The table records final decision events for prior applications, making it suitable for an Event Table designation.
CONSUMER_LOAN_STATUS --> Slowly Changing Dimension (SCD) table.
Why Slowly Changing Dimension Table?: The table tracks loan statuses and dynamic fields that change over time, making it a Slowly Changing Dimension (SCD) Table.
CONSUMER_INSTALLMENTS --> Time Series table.
Why Time Series Table?: While we could have registered this as an Event Table, installment payments occur monthly. By defining it as a Time Series Table, we ensure calendar month aggregation, aligning with the event frequency.
If you are interested in a use case that exploits item table, checkout out the Grocery SDK Tutorials.
Let's register each table using respective type:
application_table = application_source_table.create_dimension_table(
name="NEW_APPLICATION",
dimension_id_column="APPLICATION_ID",
record_creation_timestamp_column="available_at",
)
prior_applications_table = prior_applications_source_table.create_event_table(
name="PRIOR_APPLICATIONS",
event_id_column="APPLICATION_ID",
event_timestamp_column="DECISION_DATE",
event_timestamp_timezone_offset_column="TIMEZONE_OFFSET",
record_creation_timestamp_column="available_at",
)
consumer_loans_table = consumer_loans_source_table.create_scd_table(
name="CONSUMER_LOAN_STATUS",
natural_key_column="LOAN_ID",
effective_timestamp_column="EFFECTIVE_TIMESTAMP",
effective_timestamp_schema=fb.TimestampSchema(
format_string="YYYY-MM-DD",
is_utc_time=True,
),
end_timestamp_column="END_TIMESTAMP",
end_timestamp_schema=fb.TimestampSchema(
format_string="YYYY-MM-DD",
is_utc_time=True,
),
record_creation_timestamp_column="available_at",
)
consumer_installments_table = consumer_installments_source_table.create_time_series_table(
name="CONSUMER_INSTALLMENTS",
reference_datetime_column="ACTUAL_INSTALLMENT_DATE",
reference_datetime_schema=fb.TimestampSchema(
is_utc_time=True,
timezone=fb.TimeZoneColumn(column_name="TIMEZONE_OFFSET", type="offset"),
),
time_interval=fb.TimeInterval(value=1, unit="MONTH"),
series_id_column="LOAN_ID",
record_creation_timestamp_column="available_at",
)
After this we will be able to see our tables in the catalog:
display(catalog.list_tables())
id | name | type | status | entities | created_at | |
---|---|---|---|---|---|---|
0 | 67c2c752924afe7a79ec6f27 | CONSUMER_INSTALLMENTS | time_series_table | PUBLIC_DRAFT | None | 2025-03-01T08:37:38.763000 |
1 | 67c2c750924afe7a79ec6f26 | CONSUMER_LOAN_STATUS | scd_table | PUBLIC_DRAFT | None | 2025-03-01T08:37:36.903000 |
2 | 67c2c74e924afe7a79ec6f25 | PRIOR_APPLICATIONS | event_table | PUBLIC_DRAFT | None | 2025-03-01T08:37:34.829000 |
3 | 67c2c74c924afe7a79ec6f24 | NEW_APPLICATION | dimension_table | PUBLIC_DRAFT | None | 2025-03-01T08:37:32.806000 |
Initialize Feature Job Setting for an event table¶
Feature Job Setting defines when and how frequently features are updated in the feature store. These settings also include a blind spot period, which represents the time gap between when a feature is computed and the latest available event.
Why is the Blind Spot Important? In an ideal scenario, we would include all prior applications up to the present in our feature computation. However, due to the time required for data collection, ETL processing, and other pipeline delays, the most recent applications may not be immediately available.
Setting the blind spot too small may result in data leakage during model training, as the production environment might not have access to the most recent data at inference time. Conversely, setting it too large may lead to stale feature values, reducing predictive performance.
Automated Feature Job Settings with FeatureByte FeatureByte intelligently determines the optimal feature job settings for event tables by automatically adjusting:
- Blind Spot: Ensuring that data is available and aligned with the production scenario.
- Update Frequency (Period): Setting an appropriate interval for feature updates.
- Offset: Delaying computation to accommodate data pipeline scheduling.
prior_applications_table.initialize_default_feature_job_setting()
Done! |████████████████████████████████████████| 100% in 9.1s (0.11%/s)
The analysis period starts at 2025-01-31 23:59:11 and ends at 2025-02-28 23:59:11
The column used for the event timestamp is DECISION_DATE
The column used for the record creation timestamp for PRIOR_APPLICATIONS is available_at
STATISTICS ON TIME BETWEEN PRIOR_APPLICATIONS RECORDS CREATIONS
- Average time is 1913.7940379403794 s
- Median time is 60.0 s
- Lowest time is 60.0 s
- Largest time is 83040.0 s
based on a total of 1207 unique record creation timestamps.
PRIOR_APPLICATIONS UPDATE TIME starts 6.0 hours and ends 6.0 hours 59.0 minutes after the start of each 24 hours
This includes a buffer of 600 s to allow for late jobs.
Search for optimal blind spot
- blind spot for 99.5 % of events to land: 25200 s
- blind spot for 99.9 % of events to land: 25800 s
- blind spot for 99.95 % of events to land: 25800 s
- blind spot for 99.99 % of events to land: 25800 s
- blind spot for 99.995 % of events to land: 25800 s
- blind spot for 100.0 % of events to land: 25800 s
In SUMMARY, the recommended FEATUREJOB DEFAULT setting is:
period: 86400
offset: 25740
blind_spot: 25800
The resulting FEATURE CUTOFF offset is 86340 s.
For a feature cutoff at 86340 s:
- time for 99.5 % of events to land: 25800 s
- time for 99.9 % of events to land: 25800 s
- time for 99.95 % of events to land: 25800 s
- time for 99.99 % of events to land: 25800 s
- time for 99.995 % of events to land: 25800 s
- time for 100.0 % of events to land: 25800 s
- Period = 86400 s / Offset = 25740 s / Blind spot = 25800 s
The backtest found that all records would have been processed on time.
- Based on the past records created from 2025-01-31 23:00:00 to 2025-02-28 23:00:00, the table is regularly updated 6.0 hours after the start of each 24 hours within a 59.0 minutes interval. No job failure or late job has been detected.
- The features computation jobs are recommended to be scheduled after the table updates completion and be set 7 hours 9 minutes after the start of each 24 hours.
- Based on the analysis of the records latency, the blind_spot parameter used to determine the window of the features aggregation is recommended to be set at 25800 s.
- period: 86400 s
- offset: 25740 s
- blind_spot: 25800 s
Done! |████████████████████████████████████████| 100% in 12.1s (0.08%/s)
The analysis period starts at 2024-05-15 07:29:25 and ends at 2024-06-12 07:29:25
The column used for the event timestamp is Timestamp
The column used for the record creation timestamp for GROCERYINVOICE is record_available_at
STATISTICS ON TIME BETWEEN GROCERYINVOICE RECORDS CREATIONS
- Average time is 4176.910911966263 s
- Median time is 3600.0 s
- Lowest time is 3600.0 s
- Largest time is 36000.0 s
based on a total of 481 unique record creation timestamps.
The longer time between record creations are due to 182 MISSING UPDATES.
This includes a buffer of 60 s to allow for late jobs.
The 63 jobs that occurred after missing jobs don't seem to have processed significantly older records.
Search for optimal blind spot
- blind spot for 99.5 % of events to land: 120 s
- blind spot for 99.9 % of events to land: 120 s
- blind spot for 99.95 % of events to land: 120 s
- blind spot for 99.99 % of events to land: 120 s
- blind spot for 99.995 % of events to land: 120 s
- blind spot for 100.0 % of events to land: 120 s
In SUMMARY, the recommended FEATUREJOB DEFAULT setting is:
period: 3600
offset: 120
blind_spot: 120
The resulting FEATURE CUTOFF offset is 0 s.
For a feature cutoff at 0 s:
- time for 99.5 % of events to land: 120 s
- time for 99.9 % of events to land: 120 s
- time for 99.95 % of events to land: 120 s
- time for 99.99 % of events to land: 120 s
- time for 99.995 % of events to land: 120 s
- time for 100.0 % of events to land: 120 s
- Period = 3600 s / Offset = 120 s / Blind spot = 120 s
The backtest found that all records would have been processed on time.
- Based on the past records created from 2024-05-15 07:00:00 to 2024-06-12 07:00:00, the table is regularly updated 1.0 minute after the start of each 1 hour within a interval. No job failure or late job has been detected.
- The features computation jobs are recommended to be scheduled after the table updates completion and be set 2 minutes after the start of each 1 hour.
- Based on the analysis of the records latency, the blind_spot parameter used to determine the window of the features aggregation is recommended to be set at 120 s.
- period: 3600 s
- offset: 120 s
- blind_spot: 120 s
To override default feature job settings, see update_default_feature_job_setting.
Set Feature Job setting for TimeSeries¶
For TimeSeries, the feature job setting consists of three key parameters:
Crontab: Defines the cron schedule for the feature job, specifying when the job should run.
Time Zone: Determines the time zone in which the cron schedule operates, ensuring alignment with local time conventions.
Reference Time Zone: Specifies the time zone used to define calendar-based aggregation periods (e.g., daily, weekly, or monthly). This reference time zone ensures consistency when calculating calendar periods across different data time zones.
- For example, if the scheduled job is
2025/01/31 23:00 UTC
and the reference time zone isAsia/Singapore
, the corresponding calendar date would be2025/02/01
. Consequently, the aggregation for the latest complete month would cover January. - If a time zone column is used to assign individual time zones per record, the reference time zone should be the westernmost time zone among those specified in the column. This ensures that aggregation periods fully encompass the calendar dates of all relevant observations.
- For example, if the scheduled job is
job_setting = fb.CronFeatureJobSetting(
crontab="0 13 1 * *",
timezone="Etc/UTC",
timezone_reference="America/Los_Angeles",
)
consumer_installments_table.update_default_feature_job_setting(job_setting)