2. Register Tables
Registering tables in the catalog¶
Our catalog is created and we can start registering tables in it.
First of all, let's activate our catalog.¶
We will be repeating this command in following notebooks.
import featurebyte as fb
# Set your profile to the tutorial environment
fb.use_profile("tutorial")
catalog_name = "Loan Applications Dataset SDK Tutorial"
catalog = fb.Catalog.activate(catalog_name)
14:06:19 | INFO | SDK version: 3.0.1.dev45 INFO :featurebyte:SDK version: 3.0.1.dev45 14:06:19 | INFO | No catalog activated. INFO :featurebyte:No catalog activated. 14:06:19 | INFO | Using profile: tutorial INFO :featurebyte:Using profile: tutorial 14:06:19 | INFO | Using configuration file at: /Users/gxav/.featurebyte/config.yaml INFO :featurebyte:Using configuration file at: /Users/gxav/.featurebyte/config.yaml 14:06:19 | INFO | Active profile: tutorial (https://tutorials.featurebyte.com/api/v1) INFO :featurebyte:Active profile: tutorial (https://tutorials.featurebyte.com/api/v1) 14:06:19 | INFO | SDK version: 3.0.1.dev45 INFO :featurebyte:SDK version: 3.0.1.dev45 14:06:19 | INFO | No catalog activated. INFO :featurebyte:No catalog activated. 14:06:19 | INFO | Catalog activated: Loan Applications Dataset SDK Tutorial INFO :featurebyte.api.catalog:Catalog activated: Loan Applications Dataset SDK Tutorial 16:04:52 | WARNING | Remote SDK version (1.1.0.dev7) is different from local (1.1.0.dev1). Update local SDK to avoid unexpected behavior. 16:04:52 | INFO | No catalog activated. 16:04:52 | INFO | Catalog activated: Grocery Dataset Tutorial
Get data source¶
To be able to get source tables from the data warehouse we need to get data source, which our catalog has access to.
This data source contains collection of tables in our database, and we can use it to explore our DB schema:
ds = catalog.get_data_source()
Here we see we have access to a number of databases. For these tutorials we will use the one called 'DEMO_DATASETS' and the 'LOAN_APPLICATIONS' schema under it.
ds.list_databases()
['DEMO_DATASETS', 'TUTORIAL']
DATABASE_NAME = 'DEMO_DATASETS'
ds.list_schemas(database_name=DATABASE_NAME)
['CREDITCARD', 'CREDIT_DEFAULT_TUTORIAL', 'GROCERY', 'HEALTHCARE', 'INFORMATION_SCHEMA', 'LOAN_APPLICATIONS']
SCHEMA_NAME = 'LOAN_APPLICATIONS'
ds.list_source_tables(database_name=DATABASE_NAME, schema_name=SCHEMA_NAME)
['BUREAU', 'CLIENT_PROFILE', 'CREDIT_CARD_MONTHLY_BALANCE', 'INSTALLMENTS_PAYMENTS', 'LOAN_STATUS', 'NEW_APPLICATION', 'OBSERVATIONS_WITH_TARGET', 'OBSERVATION_EDA_TABLE', 'PREVIOUS_APPLICATION']
Get source tables¶
We identified database and schema we want to work with, it is time to get source tables and register them in the catalog.
We'll utilize the following seven source tables of our Loan Applications Dataset:
Table | Description |
---|---|
NEW_APPLICATION | Contains information about new loan applications submitted by clients. |
CLIENT_PROFILE | Describes the profile for each client. |
BUREAU | Lists credits taken by clients from other financial institutions, as reported to the credit bureau. |
PREVIOUS_APPLICATION | Details prior loan applications by the client. |
INSTALLMENTS_PAYMENTS | Logs monthly installments for loans at the time of payment. |
LOAN_STATUS | Tracks whether loan was terminated. |
CREDIT_CARD_MONTHLY_BALANCE | Monthly balance summaries for credit cards |
ds = catalog.get_data_source()
application_source_table = ds.get_source_table(
database_name=DATABASE_NAME, schema_name=SCHEMA_NAME, table_name="NEW_APPLICATION"
)
client_profile_source_table = ds.get_source_table(
database_name=DATABASE_NAME, schema_name=SCHEMA_NAME, table_name="CLIENT_PROFILE"
)
bureau_source_table = ds.get_source_table(
database_name=DATABASE_NAME, schema_name=SCHEMA_NAME, table_name="BUREAU"
)
installments_payments_source_table = ds.get_source_table(
database_name=DATABASE_NAME, schema_name=SCHEMA_NAME, table_name="INSTALLMENTS_PAYMENTS"
)
previous_application_source_table = ds.get_source_table(
database_name=DATABASE_NAME, schema_name=SCHEMA_NAME, table_name="PREVIOUS_APPLICATION"
)
loan_status_source_table = ds.get_source_table(
database_name=DATABASE_NAME, schema_name=SCHEMA_NAME, table_name="LOAN_STATUS"
)
credit_card_balance_source_table = ds.get_source_table(
database_name=DATABASE_NAME, schema_name=SCHEMA_NAME, table_name="CREDIT_CARD_MONTHLY_BALANCE"
)
Explore Source Tables¶
# Preview a selection of rows
previous_application_source_table.preview()
APPLICATION_ID | ClientID | CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | ... | SELLERPLACE_AREA | SELLER_INDUSTRY | CNT_PAYMENT | YIELD_GROUP | PRODUCT_COMBINATION | first_disbursment_timestamp | first_due_timestamp | last_due_1st_version_timestamp | NFLAG_INSURED_ON_APPROVAL | available_at | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2820941 | 283690 | Cash loans | 11460.735 | 121500.0 | 133528.5 | 0.0 | 121500.0 | SUNDAY | 10 | ... | -1 | Consumer electronics | 18.0 | high | Cash X-Sell: high | NaT | 2018-09-03 06:26:07 | 2020-01-26 06:26:07 | 1.0 | 2018-08-05 06:56:00 |
1 | 2621879 | 344988 | Consumer loans | 6622.245 | 51777.0 | 56335.5 | 0.0 | 51777.0 | TUESDAY | 12 | ... | 1375 | Consumer electronics | 10.0 | middle | POS household with interest | NaT | 2021-06-03 08:55:00 | 2022-02-28 08:55:00 | 0.0 | 2021-05-03 06:47:00 |
2 | 1150381 | 102189 | Consumer loans | 13975.605 | 90171.0 | 75703.5 | 18036.0 | 90171.0 | SATURDAY | 13 | ... | 1375 | Consumer electronics | 6.0 | middle | POS household with interest | NaT | 2021-10-20 21:05:34 | 2022-03-19 21:05:34 | 0.0 | 2021-09-19 06:56:00 |
3 | 2378212 | 423702 | Consumer loans | 4466.430 | 61272.0 | 74353.5 | 0.0 | 61272.0 | SATURDAY | 10 | ... | 1375 | Consumer electronics | 18.0 | low_action | POS household without interest | NaT | 2023-02-27 03:01:32 | 2024-07-21 03:01:32 | 0.0 | 2023-01-27 06:07:00 |
4 | 2299480 | 104023 | Consumer loans | 6637.635 | 73669.5 | 73669.5 | 0.0 | 73669.5 | MONDAY | 10 | ... | 1375 | Consumer electronics | 12.0 | low_action | POS household without interest | NaT | 2015-09-21 05:13:49 | 2016-08-16 05:13:49 | 0.0 | 2015-08-21 06:35:00 |
5 | 1873854 | 101412 | Consumer loans | 17051.130 | 87120.0 | 87120.0 | 0.0 | 87120.0 | TUESDAY | 15 | ... | 51 | Industry | 6.0 | high | POS other with interest | NaT | 2022-02-09 16:30:59 | 2022-07-09 16:30:59 | 0.0 | 2021-12-23 06:43:00 |
6 | 2736786 | 328888 | Consumer loans | 11598.750 | 100858.5 | 100858.5 | 0.0 | 100858.5 | THURSDAY | 11 | ... | 134 | Consumer electronics | 10.0 | middle | POS household with interest | NaT | 2020-02-23 20:55:37 | 2020-11-19 20:55:37 | 0.0 | 2020-01-23 06:26:00 |
7 | 1760932 | 180612 | Cash loans | 48504.555 | 1125000.0 | 1223010.0 | NaN | 1125000.0 | TUESDAY | 17 | ... | -1 | XNA | 36.0 | low_normal | Cash X-Sell: low | NaT | 2020-07-30 01:55:01 | 2023-06-15 01:55:01 | 1.0 | 2020-07-01 06:36:00 |
8 | 1509312 | 319494 | Cash loans | 20600.640 | 90000.0 | 104418.0 | NaN | 90000.0 | TUESDAY | 11 | ... | -1 | XNA | 6.0 | high | Cash X-Sell: high | NaT | 2023-06-04 14:55:44 | 2023-11-01 14:55:44 | 1.0 | 2023-05-06 06:36:00 |
9 | 2486047 | 174600 | Cash loans | 24543.585 | 247500.0 | 267592.5 | NaN | 247500.0 | WEDNESDAY | 14 | ... | -1 | XNA | 18.0 | high | Cash Street: high | NaT | 2018-09-21 16:04:38 | 2020-02-13 16:04:38 | 1.0 | 2018-08-23 06:10:00 |
10 rows × 36 columns
Register Tables in the Catalog¶
For accurate feature derivation, FeatureByte needs to recognize the roles of different tables.
Each table should be assigned a specific type based on its structure and purpose:
NEW_APPLICATION --> Dimension table.
Why Dimesion Table?: While we could have registered this as an Event table, this table contains only a subset of the applications. By setting it as a Dimension Table, we disable aggregations, preventing potentially non-meaningful computations.
CLIENT_PROFILE --> Slowly Changing Dimension (SCD) table.
Why Slowly Changing Dimension Table?: The table tracks the client profile that changes over time, making it a Slowly Changing Dimension (SCD) Table.
BUREAU --> Event table.
Why Event Table?: The table captures event-based updates from the credit bureau regarding external credit activities.
PREVIOUS_APPLICATION --> Event table.
Why Event Table?: The table records events tied to decisions made on previous loan applications.
LOAN_STATUS --> Slowly Changing Dimension (SCD) table.
Why Slowly Changing Dimension Table?: The table tracks loan statuses and dynamic fields that change over time, making it a Slowly Changing Dimension (SCD) Table.
INSTALLMENTS_PAYMENTS --> Event table.
Why Event Table?: The table logs discrete payment events for loan installments.
CREDIT_CARD_MONTHLY_BALANCE --> Time Series table.
Why Time Series Table?: The table provides monthly snapshots of credit cards usage, aligning well with time series.
If you are interested in a use case that exploits item table, check out the Grocery UI Tutorials.
Let's register each table using respective type:
application_table = application_source_table.create_dimension_table(
name="NEW_APPLICATION",
dimension_id_column="SK_ID_CURR",
record_creation_timestamp_column="available_at",
)
client_profile_table = client_profile_source_table.create_scd_table(
name="CLIENT_PROFILE",
natural_key_column="ClientID",
effective_timestamp_column="SCD_effective_timestamp",
end_timestamp_column="SCD_end_timestamp",
record_creation_timestamp_column="available_at",
)
bureau_table = bureau_source_table.create_event_table(
name="BUREAU",
event_id_column="SK_ID_BUREAU",
event_timestamp_column="credit_update",
record_creation_timestamp_column="available_at",
)
installments_payments_table = installments_payments_source_table.create_event_table(
name="INSTALLMENTS_PAYMENTS",
event_id_column="INSTALMENT_ID",
event_timestamp_column="actual_installment_date",
record_creation_timestamp_column="available_at",
)
previous_application_table = previous_application_source_table.create_event_table(
name="PREVIOUS_APPLICATION",
event_id_column="APPLICATION_ID",
event_timestamp_column="decision_date",
record_creation_timestamp_column="available_at",
)
loan_status_table = loan_status_source_table.create_scd_table(
name="LOAN_STATUS",
natural_key_column="LOAN_ID",
effective_timestamp_column="SCD_Effective_Timestamp",
end_timestamp_column="SCD_End_Timestamp",
record_creation_timestamp_column="available_at",
)
credit_card_balance_table = credit_card_balance_source_table.create_time_series_table(
name="CREDIT_CARD_MONTHLY_BALANCE",
reference_datetime_column="balance_month",
reference_datetime_schema=fb.TimestampSchema(
format_string="YYYY-MM",
timezone="America/Los_Angeles",
),
time_interval=fb.TimeInterval(value=1, unit="MONTH"),
series_id_column="CARD_ID",
record_creation_timestamp_column="available_at",
)
After this we will be able to see seven tables in the catalog:
display(catalog.list_tables())
id | name | type | status | entities | created_at | |
---|---|---|---|---|---|---|
0 | 683d3a70f4d55cd61d526cee | CREDIT_CARD_MONTHLY_BALANCE | time_series_table | PUBLIC_DRAFT | None | 2025-06-02T05:45:20.813000 |
1 | 683d3a6ef4d55cd61d526ced | LOAN_STATUS | scd_table | PUBLIC_DRAFT | None | 2025-06-02T05:45:18.744000 |
2 | 683d3a6cf4d55cd61d526cec | PREVIOUS_APPLICATION | event_table | PUBLIC_DRAFT | None | 2025-06-02T05:45:16.825000 |
3 | 683d3a6af4d55cd61d526ceb | INSTALLMENTS_PAYMENTS | event_table | PUBLIC_DRAFT | None | 2025-06-02T05:45:14.598000 |
4 | 683d3a69f4d55cd61d526cea | BUREAU | event_table | PUBLIC_DRAFT | None | 2025-06-02T05:45:13.206000 |
5 | 683d3a66f4d55cd61d526ce9 | CLIENT_PROFILE | scd_table | PUBLIC_DRAFT | None | 2025-06-02T05:45:11.014000 |
6 | 683d3a64f4d55cd61d526ce8 | NEW_APPLICATION | dimension_table | PUBLIC_DRAFT | None | 2025-06-02T05:45:08.853000 |
Initialize Feature Job Setting for an event table¶
Feature Job Setting defines when and how frequently features are updated in the feature store. These settings also include a blind spot period, which represents the time gap between when a feature is computed and the latest available event.
Why is the Blind Spot Important? In an ideal scenario, we would include all prior applications up to the present in our feature computation. However, due to the time required for data collection, ETL processing, and other pipeline delays, the most recent applications may not be immediately available.
Setting the blind spot too small may result in data leakage during model training, as the production environment might not have access to the most recent data at inference time. Conversely, setting it too large may lead to stale feature values, reducing predictive performance.
Automated Feature Job Settings with FeatureByte FeatureByte intelligently determines the optimal feature job settings for event tables by automatically adjusting:
- Blind Spot: Ensuring that data is available and aligned with the production scenario.
- Update Frequency (Period): Setting an appropriate interval for feature updates.
- Offset: Delaying computation to accommodate data pipeline scheduling.
previous_application_table.initialize_default_feature_job_setting()
Done! |████████████████████████████████████████| 100% in 9.2s (0.11%/s)
The analysis period starts at 2025-05-03 23:55:57 and ends at 2025-05-31 23:55:57
The column used for the event timestamp is decision_date
The column used for the record creation timestamp for PREVIOUS_APPLICATION is available_at
STATISTICS ON TIME BETWEEN PREVIOUS_APPLICATION RECORDS CREATIONS
- Average time is 2331.648902821317 s
- Median time is 60.0 s
- Lowest time is 60.0 s
- Largest time is 83160.0 s
based on a total of 1014 unique record creation timestamps.
PREVIOUS_APPLICATION UPDATE TIME starts 6.0 hours and ends 6.0 hours 59.0 minutes after the start of each 24 hours
This includes a buffer of 600 s to allow for late jobs.
Search for optimal blind spot
- blind spot for 99.5 % of events to land: 25200 s
- blind spot for 99.9 % of events to land: 25800 s
- blind spot for 99.95 % of events to land: 25800 s
- blind spot for 99.99 % of events to land: 25800 s
- blind spot for 99.995 % of events to land: 25800 s
- blind spot for 100.0 % of events to land: 25800 s
In SUMMARY, the recommended FEATUREJOB DEFAULT setting is:
period: 86400
offset: 25740
blind_spot: 25800
The resulting FEATURE CUTOFF offset is 86340 s.
For a feature cutoff at 86340 s:
- time for 99.5 % of events to land: 25800 s
- time for 99.9 % of events to land: 25800 s
- time for 99.95 % of events to land: 25800 s
- time for 99.99 % of events to land: 25800 s
- time for 99.995 % of events to land: 25800 s
- time for 100.0 % of events to land: 25800 s
- Period = 86400 s / Offset = 25740 s / Blind spot = 25800 s
The backtest found that all records would have been processed on time.
- Based on the past records created from 2025-05-03 23:00:00 to 2025-05-31 23:00:00, the table is regularly updated 6.0 hours after the start of each 24 hours within a 59.0 minutes interval. No job failure or late job has been detected.
- The features computation jobs are recommended to be scheduled after the table updates completion and be set 7 hours 9 minutes after the start of each 24 hours.
- Based on the analysis of the records latency, the blind_spot parameter used to determine the window of the features aggregation is recommended to be set at 25800 s.
- period: 86400 s
- offset: 25740 s
- blind_spot: 25800 s
Done! |████████████████████████████████████████| 100% in 12.1s (0.08%/s)
The analysis period starts at 2024-05-15 07:29:25 and ends at 2024-06-12 07:29:25
The column used for the event timestamp is Timestamp
The column used for the record creation timestamp for GROCERYINVOICE is record_available_at
STATISTICS ON TIME BETWEEN GROCERYINVOICE RECORDS CREATIONS
- Average time is 4176.910911966263 s
- Median time is 3600.0 s
- Lowest time is 3600.0 s
- Largest time is 36000.0 s
based on a total of 481 unique record creation timestamps.
The longer time between record creations are due to 182 MISSING UPDATES.
This includes a buffer of 60 s to allow for late jobs.
The 63 jobs that occurred after missing jobs don't seem to have processed significantly older records.
Search for optimal blind spot
- blind spot for 99.5 % of events to land: 120 s
- blind spot for 99.9 % of events to land: 120 s
- blind spot for 99.95 % of events to land: 120 s
- blind spot for 99.99 % of events to land: 120 s
- blind spot for 99.995 % of events to land: 120 s
- blind spot for 100.0 % of events to land: 120 s
In SUMMARY, the recommended FEATUREJOB DEFAULT setting is:
period: 3600
offset: 120
blind_spot: 120
The resulting FEATURE CUTOFF offset is 0 s.
For a feature cutoff at 0 s:
- time for 99.5 % of events to land: 120 s
- time for 99.9 % of events to land: 120 s
- time for 99.95 % of events to land: 120 s
- time for 99.99 % of events to land: 120 s
- time for 99.995 % of events to land: 120 s
- time for 100.0 % of events to land: 120 s
- Period = 3600 s / Offset = 120 s / Blind spot = 120 s
The backtest found that all records would have been processed on time.
- Based on the past records created from 2024-05-15 07:00:00 to 2024-06-12 07:00:00, the table is regularly updated 1.0 minute after the start of each 1 hour within a interval. No job failure or late job has been detected.
- The features computation jobs are recommended to be scheduled after the table updates completion and be set 2 minutes after the start of each 1 hour.
- Based on the analysis of the records latency, the blind_spot parameter used to determine the window of the features aggregation is recommended to be set at 120 s.
- period: 3600 s
- offset: 120 s
- blind_spot: 120 s
bureau_table.initialize_default_feature_job_setting()
Done! |████████████████████████████████████████| 100% in 15.2s (0.07%/s)
The analysis period starts at 2025-05-03 23:59:09 and ends at 2025-05-31 23:59:09
The column used for the event timestamp is credit_update
The column used for the record creation timestamp for BUREAU is available_at
STATISTICS ON TIME BETWEEN BUREAU RECORDS CREATIONS
- Average time is 1399.5279025016036 s
- Median time is 60.0 s
- Lowest time is 60.0 s
- Largest time is 82920.0 s
based on a total of 1651 unique record creation timestamps.
BUREAU UPDATE TIME starts 6.0 hours and ends 6.0 hours 59.0 minutes after the start of each 24 hours
This includes a buffer of 600 s to allow for late jobs.
Search for optimal blind spot
- blind spot for 99.5 % of events to land: 25800 s
- blind spot for 99.9 % of events to land: 25800 s
- blind spot for 99.95 % of events to land: 25800 s
- blind spot for 99.99 % of events to land: 25800 s
- blind spot for 99.995 % of events to land: 25800 s
- blind spot for 100.0 % of events to land: 25800 s
In SUMMARY, the recommended FEATUREJOB DEFAULT setting is:
period: 86400
offset: 25740
blind_spot: 25800
The resulting FEATURE CUTOFF offset is 86340 s.
For a feature cutoff at 86340 s:
- time for 99.5 % of events to land: 25800 s
- time for 99.9 % of events to land: 25800 s
- time for 99.95 % of events to land: 25800 s
- time for 99.99 % of events to land: 25800 s
- time for 99.995 % of events to land: 25800 s
- time for 100.0 % of events to land: 25800 s
- Period = 86400 s / Offset = 25740 s / Blind spot = 25800 s
The backtest found that all records would have been processed on time.
- Based on the past records created from 2025-05-03 23:00:00 to 2025-05-31 23:00:00, the table is regularly updated 6.0 hours after the start of each 24 hours within a 59.0 minutes interval. No job failure or late job has been detected.
- The features computation jobs are recommended to be scheduled after the table updates completion and be set 7 hours 9 minutes after the start of each 24 hours.
- Based on the analysis of the records latency, the blind_spot parameter used to determine the window of the features aggregation is recommended to be set at 25800 s.
- period: 86400 s
- offset: 25740 s
- blind_spot: 25800 s
To set manually the default feature job settings or override, see update_default_feature_job_setting.
job_setting = fb.CronFeatureJobSetting(
crontab="0 19 * * *",
blind_spot="1h",
)
installments_payments_table.update_default_feature_job_setting(job_setting)
Set Feature Job setting for TimeSeries¶
For TimeSeries, the feature job setting consists of three key parameters:
Crontab: Defines the cron schedule for the feature job, specifying when the job should run.
Time Zone: Determines the time zone in which the cron schedule operates, ensuring alignment with local time conventions.
Reference Time Zone: Specifies the time zone used to define calendar-based aggregation periods (e.g., daily, weekly, or monthly). This reference time zone ensures consistency when calculating calendar periods across different data time zones.
- For example, if the scheduled job is
2025/01/31 23:00 UTC
and the reference time zone isAsia/Singapore
, the corresponding calendar date would be2025/02/01
. Consequently, the aggregation for the latest complete month would cover January. - If a time zone column is used to assign individual time zones per record, the reference time zone should be the westernmost time zone among those specified in the column. This ensures that aggregation periods fully encompass the calendar dates of all relevant observations.
- For example, if the scheduled job is
job_setting = fb.CronFeatureJobSetting(
crontab="0 13 1 * *",
timezone="America/Los_Angeles",
)
credit_card_balance_table.update_default_feature_job_setting(job_setting)