2. Register Tables

Registering tables in the catalog¶

Our catalog is created and we can start registering tables in it.

First of all, let's activate our catalog.¶

We will be repeating this command in following notebooks.

In [1]:

            
                Copied!
                
import featurebyte as fb

# Set your profile to the tutorial environment
fb.use_profile("tutorial")

catalog_name = "Loan Applications Dataset SDK Tutorial"
catalog = fb.Catalog.activate(catalog_name)
import featurebyte as fb

# Set your profile to the tutorial environment
fb.use_profile("tutorial")

catalog_name = "Loan Applications Dataset SDK Tutorial"
catalog = fb.Catalog.activate(catalog_name)

14:06:19 | INFO     | SDK version: 3.0.1.dev45
INFO    :featurebyte:SDK version: 3.0.1.dev45
14:06:19 | INFO     | No catalog activated.
INFO    :featurebyte:No catalog activated.
14:06:19 | INFO     | Using profile: tutorial
INFO    :featurebyte:Using profile: tutorial
14:06:19 | INFO     | Using configuration file at: /Users/gxav/.featurebyte/config.yaml
INFO    :featurebyte:Using configuration file at: /Users/gxav/.featurebyte/config.yaml
14:06:19 | INFO     | Active profile: tutorial (https://tutorials.featurebyte.com/api/v1)
INFO    :featurebyte:Active profile: tutorial (https://tutorials.featurebyte.com/api/v1)
14:06:19 | INFO     | SDK version: 3.0.1.dev45
INFO    :featurebyte:SDK version: 3.0.1.dev45
14:06:19 | INFO     | No catalog activated.
INFO    :featurebyte:No catalog activated.
14:06:19 | INFO     | Catalog activated: Loan Applications Dataset SDK Tutorial
INFO    :featurebyte.api.catalog:Catalog activated: Loan Applications Dataset SDK Tutorial
16:04:52 | WARNING  | Remote SDK version (1.1.0.dev7) is different from local (1.1.0.dev1). Update local SDK to avoid unexpected behavior.
16:04:52 | INFO     | No catalog activated.
16:04:52 | INFO     | Catalog activated: Grocery Dataset Tutorial

Get data source¶

To be able to get source tables from the data warehouse we need to get data source, which our catalog has access to.

This data source contains collection of tables in our database, and we can use it to explore our DB schema:

In [2]:

            
                Copied!
                
ds = catalog.get_data_source()
ds = catalog.get_data_source()

Here we see we have access to a number of databases. For these tutorials we will use the one called 'DEMO_DATASETS' and the 'LOAN_APPLICATIONS' schema under it.

In [3]:

            
                Copied!
                
ds.list_databases()
ds.list_databases()

Out[3]:

['DEMO_DATASETS', 'TUTORIAL']

In [4]:

            
                Copied!
                
DATABASE_NAME = 'DEMO_DATASETS'
ds.list_schemas(database_name=DATABASE_NAME)
DATABASE_NAME = 'DEMO_DATASETS'
ds.list_schemas(database_name=DATABASE_NAME)

Out[4]:

['CREDITCARD',
 'CREDIT_DEFAULT_TUTORIAL',
 'GROCERY',
 'HEALTHCARE',
 'INFORMATION_SCHEMA',
 'LOAN_APPLICATIONS']

In [5]:

            
                Copied!
                
SCHEMA_NAME = 'LOAN_APPLICATIONS'
ds.list_source_tables(database_name=DATABASE_NAME, schema_name=SCHEMA_NAME)
SCHEMA_NAME = 'LOAN_APPLICATIONS'
ds.list_source_tables(database_name=DATABASE_NAME, schema_name=SCHEMA_NAME)

Out[5]:

['BUREAU',
 'CLIENT_PROFILE',
 'CREDIT_CARD_MONTHLY_BALANCE',
 'INSTALLMENTS_PAYMENTS',
 'LOAN_STATUS',
 'NEW_APPLICATION',
 'OBSERVATIONS_WITH_TARGET',
 'OBSERVATION_EDA_TABLE',
 'PREVIOUS_APPLICATION']

Get source tables¶

We identified database and schema we want to work with, it is time to get source tables and register them in the catalog.

We'll utilize the following seven source tables of our Loan Applications Dataset:

Table	Description
NEW_APPLICATION	Contains information about new loan applications submitted by clients.
CLIENT_PROFILE	Describes the profile for each client.
BUREAU	Lists credits taken by clients from other financial institutions, as reported to the credit bureau.
PREVIOUS_APPLICATION	Details prior loan applications by the client.
INSTALLMENTS_PAYMENTS	Logs monthly installments for loans at the time of payment.
LOAN_STATUS	Tracks whether loan was terminated.
CREDIT_CARD_MONTHLY_BALANCE	Monthly balance summaries for credit cards

In [6]:

            
                Copied!
                
                    
                    
                
                

        
ds = catalog.get_data_source()

application_source_table = ds.get_source_table(
    database_name=DATABASE_NAME, schema_name=SCHEMA_NAME, table_name="NEW_APPLICATION"
)

client_profile_source_table = ds.get_source_table(
    database_name=DATABASE_NAME, schema_name=SCHEMA_NAME, table_name="CLIENT_PROFILE"
)

bureau_source_table = ds.get_source_table(
    database_name=DATABASE_NAME, schema_name=SCHEMA_NAME, table_name="BUREAU"
)

installments_payments_source_table = ds.get_source_table(
    database_name=DATABASE_NAME, schema_name=SCHEMA_NAME, table_name="INSTALLMENTS_PAYMENTS"
)

previous_application_source_table = ds.get_source_table(
    database_name=DATABASE_NAME, schema_name=SCHEMA_NAME, table_name="PREVIOUS_APPLICATION"
)

loan_status_source_table = ds.get_source_table(
    database_name=DATABASE_NAME, schema_name=SCHEMA_NAME, table_name="LOAN_STATUS"
)

credit_card_balance_source_table = ds.get_source_table(
    database_name=DATABASE_NAME, schema_name=SCHEMA_NAME, table_name="CREDIT_CARD_MONTHLY_BALANCE"
)
ds = catalog.get_data_source()

application_source_table = ds.get_source_table(
    database_name=DATABASE_NAME, schema_name=SCHEMA_NAME, table_name="NEW_APPLICATION"
)

client_profile_source_table = ds.get_source_table(
    database_name=DATABASE_NAME, schema_name=SCHEMA_NAME, table_name="CLIENT_PROFILE"
)

bureau_source_table = ds.get_source_table(
    database_name=DATABASE_NAME, schema_name=SCHEMA_NAME, table_name="BUREAU"
)

installments_payments_source_table = ds.get_source_table(
    database_name=DATABASE_NAME, schema_name=SCHEMA_NAME, table_name="INSTALLMENTS_PAYMENTS"
)

previous_application_source_table = ds.get_source_table(
    database_name=DATABASE_NAME, schema_name=SCHEMA_NAME, table_name="PREVIOUS_APPLICATION"
)

loan_status_source_table = ds.get_source_table(
    database_name=DATABASE_NAME, schema_name=SCHEMA_NAME, table_name="LOAN_STATUS"
)

credit_card_balance_source_table = ds.get_source_table(
    database_name=DATABASE_NAME, schema_name=SCHEMA_NAME, table_name="CREDIT_CARD_MONTHLY_BALANCE"
)

Explore Source Tables¶

In [7]:

            
                Copied!
                
# Preview a selection of rows
previous_application_source_table.preview()
# Preview a selection of rows
previous_application_source_table.preview()

Out[7]:

	APPLICATION_ID	ClientID	CONTRACT_TYPE	AMT_ANNUITY	AMT_APPLICATION	AMT_CREDIT	AMT_DOWN_PAYMENT	AMT_GOODS_PRICE	WEEKDAY_APPR_PROCESS_START	HOUR_APPR_PROCESS_START	...	SELLERPLACE_AREA	SELLER_INDUSTRY	CNT_PAYMENT	YIELD_GROUP	PRODUCT_COMBINATION	first_disbursment_timestamp	first_due_timestamp	last_due_1st_version_timestamp	NFLAG_INSURED_ON_APPROVAL	available_at
0	2820941	283690	Cash loans	11460.735	121500.0	133528.5	0.0	121500.0	SUNDAY	10	...	-1	Consumer electronics	18.0	high	Cash X-Sell: high	NaT	2018-09-03 06:26:07	2020-01-26 06:26:07	1.0	2018-08-05 06:56:00
1	2621879	344988	Consumer loans	6622.245	51777.0	56335.5	0.0	51777.0	TUESDAY	12	...	1375	Consumer electronics	10.0	middle	POS household with interest	NaT	2021-06-03 08:55:00	2022-02-28 08:55:00	0.0	2021-05-03 06:47:00
2	1150381	102189	Consumer loans	13975.605	90171.0	75703.5	18036.0	90171.0	SATURDAY	13	...	1375	Consumer electronics	6.0	middle	POS household with interest	NaT	2021-10-20 21:05:34	2022-03-19 21:05:34	0.0	2021-09-19 06:56:00
3	2378212	423702	Consumer loans	4466.430	61272.0	74353.5	0.0	61272.0	SATURDAY	10	...	1375	Consumer electronics	18.0	low_action	POS household without interest	NaT	2023-02-27 03:01:32	2024-07-21 03:01:32	0.0	2023-01-27 06:07:00
4	2299480	104023	Consumer loans	6637.635	73669.5	73669.5	0.0	73669.5	MONDAY	10	...	1375	Consumer electronics	12.0	low_action	POS household without interest	NaT	2015-09-21 05:13:49	2016-08-16 05:13:49	0.0	2015-08-21 06:35:00
5	1873854	101412	Consumer loans	17051.130	87120.0	87120.0	0.0	87120.0	TUESDAY	15	...	51	Industry	6.0	high	POS other with interest	NaT	2022-02-09 16:30:59	2022-07-09 16:30:59	0.0	2021-12-23 06:43:00
6	2736786	328888	Consumer loans	11598.750	100858.5	100858.5	0.0	100858.5	THURSDAY	11	...	134	Consumer electronics	10.0	middle	POS household with interest	NaT	2020-02-23 20:55:37	2020-11-19 20:55:37	0.0	2020-01-23 06:26:00
7	1760932	180612	Cash loans	48504.555	1125000.0	1223010.0	NaN	1125000.0	TUESDAY	17	...	-1	XNA	36.0	low_normal	Cash X-Sell: low	NaT	2020-07-30 01:55:01	2023-06-15 01:55:01	1.0	2020-07-01 06:36:00
8	1509312	319494	Cash loans	20600.640	90000.0	104418.0	NaN	90000.0	TUESDAY	11	...	-1	XNA	6.0	high	Cash X-Sell: high	NaT	2023-06-04 14:55:44	2023-11-01 14:55:44	1.0	2023-05-06 06:36:00
9	2486047	174600	Cash loans	24543.585	247500.0	267592.5	NaN	247500.0	WEDNESDAY	14	...	-1	XNA	18.0	high	Cash Street: high	NaT	2018-09-21 16:04:38	2020-02-13 16:04:38	1.0	2018-08-23 06:10:00

10 rows × 36 columns

Register Tables in the Catalog¶

For accurate feature derivation, FeatureByte needs to recognize the roles of different tables.

Each table should be assigned a specific type based on its structure and purpose:

NEW_APPLICATION --> Dimension table.

Why Dimesion Table?: While we could have registered this as an Event table, this table contains only a subset of the applications. By setting it as a Dimension Table, we disable aggregations, preventing potentially non-meaningful computations.
CLIENT_PROFILE --> Slowly Changing Dimension (SCD) table.

Why Slowly Changing Dimension Table?: The table tracks the client profile that changes over time, making it a Slowly Changing Dimension (SCD) Table.
BUREAU --> Event table.

Why Event Table?: The table captures event-based updates from the credit bureau regarding external credit activities.
PREVIOUS_APPLICATION --> Event table.

Why Event Table?: The table records events tied to decisions made on previous loan applications.
LOAN_STATUS --> Slowly Changing Dimension (SCD) table.

Why Slowly Changing Dimension Table?: The table tracks loan statuses and dynamic fields that change over time, making it a Slowly Changing Dimension (SCD) Table.
INSTALLMENTS_PAYMENTS --> Event table.

Why Event Table?: The table logs discrete payment events for loan installments.
CREDIT_CARD_MONTHLY_BALANCE --> Time Series table.

Why Time Series Table?: The table provides monthly snapshots of credit cards usage, aligning well with time series.

If you are interested in a use case that exploits item table, check out the Grocery UI Tutorials.

Let's register each table using respective type:

In [8]:

            
                Copied!
                
                    
                    
                
                

        
application_table = application_source_table.create_dimension_table(
    name="NEW_APPLICATION",
    dimension_id_column="SK_ID_CURR",
    record_creation_timestamp_column="available_at",
)

client_profile_table = client_profile_source_table.create_scd_table(
    name="CLIENT_PROFILE",
    natural_key_column="ClientID",
    effective_timestamp_column="SCD_effective_timestamp",
    end_timestamp_column="SCD_end_timestamp",
    record_creation_timestamp_column="available_at",
)

bureau_table = bureau_source_table.create_event_table(
    name="BUREAU",
    event_id_column="SK_ID_BUREAU",
    event_timestamp_column="credit_update",
    record_creation_timestamp_column="available_at",
)

installments_payments_table = installments_payments_source_table.create_event_table(
    name="INSTALLMENTS_PAYMENTS",
    event_id_column="INSTALMENT_ID",
    event_timestamp_column="actual_installment_date",
    record_creation_timestamp_column="available_at",
)


previous_application_table = previous_application_source_table.create_event_table(
    name="PREVIOUS_APPLICATION",
    event_id_column="APPLICATION_ID",
    event_timestamp_column="decision_date",
    record_creation_timestamp_column="available_at",
)

loan_status_table = loan_status_source_table.create_scd_table(
    name="LOAN_STATUS",
    natural_key_column="LOAN_ID",
    effective_timestamp_column="SCD_Effective_Timestamp",
    end_timestamp_column="SCD_End_Timestamp",
    record_creation_timestamp_column="available_at",
)

credit_card_balance_table = credit_card_balance_source_table.create_time_series_table(
    name="CREDIT_CARD_MONTHLY_BALANCE",
    reference_datetime_column="balance_month",
    reference_datetime_schema=fb.TimestampSchema(
        format_string="YYYY-MM",
        timezone="America/Los_Angeles",
    ),
    time_interval=fb.TimeInterval(value=1, unit="MONTH"),
    series_id_column="CARD_ID",
    record_creation_timestamp_column="available_at",
)
application_table = application_source_table.create_dimension_table(
    name="NEW_APPLICATION",
    dimension_id_column="SK_ID_CURR",
    record_creation_timestamp_column="available_at",
)

client_profile_table = client_profile_source_table.create_scd_table(
    name="CLIENT_PROFILE",
    natural_key_column="ClientID",
    effective_timestamp_column="SCD_effective_timestamp",
    end_timestamp_column="SCD_end_timestamp",
    record_creation_timestamp_column="available_at",
)

bureau_table = bureau_source_table.create_event_table(
    name="BUREAU",
    event_id_column="SK_ID_BUREAU",
    event_timestamp_column="credit_update",
    record_creation_timestamp_column="available_at",
)

installments_payments_table = installments_payments_source_table.create_event_table(
    name="INSTALLMENTS_PAYMENTS",
    event_id_column="INSTALMENT_ID",
    event_timestamp_column="actual_installment_date",
    record_creation_timestamp_column="available_at",
)


previous_application_table = previous_application_source_table.create_event_table(
    name="PREVIOUS_APPLICATION",
    event_id_column="APPLICATION_ID",
    event_timestamp_column="decision_date",
    record_creation_timestamp_column="available_at",
)

loan_status_table = loan_status_source_table.create_scd_table(
    name="LOAN_STATUS",
    natural_key_column="LOAN_ID",
    effective_timestamp_column="SCD_Effective_Timestamp",
    end_timestamp_column="SCD_End_Timestamp",
    record_creation_timestamp_column="available_at",
)

credit_card_balance_table = credit_card_balance_source_table.create_time_series_table(
    name="CREDIT_CARD_MONTHLY_BALANCE",
    reference_datetime_column="balance_month",
    reference_datetime_schema=fb.TimestampSchema(
        format_string="YYYY-MM",
        timezone="America/Los_Angeles",
    ),
    time_interval=fb.TimeInterval(value=1, unit="MONTH"),
    series_id_column="CARD_ID",
    record_creation_timestamp_column="available_at",
)

After this we will be able to see seven tables in the catalog:

In [9]:

            
                Copied!
                
display(catalog.list_tables())
display(catalog.list_tables())

	id	name	type	status	entities	created_at
0	683d3a70f4d55cd61d526cee	CREDIT_CARD_MONTHLY_BALANCE	time_series_table	PUBLIC_DRAFT	None	2025-06-02T05:45:20.813000
1	683d3a6ef4d55cd61d526ced	LOAN_STATUS	scd_table	PUBLIC_DRAFT	None	2025-06-02T05:45:18.744000
2	683d3a6cf4d55cd61d526cec	PREVIOUS_APPLICATION	event_table	PUBLIC_DRAFT	None	2025-06-02T05:45:16.825000
3	683d3a6af4d55cd61d526ceb	INSTALLMENTS_PAYMENTS	event_table	PUBLIC_DRAFT	None	2025-06-02T05:45:14.598000
4	683d3a69f4d55cd61d526cea	BUREAU	event_table	PUBLIC_DRAFT	None	2025-06-02T05:45:13.206000
5	683d3a66f4d55cd61d526ce9	CLIENT_PROFILE	scd_table	PUBLIC_DRAFT	None	2025-06-02T05:45:11.014000
6	683d3a64f4d55cd61d526ce8	NEW_APPLICATION	dimension_table	PUBLIC_DRAFT	None	2025-06-02T05:45:08.853000

Initialize Feature Job Setting for an event table¶

Feature Job Setting defines when and how frequently features are updated in the feature store. These settings also include a blind spot period, which represents the time gap between when a feature is computed and the latest available event.

Why is the Blind Spot Important? In an ideal scenario, we would include all prior applications up to the present in our feature computation. However, due to the time required for data collection, ETL processing, and other pipeline delays, the most recent applications may not be immediately available.

Setting the blind spot too small may result in data leakage during model training, as the production environment might not have access to the most recent data at inference time. Conversely, setting it too large may lead to stale feature values, reducing predictive performance.

Automated Feature Job Settings with FeatureByte FeatureByte intelligently determines the optimal feature job settings for event tables by automatically adjusting:

Blind Spot: Ensuring that data is available and aligned with the production scenario.
Update Frequency (Period): Setting an appropriate interval for feature updates.
Offset: Delaying computation to accommodate data pipeline scheduling.

In [10]:

            
                Copied!
                
previous_application_table.initialize_default_feature_job_setting()
previous_application_table.initialize_default_feature_job_setting()

Done! |████████████████████████████████████████| 100% in 9.2s (0.11%/s)

Feature Job Setting Analysis

Feature Job Setting Analysis Report

Warehouse Jobs Statistics

The analysis is for the event table: PREVIOUS_APPLICATION
The analysis period starts at 2025-05-03 23:55:57 and ends at 2025-05-31 23:55:57
The column used for the event timestamp is decision_date
The column used for the record creation timestamp for PREVIOUS_APPLICATION is available_at

STATISTICS ON TIME BETWEEN PREVIOUS_APPLICATION RECORDS CREATIONS
- Average time is 2331.648902821317 s
- Median time is 60.0 s
- Lowest time is 60.0 s
- Largest time is 83160.0 s
based on a total of 1014 unique record creation timestamps.

The BEST ESTIMATE FOR PREVIOUS_APPLICATION UPDATE FREQUENCY is every 24 hours

PREVIOUS_APPLICATION UPDATE TIME starts 6.0 hours and ends 6.0 hours 59.0 minutes after the start of each 24 hours

Job Frequency Recommendation

The RECOMMENDED FEATURE JOB FREQUENCY is 7 hours 9 minutes after the start of each 24 hours.
This includes a buffer of 600 s to allow for late jobs.

Blind Spot Search

The OPTIMAL BLIND SPOT setting is 25800 s to keep late data at less than 0.0 %
Search for optimal blind spot
- blind spot for 99.5 % of events to land: 25200 s
- blind spot for 99.9 % of events to land: 25800 s
- blind spot for 99.95 % of events to land: 25800 s
- blind spot for 99.99 % of events to land: 25800 s
- blind spot for 99.995 % of events to land: 25800 s
- blind spot for 100.0 % of events to land: 25800 s

Feature Job Setting Recommendation

The RECOMMENDED BLIND SPOT setting is 25800 s

In SUMMARY, the recommended FEATUREJOB DEFAULT setting is:
period: 86400
offset: 25740
blind_spot: 25800

The resulting FEATURE CUTOFF offset is 86340 s.

Feature Tiles Event Landing Time

For a feature cutoff at 86340 s:
- time for 99.5 % of events to land: 25800 s
- time for 99.9 % of events to land: 25800 s
- time for 99.95 % of events to land: 25800 s
- time for 99.99 % of events to land: 25800 s
- time for 99.995 % of events to land: 25800 s
- time for 100.0 % of events to land: 25800 s

Backtest Result

For the feature job setting:
- Period = 86400 s / Offset = 25740 s / Blind spot = 25800 s
The backtest found that all records would have been processed on time.

Summary

Key findings:

Based on the past records created from 2025-05-03 23:00:00 to 2025-05-31 23:00:00, the table is regularly updated 6.0 hours after the start of each 24 hours within a 59.0 minutes interval. No job failure or late job has been detected.
The features computation jobs are recommended to be scheduled after the table updates completion and be set 7 hours 9 minutes after the start of each 24 hours.
Based on the analysis of the records latency, the blind_spot parameter used to determine the window of the features aggregation is recommended to be set at 25800 s.

The recommended Default Feature Job setting is:

period: 86400 s
offset: 25740 s
blind_spot: 25800 s

Done! |████████████████████████████████████████| 100% in 12.1s (0.08%/s)

Feature Job Setting Analysis

Feature Job Setting Analysis Report

Warehouse Jobs Statistics

The analysis is for the event table: GROCERYINVOICE
The analysis period starts at 2024-05-15 07:29:25 and ends at 2024-06-12 07:29:25
The column used for the event timestamp is Timestamp
The column used for the record creation timestamp for GROCERYINVOICE is record_available_at

STATISTICS ON TIME BETWEEN GROCERYINVOICE RECORDS CREATIONS
- Average time is 4176.910911966263 s
- Median time is 3600.0 s
- Lowest time is 3600.0 s
- Largest time is 36000.0 s
based on a total of 481 unique record creation timestamps.

The BEST ESTIMATE FOR GROCERYINVOICE UPDATE FREQUENCY is every 1 hour

The longer time between record creations are due to 182 MISSING UPDATES.

GROCERYINVOICE UPDATE TIME starts 1.0 minute and ends 1.0 minute after the start of each 1 hour

Job Frequency Recommendation

The RECOMMENDED FEATURE JOB FREQUENCY is 2 minutes after the start of each 1 hour.
This includes a buffer of 60 s to allow for late jobs.

The 63 jobs that occurred after missing jobs don't seem to have processed significantly older records.

Blind Spot Search

The OPTIMAL BLIND SPOT setting is 120 s to keep late data at less than 0.0 %
Search for optimal blind spot
- blind spot for 99.5 % of events to land: 120 s
- blind spot for 99.9 % of events to land: 120 s
- blind spot for 99.95 % of events to land: 120 s
- blind spot for 99.99 % of events to land: 120 s
- blind spot for 99.995 % of events to land: 120 s
- blind spot for 100.0 % of events to land: 120 s

Feature Job Setting Recommendation

The RECOMMENDED BLIND SPOT setting is 120 s

In SUMMARY, the recommended FEATUREJOB DEFAULT setting is:
period: 3600
offset: 120
blind_spot: 120

The resulting FEATURE CUTOFF offset is 0 s.

Feature Tiles Event Landing Time

For a feature cutoff at 0 s:
- time for 99.5 % of events to land: 120 s
- time for 99.9 % of events to land: 120 s
- time for 99.95 % of events to land: 120 s
- time for 99.99 % of events to land: 120 s
- time for 99.995 % of events to land: 120 s
- time for 100.0 % of events to land: 120 s

Backtest Result

For the feature job setting:
- Period = 3600 s / Offset = 120 s / Blind spot = 120 s
The backtest found that all records would have been processed on time.

Summary

Key findings:

Based on the past records created from 2024-05-15 07:00:00 to 2024-06-12 07:00:00, the table is regularly updated 1.0 minute after the start of each 1 hour within a interval. No job failure or late job has been detected.
The features computation jobs are recommended to be scheduled after the table updates completion and be set 2 minutes after the start of each 1 hour.
Based on the analysis of the records latency, the blind_spot parameter used to determine the window of the features aggregation is recommended to be set at 120 s.

The recommended Default Feature Job setting is:

period: 3600 s
offset: 120 s
blind_spot: 120 s

In [11]:

            
                Copied!
                
bureau_table.initialize_default_feature_job_setting()
bureau_table.initialize_default_feature_job_setting()

Done! |████████████████████████████████████████| 100% in 15.2s (0.07%/s)

Feature Job Setting Analysis

Feature Job Setting Analysis Report

Warehouse Jobs Statistics

The analysis is for the event table: BUREAU
The analysis period starts at 2025-05-03 23:59:09 and ends at 2025-05-31 23:59:09
The column used for the event timestamp is credit_update
The column used for the record creation timestamp for BUREAU is available_at

STATISTICS ON TIME BETWEEN BUREAU RECORDS CREATIONS
- Average time is 1399.5279025016036 s
- Median time is 60.0 s
- Lowest time is 60.0 s
- Largest time is 82920.0 s
based on a total of 1651 unique record creation timestamps.

The BEST ESTIMATE FOR BUREAU UPDATE FREQUENCY is every 24 hours

BUREAU UPDATE TIME starts 6.0 hours and ends 6.0 hours 59.0 minutes after the start of each 24 hours

Job Frequency Recommendation

The RECOMMENDED FEATURE JOB FREQUENCY is 7 hours 9 minutes after the start of each 24 hours.
This includes a buffer of 600 s to allow for late jobs.

Blind Spot Search

The OPTIMAL BLIND SPOT setting is 25800 s to keep late data at less than 0.0 %
Search for optimal blind spot
- blind spot for 99.5 % of events to land: 25800 s
- blind spot for 99.9 % of events to land: 25800 s
- blind spot for 99.95 % of events to land: 25800 s
- blind spot for 99.99 % of events to land: 25800 s
- blind spot for 99.995 % of events to land: 25800 s
- blind spot for 100.0 % of events to land: 25800 s

Feature Job Setting Recommendation

The RECOMMENDED BLIND SPOT setting is 25800 s

In SUMMARY, the recommended FEATUREJOB DEFAULT setting is:
period: 86400
offset: 25740
blind_spot: 25800

The resulting FEATURE CUTOFF offset is 86340 s.

Feature Tiles Event Landing Time

For a feature cutoff at 86340 s:
- time for 99.5 % of events to land: 25800 s
- time for 99.9 % of events to land: 25800 s
- time for 99.95 % of events to land: 25800 s
- time for 99.99 % of events to land: 25800 s
- time for 99.995 % of events to land: 25800 s
- time for 100.0 % of events to land: 25800 s

Backtest Result

For the feature job setting:
- Period = 86400 s / Offset = 25740 s / Blind spot = 25800 s
The backtest found that all records would have been processed on time.

Summary

Key findings:

Based on the past records created from 2025-05-03 23:00:00 to 2025-05-31 23:00:00, the table is regularly updated 6.0 hours after the start of each 24 hours within a 59.0 minutes interval. No job failure or late job has been detected.
The features computation jobs are recommended to be scheduled after the table updates completion and be set 7 hours 9 minutes after the start of each 24 hours.
Based on the analysis of the records latency, the blind_spot parameter used to determine the window of the features aggregation is recommended to be set at 25800 s.

The recommended Default Feature Job setting is:

period: 86400 s
offset: 25740 s
blind_spot: 25800 s

To set manually the default feature job settings or override, see update_default_feature_job_setting.

In [12]:

            
                Copied!
                
job_setting = fb.CronFeatureJobSetting(
    crontab="0 19 * * *",
    blind_spot="1h",
)
installments_payments_table.update_default_feature_job_setting(job_setting)
job_setting = fb.CronFeatureJobSetting(
    crontab="0 19 * * *",
    blind_spot="1h",
)
installments_payments_table.update_default_feature_job_setting(job_setting)

Set Feature Job setting for TimeSeries¶

For TimeSeries, the feature job setting consists of three key parameters:

Crontab: Defines the cron schedule for the feature job, specifying when the job should run.
Time Zone: Determines the time zone in which the cron schedule operates, ensuring alignment with local time conventions.
Reference Time Zone: Specifies the time zone used to define calendar-based aggregation periods (e.g., daily, weekly, or monthly). This reference time zone ensures consistency when calculating calendar periods across different data time zones.
- For example, if the scheduled job is 2025/01/31 23:00 UTC and the reference time zone is Asia/Singapore, the corresponding calendar date would be 2025/02/01. Consequently, the aggregation for the latest complete month would cover January.
- If a time zone column is used to assign individual time zones per record, the reference time zone should be the westernmost time zone among those specified in the column. This ensures that aggregation periods fully encompass the calendar dates of all relevant observations.

In [13]:

            
                Copied!
                
job_setting = fb.CronFeatureJobSetting(
    crontab="0 13 1 * *",
    timezone="America/Los_Angeles",
)
credit_card_balance_table.update_default_feature_job_setting(job_setting)
job_setting = fb.CronFeatureJobSetting(
    crontab="0 13 1 * *",
    timezone="America/Los_Angeles",
)
credit_card_balance_table.update_default_feature_job_setting(job_setting)

That's it for this tutorial, now we are ready for modeling our entities.

More details about concepts we discussed in this tutorial:¶

SDK reference for¶

In [ ]: