2. Register Tables

Registering tables in the catalog¶

Our catalog is created and we can start registering tables in it.

First of all, let's activate our catalog.¶

We will be repeating this command in following notebooks.

In [1]:

            
                Copied!
                
import featurebyte as fb

# Set your profile to the tutorial environment
fb.use_profile("tutorial")

catalog_name = "Credit Default Dataset SDK Tutorial"
catalog = fb.Catalog.activate(catalog_name)
import featurebyte as fb

# Set your profile to the tutorial environment
fb.use_profile("tutorial")

catalog_name = "Credit Default Dataset SDK Tutorial"
catalog = fb.Catalog.activate(catalog_name)

16:37:06 | WARNING  | Service endpoint is inaccessible: http://featurebyte-server:8088/
16:37:06 | INFO     | Using profile: tutorial
16:37:07 | INFO     | Using configuration file at: /Users/gxav/.featurebyte/config.yaml
16:37:07 | INFO     | Active profile: tutorial (https://tutorials.featurebyte.com/api/v1)
16:37:07 | INFO     | SDK version: 2.1.0.dev113
16:37:07 | INFO     | No catalog activated.
16:37:07 | INFO     | Catalog activated: Credit Default Dataset SDK Tutorial
16:04:52 | WARNING  | Remote SDK version (1.1.0.dev7) is different from local (1.1.0.dev1). Update local SDK to avoid unexpected behavior.
16:04:52 | INFO     | No catalog activated.
16:04:52 | INFO     | Catalog activated: Grocery Dataset Tutorial

Get data source¶

To be able to get source tables from the data warehouse we need to get data source, which our catalog has access to.

This data source contains collection of tables in our database, and we can use it to explore our DB schema:

In [2]:

            
                Copied!
                
ds = catalog.get_data_source()
ds = catalog.get_data_source()

Here we see we have access to a number of databases. For these tutorials we will use the one called 'DEMO_DATASETS' and the 'GROCERY' schema under it.

In [3]:

            
                Copied!
                
ds.list_databases()
ds.list_databases()

Out[3]:

['DEMO_DATASETS', 'TUTORIAL']

In [4]:

            
                Copied!
                
database_name = 'DEMO_DATASETS'
ds.list_schemas(database_name=database_name)
database_name = 'DEMO_DATASETS'
ds.list_schemas(database_name=database_name)

Out[4]:

['CREDITCARD',
 'CREDIT_DEFAULT_TUTORIAL',
 'GROCERY',
 'HEALTHCARE',
 'INFORMATION_SCHEMA']

In [5]:

            
                Copied!
                
schema_name = 'CREDIT_DEFAULT_TUTORIAL'
ds.list_source_tables(database_name=database_name, schema_name=schema_name)
schema_name = 'CREDIT_DEFAULT_TUTORIAL'
ds.list_source_tables(database_name=database_name, schema_name=schema_name)

Out[5]:

['CASH_INSTALLMENTS',
 'CASH_LOAN_STATUS',
 'CONSUMER_INSTALLMENTS',
 'CONSUMER_LOAN_STATUS',
 'CREDIT_DEFAULT_FULL_OBSERVATIONS',
 'CREDIT_DEFAULT_SAMPLE_OBSERVATIONS',
 'NEW_APPLICATION',
 'PRIOR_APPLICATIONS']

Get source tables¶

We identified database and schema we want to work with, it is time to get source tables and register them in the catalog.

We'll utilize the following four source tables of our Credit Default Dataset:

Table	Description
NEW_APPLICATION	Records new loan applications.
PRIOR_APPLICATIONS	Contains data on prior loan applications and the final decision.
CONSUMER_LOAN_STATUS	Tracks consumer loans status.
CONSUMER_INSTALLMENTS	Logs monthly installments for consumer loans at the time of payment.

Two source tables are left for you to explore.

Table	Description
CASH_LOAN_STATUS	Tracks cash loans status.
CASH_INSTALLMENTS	Logs monthly installments for cash loans at the time of payment.

In [6]:

            
                Copied!
                
                    
                    
                
                

        
application_source_table = ds.get_source_table(
    database_name=database_name,
    schema_name=schema_name,
    table_name="NEW_APPLICATION"
)
prior_applications_source_table = ds.get_source_table(
    database_name=database_name,
    schema_name=schema_name,
    table_name="PRIOR_APPLICATIONS"
)
consumer_loans_source_table = ds.get_source_table(
    database_name=database_name,
    schema_name=schema_name,
    table_name="CONSUMER_LOAN_STATUS"
)
consumer_installments_source_table = ds.get_source_table(
    database_name=database_name,
    schema_name=schema_name,
    table_name="CONSUMER_INSTALLMENTS"
)
application_source_table = ds.get_source_table(
    database_name=database_name,
    schema_name=schema_name,
    table_name="NEW_APPLICATION"
)
prior_applications_source_table = ds.get_source_table(
    database_name=database_name,
    schema_name=schema_name,
    table_name="PRIOR_APPLICATIONS"
)
consumer_loans_source_table = ds.get_source_table(
    database_name=database_name,
    schema_name=schema_name,
    table_name="CONSUMER_LOAN_STATUS"
)
consumer_installments_source_table = ds.get_source_table(
    database_name=database_name,
    schema_name=schema_name,
    table_name="CONSUMER_INSTALLMENTS"
)

Explore Source Tables¶

You can obtain descriptive statistics, preview a selection of rows, or collect additional information on their columns.

In [7]:

            
                Copied!
                
# Obtain descriptive statistics
prior_applications_source_table.describe()
# Obtain descriptive statistics
prior_applications_source_table.describe()

Done! |████████████████████████████████████████| 100% in 9.1s (0.11%/s)

Out[7]:

	APPLICATION_ID	CLIENT_ID	CONTRACT_TYPE	AMT_ANNUITY	AMT_APPLICATION	AMT_CREDIT	AMT_GOODS_VALUE	CONTRACT_STATUS	DECISION_DATE	TIMEZONE_OFFSET	CODE_REJECT_REASON	GOODS_CATEGORY	CNT_PAYMENT	YIELD_GROUP	PRODUCT_COMBINATION	available_at
dtype	INT	INT	VARCHAR	FLOAT	FLOAT	FLOAT	FLOAT	VARCHAR	TIMESTAMP	VARCHAR	VARCHAR	VARCHAR	FLOAT	VARCHAR	VARCHAR	TIMESTAMP
unique	1670214	338857	4	357959	93885	86803	93885	4	1506737	3	9	28	49	5	17	267740
%missing	0.0	0.0	0.0	22.2867	0.0	0.0001	23.0818	0.0	0.0	0.0	0.0	0.0	22.2864	0.0	0.0207	0.0
%empty	NaN	NaN	0	NaN	NaN	NaN	NaN	0	NaN	0	0	0	NaN	0	0	NaN
entropy	NaN	NaN	0.972882	NaN	NaN	NaN	NaN	0.981049	NaN	1.098611	0.743124	1.568896	NaN	1.507147	2.48288	NaN
top	1727222	187868	Cash loans	NaN	NaN	NaN	NaN	Approved	NaN	-07:00	XAP	XNA	NaN	XNA	Cash	NaN
freq	1.0	77.0	747553.0	NaN	NaN	NaN	NaN	1036781.0	NaN	557756.0	1353093.0	950809.0	NaN	517215.0	285990.0	NaN
mean	1923089.135331	278357.174099	NaN	15955.120659	175233.86036	196114.021218	227847.279283	NaN	NaN	NaN	NaN	NaN	16.054082	NaN	NaN	NaN
std	532597.958696	102814.823849	NaN	14782.137335	292779.762386	318574.616547	315396.557937	NaN	NaN	NaN	NaN	NaN	14.567288	NaN	NaN	NaN
min	1000001	100001	NaN	0.0	0.0	0.0	0.0	NaN	2010-09-01T17:11:16.000000000	NaN	NaN	NaN	0.0	NaN	NaN	2010-09-02T06:59:00.000000000
25%	1461857.25	189329.0	NaN	6321.78	18720.0	24160.5	50841.0	NaN	NaN	NaN	NaN	NaN	6.0	NaN	NaN	NaN
50%	1923110.5	278714.5	NaN	11250.0	71046.0	80541.0	112320.0	NaN	NaN	NaN	NaN	NaN	12.0	NaN	NaN	NaN
75%	2384279.75	367514.0	NaN	20658.42	180360.0	216418.5	234000.0	NaN	NaN	NaN	NaN	NaN	24.0	NaN	NaN	NaN
max	2845382	456255	NaN	418058.145	6905160.0	6905160.0	6905160.0	NaN	2024-08-27T00:26:16.000000000	NaN	NaN	NaN	84.0	NaN	NaN	2024-08-28T06:01:00.000000000

Done! |████████████████████████████████████████| 100% in 9.1s (0.11%/s)

Out[7]:

	GroceryInvoiceGuid	GroceryCustomerGuid	Timestamp	tz_offset	record_available_at	Amount
dtype	VARCHAR	VARCHAR	TIMESTAMP	VARCHAR	TIMESTAMP	FLOAT
unique	61513	500	61461	4	15181	6647
%missing	0.0	0.0	0.0	0.0	0.0	0.0
%empty	0	0	NaN	0	NaN	NaN
entropy	6.214608	5.860121	NaN	0.782711	NaN	NaN
top	003224bd-aad1-4e34-9182-1f8f3a6b0a57	cea213d4-36e4-48c3-ae8d-c7a25911e11c	2022-02-08 14:47:42.000	+02:00	2022-08-23 15:01:00.000	1
freq	1.0	1072.0	2.0	33673.0	17.0	1043.0
mean	NaN	NaN	NaN	NaN	NaN	19.165033
std	NaN	NaN	NaN	NaN	NaN	23.732982
min	NaN	NaN	2022-01-01T04:17:46.000000000	NaN	2022-01-01T05:01:00.000000000	0.0
25%	NaN	NaN	NaN	NaN	NaN	4.28
50%	NaN	NaN	NaN	NaN	NaN	10.58
75%	NaN	NaN	NaN	NaN	NaN	24.53
max	NaN	NaN	2024-06-12T07:29:25.000000000	NaN	2024-06-12T08:01:00.000000000	360.84

In [8]:

            
                Copied!
                
# Preview a selection of rows
prior_applications_source_table.preview()
# Preview a selection of rows
prior_applications_source_table.preview()

Out[8]:

	APPLICATION_ID	CLIENT_ID	CONTRACT_TYPE	AMT_ANNUITY	AMT_APPLICATION	AMT_CREDIT	AMT_GOODS_VALUE	CONTRACT_STATUS	DECISION_DATE	TIMEZONE_OFFSET	CODE_REJECT_REASON	GOODS_CATEGORY	CNT_PAYMENT	YIELD_GROUP	PRODUCT_COMBINATION	available_at
0	2409570	106763	Consumer loans	6231.960	61605.0	55444.5	61605.0	Approved	2019-10-03 18:24:04	-05:00	XAP	Consumer Electronics	10.0	low_normal	POS household without interest	2019-10-04 06:02:00
1	2050716	244529	Consumer loans	6682.320	43155.0	47425.5	43155.0	Approved	2024-01-27 18:13:28	-07:00	XAP	Mobile	10.0	high	POS household with interest	2024-01-28 06:40:00
2	2716297	174445	Consumer loans	4304.565	40500.0	39564.0	40500.0	Approved	2017-05-01 16:36:35	-07:00	XAP	Furniture	12.0	high	POS industry with interest	2017-05-02 06:43:00
3	2633072	334346	Consumer loans	13354.785	165955.5	187312.5	165955.5	Approved	2017-04-29 21:11:39	-06:00	XAP	Audio/Video	18.0	middle	POS household with interest	2017-04-30 06:28:00
4	2680906	411586	Cash loans	13072.185	90000.0	110146.5	90000.0	Refused	2017-04-17 08:37:23	-05:00	HC	XNA	12.0	high	Cash Street: high	2017-04-18 06:00:00
5	1232927	454931	Cash loans	20317.590	270000.0	291919.5	270000.0	Approved	2022-01-24 11:54:40	-05:00	XAP	XNA	18.0	low_normal	Cash X-Sell: low	2022-01-25 06:47:00
6	2800956	366231	Consumer loans	17939.250	153000.0	153000.0	153000.0	Approved	2022-05-28 15:31:15	-07:00	XAP	Clothing and Accessories	10.0	middle	POS industry with interest	2022-05-29 06:58:00
7	1702309	211842	Consumer loans	NaN	52380.0	52380.0	52380.0	Unused offer	2018-12-23 19:15:59	-06:00	CLIENT	Mobile	NaN	XNA	POS mobile with interest	2018-12-24 06:26:00
8	1705998	150619	Consumer loans	NaN	41796.0	41796.0	41796.0	Unused offer	2018-07-31 19:35:49	-05:00	CLIENT	Mobile	NaN	XNA	POS mobile with interest	2018-08-01 06:16:00
9	1633822	322184	Consumer loans	10229.940	144900.0	144900.0	144900.0	Refused	2020-03-17 16:58:06	-06:00	LIMIT	Furniture	18.0	middle	POS industry with interest	2020-03-18 06:24:00

Register Tables in the Catalog¶

For accurate feature derivation, FeatureByte needs to recognize the roles of different tables.

Each table should be assigned a specific type based on its structure and purpose:

NEW_APPLICATION --> Dimension table.

Why Dimesion Table?: While we could have registered this as an Event Table, this table contains only a subset of the applications. By setting it as a Dimension Table, we disable aggregations, preventing potentially non-meaningful computations.
PRIOR_APPLICATIONS --> Event table.

Why Event Table?: The table records final decision events for prior applications, making it suitable for an Event Table designation.
CONSUMER_LOAN_STATUS --> Slowly Changing Dimension (SCD) table.

Why Slowly Changing Dimension Table?: The table tracks loan statuses and dynamic fields that change over time, making it a Slowly Changing Dimension (SCD) Table.
CONSUMER_INSTALLMENTS --> Time Series table.

Why Time Series Table?: While we could have registered this as an Event Table, installment payments occur monthly. By defining it as a Time Series Table, we ensure calendar month aggregation, aligning with the event frequency.

If you are interested in a use case that exploits item table, checkout out the Grocery SDK Tutorials.

Let's register each table using respective type:

In [9]:

            
                Copied!
                
                    
                    
                
                

        
application_table = application_source_table.create_dimension_table(
    name="NEW_APPLICATION",
    dimension_id_column="APPLICATION_ID",
    record_creation_timestamp_column="available_at",
)
prior_applications_table = prior_applications_source_table.create_event_table(   
    name="PRIOR_APPLICATIONS",
    event_id_column="APPLICATION_ID",
    event_timestamp_column="DECISION_DATE",
    event_timestamp_timezone_offset_column="TIMEZONE_OFFSET",
    record_creation_timestamp_column="available_at",
)
consumer_loans_table = consumer_loans_source_table.create_scd_table(
    name="CONSUMER_LOAN_STATUS",
    natural_key_column="LOAN_ID",
    effective_timestamp_column="EFFECTIVE_TIMESTAMP",
    effective_timestamp_schema=fb.TimestampSchema(
        format_string="YYYY-MM-DD",
        is_utc_time=True,
    ),
    end_timestamp_column="END_TIMESTAMP",
    end_timestamp_schema=fb.TimestampSchema(
        format_string="YYYY-MM-DD",
        is_utc_time=True,
    ),
    record_creation_timestamp_column="available_at",
)
consumer_installments_table = consumer_installments_source_table.create_time_series_table(
    name="CONSUMER_INSTALLMENTS",
    reference_datetime_column="ACTUAL_INSTALLMENT_DATE",
    reference_datetime_schema=fb.TimestampSchema(
        is_utc_time=True,
        timezone=fb.TimeZoneColumn(column_name="TIMEZONE_OFFSET", type="offset"),
    ),
    time_interval=fb.TimeInterval(value=1, unit="MONTH"),
    series_id_column="LOAN_ID",
    record_creation_timestamp_column="available_at",
)
application_table = application_source_table.create_dimension_table(
    name="NEW_APPLICATION",
    dimension_id_column="APPLICATION_ID",
    record_creation_timestamp_column="available_at",
)
prior_applications_table = prior_applications_source_table.create_event_table(   
    name="PRIOR_APPLICATIONS",
    event_id_column="APPLICATION_ID",
    event_timestamp_column="DECISION_DATE",
    event_timestamp_timezone_offset_column="TIMEZONE_OFFSET",
    record_creation_timestamp_column="available_at",
)
consumer_loans_table = consumer_loans_source_table.create_scd_table(
    name="CONSUMER_LOAN_STATUS",
    natural_key_column="LOAN_ID",
    effective_timestamp_column="EFFECTIVE_TIMESTAMP",
    effective_timestamp_schema=fb.TimestampSchema(
        format_string="YYYY-MM-DD",
        is_utc_time=True,
    ),
    end_timestamp_column="END_TIMESTAMP",
    end_timestamp_schema=fb.TimestampSchema(
        format_string="YYYY-MM-DD",
        is_utc_time=True,
    ),
    record_creation_timestamp_column="available_at",
)
consumer_installments_table = consumer_installments_source_table.create_time_series_table(
    name="CONSUMER_INSTALLMENTS",
    reference_datetime_column="ACTUAL_INSTALLMENT_DATE",
    reference_datetime_schema=fb.TimestampSchema(
        is_utc_time=True,
        timezone=fb.TimeZoneColumn(column_name="TIMEZONE_OFFSET", type="offset"),
    ),
    time_interval=fb.TimeInterval(value=1, unit="MONTH"),
    series_id_column="LOAN_ID",
    record_creation_timestamp_column="available_at",
)

After this we will be able to see our tables in the catalog:

In [10]:

            
                Copied!
                
display(catalog.list_tables())
display(catalog.list_tables())

	id	name	type	status	entities	created_at
0	67c2c752924afe7a79ec6f27	CONSUMER_INSTALLMENTS	time_series_table	PUBLIC_DRAFT	None	2025-03-01T08:37:38.763000
1	67c2c750924afe7a79ec6f26	CONSUMER_LOAN_STATUS	scd_table	PUBLIC_DRAFT	None	2025-03-01T08:37:36.903000
2	67c2c74e924afe7a79ec6f25	PRIOR_APPLICATIONS	event_table	PUBLIC_DRAFT	None	2025-03-01T08:37:34.829000
3	67c2c74c924afe7a79ec6f24	NEW_APPLICATION	dimension_table	PUBLIC_DRAFT	None	2025-03-01T08:37:32.806000

Initialize Feature Job Setting for an event table¶

Feature Job Setting defines when and how frequently features are updated in the feature store. These settings also include a blind spot period, which represents the time gap between when a feature is computed and the latest available event.

Why is the Blind Spot Important? In an ideal scenario, we would include all prior applications up to the present in our feature computation. However, due to the time required for data collection, ETL processing, and other pipeline delays, the most recent applications may not be immediately available.

Setting the blind spot too small may result in data leakage during model training, as the production environment might not have access to the most recent data at inference time. Conversely, setting it too large may lead to stale feature values, reducing predictive performance.

Automated Feature Job Settings with FeatureByte FeatureByte intelligently determines the optimal feature job settings for event tables by automatically adjusting:

Blind Spot: Ensuring that data is available and aligned with the production scenario.
Update Frequency (Period): Setting an appropriate interval for feature updates.
Offset: Delaying computation to accommodate data pipeline scheduling.

In [11]:

            
                Copied!
                
prior_applications_table.initialize_default_feature_job_setting()
prior_applications_table.initialize_default_feature_job_setting()

Done! |████████████████████████████████████████| 100% in 9.1s (0.11%/s)

Feature Job Setting Analysis

Feature Job Setting Analysis Report

Warehouse Jobs Statistics

The analysis is for the event table: PRIOR_APPLICATIONS
The analysis period starts at 2025-01-31 23:59:11 and ends at 2025-02-28 23:59:11
The column used for the event timestamp is DECISION_DATE
The column used for the record creation timestamp for PRIOR_APPLICATIONS is available_at

STATISTICS ON TIME BETWEEN PRIOR_APPLICATIONS RECORDS CREATIONS
- Average time is 1913.7940379403794 s
- Median time is 60.0 s
- Lowest time is 60.0 s
- Largest time is 83040.0 s
based on a total of 1207 unique record creation timestamps.

The BEST ESTIMATE FOR PRIOR_APPLICATIONS UPDATE FREQUENCY is every 24 hours

PRIOR_APPLICATIONS UPDATE TIME starts 6.0 hours and ends 6.0 hours 59.0 minutes after the start of each 24 hours

Job Frequency Recommendation

The RECOMMENDED FEATURE JOB FREQUENCY is 7 hours 9 minutes after the start of each 24 hours.
This includes a buffer of 600 s to allow for late jobs.

Blind Spot Search

The OPTIMAL BLIND SPOT setting is 25800 s to keep late data at less than 0.0 %
Search for optimal blind spot
- blind spot for 99.5 % of events to land: 25200 s
- blind spot for 99.9 % of events to land: 25800 s
- blind spot for 99.95 % of events to land: 25800 s
- blind spot for 99.99 % of events to land: 25800 s
- blind spot for 99.995 % of events to land: 25800 s
- blind spot for 100.0 % of events to land: 25800 s

Feature Job Setting Recommendation

The RECOMMENDED BLIND SPOT setting is 25800 s

In SUMMARY, the recommended FEATUREJOB DEFAULT setting is:
period: 86400
offset: 25740
blind_spot: 25800

The resulting FEATURE CUTOFF offset is 86340 s.

Feature Tiles Event Landing Time

For a feature cutoff at 86340 s:
- time for 99.5 % of events to land: 25800 s
- time for 99.9 % of events to land: 25800 s
- time for 99.95 % of events to land: 25800 s
- time for 99.99 % of events to land: 25800 s
- time for 99.995 % of events to land: 25800 s
- time for 100.0 % of events to land: 25800 s

Backtest Result

For the feature job setting:
- Period = 86400 s / Offset = 25740 s / Blind spot = 25800 s
The backtest found that all records would have been processed on time.

Summary

Key findings:

Based on the past records created from 2025-01-31 23:00:00 to 2025-02-28 23:00:00, the table is regularly updated 6.0 hours after the start of each 24 hours within a 59.0 minutes interval. No job failure or late job has been detected.
The features computation jobs are recommended to be scheduled after the table updates completion and be set 7 hours 9 minutes after the start of each 24 hours.
Based on the analysis of the records latency, the blind_spot parameter used to determine the window of the features aggregation is recommended to be set at 25800 s.

The recommended Default Feature Job setting is:

period: 86400 s
offset: 25740 s
blind_spot: 25800 s

Done! |████████████████████████████████████████| 100% in 12.1s (0.08%/s)

Feature Job Setting Analysis

Feature Job Setting Analysis Report

Warehouse Jobs Statistics

The analysis is for the event table: GROCERYINVOICE
The analysis period starts at 2024-05-15 07:29:25 and ends at 2024-06-12 07:29:25
The column used for the event timestamp is Timestamp
The column used for the record creation timestamp for GROCERYINVOICE is record_available_at

STATISTICS ON TIME BETWEEN GROCERYINVOICE RECORDS CREATIONS
- Average time is 4176.910911966263 s
- Median time is 3600.0 s
- Lowest time is 3600.0 s
- Largest time is 36000.0 s
based on a total of 481 unique record creation timestamps.

The BEST ESTIMATE FOR GROCERYINVOICE UPDATE FREQUENCY is every 1 hour

The longer time between record creations are due to 182 MISSING UPDATES.

GROCERYINVOICE UPDATE TIME starts 1.0 minute and ends 1.0 minute after the start of each 1 hour

Job Frequency Recommendation

The RECOMMENDED FEATURE JOB FREQUENCY is 2 minutes after the start of each 1 hour.
This includes a buffer of 60 s to allow for late jobs.

The 63 jobs that occurred after missing jobs don't seem to have processed significantly older records.

Blind Spot Search

The OPTIMAL BLIND SPOT setting is 120 s to keep late data at less than 0.0 %
Search for optimal blind spot
- blind spot for 99.5 % of events to land: 120 s
- blind spot for 99.9 % of events to land: 120 s
- blind spot for 99.95 % of events to land: 120 s
- blind spot for 99.99 % of events to land: 120 s
- blind spot for 99.995 % of events to land: 120 s
- blind spot for 100.0 % of events to land: 120 s

Feature Job Setting Recommendation

The RECOMMENDED BLIND SPOT setting is 120 s

In SUMMARY, the recommended FEATUREJOB DEFAULT setting is:
period: 3600
offset: 120
blind_spot: 120

The resulting FEATURE CUTOFF offset is 0 s.

Feature Tiles Event Landing Time

For a feature cutoff at 0 s:
- time for 99.5 % of events to land: 120 s
- time for 99.9 % of events to land: 120 s
- time for 99.95 % of events to land: 120 s
- time for 99.99 % of events to land: 120 s
- time for 99.995 % of events to land: 120 s
- time for 100.0 % of events to land: 120 s

Backtest Result

For the feature job setting:
- Period = 3600 s / Offset = 120 s / Blind spot = 120 s
The backtest found that all records would have been processed on time.

Summary

Key findings:

Based on the past records created from 2024-05-15 07:00:00 to 2024-06-12 07:00:00, the table is regularly updated 1.0 minute after the start of each 1 hour within a interval. No job failure or late job has been detected.
The features computation jobs are recommended to be scheduled after the table updates completion and be set 2 minutes after the start of each 1 hour.
Based on the analysis of the records latency, the blind_spot parameter used to determine the window of the features aggregation is recommended to be set at 120 s.

The recommended Default Feature Job setting is:

period: 3600 s
offset: 120 s
blind_spot: 120 s

To override default feature job settings, see update_default_feature_job_setting.

Set Feature Job setting for TimeSeries¶

For TimeSeries, the feature job setting consists of three key parameters:

Crontab: Defines the cron schedule for the feature job, specifying when the job should run.
Time Zone: Determines the time zone in which the cron schedule operates, ensuring alignment with local time conventions.
Reference Time Zone: Specifies the time zone used to define calendar-based aggregation periods (e.g., daily, weekly, or monthly). This reference time zone ensures consistency when calculating calendar periods across different data time zones.
- For example, if the scheduled job is 2025/01/31 23:00 UTC and the reference time zone is Asia/Singapore, the corresponding calendar date would be 2025/02/01. Consequently, the aggregation for the latest complete month would cover January.
- If a time zone column is used to assign individual time zones per record, the reference time zone should be the westernmost time zone among those specified in the column. This ensures that aggregation periods fully encompass the calendar dates of all relevant observations.

In [12]:

            
                Copied!
                
                    
                    
                
                

        
job_setting = fb.CronFeatureJobSetting(
    crontab="0 13 1 * *",
    timezone="Etc/UTC",
    timezone_reference="America/Los_Angeles",
)
consumer_installments_table.update_default_feature_job_setting(job_setting)
job_setting = fb.CronFeatureJobSetting(
    crontab="0 13 1 * *",
    timezone="Etc/UTC",
    timezone_reference="America/Los_Angeles",
)
consumer_installments_table.update_default_feature_job_setting(job_setting)

That's it for this tutorial, now we are ready for modeling our entities.

More details about concepts we discussed in this tutorial:¶

SDK reference for¶

In [ ]: