featurebyte.EventTable.create_new_feature_job_setting_analysis¶
Description¶
Creates a new analysis of data availability and freshness of an event table in order to suggest an optimal setting for scheduling Feature Jobs and associated Blind Spot information.
This analysis relies on the presence of record creation timestamps in the source table, typically added when updating data in the warehouse. The analysis focuses on a recent time window, the past four weeks by default.
FeatureByte estimates the data update frequency based on the distribution of time intervals among the sequence of record creation timestamps. It also assesses the timeliness of source table updates and identifies late jobs using an outlier detection algorithm. By default, the recommended scheduling time takes late jobs into account.
To accommodate data that may not arrive on time during warehouse updates, a blind spot is proposed for determining the cutoff for feature aggregation windows, in addition to scheduling frequency and time of the Feature Job. The suggested blind spot offers a percentage of late data closest to the user-defined tolerance, with a default of 0.005%.
To validate the Feature Job schedule and blind spot recommendations, a backtest is conducted. You can also backtest your own settings.
Parameters¶
- analysis_date: Optional[datetime]
Specifies the end date and time (in UTC) for the analysis. If not provided, the current date and time will be used. - analysis_length: int
default: 2419200
Sets the duration of the analysis in seconds. The default value is 2,419,200 seconds (approximately 4 weeks). - min_featurejob_period: int
default: 60
Determines the minimum period (in seconds) between feature jobs. The default value is 60 seconds. - exclude_late_job: bool
default: False
If set to True, late jobs will be excluded from the analysis. This would assume that recent incidents won't happen again in the future. - blind_spot_buffer_setting: int
default: 5
Defines the buffer time (in seconds) for the blind spot recommendation. The default value is 5 seconds. - job_time_buffer_setting: Union[int, Literal["auto"]]
default: "auto"
Specifies the buffer time (in seconds) for job timing recommendations. A larger buffer reduces the risk of running a feature job before table updates are completed. If set to "auto", an appropriate buffer time will be determined automatically. - late_data_allowance: float
default: 5e-05
Indicates the maximum acceptable percentage of late records. The default value is 0.005% (5e-05).
Returns¶
- FeatureJobSettingAnalysis
Examples¶
Create new feature job setting analysis on the saved event table with the following configuration:
- analysis should cover the last 2 weeks,
- recommendation for the feature job frequency period should be at least one hour,
- recent late data warehouse updates should be excluded in the analysis, as it is expected they won't occur again because your instances have been upsized
- tolerance for late data is incresed to 0.5%.
>>> from datetime import datetime
>>> event_table = catalog.get_table("GROCERYINVOICE")
>>> analysis = event_table.create_new_feature_job_setting_analysis(
... analysis_date=datetime.utcnow(),
... analysis_length=606024712,
... min_featurejob_period=60*60,
... exclude_late_job=True,
... blind_spot_buffer_setting=10,
... job_time_buffer_setting=5,
... late_data_allowance=0.5/100,
... )