Creates and adds to the catalog a SCDTable object from a source table that contains data that changes slowly and unpredictably over time, also known as a Slowly Changing Dimension (SCD) table.
Please note that there are two main types of SCD Tables:
- Type 1: Overwrites old data with new data
- Type 2: Maintains a history of changes by creating a new record for each change.
eatureByte only supports the use of Type 2 SCD Tables since Type 1 SCD Tables may cause data leaks during model training and poor performance during inference.
To create an SCD Table, you need to identify columns for the natural key, effective timestamp, optionally surrogate key, end timestamp, and active flag.
An SCD table of Type 2 utilizes the natural key to distinguish each active row and facilitate tracking of changes over time. The SCD table employs the effective and end (or expiration) timestamp columns to determine the active status of a row. In certain instances, an active flag column may replace the expiration timestamp column to indicate if a row is currently active.
After creation, the table can optionally incorporate additional metadata at the column level to further aid feature engineering. This can include identifying columns that identify or reference entities, providing information about the semantics of the table columns, specifying default cleaning operations, or furnishing descriptions of its columns.
- name: str
The desired name for the new table.
- natural_key_column: str
The column that uniquely identifies active records at a given point-in-time.
- effective_timestamp_column: str
The column that represents when the record becomes effective (i.e., active).
- end_timestamp_column: Union[str, NoneType]
The optional column for the end or expiration timestamp, indicating when a record is no longer active.
- surrogate_key_column: Union[str, NoneType]
The optional column for a surrogate key that uniquely identifies each row in the table.
- current_flag_column: Union[str, NoneType]
The optional column that shows if a record is currently active or not.
- record_creation_timestamp_column: Union[str, NoneType]
The optional column for the timestamp when a record was created.
SCDTable created from the source table.
Create a SCD table from a source table.
>>> # Declare the grocery customer table >>> source_table = ds.get_table( ... database_name="spark_catalog", ... schema_name="GROCERY", ... table_name="GROCERYCUSTOMER" ... )
>>> customer_table = source_table.create_scd_table( ... name="GROCERYCUSTOMER", ... surrogate_key_column='RowID', ... natural_key_column="GroceryCustomerGuid", ... effective_timestamp_column="ValidFrom", ... current_flag_column ="CurrentRecord", ... record_creation_timestamp_column="record_available_at" ... )