Skip to content

featurebyte.ObservationTable.split

split(
split_ratios: List[float],
names: Optional[List[str]]=None,
seed: int=1234
) -> List[ObservationTable]

Description

Split the observation table into multiple tables based on percentages. Each split creates a new observation table containing a non-overlapping subset of rows. The splits are determined using a seeded random assignment, ensuring reproducibility.

The first split is automatically assigned Purpose.TRAINING, while all subsequent splits are assigned Purpose.VALIDATION_TEST.

Parameters

  • split_ratios: List[float]
    List of percentages (0-1) for each split. Must sum to 1.0 and contain 2 or 3 values. Example: [0.7, 0.3] for a 70/30 train/test split Example: [0.6, 0.2, 0.2] for a 60/20/20 train/validation/test split

  • names: Optional[List[str]]
    Names for the resulting tables. If None, auto-generated as "{name}_split_0", "{name}_split_1", etc. Must have the same length as split_ratios if provided.

  • seed: int
    default: 1234
    Random seed for reproducible splits. Default is 1234.

Returns

  • List[ObservationTable]
    List of split observation tables in the same order as split_ratios. The first table has Purpose.TRAINING, the rest have Purpose.VALIDATION_TEST.

Raises

  • ValueError
    If split_ratios is invalid (doesn't sum to 1, wrong length, values out of range). If names length doesn't match split_ratios length.

Examples

Split into train (70%) and test (30%) sets:

>>> observation_table = catalog.get_observation_table("observation_table")
>>> train_table, test_table = observation_table.split(
...     split_ratios=[0.7, 0.3],
...     names=["train_data", "test_data"],
... )
Split into train (60%), validation (20%), and test (20%) sets:

>>> train, val, test = observation_table.split(
...     split_ratios=[0.6, 0.2, 0.2],
...     names=["train_data", "validation_data", "test_data"],
...     seed=42,
... )