featurebyte.SourceTable.sample¶

size: int=10,

seed: int=1234,

from_timestamp: Union[datetime, str, NoneType]=None,

to_timestamp: Union[datetime, str, NoneType]=None,

after_cleaning: bool=False

) -> DataFrame

Description¶

Returns a DataFrame that contains a random selection of rows of the table based on a specified time range, size, and seed for sampling control. By default, the materialization process occurs before any cleaning operations that were defined at the column level.

Parameters¶

size: int
default: 10
Maximum number of rows to sample.
seed: int
default: 1234
Seed to use for random sampling.
from_timestamp: Union[datetime, str, NoneType]
Start of date range to sample from.
to_timestamp: Union[datetime, str, NoneType]
End of date range to sample from.
after_cleaning: bool
default: False
Whether to apply cleaning operations.

Returns¶

DataFrame
Sampled rows from the table.

Examples¶

Sample 3 rows from the table.

>>> catalog.get_table("GROCERYPRODUCT").sample(3)
                     GroceryProductGuid ProductGroup
0  e890c5cb-689b-4caf-8e49-6b97bb9420c0       Épices
1  5720e4df-2996-4443-a1bc-3d896bf98140         Chat
2  96fc4d80-8cb0-4f1b-af01-e71ad7e7104a        Pains

Sample 3 rows from the table with timestamps.

>>> event_table = catalog.get_table("GROCERYINVOICE")
>>> event_table["Amount"].update_critical_data_info(
...   cleaning_operations=[
...     fb.MissingValueImputation(imputed_value=0),
...   ]
... )

>>> event_table.sample(
...   size=3,
...   seed=111,
...   from_timestamp=datetime(2019, 1, 1),
...   to_timestamp=datetime(2023, 12, 31),
...   after_cleaning=True,
... )

featurebyte.SourceTable.sample¶

Description¶

Parameters¶

Returns¶

Examples¶

See Also¶