{ "cells": [ { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2023-05-09T08:14:30.707049Z", "start_time": "2023-05-09T08:12:29.159554Z" }, "tags": [] }, "source": [ "### Setup Cleaning Operations \n", "\n", "Our data modeling is done, now we are ready to work with data itself. \n", "\n", "A crucial step in every data science project is ensuring the data is clean and ready for feature engineering. Issues such as missing values, disguised missing values (missing values that are not explicitly encoded as missing values), or outliers can significantly impair the quality of features and eventually the quality of the final model.\n", "\n", "FeatureByte offers an API to effectively address these concerns." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\u001b[32;20m16:40:27\u001b[0m | \u001b[1m\u001b[38;20mINFO \u001b[0m\u001b[0m | \u001b[1m\u001b[38;20mUsing configuration file at: /Users/viktor/.featurebyte/config.yaml\u001b[0m\u001b[0m\n", "\u001b[32;20m16:40:27\u001b[0m | \u001b[1m\u001b[38;20mINFO \u001b[0m\u001b[0m | \u001b[1m\u001b[38;20mActive profile: tutorial (https://tutorials.featurebyte.com/api/v1)\u001b[0m\u001b[0m\n", "\u001b[32;20m16:40:28\u001b[0m | \u001b[1m\u001b[38;20mINFO \u001b[0m\u001b[0m | \u001b[1m\u001b[38;20mSDK version: 0.6.0.dev121\u001b[0m\u001b[0m\n", "\u001b[32;20m16:40:28\u001b[0m | \u001b[1m\u001b[38;20mINFO \u001b[0m\u001b[0m | \u001b[1m\u001b[38;20mNo catalog activated.\u001b[0m\u001b[0m\n", "\u001b[32;20m16:40:28\u001b[0m | \u001b[1m\u001b[38;20mINFO \u001b[0m\u001b[0m | \u001b[1m\u001b[38;20m10 feature lists, 59 features deployed\u001b[0m\u001b[0m\n", "\u001b[32;20m16:40:28\u001b[0m | \u001b[1m\u001b[38;20mINFO \u001b[0m\u001b[0m | \u001b[1m\u001b[38;20mUsing profile: tutorial\u001b[0m\u001b[0m\n", "\u001b[32;20m16:40:28\u001b[0m | \u001b[1m\u001b[38;20mINFO \u001b[0m\u001b[0m | \u001b[1m\u001b[38;20mUsing configuration file at: /Users/viktor/.featurebyte/config.yaml\u001b[0m\u001b[0m\n", "\u001b[32;20m16:40:28\u001b[0m | \u001b[1m\u001b[38;20mINFO \u001b[0m\u001b[0m | \u001b[1m\u001b[38;20mActive profile: tutorial (https://tutorials.featurebyte.com/api/v1)\u001b[0m\u001b[0m\n", "\u001b[32;20m16:40:28\u001b[0m | \u001b[1m\u001b[38;20mINFO \u001b[0m\u001b[0m | \u001b[1m\u001b[38;20mSDK version: 0.6.0.dev121\u001b[0m\u001b[0m\n", "\u001b[32;20m16:40:28\u001b[0m | \u001b[1m\u001b[38;20mINFO \u001b[0m\u001b[0m | \u001b[1m\u001b[38;20mNo catalog activated.\u001b[0m\u001b[0m\n", "\u001b[32;20m16:40:29\u001b[0m | \u001b[1m\u001b[38;20mINFO \u001b[0m\u001b[0m | \u001b[1m\u001b[38;20m10 feature lists, 59 features deployed\u001b[0m\u001b[0m\n", "\u001b[32;20m16:40:29\u001b[0m | \u001b[1m\u001b[38;20mINFO \u001b[0m\u001b[0m | \u001b[1m\u001b[38;20mCatalog activated: Grocery Dataset Tutorial\u001b[0m\u001b[0m\n" ] } ], "source": [ "import featurebyte as fb\n", "\n", "# Set your profile to the tutorial environment\n", "fb.use_profile(\"tutorial\")\n", "\n", "catalog_name = \"Grocery Dataset Tutorial\"\n", "catalog = fb.Catalog.activate(catalog_name) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look into descriptive statistics of Invoice Amount" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "tags": [] }, "outputs": [], "source": [ "invoice_table = catalog.get_table(\"GROCERYINVOICE\")" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Amount
dtypeFLOAT
unique6647
%missing0.0
%emptyNaN
entropyNaN
top1
freq798.0
mean19.171153
std23.717282
min0.0
25%4.29
50%10.59
75%24.5
max360.84
\n", "
" ], "text/plain": [ " Amount\n", "dtype FLOAT\n", "unique 6647\n", "%missing 0.0\n", "%empty NaN\n", "entropy NaN\n", "top 1\n", "freq 798.0\n", "mean 19.171153\n", "std 23.717282\n", "min 0.0\n", "25% 4.29\n", "50% 10.59\n", "75% 24.5\n", "max 360.84" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "invoice_table.Amount.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Although the Amount column doesn't present any anomalies, we will set default cleaning operations to ensure that if any issues happen in the future, the data will remain clean.\n", "\n", "We will here set the following cleaning operations:\n", "\n", "* ignore disguised missing values equal to -99 and -98\n", "* cap any amount less than 0 Euro\n", "* cap any amount greater than 2000 Euros\n", "\n", "Those operations will be applied by default when a [view](https://docs.featurebyte.com/latest/about/glossary/#view) is created from the table.\n", "You can however overwrite those operations by creating a view in a [manual mode](https://docs.featurebyte.com/latest/reference/featurebyte.api.event_table.EventTable.get_view/), or access a view [raw data](https://docs.featurebyte.com/latest/reference/core/view/#accessing-source-table-raw-data-in-views) " ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "tags": [] }, "outputs": [], "source": [ "invoice_table[\"Amount\"].update_critical_data_info(\n", " cleaning_operations=[\n", " fb.DisguisedValueImputation(disguised_values=[-99, -98], imputed_value=None),\n", " fb.ValueBeyondEndpointImputation(\n", " type=\"less_than\", end_point=0, imputed_value=0\n", " ),\n", " fb.ValueBeyondEndpointImputation(\n", " type=\"greater_than\", end_point=2000, imputed_value=2000\n", " ),\n", " ]\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we look at the `columns_info`, we'll see that `critical_data_info` for Amount column is populated with cleaning operations now. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t
Event Table
\n", "\t\n", "\t\t\n", "\t\t\t\n", "\t\t\t\n", "\t\t\n", "\t\t\n", "\t\t\t\n", "\t\t\t\n", "\t\t\n", "\t\t\n", "\t\t\t\n", "\t\t\t\n", "\t\t\n", "\t\t\n", "\t\t\t\n", "\t\t\t\n", "\t\t\n", "\t\t\n", "\t\t\t\n", "\t\t\t\n", "\t\t\n", "\t\t\n", "\t\t\t\n", "\t\t\t\n", "\t\t\n", "\t\t\n", "\t\t\t\n", "\t\t\t\n", "\t\t\n", "\t\t\n", "\t\t\t\n", "\t\t\t\n", "\t\t\n", "\t\t\n", "\t\t\t\n", "\t\t\t\n", "\t\t\n", "\t\t\n", "\t\t\t\n", "\t\t\t\n", "\t\t\n", "\t\t\n", "\t\t\t\n", "\t\t\t\n", "\t\t\n", "\t\t\n", "\t\t\t\n", "\t\t\t\n", "\t\t\n", "\t\t\n", "\t\t\t\n", "\t\t\t\n", "\t\t\n", "\t\t\n", "\t\t\t\n", "\t\t\t\n", "\t\t\n", "\t\t\n", "\t\t\t\n", "\t\t\t\n", "\t\t\n", "\t
nameGROCERYINVOICE
created_at2023-11-27 15:38:20
updated_at2023-11-27 15:40:33
descriptionGrocery invoice details, containing the timestamp and the total amount of the invoice
statusPUBLIC_DRAFT
catalog_nameGrocery Dataset Tutorial
record_creation_timestamp_columnrecord_available_at
table_details\n", "\t\t\t\t
\n", "\t\t\t\t\t\n", "\t\t\t\t\t\t\n", "\t\t\t\t\t\t\t\n", "\t\t\t\t\t\t\t\n", "\t\t\t\t\t\t\n", "\t\t\t\t\t\t\n", "\t\t\t\t\t\t\t\n", "\t\t\t\t\t\t\t\n", "\t\t\t\t\t\t\n", "\t\t\t\t\t\t\n", "\t\t\t\t\t\t\t\n", "\t\t\t\t\t\t\t\n", "\t\t\t\t\t\t\n", "\t\t\t\t\t
database_nameTUTORIAL_DATASETS
schema_nameGROCERY
table_nameGROCERYINVOICE
\n", "\t\t\t\t
\n", "\t\t\t
entities\n", "\t\t\t\t
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameserving_namescatalog_name
0invoice[GROCERYINVOICEGUID]Grocery Dataset Tutorial
1customer[GROCERYCUSTOMERGUID]Grocery Dataset Tutorial
\n", "\t\t\t
semantics['event_id', 'event_timestamp', 'record_creation_timestamp', 'time_zone']
column_count6
columns_info\n", "\t\t\t\t
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namedtypeentitysemanticcritical_data_infodescription
0GroceryInvoiceGuidVARCHARinvoiceevent_idNoneUnique identifier of each row in the table, in GUID format. Uniquely identifies each invoice.
1GroceryCustomerGuidVARCHARcustomerNoneNoneUnique identifier for each customer, in GUID format.
2TimestampTIMESTAMPNoneevent_timestampNoneThe GMT timestamp of when this invoice transaction event occurred.
3tz_offsetVARCHARNonetime_zoneNoneThe local timezone offset of the invoice event.
4record_available_atTIMESTAMPNonerecord_creation_timestampNoneA timestamp for when this row was added to the database.
5AmountFLOATNoneNone{'cleaning_operations': [{'imputed_value': None, 'type': 'disguised', 'disguised_values': [-99.0, -98.0]}, {'imputed_value': 0.0, 'type': 'less_than', 'end_point': 0.0}, {'imputed_value': 2000.0, 'type': 'greater_than', 'end_point': 2000.0}]}The total amount of the invoice, including all items and any discounts applied. Cannot be negative.
\n", "\t\t\t
event_timestamp_columnTimestamp
event_id_columnGroceryInvoiceGuid
default_feature_job_setting\n", "\t\t\t\t
\n", "\t\t\t\t\t\n", "\t\t\t\t\t\t\n", "\t\t\t\t\t\t\t\n", "\t\t\t\t\t\t\t\n", "\t\t\t\t\t\t\n", "\t\t\t\t\t\t\n", "\t\t\t\t\t\t\t\n", "\t\t\t\t\t\t\t\n", "\t\t\t\t\t\t\n", "\t\t\t\t\t\t\n", "\t\t\t\t\t\t\t\n", "\t\t\t\t\t\t\t\n", "\t\t\t\t\t\t\n", "\t\t\t\t\t
blind_spot120s
frequency3600s
time_modulo_frequency120s
\n", "\t\t\t\t
\n", "\t\t\t
\n", "\n" ], "text/plain": [ "{\n", " 'name': 'GROCERYINVOICE',\n", " 'created_at': '2023-11-27T15:38:20.150000',\n", " 'updated_at': '2023-11-27T15:40:33.297000',\n", " 'description': 'Grocery invoice details, containing the timestamp and the total amount of the invoice',\n", " 'status': 'PUBLIC_DRAFT',\n", " 'catalog_name': 'Grocery Dataset Tutorial',\n", " 'record_creation_timestamp_column': 'record_available_at',\n", " 'table_details': {\n", " 'database_name': 'TUTORIAL_DATASETS',\n", " 'schema_name': 'GROCERY',\n", " 'table_name': 'GROCERYINVOICE'\n", " },\n", " 'entities': [\n", " {\n", " 'name': 'invoice',\n", " 'serving_names': [\n", " 'GROCERYINVOICEGUID'\n", " ],\n", " 'catalog_name': 'Grocery Dataset Tutorial'\n", " },\n", " {\n", " 'name': 'customer',\n", " 'serving_names': [\n", " 'GROCERYCUSTOMERGUID'\n", " ],\n", " 'catalog_name': 'Grocery Dataset Tutorial'\n", " }\n", " ],\n", " 'semantics': [\n", " 'event_id',\n", " 'event_timestamp',\n", " 'record_creation_timestamp',\n", " 'time_zone'\n", " ],\n", " 'column_count': 6,\n", " 'columns_info': [\n", " {\n", " 'name': 'GroceryInvoiceGuid',\n", " 'dtype': 'VARCHAR',\n", " 'entity': 'invoice',\n", " 'semantic': 'event_id',\n", " 'critical_data_info': None,\n", " 'description': 'Unique identifier of each row in the table, in GUID format. Uniquely identifies each invoice.'\n", " },\n", " {\n", " 'name': 'GroceryCustomerGuid',\n", " 'dtype': 'VARCHAR',\n", " 'entity': 'customer',\n", " 'semantic': None,\n", " 'critical_data_info': None,\n", " 'description': 'Unique identifier for each customer, in GUID format.'\n", " },\n", " {\n", " 'name': 'Timestamp',\n", " 'dtype': 'TIMESTAMP',\n", " 'entity': None,\n", " 'semantic': 'event_timestamp',\n", " 'critical_data_info': None,\n", " 'description': 'The GMT timestamp of when this invoice transaction event occurred.'\n", " },\n", " {\n", " 'name': 'tz_offset',\n", " 'dtype': 'VARCHAR',\n", " 'entity': None,\n", " 'semantic': 'time_zone',\n", " 'critical_data_info': None,\n", " 'description': 'The local timezone offset of the invoice event.'\n", " },\n", " {\n", " 'name': 'record_available_at',\n", " 'dtype': 'TIMESTAMP',\n", " 'entity': None,\n", " 'semantic': 'record_creation_timestamp',\n", " 'critical_data_info': None,\n", " 'description': 'A timestamp for when this row was added to the database.'\n", " },\n", " {\n", " 'name': 'Amount',\n", " 'dtype': 'FLOAT',\n", " 'entity': None,\n", " 'semantic': None,\n", " 'critical_data_info': {\n", " 'cleaning_operations': [\n", " {\n", " 'imputed_value': None,\n", " 'type': 'disguised',\n", " 'disguised_values': [\n", " -99.0,\n", " -98.0\n", " ]\n", " },\n", " {\n", " 'imputed_value': 0.0,\n", " 'type': 'less_than',\n", " 'end_point': 0.0\n", " },\n", " {\n", " 'imputed_value': 2000.0,\n", " 'type': 'greater_than',\n", " 'end_point': 2000.0\n", " }\n", " ]\n", " },\n", " 'description': 'The total amount of the invoice, including all items and any discounts applied. Cannot be negative.'\n", " }\n", " ],\n", " 'event_timestamp_column': 'Timestamp',\n", " 'event_id_column': 'GroceryInvoiceGuid',\n", " 'default_feature_job_setting': {\n", " 'blind_spot': '120s',\n", " 'frequency': '3600s',\n", " 'time_modulo_frequency': '120s'\n", " }\n", "}" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "invoice_table.info(verbose=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's all! Now, every time we generate a new entry from the invoice table, we can be confident that no undesirable values will slip through.\n", "\n", "### To learn more, refer to following materials: \n", "- [Cleaning Operations](https://docs.featurebyte.com/latest/about/glossary/#cleaning-operations)\n", "- [Views](https://docs.featurebyte.com/latest/about/glossary/#views-and-column-transforms)\n", "\n", "#### SDK reference for\n", "- [Table.get view()](https://docs.featurebyte.com/latest/reference/featurebyte.api.event_table.EventTable.get_view/)\n", "- [TableColumn.update_critical_data_info()](https://docs.featurebyte.com/latest/reference/featurebyte.api.base_table.TableColumn.update_critical_data_info/)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.13" } }, "nbformat": 4, "nbformat_minor": 4 }