4. Add descriptions to tables
(Optional) Adding descriptions to tables and columns¶
While not mandatory, adding concise descriptions to tables and columns can be immensely beneficial. These annotations assist FeatureByte's feature ideation engine in generating insightful features.
Much like a data scientist, FeatureByte does its best to grasp the significance and purpose of various tables and columns, discerning their types and more. Based on this understanding, it suggests pertinent aggregations and feature combinations. While FeatureByte can operate effectively without these descriptions, having them certainly enhances the quality of its recommendations.
import featurebyte as fb
# Set your profile to the tutorial environment
fb.use_profile("tutorial")
catalog_name = "Grocery Dataset Tutorial"
catalog = fb.Catalog.activate(catalog_name)
21:57:30 | INFO | Using configuration file at: /Users/gxav/.featurebyte/config.yaml 21:57:30 | INFO | Active profile: tutorial (https://tutorials.featurebyte.com/api/v1) 21:57:30 | WARNING | Remote SDK version (0.5.0.dev6) is different from local (0.5.0.dev1). Update local SDK to avoid unexpected behavior. 21:57:30 | INFO | No catalog activated. 21:57:30 | INFO | 6 feature lists, 31 features deployed 21:57:30 | INFO | Using profile: tutorial 21:57:31 | INFO | Using configuration file at: /Users/gxav/.featurebyte/config.yaml 21:57:31 | INFO | Active profile: tutorial (https://tutorials.featurebyte.com/api/v1) 21:57:31 | WARNING | Remote SDK version (0.5.0.dev6) is different from local (0.5.0.dev1). Update local SDK to avoid unexpected behavior. 21:57:31 | INFO | No catalog activated. 21:57:31 | INFO | 6 feature lists, 31 features deployed 21:57:32 | INFO | Catalog activated: Grocery Dataset Tutorial
Get tables from the catalog first:
catalog.list_tables()
id | name | type | status | entities | created_at | |
---|---|---|---|---|---|---|
0 | 64ff1c940d5bfbfb21bce78c | GROCERYPRODUCT | dimension_table | PUBLIC_DRAFT | [product, productgroup] | 2023-09-11T13:56:37.144000 |
1 | 64ff1c920d5bfbfb21bce78b | INVOICEITEMS | item_table | PUBLIC_DRAFT | [item, invoice, product] | 2023-09-11T13:56:35.927000 |
2 | 64ff1c910d5bfbfb21bce78a | GROCERYINVOICE | event_table | PUBLIC_DRAFT | [invoice, customer] | 2023-09-11T13:56:34.182000 |
3 | 64ff1c8f0d5bfbfb21bce789 | GROCERYCUSTOMER | scd_table | PUBLIC_DRAFT | [customer, frenchstate] | 2023-09-11T13:56:32.993000 |
customer_table = catalog.get_table("GROCERYCUSTOMER")
invoice_table = catalog.get_table("GROCERYINVOICE")
items_table = catalog.get_table("INVOICEITEMS")
product_table = catalog.get_table("GROCERYPRODUCT")
Let's update descriptions of tables:¶
customer_table.update_description('Customer details, including their name, address, and date of birth')
invoice_table.update_description('Grocery invoice details, containing the timestamp and the total amount of the invoice')
items_table.update_description('The grocery product item details within each invoice, including the quantity, total cost, discount applied, and product ID')
product_table.update_description('The product group description for each grocery product')
And do the same for each column as well:¶
# update descriptions for GROCERYCUSTOMER table
customer_table.update_column_description("RowID", "Unique identifier of each row in the table, in GUID format. Uniquely identifies each customer and version combination.")
customer_table.update_column_description("GroceryCustomerGuid", "Unique identifier for each customer, in GUID format.")
customer_table.update_column_description("ValidFrom", "GMT timestamp of when this version of a customer's details becomes valid or live")
customer_table.update_column_description("CurrentRecord", "Flags whether this version of the customer's details is the latest known version")
customer_table.update_column_description("Gender", "The customer's gender. Can only have values male or female.")
customer_table.update_column_description("Title", "The customer's title. Can only have values Dr. Mr. Mrs. or Ms.")
customer_table.update_column_description("GivenName", "The customer's given name")
customer_table.update_column_description("MiddleInitial", "The first letter of the customer's middle name.")
customer_table.update_column_description("Surname", "The customer's family or surname.")
customer_table.update_column_description("StreetAddress", "The customer address's building number and street name.")
customer_table.update_column_description("City", "The city name of the customer's address.")
customer_table.update_column_description("State", "The state name of the customer's address.")
customer_table.update_column_description("PostalCode", "The postal code of the customer's address. Contains only digits but is a categorical variable.")
customer_table.update_column_description("BrowserUserAgent", "The user agent details of the customer's internet browser, including the operating system name and version, and the browser name and version.")
customer_table.update_column_description("DateOfBirth", "The customer's date of birth.")
customer_table.update_column_description("Latitude", "The latitude of the customer's address.")
customer_table.update_column_description("Longitude", "The longitude of the customer's address.")
customer_table.update_column_description("record_available_at", "A timestamp for when this row was added to the database.")
# update descriptions for GROCERYINVOICE table
invoice_table.update_column_description("GroceryInvoiceGuid", "Unique identifier of each row in the table, in GUID format. Uniquely identifies each invoice.")
invoice_table.update_column_description("GroceryCustomerGuid", "Unique identifier for each customer, in GUID format.")
invoice_table.update_column_description("Timestamp", "The GMT timestamp of when this invoice transaction event occurred.")
invoice_table.update_column_description("tz_offset", "The local timezone offset of the invoice event.")
invoice_table.update_column_description("record_available_at", "A timestamp for when this row was added to the database.")
invoice_table.update_column_description("Amount", "The total amount of the invoice, including all items and any discounts applied. Cannot be negative.")
# update descriptions for INVOICEITEMS table
items_table.update_column_description("GroceryInvoiceItemGuid", "Unique identifier of each row in the table, in GUID format. Uniquely identifies each item type within an invoice.")
items_table.update_column_description("GroceryInvoiceGuid", "Identifies the invoice that the item belongs to. Stored as a GUID.")
items_table.update_column_description("GroceryProductGuid", "The type of product that was purchased. Stored as a GUID. Not human-readable.")
items_table.update_column_description("Quantity", "The purchased count of this product within this invoice.")
items_table.update_column_description("UnitPrice", "The cost in euros per item. Cannot be negative.")
items_table.update_column_description("TotalCost", "The total cost in euros for this product. Equals Quantity multiplied by UnitCost.")
items_table.update_column_description("Discount", "The discount amount, in euros, applied to this product purchase. Cannot be negative and cannot exceed the gross item cost.")
items_table.update_column_description("record_available_at", "A timestamp for when this row was added to the database.")
# update descriptions for GROCERYPRODUCT table
product_table.update_column_description("GroceryProductGuid", "Uniquely identifies each row in this table. Uniquely identifies a product. Stored as a GUID.")
product_table.update_column_description("ProductGroup", "Text description of the product group.")
Alright, descriptions are updated. We can verify our changes has taken an effect:
customer_table.description
'Customer details, including their name, address, and date of birth'
invoice_table.columns_info
[ColumnInfo(name='GroceryInvoiceGuid', dtype='VARCHAR', entity_id=ObjectId('64ff1caa36a87a52b58efadd'), critical_data_info=None, semantic_id=ObjectId('64b9571c9a639afe12ac4e73'), description='Unique identifier of each row in the table, in GUID format. Uniquely identifies each invoice.'), ColumnInfo(name='GroceryCustomerGuid', dtype='VARCHAR', entity_id=ObjectId('64ff1ca936a87a52b58efadc'), critical_data_info=None, semantic_id=None, description='Unique identifier for each customer, in GUID format.'), ColumnInfo(name='Timestamp', dtype='TIMESTAMP', entity_id=None, critical_data_info=None, semantic_id=ObjectId('64b9571b9a639afe12ac4e71'), description='The GMT timestamp of when this invoice transaction event occurred.'), ColumnInfo(name='tz_offset', dtype='VARCHAR', entity_id=None, critical_data_info=None, semantic_id=ObjectId('64e5f504c83f3170e31736fc'), description='The local timezone offset of the invoice event.'), ColumnInfo(name='record_available_at', dtype='TIMESTAMP', entity_id=None, critical_data_info=None, semantic_id=ObjectId('64e5f504c83f3170e31736fa'), description='A timestamp for when this row was added to the database.'), ColumnInfo(name='Amount', dtype='FLOAT', entity_id=None, critical_data_info=None, semantic_id=None, description='The total amount of the invoice, including all items and any discounts applied. Cannot be negative.')]
That's it for this tutorial. Again, this is an optional step, but it can drastically improve FeatureByte's feature ideation.
SDK reference for¶