5. Update Descriptions and Tag Semantics
What is Semantic Tagging?
Semantic tagging is the process of associating each column with a semantic type.
A semantic type defines a column's meaning, expected values, and suitable feature engineering operations. By linking columns to semantic types, you ensure that data is properly transformed, aggregated, and utilized for analysis and machine learning.
FeatureByte automates semantic tagging through Generative AI, which analyzes column metadata and suggests an appropriate semantic type. However, you can manually assign or refine these types at the table level. If needed, semantic tags can also be overwritten during Feature Ideation to fine-tune feature engineering strategies.
By structuring this process within a data ontology, FeatureByte enables a systematic approach to selecting relevant feature engineering techniques while minimizing manual effort.
Why it is important?
Accurate descriptions and semantic tagging of data fields and tables are essential for enhancing Feature Ideation’s recommendations, enabling more relevant data aggregations, transforms, filters, and feature combinations. While Feature Ideation can operate without descriptions, including them leads to better feature selection and model performance.
Step 1: Update Table Descriptions¶
Note
This step is optional if your Data Warehouse already includes table descriptions.
-
From the menu, navigate to the 'Explore' section and open the Table Catalog. If you're on the Table Diagram page, click
to return to the Table Catalog.
-
Verify the following table descriptions:
Table Description GROCERYCUSTOMER Customer details, including their name, address, and date of birth GROCERYINVOICE Grocery invoice details, containing the timestamp and the total amount of the invoice INVOICEITEMS Details of grocery product items within each invoice, including quantity, total cost, discount applied, and product ID GROCERYPRODUCT Product group description for each grocery product -
Edit the table's description if needed:
- Select the table from the Table Catalog and navigate to the 'About' tab.
- Update the description by clicking
next to the description field.
Step 2: Update Column Descriptions¶
Note
This step is optional if your Data Warehouse already includes column descriptions.
- Select a table from the Table Catalog and navigate to the 'Columns' tab.
- Update the description by clicking
next to the description field of the column.
Step 3: Tag Semantics¶
Note
Semantic Tagging is not required at the table level, as Feature Ideation will automatically infer and fill in missing semantic tags.
However, it is a best practice to verify that column descriptions are accurate and manually assign semantic types at the table level when needed.
In this tutorial, we will leave the columns semantically untagged. If you want to tag semantics at the table level, follow those steps:
- Select one table from the Table Catalog.
- Go to the 'Columns' tab.
- Click
to run semantic tagging.
- Review the suggestions provided. Accept, adjust, or leave the column semantically untagged.
Which Semantic Type Should You Focus On?
When working with different table types, pay close attention to specific semantic types, as they influence filtering strategies, data aggregation, and feature engineering choices.
In Event Table and Time Series Table, check out the event_type (categorization of events based on their primary purpose or nature) and event_status (state, condition, or outcome of an event) semantic types. These columns will guide event-based filtering strategies.
In a Slowly Changing Dimension Table, check out the termination_timestamp and termination_date semantic types that indicate when an entity is actively terminated, sometimes prematurely. These columns determine how active entities are aggregated and when terminated entities should be analyzed.
For all tables, check out:
- the non_additive_numeric semantic types (numeric values where direct addition is not meaningful). Understanding these columns prevents incorrect sum operations.
- the automated non_informative semantic type (column with constant value). This may indicate problems in your data.
- the not_to_use semantic type (sensitive, personal, operational, or non-reliable data that should not be used). This decides whether feature engineering should be operated for those columns.
- the ambiguous_numeric (column that combines different units or scales) and ambiguous_categorical (column that does not provide unique information by itself) semantic types. These columns may require prior manual transformations before being used by feature engineering.
By carefully reviewing these semantic types, you can enhance feature selection and ensure high-quality transformations for machine learning.