5. Update Descriptions and Tag Semantics
What is Semantic Tagging?
Semantic tagging is the process of associating each column with a semantic type.
A semantic type defines a column's meaning, expected values, and suitable feature engineering operations. By linking columns to semantic types, you ensure that data is properly transformed, aggregated, and utilized for analysis and machine learning.
FeatureByte automates semantic tagging through Generative AI, which analyzes column metadata and suggests an appropriate semantic type. However, you can manually assign or refine these types at the table level. If needed, semantic tags can also be overwritten during Feature Ideation to fine-tune feature engineering strategies.
By structuring this process within a data ontology, FeatureByte enables a systematic approach to selecting relevant feature engineering techniques while minimizing manual effort.
Why it is important?
Accurate descriptions and semantic tagging of data fields and tables are essential for enhancing Feature Ideation’s recommendations, enabling more relevant data aggregations, transforms, filters, and feature combinations. While Feature Ideation can operate without descriptions, including them leads to better feature selection and model performance.
Step 1: Update Table Descriptions¶
Note
This step is optional if your Data Warehouse already includes good table descriptions.
-
From the menu, navigate to the 'Explore' section and open the Table Catalog. If you're on the Table Diagram page, click
to return to the Table Catalog.
-
Verify the following table descriptions:
Table Description NEW_APPLICATION Records new loan applications.. PRIOR_APPLICATIONS Contains data on prior loan applications and the final decision. CONSUMER_LOAN_STATUS Tracks consumer loans status. CONSUMER_INSTALLMENTS Logs monthly installments for consumer loans at the time of payment. -
Edit the table's description if needed:
- Select the table from the Table Catalog and navigate to the 'About' tab.
- Update the description by clicking
next to the description field.
Step 2: Update Column Descriptions¶
Note
This step is optional if your Data Warehouse already includes good column descriptions.
- Select a table from the Table Catalog and navigate to the 'Columns' tab.
- Update the description by clicking
next to the description field of the column.
Step 3: Tag Semantics¶
Note
Semantic Tagging is not required at the table level, as Feature Ideation will automatically infer and fill in missing semantic tags.
However, it is a best practice to verify that column descriptions are accurate and manually assign semantic types at the table level when needed.
We will here operate the semantic tagging for the CONSUMER_INSTALLMENTS table and leave the columns semantically untagged for the other tables.
- Select the table from the Table Catalog and navigate to the 'Columns' tab.
- Click
to run semantic tagging.
- Review the suggestions provided.
-
Click
to accept the semantic type per column. You can also adjust, or leave the column semantically untagged.
-
Accept all semantic suggestions by clicking
next to
.
Which Semantic Type Should You Focus On?
When working with different table types, pay close attention to specific semantic types, as they influence filtering strategies, data aggregation, and feature engineering choices.
In Event Table and Time Series Table, check out the event_type (categorization of events based on their primary purpose or nature) and event_status (state, condition, or outcome of an event) semantic types. These columns will guide event-based filtering strategies.
In a Slowly Changing Dimension Table, check out the termination_timestamp and termination_date semantic types that indicate when an entity is actively terminated, sometimes prematurely. These columns determine how active entities are aggregated and when terminated entities should be analyzed.
For all tables, check out:
- the non_additive_numeric semantic types (numeric values where direct addition is not meaningful). Understanding these columns prevents incorrect sum operations.
- the automated non_informative semantic type (column with constant value). This may indicate problems in your data.
- the not_to_use semantic type (sensitive, personal, operational, or non-reliable data that should not be used). This decides whether feature engineering should be operated for those columns.
- the ambiguous_numeric (column that combines different units or scales) and ambiguous_categorical (column that does not provide unique information by itself) semantic types. These columns may require prior manual transformations before being used by feature engineering.
By carefully reviewing these semantic types, you can enhance feature selection and ensure high-quality transformations for machine learning.