PRODUCT ProductGroup Embedding
Create dense embedding features for text columns¶
Textual data is quite hard to deal with. With tranditional preprocessing and encoding approaches like Bag-Of-Words and TF-IDF it is required to clean text data from punctuation and special characters, remove stop words, lower case, etc. Handling mixed language data also a challenge.
Luckily pre-trained Transformer-based models like BERT and other provide an easy way of handling text data without necessity of complicated preprocessing techniques.
In addition to that, they provide a way to add semantic information of the text into the model, e.g. Fruits
, Fruits Surgelés
and Fruits secs
product groups will be closer to each other in terms of cosine distance, then to other product groups.
In this notebook we will create an embedding feature of the ProductGroup.
import featurebyte as fb
fb.use_profile("tutorial")
19:02:01 | INFO | Using configuration file at: /Users/viktor/.featurebyte/config.yaml 19:02:01 | INFO | Active profile: tutorial (https://tutorials.featurebyte.com/api/v1) 19:02:01 | WARNING | Remote SDK version (0.5.1.dev70) is different from local (0.5.1.dev63). Update local SDK to avoid unexpected behavior. 19:02:01 | INFO | No catalog activated. 19:02:01 | INFO | 10 feature lists, 59 features deployed 19:02:01 | INFO | Using profile: tutorial 19:02:01 | INFO | Using configuration file at: /Users/viktor/.featurebyte/config.yaml 19:02:01 | INFO | Active profile: tutorial (https://tutorials.featurebyte.com/api/v1) 19:02:01 | WARNING | Remote SDK version (0.5.1.dev70) is different from local (0.5.1.dev63). Update local SDK to avoid unexpected behavior. 19:02:01 | INFO | No catalog activated. 19:02:02 | INFO | 10 feature lists, 59 features deployed
catalog = fb.Catalog.activate("Grocery Dataset Tutorial")
19:02:02 | INFO | Catalog activated: Grocery Dataset Tutorial
Get previously defined UDF function.¶
Function, which was previosuly created via:
embedding_udf = fb.UserDefinedFunction.create(
name='embedding',
sql_function_name='F_SBERT_EMBEDDING',
function_parameters=[fb.FunctionParameter(name="x", dtype=fb.enum.DBVarType.VARCHAR)],
output_dtype=fb.enum.DBVarType.ARRAY,
is_global=False,
)
can be instantiated as following
embedding_udf = fb.UserDefinedFunction.get("embedding")
# Get view from GROCERYPRODUCT dimension table.
groceryproduct_view = catalog.get_view("GROCERYPRODUCT")
Create embedding feature¶
groceryproduct_view["ProductGroupEmbedding"] = embedding_udf(groceryproduct_view["ProductGroup"])
product_group_embedding = groceryproduct_view["ProductGroupEmbedding"].as_feature("PRODUCT_ProductGroup_Embedding")
Get previosuly created observation table¶
Which can be created via:
observation_table = invoiceitems_view.create_observation_table(
name="Preview tables with Invoice Items",
sample_rows=10,
columns=["Timestamp", "GroceryInvoiceItemGuid"],
columns_rename_mapping={
"Timestamp": "POINT_IN_TIME",
"GroceryInvoiceItemGuid": "GROCERYINVOICEITEMGUID",
},
)
observation_table = catalog.get_observation_table("Preview tables with Invoice Items")
product_group_embedding.preview(observation_table.to_pandas())
Downloading table |████████████████████████████████████████| 10/10 [100%] in 0.1
POINT_IN_TIME | GROCERYINVOICEITEMGUID | PRODUCT_ProductGroup_Embedding | |
---|---|---|---|
0 | 2022-02-27 12:19:06 | d307efab-fc40-4b16-be88-2d13b70d8903 | [-0.076364621520042, 0.022746421396732, 0.0972... |
1 | 2022-03-19 13:17:52 | e42054bf-fc35-4248-a279-16dc7ac8efa5 | [-0.100475512444973, 0.036730881780386006, -0.... |
2 | 2022-04-13 19:50:26 | b3afcbe1-dd98-41f3-bb79-9133bd316dae | [-0.025993665680289, 0.035225179046392004, -0.... |
3 | 2022-09-27 12:46:45 | 48156112-a586-42ee-ab1d-fc53629f438a | [-0.010095329023898, 0.024357495829463, 0.0082... |
4 | 2022-11-04 12:15:04 | 23563a05-76be-44e7-9fe6-6bb4e4d11d2b | [-0.076364621520042, 0.022746421396732, 0.0972... |
5 | 2023-03-09 12:15:30 | 57c7d176-f2e3-48d4-94a8-d7f3fbc726a3 | [-0.098438546061516, -0.047346442937851, -0.03... |
6 | 2023-04-20 12:17:35 | 20779d9e-69c3-4135-a42b-6e7a10819136 | [0.047602690756321, 0.037338279187679006, -0.0... |
7 | 2023-06-12 14:14:32 | 8717d1fa-6708-4a49-b022-b21f89d5060b | [-0.087121985852718, 0.050529576838017, -0.008... |
8 | 2023-07-03 13:14:46 | 4b00f6c0-0913-4608-b0f2-f344ad57481b | [-0.047038100659847, 0.008642449043691, -0.089... |
9 | 2023-08-09 11:17:46 | c851f86c-f52e-4b55-8311-b62f55be6945 | [-0.026694169268012, 0.168275728821754, -0.067... |
Save feature and view definition file¶
product_group_embedding.save()
Done! |████████████████████████████████████████| 100% in 6.5s (0.16%/s) ▂▂▄ 100% in 2s (~0s, 0.6%/
# Add description
product_group_embedding.update_description(
"Product group dense embedding via Sentence Transformers"
)
# See feature definition file
product_group_embedding.definition
# Generated by SDK version: 0.5.1.dev70
from bson import ObjectId
from featurebyte import DimensionTable
from featurebyte import UserDefinedFunction
# dimension_table name: "GROCERYPRODUCT"
dimension_table = DimensionTable.get_by_id(ObjectId("6553b0012c001f76d263e059"))
dimension_view = dimension_table.get_view(
view_mode="manual", drop_column_names=[], column_cleaning_operations=[]
)
col = dimension_view["ProductGroup"]
# udf_name: embedding, sql_function_name: F_SBERT_EMBEDDING
udf_embedding = UserDefinedFunction.get_by_id(
ObjectId("6553b511850516ee23c8b734")
)
col_1 = udf_embedding(col)
view = dimension_view.copy()
view["ProductGroupEmbedding"] = col_1
grouped = view.as_features(
column_names=["ProductGroupEmbedding"],
feature_names=["PRODUCT_ProductGroup_Embedding"],
offset=None,
)
feat = grouped["PRODUCT_ProductGroup_Embedding"]
output = feat
output.save(_id=ObjectId("6553b6212938b14622140567"))