PRODUCT ProductGroup Embedding

Create dense embedding features for text columns¶

Textual data is quite hard to deal with. With tranditional preprocessing and encoding approaches like Bag-Of-Words and TF-IDF it is required to clean text data from punctuation and special characters, remove stop words, lower case, etc. Handling mixed language data also a challenge.

Luckily pre-trained Transformer-based models like BERT and other provide an easy way of handling text data without necessity of complicated preprocessing techniques. In addition to that, they provide a way to add semantic information of the text into the model, e.g. Fruits, Fruits Surgelés and Fruits secs product groups will be closer to each other in terms of cosine distance, then to other product groups.

In this notebook we will create an embedding feature of the ProductGroup.

In [1]:

            
                Copied!
                
import featurebyte as fb
fb.use_profile("tutorial")
import featurebyte as fb
fb.use_profile("tutorial")

19:02:01 | INFO     | Using configuration file at: /Users/viktor/.featurebyte/config.yaml
19:02:01 | INFO     | Active profile: tutorial (https://tutorials.featurebyte.com/api/v1)
19:02:01 | WARNING  | Remote SDK version (0.5.1.dev70) is different from local (0.5.1.dev63). Update local SDK to avoid unexpected behavior.
19:02:01 | INFO     | No catalog activated.
19:02:01 | INFO     | 10 feature lists, 59 features deployed
19:02:01 | INFO     | Using profile: tutorial
19:02:01 | INFO     | Using configuration file at: /Users/viktor/.featurebyte/config.yaml
19:02:01 | INFO     | Active profile: tutorial (https://tutorials.featurebyte.com/api/v1)
19:02:01 | WARNING  | Remote SDK version (0.5.1.dev70) is different from local (0.5.1.dev63). Update local SDK to avoid unexpected behavior.
19:02:01 | INFO     | No catalog activated.
19:02:02 | INFO     | 10 feature lists, 59 features deployed

In [2]:

            
                Copied!
                
catalog = fb.Catalog.activate("Grocery Dataset Tutorial")
catalog = fb.Catalog.activate("Grocery Dataset Tutorial")

19:02:02 | INFO     | Catalog activated: Grocery Dataset Tutorial

Get previously defined UDF function.¶

Function, which was previosuly created via:

embedding_udf = fb.UserDefinedFunction.create(
    name='embedding', 
    sql_function_name='F_SBERT_EMBEDDING',
    function_parameters=[fb.FunctionParameter(name="x", dtype=fb.enum.DBVarType.VARCHAR)],
    output_dtype=fb.enum.DBVarType.ARRAY,
    is_global=False,
)

can be instantiated as following

In [3]:

            
                Copied!
                
embedding_udf = fb.UserDefinedFunction.get("embedding")
embedding_udf = fb.UserDefinedFunction.get("embedding")

In [4]:

            
                Copied!
                
# Get view from GROCERYPRODUCT dimension table.
groceryproduct_view = catalog.get_view("GROCERYPRODUCT")
# Get view from GROCERYPRODUCT dimension table.
groceryproduct_view = catalog.get_view("GROCERYPRODUCT")

Create embedding feature¶

In [5]:

            
                Copied!
                
groceryproduct_view["ProductGroupEmbedding"] = embedding_udf(groceryproduct_view["ProductGroup"])
product_group_embedding = groceryproduct_view["ProductGroupEmbedding"].as_feature("PRODUCT_ProductGroup_Embedding")
groceryproduct_view["ProductGroupEmbedding"] = embedding_udf(groceryproduct_view["ProductGroup"])
product_group_embedding = groceryproduct_view["ProductGroupEmbedding"].as_feature("PRODUCT_ProductGroup_Embedding")

Get previosuly created observation table¶

Which can be created via:

observation_table = invoiceitems_view.create_observation_table(
    name="Preview tables with Invoice Items",
    sample_rows=10,
    columns=["Timestamp", "GroceryInvoiceItemGuid"],
    columns_rename_mapping={
        "Timestamp": "POINT_IN_TIME",
        "GroceryInvoiceItemGuid": "GROCERYINVOICEITEMGUID",
    },
)

In [6]:

            
                Copied!
                
observation_table = catalog.get_observation_table("Preview tables with Invoice Items")
observation_table = catalog.get_observation_table("Preview tables with Invoice Items")

In [7]:

            
                Copied!
                
product_group_embedding.preview(observation_table.to_pandas())
product_group_embedding.preview(observation_table.to_pandas())

Downloading table |████████████████████████████████████████| 10/10 [100%] in 0.1

Out[7]:

	POINT_IN_TIME	GROCERYINVOICEITEMGUID	PRODUCT_ProductGroup_Embedding
0	2022-02-27 12:19:06	d307efab-fc40-4b16-be88-2d13b70d8903	[-0.076364621520042, 0.022746421396732, 0.0972...
1	2022-03-19 13:17:52	e42054bf-fc35-4248-a279-16dc7ac8efa5	[-0.100475512444973, 0.036730881780386006, -0....
2	2022-04-13 19:50:26	b3afcbe1-dd98-41f3-bb79-9133bd316dae	[-0.025993665680289, 0.035225179046392004, -0....
3	2022-09-27 12:46:45	48156112-a586-42ee-ab1d-fc53629f438a	[-0.010095329023898, 0.024357495829463, 0.0082...
4	2022-11-04 12:15:04	23563a05-76be-44e7-9fe6-6bb4e4d11d2b	[-0.076364621520042, 0.022746421396732, 0.0972...
5	2023-03-09 12:15:30	57c7d176-f2e3-48d4-94a8-d7f3fbc726a3	[-0.098438546061516, -0.047346442937851, -0.03...
6	2023-04-20 12:17:35	20779d9e-69c3-4135-a42b-6e7a10819136	[0.047602690756321, 0.037338279187679006, -0.0...
7	2023-06-12 14:14:32	8717d1fa-6708-4a49-b022-b21f89d5060b	[-0.087121985852718, 0.050529576838017, -0.008...
8	2023-07-03 13:14:46	4b00f6c0-0913-4608-b0f2-f344ad57481b	[-0.047038100659847, 0.008642449043691, -0.089...
9	2023-08-09 11:17:46	c851f86c-f52e-4b55-8311-b62f55be6945	[-0.026694169268012, 0.168275728821754, -0.067...

Save feature and view definition file¶

In [8]:

            
                Copied!
                
product_group_embedding.save()
product_group_embedding.save()

Done! |████████████████████████████████████████| 100% in 6.5s (0.16%/s)         ▂▂▄ 100% in 2s (~0s, 0.6%/

In [9]:

            
                Copied!
                
                    
                    
                
                

        
# Add description
product_group_embedding.update_description(
    "Product group dense embedding via Sentence Transformers"
)
# See feature definition file
product_group_embedding.definition
# Add description
product_group_embedding.update_description(
    "Product group dense embedding via Sentence Transformers"
)
# See feature definition file
product_group_embedding.definition

Out[9]:

# Generated by SDK version: 0.5.1.dev70
from bson import ObjectId
from featurebyte import DimensionTable
from featurebyte import UserDefinedFunction


# dimension_table name: "GROCERYPRODUCT"
dimension_table = DimensionTable.get_by_id(ObjectId("6553b0012c001f76d263e059"))
dimension_view = dimension_table.get_view(
    view_mode="manual", drop_column_names=[], column_cleaning_operations=[]
)
col = dimension_view["ProductGroup"]

# udf_name: embedding, sql_function_name: F_SBERT_EMBEDDING
udf_embedding = UserDefinedFunction.get_by_id(
    ObjectId("6553b511850516ee23c8b734")
)
col_1 = udf_embedding(col)
view = dimension_view.copy()
view["ProductGroupEmbedding"] = col_1
grouped = view.as_features(
    column_names=["ProductGroupEmbedding"],
    feature_names=["PRODUCT_ProductGroup_Embedding"],
    offset=None,
)
feat = grouped["PRODUCT_ProductGroup_Embedding"]
output = feat
output.save(_id=ObjectId("6553b6212938b14622140567"))

In [ ]: