This section describes the various ways and possibilities for deploying transformer models as inference endpoints.
Transformer deep learning models can be deployed using a range of methods. Popular deployment options include Cloud-based Serverless deployments and Custom Inference Services.
Cloud-based Serverless Deployments:
Custom inference services:
The choice usually depends on business requirements and restrictions, existing infrastructure setups, usage of cloud providers and specific data warehouses.
In the ensuing sections, we'll specifically explore:
Deployment via GCP Cloud Functions
Custom Inference with TorchServe
Deployment leveraging MLFlow
On DataBricks Delta Live Tables (DLT)¶
All of the methods listed above expose transformer model as http endpoint. This raises a different issue - how to score a lot of data (potentially hundreds of millions of rows or more) over http in a batch setting. Sending a separate http request for each row might not be feasible due to service throughtput limitations.
There are multiple strategies for solving this issue:
Snowflake Data Warehouse mitigates this via built-in batching, i.e. instead of sending http request per observation, it gathers rows into batches automatically.
Using Cloud Functions or Lambdas can help via automatic horizontal scaling based on the load.
Torchserve and Tensorflow Serving provide server side batching capabilities.
DataBricks provides Delta Live Tables capability, which can also help with scoring huge number of observations via transformers and mitigate throughtput issues. We will explore this approach in Using transformer model with DataBricks Delta Live Tables (DLT) and FeatureByte tutorial.