Skip to content

Deploying using TorchServe


One of the most popular ways to deploy pytorch models is to use the TorchServe. TorchServe is a flexible and performat library for deploying pytorch models to production, which provides a lot of inference and monitoring capabilities.

It supports server side batching for increasing inference throughput, provides a lot of useful metrics which allow to monitor models using tools like grafana or datadog.

TorchServe can be especially useful for environments where UDFs process one record at a time, and creation of batches on the client side is not possible. In such environments, number of requests executed at the same time can be huge, which given the sizable nature and relatively slow processing of Transformer models can lead to spikes in latencies, higher number of timeout errors or even failures of the inference services to handle the load at all.

On the other hand, with server side batch processing, a service that supports request batching will hold a request for a brief moment before processing. If additional requests arrive within this timeframe, the service accumulates them into a single batch and processes them through the model simultaneously. This method leverages matrix multiplication optimisations and GPU processing, among other benefits.


Implementing the TorchServe service is somewhat more complex than using a cloud function. Fortunately, the TorchServe documentation provides an excellent overview of the key concepts and a comprehensive guide for beginners. If you are new to TorchServe, we highly recommend exploring their official documentation. Additionally, a broad range of examples are available, including specific cases of transformer deployments, which can be immensely helpful.

We prepared an end2end implementation of SBERT Transformer model in the TorchServe service here. It contains a simple implementation of handler, configuration properties and CPU based docker container.

We prepared an end2end implementation of SBERT Transformer model in the TorchServe service here.

It contains a simple implementation of handler, configuration properties and CPU based docker container.

Main highlights of the implementation


Handler - is a main entrypoint for the incoming request. It loads the model during initialization, and has methods which process requests from accepting incoming requests, model execution and formatting final results.

import logging

import torch
from sentence_transformers import SentenceTransformer
from ts.torch_handler.base_handler import BaseHandler

logger = logging.getLogger(__name__)

class TransformerHandler(BaseHandler):
    def __init__(self):
        self.initialized = False

    def initialize(self, ctx):
        self.manifest = ctx.manifest
        properties = ctx.system_properties

        self.device = "cuda" if torch.cuda.is_available() else "cpu"

        self.model = SentenceTransformer("sentence-transformers/all-MiniLM-L12-v2")

        self.initialized = True

    def preprocess(self, requests):"Accepted %s requests, data: %s", len(requests), requests)

        texts = []
        for data in requests:
        return texts

    def inference(self, inputs):
        embeddings = self.model.encode(inputs, show_progress_bar=False)
        return embeddings.tolist()

    def postprocess(self, inference_output):
        results = []
        for out in inference_output:
            results.append({"result": out})
        return results

Configuration properties

Configuration properties allow controlling a lot of advanced configuration. For more details refer to official TorchServe documentation. In our scenario, the configuration will appear as follows:


The main entries to watch out for are:

  • batchSize - maximum batch to accumulate before sending it to the model. In this example we set it to 8, but it usually depends on the latency of the concrete model being used and GPU memory available (i.e. bigger the model and smaller the GPU - there smaller batch size will need to be).

  • maxBatchDelay - maximum delay in milliseconds server will wait for batch to accumulate before sending it to the model. This usually should be set between 200-800 milliseconds. The lower the value - the smaller overall latency will be, but it gives less time for a server to gather full batch before execution.

Running locally

Before running locally or anywhere else, the model has to be archived. Archive is an artifact that TorchServe can consume and deploy. TorchServe comes with an archiver which will create such model artifact for us.

To run archiver on the provided project, execute following command:

make archive

This will create a transformer.mar archive, which can be deployed.

Next, to run torch serve locally, run:

# run locally without docker
make run-local

# to stop local server

In order to run the service as docker container, we need to build and image first:

# build docker image
make build-docker

# run locally in docker
make run-docker

To execute predictions following curl command can be executed:

curl -XPOST http://localhost:8080/predictions/transformer \
  -H 'Content-Type: application/json' \
  -d '{"text": "Hello world"}'

  "result": [

Deploy to production

The approach to production deployment varies based on the current infrastructure setup, such as the types of cluster management tools in use, cloud providers, and other factors. However, with the Docker image we created in the previous section, deployment becomes flexible. This image can be deployed in virtually any environment that supports Docker containers, including but not limited to Kubernetes, AWS EKS, and similar platforms.