Model serving is an essential component of a production machine learning infrastructure stack. It's how you convert your trained model into an API for real-time predictions — e.g. how you integrate sentiment analysis into a customer support application or how you flag fraudulent transactions in real time.

TensorFlow Serving is an open source tool for serving models, and by containerizing TensorFlow Serving and running it on Kubernetes, we can automate deployments, enable horizontal scaling, and perform rolling updates. Our architecture will emphasize:

  • Simplicity: the code should be straightforward, easy to extend, and "just work".
  • Scalability: the prediction service should scale up easily.
  • Generalizability: the code should work with any model and environment with minimal modification.
  • JSON: common web clients like curl should be able to make prediction requests.

In this example, we'll deploy an iris classifier trained on the UCI Iris Dataset.

Define a Kubernetes Deployment

The primary Kubernetes resource we'll use to deploy our prediction service is a Deployment. A Deployment runs multiple replicas of an application (Pods) and automatically replaces failed or unresponsive Pods with new ones. This ensures that the service is scalable and available, and enables rolling updates.

We'll use 3 replicas for now, but it's easy to add or remove replicas as necessary (even after deploying). Adding more replicas increases our load capacity and improves the reliability of our API by introducing redundancy in case of Pod failures.

All Kubernetes resources can be annotated with labels, which are arbitrary key/value pairs. We'll give each of our Pods the app: iris-classifier label, and we'll use the Deployment's selector field to specify that all Pods with this label are part of our Deployment. Although this definition may seem unnecessary, this is how the Deployment finds which Pods to manage.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: iris
spec:
  replicas: 3
  selector:
    matchLabels:
      app: iris-classifier
  template:
    metadata:
      labels:
        app: iris-classifier
This is the scaffold for the Deployment, which we'll continue to update (here's the full config)

Download the Model with an Init Container

When TensorFlow Serving starts up, it searches its local filesystem for an exported model to load and serve. So, the first thing we need to do is download the model.

You may use any model that's exported in a format that TensorFlow Serving supports (e.g. via tf.estimator.FinalExporter). I trained my model with TensorFlow's pre-made DNNClassifier using Cortex, and I uploaded it to a public S3 bucket.

To ensure we have the model before TensorFlow Serving starts, we can use an Init Container, which runs before the app container. The Init Container will download the zipped model from S3, extract it, and move it to the location that TensorFlow Serving expects (/models/<model_name>). We can accomplish this using only wget and tar, so we'll use the lightweight busybox image as the base for our Init Container.

initContainers:
- name: download-model
  image: busybox
  command: ["/bin/sh", "-c"]
  args: ["wget -qO- https://s3-us-west-2.amazonaws.com/cortex-blog/tf-serving-k8s/iris.tar.gz | tar xvz -C /models/iris"]
I omitted the rest of the config for brevity (here's the full config)

Persist the Model in a Volume

Because local files in a container aren't shared (even among containers within the same Pod), our Init Container cannot automatically share the model it downloaded with the TensorFlow Serving container. We can solve this with a Kubernetes Volume. Specifically, we'll use an emptyDir Volume, which is created when the Pod starts, exists as long as that Pod is running, and is deleted when the Pod terminates. Containers in the Pod can all read and write the same files in the emptyDir Volume.

To use the emptyDir Volume, we declare the Volume in the Pod spec (using the volumes key) and mount the Volume to each container (using the volumeMounts key).

initContainers:
- name: download-model
  # ...
  volumeMounts:
  - name: model
    mountPath: /models/iris
volumes:
- name: model
  emptyDir: {}
I omitted the rest of the config for brevity (here's the full config)

Configure TensorFlow Serving

We'll use the pre-built tensorflow/serving image from Docker Hub. TensorFlow Serving will look for a model in the /models/<model_name> directory, and we can specify our model name by setting the MODEL_NAME environment variable. We'll mount the Volume in the same way we did for the Init Container. We'll also expose port 8501, which is the port TensorFlow Serving uses for its REST API.

To ensure that Kubernetes doesn't route traffic to our pod until TensorFlow Serving has finished loading our model, we'll configure a readiness probe. Specifically, we'll use a TCP probe to check if TensorFlow Serving is accepting TCP connections on port 8500 (its gRPC port).

containers:
- name: serving
  image: tensorflow/serving
  env:
  - name: MODEL_NAME
    value: iris
  ports:
  - containerPort: 8501
  readinessProbe:
    tcpSocket:
      port: 8500
  volumeMounts:
  - name: model
    mountPath: /models/iris
I omitted the rest of the config for brevity (here's the full config)

Expose the Service

In order to make our prediction API publicly available, we need to group all of the Pods created by our deployment into one service which can be addressed by a single URL. To do this, we'll use a Kubernetes Service with type: LoadBalancer. This exposes our service as a public endpoint using a load balancer (e.g. ELB on AWS), which distributes incoming requests among all active Pods in the Service. Like we did in our Deployment, we'll use the selector field to specify which Pods are part of the Service (i.e. all Pods which have the app: iris-classifier label).

The TensorFlow Serving Pods expose the prediction endpoint on port 8501. We'll map this to port 80 in our Service, since that's the default for HTTP.

apiVersion: v1
kind: Service
metadata:
  name: iris
spec:
  ports:
  - port: 80
    targetPort: 8501
  selector:
    app: iris-classifier
  type: LoadBalancer
I omitted the rest of the config for brevity (here's the full config)

Let's deploy!

Here's our full Kubernetes configuration:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: iris
spec:
  replicas: 3
  selector:
    matchLabels:
      app: iris-classifier
  template:
    metadata:
      labels:
        app: iris-classifier
    spec:
      initContainers:
      - name: download-model
        image: busybox
        command: ["/bin/sh", "-c"]
        args: ["wget -qO- https://s3-us-west-2.amazonaws.com/cortex-blog/tf-serving-k8s/iris.tar.gz | tar xvz -C /models/iris"]
        volumeMounts:
        - name: model
          mountPath: /models/iris
      containers:
      - name: serving
        image: tensorflow/serving
        env:
        - name: MODEL_NAME
          value: iris
        ports:
        - containerPort: 8501
        readinessProbe:
          tcpSocket:
            port: 8500
        volumeMounts:
        - name: model
          mountPath: /models/iris
      volumes:
      - name: model
        emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
  name: iris
spec:
  ports:
  - port: 80
    targetPort: 8501
  selector:
    app: iris-classifier
  type: LoadBalancer

We'll download and save the configuration file as tensorflow_serving.yaml:

curl https://gist.githubusercontent.com/deliahu/a795c9702cc9e1ce46eb773d61f3abb1/raw -o tensorflow_serving.yaml

We're now ready to deploy our prediction service to the cluster. If you don't already have access to a Kubernetes cluster, consult the guides for your favorite cloud provider (e.g. AWS, GCP, Azure, DigitalOcean). You can also run Kubernetes locally using Minikube.

$ kubectl apply -f tensorflow_serving.yaml

deployment.apps/iris created
service/iris created

We can track the progress of our Deployment and Pods:

$ kubectl get deployments

NAME   READY   UP-TO-DATE   AVAILABLE   AGE
iris   0/3     3            0           3s

$ kubectl get pods

NAME                    READY   STATUS            RESTARTS   AGE
iris-7d95d86ddb-4fsj5   0/1     PodInitializing   0          7s
iris-7d95d86ddb-j55mk   0/1     PodInitializing   0          7s
iris-7d95d86ddb-pl9nt   0/1     PodInitializing   0          7s

And we can check the status of our Service:

$ kubectl describe service iris

Name:                     iris
Namespace:                default
Labels:                   <none>
Selector:                 app=iris-classifier
Type:                     LoadBalancer
IP:                       10.100.193.175
LoadBalancer Ingress:     <URL or IP address of endpoint>
Port:                     <unset>  80/TCP
TargetPort:               8501/TCP
NodePort:                 <unset>  30307/TCP
Endpoints:                192.168.30.61:8501,...
Session Affinity:         None
External Traffic Policy:  Cluster

It should take a minute for LoadBalancer Ingress to appear, and then another few minutes for it to be ready to route traffic. If you use AWS, the value for LoadBalancer Ingress will be an ELB endpoint (e.g. a963d9d84532c11e983f10e9cd88a514-1399285962.us-west-2.elb.amazonaws.com), if you use GCP it will be an IPv4 address (e.g. 104.155.184.117), and if you use Minikube you can find the URL with minikube service iris --url.

Now we're ready to classify an iris flower via the TensorFlow Serving API:

$ PREDICTION_ENDPOINT="<URL or IP address of LoadBalancer Ingress>"
$ curl \
    -X POST \
    -H "Content-Type: application/json" \
    -d '{
          "signature_name":"predict",
          "instances":[
            {
              "sepal_length": 5.2,
              "sepal_width": 3.6,
              "petal_length": 1.4,
              "petal_width": 0.3
            }
          ]
        }' \
    $PREDICTION_ENDPOINT/v1/models/iris:predict

Here is the response:

{
    "predictions": [
        {
            "probabilities": [0.826282, 0.11641, 0.057308],
            "class_ids": [0],
            "classes": ["0"],
            "logits": [1.51105, -0.448763, -1.15744]
        }
    ]
}
In our case, class 0 is "Iris-setosa"

Note: If curl responds with "Could not resolve host", try waiting a few more minutes.

Update the Model

We can perform a rolling update of our service, ensuring there's no downtime. Simply replace the path to the model in the Init Container and run kubectl apply -f tensorflow_serving.yaml to kick off the rolling update.

Clean Up

$ kubectl delete service iris
service "iris" deleted

$ kubectl delete deployment iris
deployment.extensions "iris" deleted

You may also want to spin down your Kubernetes cluster (check your cluster manager's docs for instructions).

Next Steps

Private Data

I uploaded the trained model to a public S3 bucket, but we can use private models by granting access to them in our Init Container. The exact steps vary depending on where your model is hosted. For S3, we'll need to configure our AWS credentials and use the AWS CLI to download the model. We'll use mesosphere/aws-cli as the base image for our Init Container, and we can configure our credentials in one of two ways:

1) We can assign an IAM role to the EC2 nodes in our Kubernetes cluster. The AWS CLI will then automatically load the credentials, so no further configuration is required. Here's our updated Init Container:

initContainers:
- name: download-model
  image: mesosphere/aws-cli
  command: ["/bin/sh", "-c"]
  args: ["aws s3 cp s3://cortex-blog/tf-serving-k8s/iris.tar.gz /tmp/iris.tar.gz && tar -C /models/iris -xvzf /tmp/iris.tar.gz && rm /tmp/iris.tar.gz"]
  volumeMounts:
  - name: model
    mountPath: /models/iris

2) We can manually configure our AWS credentials by creating a Kubernetes Secret and using it to populate the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables.

Here's how we create the Secret:

kubectl create secret generic 'aws-credentials' \
  --from-literal='AWS_ACCESS_KEY_ID'='XXX' \
  --from-literal='AWS_SECRET_ACCESS_KEY'='XXX'

And here's our updated Init Container:

initContainers:
- name: download-model
  image: mesosphere/aws-cli
  command: ["/bin/sh", "-c"]
  args: ["aws s3 cp s3://cortex-blog/tf-serving-k8s/iris.tar.gz /tmp/iris.tar.gz && tar -C /models/iris -xvzf /tmp/iris.tar.gz && rm /tmp/iris.tar.gz"]
  env:
  - name: AWS_ACCESS_KEY_ID
    valueFrom:
      secretKeyRef:
        name: aws-credentials
        key: AWS_ACCESS_KEY_ID
  - name: AWS_SECRET_ACCESS_KEY
    valueFrom:
      secretKeyRef:
        name: aws-credentials
        key: AWS_SECRET_ACCESS_KEY
  volumeMounts:
  - name: model
    mountPath: /models/iris

Avoid Duplicate Model Downloads

In this example, the model is downloaded by every Pod in the application. You can avoid this by baking the model into the docker image: create a Dockerfile with tensorflow/serving as the base image, copy your trained model into the image (via Docker's COPY instruction), build your image, and push it to your container registry. Then replace tensorflow/serving with the URL to your image in the Deployment and remove the Init Container.

This solution requires a container registry and adds an additional step to the model update workflow. Therefore, it may not be worth it if your model is small or you're just trying to get an MVP off the ground.

Optimize TensorFlow Serving

You can build TensorFlow Serving from source to target your specific architecture. This could dramatically speed up your predictions. See this article from Mux.

SSL

Instructions for enabling SSL vary by cloud provider, but should be straightforward. For example, here are the instructions for AWS.

Programmatic Management

If you'd like to manage your deployment from another application, check out the Kubernetes client libraries.

Dynamic Endpoints

To serve multiple models from a single base URL and dynamically add and remove endpoints, check out Kubernetes Ingress and Ingress Controllers.


If you're looking for an open source tool to manage all of this (and much more), check out Cortex! (github.com/cortexlabs/cortex)