So you’ve trained your TensorFlow ML model, now what? (Part 2 - Deploying to Kubernetes)
Part 1 showed how to run the open source pre-trained model Inception V3 as an image classification service. That’s all you need to do for a single classification instance, whether you run it on a server, your laptop or even a smart IoT device. But, what if you need to scale to handle a high volume of concurrent requests or you want multiple instances for resilency? For that, you’ll want to scale horizaontally with many instances. Then, a cluster managed by a resource scheduler like Kubernetes is a great way to go. Not only does it help with scalability, but it also enables deployment automation, improves manageability and will most likely result in better infrastructure utilization.
Running the classification service in Kubernetes is pretty straightforward once you have the Docker image. It only requires one more step, creating a Kubernetes Deployment. You define the Deployment in a YAML file, such as the one below.
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: inception-deployment
spec:
replicas: 1
template:
metadata:
labels:
app: inception-server
spec:
containers:
- name: inception-container
image: index.docker.io/paulwelch/tensorflow-serving-inception
command:
- /bin/sh
- -c
args:
- tensorflow_model_server
--port=9000 --model_name=inception --model_base_path=/serving/inception-export
ports:
- containerPort: 9000
My Deployment starts an instance listening on port 9000
. Note the nested keys of containers
that define the image
, as well as the command
and args
to run when the instance starts. These correspond to the command line parameters we used with docker run
in Part 1.
Copy the YAML to a file inception_serving_k8s.yaml
. Assuming you have the Kubernetes cli tool kubectl
configured to talk to your cluster, it can be applied as follows:
» kubectl apply -f inception_serving_k8s.yaml
You can see the Deployment status with the following command:
» kubectl get deployments
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
inception-deployment 1 1 1 1 42s
The group of containers started by the Deployment is called a Pod, which you can see with this command:
» kubectl get pods
NAME READY STATUS RESTARTS AGE
inception-deployment-57553967-36l4k 1/1 Running 0 42s
Now an instance of the service is running. You could run the inception_client in the instance, just as in Part 1. But, we want to run more instances to handle the high request volume. And, we’ll need a way to distribute the volume across all of the instances. Kubernetes makes this pretty trivial as well. In the YAML file, simply change the replicas
key from 1
to the count you want and re-run the kubectl apply
as above.
You may also want to provide connectivity to the service endpoint. A simple way is to add a Service
definition to the Deployment. The Service defines an endpoint for the Pod.
Note the type
key. As of the current release, if you’re running in a supported Cloud Provider you can specify a Service type of LoadBalancer
and it will auto-allocate a load balanced IP that is accessible outside the cluster with all instance endpoints in its pool. The load balancer implementation is specific to the Cloud Provider, such as ELB on AWS.
The LoadBalancer type won’t be available if you’re running on a platform that doesn’t have a Cloud Provider, like my bare metal dev environment. In this case, you’ll need to use another type such as NodePort
. The NodePort type allocates a port for each instance and makes it discoverable inside the cluster. But you’ll need to bring your own load balancer to distribute the requests and expose the IP outside the cluster. The good news is that there are several Kubernetes integrated open source load balancers available, such as the excellent Traefik Project.
The full YAML updates are shown in the example below to scale out to 3 instances and add a Service.
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: inception-deployment
spec:
replicas: 3
template:
metadata:
labels:
app: inception-server
spec:
containers:
- name: inception-container
image: index.docker.io/paulwelch/tensorflow-serving-inception
command:
- /bin/sh
- -c
args:
- tensorflow_model_server
--port=9000 --model_name=inception --model_base_path=/serving/inception-export
ports:
- containerPort: 9000
---
apiVersion: v1
kind: Service
metadata:
labels:
run: inception-service
name: inception-service
spec:
ports:
- port: 9000
targetPort: 9000
selector:
app: inception-server
type: NodePort
Apply the changes with kubectl apply
, as follows.
» kubectl apply -f inception_serving_k8s.yaml
Then you can see the changes with the kubectl get pods
and kubectl get services
commands. If you applied a replicas
value to 3, you should see 3 instances listed. One great thing about running in Kubernetes is the value could just as easily be 30 or 300, as long as there are sufficient resources to run them.
» kubectl get pods
NAME READY STATUS RESTARTS AGE
inception-deployment-57553967-36l4k 1/1 Running 0 28m
inception-deployment-57553967-8228b 1/1 Running 0 42s
inception-deployment-57553967-nr087 1/1 Running 0 3s
» kubectl get services
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
inception-service NodePort 10.3.50.102 <none> 9000:32027/TCP 28s
traefik-ingress-service NodePort 10.3.106.173 <none> 8888:32720/TCP,8080:30569/TCP 65d
At this point, you have 3 instances running. I’ve also added a Traefik load balancer to provide the load balanced endpoint accessible outside the cluster. Rolling your own load balancer may be a good topic for a future post.
To sum it up, a few of the benefits of using Kubernetes to run our classification service:
replicas
config value. Changing the scale is easily initiated by an external system, such as a monitoring system that could make changes based on CPU utilization, the depth of some queue or another metric.The service is shaping up pretty well so far. In the next part, I’ll take a look at using the service from another application separately deployed in the cluster.