/5 min read

Moving our machine learning models to GKE

GKE is the new Victoria’s Secret. That’s where all the best models go.

Ugh, why do I have to cringe at my own jokes… Moving on. Let me tell you about this kinda unique setup we have at the moment, where all of our prod services are in AWS except our two models requiring GPUs. These two live on our Google Kubernetes cluster. It’s been a pretty fun adventure setting it up and we’ve learnt a bunch of things along the way that I want to share with you.

Yeah but why?

Several reasons, but one is the good ol’ money thing. Thanks to the program we’re on, Google Cloud for Startup, we got free Google credits to use. Although all of our infrastructure is on AWS, we chose to move our GPU-intensive models over to Google as they are by far the most expensive of our cloud bills. We picked Kubernetes because of the impressive flexibility and modularity it offers, and with the sights of one day moving other things over. Its network policies, scaleability and deployment system make it pretty attractive for a growing codebase like ours.

I love that Kubernetes is declarative by nature. The ability to have all of your cluster’s config as code in one place and being able to apply it really satisfies the OCD person in me. And I’m not even OCD.

So, what have we got?

One cluster with two node pools:

  • one with GPUs for our models, it autoscales from 2 to 4 nodes,
  • one node pool for supporting pods that are automatically added by Google,

And that’s about it for now.

Since our two programs pick messages up from SQS queues, their Kubernetes pods don’t need public IPs and therefore no load balancer needed.

A couple of choices along the way

Since our services need to query from AWS queues, they need credentials to access them. We’re storing them as cluster-wide secrets that we generate locally, apply once and then delete. We originally decided to encode them using a KMS key and decode them in our build process before applying them; but building with CircleCI meant the decoded creds might get stored, which we definitely did not want.

Oops, I said it, now it’s no longer a surprise: we are using CircleCI for our CI/CD. Damn, always spoiling the juicy details, me.

That part’s been ok, though we faced a couple of challenges I’ll be mentioning below.

Queue-based scaling

Funnily enough, with this whole queuing system, our GPU models don’t really bite off more than they can chew and our biggest overload issue is actually latency when the queue backlogs get too big. This means that autoscaling on GPU usage makes no sense for us (and is not straightforward with GKE), let alone on CPU usage. Instead, we’ve chosen to scale on queue backlog, meaning we increase the number of instances every time more than 100 messages are waiting.

Now, that’s been tricky on two ends - AWS because we chose to trigger a lambda when the number of messages got too big and that was not straightforward; and on the program’s end because programatically scaling a cluster from another cloud provider requires a lot of trial and error. Each of these issues will get their own little tutorial because it’s just not super fun figuring it out by yourself.

In short, we now have two AWS lambdas that will scale up or down our GKE model instances based on SQS queue length.

The challenges

A few things that could be useful to a distant reader dealing with similar issues.

GPU taints and tolerations (docs here)

GKE automatically taints all its GPU node pools with nvidia.com/gpu=present so that, technically, only pods with the corresponding tolerations would be scheduled on them. I don’t quite know why but it did not work for us, such that if you upscaled your deployment - say, from 2 to 3 replicas - your node pool would go up to 3 nodes, but when you downscaled your replicas, your node pool would stay at 3 because other GKE-native pods would have been scheduled on the third node.

As a result, we had to add our own taint to this pool to make sure that it works. We called it mlonly=true.

Another interesting bug was that, for a while, a node pool had a duplicate of this taint. The taints looked like nvidia.com/gpu=present mlonly=true mlonly=true, and this caused our deployments to never get scheduled on them. Granted, we were idiots for having this duplicate and this may not be fully a bug.

Image pulling from containers

That was a little bit frustrating. In short, we wanted that every time we “deploy” a new version of our code, well, it actually deployed. But since our deployment folder essentially calls for the my-image:latest image, it doesn’t know to actively pull a new image whenever you re-apply your config. And using templates for our k8s config isn’t an option yet.

We thought we’d fixed this by first adding imagePullPolicy: Always on our deployment config. Turns out this wasn’t enough. So we ended up using an ugly hack which consists of force patching a random parameter into our config to force a new pull. This hack can be found here.

Interestingly enough, in Kubernetes 1.15, there’s a nice command that’ll do that for you, kubectl rollout restart, but GKE doesn’t yet support this version.

CircleCI

A couple of challenges cropped up with CircleCI.

We started using CircleCI’s orbs, which are kinda cool because they make your config look much cleaner. However, they do sometimes get a little rigid, which made us revert to the old Circle ways on a couple of occasions.

One unfortunate annoyance that we have come across on multiple projects is related to image naming. We use the CIRCLE_BRANCH variable to generate our image name, but we also tend to use forward slash / a lot in our branch names. Forward slash not being supported in GCP’s container registry, it causes our image pulls to fail. To work around this, we’be been using a quick Regex (brrr…) hack to remove all forward slashes from the image name. Not a challenge per se, but every time I use a regex, it feels as sinful as leaving the water running when I brush my teeth.

To conclude

We have learnt a couple of things along the way on which I will make quick dirty tutorials. There is a lot to be said about technology concentration with one provider, even though this leads to ugly monopolies.

A tantôt!

Subscribe to the blog

Receive all the latest posts right into your inbox

Astrid de Gasté

Astrid de Gasté

Software Engineer at Papercup

Read More