Stop Serving Models with Flask: A DevOps Guide to TensorFlow Serving in 2020
Let’s be honest. The handoff between data science and operations is usually a disaster. Your data scientist hands you a Jupyter Notebook and a .h5 file, and expects it to scale. You wrap it in a Flask or Django REST framework, put it behind Gunicorn, and watch your CPU usage spike while latency goes through the roof. The Global Interpreter Lock (GIL) in Python is choking your throughput.
I’ve been there. Last month, we tried deploying a BERT-based sentiment analysis tool for a Norwegian e-commerce client using a standard Python web server. The result? 400ms latency per request and timeouts whenever concurrent users hit double digits. That is unacceptable.
The solution isn't a bigger server; it's better architecture. Today, we represent the TensorFlow Serving (TFS) standard. We will look at how to deploy this on a KVM-based VPS with NVMe storage—specifically referencing the architecture we utilize at CoolVDS—to ensure consistent inference times and compliance with local Norwegian data laws.
1. Exporting for Production (The Right Way)
First, get out of the notebook mindset. You need a SavedModel format, not just weights. TensorFlow 2.x (we are using 2.2 for this guide) makes this straightforward, but you must define your signatures correctly.
import tensorflow as tf
# Assuming 'model' is your trained Keras model
version = "1"
export_path = f"./models/sentiment_model/{version}"
tf.saved_model.save(
model,
export_path,
signatures=model.call.get_concrete_function(
tf.TensorSpec(shape=[None, 128], dtype=tf.int32, name="input_ids")
)
)This creates a structured directory containing a saved_model.pb file and a variables folder. This protocol buffer format is language-agnostic, optimized for C++ execution, and removes the Python overhead entirely.
2. The Runtime Environment: Docker & TensorFlow Serving
Don't install TensorFlow Serving from source unless you enjoy compiling Bazel for six hours. Docker is the industry standard here. However, running Docker on a VPS requires a kernel that supports overlayfs and proper cgroups isolation—something standard OpenVZ containers struggle with. This is why we insist on KVM virtualization at CoolVDS; you get a dedicated kernel.
Pull the image:
docker pull tensorflow/serving:2.2.0Run it with a mount to your model directory. Note the environment variable MODEL_NAME.
docker run -d --name tf_serving_sentiment \
-p 8501:8501 \
-v /path/to/models/sentiment_model:/models/sentiment_model \
-e MODEL_NAME=sentiment_model \
tensorflow/serving:2.2.0At this point, you have a REST API listening on port 8501. But if you stop here, you are leaving performance on the table.
3. Optimization: Batching and CPU Instructions
Inference is expensive. If you send requests one by one, your CPU spends more time context switching than calculating matrix multiplications. You need dynamic batching. TFS waits a few milliseconds to group incoming requests into a batch, processes them in parallel (leveraging AVX2/AVX-512 instructions), and returns the results.
Create a config file batching_parameters.txt:
max_batch_size { value: 32 }
batch_timeout_micros { value: 2000 }
num_batch_threads { value: 8 }
pad_variable_length_inputs: trueMount this file into your container configuration. This is where infrastructure matters. Batching is CPU intensive. On a shared hosting environment with "burstable" CPU credits, your inference time will jitter unpredictably. You need dedicated CPU cores.
Pro Tip: Check your CPU flags. Run lscpu | grep avx. If your host doesn't support AVX2, your TensorFlow performance will be cut in half. CoolVDS nodes are built on modern enterprise processors that guarantee these instruction sets are available to the guest OS.4. Storage I/O: The Hidden Bottleneck
When TensorFlow Serving starts, or when you hot-swap a new model version, it reads the entire graph into memory. For modern NLP models, this can be hundreds of megabytes or even gigabytes.
On standard SATA SSDs (or heaven forbid, spinning rust), loading a model can cause a service hang of 10-30 seconds. On NVMe storage, which is standard on CoolVDS, we see read speeds exceeding 2000 MB/s. This means your autoscaling groups can spin up new inference nodes in seconds, not minutes.
5. The Norwegian Context: Latency and Sovereignty
Why host this in Oslo? Two reasons: Latency and Law.
Latency
If your users are in Norway or Northern Europe, routing traffic to a US-East server adds 80-120ms of round-trip time (RTT). For real-time inference (like voice assistants or fraud detection), that delay is noticeable. Hosting locally at NIX (Norwegian Internet Exchange) connected facilities ensures your network latency is under 10ms.
Data Sovereignty
With GDPR in full effect and the scrutiny on data transfers increasing, storing personal data (which inference inputs often are) outside the EEA is risky. While the Privacy Shield is currently in place, legal experts are already warning about its stability. Hosting on CoolVDS ensures your data stays on Norwegian soil, simplifying compliance with Datatilsynet requirements.
6. Architecture Summary
Here is the robust deployment architecture:
| Component | Technology | CoolVDS Advantage |
| Orchestration | Docker / Docker Compose | Full KVM support for Docker isolation. |
| Compute | TF Serving (C++) | Dedicated vCPUs with AVX2 support. |
| Storage | SavedModel (.pb) | NVMe I/O for instant model loading. |
| Network | REST / gRPC | Low latency peering in Oslo. |
Final Thoughts
Taking machine learning models to production is about removing variables. You remove the Python GIL by using TensorFlow Serving. You remove network jitter by hosting close to your users. And you remove resource contention by using dedicated, high-performance infrastructure.
Don't let your infrastructure be the reason your model fails in production.
Ready to test your inference speed? Spin up a CoolVDS NVMe instance in Oslo. It takes 55 seconds to deploy, giving you enough time to grab a coffee before you start pushing Docker images.