TensorFlow in Production: High-Performance Serving Strategies
Your data science team just handed you a frozen GraphDef file. They are celebrating high accuracy on the MNIST or ImageNet dataset. But now, the real headache begins. How do you take that .pb file and serve predictions to thousands of concurrent users in Oslo without crashing your infrastructure?
Most developers make the mistake of wrapping a TensorFlow session inside a Flask app. This works for a hackathon. In production, Python's Global Interpreter Lock (GIL) and the overhead of HTTP/1.1 JSON serialization will strangle your throughput. If you care about latency—and you should, because every millisecond of delay costs money—you need a bare-metal approach to inference.
Today, we are looking at the bleeding edge of February 2017: deploying the TensorFlow Serving system via Docker, managing gRPC connections, and ensuring your underlying hardware (specifically NVMe and CPU instruction sets) isn't the bottleneck.
The Bottleneck: Why Simple Web Servers Fail
I recently audited a setup for a Norwegian fintech startup trying to run fraud detection models. They were running Gunicorn with Flask. Their response times were averaging 400ms. Why? Because every worker process was trying to load a 300MB weight file into memory, causing massive thrashing, and JSON parsing was eating up CPU cycles that should have gone to matrix multiplication.
Here is what you absolutely should not do for high-load environments:
# BAD PRACTICE: Do not use this for high concurrency
@app.route('/predict', methods=['POST'])
def predict():
data = request.get_json()
# Re-initializing sessions or using heavy JSON parsing here blocks
result = sess.run(y, feed_dict={x: data['image']})
return jsonify(result.tolist())
The Solution: TensorFlow Serving & gRPC
Google released TensorFlow Serving to solve exactly this problem. It is a flexible, high-performance serving system for machine learning models, designed for production environments. It deals with the lifecycle management of models and, crucially, supports gRPC. gRPC uses Protocol Buffers, which are binary and significantly smaller and faster than JSON.
Step 1: Exporting the Model
First, you ensure your model is exported correctly using the SavedModelBuilder (introduced recently in TF). You cannot just save the checkpoint; you need the signature definition.
import tensorflow as tf
from tensorflow.python.saved_model import builder as saved_model_builder
from tensorflow.python.saved_model import signature_constants
from tensorflow.python.saved_model import tag_constants
export_path = "/models/fraud_detection/1"
builder = saved_model_builder.SavedModelBuilder(export_path)
# Define input and output signatures
tensor_info_x = tf.saved_model.utils.build_tensor_info(x)
tensor_info_y = tf.saved_model.utils.build_tensor_info(y)
prediction_signature = (
tf.saved_model.signature_def_utils.build_signature_def(
inputs={'images': tensor_info_x},
outputs={'scores': tensor_info_y},
method_name=tf.saved_model.signature_constants.PREDICT_METHOD_NAME))
builder.add_meta_graph_and_variables(
sess, [tag_constants.SERVING],
signature_def_map={'predict_images': prediction_signature})
builder.save()
Step 2: Hosting with Docker
Compiling TensorFlow Serving from source using Bazel is painful and time-consuming. The cleanest way to run this in 2017 is using Docker. Ensure your host machine has Docker 1.12+ installed.
We will mount the model directory from the host (CoolVDS instance) into the container. Note that we are binding port 8500 for the gRPC API.
docker run -d -p 8500:8500 \
--mount type=bind,source=/home/user/models/fraud_detection,target=/models/fraud_detection \
-e MODEL_NAME=fraud_detection \
-t tensorflow/serving
Check the logs immediately. If you see "Netloc mismatch" or OOM errors, your instance doesn't have enough RAM to map the graph. This is where hardware selection becomes critical.
Hardware Matters: AVX and NVMe
Running deep learning inference is CPU intensive. TensorFlow binaries are compiled to use AVX (Advanced Vector Extensions). If your VPS provider puts you on old hardware or limits CPU instruction sets via the hypervisor, TensorFlow will either crash with "Illegal instruction" or fallback to slow, non-vectorized math.
Pro Tip: Check your CPU flags immediately upon logging in.If you don't see `avx` or `avx2`, you are paying for dead weight. CoolVDS guarantees KVM virtualization with full CPU passthrough, ensuring your model can actually use the math co-processors it was compiled for.cat /proc/cpuinfo | grep avx
Furthermore, model loading is I/O heavy. When TensorFlow Serving starts, or when you swap a model version, it reads the entire graph into memory. On standard SATA SSDs or spinning rust, this can take seconds. On CoolVDS NVMe storage, we consistently see read speeds exceeding 2000 MB/s, making model swapping almost instantaneous.
The Client: Consuming predictions via gRPC
Now that the server is listening on port 8500, we need a client. You cannot use `curl` easily here because it's binary data. You need a Python client stub.
from grpc.beta import implementations
import tensorflow as tf
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2
host = '10.0.0.5' # Private LAN IP of your CoolVDS instance
port = 8500
channel = implementations.insecure_channel(host, port)
stub = prediction_service_pb2.beta_create_PredictionService_stub(channel)
request = predict_pb2.PredictRequest()
request.model_spec.name = 'fraud_detection'
request.model_spec.signature_name = 'predict_images'
request.inputs['images'].CopyFrom(
tf.contrib.util.make_tensor_proto(image_data, shape=[1, 784]))
result = stub.Predict(request, 10.0) # 10 seconds timeout
print(result)
Data Sovereignty & Latency in Norway
We are seeing stricter enforcement from Datatilsynet regarding where user data is processed. If you are serving predictions for Norwegian users, routing that data to a US-based cloud adds 100ms+ of latency and potential legal headaches regarding Safe Harbor (or the new Privacy Shield).
Hosting your inference engine in Oslo solves both issues. Latency drops to <5ms for local users. With CoolVDS, you get the raw compute power of a dedicated server with the flexibility of a VPS. We don't oversubscribe our CPU cores, meaning when your model needs 100% of a core for matrix multiplication, it gets it.
Conclusion
Deploying machine learning models in 2017 requires moving beyond the Jupyter notebook. You need the stability of Docker, the speed of gRPC, and the raw throughput of NVMe storage. Don't let your infrastructure be the reason your model underperforms.
Ready to deploy? Spin up a Performance NVMe instance on CoolVDS today. We support Docker and custom kernels out of the box.