TensorFlow in Production: High-Performance Serving Strategies

Your data science team just handed you a frozen GraphDef file. They are celebrating high accuracy on the MNIST or ImageNet dataset. But now, the real headache begins. How do you take that .pb file and serve predictions to thousands of concurrent users in Oslo without crashing your infrastructure?

Most developers make the mistake of wrapping a TensorFlow session inside a Flask app. This works for a hackathon. In production, Python's Global Interpreter Lock (GIL) and the overhead of HTTP/1.1 JSON serialization will strangle your throughput. If you care about latency—and you should, because every millisecond of delay costs money—you need a bare-metal approach to inference.

Today, we are looking at the bleeding edge of February 2017: deploying the TensorFlow Serving system via Docker, managing gRPC connections, and ensuring your underlying hardware (specifically NVMe and CPU instruction sets) isn't the bottleneck.

The Bottleneck: Why Simple Web Servers Fail

I recently audited a setup for a Norwegian fintech startup trying to run fraud detection models. They were running Gunicorn with Flask. Their response times were averaging 400ms. Why? Because every worker process was trying to load a 300MB weight file into memory, causing massive thrashing, and JSON parsing was eating up CPU cycles that should have gone to matrix multiplication.

Here is what you absolutely should not do for high-load environments:

# BAD PRACTICE: Do not use this for high concurrency
@app.route('/predict', methods=['POST'])
def predict():
    data = request.get_json()
    # Re-initializing sessions or using heavy JSON parsing here blocks
    result = sess.run(y, feed_dict={x: data['image']})
    return jsonify(result.tolist())

The Solution: TensorFlow Serving & gRPC

Google released TensorFlow Serving to solve exactly this problem. It is a flexible, high-performance serving system for machine learning models, designed for production environments. It deals with the lifecycle management of models and, crucially, supports gRPC. gRPC uses Protocol Buffers, which are binary and significantly smaller and faster than JSON.

Step 1: Exporting the Model

First, you ensure your model is exported correctly using the SavedModelBuilder (introduced recently in TF). You cannot just save the checkpoint; you need the signature definition.

import tensorflow as tf
from tensorflow.python.saved_model import builder as saved_model_builder
from tensorflow.python.saved_model import signature_constants
from tensorflow.python.saved_model import tag_constants

export_path = "/models/fraud_detection/1"
builder = saved_model_builder.SavedModelBuilder(export_path)

# Define input and output signatures
tensor_info_x = tf.saved_model.utils.build_tensor_info(x)
tensor_info_y = tf.saved_model.utils.build_tensor_info(y)

prediction_signature = (
    tf.saved_model.signature_def_utils.build_signature_def(
        inputs={'images': tensor_info_x},
        outputs={'scores': tensor_info_y},
        method_name=tf.saved_model.signature_constants.PREDICT_METHOD_NAME))

builder.add_meta_graph_and_variables(
    sess, [tag_constants.SERVING],
    signature_def_map={'predict_images': prediction_signature})

builder.save()

Step 2: Hosting with Docker

Compiling TensorFlow Serving from source using Bazel is painful and time-consuming. The cleanest way to run this in 2017 is using Docker. Ensure your host machine has Docker 1.12+ installed.

We will mount the model directory from the host (CoolVDS instance) into the container. Note that we are binding port 8500 for the gRPC API.

docker run -d -p 8500:8500 \
  --mount type=bind,source=/home/user/models/fraud_detection,target=/models/fraud_detection \
  -e MODEL_NAME=fraud_detection \
  -t tensorflow/serving

Check the logs immediately. If you see "Netloc mismatch" or OOM errors, your instance doesn't have enough RAM to map the graph. This is where hardware selection becomes critical.

Hardware Matters: AVX and NVMe

Running deep learning inference is CPU intensive. TensorFlow binaries are compiled to use AVX (Advanced Vector Extensions). If your VPS provider puts you on old hardware or limits CPU instruction sets via the hypervisor, TensorFlow will either crash with "Illegal instruction" or fallback to slow, non-vectorized math.

Pro Tip: Check your CPU flags immediately upon logging in.
cat /proc/cpuinfo | grep avx
If you don't see `avx` or `avx2`, you are paying for dead weight. CoolVDS guarantees KVM virtualization with full CPU passthrough, ensuring your model can actually use the math co-processors it was compiled for.

Furthermore, model loading is I/O heavy. When TensorFlow Serving starts, or when you swap a model version, it reads the entire graph into memory. On standard SATA SSDs or spinning rust, this can take seconds. On CoolVDS NVMe storage, we consistently see read speeds exceeding 2000 MB/s, making model swapping almost instantaneous.

The Client: Consuming predictions via gRPC

Now that the server is listening on port 8500, we need a client. You cannot use `curl` easily here because it's binary data. You need a Python client stub.

from grpc.beta import implementations
import tensorflow as tf
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2

host = '10.0.0.5' # Private LAN IP of your CoolVDS instance
port = 8500

channel = implementations.insecure_channel(host, port)
stub = prediction_service_pb2.beta_create_PredictionService_stub(channel)

request = predict_pb2.PredictRequest()
request.model_spec.name = 'fraud_detection'
request.model_spec.signature_name = 'predict_images'
request.inputs['images'].CopyFrom(
    tf.contrib.util.make_tensor_proto(image_data, shape=[1, 784]))

result = stub.Predict(request, 10.0)  # 10 seconds timeout
print(result)

Data Sovereignty & Latency in Norway

We are seeing stricter enforcement from Datatilsynet regarding where user data is processed. If you are serving predictions for Norwegian users, routing that data to a US-based cloud adds 100ms+ of latency and potential legal headaches regarding Safe Harbor (or the new Privacy Shield).

Hosting your inference engine in Oslo solves both issues. Latency drops to <5ms for local users. With CoolVDS, you get the raw compute power of a dedicated server with the flexibility of a VPS. We don't oversubscribe our CPU cores, meaning when your model needs 100% of a core for matrix multiplication, it gets it.

Conclusion

Deploying machine learning models in 2017 requires moving beyond the Jupyter notebook. You need the stability of Docker, the speed of gRPC, and the raw throughput of NVMe storage. Don't let your infrastructure be the reason your model underperforms.

Ready to deploy? Spin up a Performance NVMe instance on CoolVDS today. We support Docker and custom kernels out of the box.

TensorFlow in Production: High-Performance Serving Strategies (Feb 2017 Edition)

TensorFlow in Production: High-Performance Serving Strategies

The Bottleneck: Why Simple Web Servers Fail

The Solution: TensorFlow Serving & gRPC

Step 1: Exporting the Model

Step 2: Hosting with Docker

Hardware Matters: AVX and NVMe

The Client: Consuming predictions via gRPC

Data Sovereignty & Latency in Norway

Conclusion

Recent Searches

🍪 We Value Your Privacy

Privacy & Cookie Settings

Your Privacy Rights

TensorFlow in Production: High-Performance Serving Strategies (Feb 2017 Edition)

TensorFlow in Production: High-Performance Serving Strategies

The Bottleneck: Why Simple Web Servers Fail

The Solution: TensorFlow Serving & gRPC

Step 1: Exporting the Model

Step 2: Hosting with Docker

Hardware Matters: AVX and NVMe

The Client: Consuming predictions via gRPC

Data Sovereignty & Latency in Norway

Conclusion

/// RELATED POSTS

Getting Started with GPU Slicing for AI Workloads

Deep Learning Bottlenecks: Why Fast NVMe and KVM Matter More Than Your GPU

Machine Learning Infrastructure on VDS: Why I/O Latency is the Silent Killer of Model Training

Green Hosting Revolution: Why Norwegian Businesses Are Switching to VDS and Cloud Hosting for Sustainability

Recent Searches