Admin: New server guide

2026-04-12T15:53:23Z

New server guide

New page

= Deploying ML Models with TensorRT =

This guide provides a practical, hands-on approach to optimizing and deploying machine learning models for high-performance inference using NVIDIA's TensorRT. TensorRT is an SDK for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for deep learning applications.

== Prerequisites ==

Before you begin, ensure you have the following:

* A Linux server with an NVIDIA GPU. For cost-effective GPU solutions, consider checking out [https://en.immers.cloud/signup/r/20241007-8310688-334/ Immers Cloud] which offers GPU instances starting from $0.23/hr for inference.
* NVIDIA drivers installed and functioning correctly. You can verify this by running:
<pre>nvidia-smi</pre>
* CUDA Toolkit installed. TensorRT requires a compatible CUDA version.
* cuDNN library installed.
* Python 3.6+ and pip installed.
* An existing trained machine learning model in a supported format (e.g., ONNX, TensorFlow SavedModel, PyTorch JIT).

== Installing TensorRT ==

TensorRT can be installed in several ways. The recommended method for most users is via pip or by downloading the Tarball from the NVIDIA developer website.

=== Method 1: Installing via pip (Recommended) ===

1. Ensure your Python environment is set up (e.g., using a virtual environment).
2. Install the TensorRT package using pip. Replace `X.Y` with your desired TensorRT version and `Z.Z` with your CUDA version.
<pre>pip install nvidia-tensorrt==X.Y.Z --extra-index-url https://pypi.ngc.nvidia.com</pre>
* '''Example for TensorRT 8.6 and CUDA 11.8:'''
<pre>pip install nvidia-tensorrt==8.6.1.2 --extra-index-url https://pypi.ngc.nvidia.com</pre>
3. Verify the installation:
<pre>python -c "import tensorrt as trt; print(trt.__version__)"</pre>

=== Method 2: Installing from Tarball ===

1. Download the TensorRT Tarball that matches your CUDA and driver versions from the [https://developer.nvidia.com/tensorrt NVIDIA Developer website].
2. Extract the Tarball:
<pre>tar xzvf TensorRT-X.Y.Z.Linux.x86_64-gnu.cuda-A.B.tar.gz</pre>
3. Navigate to the extracted directory and run the install script:
<pre>cd TensorRT-X.Y.Z</pre>
<pre>sudo ./install.sh</pre>
4. Add TensorRT to your system's library path (optional, but recommended):
<pre>echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/TensorRT-X.Y.Z/lib' >> ~/.bashrc</pre>
<pre>source ~/.bashrc</pre>

== Converting Models to TensorRT Engine ===

TensorRT uses its own optimized engine format. You'll typically convert your trained model to this format. The most common intermediate format is ONNX.

=== Converting PyTorch to ONNX ===

1. Export your PyTorch model to ONNX format:
<pre>
import torch

# Load your PyTorch model
model = YourPyTorchModel()
model.load_state_dict(torch.load('your_model.pth'))
model.eval()

# Create a dummy input tensor
dummy_input = torch.randn(1, 3, 224, 224) # Example input shape

# Export to ONNX
torch.onnx.export(model,
dummy_input,
"your_model.onnx",
export_params=True,
opset_version=11, # Use an appropriate opset version
do_constant_folding=True,
input_names=['input'],
output_names=['output'])
</pre>

=== Converting TensorFlow to ONNX ===

You can use tools like `tf2onnx` to convert TensorFlow SavedModels or Keras models to ONNX.

1. Install `tf2onnx`:
<pre>pip install tf2onnx</pre>
2. Convert the model:
<pre>python -m tf2onnx.convert --saved-model /path/to/your/tf_model --output your_model.onnx --opset 13</pre>

=== Building the TensorRT Engine from ONNX ===

TensorRT provides a Python API to build an optimized engine from an ONNX file.

1. Install the TensorRT Python bindings if you haven't already (usually included with pip installation).
2. Use the TensorRT Python API to build the engine:
<pre>
import tensorrt as trt

TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(TRT_LOGGER)
network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
parser = trt.OnnxParser(network, TRT_LOGGER)

# Load the ONNX model
with open("your_model.onnx", "rb") as model:
if not parser.parse(model.read()):
print("ERROR: Failed to parse the ONNX file.")
for error in range(parser.num_errors):
print(parser.get_error(error))
exit(1)

# Build the engine
config = builder.create_builder_config()
# Set max batch size (important for dynamic batching)
builder.max_batch_size = 1 # Or a larger value if dynamic batching is needed

# Optional: Configure optimization profile for dynamic shapes
# profile = builder.create_optimization_profile()
# profile.set_shape("input", (1, 3, 224, 224), (1, 3, 256, 256), (1, 3, 288, 288))
# config.add_optimization_profile(profile)

# Set precision (FP16, INT8) if supported by your GPU and desired
# config.set_flag(trt.BuilderFlag.FP16)

engine = builder.build_engine(network, config)

# Serialize and save the engine
with open("your_model.engine", "wb") as f:
f.write(engine.serialize())

print("TensorRT engine built successfully and saved to your_model.engine")
</pre>
* '''Note on Dynamic Batching:''' If your model needs to handle varying batch sizes at inference time, you must set up an `OptimizationProfile`.

== Performing Inference with TensorRT ===

Once you have your optimized TensorRT engine (`.engine` file), you can use the TensorRT runtime to perform inference.

1. Load the TensorRT engine:
<pre>
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np

TRT_LOGGER = trt.Logger(trt.Logger.WARNING)

def load_engine(engine_file_path):
with open(engine_file_path, "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
return runtime.deserialize_cuda_engine(f.read())

engine = load_engine("your_model.engine")
if not engine:
print("Failed to load TensorRT engine.")
exit(1)

context = engine.create_execution_context()
</pre>

2. Prepare input data and perform inference:
<pre>
# Assuming your input is a NumPy array with shape (batch_size, channels, height, width)
# and matches the input shape expected by the engine.
input_data = np.random.rand(1, 3, 224, 224).astype(np.float32)

# Allocate device memory for input and output
inputs, outputs, bindings, stream = allocate_buffers(engine)

# Transfer input data to the GPU
np.copyto(inputs[0].host, input_data.ravel())
batch_size = input_data.shape[0]

# Run inference
with engine.create_execution_context() as context:
# For dynamic batching, you might need to set binding shapes
# context.set_binding_shape(engine.get_binding_index("input"), input_data.shape)
trt_outputs = do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)

# Process output data (convert from GPU to host, reshape)
# For example:
# output_data = trt_outputs[0].host.reshape(expected_output_shape)
</pre>

3. Helper functions for buffer allocation and inference (simplified example):
<pre>
def allocate_buffers(engine):
inputs = []
outputs = []
bindings = []
stream = cuda.Stream()
for binding in engine:
size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
dtype = trt.nptype(engine.get_binding_dtype(binding))
# Allocate host and device buffers
host_mem = cuda.pagelocked_empty(size, dtype)
device_mem = cuda.mem_alloc(host_mem.nbytes)
# Append the device buffer to bindings
bindings.append(int(device_mem))
# Append to the appropriate list (input or output)
if engine.binding_is_input(binding):
inputs.append({'host': host_mem, 'device': device_mem})
else:
outputs.append({'host': host_mem, 'device': device_mem})
return inputs, outputs, bindings, stream

def do_inference(context, bindings, inputs, outputs, stream):
# Transfer input data to the GPU
[cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
# Run inference
context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
# Transfer predictions back from the GPU
[cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
# Synchronize the stream
stream.synchronize()
# Return only the host buffers
return [out.host for out in outputs]
</pre>

== Troubleshooting ==

* '''CUDA/Driver Mismatch:''' Ensure your installed CUDA Toolkit, cuDNN, and NVIDIA drivers are compatible with the TensorRT version you are using. Check the TensorRT release notes for compatibility matrices.
* '''ONNX Parsing Errors:''' If `parser.parse()` fails, carefully examine the error messages provided by `parser.get_error()`. This often indicates unsupported ONNX operators or incorrect ONNX graph structure. You might need to use a different opset version or modify your model export.
* '''Out of Memory (OOM) Errors:''' This can happen during engine building or inference.
* During building: Try reducing `builder.max_batch_size` or disabling FP16/INT8 precision if it's causing issues.
* During inference: Ensure your input batch size is not exceeding what the engine was built for, especially if not using dynamic batching. For very large models, consider using smaller batch sizes or models optimized for lower precision (INT8). GPU servers from [https://en.immers.cloud/signup/r/20241007-8310688-334/ Immers Cloud] offer a range of GPUs that can help mitigate OOM issues.
* '''Incorrect Inference Results:''' Double-check input data preprocessing (normalization, data type, shape) and output data postprocessing. Ensure they exactly match what the model was trained with and what TensorRT expects.

== Further Reading ==

* [[NVIDIA TensorRT Documentation]]
* [[Optimizing Models with ONNX Runtime]]

[[Category:AI and GPU]]
[[Category:Machine Learning]]
[[Category:Performance Tuning]]

Deploying ML Models with TensorRT - Revision history

Admin: New server guide