← Back to blog

Cutting Out the Middleman: Running AI³ Models Directly on cuDNN

By bypassing PyTorch and running AI³ models directly on cuDNN, we cut inference latency by up to 45% and gained full control over GPU performance.

⚡ Cutting Out the Middleman: Running AI³ Models Directly on cuDNN

One-line summary:
By bypassing PyTorch and running AI³ models directly on cuDNN, we cut inference latency by up to 45% and gained full control over GPU performance.

When most people think of deep learning frameworks, they think PyTorch or TensorFlow. These are amazing ecosystems — they make experimentation simple, abstract away low-level details, and come with vibrant communities. But here’s the catch: that abstraction comes with overhead.

At AI³, one of our goals is to push inference latency down to the absolute minimum. That means asking a tough question:
👉 What if we bypassed high-level frameworks altogether and talked directly to the GPU?

Why cuDNN?

cuDNN is NVIDIA’s low-level CUDA library built specifically for deep neural networks. It handles the math kernels — convolutions, pooling, normalization — that power almost every model. Normally, frameworks like PyTorch call into cuDNN under the hood. But when you go direct-to-cuDNN, you:

  • Eliminate framework overhead
  • Gain fine-grained control over memory layout and execution
  • Unlock optimizations that frameworks often don’t expose

Our Experiment

We re-implemented the inference path of a CycleGAN model directly with cuDNN. That meant:

  • Writing custom kernels for forward passes
  • Manually managing memory (no Python garbage collector here!)
  • Building a thin C++ layer to orchestrate execution

The payoff?

  • 30–45% faster inference latency on large 3D medical image datasets
  • Lower VRAM usage, since we skip framework bookkeeping
  • A cleaner path to deploy models on edge devices where every millisecond counts

Why This Matters

For research, PyTorch is unbeatable. But when your model moves from “interesting prototype” to production system — especially in healthcare imaging or enterprise AI pipelines — shaving off hundreds of milliseconds per request can mean the difference between usable and impractical.

Direct cuDNN gives us:

  • Lower latency
  • Predictable memory footprints
  • Maximum control for scaling across GPUs

Takeaway

Frameworks like PyTorch are great scaffolding. But for true system-level performance, sometimes the best path is to strip it all down and speak GPU natively.

At AI³, this kind of low-level optimization isn’t just about speed — it’s about building AI systems that are reliable, cost-efficient, and ready for real-world scale.

👉 Curious to hear from others: Have you ever ditched the framework and gone low-level for performance gains? What trade-offs did you run into?