| Moving Beyond Single-Node to Multi-Node Inference |
|
|
| NVIDIA Dynamo Scales and Streamlines Data Center Inference with Kubernetes |
 |
| As AI inference becomes increasingly distributed, the combination of Kubernetes, NVIDIA Dynamo, and NVIDIA Grove greatly simplifies how developers build and scale intelligent applications. NVIDIA Dynamo now integrates with managed Kubernetes services from Amazon EKS, Microsoft Azure AKS, Google Cloud GKE, and Oracle Cloud Infrastructure OKE to get started with large-scale inference. |
 |
|
|
|
|
|
|
|
|
 |
| Breaking the Million-Token Barrier: The Business Impact of Azure's NVIDIA GB300 Performance for Enterprise AI |
 |
Microsoft Azure achieved 1,100,948 tokens/sec on ND GB300 v6 racks powered by 72 NVIDIA Blackwell Ultra GPUs, validated by Signal65. This benchmark highlights how enterprise AI can deliver record throughput with ~2.5x better power efficiency, combining high performance, operational efficiency, and governance-ready scale.
Read Blog Post ❯ |
 |
| Published November 3, 2025 |
|
|
|
|
|
|
 |
| NVIDIA Extreme Co-Design Delivers X-Factors on One-Year Rhythm |
 |
How do you get 10x the performance with only twice the transistors? Extreme co-design. At GTC DC, NVIDIA CEO Jensen Huang showed how the NVIDIA GB200 NVL72 architecture delivers a massive leap in inference performance—creating the lowest-cost AI tokens in the world while driving 10x higher throughput.
Watch Keynote Chapter ❯ |
 |
| Published October 28, 2025 |
|
|
|
|
|
|
|
| Latest Inference News and Resources |
|
|
|
|
 |
| Barron’s Highlights NVIDIA’s Inference Leadership |
 |
Barron’s explores NVIDIA’s inference leadership with the NVIDIA GB200 NVL72 sweeping the SemiAnalysis InferenceMAX v1 benchmarks, delivering unmatched performance, efficiency, and ROI for AI inference.
Read Article ❯ |
 |
| Published October 15, 2025 |
|
|
|
|
|
|
 |
| The Next Platform on Software Pushes the AI Pareto Frontier More Than Hardware |
 |
The Next Platform details how NVIDIA software optimizations are boosting performance by 5–10x on the same hardware. Pareto curves illustrate how hardware and software optimizations can boost AI inference performance.
Read Article ❯ |
 |
| Published October 21, 2025 |
|
|
|
|
|
|
|
|
|
 |
| Google Cloud Now Shipping A4X Max, Vertex AI Training, and More |
 |
Google Cloud's new A4X Max VMs, powered by NVIDIA GB300 NVL72 systems, are now in preview. A4X Max is designed for training and low-latency AI inference of frontier reasoning models. Further integration with GKE Inference Gateway and NVIDIA NeMo™ Guardrails enables prefix-aware load balancing, significantly improving latency and throughput.
Read Article ❯ |
 |
| Published October 28, 2025 |
|
|
|
|
|
|
 |
| Siemens Builds and Deploys Self-Contained, Sustainable, and Cost-Effective LLM |
 |
Siemens details its efforts to build a future-proof AI ecosystem and provide services for its internal developers. Its sovereign AI infrastructure focuses on data privacy, compliance, cost predictability, and customization—served using vLLM and powered by NVIDIA H200 Tensor Core GPUs and L40S GPUs.
Read Article ❯ |
 |
| Published October 13, 2025 |
|
|
|
|
|
|
|
| NVIDIA Inference Technology Highlights |
|
|
|
|
 |
| Streamline Complex AI Inference on Kubernetes With NVIDIA Grove |
 |
NVIDIA Grove, a Kubernetes API for running modern machine learning inference workloads on Kubernetes clusters, is now available within NVIDIA Dynamo as a modular component for unified inference management. Grove is fully open source and enables multimode disaggregated serving through multilevel autoscaling, system-level lifecycle management, flexible gang scheduling, topology-aware scheduling, and role-aware orchestration.
Read Blog ❯ |
 |
| Published November 10, 2025 |
|
|
|
|
|
|
 |
| STAC-ing Wins: NVIDIA GH200 Superchip Sets Records on Financial Services Industry Benchmarks |
 |
STAC audited a STAC-ML Markets (Inference) benchmark on a stack featuring NVIDIA GH200 Grace Hopper™ Superchip on Supermicro. Compared to the previous FPGA-based record, GH200 delivered up to 49% lower latency on large models, 44% higher energy efficiency, 8–13x lower inference error rates, and latency as low as 4.67 μs (99p).
Read Blog ❯ |
 |
| Published October 28, 2025 |
|
|
|
|
|
|
|
|
|
 |
| Streamline AI Infrastructure With NVIDIA Run:ai on Microsoft Azure |
 |
NVIDIA Run:ai integrates with Azure Kubernetes Service (AKS) to dynamically manage GPU resources, allowing multiple workloads to share GPUs efficiently and supporting multi-node and multi-GPU training jobs.
Read Blog ❯ |
 |
| Published October 30, 2025 |
|
|
|
|
|
|
 |
| NVIDIA and Oracle to Accelerate Enterprise AI and Data Processing |
 |
Oracle announced a new OCI Zettascale10 computing cluster powered by the NVIDIA Blackwell platform, designed for AI training and inference workloads. The cluster will deliver up to 16 zettaFLOPS of AI compute and utilize NVIDIA Spectrum-X™ Ethernet, which enables hyperscalers to interconnect millions of NVIDIA GPUs.
Read Blog ❯ |
 |
| Published October 14, 2025 |
|
|
|
|
|
|
|
|
|
 |
| Scaling Large MoE Models With Wide Expert Parallelism on NVL72 Rack-Scale Systems |
 |
NVIDIA TensorRT™-LLM's Wide Expert Parallelism (Wide-EP) on NVIDIA GB200 NVL72 systems achieves up to 1.8x higher per-GPU throughput compared to smaller EP configurations—improving tokens per second per GPU and lowering costs to serve serving reasoning models such as DeepSeek-R1.
Read Blog ❯ |
 |
| Published October 20, 2025 |
|
|
|
|
|
|
 |
| Nebius Scales AI Inference in the Cloud, Powered by NVIDIA |
 |
Using managed Kubernetes with auto-scaling, Nebius optimizes its AI cloud to deliver multi-node training and inference of frontier models and AI applications for startups and enterprises. Nebius, an ecosystem partner for NVIDIA Dynamo, is enabling AI inference at scale with NVIDIA infrastructure.
Watch Video ❯ |
 |
| Published November 10, 2025 |
|
|
|
|
|
|
|
| Why Your $5 Million AI Investment Could Generate $75 Million—If You Understand Inference |
 |
AI pioneer Pascal Bornet sits down with Dion Harris, Sr. Director of HPC, Cloud, and AI Infrastructure Solutions GTM at NVIDIA, to discuss AI inference, reasoning models, and how performance and efficiency are the driving factors to maximize return on investment from AI factories.
Watch Interview ❯ Read LinkedIn Post ❯ |
 |
| Published November 10, 2025 |
|
|
|
|
|