NVIDIA

inference-email-header-sa-external-think-smart-1360x400.png

Think SMART Inference Signals: Latest News

Moving Beyond Single-Node to Multi-Node Inference

NVIDIA Dynamo Scales and Streamlines Data Center Inference with Kubernetes

As AI inference becomes increasingly distributed, the combination of Kubernetes, NVIDIA Dynamo, and NVIDIA Grove greatly simplifies how developers build and scale intelligent applications. NVIDIA Dynamo now integrates with managed Kubernetes services from Amazon EKS, Microsoft Azure AKS, Google Cloud GKE, and Oracle Cloud Infrastructure OKE to get started with large-scale inference.

Read Blog

Top Inference Highlights

Breaking the Million-Token Barrier: The Business Impact of Azure's NVIDIA GB300 Performance for Enterprise AI

Microsoft Azure achieved 1,100,948 tokens/sec on ND GB300 v6 racks powered by 72 NVIDIA Blackwell Ultra GPUs, validated by Signal65. This benchmark highlights how enterprise AI can deliver record throughput with ~2.5x better power efficiency, combining high performance, operational efficiency, and governance-ready scale.

Read Blog Post ❯

Published November 3, 2025

NVIDIA Extreme Co-Design Delivers X-Factors on One-Year Rhythm

How do you get 10x the performance with only twice the transistors? Extreme co-design. At GTC DC, NVIDIA CEO Jensen Huang showed how the NVIDIA GB200 NVL72 architecture delivers a massive leap in inference performance—creating the lowest-cost AI tokens in the world while driving 10x higher throughput.

Watch Keynote Chapter ❯

Published October 28, 2025

Latest Inference News and Resources

Barron’s Highlights NVIDIA’s Inference Leadership

Barron’s explores NVIDIA’s inference leadership with the NVIDIA GB200 NVL72 sweeping the SemiAnalysis InferenceMAX v1 benchmarks, delivering unmatched performance, efficiency, and ROI for AI inference.

Read Article ❯

Published October 15, 2025

The Next Platform on Software Pushes the AI Pareto Frontier More Than Hardware

The Next Platform details how NVIDIA software optimizations are boosting performance by 5–10x on the same hardware. Pareto curves illustrate how hardware and software optimizations can boost AI inference performance.

Read Article ❯

Published October 21, 2025

Google Cloud Now Shipping A4X Max, Vertex AI Training, and More

Google Cloud's new A4X Max VMs, powered by NVIDIA GB300 NVL72 systems, are now in preview. A4X Max is designed for training and low-latency AI inference of frontier reasoning models. Further integration with GKE Inference Gateway and NVIDIA NeMo™ Guardrails enables prefix-aware load balancing, significantly improving latency and throughput.

Read Article ❯

Published October 28, 2025

Siemens Builds and Deploys Self-Contained, Sustainable, and Cost-Effective LLM

Siemens details its efforts to build a future-proof AI ecosystem and provide services for its internal developers. Its sovereign AI infrastructure focuses on data privacy, compliance, cost predictability, and customization—served using vLLM and powered by NVIDIA H200 Tensor Core GPUs and L40S GPUs.

Read Article ❯

Published October 13, 2025

NVIDIA Inference Technology Highlights

Streamline Complex AI Inference on Kubernetes With NVIDIA Grove

NVIDIA Grove, a Kubernetes API for running modern machine learning inference workloads on Kubernetes clusters, is now available within NVIDIA Dynamo as a modular component for unified inference management. Grove is fully open source and enables multimode disaggregated serving through multilevel autoscaling, system-level lifecycle management, flexible gang scheduling, topology-aware scheduling, and role-aware orchestration.

Read Blog ❯

Published November 10, 2025

STAC-ing Wins: NVIDIA GH200 Superchip Sets Records on Financial Services Industry Benchmarks

STAC audited a STAC-ML Markets (Inference) benchmark on a stack featuring NVIDIA GH200 Grace Hopper™ Superchip on Supermicro. Compared to the previous FPGA-based record, GH200 delivered up to 49% lower latency on large models, 44% higher energy efficiency, 8–13x lower inference error rates, and latency as low as 4.67 μs (99p).

Read Blog ❯

Published October 28, 2025

Streamline AI Infrastructure With NVIDIA Run:ai on Microsoft Azure

NVIDIA Run:ai integrates with Azure Kubernetes Service (AKS) to dynamically manage GPU resources, allowing multiple workloads to share GPUs efficiently and supporting multi-node and multi-GPU training jobs.

Read Blog ❯

Published October 30, 2025

NVIDIA and Oracle to Accelerate Enterprise AI and Data Processing

Oracle announced a new OCI Zettascale10 computing cluster powered by the NVIDIA Blackwell platform, designed for AI training and inference workloads. The cluster will deliver up to 16 zettaFLOPS of AI compute and utilize NVIDIA Spectrum-X™ Ethernet, which enables hyperscalers to interconnect millions of NVIDIA GPUs.

Read Blog ❯

Published October 14, 2025

Scaling Large MoE Models With Wide Expert Parallelism on NVL72 Rack-Scale Systems

NVIDIA TensorRT™-LLM's Wide Expert Parallelism (Wide-EP) on NVIDIA GB200 NVL72 systems achieves up to 1.8x higher per-GPU throughput compared to smaller EP configurations—improving tokens per second per GPU and lowering costs to serve serving reasoning models such as DeepSeek-R1.

Read Blog ❯

Published October 20, 2025

Nebius Scales AI Inference in the Cloud, Powered by NVIDIA

Using managed Kubernetes with auto-scaling, Nebius optimizes its AI cloud to deliver multi-node training and inference of frontier models and AI applications for startups and enterprises. Nebius, an ecosystem partner for NVIDIA Dynamo, is enabling AI inference at scale with NVIDIA infrastructure.

Watch Video ❯

Published November 10, 2025

Spotlight: Pascal Bornet

Why Your $5 Million AI Investment Could Generate $75 Million—If You Understand Inference

AI pioneer Pascal Bornet sits down with Dion Harris, Sr. Director of HPC, Cloud, and AI Infrastructure Solutions GTM at NVIDIA, to discuss AI inference, reasoning models, and how performance and efficiency are the driving factors to maximize return on investment from AI factories.

Watch Interview ❯
Read LinkedIn Post ❯

Published November 10, 2025

Resources

Livestream Replay: NVIDIA Dynamo KV Block Manager (KVBM) - Managing Memory at Scale
Video: NVIDIA Rubin CPX Accelerates Inference for Million‑Token Context AI
Infographic: Disaggregated AI Inference on NVIDIA Dynamo
Video: Inference at Scale: The New Frontier for AI Infrastructure and ROI
Ebook: The Art of Balancing AI Inference Cost and Performance
Video: Get Smart About AI Costs to Unlock Performance, Efficiency, and Profit at Scale [Instagram | LinkedIn | X]
Get Started with NVIDIA AI Inference Platform

Follow Us

Privacy Center | Manage Preferences | Unsubscribe | Contact Us | View Online

NVIDIA Corporation, 2788 San Tomas Expressway Santa Clara, CA 95051.