NVIDIA

Think SMART Inference Signals: Recent Recap

Top Inference Highlights

The More You Buy, the More You Make—What Happens When You Think SMART

NVIDIA’s AI Factory isn’t just infrastructure—it’s a force multiplier—scaling inference, boosting productivity, and accelerating breakthroughs across science, health, and climate. This purpose-built infrastructure optimized for inference at scale with NVIDIA Blackwell is designed to deliver performance, efficiency, and ROI across industries.

Read Blog ❯

Published May 30, 2025

OpenAI and NVIDIA Propel Innovation With Open Models Optimized for World’s Largest AI Inference Infrastructure

NVIDIA delivers industry-leading GPT-OSS-120B performance at 1.5 million tokens per second on a single Blackwell GB200 NVL72 system. Trained on NVIDIA GPUs and optimized across the full stack, the models run best on Blackwell and RTX GPUs. They run on the world’s largest installed base— hundreds of millions of CUDA GPUs—from laptops to data centers and cloud platforms, powering global innovation.

Read Blog ❯

Published Aug 5, 2025

Latest Inference News and Resources

Together AI Delivers Top Speeds for DeepSeek-R1-0528 Inference on NVIDIA Blackwell

Together AI delivers record-setting inference speed with the DeepSeek-R1-0528 model—enabled by the NVIDIA Blackwell platform. Purpose-built for high-performance compute, memory, and bandwidth, NVIDIA Blackwell is enabling the next generation of AI infrastructure.

Read Blog ❯

Published Jul 17, 2025

NVIDIA Dynamo Adds Support for AWS Services to Deliver Cost-Efficient Inference at Scale

Dynamo adds support for popular AWS services, unlocking new levels of performance, scalability, and cost-efficiency for serving large language models.

Read Blog ❯

Published Jul 15, 2025

CoreWeave Leads the Way With First NVIDIA GB300 NVL72 Deployment

CoreWeave is deploying NVIDIA Blackwell Ultra for inference at scale, using NVIDIA GB300 NVL72 systems powered by NVIDIA networking and delivered by Dell Technologies. Each rack delivers over one exaflop of dense AI performance and up to 40 TB of fast memory—designed to meet the demands of large-scale inference.

Read Blog ❯

Published Jul 3, 2025

VAST Inference Evolution Featuring Dynamo NIXL Integration for Maximum Compute Efficiency

VAST Data and NVIDIA Dynamo, powered by NVIDIA NIXL, are redefining inference at scale—enabling high-speed KV cache transfers across GPUs, CPUs, and storage. Get 10x faster time-to-first-token and disaggregate prefill and decode with a persistent cache architecture designed for maximum throughput.

Read Blog ❯

Published Jul 1, 2025

Inference at Scale With NVIDIA GB200 NVL72 on AWS

Now available as Amazon EC2 P6e-GB200 instances, the NVIDIA GB200 NVL72 platform with NVLink™ accelerates training and inference for cutting-edge applications—from drug discovery to software development.

Watch Video ❯

Published Jul 9, 2025

From Prompt to Paris: How AI Agents Launch a Food Truck Dream

What happens when you ask an AI to launch a food truck? Perplexity’s agent system breaks the prompt into tasks—research, design, planning—using NVIDIA-accelerated inference to deliver a full business plan in seconds.

Watch Video ❯

Published Jul 11, 2025

NVIDIA Inference Technology Highlights

Think Smart and Ask an Encyclopedia-Sized Question: Multimillion Token Real-Time Inference for 32x More Users

What if you could ask a chatbot a question the size of an entire encyclopedia—and get an answer in real time? Multimillion token queries with 32x more users are now possible with Helix Parallelism, an innovation by NVIDIA Research that drives inference at a huge scale.

Read Blog ❯

Published Jul 7, 2025

Introducing NVFP4 for Efficient and Accurate Low-Precision Inference

NVFP4 is a new four-bit format that improves AI inference efficiency while preserving accuracy through advanced scaling—enabling up to 50x energy efficiency and lower TCO at scale.

Read Blog ❯

Published Jun 24, 2025

Optimizing for Low-Latency Communication in Inference Workloads With JAX and XLA

To help teams reduce latency in the decode stage when running LLM inference in production, we’re sharing techniques that minimize communication overhead for small message sizes—especially when compute and communication can’t overlap—using custom kernels, Google JAX FFI, and NVIDIA^® CUDA^® Graphs for faster inference.

Read Blog ❯

Published Jul 18, 2025

How Nasdaq Is Driving Faster Insights and Smarter Investment Decisions With Scalable AI Innovation

Nasdaq leveraged NVIDIA NIM™ and improved its performance across the board—delivering 30% faster response times, 30% higher chatbot accuracy, and real-time feedback to quickly address issues like latency and data errors.

Read Blog ❯

Published Aug 1, 2025

Open-Source Ecosystem Advances Inference Optimizations on GB200 NVL72

NVIDIA collaborated with SGLang to release DeepSeek-R1 inference container optimized for large-scale deployment on GB200 NVL72, the world’s most advanced data center–scale accelerated computing platform. This container runs a single copy of the model across 56 Blackwell GPUs, achieving over 9,290 tokens/sec for decoding and 13,149 tokens/sec for prefill.

Read Thread ❯

Published Jul 29, 2025

How NVIDIA GB200 NVL72 and NVIDIA Dynamo Boost Inference Performance for MoE Models

New NVIDIA research shows how disaggregated serving with NVIDIA Dynamo and GB200 NVL72 accelerates inference for MoE models like DeepSeek-R1 and Llama 4—unlocking faster, more efficient AI performance.

Read Blog ❯

Published Jul 6, 2025

Inference Spotlight

Influencer Harper Carroll’s Video on Reasoning Models and NVIDIA’s Inference Leadership

Hear from community leader Harper Carroll on how AI reasoning enables models to think step-by-step, boosting their capabilities but also increasing token usage—making optimized inference platforms like NVIDIA’s essential. Researchers are now exploring CoT monitoring as a way to improve transparency and safety in advanced AI systems.

Watch the Video on LinkedIn ❯

Follow the conversation on Instagram and X

Published Jul 24, 2025

Resources

Video: Inference at Scale: The New Frontier for AI Infrastructure and ROI
Ebook: The Art of Balancing AI Inference Cost and Performance
Video: Get Smart About AI Costs to Unlock Performance, Efficiency, and Profit at Scale [Instagram | LinkedIn | X]

Follow Us

You are receiving this email because you are subscribed to enterprise emails.

Privacy Center | Manage Preferences | Unsubscribe | Contact Us | View Online

NVIDIA Corporation, 2788 San Tomas Expressway Santa Clara, CA 95051.