Explore NVIDIA’s approach to efficient, high-performance AI infrastructure.
‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌  ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ 
NVIDIA Logo
inference-email-3691394-1360x425.png
Think SMART Inference Signals: Recent Recap
Top Inference Highlights
newsletter-end-to-end-ai-factory-kv-600x338 (3).png
The More You Buy, the More You Make—What Happens When You Think SMART
NVIDIA’s AI Factory isn’t just infrastructure—it’s a force multiplier—scaling inference, boosting productivity, and accelerating breakthroughs across science, health, and climate. This purpose-built infrastructure optimized for inference at scale with NVIDIA Blackwell is designed to deliver performance, efficiency, and ROI across industries.

Read Blog

Published May 30, 2025
openai-logo-lockup-600x338.png
OpenAI and NVIDIA Propel Innovation With Open Models Optimized for World’s Largest AI Inference Infrastructure
NVIDIA delivers industry-leading GPT-OSS-120B performance at 1.5 million tokens per second on a single Blackwell GB200 NVL72 system. Trained on NVIDIA GPUs and optimized across the full stack, the models run best on Blackwell and RTX GPUs. They run on the world’s largest installed base— hundreds of millions of CUDA GPUs—from laptops to data centers and cloud platforms, powering global innovation.

Read Blog

Published Aug 5, 2025
Latest Inference News and Resources
hpc-newsletter-blackwell-hgx-b200-600x338-3373406.jpg
Together AI Delivers Top Speeds for DeepSeek-R1-0528 Inference on NVIDIA Blackwell
Together AI delivers record-setting inference speed with the DeepSeek-R1-0528 model—enabled by the NVIDIA Blackwell platform. Purpose-built for high-performance compute, memory, and bandwidth, NVIDIA Blackwell is enabling the next generation of AI infrastructure.

Read Blog

Published Jul 17, 2025
newsletter-llm-ai-reasoning-image-600x338.png
NVIDIA Dynamo Adds Support for AWS Services to Deliver Cost-Efficient Inference at Scale
Dynamo adds support for popular AWS services, unlocking new levels of performance, scalability, and cost-efficiency for serving large language models.

Read Blog

Published Jul 15, 2025
newsletter-inference-think-smart-gb300-deployment-600x338.png
CoreWeave Leads the Way With First NVIDIA GB300 NVL72 Deployment
CoreWeave is deploying NVIDIA Blackwell Ultra for inference at scale, using NVIDIA GB300 NVL72 systems powered by NVIDIA networking and delivered by Dell Technologies. Each rack delivers over one exaflop of dense AI performance and up to 40 TB of fast memory—designed to meet the demands of large-scale inference.

Read Blog

Published Jul 3, 2025
newsletter-inference-think-smart-accelerating-inference-600x338.png
VAST Inference Evolution Featuring Dynamo NIXL Integration for Maximum Compute Efficiency
VAST Data and NVIDIA Dynamo, powered by NVIDIA NIXL, are redefining inference at scale—enabling high-speed KV cache transfers across GPUs, CPUs, and storage. Get 10x faster time-to-first-token and disaggregate prefill and decode with a persistent cache architecture designed for maximum throughput.

Read Blog

Published Jul 1, 2025
newsletter-inference-think-smart-blackwell-600x338.png
Inference at Scale With NVIDIA GB200 NVL72 on AWS
Now available as Amazon EC2 P6e-GB200 instances, the NVIDIA GB200 NVL72 platform with NVLink™ accelerates training and inference for cutting-edge applications—from drug discovery to software development.

Watch Video

Published Jul 9, 2025
newsletter-inference-think-smart-perplexity-agentic-ai-600x338.png
From Prompt to Paris: How AI Agents Launch a Food Truck Dream
What happens when you ask an AI to launch a food truck? Perplexity’s agent system breaks the prompt into tasks—research, design, planning—using NVIDIA-accelerated inference to deliver a full business plan in seconds.

Watch Video

Published Jul 11, 2025
NVIDIA Inference Technology Highlights
newsletter-inference-world-record-gtc25-600x338-v2.png
Think Smart and Ask an Encyclopedia-Sized Question: Multimillion Token Real-Time Inference for 32x More Users
What if you could ask a chatbot a question the size of an entire encyclopedia—and get an answer in real time? Multimillion token queries with 32x more users are now possible with Helix Parallelism, an innovation by NVIDIA Research that drives inference at a huge scale.

Read Blog

Published Jul 7, 2025
newsletter-inception-energy-nl-600x338.jpg
Introducing NVFP4 for Efficient and Accurate Low-Precision Inference
NVFP4 is a new four-bit format that improves AI inference efficiency while preserving accuracy through advanced scaling—enabling up to 50x energy efficiency and lower TCO at scale.

Read Blog

Published Jun 24, 2025
newsletter-inference-3691394-600x338.png
Optimizing for Low-Latency Communication in Inference Workloads With JAX and XLA
To help teams reduce latency in the decode stage when running LLM inference in production, we’re sharing techniques that minimize communication overhead for small message sizes—especially when compute and communication can’t overlap—using custom kernels, Google JAX FFI, and NVIDIA® CUDA® Graphs for faster inference.

Read Blog

Published Jul 18, 2025
newsletter-fsi-nasdaq-case-study-600x338.png
How Nasdaq Is Driving Faster Insights and Smarter Investment Decisions With Scalable AI Innovation
Nasdaq leveraged NVIDIA NIM™ and improved its performance across the board—delivering 30% faster response times, 30% higher chatbot accuracy, and real-time feedback to quickly address issues like latency and data errors.

Read Blog

Published Aug 1, 2025
newsletter-llm-oss-3676650-600x338.png
Open-Source Ecosystem Advances Inference Optimizations on GB200 NVL72
NVIDIA collaborated with SGLang to release DeepSeek-R1 inference container optimized for large-scale deployment on GB200 NVL72, the world’s most advanced data center–scale accelerated computing platform. This container runs a single copy of the model across 56 Blackwell GPUs, achieving over 9,290 tokens/sec for decoding and 13,149 tokens/sec for prefill.

Read Thread

Published Jul 29, 2025
newsletter-inception-nvidia-gb200-nvl72-600x338.jpg
How NVIDIA GB200 NVL72 and NVIDIA Dynamo Boost Inference Performance for MoE Models
New NVIDIA research shows how disaggregated serving with NVIDIA Dynamo and GB200 NVL72 accelerates inference for MoE models like DeepSeek-R1 and Llama 4—unlocking faster, more efficient AI performance.

Read Blog

Published Jul 6, 2025
Inference Spotlight
newsletter-inference-think-smart-harper-carroll-600x600.png
Influencer Harper Carroll’s Video on Reasoning Models and NVIDIA’s Inference Leadership
Hear from community leader Harper Carroll on how AI reasoning enables models to think step-by-step, boosting their capabilities but also increasing token usage—making optimized inference platforms like NVIDIA’s essential. Researchers are now exploring CoT monitoring as a way to improve transparency and safety in advanced AI systems.

Watch the Video on LinkedIn

Follow the conversation on Instagram and X

Published Jul 24, 2025
Resources
Follow Us
Facebook   Twitter   YouTube   nv-social_icons-in.png   LinkedIn   NVIDIA Blog
You are receiving this email because you are subscribed to enterprise emails.
Privacy CenterManage Preferences | UnsubscribeContact Us | View Online
© 2025 NVIDIA Corporation. All rights reserved.
NVIDIA Corporation, 2788 San Tomas Expressway Santa Clara, CA 95051.