Be a part of our day by day and weekly newsletters for the most recent updates and unique content material on industry-leading AI protection. Study Extra
At the same time as Meta fends off questions and criticisms of its new Llama 4 mannequin household, graphics processing unit (GPU) grasp Nvidia has launched a brand new, totally open supply giant language mannequin (LLM) based mostly on Meta’s older mannequin Llama-3.1-405B-Instruct mannequin and it’s claiming close to prime efficiency on a wide range of third-party benchmarks — outperforming the vaunted rival DeepSeek R1 open supply reasoning mannequin.
Llama-3.1-Nemotron-Extremely-253B-v1, is a dense 253-billion parameter designed to help superior reasoning, instruction following, and AI assistant workflows. It was first talked about again at Nvidia’s annual GPU Expertise Convention (GTC) in March.
The discharge displays Nvidia continued concentrate on efficiency optimization by architectural innovation and focused post-training.
Introduced final evening, April 7, 2025, the mannequin code is now publicly accessible on Hugging Face, with open weights and post-training information. It’s designed to function effectively in each “reasoning on” and “reasoning off” modes, permitting builders to toggle between high-complexity reasoning duties and extra simple outputs based mostly on system prompts.
Designed for environment friendly inference
The Llama-3.1-Nemotron-Extremely-253B builds on Nvidia’s earlier work in inference-optimized LLM improvement. Its structure—custom-made by a Neural Structure Search (NAS) course of—introduces structural variations equivalent to skipped consideration layers, fused feedforward networks (FFNs), and variable FFN compression ratios.
This architectural overhaul reduces reminiscence footprint and computational calls for with out severely impacting output high quality, enabling deployment on a single 8x H100 GPU node.
The end result, based on Nvidia, is a mannequin that provides robust efficiency whereas being less expensive to deploy in information heart environments. Further {hardware} compatibility consists of help for Nvidia’s B100 and Hopper microarchitectures, with configurations validated in each BF16 and FP8 precision modes.
Submit-training for reasoning and alignment
Nvidia enhanced the bottom mannequin by a multi-phase post-training pipeline. This included supervised fine-tuning throughout domains equivalent to math, code era, chat, and gear use, adopted by reinforcement studying with Group Relative Coverage Optimization (GRPO) to additional increase instruction-following and reasoning efficiency.
The mannequin underwent a data distillation part over 65 billion tokens, adopted by continuous pretraining on a further 88 billion tokens.
Coaching datasets included sources like FineWeb, Buzz-V1.2, and Dolma. Submit-training prompts and responses had been drawn from a mixture of public corpora and artificial era strategies, together with datasets that taught the mannequin to distinguish between its reasoning modes.
Improved efficiency throughout quite a few domains and benchmarks
Analysis outcomes present notable features when the mannequin operates in reasoning-enabled mode. For example, on the MATH500 benchmark, efficiency elevated from 80.40% in customary mode to 97.00% with reasoning enabled.
Equally, outcomes on the AIME25 benchmark rose from 16.67% to 72.50%, and LiveCodeBench scores greater than doubled, leaping from 29.03% to 66.31%.
Efficiency features had been additionally noticed in tool-based duties like BFCL V2 and performance composition, in addition to basically query answering (GPQA), the place the mannequin scored 76.01% in reasoning mode versus 56.60% with out.
These benchmarks had been performed with a most sequence size of 32,000 tokens, and every take a look at was repeated as much as 16 instances to make sure accuracy.
In comparison with DeepSeek R1, a state-of-the-art MoE mannequin with 671 billion complete parameters, Llama-3.1-Nemotron-Extremely-253B exhibits aggressive outcomes regardless of having lower than half the variety of parameters (mannequin settings) — outperforming in duties like GPQA (76.01 vs. 71.5), IFEval instruction following (89.45 vs. 83.3), and LiveCodeBench coding duties (66.31 vs. 65.9).
In the meantime, DeepSeek R1 holds a transparent benefit on sure math evaluations, significantly AIME25 (79.8 vs. 72.50), and barely edges out MATH500 (97.3 vs. 97.00).
These outcomes recommend that regardless of being a dense mannequin, Nvidia’s providing matches or exceeds MoE options on reasoning and basic instruction alignment duties, whereas trailing barely in math-heavy classes.
Utilization and integration
The mannequin is suitable with the Hugging Face Transformers library (model 4.48.3 beneficial) and helps enter and output sequences as much as 128,000 tokens.
Builders can management reasoning conduct by way of system prompts and choose decoding methods based mostly on job necessities.
For reasoning duties, Nvidia recommends utilizing temperature sampling (0.6) with a top-p worth of 0.95. For deterministic outputs, grasping decoding is most well-liked.
Llama-3.1-Nemotron-Extremely-253B helps multilingual purposes, with capabilities in English and several other extra languages, together with German, French, Italian, Portuguese, Hindi, Spanish, and Thai.
Additionally it is appropriate for widespread LLM use instances equivalent to chatbot improvement, AI agent workflows, retrieval-augmented era (RAG), and code era.
Licensed for business use
Launched underneath the Nvidia Open Mannequin License and ruled by the Llama 3.1 Neighborhood License Settlement, the mannequin is prepared for business use.
Nvidia has emphasised the significance of accountable AI improvement, encouraging groups to guage the mannequin’s alignment, security, and bias profiles for his or her particular use instances.
Oleksii Kuchaiev, Director of AI Mannequin Submit-Coaching at Nvidia, shared the announcement on X, stating that the workforce was excited to share the open launch, describing it as a dense 253B mannequin designed with toggle ON/OFF reasoning capabilities and launched with open weights and information.