Be a part of our day by day and weekly newsletters for the most recent updates and unique content material on industry-leading AI protection. Study Extra
Meta’s new flagship AI language mannequin Llama 4 got here all of the sudden over the weekend, with the guardian firm of Fb, Instagram, WhatsApp and Quest VR (amongst different companies and merchandise) revealing not one, not two, however three variations — all upgraded to be extra highly effective and performant utilizing the favored “Mixture-of-Experts” structure and a brand new coaching methodology involving fastened hyperparameters, referred to as MetaP.
Additionally, all three are outfitted with large context home windows — the quantity of knowledge that an AI language mannequin can deal with in a single enter/output change with a consumer or device.
However following the shock announcement and public launch of two of these fashions for obtain and utilization — the lower-parameter Llama 4 Scout and mid-tier Llama 4 Maverick — on Saturday, the response from the AI neighborhood on social media has been lower than adoring.
Llama 4 sparks confusion and criticism amongst AI customers
An unverified submit on the North American Chinese language language neighborhood discussion board 1point3acres made its means over to the r/LocalLlama subreddit on Reddit alleging to be from a researcher at Meta’s GenAI group who claimed that the mannequin carried out poorly on third-party benchmarks internally and that firm management “suggested blending test sets from various benchmarks during the post-training process, aiming to meet the targets across various metrics and produce a ‘presentable’ result.”
The submit was met with skepticism from the neighborhood in its authenticity, and a VentureBeat electronic mail to a Meta spokesperson has not but obtained a reply.
However different customers discovered causes to doubt the benchmarks regardless.
“At this point, I highly suspect Meta bungled up something in the released weights … if not, they should lay off everyone who worked on this and then use money to acquire Nous,” commented @cto_junior on X, in reference to an impartial consumer take a look at displaying Llama 4 Maverick’s poor efficiency (16%) on a benchmark referred to as aider polyglot, which runs a mannequin by way of 225 coding duties. That’s properly beneath the efficiency of comparably sized, older fashions similar to DeepSeek V3 and Claude 3.7 Sonnet.
Referencing the ten million-token context window Meta boasted for Llama 4 Scout, AI PhD and creator Andriy Burkov wrote on X partly that: “The declared 10M context is virtual because no model was trained on prompts longer than 256k tokens. This means that if you send more than 256k tokens to it, you will get low-quality output most of the time.”
Additionally on the r/LocalLlama subreddit, consumer Dr_Karminski wrote that “I’m incredibly disappointed with Llama-4,” and demonstrated its poor efficiency in comparison with DeepSeek’s non-reasoning V3 mannequin on coding duties similar to simulating balls bouncing round a heptagon.
Former Meta researcher and present AI2 (Allen Institute for Synthetic Intelligence) Senior Analysis Scientist Nathan Lambert took to his Interconnects Substack weblog on Monday to level out {that a} benchmark comparability posted by Meta to its personal Llama obtain website of Llama 4 Maverick to different fashions, based mostly on cost-to-performance on the third-party head-to-head comparability device LMArena ELO aka Chatbot Enviornment, truly used a totally different model of Llama 4 Maverick than the corporate itself had made publicly obtainable — one “optimized for conversationality.”
As Lambert wrote: “Sneaky. The results below are fake, and it is a major slight to Meta’s community to not release the model they used to create their major marketing push. We’ve seen many open models that come around to maximize on ChatBotArena while destroying the model’s performance on important skills like math or code.”
Lambert went on to notice that whereas this specific mannequin on the sector was “tanking the technical reputation of the release because its character is juvenile,” together with numerous emojis and frivolous emotive dialog, “The actual model on other hosting providers is quite smart and has a reasonable tone!”
In response to the torrent of criticism and accusations of benchmark cooking, Meta’s VP and Head of GenAI Ahmad Al-Dahle took to X to state:
“We’re glad to begin getting Llama 4 in all of your fingers. We’re already listening to numerous nice outcomes persons are getting with these fashions.
That stated, we’re additionally listening to some reviews of blended high quality throughout totally different companies. Since we dropped the fashions as quickly as they have been prepared, we count on it’ll take a number of days for all the general public implementations to get dialed in. We’ll maintain working by way of our bug fixes and onboarding companions.
We’ve additionally heard claims that we educated on take a look at units — that’s merely not true and we’d by no means try this. Our greatest understanding is that the variable high quality persons are seeing is because of needing to stabilize implementations.
We consider the Llama 4 fashions are a big development and we’re wanting ahead to working with the neighborhood to unlock their worth.“
But even that response was met with many complaints of poor efficiency and requires additional info, similar to extra technical documentation outlining the Llama 4 fashions and their coaching processes, in addition to further questions on why this launch in comparison with all prior Llama releases was notably riddled with points.
It additionally comes on the heels of the quantity two at Meta’s VP of Analysis Joelle Pineau, who labored within the adjoining Meta Foundational Synthetic Intelligence Analysis (FAIR) group, saying her departure from the corporate on LinkedIn final week with “nothing but admiration and deep gratitude for each of my managers.” Pineau, it ought to be famous additionally promoted the discharge of the Llama 4 mannequin household this weekend.
Llama 4 continues to unfold to different inference suppliers with blended outcomes, however it’s secure to say the preliminary launch of the mannequin household has not been a slam dunk with the AI neighborhood.
And the upcoming Meta LlamaCon on April 29, the primary celebration and gathering for third-party builders of the mannequin household, will doubtless have a lot fodder for dialogue. We’ll be monitoring all of it, keep tuned.