New open supply AI chief Reflection 70B’s efficiency questioned, accused of ‘fraud’

Be a part of our every day and weekly newsletters for the most recent updates and unique content material on industry-leading AI protection. Be taught Extra

It took only one weekend for the new, self-proclaimed king of open supply AI fashions to have its crown tarnished.

Reflection 70B, a variant of Meta’s Llama 3.1 open supply massive language mannequin (LLM) — or wait, was it a variant of the older Llama 3? — that had been educated and launched by small New York startup HyperWrite (previously OthersideAI) and boasted spectacular, main benchmarks on third-party assessments, has now been aggressively questioned as different third-party evaluators have failed to breed a few of stated efficiency measures.

The mannequin was triumphantly introduced in a submit on the social community X by HyperWrite AI co-founder and CEO Matt Shumer on Friday, September 6, 2024 as “the world’s top open-source model.”

I am excited to announce Reflection 70B, the world’s high open-source mannequin.
Skilled utilizing Reflection-Tuning, a method developed to allow LLMs to repair their very own errors.
405B coming subsequent week – we anticipate it to be the very best mannequin on the earth.
Constructed w/ @GlaiveAI.
Learn on : pic.twitter.com/kZPW1plJuo
— Matt Shumer (@mattshumer_) September 5, 2024

In a sequence of public X posts documenting a few of Reflection 70B’s coaching course of and subsequent interview over X Direct Messages with VentureBeat, Shumer defined extra about how the brand new LLM used “Reflection Tuning,” a beforehand documented method developed by different researchers exterior the corporate that sees LLMs verify the correctness of or “reflect” on their very own generated responses earlier than outputting them to customers, bettering accuracy on a variety of duties in writing, math, and different domains.

Nonetheless, on Saturday September 7, a day after the preliminary HyperWrite announcement and VentureBeat article have been revealed, Synthetic Evaluation, a corporation devoted to “Independent analysis of AI models and hosting providers” posted its personal evaluation on X stating that “our evaluation of Reflection Llama 3.170B’s MMLU score” — referencing the generally used Large Multitask Language Understanding (MMLU) benchmark — “resulted in the same score as Llama 3 70B and significantly lower than Meta’s Llama 3.1 70B,” exhibiting a significant discrepancy with HyperWrite/Shumer’s initially posted outcomes.

Our analysis of Reflection Llama 3.1 70B’s MMLU rating resulted in the identical rating as Llama 3 70B and considerably decrease than Meta’s Llama 3.1 70B.
A LocalLLaMA submit (hyperlink under) additionally in contrast the diff of Llama 3.1 & Llama 3 weights to Reflection Llama 3.1 70B and concluded the… pic.twitter.com/hqvFp2TyCC
— Synthetic Evaluation (@ArtificialAnlys) September 7, 2024

On X that very same day, Shumer acknowledged that Reflection 70B’s weights — or settings of the open supply mannequin — had been “fucked up during the upload process” to Hugging Face, the third-party AI code internet hosting repository and firm, and that this subject might have resulted in worse high quality efficiency in comparison with HyperWrite’s “internal API” model.

We’ve discovered the problem. The reflection weights on Hugging Face are literally a mixture of some totally different fashions — one thing obtained fucked up in the course of the add course of.
Will repair at the moment. https://t.co/rKuOlTApRK
— Matt Shumer (@mattshumer_) September 7, 2024

On Sunday, September 8, 2024 at round 10 pm ET, Synthetic Evaluation posted on X that it had been “given access to a private API which we tested and saw impressive performance but not to the level of the initial claims. As this testing was performed on a private API, we were not able to independently verify exactly what we were testing.”

Reflection 70B replace: Fast notice on timeline and excellent questions from our perspective
Timeline:
– We examined the preliminary Reflection 70B launch and noticed worse efficiency than Llama 3.1 70B.
– We got entry to a personal API which we examined and noticed spectacular…
— Synthetic Evaluation (@ArtificialAnlys) September 9, 2024

The group detailed two key questions that significantly name into query HyperWrite and Shumer’s preliminary efficiency claims, particularly:

“We’re not clear on why a model can be revealed which isn’t the model we examined by way of Reflection’s personal API.
We’re not clear why the mannequin weights of the model we examined wouldn’t be launched but.

As quickly because the weights are launched on Hugging Face, we plan to re-test and evaluate to our analysis of the personal endpoint.”

All of the whereas, customers on numerous machine studying and AI Reddit communities or subreddits, have additionally known as into query Reflection 70B’s acknowledged efficiency and origins. Some have identified that primarily based on a mannequin comparability posted on Github by a 3rd celebration, Reflection 70B seems to be a Llama 3 variant fairly than a Llama-3.1 variant, casting additional doubt on Shumer and HyperWrite’s preliminary claims.

This has led to a minimum of one X consumer, Shin Megami Boson, to overtly accuse Shumer of “fraud in the AI research community” as of 8:07 pm ET on Sunday, September 8, posting an extended record of screenshots and different proof.

A narrative about fraud within the AI analysis neighborhood:
On September fifth, Matt Shumer, CEO of OthersideAI, proclaims to the world that they’ve made a breakthrough, permitting them to coach a mid-size mannequin to top-tier ranges of efficiency. That is big. If it is actual.
It is not. pic.twitter.com/S0jWT8rDVb
— ? Shin Megami Boson ? (@shinboson) September 9, 2024

Others accuse the mannequin of really being a “wrapper” or software constructed atop of propertiary/closed-source rival Anthropic’s Claude 3.

Nonetheless, different X customers have spoken up in protection of Shumer and Reflection 70B, and a few have posted in regards to the mannequin’s spectacular efficiency on their finish.

I do know @mattshumer_ and this doesn’t mesh with my understanding of him. He is aware of his stuff and is tremendous pragmatic and works round issues in spectacular ways in which most individuals get slowed down on for months. I’d say possibly give the man just a little extra time earlier than you say stuff…
— Sasha krecinic (@SashaKrecinic) September 9, 2024

Regardless, the mannequin’s rollout, lofty claims, and now criticism present how quickly the AI hype cycle can come crashing down.

As for now, the AI analysis neighborhood waits with breath baited for Shumer’s response and up to date mannequin weights on Hugging Face. VentureBeat has additionally reached out to Shumer for a direct response to those allegations of fraud and can replace once we hear again.

VB Day by day

Keep within the know! Get the most recent information in your inbox every day

By subscribing, you comply with VentureBeat’s Phrases of Service.

Thanks for subscribing. Take a look at extra VB newsletters right here.

An error occured.