Be a part of our every day and weekly newsletters for the most recent updates and unique content material on industry-leading AI protection. Be taught Extra
It took only one weekend for the new, self-proclaimed king of open supply AI fashions to have its crown tarnished.
Reflection 70B, a variant of Meta’s Llama 3.1 open supply massive language mannequin (LLM) — or wait, was it a variant of the older Llama 3? — that had been educated and launched by small New York startup HyperWrite (previously OthersideAI) and boasted spectacular, main benchmarks on third-party assessments, has now been aggressively questioned as different third-party evaluators have failed to breed a few of stated efficiency measures.
The mannequin was triumphantly introduced in a submit on the social community X by HyperWrite AI co-founder and CEO Matt Shumer on Friday, September 6, 2024 as “the world’s top open-source model.”
In a sequence of public X posts documenting a few of Reflection 70B’s coaching course of and subsequent interview over X Direct Messages with VentureBeat, Shumer defined extra about how the brand new LLM used “Reflection Tuning,” a beforehand documented method developed by different researchers exterior the corporate that sees LLMs verify the correctness of or “reflect” on their very own generated responses earlier than outputting them to customers, bettering accuracy on a variety of duties in writing, math, and different domains.
Nonetheless, on Saturday September 7, a day after the preliminary HyperWrite announcement and VentureBeat article have been revealed, Synthetic Evaluation, a corporation devoted to “Independent analysis of AI models and hosting providers” posted its personal evaluation on X stating that “our evaluation of Reflection Llama 3.170B’s MMLU score” — referencing the generally used Large Multitask Language Understanding (MMLU) benchmark — “resulted in the same score as Llama 3 70B and significantly lower than Meta’s Llama 3.1 70B,” exhibiting a significant discrepancy with HyperWrite/Shumer’s initially posted outcomes.
On X that very same day, Shumer acknowledged that Reflection 70B’s weights — or settings of the open supply mannequin — had been “fucked up during the upload process” to Hugging Face, the third-party AI code internet hosting repository and firm, and that this subject might have resulted in worse high quality efficiency in comparison with HyperWrite’s “internal API” model.
On Sunday, September 8, 2024 at round 10 pm ET, Synthetic Evaluation posted on X that it had been “given access to a private API which we tested and saw impressive performance but not to the level of the initial claims. As this testing was performed on a private API, we were not able to independently verify exactly what we were testing.”
The group detailed two key questions that significantly name into query HyperWrite and Shumer’s preliminary efficiency claims, particularly:
- “We’re not clear on why a model can be revealed which isn’t the model we examined by way of Reflection’s personal API.
- We’re not clear why the mannequin weights of the model we examined wouldn’t be launched but.
As quickly because the weights are launched on Hugging Face, we plan to re-test and evaluate to our analysis of the personal endpoint.”
All of the whereas, customers on numerous machine studying and AI Reddit communities or subreddits, have additionally known as into query Reflection 70B’s acknowledged efficiency and origins. Some have identified that primarily based on a mannequin comparability posted on Github by a 3rd celebration, Reflection 70B seems to be a Llama 3 variant fairly than a Llama-3.1 variant, casting additional doubt on Shumer and HyperWrite’s preliminary claims.
This has led to a minimum of one X consumer, Shin Megami Boson, to overtly accuse Shumer of “fraud in the AI research community” as of 8:07 pm ET on Sunday, September 8, posting an extended record of screenshots and different proof.
Others accuse the mannequin of really being a “wrapper” or software constructed atop of propertiary/closed-source rival Anthropic’s Claude 3.
Nonetheless, different X customers have spoken up in protection of Shumer and Reflection 70B, and a few have posted in regards to the mannequin’s spectacular efficiency on their finish.
Regardless, the mannequin’s rollout, lofty claims, and now criticism present how quickly the AI hype cycle can come crashing down.
As for now, the AI analysis neighborhood waits with breath baited for Shumer’s response and up to date mannequin weights on Hugging Face. VentureBeat has additionally reached out to Shumer for a direct response to those allegations of fraud and can replace once we hear again.