Be a part of our each day and weekly newsletters for the most recent updates and unique content material on industry-leading AI protection. Study Extra
Hugging Face at present introduced it has acquired Seattle-based XetHub, a collaborative improvement platform based by former Apple researchers to assist machine studying groups work extra effectively with giant datasets and fashions.
Whereas the precise worth of the deal stays undisclosed, CEO Clem Delangue stated in an interview with Forbes that that is the biggest acquisition the corporate has made to date.
The HF crew plans to combine XetHub’s know-how with its platform and improve its storage backend, enabling builders to host extra giant fashions and datasets than presently doable — with minimal effort.
“The XetHub team will help us unlock the next 5 years of growth of HF datasets and models by switching to our own, better version of LFS as a storage backend for the Hub’s repos,” Julien Chaumond, the CTO of the corporate, wrote in a weblog submit.
What does XetHub carry to Hugging Face?
Based in 2021 by Yucheng Low, Ajit Banerjee and Rajat Arya, who labored on Apple’s inner ML infrastructure, XetHub made a reputation for itself by offering enterprises with a platform to discover, perceive and work with giant fashions and datasets.
The providing enabled Git-like model management for repositories going as much as TBs in measurement, permitting groups to trace adjustments, collaborate and preserve reproducibility of their ML workflows.
Throughout these three years, XetHub drew a sizeable buyer base, together with main names like Tableau and Collect AI, with its skill to deal with advanced scalability wants stemming from consistently rising instruments, recordsdata and artifacts. It improved storage and switch processes utilizing superior methods like content-defined chunking, deduplication, instantaneous repository mounting and file streaming.
Now, with this acquisition, the XetHub platform will stop to exist and its knowledge and mannequin dealing with capabilities will come to the Hugging Face Hub, upgrading the mannequin and dataset sharing platform with a extra optimized storage and versioning backend.
On the storage entrance, the HF Hub presently makes use of Git LFS (Giant File Storage) because the backend. It launched in 2020, however Chaumond says the corporate has lengthy identified that the storage system wouldn’t be sufficient after one level given the consistently rising quantity of huge recordsdata within the AI ecosystem. It was a superb level to begin off, however the firm wanted an improve, which is able to include XetHub.
At present, the XetHub platform helps particular person recordsdata bigger than 1TB with the whole repository measurement going nicely above 100TB, making a significant improve over Git LFS which solely helps a most of 5GB of file measurement and 10GB of repository. This may allow the HF Hub to host even bigger datasets, fashions and recordsdata than presently doable.
On prime of this, XetHub’s further storage and switch options will make the package deal much more profitable.
As an example, the content-define chunking and deduplication capabilities of the platform will let customers add choose chunks of recent rows in case of a dataset replace somewhat than re-uploading the entire set of recordsdata once more (which takes plenty of time). The identical would be the case for mannequin repositories.
“As the field moves to trillion parameters models in the coming months (thanks Maxime Labonne for the new BigLlama-3.1-1T ?) our hope is that this new tech will unlock new scale both in the community and inside of enterprise companies,” the CTO famous. He additionally added that the businesses will work intently to launch options geared toward serving to groups collaborate on their HF Hub property and monitor how they’re evolving.
At present, the Hugging Face Hub hosts 1.3 million fashions, 450,000 datasets and 680,000 areas, totaling as a lot as 12PB in LFS.
It will likely be attention-grabbing to see how this quantity grows with the improved storage backend, permitting help for bigger fashions and datasets, coming into play. The timeline for the mixing and launch of different supporting options stays unclear at this stage.