After more than a year of planning and training, a volunteer-led project has produced an open source language model that they claim is as powerful as OpenAI’s GPT-3, but free and open to anyone (if they have the computing power). The model, called Bloom, is available in open source, along with the code and datasets used to create it. Brooklyn-based AI startup Hugging Face has released a free web app that allows anyone to try Bloom without having to download it.
Bloom is the brainchild of BigScience, an international community-driven project that aims to make large natural language models widely available for research. Large language models, or “LLMs” for short, can translate, summarize and write text with a human nuance – more or less. (See GPT-3.) But they’ve historically been expensive to make, leaving them out of the reach of researchers and firmly in the hands of Big Tech companies like Meta, Google, and Microsoft.
That is finally changing, thanks in part to the efforts of BigScience. The group’s more than 1,000 volunteer researchers — supported by ethicists, philosophers, lawyers and engineers from both startups and major tech companies — spent months working on Bloom, which rivals LLMs made by companies like OpenAI and Alphabet’s DeepMind in scale. Bloom is one of the largest open source models to work in multiple languages and is designed to be applied in a range of research applications, such as extracting information from historical texts.
†Bloom can generate text in 46 natural languages and dialects and 13 programming languages,” reads a blog post shared with TechCrunch prior to release. †Although never trained in any of those specific tasks, Bloom may be asked to create summaries or translations of text, execute code from instructions, and follow directions to complete original tasks such as writing recipes, extracting information from a news article or composing sentences with a newly defined invented word… Bloom’s performance will continue to improve as the workshop continues to experiment and move beyond Bloom.”
BigScience’s funders also hope Bloom will spur new research into ways to combat the problems that plague all LLMs, including bias and toxicity. LLMs tend to spew falsehoods and prejudice against religions, genders, races, and people with disabilities. They also struggle with the basics of writing, often changing the topic of a conversation without a segue, and repeating themselves endlessly — or even contradicting them.
†[Bloom] shows the enduring power of open source and open science, even for expensive, large base models,” Richard Socher, the CEO of You.com and former chief scientist at Salesforce, told TechCrunch via email. Socher is not involved with BigScience. “It also shows that in AI no organization has a big lead for very long. Once an organization shows that something is achievable, the same opportunities will appear in other places six to 12 months later.”
The origins of BigScience lie in discussions years ago between Hugging Face chief science officer Thomas Wolf, GENCI’s Stéphane Requena and IDRIS’ Pierre-François Lavallée. The founders wanted to create software, datasets, LLMs and tools to explore the social impact of AI, which has only been gaining more attention from the research community in recent years.
Steering committees were soon formed to provide members of BigScience – drawn from more than 60 countries and 250 institutions – with scientific and general advice, design collaborative tasks and organize workshops, hackathons and public events. Several working groups were tasked with addressing challenges such as data management, proving theorems in mathematics and archival strategies, as well as privacy and informed consent and other legal issues.
Bloom is the sum of their work. It was trained using $7 million in government-funded (through grants) computing time on the Jean Zay supercomputer near Paris, France, which is among the most powerful machines in the world.
There is a heated discussion in academic circles about the CO2 impact of AI training; data centers are not particularly environmentally friendly. But BigScience says that, thanks to his unique cooling system and nuclear power source, Jean Zay was able to train Bloom with a carbon footprint equivalent to a flight from Paris to New York.
Like all language models, Bloom is essentially a statistical tool to predict words. Fueled by a huge number of examples from a 1.6-terabyte training dataset, Bloom learned the likelihood of words occurring based on patterns, including the semantic context of surrounding text. For example, given a typical email that ends with the “Looking forward…” snippet, Bloom could complete it with “… to hear back.”
One goal of the BigScience working groups was to collect data that was representative enough to train Bloom. Because of systemic biases in public data sources, non-English LLMs have traditionally not performed as well as their Anglophone counterparts. The 341 billion word dataset used to train Bloom is based on books, academic publications, radio transcripts, podcasts and websites and aims to encode different cultural contexts in several languages, including Swahili, Catalan, Bengali and Vietnamese.
The BigScience groups hand-selected nearly two-thirds of the dataset from 500 sources and solicited suggestions from community groups, including the African natural language processing community Masakhane, LatinX in AI, and Machine Learning Tokyo. They edited for privacy and filtered by quality, for example in an effort to reduce an overrepresentation of porn sites, which often contain sexist associations.
Bloom isn’t completely free from bias – no LLM is. But the hope is that by maintaining transparency around the training data, it will be easier for researchers to get to the heart of Bloom’s predictions and decision-making.
Large in size
With 176 billion parameters, Bloom is about the size of GPT-3. Parameters in machine learning are the parts of the LLM learned from training data and tend to correlate with the effectiveness of the model for a task such as text generation.
In general, models with a higher parameter require more computing power to train. A 2020 study from AI21 Labs pinned the cost of developing a text-generating model with just 1.5 billion parameters at a whopping $1.6 million; Bloom trained on 384 Nvidia A100 GPUs for three months. That fact has made it difficult for the community to use large, state-of-the-art language models, such as the Megatron-Turing Natural Language Generation (MT-NLG) from Microsoft and Nvidia, which has 530 billion parameters.
BigScience claims that researchers have the option to use Bloom for less than $40 an hour with a cloud provider. But oneTo remove even this barrier to entry, the organization plans to release smaller, less hardware-intensive versions of Bloom and is developing a distributed system that will allow labs to share the model on their servers. An API is also in the works.
Bloom joins an burgeoning ecosystem of open source, highly capable LLMs with broad commercial and research applications. In February, the open AI research group EleutherAI released GPT-NeoX-20B, which outperformed other public language models at the time in several benchmarks. Months later, Meta open-sourced OPT-175B, which the company claimed was the first language model with 175 billion parameters made available to the AI community.
They have been put to good use – companies have already sprung up around EleutherAI’s models. But some researchers fear abuse. At the University of Maryland, researchers found that it is possible for LLMs to generate fake news and cybersecurity reports convincing enough to fool experts. Another paper, co-authored by Meta researchers, examines the potential harm that can result from LLMs giving poor advice, particularly medical or psychological prognosis.
Many companies that provide access to LLMs through an API, such as OpenAI, apply filters to remove problematic text. But open source models clearly do not have such protections.
In recognition of the potential for abuse, Bloom comes out with documentation outlining its capabilities and limitations. To use it, you must agree to a legal license that obliges researchers not to use the model for malicious purposes. BigScience plans to monitor how the model is being applied and to adjust the license and documentation where necessary.
“We plan to add more languages, make the model smaller so that it is easier to use with the same level of performance, and we will support community efforts to expand it,” the blog post continues. “Bloom is a living family of models that will grow, not a one-time model.”