The field of natural language processing (NLP) enjoyed its hitherto most significant breakthrough in 2017 with the introduction of the Transformer neural network architecture for language modeling. This led to what can only be described as a Cambrian explosion of language models, in which we can discern three main species of models:
Language models quickly set new state-of-the-art results across the board of NLP tasks and became the standard tool for building NLP applications. The development of new models continued at an unwavering pace, and the next significant breakthrough came in the early summer of 2020 when Open AI published a paper describing the incomparably largest language model to that date: GPT-3.
GPT-3 is, as the name suggests, the third generation of GPT, with the main difference from its predecessor (GPT-2) being the sheer size of the model; 96 layers, 175 billion parameters, a batch size of 3.2 million, and trained on 570 GB of data. Apart from the impressive engineering achievement in simply scaling a model to this size, the model also displayed some interesting properties and abilities. In particular, its capacity for zero-shot and few-shot learning was reported to be unlike any other model in the field of NLP up to that point.
It soon became apparent that this was not only empty words; Open AI took the somewhat controversial decision to not release the model openly (as had been common practice in the field until then) and instead opted to commercialize access to the model through an API, which relied wholly on the model’s ability to do zero-shot and few-shot learning. The new guild of “prompt hacking” was born, and the commercial success of GPT-3 was a fact.
As the name suggests, GPT-SW3 is the Swedish counterpart to GPT-3: a large-scale generative pre trained Transformer built on massive amounts of Swedish data with the explicit purpose of being able to generate Swedish text and thereby function in zero-shot and few-shot scenarios. The initiative to develop such a model for Swedish is led by the NLU research group at AI Sweden, and includes a number of collaborators, including RISE, the WASP WARA on media and language, and Nvidia. GPT-SW3 already exists in a preliminary version with 3.5 billion parameters, which was trained on a moderately sized corpus of approximately 100 GB of mostly web data. Our current goal is to train a 175 billion parameter model with close to 1 TB of data.
There is only one publicly available computer in Sweden that is sufficiently large to train a model of this size: Berzelius, located at Linköping University. Berzelius is an Nvidia DGX SuperPOD equipped with 60 DGX-A100 machines connected with Nvidia Mellanox InfiniBand HDR network. The system is a donation from Knut and Alice Wallenberg foundation, and can reach a theoretical performance of 300 petaflops. For training our preliminary 3.5B version of GPT-SW3, we used 16 nodes for 2.5 days. For training the 175B version, we estimate that we will need 30 nodes for approximately 90 days.
My, and the AI Sweden NLU team’s, aim with this blog is to be as open and transparent as possible with the development process, and to invite several different perspectives on this type of model — both our own as well as those who may be more critical of this development. We will cover topics ranging from details of the development process (e.g. data collection and filtering), over reasons and considerations for leading this initiative, to more philosophical questions about the limits of large language models and the potential (or not) to reach AGI.
Follow, share, and do let us know your thoughts.
Best,
Magnus