The Swedish AI ecosystem is developing rapidly, partly beyond expectation. We see a much more...
Why do we need a large GPT for Swedish?
What are the advantages of building a large language model for Swedish, and what should we look out for?
Generality and usability
As we noted in our first post, with an enhanced size of the language models (and the training data) comes improved capacity, in particular with respect to the generality of the models. This generality manifests in zero-shot and few-shot capacity, which refers to the model’s ability to solve tasks it has not been specifically trained for, relying solely on clever instruction modeling. As an example, we might ask the model to translate from Swedish to English simply by using the instruction “Translate from Swedish into English:” and then adding the term or phrase we want to translate, e.g. “vörtbröd”, together with an arrow to signal the model’s reply. This leads to the following input to the model:
Translate from Swedish to English:
(Prompt instruction in boldface, followed by the target term on the next line, and an arrow to signal where the model’s reply will go. This particular prompt design is copied from Brown et al. 2020: Language Models are Few-Shot Learners.)
This would be an example of a zero-shot setting. If we also add a couple of examples of translations from Swedish to English, we provide additional context for the model, which is then referred to as a few-shot scenario:
Translate from Swedish to English: strömming
sill => herring
Translate from Swedish to English:
rädisa => radish
Translate from Swedish to English:
(Prompt examples and instruction in boldface, followed by the target term, and an arrow to signal the model’s reply.)
It should be noted that there are also other ways to send instructions to the model, such as P-tuning, which is a very interesting and promising alternative to standard prompt engineering. P-tuning instead relies on an external prompt model that trains input embeddings that can act as continuous prompts to the GPT. We will write more about the potential of P-tuning, and how it might work using GPT-SW3, in a later post.
The possibility to solve tasks without specific training is both interesting and potentially extremely valuable since it would drastically reduce the need for training data, and the need for building specialized solutions for each different application. Our hypothesis is that if we could provide API access to a Swedish language model with few-shot capacity, it would greatly simplify the use of language models for actors that have limited competence and hardware to build and deploy their own solutions.
The GPT-SW3 initiative was partly borne from lessons learned during previous initiatives such as “language models for Swedish authorities”. During this project, whose goal was to facilitate actors in the public sector in Sweden to use language models in their NLP solutions, we have seen significant needs and interest from the public sector in using language models, but we have also seen significant challenges with doing so. These challenges often concern data readiness, competence, and compute.
Data readiness refers to issues related to data, and can include questions such as “Do we have access to the data we want to use?”, “Can we legally use the data?”, “is the data in the right format?”, “is the data noisy?”, “Is this the right data for solving the problem?”. In fact, we encounter such questions so often that we have started a dedicated project for improving data readiness in the public sector in Sweden: the Data Readiness Lab.
Access to, and retention of, competence in language modeling, NLP, or AI in general is also a challenging factor, in particular in the public sector. This affects the potential for end users to build their own solutions based on language models, even if the models are publicly available in user-friendly frameworks such as Huggingface Transformers. We are currently exploring an NLU Talent Program for fostering and supplying competence in NLU to organizations that are partners in AI Sweden.
Access to suitable compute resources is also challenging for many actors, especially in the public sector. Even if their data is in order, they have recruited the required competencies, and suitable models are publicly available, they still need sufficient compute resources to run the models (both for training and for inference). Acquiring and integrating such resources can be challenging, both with respect to cost (and availability), but also with respect to IT-infrastructural considerations.
We believe that the few-shot capacity of very large language models may provide a way to leverage the power of language models without requiring end-users to tackle these challenges by themselves.
Linguistic sovereignty and digital democracy
Most of the current development of very large language models is driven by private companies in other countries (and even in other continents) than Sweden. These actors may not have an interest in producing such resources for a small language such as Swedish, and even if they do, they may not have the incentives, nor the capacity, to build and provide such models in a way that reflects Swedish circumstances and needs. We think it is important that a model that aims to be a foundational resource for Swedish NLP represents the entire Swedish language, and by extension, the entire Swedish population.
This obviously boils down to a question about representative, fair, and transparent data, which is something we take very seriously, and devote considerable efforts to. Since the question about data is so important, we will devote our next series of posts entirely to this topic. For now, we simply note that the GPT-SW3 initiative has the ambition to build a model that as exhaustively as possible represents the Swedish language and the Swedish population. We build GPT-SW3 in Sweden, for Sweden.
This means that we also think it is important that a foundational Swedish GPT model can be made available to all sectors in Sweden that may have a need for NLP solutions. Our vision is thus to provide GPT-SW3 as a kind of basic infrastructure for Swedish NLP that everyone can access and build services and solutions on top of.
Our current plan is to provide access to the model through an API in the same way as Open AI provides access to GPT-3. This has proven to be a working solution in the case of GPT-3, and we want to explore if the same approach will work also for GPT-SW3. However, we aim to do things a bit differently from Open AI. Even if we currently think that access to GPT-SW3 will be limited to an API (we will discuss this in more detail in a coming post), we aim to be completely open, transparent, and collaborative throughout the entire development process.
Even if building a model such as GPT-SW3 is extremely resource-demanding both with regards to data and computing, we think providing access to the resulting model through an API may actually lead to cost savings for the end-users, who no longer have to buy their own GPU servers and to recruit expensive experts. Perhaps somewhat counter-intuitively, very large language models may in this way contribute to environmental sustainability (in particular if we make sure that the inference server runs on green electricity, which we will do).
New to this Medium
My, and the AI Sweden NLU team’s, aim with being on Medium is to be as open and transparent as possible with the development process, and to invite several different perspectives on this type of model — both our own as well as those who may be more critical of this development. We will cover topics ranging from details of the development process (e.g. data collection and filtering), over reasons and considerations for leading this initiative, to more philosophical questions about the limits of large language models and the potential (or not) to reach AGI.