The University of Turku in Finland is taking innovative steps to preserve minor languages, including Estonian, in the post-ChatGPT era. They are in the process of developing a comprehensive artificial language corpus proficient in all European languages, emphasizing the need to digitize a substantial amount of Estonian texts for an operational language model. The Estonian Language Institute (EKI) supports this initiative but highlights the necessity for a more extensive collection of digitized Estonian texts.

Eleri Aedmaa, a natural language processing engineer at the Institute of the Estonian Language, emphasized the pivotal role of abundant text in training language models. She compared the revolutionary impact of English ChatGPT and stressed the need for an extensive Estonian text repository to achieve a similar feat for the Estonian language.

The University of Turku, in collaboration with language technology company SiloGen, is at the forefront of creating the world's largest open language model. This model aims to encompass all official European languages and is hosted on the powerful pan-European supercomputer, LUMI, situated in Kajaani, Finland. This endeavor necessitates a rich collection of original digital Estonian texts to effectively train the language model in Estonian, as outlined by Aedmaa.

Aedmaa further discussed the limitation of existing large language models, which predominantly think in English and translate into other languages, including Estonian. She expressed concerns about the potential threat to the Estonian language due to this dependency on English and emphasized the importance of linguistic and cultural knowledge embedded within language models.

In response to the prevalent English-centric training of language models, the Finns' project focuses on developing a GPT-like digital machine trained from the ground up on a diverse range of languages. Their objective is to ensure European linguistic sovereignty and democratize language technology by creating an open-source language model with transparent logic, accessible for the development of various language technology applications.

Business Finland and the EU Horizon program are supporting this Finnish project, demonstrating the collaborative effort to enhance language preservation and technological advancement. However, the challenge remains that there is a scarcity of available Estonian content to train a large language model effectively.

Kadri Vare, head of the EKI's language and speech technology department, emphasized the necessity of collaborating with the Finnish initiative and further contributing by digitizing and sharing data. The EKI is in the process of compiling a substantial Estonian language corpus to address the shortage of Estonian content for large language models, aiming to contribute significantly to the preservation of the Estonian language and culture.