The University of Turku, Finland, is pioneering a groundbreaking initiative aimed at developing a multilingual artificial language corpus proficient in all European languages, including Estonian. This ambitious project emerges in response to the rise of ChatGPT-like models and aims to safeguard minor languages in the post-ChatGPT era. While the Estonian Language Institute (EKI) supports the endeavor, it highlights the need for a substantial increase in digitized Estonian texts to ensure the effectiveness of operational language models.

Eleri Aedmaa, a natural language processing engineer at the Institute of the Estonian Language, emphasized the pivotal role of text quantity in training language models. She underscored the necessity of digitizing diverse Estonian texts, including historical archives and online communication, to secure the language's future.

The initiative, led by the University of Turku and language technology company SiloGen, seeks to create the world's largest open language model, encompassing all official European languages. Leveraging the computational power of the LUMI supercomputer in Kajaani, Finland, the project aims to address the scarcity of digital Estonian texts essential for training language models effectively.

Aedmaa pointed out a significant drawback of existing large language models, noting their predominantly English-centric training. While these models may comprehend Estonian, they primarily operate through translation, posing a long-term risk to the Estonian language's integrity and cultural nuances.

Kadri Vare, head of the EKI's language and speech technology department, stressed the need for collaborative efforts to augment Estonian language resources. She highlighted ongoing initiatives to digitize and disseminate Estonian language data, emphasizing its critical role in preserving the language's richness and identity.

As the EKI continues to compile a comprehensive Estonian language corpus, Vare underscores the importance of public participation in supporting language preservation endeavors. Collaborating in large-scale language model projects, she believes, is instrumental in safeguarding Estonia's linguistic heritage for future generations.

More: https://news.err.ee/1609120697/finland-s-chatgpt-equivalent-begins-to-think-in-estonian-as-well