Developing Language Models for the Global South
The Problem: Communities with the greatest linguistic diversity often face severe infrastructure constraints.
[ figure loads here ]
The field has several names for this: the low-resource double bind (Ahia et al., 2021), the square-one bias (Ruder et al., 2022), Zeno's paradox of language technology (Nigatu et al., 2024), among others.
The Challenge How can we develop language models that are both multilingual and deployable on-device?
Our Approach: To understand the state of the art and the challenges of combining the two areas, we survey 232 papers that tackle this problem across the language modelling pipeline.
[ figure loads here ]
[ figure loads here ]
The requirements for deploying on the edge and supporting multilinguality often have competing requirements that impose challenges across the language modelling pipeline. Click on each pipeline stage (or requirement) to read about the challenges and the state of the art.
[ pipeline diagram loads here ]
We also looked into edge LM systems, which we define as completed efforts that have been integrated into real-world applications. To identify them, we manually classified each of the 232 papers on whether an actual model deployment took place, obtaining 36 systems in the process.
To examine how edge LM systems are made, we situate the 36 deployment papers within the broader 232 surveyed papers. We embed each abstract with MiniLM, reduce to 2D with UMAP, and cluster with HDBSCAN; KeyBERT extracts the top keywords per cluster. Hover any cluster to see representative papers.
[ chart loads here ]
We classified the affiliations of authors across the 36 deployment papers into four sectors: Academia (universities and affiliated research institutions), Industry (startups to enterprise), Research collective (non-profit research organizations), and Government (state-affiliated institutes, public sector). Authors with multiple affiliations are counted in each. Cross-sector collaborations are measured by how often each pair of sectors co-occurs within the same paper.
[ chart loads here ]
In order to map the domains in which an edge LM is deployed, we perform a round of classification by tagging each paper according to their domain: Agriculture, Climate, Finance, Healthcare, Legal, Social, and Speech. Then, we extract mentions of different methods by keyword matching via KeyBERT, and visualize the domain-method connections as a network graph. Click on any outer domain node to see representative papers for that domain.
[ chart loads here ]
@misc{miranda2026multilingualityedgedevelopinglanguage,
title={{M}ultilinguality at the {E}dge: {D}eveloping {L}anguage {M}odels for the {G}lobal {S}outh},
author={Lester James Validad Miranda and Songbo Hu and Roi Reichart and Anna Korhonen},
year={2026},
eprint={2604.21637},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2604.21637},
}
Have feedback, questions, or ideas? Join the conversation below.