Celebrating 25 years of DDD's Excellence and Social Impact.

Author name: saurabh garg

Avatar of saurabh garg
Generative AI​

Challenges in Building Multilingual Datasets for Generative AI

When we talk about the progress of generative AI, the conversation often circles back to the same foundation: data. Large language models, image generators, and conversational systems all learn from the patterns they find in the text and speech we produce. The breadth and quality of that data decide how well these systems understand human expression across cultures and contexts. But there’s a catch: most of what we call “global data” isn’t very global at all.

Despite the rapid growth of AI datasets, English continues to dominate the landscape. A handful of other major languages follow closely behind, while thousands of others remain sidelined or absent altogether. It’s not that these languages lack speakers or stories. Many simply lack the digital presence or standardized formats that make them easy to collect and train on. The result is an uneven playing field where AI performs fluently in one language but stumbles when faced with another.

Building multilingual datasets for generative AI is far from straightforward. It involves a mix of technical, linguistic, and ethical challenges that rarely align neatly. Gathering enough data for one language can take years of collaboration, while maintaining consistency across dozens of languages can feel nearly impossible. And yet, this effort is essential if we want AI systems that truly reflect the diversity of global communication.

In this blog, we will explore the major challenges involved in creating multilingual datasets for generative AI. We will look at why data imbalance persists, what makes multilingual annotation so complex, how governance and infrastructure affect data accessibility, and what strategies are emerging to address these gaps.

The Importance of Multilingual Data in Generative AI

Generative AI might appear to understand the world, but in reality, it only understands what it has been taught. The boundaries of that understanding are drawn by the data it consumes. When most of this data exists in a few dominant languages, it quietly narrows the scope of what AI can represent. A model trained mostly in English will likely perform well in global markets that use English, yet falter when faced with languages rich in context, idioms, or scripts it has rarely seen.

For AI to serve a truly global audience, multilingual capability is not optional; it’s foundational. Multilingual models allow people to engage with technology in the language they think, dream, and argue in. That kind of accessibility changes how students learn, how companies communicate, and how public institutions deliver information. Without it, AI risks reinforcing existing inequalities rather than bridging them.

The effect of language diversity on model performance is more intricate than it first appears. Expanding a model’s linguistic range isn’t just about adding more words or translations; it’s about capturing how meaning shifts across cultures. Instruction tuning, semantic understanding, and even humor all depend on these subtle differences. A sentence in Italian might carry a tone or rhythm that doesn’t exist in English, and a literal translation can strip it of intent. Models trained with diverse linguistic data are better equipped to preserve that nuance and, in turn, generate responses that feel accurate and natural to native speakers.

The social and economic implications are also significant. Multilingual AI systems can support local entrepreneurship, enable small businesses to serve broader markets, and make public content accessible to communities that were previously excluded from digital participation. In education, they can make learning materials available in native languages, improving comprehension and retention. In customer service, they can bridge cultural gaps by responding naturally to regional language variations.

Many languages remain underrepresented, not because they lack value, but because the effort to digitize, annotate, and maintain their data has been slow or fragmented. Until multilingual data becomes as much a priority as algorithmic performance, AI will continue to be fluent in only part of the human story.

Key Challenges in Building Multilingual Datasets

Creating multilingual datasets for generative AI may sound like a matter of collecting enough text, translating it, and feeding it into a model. In practice, each of those steps hides layers of difficulty. The problems aren’t only technical; they’re linguistic, cultural, and even political. Below are some of the most pressing challenges shaping how these datasets are built and why progress still feels uneven.

Data Availability and Language Imbalance

The most obvious obstacle is the uneven distribution of digital language content. High-resource languages like English, Spanish, and French dominate the internet, which makes their data easy to find and use. But for languages spoken by smaller or regionally concentrated populations, digital traces are thin or fragmented. Some languages exist mostly in oral form, with limited standardized spelling or writing systems. Others have digital content trapped in scanned documents, PDFs, or community platforms that aren’t easily scraped.

Even when data exists, it often lacks metadata or structure, making it difficult to integrate into large-scale datasets. This imbalance perpetuates itself; AI tools trained on major languages become more useful, drawing in more users, while underrepresented languages fall further behind in digital representation.

Data Quality, Cleaning, and Deduplication

Raw multilingual data rarely comes clean. It’s often riddled with spam, repeated content, or automatically translated text of questionable accuracy. Identifying which lines belong to which language, filtering offensive material, and avoiding duplication are recurring problems that drain both time and computing power.

The cleaning process may appear purely technical, but it requires contextual judgment. A word that’s harmless in one dialect might be offensive in another. Deduplication, too, is tricky when scripts share similar structures or transliteration conventions. Maintaining semantic integrity across alphabets, diacritics, and non-Latin characters demands a deep awareness of linguistic nuance that algorithms still struggle to match.

Annotation and Translation Complexity

Annotation is where human expertise becomes indispensable and expensive. Labeling data across multiple languages requires trained linguists who understand local syntax, idioms, and cultural cues. For many lesser-known languages, there are simply not enough qualified annotators to meet the growing demand.

Machine translation can fill some gaps, but not without trade-offs. Automated translations may capture literal meaning while missing tone, irony, or context. This becomes particularly problematic when curating conversational or instruction datasets, where intent matters as much as accuracy. Balancing cost and precision often forces teams to make uncomfortable compromises.

Bias, Representation, and Fairness

Language datasets are mirrors of the societies they come from. When those mirrors are distorted, say, overrepresenting urban dialects or Western perspectives, the models trained on them inherit those distortions. In multilingual contexts, the risks multiply. Bias can appear not only in what’s said but in which languages or dialects are deemed “worthy” of inclusion.

There’s also the subtler problem of evaluation bias. A model might perform well in benchmark tests because those benchmarks themselves favor certain language families. Without balanced datasets and culturally aware evaluation metrics, claims of fairness can be misleading.

Legal, Ethical, and Governance Barriers

Collecting multilingual data across borders is complicated by differing privacy laws and ethical standards. Regulations like the GDPR have pushed data teams to think harder about consent, data ownership, and personal information embedded in public text. While these rules are crucial for accountability, they can also slow down open data collaboration.

Beyond legality, there’s the question of cultural consent. Some communities may object to their languages or stories being used for AI training at all, particularly when it’s done without clear benefit-sharing or acknowledgment. Governance frameworks are evolving, but there’s still no universal standard for what ethical multilingual data collection should look like.

Infrastructure and Resource Limitations

Finally, even when the data exists, managing it efficiently is another challenge altogether. Multilingual datasets can easily reach petabyte scale, demanding sophisticated infrastructure for storage, indexing, and version control. Ensuring that updates, corrections, and metadata remain consistent across hundreds of languages becomes a logistical maze.

Smaller organizations or research groups often find the computational cost prohibitive. Evaluating models across multiple languages adds another layer of expense and complexity. The result is a concentration of multilingual AI development among large institutions, leaving smaller players dependent on whatever datasets are publicly available.

Emerging Strategies in Building Multilingual Datasets for Generative AI

Despite the long list of challenges, there’s a quiet shift underway. Data scientists, linguists, and AI developers are rethinking how multilingual data should be sourced, cleaned, and shared. What used to be a niche problem in computational linguistics is now a central discussion in global AI development. While progress is uneven, a few emerging strategies are showing promise in how we approach the creation of inclusive and culturally aware datasets.

Community-Driven and Participatory Data Collection

One of the most meaningful changes is the move toward community participation. Instead of treating speakers of underrepresented languages as data sources, many initiatives now view them as collaborators. Native speakers contribute translations, validate content, and shape guidelines that reflect how their language is actually used.

This approach may sound slower, but it builds legitimacy and trust. When communities see direct benefits, like educational tools or localized AI applications, they are more willing to contribute. Community-led annotation also captures dialectal richness that large-scale scraping simply misses. It’s a more human, sustainable model that aligns technology development with local ownership.

Synthetic and Augmented Data Generation

Synthetic data is becoming an important tool for filling linguistic gaps, particularly where natural data is scarce. Techniques like back-translation, paraphrasing, or controlled text generation can multiply existing datasets while preserving diversity. For instance, a small corpus in a low-resource language can be expanded by automatically generating equivalent paraphrases or contextually similar sentences.

Still, synthetic data comes with its own caution. It can amplify translation errors or introduce artificial patterns that distort real-world usage. The challenge is not to replace human-generated content but to blend both carefully, using synthetic augmentation as a scaffold, not a substitute.

Governance Frameworks and Collaborative Platforms

Data sharing across borders has always been messy, but newer governance models are starting to reduce that friction. Frameworks that define data provenance, consent, and licensing upfront can make collaboration more predictable and transparent. Federated or shared data infrastructures are also gaining traction, allowing different organizations to contribute to multilingual datasets without relinquishing full control of their data.

These frameworks don’t only solve legal problems, they help balance power dynamics between large tech companies and smaller research groups. When standards for data ethics and accessibility are agreed upon collectively, it levels the playing field and encourages long-term cooperation.

Advances in Multilingual Evaluation and Benchmarking

Improving multilingual AI isn’t just about building bigger datasets; it’s also about measuring quality in fairer ways. Recent benchmarking practices emphasize cultural and linguistic diversity instead of relying solely on English-centric metrics. These tests aim to capture how well a model understands nuance, pragmatics, and context rather than just grammar or vocabulary.

A multilingual evaluation framework may sound like an academic detail, but it’s quietly reshaping the industry. When benchmarks reward diversity and contextual accuracy, teams are more motivated to invest in better data curation across languages. Over time, this shift can move the focus from “translation accuracy” to “linguistic understanding.”

Read more: How Human Feedback in Model Training Improves Conversational AI Accuracy

How We Can Help

At Digital Divide Data (DDD), we work directly with organizations to make multilingual AI development practical, ethical, and scalable. Our teams specialize in data annotation, linguistic validation, and cultural adaptation across a wide range of languages. Whether you need conversational data labeled for regional dialects or multilingual text aligned for instruction-tuning, we bring together human expertise and process-driven quality assurance.

Our experience shows that linguistic diversity is not a barrier but an advantage, if it’s handled with the right mix of precision and empathy. We help AI teams build datasets that not only meet performance goals but also reflect the diversity of the people who use them.

Read more: Building Datasets for Large Language Model Fine-Tuning

Conclusion

Building multilingual datasets for generative AI sits at the intersection of technology, culture, and ethics. It’s a process that forces us to confront what “inclusivity” actually means in machine learning. The challenge isn’t just collecting words in different languages; it’s capturing meaning, intent, and identity in a way that models can learn from without distorting them.

What’s emerging is a realization that diversity cannot be automated. Tools and pipelines help, but the foundation still relies on human insight, on people who understand not just the grammar of a language, but the rhythm and emotion behind it. Progress may appear slow because the goal is no longer scale alone; it’s quality, fairness, and accountability.

As AI systems become more embedded in education, healthcare, and governance, the stakes grow higher. A system that fails to understand a user’s language risks more than miscommunication; it risks exclusion. Closing this gap requires collective effort, technologists, linguists, policymakers, and communities working toward the same purpose: making language technology serve everyone, not just those who speak the world’s dominant tongues.

The future of generative AI will depend on how seriously we take this challenge. The tools are advancing quickly, but the responsibility to represent all voices remains human.

Partner with Digital Divide Data to build multilingual datasets that power inclusive, ethical, and globally relevant AI.


References

European Commission. (2024). Common European Language Data Space: Governance and implementation roadmap. Publications Office of the European Union.

Schmidt, T., & Brack, A. (2025). Auditing multilingual speech datasets: Data quality, accent coverage, and bias detection. Proceedings of LREC-COLING 2025.

Vázquez, M., & Lison, P. (2024). High-quality multilingual corpora for generative AI: Data quality, governance, and evaluation. European Language Resources Association (ELRA).

OpenAI Research. (2023). Bias and fairness in multilingual model alignment. Retrieved from https://openai.com/research


FAQs

Why do some languages remain absent from large AI datasets even today?
Many languages have little written or digitized content online, making them difficult to collect automatically. Additionally, some communities prefer oral transmission or have privacy concerns about sharing cultural material for AI use.

Can multilingual AI ever achieve equal performance across all languages?
In theory, it’s possible, but practically unlikely. Differences in data size, cultural context, and linguistic structure mean some imbalance will always exist. The goal is to minimize these gaps, not eliminate them.

How do organizations ensure fairness when expanding language coverage?
Fairness begins with transparent data sourcing, ethical consent processes, and community collaboration. Teams should also include native speakers in quality assurance and evaluation.

What are the biggest cost drivers in multilingual dataset creation?
Human annotation, translation quality assurance, and infrastructure costs for managing massive data volumes are the primary expenses. Balancing automation with skilled human review helps control cost without sacrificing accuracy.

How can smaller organizations contribute to improving multilingual datasets?
They can participate in open data initiatives, sponsor community-driven projects, or share localized datasets under permissive licenses. Even small contributions can have a meaningful impact on language inclusivity in AI.

Challenges in Building Multilingual Datasets for Generative AI Read Post »

Digitization complete workflow and quality assurance 1 2 1 1

How Optical Character Recognition (OCR) Digitization Enables Accessibility for Records and Archives

Over the past decade, governments, universities, and cultural organizations have been racing to digitize their holdings. Scanners hum in climate-controlled rooms, and terabytes of images fill digital repositories. But scanning alone doesn’t guarantee access. A digital image of a page is still just that, an image. You can’t search it, quote it, or feed it to assistive software. In that sense, a scanned archive can still behave like a locked cabinet, only prettier and more portable.

Millions of historical documents remain in this limbo. Handwritten parish records, aging census forms, and deteriorating legal ledgers have been captured as pictures but not transformed into living text. Their content exists in pixels rather than words. That gap between preservation and usability is where Optical Character Recognition (OCR) quietly reshapes the story.

In this blog, we will explore how OCR digitization acts as the bridge between preservation and accessibility, transforming static historical materials into searchable, readable, and inclusive digital knowledge. The focus is not just on the technology itself but on what it makes possible, the idea that archives can be truly open, not only to those with access badges and physical proximity, but to anyone with curiosity and an internet connection.

Understanding OCR in Digitization

Optical Character Recognition, or OCR, is a system that turns images of text into actual, editable text. In practice, it’s far more intricate. When an old birth register or newspaper is scanned, the result is a high-resolution picture made of pixels, not words. OCR steps in to interpret those shapes and patterns, the slight curve of an “r,” the spacing between letters, the rhythm of printed lines, and converts them into machine-readable characters. It’s a way of teaching a computer to read what the human eye has always taken for granted.

Early OCR systems did this mechanically, matching character shapes against fixed templates. It worked reasonably well on clean, modern prints, but stumbled the moment ink bled, fonts shifted, or paper aged. The documents that fill most archives are anything but uniform: smudged pages, handwritten annotations, ornate typography, even water stains that blur whole paragraphs. Recognizing these requires more than pattern matching; it calls for context. Recent advances bring in machine learning models that “learn” from thousands of examples, improving their ability to interpret messy or inconsistent text. Some tools specialize in handwriting (Handwritten Text Recognition, or HTR), others in multilingual documents, or layouts that include tables, footnotes, and marginalia. Together, they form a toolkit that can read the irregular and the imperfect, which is what most of history looks like.

But digitization is not just about making digital surrogates of paper. There’s a deeper shift from preservation to participation. When a collection becomes searchable, it changes how people interact with it. Researchers no longer need to browse page by page to find a single reference; they can query a century’s worth of data in seconds. Teachers can weave original materials into lessons without leaving their classrooms. Genealogists and community historians can trace local stories that would otherwise be lost to time. The archive moves from being a static repository to something closer to a public workspace, alive with inquiry and interpretation.

Optical Character Recognition (OCR) Digitization Pipeline

The journey from a physical document to an accessible digital text is rarely straightforward. It begins with a deceptively simple act: scanning. Archivists often spend as much time preparing documents as they do digitizing them. Fragile pages need careful handling, bindings must be loosened without damage, and light exposure has to be controlled to avoid degradation. The resulting images must meet specific standards for resolution and clarity, because even the best OCR software can’t recover text that isn’t legible in the first place. Metadata tagging happens here too, identifying the document’s origin, date, and context so it can be meaningfully organized later.

Once the images are ready, OCR processing takes over. The software identifies where text appears, separates it from images or decorative borders, and analyzes each character’s shape. For handwritten records, the task becomes more complex: the model has to infer individual handwriting styles, letter spacing, and contextual meaning. The output is a layer of text data aligned with the original image, often stored in formats like ALTO or PDF/A, which allow users to search or highlight words within the scanned page. This is the invisible bridge between image and information.

But raw OCR output is rarely perfect. Post-processing and quality assurance form the next critical phase. Algorithms can correct obvious spelling errors, but context matters. Is that “St.” a street or a saint? Is a long “s” from 18th-century typography being mistaken for an “f”? Automated systems make their best guesses, yet human review remains essential. Archivists, volunteers, or crowd-sourced contributors often step in to correct, verify, and enrich the data, especially for heritage materials that carry linguistic or cultural nuances.

The digitized text must be integrated into an archive or information system. This is where technology meets usability. The text and images are stored, indexed, and made available through search portals, APIs, or public databases. Ideally, users should not need to think about the pipeline at all; they simply find what they need. The quality of that experience depends on careful integration: how results are displayed, how metadata is structured, and how accessibility tools interact with the content. When all these elements align, a once-fragile document becomes part of a living digital ecosystem, open to anyone with curiosity and an internet connection.

Recommendations for Character Recognition (OCR) Digitization

Working with historical materials is rarely a clean process. Ink fades unevenly, pages warp, and handwriting changes from one entry to the next. These irregularities are exactly what make archives human, but they also make them hard for machines to read. OCR systems, no matter how sophisticated, can stumble over a smudged “c” or a handwritten flourish mistaken for punctuation. The result may look accurate at first glance, but lose meaning in subtle ways; these errors ripple through databases, skew search results, and occasionally distort historical interpretation.

Adaptive Learning Models

To deal with this, modern OCR systems rely on more than static pattern recognition. They use adaptive learning models that improve as they process more data, especially when corrections are fed back into the system. In some cases, language models predict the next likely word based on context, a bit like how predictive text works on smartphones. These systems don’t truly “understand” the text, but they simulate enough contextual awareness to catch obvious mistakes. That said, there’s a fine line between intelligent correction and overcorrection; a model trained on modern language patterns may unintentionally “normalize” historical spelling or phrasing that actually holds cultural value.

Human-in-the-loop

This is where humans come in. Archivists and volunteers provide the cultural and contextual knowledge that AI still lacks. A local historian might recognize that “Ye” in an old English document isn’t a misprint but a genuine character variant. A bilingual archivist might spot linguistic borrowing that algorithms misinterpret. In that sense, the most effective OCR workflows are not purely automated but cooperative. Machines handle scale, processing thousands of pages quickly, while humans refine meaning.

AI and Human Collaboration

The collaboration between AI and people isn’t just about accuracy; it’s about accountability. Algorithms can process information faster than any team could, but only humans can decide what accuracy means in context. Whether to preserve an archaic spelling, how to treat marginal notes, and when to flag uncertainty are interpretive choices. The more transparent this relationship becomes, the more credible and inclusive the digitized archive will be. OCR, at its best, works not as a replacement for human expertise but as an amplifier of it.

Technological Innovations Shaping OCR Accessibility

The most interesting progress has come from systems that don’t just “see” text but interpret its surroundings. For instance, layout-aware OCR can distinguish between a headline, a caption, and a footnote, recognizing how the visual hierarchy of a document affects meaning. This matters more than it sounds. A poorly parsed layout can scramble sentences or strip tables of their logic, turning a digitized record into nonsense.

Domain-Specific Data

Recent OCR models also train on domain-specific data, a subtle shift that changes results dramatically. A system tuned to modern business documents may perform terribly on 18th-century legal manuscripts, where ink density, letter spacing, and orthography behave differently. By contrast, a domain-adapted model, say, one specialized for historical newspapers or handwritten correspondence, learns to expect irregularities rather than treat them as noise. The outcome is a kind of tailored reading ability that fits the document’s world rather than forcing it into modern patterns.

Context-Aware Correction

Another promising area lies in context-aware correction. Instead of applying broad language rules, new systems analyze regional or temporal variations. They recognize that “colour” and “color” are both valid, depending on context, or that an unfamiliar surname is not a typo. The idea is not to normalize but to preserve distinctiveness. When paired with handwriting models, this approach makes it easier to digitize materials that reflect cultural and linguistic diversity, a step toward archives that represent people as they were, not as algorithms think they should be.

Integrated Workflows

OCR is also becoming part of larger ecosystems. Increasingly, digitization projects combine text recognition with translation tools, transcription platforms, or semantic search engines that can identify people, places, and themes across collections. The result is a more connected landscape of archives where one record can lead to another through shared metadata or linked entities. These integrated workflows blur the boundaries between libraries, museums, and research databases, creating something closer to a network of knowledge than a set of isolated repositories.

Conclusion

Optical Character Recognition in digitization has quietly become one of the most transformative forces in the archival world. It doesn’t replace the work of preservation or the value of physical materials; rather, it extends their reach. By converting static images into searchable, readable text, OCR bridges the gap between memory and access, between what’s stored and what can be shared. It gives new life to forgotten records and makes history usable again, by scholars, by policymakers, by anyone curious enough to look.

Technology continues to evolve, but archives remain as diverse and unpredictable as the histories they hold. Each page brings new quirks, new languages, and new technical challenges. What matters most is not perfect automation but the ongoing collaboration between people and machines. Accuracy, ethics, and inclusivity are not endpoints; they are habits that must guide every decision, from scanning a page to publishing it online.

As archives become increasingly digital, the conversation shifts from what we preserve to how we allow others to experience it. OCR is part of that larger story: it turns preservation into participation. The real promise lies in accessibility that feels invisible, when anyone, anywhere, can uncover a piece of history without realizing the technical complexity that made it possible. That is the quiet success of OCR: not that it reads what we cannot, but that it helps us keep reading what we might otherwise have lost.

Read more: How Multi-Format Digitization Improves Information Accessibility

How We Can Help

At Digital Divide Data (DDD), we understand that turning physical archives into accessible digital assets requires more than just technology; it requires precision, care, and context. Many organizations begin digitization projects with enthusiasm but soon face challenges: inconsistent image quality, multilingual content, and the need for scalable quality assurance. DDD’s approach bridges these gaps by combining human expertise with advanced OCR and HTR workflows tailored for archival material.

Our teams specialize in managing high-volume digitization pipelines for government agencies, libraries, and cultural institutions. We handle everything from image preparation and text recognition to post-processing and metadata enrichment. Crucially, we focus on accessibility, not just in a regulatory sense but in the practical one: ensuring that digital records can be read, searched, and used by everyone, including those relying on assistive technologies.

By turning analog collections into digital ecosystems, we make archival heritage discoverable, inclusive, and sustainable for the long term.

Partner with Digital Divide Data to digitize your archives into searchable, inclusive digital knowledge.


References

Federal Agencies Digital Guidelines Initiative. (2025, January 30). Technical guidelines for the still image digitization of cultural heritage materials. Retrieved from https://www.digitizationguidelines.gov/

National Archives and Records Administration. (2024, May). Digitization of federal records: Policy, guidance, and standards for permanent records. Washington, DC: U.S. Government Publishing Office.

Library of Congress. (2025, April). Improving machine-readable text for newspapers in Chronicling America. Retrieved from https://www.loc.gov/

British Library. (2024, June). Digital scholarship blog: Advancing OCR and HTR for cultural collections. London, UK.

U.S. National Archives News. (2024, May). New digitization center at College Park improves access to historical records. Washington, DC: National Archives Press.


FAQs

Q1. How is OCR different from simple scanning?
Scanning creates a digital image of a page, but OCR extracts the actual text content from that image. Without OCR, you can view but not search, quote, or use the text in accessibility tools. OCR makes the content functional rather than merely visible.

Q2. What kinds of documents benefit most from OCR digitization?
Printed newspapers, books, government reports, manuscripts, and archival correspondence all benefit. Essentially, any text-based record that needs to be searchable, translated, or read by assistive technology gains value through OCR.

Q3. What are the main challenges in applying OCR to historical archives?
Poor image quality, unusual fonts, fading ink, and complex layouts often lead to misreads. Handwritten materials are particularly challenging. Modern OCR solutions mitigate this with handwriting models and AI correction, but manual validation is still essential.

Q4. Can OCR handle multiple languages or scripts?
Yes, but with limitations. Modern OCR systems can be trained on multilingual data, making them capable of recognizing multiple alphabets and writing systems. However, accuracy still depends on the quality of the training data and the similarity between languages.

Q5. Does OCR improve accessibility for people with disabilities?
Absolutely. Once text is machine-readable, it can be converted to speech or braille, navigated by screen readers, and accessed via keyboard controls. OCR effectively turns static images into inclusive digital content.

How Optical Character Recognition (OCR) Digitization Enables Accessibility for Records and Archives Read Post »

Scroll to Top