Now Next Later AI - Blog #Translation

No Language Left Behind: The Future of Human-Centered Machine Translation

Thu, 10 Aug 2023 08:04:29 +1000

In our increasingly interconnected world, language remains a persistent barrier. While machine translation has improved dramatically for major languages like English, Spanish, and Mandarin, quality translation still lags for thousands of smaller languages.

A new effort called No Language Left Behind (NLLB) aims to close this gap by developing high-quality translation systems covering over 200 languages. The project employs a human-centered approach to translation, collaborating with speakers of diverse languages to ensure cultural relevance. This responds directly to the needs illuminated in interviews with translators and language communities, who found existing systems unreliable.

Creating useful systems requires quality training data. But most languages lack large parallel corpora needed for machine translation. NLLB’s ingenious solution mines the web to extract potential translations from multilingual websites. Using specialized cross-lingual sentence embedding models, they align sentences conveying similar meanings in different languages with high accuracy. This novel mining process enabled building datasets for low-resource languages at unprecedented scale.

For evaluation, NLLB created the Flores benchmark spanning 200 languages. Careful protocols ensure accurate human translation and alignment. Flores helps quantify progress, with higher scores indicating better translation quality. NLLB models achieve state-of-the-art results on Flores, with many languages nearing usability. The project also conducted human studies confirming considerable gains over past systems.

Another key innovation was tailoring the training process to prevent overfitting on high-resource languages. Curriculum learning introduces low-resource languages slowly to focus model capacity. A mixture-of-experts architecture further reduced interference between languages. Together these advances allowed a single model to translate between all 200+ directions while maintaining specialization.

The NLLB project has committed to ethics and inclusion. Their partnership with Wikipedia is enabling content translation for many smaller wikipedias. With potential harms like cultural erasure in mind, the project also interviewed speakers from vulnerable language communities to guide decision making. Framing translation as a human right, the goal is to enhance global communication.

NLLB's ambitions embody AI's expanding potential to connect us. Machine translation was long considered nearly hopeless for low-resource languages. But the project upends these assumptions by proving quality translation possible for all tongues when equity becomes a design principle.

NLLB believes openness and collaboration will be central to fulfilling this vision. Models and datasets are fully public to catalyze innovation. The project also plans grants, support for adapting their tools, and partnerships with researchers invested in specific languages. Only through such collective effort can we ensure no language gets left out.

NLLB faces challenges in improving under-served language pairs, capturing informal dialects, and evaluation beyond text. But these are solvable given dedicated investment and commitment. Progress shows what's possible when innovation grounds itself in lived experience.

Past AI has often encoded the priorities of English-speaking elites. NLLB points toward an alternative future, designed for international users. When technology accommodates our differences rather than erasing them, new opportunities for connection emerge.

No Language Left Behind offers a roadmap for inclusive AI. Its translation advancements - enabled by novel datasets, training methods, and modeling architectures - carry lessons for ensuring equity more widely. Expanding access requires investigating deficiencies and confronting barriers directly.

The project's mission resonates far beyond translation, to all intelligent systems. For as long as exclusion persists, the work remains unfinished. These efforts bring us closer to AI that knows no language boundaries, only the poetry of voices united across divides.

source:

arxiv

Taming the Tower of Babel - Making AI Translate Many Languages

Thu, 10 Aug 2023 08:00:44 +1000

Machine translation has improved immensely thanks to AI, but handling multiple languages remains tricky. When you train one model to translate between English, Spanish, French and more, the languages can “interfere” with each other.

Researchers from Tel Aviv University and Meta studied this challenge. Through systematic experiments, they uncovered what really causes the most interference.

What Causes the Problems?

The researchers found two main things cause issues:

Small models struggle with diverse data. When you add more languages, they get confused.
Low-data languages don't get enough examples. Spanish has tons of text to learn from. But Swahili does not.

Other factors like language similarity mattered much less. Having more languages wasn't so bad either.

How to Make Translation Better

The team found two simple solutions:

Use bigger models. Large AI models handle diverse languages better. The extra capacity reduces the mixed-up translations.
Balance data proportions. Ensure low-data languages get sampled enough during training. Tuning this hyperparameter helped low-data languages improve.

Other papers invented algorithms to reduce the language mixing-up. But this study showed basic solutions of scaling up and balancing data work well.

The Lesson for Business

This teaches an important lesson about AI. Advances often come from more computing power and data, not just clever new ideas. Getting the basics right matters most.

For business leaders, it shows the value of dedicating resources to train large AI models. Bigger models accommodate diverse data better. It also reduces the need for exotic algorithms that may not work that well.

Investing in computing enables handling diverse data well. Keeping solutions simple is usually the best path to success in AI. Scaling up does have downsides - it costs more and has environmental impacts.

Sources:

arxiv

Lost in Translation: When Context Matters for AI

Thu, 10 Aug 2023 07:58:11 +1000

Recent advances in generative artificial intelligence have led to great improvements in machine translation, with AI systems able to translate text between languages with impressive quality. However, new research reveals these systems still struggle with ambiguity and discourse phenomena that require broader context beyond individual sentences.

Researchers at Carnegie Mellon University and University of Lisbon systematically studied when context is needed for high-quality translation across 14 languages. By analyzing AI model predictions, they identified common ambiguity patterns including:

Lexical cohesion - Using consistent words to refer to the same entity across a document.
Formality - Properly translating formal vs informal pronouns based on social relationships.
Pronouns - Identifying the correct gendered pronoun when translating.
Verb forms - Choosing the right verb tense and conjugation to match the tone and flow.
Ellipsis - Filling in omitted verbs and subjects that require inferred meaning.

The study found even state-of-the-art commercial systems struggle with these discourse phenomena compared to translating simpler, unambiguous sentences. The AI models only showed marginal improvements when given surrounding context, suggesting current techniques are inadequate.

For business leaders employing machine translation, these findings reveal a gap between today's AI capabilities and the nuance of human language. While fine for simple uses, ambiguity and discourse still pose a major challenge. For mission-critical applications, companies should be aware of potential errors with discourse and not fully trust black-box AI systems.

Advancing natural language AI to properly incorporate discourse and context remains an open research problem. As models ingest more diverse training data and learn complex reasoning, the goal is to reach human-level mastery of pragmatics. Moving forward, evaluation benchmarks that test contextual understanding will be crucial to drive progress.

The road to fully fluent AI is still long, but identifying specific weaknesses around ambiguity lays the groundwork for breakthroughs. With focused efforts to handle discourse, translation models that know what truly matters in the broader context may soon get the gist across languages.

Source:

arxiv