Now Next Later AI - Blog , Gen AI Research

Language Model Tokenization Reveals Significant Disparities Across Languages: Implications for Businesses and Users

Mon, 29 Apr 2024 12:28:39 +1000

In this article, we'll dive into a recent study that uncovers substantial disparities in the tokenization process used by language models across different languages. These disparities have significant implications for businesses and users, affecting the cost, latency, and quality of service when using AI-powered language technologies. By understanding these issues, business leaders can make more informed decisions about the adoption and deployment of language models and advocate for the development of more equitable solutions.

The Importance of Tokenization in Language Models

Tokenization is the process of breaking down natural language text into smaller units called tokens, which are then used as input for language models. The choice of tokenization method can significantly impact a model's performance and efficiency. Subword tokenization, which breaks down complex words into smaller parts, has become the preferred approach for state-of-the-art language models.

However, the study revealed that even subword tokenization methods can lead to significant disparities in the number of tokens required to represent the same content across different languages. This has far-reaching consequences for businesses and users relying on language models for various applications.

Tokenization Disparities Across Languages

The researchers analyzed the tokenization process of several popular language models, including GPT-2, RoBERTa, and the tokenizers used by ChatGPT and GPT-4. They found that the number of tokens required to represent the same text can vary drastically across languages. For example:

GPT-2 requires 3 times more tokens to represent the same content in Japanese compared to English.
The ChatGPT and GPT-4 tokenizers use 1.6 times more tokens for Italian, 2.6 times more for Bulgarian, and 3 times more for Arabic compared to English.
For Shan, a language spoken in Myanmar, the difference can be as high as 15 times compared to English.

These disparities persist even in tokenizers specifically designed for multilingual support, with some language pairs showing a 4-fold difference in the number of tokens required.

Implications for Businesses and Users

The tokenization disparities across languages have significant implications for businesses and users:

Cost: Many commercial language model services charge users per token. As a result, users of certain languages may end up paying significantly more for the same task compared to users of English or other more efficiently tokenized languages.
Latency: The number of tokens directly impacts the processing time for a task. Languages with longer tokenized representations can experience twice the latency compared to English, which may be critical for real-time applications like customer support or emergency services.
Long Context Processing: Language models often have a fixed context window, limiting the amount of text they can process at once. Users of more efficiently tokenized languages can work with much longer texts compared to users of languages with higher token counts, potentially leading to significant disparities in the quality of service.

The Path Forward: Multilingual Tokenization Fairness

To address these disparities and ensure more equitable access to language technologies, the researchers propose the concept of multilingual tokenization fairness. They argue that tokenizers should produce similar encoded lengths for the same content across languages. This can be achieved by:

Recognizing that subword tokenization is necessary to achieve parity, as character-level and byte-level representations cannot fully address the issue.
Ensuring that tokenizers support all Unicode codepoints to handle characters from all languages.
Building a multilingually fair parallel corpus for training and evaluating tokenizers, with balanced representation of topics, named entities, and diverse translations.
Developing multilingually fair tokenizers by first training individual monolingual tokenizers for each target language and then merging them while maintaining parity.

By adopting these principles, language model developers can create more equitable tokenizers that provide similar levels of service across languages, benefiting businesses and users worldwide.

As language models become increasingly integral to our daily lives, it is crucial that we prioritize fairness and inclusivity in their design and deployment. By understanding the implications of tokenization disparities and taking action to address them, business leaders can play a vital role in shaping a more equitable future for AI-powered language technologies.

AI Benchmarks: Misleading Measures of Progress Towards General Intelligence

Wed, 03 Apr 2024 10:42:30 +1100

Artificial intelligence (AI) has made remarkable strides in recent years, with AI systems now achieving impressive performance on a variety of tasks, from image recognition to language understanding. These advancements have been largely driven by the development of powerful machine learning algorithms, coupled with the availability of vast amounts of training data and computational resources.

However, as AI continues to progress, it is crucial for business leaders to understand the limitations and potential pitfalls of current approaches to measuring AI capabilities. A position paper by Raji et al. offers a compelling critique of popular AI benchmarks, arguing that they are often misleading and fail to capture meaningful progress towards general intelligence. This critique is further echoed in a recent TechCrunch article by Kyle Wiggers, which highlights the disconnect between AI benchmarks and real-world applications.

The Allure of "General" AI Benchmarks

Two of the most widely cited benchmarks in AI are ImageNet, used for evaluating image recognition systems, and GLUE (General Language Understanding Evaluation), used for assessing natural language processing models. These benchmarks have taken on an outsized role in the AI community, with performance on these tasks often seen as indicative of progress towards general AI capabilities.

The appeal of these benchmarks is understandable. They offer a standardized way to compare different AI systems and track improvements over time. Moreover, the tasks they encompass, such as identifying objects in images or understanding the meaning of sentences, seem to capture essential aspects of intelligence that humans excel at.

However, as Raji et al. point out, these benchmarks are far from perfect measures of general intelligence. In fact, they argue, the focus on achieving state-of-the-art performance on these narrow tasks has distorted the priorities of the AI research community and led to an overemphasis on benchmark-chasing at the expense of more meaningful progress.

The Limitations of Current Benchmarks

One of the key criticisms leveled by Raji et al. is that the tasks included in popular AI benchmarks are often arbitrary and not systematically chosen to represent general capabilities. They compare this to a fictional children's story about a museum claiming to contain "everything in the whole wide world," but which actually just contains a haphazard collection of random objects.

Similarly, the authors argue, benchmarks like ImageNet and GLUE are composed of a relatively narrow and idiosyncratic set of tasks that hardly capture the full range of intelligent behaviors. Impressive performance on these tasks is often taken as evidence of general intelligence, when in reality it may simply reflect a system's ability to exploit specific patterns or statistical regularities present in the training data.

The TechCrunch article by Wiggers further underscores this point, noting that many of the most commonly used benchmarks for chatbot-powering AI models, such as GPQA ("A Graduate-Level Google-Proof Q&A Benchmark"), contain questions that are far removed from the everyday tasks most people use these models for, such as responding to emails or writing cover letters. As Jesse Dodge, a scientist at the Allen Institute for AI, puts it, "Benchmarks are typically static and narrowly focused on evaluating a single capability, like a model's factuality in a single domain, or its ability to solve mathematical reasoning multiple choice questions."

Another issue highlighted in both the Raji et al. paper and the TechCrunch article is the presence of errors and flaws in some widely used benchmarks. For example, an analysis of the HellaSwag benchmark, designed to evaluate commonsense reasoning in AI models, found that more than a third of the test questions contained typos and nonsensical writing. Similarly, the MMLU benchmark, which has been touted by vendors like Google, OpenAI, and Anthropic as evidence of their models' logical reasoning abilities, contains questions that can be solved through mere memorization rather than genuine understanding.

As David Widder, a postdoctoral researcher at Cornell studying AI and ethics, notes in the TechCrunch article, "A model can't [reason through and solve new and complex problems] either" just because it performs well on benchmarks like MMLU. Instead, he argues, these benchmarks often test a model's ability to "memoriz[e] and associat[e] two keywords together" rather than truly understand causal mechanisms.

Key Takeaways for Business Leaders

Given the limitations and potential misleading nature of current AI benchmarks, what should business leaders keep in mind when evaluating AI technologies? Here are some key takeaways from the Raji et al. paper and the TechCrunch article:

Be skeptical of grand claims about AI systems achieving human-level or superhuman intelligence based solely on benchmark performance. As both sources emphasize, impressive results on specific benchmarks do not necessarily translate to general intelligence or robustness in real-world deployments.
When evaluating AI vendors or technologies, look beyond top-line benchmark numbers. Ask detailed questions about the specific capabilities and limitations of the system, and how it has been tested on tasks and datasets relevant to your business needs.
Encourage a culture of rigorous, multifaceted evaluation within your organization's AI initiatives. Rather than focusing solely on chasing state-of-the-art benchmark results, prioritize detailed error analysis, bias auditing, and stress testing across a diverse range of scenarios.
Support research and development efforts aimed at creating more meaningful and comprehensive benchmarks tied to real-world applications. This could include developing industry-specific datasets and evaluation protocols that better reflect the challenges and requirements of your business domain.
Foster an AI research culture that values creativity, diversity of thought, and long-term progress over short-term benchmark wins. Encourage your teams to explore novel architectures and approaches, even if they may not immediately yield chart-topping results.

Looking Ahead: Improving AI Benchmarks

Both the Raji et al. paper and the TechCrunch article offer some suggestions for improving the current state of AI benchmarks. One key idea is to incorporate more human evaluation alongside automated benchmarks. As Jesse Dodge suggests in the TechCrunch piece, "The right path forward, here, is a combination of evaluation benchmarks with human evaluation—prompting a model with a real user query and then hiring a person to rate how good the response is."

David Widder, on the other hand, is less optimistic about the potential for improving existing benchmarks. Instead, he argues that AI evaluation should focus more on the downstream impacts of these models and whether those impacts align with the goals and values of the people affected by them. "I'd ask which specific contextual goals we want AI models to be able to be used for," he says, "and evaluate whether they'd be—or are— successful in such contexts."

As AI continues to advance and become more deeply integrated into business operations, it is crucial for leaders to have a nuanced understanding of the technologies' strengths and limitations. By looking beyond simplistic benchmark results and embracing a more holistic and rigorous approach to AI evaluation, organizations can make more informed decisions and unlock the true potential of artificial intelligence while mitigating its risks and pitfalls.

Footnotes:

"AI and the Everything in the Whole Wide World Benchmark" by Inioluwa Deborah Raji, Emily M. Bender, Amandalynne Paullada, Emily Denton, and Alex Hanna
"Why most AI benchmarks tell us so little" by Kyle Wiggers for TechCrunch

Photo by William Warby on Unsplash

Microsoft Unveils AutoGen to Revolutionize Conversational AI Apps

Tue, 24 Oct 2023 14:13:24 +1100

Conversational artificial intelligence (AI) is transforming numerous industries by enabling more natural interactions between humans and computers. From virtual assistants to chatbots, voice interfaces, and avatars, conversational AI is becoming increasingly prevalent in everyday digital experiences. However, building the complex workflows that power these next-generation systems remains challenging for most companies.

To accelerate development of advanced conversational AI applications, Microsoft recently introduced AutoGen, an open-source Python library that streamlines orchestrating multi-agent conversations. With AutoGen's customizable and intelligent agents, developers can readily construct sophisticated conversational systems and workflows using combinations of AI, tools, and human inputs.

Democratizing Complex Conversational AI Workflows

A key goal of AutoGen is democratizing the creation of intricate conversational AI applications. Traditionally, building multi-turn workflows involving several AI components has required extensive engineering expertise and effort. AutoGen encapsulates the complexity behind easy-to-use agents and interfaces.

Some examples of applications enabled by AutoGen:

Tutoring systems where students converse with an AI tutor that can call an expert for help when needed
Troubleshooting chatbots that propose solutions, execute tools, and incorporate human feedback
Interactive fiction games with conversational NPCs powered by AI and humans
Data analysis workflows where users discuss options with an AI assistant that runs code and queries databases

With AutoGen's pre-built agents and simple API, developers can set up the conversational 'cast' and interactions for their application in just a few lines of Python code. The complexity of conversing, remembering context, integrating tools, handling errors, and supporting dynamic multi-agent chatter happens automatically behind the scenes.

AutoGen Agents - Conversational Building Blocks

At the core of AutoGen are customizable agents that can chat with each other and humans to solve problems. There are two key types of agents:

Assistant agents provide domain expertise using large language models like GPT-3.5 and GPT-4. They can be configured with instructions and knowledge for different roles.
User proxy agents act on behalf of humans. They can request inputs, execute tools through code, or take other custom actions.

By combining these agents into multi-agent systems, developers can construct auto- mated workflows with flexible human involvement. Agents exchange messages until they mutually determine the conversation has achieved its goal.

For instance, an assistant agent might propose an analytical approach while the user proxy agent runs simulations to validate the idea before reporting results back to the assistant. AutoGen streamlines the intricacies of conversation management so developers simply define the agents and their interactions.

Maximizing Value from Large Language Models

In addition to simplifying complex workflows, AutoGen also includes features to maximize the value derived from expensive large language model APIs like OpenAI's.

AutoGen helps users:

Fine-tune model hyperparameters like temperature, presence penalty, and stop sequences to optimize for metrics like accuracy, cost, etc.
Cache model outputs to avoid redundant expensive calls.
Automatically handle errors and retries to improve reliability.
Seamlessly blend outputs from multiple model configurations.

Tools like these ensure users efficiently tap into the vast capabilities of large language models through a robust interface.

Microsoft is particularly focused on responsible and ethical standards for AutoGen. They incorporated algorithmic techniques to provide transparency and maintain human oversight over any automated conversations between agents.

Empowering a New Generation of AI Applications

AutoGen tackles a common pain point in leveraging today's most advanced AI capabilities: the burdensome process of coordinating multiple conversational AI components. With its blend of simple abstractions and powerful features, AutoGen opens the door to new categories of AI applications:

Medical chatbots that discuss patient cases with doctors before synthesizing expert advice
Multi-modal VR agents that converse with users and AI assistants while manipulating 3D environments
Interactive fiction games with dialogue trees branching based on player choices and AI improvisation
Data science workflows where users explore models through natural language conversations with AutoGen agents

AutoGen represents an important step in making sophisticated AI more accessible. Its potential to unlock new products and experiences makes AutoGen one of the most exciting recent developments in conversational AI.

Key Takeaways for Business Leaders

For business leaders, AutoGen represents an opportunity to leverage conversational AI in new ways across customer engagement, operations, employee productivity, and more. Companies that leverage AutoGen early could gain a competitive advantage in their ability to rapidly deploy innovative conversational experiences.

AutoGen is an enabling technology that can help businesses adopt conversational AI at scale by making development drastically easier. Its potential to unlock new products and efficiencies makes it a platform business leaders should have on their radar.

Sources:

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Autogen

MemGPT: The Memory Limitations of AI Systems and a Clever Technological Workaround

Tue, 24 Oct 2023 12:02:02 +1100

Artificial intelligence systems that can have natural conversations and analyze documents have transformative business potential. However, today's AI - specifically large language models (LLMs) like Claude 2 and GPT-4 - have a major limitation. They can only remember a finite amount of information before needing to completely reset their memory. This restricts their ability to have coherent, long-term interactions or make connections across lengthy documents.

One might assume the solution is just to build LLMs with bigger memories. But LLMs face sharply diminishing returns and ballooning computational costs from naively expanding memory. After reviewing these tradeoffs, researchers at UC Berkeley devised an innovative workaround drawing inspiration from operating systems. Their system, MemGPT, applies OS principles like virtual memory and process management to unlock more powerful applications of LLMs - all while staying within their inherent memory limits.

The Core Challenge of LLM Memory Limits

LLMs use an algorithm called self-attention to analyze incoming text and predict upcoming words, just as humans intuitively continue a thought or conversation. This grants LLMs their impressive language skills. However, self-attention requires the LLM to look across all context it's received so far, which means its memory must be reset after reaching a fixed size limit.

For perspective, Claude 2 can handle about 100,000 tokens before resetting. That may sound generous compared to a 10,000 word business report. But spoken conversation can easily exceed this limit in just a few hours of steady chit-chat. Even more daunting are tasks like sifting complex legal documents that routinely run millions of tokens.

LLMs have a fixed memory capacity because the self-attention algorithm scales quadratically based on context length. Doubling the memory size makes the LLM's computations 4x more intensive. Expanding memory quickly becomes computationally infeasible, even for large tech companies.

Rather than a flaw in specific systems like Claude, this limited memory span is an inherent constraint of all modern LLM architectures. Naively expanding memory was not a viable solution path. More creative approaches would be needed.

The Insights Behind MemGPT's Operating System-Inspired Design

UC Berkeley researchers drew inspiration from operating systems like Windows that run applications working with far more data than fits into available RAM. They asked: how can we apply OS techniques to provide an LLM the illusion of infinite memory?

The result was MemGPT, which implements two key principles:

A hierarchy of memory resources - MemGPT divides memory into a small, fast "main context" like RAM and a large, slow "external context" like disk storage. Information must be explicitly transferred between them.
Process management - MemGPT handles control flow between memory, the LLM, and users akin to how an OS arbitrates between concurrent processes.

Together these give MemGPT the ability to pipeline potentially unlimited memory in and out of the LLM's limited context window as needed to accomplish tasks requiring unbounded memory over multiple processing cycles.

Just as clever OS architectures enable applications to work with more data than available RAM, MemGPT's design confers an illusion of infinite memory to fixed-context LLMs.

Conversational AI That Can Reference Years of Dialogue

A major application of LLMs is powering conversational assistants and social bots. MemGPT demonstrates substantially improved consistency and personalization in these applications:

Consistency - By querying external memory of prior interactions, MemGPT can coherently maintain facts, preferences, and history even when referring back to dialogues from months or years ago.
Personalization - MemGPT can spontaneously incorporate comprehensive knowledge about the user, like callback jokes referencing childhood stories told weeks in the past to forge greater rapport.

Analyzing Large Collections of Documents

MemGPT also excels at tasks like:

Question answering using a massive multi-document corpus like Wikipedia or a company knowledge base.
Extracting key facts and relationships by synthesizing relevant excerpts across thousands of pages.
Performing multi-hop reasoning spanning fragmented information distributed across documents.

These capabilities could greatly amplify the utility of LLMs for knowledge management applications.

Takeaways for Business Leaders

MemGPT provides two key lessons for applying LLMs:

Look beyond scaling model size, and consider architectural innovations to push capabilities forward within intrinsic limits.
Draw inspiration from solutions in fields like systems architecture - LLM memory management has parallels to longstanding CS problems.

Rather than getting caught up in an AI arms race, the clever memory architecture of MemGPT unlocks substantially more powerful applications without requiring unrealistic context sizes. Techniques like this that work within practical constraints will be key to delivering business value from AI.

Sources:

MEMGPT: Towards LLMs as Operating Systems byUC Berkeley

Is GPT-4 a Mixture of Experts Model? Exploring MoE Architectures for Language Models

Thu, 17 Aug 2023 14:25:20 +1000

Rumors are swirling that GPT-4 may use an advanced technique called Mixture of Experts (MoE) to achieve over 1 trillion parameters. Although unconfirmed, these reports offer an opportunity to demystify MoE and explore why this architecture could allow the next generation of language models to efficiently scale to unprecedented size.

What is Mixture of Experts?

In most AI systems, a single model is applied to all inputs. But MoE models have groups of smaller "expert" models, each with their own parameters. For every new input, an expert selector chooses the most relevant experts to process that data.

This means only a sparse subset of the total parameters are activated per input. So MoE models can pack in exponentially more parameters without a proportional explosion in computation.

For language tasks, some experts specialize in grammar, others learn factual knowledge, allowing MoE models to better handle the nuances of natural language. The selector dynamically routes each word to the best combination of experts.

So while an MoE model may contain trillions of total parameters via its many experts, only a tiny fraction need to be used for any given input. This allows unprecedented scale while maintaining efficiency.

Pioneering MoE to Power Language AI

The core concept of MoE dates back decades, but only recently has progress in model parallelism and distributed training enabled its application to large language models.

Google has published notable results using MoE to achieve huge language models:

1) Switch Transformers simplify MoE routing strategies. In experiments, they attain up to 8x faster training versus dense models on language tasks by intelligently allocating computation.

2) GLaM leverages MoE to reach 1.2 trillion parameters. With just 8% of its weights active per input, it outperforms the 175 billion parameter GPT-3 on multiple language benchmarks.

Between these two projects, we see MoE enables order-of-magnitude leaps in model capacity, capability, and efficiency. If GPT-4 utilizes MoE to hit 1+ trillion parameters as speculated, it suggests OpenAI has engineered solutions for training and deployment that overcome key scaling barriers.

The Upshot for Business Leaders

MoE presents a disruptive path to building AI systems with previously unfathomable levels of knowledge and versatility. Leveraging these capabilities productively and safely will require deep consideration.

As this technology continues advancing, business leaders should stay cognizant of developments in MoE and large language models, and keep in mind the following:

MoE enables exponential gains in model capacity at constant computational cost - expect rapid leaps in language AI.
Specialized experts can encode robust knowledge - anticipate AI that is far more competent and wide-ranging.
However, risks rise with capability - plan to implement strong controls and oversight for safety.

While the details of GPT-4 remain unconfirmed, its scale may soon demonstrate the vast possibilities of MoE in language AI, for better or worse. A wise, measured approach to deploying such technology will be vital.

Automating Common Sense for AI With Ensemble Models

Wed, 16 Aug 2023 11:38:51 +1000

Artificial intelligence (AI) systems still lack true understanding of the world and rely heavily on training data provided by humans. An ongoing challenge is developing AI with more generalized common sense - basic knowledge about how the world works that humans acquire through experience.

Researchers have proposed compiling common sense into knowledge graphs - structured collections of facts. But these require extensive manual effort to create and often have gaps. Now, scientists at the University of Washington and the Allen Institute for AI have demonstrated a new technique called "symbolic knowledge distillation" that automates common sense acquisition for AI. Their method transfers knowledge from a large, general AI model into a specialized common sense model, without direct human authoring.

The researchers used GPT-3, a leading natural language AI model from OpenAI, as the knowledge source. GPT-3 was prompted to generate common sense inferences about everyday scenarios, creating a knowledge graph called ATOMIC10x with 10 times more entries than human-authored versions. This automatic approach achieved greater scale and diversity of common sense than manual authoring.

To improve the accuracy of the AI-generated knowledge, the researchers trained a separate "critic" model to filter out incorrect inferences. With this critic, ATOMIC10x attained over 96% accuracy in human evaluations, surpassing 86.8% for human-authored graphs. The knowledge graph both exceeded humans in quantity and matched quality.

The researchers then trained a compact common sense model called COMET on the ATOMIC10x graph. Remarkably, this smaller COMET model outperformed its massive GPT-3 teacher in generating accurate common sense inferences. It also improved on models trained with human-written knowledge graphs.

This demonstrates an alternative pipeline - from machine-generated data to specialized AI models - that can exceed human capabilities for common sense acquisition. The researchers propose that humans can play a more focused role as critics, rather than manually authoring entire knowledge bases.

The new distillation technique paves the way for more capable AI assistants, chatbots, and robots that understand implicit rules of everyday situations. Common sense helps AI converse naturally, perform physical tasks, and make logical inferences about causality and human behavior. Automating common sense at scale remains a grand challenge for human-like artificial intelligence.

This research exemplifies how large AI models like GPT-3 can transfer knowledge to more specialized applications through automatic generation. While general models have limitations in narrowly defined tasks, their broad learning makes them valuable teachers. Distillation techniques focus that broad knowledge into optimized models for specific needs like common sense.

Business leaders should track such advances that make AI more generally capable and useful across applications. Automating the acquisition of common sense can complement training data curated by humans, reducing manual bottlenecks. AI models endowed with common sense hold promise for everything from chatbots to autonomous systems to creative applications. While current methods are imperfect, rapid progress is being made - foreshadowing AI assistants that understand the world more like we do.

Sources:

Symbolic Knowledge Distillation: from General Language Models to Commonsense Models

Enhancing AI's Compositional Language Skills

Sat, 12 Aug 2023 10:10:54 +1000

A major challenge in artificial intelligence is improving computers' ability to truly comprehend language. Humans readily grasp how the meaning of a sentence depends on the meanings of its component words and how they combine structurally. We intuitively rearrange language components while preserving overall meaning.

AI systems still struggle with this fluid, compositional reasoning. Mastering it would make conversational AI much more powerful and useful. For example, chatbots could handle varied questions and scenarios if they deeply understood how permutations of known linguistic elements construct meaning.

To advance AI capabilities in this area, researchers at MIT and IBM recently developed a novel technique called LEXSYM. Their key insight is that compositionality mathematically correlates with symmetries in how language data can be transformed while staying semantically valid.

For instance, swapping "yellow" and "green" in the sentence "Pick up the yellow cube" maintains its essential meaning. LEXSYM automatically detects such symmetries and uses them to synthesize new training examples by substituting related words and phrases.

In experiments, neural networks trained with LEXSYM-augmented data showed improved skills in executing new instruction combinations, answering compositional reasoning questions about images, and inferring the logical parse of unfamiliar sentences.

While limitations remain, LEXSYM provides a promising path toward stronger fluidity, generalization, and human-like compositional abilities in AI systems. As conversational interfaces proliferate, these skills will allow smooth, robust interactions.

For businesses leveraging AI, enhanced compositional language mastery can significantly increase the capability, utility, and linguistic versatility of chatbots, virtual assistants, recommendation systems, and other applications. LEXSYM offers useful foundations to make these AI agents more conversant, adaptive, and lifelike in communications.

Sources:

LexSym: Compositionality as Lexical Symmetry

DisentQA: Catching Knowledge Gaps and Avoiding Misleading Users

Sat, 12 Aug 2023 09:22:46 +1000

Imagine you ask your phone "Who wrote the song Hello by Adele?" and it gives you an incorrect answer, insisting the song is by Taylor Swift. This shows artificial intelligence sometimes confuses its own training knowledge with external facts.

Researchers want to fix this issue to make AI assistants more helpful and honest. Their solution: Build QA Systems that catch knowledge gaps and avoid misleading users by teaching the system to provide two responses:

The factual answer based on given information (e.g. Adele)
What it privately recalls from its memory (e.g. Taylor Swift)

This highlights any mismatches between its training knowledge and external data. It's like when we say "Hmm, I thought X, but the website says Y."

The team trained the AI model by creating quizzes with tricky examples:

Swapping names in passages to elicit different responses from the context vs. the model's recollection
Removing passages altogether so the system must say "I don't know"

After this special training, the model reliably distinguished its own knowledge from given facts. This improved its accuracy and truthfulness.

Say you ask about a movie release date. The system can now respond:

"The article says July 2022. But I thought it was December 2022."

This catches any knowledge gaps and avoids misleading users.

While not perfect, it's major progress toward AI that collaborates in a transparent, helpful manner. The benefits for businesses are clear:

Avoid frustrated users with incorrect responses
Build trust by exposing limitations upfront
Reduce risk from applying flawed knowledge
Clarify when external data should override internal beliefs

By recognizing and sharing when its knowledge is incomplete, the AI becomes a more reliable and honest partner. This research brings us closer to truly cooperative human-AI interaction.

Sources:

DisentQA: Disentangling Parametric and Contextual Knowledge with Counterfactual Question Answering

Training Smarter AI Systems to Understand Natural Language

Sat, 12 Aug 2023 08:46:52 +1000

Artificial intelligence has come a long way in understanding human language, but it still struggles with the nuances and complexities of natural conversation. Researchers are exploring new techniques to improve AI's ability to grasp diverse sentence structures and indirect meaning.

A team at Google, UCLA and USC recently made advances on this challenge by creating a large dataset of syntactically diverse sentence pairs with similar meaning. Their method relies on abstract meaning representations (AMRs).

AMRs capture the underlying semantics of sentences in a structured graph format. While two sentences can differ significantly in wording and syntax, their AMRs may convey largely the same meaning.

The researchers leveraged this insight for paraphrasing - generating sentences that communicate the same essence differently. First, they parsed over 15 million sentences into AMR graphs using an existing tool. Next, they systematically modified each graph's "focus" node and direction of connecting edges to reflect alternate ways of expressing the main idea.

The altered AMR graphs were then decoded back into English sentences. This yielded over 100 million novel paraphrases exhibiting substantial syntactic diversity like changes in word order, structure and focus.

Through both automatic metrics and human evaluation, the team showed their new corpus called PARAAMR has greater diversity than other popular paraphrasing datasets based on machine translation, while maintaining semantic similarity.

Unlike translating between languages, the AMR approach reliably preserves meaning without introducing errors. And forcing syntactic variations during decoding prompts more creative expression of ideas.

The researchers demonstrated PARAAMR's value on three NLP tasks. Using it to train systems for learning sentence embeddings, controlling paraphrase syntax, and low-shot text classification all led to improved performance over other datasets.

For businesses applying AI, better representing language semantics in machine learning models enables more natural interactions. Conversational systems like chatbots and voice assistants can understand users more precisely without strictly expecting fixed phrases and patterns.

PARAAMR shows the possibilities of graph-based semantic parsing for AI language understanding. But some limitations remain for real-world deployment:

Performance depends heavily on upstream parsing and graph-to-text modules. Imperfect components propagate errors.
Many graph modifications yield unnatural outputs. The team filtered these, but some issues may remain.
Their English-only approach lacks linguistic and cultural diversity to cover all use cases.

With smart engineering and expanded training data, AMR-based methods can make conversational AI more flexible and robust. By better grasping nuanced human language, systems can communicate more naturally across diverse applications.

Sources:

ParaAMR: A Large-Scale Syntactically Diverse Paraphrase Dataset by AMR Back-Translation

Making Conversational AI More Natural: Helping Systems Understand Indirect References

Sat, 12 Aug 2023 08:22:55 +1000

Artificial intelligence (AI) has made great strides in recent years, with systems able to hold conversations, answer questions, and make recommendations. However, these systems still struggle with the subtle complexities of natural human language. In particular, when people are choosing between options, they often refer indirectly to their choice rather than using the exact name. For example, when asked "Do you want the chocolate or vanilla ice cream?" someone may respond "I'll have the darker one" rather than saying "chocolate." Teaching AI systems to understand such indirect references is an important next step to make interactions feel more natural.

Researchers at Google have developed a new dataset and models to tackle this problem, summarized in a recent paper. Their key innovation was creating a cartoon-style interface to collect natural conversational responses from regular people choosing between two options, such as recipes, books or songs. By framing it as a casual chat between friends looking back on options, they encouraged indirect references like "the one with the green cover" or "the sweeter dessert" rather than using item names directly.

After collecting a dataset of over 40,000 such indirect references across three categories, they tested different AI models at picking the intended option based on the reference. With no background knowledge beyond the item names, accuracy was just above random guessing. But given relevant textual descriptions of each item, accuracy reached over 80% with the best models. This is promising compared to previous results, but still leaves room for improvement to handle more subtle references.

The researchers also showed the models can learn general patterns that transfer between categories, rather than just memorizing item-specific clues. So training on books, songs and recipes enabled reasonably good performance on each area without needing new training data. This is important for applying the technology efficiently to new domains.

For business leaders, this research highlights both the progress and remaining challenges in making AI conversational interfaces feel natural. Indirect references are common in human conversations, so handling them well is key to users' comfort with AI systems. These results suggest current AI capabilities could support basic back-and-forth interactions, but with some limitations.

Looking ahead, there are several opportunities to build on this work:

Expanding training data to cover more domains, languages and cultural references would make systems more robust.
Exploring different input modes beyond text, like images, audio and video, could improve understanding of indirect references.
Better reasoning capabilities would allow AI systems to make inferences about items, rather than relying completely on background knowledge descriptions.
Retrieval augmented models that proactively gather relevant information could improve disambiguation with limited initial knowledge.
Decomposing complex references into simpler concepts could enable understanding of indirect comparisons like "the happier song."

As conversational systems become integrated into more products and workflows, demand will grow for smooth and natural interactions. Investing in AI advances that unlock more human-like language understanding seems likely to offer strategic value across many industries. While current capabilities are promising, there is still plenty of work needed to truly reach the subtlety and flexibility of human conversation.

Sources

Resolving Indirect Referring Expressions for Entity Selection