Now Next Later AI - Blog #Gen AI Research

Manipulation in AI-Powered Product Recommendations: What Business Leaders Need to Know

Mon, 15 Apr 2024 14:00:36 +1000

Fig 1: Bing Copilot’s response for the search phrase “coffee machines”.

In today's digital marketplace, consumers increasingly rely on AI-driven search tools and chatbots to guide their purchasing decisions. A new study by Aounon Kumar and Himabindu Lakkaraju from Harvard University reveals how these AI systems—specifically Large Language Models—can potentially be manipulated to boost a product's visibility and ranking in recommendations. This has significant implications for fair market competition that business leaders need to be aware of.

Key Findings:

By strategically inserting an optimized sequence of text into a product's online information page, vendors can substantially increase the likelihood of that product being listed as the top recommendation by an AI language model.
Even for products that already rank highly, this technique can further boost their chances of securing the #1 recommended spot.
The strategic text sequences can be made robust to variations in the order products are listed in the AI model's input data. This makes the technique effective across different search scenarios.

Implications for Businesses

Just as search engine optimization (SEO) revolutionized how companies tailor web content to rank higher in Google results, AI search optimization may become the next frontier in digital marketing. Early adopters could gain a major competitive advantage by ensuring their products are prominently featured in AI-generated recommendations.

However, the ability to manipulate AI results also raises concerns about fair competition. If exploited at scale, it could lead to a marketplace where product visibility is based more on gaming algorithms than genuine customer value. Lack of transparency around AI search makes it difficult for consumers to recognize biased recommendations.

As AI becomes core to e-commerce, new industry standards and regulations will be needed to ensure a level playing field. Companies relying on AI-generated recommendations (either their own or via third-party platforms) will need to invest in safeguards to detect and prevent unfair manipulation by vendors.

The Way Forward

Business leaders should stay informed about emerging AI search capabilities and their potential for both opportunity and misuse in the market. Key priorities include:

Examining how AI search and chatbots factor into your industry's competitive landscape
Dedicating resources to understand and properly leverage AI search for your products
Advocating for transparency and fair competition standards around AI-driven recommendations
Collaborating with IT to implement manipulation detection for any customer-facing AI tools

The rise of AI search has the power to transform how consumers discover and choose products. It's up to business leaders to proactively shape this technology's role in their market - or risk ceding control to those willing to exploit it for unilateral gain. Careful navigation and proactive governance will be essential to harness AI's potential while preserving an equitable digital marketplace for all.

AI Benchmarks: Misleading Measures of Progress Towards General Intelligence

Wed, 03 Apr 2024 10:42:30 +1100

Artificial intelligence (AI) has made remarkable strides in recent years, with AI systems now achieving impressive performance on a variety of tasks, from image recognition to language understanding. These advancements have been largely driven by the development of powerful machine learning algorithms, coupled with the availability of vast amounts of training data and computational resources.

However, as AI continues to progress, it is crucial for business leaders to understand the limitations and potential pitfalls of current approaches to measuring AI capabilities. A position paper by Raji et al. offers a compelling critique of popular AI benchmarks, arguing that they are often misleading and fail to capture meaningful progress towards general intelligence. This critique is further echoed in a recent TechCrunch article by Kyle Wiggers, which highlights the disconnect between AI benchmarks and real-world applications.

The Allure of "General" AI Benchmarks

Two of the most widely cited benchmarks in AI are ImageNet, used for evaluating image recognition systems, and GLUE (General Language Understanding Evaluation), used for assessing natural language processing models. These benchmarks have taken on an outsized role in the AI community, with performance on these tasks often seen as indicative of progress towards general AI capabilities.

The appeal of these benchmarks is understandable. They offer a standardized way to compare different AI systems and track improvements over time. Moreover, the tasks they encompass, such as identifying objects in images or understanding the meaning of sentences, seem to capture essential aspects of intelligence that humans excel at.

However, as Raji et al. point out, these benchmarks are far from perfect measures of general intelligence. In fact, they argue, the focus on achieving state-of-the-art performance on these narrow tasks has distorted the priorities of the AI research community and led to an overemphasis on benchmark-chasing at the expense of more meaningful progress.

The Limitations of Current Benchmarks

One of the key criticisms leveled by Raji et al. is that the tasks included in popular AI benchmarks are often arbitrary and not systematically chosen to represent general capabilities. They compare this to a fictional children's story about a museum claiming to contain "everything in the whole wide world," but which actually just contains a haphazard collection of random objects.

Similarly, the authors argue, benchmarks like ImageNet and GLUE are composed of a relatively narrow and idiosyncratic set of tasks that hardly capture the full range of intelligent behaviors. Impressive performance on these tasks is often taken as evidence of general intelligence, when in reality it may simply reflect a system's ability to exploit specific patterns or statistical regularities present in the training data.

The TechCrunch article by Wiggers further underscores this point, noting that many of the most commonly used benchmarks for chatbot-powering AI models, such as GPQA ("A Graduate-Level Google-Proof Q&A Benchmark"), contain questions that are far removed from the everyday tasks most people use these models for, such as responding to emails or writing cover letters. As Jesse Dodge, a scientist at the Allen Institute for AI, puts it, "Benchmarks are typically static and narrowly focused on evaluating a single capability, like a model's factuality in a single domain, or its ability to solve mathematical reasoning multiple choice questions."

Another issue highlighted in both the Raji et al. paper and the TechCrunch article is the presence of errors and flaws in some widely used benchmarks. For example, an analysis of the HellaSwag benchmark, designed to evaluate commonsense reasoning in AI models, found that more than a third of the test questions contained typos and nonsensical writing. Similarly, the MMLU benchmark, which has been touted by vendors like Google, OpenAI, and Anthropic as evidence of their models' logical reasoning abilities, contains questions that can be solved through mere memorization rather than genuine understanding.

As David Widder, a postdoctoral researcher at Cornell studying AI and ethics, notes in the TechCrunch article, "A model can't [reason through and solve new and complex problems] either" just because it performs well on benchmarks like MMLU. Instead, he argues, these benchmarks often test a model's ability to "memoriz[e] and associat[e] two keywords together" rather than truly understand causal mechanisms.

Key Takeaways for Business Leaders

Given the limitations and potential misleading nature of current AI benchmarks, what should business leaders keep in mind when evaluating AI technologies? Here are some key takeaways from the Raji et al. paper and the TechCrunch article:

Be skeptical of grand claims about AI systems achieving human-level or superhuman intelligence based solely on benchmark performance. As both sources emphasize, impressive results on specific benchmarks do not necessarily translate to general intelligence or robustness in real-world deployments.
When evaluating AI vendors or technologies, look beyond top-line benchmark numbers. Ask detailed questions about the specific capabilities and limitations of the system, and how it has been tested on tasks and datasets relevant to your business needs.
Encourage a culture of rigorous, multifaceted evaluation within your organization's AI initiatives. Rather than focusing solely on chasing state-of-the-art benchmark results, prioritize detailed error analysis, bias auditing, and stress testing across a diverse range of scenarios.
Support research and development efforts aimed at creating more meaningful and comprehensive benchmarks tied to real-world applications. This could include developing industry-specific datasets and evaluation protocols that better reflect the challenges and requirements of your business domain.
Foster an AI research culture that values creativity, diversity of thought, and long-term progress over short-term benchmark wins. Encourage your teams to explore novel architectures and approaches, even if they may not immediately yield chart-topping results.

Looking Ahead: Improving AI Benchmarks

Both the Raji et al. paper and the TechCrunch article offer some suggestions for improving the current state of AI benchmarks. One key idea is to incorporate more human evaluation alongside automated benchmarks. As Jesse Dodge suggests in the TechCrunch piece, "The right path forward, here, is a combination of evaluation benchmarks with human evaluation—prompting a model with a real user query and then hiring a person to rate how good the response is."

David Widder, on the other hand, is less optimistic about the potential for improving existing benchmarks. Instead, he argues that AI evaluation should focus more on the downstream impacts of these models and whether those impacts align with the goals and values of the people affected by them. "I'd ask which specific contextual goals we want AI models to be able to be used for," he says, "and evaluate whether they'd be—or are— successful in such contexts."

As AI continues to advance and become more deeply integrated into business operations, it is crucial for leaders to have a nuanced understanding of the technologies' strengths and limitations. By looking beyond simplistic benchmark results and embracing a more holistic and rigorous approach to AI evaluation, organizations can make more informed decisions and unlock the true potential of artificial intelligence while mitigating its risks and pitfalls.

Footnotes:

"AI and the Everything in the Whole Wide World Benchmark" by Inioluwa Deborah Raji, Emily M. Bender, Amandalynne Paullada, Emily Denton, and Alex Hanna
"Why most AI benchmarks tell us so little" by Kyle Wiggers for TechCrunch

Photo by William Warby on Unsplash

Microsoft Unveils AutoGen to Revolutionize Conversational AI Apps

Tue, 24 Oct 2023 14:13:24 +1100

Conversational artificial intelligence (AI) is transforming numerous industries by enabling more natural interactions between humans and computers. From virtual assistants to chatbots, voice interfaces, and avatars, conversational AI is becoming increasingly prevalent in everyday digital experiences. However, building the complex workflows that power these next-generation systems remains challenging for most companies.

To accelerate development of advanced conversational AI applications, Microsoft recently introduced AutoGen, an open-source Python library that streamlines orchestrating multi-agent conversations. With AutoGen's customizable and intelligent agents, developers can readily construct sophisticated conversational systems and workflows using combinations of AI, tools, and human inputs.

Democratizing Complex Conversational AI Workflows

A key goal of AutoGen is democratizing the creation of intricate conversational AI applications. Traditionally, building multi-turn workflows involving several AI components has required extensive engineering expertise and effort. AutoGen encapsulates the complexity behind easy-to-use agents and interfaces.

Some examples of applications enabled by AutoGen:

Tutoring systems where students converse with an AI tutor that can call an expert for help when needed
Troubleshooting chatbots that propose solutions, execute tools, and incorporate human feedback
Interactive fiction games with conversational NPCs powered by AI and humans
Data analysis workflows where users discuss options with an AI assistant that runs code and queries databases

With AutoGen's pre-built agents and simple API, developers can set up the conversational 'cast' and interactions for their application in just a few lines of Python code. The complexity of conversing, remembering context, integrating tools, handling errors, and supporting dynamic multi-agent chatter happens automatically behind the scenes.

AutoGen Agents - Conversational Building Blocks

At the core of AutoGen are customizable agents that can chat with each other and humans to solve problems. There are two key types of agents:

Assistant agents provide domain expertise using large language models like GPT-3.5 and GPT-4. They can be configured with instructions and knowledge for different roles.
User proxy agents act on behalf of humans. They can request inputs, execute tools through code, or take other custom actions.

By combining these agents into multi-agent systems, developers can construct auto- mated workflows with flexible human involvement. Agents exchange messages until they mutually determine the conversation has achieved its goal.

For instance, an assistant agent might propose an analytical approach while the user proxy agent runs simulations to validate the idea before reporting results back to the assistant. AutoGen streamlines the intricacies of conversation management so developers simply define the agents and their interactions.

Maximizing Value from Large Language Models

In addition to simplifying complex workflows, AutoGen also includes features to maximize the value derived from expensive large language model APIs like OpenAI's.

AutoGen helps users:

Fine-tune model hyperparameters like temperature, presence penalty, and stop sequences to optimize for metrics like accuracy, cost, etc.
Cache model outputs to avoid redundant expensive calls.
Automatically handle errors and retries to improve reliability.
Seamlessly blend outputs from multiple model configurations.

Tools like these ensure users efficiently tap into the vast capabilities of large language models through a robust interface.

Microsoft is particularly focused on responsible and ethical standards for AutoGen. They incorporated algorithmic techniques to provide transparency and maintain human oversight over any automated conversations between agents.

Empowering a New Generation of AI Applications

AutoGen tackles a common pain point in leveraging today's most advanced AI capabilities: the burdensome process of coordinating multiple conversational AI components. With its blend of simple abstractions and powerful features, AutoGen opens the door to new categories of AI applications:

Medical chatbots that discuss patient cases with doctors before synthesizing expert advice
Multi-modal VR agents that converse with users and AI assistants while manipulating 3D environments
Interactive fiction games with dialogue trees branching based on player choices and AI improvisation
Data science workflows where users explore models through natural language conversations with AutoGen agents

AutoGen represents an important step in making sophisticated AI more accessible. Its potential to unlock new products and experiences makes AutoGen one of the most exciting recent developments in conversational AI.

Key Takeaways for Business Leaders

For business leaders, AutoGen represents an opportunity to leverage conversational AI in new ways across customer engagement, operations, employee productivity, and more. Companies that leverage AutoGen early could gain a competitive advantage in their ability to rapidly deploy innovative conversational experiences.

AutoGen is an enabling technology that can help businesses adopt conversational AI at scale by making development drastically easier. Its potential to unlock new products and efficiencies makes it a platform business leaders should have on their radar.

Sources:

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Autogen

MemGPT: The Memory Limitations of AI Systems and a Clever Technological Workaround

Tue, 24 Oct 2023 12:02:02 +1100

Artificial intelligence systems that can have natural conversations and analyze documents have transformative business potential. However, today's AI - specifically large language models (LLMs) like Claude 2 and GPT-4 - have a major limitation. They can only remember a finite amount of information before needing to completely reset their memory. This restricts their ability to have coherent, long-term interactions or make connections across lengthy documents.

One might assume the solution is just to build LLMs with bigger memories. But LLMs face sharply diminishing returns and ballooning computational costs from naively expanding memory. After reviewing these tradeoffs, researchers at UC Berkeley devised an innovative workaround drawing inspiration from operating systems. Their system, MemGPT, applies OS principles like virtual memory and process management to unlock more powerful applications of LLMs - all while staying within their inherent memory limits.

The Core Challenge of LLM Memory Limits

LLMs use an algorithm called self-attention to analyze incoming text and predict upcoming words, just as humans intuitively continue a thought or conversation. This grants LLMs their impressive language skills. However, self-attention requires the LLM to look across all context it's received so far, which means its memory must be reset after reaching a fixed size limit.

For perspective, Claude 2 can handle about 100,000 tokens before resetting. That may sound generous compared to a 10,000 word business report. But spoken conversation can easily exceed this limit in just a few hours of steady chit-chat. Even more daunting are tasks like sifting complex legal documents that routinely run millions of tokens.

LLMs have a fixed memory capacity because the self-attention algorithm scales quadratically based on context length. Doubling the memory size makes the LLM's computations 4x more intensive. Expanding memory quickly becomes computationally infeasible, even for large tech companies.

Rather than a flaw in specific systems like Claude, this limited memory span is an inherent constraint of all modern LLM architectures. Naively expanding memory was not a viable solution path. More creative approaches would be needed.

The Insights Behind MemGPT's Operating System-Inspired Design

UC Berkeley researchers drew inspiration from operating systems like Windows that run applications working with far more data than fits into available RAM. They asked: how can we apply OS techniques to provide an LLM the illusion of infinite memory?

The result was MemGPT, which implements two key principles:

A hierarchy of memory resources - MemGPT divides memory into a small, fast "main context" like RAM and a large, slow "external context" like disk storage. Information must be explicitly transferred between them.
Process management - MemGPT handles control flow between memory, the LLM, and users akin to how an OS arbitrates between concurrent processes.

Together these give MemGPT the ability to pipeline potentially unlimited memory in and out of the LLM's limited context window as needed to accomplish tasks requiring unbounded memory over multiple processing cycles.

Just as clever OS architectures enable applications to work with more data than available RAM, MemGPT's design confers an illusion of infinite memory to fixed-context LLMs.

Conversational AI That Can Reference Years of Dialogue

A major application of LLMs is powering conversational assistants and social bots. MemGPT demonstrates substantially improved consistency and personalization in these applications:

Consistency - By querying external memory of prior interactions, MemGPT can coherently maintain facts, preferences, and history even when referring back to dialogues from months or years ago.
Personalization - MemGPT can spontaneously incorporate comprehensive knowledge about the user, like callback jokes referencing childhood stories told weeks in the past to forge greater rapport.

Analyzing Large Collections of Documents

MemGPT also excels at tasks like:

Question answering using a massive multi-document corpus like Wikipedia or a company knowledge base.
Extracting key facts and relationships by synthesizing relevant excerpts across thousands of pages.
Performing multi-hop reasoning spanning fragmented information distributed across documents.

These capabilities could greatly amplify the utility of LLMs for knowledge management applications.

Takeaways for Business Leaders

MemGPT provides two key lessons for applying LLMs:

Look beyond scaling model size, and consider architectural innovations to push capabilities forward within intrinsic limits.
Draw inspiration from solutions in fields like systems architecture - LLM memory management has parallels to longstanding CS problems.

Rather than getting caught up in an AI arms race, the clever memory architecture of MemGPT unlocks substantially more powerful applications without requiring unrealistic context sizes. Techniques like this that work within practical constraints will be key to delivering business value from AI.

Sources:

MEMGPT: Towards LLMs as Operating Systems byUC Berkeley

Is GPT-4 a Mixture of Experts Model? Exploring MoE Architectures for Language Models

Thu, 17 Aug 2023 14:25:20 +1000

Rumors are swirling that GPT-4 may use an advanced technique called Mixture of Experts (MoE) to achieve over 1 trillion parameters. Although unconfirmed, these reports offer an opportunity to demystify MoE and explore why this architecture could allow the next generation of language models to efficiently scale to unprecedented size.

What is Mixture of Experts?

In most AI systems, a single model is applied to all inputs. But MoE models have groups of smaller "expert" models, each with their own parameters. For every new input, an expert selector chooses the most relevant experts to process that data.

This means only a sparse subset of the total parameters are activated per input. So MoE models can pack in exponentially more parameters without a proportional explosion in computation.

For language tasks, some experts specialize in grammar, others learn factual knowledge, allowing MoE models to better handle the nuances of natural language. The selector dynamically routes each word to the best combination of experts.

So while an MoE model may contain trillions of total parameters via its many experts, only a tiny fraction need to be used for any given input. This allows unprecedented scale while maintaining efficiency.

Pioneering MoE to Power Language AI

The core concept of MoE dates back decades, but only recently has progress in model parallelism and distributed training enabled its application to large language models.

Google has published notable results using MoE to achieve huge language models:

1) Switch Transformers simplify MoE routing strategies. In experiments, they attain up to 8x faster training versus dense models on language tasks by intelligently allocating computation.

2) GLaM leverages MoE to reach 1.2 trillion parameters. With just 8% of its weights active per input, it outperforms the 175 billion parameter GPT-3 on multiple language benchmarks.

Between these two projects, we see MoE enables order-of-magnitude leaps in model capacity, capability, and efficiency. If GPT-4 utilizes MoE to hit 1+ trillion parameters as speculated, it suggests OpenAI has engineered solutions for training and deployment that overcome key scaling barriers.

The Upshot for Business Leaders

MoE presents a disruptive path to building AI systems with previously unfathomable levels of knowledge and versatility. Leveraging these capabilities productively and safely will require deep consideration.

As this technology continues advancing, business leaders should stay cognizant of developments in MoE and large language models, and keep in mind the following:

MoE enables exponential gains in model capacity at constant computational cost - expect rapid leaps in language AI.
Specialized experts can encode robust knowledge - anticipate AI that is far more competent and wide-ranging.
However, risks rise with capability - plan to implement strong controls and oversight for safety.

While the details of GPT-4 remain unconfirmed, its scale may soon demonstrate the vast possibilities of MoE in language AI, for better or worse. A wise, measured approach to deploying such technology will be vital.

Automating Common Sense for AI With Ensemble Models

Wed, 16 Aug 2023 11:38:51 +1000

Artificial intelligence (AI) systems still lack true understanding of the world and rely heavily on training data provided by humans. An ongoing challenge is developing AI with more generalized common sense - basic knowledge about how the world works that humans acquire through experience.

Researchers have proposed compiling common sense into knowledge graphs - structured collections of facts. But these require extensive manual effort to create and often have gaps. Now, scientists at the University of Washington and the Allen Institute for AI have demonstrated a new technique called "symbolic knowledge distillation" that automates common sense acquisition for AI. Their method transfers knowledge from a large, general AI model into a specialized common sense model, without direct human authoring.

The researchers used GPT-3, a leading natural language AI model from OpenAI, as the knowledge source. GPT-3 was prompted to generate common sense inferences about everyday scenarios, creating a knowledge graph called ATOMIC10x with 10 times more entries than human-authored versions. This automatic approach achieved greater scale and diversity of common sense than manual authoring.

To improve the accuracy of the AI-generated knowledge, the researchers trained a separate "critic" model to filter out incorrect inferences. With this critic, ATOMIC10x attained over 96% accuracy in human evaluations, surpassing 86.8% for human-authored graphs. The knowledge graph both exceeded humans in quantity and matched quality.

The researchers then trained a compact common sense model called COMET on the ATOMIC10x graph. Remarkably, this smaller COMET model outperformed its massive GPT-3 teacher in generating accurate common sense inferences. It also improved on models trained with human-written knowledge graphs.

This demonstrates an alternative pipeline - from machine-generated data to specialized AI models - that can exceed human capabilities for common sense acquisition. The researchers propose that humans can play a more focused role as critics, rather than manually authoring entire knowledge bases.

The new distillation technique paves the way for more capable AI assistants, chatbots, and robots that understand implicit rules of everyday situations. Common sense helps AI converse naturally, perform physical tasks, and make logical inferences about causality and human behavior. Automating common sense at scale remains a grand challenge for human-like artificial intelligence.

This research exemplifies how large AI models like GPT-3 can transfer knowledge to more specialized applications through automatic generation. While general models have limitations in narrowly defined tasks, their broad learning makes them valuable teachers. Distillation techniques focus that broad knowledge into optimized models for specific needs like common sense.

Business leaders should track such advances that make AI more generally capable and useful across applications. Automating the acquisition of common sense can complement training data curated by humans, reducing manual bottlenecks. AI models endowed with common sense hold promise for everything from chatbots to autonomous systems to creative applications. While current methods are imperfect, rapid progress is being made - foreshadowing AI assistants that understand the world more like we do.

Sources:

Symbolic Knowledge Distillation: from General Language Models to Commonsense Models

Protecting LLMs from Theft with Watermarks

Sat, 12 Aug 2023 10:41:41 +1000

AI models, like GPT-4, are like gold in the tech world. Companies use these models to turn text into a special format called vectors. But there's a problem: some people are copying these models without permission, which is bad for businesses that spent a lot of money creating them.

Some experts from big companies like Microsoft and Sony came up with a smart solution. They found a way to put a secret mark inside the model, like an invisible tattoo. This mark is made by slightly changing the way the model handles certain words. So, if someone tries to copy the model, the mark will also be copied. This way, the original company can prove they own the model.

How does it work? These secret words (let's call them 'trigger words') are chosen carefully. They're not super common, so they don't mess up the model's usual tasks. But they're not too rare either, so the mark is likely to show up in copied models. The great thing is, these marks are very hard to find or remove if you don’t know what to look for.

Why is this important for businesses?

Companies can prove they own a model, protecting their hard work and money.
It stops others from copying models without permission, which keeps the market fair.
Customers using the original service won't notice any difference, so they still get top-quality service.
This method can be used in many different AI models and situations.
It could also help companies track if their own employees are sharing things they shouldn’t.

In summary, this invisible marking system is like a shield for AI models in the cloud. It makes sure companies' hard work is safe, stops people from cheating, and helps the whole AI industry stay fair and trustworthy. While it's not perfect, it's a big step forward in keeping AI models secure.

Critically Analyzing the Priorities of Companies Like Microsoft

While the invisible marking system is an innovative way to safeguard AI models, there's a more fundamental issue many companies are overlooking: the ethical and legal implications of training these models on copyrighted data. Often, AI models like GPT-4 are trained on vast datasets that include copyrighted materials, like books, articles, or artwork. This training process might infringe on the rights of artists, authors, and other content creators, leading to significant legal and ethical quandaries.

These creators often don't consent to their work being used in such a manner, and it denies them the rightful recognition or compensation they deserve. It's imperative that companies prioritize the sourcing of their training data ethically, ensuring it respects copyrights and intellectual property rights.

Before adopting advanced protection measures for the models, the first step should be to ensure that these models aren't built upon the unrecognized or uncompensated work of others. The industry must acknowledge and address this foundational issue, ensuring AI advancements are both technologically and ethically sound.

Sources:

ACL 2023 — Area Chair Awards — NLP Applications: Are You Copying My Model? Protecting the Copyright of Large Language Models for EaaS via Backdoor Watermark

Enhancing AI's Compositional Language Skills

Sat, 12 Aug 2023 10:10:54 +1000

A major challenge in artificial intelligence is improving computers' ability to truly comprehend language. Humans readily grasp how the meaning of a sentence depends on the meanings of its component words and how they combine structurally. We intuitively rearrange language components while preserving overall meaning.

AI systems still struggle with this fluid, compositional reasoning. Mastering it would make conversational AI much more powerful and useful. For example, chatbots could handle varied questions and scenarios if they deeply understood how permutations of known linguistic elements construct meaning.

To advance AI capabilities in this area, researchers at MIT and IBM recently developed a novel technique called LEXSYM. Their key insight is that compositionality mathematically correlates with symmetries in how language data can be transformed while staying semantically valid.

For instance, swapping "yellow" and "green" in the sentence "Pick up the yellow cube" maintains its essential meaning. LEXSYM automatically detects such symmetries and uses them to synthesize new training examples by substituting related words and phrases.

In experiments, neural networks trained with LEXSYM-augmented data showed improved skills in executing new instruction combinations, answering compositional reasoning questions about images, and inferring the logical parse of unfamiliar sentences.

While limitations remain, LEXSYM provides a promising path toward stronger fluidity, generalization, and human-like compositional abilities in AI systems. As conversational interfaces proliferate, these skills will allow smooth, robust interactions.

For businesses leveraging AI, enhanced compositional language mastery can significantly increase the capability, utility, and linguistic versatility of chatbots, virtual assistants, recommendation systems, and other applications. LEXSYM offers useful foundations to make these AI agents more conversant, adaptive, and lifelike in communications.

Sources:

LexSym: Compositionality as Lexical Symmetry

Training Smarter AI Systems to Understand Natural Language

Sat, 12 Aug 2023 08:46:52 +1000

Artificial intelligence has come a long way in understanding human language, but it still struggles with the nuances and complexities of natural conversation. Researchers are exploring new techniques to improve AI's ability to grasp diverse sentence structures and indirect meaning.

A team at Google, UCLA and USC recently made advances on this challenge by creating a large dataset of syntactically diverse sentence pairs with similar meaning. Their method relies on abstract meaning representations (AMRs).

AMRs capture the underlying semantics of sentences in a structured graph format. While two sentences can differ significantly in wording and syntax, their AMRs may convey largely the same meaning.

The researchers leveraged this insight for paraphrasing - generating sentences that communicate the same essence differently. First, they parsed over 15 million sentences into AMR graphs using an existing tool. Next, they systematically modified each graph's "focus" node and direction of connecting edges to reflect alternate ways of expressing the main idea.

The altered AMR graphs were then decoded back into English sentences. This yielded over 100 million novel paraphrases exhibiting substantial syntactic diversity like changes in word order, structure and focus.

Through both automatic metrics and human evaluation, the team showed their new corpus called PARAAMR has greater diversity than other popular paraphrasing datasets based on machine translation, while maintaining semantic similarity.

Unlike translating between languages, the AMR approach reliably preserves meaning without introducing errors. And forcing syntactic variations during decoding prompts more creative expression of ideas.

The researchers demonstrated PARAAMR's value on three NLP tasks. Using it to train systems for learning sentence embeddings, controlling paraphrase syntax, and low-shot text classification all led to improved performance over other datasets.

For businesses applying AI, better representing language semantics in machine learning models enables more natural interactions. Conversational systems like chatbots and voice assistants can understand users more precisely without strictly expecting fixed phrases and patterns.

PARAAMR shows the possibilities of graph-based semantic parsing for AI language understanding. But some limitations remain for real-world deployment:

Performance depends heavily on upstream parsing and graph-to-text modules. Imperfect components propagate errors.
Many graph modifications yield unnatural outputs. The team filtered these, but some issues may remain.
Their English-only approach lacks linguistic and cultural diversity to cover all use cases.

With smart engineering and expanded training data, AMR-based methods can make conversational AI more flexible and robust. By better grasping nuanced human language, systems can communicate more naturally across diverse applications.

Sources:

ParaAMR: A Large-Scale Syntactically Diverse Paraphrase Dataset by AMR Back-Translation

The Promise of Frozen Language Models

Fri, 11 Aug 2023 10:04:48 +1000

In recent years, artificial intelligence has taken great leaps forward thanks to large language models (LLMs) - AI systems trained on massive amounts of text data that can understand language and generate human-like text. Companies like Google, Microsoft, and startups like OpenAI and Anthropic have invested heavily in developing ever-larger LLMs with billions or even trillions of parameters.

However, once these giant LLMs are trained, companies face a dilemma - whether to "fine-tune" the model by further training it on specific tasks, or keep the model "frozen" without any changes. Fine-tuning allows the LLM to specialize and achieve state-of-the-art performance on specialized tasks. But it comes at a high cost - computationally expensive retraining, reduced versatility, and forgetting of previous capabilities.

In their research paper, AI21 Labs demonstrates that frozen LLMs have untapped potential that can match or exceed fine-tuning approaches, without these downsides. They present three new techniques to effectively "stand on the shoulders" of frozen giants:

1. Input-Dependent Prompt Tuning

Large language models are adept at understanding natural language, but they don't automatically know how to perform specific tasks like answering questions or summarizing text.However, their capabilities can be unlocked using prompt tuning.

The key idea behind prompt tuning is that providing the right prompt text before the input steers the language model towards the desired task.It's like giving the model instructions on how to process the upcoming input.

For example, if we want the language model to answer questions based on a passage of text, we can prepend the input with a prompt like:

"Answer the following question based only on the passage below:"

[Text Passage]

[Question]

This tunes the model to approach the upcoming input as a question answering task.The prompt acts like an adapter, steering the versatile model to useful behaviors without any training or fine-tuning.

So prompt tuning just means optimizing the wording of these instruction prompts for each task to get the best performance from the frozen language model.It's like learning how to most effectively communicate with and direct the model.

The key innovation from AI21 Labs was making prompt tuning input-dependent. Rather than using one static prompt per task, they trained a small neural network to generate custom prompts tailored to each specific input.

This input-dependent prompting allowed a single frozen language model to master over 100 diverse tasks, from question answering to summarization to sentiment analysis, matching extensive fine-tuning without degradation.

The prompts serve as lightweight yet powerful steering instructions that can specialize a frozen model on the fly based on the input.It's like having a dynamic adapter that configures the model differently for each unique situation.

2. Huge Frozen Readers for Question Answering

In open-domain question answering, the AI system must answer questions by finding relevant information from a massive collection of text passages, like Wikipedia.

Typically, these systems use a smaller "reader" model to read through the relevant passages and figure out the answer. That's because even the largest language models can only process a limited amount of text at once.

But smaller reader models have less knowledge and reasoning ability than giant language models with billions or trillions of parameters. So they don't fully unlock the potential of these frozen giants.

AI21 Labs tackled this by adding a "re-ranking" stage to condense the most important information from the passages into a condensed form that fits into the giant frozen language model.

This allowed their 17 billion parameter model to read enough of the relevant context to match specialized reader models that were extensively fine-tuned for question answering.

In essence, the smaller re-ranking model acts like a search engine, retrieving and condensing the most useful knowledge to fit the limitations of the frozen giant.

This gives the huge frozen model access to all the relevant information it needs to apply its powerful reasoning abilities. The giants' knowledge and capabilities can be tapped without fine-tuning that risks degrading other skills.

It demonstrates how frozen language models have untapped potential that can be unlocked with the right surrounding components, like the re-ranking stage here. Their true capabilities can be accessed without resorting to extensive fine-tuning.

3. Recursive Application of a Single LLM

Typically, large language models are used to process an input query just once before generating an output response. The model reads the input, does its internal reasoning, and returns a single output.

But AI21 Labs found that recursively applying the model on its own outputs can actually improve performance. Essentially, the model refines and enhances its initial output by processing it again.

It's like having the model double-check its own work and refine its initial response. Humans often re-read what we initially wrote to improve the wording and fix errors. Recursively applying language models does something similar, but in an automated way.

To implement this, AI21 built a small 2-layer neural network "connector" that feeds the language model's output back into its input.

So the model first processes the original query as normal. But then the connector passes the model's initial output back into it as the new input. This triggers it to refine and enhance that initial output.

In tests for question answering, just two recursive passes through a 7 billion parameter model allowed it to match the performance of a much larger 17 billion parameter model.

Essentially, it nearly doubled the capabilities of the smaller model by re-applying it recursively. This shows how recursive application unlocks additional performance without requiring even larger pretrained models.

The connector module creates a feedback loop, allowing the model to re-process its own output and correct errors or improve phrasing, much like a human would. This technique amplifies the capabilities of a given model without expensive retraining or fine-tuning.

Business Implications

These techniques enable building capable AI systems on top of a single, frozen pretrained LLM instead of an array of specialized fine-tuned models. This offers tangible business benefits:

Cost Savings - Avoiding expensive training of multiple large models cuts costs. Just maintaining and serving one frozen LLM backbone provides economies of scale.
Simplicity - Relying on prompting and other external components is far simpler than intricately fine-tuning models. Less specialized engineering effort is required.
Flexibility - New capabilities can be added without interfering with existing ones. Fine-tuning risks degradation on previous tasks.
Efficiency - Recursive passing allows improving performance on-demand by re-applying the LLM only when beneficial. Bigger pretrained models must be applied to all inputs.

While fine-tuning revolutionized AI, endless model growth is impractical. Frozen language models present an alluring path forward - unlocking their full potential with the right neural "plug-ins" provides a scalable approach to building production AI systems.

Source:

S TANDING ON THE S HOULDERS OF G IANT F ROZEN L ANGUAGE M ODELS