Examining Claims and Hype: Large Language Models

16.08.23 09:04 AM - Comment(s) - By Ines Almeida

In recent years, a new type of AI system called large language models (LLMs) has rapidly gained popularity and changed the landscape of natural language processing (NLP) research and applications. LLMs are AI systems that are trained on massive amounts of text data to generate or understand language. Popular examples include ChatGPT, created by Anthropic; GPT-3, created by OpenAI; and Google's LaMDA.

While LLMs have shown impressive capabilities, their sudden prominence has also raised concerns about potential downsides and knowledge gaps. In a new research paper, AI experts Alexandra Luccioni and Anna Rogers take a critical look at LLMs, analyzing common claims and assumptions while identifying issues and proposing ways forward. Here are the key takeaways:

Defining LLMs

The authors first attempt to precisely define what counts as an LLM, since the term is used loosely. They propose three criteria:

LLMs model text and can generate it based on context. For example, ChatGPT can generate coherent continuations of text when given a prompt.
LLMs are pretrained on over 1 billion tokens of text. A token is a basic unit of text, like a word or punctuation mark. For comparison, 1 billion tokens is about 15 billion words, or tens of thousands of books worth of text.
LLMs utilize transfer learning to adapt to new tasks. Transfer learning means the model learns general patterns from large datasets which can then be applied to new tasks with minimal additional training.

This definition excludes some popular older NLP models like word2vec which don't generate text based on context.

Examining Common Claims

The authors then fact-check four common claims about LLMs:

LLMs are robust

While LLMs have reduced some brittleness issues of old AI systems that completely failed on unfamiliar inputs, they still fail in many edge cases and exhibit biases. For example, ChatGPT sometimes confidently generates plausible but incorrect answers, exposing a lack of robustness. Shortcuts in training data remain a problem, where models exploit superficial cues rather than truly understanding language.

LLMs achieve state-of-the-art results

LLMs excel at few-shot learning, meaning they can perform well on new tasks with just a few examples as prompts, without task-specific fine-tuning. However, they don't necessarily beat fine-tuned models designed specifically for a task. On SuperGLUE language benchmarks, GPT-3 scored 71.8% in few-shot learning, while a fine-tuned RoBERTa model achieved 84.6%. Non-LLM approaches can still be top performers on some tasks too. Benchmark contamination is also a concern, where test data overlaps with the LLM's training data, giving an unreliable boost in performance.

LLM performance is due to scale

While model size has been a key factor in improvements, as seen in successive models like GPT-3 (175 billion parameters) to GPT-4 (100 trillion parameters), training data quality and other optimizations also play a big role. For example, PaLM performance gains were partly attributed to data cleaning. Recent efficient models like Anthropic's Claude challenge the theory that sheer scale is all that matters.

LLMs show emergent properties

Claims of LLMs exhibiting abilities not explicitly trained for lack rigorous proof. Their abilities often correlate with evidence found in the massive training data, which cannot be fully audited. For example, ChatGPT may appear to have common sense not seen during training, but this is unproven.

Concerns and Issues

The authors argue these claims contribute to issues like lack of model diversity, influence of private companies, barriers to entry for researchers, decreased reproducibility, and dismissal of theory. LLMs are also deployed without sufficient testing for safety and fairness across demographics.

Recommendations

The authors provide several concrete recommendations to address the issues raised and steer LLM research in a more rigorous direction:

Maintain diversity of research approaches in NLP - Conferences and journals should ensure balanced representation of non-LLM techniques instead of solely focusing on the latest LLM variants. This avoids over-reliance on one methodology and allows exploration of alternative approaches.
Improve definitional clarity - Key terms like "large language model" and "emergent properties" require precise definitions grounded in evidence to avoid hype or confusion. For example, emergence could refer to behaviors not directly trained for vs. behaviors learned from training data.
Avoid reliance on closed-source models - Using proprietary models like GPT-4 as benchmarks makes research expensive, unfair, and results unreliable if the model changes. Open models should be preferred.
More controlled studies on capabilities - Rather than generic benchmarks, experiments should isolate factors like model architecture and training data to pinpoint causes of behaviors. Granular testing on specific skills is needed.
Develop better evaluation methods - Metrics beyond accuracy like robustness and bias should be assessed. Potential training data overlap must be checked. Evaluation should account for faults in open-ended generation like inconsistency.
Ensure transparency and reproducibility - Details of model training, evaluation results, and ideally training data details should be released to enable reproducibility. Documentation and versioning are key for API-based models.
Incorporate diverse perspectives - Potential societal impacts of LLM use and misuse need consideration in development and deployment. Representation in data and teams is crucial.

With more rigor, transparency, and diversity, LLMs can be guided to fulfill their promise responsibly and avoid the pitfalls of hype, lack of oversight, and concentration of power.

Key Takeaways for Business Leaders

As LLMs spread into products and services, business leaders should view bold claims about their abilities with caution rather than credulously accepting marketing hype. Rigorous testing is essential, as LLMs still have significant limitations. Leaders should pressure vendors to provide transparency about training data and testing procedures. Diversity of approaches should be encouraged to avoid putting all eggs in one basket. As LLMs influence society, their development and use should incorporate diverse perspectives, including consideration of potential harms. An open and critical scientific approach is needed to steer the future of LLMs responsibly.

Source:

Mind your Language (Model): Fact-Checking LLMs and their Role in NLP Research and Practice

by Alexandra Sasha Luccioni and Anna Rogers