Now Next Later AI - Blog , Responsible AI

Measuring the Truthfulness of Large Language Models: Benchmarks, Challenges, and Implications for Business Leaders

Mon, 29 Apr 2024 13:00:10 +1000

In recent years, large language models (LLMs) like GPT-3, ChatGPT, and others have made stunning breakthroughs in natural language processing. These powerful AI systems can engage in human-like conversations, answer questions, write articles, and even generate code. Their potential to transform industries from customer service to content creation has captured the imagination of business leaders worldwide.

However, as companies rush to adopt LLMs, a critical question often goes overlooked - just how truthful and reliable are these systems? Can we trust the outputs of LLMs to be factual and free of misinformation or deception? As it turns out, LLMs currently face significant challenges when it comes to truthfulness. Understanding these limitations is essential for any business considering leveraging LLMs.

The Hallucination Problem

One of the biggest issues with LLMs today is their tendency to "hallucinate" information - that is, to generate content that seems plausible but is not actually true. Because LLMs are trained on vast amounts of online data, they can pick up and parrot back common misconceptions, outdated facts, biases and outright falsehoods mixed in with truth.

An LLM may confidently assert something that sounds right but does not match reality. For example, an LLM might claim a fictional event from a book or movie actually happened in history. Or it may invent realistic-sounding but untrue details when asked about a topic it lacks knowledge of.

LLMs do not have a true understanding of the information they process - they work by recognizing and reproducing patterns of text. So they can combine ideas in seemingly coherent but inaccurate ways. This makes it difficult to always separate LLM fact from fiction.

Benchmarking LLM Truthfulness

To quantify just how prone LLMs are to truthful vs untruthful outputs, AI researchers have developed benchmark datasets to test these models. Two notable examples are:

TruthfulQA (2022) - Contains 817 questions designed to elicit false answers that mimic human misconceptions across topics like health, law, and finance. Models are scored on how often they generate truthful responses.
HaluEval (2023) - Includes 35,000 examples of human-annotated or machine-generated "hallucinated" outputs for models to detect, across user queries, Q&A, dialog and summarization. Measures model ability to discern truthful vs untruthful text.

When tested on these benchmarks, even state-of-the-art LLMs struggle with truthfulness:

On TruthfulQA, the best model was truthful only 58% of the time (vs 94% for humans). Larger models actually scored worse.
On HaluEval, models frequently failed to detect hallucinations, with accuracy barely above random chance in some cases. Hallucinated content often covered entities and topics the models lacked knowledge of.

While providing knowledge or adding reasoning steps helped models somewhat, truthfulness remains an unsolved challenge. Models today are not reliable oracles of truth.

Implications for Businesses

The current limitations of LLMs in generating consistently truthful outputs has major implications for their practical use in business:

Careful human oversight of LLM content is a must. Outputs cannot be blindly trusted as true without verification from authoritative sources.
LLMs are not suitable for high-stakes domains like healthcare, finance, or legal advice where inaccuracies pose unacceptable risks. More narrow, specialized and validated knowledge bases are needed.
Using LLMs for content generation requires clear disclosure that output may not be entirely factual. Audiences should be informed on the role and limitations of AI.
"Prompt engineering" and other filtering techniques to coax more truthful responses have limits. Changes to underlying training data and architectures are needed for major improvements.

As research continues to progress, we can expect to see more truthful and dependable LLMs over time. Providing models with curated factual knowledge, better reasoning abilities, and alignment with human values are promising directions.

But for now, business leaders eager to harness the power of LLMs must temper their expectations around truthfulness. Treating these AIs as helpful assistants to augment and accelerate human knowledge work, while keeping a human in the loop to validate outputs, is the prudent approach. The truth is, LLMs still have a ways to go before they can be fully trusted as reliably truthful.

The Responsible AI Imperative: Key Insights for Business Leaders

Mon, 29 Apr 2024 11:19:06 +1000

The 2024 AI Index Report from the Stanford Institute for Human-Centered Artificial Intelligence (HAI) provides a comprehensive overview of the AI landscape. In a series of articles, we highlight key findings of the report, focusing on trends and insights that are particularly relevant for business leaders.

In this article, we'll explore the current state of responsible AI, examining the lack of standardized evaluations for large language models (LLMs), the discovery of complex vulnerabilities in these models, the growing concern among businesses about AI risks, and the challenges posed by LLMs outputting copyrighted material. We'll also discuss the low transparency scores of AI developers and the rising number of AI incidents. By understanding these critical issues, business leaders can make more informed decisions about the responsible development and deployment of AI systems.

Lack of Standardized Evaluations for LLM Responsibility

One of the most significant findings from the 2024 AI Index Report is the lack of robust and standardized evaluations for assessing the responsibility of LLMs. New analysis reveals that leading AI developers, such as OpenAI, Google, and Anthropic, primarily test their models against different responsible AI benchmarks. This inconsistency in benchmark selection complicates efforts to systematically compare the risks and limitations of top AI models, making it difficult for businesses to make informed decisions when choosing AI solutions. To improve responsible AI reporting, it is crucial that a consensus is reached on which benchmarks model developers should consistently test against.

Complex Vulnerabilities Discovered in LLMs

Researchers have uncovered increasingly complex vulnerabilities in LLMs over the past year. While previous efforts to "red team" AI models focused on testing adversarial prompts that intuitively made sense to humans, recent studies have found less obvious strategies to elicit harmful behavior from LLMs. For example, asking models to infinitely repeat random words can lead to the inadvertent revelation of sensitive personal information from training datasets. This finding highlights the need for businesses to be aware of potential risks associated with LLMs and to implement appropriate safeguards and monitoring mechanisms.

AI Risks Concern Businesses Globally

A global survey on responsible AI highlights that companies' top AI-related concerns include privacy, security, and reliability. The survey shows that while organizations are beginning to take steps to mitigate these risks, most have only mitigated a portion of them so far. For business leaders, this underscores the importance of prioritizing responsible AI practices and investing in comprehensive risk mitigation strategies. By proactively addressing AI risks, businesses can build trust with their stakeholders and ensure the long-term success of their AI initiatives.

LLMs Can Output Copyrighted Material

Multiple researchers have demonstrated that the generative outputs of popular LLMs may contain copyrighted material, such as excerpts from The New York Times or scenes from movies. This raises significant legal questions about whether such output constitutes copyright violations. For businesses looking to leverage LLMs for content generation or other applications, it is essential to be aware of these potential legal risks and to implement appropriate monitoring and filtering mechanisms to prevent the unauthorized use of copyrighted material.

Low Transparency Scores for AI Developers

The newly introduced Foundation Model Transparency Index reveals that AI developers generally lack transparency, particularly regarding the disclosure of training data and methodologies. This lack of openness hinders efforts to further understand the robustness and safety of AI systems. For businesses, this means that they may not have access to all the information they need to fully assess the risks and limitations of the AI solutions they are considering. To make informed decisions, business leaders should demand greater transparency from AI developers and prioritize solutions that provide comprehensive documentation and disclosure.

Rising Number of AI Incidents

According to the AI Incident Database, which tracks incidents related to the misuse of AI, there were 123 reported incidents in 2023, representing a 32.3% increase from 2022. Since 2013, AI incidents have grown by over twentyfold. A notable example includes AI-generated, sexually explicit deepfakes of Taylor Swift that were widely shared online. For businesses, this trend underscores the importance of implementing robust AI governance frameworks and monitoring systems to detect and mitigate potential misuse of their AI systems. By staying vigilant and responsive to emerging AI risks, businesses can protect their reputation and maintain the trust of their customers and stakeholders.

Conclusion

The 2024 AI Index Report highlights the urgent need for businesses to prioritize responsible AI practices as they increasingly adopt and deploy AI systems. From the lack of standardized evaluations for LLM responsibility to the discovery of complex vulnerabilities and the rising number of AI incidents, the report underscores the importance of proactively addressing AI risks and challenges.

By demanding greater transparency from AI developers, investing in comprehensive risk mitigation strategies, and implementing robust AI governance frameworks, business leaders can ensure the responsible development and deployment of AI systems. Only by prioritizing responsible AI practices can businesses fully realize the benefits of this transformative technology while protecting the interests of their stakeholders and society at large.

Manipulation in AI-Powered Product Recommendations: What Business Leaders Need to Know

Mon, 15 Apr 2024 14:00:36 +1000

Fig 1: Bing Copilot’s response for the search phrase “coffee machines”.

In today's digital marketplace, consumers increasingly rely on AI-driven search tools and chatbots to guide their purchasing decisions. A new study by Aounon Kumar and Himabindu Lakkaraju from Harvard University reveals how these AI systems—specifically Large Language Models—can potentially be manipulated to boost a product's visibility and ranking in recommendations. This has significant implications for fair market competition that business leaders need to be aware of.

Key Findings:

By strategically inserting an optimized sequence of text into a product's online information page, vendors can substantially increase the likelihood of that product being listed as the top recommendation by an AI language model.
Even for products that already rank highly, this technique can further boost their chances of securing the #1 recommended spot.
The strategic text sequences can be made robust to variations in the order products are listed in the AI model's input data. This makes the technique effective across different search scenarios.

Implications for Businesses

Just as search engine optimization (SEO) revolutionized how companies tailor web content to rank higher in Google results, AI search optimization may become the next frontier in digital marketing. Early adopters could gain a major competitive advantage by ensuring their products are prominently featured in AI-generated recommendations.

However, the ability to manipulate AI results also raises concerns about fair competition. If exploited at scale, it could lead to a marketplace where product visibility is based more on gaming algorithms than genuine customer value. Lack of transparency around AI search makes it difficult for consumers to recognize biased recommendations.

As AI becomes core to e-commerce, new industry standards and regulations will be needed to ensure a level playing field. Companies relying on AI-generated recommendations (either their own or via third-party platforms) will need to invest in safeguards to detect and prevent unfair manipulation by vendors.

The Way Forward

Business leaders should stay informed about emerging AI search capabilities and their potential for both opportunity and misuse in the market. Key priorities include:

Examining how AI search and chatbots factor into your industry's competitive landscape
Dedicating resources to understand and properly leverage AI search for your products
Advocating for transparency and fair competition standards around AI-driven recommendations
Collaborating with IT to implement manipulation detection for any customer-facing AI tools

The rise of AI search has the power to transform how consumers discover and choose products. It's up to business leaders to proactively shape this technology's role in their market - or risk ceding control to those willing to exploit it for unilateral gain. Careful navigation and proactive governance will be essential to harness AI's potential while preserving an equitable digital marketplace for all.

Unlocking the Power of Interpretable AI with InterpretML: A Guide for Business Leaders

Thu, 04 Apr 2024 12:45:20 +1100

In today's fast-paced business world, artificial intelligence (AI) has become a game-changer, enabling organizations to make data-driven decisions and gain a competitive advantage. However, as machine learning models grow more complex, the need for transparency and interpretability becomes increasingly important. InterpretML, an open-source Python package developed by Microsoft, empowers businesses to explain and understand the behavior of their AI models. In this article, we will explore the key capabilities and benefits of InterpretML, discuss its limitations when it comes to interpreting advanced language models, and delve into the current research efforts in the field of interpretability for generative AI.

Key Capabilities of InterpretML

Global and Local Explanations: InterpretML offers a comprehensive set of tools to explain model behavior at both high-level (global) and individual (local) perspectives. Global explanations provide insights into overall patterns and trends, allowing business leaders to grasp the general decision-making process of their models. Local explanations, on the other hand, focus on specific predictions, enabling a detailed analysis of individual cases. This dual approach empowers organizations to gain a holistic understanding of their AI systems.
Compatibility with Various Models: One of the standout features of InterpretML is its ability to work with a wide range of machine learning models, including decision trees, linear models, neural networks, random forests, gradient boosting machines, and support vector machines. This versatility ensures that businesses can apply interpretation techniques to their existing AI workflows while enhancing transparency and interpretability.
Feature Importance and What-If Scenarios: InterpretML provides powerful techniques to identify the most influential factors in a model's predictions. By determining the importance of different features, business leaders can gain valuable insights into the key drivers behind the model's decisions. Additionally, InterpretML can generate "what-if" scenarios, showing how changes in input features would impact the model's output. This capability allows organizations to explore different possibilities and make informed decisions based on the model's behavior.
Clear Visualizations: Effective communication is crucial when it comes to interpreting and explaining AI models. InterpretML recognizes this need and offers a range of visualization tools to present explanations in a clear and accessible manner. From feature importance plots to graphs showing the model's behavior, these visualizations help business leaders and stakeholders understand the inner workings of their AI systems without requiring deep technical expertise.

Limitations of InterpretML with Advanced Language Models

While InterpretML is a powerful tool for interpreting various types of machine learning models, it may have limitations when it comes to explaining the behavior of advanced language models, such as GPT-3, BERT, and T5. These models, known as large language models (LLMs) or transformers, are highly complex and have millions or billions of parameters. Their intricate inner workings and decision-making processes can be challenging to interpret due to their scale and complexity.

InterpretML's techniques, such as SHAP (SHapley Additive exPlanations) and LIME (Local Interpretable Model-agnostic Explanations), are primarily designed for interpreting more traditional machine learning models. SHAP assigns importance scores to each feature based on its contribution to the model's prediction, while LIME generates local explanations by approximating the model's behavior around a specific instance using a simpler, interpretable model. These techniques may not directly translate to the complexities of LLMs and transformers, which have more sophisticated architectures and capture nuanced patterns in natural language.

Current Research in Interpretability for Generative AI

Although InterpretML may not be the perfect fit for interpreting LLMs and transformers, the field of interpretability for advanced language models is an active area of research. Scientists and researchers are developing new techniques specifically tailored to understanding and explaining the behavior of these models. Some of the current research efforts include:

Attention Analysis: Researchers are studying the attention mechanisms of transformer models to understand which parts of the input the model focuses on during prediction. By visualizing and analyzing these attention patterns, we can gain insights into how the model processes and prioritizes information.
Probing Tasks: Designing specific tasks to test the model's understanding of language properties, such as grammar, meaning, and common sense, can help uncover the knowledge and capabilities of LLMs. These probing tasks provide a targeted evaluation of the model's behavior and decision-making process.
Perturbation-based Methods: By slightly modifying the input or internal representations of the model and observing how the outputs change, researchers can gain insights into the model's sensitivity to specific changes and its decision-making process. Perturbation-based methods help identify the most influential factors in the model's predictions.
Interpretable Architectures: Some researchers are exploring the development of new architectures for LLMs and transformers that are inherently more interpretable. By designing models with built-in interpretability mechanisms, such as attention-based explanations or modular components, we can achieve a better understanding of their inner workings.
Other Approaches: Researchers are also investigating techniques such as layer-wise relevance propagation (LRP), which assigns relevance scores to input features based on their contribution to the model's output, and integrated gradients, which attribute the model's prediction to input features by calculating the path integral of the gradients.

The Importance of Interpretability in the Age of Generative AI

As generative AI models become more prevalent and influential in various industries, the need for transparency and accountability becomes paramount. These models have the potential to generate human-like text, images, and even code, revolutionizing the way businesses operate. However, the complexity and autonomy of generative AI models also raise concerns about biased outputs or potential misuse.

Interpretability plays a crucial role in mitigating these risks and building trust in AI systems. By providing clear explanations of how models arrive at their outputs, businesses can ensure fairness, detect and address biases, and maintain ethical standards. Interpretability also enables organizations to comply with regulatory requirements and demonstrate the reasoning behind AI-driven decisions.

Key Takeaways for Business Leaders

InterpretML is a valuable tool for unlocking the power of interpretable AI in traditional machine learning models. While it may have limitations when it comes to directly interpreting advanced language models, the broader principles of interpretability and transparency remain crucial in the age of generative AI. As research in this field advances, business leaders should stay informed about the latest developments and adopt new tools and techniques that enable them to explain and understand the behavior of their AI systems. By prioritizing interpretability and transparency, organizations can build trust, mitigate risks, ensure compliance, and harness the full potential of AI technologies while maintaining ethical and responsible practices.

Critique of the AI Transparency Index

Wed, 01 Nov 2023 12:07:37 +1100

A recent critique calls into question a prominent AI transparency benchmark, illustrating the challenges in evaluating something as complex as transparency.

Earlier this month, we reported that researchers at Stanford University released the Foundation Model Transparency Index, an effort to assess and score leading AI systems on 100 metrics related to transparency. The index aimed to provide an empirical view into the often opaque development of artificial intelligence.

However, the index has faced sharp criticism for misrepresenting transparency and introducing methodological flaws. In a detailed rebuttal titled "How the Foundation Model Transparency Index Distorts Transparency," researchers affiliated with the nonprofit EleutherAI argue the index distorts more than it reveals.

The critique makes several core assertions:

The index conflates transparency and corporate responsibility. Many of the 100 metrics relate more to issues like moderation policies and terms of service versus research reproducibility.
Openly released models score poorly despite transparency being a goal. Models that prioritize releasing datasets, code, and weights score low since the index underweights these factors.
Questions introduce bias against certain projects. Many metrics favor commercial services over research efforts and introduce unreasonable requirements like disclosing salaries.
Factual errors and misrepresentation are common. Multiple models are docked points incorrectly due to misinterpreting or overlooking documentation.
Aggregate scoring obscures nuance. Collapsing 100 complex metrics into a single 0-100 score encourages gaming and misuse of the ratings.

The critique argues the index will likely lead to "transparency theater" where companies generate documentation solely to boost scores without meaningfully improving openness. Researchers involved contend the index conflicts with its own stated goals around enabling ecosystem health.

The issues raised serve as an important reminder that quantifying something as nuanced as transparency is enormously challenging. Even well-intentioned measurement risks reinforcing bias, oversimplification, and unintended incentives.

For business leaders, this debate underscores the need for diligence in evaluating AI systems. Metrics like the transparency index may provide a useful starting point but require close scrutiny themselves. When assessing responsible AI practices, metrics should be just one input into a holistic process also accounting for factors like direct audits, benchmark tests, and qualitative reviews.

The path towards genuinely transparent and trustworthy AI will require sustained coordination between companies, researchers, regulators, and civil society. For now, business leaders would be wise to approach AI transparency rankings and research with a critical eye.

Measuring Transparency in Foundation Models

Wed, 01 Nov 2023 12:07:37 +1100

Artificial intelligence (AI) has rapidly permeated nearly every sector of society, bringing immense commercial investment and public interest. AI systems like chatbots, facial recognition, and recommendation engines mediate how we communicate, access information, use transportation, and make purchases. The most advanced form of AI today are foundation models: general purpose systems trained on massive datasets that can be adapted to many applications. Foundation models include systems like GPT-4 for text, Midjourney for images, and Codex for code.

The rise of foundation models has raised pressing questions about their societal impact and how to ensure this technology advances the public interest. Specifically, a lack of transparency about how foundation models are built and used makes it difficult to properly scrutinize them or propose interventions when issues arise. This concerning trend parallels the trajectory of other digital technologies like social media. When new technologies emerge, early optimism and rapid adoption precedes growing awareness of harms enabled by the technology in question. After the fact, society often realizes the technology suffered from a lack of transparency by its builders.

To avoid repeating past mistakes, many experts such as Prof Emily M. Bender and the members of the DAIR institute have called for greater transparency in AI. However, what constitutes meaningful transparency for foundation models remains unclear. To clarify how transparent today's developers are, researchers at Stanford University's Institute for Human-Centered Artificial Intelligence conducted the Foundation Model Transparency Index. This multi-month research initiative systematically assessed and scored leading foundation model developers on 100 concrete metrics of transparency.

The results reveal a fundamental lack of transparency across the AI industry. The highest scoring developer in the index scored just 54 out of 100 possible points. On average, developers scored only 37 out of 100. This concerning opacity spans critical issues like the data used to train models, the labor practices involved, the risks posed by systems, and their real-world usage. The index provides an unprecedented, empirical view into the black box of AI development while establishing exactly where greater transparency is most needed.

Here, we explore the goals, approach, and key findings of this pioneering research. Our aim is to clarify the push for transparency in AI and what it might achieve for society.

The Case for Transparency in AI

Transparency is a vital prerequisite for accountability, robust science, continuous innovation, and effective governance of digital technologies. Without transparency, the public cannot properly understand technologies mediating essential aspects of life nor propose interventions when harms become apparent. This lesson has become abundantly clear from prior technologies where opacity enabled wide-ranging harms to emerge before society could respond.

Social media serves as a cautionary tale. For years, social media companies obscured how their platforms were built, how content was moderated, and how user data was handled. The resulting opacity let harms proliferate, ranging from disinformation influencing elections to illegal data sharing enabling voter manipulation. By the time society grasped the repercussions, immense damage was already done.

Today, foundation models appear on track to repeat this trajectory. As capabilities have rapidly advanced from GPT-2 to GPT-3 and now GPT-4, transparency about these systems has dramatically decreased. For example, OpenAI stated that its release of GPT-4 intentionally omitted all details regarding model architecture, training data, compute usage, and other key factors. This decrease parallels social media's path towards opacity as profits and users grew. Without transparency, foundation models will likely lead to unintended consequences that cannot be addressed until significant harm occurs.

Calls for transparency in AI are mounting from experts across civil society, academia, industry, and government:

80+ civil society groups urged developers to release model cards explaining characteristics of AI systems.
400+ researchers signed a pledge to document details about datasets, models, and experiments.
The EU AI Act requires high-risk AI systems to be transparent and provide information to authorities.
The White House secured commitments from companies to share more information about development and risks.

However, the AI field lacks clarity on what transparency entails and which developers are transparent about what issues. The Foundation Model Transparency Index aimed to address this knowledge gap with a rigorous, empirical assessment.

Measuring Transparency in Foundation Models

The Foundation Model Transparency Index began by articulating 100 specific indicators that comprehensively characterize transparency for developers of foundation models. Indicators span the full supply chain of foundation models including:

Upstream - Data, compute, code, labor used to build models
Model - Technical properties, capabilities, risks, limitations
Downstream - Distribution, policies, societal impact of models

This multi-dimensional view of transparency builds on emerging best practices like model cards and supply chain tracing for AI. Researchers engaged a range of experts to refine the set of 100 indicators, which are tailored to foundation models and emphasize issues like labor, bias, and environmental impact.

The index then scored 10 major foundation model developers against these 100 indicators based solely on publicly available information. The 10 developers - spanning startups, Big Tech firms, and open collaborations - included leaders across the foundation model ecosystem:

OpenAI (GPT-4)
Anthropic (Claude 2)
Google (PaLM 2)
Meta (Lamma 2)
Amazon (Titan Text)
Microsoft (Turing NLG)
Cohere (Cohere API)
AI21 Labs (Jurassic-2)
Hugging Face (BLOOMZ)
Stability AI (Stable Diffusion 2)

Scoring adhered to a rigorous process. Researchers independently assessed each of the 1,000 (developer, indicator) pairs, assigning a 1 if the developer was transparent on the indicator and 0 otherwise based on clear criteria. Disagreements were resolved through discussion and examination of sources. Developers were allowed to contest scores prior to publication. This scoring yielded a 0-100 overall transparency score for each developer along with granular subdomain scores.

Key Findings on the State of Transparency in AI

The Foundation Model Transparency Index yields unprecedented empirical insights into transparency across the foundation model ecosystem. Here, we summarize 10 key research findings that paint a concerning picture of pervasive opacity among organizations building the most impactful AI today:

No developer comes close to adequate transparency. The highest score is 54/100, with an average score of just 37/100. Even top-scoring Meta leaves much undisclosed. This fundamental opacity spans critical issues like labor, bias, and environmental impact.
Upstream resources used to build models are a massive blindspot. Scores on upstream data, compute, and labor average just 20%, 17%, and 17% respectively. Many details like data provenance, worker wages, and carbon emissions remain obscured.
Downstream societal impacts are almost entirely opaque. Developers score only 11% on downstream impact issues like usage statistics, affected communities, and redress mechanisms. There is virtually no transparency into how models affect society once deployed.
Open model developers are consistently more transparent. OpenAI (48) scores below Meta (54), Hugging Face (53), and Stability AI (53) which all release model weights and data. Closed access exacerbates opacity.
Capabilities transparency doesn’t extend to risks and limitations. Most developers demonstrate capabilities (62%) but few evaluate risks (24%), limitations (60%), or model weaknesses (26%). One-sided transparency creates misplaced trust.
Labor practices are obscured across the board. Just one developer discloses any information about labor conditions and wages. This occludes concerns about exploitative outsourcing raised by many experts.
Data transparency narrowly focuses on augmentation and curation. Data provenance, creators, copyright, and licensing are widely opaque. This prevents scrutiny of unauthorized data practices which are the subject of ongoing lawsuits.
Compute and environmental impact are concealed. Only two developers report emissions and energy use. Foundation models can require carbon emissions rivaling a car’s lifetime usage, but most developers do not disclose.
Release decisions happen in a black box. Nearly all developers give no information about how and why they decide what models to release or not. Release strategy plays an outsized role in impact but processes remain opaque.
Transparency initiatives do not penetrate industry. Despite some documentation efforts by researchers and promises from developers, transparency on critical issues like bias testing and model performance remains scarce.

In summary, the index paints a concerning picture of an ecosystem trending towards opacity as risks rise. It also clarifies where transparency is most lacking across areas like labor, environmental impact, release processes, and downstream usage. These empirical findings make evident the need for transparency while focusing attention on the most critical blind spots.

Paths Forward to Increase Transparency

The Foundation Model Transparency Index establishes an unprecedented understanding of where industry practices currently fall short on transparency while spotlighting areas for urgent improvement. But how exactly can progress be made to rectify this fundamental opacity? We outline promising pathways forward identified in the report.

1. Industry Self-Regulation

The most direct path is for developers to voluntarily increase transparency. The index provides developers clear guidance on where they lack transparency relative to competitors. It also demonstrates how transparency on nearly every indicator is feasible today, since some developer scored points. Developers should:

Proactively disclose more information about existing models that power live systems affecting millions.
Make new models substantially more transparent about upstream sources, technical properties, and downstream impacts.
Draw on practices from peers who are transparent on specific issues, following emerging best practices.
Work closely with deployers of their models (e.g. API providers) to gather and share aggregated usage statistics.

2. Transparency requirements

Governments are actively considering requirements for transparency in AI. The index empirically grounds policymaking, clarifying where developers are opaque and where interventions are most needed. Policymakers should:

Make transparency a top priority. Transparency unlocks public accountability and supports effective regulation.
Enforce existing laws (e.g. sectoral regulations) requesting companies share information. Require transparency, not just promises.
Use the Index to guide requirements for transparency with greater nuance. For instance, target areas like labor and data where opacity is extreme.

While requirements risk unintended consequences like stifling innovation, targeted transparency regulation can unlock public understanding and oversight.

3. Corporate and consumer pressure

Public pressure can drive change even absent regulation. The Index clarifies how developers compare and where consumers and corporations deploying models should push them for greater transparency.

Corporations with purchasing power should require transparency provisions in contracts licensing models. Collectively, major cloud providers could exert influence on developers.
Rankings like the Index can become a basis for corporate responsibility rankings and consumer guides that pressure laggards. Public understanding and concern drives change.

4. Academic engagement

Finally, the research community has an essential role to play through standards and strong norms around transparency. Researchers should:

Adopt and refine emerging best practices like model cards to release details about training data, model development, and evaluation results.
Extend initiatives like PapersWithCode that log experimental results to also capture details about datasets, compute, and other factors essential for reproducibility.
Develop conference policies that require transparency as a prerequisite for publishing research on foundation models and other high-impact AI systems.

Taken together, these complementary approaches can put transparency as a front-and-center priority for the field. But achieving meaningful transparency requires sustained effort from all stakeholders: developers, policymakers, researchers, corporations, and civil society. The Foundation Model Transparency Index aims to provide common ground for this collective push by empirically outlining the current state of affairs.

Measuring Progress Towards Transparent AI

The Foundation Model Transparency Index represents a critical first step, not the final word, on transparency in AI. The urgency of public calls for transparency demands that we move beyond principles to measurement. By quantifying the current state of transparency, the Index spotlights where society needs more information to properly understand AI systems mediating essential aspects of life.

Moving forward, the Index can fuel continued progress by benchmarking how transparency evolves across critical issues like labor practices, algorithmic bias, and environmental impact. The research team plans to regularly update the Index, scoring developers on expanded sets of indicators that capture emerging risks, capabilities, and best practices. With time, they hope transparency will become standard practice for developers building society-shaping technologies like foundation models.

In other words, the Index can help realize a long-term vision: an AI ecosystem that is consistently transparent about the technologies it creates and deploys. But achieving accountable AI guided by public oversight will require all of us—as consumers, citizens, experts, and institutions—to collectively demand meaningful transparency.

Humanity cannot afford to repeat the harms that opaque technologies have inflicted throughout history. Algorithmic systems will never be perfect, but transparency provides the essential ingredients for public awareness, scientific exchange, and equitable governance. We must continue pushing towards transparent AI that advances prosperity for all. The Foundation Model Transparency Index marks one milestone in what will surely be a long journey.

Note: Read a critique to this Index here.

Examining Claims and Hype: Large Language Models

Wed, 16 Aug 2023 09:04:25 +1000

In recent years, a new type of AI system called large language models (LLMs) has rapidly gained popularity and changed the landscape of natural language processing (NLP) research and applications. LLMs are AI systems that are trained on massive amounts of text data to generate or understand language. Popular examples include ChatGPT, created by Anthropic; GPT-3, created by OpenAI; and Google's LaMDA.

While LLMs have shown impressive capabilities, their sudden prominence has also raised concerns about potential downsides and knowledge gaps. In a new research paper, AI experts Alexandra Luccioni and Anna Rogers take a critical look at LLMs, analyzing common claims and assumptions while identifying issues and proposing ways forward. Here are the key takeaways:

Defining LLMs

The authors first attempt to precisely define what counts as an LLM, since the term is used loosely. They propose three criteria:

LLMs model text and can generate it based on context. For example, ChatGPT can generate coherent continuations of text when given a prompt.
LLMs are pretrained on over 1 billion tokens of text. A token is a basic unit of text, like a word or punctuation mark. For comparison, 1 billion tokens is about 15 billion words, or tens of thousands of books worth of text.
LLMs utilize transfer learning to adapt to new tasks. Transfer learning means the model learns general patterns from large datasets which can then be applied to new tasks with minimal additional training.

This definition excludes some popular older NLP models like word2vec which don't generate text based on context.

Examining Common Claims

The authors then fact-check four common claims about LLMs:

LLMs are robust

While LLMs have reduced some brittleness issues of old AI systems that completely failed on unfamiliar inputs, they still fail in many edge cases and exhibit biases. For example, ChatGPT sometimes confidently generates plausible but incorrect answers, exposing a lack of robustness. Shortcuts in training data remain a problem, where models exploit superficial cues rather than truly understanding language.

LLMs achieve state-of-the-art results

LLMs excel at few-shot learning, meaning they can perform well on new tasks with just a few examples as prompts, without task-specific fine-tuning. However, they don't necessarily beat fine-tuned models designed specifically for a task. On SuperGLUE language benchmarks, GPT-3 scored 71.8% in few-shot learning, while a fine-tuned RoBERTa model achieved 84.6%. Non-LLM approaches can still be top performers on some tasks too. Benchmark contamination is also a concern, where test data overlaps with the LLM's training data, giving an unreliable boost in performance.

LLM performance is due to scale

While model size has been a key factor in improvements, as seen in successive models like GPT-3 (175 billion parameters) to GPT-4 (100 trillion parameters), training data quality and other optimizations also play a big role. For example, PaLM performance gains were partly attributed to data cleaning. Recent efficient models like Anthropic's Claude challenge the theory that sheer scale is all that matters.

LLMs show emergent properties

Claims of LLMs exhibiting abilities not explicitly trained for lack rigorous proof. Their abilities often correlate with evidence found in the massive training data, which cannot be fully audited. For example, ChatGPT may appear to have common sense not seen during training, but this is unproven.

Concerns and Issues

The authors argue these claims contribute to issues like lack of model diversity, influence of private companies, barriers to entry for researchers, decreased reproducibility, and dismissal of theory. LLMs are also deployed without sufficient testing for safety and fairness across demographics.

Recommendations

The authors provide several concrete recommendations to address the issues raised and steer LLM research in a more rigorous direction:

Maintain diversity of research approaches in NLP - Conferences and journals should ensure balanced representation of non-LLM techniques instead of solely focusing on the latest LLM variants. This avoids over-reliance on one methodology and allows exploration of alternative approaches.
Improve definitional clarity - Key terms like "large language model" and "emergent properties" require precise definitions grounded in evidence to avoid hype or confusion. For example, emergence could refer to behaviors not directly trained for vs. behaviors learned from training data.
Avoid reliance on closed-source models - Using proprietary models like GPT-4 as benchmarks makes research expensive, unfair, and results unreliable if the model changes. Open models should be preferred.
More controlled studies on capabilities - Rather than generic benchmarks, experiments should isolate factors like model architecture and training data to pinpoint causes of behaviors. Granular testing on specific skills is needed.
Develop better evaluation methods - Metrics beyond accuracy like robustness and bias should be assessed. Potential training data overlap must be checked. Evaluation should account for faults in open-ended generation like inconsistency.
Ensure transparency and reproducibility - Details of model training, evaluation results, and ideally training data details should be released to enable reproducibility. Documentation and versioning are key for API-based models.
Incorporate diverse perspectives - Potential societal impacts of LLM use and misuse need consideration in development and deployment. Representation in data and teams is crucial.

With more rigor, transparency, and diversity, LLMs can be guided to fulfill their promise responsibly and avoid the pitfalls of hype, lack of oversight, and concentration of power.

Key Takeaways for Business Leaders

As LLMs spread into products and services, business leaders should view bold claims about their abilities with caution rather than credulously accepting marketing hype. Rigorous testing is essential, as LLMs still have significant limitations. Leaders should pressure vendors to provide transparency about training data and testing procedures. Diversity of approaches should be encouraged to avoid putting all eggs in one basket. As LLMs influence society, their development and use should incorporate diverse perspectives, including consideration of potential harms. An open and critical scientific approach is needed to steer the future of LLMs responsibly.

Source:

Mind your Language (Model): Fact-Checking LLMs and their Role in NLP Research and Practice

by Alexandra Sasha Luccioni and Anna Rogers

Towards Responsible AI: Model Cards for Transparent Machine Learning

Sun, 13 Aug 2023 22:39:11 +1000

In 2019, a research paper proposed "model cards" as a way to increase transparency into AI systems and mitigate their potential harms. Model cards are short documents accompanying machine learning models that disclose key details for assessing whether they are appropriate for a use case. The authors argued model cards are a step towards democratizing AI responsibly.

As AI proliferates, external audits have found many deployed systems encode biases and fail on marginalized groups. However, since current models lack standardized reporting, it's hard for practitioners to evaluate suitability and compare options. Model cards aim to change this by requiring transparent documentation of model capabilities and limitations.

The proposed model card framework contains sections summarizing:

Model details like architecture and training approach
Intended use cases and exclusion criteria
Relevant factors like demographic groups for evaluating performance
Quantitative performance metrics, broken down across subgroups
Training and evaluation data sources
Ethical considerations during development
Limitations and recommendations

For example, a model card could report a facial recognition system's error rates across race, gender, and age groups. This transparency into variability helps assess if the model is appropriate for an application context.

Model cards complement datasheets for datasets, which document training data characteristics. Together, they increase accountability across the AI lifecycle. The authors presented example model cards for an image classifier and toxicity detection algorithm.

The image classifier model card revealed a high false positive rate for elderly males being classified as smiling. This demonstrates the importance of intersectional analysis - evaluating across combinations of factors like age and gender. The toxicity detector model card showed improved performance on minority groups between two versions, illustrating model cards can track progress.

The authors acknowledged model cards have limitations. Their usefulness relies on creator integrity. They are flexible in scope and do not prevent misleading representations. Cards should complement other transparency techniques like external auditing.

However, by standardizing model reporting with relevant details for stakeholders, model cards represent a step towards responsible AI development and deployment. They help assess whether models warrant trust for particular use cases. The proposed framework offers a template for organizations to evaluate models rigorously before adoption. Broad adoption of model cards could enable accountable AI systems that avoid perpetuating inequality.

Increasingly, regulators are also recognizing the importance of transparency in AI systems. For instance, the proposed EU AI Act requires certain disclosures like intended use cases and limitations. The Model Card Regulatory Check automates checking if a model card complies with these regulatory requirements. This demonstrates how model cards can facilitate efficient regulatory compliance, beyond just informing users. Linking model cards to governance frameworks reinforces their value in responsible AI.

Sources:

Model Cards for Model Reporting

The Risks of Ever-Larger AI Language Models

Sun, 13 Aug 2023 21:46:24 +1000

A thought-provoking paper from computer scientists raises important concerns about the AI community's pursuit of ever-larger language models. It argues this dominant research direction has significant downsides and risks that demand urgent attention.

In recent years, models like Google's BERT, OpenAI's GPT-3, and others have achieved impressive performance gains in language tasks through scaling up to hundreds of billions of parameters trained on massive text datasets. However, the authors argue the environmental, financial, and social costs of this approach outweigh the benefits, and more focus should go towards better understanding models rather than simply making them bigger.

On the environmental front, training these models requires prohibitive amounts of computing power, racking up massive carbon footprints. This compounds inequality when the benefits accrue mainly to wealthy nations but the environmental consequences are borne globally. The financial costs of training also centralize progress in a few well-resourced labs.

The authors also highlight problems with training data. Web-scale datasets amplify dominant viewpoints and encode harmful biases against marginalized groups. Attempting to filter out toxic content is insufficient and risks suppressing minority voices. More investment is needed in thoughtful data curation versus simply amassing unfathomable quantities.

Additionally, while larger models post impressive scores on NLP leaderboards, they don't actually perform true language understanding. Their inner workings remain opaque and they succeed by picking up on spurious statistical cues. This risks misdirecting research efforts away from real progress on AI interpretability and accountability.

When deployed, huge models can generate remarkably fluent but meaningless and incoherent text. The authors liken them to "stochastic parrots" given their tendency to amplify toxic patterns in training data. The term refers to how these models randomly stitch together linguistic patterns they have observed, without any grounding in meaning or intent. If people interpret their outputs as credible despite lack of grounding, it enables spreading misinformation and abuse.

Given these downsides, the authors advocate rethinking the goal of ever-larger models. They recommend prioritizing energy efficiency, curating training data carefully, engaging stakeholders to shape ethical systems, and exploring alternative research directions not dependent on unfathomable data quantities.

While large models can sometimes benefit applications like speech recognition, risks need balancing with harm mitigation measures like watermarking their outputs. Overall, the paper compellingly argues that continuing blindly on the path of scaling up carries severe risks that require urgent attention.

This paper became controversial when some authors published it while working at Google Research. Google allegedly requested they withdraw the paper for internal review, then fired several of the co-authors, including well-known AI ethics researcher Timnit Gebru. The incident highlighted risks of speaking out against dominant research paradigms, especially when papers critique an employer's technology direction. It increased scrutiny of research freedom and ethics in AI.

Sources:

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?

Emily M. Bender, Timnit Gebru

Ensuring Ethical AI Through Internal Audits

Sun, 13 Aug 2023 20:31:21 +1000

A research paper proposes formal internal audits as a mechanism for technology companies to ensure their artificial intelligence (AI) aligns with ethical priorities before deployment. Conducting rigorous self-assessments throughout development can make AI more accountable to society.

AI systems like facial recognition, predictive policing algorithms, and social media filters are increasingly affecting people's lives. However, external audits by journalists and academics often find these technologies disproportionately harm marginalized groups through biases encoded in training data or design choices.

By only auditing after deployment, it becomes difficult to fix issues rooted in early development stages. The authors argue that organizations creating AI should perform ongoing internal audits to catch problems sooner. This would complement external oversight.

Drawing inspiration from quality control practices in aerospace, medicine, and finance, the researchers outline a framework called SMACTR to structure internal AI audits. It comprises five stages:

Scoping: Confirm the AI system's intended use case and principles it should uphold. Review potential social impacts.
Mapping: Document the development process and people involved for traceability. Start a risk analysis.
Artifact Collection: Gather key documents like model cards explaining model limitations and datasheets detailing training data.
Testing: Evaluate risks through methods like adversarial attacks to fool models and surface failure cases.
Reflection: Finalize the risk analysis. Develop a mitigation plan to address issues before launch.

At each stage, auditors generate artifacts like stakeholder maps, checklist procedures, and risk profiles. Together these produce an audit trail enabling organizations to validate if projects adhere to ethical objectives.

For example, an internal audit could evaluate a proposed income scoring algorithm to assess financial risk for lending decisions. The scoping stage would analyze how denying loans to "high-risk" individuals perpetuates inequality. Mapping would outline the engineering team and main model components. Artifact collection would include documentation of the training process and data sources. Testing would probe for biases like lower scores for racial minorities. The reflection would produce an action plan to address identified biases before deployment.

Unlike external audits focused on model outputs, internal assessment provides access to inside processes and data. The framework aims to translate abstract AI principles like fairness into concrete practices for responsible development.

The authors acknowledge limitations. Internal auditors share incentives with the audited organization, risking biased assessments. The framework also requires extensive documentation which slows rapid development. And ambiguity remains around acceptable thresholds for risks versus model accuracy and business objectives.

Nonetheless, the SMACTR methodology provides an initial structure for companies to audit algorithms against ethical priorities throughout the creation process, rather than just critiquing systems after launch. This proactive approach can uncover issues early when they are easier to fix, and prevent harmful technology from ever reaching users. The framework will require refinement, but represents a step towards aligning AI with societal values.

For business leaders, this research reinforces that ethical AI requires continuous review, not just post-deployment auditing. Companies should formalize auditing units with technical and ethical oversight separate from product teams. Following a standardized methodology can demonstrate accountability. Documentation and transparency will be key.

While external audits remain essential, supplementing these with rigorous internal assessments makes it likelier technologies empower society broadly rather than negatively impact vulnerable communities. As AI proliferates, adopting structured self-evaluations will help ensure businesses uphold their principles. The proposed framework offers a template for auditing AI with ethics in mind at every development stage.

Sources:

Closing the AI Accountability Gap: Defining an End-to-End Framework for Internal Algorithmic Auditing