Now Next Later AI - Blog , AI Story Brain

What is Tokenization? Let's Explore, Using Novel AI's New Tokenizer as a Use Case

Thu, 10 Aug 2023 11:43:18 +1000

Tokenization is a foundational step in natural language processing (NLP) and machine learning.

Large Language Models are big statistical calculators that work with numbers, not words. Tokenisation converts the words into numbers, with each number representing a position in a dictionary of all the possible words.

Tokenization breaks down a piece of text into smaller units, called "tokens." These tokens can represent whole words, parts of words, or even multiple words in some languages. For instance, the sentence "ChatGPT is fun!" might be broken down into tokens like ["Chat", "G", "PT", " is", " fun", "!"].

You can choose from multiple tokenization methods, but it's crucial to consistently use the same method during both training and text generation.

Why is this important for large language models?

Understanding Context: By breaking text into tokens, the model can process and understand the context around each token. It's like looking at each puzzle piece and understanding where it might fit in the bigger picture.
Efficiency: Language models have a limit to how many tokens they can process at once. By tokenizing text, they can manage and process information more efficiently.
Flexibility: Different languages have different structures. Tokenization allows these models to be flexible and work with multiple languages. For example, in English, spaces often separate words, but in languages like Chinese, words are often clustered together without spaces. Tokenization helps the model handle such variations.
Training: When training these models on vast amounts of text, tokenization ensures that the model learns from consistent and standardized units of text.

In essence, for large language models, tokenization is a foundational step that allows them to read, understand, and generate human-like text across various languages and contexts.

Trade-offs

Different tokenization strategies come with their own sets of trade-offs. Let's delve into some of these key trade-offs:

Granularity:

Fine-grained (e.g., character-level):

Pros: Can handle any word or term, even if it's never seen it before. It's very flexible and can be language-agnostic.
Cons: Requires more tokens to represent a text, which can be computationally expensive and may not capture semantic meanings as effectively.

Coarse-grained (e.g., word-level):

Pros: Can capture semantic meanings more directly and is often more efficient in terms of the number of tokens used.
Cons: Struggles with out-of-vocabulary words and might not be as flexible across different languages.

Language Dependence:

Language-specific tokenizers:

Pros: Optimized for a particular language, capturing its nuances and structures effectively.
Cons: Not versatile. A separate tokenizer would be needed for each language, which isn't scalable for multilingual models.

Language-agnostic tokenizers:

Pros: Can be used across multiple languages, making them ideal for multilingual models.
Cons: Might not capture the intricacies of each individual language as effectively as a language-specific tokenizer.

Fixed vs. Dynamic Vocabulary:

Fixed Vocabulary:

Pros: Deterministic and consistent in its tokenization. Easier to manage and deploy.
Cons: Struggles with out-of-vocabulary terms and might become outdated as language evolves.

Dynamic (or adaptive) Vocabulary:

Pros: Can adjust to new terms or slang, making it more flexible and up-to-date.
Cons: More complex to implement and might introduce inconsistencies over time.

Efficiency vs. Coverage: Some tokenizers aim for maximum coverage, ensuring they can handle any text thrown at them. Others prioritize efficiency, using the fewest tokens possible to represent a text. There's a balance to strike here: more coverage can mean more computational overhead, while prioritizing efficiency might mean sacrificing the ability to handle rare terms.
Complexity and Overhead: Advanced tokenization methods, like Byte-Pair Encoding (BPE) or SentencePiece, can handle a wide range of text types and languages. However, they introduce additional computational and implementation overhead compared to simpler methods.
Consistency: Some tokenization methods might tokenize the same text differently based on context, leading to potential inconsistencies. While this can be beneficial in capturing nuanced meanings, it can also introduce unpredictability in the model's behavior.

Choosing a tokenizer for a large language model involves weighing these trade-offs based on the specific goals and constraints of the project. Whether the priority is multilingual support, computational efficiency, or capturing linguistic nuances, understanding these trade-offs is crucial in making an informed decision.

Novel AI's Tokenizer to "enable stronger storytelling capabilities"

Novel AI recently developed a custom tokenizer for their AI models. The GitHub project detailing their process provides an inside look at engineering tradeoffs like vocabulary size, compression ratio, and handling numerals. The focus of this project is to build a tokenizer that enables stronger storytelling capabilities. Let's explore!

On the surface, the new tokenizer offers clear advantages:

Improved Granularity and Flexibility: Unlike traditional tokenizers, this one offers a balance between word-level and subword-level tokenization. By breaking down words into meaningful fragments, it can better understand and generate nuanced text. This is especially crucial for storytelling where context, nuance, and subtlety matter.

Compression Ratio: A higher compression ratio means the model can process and understand larger chunks of text at once. This is vital for maintaining context in long narratives or when referencing earlier parts of a story. By achieving a 7-19% higher compression ratio than the LLaMa tokenizer on significant parts of the English dataset, it's evident that the tokenizer is more efficient. This efficiency can translate to richer and more coherent narratives, especially in longer stories.

Adaptability and Evolution: The iterative approach to tokenizer training, with multiple runs and rebalancing, ensures that the tokenizer is optimized for the specific nuances of your dataset. This adaptability is key for evolving storytelling styles and trends.

Some other pros are mentioned. These are more specific to the Novel AI project:

Multilingual Capabilities: By accommodating both English and Japanese from the start, the tokenizer is designed for bilingual storytelling. This means it can seamlessly switch between languages or even blend them, offering richer narratives and reaching a broader audience.
Efficient Handling of Unicode Characters: The ability to natively tokenize Unicode characters, especially emojis, allows for more expressive storytelling. Emojis, in modern communication, can convey emotions, context, and tone, making them valuable in narratives. But they are less relevant to novel writing.
Numeric Understanding: Tokenizing numbers digit by digit enhances the model's capability to understand and manipulate numeric values. This is crucial for stories that involve dates, quantities, or any numerical context.

Disadvantages:

Complexity and Maintenance: Training the tokenizer added development time and complexity.
BPE vs. Unigram: The decision to choose BPE over Unigram was based on compression ratio. While BPE might offer better compression, Unigram might provide more natural word segmentations. The storytelling quality might be affected if the tokenizer doesn't segment words in a way that's intuitive to human readers.
Multilingual Limitations: While accommodating both English and Japanese is a strength, it might also be a limitation. The tokenizer might be overly specialized for these two languages, potentially making it less effective for other languages or multilingual contexts beyond English and Japanese.
Vocabulary Size: The decision to use a vocabulary size of 65535 tokens, while efficient from a computational standpoint, might introduce limitations. Is this size sufficient to capture the nuances of both English and Japanese, especially given the richness of the Japanese writing system?
Numeric Tokenization: Tokenizing numbers digit by digit can indeed improve the model's understanding of numeric values. However, it might also make the model less adept at recognizing larger numerical patterns or relationships between numbers.
Handling of Unicode Characters: While the ability to natively tokenize Unicode characters is a strength, there's a potential for overfitting or misinterpretation. Emojis and other Unicode characters can have different meanings in different contexts or cultures. Relying heavily on them might lead to misunderstandings in generated narratives.

Testing AI's Ability to Understand Language in Context

Thu, 10 Aug 2023 08:08:00 +1000

Artificial intelligence has made great strides in natural language processing in recent years. Systems can now translate text, answer questions, and generate coherent paragraphs on demand. However, most AI still struggles with true language understanding that requires integrating information across long texts.

Back in 2016, to address this limitation, researchers developed a benchmark called the LAMBADA dataset to rigorously test how well AI models can leverage broader discourse context when predicting an upcoming word.

LAMBADA contains over 10,000 passages extracted from fiction books, with the last word blanked out in each passage. When humans are given the full passage as context, they can easily guess the missing word. However, if humans only see the final sentence containing the blank, it becomes virtually impossible to predict the missing word.

For example, the sentence "Do you honestly think that I would want you to have a ?" on its own has many plausible words that could fill in the blank. But when given the full passage about a couple discussing pregnancy concerns beforehand, it becomes clear from the context that the missing word is "miscarriage."

The researchers tested a wide range of AI systems on LAMBADA, including statistical n-gram models as well as advanced neural network architectures like LSTMs. Back then, all the models performed extremely poorly, with 0% to 7% accuracy in predicting the missing word. The models often relied on simple techniques like picking a random proper noun from the passage. Even methods designed to track broader context failed to match human performance. LAMBADA continues to be used today too test new projects such as Novel AI, and this time Models are performing with over 70% accuracy.

Truly intelligent systems will need to integrate information across long passages and reason about that context to understand language the way people do.

While AI chatbots and virtual assistants are improving customer service and other applications, they cannot yet achieve the sophistication of human context processing. Benchmarks like LAMBADA push innovators to develop the next generation of AI that skillfully uses context instead of relying on surface-level statistical patterns.

Just as IQ tests expanded to gauge different types of intelligence beyond a single number, benchmarks like LAMBADA are important for building well-rounded language AI systems. Advancing contextual language understanding will enable more fluent, trustworthy interfaces between people and machines. Whether in customer service or product development, AI that masters using context could unlock new levels of human-computer interaction.

Sources:

The LAMBADA dataset: Word prediction requiring a broad discourse context

Filling in the Blanks: AI Learns to Suggest Missing Pieces of Stories

Thu, 10 Aug 2023 08:07:37 +1000

Stories unfold step-by-step, but writers sometimes get stuck on how to connect one part to the next. AI research from 2019 explored how to automatically generate reasonable suggestions for missing sections of text. This "story infilling" aimed to assist creative writing by proposing ideas that align with the existing story while still surprising the author.

The researchers found that standard AI language models at the time struggled to balance coherence with novelty when filling in gaps. The generated text ended up too boring or too random. To address this limitation, they designed a two-step hierarchical system:

First, the AI randomly selected a few rare, interesting words that could plausibly fit into the storyline based on the context. For a medieval fantasy passage, it might suggest words like "dragon," "princess," or "castle." The system focused on rare words since they provide more information to guide the rest of the text.
Second, the system generated full sentences conditioned on those interesting words, searching likely combinations that form coherent text. Leveraging the rare words prevented repetitive suggestions, while allowing the model to focus on fluency and coherence.

The researchers tested story infilling on passages from children's tales with missing sections of 15-30 words. Human evaluators preferred the hierarchical model's suggestions over non-hierarchical methods, which sacrificed diversity or quality.

While an early attempt, the study shows promise for AI-assisted writing tools. The approach mirrors a writer's workflow - first deciding on key ideas, then piecing together suitable wording. Similar techniques may enable more human-like narrative understanding and creativity.

The field has greatly advanced since 2019 with models like Claude and GPT-4. Yet even powerful AI still struggles with high-level plot and character consistency. Explicitly decomposing generation into steps of planning and drafting, as humans do, is one way to address these challenges. While AI cannot replace human creativity, structured models could soon provide useful brainstorming and revision tools for real authors.

Sources:

Unsupervised Hierarchical Story Infilling

Behind the Scenes of Storytelling: Using AI to Plan and Structure Narratives

Thu, 10 Aug 2023 08:07:19 +1000

Storytelling seems almost magical. Writers conjure up entire worlds from their imaginations. But even master storytellers rely on plans and outlines to craft complex, coherent narratives spanning hundreds of words. In 2019, researchers explored how artificial intelligence could similarly use hierarchical models to improve computer-generated stories.

Up until then, most AI systems created stories simply word-by-word from left to right. While fine for short texts, this method struggled with long-term plot and character consistency. The researchers proposed "coarse-to-fine" techniques to first generate story outlines, then build surface-level details conditioned on the outline.

Their approach involved three steps: modeling the sequence of actions using verbs and arguments, generating story sentences with placeholder entities like "ent0", and finally rewriting the placeholders with specific references. This mirrored how human writers first sketch a plot's arc, then go back to flesh out settings and characters.

By creating more structured drafts, the AI models improved event diversity and entity consistency compared to previous approaches. The placeholder entities also made it easier to track characters, replacing different mentions with the same token. The researchers found that human judges strongly preferred stories created with hierarchical planning versus direct generation.

While an early attempt, this work showed the promise of mimicking writing strategies like outlining and revising. The field has advanced rapidly since 2019 as models like GPT-4 or Claude 2 now generate amazingly fluent text. But behind the scenes, AI still struggles with plot and people - areas where hierarchical techniques could help. The research highlights the value of breaking narration into more human-like steps. A technique currently being explored by several AI-assisted writing startups such as Novel AI and Sudowrite.

Just as outlines aid human storytellers, explicit planning and revisions may allow AI to better learn from experience. More structured generation spaces let models focus on specific challenges like action sequences before full text. While AI has seen stunning progress, people remain the masters of storycraft. Studying the narrative strategies of writers may guide systems to become more helpful to writers.

Source:

Strategies for Structuring Story Generation

Reading Between the Lines: Using Math to Uncover Hidden Patterns in Books

Thu, 10 Aug 2023 08:06:53 +1000

Books may seem like straightforward stories, but researchers are finding mathematical patterns hidden in the text. By tracking how words are used over the course of a book in minute detail, they can reveal new insights into plot, emotion, and structure that are not visible to the naked eye.

The researchers started by scoring a large number of words based on their emotional meaning. For example, positive words like "love" scored higher while negative words like "war" scored lower. They used a framework called "ousiometrics" which boils down emotions to two key dimensions: power and danger. Power relates to agency, confidence, and positivity. Danger relates to emotional uncertainty, negativity, and aggression.

They then took thousands of books and broke them down into short segments of 50 words each. For each segment, they calculated the average power and danger scores based on the words present. This turned each book into a rolling wave of numbers, with peaks representing more emotional sections and valleys as more neutral parts.

Short books generally showed a steady wave pattern while long books had more fluctuations in emotion over the course of the text. Surprisingly, when they zoomed in on long books they found the fluctuating highs and lows had a consistent length of a few thousand words. This matches the typical length of chapters in published fiction.

To study the patterns further, the researchers used a technique called empirical mode decomposition that breaks down fluctuations in data into distinct components, much like musical notes make up chords. The text segments were also compared to "shuffled" versions of the books with random word order. The real books differed from the random versions after a certain decomposition level, indicating that the fluctuations were not random but reflected an underlying structure.

These findings suggest longer books have a wave-like shape that is closer to collections of short stories or chapters. The emotional ups and downs of the text cycle on a scale of thousands of words, perhaps reflecting how long the human brain can comfortably process a complex narrative before needing a reset. Shorter books lacked these larger fluctuations.

While we intuitively understand how passages evoke certain moods, the researchers were able to quantify the pacing of emotional highs and lows mathematically. Their work helps confirm the existence of nested patterns in writing - punctuation gives phrases, paragraphs offer local structure, chapters provide mid-level segments, and over the full book arcs emerge.

So the next time you open a book, think about the hidden rhythms inside that subtly influence your experience. The feelings evoked in the story may follow mathematical waves as you steadily progress from cover to cover. This emerging field opens up new ways of appreciating the art and science of expert storytelling.

Sources:

A decomposition of book structure through ousiometric fluctuations in cumulative word-time

Storywrangler: Tracking Culture and Events through Twitter's Lens

Thu, 10 Aug 2023 08:05:59 +1000

Social media platforms like Twitter offered an unprecedented window into the real-time thoughts, conversations, and interests of millions of people. Researchers developed a tool called Storywrangler that leveraged Twitter data to create an "instrument for understanding our world through the lens of social media."

Storywrangler analyzed over 100 billion tweets dating back to 2008 to detect trends in word usage over time. It broke down tweets into "n-grams" - sequences of one, two, or three words - and tracked how the usage frequencies of these n-grams changed on a daily basis across different languages.

This massive database allowed researchers to see how real-world events, from natural disasters to political movements, were reflected in the narratives that unfolded on Twitter. For example, Storywrangler revealed surging interest in climate-related terms during major storms and wildfires. And it captured the rapid rise and fall of hashtags associated with social justice protests. Beyond reacting to news, Twitter also mirrored more subtle cultural shifts, like the waxing and waning popularity of celebrities or diets.

Storywrangler went beyond tracking raw frequencies - it also quantified how widely information spread on social media through shares and reposts. This helped distinguish niche conversations from truly viral ideas. The researchers used "contagiograms" to visualize both the popularity and amplification of n-grams over time.

There were certainly limitations to the Twitter lens. The platform's user base skewed young, urban, and affluent compared to the general population. Bots and organized campaigns could artificially inflate interest in certain topics. And the meanings of words themselves evolved across the years.

But used carefully, Storywrangler offered an unparalleled window into the collective consciousness - recording not just major news events but also the mundane daily conversations of millions worldwide. It aimed to complement more traditional data sources like books and news archives. The researchers hoped Storywrangler would enable more data-driven computational social science to understand our fast-changing, digitally-connected world.

Source:

Storywrangler: A massive exploratorium for sociolinguistic, cultural, socioeconomic, and political timelines using Twitter

Teaching AI to Tell Better Tales by Integrating External Knowledge

Thu, 10 Aug 2023 08:05:08 +1000

Storytelling comes naturally to humans. But for machines, spinning an engaging narrative remains an elusive goal. While AI can generate remarkably fluent text, its tales often lack coherence or get repetitive. New research explores how integrating structured knowledge into AI systems can enhance storytelling abilities.

When reading a story, we draw on general knowledge about how events logically unfold and characters plausibly act. We track complex plot threads and fill gaps using common sense. Machines lack this innate understanding we take for granted. Their stories can become nonsensical or contradictory.

To tackle this, researchers are providing AI systems explicit knowledge in structured formats. This external knowledge acts like a guide, keeping machine-generated plots on track. It also helps avoid stale repetitions by expanding the ideas available to pull from.

Several common limitations plague today's AI storytellers:

Lack of long-term coherence. Without a sense of overall narrative arc, they ramble aimlessly.
Insufficient grounding in real-world facts. Stories come off vague rather than richly descriptive.
Repetition. They loop the same words and phrases like a broken record.
Hallucination. They fabricate events that don't logically follow.

Integrating knowledge resources like ConceptNet, which contains common sense facts about the world, alleviates these issues. The knowledge functions like an annotated outline, steering the plot. It also provides a memory bank of concepts to reference, varying the content.

But effectively harnessing external knowledge remains challenging. Two main strategies have emerged:

Injecting knowledge directly into the AI system's training process, like teaching a human author.
Using knowledge as an external guiding reference during story generation.

Each approach has trade-offs. Weighting structured resources too strongly can pollute the system's original language skills. But using knowledge merely as a loose guide can fail to correct nonsensical narration.

Striking the right balance is an active research problem. Scientists are also expanding the knowledge available to AI storytellers with new databases. Most systems today use generic common sense facts. But resources detailing specific people, places, and events could enable more detailed, vivid storytelling.

Automating evaluation also poses difficulties. No single "correct" story exists for a given prompt. Automatic metrics struggle to account for creativity and interest - aspects requiring human judgment. More robust evaluation is critical to gauge progress.

Despite hurdles, knowledge-infused narration clearly improves coherence, factual grounding, and variation. AI authors with a knowledge boost spin far more convincing yarns. The research provides a roadmap for machines to better mimic core elements of human storytelling.

Rather than viewing imagination and structure as at odds, they are complementary. Master storytellers combine free-flowing creativity with purposeful intent. By fusing extensive knowledge with unrestrained generation, machines inch closer toward unlocking that balancing act.

Sources:

Open-world Story Generation with Structured Knowledge Enhancement: A Comprehensive Survey

Teaching AI to Craft Coherent Stories

Thu, 10 Aug 2023 08:04:45 +1000

Storytelling comes naturally to humans, but is exceptionally difficult for artificial intelligence. Machine learning models that generate remarkably fluent text still struggle to craft narrative arcs spanning paragraphs or pages. New research from Stanford University demonstrates how "emotion maps" could improve story generation.

The key challenge is imbuing AI systems with a high-level understanding of plot and long-range dependencies - core elements of compelling stories. Without such top-down guidance, machine-written tales easily become repetitive and disjointed.

The Stanford project explores a technique called hierarchical generation. This involves first creating a short premise or prompt, then expanding that outline into a full story. The premise acts like an anchor, guiding the system to remain on topic and logically progress the narrative.

But how can we represent a good premise for AI? The researchers move beyond using text, instead generating "emotion maps." These maps contain a series of numerical scores representing different emotive attributes. Each score tracks how positive or negative, sad or joyful consecutive sections of the story feel.

For instance, a map may start very positive, then become sadder, and end on a more hopeful note. Feeding these maps as prompts produces stories that logically follow the intended emotional arc. The numbers offer a bird's-eye view of the narrative's affective flow.

Remarkably, this numerically-conditioned approach achieved comparable results on two standard story datasets as previous efforts using text prompts. The generated stories displayed coherent grammar and punctuation, sensibly reacting to the emotion map's ups and downs.

To better understand the relationship between maps and stories, the researchers introduced a new metric called Average Emotional Similarity. This quantifies how closely a story's actual emotion aligns with its prompt map. Initial results demonstrate some correlation, confirming the maps exert influence on the tone of generated text.

There are several advantages to conditioning story generation on simplified cues rather than verbose outlines. Maps neatly capture narrative essence in a compact, rapidly computed form. Reducing hand-authoring effort also enables building larger training datasets.

However, many challenges remain. Emotionless stories often flummox the system, producing bizarre outputs. Repetition and hallucination still crop up, demonstrating the need for greater plot coherency. And evaluating story quality continues to prove difficult without extensive human judgement.

Nonetheless, this research highlights the potential of hierarchical methods to imbue AI storytelling with greater purpose. The raw material exists in today's pretrained language models - machines that have "read" vast amounts of text. We must guide them toward higher reasoning about concepts like theme, characters, and dramatic structure.

Interactive tools could empower human authors to easily craft emotion maps, generating stories tailored to their creative vision. Teachers might build maps to help students practice writing logically paced narratives. Therapists could use emotive cadences to gently evoke memories or feelings from patients.

The capacity for machines to conjure compelling tales could transform how we communicate ideas and experiences. But achieving this dream will hinge on passing down our innate sense for what makes a story worth telling. With innovations like emotion maps lighting the way, artificial authors inch toward unlocking our imagination.

source:

Hierarchical, Feature-Based Text Generation

Caitlin Hogan Department of Computer Science Stanford University

Teaching AI to Tell Coherent Stories

Thu, 10 Aug 2023 08:03:41 +1000

Back in 2018, researchers from Facebook AI developed a new method to improve story generation through hierarchical modeling. Their approach mimics how people plan out narratives. While significant developments have occurred in language generation, it is worth exploring this technique as several current projects leverage these techniques.

The key innovation is generating a short premise first, then expanding that premise into a full story. Take the premise "A knight goes on a quest to save the kingdom." From this high-level summary, a system can flesh out details - the specific characters, events, and dialogue - while staying focused on the overarching plot.

This technique helps in two ways. First, the premise acts like an outline, guiding the story generation process. Second, conditioning the story on the premise makes it easier for the AI to stay on topic. Without such grounding, AI systems tend to lose coherence as they generate text word-by-word.

To train and test hierarchical story generation, the researchers built a new dataset using the r/WritingPrompts subreddit. This online community shares story premises, or prompts, that inspire other users to write original tales. Drawing on over 300,000 prompt-story pairs, the dataset captures diverse genres and narrative styles.

The researchers' AI system first generates a short prompt, similar to a human providing a premise. It then passes this prompt to a second model that expands it into a full story. Both steps use sequence-to-sequence neural networks, which translate an input sequence into target text.

To improve story coherence, the researchers introduced two key innovations.

First, they trained a standard sequence-to-sequence neural network model on the story generation task. This model learns to generate fluent stories, but often ignores the premise and fails to maintain consistency with it.
Next, they trained a second sequence-to-sequence model, but this time provided the hidden state outputs of the first pre-trained model to the second model during training. In other words, the second model learns on top of the representations already learned by the first model. It has access to the pre-trained model's outputs. This encourages the second model to focus specifically on relating the story back to the premise, rather than just improving language modeling in general

By "fusing" the second model with the first pre-trained model in this way, the researchers aim to improve coherence between the premise and final story. The second model builds on top of the first to better maintain relevance.

Experiments found these advances substantially boosted performance. The AI's stories scored higher in human evaluations for coherence, relevance to the prompt, and overall quality compared to baseline systems. The gated self-attention enabled referring back to any previous part of the story. And model fusion encouraged tighter connections between the premise and story.

While far from perfect, these results demonstrate AI's increasing capacity for controllable, long-form text generation. The hierarchical approach mimics how people first conceptualize, then craft, narratives. Such human-inspired techniques will be key to teaching machines to tell truly compelling tales spanning paragraphs or pages.

The researchers highlight several directions for improvement. The premises generated by the AI tend to be generic, lacking the creativity of human prompts. Repetition remains an issue when expanding premises into stories. And problems like dropped pronouns persist.

Nonetheless, this work moves neural story writing systems in a promising direction. As models strengthen their understanding of narrative cause-and-effect, characters, and more, their power as digital storytellers will grow. Hierarchical modeling that mirrors human planning seems a fitting way to imbue AI with our innate gift for spinning both short yarns and epics.

Sources:

arxiv

Hierarchical Neural Story Generation from ACL on Vimeo.

Peeking Inside the Black Box: Uncovering What AI Models Know About Books

Thu, 10 Aug 2023 08:03:22 +1000

Artificial intelligence systems like ChatGPT and GPT-4 have demonstrated impressive language skills, holding fluent conversations and answering questions on virtually any topic. But their inner workings remain largely opaque to users. These systems are "black boxes" - we know little about what knowledge they actually contain.

New research from the University of California, Berkeley sheds light on one slice of these models' knowledge: which books they have "read" and memorized. The study uncovers systematic biases in what texts AI systems know most about, with implications for how we should evaluate them.

The researchers focused specifically on works of fiction. They selected a sample of 571 English novels published between 1749 and 2020, containing literary classics along with contemporary bestsellers and award winners. The sample spanned mystery, romance, and science fiction genres as well as global Anglophone and African American literature.

For each book, the team extracted short passages of 40-60 words containing a single character name - but with the name removed. For instance, a passage from Pride and Prejudice might read "______ entered the room and greeted her hosts warmly." Humans cannot guess the missing name from such brief context. But does the AI system know the name from having read the full book?

The researchers tested two systems, ChatGPT and GPT-4, by giving each passage and asking what single-word name belongs in the blank. The accuracy of each AI model on this challenging "cloze" task revealed what books it likely memorized.

The results illuminated clear biases. Both systems strongly favor science fiction and fantasy works like Lord of the Rings and Harry Potter over other genres. They excel at classic literature like Alice in Wonderland and Pride and Prejudice but fare poorly on modern award-winning diverse books. In short, they are more knowledgeable about popular texts.

What explains this imbalance? The researchers found it closely mirrors what's most duplicated across the internet. There is a strong correlation between AI accuracy on a book and the number of verbatim passages found through Google, Bing, and other sources. The models appear to "know" books in proportion to their web popularity.

This reliance on the internet has consequences. The study showed AI systems perform better at predicting a book's publication date and summarizing its passages when they have memorized the book. In other words, their reasoning is tied to memorization - causing disparities between popular versus niche texts.

These insights matter because AI systems like ChatGPT are increasingly used for applications like analyzing literature and human culture. If their knowledge comes largely from duplicated web text, focused on popular sci-fi and fantasy, how well can we trust their judgments about less mainstream books? Their skewed knowledge could propagate biases into downstream decisions.

The findings illustrate the challenges of opaque "black box" AI systems whose training data is secret. OpenAI, which created ChatGPT and GPT-4, has not revealed what texts were used to train them. This leaves us unable to fully assess their knowledge gaps.

The researchers argue we should instead push for more transparent, open-source AI systems whose training data is public knowledge. This allows us to better understand their strengths and weaknesses - illuminated through research like this study.

As AI models grow more capable and ubiquitous, it becomes only more important to peek inside their black boxes. Understanding what knowledge they contain helps ensure we build and apply them responsibly. Analyses of what systems like ChatGPT "know" about books mark an important step toward making AI more intelligible as it continues permeating our lives.

Sources:

arxiv