Now Next Later AI - Blog #Tokenizer

Language Model Tokenization Reveals Significant Disparities Across Languages: Implications for Businesses and Users

Mon, 29 Apr 2024 12:28:39 +1000

In this article, we'll dive into a recent study that uncovers substantial disparities in the tokenization process used by language models across different languages. These disparities have significant implications for businesses and users, affecting the cost, latency, and quality of service when using AI-powered language technologies. By understanding these issues, business leaders can make more informed decisions about the adoption and deployment of language models and advocate for the development of more equitable solutions.

The Importance of Tokenization in Language Models

Tokenization is the process of breaking down natural language text into smaller units called tokens, which are then used as input for language models. The choice of tokenization method can significantly impact a model's performance and efficiency. Subword tokenization, which breaks down complex words into smaller parts, has become the preferred approach for state-of-the-art language models.

However, the study revealed that even subword tokenization methods can lead to significant disparities in the number of tokens required to represent the same content across different languages. This has far-reaching consequences for businesses and users relying on language models for various applications.

Tokenization Disparities Across Languages

The researchers analyzed the tokenization process of several popular language models, including GPT-2, RoBERTa, and the tokenizers used by ChatGPT and GPT-4. They found that the number of tokens required to represent the same text can vary drastically across languages. For example:

GPT-2 requires 3 times more tokens to represent the same content in Japanese compared to English.
The ChatGPT and GPT-4 tokenizers use 1.6 times more tokens for Italian, 2.6 times more for Bulgarian, and 3 times more for Arabic compared to English.
For Shan, a language spoken in Myanmar, the difference can be as high as 15 times compared to English.

These disparities persist even in tokenizers specifically designed for multilingual support, with some language pairs showing a 4-fold difference in the number of tokens required.

Implications for Businesses and Users

The tokenization disparities across languages have significant implications for businesses and users:

Cost: Many commercial language model services charge users per token. As a result, users of certain languages may end up paying significantly more for the same task compared to users of English or other more efficiently tokenized languages.
Latency: The number of tokens directly impacts the processing time for a task. Languages with longer tokenized representations can experience twice the latency compared to English, which may be critical for real-time applications like customer support or emergency services.
Long Context Processing: Language models often have a fixed context window, limiting the amount of text they can process at once. Users of more efficiently tokenized languages can work with much longer texts compared to users of languages with higher token counts, potentially leading to significant disparities in the quality of service.

The Path Forward: Multilingual Tokenization Fairness

To address these disparities and ensure more equitable access to language technologies, the researchers propose the concept of multilingual tokenization fairness. They argue that tokenizers should produce similar encoded lengths for the same content across languages. This can be achieved by:

Recognizing that subword tokenization is necessary to achieve parity, as character-level and byte-level representations cannot fully address the issue.
Ensuring that tokenizers support all Unicode codepoints to handle characters from all languages.
Building a multilingually fair parallel corpus for training and evaluating tokenizers, with balanced representation of topics, named entities, and diverse translations.
Developing multilingually fair tokenizers by first training individual monolingual tokenizers for each target language and then merging them while maintaining parity.

By adopting these principles, language model developers can create more equitable tokenizers that provide similar levels of service across languages, benefiting businesses and users worldwide.

As language models become increasingly integral to our daily lives, it is crucial that we prioritize fairness and inclusivity in their design and deployment. By understanding the implications of tokenization disparities and taking action to address them, business leaders can play a vital role in shaping a more equitable future for AI-powered language technologies.

A New LLM for Finance: BloombergGPT

Thu, 10 Aug 2023 13:43:14 +1000

Financial technology companies are increasingly turning to artificial intelligence to help analyze data, automate processes, and improve decision making. Bloomberg recently unveiled a new AI system called BloombergGPT that is specially designed to understand financial and business language.

BloombergGPT is what's known as a large language model (LLM).

LLMs are AI systems that are trained on massive amounts of text data so they can generate human-like text and engage in tasks like question answering and summarization. They have become very popular in recent years thanks to advances in computing power and AI techniques. Some well-known examples are systems like GPT-4 from OpenAI and BLOOM from Anthropic.

What makes BloombergGPT different is that it was explicitly trained on financial data sources curated by Bloomberg analysts over decades. This includes 360 billion tokens (words) from sources like financial news, earnings reports, regulatory filings, press releases, and more. The goal was to create an AI system optimized for understanding nuanced financial language.

To supplement the financial data, BloombergGPT was also trained on 345 billion tokens of more general text from publicly available sources like Wikipedia, books, academic papers, and web content. The combination of financial and general data was intended to make the system adept at both financial tasks and general natural language processing abilities.

In terms of its technical details, BloombergGPT contains 50 billion parameters. Parameters refer to the adjustable settings inside the model that are tuned during training. More parameters allow the system to learn more complex patterns and relationships. For comparison, GPT-3 has 175 billion parameters.

BloombergGPT uses an architecture based on the transformer, which is the most common framework used today for large language models. Transformers process text by looking at the entire context rather than processing words one-by-one. This allows them to develop a more holistic understanding of language.

During training, BloombergGPT was optimized using techniques like activation checkpointing and mixed precision to lower memory usage and increase speed. The result was a system capable of 102 teraflops, or 102 trillion floating point operations per second. This level of compute power was needed to effectively train the 50 billion parameter model.

To evaluate the capabilities of BloombergGPT, the researchers tested it on a range of financial natural language tasks as well as standard AI benchmarks. On financial tasks, BloombergGPT achieved state-of-the-art results, outperforming other models by significant margins. It performed very well on financial question answering, named entity recognition, and sentiment analysis.

On general benchmarks, BloombergGPT proved competitive with some models over 100 times its size. While it wasn't always the top performer, it consistently outscored similarly sized models on benchmarks measuring abilities like reasoning, reading comprehension and common sense knowledge.

The researchers attribute BloombergGPT's effectiveness to three main factors. First and foremost was the high-quality domain-specific data. Second, they believe their choice of tokenizer - the system responsible for breaking text into pieces the model can process - was beneficial. Finally, the model architecture and training techniques allowed them to efficiently train a model competitive with far larger systems.

The researchers highlight that creating an AI system of this scale still poses challenges. Instabilities can arise during training that require careful monitoring and intervention. They logged their training process in detail to aid future work.

Additionally, the authors considered ethical issues like potential biases in financial data and possible misuse of the system. Bloomberg has extensive procedures in place to reduce risks and ensure responsible AI development. However, the company chose not to publicly release the model to minimize chances of data leakage or misuse.

In conclusion, BloombergGPT represents a milestone for domain-specific natural language AI. Its training process and strong performance on financial tasks demonstrate the value of curated in-domain data. While specialized, the system retains competitive general abilities as well. As financial institutions continue adopting AI, expect systems like BloombergGPT to play an increasing role in driving automation and insights.

Sources:

BloombergGPT: A Large Language Model for Finance

Turning Text into Data: How BloombergGPT Tokenizes Financial Language

Thu, 10 Aug 2023 13:43:14 +1000

One of the fundamental challenges in natural language processing is transforming free-form human language into data that machines can understand. The first step in this process is called tokenization – breaking down sentences and documents into small chunks or “tokens”.

When designing BloombergGPT, a new large language model optimized for financial data, the researchers at Bloomberg made an important decision around tokenization. They opted to use a method called unigram tokenization rather than the more standard approach of byte pair encoding (BPE).

To understand this choice, it helps to first understand what tokenization involves and the strengths of different techniques.

The Role of the Tokenizer

The job of a tokenizer is to split input text into tokens that contain one or more characters. Tokens become the basic unit that the model manipulates when processing language.

With English text, an obvious tokenization would be to split on spaces to get words and punctuation. But for machine learning, it’s common to break words down even further into subword units.

This offers two main advantages. First, it limits the size of the vocabulary that the model needs to represent. For example, instead of separate tokens for “training” and “trainer”, they can share a common root like “train”.

Second, the model can process words it hasn’t seen before by recognizing their subword components. This helps the model generalize.

Byte Pair Encoding

The most popular subword tokenization algorithm used in NLP is byte pair encoding (BPE). Originally developed for data compression, it builds up a vocabulary by scanning text and greedily merging frequent pairs of tokens.

BPE starts by assigning each character as a token. It then iteratively merges the most common pair of tokens until reaching a target vocabulary size. Popular pairs like “in+the” get merged early on.

This simple probabilistic approach works decently well in practice. BPE is fast to train and produces a vocabulary with reusable subword chunks.

However, BPE discards the probabilistic model after training. Each merge decision is hard-coded based on frequency alone. At test time, BPE tokenizes text greedily using this fixed vocabulary.

Unigram Tokenization

BloombergGPT implements a more advanced approach called unigram tokenization. A unigram tokenizer models the probability of tokens directly using techniques from statistical language modeling.

The probabilities allow unigram tokenizers to capture uncertainty and make “soft” decisions when tokenizing new text. Rather than greedily splitting input using a fixed vocabulary, it chooses the most probable segmentation based on context.

Unigram tokenizers are first trained on large datasets to learn these token probabilities. The training process gradually removes unlikely token candidates until reaching the target vocabulary size.

The main advantage over BPE is that unigram keeps the probabilistic model which allows for smarter, context-aware tokenization at inference time.

Implementation in BloombergGPT

For BloombergGPT, the unigram tokenizer was trained specifically on The Pile, a diverse dataset containing both general and domain-specific data.

After experimenting with different sizes, Bloomberg settled on a 217 token vocabulary (about 131,000 tokens). This is larger than typical for NLP models.

The researchers argue the customized tokenizer better captures financial terminology. And the large vocabulary encodes more meaning within each token.

In initial evaluations, BloombergGPT’s unigram tokenizer reduced the size of encoded text compared to BPE and other tokenizers. This suggests it forms an efficient token vocabulary for financial language.

Tradeoffs and Considerations

Unigram tokenization has tradeoffs to consider. The probabilistic modeling approach requires more data and compute to train compared to BPE. The tokenization decisions are also less explainable than simple frequency counts.

Additionally, vocabularies tailored for specific domains may not generalize as well to other data. So there are open questions around how to balance domain-specific and general tokenization.

The choice ultimately depends on use cases. For a financial model like BloombergGPT, the benefits of smarter domain-aware tokenization appear to outweigh the costs. But simple, fast methods like BPE remain appealing in many scenarios.

As language models continue advancing, we’re likely to see more specialized tokenization strategies like unigram. The tokenizer plays a key role in how systems represent and process language. Improvements here translate to downstream gains, enabling models like BloombergGPT to push the state-of-the-art in domains like finance.

What is Tokenization? Let's Explore, Using Novel AI's New Tokenizer as a Use Case

Thu, 10 Aug 2023 11:43:18 +1000

Tokenization is a foundational step in natural language processing (NLP) and machine learning.

Large Language Models are big statistical calculators that work with numbers, not words. Tokenisation converts the words into numbers, with each number representing a position in a dictionary of all the possible words.

Tokenization breaks down a piece of text into smaller units, called "tokens." These tokens can represent whole words, parts of words, or even multiple words in some languages. For instance, the sentence "ChatGPT is fun!" might be broken down into tokens like ["Chat", "G", "PT", " is", " fun", "!"].

You can choose from multiple tokenization methods, but it's crucial to consistently use the same method during both training and text generation.

Why is this important for large language models?

Understanding Context: By breaking text into tokens, the model can process and understand the context around each token. It's like looking at each puzzle piece and understanding where it might fit in the bigger picture.
Efficiency: Language models have a limit to how many tokens they can process at once. By tokenizing text, they can manage and process information more efficiently.
Flexibility: Different languages have different structures. Tokenization allows these models to be flexible and work with multiple languages. For example, in English, spaces often separate words, but in languages like Chinese, words are often clustered together without spaces. Tokenization helps the model handle such variations.
Training: When training these models on vast amounts of text, tokenization ensures that the model learns from consistent and standardized units of text.

In essence, for large language models, tokenization is a foundational step that allows them to read, understand, and generate human-like text across various languages and contexts.

Trade-offs

Different tokenization strategies come with their own sets of trade-offs. Let's delve into some of these key trade-offs:

Granularity:

Fine-grained (e.g., character-level):

Pros: Can handle any word or term, even if it's never seen it before. It's very flexible and can be language-agnostic.
Cons: Requires more tokens to represent a text, which can be computationally expensive and may not capture semantic meanings as effectively.

Coarse-grained (e.g., word-level):

Pros: Can capture semantic meanings more directly and is often more efficient in terms of the number of tokens used.
Cons: Struggles with out-of-vocabulary words and might not be as flexible across different languages.

Language Dependence:

Language-specific tokenizers:

Pros: Optimized for a particular language, capturing its nuances and structures effectively.
Cons: Not versatile. A separate tokenizer would be needed for each language, which isn't scalable for multilingual models.

Language-agnostic tokenizers:

Pros: Can be used across multiple languages, making them ideal for multilingual models.
Cons: Might not capture the intricacies of each individual language as effectively as a language-specific tokenizer.

Fixed vs. Dynamic Vocabulary:

Fixed Vocabulary:

Pros: Deterministic and consistent in its tokenization. Easier to manage and deploy.
Cons: Struggles with out-of-vocabulary terms and might become outdated as language evolves.

Dynamic (or adaptive) Vocabulary:

Pros: Can adjust to new terms or slang, making it more flexible and up-to-date.
Cons: More complex to implement and might introduce inconsistencies over time.

Efficiency vs. Coverage: Some tokenizers aim for maximum coverage, ensuring they can handle any text thrown at them. Others prioritize efficiency, using the fewest tokens possible to represent a text. There's a balance to strike here: more coverage can mean more computational overhead, while prioritizing efficiency might mean sacrificing the ability to handle rare terms.
Complexity and Overhead: Advanced tokenization methods, like Byte-Pair Encoding (BPE) or SentencePiece, can handle a wide range of text types and languages. However, they introduce additional computational and implementation overhead compared to simpler methods.
Consistency: Some tokenization methods might tokenize the same text differently based on context, leading to potential inconsistencies. While this can be beneficial in capturing nuanced meanings, it can also introduce unpredictability in the model's behavior.

Choosing a tokenizer for a large language model involves weighing these trade-offs based on the specific goals and constraints of the project. Whether the priority is multilingual support, computational efficiency, or capturing linguistic nuances, understanding these trade-offs is crucial in making an informed decision.

Novel AI's Tokenizer to "enable stronger storytelling capabilities"

Novel AI recently developed a custom tokenizer for their AI models. The GitHub project detailing their process provides an inside look at engineering tradeoffs like vocabulary size, compression ratio, and handling numerals. The focus of this project is to build a tokenizer that enables stronger storytelling capabilities. Let's explore!

On the surface, the new tokenizer offers clear advantages:

Improved Granularity and Flexibility: Unlike traditional tokenizers, this one offers a balance between word-level and subword-level tokenization. By breaking down words into meaningful fragments, it can better understand and generate nuanced text. This is especially crucial for storytelling where context, nuance, and subtlety matter.

Compression Ratio: A higher compression ratio means the model can process and understand larger chunks of text at once. This is vital for maintaining context in long narratives or when referencing earlier parts of a story. By achieving a 7-19% higher compression ratio than the LLaMa tokenizer on significant parts of the English dataset, it's evident that the tokenizer is more efficient. This efficiency can translate to richer and more coherent narratives, especially in longer stories.

Adaptability and Evolution: The iterative approach to tokenizer training, with multiple runs and rebalancing, ensures that the tokenizer is optimized for the specific nuances of your dataset. This adaptability is key for evolving storytelling styles and trends.

Some other pros are mentioned. These are more specific to the Novel AI project:

Multilingual Capabilities: By accommodating both English and Japanese from the start, the tokenizer is designed for bilingual storytelling. This means it can seamlessly switch between languages or even blend them, offering richer narratives and reaching a broader audience.
Efficient Handling of Unicode Characters: The ability to natively tokenize Unicode characters, especially emojis, allows for more expressive storytelling. Emojis, in modern communication, can convey emotions, context, and tone, making them valuable in narratives. But they are less relevant to novel writing.
Numeric Understanding: Tokenizing numbers digit by digit enhances the model's capability to understand and manipulate numeric values. This is crucial for stories that involve dates, quantities, or any numerical context.

Disadvantages:

Complexity and Maintenance: Training the tokenizer added development time and complexity.
BPE vs. Unigram: The decision to choose BPE over Unigram was based on compression ratio. While BPE might offer better compression, Unigram might provide more natural word segmentations. The storytelling quality might be affected if the tokenizer doesn't segment words in a way that's intuitive to human readers.
Multilingual Limitations: While accommodating both English and Japanese is a strength, it might also be a limitation. The tokenizer might be overly specialized for these two languages, potentially making it less effective for other languages or multilingual contexts beyond English and Japanese.
Vocabulary Size: The decision to use a vocabulary size of 65535 tokens, while efficient from a computational standpoint, might introduce limitations. Is this size sufficient to capture the nuances of both English and Japanese, especially given the richness of the Japanese writing system?
Numeric Tokenization: Tokenizing numbers digit by digit can indeed improve the model's understanding of numeric values. However, it might also make the model less adept at recognizing larger numerical patterns or relationships between numbers.
Handling of Unicode Characters: While the ability to natively tokenize Unicode characters is a strength, there's a potential for overfitting or misinterpretation. Emojis and other Unicode characters can have different meanings in different contexts or cultures. Relying heavily on them might lead to misunderstandings in generated narratives.