Turning Text into Data: How BloombergGPT Tokenizes Financial Language

10.08.23 01:43 PM Comment(s) By Ines Almeida

One of the fundamental challenges in natural language processing is transforming free-form human language into data that machines can understand. The first step in this process is called tokenization – breaking down sentences and documents into small chunks or “tokens”.

When designing BloombergGPT, a new large language model optimized for financial data, the researchers at Bloomberg made an important decision around tokenization. They opted to use a method called unigram tokenization rather than the more standard approach of byte pair encoding (BPE).

To understand this choice, it helps to first understand what tokenization involves and the strengths of different techniques.

The Role of the Tokenizer

The job of a tokenizer is to split input text into tokens that contain one or more characters. Tokens become the basic unit that the model manipulates when processing language.

With English text, an obvious tokenization would be to split on spaces to get words and punctuation. But for machine learning, it’s common to break words down even further into subword units.

This offers two main advantages. First, it limits the size of the vocabulary that the model needs to represent. For example, instead of separate tokens for “training” and “trainer”, they can share a common root like “train”.

Second, the model can process words it hasn’t seen before by recognizing their subword components. This helps the model generalize.

Byte Pair Encoding

The most popular subword tokenization algorithm used in NLP is byte pair encoding (BPE). Originally developed for data compression, it builds up a vocabulary by scanning text and greedily merging frequent pairs of tokens.

BPE starts by assigning each character as a token. It then iteratively merges the most common pair of tokens until reaching a target vocabulary size. Popular pairs like “in+the” get merged early on.

This simple probabilistic approach works decently well in practice. BPE is fast to train and produces a vocabulary with reusable subword chunks.

However, BPE discards the probabilistic model after training. Each merge decision is hard-coded based on frequency alone. At test time, BPE tokenizes text greedily using this fixed vocabulary.

Unigram Tokenization

BloombergGPT implements a more advanced approach called unigram tokenization. A unigram tokenizer models the probability of tokens directly using techniques from statistical language modeling.

The probabilities allow unigram tokenizers to capture uncertainty and make “soft” decisions when tokenizing new text. Rather than greedily splitting input using a fixed vocabulary, it chooses the most probable segmentation based on context.

Unigram tokenizers are first trained on large datasets to learn these token probabilities. The training process gradually removes unlikely token candidates until reaching the target vocabulary size.

The main advantage over BPE is that unigram keeps the probabilistic model which allows for smarter, context-aware tokenization at inference time.

Implementation in BloombergGPT

For BloombergGPT, the unigram tokenizer was trained specifically on The Pile, a diverse dataset containing both general and domain-specific data.

After experimenting with different sizes, Bloomberg settled on a 217 token vocabulary (about 131,000 tokens). This is larger than typical for NLP models.

The researchers argue the customized tokenizer better captures financial terminology. And the large vocabulary encodes more meaning within each token.

In initial evaluations, BloombergGPT’s unigram tokenizer reduced the size of encoded text compared to BPE and other tokenizers. This suggests it forms an efficient token vocabulary for financial language.

Tradeoffs and Considerations

Unigram tokenization has tradeoffs to consider. The probabilistic modeling approach requires more data and compute to train compared to BPE. The tokenization decisions are also less explainable than simple frequency counts.

Additionally, vocabularies tailored for specific domains may not generalize as well to other data. So there are open questions around how to balance domain-specific and general tokenization.

The choice ultimately depends on use cases. For a financial model like BloombergGPT, the benefits of smarter domain-aware tokenization appear to outweigh the costs. But simple, fast methods like BPE remain appealing in many scenarios.

As language models continue advancing, we’re likely to see more specialized tokenization strategies like unigram. The tokenizer plays a key role in how systems represent and process language. Improvements here translate to downstream gains, enabling models like BloombergGPT to push the state-of-the-art in domains like finance.