Now Next Later AI - Blog #BloombergGPT

A New LLM for Finance: BloombergGPT

Thu, 10 Aug 2023 13:43:14 +1000

Financial technology companies are increasingly turning to artificial intelligence to help analyze data, automate processes, and improve decision making. Bloomberg recently unveiled a new AI system called BloombergGPT that is specially designed to understand financial and business language.

BloombergGPT is what's known as a large language model (LLM).

LLMs are AI systems that are trained on massive amounts of text data so they can generate human-like text and engage in tasks like question answering and summarization. They have become very popular in recent years thanks to advances in computing power and AI techniques. Some well-known examples are systems like GPT-4 from OpenAI and BLOOM from Anthropic.

What makes BloombergGPT different is that it was explicitly trained on financial data sources curated by Bloomberg analysts over decades. This includes 360 billion tokens (words) from sources like financial news, earnings reports, regulatory filings, press releases, and more. The goal was to create an AI system optimized for understanding nuanced financial language.

To supplement the financial data, BloombergGPT was also trained on 345 billion tokens of more general text from publicly available sources like Wikipedia, books, academic papers, and web content. The combination of financial and general data was intended to make the system adept at both financial tasks and general natural language processing abilities.

In terms of its technical details, BloombergGPT contains 50 billion parameters. Parameters refer to the adjustable settings inside the model that are tuned during training. More parameters allow the system to learn more complex patterns and relationships. For comparison, GPT-3 has 175 billion parameters.

BloombergGPT uses an architecture based on the transformer, which is the most common framework used today for large language models. Transformers process text by looking at the entire context rather than processing words one-by-one. This allows them to develop a more holistic understanding of language.

During training, BloombergGPT was optimized using techniques like activation checkpointing and mixed precision to lower memory usage and increase speed. The result was a system capable of 102 teraflops, or 102 trillion floating point operations per second. This level of compute power was needed to effectively train the 50 billion parameter model.

To evaluate the capabilities of BloombergGPT, the researchers tested it on a range of financial natural language tasks as well as standard AI benchmarks. On financial tasks, BloombergGPT achieved state-of-the-art results, outperforming other models by significant margins. It performed very well on financial question answering, named entity recognition, and sentiment analysis.

On general benchmarks, BloombergGPT proved competitive with some models over 100 times its size. While it wasn't always the top performer, it consistently outscored similarly sized models on benchmarks measuring abilities like reasoning, reading comprehension and common sense knowledge.

The researchers attribute BloombergGPT's effectiveness to three main factors. First and foremost was the high-quality domain-specific data. Second, they believe their choice of tokenizer - the system responsible for breaking text into pieces the model can process - was beneficial. Finally, the model architecture and training techniques allowed them to efficiently train a model competitive with far larger systems.

The researchers highlight that creating an AI system of this scale still poses challenges. Instabilities can arise during training that require careful monitoring and intervention. They logged their training process in detail to aid future work.

Additionally, the authors considered ethical issues like potential biases in financial data and possible misuse of the system. Bloomberg has extensive procedures in place to reduce risks and ensure responsible AI development. However, the company chose not to publicly release the model to minimize chances of data leakage or misuse.

In conclusion, BloombergGPT represents a milestone for domain-specific natural language AI. Its training process and strong performance on financial tasks demonstrate the value of curated in-domain data. While specialized, the system retains competitive general abilities as well. As financial institutions continue adopting AI, expect systems like BloombergGPT to play an increasing role in driving automation and insights.

Sources:

BloombergGPT: A Large Language Model for Finance

Turning Text into Data: How BloombergGPT Tokenizes Financial Language

Thu, 10 Aug 2023 13:43:14 +1000

One of the fundamental challenges in natural language processing is transforming free-form human language into data that machines can understand. The first step in this process is called tokenization – breaking down sentences and documents into small chunks or “tokens”.

When designing BloombergGPT, a new large language model optimized for financial data, the researchers at Bloomberg made an important decision around tokenization. They opted to use a method called unigram tokenization rather than the more standard approach of byte pair encoding (BPE).

To understand this choice, it helps to first understand what tokenization involves and the strengths of different techniques.

The Role of the Tokenizer

The job of a tokenizer is to split input text into tokens that contain one or more characters. Tokens become the basic unit that the model manipulates when processing language.

With English text, an obvious tokenization would be to split on spaces to get words and punctuation. But for machine learning, it’s common to break words down even further into subword units.

This offers two main advantages. First, it limits the size of the vocabulary that the model needs to represent. For example, instead of separate tokens for “training” and “trainer”, they can share a common root like “train”.

Second, the model can process words it hasn’t seen before by recognizing their subword components. This helps the model generalize.

Byte Pair Encoding

The most popular subword tokenization algorithm used in NLP is byte pair encoding (BPE). Originally developed for data compression, it builds up a vocabulary by scanning text and greedily merging frequent pairs of tokens.

BPE starts by assigning each character as a token. It then iteratively merges the most common pair of tokens until reaching a target vocabulary size. Popular pairs like “in+the” get merged early on.

This simple probabilistic approach works decently well in practice. BPE is fast to train and produces a vocabulary with reusable subword chunks.

However, BPE discards the probabilistic model after training. Each merge decision is hard-coded based on frequency alone. At test time, BPE tokenizes text greedily using this fixed vocabulary.

Unigram Tokenization

BloombergGPT implements a more advanced approach called unigram tokenization. A unigram tokenizer models the probability of tokens directly using techniques from statistical language modeling.

The probabilities allow unigram tokenizers to capture uncertainty and make “soft” decisions when tokenizing new text. Rather than greedily splitting input using a fixed vocabulary, it chooses the most probable segmentation based on context.

Unigram tokenizers are first trained on large datasets to learn these token probabilities. The training process gradually removes unlikely token candidates until reaching the target vocabulary size.

The main advantage over BPE is that unigram keeps the probabilistic model which allows for smarter, context-aware tokenization at inference time.

Implementation in BloombergGPT

For BloombergGPT, the unigram tokenizer was trained specifically on The Pile, a diverse dataset containing both general and domain-specific data.

After experimenting with different sizes, Bloomberg settled on a 217 token vocabulary (about 131,000 tokens). This is larger than typical for NLP models.

The researchers argue the customized tokenizer better captures financial terminology. And the large vocabulary encodes more meaning within each token.

In initial evaluations, BloombergGPT’s unigram tokenizer reduced the size of encoded text compared to BPE and other tokenizers. This suggests it forms an efficient token vocabulary for financial language.

Tradeoffs and Considerations

Unigram tokenization has tradeoffs to consider. The probabilistic modeling approach requires more data and compute to train compared to BPE. The tokenization decisions are also less explainable than simple frequency counts.

Additionally, vocabularies tailored for specific domains may not generalize as well to other data. So there are open questions around how to balance domain-specific and general tokenization.

The choice ultimately depends on use cases. For a financial model like BloombergGPT, the benefits of smarter domain-aware tokenization appear to outweigh the costs. But simple, fast methods like BPE remain appealing in many scenarios.

As language models continue advancing, we’re likely to see more specialized tokenization strategies like unigram. The tokenizer plays a key role in how systems represent and process language. Improvements here translate to downstream gains, enabling models like BloombergGPT to push the state-of-the-art in domains like finance.