BloombergGPT training learnings

Introduction

a finance optimised large language model (LLM) designed by Bloomberg in late 2022/early 2023
see https://www.youtube.com/watch?v=m2Scj2SO85Y
50b parameter LLM trained on 570b tokens, half of which are in the financial domain, used 512 GPUs and budget was 1.3m GPU hours on nVidia A100 40Gb GPUs;

based on the BLOOM model (176b parameters, 366b tokens) but:
- tokenization optimised for numerical inputs by breaking them up into individual digits, and allowance of multi-word tokens which differed to GPT 3 (175b parameters, 300b tokens)
- this resulted in 130,000 token values
- followed the advice of “Chinchilla paper” published Mar 2022 advised its better to have a smaller parameter model and more training data tokens when there is resource limitation for training
- removed “embedding layer norm” - as most don't use it
data from both public domain (eg. wikipedia, web crawlers such as C4) and 400b token private (Bloomberg's “FinPile” of compiled data since 2007 and this was sorted in chronologic order) - training data is 200x bigger than wikipedia-EN;

initial training was problematic, started using grad norm to view spikes in this which would indicate issues early, and they had to shuffle training data and lower learning rate to reduce this but this just delayed the spikes
they then examined the parameter values of each of their 17 layer neural network and found an issue with one node in the 1st transform layer, they resolved this by:
- fully shuffling data (revert temporal ordering)
- new random seeds
- removed weight decay on LayerNorm scale parameters
- Add LayerNorm at embedding layer (BLOOM had this)
- change from BF16 to FP32 in the output softmax (BLOOM didn't have this)
- reduced max learning rate from 6e-5 to 1e-4 for more stability
- reduced gradient clipping to 0.3 instead of 1.0 for more stability
- lengthened learning rate warm up period to 1800 steps for more stability
- applied batch size warm up (1024 for 7200 iterations, then 2048)
- use Megatron initialization rescaling
- apply query_key_layer_scaling for numerical stability
despite this performance worsened after 42 days so they stopped training at the 42 days as they had an adequately trained data having used 570b tokens out of their 700b token data

it was a good as or better than the 2022 peer models except BLOOM was better on NER
was able to convert natural language prompt inputs to generate a database query of the live Bloomberg database and return a response (albeit not 100% accurate) as well as allow users to sequentially prompt various features in Bloomberg's charts in natural language