Large Language Models in artificial intelligence rapidly evolved after the introduction of transformer models in 2018 and the availability of very large amounts of parallel computing resources such as GPUs on the Cloud
as of 2023, the concept of “compositional generalization” and understanding the meaning of words rather than just word associations is still a work in progress for LLMs, but training them with Meta-learning for Compositionality (MLC) seems to outperform other approaches and is likely to progress further when trained within physical embodiment systems such as robots
LLMs are NOT good with private data - access controls are not possible within a model - private data will leak out
the usual way to manage private data such as a patient's EHR is to upload the EHR to the LLM and then ask a context and get a response and as the LLM itself does not store the EHR data (this is usually stored in a separate vector database in order to maintain a conversion regarding it)
reduce hallucinations in LLMs by:
providing it with context data and tell it to preferentially use that data, and,
explicitly tell it that if it doesn't know, to just say it doesn't know
LLMs are predictive generative machines based upon probability distributions for the next most likely token
statistical thinking has its place BUT by itself:
is error prone - the longer the context, the more errors that are likely to be introduced as future values are determined by past values and as errors arise they get multiplied with every new token
do not represent intelligence or consciousness although these can be simulated by scaling up these LLMs
are not good at understanding cause and effect
can have a lot of trouble with reasoning, logic, understanding the physical world and the rules of physics, and performing mathematical calculations
are not as good as using definite accurate known values
when asked for possible diagnoses of a patient Jean Smith who is 20yrs old with abdominal pain - it needs to ascertain probabilities for gender which is no where near as good as if biological gender is a known fact as gender does change the DDx profile considerably
when asked for a diagnosis based upon a range of clinical features, the best it will probably do is give a range of differentials with some broad assessment of likelihood, but it is unlikely to give a definite diagnosis - this may improve if trained on an enormous amount of diagnosis-labelled clinical scenarios, or, it has access to well defined definite criteria for every diagnosis - and currently we don't have that and much of our inputs have uncertainty or are incomplete
requires attention and a relatively large context window to ensure the weights applied to potential tokens adequately reflect context
may ignore lower probability tokens even though these may in fact be the best fit for a context - hence the use of the temperature hyper-parameter to allow a more even weight distribution to various tokens by increasing the temperature value away from 0.
cannot account for chaos of natural randomness or the complexity of a multitude of inputs which are not provided
whilst they are useful in the short term, current LLMs only simulate intelligence and they have major flaws
most have a knowledge cut off date as to end of the acquisition of data for their training
whilst they have a massive amount of knowledge accumulation this does not substitute for actual understanding
they memorize lots of “problem statements” and “recipes” on how to solve them
if needing to solve a new problem they will use the closest matching “recipe” even if this is not logical and usually without checking the solution for logic, common sense or real world modelling checks - and if it gives an incorrect response, it will generally reply “I'm sorry, you are right” and apply another irrelevant recipe.
responses are very dependent upon:
its data on which it was trained (quality and quantity)
how it was trained and fine tuned
its context length
it's temperature creativity hyper-parameter setting
its hidden system prompts (these are usually designed to provide guardrails)
how your prompts are provided (role instructions, where to source information, how to format it, whether you have included irrelevant information which it may think is more important, etc)
Performance measures
NB. WARNING: some models are also pre-trained on the benchmark data which falsely elevates their benchmark scores so they can gain more funding
some models are multi-lingual
number of tokens
Mistral 7B instruct only allows 2K tokens
Mixtral 7Bx8e allows 32K tokens
training dataset size
GPT 3.5 - 300B words/570Gb dataset;
Llama 2 has 7B, 13B and 70B parameter models
pretraining computer processing time and power consumption
eg. Llama 2 used 3.3million GPU hours of computation on hardware of type A100-80GB (TDP of 350-400W)
commonsense reasoning scores
PIQA
SIQA
HellaSwag
WinoGrande
ARC easy
ARC challenge
OpenBookQA
CommonsenseQA
world knowledge score
NaturalQuestions
TriviaQA
reading comprehension scores
SQuAD
QuAC
BoolQ
math scores
GSM8K
MATH
Multi-task Language Understanding (MMLU) score
test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.1)
March 14, 2023, OpenAI released GPT-4 with a waitlist for access
Aug 2023, Microsoft releases LlaVa (Large Language and Vision Assistant) which is a multi-model model which can take image inputs and encodes them and then uses GPT-4 to respond to a prompt regarding the image
Sept 2023: MS releases DALL-E3 image generation combined with ChatGPT linked to many of its software apps including Win11, Edge, Bing, Paint, Office 365.
Oct 2023: ChatGPT-4Vision allows analysis of images as part of the prompts
Oct 2023: “Browse with Bing” feature lets ChatGPT access up-to-date information, rather than being limited to the training data that was cut off before September 2021.
July 2023, Meta released several models as Llama 2, using 7, 13 and 70 billion parameters
uses 4K context length tokens and foundational models were trained on a curated data set with 2 trillion tokens with batch size of 64
RedPyjama contains a 1.2trillion token open source and downloaded dataset version of the LLama 2 training dataset
Llama 2 - Chat was additionally fine-tuned on 27,540 prompt-response pairs created for this project
pretraining utilized a cumulative 3.3million GPU hours of computation on hardware of type A100-80GB (TDP of 350-400W)
Google's Gemini
announced Dec 2023
Microsoft's Phi
Phi-1 and Phi-1.5 were 1.3b parameter models
Phi-2:
announced Dec 2023 with only 2.7b parameters but beats LLaMA-2 7b, 13b, Mistral 7b, GPT4 and Gemini Nano2 3.2b in logical reasoning, math proficiency, coding acumen and safety standards
Mistral
French company
“Mistral Tiny”:
Mistral 7B Instruct v0.2
released as open source in Oct 2023 - English only
only 4.6Gb download and can easily run on a powerful laptop with 8Gb VRAM and nVidia GPU via Private GTP
scores well on conversational scores as this is what it is designed for but poorly on coding, maths and reasoning (similar to LLaMA 2 70b in these latter scores)
continuously fails on “if 5 shirts take 4hrs in sun to dry how long for each shirt to dry if you put 20 shirts out?” uses proportional reasoning and then simple multiplication approaches and only after a few prompted corrections does it come up with the correct answer albeit with disconnected logic
“Mistral Small”
Mixtral 7bx8e Mixture-of-Experts (MoE) model
released Dec 2023 as open source via Torrent, English, French, Italian, German and Spanish, code and competes well with GPT 4 on benchmarks
can be run on a high end desktop computer with LOTS of VRAM (preferably 64-128Gb, but at least 23Gb)
released Dec 2023 as API, English, French, Italian, German and Spanish, code and competes nearly the same as GPT 4 on benchmarks but at 1/10th the cost
?33bx8e
succeeds on “if 5 shirts take 4hrs in sun to dry how long for each shirt to dry if you put 20 shirts out”
still fails on complex word based maths problems and on real world physics problems
Running multimodal image2txt LLMs on your computer offline
use LM Studio:
download a visual model and load it
go to the horizontal arrows icon on left panel and once the model is loaded hit Start Server
copy the code under the vision(python) tab
if openai Python library not installed:
in Anaconda Prompt:
activate your desired environment
install latest version of openai eg. conda install openai=1.6.0
NB. default older version install does not have the OpenAI object to import
either:
1. open Anaconda Navigator and select the environment that has the openai library installed
open Jupyter Notebooks and create a new python file and paste the python code from LM Studio above
2. create a native app to run the python code such as by using Delphi 11 or 12 as I have done (still needs python environment installed on the same computer)
An important paradigm of natural language processing consists of large-scale pre-training on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes less feasible.
1. start with a generic LLM
2. fine tune with Structured Fine Tuning (SFT) or instruction tuning via either:
Reinforcement Learning (when the model needs to learn outputs not explicitly defined in the training dataset ie. you give feedback on outputs)
fine tuning and aligning Mistral 7b Instruct with UltraChat synthetic dataset derived from chat between 2 LLMs and UltraFeedback alignment to create “Zephyr” costs $US500 and takes 8hrs on 16 x A100 nVidia GPUs! - as a minimum you need 80Gb VRAM 2)
Fine Tuning with your own dataset (when the objective is clear and can be defined through use of labelled examples -ie you can give examples of desired outputs)
example is Parameter Efficient Fine Tuning (PEFT):
LoRAs
3. align model to a Feedback Dataset with Direct Preference Optimisation (DPO) which is generally much more stable than Proximal Policy Optimisation (PPO)
4. now you have a crafted model eg. optimised for medicine
4. integrate with your data:
Retrieval Augmented Generation (RAG), a method of enhancing AI's response quality by retrieving relevant external data from unstructured Delta Lake document data
tools like “Databricks RAG” for creating production-ready high-quality RAG applications
NB. a structured database is more “intelligent” than a vector embedding of the database as the latter loses accuracy and has to compute probabilistic answers rather than a straight SQL search for the answer - so don't convert a SQL database to a vector embedding
vector embedding is good for allowing detection of semantic similarities which a word search may not detect if that word is not actually used in the relevant document (eg. an acronym is used instead) however, they need to be built with an appropriate tokenizer to make them coherent or harmonic vectors
also be aware that text vector embeddings can reveal almost as much as the raw text itself so this can be an issue with privacy
LoRA (Low-Rank Adaptation of Large Language Models)
freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks
Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times
unlike adapters it has no additional inference latency
when used in SD-XL AI text2image generation, LoRA models operate by applying minute changes to the most critical part of Stable Diffusion models—the cross-attention layers. This is the part where the image and the prompt intersect, and researchers have found that fine-tuning this section yields excellent training results.
example of using LoRA to fine tune Mistral-7B LLM:
https://www.youtube.com/watch?v=yfHHvmaMkcA - Youtube tutorial of how to use OpenAI embeddings to convert a text dataset into a cloud vector datastore and then ask the vector store for semantically similar items to your prompt
a vector embedding is a conversion words (or other data such as sentences, documents, images for facial recognition systems, graphs, etc) to a vector space embedding and allows storage of semantic meaning similarity of words
this is often done using a cosine similarity which gives a similarity value of two components in the vector embedding based on a range of their vector values for their “traits”
in this way banana could be found by the LLM when prompted to find fruit or food whereas a straight character search for “food” would only find the word “food” or similar words.
use of iterative AI reasoning methods either via system prompts (eg. tree search, use step by step reasoning) or by use of AI agents can dramatically improve the performance of LLMs compared to zero shot results
agentic reasonic design patterns:
self-reflection
eg. ask LLM to check for correctness, style, efficiency, etc and give constructive criticism on how to improve the response
eg. asking it to run a unit test on the generated code if asking it to write code and if it fails to look at correcting it
tool use
eg. smaller models designed as dedicated tools to perform tasks faster than larger LLMs, tools could be such as web search, image generator, etc
planning
instructions to allow planning a stepwise approach to solving a problem
multi-agent collaboration with each AI agent assigned a task eg. one could be a manager to orchestrate the other agents, one could be designing, one could be coding, one could be a critic agent to test and feed back for self-reflection, one could be documenting