Introduction
Performance measures
Current models
Prompt engineering
Fine tuning models to particular tasks or domains
- LoRA (Low-Rank Adaptation of Large Language Models)
- Vector embeddings
Hallucination mitigation
Improving performance using AI iterative Agents

AI Large Language Models (LLMs)

Introduction

Large Language Models in artificial intelligence rapidly evolved after the introduction of transformer models in 2018 and the availability of very large amounts of parallel computing resources such as GPUs on the Cloud
as of 2023, the concept of “compositional generalization” and understanding the meaning of words rather than just word associations is still a work in progress for LLMs, but training them with Meta-learning for Compositionality (MLC) seems to outperform other approaches and is likely to progress further when trained within physical embodiment systems such as robots
LLMs are NOT good with private data - access controls are not possible within a model - private data will leak out
- the usual way to manage private data such as a patient's EHR is to upload the EHR to the LLM and then ask a context and get a response and as the LLM itself does not store the EHR data (this is usually stored in a separate vector database in order to maintain a conversion regarding it)
reduce hallucinations in LLMs by:
- providing it with context data and tell it to preferentially use that data, and,
- explicitly tell it that if it doesn't know, to just say it doesn't know
LLMs are predictive generative machines based upon probability distributions for the next most likely token
- statistical thinking has its place BUT by itself:
  - is error prone - the longer the context, the more errors that are likely to be introduced as future values are determined by past values and as errors arise they get multiplied with every new token
  - do not represent intelligence or consciousness although these can be simulated by scaling up these LLMs
  - are not good at understanding cause and effect
  - can have a lot of trouble with reasoning, logic, understanding the physical world and the rules of physics, and performing mathematical calculations
  - are not as good as using definite accurate known values
    - when asked for possible diagnoses of a patient Jean Smith who is 20yrs old with abdominal pain - it needs to ascertain probabilities for gender which is no where near as good as if biological gender is a known fact as gender does change the DDx profile considerably
    - when asked for a diagnosis based upon a range of clinical features, the best it will probably do is give a range of differentials with some broad assessment of likelihood, but it is unlikely to give a definite diagnosis - this may improve if trained on an enormous amount of diagnosis-labelled clinical scenarios, or, it has access to well defined definite criteria for every diagnosis - and currently we don't have that and much of our inputs have uncertainty or are incomplete
  - requires attention and a relatively large context window to ensure the weights applied to potential tokens adequately reflect context
  - may ignore lower probability tokens even though these may in fact be the best fit for a context - hence the use of the temperature hyper-parameter to allow a more even weight distribution to various tokens by increasing the temperature value away from 0.
  - cannot account for chaos of natural randomness or the complexity of a multitude of inputs which are not provided

whilst they are useful in the short term, current LLMs only simulate intelligence and they have major flaws
- most have a knowledge cut off date as to end of the acquisition of data for their training
- whilst they have a massive amount of knowledge accumulation this does not substitute for actual understanding
- they memorize lots of “problem statements” and “recipes” on how to solve them
- if needing to solve a new problem they will use the closest matching “recipe” even if this is not logical and usually without checking the solution for logic, common sense or real world modelling checks - and if it gives an incorrect response, it will generally reply “I'm sorry, you are right” and apply another irrelevant recipe.
- responses are very dependent upon:
  - its data on which it was trained (quality and quantity)
  - how it was trained and fine tuned
  - its context length
  - it's temperature creativity hyper-parameter setting
  - its hidden system prompts (these are usually designed to provide guardrails)
  - how your prompts are provided (role instructions, where to source information, how to format it, whether you have included irrelevant information which it may think is more important, etc)

Performance measures

NB. WARNING: some models are also pre-trained on the benchmark data which falsely elevates their benchmark scores so they can gain more funding
some models are multi-lingual
number of tokens
- Mistral 7B instruct only allows 2K tokens
- Mixtral 7Bx8e allows 32K tokens
training dataset size
- GPT 3.5 - 300B words/570Gb dataset;
- Llama 2 has 7B, 13B and 70B parameter models
pretraining computer processing time and power consumption
- eg. Llama 2 used 3.3million GPU hours of computation on hardware of type A100-80GB (TDP of 350-400W)
commonsense reasoning scores
- PIQA
- SIQA
- HellaSwag
- WinoGrande
- ARC easy
- ARC challenge
- OpenBookQA
- CommonsenseQA
world knowledge score
- NaturalQuestions
- TriviaQA
reading comprehension scores
- SQuAD
- QuAC
- BoolQ
math scores
- GSM8K
- MATH
Multi-task Language Understanding (MMLU) score
- test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.¹⁾
- see https://huggingface.co/blog/evaluating-mmlu-leaderboard for more details
- GPT-4 = 86.4
- GPT-3.5 = 70
- Llama 2 70b = 68.9
- Chinchilla = 67.5
- Llama 2 13b = 54.8
- Llama 2 7b = 45.3
BBH score
- BBH-nlp
- BBH-alg
AGI Eval score
TruthfulQA score
- percentage of generations that are both truthful and informative
Toxigen score
- percentage of toxic generations (the smaller the better)

Current models

Google DeepMind's Chinchilla LLM

presented March 2022
trained in order to investigate the scaling laws of large language models.
70billion parameters; 80 layers, 1.5-3million batch size;

OpenAI's GPT

OpenAI was founded in 2015 initially as not for profit company
Microsoft provided OpenAI LP with a $1 billion investment in 2019 and a $10 billion investment in 2023
Dec 2022, ChatGPT 3.5 released as a new AI chatbot based on GPT-3.5
Feb 2023, Microsoft incorporates GPT-4 tech as “Prometheus” into its Bing search engine but it could be quite “unhinged” see https://twitter.com/MovingToTheSun/status/1625156575202537474/photo/1 where it gaslights the user with incorrect information and logic
March 14, 2023, OpenAI released GPT-4 with a waitlist for access
Aug 2023, Microsoft releases LlaVa (Large Language and Vision Assistant) which is a multi-model model which can take image inputs and encodes them and then uses GPT-4 to respond to a prompt regarding the image
Sept 2023: MS releases DALL-E3 image generation combined with ChatGPT linked to many of its software apps including Win11, Edge, Bing, Paint, Office 365.
Oct 2023: ChatGPT-4Vision allows analysis of images as part of the prompts
Oct 2023: “Browse with Bing” feature lets ChatGPT access up-to-date information, rather than being limited to the training data that was cut off before September 2021.

Meta's Llama

Meta's LLama:
- February 23, 2023, Meta announces Llama
- July 2023, Meta released several models as Llama 2, using 7, 13 and 70 billion parameters
  - uses 4K context length tokens and foundational models were trained on a curated data set with 2 trillion tokens with batch size of 64
    - RedPyjama contains a 1.2trillion token open source and downloaded dataset version of the LLama 2 training dataset
  - Llama 2 - Chat was additionally fine-tuned on 27,540 prompt-response pairs created for this project
  - pretraining utilized a cumulative 3.3million GPU hours of computation on hardware of type A100-80GB (TDP of 350-400W)

Google's Gemini

announced Dec 2023

Microsoft's Phi

Phi-1 and Phi-1.5 were 1.3b parameter models
Phi-2:
- announced Dec 2023 with only 2.7b parameters but beats LLaMA-2 7b, 13b, Mistral 7b, GPT4 and Gemini Nano2 3.2b in logical reasoning, math proficiency, coding acumen and safety standards

Mistral

French company
“Mistral Tiny”:
- Mistral 7B Instruct v0.2
  - released as open source in Oct 2023 - English only
  - only 4.6Gb download and can easily run on a powerful laptop with 8Gb VRAM and nVidia GPU via Private GTP
  - scores well on conversational scores as this is what it is designed for but poorly on coding, maths and reasoning (similar to LLaMA 2 70b in these latter scores)
  - continuously fails on “if 5 shirts take 4hrs in sun to dry how long for each shirt to dry if you put 20 shirts out?” uses proportional reasoning and then simple multiplication approaches and only after a few prompted corrections does it come up with the correct answer albeit with disconnected logic
“Mistral Small”
- Mixtral 7bx8e Mixture-of-Experts (MoE) model
  - released Dec 2023 as open source via Torrent, English, French, Italian, German and Spanish, code and competes well with GPT 4 on benchmarks
  - can be run on a high end desktop computer with LOTS of VRAM (preferably 64-128Gb, but at least 23Gb)
  - 32K tokens;
  - https://www.youtube.com/watch?v=ucov1AWvGEc
“Mistral Medium”
- released Dec 2023 as API, English, French, Italian, German and Spanish, code and competes nearly the same as GPT 4 on benchmarks but at 1/10th the cost
- ?33bx8e
- succeeds on “if 5 shirts take 4hrs in sun to dry how long for each shirt to dry if you put 20 shirts out”
- still fails on complex word based maths problems and on real world physics problems
- https://www.youtube.com/watch?v=S2aQpSflywA

Private GTP

1st released on GitHub in May 2023
allows offline use (hence totally private use) of a LLM to ingest your document and then respond to prompts about that document
installs on your local computer by making a new python environment with conda and downloading the 5Gb or so of files including the model
v2.0 uses the Mistral 7b Instruct LLM
see https://www.youtube.com/watch?v=XFiof0V3nhA

Other ways to run downloaded LLMs on your computer offline

LM Studio provides a free GUI to chat with a downloaded model or create an API server to that model
TextGen WebUI
- uses llama-cpp-python
- https://github.com/oobabooga/text-generation-webui - git code
- https://gist.github.com/mberman84/f092a28e4151dd5cecebfc58ac1cbc0e - install commands
- see https://www.youtube.com/watch?v=VPW6mVTTtTc install and use demo
jan.ai provides a free open source GUI with python code on github to chat with a downloaded model - see https://www.youtube.com/watch?v=zkafOIyQM8s install and use demo
ollama
- see https://www.youtube.com/watch?v=kJvXT25LkwA - creating agents to perform tasks using local models with ollama
GPT4All -runs off CPU

Running multimodal image2txt LLMs on your computer offline

use LM Studio:
- download a visual model and load it
- go to the horizontal arrows icon on left panel and once the model is loaded hit Start Server
- copy the code under the vision(python) tab
if openai Python library not installed:
- in Anaconda Prompt:
  - activate your desired environment
  - install latest version of openai eg. conda install openai=1.6.0
    - NB. default older version install does not have the OpenAI object to import
either:
- 1. open Anaconda Navigator and select the environment that has the openai library installed
  - open Jupyter Notebooks and create a new python file and paste the python code from LM Studio above
- 2. create a native app to run the python code such as by using Delphi 11 or 12 as I have done (still needs python environment installed on the same computer)

Prompt engineering

https://github.com/dair-ai/Prompt-Engineering-Guide

Fine tuning models to particular tasks or domains

An important paradigm of natural language processing consists of large-scale pre-training on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes less feasible.
1. start with a generic LLM
2. fine tune with Structured Fine Tuning (SFT) or instruction tuning via either:
- Reinforcement Learning (when the model needs to learn outputs not explicitly defined in the training dataset ie. you give feedback on outputs)
  - Transformer Reinforcement Learning (TRL)
    - NB. HuggingFace has a SFT Trainer and a DPOTRainer
      - fine tuning and aligning Mistral 7b Instruct with UltraChat synthetic dataset derived from chat between 2 LLMs and UltraFeedback alignment to create “Zephyr” costs $US500 and takes 8hrs on 16 x A100 nVidia GPUs! - as a minimum you need 80Gb VRAM ²⁾
- Fine Tuning with your own dataset (when the objective is clear and can be defined through use of labelled examples -ie you can give examples of desired outputs)
  - example is Parameter Efficient Fine Tuning (PEFT):
    - LoRAs
3. align model to a Feedback Dataset with Direct Preference Optimisation (DPO) which is generally much more stable than Proximal Policy Optimisation (PPO)
4. now you have a crafted model eg. optimised for medicine
4. integrate with your data:
- Retrieval Augmented Generation (RAG), a method of enhancing AI's response quality by retrieving relevant external data from unstructured Delta Lake document data
  - tools like “Databricks RAG” for creating production-ready high-quality RAG applications
- NB. a structured database is more “intelligent” than a vector embedding of the database as the latter loses accuracy and has to compute probabilistic answers rather than a straight SQL search for the answer - so don't convert a SQL database to a vector embedding
- vector embedding is good for allowing detection of semantic similarities which a word search may not detect if that word is not actually used in the relevant document (eg. an acronym is used instead) however, they need to be built with an appropriate tokenizer to make them coherent or harmonic vectors
  - also be aware that text vector embeddings can reveal almost as much as the raw text itself so this can be an issue with privacy

LoRA (Low-Rank Adaptation of Large Language Models)

a fine tuning method proposed in 2021 ³⁾
freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks
Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times
unlike adapters it has no additional inference latency
when used in SD-XL AI text2image generation, LoRA models operate by applying minute changes to the most critical part of Stable Diffusion models—the cross-attention layers. This is the part where the image and the prompt intersect, and researchers have found that fine-tuning this section yields excellent training results.
example of using LoRA to fine tune Mistral-7B LLM:
- https://github.com/ml-explore/mlx-examples/tree/main/lora

Vector embeddings

https://www.youtube.com/watch?v=yfHHvmaMkcA - Youtube tutorial of how to use OpenAI embeddings to convert a text dataset into a cloud vector datastore and then ask the vector store for semantically similar items to your prompt
a vector embedding is a conversion words (or other data such as sentences, documents, images for facial recognition systems, graphs, etc) to a vector space embedding and allows storage of semantic meaning similarity of words
- this is often done using a cosine similarity which gives a similarity value of two components in the vector embedding based on a range of their vector values for their “traits”
- in this way banana could be found by the LLM when prompted to find fruit or food whereas a straight character search for “food” would only find the word “food” or similar words.
embeddings can be created by:
- using OpenAI's create embedding API tool
you can create a vector database on the cloud such as with DataStax.com and when you connect you will get token information to access this database

Hallucination mitigation

a comprehensive survey of (32+) Hallucination Mitigation Techniques in LLMs see https://arxiv.org/abs/2401.01313

Improving performance using AI iterative Agents

use of iterative AI reasoning methods either via system prompts (eg. tree search, use step by step reasoning) or by use of AI agents can dramatically improve the performance of LLMs compared to zero shot results
agentic reasonic design patterns:
- self-reflection
  - eg. ask LLM to check for correctness, style, efficiency, etc and give constructive criticism on how to improve the response
  - eg. asking it to run a unit test on the generated code if asking it to write code and if it fails to look at correcting it
- tool use
  - eg. smaller models designed as dedicated tools to perform tasks faster than larger LLMs, tools could be such as web search, image generator, etc
- planning
  - instructions to allow planning a stepwise approach to solving a problem
- multi-agent collaboration with each AI agent assigned a task eg. one could be a manager to orchestrate the other agents, one could be designing, one could be coding, one could be a critic agent to test and feed back for self-reflection, one could be documenting

¹⁾

https://paperswithcode.com/sota/multi-task-language-understanding-on-mmlu

²⁾

https://www.youtube.com/watch?v=Up7VKg6ZE90

³⁾

https://arxiv.org/abs/2106.09685

Table of Contents

AI Large Language Models (LLMs)

Introduction

Performance measures

Current models

Google DeepMind's Chinchilla LLM

OpenAI's GPT

Meta's Llama

Google's Gemini

Microsoft's Phi

Mistral

Private GTP

Other ways to ingest local documents into a local LLM

Other ways to run downloaded LLMs on your computer offline

Running multimodal image2txt LLMs on your computer offline

Prompt engineering

Fine tuning models to particular tasks or domains

LoRA (Low-Rank Adaptation of Large Language Models)

Vector embeddings

Hallucination mitigation

Improving performance using AI iterative Agents