private GPTs

Introduction

Private GPTs allows LLM querying of YOUR private docs / files all on your local machine
in contrast, online custom GPTs allow LLM querying of YOUR UPLOADED docs / files but these then are potentially accessible by 3rd parties

PrivateGPT 2.0

2023 open source GPT you can download and install, uses Mistral 7b Instruct version LLM by default (this outperforms LLaMA-2 13b) which gets downloaded to your machine (4.4Gb)
see https://www.youtube.com/watch?v=XFiof0V3nhA

BEFORE you install this be aware that it is highly error prone in its query responses based on your uploaded documents!
- you will need to be careful with your prompts and to create contexts in them
- responses are limited to 2048 characters
- see also https://www.promptingguide.ai/models/mistral-7b

install process

assuming Anaconda is already installed
ensure cmake is installed (see cmake.org)
open Anaconda prompt
activate base via
```
conda activate base
```

clone privateGPT files via

 git clone https://github.com/imartinez/privateGPT

cd privateGPT
  *conda create -n privategtp python=3.11
  *conda activate privategtp
  *conda install -c conda-forge poetry
  *poetry install --with ui,local

configure the GPT settings as needed in the settings.yaml file
- eg under local:, you could use LLama2 model instead of the TheBloke/Mistral-7B-Instruct-v0.1-GGUF .. model on the llm_hf_repo_id: line
- sagemaker section is for hosting your model on Amazon sagemaker
- if you wish to use openAI then you need to supply your openAI API key
```
poetry run python scripts/setup
```
- this takes a while as it needs to download the embedding model (converts text into vector storage) as well as the LLM (Mistral 7b is ~4.37Gb)

now build the app after setting up for Windows nVidia GPUs (assuming you have a compatible nVidia GPU and have CUDA software already installed):

    * conda install -c "nvidia/label/cuda-11.7.0" cuda-cudart (NB. not doing this results in pytorch install failure as cant find compatible cuda-cudart library)
    * conda install -c pytorch pytorch-cuda
    * conda install -c anaconda cmake
    * Open Powershell with that environment active then cd to the git clone directory folder created above eg. C:\Users\username\privateGTP
    * $env:CMAKE_ARGS='-DLLAMA_CUBLAS=on'; poetry run pip install --force-reinstall --no-cache-dir llama-cpp-python

```
 set PGPT_PROFILES=local
```

to read docx files:
- ```
pip install docx2txt
```
to read html files:

 conda install -c conda-forge html2text #but I still get 'charmap' codec can't decode byte 0x81 error

to read Powerpoint pptx files:

pip install torch transformers python-pptx Pillow

NB. cannot read .doc, .xls or .xlsx files

Changing the LLM

modify settings.yaml in the root folder to switch between different models (you will need to download them and save them to the models subfolder of private_GPT folder)
the recommended file is mistral-7B-instruct-v0.2.Q4_K_M.gguf (7B indicates 7 billion parameter model, Q4 indicates 4 bit quantization and K_M is the type of quantization)

installing a new model

update settings.yaml with the new model file names etc
in Anaconda prompt, with that environment active then cd to the git clone directory folder created above eg. C:\Users\username\privateGTP:
- ```
poetry run python scripts/setup
```

about model files

quantization
- reduce size and increase speed of the vector models
- quantizing 32-bit floating point weight parameters to 8-bit integers in neural networks can be done without sacrificing accuracy
- While FP16 can be used instead of FP32 with only a small loss in accuracy of the representation in the context of deep neural networks inference, the smaller dynamic range formats like INT8 pose a challenge. During quantization, we have to squeeze a very high dynamic range of FP32 into only 255 values of INT8, or even into 15 values of INT4 - to mitigate this challenge, various techniques have been developed for quantizing models, such as per-channel or per-layer scaling, which adjust the scale and zero-point values of the weight and activation tensors to better fit the quantized format
- see https://deci.ai/quantization-and-quantization-aware-training/
- perplexity is a measure of the uncertainty of a model when it generates responses - the lower the better
  - increasing quantization and reducing file size loses accuracy and thus increases perplexity - see https://github.com/ggerganov/llama.cpp/pull/1684#issuecomment-1579252501
  - 6-bit quantized perplexity is within 0.1% or better from the original fp16 model.
  - a 25Gb f16 13B LLM quantized to Q4_K_M results in perplexity rising from 5.25 to 5.3 while file size drops to a much more manageable 7.32Gb
  - a 13Gb f16 7B LLM quantized to Q4_K_M results in perplexity rising from 5.9 to 5.96 while file size drops to a much more manageable 3.8Gb (Q5_K_M gives 5.92 and 4.45Gb)
GGUF is the new quantized model file format and is CPU based inference which offloads some work to the GPU (great as most GPUs don't have enough VRAM for larger LLMs)
- prompt processing is CPU speed/CPU core/GPU core bound, and inference is RAM/vRAM speed bound
- optimum performance is with K_M quants and either 4 bit or 5 bit.
Koboldcpp is a UI interface for running local models with CPU inference.
- Koboldcpp is a simple single file version of koboldai. https://github.com/LostRuins/koboldcpp. Simply download the .exe and run it.

Running the Private GTP

in Anaconda Prompt or Powershell:
- ```
conda activate privategpt
```
- ```
cd C:\Users\username\privateGTP
```
- ```
 make run
```
  - alternatively,
```
poetry run python -m private_gpt
```
  - alternatively,
```
poetry run python -m uvicorn private_gpt.main:app --reload --port 8001
```
  - Windows Security may now display a popup asking you if you want public and private networks to access this app - No should be OK
open web browser and put in the URL 127.0.0.1:8001
- the GUI interface should now display and there are 3 modes you can choose:
  - query - this will query any docs you have uploaded and display a ChatGTP like natural language response to your question
  - search - this will search any docs you have uploaded and display results of search
  - LLM chat
- “ingesting” documents to query:
  - just click the upload button to add documents
  - these are converted and appended into a json format file in private_GPT\local_data\private_gpt\docstore.json
  - private_GPT\local_data\private_gpt\graph_store.json is updated
  - private_GPT\local_data\private_gpt\index_store.json is updated
  - and presumably the embedded binary vectors are stored in private_GPT\local_data\private_gpt\qdrant\collection\make_this_parameterizable_per_api_call\storage.sqlite

Ending a session

in the current Anaconda Prompt or Powershell, press Ctrl-C and the server should terminate

Removing "ingested" files you had uploaded

use the command “make wipe”
or, manually delete these files (they will be re-created when a new instance is started):
- private_GPT\local_data\private_gpt\docstore.json
- private_GPT\local_data\private_gpt\graph_store.json is updated
- private_GPT\local_data\private_gpt\index_store.json is updated
- private_GPT\local_data\private_gpt\qdrant\collection\make_this_parameterizable_per_api_call\storage.sqlite

Fine tune re-training the Mistral 7b Instruct model

see https://www.youtube.com/watch?v=lCZRwrRvrWg
uses Python, transformers, LoRA, SFTTrainer
but needs A100 GPU, 32Gb VRAM

OzEMedicine - Wiki for Australian Emergency Medicine Doctors

Table of Contents