it:privategpt
Table of Contents
private GPTs
see also:
Introduction
- Private GPTs allows LLM querying of YOUR private docs / files all on your local machine
- in contrast, online custom GPTs allow LLM querying of YOUR UPLOADED docs / files but these then are potentially accessible by 3rd parties
PrivateGPT 2.0
- 2023 open source GPT you can download and install, uses Mistral 7b Instruct version LLM by default (this outperforms LLaMA-2 13b) which gets downloaded to your machine (4.4Gb)
- BEFORE you install this be aware that it is highly error prone in its query responses based on your uploaded documents!
- you will need to be careful with your prompts and to create contexts in them
- responses are limited to 2048 characters
install process
- assuming Anaconda is already installed
- ensure cmake is installed (see cmake.org)
- open Anaconda prompt
- activate base via
conda activate base
- clone privateGPT files via
git clone https://github.com/imartinez/privateGPT
cd privateGPT *conda create -n privategtp python=3.11 *conda activate privategtp *conda install -c conda-forge poetry *poetry install --with ui,local
- configure the GPT settings as needed in the settings.yaml file
- eg under local:, you could use LLama2 model instead of the TheBloke/Mistral-7B-Instruct-v0.1-GGUF .. model on the llm_hf_repo_id: line
- sagemaker section is for hosting your model on Amazon sagemaker
- if you wish to use openAI then you need to supply your openAI API key
poetry run python scripts/setup
- this takes a while as it needs to download the embedding model (converts text into vector storage) as well as the LLM (Mistral 7b is ~4.37Gb)
- now build the app after setting up for Windows nVidia GPUs (assuming you have a compatible nVidia GPU and have CUDA software already installed):
* conda install -c "nvidia/label/cuda-11.7.0" cuda-cudart (NB. not doing this results in pytorch install failure as cant find compatible cuda-cudart library) * conda install -c pytorch pytorch-cuda * conda install -c anaconda cmake * Open Powershell with that environment active then cd to the git clone directory folder created above eg. C:\Users\username\privateGTP * $env:CMAKE_ARGS='-DLLAMA_CUBLAS=on'; poetry run pip install --force-reinstall --no-cache-dir llama-cpp-pythonset PGPT_PROFILES=local
- to read docx files:
pip install docx2txt
- to read html files:
conda install -c conda-forge html2text #but I still get 'charmap' codec can't decode byte 0x81 error
- to read Powerpoint pptx files:
pip install torch transformers python-pptx Pillow
- NB. cannot read .doc, .xls or .xlsx files
Changing the LLM
- modify settings.yaml in the root folder to switch between different models (you will need to download them and save them to the models subfolder of private_GPT folder)
- the recommended file is mistral-7B-instruct-v0.2.Q4_K_M.gguf (7B indicates 7 billion parameter model, Q4 indicates 4 bit quantization and K_M is the type of quantization)
installing a new model
- update settings.yaml with the new model file names etc
- in Anaconda prompt, with that environment active then cd to the git clone directory folder created above eg. C:\Users\username\privateGTP:
poetry run python scripts/setup
about model files
- quantization
- reduce size and increase speed of the vector models
- quantizing 32-bit floating point weight parameters to 8-bit integers in neural networks can be done without sacrificing accuracy
- While FP16 can be used instead of FP32 with only a small loss in accuracy of the representation in the context of deep neural networks inference, the smaller dynamic range formats like INT8 pose a challenge. During quantization, we have to squeeze a very high dynamic range of FP32 into only 255 values of INT8, or even into 15 values of INT4 - to mitigate this challenge, various techniques have been developed for quantizing models, such as per-channel or per-layer scaling, which adjust the scale and zero-point values of the weight and activation tensors to better fit the quantized format
- perplexity is a measure of the uncertainty of a model when it generates responses - the lower the better
- increasing quantization and reducing file size loses accuracy and thus increases perplexity - see https://github.com/ggerganov/llama.cpp/pull/1684#issuecomment-1579252501
- 6-bit quantized perplexity is within 0.1% or better from the original fp16 model.
- a 25Gb f16 13B LLM quantized to Q4_K_M results in perplexity rising from 5.25 to 5.3 while file size drops to a much more manageable 7.32Gb
- a 13Gb f16 7B LLM quantized to Q4_K_M results in perplexity rising from 5.9 to 5.96 while file size drops to a much more manageable 3.8Gb (Q5_K_M gives 5.92 and 4.45Gb)
- GGUF is the new quantized model file format and is CPU based inference which offloads some work to the GPU (great as most GPUs don't have enough VRAM for larger LLMs)
- prompt processing is CPU speed/CPU core/GPU core bound, and inference is RAM/vRAM speed bound
- optimum performance is with K_M quants and either 4 bit or 5 bit.
- Koboldcpp is a UI interface for running local models with CPU inference.
- Koboldcpp is a simple single file version of koboldai. https://github.com/LostRuins/koboldcpp. Simply download the .exe and run it.
Running the Private GTP
- in Anaconda Prompt or Powershell:
conda activate privategpt
cd C:\Users\username\privateGTP
make run
- alternatively,
poetry run python -m private_gpt
- alternatively,
poetry run python -m uvicorn private_gpt.main:app --reload --port 8001
- Windows Security may now display a popup asking you if you want public and private networks to access this app - No should be OK
- open web browser and put in the URL 127.0.0.1:8001
- the GUI interface should now display and there are 3 modes you can choose:
- query - this will query any docs you have uploaded and display a ChatGTP like natural language response to your question
- search - this will search any docs you have uploaded and display results of search
- LLM chat
- “ingesting” documents to query:
- just click the upload button to add documents
- these are converted and appended into a json format file in private_GPT\local_data\private_gpt\docstore.json
- private_GPT\local_data\private_gpt\graph_store.json is updated
- private_GPT\local_data\private_gpt\index_store.json is updated
- and presumably the embedded binary vectors are stored in private_GPT\local_data\private_gpt\qdrant\collection\make_this_parameterizable_per_api_call\storage.sqlite
Ending a session
- in the current Anaconda Prompt or Powershell, press Ctrl-C and the server should terminate
Removing "ingested" files you had uploaded
- use the command “make wipe”
- or, manually delete these files (they will be re-created when a new instance is started):
- private_GPT\local_data\private_gpt\docstore.json
- private_GPT\local_data\private_gpt\graph_store.json is updated
- private_GPT\local_data\private_gpt\index_store.json is updated
- private_GPT\local_data\private_gpt\qdrant\collection\make_this_parameterizable_per_api_call\storage.sqlite
Fine tune re-training the Mistral 7b Instruct model
- uses Python, transformers, LoRA, SFTTrainer
- but needs A100 GPU, 32Gb VRAM
it/privategpt.txt · Last modified: 2023/12/24 12:07 by gary1