OpenChat: Advancing Open-source Language Models with Mixed-Quality Data
💻Online Demo | 🤗Huggingface | 📃Paper | 💭Discord
- OpenChat is an innovative library of open-source language models, fine-tuned with C-RLFT - a strategy inspired by offline reinforcement learning.
- Our models learn from mixed-quality data without preference labels, delivering exceptional performance on par with
ChatGPT
, even with a7B
model which can be run on a consumer GPU (e.g. RTX 3090). - Despite our simple approach, we are committed to developing a high-performance, commercially viable, open-source large language model, and we continue to make significant strides toward this vision.
✨ News
[2024/05/22] We released the Llama-3 based version OpenChat 3.6 20240522, outperforming official Llama 3 8B Instruct and open-source finetunes/merges.
[2024/01/06] We released the second update, OpenChat 3.5 0106, further improved coding and overall performance 🏆.
[2023/12/10] We released the first update, OpenChat 3.5 1210, improved coding by 15 points 🚀.
[2023/11/01] We released the OpenChat-3.5-7B model, surpassing ChatGPT on various benchmarks 🔥.
[2023/09/21] We released our paper OpenChat: Advancing Open-source Language Models with Mixed-Quality Data.
🏷️ Benchmarks - OpenChat 3.6
🏷️ Benchmarks - OpenChat 3.5
Model | # Params | Average | MT-Bench | HumanEval | BBH MC | AGIEval | TruthfulQA | MMLU | GSM8K | BBH CoT |
OpenChat-3.5-0106 | 7B | 64.5 | 7.8 | 71.3 | 51.5 | 49.1 | 61.0 | 65.8 | 77.4 | 62.2 |
ChatGPT (March)* | ???B | 61.5 | 7.94 | 48.1 | 47.6 | 47.1 | 57.7 | 67.3 | 74.9 | 70.1 |
OpenHermes 2.5 | 7B | 59.3 | 7.54 | 48.2 | 49.4 | 46.5 | 57.5 | 63.8 | 73.5 | 59.9 |
OpenOrca Mistral | 7B | 52.7 | 6.86 | 38.4 | 49.4 | 42.9 | 45.9 | 59.3 | 59.1 | 58.1 |
Zephyr-β^ | 7B | 34.6 | 7.34 | 22.0 | 40.6 | 39.0 | 40.8 | 39.8 | 5.1 | 16.0 |
Mistral | 7B | - | 6.84 | 30.5 | 39.0 | 38.0 | - | 60.1 | 52.2 | - |
Open-source SOTA** | 13B-70B | 61.4 | 7.71 | 73.2 | 49.7 | 41.7 | 62.3 | 63.7 | 82.3 | 41.4 |
WizardLM 70B | WizardCoder 34B | Orca 13B | Orca 13B | Platypus2 70B | WizardLM 70B | MetaMath 70B | Flan-T5 11B |
🔥 OpenChat-3.5-0106 (7B) now outperforms Grok-0 (33B) on all 4 benchmarks and Grok-1 (314B) on average and 3/4 benchmarks.
License | # Param | Average | MMLU | HumanEval | MATH | GSM8k | |
OpenChat-3.5-0106 | Apache-2.0 | 7B | 61.0 | 65.8 | 71.3 | 29.3 | 77.4 |
Grok-0 | Proprietary | 33B | 44.5 | 65.7 | 39.7 | 15.7 | 56.8 |
Grok-1 | Proprietary | 314B | 55.8 | 73 | 63.2 | 23.9 | 62.9 |
⬇️ Installation
Note
Need pytorch
and CUDA to run OpenChat
pip
pip3 install ochat
Important
If you are facing package compatibility issues with pip, try the conda method below or check this issue
conda
conda create -y --name openchat python=3.11
conda activate openchat
pip3 install ochat
Windows (WSL 1.x, Ubuntu-22.04)
sudo apt update
sudo apt install build-essential
sudo apt install -y curl
curl -o miniconda.sh https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash miniconda.sh
# Restart WSL terminal if the following conda command does not work
conda create -y --name openchat python=3.11
conda activate openchat
pip3 install ochat
From source
🚀 Deploying API server
⚡ Our API server is ready for production use and compatible with the OpenAI API protocol. It is highly optimized with vLLM and can dynamically batch requests.
📎 Note: For 20 series or older GPUs that do not support bfloat16
, add --dtype float16
to the server args.
List of currently supported models
MODEL_TYPE | MODEL_REPO | License |
openchat_3.6 | Llama 3 | |
openchat_3.5 | Apache 2.0 |
For a single GPU (e.g. RTX 3090, 4090)
python -m ochat.serving.openai_api_server --model MODEL_REPO
For multiple GPUs (tensor parallel)
# N is the number of tensor parallel GPUs
python -m ochat.serving.openai_api_server --model MODEL_REPO --engine-use-ray --worker-use-ray --tensor-parallel-size N
use -h
to see more settings
python -m ochat.serving.openai_api_server --model MODEL_REPO -h
Request example
Once started, the server listens at localhost:18888
for requests and is compatible with the OpenAI ChatCompletion API specifications.
💡 Default Mode (GPT4 Correct): Best for coding, chat and general tasks
curl http://localhost:18888/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "MODEL_TYPE",
"messages": [{"role": "user", "content": "You are a large language model named OpenChat. Write a poem to describe yourself"}]
}'
🧮 Mathematical Reasoning Mode: Tailored for solving math problems
curl http://localhost:18888/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "MODEL_TYPE",
"condition": "Math Correct",
"messages": [{"role": "user", "content": "10.3 − 7988.8133 = "}]
}'
🌐 Web UI - OpenChat-UI
After launching the API server, OpenChat provide user interface that easy to interact with. Click here to check Web UI
🤗 Inference with Transformers
Warning
It's recommended to use our optimized API server for deployment. Inferencing with Transformers will be slower.
💡 Default Mode (GPT4 Correct): Best for coding, chat and general tasks
GPT4 Correct User: Hello<|end_of_turn|>GPT4 Correct Assistant: Hi<|end_of_turn|>GPT4 Correct User: How are you today?<|end_of_turn|>GPT4 Correct Assistant:
🧮 Mathematical Reasoning Mode: Tailored for solving math problems
Math Correct User: 10.3 − 7988.8133=<|end_of_turn|>Math Correct Assistant:
⚠️ Notice: Remember to set <|end_of_turn|>
as end of generation token.
The default (GPT4 Correct) template is also available as the integrated tokenizer.chat_template
, which can be used instead of manually specifying the template.
🛠️ Training
The OpenChat training system utilizes padding-free training and the Multipack Sampler, achieving a 3~10x speedup compared to the conventional padded training.
Choose a base model
OpenChat supports Llama 3 and Mistral models. Please first choose a base model to fit your needs. Each base model has a corresponding weight repo, model type, and recommended batch size as listed below, they should be filled into BASE_REPO
, MODEL_TYPE
, and BATCH_SIZE
in the following instructions.
Note: The OpenChat conversation template requires <|eot_id|>, <|start_header_id|>, <|end_header_id|>
(Llama 3) <|end_of_turn|>
(Mistral) special tokens. The base model specified must include these tokens with initialized embeddings. Our provided weights are the original base weights with this token added and embeddings initialized. If you want to add them manually, use the init_special_embedding_llama3.py
or mistral_add_tokens.py
in the scripts
directory.
Installing DeepSpeed and Flash Attention
First, ensure that the CUDA nvcc
compiler is available in your environment. If it is not, install the CUDA toolkit that matches the version used by PyTorch.
Next, install building dependencies:
pip install packaging ninja