.. meta:: :google-site-verification: S66K6GAclKw1RroxU0Rka_2d1LZFVe27M0gRneEsIVI .. important:: .. raw:: html

bigdl-llm has now become ipex-llm (see the migration guide here); you may find the original BigDL project here.

------ ################################################ 💫 IPEX-LLM ################################################ .. raw:: html

IPEX-LLM is a PyTorch library for running LLM on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max) with very low latency [1].

.. note:: .. raw:: html

************************************************ Latest update 🔥 ************************************************ * [2024/04] You can now run **Llama 3** on Intel GPU using ``llama.cpp`` and ``ollama``; see the quickstart `here `_. * [2024/04] ``ipex-llm`` now supports **Llama 3** on Intel `GPU `_ and `CPU `_. * [2024/04] ``ipex-llm`` now provides C++ interface, which can be used as an accelerated backend for running `llama.cpp `_ and `ollama `_ on Intel GPU. * [2024/03] ``bigdl-llm`` has now become ``ipex-llm`` (see the migration guide `here `_); you may find the original ``BigDL`` project `here `_. * [2024/02] ``ipex-llm`` now supports directly loading model from `ModelScope `_ (`魔搭 `_). * [2024/02] ``ipex-llm`` added inital **INT2** support (based on llama.cpp `IQ2 `_ mechanism), which makes it possible to run large-size LLM (e.g., Mixtral-8x7B) on Intel GPU with 16GB VRAM. * [2024/02] Users can now use ``ipex-llm`` through `Text-Generation-WebUI `_ GUI. * [2024/02] ``ipex-llm`` now supports `Self-Speculative Decoding `_, which in practice brings **~30% speedup** for FP16 and BF16 inference latency on Intel `GPU `_ and `CPU `_ respectively. * [2024/02] ``ipex-llm`` now supports a comprehensive list of LLM finetuning on Intel GPU (including `LoRA `_, `QLoRA `_, `DPO `_, `QA-LoRA `_ and `ReLoRA `_). * [2024/01] Using ``ipex-llm`` `QLoRA `_, we managed to finetune LLaMA2-7B in **21 minutes** and LLaMA2-70B in **3.14 hours** on 8 Intel Max 1550 GPU for `Standford-Alpaca `_ (see the blog `here `_). .. dropdown:: More updates :color: primary * [2023/12] ``ipex-llm`` now supports `ReLoRA `_ (see `"ReLoRA: High-Rank Training Through Low-Rank Updates" `_). * [2023/12] ``ipex-llm`` now supports `Mixtral-8x7B `_ on both Intel `GPU `_ and `CPU `_. * [2023/12] ``ipex-llm`` now supports `QA-LoRA `_ (see `"QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models" `_). * [2023/12] ``ipex-llm`` now supports `FP8 and FP4 inference `_ on Intel **GPU**. * [2023/11] Initial support for directly loading `GGUF `_, `AWQ `_ and `GPTQ `_ models in to ``ipex-llm`` is available. * [2023/11] ``ipex-llm`` now supports `vLLM continuous batching `_ on both Intel `GPU `_ and `CPU `_. * [2023/10] ``ipex-llm`` now supports `QLoRA finetuning `_ on both Intel `GPU `_ and `CPU `_. * [2023/10] ``ipex-llm`` now supports `FastChat serving `_ on on both Intel CPU and GPU. * [2023/09] ``ipex-llm`` now supports `Intel GPU `_ (including iGPU, Arc, Flex and MAX). * [2023/09] ``ipex-llm`` `tutorial `_ is released. ************************************************ ``ipex-llm`` Demos ************************************************ See the **optimized performance** of ``chatglm2-6b`` and ``llama-2-13b-chat`` models on 12th Gen Intel Core CPU and Intel Arc GPU below. .. raw:: html
12th Gen Intel Core CPU Intel Arc GPU
chatglm2-6b llama-2-13b-chat chatglm2-6b llama-2-13b-chat
************************************************ ``ipex-llm`` Quickstart ************************************************ ============================================ Install ``ipex-llm`` ============================================ * `Windows GPU `_: installing ``ipex-llm`` on Windows with Intel GPU * `Linux GPU `_: installing ``ipex-llm`` on Linux with Intel GPU * `Docker `_: using ``ipex-llm`` dockers on Intel CPU and GPU .. seealso:: For more details, please refer to the `installation guide `_ ============================================ Run ``ipex-llm`` ============================================ * `llama.cpp `_: running **llama.cpp** (*using C++ interface of* ``ipex-llm`` *as an accelerated backend for* ``llama.cpp``) on Intel GPU * `ollama `_: running **ollama** (*using C++ interface of* ``ipex-llm`` *as an accelerated backend for* ``ollama``) on Intel GPU * `vLLM `_: running ``ipex-llm`` in ``vLLM`` on both Intel `GPU `_ and `CPU `_ * `FastChat `_: running ``ipex-llm`` in ``FastChat`` serving on on both Intel GPU and CPU * `LangChain-Chatchat RAG `_: running ``ipex-llm`` in ``LangChain-Chatchat`` (*Knowledge Base QA using* **RAG** *pipeline*) * `Text-Generation-WebUI `_: running ``ipex-llm`` in ``oobabooga`` **WebUI** * `Benchmarking `_: running (latency and throughput) benchmarks for ``ipex-llm`` on Intel CPU and GPU ============================================ Code Examples ============================================ * Low bit inference * `INT4 inference `_: **INT4** LLM inference on Intel `GPU `_ and `CPU `_ * `FP8/FP4 inference `_: **FP8** and **FP4** LLM inference on Intel `GPU `_ * `INT8 inference `_: **INT8** LLM inference on Intel `GPU `_ and `CPU `_ * `INT2 inference `_: **INT2** LLM inference (based on llama.cpp IQ2 mechanism) on Intel `GPU `_ * FP16/BF16 inference * **FP16** LLM inference on Intel `GPU `_, with possible `self-speculative decoding `_ optimization * **BF16** LLM inference on Intel `CPU `_, with possible `self-speculative decoding `_ optimization * Save and load * `Low-bit models `_: saving and loading ``ipex-llm`` low-bit models * `GGUF `_: directly loading GGUF models into ``ipex-llm`` * `AWQ `_: directly loading AWQ models into ``ipex-llm`` * `GPTQ `_: directly loading GPTQ models into ``ipex-llm`` * Finetuning * LLM finetuning on Intel `GPU `_, including `LoRA `_, `QLoRA `_, `DPO `_, `QA-LoRA `_ and `ReLoRA `_ * QLoRA finetuning on Intel `CPU `_ * Integration with community libraries * `HuggingFace tansformers `_ * `Standard PyTorch model `_ * `DeepSpeed-AutoTP `_ * `HuggingFace PEFT `_ * `HuggingFace TRL `_ * `LangChain `_ * `LlamaIndex `_ * `AutoGen `_ * `ModeScope `_ * `Tutorials `_ .. seealso:: For more details, please refer to the |ipex_llm_document|_. .. |ipex_llm_document| replace:: ``ipex-llm`` document .. _ipex_llm_document: doc/LLM/index.html ************************************************ Verified Models ************************************************ .. raw:: html
Model CPU Example GPU Example
LLaMA (such as Vicuna, Guanaco, Koala, Baize, WizardLM, etc.) link1, link2 link link
LLaMA 2 link1, link2 link link
LLaMA 3 link link
ChatGLM link
ChatGLM2 link link
ChatGLM3 link link
Mistral link link
Mixtral link link
Falcon link link
MPT link link
Dolly-v1 link link
Dolly-v2 link link
Replit Code link link
RedPajama link1, link2
Phoenix link1, link2
StarCoder link1, link2 link
Baichuan link link
Baichuan2 link link
InternLM link link
Qwen link link
Qwen1.5 link link
Qwen-VL link link
Aquila link link
Aquila2 link link
MOSS link
Whisper link link
Phi-1_5 link link
Flan-t5 link link
LLaVA link link
CodeLlama link link
Skywork link
InternLM-XComposer link
WizardCoder-Python link
CodeShell link
Fuyu link
Distil-Whisper link link
Yi link link
BlueLM link link
Mamba link link
SOLAR link link
Phixtral link link
InternLM2 link link
RWKV4 link
RWKV5 link
Bark link link
SpeechT5 link
DeepSeek-MoE link
Ziya-Coding-34B-v1.0 link
Phi-2 link link
Phi-3 link link
Yuan2 link link
Gemma link link
DeciLM-7B link link
Deepseek link link
StableLM link link
CodeGemma link link
Command-R/cohere link link
************************************************ Get Support ************************************************ * Please report a bug or raise a feature request by opening a `Github Issue `_ * Please report a vulnerability by opening a draft `GitHub Security Advisory `_ ------ .. raw:: html

[1] Performance varies by use, configuration and other factors. ipex-llm may not optimize to the same degree for non-Intel products. Learn more at www.Intel.com/PerformanceIndex.