Falcon 40B Instruct

Legacy

11.3KDownloads1.2KLikesMay 2023Released8K tokensContextApache 2.0License50 GoodQuality

Falcon 40B Instruct (40B parameters) requires approximately 32.4 GB of VRAM with Q5_K_M quantization. For the best balance of quality and speed, we recommend hardware with at least 38 GB of VRAM.

Get started

— copy & paste to run locally

Copy-paste commands to run Falcon 40B Instruct on your machine.

Run

docker run --rm -it ghcr.io/ggerganov/llama.cpp:full \
 --hf-repo "tiiuae/falcon-40b-instruct" \
 --hf-file "falcon-40b-instruct-Q5_K_M.gguf" \
 -c 4096 -ngl 99

Quick specs

Parameters40B

Architecturedense

Context8K tokens

Modalitytext

Min RAM15.6 GB

Rec. RAM28.8 GB (Q5_K_M)

LicenseApache 2.0

FamilyFalcon

✓ Chat✓ Reasoning

About this model

Falcon-40B-Instruct is a 40B parameters causal decoder-only model built by TII based on Falcon-40B and finetuned on a mixture of Baize. It is made available under the Apache 2.0 license.

•You are looking for a ready-to-use chat/instruct model based on Falcon-40B
•Falcon-40B is the best open-source model available.: It outperforms LLaMA, StableLM, RedPajama, MPT, etc. See the OpenLLM Leaderboard
•It features an architecture optimized for inference: , with FlashAttention (Dao et al., 2022) and multiquery (Shazeer et al., 2019)

Related models

Your hardware

Detecting...

Quick picks

Best budgetB

Mac mini M4 64GB~$1,099 — 6 tok/s

👁 NVIDIA

Best overallA

RTX PRO 5000 Blackwell 48GB~$4,999 — 44 tok/s

Best hardware

Top picks for Falcon 40B Instruct

👁 NVIDIA

RTX PRO 5000 Blackwell 48GBA

48 GB

👁 NVIDIA

RTX 6000 Ada 48GBA

48 GB

👁 NVIDIA

NVIDIA L40S 48GBA

48 GB

👁 NVIDIA

NVIDIA L40 48GBA

48 GB

👁 NVIDIA

NVIDIA L20 48GBA

48 GB

Run this model

Falcon 40B Instruct on RTX PRO 5000 Blackwell 48GB Falcon 40B Instruct on RTX 6000 Ada 48GB Falcon 40B Instruct on NVIDIA L40S 48GB

Quantization options

VRAM estimates by quant level

No hardware detected — fit column shows raw VRAM estimates

Quant	Bits	VRAM	Quality	Fit
Q2_K	2	15.6 GB	Low	—
Q3_K_S	3	19.6 GB	Low	—
NVFP4	4	22.4 GB	Medium	—
Q4_K_M	4	24.4 GB	Medium	—
Q5_K_M	5	28.8 GB	High	—
Q6_K	6	32.8 GB	High	—
Q8_0	8	42.8 GB	Very High	—
F16	16	82.0 GB	Maximum	—

Quality benchmarks