Clinical Chain-of-Thought Medical Assistant (GPT-OSS-20B, GGUF)

This repository provides GGUF builds of a clinical reasoning model that was fine-tuned from unsloth/gpt-oss-20b-BF16. Medical questions are answered with an explicit reasoning trace, which is emitted inside <think> ... </think> tags, followed by a final answer. The GGUF files are intended for inference through llama.cpp, Ollama, and llama-cpp-python. A merged 16-bit checkpoint is also included as safetensors so that the model can be loaded with Transformers from the same repository.

Disclaimer: This model was built for research and educational purposes only. It is not a medical device and is not intended to be used for diagnosis, treatment, or any clinical decision making. Professional medical advice should always be sought from a qualified clinician.

GGUF Files

Four GGUF builds are provided. The quantization level of each file is indicated by its name.

File	Quantization	Size	Recommended for
`model-q4_k_m.gguf`	4-bit (Q4_K_M)	15.8 GB	Default choice. The strongest balance between size and quality is offered here.
`model-q5_k_m.gguf`	5-bit (Q5_K_M)	16.9 GB	Slightly higher fidelity at a modestly larger footprint.
`model-q8_0.gguf`	8-bit (Q8_0)	22.3 GB	Near-lossless quality where footprint is less of a concern.
`model-f16.gguf`	16-bit (F16)	41.9 GB	Unquantized reference. It is mainly intended for re-quantization rather than direct use.

For most deployments, model-q4_k_m.gguf is recommended. Runtime memory roughly tracks the file size, with additional headroom required for the context window.

The merged 16-bit checkpoint (model-0000*-of-00009.safetensors), the tokenizer, and chat_template.jinja are also stored in this repository for use with Transformers.

Model Details

Property	Value
Base model	`unsloth/gpt-oss-20b-BF16` (OpenAI GPT-OSS-20B, Mixture-of-Experts, 21B total / 3.6B active parameters)
Fine-tuning method	LoRA supervised fine-tuning (SFT) followed by quantization-aware training (QAT)
Reasoning format	Harmony chat template with reasoning emitted inside `<think>` tags
Language	English
Training hardware	AMD MI300X (ROCm 6.4)
License	Apache 2.0 (inherited from the base model)

Usage

The harmony chat template must be used so that the model behaves correctly. The GGUF files already carry this template.

Ollama

ollama run hf.co/Melikshah/gpt-oss-20b-clinical-cot-gguf:Q4_K_M

llama.cpp

# A single quant can be downloaded with the Hugging Face CLI
huggingface-cli download Melikshah/gpt-oss-20b-clinical-cot-gguf model-q4_k_m.gguf --local-dir .

# The downloaded file can then be run
./llama-cli -m model-q4_k_m.gguf \
 -p "What are the differential diagnoses for acute chest pain?" -n 512

llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
 repo_id="Melikshah/gpt-oss-20b-clinical-cot-gguf",
 filename="model-q4_k_m.gguf",
 n_ctx=4096,
 n_gpu_layers=-1,
)

messages = [{"role": "user", "content": "What are the differential diagnoses for acute chest pain?"}]
out = llm.create_chat_completion(messages=messages, max_tokens=512, temperature=0.6, top_p=0.95)
print(out["choices"][0]["message"]["content"])

Transformers (merged 16-bit checkpoint)

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "Melikshah/gpt-oss-20b-clinical-cot-gguf"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
 model_id, torch_dtype=torch.bfloat16, device_map="auto"
)

messages = [{"role": "user", "content": "What are the differential diagnoses for acute chest pain?"}]
inputs = tokenizer.apply_chat_template(
 messages, add_generation_prompt=True, return_tensors="pt"
).to(model.device)

outputs = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

The reasoning trace is returned inside <think> ... </think> and is intended for inspection and debugging rather than for display to end users.

Training Data

Three medical reasoning datasets were merged and normalized to a unified instruction, reasoning, output schema. The reasoning component was wrapped in <think> tags during formatting.

Dataset	Description
FreedomIntelligence/medical-o1-reasoning-SFT	Complex chain-of-thought medical reasoning
FreedomIntelligence/Medical-R1-Distill-Data	Distilled medical reasoning data
UCSC-VLAA/MedReason	Medical question answering with structured reasoning

Rows with empty instructions or answers were filtered out, and a 95% / 5% train / eval split was created.

Training Procedure

The model was trained in two stages. First, a LoRA adapter was trained on the merged dataset and merged back into the base weights. Second, the merged checkpoint was refined with quantization-aware training so that low-bit quality could be preserved. The refined checkpoint was then exported to the GGUF formats listed above.

Stage	Key settings
SFT	LoRA rank 32, alpha 64, dropout 0.05, learning rate 1e-4, cosine schedule, 2 epochs, sequence packing, max sequence length 4096
QAT	`int4_weight_only` scheme, LoRA rank 16, alpha 32, learning rate 5e-5, 1 epoch

LoRA adapters were applied to q_proj, k_proj, v_proj, o_proj, up_proj, down_proj, and gate_proj. Training was performed with Unsloth and TRL, and runs were tracked with Weights and Biases.

Evaluation

An LLM-as-a-judge protocol was used. Responses to 50 held-out questions were generated by four candidates, anonymized, shuffled, and then scored against the ground truth. Scores were assigned for accuracy, reasoning quality, safety, and completeness, and an overall score was aggregated from them. The figures below were reported with a GPT-5.2 judge.

Win rate (out of 50 prompts)

Model	Wins	Win rate
Fine-tuned GPT-OSS (BF16)	35	70.0%
QAT fine-tuned GPT-OSS	14	28.0%
Base GPT-OSS (unfine-tuned)	1	2.0%
GPT-4.1 (OpenAI API)	0	0.0%

Average overall score (1 to 10)

Model	Score
Fine-tuned GPT-OSS (BF16)	9.16
QAT fine-tuned GPT-OSS	8.32
GPT-4.1 (OpenAI API)	7.37
Base GPT-OSS (unfine-tuned)	7.12

These figures were obtained from a 50-question sample scored by a single judge, and they should therefore be treated as preliminary.

Demo

👁 Clinical CoT Medical Assistant Demo

Intended Use and Limitations

This model is intended to be used for research into medical reasoning, for educational demonstrations, and for the study of chain-of-thought fine-tuning. Clinical deployment is not supported.

The following limitations should be kept in mind. Incorrect or fabricated information may be produced, since hallucination is not eliminated by fine-tuning. Coverage is limited to the topics present in the training datasets, and English is the only language that was targeted. Some quality loss is expected at lower quantization levels. The published evaluation is small in scale and was scored by a single automated judge, so the reported quality should not be over-interpreted.

Acknowledgements and Citation

The base model gpt-oss-20b was released by OpenAI under the Apache 2.0 license, and fine-tuning was performed with Unsloth. The full training and evaluation pipeline is available at TheDeadcoder/medical-cot-assistant.

The source datasets should be cited if this model is used:

@misc{medical-o1-reasoning, title={Medical-O1-Reasoning-SFT}, author={FreedomIntelligence}, year={2024}, publisher={HuggingFace}}
@misc{medical-r1-distill, title={Medical-R1-Distill-Data}, author={FreedomIntelligence}, year={2024}, publisher={HuggingFace}}
@misc{medreason, title={MedReason}, author={UCSC-VLAA}, year={2024}, publisher={HuggingFace}}

Downloads last month: 1,405

Safetensors

Model size

21B params

Tensor type

F32

BF16

Model tree for Melikshah/gpt-oss-20b-clinical-cot-gguf

Base model

openai/gpt-oss-20b

Finetuned

unsloth/gpt-oss-20b-BF16

Quantized

(12)

this model

URL: https://huggingface.co/Melikshah/gpt-oss-20b-clinical-cot-gguf

⇱ Melikshah/gpt-oss-20b-clinical-cot-gguf · Hugging Face