🧬 DTI Models: Drug–Target Interaction Prediction

A collection of Drug–Target Interaction (DTI) models for activity prediction and potency estimation.

Current Status: 🚧 Active Development / Research Prototype

Overview

This repository contains two Drug–Target Interaction models developed for computational drug discovery research.

Available Models

1. DTI-LLM (LoRA Adapter)

A LoRA fine‑tuned LLaMA‑3 model designed for:

Activity Classification
Potency Regression (pXC50 Prediction)

This is a dual‑task model capable of both predicting whether a drug–target pair is active and estimating interaction potency.

2. DTI-BioMedBERT (Classification Checkpoint)

A BioMedBERT‑based checkpoint trained specifically for:

Binary Activity Classification

This model focuses exclusively on determining whether a compound is likely to be biologically active against a target protein.

Research Goal

The primary objective of this project is to explore how modern AI models can be adapted for computational drug discovery tasks.

The long-term goals include:

Improving virtual screening workflows
Assisting early-stage lead prioritization
Exploring LLM-based molecular reasoning
Investigating structured biomedical prediction
Building lightweight domain-specific AI systems deployable on consumer hardware

This repository represents an ongoing research effort rather than a finished production system.

Model Variants

DTI-LLM (LoRA Adapter)

Component	Value
Base Model	`unsloth/llama-3-8b-bnb-4bit`
Fine‑Tuning Method	LoRA + Checkpoint
Training Hardware	NVIDIA T4 16GB
Framework	Unsloth

Tasks

Classification – Predict whether a drug is likely to be biologically active against a target protein.
Regression – Estimate the interaction potency (pXC50) of the drug–target pair.

DTI-BioMedBERT (Classification Checkpoint)

Component	Value
Base Model	`microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext`
Architecture	BioMedBERT
Task	Binary Classification
Output	`Active` / `Inactive`
Framework	Transformers

Task

Predict whether a drug–target pair is biologically active.

Unlike DTI-LLM, this checkpoint does not perform potency regression and is intended solely for activity prediction.

Input Format

The models expect information about:

Drug molecule (SMILES)
Protein target (UniProt ID)
Optional assay metadata

Example:

Drug:
SMILES: NC1=NC(=S)C2=C(N1)N=CN2

Target:
UniProt ID: Q13043

Output Formats

DTI-LLM

{
 "is_active": true,
 "pxc50": 6.2,
 "confidence": "high",
 "reasoning": "Structural similarity suggests moderate binding affinity."
}

Field	Description
is_active	Binary activity prediction
pxc50	Predicted potency value
confidence	Model confidence estimate
reasoning	Generated explanation

DTI-BioMedBERT

{
 "is_active": true,
 "confidence": 0.91
}

Field	Description
is_active	Binary activity prediction
confidence	Predicted confidence score

Performance

DTI-LLM

Classification Task (Activity Prediction)

Metric	Score
Accuracy	0.946
Precision	1.000
Recall	0.512
F1 Score	0.658
ROC-AUC	0.765
PR-AUC	0.610

Interpretation

The model currently exhibits extremely high precision.

When the model predicts that a compound is active, it is rarely incorrect. This behavior makes it useful for reducing false positives during early-stage virtual screening.

However, recall remains moderate, meaning some genuinely active compounds may not be identified.

Current development efforts are focused on improving recall while maintaining strong precision.

Regression Task (Potency Prediction)

Metric	Score
RMSE	1.099
MAE	0.723
R²	-0.235
Pearson r	0.404
Spearman ρ	0.578

Interpretation

The regression component remains experimental.

While the model demonstrates moderate ranking capability (Spearman correlation 0.578), absolute potency prediction is currently unreliable.

The model can often distinguish stronger interactions from weaker ones, but exact pXC50 values should not be interpreted as experimentally accurate measurements.

For the current release:

✅ Suitable for relative ranking

⚠️ Not suitable for precise potency estimation

Future work will focus heavily on improving regression performance through larger datasets, improved loss functions, and multi-task optimization.

DTI-BioMedBERT

Classification Task (Activity Prediction)

Metric	Score
Accuracy	0.925
Precision	0.560
Recall	0.593
F1 Score	0.576
ROC-AUC	0.903

Interpretation

The DTI-BioMedBERT checkpoint demonstrates strong classification performance with a ROC-AUC of 0.903, indicating effective discrimination between active and inactive drug–target pairs.

Compared with DTI-LLM, it provides a more balanced precision–recall tradeoff and is optimized specifically for activity prediction.

Recommended use cases include:

✅ Binary DTI classification

✅ Large-scale virtual screening

✅ Activity prediction benchmarks

✅ Fast inference workflows

Choosing a Model

Use Case	Recommended Model
Activity Prediction Only	DTI-BioMedBERT
Activity + Potency Prediction	DTI-LLM
Fast Screening	DTI-BioMedBERT
Potency Ranking	DTI-LLM
LLM-Based Biomedical Research	DTI-LLM
Highest ROC-AUC Classification	DTI-BioMedBERT

Current Development Status

These models are actively being developed.

Planned improvements include:

Larger and more diverse training datasets
Additional target protein coverage
Improved regression accuracy
Better calibration of confidence scores
Multi-stage fine-tuning strategies
Retrieval-augmented biomedical context
Expanded benchmark evaluation

Performance metrics and model behavior may change significantly between releases.

Example Usage

Installation

pip install unsloth transformers accelerate bitsandbytes peft

Loading DTI-LLM

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(
 "unsloth/llama-3-8b-bnb-4bit"
)

model = PeftModel.from_pretrained(
 base_model,
 "Cyanex/BioGPT-X"
)

tokenizer = AutoTokenizer.from_pretrained(
 "Cyanex/BioGPT-XCyanex/BioGPT-X"
)

CLI Inference (Recommended) for lora:

The repository includes a ready-to-use inference script for generating Drug–Target Interaction predictions.

Example:

python inference.py \
 --model_path ./lora_adapter \
 --smiles "CCO" \
 --uniprot "P04637" \
 --target_name "p53" \
 --mechanism "binding" \
 --technology "IC50 assay"

Supported Arguments

Argument	Description
`--model_path`	Path to the LoRA adapter
`--smiles`	Drug SMILES string
`--uniprot`	UniProt protein identifier
`--target_name`	Optional target name
`--mechanism`	Optional assay mechanism
`--technology`	Optional assay technology

The CLI script is the recommended way to run inference and reproduce the results reported in this repository.

Loading DTI-BioMedBERT

from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
 "Cyanex/BioGPT-X"
)

tokenizer = AutoTokenizer.from_pretrained(
 "Cyanex/BioGPT-X"
)

Repository Contents

dti_llm/
├── adapter_config.json
├── adapter_model.safetensors
├── tokenizer.json
├── tokenizer_config.json
└── training_config.json

dti_biomedbert/
├── config.json
├── model.safetensors
├── tokenizer.json
└── tokenizer_config.json

Limitations

Regression Performance

Potency prediction remains the weakest component of the DTI-LLM system and should be considered experimental.

Dataset Bias

Training data originates from public biological assays and may not represent all protein families, assay conditions, or chemical spaces.

Hallucinated Reasoning

Generated explanations from DTI-LLM are model-generated text and should not be interpreted as mechanistic biological evidence.

Not for Clinical Use

These models are intended solely for research, education, and experimentation.

Predictions must never be used for:

Clinical decision making
Medical diagnosis
Drug prescription
Regulatory submissions

All predictions require experimental validation.

Intended Use

Appropriate uses include:

Academic research
Educational projects
Drug discovery experimentation
Virtual screening exploration
Biomedical AI benchmarking
Model fine-tuning demonstrations

Acknowledgements

Special thanks to:

Meta for LLaMA-3
Unsloth for efficient fine-tuning tools
Microsoft Research for BioMedBERT
The creators of the eve-bio/drug-target-activity dataset
The open-source biomedical AI community

License

Research Only.

Commercial use may be subject to the license terms of the underlying LLaMA-3 and BioMedBERT models.

Disclaimer

DTI-LLM and DTI-BioMedBERT are experimental research projects under active development.

All predictions are computational estimates and should not be considered biological evidence.

Experimental validation is required before any practical use.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for Cyanex/BioGPT-X

Base model

meta-llama/Meta-Llama-3-8B

Quantized

unsloth/llama-3-8b-bnb-4bit

Adapter

(308)

this model

URL: https://huggingface.co/Cyanex/BioGPT-X

⇱ Cyanex/BioGPT-X · Hugging Face

🧬 DTI Models: Drug–Target Interaction Prediction

Overview

Available Models

1. DTI-LLM (LoRA Adapter)

2. DTI-BioMedBERT (Classification Checkpoint)

Research Goal

Model Variants

DTI-LLM (LoRA Adapter)

DTI-BioMedBERT (Classification Checkpoint)

Input Format

Output Formats

DTI-LLM

DTI-BioMedBERT

Performance

DTI-LLM

Classification Task (Activity Prediction)

Interpretation

Regression Task (Potency Prediction)

Interpretation

DTI-BioMedBERT

Classification Task (Activity Prediction)

Interpretation

Choosing a Model

Current Development Status

Example Usage

Installation

Loading DTI-LLM

CLI Inference (Recommended) for lora:

Supported Arguments

The CLI script is the recommended way to run inference and reproduce the results reported in this repository.

Loading DTI-BioMedBERT

Repository Contents

Limitations

Regression Performance

Dataset Bias

Hallucinated Reasoning

Not for Clinical Use

Intended Use

Acknowledgements

License

Disclaimer

Model tree for Cyanex/BioGPT-X

Dataset used to train Cyanex/BioGPT-X