Model Card for HP (High-Preference) Model

This model is a specialized human preference scoring function that evaluates image quality based purely on visual aesthetics and human preferences, without relying on text-image alignment. See our paper Enhancing Reward Models for High-quality Image Generation: Beyond Text-Image Alignment for more details.

Model Details

Model Description

The HP (High-Preference) model represents a paradigm shift in image quality evaluation by operating solely on the image modality. When text-image alignment reaches saturation (ICT score approaches 1), the HP model continues to differentiate image quality based on aesthetic and perceptual factors that matter to human viewers.

Core Philosophy: Once an image adequately represents textual content, further quality improvements should be measured through pure visual assessment rather than text-image similarity metrics.

Key Features

Image-Only Evaluation: No text input required, focuses purely on visual quality
Human Preference Aligned: Trained on preference triplets from Pick-High datase and Pick-a-pic dataset
Complementary Design: Works optimally when combined with ICT model for comprehensive evaluation

Model Sources

Repository: https://github.com/BarretBa/ICTHP
Paper: Enhancing Reward Models for High-quality Image Generation: Beyond Text-Image Alignment
Base Model: CLIP-ViT-H-14 (Image Encoder + MLP Head)
Training Dataset: Pick-High datase and Pick-a-pic dataset (360,000 preference triplets)

How to Get Started with the Model

Installation

pip install torch transformers pillow numpy open-clip-torch

Quick Start

# import
import torch
from transformers import CLIPModel, CLIPProcessor
from PIL import Image
import torch.nn as nn

class MLP(nn.Module):
 def __init__(self):
 super().__init__()
 self.layers = nn.Sequential(
 nn.Linear(1024, 1024), nn.Dropout(0.2),
 nn.Linear(1024, 128), nn.Dropout(0.2), 
 nn.Linear(128, 64), nn.Dropout(0.1),
 nn.Linear(64, 16), nn.Linear(16, 1)
 )
 def forward(self, x):
 return self.layers(x)

# load model
device = "cuda"
processor_name_or_path = "laion/CLIP-ViT-H-14-laion2B-s32B-b79K"
model_pretrained_name_or_path = "8y/HP"

processor = CLIPProcessor.from_pretrained(processor_name_or_path)
backbone = CLIPModel.from_pretrained(model_pretrained_name_or_path, subfolder="hp_backbone").eval().to(device)
scorer = MLP()
scorer.load_state_dict(torch.load(f"{model_pretrained_name_or_path}/hp_scorer/mlp_pytorch_model.bin"))
scorer = scorer.eval().to(device)

def calc_hp_scores(images):
 # preprocess
 image_inputs = processor(
 images=images,
 return_tensors="pt"
 ).to(device)
 
 with torch.no_grad():
 # extract features
 image_features = backbone.get_image_features(**image_inputs)
 
 # calculate hp scores
 hp_scores = torch.sigmoid(scorer(image_features))
 
 return hp_scores.cpu().squeeze().tolist()

pil_images = [Image.open("image1.jpg"), Image.open("image2.jpg")]
scores = calc_hp_scores(pil_images)
print(f"HP Scores: {scores}")

Training Details

Training Data

This model was trained on 36000 preference triplets from Pick-High datase and Pick-a-pic dataset.

Citation

@misc{ba2025enhancingrewardmodelshighquality,
 title={Enhancing Reward Models for High-quality Image Generation: Beyond Text-Image Alignment}, 
 author={Ying Ba and Tianyu Zhang and Yalong Bai and Wenyi Mo and Tao Liang and Bing Su and Ji-Rong Wen},
 year={2025},
 eprint={2507.19002},
 archivePrefix={arXiv},
 primaryClass={cs.CV},
 url={https://arxiv.org/abs/2507.19002}, 
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for 8y/HP

Paper • 2507.19002 • Published Jul 25, 2025 • 2

URL: https://huggingface.co/8y/HP

⇱ 8y/HP · Hugging Face