GLM-4V

Parameters

Context Length

128K

Modality

Multimodal

Architecture

Dense

License

MIT License

Release Date

15 Jan 2024

Knowledge Cutoff

Technical Specifications

Attention

Attention Structure

Multi-Head Attention

Attention Heads

Key-Value Heads

Attention Head Dimension

128

Position Embedding

Absolute Position Embedding

RoPE Theta

Sliding Window Attention

Sliding Window Size

Normalization

RMS Normalization

Activation Function

Dimensions

Hidden Dimension Size

4,096

Number of Layers

FFN Intermediate Size (Dense)

13,696

Multi-Token Prediction Heads

Tokenizer

Vocabulary Size

151,552

Architecture Diagram

GLM-4V

The GLM-4V model variant, developed by Z.ai, represents a significant advancement in multimodal artificial intelligence. It is a member of the GLM-4 series, designed to process and interpret both high-resolution image and video data alongside textual input. This architecture facilitates a deep integration of visual and linguistic features, enabling the model to perform complex multimodal tasks without degradation in natural language processing capabilities. The design goal is to provide a unified framework for understanding diverse data modalities.

Technically, GLM-4V incorporates a sophisticated architecture that includes a Visual Encoder, an MLP Projector, and a Language Decoder. The Visual Encoder processes visual inputs, including images and videos, often utilizing a modified Vision Transformer (ViT) and handling arbitrary image aspect ratios and resolutions up to 4K pixels. The MLP Projector serves as an intermediary, translating visual features into a format compatible with the language model, and may incorporate techniques like 3D-RoPE for enhanced spatial understanding. The Language Decoder is based on the underlying GLM architecture, responsible for generating coherent textual responses by integrating the processed visual and textual information.

GLM-4V is engineered to support a range of practical applications, including visual question answering, image captioning, and complex object detection. Its capabilities extend to video understanding, where it incorporates temporal information to analyze sequences effectively. The model's design focuses on enabling robust performance in tasks requiring both visual perception and advanced linguistic reasoning, such as interactive tutoring for STEM subjects or generating step-by-step solutions from visual problems.

About GLM Family

General Language Models from Z.ai

Other GLM Family Models

Evaluation Benchmarks

No evaluation benchmarks for GLM-4V available.

Rankings

Overall Rank

Coding Rank

Model Integrity

Total Score

68 / 100

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

63k

125k

VRAM Required:

Recommended GPUs

Resources

Official Documentation Release Notes Read the Paper Download Weights Source Code

About Contact Compute Efficiency Content Integrity Terms of Use Privacy Policy

URL: https://apxml.com/models/glm-4v

⇱ GLM-4V: Specifications and GPU VRAM Requirements

GLM-4V

Technical Specifications

Architecture Diagram

GLM-4V

About GLM Family

Other GLM Family Models

Evaluation Benchmarks

Rankings

Model Integrity

GPU Requirements

VRAM Required:

Recommended GPUs

Resources