Files
transformers/docs/source/en/model_doc/glm4v_moe.md
Yuanyuan Chen 374ded5ea4 Fix white space in documentation (#41157)
* Fix white space

Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>

* Revert changes

Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>

* Fix autodoc

Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>

---------

Signed-off-by: Yuanyuan Chen <cyyever@outlook.com>
2025-09-30 09:41:03 -07:00

3.5 KiB
Raw Blame History

This model was released on 2025-07-28 and added to Hugging Face Transformers on 2025-08-08.

PyTorch FlashAttention SDPA

Glm4vMoe

Overview

Vision-language models (VLMs) have become a key cornerstone of intelligent systems. As real-world AI tasks grow increasingly complex, VLMs urgently need to enhance reasoning capabilities beyond basic multimodal perception — improving accuracy, comprehensiveness, and intelligence — to enable complex problem solving, long-context understanding, and multimodal agents.

Through our open-source work, we aim to explore the technological frontier together with the community while empowering more developers to create exciting and innovative applications.

GLM-4.5V (Github repo) is based on ZhipuAIs next-generation flagship text foundation model GLM-4.5-Air (106B parameters, 12B active). It continues the technical approach of GLM-4.1V-Thinking, achieving SOTA performance among models of the same scale on 42 public vision-language benchmarks. It covers common tasks such as image, video, and document understanding, as well as GUI agent operations.

bench_45

Beyond benchmark performance, GLM-4.5V focuses on real-world usability. Through efficient hybrid training, it can handle diverse types of visual content, enabling full-spectrum vision reasoning, including:

  • Image reasoning (scene understanding, complex multi-image analysis, spatial recognition)
  • Video understanding (long video segmentation and event recognition)
  • GUI tasks (screen reading, icon recognition, desktop operation assistance)
  • Complex chart & long document parsing (research report analysis, information extraction)
  • Grounding (precise visual element localization)

The model also introduces a Thinking Mode switch, allowing users to balance between quick responses and deep reasoning. This switch works the same as in the GLM-4.5 language model.

Glm4vMoeConfig

autodoc Glm4vMoeConfig

Glm4vMoeTextConfig

autodoc Glm4vMoeTextConfig

Glm4vMoeTextModel

autodoc Glm4vMoeTextModel - forward

Glm4vMoeModel

autodoc Glm4vMoeModel - forward

Glm4vMoeForConditionalGeneration

autodoc Glm4vMoeForConditionalGeneration - forward