With Yan Shuicheng as the leader, Kunlun Wanwei 2050 Global Research Institute and NUS and NTU released Vitron to establish the ultimate form of a multi-modal large model for general vision

近日，由颜水成教授带队，昆仑万维2050全球研究院、新加坡国立大学、新加坡南洋理工大学团队联合发布并开源了Vitron通用像素级视觉多模态大语言模型。

这是一款重磅的通用视觉多模态大模型，支持从视觉理解到视觉生成、从低层次到高层次的一系列视觉任务，解决了困扰大语言模型产业已久的图像/视频模型割裂问题，提供了一个全面统一静态图像与动态视频内容的理解、生成、分割、编辑等任务的像素级通用视觉多模态大模型，为下一代通用视觉大模型的终极形态奠定了基础，也标志着大模型迈向通用人工智能（AGI）的又一大步。

Vitron作为一个统一的像素级视觉多模态大语言模型，实现了从低层次到高层次的视觉任务的全面支持，能够处理复杂的视觉任务，并理解和生成图像和视频内容，提供了强大的视觉理解和任务执行能力。同时，Vitron支持与用户的连续操作，实现了灵活的人机互动，展示了通向更统一的视觉多模态通用模型的巨大潜力。

Vitron相关的论文、代码和Demo已全部公开，其在综合性、技术创新、人机交互和应用潜力等方面展现出的独特优势和潜力，不仅推动了多模态大模型的发展，还为未来的视觉大模型研究提供了一个新的方向。

一直以来，昆仑万维2050全球研究院都致力于打造一家面向未来世界的卓越科学研究机构，与科学社区共同跨越“奇点”，探索未知世界，创造美好未来。此前，昆仑万维2050全球研究院已经发布并开源了数字智能体研发工具包AgentStudio，未来，研究院还将不断推动人工智能技术突破，为中国人工智能生态建设贡献力量。

当前视觉大语言模型（LLMs）的发展取得了喜人进展。社区越来越相信，构建更通用、更强大的多模态大模型（MLLMs）将会是通向通用人工智能（AGI）的必经之路。但在向多模态通用大模型（Generalist）的迈进过程中，目前仍存在一些关键挑战。比如很大一部分工作都没有实现细粒度像素级别的视觉理解，或者缺乏对图像和视频的统一支持。抑或对于各种视觉任务的支持不充分，离通用大模型相差甚远。

为了填补这个空白，近日，昆仑万维2050全球研究院、新加坡国立大学、新加坡南洋理工大学团队联合发布开源了Vitron通用像素级视觉多模态大语言模型。Vitron支持从视觉理解到视觉生成、从低层次到高层次的一系列视觉任务，包括静态图像和动态视频内容进行全面的理解、生成、分割和编辑等任务。

上图综合描绘了Vitron在四大视觉相关任务的功能支持，以及其关键优势。Vitron还支持与用户的连续操作，实现灵活的人机互动。该项目展示了面向更统一的视觉多模态通用模型的巨大潜力，为下一代通用视觉大模型的终极形态奠定了基础。

Vitron相关论文、代码、Demo目前已全部公开。

论文标题：Vitron: A Unified Pixel-levelVision LLM for Understanding, Generating, Segmenting, Editing

项目主页&Demo：

论文链接：

开源代码：

01. 大一统的终极多模态大语言模型

近年来，大语言模型（LLMs）展现出了前所未有的强大能力，其被逐渐验证为乃是通向AGI的技术路线。而多模态大语言模型（MLLMs）在多个社区火爆发展且迅速出圈，通过引入能进行视觉感知的模块，扩展纯语言基础LLMs至MLLMs，众多在图像理解方面强大卓越的MLLMs被研发问世，例如BLIP-2、LLaVA、MiniGPT-4等等。与此同时，专注于视频理解的MLLMs也陆续面世，如VideoChat、Video-LLaMA和Video-LLaVA等等。

随后，研究人员主要从两个维度试图进一步扩展MLLMs的能力。一方面，研究人员尝试深化MLLMs对视觉的理解，从粗略的实例级理解过渡到对图像的像素级细粒度理解，从而实现视觉区域定位（Regional Grounding）能力，如GLaMM、PixelLM、NExT-Chat和MiniGPT-v2等。另一方面，研究人员尝试扩展MLLMs可以支持的视觉功能。部分研究已经开始研究让MLLMs不仅理解输入视觉信号，还能支持生成输出视觉内容。比如，GILL、Emu等MLLMs能够灵活生成图像内容，以及GPT4Video和NExT-GPT实现视频生成。

目前人工智能社区已逐渐达成一致，认为视觉MLLMs的未来趋势必然会朝着高度统一、能力更强的方向发展。然而，尽管社区开发了众多的MLLMs，但仍然存在明显的鸿沟。

首先，几乎所有现有的视觉LLMs将图像和视频视为不同的实体，要么仅支持图像，要么仅支持视频。研究人员主张，视觉应该同时包含了静态图像和动态视频两个方面的内涵——这两者都是视觉世界的核心组成，在大多数场景中甚至可以互换。所以，需要构建一个统一的MLLM框架能够同时支持图像和视频模态。

其次，目前MLLMs对视觉功能的支持还有所不足。大多数模型仅能进行理解，或者最多生成图像或视频。研究人员认为，未来的MLLMs应该是一个通用大语言模型，能覆盖更广泛的视觉任务和操作范围，实现对所有视觉相关任务的统一支持，达到“one for all”的能力。这点对实际应用尤其是在经常涉及一系列迭代和交互操作的视觉创作中至关重要。例如，用户通常首先从文本开始，通过文生图，将一个想法转化为视觉内容；然后通过进一步的细粒度图像编辑来完善初始想法，添加更多细节；接着，通过图像生成视频来创建动态内容；最后，进行几轮迭代交互，如视频编辑，完善创作。

上表简单地归纳了现有的视觉MLLM的能力（只代表性地囊括了部分模型，覆盖不完整）。为了弥补这些差距，该团队提出一种通用的像素级视觉MLLM——Vitron。

02. Vitron系统架构：三大关键模块

Vitron整体框架如下图所示。Vitron采用了与现有相关MLLMs相似的架构，包括三个关键部分：1) 前端视觉&语言编码模块，2)中心LLM理解和文本生成模块，以及3) 后端用户响应和模块调用以进行视觉操控模块。

前端模块：视觉-语言编码。为了感知图像和视频模态信号，并支持细粒度用户视觉输入，Vitron集成了图像编码器、视频编码器、区域框/草图编码器。

中心模块：核心LLM。Vitron使用的是Vicuna（7B，1.5），来实现理解、推理、决策制定和多轮用户交互。

后端模块：用户响应与模块调用。Vitron采用以文本为中心的调用策略，整合现成的几个强大先进（SoTA）的图像和视频处理模块，用于解码和执行从低层到高层的一系列视觉终端任务。通过采用以文本为中心的模块集成调用方法，Vitron不仅实现了系统统一，还确保了对齐效率和系统可扩展性。

03. Vitron模型训练三大阶段

基于上述架构，再对Vitron进行训练微调，以赋予其强大的视觉理解和任务执行能力。模型训练主要囊括三个不同的阶段。

步骤一：视觉-语言整体对齐学习。将输入的视觉语言特征映射到一个统一的特征空间中，从而使其能够有效理解输入的多模态信号。这是一种粗粒度的视觉-语言对齐学习，可以让系统具备整体上有效处理传入的视觉信号。研究人员采用了现存的图像-标题对（CC3M）、视频-标题对（Webvid）和区域-标题对（RefCOCO）的数据集进行训练。

步骤二：细粒度的时空视觉定位指令微调。系统采用了调用外部模块方式来执行各种像素级视觉任务，但LLM本身并未经过任何细粒度的视觉训练，这将会阻碍了系统实现真正的像素级视觉理解。为此，研究人员提出了一种细粒度的时空视觉定位指令微调训练，核心思想是使LLM能够定位图像的细粒度空间性和视频的具体时序特性。

步骤三：输出端面向命令调用的指令微调。上述第二阶段的训练赋予了LLM和前端编码器在像素级别理解视觉的能力。这最后一步，面向命令调用的指令微调，旨在让系统具备精确执行命令的能力，允许LLM生成适当且正确的调用文本。由于不同的终端视觉任务可能需要不同的调用命令，为了统一这一点，研究人员提出将LLM的响应输出标准化为结构化文本格式，其中包括：

1) 用户响应输出，直接回复用户的输入。

2) 模块名称，指示将要执行的功能或任务。

3) 调用命令，触发任务模块的元指令。

4) 区域（可选输出），指定某些任务所需的细粒度视觉特征，例如在视频跟踪或视觉编辑中，后端模块需要这些信息。对于区域，基于LLM的像素级理解，将输出由坐标描述的边界框。

04. 评估实验

研究人员基于Vitron在22个常见的基准数据集、12个图像/视频视觉任务上进行了广泛的实验评估。Vitron展现出在四大主要视觉任务群组（分割、理解、内容生成和编辑）中的强大能力，与此同时其具备灵活的人机交互能力。以下代表性地展示了一些定性比较结果：

Vision Segmentation

Results of image referring imagesegmentation

Fine-grained Vision Understanding

Results of image referringexpression comprehension.

Results on video QA.

Vision Generation

Vision Editing

Imageediting results

具体更多详细实验内容和细节请移步论文。

05. 未来方向展望

总体上，这项工作展示了研发大一统的视觉多模态通用大模型的巨大潜力，为下一代视觉大模型的研究奠定了一个新的形态，迈出了这个方向的第一步。尽管团队所提出的Vitron系统表现出强大的通用能力，但依然存在自身的局限性。以下研究人员列出一些未来可进一步探索的方向。

系统架构

Vitron系统仍采用半联合、半代理的方式来调用外部工具。虽然这种基于调用的方法便于扩展和替换潜在模块，但这也意味着这种流水线结构的后端模块不参与到前端与LLM核心模块的联合学习。这一限制不利于系统的整体学习，这意味着不同视觉任务的性能上限将受到后端模块的限制。未来的工作应将各种视觉任务模块整合成一个统一的单元。实现对图像和视频的统一理解和输出，同时通过单一生成范式支持生成和编辑能力，仍然是一个挑战。目前一种有希望的方式是结合modality-persistent的tokenization, 提升系统在不同输入和输出以及各种任务上的统一化。

用户交互性

与之前专注于单一视觉任务的模型（例如，Stable Diffusion和SEEM）不同，Vitron旨在促进LLM和用户之间的深度交互，类似于行业内的OpenAI的DALL-E系列，Midjourney等。实现最佳的用户交互性是本项工作的核心目标之一。Vitron利用现有的基于语言的LLM，结合适当的指令调整，以实现一定程度的交互。例如，系统可以灵活地响应用户输入的任何预期消息，产生相应的视觉操作结果，而不要求用户输入精确匹配后端模块条件。然而，该工作在增强交互性方面仍有很大的提升空间。例如，从闭源的Midjourney系统汲取灵感，不论LLM在每一步做出何种决定，系统都应积极向用户提供反馈，以确保其行动和决策与用户意图一致。

模态能力

当前，Vitron集成了一个7B的Vicuna模型，其可能对其理解语言、图像和视频的能力会产生某些限制。未来的探索方向可以发展一个全面的端到端系统，比如扩大模型的规模，以实现对视觉的更彻底和全面的理解。此外，应该努力使LLM能够完全统一图像和视频模态的理解。

Recently, a team led by Professor Yan Shuicheng, the Kunlun World Wide 2050 Global Research Institute, the National University of Singapore, and Nanyang Technological University in Singapore jointly released and open-sourced Vitron's universal pixel-level visual multi-modal large language model.

This is a major general-purpose visual multi-modal large model. It supports a series of visual tasks from visual understanding to visual generation, from low level to high level, solves the image/video model fragmentation problem that has plagued the big language model industry for a long time, and provides a pixel-level general visual multi-modal model that comprehensively unifies tasks such as understanding, generation, segmentation, and editing of still images and moving video content. It lays the foundation for the ultimate form of the next generation of general visual big models. It also marks another big step towards general artificial intelligence (AGI) for big models.

As a unified pixel-level visual multi-modal large language model, Vitron enables comprehensive support for low-level to high-level visual tasks, can handle complex visual tasks, understand and generate image and video content, and provides strong visual understanding and task execution capabilities. At the same time, Vitron supports continuous operation with users, enables flexible human-computer interaction, and shows great potential for a more unified visual multi-modal universal model.

All papers, codes, and demos related to Vitron have been made public. Its unique advantages and potential in terms of comprehensiveness, technological innovation, human-computer interaction, and application potential have not only promoted the development of multi-modal large models, but also provided a new direction for future visual large-scale model research.

The Kunlun World Wide 2050 Global Research Institute has always been committed to building an outstanding scientific research institution for the future world, crossing “singularities” with the scientific community, exploring unknown worlds, and creating a better future. Previously, the Kunlun World Wide 2050 Global Research Institute released and open-sourced AgentStudio, a digital intelligence R&D toolkit. In the future, the Institute will continue to promote breakthroughs in artificial intelligence technology and contribute to the construction of China's artificial intelligence ecosystem.

The current development of visual big language models (LLMs) has made impressive progress. The community is increasingly convinced that building more versatile and powerful multi-modal big models (mLLMs) will be the only way to general artificial intelligence (AGI). However, in the process of moving towards a multi-modal generalist model (Generalist), there are still some key challenges. For example, a large part of the work did not achieve fine-grained, pixel-level visual understanding, or lacked unified support for images and video. Or maybe there is insufficient support for various visual tasks, which is far from a general-purpose large model.

In order to fill this gap, a team from the Kunlun World Wide 2050 Global Research Institute, the National University of Singapore, and Nanyang Technological University recently released and open-sourced Vitron's universal pixel-level visual multi-modal large language model. Vitron supports a range of visual tasks from visual understanding to visual generation, from low level to high level, including comprehensive understanding, generation, segmentation, and editing of still images and moving video content.

The image above comprehensively depicts Vitron's functional support for the four major vision-related tasks, as well as its key advantages. Vitron also supports continuous operation with users to achieve flexible human-machine interaction. The project showcased the huge potential for a more unified visual multi-modal universal model, laying the foundation for the ultimate form of the next generation of general-purpose visual large-scale models.

All Vitron-related papers, codes, and demos have now been made public.

Thesis title:Vitron: A Unified Pixel-LevelVision LLM for Understanding, Understanding, Segmenting, Editing

Project home page&Demo:

Link to the paper:

Open source code:

01. The ultimate multi-modal big language model of the Great Unification

In recent years, big language models (LLMs) have shown unprecedented strength, and they have gradually been proven to be a technical route to AGI. Meanwhile, the multi-modal big language model (mLLMS) is booming in many communities and is rapidly gaining popularity. By introducing modules that can perform visual perception, expanding the basic pure language LLMS to mLLMS, many mLLMs that are powerful and excellent in image understanding have been developed, such as BLIP-2, LLAVA, miniGPT-4, etc. At the same time, MLLMS, which focus on video understanding, are also being launched one after another, such as VideoChat, Video-llama, and Video-llava.

The researchers then sought to further expand the capabilities of mLLMs mainly in two dimensions. On the one hand, the researchers tried to deepen MLLMS's understanding of vision, transitioning from a rough instance-level understanding to a pixel-level fine-grained understanding of images to achieve visual regional grounding (Regional Grounding) capabilities, such as GLAMM, PixellM, Next-Chat, and MiniGPT-v2. On the other hand, the researchers are trying to expand the visual capabilities that mLLMs can support. Some research has begun to allow mLLMs not only to understand the input visual signals, but also to support the generation of output visual content. For example, MLLMS such as GILL and EMU can flexibly generate image content, and GPT4Video and NEXT-GPT can generate video.

At present, the artificial intelligence community has gradually reached an agreement that the future trend of visual MLLMS will inevitably develop in the direction of high uniformity and stronger capabilities. However, despite numerous mLLMs developed by the community, there are still significant gaps.

First,Almost all existing visual LLMs treat images and videos as separate entities, and either only support images or only videos. Researchers argue that vision should include both still images and moving video — both are core components of the visual world, and are even interchangeable in most scenes. Therefore, it is necessary to build a unified MLLM framework that can support both image and video modes.

Second,Currently, mLLMS's support for visual functions is insufficient. Most models can only be understood, or at most generated images or videos. Researchers believe that future mLLMs should be a general language model that can cover a wider range of visual tasks and operations, achieve unified support for all vision-related tasks, and achieve “one for all” capabilities. This is critical for practical applications, particularly in visual creations that often involve a series of iterations and interactions. For example, users usually start with text, turn an idea into visual content through a graphic; then refine the initial idea and add more details through further fine-grained image editing; then generate video from images to create dynamic content; and finally, perform several rounds of iterative interaction, such as video editing, to perfect the creation.

The table above simply summarizes the existing visual MLLM capabilities (only some models are representative, and the coverage is incomplete). To fill these gaps, the team proposed a generic pixel-level vision MLLM—Vitron.

02. VitronSystem architecture: three key modules

The overall Vitron framework is shown in the figure below. Vitron uses an architecture similar to existing related MLLMS, including three key parts: 1) front-end visual & language coding module, 2) central LLM comprehension and text generation module, and 3) back-end user response and module calls for visual control.

Front-end module:Visual-linguistic coding. To sense image and video modal signals and support fine-grained user visual input, Vitron integrates an image encoder, video encoder, and area box/sketch encoder.

Central module:Core LLM. Vitron uses Vicuna (7B, 1.5) to enable understanding, reasoning, decision making, and multiple rounds of user interaction.

Backend module:User responses and module calls. Using a text-centric calling strategy, Vitron integrates several ready-made powerful and advanced (SoTA) image and video processing modules to decode and perform a range of visual terminal tasks from the lower level to the upper level. By adopting a text-centered module integration call method, Vitron not only unifies the system, but also ensures alignment efficiency and system scalability.

03. VitronThe three stages of model training

Based on the above architecture, Vitron was then trained and fine-tuned to give it strong visual understanding and ability to perform tasks. Model training mainly consists of three different stages.

Step 1:Holistic visual-linguistic alignment learning. The visual language features of the input are mapped into a unified characteristic space, so that it can effectively understand the multi-modal signals of the input. This is a coarse-grained visual language alignment learning that allows the system to effectively process incoming visual signals as a whole. The researchers used existing image-title pair (CC3M), video-title pair (Webvid), and region-title pair (RefCoco) data sets to train.

Step 2:Fine-grained spatial and temporal visual positioning instructions are fine-tuned. The system uses an external module to perform various pixel-level vision tasks, but LLM itself has not undergone any fine-grained visual training, which will prevent the system from achieving true pixel-level visual understanding. To this end, the researchers proposed a fine-grained spatio-temporal visual positioning command fine-tuning training. The core idea is to enable LLM to locate the fine-grained spatial nature of images and the specific timing characteristics of videos.

Step 3:The output terminal fine-tunes instructions for command invocation. The second stage of training described above gave LLM and front-end encoders the ability to understand vision at the pixel level. This final step, fine-tuning instructions for command invocation, aims to give the system the ability to execute commands accurately, and allow LLM to generate appropriate and correct call text. Since different terminal vision tasks may require different calling commands, to unify this, the researchers proposed standardizing LLM's response output to a structured text format, which includes:

1) The user responds to the output and directly responds to the user's input.

2) The name of the module, indicating the function or task that will be performed.

3) Call the command to trigger the task module's meta instructions.

4) Area (optional output), which specifies the fine-grained visual characteristics required for certain tasks, such as in video tracking or visual editing, where the backend module requires this information. For regions, based on LLM's pixel-level understanding, boundary boxes described by coordinates will be output.

04. Evaluation experiments

Based on Vitron, the researchers conducted extensive experimental evaluations on 22 common benchmark data sets and 12 image/video vision tasks. Vitron has demonstrated strong abilities in the four main visual task groups (segmentation, comprehension, content generation, and editing), while at the same time, it has flexible human-computer interaction capabilities. The following are representative examples of some qualitative comparison results:

Vision segmentation

Results of image analysis imagesegmentation

Fine-grained Vision Understanding

Results of image referringexpression prediction.

Results on video QA.

Vision generation

Vision Editing

ImageEditing results

For more details and details of the experiment, please go to the paper.

05. Future direction outlook

Overall, this work shows the huge potential for developing a unified visual multi-modal general model, laid a new shape for the next generation of large-scale visual models, and took the first step in this direction. Although the Vitron system proposed by the team showed strong general-purpose capabilities, it still had its own limitations. The following researchers list some future directions for further exploration.

System architecture

The Vitron system still uses a semi-federated, semi-proxy method to call external tools. Although this call-based approach makes it easy to expand and replace potential modules, it also means that back-end modules with this pipeline structure do not participate in joint learning between the front-end and LLM core modules. This limitation is not conducive to the overall learning of the system, which means that the upper performance limits for different visual tasks will be limited by backend modules. Future work should integrate the various visual task modules into a unified unit. Achieving a unified understanding and output of images and video while supporting generation and editing capabilities through a single generation paradigm remains a challenge. Currently, one promising approach is to combine modality-persistent tokenization to improve the unification of the system on different inputs and outputs and various tasks.

User interactivity

Unlike previous models that focused on a single visual task (e.g., Stable Diffusion and Vision), Vitron is designed to facilitate deep interaction between LLM and users, similar to OpenAI's Dall-e series, Midjourney, etc. within the industry. Achieving optimal user interactivity is one of the core goals of this work. Vitron utilizes an existing language-based LLM, combined with appropriate instruction adjustments to achieve a certain level of interaction. For example, the system can flexibly respond to any expected message entered by the user and produce corresponding visual operation results without requiring user input to accurately match the back-end module conditions. However, the work still has a lot of room for improvement in terms of enhancing interactivity. For example, drawing inspiration from the closed-source Midjourney system, no matter what decisions the LLM makes at every step, the system should actively provide feedback to users to ensure that their actions and decisions are consistent with the user's intentions.

Modal abilities

Currently, Vitron integrates a 7B Vicuna model, which may limit its ability to understand language, images, and video. Future exploration directions could be to develop a comprehensive end-to-end system, such as scaling up the model to achieve a more thorough and comprehensive understanding of vision. Furthermore, efforts should be made to enable LLM to fully unify the understanding of image and video modes.

Disclaimer: This content is for informational and educational purposes only and does not constitute a recommendation or endorsement of any specific investment or investment strategy. Read more

颜水成挂帅，昆仑万维2050全球研究院联合NUS、NTU发布Vitron，奠定通用视觉多模态大模型终极形态

With Yan Shuicheng as the leader, Kunlun Wanwei 2050 Global Research Institute and NUS and NTU released Vitron to establish the ultimate form of a multi-modal large model for general vision

Risk Disclaimer

Statement