share_log

颜水成挂帅,昆仑万维2050全球研究院联合NUS、NTU发布Vitron,奠定通用视觉多模态大模型终极形态

With Yan Shuicheng as the leader, Kunlun Wanwei 2050 Global Research Institute and NUS and NTU released Vitron to establish the ultimate form of a multi-modal large model for general vision

China Investors ·  Apr 25 03:04

Recently, a team led by Professor Yan Shuicheng, the Kunlun World Wide 2050 Global Research Institute, the National University of Singapore, and Nanyang Technological University in Singapore jointly released and open-sourced Vitron's universal pixel-level visual multi-modal large language model.

This is a major general-purpose visual multi-modal large model. It supports a series of visual tasks from visual understanding to visual generation, from low level to high level, solves the image/video model fragmentation problem that has plagued the big language model industry for a long time, and provides a pixel-level general visual multi-modal model that comprehensively unifies tasks such as understanding, generation, segmentation, and editing of still images and moving video content. It lays the foundation for the ultimate form of the next generation of general visual big models. It also marks another big step towards general artificial intelligence (AGI) for big models.

As a unified pixel-level visual multi-modal large language model, Vitron enables comprehensive support for low-level to high-level visual tasks, can handle complex visual tasks, understand and generate image and video content, and provides strong visual understanding and task execution capabilities. At the same time, Vitron supports continuous operation with users, enables flexible human-computer interaction, and shows great potential for a more unified visual multi-modal universal model.

All papers, codes, and demos related to Vitron have been made public. Its unique advantages and potential in terms of comprehensiveness, technological innovation, human-computer interaction, and application potential have not only promoted the development of multi-modal large models, but also provided a new direction for future visual large-scale model research.

The Kunlun World Wide 2050 Global Research Institute has always been committed to building an outstanding scientific research institution for the future world, crossing “singularities” with the scientific community, exploring unknown worlds, and creating a better future. Previously, the Kunlun World Wide 2050 Global Research Institute released and open-sourced AgentStudio, a digital intelligence R&D toolkit. In the future, the Institute will continue to promote breakthroughs in artificial intelligence technology and contribute to the construction of China's artificial intelligence ecosystem.

The current development of visual big language models (LLMs) has made impressive progress. The community is increasingly convinced that building more versatile and powerful multi-modal big models (mLLMs) will be the only way to general artificial intelligence (AGI). However, in the process of moving towards a multi-modal generalist model (Generalist), there are still some key challenges. For example, a large part of the work did not achieve fine-grained, pixel-level visual understanding, or lacked unified support for images and video. Or maybe there is insufficient support for various visual tasks, which is far from a general-purpose large model.

In order to fill this gap, a team from the Kunlun World Wide 2050 Global Research Institute, the National University of Singapore, and Nanyang Technological University recently released and open-sourced Vitron's universal pixel-level visual multi-modal large language model. Vitron supports a range of visual tasks from visual understanding to visual generation, from low level to high level, including comprehensive understanding, generation, segmentation, and editing of still images and moving video content.

The image above comprehensively depicts Vitron's functional support for the four major vision-related tasks, as well as its key advantages. Vitron also supports continuous operation with users to achieve flexible human-machine interaction. The project showcased the huge potential for a more unified visual multi-modal universal model, laying the foundation for the ultimate form of the next generation of general-purpose visual large-scale models.

All Vitron-related papers, codes, and demos have now been made public.

Thesis title:Vitron: A Unified Pixel-LevelVision LLM for Understanding, Understanding, Segmenting, Editing

Project home page&Demo:

Link to the paper:

Open source code: 

01. The ultimate multi-modal big language model of the Great Unification 

In recent years, big language models (LLMs) have shown unprecedented strength, and they have gradually been proven to be a technical route to AGI. Meanwhile, the multi-modal big language model (mLLMS) is booming in many communities and is rapidly gaining popularity. By introducing modules that can perform visual perception, expanding the basic pure language LLMS to mLLMS, many mLLMs that are powerful and excellent in image understanding have been developed, such as BLIP-2, LLAVA, miniGPT-4, etc. At the same time, MLLMS, which focus on video understanding, are also being launched one after another, such as VideoChat, Video-llama, and Video-llava.

The researchers then sought to further expand the capabilities of mLLMs mainly in two dimensions. On the one hand, the researchers tried to deepen MLLMS's understanding of vision, transitioning from a rough instance-level understanding to a pixel-level fine-grained understanding of images to achieve visual regional grounding (Regional Grounding) capabilities, such as GLAMM, PixellM, Next-Chat, and MiniGPT-v2. On the other hand, the researchers are trying to expand the visual capabilities that mLLMs can support. Some research has begun to allow mLLMs not only to understand the input visual signals, but also to support the generation of output visual content. For example, MLLMS such as GILL and EMU can flexibly generate image content, and GPT4Video and NEXT-GPT can generate video.

At present, the artificial intelligence community has gradually reached an agreement that the future trend of visual MLLMS will inevitably develop in the direction of high uniformity and stronger capabilities. However, despite numerous mLLMs developed by the community, there are still significant gaps.

First,Almost all existing visual LLMs treat images and videos as separate entities, and either only support images or only videos. Researchers argue that vision should include both still images and moving video — both are core components of the visual world, and are even interchangeable in most scenes. Therefore, it is necessary to build a unified MLLM framework that can support both image and video modes.

Second,Currently, mLLMS's support for visual functions is insufficient. Most models can only be understood, or at most generated images or videos. Researchers believe that future mLLMs should be a general language model that can cover a wider range of visual tasks and operations, achieve unified support for all vision-related tasks, and achieve “one for all” capabilities. This is critical for practical applications, particularly in visual creations that often involve a series of iterations and interactions. For example, users usually start with text, turn an idea into visual content through a graphic; then refine the initial idea and add more details through further fine-grained image editing; then generate video from images to create dynamic content; and finally, perform several rounds of iterative interaction, such as video editing, to perfect the creation.

The table above simply summarizes the existing visual MLLM capabilities (only some models are representative, and the coverage is incomplete). To fill these gaps, the team proposed a generic pixel-level vision MLLM—Vitron.

02. VitronSystem architecture: three key modules 

The overall Vitron framework is shown in the figure below. Vitron uses an architecture similar to existing related MLLMS, including three key parts: 1) front-end visual & language coding module, 2) central LLM comprehension and text generation module, and 3) back-end user response and module calls for visual control.

Front-end module:Visual-linguistic coding. To sense image and video modal signals and support fine-grained user visual input, Vitron integrates an image encoder, video encoder, and area box/sketch encoder.

Central module:Core LLM. Vitron uses Vicuna (7B, 1.5) to enable understanding, reasoning, decision making, and multiple rounds of user interaction.

Backend module:User responses and module calls. Using a text-centric calling strategy, Vitron integrates several ready-made powerful and advanced (SoTA) image and video processing modules to decode and perform a range of visual terminal tasks from the lower level to the upper level. By adopting a text-centered module integration call method, Vitron not only unifies the system, but also ensures alignment efficiency and system scalability.

03. VitronThe three stages of model training 

Based on the above architecture, Vitron was then trained and fine-tuned to give it strong visual understanding and ability to perform tasks. Model training mainly consists of three different stages.

Step 1:Holistic visual-linguistic alignment learning. The visual language features of the input are mapped into a unified characteristic space, so that it can effectively understand the multi-modal signals of the input. This is a coarse-grained visual language alignment learning that allows the system to effectively process incoming visual signals as a whole. The researchers used existing image-title pair (CC3M), video-title pair (Webvid), and region-title pair (RefCoco) data sets to train.

Step 2:Fine-grained spatial and temporal visual positioning instructions are fine-tuned. The system uses an external module to perform various pixel-level vision tasks, but LLM itself has not undergone any fine-grained visual training, which will prevent the system from achieving true pixel-level visual understanding. To this end, the researchers proposed a fine-grained spatio-temporal visual positioning command fine-tuning training. The core idea is to enable LLM to locate the fine-grained spatial nature of images and the specific timing characteristics of videos.

Step 3:The output terminal fine-tunes instructions for command invocation. The second stage of training described above gave LLM and front-end encoders the ability to understand vision at the pixel level. This final step, fine-tuning instructions for command invocation, aims to give the system the ability to execute commands accurately, and allow LLM to generate appropriate and correct call text. Since different terminal vision tasks may require different calling commands, to unify this, the researchers proposed standardizing LLM's response output to a structured text format, which includes:

1) The user responds to the output and directly responds to the user's input.

2) The name of the module, indicating the function or task that will be performed.

3) Call the command to trigger the task module's meta instructions.

4) Area (optional output), which specifies the fine-grained visual characteristics required for certain tasks, such as in video tracking or visual editing, where the backend module requires this information. For regions, based on LLM's pixel-level understanding, boundary boxes described by coordinates will be output.

04. Evaluation experiments 

Based on Vitron, the researchers conducted extensive experimental evaluations on 22 common benchmark data sets and 12 image/video vision tasks. Vitron has demonstrated strong abilities in the four main visual task groups (segmentation, comprehension, content generation, and editing), while at the same time, it has flexible human-computer interaction capabilities. The following are representative examples of some qualitative comparison results:

Vision segmentation

Results of image analysis imagesegmentation

Fine-grained Vision Understanding

Results of image referringexpression prediction.

Results on video QA.

Vision generation

Vision Editing

ImageEditing results

For more details and details of the experiment, please go to the paper.

05. Future direction outlook

Overall, this work shows the huge potential for developing a unified visual multi-modal general model, laid a new shape for the next generation of large-scale visual models, and took the first step in this direction. Although the Vitron system proposed by the team showed strong general-purpose capabilities, it still had its own limitations. The following researchers list some future directions for further exploration.

System architecture

The Vitron system still uses a semi-federated, semi-proxy method to call external tools. Although this call-based approach makes it easy to expand and replace potential modules, it also means that back-end modules with this pipeline structure do not participate in joint learning between the front-end and LLM core modules. This limitation is not conducive to the overall learning of the system, which means that the upper performance limits for different visual tasks will be limited by backend modules. Future work should integrate the various visual task modules into a unified unit. Achieving a unified understanding and output of images and video while supporting generation and editing capabilities through a single generation paradigm remains a challenge. Currently, one promising approach is to combine modality-persistent tokenization to improve the unification of the system on different inputs and outputs and various tasks.

User interactivity

Unlike previous models that focused on a single visual task (e.g., Stable Diffusion and Vision), Vitron is designed to facilitate deep interaction between LLM and users, similar to OpenAI's Dall-e series, Midjourney, etc. within the industry. Achieving optimal user interactivity is one of the core goals of this work. Vitron utilizes an existing language-based LLM, combined with appropriate instruction adjustments to achieve a certain level of interaction. For example, the system can flexibly respond to any expected message entered by the user and produce corresponding visual operation results without requiring user input to accurately match the back-end module conditions. However, the work still has a lot of room for improvement in terms of enhancing interactivity. For example, drawing inspiration from the closed-source Midjourney system, no matter what decisions the LLM makes at every step, the system should actively provide feedback to users to ensure that their actions and decisions are consistent with the user's intentions.

Modal abilities

Currently, Vitron integrates a 7B Vicuna model, which may limit its ability to understand language, images, and video. Future exploration directions could be to develop a comprehensive end-to-end system, such as scaling up the model to achieve a more thorough and comprehensive understanding of vision. Furthermore, efforts should be made to enable LLM to fully unify the understanding of image and video modes.

Disclaimer: This content is for informational and educational purposes only and does not constitute a recommendation or endorsement of any specific investment or investment strategy. Read more
    Write a comment