Deepseek-Vl: Towards Real-World Vision-Language Understanding

Download as pdf or txt
Download as pdf or txt
You are on page 1of 33

DeepSeek-VL: Towards Real-World Vision-Language

Understanding

Haoyu Lu*1† , Wen Liu*1 , Bo Zhang*1‡ , Bingxuan Wang1† , Kai Dong1 , Bo Liu1† , Jingxiang Sun1† ,
Tongzheng Ren1† , Zhuoshu Li1 , Hao Yang1† , Yaofeng Sun1 , Chengqi Deng1 , Hanwei Xu1 , Zhenda Xie1 ,
Chong Ruan1
1 DeepSeek-AI

{neal, liuwen, bo}@deepseek.com


https://github.com/deepseek-ai/DeepSeek-VL
arXiv:2403.05525v2 [cs.AI] 11 Mar 2024

Abstract

We present DeepSeek-VL, an open-source Vision-Language (VL) Model designed for real-world


vision and language understanding applications. Our approach is structured around three key
dimensions:
• Data Construction: We strive to ensure our data is diverse, scalable and extensively covers
real-world scenarios including web screenshots, PDFs, OCR, charts, and knowledge-based
content (expert knowledge, textbooks), aiming for a comprehensive representation of practical
contexts. Further, we create a use case taxonomy from real user scenarios and construct an
instruction-tuning dataset accordingly. The fine-tuning with this dataset substantially improves
the model’s user experience in practical applications.
• Model Architecture: Considering efficiency and the demands of most real-world scenarios,
DeepSeek-VL incorporates a hybrid vision encoder that efficiently processes high-resolution
images (1024 x 1024) within a fixed token budget, while maintaining a relatively low computa-
tional overhead. This design choice ensures the model’s ability to capture critical semantic and
detailed information across various visual tasks.
• Training Strategy: We posit that a proficient Vision-Language Model should, foremost,
possess strong language abilities. To ensure the preservation of LLM capabilities during
pretraining, we investigate an effective VL pretraining strategy by integrating LLM training
from the beginning and carefully managing the competitive dynamics observed between vision
and language modalities. Starting with a focus on text, we gradually adjust the ratio to facilitate
a balanced integration of both modalities.
The DeepSeek-VL family (both 1.3B and 7B models) showcases superior user experiences as a
vision-language chatbot in real-world applications, achieving state-of-the-art or competitive
performance across a wide range of visual-language benchmarks at the same model size while
maintaining robust performance on language-centric benchmarks. We have made both 1.3B
and 7B models publicly accessible to foster innovations based on this foundation model.

∗ Equal contribution.
† Work done during the internship at DeepSeek-AI.
‡ Project lead.
Contents

1 Introduction 3

2 Data Construction 6
2.1 Vision-Language pretraining Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Supervised Fine-tuning Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Approach 10
3.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Training Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.1 Stage 1: Training Vision-Language Adaptor . . . . . . . . . . . . . . . . . . 12
3.2.2 Stage 2: Joint Vision-Language pretraining . . . . . . . . . . . . . . . . . . 13
3.2.3 Stage 3: Supervised Fine-tuning . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 Hyperparameters and Infrastructures . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 Evaluation 16
4.1 Public Multimodal Benchmarks Evaluation . . . . . . . . . . . . . . . . . . . . . . 16
4.2 Public Language Benchmarks Evaluation . . . . . . . . . . . . . . . . . . . . . . . 17
4.3 Human Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.4 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5 Conclusion, Limitation, and Future Work 22

A Appendix 30

2
1. Introduction
The remarkable success of large language models (LLMs) (Anthropic, 2023; Google, 2023;
OpenAI, 2022, 2023a) has fueled the demand for a versatile interface that can handle multiple
modalities beyond language. In response to this growing demand, we have seen an emergence
of Large Multimodal Models (LMMs) like GPT-4V (OpenAI, 2023b) and Gemini (Team et al.,
2023), which serve as versatile assistants capable of comprehending and acting upon instructions
that span vision and language. These models exhibit considerable promise in executing complex,
diverse real-world tasks, enabling more natural and human-like interactions.
Recently, there has been a surge of open-source large multimodal models aimed at narrowing
the gap with proprietary counterparts. Substantial strides have been made, especially in
benchmark performance, yet a significant divide persists between the majority of open-source
models and state-of-the-art closed-source models (Bai et al., 2023; Bavishi et al., 2023; OpenAI,
2023b; Team et al., 2023) when it comes to real-world performance and user experience. It
remains challenging for the open-source community to develop models with robust general
multimodal capabilities for real-world applications.
The performance gap between the most open-source models and the proprietary models is
largely pronounced in real-world scenarios, primarily due to the following reasons:

• Many open-source solutions allocate a significant proportion of computational resources


to the instruction tuning phase. However, the experience of training powerful language
models underscores the importance of extensive pretraining in the development of general
intelligence. To imbue multimodal models with rich world knowledge, there should be
an emphasis on comprehensive pretraining that leverages a broad spectrum of vision-
language data.
• A common practice is to amalgamate various academic datasets during instruction tuning.
While such an approach may yield good benchmark results, it often falls short in providing
an authentic real-world usage experience.
• In terms of model architecture, prior works mostly adapt a vision transformer, typically
text-aligned, to a pre-trained language model. However, most of these models operate
on a relatively low resolution, e.g., 336×336 or 448× 448. The intricacies of complex real-
world scenarios, such as optical character recognition or tiny object discernment, demand
high-resolution processing capability.
• While some models (01-ai, 2024; Lin et al., 2023a; Sun et al., 2023; Wang et al., 2023b) have
begun to exploit pretraining, they often overlook the preservation of language skills. Often,
there is a degradation of language capability after prolonged multimodal training. Since
we aim for a generalist that possesses strong capabilities in both modalities, there should
be a training strategy that well preserves the language capability when developing the
new modality ability.

In light of these, we present DeepSeek-VL, an open-source large multimodal model, which


is built upon the DeepSeek language model series. We develop the model in the pursuit of
adept performance in real-world scenarios, which involves extensive pretraining, careful data
curation based on a use case taxonomy, model architecture design for high-resolution processing,
and a training strategy that balances the multi-modalities. On top of these, we develop a
training methodology that steers the model scaling, from 1B to 7B. These comprehensive
explorations bring a significant performance advantage in practical settings, compared to other
large multimodal models (LMMs) of similar size.

3
Prompt Response Prompt Response
图片描述了什么算法? 图片展示了一个名为CLIP的算法的流程图。该算法被描述 You are a home robot. Now you are at the kitchen, this To fetch something from the fridge, I would follow these steps:
为一种新的零样本图像到图像转换方法,它使用文本描述 is what you can see from there. To fetch something Approach the refrigerator: The refrigerator is located on the
来预测目标数据集的类。 from the fridge, what should you do? right side of the kitchen island. It's a stainless steel model with
在流程图中,我们看到了三个主要步骤: a freezer compartment at the bottom and an ice maker
1. Contrastive pre-training(对比式预训练):在这 dispenser above it.
个阶段,模型被训练以最小化两个编码器之间的差异,一 Open the refrigerator door: To open the refrigerator door, I
个用于图像,另一个用于文本。这通过将图像和文本嵌入 would use my hands to grasp the handle or pull bar located on
投影到相同的潜在空间中来实现。 the front edge of the refrigerator door. This action should be
2. Create dataset classifier from label text(创建 done carefully to avoid any accidental bumps into other items
标签文本分类器):这个步骤涉及从给定的标签文本中提 in the kitchen.
取特征向量,并将其与图像嵌入一起输入到一个分类器中。 Locate the desired item: Once inside the refrigerator, I would
3. Use for zero-shot prediction(用于零样本文本预 scan through the shelves and compartments for the specific
测):最后,使用前面步骤中学习的模型进行零样本文本 item I am looking for. The refrigerator has multiple shelves and
预测。这涉及到将图像嵌入和提取的特征向量传递给分类 drawers that are likely organized by type of food or drink.
器,以预测图像属于哪个类别。 Retrieve the item: After locating the desired item, I would
整个过程都围绕着学习一个能够理解图像和文本之间关系 reach out and grab it using one hand while ensuring not to
的模型这一核心概念。这种方法的关键在于利用大量的无 disturb the organization of the contents. If the item is placed
标注数据,通过对比式的预训练来提高模型的泛化能力。 on a shelf or drawer, I would lift it up gently to remove it
在流程图的底部,有一个例子展示了如何使用CLIP进行零 without knocking anything over.
样本预测。这里,一张狗的照片和一个关于狗的短语被用 Close the refrigerator door: After retrieving the item, I would
来预测这张照片中的狗属于哪个品种。 close the refrigerator door securely using the handle or pull bar.
总的来说,流程图清晰地描绘了CLIP算法的结构和它的工 This ensures that the cold air remains trapped within the
Prompt 作原理,强调了它在处理大量未标记数据方面的潜力以及 appliance and maintains its temperature efficiency.
其对零样本任务的有效性。 Prompt Return to the task at hand: With the item retrieved from the
Which bathroom is the bedroom A closer to? fridge, I can now proceed with whatever activity I was doing
Tell me the main contribution of this paper. before, such as cooking or preparing a meal.

Response
Response
Bedroom A is closer to Bathroom A.
This can be determined by observing the floor plan
The main contribution of this paper, as stated in the abstract, is
provided in the image. The layout of the rooms and
to develop a cognitive model that integrates individual
bathrooms are clearly marked with labels, allowing us to
differences into human-automated vehicle interaction. The
identify their relative positions. Bedroom A is located on
authors aim to provide insights into how humans interact with
the left side of the floor plan, while Bathroom A is situated
new technologies and propose a framework for understanding
directly below it, towards the bottom-left corner of the
these interactions. They also discuss potential pitfalls
image. This positioning indicates that there is no other
associated with using a one-size-fits-all model and explain how
bathroom between them, making Bathroom A the closest
modeling individual differences can help design effective user
one for Bedroom A.
interfaces and interventions. Additionally, they conduct a
simulation study to demonstrate possible benefits from
incorporating individual differences into the design process.
This conclusion is drawn directly from the text under the
Response "ABSTRACT" section, where it outlines the purpose and
contributions of the research.
The flowchart provided is a simple decision-making process
that compares two variables, x and y, to determine which one
has the greater value. The steps are as follows:
1.Start with an input for variable 'x' and get its next input. Response
Prompt
Prompt 2.Repeat step 1 for variable 'y.
3.Compare 'x' and 'y' using the "x > y" condition. If true, it 这幅图片是Minecraft官方网站的生动欢迎界面。背景是
means 'X' is greater than 'y', so put the maximum of these 描述图片 一片辽阔的蓝天,点缀着朵朵蓬松的白云。下方是一个
Can you translate the flowchart to a python code?
two values into the output. If false, it means 'x' is not greater 充满活力的像素艺术世界,特色是一个村庄、一片茂密
than 'y', so put the maximum of 'x' and 'y' into the output. 的森林和一条蜿蜒的小路。
4.Finally, end the program by putting the max value into the 在图片的中心,一个绿色的按钮吸引着注意力,上面用
output. 粗体白色字母写着“获取Minecraft”。这个按钮似乎邀
Here is how you can translate this flowchart into Python code: 请观看者加入游戏并开始他们的Minecraft冒险。
在图片的顶部,一段文字热情地欢迎观众来到官方
Minecraft网站。它写道:“欢迎来到官方 Minecraft网
站!加入我们的社区,了解如何玩这款游戏,加入我们
最大的社区之一,并在今天开始创造吧!”这段文字传
达了游戏的包容性和社区精神,鼓励新玩家探索和创造。
总的来说,这幅图片捕捉到了Minecraft世界的精髓一它
的无限可能性以及与他人的互动乐趣。

Response

Prompt The image shows a graph with various lines representing


In this Python code, we define a function called max_of_two
different models' validation log perplexity over the length of
that takes two arguments, x and y, and returns the maximum
the validation sequence, which is measured in x10^3. The title
of these two values. We then call this function with the Which direction of paper is the most likely source of
"C4 language modeling (large model)" suggests that this graph
appropriate inputs and print the result. the table in the picture.
pertains to a study or analysis related to large-scale natural
language processing tasks, specifically focusing on language
modeling using a C4 dataset.
Prompt To determine the most likely direction of paper from which the
table originated, we should look for visual clues in the image:
What art style is evident in this image? Response 1.Graph Style: The graph has a clear and professional
appearance, suggesting it could be part of an academic paper
The art style evident in this image is Impressionism. This can or technical report.
be determined by several visual clues present: 2.Data Representation: The use of multiple lines with markers
such as squares, diamonds, and circles indicates comparisons
Brushwork: The painting exhibits a loose, visible brushstroke between different models or methods. This kind of data
technique which is characteristic of Impressionist art. The representation is common in research papers where different
strokes are not smooth or blended but rather they create a experimental results are presented side by side for comparison.
sense of movement and texture. 3.Title and Content: The title provides context about the
nature of the content – it mentions "language modeling" and
Color Palette: The use of bright, vibrant colors that seem to "large model," indicating that the paper deals with
capture the light and atmosphere of the scene without precise computational linguistics or machine learning topics.
detail is typical of Impressionist painters. The sky's blue with 4.Legend: There is a legend at the bottom left that matches the
white clouds, the warm tones on the cliffs, and the varied hues colors of the lines to their respective models. This is typical in
used for the boats all contribute to this impressionistic scientific graphs to help readers understand what each line
approach. represents without having to refer back to the text.
Prompt
5.Axis Labels: The axes have labels that are consistent with
Subject Matter: The subject matter itself—a coastal landscape scientific notation ("Validation sequence length" and
with figures and boats—is common in Impressionist paintings, Please generate the latex code of the formula in the image.
"Validation log perplexity"). These labels suggest that the
which often depicted scenes from daily life and nature. paper is intended for an audience familiar with these terms,
which would include researchers or students in fields like
Lack of Detail: There is a lack of fine detail in the depiction of computer science or engineering.
objects such as the boats and figures, which suggests an 6.Font and Formatting: The font used for the title and axis
emphasis on capturing the overall mood and impression of the labels appears to be sans-serif, which is often used in modern
scene rather than meticulous realism. publications due to its readability and clean design.
Based on these observations, the most likely source of the
Light and Shadow: The play of light and shadow is another table in the picture is a publication from a field related to
hallmark of Impressionist works. In this painting, we see how artificial intelligence, machine learning, or computational
the artist has captured the interplay of sunlight on the water Response linguistics. It could be a journal article, conference proceedings,
and rocks, creating a dynamic effect. a thesis, or even a presentation slide from a seminar or
These elements together strongly suggest that the artwork A_n = a_0 \left[ 1 + \frac{3}{4} \sum_{k=1}^{n} \left( \frac{4}{9} workshop. The style and content align well with scholarly work
belongs to the Impressionist genre. \right)^k \right] in those domains.

Figure 1 | DeepSeek-VL possesses general multimodal understanding capabilities, capable of


processing logical diagrams, web pages, formula recognition, scientific literature, natural images,
and embodied intelligence in complex scenarios.

4
DeepSeek-VL’s pretraining dataset is compiled from a variety of sources, including but not
limited to Common Crawl, Web Code, E-books, Educational Materials, and arXiv Articles. This
collection thoroughly encompasses real-world scenarios such as web screenshots, PDFs, OCR,
charts, and knowledge-based content (expertise, textbooks), aiming for a broad and practical
representation while remaining scalable.
While our pretraining data encompasses a wide array of world knowledge, we meticulously
curate our instruction-tuning dataset to reflect real-world usage scenarios. To achieve this, we
manually gather authentic test cases for GPT-4V and Gemini from the Internet. These cases have
been systematically organized into a comprehensive taxonomy. We use this structured taxonomy
to choose prompts for each test image, ensuring a practical and relevant instruction tuning
dataset. This taxonomy is also used to create an evaluation dataset that effectively assesses
real-world performance.
The visual module is designed to optimize the utilization of high-resolution visual inputs
while remaining within a fixed token budget to manage inference costs effectively. As such, we
employ a hybrid vision encoder, which combines a text-aligned encoder for coarse semantic
extraction at 384 × 384 resolution with a high-resolution encoder that captures detailed visual
information at 1024 × 1024 resolution. By fusing these two encoders, our hybrid approach
efficiently condenses a 1024×1024 resolution image (which suffices in most use cases) into 576
tokens. This token count strikes a balance between rich visual representation and token economy,
making it feasible for both text-image interleaving and multi-turn inference scenarios.
During the pretraining of multimodal models, a common challenge encountered is the
potential degradation of language capabilities when the training process is overly reliant on
vision-language data. Our research reveals that maintaining a significant proportion of language
data—specifically, at least 70%—is essential to preserve the integrity of language knowledge
within the model. This balance is critical for achieving a robust multimodal capability that does
not compromise language performance. Moreover, we introduce a novel “modality warm-up”
strategy. This approach carefully adjusts the ratio of modalities during training, gradually
incorporating more vision-language data. The careful tuning of the modality ratio along with
the warm-up strategy results in a balanced performance of both modalities.
When iterating on our model, We conduct experiments on a small scale before scaling to a
larger model size. However, a smaller model, e.g., 1B model, cannot demonstrate reasonable
performance on benchmarks (Schaeffer et al., 2024) and faithfully reflect the model’s performance.
We adopt two approaches to address this. First, we modify the evaluation protocol from multi-
choice to compare the perplexity of options. Also, to prevent the instruction following ability
from becoming the bottleneck, we mix a small proportion of instruction tuning data during the
pretraining phase. In this way, we can achieve reasonable performance using the 1B model and
more accurately measure the impact of each iteration during the experiment.
Through extensive evaluations of general vision and language benchmarks, the DeepSeek-VL
family showcases superior user experiences in real-world applications and achieves state-of-
the-art or competitive performance across a wide range of visual-language benchmarks at the
same model size, while maintaining robust language-centric performance. To foster innovation
and enable a wide range of applications, we have made two versions of our ours, 1.3B and 7B,
publicly accessible, in the hope of facilitating the needs of varying computational capabilities.

5
2. Data Construction
A diverse and large dataset is the most important ingredient of visual language model training.
Our dataset can be divided into two parts: Vision-Language pretraining Data and Vision-
Language Supervised Fine-Tuning Data. VL pretraining Data is composed of visual-text data
from various sources, aimed at enhancing the model’s fundamental cross-modal understanding
capabilities; while VL Supervised Fine-Tuning Data has a relatively smaller size and aims to
teach the model to complete specific downstream tasks. By design, VL pretraining Data is
used to warm up the vision-language adaptor in training stage 1 and jointly pretrain the vision-
language model in stage 2, and VL Supervised Fine-Tuning Data is exploited in training stage 3,
i.e., vision language supervised fine-tuning.

2.1. Vision-Language pretraining Data

The pretraining dataset utilized in our study encompasses a diverse range of publicly accessible
sources, in addition to a selection of proprietary data. We provide a comprehensive overview of
the data sources employed during the joint vision and language pretraining stage in Table 1.
Such a dataset can facilitate LLM’s comprehension of the entities portrayed in the images.
Furthermore, we present a detailed breakdown of the complete dataset, which is organized
into the following categories:
Interleaved image-text data enable the models to have a better capability for in-context
learning of multi-modality inputs, and we utilize three public datasets MMC4 (Zhu et al., 2024),
Wiki (Burns et al., 2023), Wikihow (Yang et al., 2021) and Epub textbooks.
Image caption data come from three high-quality image-text paired datasets: Capsfusion (Yu
et al., 2023a), TaiSu (Liu et al., 2022b) and Detailed Caption (echo840, 2024).
Table and chart data enable the models to learn the capability for general table and chart
image understanding. It encompasses a diverse range of public data sources, including
Chart2text (Kantharaj et al., 2022), Geo170K (Gao et al., 2023), Unichart (Masry et al., 2023),
Ureader (Ye et al., 2023), M-paper (Hu et al., 2023), ScienceQA (Lu et al., 2022b), ScreenQA (Hsiao
et al., 2022), SciGraphQA-295K (Li and Tajbakhsh, 2023), Paper2figure100k (Rodriguez et al.,
2023), Widget Captioning (Li et al., 2020), Screen2words (Wang et al., 2021), and Refexp (Mao
et al., 2016).
Web Code data empowers models with the capability to reconstruct code from graphical
interfaces or visual plots. Leveraging Websight (HuggingFaceM4, 2024) for UI Inverse Rendering,
we adopted a strategy akin to that used in MATCHA (Liu et al., 2022a) for visual plots inverse
rendering. This involved the processing of approximately 1.46 million Jupyter notebooks from
the Stack dataset (Kocetkov et al., 2023). By extracting these notebooks and collating all diagrams
along with their corresponding preceding code segments, we succeeded in curating a collection
featuring 2 million pairs of images and codes. For better data quality, we filter 1.1 million
instances, each comprising a singular image coupled with a minimum of 5 lines of code, to
constitute our primary training dataset.
Document Optical Character Recognition (OCR) data facilitates the recognition of optical
characters at the document level, even in challenging real-world scenarios. To the best of
our knowledge, there is currently no publicly available large-scale dataset encompassing both
English and Chinese documents. Despite the existence of the publicly accessible small-scale
dataset Latex-OCR (Blecher, 2024), we additionally constructed a comprehensive English and

6
Table 1 | Summary of datasets used in the joint vision and language pretraining stage.

Category Dataset Ratio


Interleaved image-text MMC4 (Zhu et al., 2024) 13.1%
Wikipedia EN& CN (Foundation)
Wikihow (Yang et al., 2021)
in-house PDF and Epub textbooks
Image caption Capsfusion (Yu et al., 2023a) 11.1%
TaiSu (Liu et al., 2022b)
Detailed Caption (echo840, 2024)
Table and chart Chart2text (Kantharaj et al., 2022) 2.1%
Geo170K (Gao et al., 2023)
Ureader (Ye et al., 2023)
Unichart (Masry et al., 2023)
M-paper (Hu et al., 2023)
ScienceQA (Lu et al., 2022b)
ScreenQA (Hsiao et al., 2022)
SciGraphQA-295K (Li and Tajbakhsh, 2023)
Paper2figure100k (Rodriguez et al., 2023)
Widget Captioning (Li et al., 2020)
Screen2words (Wang et al., 2021)
Refexp (Mao et al., 2016)
Web Code Websight (HuggingFaceM4, 2024) 0.4%
python plots scraped from GitHub notebook
Scene text OCR ArT (Chng et al., 2019) 1.2%
MLT-17 (Nayef et al., 2017)
LSVT (Sun et al., 2019)
UberText (Zhang et al., 2017)
Coco-text (Veit et al., 2016)
RCTW-17 (Shi et al., 2017)
ReCTS (Zhang et al., 2019)
TextOCR (Singh et al., 2021)
OpenVINO (Krylov et al., 2021)
HierText (Long et al., 2022)
Document OCR arXiv rendered markdown (Blecher et al., 2023) 2.1%
Text-only corpus DeepSeek-LLM 2T text copus (DeepSeek-AI, 2024) 70.0%

Chinese document OCR dataset. It is comprised of two parts: 1): arXiv Articles: We collected
source code and compiled PDFs from 1.4 million arXiv articles. Utilizing pre-processing tools
from Nougat (Blecher et al., 2023), we rendered these articles into paired images and texts; 2):
E-books and Educational Materials: We cleaned 860K English and 180K Chinese e-books from
Anna’s Archive (Anna’s Archive, 2024) alongside millions of K-12 education exam questions.
Subsequently, we employed HTML rendering tools (Kulkarni and Truelsen) to convert these
HTML files with different templates into paired image and text formats.
Scene text OCR data augment the capability of the model to recognize and extract text from
images in which the text is integrated into the environment. The dataset is composed of multiple

7
Table 2 | Summary of data used in our joint vision and language supervised fine-tuning stage.

Class Dataset Ratio


In-house Data SFT data based on taxonomy (Figure 3) 10.5%
General Multi-modality ShareGPT4V (Chen et al., 2023) 35.5%
LAION-GPTV (LAION, 2023)
LVIS-Instruct4V (Wang et al., 2023a)
textOCR-GPT4V (Carter, 2024)
LLaVA1.6-GPT4V (Liu et al., 2024a)
IconQA (Lu et al., 2021)
Table and chart Ureader (Ye et al., 2023) 4.1%
Geo170K (Gao et al., 2023)
ScienceQA (Lu et al., 2022b)
Web Code Screen-to-code (Abi, 2024) 2.0%
ScreenQA (Hsiao et al., 2022)
Text-only SFT DeepSeek-LLM (DeepSeek-AI, 2024) 47.9%

public datasets, including ArT (Chng et al., 2019), MLT-17 (Nayef et al., 2017), LSVT (Sun et al.,
2019), UberText (Zhang et al., 2017), Coco-text (Veit et al., 2016), RCTW-17 (Shi et al., 2017),
ReCTS (Zhang et al., 2019), TextOCR (Singh et al., 2021), OpenVINO (Krylov et al., 2021) and
HierText (Long et al., 2022).
Text-only corpus serves to maintain proficiency in language-centric tasks. In this study, we
employ the same text corpus with DeepSeek-LLM (DeepSeek-AI, 2024).

2.2. Supervised Fine-tuning Data

The supervised fine-tuning datasets utilized in our study encompass a diverse range of multi-
modality and language data sources, including well-known open-source shared gpt4v datasets
such as ShareGPT4V (Chen et al., 2023), LAION-GPTV (LAION, 2023), LVIS-Instruct4V (Wang
et al., 2023a), textOCR-GPT4V (Carter, 2024), LLaVA1.6-GPT4V (Liu et al., 2024a) and IconQA (Lu
et al., 2021). Additionally, we incorporate partial table and chart data extracted from pretraining
datasets such as Ureader (Ye et al., 2023), ScreenQA (Hsiao et al., 2022), Geo170K (Gao et al.,
2023), and ScienceQA (Lu et al., 2022b). Moreover, we integrate the UI Code dataset obtained
from Screen-to-code (Abi, 2024) tasks. To enhance the quality of our multi-modality SFT data,
we have also curated a portion of high-quality in-house multi-modality SFT data, some of
which are in the Chinese language. Our in-house instruction-tuning dataset is meticulously
designed to reflect real-world usage scenarios and cover a wide range of tasks. We start by
collecting a diverse set of authentic test cases for GPT-4V and Gemini from various online sources.
These test cases are then carefully analyzed and organized into a comprehensive taxonomy,
which encompasses multiple categories, such as recognition, conversion, analysis, reasoning,
evaluation, and safety, as detailed in Table 3. This structured taxonomy serves as a guideline
for selecting representative prompts for each test image, ensuring that our instruction-tuning
dataset is both practical and relevant to real-world applications. Moreover, this taxonomy is
also employed to construct a balanced and comprehensive evaluation dataset, which allows us
to effectively assess the model’s performance across different tasks and categories. By following
this systematic approach, we ensure that the categories covered by our in-house multi-modality
SFT data are well-aligned with the taxonomy and representative of real-world usage scenarios.

8
Main Category Description Secondary Category Tertiary Category

Recognition This part of the use cases mainly examines the under- Global Description Theme Description, Event/Behavior Description, Lo-
standing and description ability of large models for im- cation/Scene Description, Emotion/Mood Description,
age content, which does not require high knowledge Style Recognition, Food Recognition, Others
reserve and reasoning ability of the model, and some
tasks can be completed using traditional machine learn-
ing models.

Local Description Pointing Description, Position Description, Person


Recognition, Object Attribute Description, Logo Recog-
nition, Counting, Currency Recognition

OCR and Transcription Printed Text Transcription, Handwritten Text Transcrip-


tion, Specified Format Transcription, Specified Language
Transcription

Conversion This type of use case requires the model to be able to Image to Code UI to Code, Chart to Code, Photo to SVG/p64 Encoding,
describe and recognize image content, and use specific Formula to Code, Flowchart to Code
knowledge (e.g., code knowledge, prompt engineering
knowledge) to convert image content into another form.

Image to Text Image to Prompt, Text Summary, Image-based Creation,


Text Interpretation

Analysis This type of use case requires the model to use specific Data Chart Analysis Graph Interpretation, Table Interpretation
knowledge and logical ability to make reasonable anal-
ysis and understanding based on image content, and
describe the image according to instructions.

Professional Chart Analysis Circuit Diagram, Flowchart, Map, Music Score, Financial
Chart, Floor Plan, Others

Professional Image Analysis Sensor Image, Biological and Medical Image, Voiceprint
Image, Point Cloud Image

Encyclopedia Knowledge Analy- Art and Culture Knowledge, Natural Environment


sis Knowledge, Food/Clothing/Housing/Transportation
Related Knowledge, Entertainment Related Knowledge,
Historical Knowledge

Commonsense Rea- This type of use case mainly tests the model’s under- Relationship Reasoning Interpersonal Relationship, Spatial Relationship, Size
soning standing and mastery of common sense in life, which Relationship, Species Relationship
requires reasoning based on the interpretation and anal-
ysis of image content combined with common sense.

Function Reasoning Hardware Function Reasoning, Software Function Rea-


soning

Environment Reasoning Environment State Analysis, Environment-based Behav-


ior Reasoning, Embodied Intelligence

Anomaly Reasoning Identifying Anomalies in Images, Defect Detection, Ac-


cident Judgment

Humor Reasoning -

Other Commonsense Reasoning State Reasoning, Cause Reasoning, Attribute Compari-


son, Optical Illusion, Fun Games, Intention Interpreta-
tion, Behavior Prediction

Logical Reasoning This type of use case requires the model to combine the Mathematical Reasoning Algebra and Operation, Plane Geometry, Solid Geometry
understanding of images, comprehensively use domain
knowledge and logical reasoning ability to complete
corresponding tasks.

Other Logical Reasoning Physics, Chemistry, Biology, Code, IQ Questions

Evaluation This type of use case requires the model to evaluate the - Reality Evaluation, Similarity Evaluation, Aesthetic Eval-
image content according to specific criteria. uation, Open-ended Evaluation, Improvement Sugges-
tions

Multi-graph This type of use case examines the model’s ability to Temporal Sequence Understand- Event Prediction, Image Sequencing, Behavior Analysis
analyze and understand multiple images. ing

Multi-graph Comparison Attribute Comparison, Image-Text Matching, Finding


Associations, Spotting Differences, Image Discrimina-
tion

Safety This type of use case examines the model’s performance - Suggestive Questioning, Counterfactual Questioning,
in terms of safety. Prompt Injection

Table 3 | Our taxonomy for the in-house SFT data. The categories covered by our high-quality
in-house multi-modality SFT data are comprehensively represented in this taxonomy.

9
Furthermore, we include the text-only SFT data employed in DeepSeek-LLM (DeepSeek-AI,
2024) as part of our joint vision and language SFT data.

3. Approach

3.1. Architecture

Our system contains three modules: a hybrid vision encoder, a vision adaptor, and a language
model. We introduce each part in this section.
Hybrid Vision Encoder. We employ SigLIP as the vision encoder to extract high-level
semantic feature representations from visual inputs. However, we observe that a single SigLIP
encoder struggles to address all real-world questions comprehensively. Vision encoders in the
CLIP family, including SigLIP, are primarily designed for semantic visual representations but
are challenged by ambiguous encoding, resulting in visually distinct images being encoded as
similar due to what is referred to as "CLIP-blind pairs" Tong et al. (2024). Meanwhile, the CLIP
family of models is limited by its relatively low-resolution inputs (e.g., 224 x 224, 336 x 336, 384
x 384, 512 x 512), which hinders their ability to handle tasks requiring more detailed low-level
features like dense OCR and visual grounding task.
To address these limitations, recent researches (Lin et al., 2023b; Tong et al., 2024; Wei et al.,
2023) have advocated for the integration of additional vision-only self-supervised encoders, to
enhance the visual grounding capabilities of multi-modality models. Building upon previous
motivations, we additionally utilize a vision-only encoder based on the SAM-B (Kirillov et al.,
2023), a pre-trained ViTDet (Li et al., 2022) image encoder to process low-level features, which
accepts high-resolution 1024 x 1024 image inputs. In addition to the SAM-B encoder, we
retain the SigLIP-L vision encoder with low-resolution 384 x 384 image inputs. Consequently,
our hybrid vision encoder combines the SAM-B and SigLIP-L encoders, efficiently encoding
high-resolution 1024 x 1024 images while preserving both semantic and detailed information.
Specifically, a high-resolution SAM-B vision encoder first resizes the image into 1024 x 1024 and
results in a 64 x 64 x 256 feature map.
In the case of a high-resolution feature map of size, 64 x 64 x 256 generated by SAM-B, the
VL Adaptor initially interpolates it into a size of 96 x 96 x 256. Subsequently, it employs two
convolutional layers with a stride of 2, producing a feature map of 24 x 24 x 1024, and reshapes
it to 576 x 1024. Alongside this, the low-resolution feature map of size 576 x 1024 generated by
SigLIP-L is concatenated with the high-resolution features, resulting in 576 visual tokens with
2048 dimensions. These visual tokens possess a substantial capacity for enhancing high-level
semantic visual recognition and low-level visual grounding tasks. Then they undergo GeLU
activation and are directed through an embedding layer to establish a connection with the
language model.
Vision-Language Adaptor. We employ a two-layer hybrid MLP to bridge the vision encoder
and the LLM. Initially, distinct single-layer MLPs are used to process high-resolution features
and low-resolution features separately. Subsequently, these features are concatenated along their
dimensions and then transformed into the LLM’s input space through another layer of MLP.
Language Model. Our language model is built upon DeepSeek LLM (DeepSeek-AI, 2024)
whose micro design largely follows the design of LLaMA (Touvron et al., 2023a,b), adopt-
ing a Pre-Norm structure with RMSNorm (Zhang and Sennrich, 2019) function and using
SwiGLU (Shazeer, 2020) as the activation function for the Feed-Forward Network (FFN), with
an intermediate layer dimension of 83 𝑑𝑚𝑜𝑑𝑒𝑙 . It also incorporates Rotary Embedding (Su et al.,

10
Prompt

Is the cyclist on the left or right side of the woman's handbag?

Response
The cyclist is on the left side of the woman's handbag. This can be determined by observing the relative positions of the objects in
the image:
1. The woman is holding a handbag that is positioned to her right side, as we view the image.
2. There is a cyclist depicted on the road behind the woman and to her left when looking at the image.
3. The cyclist appears to be moving away from the viewpoint, given their direction and the motion blur effect on their wheels
4. The cyclist is further down the road compared to the woman, which supports the conclusion that they are on the left side of the
handbag based on our perspective.
These visual clues confirm that the cyclist is indeed con the left side of the woman's handbag.

Figure 2 | Visualization results. DeepSeek-VL is capable of capturing tiny object and giving
organized explanations.

2024) for positional encoding and uses the same tokenizer with DeepSeek-LLM. We introduce
a family of DeepSeek-VL models. Given our objective of conducting joint pretraining with
multimodal and language, we select an intermediate checkpoint from DeepSeek’s pretrained
models to continue pretraining.
Specifically, the DeepSeek-VL-1B model is constructed based on the DeekSeek-LLM-1B
model, which underwent training with an approximate corpus of 500 billion text tokens. And
the DeekSeek-VL-7B model is developed leveraging the DeepSeek-LLM-7B model trained with
an estimated 2 trillion text tokens.

11
Stage 1: Training VL Adaptor Stage 2: Joint VL Pre-training Stage 3: Supervised Finetuning

DeepSeek LLM DeepSeek LLM DeepSeek LLM

Vision-Language Vision-Language Vision-Language


Adaptor Adaptor Adaptor

SAM-B SigLIP-L SAM-B SigLIP-L SAM-B SigLIP-L

Hybrid Vision Hybrid Vision Hybrid Vision


Encoder Encoder Encoder

Interleaved VL + VL Chat Data +


Image-Text Pairs
Pure Language Sequences Pure Language Chat Data

Figure 3 | Our training pipelines consist of three stages. Stage 1 involves training the Vision-
Language (VL) adaptor while keeping the hybrid vision encoder and language model fixed.
Stage 2 is the crucial part of the joint vision and language pretraining, where both VL adaptor
and language model are trainable. Stage 3 is the supervised fine-tuning phase, during which the
low-resolution vision encoder SigLIP-L, VL adaptor, and language model will be trained.

3.2. Training Pipelines

We train our DeepSeek-VL in three consecutive stages as shown in Figure 3: vision-language


adaptor warmup, joint vision-language pretraining, and supervised fine-tuning. We currently
focus on visual understanding capabilities and only calculate the next token prediction loss on
the language part.

3.2.1. Stage 1: Training Vision-Language Adaptor

The primary objective of this stage is to establish a conceptual link between visual and linguistic
elements within the embedding space, thereby facilitating the comprehensive understanding
of depicted entities in the images by the Large Language Model (LLM). Consistent with prior
research conducted by LLaVA (Liu et al., 2024b) and Instruct-BLIP (Dai et al., 2023), we adopt a
similar approach in which both the vision encoder and the LLM remain frozen during this stage,
while solely allowing the trainable parameters within the vision-language adaptor. We utilize a
dataset comprising 1.25 million image-text paired captions obtained from ShareGPT4V, along
with 2.5 million Document OCR rendering pairs to train the VL adaptor.
Nevertheless, compared to Large Language Models (LLMs), vision-language adaptors (e.g.,
a 2-layer MLP) have a significantly smaller parameter capacity. This limitation in model capacity
restricts the capabilities that can be learned during this stage. A natural question arises: Can the
law of data scaling be effective at this stage? To address this question, we conducted a simple
experiment in Table 8. The results demonstrate that expanding the data scale at this stage does
not provide benefits and may even lead to inferior performance. Consequently, we proceed to
unfreeze the Large Language Model (LLM) and investigate efficient vision-language pretraining
approaches during stage 2.

12
3.2.2. Stage 2: Joint Vision-Language pretraining

In this stage, we explore effective pretraining strategies which can be considered as an additional
stage to enable Large Language Models (LLMs) to comprehend multimodal inputs. We keep the
vision encoder frozen and optimize the language model and VL adaptor.
Initially, we attempt to directly train the LLM with multimodal data. However, we find
while the metrics for multimodal performance incrementally improved, there is a stark and
severe decline in language metrics as illustrated in Figure 4 (Multimodal:Language=100%:0%),.
This underscores the inherent challenge in directly conducting multimodal pretraining on the
foundation of an LLM, revealing a critical trade-off between enhancing multimodal abilities and
preserving linguistic proficiency.
We hypothesize that the observed phenomenon stems from two primary factors: firstly, the
majority of multimodal corpora, are overly simplistic and exhibit a significant divergence from
the complexity and distribution of linguistic data. Secondly, there appears to be a competitive
dynamic between multimodal and linguistic modalities, leading to what can be described as
catastrophic forgetting of language capabilities within the LLM.
Joint Language-multimodal Training To address this challenge, we devise a straightforward
yet effective joint language-multimodal training strategy. During training, we not only engage
in multimodal data training but also incorporate a large proportion of language data into
the training. This approach aims to balance the training focus, mitigating the adverse effects
observed. We conduct experiments on the DeepSeek-VL 1B model in Figure 4 to explore the
impact of varying the modality mixing ratios.
The analysis of the graph yields several key conclusions: (1). Integrating language data
significantly alleviates the decline in language capabilities, demonstrating a substantial improve-
ment in the model’s linguistic performance. (2). The inclusion of language data does not lead to
a significant loss in multimodal performance, indicating that the model retains its multimodal
processing abilities. (3). The performance of different modalities is strongly correlated with
their respective proportions in the training dataset, substantiating the competitive relationship
between the two modalities. Ultimately, we opt for a training ratio of language to multimodal
data of roughly 7:3 for our final model. This ratio enables the model to maintain its language
capabilities while simultaneously achieving better pretraining on multimodal data, effectively
balancing the development of both language and multimodal proficiencies.
Scaling Vision-Language Pretraining Nevertheless, the pretraining stage of the model incurs a
substantial computational cost, and performing iterations on the 7B model requires an excessive
amount of computing power and time. One suitable strategy involves conducting experiments
on a smaller model, specifically the 1.3B model, and subsequently scaling it up to the 7B model.
Fortunately, we have observed that a significant portion of the outcomes obtained from the 1.3B
models can be effectively transferred to the 7B model through the utilization of SFT (e.g., the
encoder design). However, during the stage 2 training phase, we have encountered considerable
fluctuations in the generative metrics of the 1.3B model, rendering it challenging to supervise
the training process effectively. And this has been discussed in Schaeffer et al. (2024), "sharp
and unpredictable changes might be induced by the researcher’s choice of measurement, even
though the model family’s per-token error rate changes smoothly, continuously and predictably
with increasing scale." Subsequent experiments have led us to identify the root causes of this
issue: the limited capacity of the 1.3B model and the absence of SFT data within the training
dataset, both of which hinder the model’s ability to accurately follow instructions. Even when
the model possesses knowledge of the correct options, it struggles to generate them precisely.

13
SeedBench MMBench
45.0 Multimodal:Langauge=10%:90% 50.0 Multimodal:Langauge=10%:90%
Multimodal:Langauge=25%:75% Multimodal:Langauge=25%:75%
42.5 Multimodal:Langauge=60%:40% 47.5 Multimodal:Langauge=60%:40%
Multimodal:Langauge=75%:25% Multimodal:Langauge=75%:25%
40.0 Multimodal:Langauge=100%:0% 45.0 Multimodal:Langauge=100%:0%
MCPPL Accuracy

MCPPL Accuracy
37.5 42.5
35.0 40.0

32.5 37.5

30.0 35.0

27.5 32.5
0 2000 4000 6000 8000 10000 12000 14000 16000 0 2000 4000 6000 8000 10000 12000 14000 16000
Step Step
MMBench_CN MMLU
Multimodal:Langauge=10%:90%
42 Multimodal:Langauge=25%:75% 0.330
Multimodal:Langauge=60%:40%
40 Multimodal:Langauge=75%:25%
Multimodal:Langauge=100%:0% 0.325
38
MCPPL Accuracy

LMPPL Accuracy
36
0.320
34
32 0.315 Multimodal:Langauge=10%:90%
30 Multimodal:Langauge=25%:75%
Multimodal:Langauge=60%:40%
28 0.310 Multimodal:Langauge=75%:25%
Multimodal:Langauge=100%:0%
0 2000 4000 6000 8000 10000 12000 14000 16000 0 2000 4000 6000 8000 10000 12000 14000 16000
Step Step
HellaSwag Pile-test
Multimodal:Langauge=10%:90%
0.578 Multimodal:Langauge=25%:75% 2.24
Multimodal:Langauge=60%:40%
0.576 Multimodal:Langauge=75%:25%
Multimodal:Langauge=100%:0% 2.22
0.574
Multimodal:Langauge=10%:90%
2.20 Multimodal:Langauge=25%:75%
Accuracy

0.572
Multimodal:Langauge=60%:40%
PPL

0.570 Multimodal:Langauge=75%:25%
2.18 Multimodal:Langauge=100%:0%
0.568
0.566 2.16

0.564
2.14
0 2000 4000 6000 8000 10000 12000 14000 16000 0 2000 4000 6000 8000 10000 12000 14000 16000
Step Step

Figure 4 | Comparative performance results on different modality fusion ratio on training


stage 2. An excessively large proportion of multimodal data (multimodal:language=100%:0%)
leads to significant forgetting of language capabilities in LLMs. A suitable ratio (multi-
modal:language=70%:30%) can effectively mitigate the issue of language forgetting while simul-
taneously enhancing the model’s multimodal abilities.

To mitigate these challenges, we adopte a dual-pronged approach. Firstly, we employ the


Multi-choice PPL methodology to monitor the model’s progress. This involves inputting not
only the prompt and image into the network but also all the answer associated with the question.
Subsequently, we calculate the PPL for each answer position (e.g., A, B, C, D) and select the
option deemed correct by the model as the final answer. Secondly, we introduce SFT data into
the training dataset at a minimal proportion, allowing the model to acquire some proficiency in
following instructions. The combination of these two approaches ensures the maintenance of
stable training metrics for the 1.3B model and bring better performance after stage3.

3.2.3. Stage 3: Supervised Fine-tuning

In this phase, we finetune the pretrained DeepSeek-VL model with instruction-based fine-tuning
to bolster its ability to follow instructions and engage in dialogue, culminating in the creation of
the interactive DeepSeek-VL-Chat model. We optimize the language model, VL adaptor, and

14
Prompt Response – DeepSeek-VL

Explain the code, step by step. The code provided is a Python function that calculates the longest palindrome substring
of a given string. The function takes in two parameters, which are not defined within this
snippet but presumably represent the input strings to be processed.

Here's a step-by-step explanation of how the code works:


1. It defines a helper function called longestPalIndrome with two arguments, self (which
is likely a reference to an object or class instance) and s. This function returns a value,
which we can assume is the length of the longest palindrome substring found in s.
2. The main function starts by checking if the length of s is less than 1. If true, it
immediately returns sas there is no palindrome substring for empty or single character
strings.
3. Next, it initializes variables Max_Len and Max_Str to store the maximum length of a
palindrome substring found so far.
4. A loop (for i in range(len(s)-1):) iterates over all possible starting points of substrings
in s. For each starting point i, another nested loop (for j in range(i+1, len(s))) checks
whether the substring from index i to j forms a palindrome.
The condition if j - i + 1 > Max_Len and s[i:j] == s[j:i][::-1] ensures that the substring is
both longer than the current maximum length and reads the same backward as forward.
If this condition is met, it updates Max_Len and Max_Str with the new maximum length
and substring.
5. After iterating through all possible substrings, the function returns Max_Str, which
contains the longest palindrome substring found.
6. Finally, the line return Max_Str indicates that the function will return the result of its
computation.

This function uses dynamic programming principles to find the longest palindromic
subsequence efficiently by storing intermediate results and avoiding redundant
calculations.

Figure 5 | Visualization results. DeepSeek-VL can understand Python code and provide detailed
and organized explanations.

hybrid vision encoder with the vision-language SFT data as shown in Table 2, SAM-B remains
frozen due to the limited GPU memory. We only supervise answers and special tokens and mask
the system and user prompts. To guarantee the model’s comprehensive proficiency in dialogue,
we utilize a blend of multimodal data and pure text dialogue data used in DeepSeek-LLM. This
approach ensures the model’s versatility across various dialogue scenarios.

3.3. Hyperparameters and Infrastructures

The detailed hyperparameters of all stages are illustrated in Table 4. We train and evaluate our
DeepSeek-VL with HAI-LLM (High-flyer, 2023), a lightweight and efficient distributed training
framework. Since we use visual encoders to convert images into embedding vectors and then
treat image embeddings and text embeddings uniformly, we can easily adapt pipeline paral-
lelism to VL model training: all we need to do is to view visual encoders and text embedding as
a single module and take it as the first layer of the resulting model. This very first layer has a
complicated model structure and precludes standard tensor parallelism technique, but luckily
it requires relatively small computation compared to upper standard transformer blocks. We
therefore simply recompute the visual encoder forward pass in all tensor parallel ranks. The
existence of visual encoders also leads to non-uniform execution time across model layers, so
we re-divide model layers between pipeline parallelism ranks to achieve better load balance and
throughput. The upper layers of DeepSeek-VL are exactly the same as those in DeepSeek-LLM.
With such minor modification, we can now perform canonical 3D parallelism techniques as
in Megatron (Korthikanti et al., 2023; Narayanan et al., 2021; Shoeybi et al., 2019) and overlap
computation and communication as in DeepSeek-LLM (DeepSeek-AI, 2024). DeepSeek-VL-
7B consumed 5 days on a cluster of 64 nodes, each comprising 8 Nvidia A100 GPUs, while
DeepSeek-VL-1B consumed 7 days on a setup involving 16 nodes.

15
DeepSeek-VL 1B DeepSeek-VL-7B
Vision Encoder SigLIP SigLIP+SAM
Hyperparameters Stage 1 Stage 2 Stage 3 Stage 1 Stage 2 Stage 3
Learning rate 1.0 × 10−3 3 × 10−5 2.0 × 10−5 1.0 × 10−3 4.2 × 10−5 2.0 × 10−5
LR scheduler Cosine Step Cosine Cosine Step Cosine
Weight decay 0.0 0.0 0.0 0.0 0.0 0.0
Gradient clip 1.0 1.0 1.0 1.0 1.0 1.0
Optimizer AdamW( 𝛽1 = 0.9, 𝛽2 = 0.95) AdamW( 𝛽1 = 0.9, 𝛽2 = 0.95)
Warm-up steps 128 2000 256 128 2000 256
Training steps 15000 96000 10000 15000 42000 10000
Batch size 256 1024 256 256 2304 256
Sequence length 512 4096 4096 512 4096 4096
Sequence packing × ✓ × × ✓ ×
Pipeline parallelism × × × × ✓ ✓

Table 4 | Detailed hyperparameters of our DeepSeek-VL.

4. Evaluation

4.1. Public Multimodal Benchmarks Evaluation

We evaluate our models on a series of public benchmarks:


Multimodal comprehensive understanding datasets: MMMU (Yue et al., 2023), CM-
MMU (Zhang et al., 2024), MMBench (Liu et al., 2023a), MMBench-CN (Liu et al., 2023a),
SeedBench (Li et al., 2023a) and MMV (Yu et al., 2023b). We compare DeepSeek-VL with
competitors on MMB/MMC-dev as current official test download link is no longer active.
Chart/table understanding datasets: OCRBench (Liu et al., 2023b);
Hallucination datasets: POPE (Li et al., 2023b);
Scientific problem datasets: ScienceQA (Lu et al., 2022a) and MathVista (Lu et al., 2023).
We apply generation-based evaluation with greedy decoding. The generation-based evalua-
tion here refers to letting the model generate free texts and parsing results from generated texts.
The comparative results, as illustrated in Table 5, show that DeepSeek-VL-7B surpasses most
open-source models of similar size across a wide range of benchmarks.
DeepSeek-VL outperforms open-source models of similar size in benchmarks such as MMB,
MMC, and SEEDbench, even approaching proprietary models (DeepSeek-VL vs. GPT-4V = 70.4
vs. 71.6 on seedbench), demonstrating its powerful natural image comprehension capability. The
model also surpasses all open-source models in mathematical logic, but still lags significantly
behind proprietary models like GPT-4V (36.1 vs. 47.8 on MathVista). This difference could be
attributed to the variance in base model sizes.
Furthermore, as shown in Table 6, DeepSeek-VL-1.3B significantly outperforms models
of comparable size. It demonstrates superior performance compared to leading open-source
models in the MMB benchmark test, while utilizing only close to half the parameters (1.3B vs.
2.7B), indicating its robust natural image comprehension capability. DeepSeek-VL-1.3B even
achieves comparable results to 7B open-source models on MathVista, further validating the
powerful logical understanding capabilities of the DeepSeek-VL family.

16
LLM MMMU CMMMU MMB MMC SEED OCRB POPE MathV MMVet
Close-source LMMs:
Gemini Pro Unk 48.9 - 75.2 74.0 70.7 659 - 45.2 59.2
GPT-4V Unk 56.8 42.5 75.0 74.7 71.6 659 - 47.8 49.9
Qwen-VL-Plus Unk 45.2 39.5 66.2 69.6 72.7 - - 43.3 55.7
Qwen-VL-MAX Unk 51.4 - 78.1 76.4 72.7 - - 51.0 61.8
Open-source 13B LMMs:
LLaVA-1.5 13B 36.4 - 68.2 61.9 68.2 331 85.9 26.4 38.3
VILA 13B - - 70.3 64.3 - - 84.2 - 38.8
LLaVA-Next 13B 36.2 - 70.0 64.4 71.9 - 86.7 35.3 48.4
Open-source 7B LMMs:
EMU2-Chat 7B 36.3 23.8 63.6 45.9 68.9 - - 30.0 31.0
Qwen-VL-Chat 7B 37.0 - 60.6 56.7 64.8 - - 33.8 47.3
CogVLM 7B 37.3 24.8 63.7 53.8 68.8 - - 34.7 54.5
LLaVA-Next 7B 35.8 - 67.4 60.0 70.2 - 86.5 34.6 43.9
Yi-VL 6B 37.8 35.8 68.2 68.9 67.6 - - 28.0 31.1
DeepSeek-VL (ours) 7B 36.6 37.9 73.2 72.8 70.4 456 88.1 36.1 41.5

Table 5 | The comparison between different multi-modal models. The top half are proprietary
models, while the bottom are open-source models.

LLM MMMU CMMMU MMB MMC SEED OCRB POPE MathV MMVet
Tiny Model:
MobileVLM 1.4B - - 53.2 - - - 84.5 - -
MobileVLM 2.7B - - 59.6 - - - 84.9 - -
MobileVLM V2 1.4B - - 59.6 - - - 84.3 - -
MobileVLM V2 2.7B - - 63.2 - - - 84.7 - -
LLaVA-Phi 2.7B - - 59.5 - - - 85.0 - 28.9
DeepSeek-VL (ours) 1.3B 32.2 27.4 64.6 61.3 66.7 409 87.6 31.1 34.8

Table 6 | The comparison between tiny multi-modal models.

4.2. Public Language Benchmarks Evaluation

We evaluate our models on the following public language benchmarks:


Multi-subject multiple-choice datasets including MMLU (Hendrycks et al., 2020).
Language understanding and reasoning datasets including HellaSwag (Zellers et al., 2019).
Language modeling datasets including Pile (Gao et al., 2020).
Math datasets including GSM8K (Cobbe et al., 2021).
Code datasets including MBPP (Austin et al., 2021).
Standardized exams including AGIEval (Zhong et al., 2023).
We apply perplexity-based evaluation to datasets that require answers to be chosen from
several options. These datasets include HellaSwag and MMLU. The perplexity-based evaluation
here refers to calculating the perplexity of each option and selecting the lowest one as the

17
DeepSeek-VL DeepSeek-VL DeepSeek-LLM
Version
1B Chat 7B Chat 7B Chat
Encoder SigLIP SigLIP+SAM None
HellaSwag 56.0 68.4 68.5
MMLU 32.5 52.4 49.4
GSM8K 18.0 55.0 63.0
Benchmark
MBPP 10.0 35.2 35.2
AGIEval 14.0 27.8 19.3

Table 7 | The performance on language benchmarks.

model prediction. Perplexity-based evaluation helps to distinguish subtle probability difference


between model predictions and avoids discontinuity of exact match style evaluation. We apply
generation-based evaluation with greedy decoding for GSM8K and AGIEval. The generation-
based evaluation here refers to letting the model generate free texts and parsing results from
generated texts. We apply language-modeling-based evaluation for Pile-test, which means
calculating the bits-per-byte on the test corpus. And the results are illustrated in Table 7
It can be observed that across the majority of language benchmarks, DeepSeek-VL performs
comparably to, or even surpasses, DeepSeek-7B. For instance, it achieves scores of 68.4 vs. 68.5
on HellaSwag, which serves as a general benchmark for evaluating general language ability.
DeepSeek-VL outperforms DeepSeek-7B on metrics such as MMLU and AGIEval, indicating that
multimodal training methods may even aid in language tasks. Nevertheless, DeepSeek-VL-7B
shows a certain degree of decline in mathematics (GSM8K), which suggests that despite efforts
to promote harmony between vision and language modalities, there still exists a competitive
relationship between them. This could be attributed to the limited model capacity (7B), and
larger models might alleviate this issue significantly. Overall, DeepSeek-VL strives to achieve
the goal of minimizing declines in language capability while addressing these challenges.

4.3. Human Evaluation

To further explore the capabilities of our DeepSeek-VL, we independently construct a dataset


for manual evaluation. This dataset comprises 100 questions, divided into seven categories,
each encompassing specific tasks. These categories and tasks are same as our taxonomy for
the in-house SFT data, as shown in Table 3. This approach ensures that the tasks we test are
universal and encompass the majority of use cases for multimodal models.
Moreover, based on the categories and tasks described in existing reports, we collect similar
image materials and developed prompts. The sources for these image materials include royalty-
free image communities and photographs taken by the researchers. This methodical collection
and prompt formulation process ensures our dataset is both comprehensive and representative
of real-world multimodal model applications.
We compare our DeepSeek-VL-7B with InternLM-XComposer2-VL, CogVLM and GPT-
4V as shown in Figure 6 (and we also provide visualization results in Appendix A). GPT-4V
demonstrates exceptional performance across most dimensions. All open-source models are still
far behind GPT-4V in logical reasoning, highlighting the necessity of scaling up the size of Large
Language Models (LLMs). DeepSeek-VL-7B achieves better results in overall performance,
reaching outcomes close to GPT-4V in Recognition, Conversion, and Commonsense Reasoning.

18
InternLM-XComposer2-VL CogVLM-17B DeepSeek-VL-7B GPT4V

8.13
7.73
7.14
7.01

6.96
6.82

6.74
6.52
6.3

5.71
5.65

5.36
5.26

5.22
4.76

4.74
4.65
4.55

4.47

4.29
4.21
4.09
4.09

3.75
3.75

3.75

3.75
3.75
3.21

3.13

2.5
1.43
TOTAL SCORE RECOGNITION CONVERSION ANALYSIS COMMONSENSE LOGICAL MULTI-IMAGES EVALUATION

Figure 6 | Human evaluation results on InternLM-XComposer2-VL (Dong et al., 2024),


CogVLM (Wang et al., 2023b), DeepSeek-VL and GPT-4V (OpenAI, 2023b).

Figure 7 | GPT-4V-based Evaluation Results of DeepSeek-VL vs. Other Models: The chart depicts
results from a GPT-4V-based assessment across 99 test samples, demonstrating DeepSeek-VL’s
favorable outcomes against both open-source and proprietary models.

In addition, we conduct a comparative assessment using GPT-4V to evaluate the performance


of DeepSeek-VL against other models across a set of 99 test samples designed for human
evaluation. Following (Zheng et al., 2024), we show GPT-4V the question and the answers
from two different models and ask GPT-4V to determine which one is better or declare a
tie. The results indicate a preference for DeepSeek-VL’s responses in the majority of cases, as
GPT-4V tends to rate the quality of DeepSeek-VL’s answers more favorably. As illustrated in
Figure 7, DeepSeek-VL is judged to be superior in over 60% of instances when compared to open-
source multimodal models, including Fuyu-8B, CogVLM-17B, and InternLM-XComposer2-VL.
Moreover, in comparison with other proprietary models, such as GPT-4V itself, DeepSeek-VL
demonstrates comparably exceptional performance.

4.4. Ablation Study

Scale Up Projector Training We expand the dataset for stage 1 (projector warmup) and sub-
sequently apply supervised fine-tuning. The results, depicted in Figure 8, demonstrate that
augmenting the training data volume does not enhance performance at this stage. This implies

19
Stage 1, Training Step MMB MMC SEED POPE MMMU Average
2K 59.0 54.0 61.8 82.3 30.3 57.5
8K 58.0 45.0 58.5 84.9 29.2 55.1
20K 56.0 52.3 59.0 81.7 28.6 55.5
80K 58.1 55.0 58.6 78.6 27.9 55.6

Table 8 | Comparative directly SFT performance results on scaling up stage 1 data. The results
demonstrate that expanding the data scale at this stage does not yield benefits, or even results in
worse performance.

Stage 1 Stage 2 Stage 3 MMB MMC SEED POPE MMMU Average


✓ ✓ 59.4 54.2 61.4 82.5 29.2 57.4
✓ ✓ 63.4 60.5 65.9 87.1 31.8 61.7
✓ ✓ ✓ 64.3 61.3 66.7 87.6 32.2 62.4

Table 9 | Analysis of model performance across training stages.

2.170 Pile-test MMBench MMBench_CN


w/o group by modality 50.0 w/o group by modality w/o group by modality
w/ group by modality w/ group by modality 42 w/ group by modality
47.5 40
2.165
45.0 38
Accuracy

2.160 42.5 Accuracy


36
PPL

40.0 34
2.155
37.5 32

35.0 30
2.150
32.5 28
0 2000 4000 6000 8000 10000 12000 14000 16000 0 2000 4000 6000 8000 10000 12000 14000 16000 0 2000 4000 6000 8000 10000 12000 14000 16000
Step Step Step

Figure 8 | Comparative analysis of modality warmup on language (Pile-test) and multimodal


(MMBench and MMBench_CN) benchmarks demonstrates that modality grouping consistently
surpasses the non-grouped modality approach in language tasks, while simultaneously preserv-
ing performance on multimodal tasks on training stage 2 (Multimodal:Language=60%:40%).

that the projector’s capacity is inherently constrained, rendering it incapable of capturing the
extensive knowledge necessary for multimodal tasks.
Training Stage In Table 9, we examine the contributions of each stage to the model’s performance.
It’s evident that combining stage 1, stage 2, and stage 3 yields significantly better results across
all metrics compared to combining stage 1 and stage 3 alone, demonstrating the effectiveness of
multimodal pretraining. Additionally, the combination of stage 2 and stage 3 still slightly lags
behind the combined performance of stage 1, stage 2, and stage 3, indicating that vision-language
adaptor warmup stage remains meaningful.
Modality Group Training When mixing language and multimodal data, we observe that directly
blending them at the batch level significantly reduces training efficiency. This inefficiency arises
because each batch gradient backpropagation process waits for the slowest sample to complete.
As a result, the predominantly faster-to-process pure language data ends up waiting for the
multimodal samples to finish, leading to a decrease in overall training efficiency.
To address this issue, we experiment with grouping different modalities of data at each global

20
Pile-test MMBench MMBench_CN
w/o modality warmup 50.0 w/o modality warmup w/o modality warmup
2.165 42
w/ modality warmup w/ modality warmup w/ modality warmup
47.5
40
2.160
45.0
38

Accuracy

Accuracy
2.155 42.5
36
PPL

40.0
2.150 34
37.5
32
2.145 35.0
30
32.5
0 2000 4000 6000 8000 10000 12000 14000 16000 0 2000 4000 6000 8000 10000 12000 14000 16000 0 2000 4000 6000 8000 10000 12000 14000 16000
Step Step Step

Figure 9 | Comparative performance results on language (Pile-test) and multimodal (MMBench


and MMBench_CN) benchmarks for modality warmup. Modality warmup consistently matches
or surpasses the performance of approaches without modality warmup across all evaluated
tasks on training stage 2 (Multimodal:Language=60%:40%).

step, sampling distinct modalities separately. This approach involves organizing the training
data so that batches are composed either entirely of language data or entirely of multimodal
data at different training steps, rather than mixing them within the same batch.
The results are shown in Figure 8, we observe that this method does not compromise the
model’s performance while enhancing the model’s training efficiency by 20%. This strategy effec-
tively circumvents the bottleneck caused by the disparate processing times between modalities,
optimizing the training workflow.
Modality Warmup Considering that our approach involves multimodal training on the founda-
tion of a language model, directly mixing multimodal data in a fixed proportion from the outset
can destabilize the model. To counteract this issue, we propose a simple yet effective modality
warm-up strategy. Initially, we set the language data ratio to 1, and then gradually decrease it to
the target ratio for the final model training (e.g., 0.7).
Our experiments, as illustrated in Figure 9, demonstrate that this strategy effectively prevents
a significant decline in language capabilities at the beginning of training, while also yielding
comparatively superior outcomes in the final phases for both the language and multimodal do-
mains. This gradual adaptation enables the model to more seamlessly adjust to the incorporation
of multimodal data, thereby improving overall training stability and performance.
Vision Encoder Selection In order to better acquire and utilize image information, we compare
the training loss of different vision encoders under our training settings except for reducing
training steps of stage 2 to 8000 for efficiency. As illustrated in Figure 10, the incorporation of
vision-only self-supervised encoders has been found to significantly enhance performance on
training loss. To more effectively process high-resolution images, our research ultimately adopts
a hybrid vision encoder strategy, combining SigLIP with SAM for our model’s implementation.
Vision-Language Adaptor Design To improve the efficiency of extracting information from the
visual encoder while adhering to current token length constraints, adjustments can be made to
the Vision-Language adaptor in two main ways: the method used to combine visual features
and the design of the MLP adaptor.
Previous studies (Tong et al., 2024) have indicated that combining visual features along
the sequence dimension can lead to better model performance, although this comes with the
trade-off of increased computational requirements due to a longer sequence of visual feature
tokens. As demonstrated in the top section of Table 10, reducing the sequence length by stacking

21
CLIP
SigLIP
2.6 SigLIP+DINO
SigLIP+SAM

2.4
Loss
2.2

2.0

1.8
0 1000 2000 3000 4000 5000 6000 7000 8000
Step

Figure 10 | Comparative analysis of different vision encoders on training losses in stage 2.

Architecture MMB MMC SEED POPE ScienceQA MMMU OCRB Average


Sequence Concatenation:
Token Pooling - W 61.2 59.6 61.6 86.5 57.7 31.6 304 55.5
Token Pooling - H 59.9 58.3 61.6 83.8 55.0 32.0 291 54.2
Embedding Concatenation:
Hybrid MLP 61.7 60.1 62.9 87.8 56.6 31.3 309 55.9
Shared MLP 62.0 58.9 62.5 86.6 54.7 30.2 318 55.2
Separate MLP 57.5 58.7 63.1 86.5 56.6 29.0 299 54.5

Table 10 | Comparison of different adaptor architectures using SigLIP and SAM as hybrid vision
encoder, Hybrid MLP are used for sequence concatenation experiments. Bolded entries represent
the best results, while underlined entries denote the second-best results. For calculating the
average score, we divide the OCRBench by the total number of questions.

visual features along the image’s width or height dimensions before sequence concatenation,
in order to keep the sequence length constant, does not achieve better results compared to
simply merging them along the embedding dimension in most metrics. In terms of the adaptor
architecture, employing separate MLP adaptors for each vision feature encoder allows for
more precise adjustments to the specific values and distribution patterns of visual features,
facilitating smoother model training. Conversely, using a shared MLP adaptor for different
vision encoders contributes to adequate feature fusion. We adopt a mixed strategy and report
stable and improved performance, as outlined in the lower section of Table 10.

5. Conclusion, Limitation, and Future Work


In this technical report, we have introduced DeepSeek-VL, a series of Multimodal Large Lan-
guage Models, available in scales of 1.3B and 6.7B parameters. This report has unveiled the
limitations inherent in the predominant projector-based pretraining methodologies, setting
the stage for the innovative approach adopted by DeepSeek-VL. By prioritizing a joint vision
and language (VL) pretraining phase, DeepSeek-VL transcends traditional models by ensuring
that the integration of multimodal data does not compromise the linguistic capabilities of the
Large Language Models (LLMs). This is achieved through a strategic warm-up data ratio and
the introduction of a hybrid vision encoder, which together enable the efficient processing of

22
high-resolution images without losing sight of semantic richness.
The incorporation of a hybrid vision encoder, capable of handling 1024 x 1024 images within
a constrained token budget, underscores our commitment to preserving the nuanced details
and semantic integrity across diverse tasks. As a result, DeepSeek-VL emerges as a pioneering
model that not only meets but exceeds the standards set by generalist models in its class. It
showcases exceptional performance across a wide range of visually-centric benchmarks while
sustaining formidable proficiency in language-centric evaluations.
In making DeepSeek-VL publicly available, we aim to catalyze further innovation and
exploration within the research community, providing a robust foundation upon which future
studies can build. This gesture of openness is intended to facilitate the collective advancement
of our understanding and capabilities in handling multimodal data.
Looking ahead, we are excited to announce plans to scale up DeepSeek-VL to larger sizes,
incorporating Mixture of Experts (MoE) technology. This forthcoming expansion promises to
further enhance the model’s efficiency and effectiveness, opening up new horizons for research
and application in the field of AI.

References
01-ai. Yi-34B vision language model. https://huggingface.co/01-ai/Yi-VL-34B, 2024.

Abi. Screenshot to code. https://github.com/abi/screenshot-to-code, 2024.

Anna’s Archive. Anna’s archive. https://annas-archive.org/, 2024.

Anthropic. Introducing Claude, 2023. URL https://www.anthropic.com/index/introd


ucing-claude.
J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry,
Q. Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732,
2021.

J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-vl: A
versatile vision-language model for understanding, localization, text reading, and beyond.
arXiv preprint arXiv:2308.12966, 2023.

R. Bavishi, E. Elsen, C. Hawthorne, M. Nye, A. Odena, A. Somani, and S. Taşırlar. Introducing


our multimodal models, 2023. URL https://www.adept.ai/blog/fuyu-8b.

L. Blecher. Latex-ocr. GitHub repository, 2024. URL https://github.com/lukas-blecher


/LaTeX-OCR.
L. Blecher, G. Cucurull, T. Scialom, and R. Stojnic. Nougat: Neural optical understanding for
academic documents. arXiv preprint arXiv:2308.13418, 2023.

A. Burns, K. Srinivasan, J. Ainslie, G. Brown, B. A. Plummer, K. Saenko, J. Ni, and M. Guo. A


suite of generative tasks for multi-level multimodal webpage understanding. In The 2023
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023. URL
https://openreview.net/forum?id=rwcLHjtUmn.
J. Carter. Textocr-gpt4v. https://huggingface.co/datasets/jimmycarter/textocr-g
pt4v, 2024.

23
L. Chen, J. Li, X. Dong, P. Zhang, C. He, J. Wang, F. Zhao, and D. Lin. Sharegpt4v: Improving
large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023.

C. K. Chng, Y. Liu, Y. Sun, C. C. Ng, C. Luo, Z. Ni, C. Fang, S. Zhang, J. Han, E. Ding, et al.
Icdar2019 robust reading challenge on arbitrary-shaped text-rrc-art. In 2019 International
Conference on Document Analysis and Recognition (ICDAR), pages 1571–1576. IEEE, 2019.

K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek,


J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems. arXiv preprint
arXiv:2110.14168, 2021.

W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi. Instructblip:
Towards general-purpose vision-language models with instruction tuning, 2023.

DeepSeek-AI. Deepseek llm: Scaling open-source language models with longtermism. arXiv
preprint arXiv:2401.02954, 2024. URL https://github.com/deepseek-ai/DeepSeek-L
LM.
X. Dong, P. Zhang, Y. Zang, Y. Cao, B. Wang, L. Ouyang, X. Wei, S. Zhang, H. Duan, M. Cao,
et al. Internlm-xcomposer2: Mastering free-form text-image composition and comprehension
in vision-language large model. arXiv preprint arXiv:2401.16420, 2024.

echo840. Detailed caption dataset. https://huggingface.co/datasets/echo840/Deta


iled_Caption, 2024.
W. Foundation. Wikimedia downloads. URL https://dumps.wikimedia.org.

J. Gao, R. Pi, J. Zhang, J. Ye, W. Zhong, Y. Wang, L. Hong, J. Han, H. Xu, Z. Li, et al. G-
llava: Solving geometric problem with multi-modal large language model. arXiv preprint
arXiv:2312.11370, 2023.

L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite,


N. Nabeshima, et al. The Pile: An 800GB dataset of diverse text for language modeling. arXiv
preprint arXiv:2101.00027, 2020.

Google. An important next step on our AI journey, 2023. URL https://blog.google/tech


nology/ai/bard-google-ai-search-updates/.
D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring
massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.

High-flyer. Hai-llm: 高效且轻量的大模型训练工具, 2023. URL https://www.high-flyer.c


n/en/blog/hai-llm.
Y.-C. Hsiao, F. Zubach, M. Wang, et al. Screenqa: Large-scale question-answer pairs over mobile
app screenshots. arXiv preprint arXiv:2209.08199, 2022.

A. Hu, Y. Shi, H. Xu, J. Ye, Q. Ye, M. Yan, C. Li, Q. Qian, J. Zhang, and F. Huang. mplug-
paperowl: Scientific diagram analysis with the multimodal large language model. arXiv
preprint arXiv:2311.18248, 2023.

HuggingFaceM4. Websight dataset. https://huggingface.co/datasets/HuggingFaceM


4/WebSight, 2024.

24
S. Kantharaj, R. T. Leong, X. Lin, A. Masry, M. Thakkar, E. Hoque, and S. Joty. Chart-to-
text: A large-scale benchmark for chart summarization. In S. Muresan, P. Nakov, and
A. Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), pages 4005–4023, Dublin, Ireland, May
2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.277. URL
https://aclanthology.org/2022.acl-long.277.
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C.
Berg, W.-Y. Lo, et al. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
D. Kocetkov, R. Li, L. B. Allal, J. Li, C. Mou, C. M. Ferrandis, Y. Jernite, M. Mitchell, S. Hughes,
T. Wolf, D. Bahdanau, L. von Werra, and H. de Vries. The stack: 3 tb of permissively licensed
source code. In Transactions on Machine Learning Research, 2023.
V. A. Korthikanti, J. Casper, S. Lym, L. McAfee, M. Andersch, M. Shoeybi, and B. Catanzaro.
Reducing activation recomputation in large transformer models. Proceedings of Machine
Learning and Systems, 5, 2023.
I. Krylov, S. Nosov, and V. Sovrasov. Open images v5 text annotation and yet another mask text
spotter. In Asian Conference on Machine Learning, pages 379–389. PMLR, 2021.
A. Kulkarni and J. Truelsen. wkhtmltopdf. https://wkhtmltopdf.org/. Project maintained
by Ashish Kulkarni, originally created by Jakob Truelsen. Accessed: 2024-02-22.
LAION. Gpt-4v dataset. https://huggingface.co/datasets/laion/gpt4v-dataset,
2023.
B. Li, R. Wang, G. Wang, Y. Ge, Y. Ge, and Y. Shan. Seed-bench: Benchmarking multimodal llms
with generative comprehension. arXiv preprint arXiv:2307.16125, 2023a.
S. Li and N. Tajbakhsh. Scigraphqa: A large-scale synthetic multi-turn question-answering
dataset for scientific graphs, 2023.
Y. Li, G. Li, L. He, J. Zheng, H. Li, and Z. Guan. Widget captioning: Generating natural language
description for mobile user interface elements. arXiv preprint arXiv:2010.04295, 2020.
Y. Li, H. Mao, R. Girshick, and K. He. Exploring plain vision transformer backbones for object
detection. In European Conference on Computer Vision, pages 280–296. Springer, 2022.
Y. Li, Y. Du, K. Zhou, J. Wang, W. X. Zhao, and J.-R. Wen. Evaluating object hallucination in
large vision-language models. arXiv preprint arXiv:2305.10355, 2023b.
J. Lin, H. Yin, W. Ping, Y. Lu, P. Molchanov, A. Tao, H. Mao, J. Kautz, M. Shoeybi, and S. Han.
Vila: On pre-training for visual language models. arXiv preprint arXiv:2312.07533, 2023a.
Z. Lin, C. Liu, R. Zhang, P. Gao, L. Qiu, H. Xiao, H. Qiu, C. Lin, W. Shao, K. Chen, et al. Sphinx:
The joint mixing of weights, tasks, and visual embeddings for multi-modal large language
models. arXiv preprint arXiv:2311.07575, 2023b.
F. Liu, F. Piccinno, S. Krichene, C. Pang, K. Lee, M. Joshi, Y. Altun, N. Collier, and J. M. Eisenschlos.
Matcha: Enhancing visual language pretraining with math reasoning and chart derendering.
arXiv preprint arXiv:2212.09662, 2022a.
H. Liu, C. Li, Y. Li, B. Li, Y. Zhang, S. Shen, and Y. J. Lee. Llava-next: Improved reasoning, ocr,
and world knowledge, January 2024a. URL https://llava-vl.github.io/blog/202
4-01-30-llava-next/.

25
H. Liu, C. Li, Q. Wu, and Y. J. Lee. Visual instruction tuning. Advances in neural information
processing systems, 36, 2024b.

Y. Liu, G. Zhu, B. Zhu, Q. Song, G. Ge, H. Chen, G. Qiao, R. Peng, L. Wu, and J. Wang. Taisu: A
166m large-scale high-quality dataset for chinese vision-language pre-training. In S. Koyejo,
S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural
Information Processing Systems, volume 35, pages 16705–16717. Curran Associates, Inc.,
2022b. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/6
a386d703b50f1cf1f61ab02a15967bb-Paper-Datasets_and_Benchmarks.pdf.
Y. Liu, H. Duan, Y. Zhang, B. Li, S. Zhang, W. Zhao, Y. Yuan, J. Wang, C. He, Z. Liu, et al.
Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281,
2023a.

Y. Liu, Z. Li, H. Li, W. Yu, M. Huang, D. Peng, M. Liu, M. Chen, C. Li, L. Jin, et al. On the hidden
mystery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895, 2023b.

S. Long, S. Qin, D. Panteleev, A. Bissacco, Y. Fujii, and M. Raptis. Towards end-to-end unified
scene text detection and layout analysis. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, 2022.

P. Lu, L. Qiu, J. Chen, T. Xia, Y. Zhao, W. Zhang, Z. Yu, X. Liang, and S.-C. Zhu. Iconqa: A
new benchmark for abstract diagram understanding and visual language reasoning. arXiv
preprint arXiv:2110.13214, 2021.

P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P. Clark, and A. Kalyan.
Learn to explain: Multimodal reasoning via thought chains for science question answering.
In The 36th Conference on Neural Information Processing Systems (NeurIPS), 2022a.

P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P. Clark, and A. Kalyan.
Learn to explain: Multimodal reasoning via thought chains for science question answering.
Advances in Neural Information Processing Systems, 35:2507–2521, 2022b.

P. Lu, H. Bansal, T. Xia, J. Liu, C. Li, H. Hajishirzi, H. Cheng, K.-W. Chang, M. Galley, and J. Gao.
Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv
preprint arXiv:2310.02255, 2023.

J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy. Generation and


comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pages 11–20, 2016.

A. Masry, P. Kavehzadeh, X. L. Do, E. Hoque, and S. Joty. Unichart: A universal vision-language


pretrained model for chart comprehension and reasoning. arXiv preprint arXiv:2305.14761,
2023.

D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V. Korthikanti, D. Vainbrand,


P. Kashinkunti, J. Bernauer, B. Catanzaro, et al. Efficient large-scale language model training
on gpu clusters using megatron-lm. In Proceedings of the International Conference for High
Performance Computing, Networking, Storage and Analysis, pages 1–15, 2021.

N. Nayef, F. Yin, I. Bizid, H. Choi, Y. Feng, D. Karatzas, Z. Luo, U. Pal, C. Rigaud, J. Chazalon,
et al. Icdar2017 robust reading challenge on multi-lingual scene text detection and script
identification-rrc-mlt. In 2017 14th IAPR international conference on document analysis and
recognition (ICDAR), volume 1, pages 1454–1459. IEEE, 2017.

26
OpenAI. Chatgpt: Optimizing language models for dialogue. 2022. URL https://openai.c
om/blog/chatgpt.
OpenAI. GPT-4 technical report. arXiv, 2023a.

R. OpenAI. Gpt-4v(ision) system card. 2023b.

J. A. Rodriguez, D. Vazquez, I. Laradji, M. Pedersoli, and P. Rodriguez. Ocr-vqgan: Tam-


ing text-within-image generation. In Proceedings of the IEEE/CVF Winter Conference on
Applications of Computer Vision, pages 3689–3698, 2023.

R. Schaeffer, B. Miranda, and S. Koyejo. Are emergent abilities of large language models a
mirage? Advances in Neural Information Processing Systems, 36, 2024.

N. Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.

B. Shi, C. Yao, M. Liao, M. Yang, P. Xu, L. Cui, S. Belongie, S. Lu, and X. Bai. Icdar2017 competition
on reading chinese text in the wild (rctw-17). In 2017 14th iapr international conference on
document analysis and recognition (ICDAR), volume 1, pages 1429–1434. IEEE, 2017.

M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro. Megatron-lm:


Training multi-billion parameter language models using model parallelism. arXiv preprint
arXiv:1909.08053, 2019.

A. Singh, G. Pang, M. Toh, J. Huang, W. Galuba, and T. Hassner. Textocr: Towards large-
scale end-to-end reasoning for arbitrary-shaped scene text. In Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition, pages 8802–8812, 2021.

J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, and Y. Liu. Roformer: Enhanced transformer with rotary
position embedding. Neurocomputing, 568:127063, 2024.

Q. Sun, Q. Yu, Y. Cui, F. Zhang, X. Zhang, Y. Wang, H. Gao, J. Liu, T. Huang, and X. Wang.
Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222, 2023.

Y. Sun, Z. Ni, C.-K. Chng, Y. Liu, C. Luo, C. C. Ng, J. Han, E. Ding, J. Liu, D. Karatzas, et al.
Icdar 2019 competition on large-scale street view text with partial labeling-rrc-lsvt. In 2019
International Conference on Document Analysis and Recognition (ICDAR), pages 1557–1562.
IEEE, 2019.

G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai,
A. Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint
arXiv:2312.11805, 2023.

S. Tong, Z. Liu, Y. Zhai, Y. Ma, Y. LeCun, and S. Xie. Eyes wide shut? exploring the visual
shortcomings of multimodal llms. arXiv preprint arXiv:2401.06209, 2024.

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal,


E. Hambro, F. Azhar, et al. LLaMA: Open and efficient foundation language models. arXiv
preprint arXiv:2302.13971, 2023a.

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra,


P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. Canton-Ferrer, M. Chen, G. Cucurull, D. Esiobu,
J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini,
R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev, P. S. Koura,
M. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, P. Mishra,

27
I. Molybog, Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M.
Smith, R. Subramanian, X. E. Tan, B. Tang, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan,
I. Zarov, Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, and
T. Scialom. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288,
2023b. doi: 10.48550/arXiv.2307.09288. URL https://doi.org/10.48550/arXiv.2307.
09288.
A. Veit, T. Matera, L. Neumann, J. Matas, and S. Belongie. Coco-text: Dataset and benchmark for
text detection and recognition in natural images. arXiv preprint arXiv:1601.07140, 2016.

B. Wang, G. Li, X. Zhou, Z. Chen, T. Grossman, and Y. Li. Screen2words: Automatic mobile
ui summarization with multimodal learning. In The 34th Annual ACM Symposium on User
Interface Software and Technology, pages 498–510, 2021.

J. Wang, L. Meng, Z. Weng, B. He, Z. Wu, and Y.-G. Jiang. To see is to believe: Prompting gpt-4v
for better visual instruction tuning. arXiv preprint arXiv:2311.07574, 2023a.

W. Wang, Q. Lv, W. Yu, W. Hong, J. Qi, Y. Wang, J. Ji, Z. Yang, L. Zhao, X. Song, et al. Cogvlm:
Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023b.

H. Wei, L. Kong, J. Chen, L. Zhao, Z. Ge, J. Yang, J. Sun, C. Han, and X. Zhang. Vary: Scaling
up the vision vocabulary for large vision-language models. arXiv preprint arXiv:2312.06109,
2023.

Y. Yang, A. Panagopoulou, Q. Lyu, L. Zhang, M. Yatskar, and C. Callison-Burch. Visual goal-step


inference using wikihow. arXiv preprint arXiv:2104.05845, 2021.

J. Ye, A. Hu, H. Xu, Q. Ye, M. Yan, G. Xu, C. Li, J. Tian, Q. Qian, J. Zhang, et al. Ureader:
Universal ocr-free visually-situated language understanding with multimodal large language
model. arXiv preprint arXiv:2310.05126, 2023.

Q. Yu, Q. Sun, X. Zhang, Y. Cui, F. Zhang, Y. Cao, X. Wang, and J. Liu. Capsfusion: Rethinking
image-text data at scale. arXiv preprint arXiv:2310.20550, 2023a.

W. Yu, Z. Yang, L. Li, J. Wang, K. Lin, Z. Liu, X. Wang, and L. Wang. Mm-vet: Evaluating large
multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023b.

X. Yue, Y. Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y. Sun, et al.
Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for
expert agi. arXiv preprint arXiv:2311.16502, 2023.

R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi. HellaSwag: Can a machine really finish
your sentence? In A. Korhonen, D. R. Traum, and L. Màrquez, editors, Proceedings of the 57th
Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July
28- August 2, 2019, Volume 1: Long Papers, pages 4791–4800. Association for Computational
Linguistics, 2019. doi: 10.18653/v1/p19-1472. URL https://doi.org/10.18653/v1/p1
9-1472.
B. Zhang and R. Sennrich. Root mean square layer normalization. Advances in Neural
Information Processing Systems, 32, 2019.

G. Zhang, X. Du, B. Chen, Y. Liang, T. Luo, T. Zheng, K. Zhu, Y. Cheng, C. Xu, S. Guo, et al.
Cmmmu: A chinese massive multi-discipline multimodal understanding benchmark. arXiv
preprint arXiv:2401.11944, 2024.

28
R. Zhang, Y. Zhou, Q. Jiang, Q. Song, N. Li, K. Zhou, L. Wang, D. Wang, M. Liao, M. Yang,
et al. Icdar 2019 robust reading challenge on reading chinese text on signboard. In 2019
international conference on document analysis and recognition (ICDAR), pages 1577–1581.
IEEE, 2019.

Y. Zhang, L. Gueguen, I. Zharkov, P. Zhang, K. Seifert, and B. Kadlec. Uber-text: A large-


scale dataset for optical character recognition from street-level imagery. In SUNw: Scene
Understanding Workshop - CVPR 2017, Hawaii, U.S.A., 2017. URL http://sunw.csail.m
it.edu/abstract/uberText.pdf.
L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. Xing, et al.
Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information
Processing Systems, 36, 2024.

W. Zhong, R. Cui, Y. Guo, Y. Liang, S. Lu, Y. Wang, A. Saied, W. Chen, and N. Duan. AGIEval: A
human-centric benchmark for evaluating foundation models. CoRR, abs/2304.06364, 2023.
doi: 10.48550/arXiv.2304.06364. URL https://doi.org/10.48550/arXiv.2304.06364.

W. Zhu, J. Hessel, A. Awadalla, S. Y. Gadre, J. Dodge, A. Fang, Y. Yu, L. Schmidt, W. Y. Wang,


and Y. Choi. Multimodal c4: An open, billion-scale corpus of images interleaved with text.
Advances in Neural Information Processing Systems, 36, 2024.

29
A. Appendix

Prompt Response – DeepSeek-VL


将这段流程图转换为python程序 这段流程图展示了一个使用Python编程语言编写的猜数字游戏。游戏的目的是玩家猜测一
个在1到10之间的随机数,如果猜对了,游戏就会结束并显示“你赢了!”(You got it!)。
如果玩家的猜测超过了或低于这个范围,游戏会提示“太高了!”(Too high!)或“太低
了!”(Too low!),然后玩家可以再次尝试。
根据流程图的描述,我们可以将这个游戏实现为以下Python代码:

在这个程序中,我们使用了random.randint()函数来生成一个1到10之间的随机整数作为
答案。然后,我们进入一个循环,直到玩家的猜测等于答案为止。在每次猜测后,我们会
检查玩家的猜测是否在范围内,以及它是否等于答案。如果是,我们就打印出“你赢
了!”;如果不是,我们就打印出“太高了!”或“太低了!”,然后让玩家再次猜测。
一旦玩家的猜测等于答案,游戏就结束了。

Response – Open-source Model 1

sorry, as a language model AI, I cannot process image into any other content.

Response – Open-source Model 2

当然!这里有一个 Python 代码片段可以复制流程图: ```python

Figure 11 | Visualization results. DeepSeek-VL can understand children’s programming dia-


grams from the real world and provide detailed and organized explanations.

30
Prompt Response – DeepSeek-VL
Explain the image. The image is a line graph that compares the performance of two different learning rate schedulers,
namely the Multi Step Learning Rate Scheduler and the Cosine Learning Rate Scheduler. The x-axis
represents the number of processed tokens in billions, ranging from 0 to 100 billion. The y-axis
represents the training loss, which ranges from 2.0 to 3.0.

The blue line represents the Multi Step Learning Rate Scheduler with three steps of increasing
rates: (80% + 10% + 10%). This line starts at a high point near 3.0 on the y-axis when the number
of processed tokens is 0 billion, indicating a higher initial loss. As the number of processed tokens
increases, this line shows a steep decline in training loss, suggesting rapid improvement in model
performance as more data is processed. However, after approximately 40 billion processed tokens,
the line begins to level off, indicating diminishing returns on further processing.

The orange line represents the Cosine Learning Rate Scheduler. It also starts at a high point near
3.0 when the number of processed tokens is 0 billion but has a much smoother descent than the
blue line. This suggests a slower but steady decrease in training loss as more tokens are processed.
After around 60 billion processed tokens, this line levels off as well, similar to the blue line.

Both lines show a general trend of decreasing training loss as the number of processed tokens
increases, indicating improved model performance over time. However, the Multi Step Learning
Rate Scheduler appears to achieve lower losses more quickly compared to the Cosine Learning
Rate Scheduler, especially in the early stages of processing tokens.

Response – Open-source Model 1

Line 1 expresses the number of tokens, following a nonlinear trend. Line 2 starts at value 80, then
is decreasing and increases by 10% until reaching a value where it then increases by 10%, before it
flattens out. Line 3 also starts in increasing part, then has some constant decreases between
different points, and after that express an increasing part.

Response – Open-source Model 2

The image displays a line graph comparing the training losses of two different learning rates
schedulers: the Multi Step Learning Rate Scheduler (80% + 10% = 10%) and the Cosine Learning
Rate Scheduler. The y-axis represents the training loss in terms of billions, ranging from
approximately 2.2 to 3.0. The x-axis represents the number of processed tokens, with a range
from 0 to 100. The Multi Step Learning Rate Scheduler's line is blue, showing a sharp decline
initially and then stabilizing. The Cosine Learning Rate Scheduler's line is orange, showing a more
gradual decline.

Prompt Response – DeepSeek-VL


Help me find the bug. The image shows a piece of code written in Python, which is intended to find the maximum value in
a sequence. The function definition starts with "def" indicating that this is a new function being
defined. The name of the function is "find_max_in_sequence".
The first line checks if the input is not a sequence; if it is, the function returns None. This is an
important check because functions should only operate on sequences (like lists or tuples) and not
other types of data structures. If the input is not a sequence, the function will return no result, which
can be useful for handling invalid inputs.
Next, there's a statement defining what a sequence is: max_value = sequence [1]. This seems
incorrect as it assigns the index 1 to max_value instead of the actual element at position 1 in the
sequence.
Then we have a loop that iterates over each number in the sequence. It uses an "if" condition to
check whether the current number (num) is greater than the current maximum value (max_value). If
true, it updates max_value to be equal to num. This part of the code appears correct, as it correctly
compares each number against the current maximum value and updates the maximum when
necessary.
Finally, after the loop ends, the function returns the updated max_value, which contains the
maximum value found in the sequence.
Based on the visual clues provided by the code snippet, the bug lies in the assignment of max_value
where it incorrectly assigns the index 1 rather than the value at that index in the sequence. The
corrected version would look like this: max_value = sequence[0].
This error could potentially lead to unexpected results if the sequence has more than one element,
as the function might fail to find the maximum value due to the wrong assignment of max_value.

Response – Open-source Model 1

The function name in the code is "find_max_in_sequence", but it has not actually been called as
the condition of the if-statement checks for 'not sequence' instead of 'if not sequence'. The actual
line where the function should be called is missing from the provided code.

Response – Open-source Model 2

The bug in the code is that the condition for returning 'None' is set to 'if not sequence', which is
incorrect. The correct condition should be 'if not max_value', to ensure that the function returns
'None' when the sequence is not found or when the max value is not found within the sequence.

Figure 12 | Visualization results. DeepSeek-VL has strong understanding capabilities for code
and charts in the real world.

31
Prompt Response – DeepSeek-VL
What coin is that? The coin is a 10 Baht coin from Thailand.

Response – Open-source Model 1

The coin is an Asian coin, which features a cityscape on it.

Response – Open-source Model 2

answering does not require reading text in the image

Prompt Response – DeepSeek-VL


请根据这张图片,创作一首七言绝句 山川暮色映斜阳,
林间雪覆寒意长。
孤树影斜人未归,
一抹残霞照断肠。

Response – Open-source Model 1

雪山苍苍笼寒日,霜林瑟瑟动清声。

Response – Open-source Model 2

"光秃的树木高耸,
金色的色调亲吻着白雪,
群山低语着古老的故事,
白昼的拥抱开始消失。"

Prompt Response – DeepSeek-VL


图中的角色是哪家公司的? 图中的角色是miHoYo(米哈游)公司的。这家公司以其开发的手机游戏《原神》而闻名,该游
戏中的角色设计风格与图片中角色的风格相似。此外,图片中的角色穿着和武器设计也符合
miHoYo游戏的特征。

Response – Open-source Model 1

图像中的角色来自游戏公司“原神”。

Response – Open-source Model 2

这张图片展示的角色是日本Square Enix公司旗下的一款游戏《最终幻想14》中的角色。

Figure 13 | Visualization results. DeepSeek-VL possesses extensive knowledge of the real world.

32
Prompt Response – DeepSeek-VL
Convert it to markdown table. Here is the image content converted into a markdown table:

Please note that the "A" column in the original table seems to be empty or not fully visible in the
image provided.

Response – Open-source Model 1

Response – Open-source Model 2

Figure 14 | Visualization results. DeepSeek-VL is capable of accurately reading the contents of


real-world tables.

33

You might also like