Deepseek-Vl: Towards Real-World Vision-Language Understanding
Deepseek-Vl: Towards Real-World Vision-Language Understanding
Deepseek-Vl: Towards Real-World Vision-Language Understanding
Understanding
Haoyu Lu*1† , Wen Liu*1 , Bo Zhang*1‡ , Bingxuan Wang1† , Kai Dong1 , Bo Liu1† , Jingxiang Sun1† ,
Tongzheng Ren1† , Zhuoshu Li1 , Hao Yang1† , Yaofeng Sun1 , Chengqi Deng1 , Hanwei Xu1 , Zhenda Xie1 ,
Chong Ruan1
1 DeepSeek-AI
Abstract
∗ Equal contribution.
† Work done during the internship at DeepSeek-AI.
‡ Project lead.
Contents
1 Introduction 3
2 Data Construction 6
2.1 Vision-Language pretraining Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Supervised Fine-tuning Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Approach 10
3.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Training Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.1 Stage 1: Training Vision-Language Adaptor . . . . . . . . . . . . . . . . . . 12
3.2.2 Stage 2: Joint Vision-Language pretraining . . . . . . . . . . . . . . . . . . 13
3.2.3 Stage 3: Supervised Fine-tuning . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 Hyperparameters and Infrastructures . . . . . . . . . . . . . . . . . . . . . . . . . 15
4 Evaluation 16
4.1 Public Multimodal Benchmarks Evaluation . . . . . . . . . . . . . . . . . . . . . . 16
4.2 Public Language Benchmarks Evaluation . . . . . . . . . . . . . . . . . . . . . . . 17
4.3 Human Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.4 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
A Appendix 30
2
1. Introduction
The remarkable success of large language models (LLMs) (Anthropic, 2023; Google, 2023;
OpenAI, 2022, 2023a) has fueled the demand for a versatile interface that can handle multiple
modalities beyond language. In response to this growing demand, we have seen an emergence
of Large Multimodal Models (LMMs) like GPT-4V (OpenAI, 2023b) and Gemini (Team et al.,
2023), which serve as versatile assistants capable of comprehending and acting upon instructions
that span vision and language. These models exhibit considerable promise in executing complex,
diverse real-world tasks, enabling more natural and human-like interactions.
Recently, there has been a surge of open-source large multimodal models aimed at narrowing
the gap with proprietary counterparts. Substantial strides have been made, especially in
benchmark performance, yet a significant divide persists between the majority of open-source
models and state-of-the-art closed-source models (Bai et al., 2023; Bavishi et al., 2023; OpenAI,
2023b; Team et al., 2023) when it comes to real-world performance and user experience. It
remains challenging for the open-source community to develop models with robust general
multimodal capabilities for real-world applications.
The performance gap between the most open-source models and the proprietary models is
largely pronounced in real-world scenarios, primarily due to the following reasons:
3
Prompt Response Prompt Response
图片描述了什么算法? 图片展示了一个名为CLIP的算法的流程图。该算法被描述 You are a home robot. Now you are at the kitchen, this To fetch something from the fridge, I would follow these steps:
为一种新的零样本图像到图像转换方法,它使用文本描述 is what you can see from there. To fetch something Approach the refrigerator: The refrigerator is located on the
来预测目标数据集的类。 from the fridge, what should you do? right side of the kitchen island. It's a stainless steel model with
在流程图中,我们看到了三个主要步骤: a freezer compartment at the bottom and an ice maker
1. Contrastive pre-training(对比式预训练):在这 dispenser above it.
个阶段,模型被训练以最小化两个编码器之间的差异,一 Open the refrigerator door: To open the refrigerator door, I
个用于图像,另一个用于文本。这通过将图像和文本嵌入 would use my hands to grasp the handle or pull bar located on
投影到相同的潜在空间中来实现。 the front edge of the refrigerator door. This action should be
2. Create dataset classifier from label text(创建 done carefully to avoid any accidental bumps into other items
标签文本分类器):这个步骤涉及从给定的标签文本中提 in the kitchen.
取特征向量,并将其与图像嵌入一起输入到一个分类器中。 Locate the desired item: Once inside the refrigerator, I would
3. Use for zero-shot prediction(用于零样本文本预 scan through the shelves and compartments for the specific
测):最后,使用前面步骤中学习的模型进行零样本文本 item I am looking for. The refrigerator has multiple shelves and
预测。这涉及到将图像嵌入和提取的特征向量传递给分类 drawers that are likely organized by type of food or drink.
器,以预测图像属于哪个类别。 Retrieve the item: After locating the desired item, I would
整个过程都围绕着学习一个能够理解图像和文本之间关系 reach out and grab it using one hand while ensuring not to
的模型这一核心概念。这种方法的关键在于利用大量的无 disturb the organization of the contents. If the item is placed
标注数据,通过对比式的预训练来提高模型的泛化能力。 on a shelf or drawer, I would lift it up gently to remove it
在流程图的底部,有一个例子展示了如何使用CLIP进行零 without knocking anything over.
样本预测。这里,一张狗的照片和一个关于狗的短语被用 Close the refrigerator door: After retrieving the item, I would
来预测这张照片中的狗属于哪个品种。 close the refrigerator door securely using the handle or pull bar.
总的来说,流程图清晰地描绘了CLIP算法的结构和它的工 This ensures that the cold air remains trapped within the
Prompt 作原理,强调了它在处理大量未标记数据方面的潜力以及 appliance and maintains its temperature efficiency.
其对零样本任务的有效性。 Prompt Return to the task at hand: With the item retrieved from the
Which bathroom is the bedroom A closer to? fridge, I can now proceed with whatever activity I was doing
Tell me the main contribution of this paper. before, such as cooking or preparing a meal.
Response
Response
Bedroom A is closer to Bathroom A.
This can be determined by observing the floor plan
The main contribution of this paper, as stated in the abstract, is
provided in the image. The layout of the rooms and
to develop a cognitive model that integrates individual
bathrooms are clearly marked with labels, allowing us to
differences into human-automated vehicle interaction. The
identify their relative positions. Bedroom A is located on
authors aim to provide insights into how humans interact with
the left side of the floor plan, while Bathroom A is situated
new technologies and propose a framework for understanding
directly below it, towards the bottom-left corner of the
these interactions. They also discuss potential pitfalls
image. This positioning indicates that there is no other
associated with using a one-size-fits-all model and explain how
bathroom between them, making Bathroom A the closest
modeling individual differences can help design effective user
one for Bedroom A.
interfaces and interventions. Additionally, they conduct a
simulation study to demonstrate possible benefits from
incorporating individual differences into the design process.
This conclusion is drawn directly from the text under the
Response "ABSTRACT" section, where it outlines the purpose and
contributions of the research.
The flowchart provided is a simple decision-making process
that compares two variables, x and y, to determine which one
has the greater value. The steps are as follows:
1.Start with an input for variable 'x' and get its next input. Response
Prompt
Prompt 2.Repeat step 1 for variable 'y.
3.Compare 'x' and 'y' using the "x > y" condition. If true, it 这幅图片是Minecraft官方网站的生动欢迎界面。背景是
means 'X' is greater than 'y', so put the maximum of these 描述图片 一片辽阔的蓝天,点缀着朵朵蓬松的白云。下方是一个
Can you translate the flowchart to a python code?
two values into the output. If false, it means 'x' is not greater 充满活力的像素艺术世界,特色是一个村庄、一片茂密
than 'y', so put the maximum of 'x' and 'y' into the output. 的森林和一条蜿蜒的小路。
4.Finally, end the program by putting the max value into the 在图片的中心,一个绿色的按钮吸引着注意力,上面用
output. 粗体白色字母写着“获取Minecraft”。这个按钮似乎邀
Here is how you can translate this flowchart into Python code: 请观看者加入游戏并开始他们的Minecraft冒险。
在图片的顶部,一段文字热情地欢迎观众来到官方
Minecraft网站。它写道:“欢迎来到官方 Minecraft网
站!加入我们的社区,了解如何玩这款游戏,加入我们
最大的社区之一,并在今天开始创造吧!”这段文字传
达了游戏的包容性和社区精神,鼓励新玩家探索和创造。
总的来说,这幅图片捕捉到了Minecraft世界的精髓一它
的无限可能性以及与他人的互动乐趣。
Response
4
DeepSeek-VL’s pretraining dataset is compiled from a variety of sources, including but not
limited to Common Crawl, Web Code, E-books, Educational Materials, and arXiv Articles. This
collection thoroughly encompasses real-world scenarios such as web screenshots, PDFs, OCR,
charts, and knowledge-based content (expertise, textbooks), aiming for a broad and practical
representation while remaining scalable.
While our pretraining data encompasses a wide array of world knowledge, we meticulously
curate our instruction-tuning dataset to reflect real-world usage scenarios. To achieve this, we
manually gather authentic test cases for GPT-4V and Gemini from the Internet. These cases have
been systematically organized into a comprehensive taxonomy. We use this structured taxonomy
to choose prompts for each test image, ensuring a practical and relevant instruction tuning
dataset. This taxonomy is also used to create an evaluation dataset that effectively assesses
real-world performance.
The visual module is designed to optimize the utilization of high-resolution visual inputs
while remaining within a fixed token budget to manage inference costs effectively. As such, we
employ a hybrid vision encoder, which combines a text-aligned encoder for coarse semantic
extraction at 384 × 384 resolution with a high-resolution encoder that captures detailed visual
information at 1024 × 1024 resolution. By fusing these two encoders, our hybrid approach
efficiently condenses a 1024×1024 resolution image (which suffices in most use cases) into 576
tokens. This token count strikes a balance between rich visual representation and token economy,
making it feasible for both text-image interleaving and multi-turn inference scenarios.
During the pretraining of multimodal models, a common challenge encountered is the
potential degradation of language capabilities when the training process is overly reliant on
vision-language data. Our research reveals that maintaining a significant proportion of language
data—specifically, at least 70%—is essential to preserve the integrity of language knowledge
within the model. This balance is critical for achieving a robust multimodal capability that does
not compromise language performance. Moreover, we introduce a novel “modality warm-up”
strategy. This approach carefully adjusts the ratio of modalities during training, gradually
incorporating more vision-language data. The careful tuning of the modality ratio along with
the warm-up strategy results in a balanced performance of both modalities.
When iterating on our model, We conduct experiments on a small scale before scaling to a
larger model size. However, a smaller model, e.g., 1B model, cannot demonstrate reasonable
performance on benchmarks (Schaeffer et al., 2024) and faithfully reflect the model’s performance.
We adopt two approaches to address this. First, we modify the evaluation protocol from multi-
choice to compare the perplexity of options. Also, to prevent the instruction following ability
from becoming the bottleneck, we mix a small proportion of instruction tuning data during the
pretraining phase. In this way, we can achieve reasonable performance using the 1B model and
more accurately measure the impact of each iteration during the experiment.
Through extensive evaluations of general vision and language benchmarks, the DeepSeek-VL
family showcases superior user experiences in real-world applications and achieves state-of-
the-art or competitive performance across a wide range of visual-language benchmarks at the
same model size, while maintaining robust language-centric performance. To foster innovation
and enable a wide range of applications, we have made two versions of our ours, 1.3B and 7B,
publicly accessible, in the hope of facilitating the needs of varying computational capabilities.
5
2. Data Construction
A diverse and large dataset is the most important ingredient of visual language model training.
Our dataset can be divided into two parts: Vision-Language pretraining Data and Vision-
Language Supervised Fine-Tuning Data. VL pretraining Data is composed of visual-text data
from various sources, aimed at enhancing the model’s fundamental cross-modal understanding
capabilities; while VL Supervised Fine-Tuning Data has a relatively smaller size and aims to
teach the model to complete specific downstream tasks. By design, VL pretraining Data is
used to warm up the vision-language adaptor in training stage 1 and jointly pretrain the vision-
language model in stage 2, and VL Supervised Fine-Tuning Data is exploited in training stage 3,
i.e., vision language supervised fine-tuning.
The pretraining dataset utilized in our study encompasses a diverse range of publicly accessible
sources, in addition to a selection of proprietary data. We provide a comprehensive overview of
the data sources employed during the joint vision and language pretraining stage in Table 1.
Such a dataset can facilitate LLM’s comprehension of the entities portrayed in the images.
Furthermore, we present a detailed breakdown of the complete dataset, which is organized
into the following categories:
Interleaved image-text data enable the models to have a better capability for in-context
learning of multi-modality inputs, and we utilize three public datasets MMC4 (Zhu et al., 2024),
Wiki (Burns et al., 2023), Wikihow (Yang et al., 2021) and Epub textbooks.
Image caption data come from three high-quality image-text paired datasets: Capsfusion (Yu
et al., 2023a), TaiSu (Liu et al., 2022b) and Detailed Caption (echo840, 2024).
Table and chart data enable the models to learn the capability for general table and chart
image understanding. It encompasses a diverse range of public data sources, including
Chart2text (Kantharaj et al., 2022), Geo170K (Gao et al., 2023), Unichart (Masry et al., 2023),
Ureader (Ye et al., 2023), M-paper (Hu et al., 2023), ScienceQA (Lu et al., 2022b), ScreenQA (Hsiao
et al., 2022), SciGraphQA-295K (Li and Tajbakhsh, 2023), Paper2figure100k (Rodriguez et al.,
2023), Widget Captioning (Li et al., 2020), Screen2words (Wang et al., 2021), and Refexp (Mao
et al., 2016).
Web Code data empowers models with the capability to reconstruct code from graphical
interfaces or visual plots. Leveraging Websight (HuggingFaceM4, 2024) for UI Inverse Rendering,
we adopted a strategy akin to that used in MATCHA (Liu et al., 2022a) for visual plots inverse
rendering. This involved the processing of approximately 1.46 million Jupyter notebooks from
the Stack dataset (Kocetkov et al., 2023). By extracting these notebooks and collating all diagrams
along with their corresponding preceding code segments, we succeeded in curating a collection
featuring 2 million pairs of images and codes. For better data quality, we filter 1.1 million
instances, each comprising a singular image coupled with a minimum of 5 lines of code, to
constitute our primary training dataset.
Document Optical Character Recognition (OCR) data facilitates the recognition of optical
characters at the document level, even in challenging real-world scenarios. To the best of
our knowledge, there is currently no publicly available large-scale dataset encompassing both
English and Chinese documents. Despite the existence of the publicly accessible small-scale
dataset Latex-OCR (Blecher, 2024), we additionally constructed a comprehensive English and
6
Table 1 | Summary of datasets used in the joint vision and language pretraining stage.
Chinese document OCR dataset. It is comprised of two parts: 1): arXiv Articles: We collected
source code and compiled PDFs from 1.4 million arXiv articles. Utilizing pre-processing tools
from Nougat (Blecher et al., 2023), we rendered these articles into paired images and texts; 2):
E-books and Educational Materials: We cleaned 860K English and 180K Chinese e-books from
Anna’s Archive (Anna’s Archive, 2024) alongside millions of K-12 education exam questions.
Subsequently, we employed HTML rendering tools (Kulkarni and Truelsen) to convert these
HTML files with different templates into paired image and text formats.
Scene text OCR data augment the capability of the model to recognize and extract text from
images in which the text is integrated into the environment. The dataset is composed of multiple
7
Table 2 | Summary of data used in our joint vision and language supervised fine-tuning stage.
public datasets, including ArT (Chng et al., 2019), MLT-17 (Nayef et al., 2017), LSVT (Sun et al.,
2019), UberText (Zhang et al., 2017), Coco-text (Veit et al., 2016), RCTW-17 (Shi et al., 2017),
ReCTS (Zhang et al., 2019), TextOCR (Singh et al., 2021), OpenVINO (Krylov et al., 2021) and
HierText (Long et al., 2022).
Text-only corpus serves to maintain proficiency in language-centric tasks. In this study, we
employ the same text corpus with DeepSeek-LLM (DeepSeek-AI, 2024).
The supervised fine-tuning datasets utilized in our study encompass a diverse range of multi-
modality and language data sources, including well-known open-source shared gpt4v datasets
such as ShareGPT4V (Chen et al., 2023), LAION-GPTV (LAION, 2023), LVIS-Instruct4V (Wang
et al., 2023a), textOCR-GPT4V (Carter, 2024), LLaVA1.6-GPT4V (Liu et al., 2024a) and IconQA (Lu
et al., 2021). Additionally, we incorporate partial table and chart data extracted from pretraining
datasets such as Ureader (Ye et al., 2023), ScreenQA (Hsiao et al., 2022), Geo170K (Gao et al.,
2023), and ScienceQA (Lu et al., 2022b). Moreover, we integrate the UI Code dataset obtained
from Screen-to-code (Abi, 2024) tasks. To enhance the quality of our multi-modality SFT data,
we have also curated a portion of high-quality in-house multi-modality SFT data, some of
which are in the Chinese language. Our in-house instruction-tuning dataset is meticulously
designed to reflect real-world usage scenarios and cover a wide range of tasks. We start by
collecting a diverse set of authentic test cases for GPT-4V and Gemini from various online sources.
These test cases are then carefully analyzed and organized into a comprehensive taxonomy,
which encompasses multiple categories, such as recognition, conversion, analysis, reasoning,
evaluation, and safety, as detailed in Table 3. This structured taxonomy serves as a guideline
for selecting representative prompts for each test image, ensuring that our instruction-tuning
dataset is both practical and relevant to real-world applications. Moreover, this taxonomy is
also employed to construct a balanced and comprehensive evaluation dataset, which allows us
to effectively assess the model’s performance across different tasks and categories. By following
this systematic approach, we ensure that the categories covered by our in-house multi-modality
SFT data are well-aligned with the taxonomy and representative of real-world usage scenarios.
8
Main Category Description Secondary Category Tertiary Category
Recognition This part of the use cases mainly examines the under- Global Description Theme Description, Event/Behavior Description, Lo-
standing and description ability of large models for im- cation/Scene Description, Emotion/Mood Description,
age content, which does not require high knowledge Style Recognition, Food Recognition, Others
reserve and reasoning ability of the model, and some
tasks can be completed using traditional machine learn-
ing models.
Conversion This type of use case requires the model to be able to Image to Code UI to Code, Chart to Code, Photo to SVG/p64 Encoding,
describe and recognize image content, and use specific Formula to Code, Flowchart to Code
knowledge (e.g., code knowledge, prompt engineering
knowledge) to convert image content into another form.
Analysis This type of use case requires the model to use specific Data Chart Analysis Graph Interpretation, Table Interpretation
knowledge and logical ability to make reasonable anal-
ysis and understanding based on image content, and
describe the image according to instructions.
Professional Chart Analysis Circuit Diagram, Flowchart, Map, Music Score, Financial
Chart, Floor Plan, Others
Professional Image Analysis Sensor Image, Biological and Medical Image, Voiceprint
Image, Point Cloud Image
Commonsense Rea- This type of use case mainly tests the model’s under- Relationship Reasoning Interpersonal Relationship, Spatial Relationship, Size
soning standing and mastery of common sense in life, which Relationship, Species Relationship
requires reasoning based on the interpretation and anal-
ysis of image content combined with common sense.
Humor Reasoning -
Logical Reasoning This type of use case requires the model to combine the Mathematical Reasoning Algebra and Operation, Plane Geometry, Solid Geometry
understanding of images, comprehensively use domain
knowledge and logical reasoning ability to complete
corresponding tasks.
Evaluation This type of use case requires the model to evaluate the - Reality Evaluation, Similarity Evaluation, Aesthetic Eval-
image content according to specific criteria. uation, Open-ended Evaluation, Improvement Sugges-
tions
Multi-graph This type of use case examines the model’s ability to Temporal Sequence Understand- Event Prediction, Image Sequencing, Behavior Analysis
analyze and understand multiple images. ing
Safety This type of use case examines the model’s performance - Suggestive Questioning, Counterfactual Questioning,
in terms of safety. Prompt Injection
Table 3 | Our taxonomy for the in-house SFT data. The categories covered by our high-quality
in-house multi-modality SFT data are comprehensively represented in this taxonomy.
9
Furthermore, we include the text-only SFT data employed in DeepSeek-LLM (DeepSeek-AI,
2024) as part of our joint vision and language SFT data.
3. Approach
3.1. Architecture
Our system contains three modules: a hybrid vision encoder, a vision adaptor, and a language
model. We introduce each part in this section.
Hybrid Vision Encoder. We employ SigLIP as the vision encoder to extract high-level
semantic feature representations from visual inputs. However, we observe that a single SigLIP
encoder struggles to address all real-world questions comprehensively. Vision encoders in the
CLIP family, including SigLIP, are primarily designed for semantic visual representations but
are challenged by ambiguous encoding, resulting in visually distinct images being encoded as
similar due to what is referred to as "CLIP-blind pairs" Tong et al. (2024). Meanwhile, the CLIP
family of models is limited by its relatively low-resolution inputs (e.g., 224 x 224, 336 x 336, 384
x 384, 512 x 512), which hinders their ability to handle tasks requiring more detailed low-level
features like dense OCR and visual grounding task.
To address these limitations, recent researches (Lin et al., 2023b; Tong et al., 2024; Wei et al.,
2023) have advocated for the integration of additional vision-only self-supervised encoders, to
enhance the visual grounding capabilities of multi-modality models. Building upon previous
motivations, we additionally utilize a vision-only encoder based on the SAM-B (Kirillov et al.,
2023), a pre-trained ViTDet (Li et al., 2022) image encoder to process low-level features, which
accepts high-resolution 1024 x 1024 image inputs. In addition to the SAM-B encoder, we
retain the SigLIP-L vision encoder with low-resolution 384 x 384 image inputs. Consequently,
our hybrid vision encoder combines the SAM-B and SigLIP-L encoders, efficiently encoding
high-resolution 1024 x 1024 images while preserving both semantic and detailed information.
Specifically, a high-resolution SAM-B vision encoder first resizes the image into 1024 x 1024 and
results in a 64 x 64 x 256 feature map.
In the case of a high-resolution feature map of size, 64 x 64 x 256 generated by SAM-B, the
VL Adaptor initially interpolates it into a size of 96 x 96 x 256. Subsequently, it employs two
convolutional layers with a stride of 2, producing a feature map of 24 x 24 x 1024, and reshapes
it to 576 x 1024. Alongside this, the low-resolution feature map of size 576 x 1024 generated by
SigLIP-L is concatenated with the high-resolution features, resulting in 576 visual tokens with
2048 dimensions. These visual tokens possess a substantial capacity for enhancing high-level
semantic visual recognition and low-level visual grounding tasks. Then they undergo GeLU
activation and are directed through an embedding layer to establish a connection with the
language model.
Vision-Language Adaptor. We employ a two-layer hybrid MLP to bridge the vision encoder
and the LLM. Initially, distinct single-layer MLPs are used to process high-resolution features
and low-resolution features separately. Subsequently, these features are concatenated along their
dimensions and then transformed into the LLM’s input space through another layer of MLP.
Language Model. Our language model is built upon DeepSeek LLM (DeepSeek-AI, 2024)
whose micro design largely follows the design of LLaMA (Touvron et al., 2023a,b), adopt-
ing a Pre-Norm structure with RMSNorm (Zhang and Sennrich, 2019) function and using
SwiGLU (Shazeer, 2020) as the activation function for the Feed-Forward Network (FFN), with
an intermediate layer dimension of 83 𝑑𝑚𝑜𝑑𝑒𝑙 . It also incorporates Rotary Embedding (Su et al.,
10
Prompt
Response
The cyclist is on the left side of the woman's handbag. This can be determined by observing the relative positions of the objects in
the image:
1. The woman is holding a handbag that is positioned to her right side, as we view the image.
2. There is a cyclist depicted on the road behind the woman and to her left when looking at the image.
3. The cyclist appears to be moving away from the viewpoint, given their direction and the motion blur effect on their wheels
4. The cyclist is further down the road compared to the woman, which supports the conclusion that they are on the left side of the
handbag based on our perspective.
These visual clues confirm that the cyclist is indeed con the left side of the woman's handbag.
Figure 2 | Visualization results. DeepSeek-VL is capable of capturing tiny object and giving
organized explanations.
2024) for positional encoding and uses the same tokenizer with DeepSeek-LLM. We introduce
a family of DeepSeek-VL models. Given our objective of conducting joint pretraining with
multimodal and language, we select an intermediate checkpoint from DeepSeek’s pretrained
models to continue pretraining.
Specifically, the DeepSeek-VL-1B model is constructed based on the DeekSeek-LLM-1B
model, which underwent training with an approximate corpus of 500 billion text tokens. And
the DeekSeek-VL-7B model is developed leveraging the DeepSeek-LLM-7B model trained with
an estimated 2 trillion text tokens.
11
Stage 1: Training VL Adaptor Stage 2: Joint VL Pre-training Stage 3: Supervised Finetuning
Figure 3 | Our training pipelines consist of three stages. Stage 1 involves training the Vision-
Language (VL) adaptor while keeping the hybrid vision encoder and language model fixed.
Stage 2 is the crucial part of the joint vision and language pretraining, where both VL adaptor
and language model are trainable. Stage 3 is the supervised fine-tuning phase, during which the
low-resolution vision encoder SigLIP-L, VL adaptor, and language model will be trained.
The primary objective of this stage is to establish a conceptual link between visual and linguistic
elements within the embedding space, thereby facilitating the comprehensive understanding
of depicted entities in the images by the Large Language Model (LLM). Consistent with prior
research conducted by LLaVA (Liu et al., 2024b) and Instruct-BLIP (Dai et al., 2023), we adopt a
similar approach in which both the vision encoder and the LLM remain frozen during this stage,
while solely allowing the trainable parameters within the vision-language adaptor. We utilize a
dataset comprising 1.25 million image-text paired captions obtained from ShareGPT4V, along
with 2.5 million Document OCR rendering pairs to train the VL adaptor.
Nevertheless, compared to Large Language Models (LLMs), vision-language adaptors (e.g.,
a 2-layer MLP) have a significantly smaller parameter capacity. This limitation in model capacity
restricts the capabilities that can be learned during this stage. A natural question arises: Can the
law of data scaling be effective at this stage? To address this question, we conducted a simple
experiment in Table 8. The results demonstrate that expanding the data scale at this stage does
not provide benefits and may even lead to inferior performance. Consequently, we proceed to
unfreeze the Large Language Model (LLM) and investigate efficient vision-language pretraining
approaches during stage 2.
12
3.2.2. Stage 2: Joint Vision-Language pretraining
In this stage, we explore effective pretraining strategies which can be considered as an additional
stage to enable Large Language Models (LLMs) to comprehend multimodal inputs. We keep the
vision encoder frozen and optimize the language model and VL adaptor.
Initially, we attempt to directly train the LLM with multimodal data. However, we find
while the metrics for multimodal performance incrementally improved, there is a stark and
severe decline in language metrics as illustrated in Figure 4 (Multimodal:Language=100%:0%),.
This underscores the inherent challenge in directly conducting multimodal pretraining on the
foundation of an LLM, revealing a critical trade-off between enhancing multimodal abilities and
preserving linguistic proficiency.
We hypothesize that the observed phenomenon stems from two primary factors: firstly, the
majority of multimodal corpora, are overly simplistic and exhibit a significant divergence from
the complexity and distribution of linguistic data. Secondly, there appears to be a competitive
dynamic between multimodal and linguistic modalities, leading to what can be described as
catastrophic forgetting of language capabilities within the LLM.
Joint Language-multimodal Training To address this challenge, we devise a straightforward
yet effective joint language-multimodal training strategy. During training, we not only engage
in multimodal data training but also incorporate a large proportion of language data into
the training. This approach aims to balance the training focus, mitigating the adverse effects
observed. We conduct experiments on the DeepSeek-VL 1B model in Figure 4 to explore the
impact of varying the modality mixing ratios.
The analysis of the graph yields several key conclusions: (1). Integrating language data
significantly alleviates the decline in language capabilities, demonstrating a substantial improve-
ment in the model’s linguistic performance. (2). The inclusion of language data does not lead to
a significant loss in multimodal performance, indicating that the model retains its multimodal
processing abilities. (3). The performance of different modalities is strongly correlated with
their respective proportions in the training dataset, substantiating the competitive relationship
between the two modalities. Ultimately, we opt for a training ratio of language to multimodal
data of roughly 7:3 for our final model. This ratio enables the model to maintain its language
capabilities while simultaneously achieving better pretraining on multimodal data, effectively
balancing the development of both language and multimodal proficiencies.
Scaling Vision-Language Pretraining Nevertheless, the pretraining stage of the model incurs a
substantial computational cost, and performing iterations on the 7B model requires an excessive
amount of computing power and time. One suitable strategy involves conducting experiments
on a smaller model, specifically the 1.3B model, and subsequently scaling it up to the 7B model.
Fortunately, we have observed that a significant portion of the outcomes obtained from the 1.3B
models can be effectively transferred to the 7B model through the utilization of SFT (e.g., the
encoder design). However, during the stage 2 training phase, we have encountered considerable
fluctuations in the generative metrics of the 1.3B model, rendering it challenging to supervise
the training process effectively. And this has been discussed in Schaeffer et al. (2024), "sharp
and unpredictable changes might be induced by the researcher’s choice of measurement, even
though the model family’s per-token error rate changes smoothly, continuously and predictably
with increasing scale." Subsequent experiments have led us to identify the root causes of this
issue: the limited capacity of the 1.3B model and the absence of SFT data within the training
dataset, both of which hinder the model’s ability to accurately follow instructions. Even when
the model possesses knowledge of the correct options, it struggles to generate them precisely.
13
SeedBench MMBench
45.0 Multimodal:Langauge=10%:90% 50.0 Multimodal:Langauge=10%:90%
Multimodal:Langauge=25%:75% Multimodal:Langauge=25%:75%
42.5 Multimodal:Langauge=60%:40% 47.5 Multimodal:Langauge=60%:40%
Multimodal:Langauge=75%:25% Multimodal:Langauge=75%:25%
40.0 Multimodal:Langauge=100%:0% 45.0 Multimodal:Langauge=100%:0%
MCPPL Accuracy
MCPPL Accuracy
37.5 42.5
35.0 40.0
32.5 37.5
30.0 35.0
27.5 32.5
0 2000 4000 6000 8000 10000 12000 14000 16000 0 2000 4000 6000 8000 10000 12000 14000 16000
Step Step
MMBench_CN MMLU
Multimodal:Langauge=10%:90%
42 Multimodal:Langauge=25%:75% 0.330
Multimodal:Langauge=60%:40%
40 Multimodal:Langauge=75%:25%
Multimodal:Langauge=100%:0% 0.325
38
MCPPL Accuracy
LMPPL Accuracy
36
0.320
34
32 0.315 Multimodal:Langauge=10%:90%
30 Multimodal:Langauge=25%:75%
Multimodal:Langauge=60%:40%
28 0.310 Multimodal:Langauge=75%:25%
Multimodal:Langauge=100%:0%
0 2000 4000 6000 8000 10000 12000 14000 16000 0 2000 4000 6000 8000 10000 12000 14000 16000
Step Step
HellaSwag Pile-test
Multimodal:Langauge=10%:90%
0.578 Multimodal:Langauge=25%:75% 2.24
Multimodal:Langauge=60%:40%
0.576 Multimodal:Langauge=75%:25%
Multimodal:Langauge=100%:0% 2.22
0.574
Multimodal:Langauge=10%:90%
2.20 Multimodal:Langauge=25%:75%
Accuracy
0.572
Multimodal:Langauge=60%:40%
PPL
0.570 Multimodal:Langauge=75%:25%
2.18 Multimodal:Langauge=100%:0%
0.568
0.566 2.16
0.564
2.14
0 2000 4000 6000 8000 10000 12000 14000 16000 0 2000 4000 6000 8000 10000 12000 14000 16000
Step Step
In this phase, we finetune the pretrained DeepSeek-VL model with instruction-based fine-tuning
to bolster its ability to follow instructions and engage in dialogue, culminating in the creation of
the interactive DeepSeek-VL-Chat model. We optimize the language model, VL adaptor, and
14
Prompt Response – DeepSeek-VL
Explain the code, step by step. The code provided is a Python function that calculates the longest palindrome substring
of a given string. The function takes in two parameters, which are not defined within this
snippet but presumably represent the input strings to be processed.
This function uses dynamic programming principles to find the longest palindromic
subsequence efficiently by storing intermediate results and avoiding redundant
calculations.
Figure 5 | Visualization results. DeepSeek-VL can understand Python code and provide detailed
and organized explanations.
hybrid vision encoder with the vision-language SFT data as shown in Table 2, SAM-B remains
frozen due to the limited GPU memory. We only supervise answers and special tokens and mask
the system and user prompts. To guarantee the model’s comprehensive proficiency in dialogue,
we utilize a blend of multimodal data and pure text dialogue data used in DeepSeek-LLM. This
approach ensures the model’s versatility across various dialogue scenarios.
The detailed hyperparameters of all stages are illustrated in Table 4. We train and evaluate our
DeepSeek-VL with HAI-LLM (High-flyer, 2023), a lightweight and efficient distributed training
framework. Since we use visual encoders to convert images into embedding vectors and then
treat image embeddings and text embeddings uniformly, we can easily adapt pipeline paral-
lelism to VL model training: all we need to do is to view visual encoders and text embedding as
a single module and take it as the first layer of the resulting model. This very first layer has a
complicated model structure and precludes standard tensor parallelism technique, but luckily
it requires relatively small computation compared to upper standard transformer blocks. We
therefore simply recompute the visual encoder forward pass in all tensor parallel ranks. The
existence of visual encoders also leads to non-uniform execution time across model layers, so
we re-divide model layers between pipeline parallelism ranks to achieve better load balance and
throughput. The upper layers of DeepSeek-VL are exactly the same as those in DeepSeek-LLM.
With such minor modification, we can now perform canonical 3D parallelism techniques as
in Megatron (Korthikanti et al., 2023; Narayanan et al., 2021; Shoeybi et al., 2019) and overlap
computation and communication as in DeepSeek-LLM (DeepSeek-AI, 2024). DeepSeek-VL-
7B consumed 5 days on a cluster of 64 nodes, each comprising 8 Nvidia A100 GPUs, while
DeepSeek-VL-1B consumed 7 days on a setup involving 16 nodes.
15
DeepSeek-VL 1B DeepSeek-VL-7B
Vision Encoder SigLIP SigLIP+SAM
Hyperparameters Stage 1 Stage 2 Stage 3 Stage 1 Stage 2 Stage 3
Learning rate 1.0 × 10−3 3 × 10−5 2.0 × 10−5 1.0 × 10−3 4.2 × 10−5 2.0 × 10−5
LR scheduler Cosine Step Cosine Cosine Step Cosine
Weight decay 0.0 0.0 0.0 0.0 0.0 0.0
Gradient clip 1.0 1.0 1.0 1.0 1.0 1.0
Optimizer AdamW( 𝛽1 = 0.9, 𝛽2 = 0.95) AdamW( 𝛽1 = 0.9, 𝛽2 = 0.95)
Warm-up steps 128 2000 256 128 2000 256
Training steps 15000 96000 10000 15000 42000 10000
Batch size 256 1024 256 256 2304 256
Sequence length 512 4096 4096 512 4096 4096
Sequence packing × ✓ × × ✓ ×
Pipeline parallelism × × × × ✓ ✓
4. Evaluation
16
LLM MMMU CMMMU MMB MMC SEED OCRB POPE MathV MMVet
Close-source LMMs:
Gemini Pro Unk 48.9 - 75.2 74.0 70.7 659 - 45.2 59.2
GPT-4V Unk 56.8 42.5 75.0 74.7 71.6 659 - 47.8 49.9
Qwen-VL-Plus Unk 45.2 39.5 66.2 69.6 72.7 - - 43.3 55.7
Qwen-VL-MAX Unk 51.4 - 78.1 76.4 72.7 - - 51.0 61.8
Open-source 13B LMMs:
LLaVA-1.5 13B 36.4 - 68.2 61.9 68.2 331 85.9 26.4 38.3
VILA 13B - - 70.3 64.3 - - 84.2 - 38.8
LLaVA-Next 13B 36.2 - 70.0 64.4 71.9 - 86.7 35.3 48.4
Open-source 7B LMMs:
EMU2-Chat 7B 36.3 23.8 63.6 45.9 68.9 - - 30.0 31.0
Qwen-VL-Chat 7B 37.0 - 60.6 56.7 64.8 - - 33.8 47.3
CogVLM 7B 37.3 24.8 63.7 53.8 68.8 - - 34.7 54.5
LLaVA-Next 7B 35.8 - 67.4 60.0 70.2 - 86.5 34.6 43.9
Yi-VL 6B 37.8 35.8 68.2 68.9 67.6 - - 28.0 31.1
DeepSeek-VL (ours) 7B 36.6 37.9 73.2 72.8 70.4 456 88.1 36.1 41.5
Table 5 | The comparison between different multi-modal models. The top half are proprietary
models, while the bottom are open-source models.
LLM MMMU CMMMU MMB MMC SEED OCRB POPE MathV MMVet
Tiny Model:
MobileVLM 1.4B - - 53.2 - - - 84.5 - -
MobileVLM 2.7B - - 59.6 - - - 84.9 - -
MobileVLM V2 1.4B - - 59.6 - - - 84.3 - -
MobileVLM V2 2.7B - - 63.2 - - - 84.7 - -
LLaVA-Phi 2.7B - - 59.5 - - - 85.0 - 28.9
DeepSeek-VL (ours) 1.3B 32.2 27.4 64.6 61.3 66.7 409 87.6 31.1 34.8
17
DeepSeek-VL DeepSeek-VL DeepSeek-LLM
Version
1B Chat 7B Chat 7B Chat
Encoder SigLIP SigLIP+SAM None
HellaSwag 56.0 68.4 68.5
MMLU 32.5 52.4 49.4
GSM8K 18.0 55.0 63.0
Benchmark
MBPP 10.0 35.2 35.2
AGIEval 14.0 27.8 19.3
18
InternLM-XComposer2-VL CogVLM-17B DeepSeek-VL-7B GPT4V
8.13
7.73
7.14
7.01
6.96
6.82
6.74
6.52
6.3
5.71
5.65
5.36
5.26
5.22
4.76
4.74
4.65
4.55
4.47
4.29
4.21
4.09
4.09
3.75
3.75
3.75
3.75
3.75
3.21
3.13
2.5
1.43
TOTAL SCORE RECOGNITION CONVERSION ANALYSIS COMMONSENSE LOGICAL MULTI-IMAGES EVALUATION
Figure 7 | GPT-4V-based Evaluation Results of DeepSeek-VL vs. Other Models: The chart depicts
results from a GPT-4V-based assessment across 99 test samples, demonstrating DeepSeek-VL’s
favorable outcomes against both open-source and proprietary models.
Scale Up Projector Training We expand the dataset for stage 1 (projector warmup) and sub-
sequently apply supervised fine-tuning. The results, depicted in Figure 8, demonstrate that
augmenting the training data volume does not enhance performance at this stage. This implies
19
Stage 1, Training Step MMB MMC SEED POPE MMMU Average
2K 59.0 54.0 61.8 82.3 30.3 57.5
8K 58.0 45.0 58.5 84.9 29.2 55.1
20K 56.0 52.3 59.0 81.7 28.6 55.5
80K 58.1 55.0 58.6 78.6 27.9 55.6
Table 8 | Comparative directly SFT performance results on scaling up stage 1 data. The results
demonstrate that expanding the data scale at this stage does not yield benefits, or even results in
worse performance.
40.0 34
2.155
37.5 32
35.0 30
2.150
32.5 28
0 2000 4000 6000 8000 10000 12000 14000 16000 0 2000 4000 6000 8000 10000 12000 14000 16000 0 2000 4000 6000 8000 10000 12000 14000 16000
Step Step Step
that the projector’s capacity is inherently constrained, rendering it incapable of capturing the
extensive knowledge necessary for multimodal tasks.
Training Stage In Table 9, we examine the contributions of each stage to the model’s performance.
It’s evident that combining stage 1, stage 2, and stage 3 yields significantly better results across
all metrics compared to combining stage 1 and stage 3 alone, demonstrating the effectiveness of
multimodal pretraining. Additionally, the combination of stage 2 and stage 3 still slightly lags
behind the combined performance of stage 1, stage 2, and stage 3, indicating that vision-language
adaptor warmup stage remains meaningful.
Modality Group Training When mixing language and multimodal data, we observe that directly
blending them at the batch level significantly reduces training efficiency. This inefficiency arises
because each batch gradient backpropagation process waits for the slowest sample to complete.
As a result, the predominantly faster-to-process pure language data ends up waiting for the
multimodal samples to finish, leading to a decrease in overall training efficiency.
To address this issue, we experiment with grouping different modalities of data at each global
20
Pile-test MMBench MMBench_CN
w/o modality warmup 50.0 w/o modality warmup w/o modality warmup
2.165 42
w/ modality warmup w/ modality warmup w/ modality warmup
47.5
40
2.160
45.0
38
Accuracy
Accuracy
2.155 42.5
36
PPL
40.0
2.150 34
37.5
32
2.145 35.0
30
32.5
0 2000 4000 6000 8000 10000 12000 14000 16000 0 2000 4000 6000 8000 10000 12000 14000 16000 0 2000 4000 6000 8000 10000 12000 14000 16000
Step Step Step
step, sampling distinct modalities separately. This approach involves organizing the training
data so that batches are composed either entirely of language data or entirely of multimodal
data at different training steps, rather than mixing them within the same batch.
The results are shown in Figure 8, we observe that this method does not compromise the
model’s performance while enhancing the model’s training efficiency by 20%. This strategy effec-
tively circumvents the bottleneck caused by the disparate processing times between modalities,
optimizing the training workflow.
Modality Warmup Considering that our approach involves multimodal training on the founda-
tion of a language model, directly mixing multimodal data in a fixed proportion from the outset
can destabilize the model. To counteract this issue, we propose a simple yet effective modality
warm-up strategy. Initially, we set the language data ratio to 1, and then gradually decrease it to
the target ratio for the final model training (e.g., 0.7).
Our experiments, as illustrated in Figure 9, demonstrate that this strategy effectively prevents
a significant decline in language capabilities at the beginning of training, while also yielding
comparatively superior outcomes in the final phases for both the language and multimodal do-
mains. This gradual adaptation enables the model to more seamlessly adjust to the incorporation
of multimodal data, thereby improving overall training stability and performance.
Vision Encoder Selection In order to better acquire and utilize image information, we compare
the training loss of different vision encoders under our training settings except for reducing
training steps of stage 2 to 8000 for efficiency. As illustrated in Figure 10, the incorporation of
vision-only self-supervised encoders has been found to significantly enhance performance on
training loss. To more effectively process high-resolution images, our research ultimately adopts
a hybrid vision encoder strategy, combining SigLIP with SAM for our model’s implementation.
Vision-Language Adaptor Design To improve the efficiency of extracting information from the
visual encoder while adhering to current token length constraints, adjustments can be made to
the Vision-Language adaptor in two main ways: the method used to combine visual features
and the design of the MLP adaptor.
Previous studies (Tong et al., 2024) have indicated that combining visual features along
the sequence dimension can lead to better model performance, although this comes with the
trade-off of increased computational requirements due to a longer sequence of visual feature
tokens. As demonstrated in the top section of Table 10, reducing the sequence length by stacking
21
CLIP
SigLIP
2.6 SigLIP+DINO
SigLIP+SAM
2.4
Loss
2.2
2.0
1.8
0 1000 2000 3000 4000 5000 6000 7000 8000
Step
Table 10 | Comparison of different adaptor architectures using SigLIP and SAM as hybrid vision
encoder, Hybrid MLP are used for sequence concatenation experiments. Bolded entries represent
the best results, while underlined entries denote the second-best results. For calculating the
average score, we divide the OCRBench by the total number of questions.
visual features along the image’s width or height dimensions before sequence concatenation,
in order to keep the sequence length constant, does not achieve better results compared to
simply merging them along the embedding dimension in most metrics. In terms of the adaptor
architecture, employing separate MLP adaptors for each vision feature encoder allows for
more precise adjustments to the specific values and distribution patterns of visual features,
facilitating smoother model training. Conversely, using a shared MLP adaptor for different
vision encoders contributes to adequate feature fusion. We adopt a mixed strategy and report
stable and improved performance, as outlined in the lower section of Table 10.
22
high-resolution images without losing sight of semantic richness.
The incorporation of a hybrid vision encoder, capable of handling 1024 x 1024 images within
a constrained token budget, underscores our commitment to preserving the nuanced details
and semantic integrity across diverse tasks. As a result, DeepSeek-VL emerges as a pioneering
model that not only meets but exceeds the standards set by generalist models in its class. It
showcases exceptional performance across a wide range of visually-centric benchmarks while
sustaining formidable proficiency in language-centric evaluations.
In making DeepSeek-VL publicly available, we aim to catalyze further innovation and
exploration within the research community, providing a robust foundation upon which future
studies can build. This gesture of openness is intended to facilitate the collective advancement
of our understanding and capabilities in handling multimodal data.
Looking ahead, we are excited to announce plans to scale up DeepSeek-VL to larger sizes,
incorporating Mixture of Experts (MoE) technology. This forthcoming expansion promises to
further enhance the model’s efficiency and effectiveness, opening up new horizons for research
and application in the field of AI.
References
01-ai. Yi-34B vision language model. https://huggingface.co/01-ai/Yi-VL-34B, 2024.
J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-vl: A
versatile vision-language model for understanding, localization, text reading, and beyond.
arXiv preprint arXiv:2308.12966, 2023.
23
L. Chen, J. Li, X. Dong, P. Zhang, C. He, J. Wang, F. Zhao, and D. Lin. Sharegpt4v: Improving
large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023.
C. K. Chng, Y. Liu, Y. Sun, C. C. Ng, C. Luo, Z. Ni, C. Fang, S. Zhang, J. Han, E. Ding, et al.
Icdar2019 robust reading challenge on arbitrary-shaped text-rrc-art. In 2019 International
Conference on Document Analysis and Recognition (ICDAR), pages 1571–1576. IEEE, 2019.
W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi. Instructblip:
Towards general-purpose vision-language models with instruction tuning, 2023.
DeepSeek-AI. Deepseek llm: Scaling open-source language models with longtermism. arXiv
preprint arXiv:2401.02954, 2024. URL https://github.com/deepseek-ai/DeepSeek-L
LM.
X. Dong, P. Zhang, Y. Zang, Y. Cao, B. Wang, L. Ouyang, X. Wei, S. Zhang, H. Duan, M. Cao,
et al. Internlm-xcomposer2: Mastering free-form text-image composition and comprehension
in vision-language large model. arXiv preprint arXiv:2401.16420, 2024.
J. Gao, R. Pi, J. Zhang, J. Ye, W. Zhong, Y. Wang, L. Hong, J. Han, H. Xu, Z. Li, et al. G-
llava: Solving geometric problem with multi-modal large language model. arXiv preprint
arXiv:2312.11370, 2023.
A. Hu, Y. Shi, H. Xu, J. Ye, Q. Ye, M. Yan, C. Li, Q. Qian, J. Zhang, and F. Huang. mplug-
paperowl: Scientific diagram analysis with the multimodal large language model. arXiv
preprint arXiv:2311.18248, 2023.
24
S. Kantharaj, R. T. Leong, X. Lin, A. Masry, M. Thakkar, E. Hoque, and S. Joty. Chart-to-
text: A large-scale benchmark for chart summarization. In S. Muresan, P. Nakov, and
A. Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), pages 4005–4023, Dublin, Ireland, May
2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.277. URL
https://aclanthology.org/2022.acl-long.277.
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C.
Berg, W.-Y. Lo, et al. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
D. Kocetkov, R. Li, L. B. Allal, J. Li, C. Mou, C. M. Ferrandis, Y. Jernite, M. Mitchell, S. Hughes,
T. Wolf, D. Bahdanau, L. von Werra, and H. de Vries. The stack: 3 tb of permissively licensed
source code. In Transactions on Machine Learning Research, 2023.
V. A. Korthikanti, J. Casper, S. Lym, L. McAfee, M. Andersch, M. Shoeybi, and B. Catanzaro.
Reducing activation recomputation in large transformer models. Proceedings of Machine
Learning and Systems, 5, 2023.
I. Krylov, S. Nosov, and V. Sovrasov. Open images v5 text annotation and yet another mask text
spotter. In Asian Conference on Machine Learning, pages 379–389. PMLR, 2021.
A. Kulkarni and J. Truelsen. wkhtmltopdf. https://wkhtmltopdf.org/. Project maintained
by Ashish Kulkarni, originally created by Jakob Truelsen. Accessed: 2024-02-22.
LAION. Gpt-4v dataset. https://huggingface.co/datasets/laion/gpt4v-dataset,
2023.
B. Li, R. Wang, G. Wang, Y. Ge, Y. Ge, and Y. Shan. Seed-bench: Benchmarking multimodal llms
with generative comprehension. arXiv preprint arXiv:2307.16125, 2023a.
S. Li and N. Tajbakhsh. Scigraphqa: A large-scale synthetic multi-turn question-answering
dataset for scientific graphs, 2023.
Y. Li, G. Li, L. He, J. Zheng, H. Li, and Z. Guan. Widget captioning: Generating natural language
description for mobile user interface elements. arXiv preprint arXiv:2010.04295, 2020.
Y. Li, H. Mao, R. Girshick, and K. He. Exploring plain vision transformer backbones for object
detection. In European Conference on Computer Vision, pages 280–296. Springer, 2022.
Y. Li, Y. Du, K. Zhou, J. Wang, W. X. Zhao, and J.-R. Wen. Evaluating object hallucination in
large vision-language models. arXiv preprint arXiv:2305.10355, 2023b.
J. Lin, H. Yin, W. Ping, Y. Lu, P. Molchanov, A. Tao, H. Mao, J. Kautz, M. Shoeybi, and S. Han.
Vila: On pre-training for visual language models. arXiv preprint arXiv:2312.07533, 2023a.
Z. Lin, C. Liu, R. Zhang, P. Gao, L. Qiu, H. Xiao, H. Qiu, C. Lin, W. Shao, K. Chen, et al. Sphinx:
The joint mixing of weights, tasks, and visual embeddings for multi-modal large language
models. arXiv preprint arXiv:2311.07575, 2023b.
F. Liu, F. Piccinno, S. Krichene, C. Pang, K. Lee, M. Joshi, Y. Altun, N. Collier, and J. M. Eisenschlos.
Matcha: Enhancing visual language pretraining with math reasoning and chart derendering.
arXiv preprint arXiv:2212.09662, 2022a.
H. Liu, C. Li, Y. Li, B. Li, Y. Zhang, S. Shen, and Y. J. Lee. Llava-next: Improved reasoning, ocr,
and world knowledge, January 2024a. URL https://llava-vl.github.io/blog/202
4-01-30-llava-next/.
25
H. Liu, C. Li, Q. Wu, and Y. J. Lee. Visual instruction tuning. Advances in neural information
processing systems, 36, 2024b.
Y. Liu, G. Zhu, B. Zhu, Q. Song, G. Ge, H. Chen, G. Qiao, R. Peng, L. Wu, and J. Wang. Taisu: A
166m large-scale high-quality dataset for chinese vision-language pre-training. In S. Koyejo,
S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural
Information Processing Systems, volume 35, pages 16705–16717. Curran Associates, Inc.,
2022b. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/6
a386d703b50f1cf1f61ab02a15967bb-Paper-Datasets_and_Benchmarks.pdf.
Y. Liu, H. Duan, Y. Zhang, B. Li, S. Zhang, W. Zhao, Y. Yuan, J. Wang, C. He, Z. Liu, et al.
Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281,
2023a.
Y. Liu, Z. Li, H. Li, W. Yu, M. Huang, D. Peng, M. Liu, M. Chen, C. Li, L. Jin, et al. On the hidden
mystery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895, 2023b.
S. Long, S. Qin, D. Panteleev, A. Bissacco, Y. Fujii, and M. Raptis. Towards end-to-end unified
scene text detection and layout analysis. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, 2022.
P. Lu, L. Qiu, J. Chen, T. Xia, Y. Zhao, W. Zhang, Z. Yu, X. Liang, and S.-C. Zhu. Iconqa: A
new benchmark for abstract diagram understanding and visual language reasoning. arXiv
preprint arXiv:2110.13214, 2021.
P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P. Clark, and A. Kalyan.
Learn to explain: Multimodal reasoning via thought chains for science question answering.
In The 36th Conference on Neural Information Processing Systems (NeurIPS), 2022a.
P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P. Clark, and A. Kalyan.
Learn to explain: Multimodal reasoning via thought chains for science question answering.
Advances in Neural Information Processing Systems, 35:2507–2521, 2022b.
P. Lu, H. Bansal, T. Xia, J. Liu, C. Li, H. Hajishirzi, H. Cheng, K.-W. Chang, M. Galley, and J. Gao.
Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv
preprint arXiv:2310.02255, 2023.
N. Nayef, F. Yin, I. Bizid, H. Choi, Y. Feng, D. Karatzas, Z. Luo, U. Pal, C. Rigaud, J. Chazalon,
et al. Icdar2017 robust reading challenge on multi-lingual scene text detection and script
identification-rrc-mlt. In 2017 14th IAPR international conference on document analysis and
recognition (ICDAR), volume 1, pages 1454–1459. IEEE, 2017.
26
OpenAI. Chatgpt: Optimizing language models for dialogue. 2022. URL https://openai.c
om/blog/chatgpt.
OpenAI. GPT-4 technical report. arXiv, 2023a.
R. Schaeffer, B. Miranda, and S. Koyejo. Are emergent abilities of large language models a
mirage? Advances in Neural Information Processing Systems, 36, 2024.
B. Shi, C. Yao, M. Liao, M. Yang, P. Xu, L. Cui, S. Belongie, S. Lu, and X. Bai. Icdar2017 competition
on reading chinese text in the wild (rctw-17). In 2017 14th iapr international conference on
document analysis and recognition (ICDAR), volume 1, pages 1429–1434. IEEE, 2017.
A. Singh, G. Pang, M. Toh, J. Huang, W. Galuba, and T. Hassner. Textocr: Towards large-
scale end-to-end reasoning for arbitrary-shaped scene text. In Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition, pages 8802–8812, 2021.
J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, and Y. Liu. Roformer: Enhanced transformer with rotary
position embedding. Neurocomputing, 568:127063, 2024.
Q. Sun, Q. Yu, Y. Cui, F. Zhang, X. Zhang, Y. Wang, H. Gao, J. Liu, T. Huang, and X. Wang.
Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222, 2023.
Y. Sun, Z. Ni, C.-K. Chng, Y. Liu, C. Luo, C. C. Ng, J. Han, E. Ding, J. Liu, D. Karatzas, et al.
Icdar 2019 competition on large-scale street view text with partial labeling-rrc-lsvt. In 2019
International Conference on Document Analysis and Recognition (ICDAR), pages 1557–1562.
IEEE, 2019.
G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai,
A. Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint
arXiv:2312.11805, 2023.
S. Tong, Z. Liu, Y. Zhai, Y. Ma, Y. LeCun, and S. Xie. Eyes wide shut? exploring the visual
shortcomings of multimodal llms. arXiv preprint arXiv:2401.06209, 2024.
27
I. Molybog, Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M.
Smith, R. Subramanian, X. E. Tan, B. Tang, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan,
I. Zarov, Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, and
T. Scialom. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288,
2023b. doi: 10.48550/arXiv.2307.09288. URL https://doi.org/10.48550/arXiv.2307.
09288.
A. Veit, T. Matera, L. Neumann, J. Matas, and S. Belongie. Coco-text: Dataset and benchmark for
text detection and recognition in natural images. arXiv preprint arXiv:1601.07140, 2016.
B. Wang, G. Li, X. Zhou, Z. Chen, T. Grossman, and Y. Li. Screen2words: Automatic mobile
ui summarization with multimodal learning. In The 34th Annual ACM Symposium on User
Interface Software and Technology, pages 498–510, 2021.
J. Wang, L. Meng, Z. Weng, B. He, Z. Wu, and Y.-G. Jiang. To see is to believe: Prompting gpt-4v
for better visual instruction tuning. arXiv preprint arXiv:2311.07574, 2023a.
W. Wang, Q. Lv, W. Yu, W. Hong, J. Qi, Y. Wang, J. Ji, Z. Yang, L. Zhao, X. Song, et al. Cogvlm:
Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023b.
H. Wei, L. Kong, J. Chen, L. Zhao, Z. Ge, J. Yang, J. Sun, C. Han, and X. Zhang. Vary: Scaling
up the vision vocabulary for large vision-language models. arXiv preprint arXiv:2312.06109,
2023.
J. Ye, A. Hu, H. Xu, Q. Ye, M. Yan, G. Xu, C. Li, J. Tian, Q. Qian, J. Zhang, et al. Ureader:
Universal ocr-free visually-situated language understanding with multimodal large language
model. arXiv preprint arXiv:2310.05126, 2023.
Q. Yu, Q. Sun, X. Zhang, Y. Cui, F. Zhang, Y. Cao, X. Wang, and J. Liu. Capsfusion: Rethinking
image-text data at scale. arXiv preprint arXiv:2310.20550, 2023a.
W. Yu, Z. Yang, L. Li, J. Wang, K. Lin, Z. Liu, X. Wang, and L. Wang. Mm-vet: Evaluating large
multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023b.
X. Yue, Y. Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y. Sun, et al.
Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for
expert agi. arXiv preprint arXiv:2311.16502, 2023.
R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi. HellaSwag: Can a machine really finish
your sentence? In A. Korhonen, D. R. Traum, and L. Màrquez, editors, Proceedings of the 57th
Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July
28- August 2, 2019, Volume 1: Long Papers, pages 4791–4800. Association for Computational
Linguistics, 2019. doi: 10.18653/v1/p19-1472. URL https://doi.org/10.18653/v1/p1
9-1472.
B. Zhang and R. Sennrich. Root mean square layer normalization. Advances in Neural
Information Processing Systems, 32, 2019.
G. Zhang, X. Du, B. Chen, Y. Liang, T. Luo, T. Zheng, K. Zhu, Y. Cheng, C. Xu, S. Guo, et al.
Cmmmu: A chinese massive multi-discipline multimodal understanding benchmark. arXiv
preprint arXiv:2401.11944, 2024.
28
R. Zhang, Y. Zhou, Q. Jiang, Q. Song, N. Li, K. Zhou, L. Wang, D. Wang, M. Liao, M. Yang,
et al. Icdar 2019 robust reading challenge on reading chinese text on signboard. In 2019
international conference on document analysis and recognition (ICDAR), pages 1577–1581.
IEEE, 2019.
W. Zhong, R. Cui, Y. Guo, Y. Liang, S. Lu, Y. Wang, A. Saied, W. Chen, and N. Duan. AGIEval: A
human-centric benchmark for evaluating foundation models. CoRR, abs/2304.06364, 2023.
doi: 10.48550/arXiv.2304.06364. URL https://doi.org/10.48550/arXiv.2304.06364.
29
A. Appendix
在这个程序中,我们使用了random.randint()函数来生成一个1到10之间的随机整数作为
答案。然后,我们进入一个循环,直到玩家的猜测等于答案为止。在每次猜测后,我们会
检查玩家的猜测是否在范围内,以及它是否等于答案。如果是,我们就打印出“你赢
了!”;如果不是,我们就打印出“太高了!”或“太低了!”,然后让玩家再次猜测。
一旦玩家的猜测等于答案,游戏就结束了。
sorry, as a language model AI, I cannot process image into any other content.
30
Prompt Response – DeepSeek-VL
Explain the image. The image is a line graph that compares the performance of two different learning rate schedulers,
namely the Multi Step Learning Rate Scheduler and the Cosine Learning Rate Scheduler. The x-axis
represents the number of processed tokens in billions, ranging from 0 to 100 billion. The y-axis
represents the training loss, which ranges from 2.0 to 3.0.
The blue line represents the Multi Step Learning Rate Scheduler with three steps of increasing
rates: (80% + 10% + 10%). This line starts at a high point near 3.0 on the y-axis when the number
of processed tokens is 0 billion, indicating a higher initial loss. As the number of processed tokens
increases, this line shows a steep decline in training loss, suggesting rapid improvement in model
performance as more data is processed. However, after approximately 40 billion processed tokens,
the line begins to level off, indicating diminishing returns on further processing.
The orange line represents the Cosine Learning Rate Scheduler. It also starts at a high point near
3.0 when the number of processed tokens is 0 billion but has a much smoother descent than the
blue line. This suggests a slower but steady decrease in training loss as more tokens are processed.
After around 60 billion processed tokens, this line levels off as well, similar to the blue line.
Both lines show a general trend of decreasing training loss as the number of processed tokens
increases, indicating improved model performance over time. However, the Multi Step Learning
Rate Scheduler appears to achieve lower losses more quickly compared to the Cosine Learning
Rate Scheduler, especially in the early stages of processing tokens.
Line 1 expresses the number of tokens, following a nonlinear trend. Line 2 starts at value 80, then
is decreasing and increases by 10% until reaching a value where it then increases by 10%, before it
flattens out. Line 3 also starts in increasing part, then has some constant decreases between
different points, and after that express an increasing part.
The image displays a line graph comparing the training losses of two different learning rates
schedulers: the Multi Step Learning Rate Scheduler (80% + 10% = 10%) and the Cosine Learning
Rate Scheduler. The y-axis represents the training loss in terms of billions, ranging from
approximately 2.2 to 3.0. The x-axis represents the number of processed tokens, with a range
from 0 to 100. The Multi Step Learning Rate Scheduler's line is blue, showing a sharp decline
initially and then stabilizing. The Cosine Learning Rate Scheduler's line is orange, showing a more
gradual decline.
The function name in the code is "find_max_in_sequence", but it has not actually been called as
the condition of the if-statement checks for 'not sequence' instead of 'if not sequence'. The actual
line where the function should be called is missing from the provided code.
The bug in the code is that the condition for returning 'None' is set to 'if not sequence', which is
incorrect. The correct condition should be 'if not max_value', to ensure that the function returns
'None' when the sequence is not found or when the max value is not found within the sequence.
Figure 12 | Visualization results. DeepSeek-VL has strong understanding capabilities for code
and charts in the real world.
31
Prompt Response – DeepSeek-VL
What coin is that? The coin is a 10 Baht coin from Thailand.
雪山苍苍笼寒日,霜林瑟瑟动清声。
"光秃的树木高耸,
金色的色调亲吻着白雪,
群山低语着古老的故事,
白昼的拥抱开始消失。"
图像中的角色来自游戏公司“原神”。
这张图片展示的角色是日本Square Enix公司旗下的一款游戏《最终幻想14》中的角色。
Figure 13 | Visualization results. DeepSeek-VL possesses extensive knowledge of the real world.
32
Prompt Response – DeepSeek-VL
Convert it to markdown table. Here is the image content converted into a markdown table:
Please note that the "A" column in the original table seems to be empty or not fully visible in the
image provided.
33