@@ -80,6 +80,51 @@ If you're in mainland China, we strongly recommend you to use our model from
80
80
## Deployment
81
81
82
82
### vLLM
83
+ vllm supports offline batched inference or launching an OpenAI-Compatible API Service for online inference.
84
+
85
+ #### Environment Preparation
86
+ Since the Pull Request (PR) has not been submitted to the vLLM community at this stage, please prepare the environment by following the steps below:
87
+ ``` bash
88
+ git clone -b v0.7.3 https://github.com/vllm-project/vllm.git
89
+ cd vllm
90
+ git apply Ling/inference/vllm/bailing_moe.patch
91
+ pip install -e .
92
+ ```
93
+ #### Offline Inference:
94
+ ``` bash
95
+ from transformers import AutoTokenizer
96
+ from vllm import LLM, SamplingParams
97
+
98
+ tokenizer = AutoTokenizer.from_pretrained(" inclusionAI/Ling-lite" )
99
+
100
+ sampling_params = SamplingParams(temperature=0.7, top_p=0.8, repetition_penalty=1.05, max_tokens=512)
101
+
102
+ llm = LLM(model=" inclusionAI/Ling-lite" ,
103
+ prompt = " Give me a short introduction to large language models."
104
+ messages = [
105
+ {" role" : " system" , " content" : " You are Ling, an assistant created by inclusionAI" },
106
+ {" role" : " user" , " content" : prompt}
107
+ ]
108
+
109
+ text = tokenizer.apply_chat_template(
110
+ messages,
111
+ tokenize=False,
112
+ add_generation_prompt=True
113
+ )
114
+ outputs = llm.generate([text], sampling_params)
115
+
116
+
117
+ ` ` `
118
+ # ### Online Inference:
119
+
120
+ ` ` ` bash
121
+ VLLM_USE_V1=1 vllm serve inclusionAI/Ling-lite \
122
+ --tensor-parallel-size 2 \
123
+ --pipeline-parrallel-size 1 \
124
+ --use-v2-block-manager \
125
+ --gpu-memory-utilization 0.90
126
+ ` ` `
127
+ For detailed guidance, please refer to the vLLM [` instructions` ](https://docs.vllm.ai/en/latest/).
83
128
84
129
# ## MindIE
85
130
0 commit comments