[Inference] Flask server compatible with OpenAI api. #9828

ZHUI · 2025-02-06T10:04:52Z

Before submitting

Lint code. If there are lint issues, please format the code first.

# Install and register `pre-commit` in the project folder
pip install pre-commit && pre-commit install

# Process previous code files separately
pre-commit run --file XXXX.py

Add test cases into tests folder. If there are codecov issues, please add tests cases first.

PR types

Bug fixes

Make flask server compatible with OpenAI api.
Gradio ui compatible with reasoning model.

PR changes

Others

Description

启动server:

python  ./predict/flask_server.py \
    --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
    --port 8010 \
    --flask_port 8011 \
    --dtype "float16"

使用OpenAI访问flask server

from openai import OpenAI

api_key = "EMPTY"
api_base = "http://localhost:8011/v1/"

client = OpenAI(
    api_key=api_key,
    base_url=api_base,
)

# Completion API
stream = True
completion = client.chat.completions.create(
    model="paddlenlp",
    messages=[
        {"role": "user", "content": "PaddleNLP好厉害！这句话的感情色彩是？"}
    ],
    max_tokens=1024,
    stream=stream,
)

if stream:
    for c in completion:
        print(c.choices[0].delta.content, end="")
else:
    print(completion.choices[0].message.content)

回复:

<think>
嗯，用户问的是“PaddleNLP好厉害！”这句话的情感色彩是什么。首先，我需要分析这个句子的结构和用词。

“好厉害”这个词本身带有褒义，表示对某件事物的赞赏或赞美。“PaddleNLP”听起来像是一个项目或者工具的名字，“厉害”在这里强调它的强大、有效。所以整体上来看，这是一个正面的评价。

接下来，考虑情感色彩的时候，通常会看主语、谓语动词以及宾语之间的关系是否积极向上。这里主语是“PaddleNLP”，谓语是“好”，宾语也是积极的事物描述。因此，整个句子表达了一种肯定和赞赏的情绪。

再进一步思考，可能用户是在使用PaddleNLP进行某个任务，并且觉得它表现得很好，想要表达自己的感受。这可能意味着他们正在学习中，或者在项目中有成功体验分享的需求。

也有可能用户想了解如何正确地形容这类工具，以便更好地与他人交流或者记录自己的经验。无论是哪种情况，明确指出情感色彩为正面都是有帮助的。

最后，总结一下：“PaddleNLP好厉害！”这句话的整体情感色彩是非常积极的，表达了对该项目的高度赞赏。
</think>

“PaddleNLP好厉害！”这句话的情感色彩非常积极，属于正面情感表达。

2、使用curl访问

 curl 127.0.0.1:8011/v1/chat/completions \
 -H 'Content-Type: application/json' \
 -d '{"messages": [{"role": "user", "content": "PaddleNLP好厉害！这句话的感情色彩是？"}]}

paddle-bot · 2025-02-06T10:04:57Z

Thanks for your contribution!

CLAassistant · 2025-02-06T10:05:07Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

ZHUI · 2025-02-06T10:07:14Z

llm/predict/predictor.py

-            source = [self.tokenizer.apply_chat_template(sentence, tokenize=False) for sentence in source]
+            # source = [self.tokenizer.apply_chat_template(sentence, tokenize=False) for sentence in source]
+            # source = [self.tokenizer.apply_chat_template(sentence, tokenize=False) for sentence in source]
+            source = self.tokenizer.apply_chat_template(source, tokenize=False)


@DrownFish19 这段你帮忙确认一下

建议保持原写法，适配不同情况

codecov · 2025-02-06T10:40:27Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 52.06%. Comparing base (54b8882) to head (1c93476).
Report is 346 commits behind head on develop.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #9828      +/-   ##
===========================================
+ Coverage    51.53%   52.06%   +0.52%     
===========================================
  Files          734      734              
  Lines       118703   116443    -2260     
===========================================
- Hits         61176    60625     -551     
+ Misses       57527    55818    -1709

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

DrownFish19 · 2025-02-07T10:19:16Z

llm/predict/predictor.py

@@ -218,7 +218,12 @@ def __init__(self, config: PredictorArgument, tokenizer: PretrainedTokenizer = N

    def _preprocess(self, source):
        if self.tokenizer.chat_template is not None:
-            source = [source] if isinstance(source, str) else source
+            # for str -> List[str] eg. "hello"
+            # for List[str] -> List[str]  eg. ["hello", "hello new"]


这里应该补充判断一下
self.tokenizer.apply_chat_template

能够直接支持的形式包括 str，List[List[str]] ，List[dict]

不能直接支持的形式为List[str], 需要循环按照str进行处理

for List[List[str]] -> List[List[str]] eg. [ [ "Hello, how are you?", "I'm doing great. How can I help you today?", ], ["I'd like to show off how chat templating works!"], ]

可以反向判断一下，

# if not(isinstance(source, list) and isinstance(source[0], str)): if not isinstance(source, list) or not isinstance(source[0], str): source = [source]

update 0113 support head_dim=192,256 for append_attn c16 attention run refine code add softmax_scale support weight_only_int8 refine code support tp delete test_append_attn add splited fused_moe from ziyuan add deepseek-v3 class fix repe for deepseek-v3 fix wint8 precision and refine code fix wint4, big diff add e_score_correction_bias fix head_dim fix v3 verify [AutoParallel] open tensor_fusion for benchmark (PaddlePaddle#9749) * open tensor_fusion for benchmark fix loraga merge (PaddlePaddle#9765) * fix loraga merge * change sign Fix ernie ci auto trainer error (PaddlePaddle#9758) * [AutoParallel]:fix ernine auto_trainer error * [AutoParallel]:fix ernine auto_trainer error * [AutoParallel]:fix ernine auto_trainer error * [AutoParallel]:fix ernine auto_trainer error * [AutoParallel]:fix ernine auto_trainer error * [AutoParallel]:fix ernine auto_trainer error * [AutoParallel]:fix ernine auto_trainer error * [AutoParallel]:fix ernine auto_trainer error * [AutoParallel]:fix ernine auto_trainer error * [AutoParallel]:fix ernine auto_trainer error * [AutoParallel]:fix ernine auto_trainer error * [AutoParallel]:fix ernine auto_trainer error * [AutoParallel]:fix ernine auto_trainer error * Update run_pretrain_auto.py Update README.md (PaddlePaddle#9766) * Update README.md [BugFix] Fix matryoshka norm loss (PaddlePaddle#9774) * fix matryoshka norm [Distributed] support fuse optimizer (PaddlePaddle#9519) (PaddlePaddle#9777) Update register_sequence_parallel_allreduce_hooks (PaddlePaddle#9782) * fix sequence parallel * update register_sequence_parallel_allreduce_hooks * update fuse_sequence_parallel_allreduce Fix ce error (PaddlePaddle#9783) * [AutoParallel]:fix ci error * [AutoParallel]:fix ci error fix (PaddlePaddle#9779) [MoE] fix expert parallel (PaddlePaddle#9760) * fix moe uc fix dpo pp criterion (PaddlePaddle#9786) [Infer] Add pir_model path for server infer. (PaddlePaddle#9790) fix d2s fix v3 verify support qk_head_dim != v_head_dim support fp8 batch gemm on cutlass3.x upgrade cutlass version for block_wise fp8 gemm change cutlass commit to ckl117 group_wise branch support fp8 block gemm, but private cutlass commit, and TODO: update fp8 dual gemm api on cutlass3.x support auto tune fp8 block gemm code update cutlass to v3.7.0, todo: support block gemm based on v3.7.0 support block gemm on cutlass v3.7.0 commit code check code check check dynamic_quant ad block builder dir rename group_quant fix wint8 v_head_dim fix rope fix qwen2 mla use position_ids only remove control flow remove gpu concat fix norm weight dtype remove all_reduce in fused_moe part support fp8 check group_quant and fake fp8 check support block gemm [LLM] support flash device on static model (PaddlePaddle#9619) (PaddlePaddle#9787) * [LLM] support flash device on static model * [LLM] adapt pdc sdk [LLM Benchmark]update scripts (PaddlePaddle#9722) * add no_proxy & del paddlenlp_ops * update timeout for dpo * fix sequence_parallel * add timeout * add Total_Tokens_per_second_per_gpu * fix Tokens_per_second_per_gpu * update Total_Tokens_per_second_per_gpu mergekit gpu 1226 (PaddlePaddle#9702) * mergekit gpu 1226 * merge model gpu * merge gpu * add lora model * change valueerror * add lora * gpu test [LLM] merge code from fastdeploy (PaddlePaddle#9791) * [LLM] update llm server dockerfiles * merge code from fastdeploy [Inference] Support eagle for llama (PaddlePaddle#9812) [CI] Fix ci of small models (PaddlePaddle#9633) [Trainer] Wrap model when lora is ON and only do evaluation. (PaddlePaddle#9803) [README] Update README.md for documention (PaddlePaddle#9785) * Update README.md * Update README.md * Update README_en.md fix static run wint8 and fake-fp8, todo: support data type does not match support fp8, but ffn1 and moe in wint8 support ffn1 fp8 block gemm done ffn1 fp8 block gemm block gemm done block gemm support batch refine rope code compute position_ids use custom op fix split_param (PaddlePaddle#9817) [LLM] Update model convert and fix TP for deepseekv3 (PaddlePaddle#9797) * fix model convert and tp in MoEMLP * fix tp_action filter * update convert accoding to num_nextn_predict_layers * add deepseek-R1 fuse rope fix macro fix mixtral set_state_dict block_wise weight support fp8 per tensor network, no support scale Tensor for tensor gemm deepseek-v3 fp8 tensor gemm network, but precision fault add triton fp8 fused_moe kernel fix moe triton kernel add moe triton kernel fix fix fp8 block gemm precision moe triton fp8 network support moe triton and precision correct, but shared ffn1 ffn2 incorrect fp8 block network, no check shared ffn1-ffn2 in v2-lite delete wint8 in fake delete some useless code and verify per tensor net with in qkv outlinear ffn1 ffn2, but triton moe don't match api fp8 block quant when load model, and code check fix tokenizer and qwen [AutoParallel] add sharding tensor_fusion save load switch (PaddlePaddle#9810) * support tensor_fusion save load * apply suggestions from code review 修复benchmark多机任务异常退出的处理 (PaddlePaddle#9651) * 修复benchmark多机任务异常退出的处理 * fix bug * update Fix LLAMA arg parsing bug in pp (PaddlePaddle#9806) [Readme] Update mixtral.md (PaddlePaddle#9829) [XPU] Support empty_cache on XPUs (PaddlePaddle#9789) * [XPU] Support empty_cache on XPUs * warn if current device doesn't support [Inference] Fix multibatch inference (PaddlePaddle#9831) * fix batch infra * fix deepseekv2 infra Fix position_ids for infra (PaddlePaddle#9841) fix moe diff due to e_score_correction_bias fix fast tokenizer [LLM] Add pipeline and flashmask for Qwen2Moe and Deepseek (PaddlePaddle#9827) * add modleing_pp * add modleing_pp for qwen2moe * add flashmask and pp for Qwen2MoE and Deepseek * remove * fix fast_tokenizer save * update for topk_weight of noaux_tc * fix for flashmask * add use_expert_parallel for pretrain * fix tokenizer test [Mergekit]update & add LoRA merge (PaddlePaddle#9811) * add * fix bug * fix * add * add lora merge * add * add * add * add * add * add [Unified Checkpoint] Fix expert parallel (PaddlePaddle#9821) * fix expert parallel * fix split_param for expert parallel * add filter_sync_parameters fix import [Inference] Flask server compatible with OpenAI api. (PaddlePaddle#9828) * flask server compatible with OpenAI api. * fix max_length to max_tokens. * fix with think model. [LLM] fix checkpoint save for non flash mode (PaddlePaddle#9830) support mla for speculate [DSK] support deepseek-v3/r1 (mha/fp16/bf16/wint8/wint4) (PaddlePaddle#9769) * support deepseek-v3 * support head_dim=192,256 for append_attn c16 * update 0113 * attention run * refine code * add softmax_scale * support weight_only_int8 * refine code * support tp * delete test_append_attn * add splited fused_moe from ziyuan * fix repe for deepseek-v3 * add deepseek-v3 class * fix wint8 precision and refine code * fix wint4, big diff * add e_score_correction_bias * fix head_dim * fix v3 verify * fix d2s * fix v3 verify * support qk_head_dim != v_head_dim * fix wint8 v_head_dim * fix rope * fix qwen2 * mla use position_ids only * remove control flow * remove gpu concat * fix norm weight dtype * remove all_reduce in fused_moe * fix static run * refine rope code * compute position_ids use custom op * fuse rope * fix macro * fix mixtral * support mla for speculate * fix tokenizer and qwen * fix moe diff due to e_score_correction_bias * fix fast tokenizer * fix import --------- Co-authored-by: lizhenyun01 <1500424927@qq.com> Co-authored-by: lizhenyun <lizhenyun@baidu.com> Solve the compatibility problem of type annotation Python version (PaddlePaddle#9853) mix fp8 and wint8 save extra special tokens (PaddlePaddle#9837) [Bugfix] Fix dsk rope diff (PaddlePaddle#9859) * fix dsk diff * fix * update merge develop to check fp8 moe-wint8 fix deepseek v3 fp8 precision fix deepseek weight quant [Optimization] Support lower memory cards. (PaddlePaddle#9804) * support lower memory cards. * add doc for v100 16G such devices. * remove debug info. * add pre divided factor to overcome overfit problem for fp16 attention. Support XPU for auto-paralllel LLaMa (PaddlePaddle#9796) * Support XPU for auto-paralllel LLaMa * Update * Update * Update * Update * Fix CI errors * Update [XPU] Add xpu fused op for deepseek (PaddlePaddle#9854) [Inference] Update deepseek (PaddlePaddle#9864) * fix * fix infra [PreTrain] Support deepseek mfu for pretraining and fix tflops for pretrain pipe model (PaddlePaddle#9855) * git flops with pp model. * Support hareware tflops for deepseek. [Inference]Support mtp with deepseek-v3 (PaddlePaddle#9856) * support mtp with deepseek_v3 both in static and dygraph mode * fix speculate tokenizer in unittest * delete useless code check code

flask server compatible with OpenAI api.

ba3e97b

ZHUI marked this pull request as ready for review February 6, 2025 10:04

ZHUI commented Feb 6, 2025

View reviewed changes

fix gradio ui.

0dc6cfa

DrownFish19 reviewed Feb 7, 2025

View reviewed changes

ZHUI added 3 commits February 8, 2025 11:53

fix max_length to max_tokens.

516638c

fix with think model.

b449029

fix tp with flask.

1c93476

ZHUI merged commit 85b77f2 into PaddlePaddle:develop Feb 12, 2025
9 of 12 checks passed

ZHUI deleted the fix_flask_server branch February 12, 2025 05:32

ckl117 mentioned this pull request Feb 17, 2025

support deepseek-v3 #9878

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Inference] Flask server compatible with OpenAI api. #9828

[Inference] Flask server compatible with OpenAI api. #9828

Uh oh!

ZHUI commented Feb 6, 2025 •

edited

Loading

Uh oh!

paddle-bot bot commented Feb 6, 2025

Uh oh!

CLAassistant commented Feb 6, 2025

Uh oh!

ZHUI Feb 6, 2025

Uh oh!

DrownFish19 Feb 6, 2025

Uh oh!

codecov bot commented Feb 6, 2025 •

edited

Loading

Uh oh!

DrownFish19 Feb 7, 2025

Uh oh!

DrownFish19 Feb 7, 2025

Uh oh!

Uh oh!

Uh oh!

[Inference] Flask server compatible with OpenAI api. #9828

[Inference] Flask server compatible with OpenAI api. #9828

Uh oh!

Conversation

ZHUI commented Feb 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Before submitting

PR types

PR changes

Description

Uh oh!

paddle-bot bot commented Feb 6, 2025

Uh oh!

CLAassistant commented Feb 6, 2025

Uh oh!

ZHUI Feb 6, 2025

Choose a reason for hiding this comment

Uh oh!

DrownFish19 Feb 6, 2025

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Feb 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

DrownFish19 Feb 7, 2025

Choose a reason for hiding this comment

Uh oh!

DrownFish19 Feb 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ZHUI commented Feb 6, 2025 •

edited

Loading

codecov bot commented Feb 6, 2025 •

edited

Loading