-
-
Notifications
You must be signed in to change notification settings - Fork 56.2k
[WIP] GSoC 2025: Add Tokenizer Support to DNN Module #27534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: 5.x
Are you sure you want to change the base?
Conversation
modules/dnn/CMakeLists.txt
Outdated
find_package(nlohmann_json REQUIRED) | ||
list(APPEND libs nlohmann_json::nlohmann_json) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need external library for JSON? OpenCV cv::FileStorage
supports JSON.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the feedback, Alexander.
I looked into cv::FileStorage
and have removed the dependency on nlohmann_json
. I will continue exploring the functionality of cv::FileStorage
moving forward.
I also spoke with Yuantao and explained that I will remove all extra dependencies from external libraries. The CMakeLists.txt
will now only include the files necessary to compile the dnn/src/tokenizer
module.
I’ll update this branch by the end of July 14 to remove unnecessary code, clean up the files, add documentation, and include all required references.
if (isUrl(pathOrUrl)) { | ||
CURL *curl = curl_easy_init(); | ||
if (!curl) throw std::runtime_error("curl_easy_init() failed"); | ||
curl_easy_setopt(curl, CURLOPT_URL, pathOrUrl.c_str()); | ||
curl_easy_setopt(curl, CURLOPT_FOLLOWLOCATION, 1L); | ||
curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, curlWrite); | ||
curl_easy_setopt(curl, CURLOPT_WRITEDATA, &contents); | ||
auto res = curl_easy_perform(curl); | ||
curl_easy_cleanup(curl); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OpenCV does not have CURL integration for now. Looks like your forgot find_package
.
I propose not to introduce new dependency, but support local files only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ll remove these files from the repository and keep them locally. Since they aren’t essential to the tokenizer’s core functionality, leaving them out will simplify this initial draft. I’ll re add them later once they’re independent of external libraries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, drop this file from implementation. We can read it as external files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the feedback! I will remove it and read externally and it should be up with my next commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same above.
modules/dnn/src/tokenizer/vocab.bpe
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this file for? Drop if not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was used to test the training in the bpe algorithm train_bpe()
as the data but it can be removed and will be updated in the next commit.
@@ -0,0 +1,122 @@ | |||
#pragma once |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to use #pragma once
here since macro __OPENCV_DNN_SRC_TOKENIZERTOKENS_CORE_BPE_HPP__
is used below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Go it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For any new files, you also need to add a claimer as the other files:
// This file is part of OpenCV project.
// It is subject to the license terms in the LICENSE file found in the top-level directory
// of this distribution and at http://opencv.org/license.html.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay will do so!
@@ -4,18 +4,21 @@ | |||
#include <string> | |||
#include <vector> | |||
|
|||
#include "encoding.hpp" | |||
// #include "../../src/tokenizer/encoding.hpp" | |||
#include "../../../src/tokenizer/encoding.hpp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The header is a part of interface and binary distribution. Such includes are not allowed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review! I moved tokenizer.hpp
into the public include tree so that the Python binding generator could see it (I’d discussed this briefly with Yuantao). Based on your comment, I’ll remove the #include
of the private encoding.hpp
from the header and replace it with a forward declaration in tokenizer.hpp
, then include encoding.hpp
only in the .cpp
implementation. I’ll test that locally to make sure the Python bindings still pick up the interface correctly.
Please let me know if there’s any other approach you’d recommend or way i should handle this. Thanks again!
#include "../src/tokenizer/core_bpe.hpp" | ||
#include "../src/tokenizer/encoding.hpp" | ||
#include "../src/tokenizer/utils.hpp" | ||
#include "../include/opencv2/dnn/tokenizer.hpp" | ||
#include "../src/tokenizer/gpt2_tokenizer_fast.hpp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid including headers files here.
The definition of newly added methods or classes should be put in dnn.hpp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should apply to other tests as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback and I am currently working on fixing and adding the Tokenizer interface in dnn.hpp.
Tokenizer(std::shared_ptr<Encoding> e, | ||
std::string model_name = ""); | ||
|
||
CV_WRAP static Tokenizer from_pretrained(const std::string& name, const std::string& pretrained_model_path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please, rename it to load
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put class Tokenizer
in dnn.hpp. User-targeted API should be exposed in namespace dnn rather than inside namespace tokenizer. Namespace tokenizer should contain some internal-used-only methods.
@@ -0,0 +1,57 @@ | |||
#pragma once |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not use #pragma once
.
CV_WRAP static Tokenizer from_pretrained(const std::string& name, const std::string& pretrained_model_path); | ||
CV_EXPORTS static Tokenizer train_bpe_from_corpus(const std::string& corpus, | ||
int vocab_sz, | ||
const std::string& pattern); | ||
CV_EXPORTS static Tokenizer train_bpe_from_corpus(const std::vector<std::string>& corpus, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, use same Camel naming style as others.
I merged 4.x->5.x again. You should get all required fixes for FileStorage in 5.x branch. Please rebase and check. |
@JorgeV92 Please rebase your branch to have new FileStorage with required fixes.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
APIs should be cleaned and put in dnn.hpp
:
class Tokenizer: {
public:
static Tokenizer load(const std:string &pretrained_model_path);
std::vector<int> encode(const std::string &text);
std::string decode(const std::vector<int> &tokens);
};
Our pipeline to use tokenizer:
import cv2 as cv
tokenizer = cv.dnn.tokenizer.load("/path/to/local/hf/repo")
input_string = "hello world!"
tokens = tokenizer.encode(input_string)
# LLM inference
output_string = tokenizer.decode(tokens)
…inbpe, llama, and rust bstr code
…ge text too slow.
fa351bc
to
5049fb8
Compare
Small additions to the proposed interface for proper bindings generation:
|
With the updated branch for For example TEST(Tokenizer_BPE, Tokenizer_GPT2_Model) {
std::string gpt2_dir = getOpenCVExtraDir() + "testdata/dnn/llm/gpt2/";
Tokenizer tok = Tokenizer::load(gpt2_dir);
auto ids = tok.encode("hello world");
auto text = tok.decode(ids);
EXPECT_EQ(text, "hello world");
} |
The current branch |
Summary
This pull request introduces initial support for a tokenizer module under
modules/dnn/src/tokenizer
as part of Google Summer of Code 2025 (Project: Tokenization for OpenCV DNN).Status
Goals
The goal is to support Hugging Face-compatible tokenization (e.g., GPT-2) natively in C++ to be integrated with DNN inference pipelines.
The core pipeline lives in
dnn/src/tokenizer/core_bpe.hpp
anddnn/src/tokenizer/encoding.hpp
. For Unicode handling I’m usingdnn/src/tokenizer/unicode.hpp
, which is adapted from llama.cpp.Feedback
Please share early feedback on:
dnn
Reference
Project: https://summerofcode.withgoogle.com/programs/2025/projects/79SW6eNK