Skip to content

[WIP] GSoC 2025: Add Tokenizer Support to DNN Module #27534

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 56 commits into
base: 5.x
Choose a base branch
from

Conversation

JorgeV92
Copy link

@JorgeV92 JorgeV92 commented Jul 12, 2025

Summary

This pull request introduces initial support for a tokenizer module under modules/dnn/src/tokenizer as part of Google Summer of Code 2025 (Project: Tokenization for OpenCV DNN).

Status

  • Project structure in place
  • Initial BPE tokenizer loading
  • Regex splitting (in progress)
  • Encoding logic for GPT-2 tokenizer (in progress)
  • Documentation (to be improved)

Goals

The goal is to support Hugging Face-compatible tokenization (e.g., GPT-2) natively in C++ to be integrated with DNN inference pipelines.

The core pipeline lives in dnn/src/tokenizer/core_bpe.hpp and dnn/src/tokenizer/encoding.hpp. For Unicode handling I’m using dnn/src/tokenizer/unicode.hpp, which is adapted from llama.cpp.

Feedback

Please share early feedback on:

  • General design structure
  • Integration strategy with dnn
  • Code organization or naming conventions

Reference

Project: https://summerofcode.withgoogle.com/programs/2025/projects/79SW6eNK

@fengyuentau fengyuentau self-requested a review July 14, 2025 04:25
@fengyuentau fengyuentau changed the base branch from 4.x to 5.x July 14, 2025 04:25
@fengyuentau fengyuentau requested a review from vpisarev July 14, 2025 04:25
Comment on lines 211 to 212
find_package(nlohmann_json REQUIRED)
list(APPEND libs nlohmann_json::nlohmann_json)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need external library for JSON? OpenCV cv::FileStorage supports JSON.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the feedback, Alexander.

I looked into cv::FileStorage and have removed the dependency on nlohmann_json. I will continue exploring the functionality of cv::FileStorage moving forward.
I also spoke with Yuantao and explained that I will remove all extra dependencies from external libraries. The CMakeLists.txt will now only include the files necessary to compile the dnn/src/tokenizer module.
I’ll update this branch by the end of July 14 to remove unnecessary code, clean up the files, add documentation, and include all required references.

Comment on lines 62 to 70
if (isUrl(pathOrUrl)) {
CURL *curl = curl_easy_init();
if (!curl) throw std::runtime_error("curl_easy_init() failed");
curl_easy_setopt(curl, CURLOPT_URL, pathOrUrl.c_str());
curl_easy_setopt(curl, CURLOPT_FOLLOWLOCATION, 1L);
curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, curlWrite);
curl_easy_setopt(curl, CURLOPT_WRITEDATA, &contents);
auto res = curl_easy_perform(curl);
curl_easy_cleanup(curl);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpenCV does not have CURL integration for now. Looks like your forgot find_package.
I propose not to introduce new dependency, but support local files only.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll remove these files from the repository and keep them locally. Since they aren’t essential to the tokenizer’s core functionality, leaving them out will simplify this initial draft. I’ll re add them later once they’re independent of external libraries.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, drop this file from implementation. We can read it as external files.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the feedback! I will remove it and read externally and it should be up with my next commit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this file for? Drop if not needed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was used to test the training in the bpe algorithm train_bpe() as the data but it can be removed and will be updated in the next commit.

@@ -0,0 +1,122 @@
#pragma once
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to use #pragma once here since macro __OPENCV_DNN_SRC_TOKENIZERTOKENS_CORE_BPE_HPP__ is used below.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go it!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For any new files, you also need to add a claimer as the other files:

// This file is part of OpenCV project.
// It is subject to the license terms in the LICENSE file found in the top-level directory
// of this distribution and at http://opencv.org/license.html.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay will do so!

@@ -4,18 +4,21 @@
#include <string>
#include <vector>

#include "encoding.hpp"
// #include "../../src/tokenizer/encoding.hpp"
#include "../../../src/tokenizer/encoding.hpp"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The header is a part of interface and binary distribution. Such includes are not allowed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! I moved tokenizer.hpp into the public include tree so that the Python binding generator could see it (I’d discussed this briefly with Yuantao). Based on your comment, I’ll remove the #include of the private encoding.hpp from the header and replace it with a forward declaration in tokenizer.hpp, then include encoding.hpp only in the .cpp implementation. I’ll test that locally to make sure the Python bindings still pick up the interface correctly.

Please let me know if there’s any other approach you’d recommend or way i should handle this. Thanks again!

Comment on lines +2 to +6
#include "../src/tokenizer/core_bpe.hpp"
#include "../src/tokenizer/encoding.hpp"
#include "../src/tokenizer/utils.hpp"
#include "../include/opencv2/dnn/tokenizer.hpp"
#include "../src/tokenizer/gpt2_tokenizer_fast.hpp"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid including headers files here.

The definition of newly added methods or classes should be put in dnn.hpp.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should apply to other tests as well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback and I am currently working on fixing and adding the Tokenizer interface in dnn.hpp.

Tokenizer(std::shared_ptr<Encoding> e,
std::string model_name = "");

CV_WRAP static Tokenizer from_pretrained(const std::string& name, const std::string& pretrained_model_path);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please, rename it to load

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put class Tokenizer in dnn.hpp. User-targeted API should be exposed in namespace dnn rather than inside namespace tokenizer. Namespace tokenizer should contain some internal-used-only methods.

@@ -0,0 +1,57 @@
#pragma once
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not use #pragma once.

Comment on lines 23 to 27
CV_WRAP static Tokenizer from_pretrained(const std::string& name, const std::string& pretrained_model_path);
CV_EXPORTS static Tokenizer train_bpe_from_corpus(const std::string& corpus,
int vocab_sz,
const std::string& pattern);
CV_EXPORTS static Tokenizer train_bpe_from_corpus(const std::vector<std::string>& corpus,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, use same Camel naming style as others.

@asmorkalov
Copy link
Contributor

I merged 4.x->5.x again. You should get all required fixes for FileStorage in 5.x branch. Please rebase and check.

@fengyuentau
Copy link
Member

@JorgeV92 Please rebase your branch to have new FileStorage with required fixes.

git remote add upstream https://github.com/opencv/opencv
git fetch upstream
# make sure your worktree is clean, i.e. no changes
git rebase upstream/5.x
git push -f

Copy link
Member

@fengyuentau fengyuentau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

APIs should be cleaned and put in dnn.hpp:

class Tokenizer: {
public:
  static Tokenizer load(const std:string &pretrained_model_path);
  std::vector<int> encode(const std::string &text);
  std::string decode(const std::vector<int> &tokens);
};

Our pipeline to use tokenizer:

import cv2 as cv

tokenizer = cv.dnn.tokenizer.load("/path/to/local/hf/repo")
input_string = "hello world!"
tokens = tokenizer.encode(input_string)
# LLM inference
output_string = tokenizer.decode(tokens)

@JorgeV92 JorgeV92 force-pushed the gsoc2025-tokenizer branch from fa351bc to 5049fb8 Compare August 7, 2025 19:54
@asmorkalov
Copy link
Contributor

asmorkalov commented Aug 8, 2025

Small additions to the proposed interface for proper bindings generation:

class CV_EXPORTS_W Tokenizer: {
public:
  CV_WRAP static Tokenizer load(CV_WRAP_FILE_PATH const std:string &pretrained_model_path);
  CV_WRAP std::vector<int> encode(const std::string &text);
  CV_WRAP std::string decode(const std::vector<int> &tokens);
};

@JorgeV92
Copy link
Author

JorgeV92 commented Aug 8, 2025

With the updated branch for FileStorage, I tried reading thetokenizer.json file from gpt2 but I ran into a error when reading a key such as "\"âĢĶ . It would skip the character 'â' and give me "?ĢĶ failing to read the correct vocab. I made a small change to the modules/core/src/persistence_json.cpp to able to read the tokenizer.json files. I made the change so we dont skip the character after reading the blackslash and fail to read the next character. This is not fully tested and will bring this up with @fengyuentau and get his thoughts on this. The update does pass the test cases I wrote to read the tokenizer json files in opencv/modules/dnn/test/test_tokenizer.cpp.

For example

TEST(Tokenizer_BPE, Tokenizer_GPT2_Model) {
    std::string gpt2_dir = getOpenCVExtraDir() + "testdata/dnn/llm/gpt2/";
    Tokenizer tok = Tokenizer::load(gpt2_dir);
    auto ids = tok.encode("hello world");
    auto text = tok.decode(ids);
    EXPECT_EQ(text, "hello world");
}

@JorgeV92
Copy link
Author

JorgeV92 commented Aug 8, 2025

The current branch gsoc2025-tokenizer will only have functionality for encode() and decode() all extra functionality will live in gsoc2025-tokenizer-backup if ever needed we could just move these over if we want more functionality later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants