[WIP] GSoC 2025: Add Tokenizer Support to DNN Module #27534

JorgeV92 · 2025-07-12T03:27:42Z

Summary

This pull request introduces initial support for a tokenizer module under modules/dnn/src/tokenizer as part of Google Summer of Code 2025 (Project: Tokenization for OpenCV DNN).

Status

Project structure in place
Initial BPE tokenizer loading
Regex splitting (in progress)
Encoding logic for GPT-2 tokenizer (in progress)
Documentation (to be improved)

Goals

The goal is to support Hugging Face-compatible tokenization (e.g., GPT-2) natively in C++ to be integrated with DNN inference pipelines.

The core pipeline lives in dnn/src/tokenizer/core_bpe.hpp and dnn/src/tokenizer/encoding.hpp. For Unicode handling I’m using dnn/src/tokenizer/unicode.hpp, which is adapted from llama.cpp.

Feedback

Please share early feedback on:

General design structure
Integration strategy with dnn
Code organization or naming conventions

Reference

Project: https://summerofcode.withgoogle.com/programs/2025/projects/79SW6eNK

asmorkalov · 2025-07-14T05:59:31Z

modules/dnn/CMakeLists.txt

+find_package(nlohmann_json REQUIRED)
+list(APPEND libs nlohmann_json::nlohmann_json)


Why do you need external library for JSON? OpenCV cv::FileStorage supports JSON.

Thank you for the feedback, Alexander.

I looked into cv::FileStorage and have removed the dependency on nlohmann_json. I will continue exploring the functionality of cv::FileStorage moving forward.
I also spoke with Yuantao and explained that I will remove all extra dependencies from external libraries. The CMakeLists.txt will now only include the files necessary to compile the dnn/src/tokenizer module.
I’ll update this branch by the end of July 14 to remove unnecessary code, clean up the files, add documentation, and include all required references.

asmorkalov · 2025-07-14T06:02:44Z

modules/dnn/src/tokenizer/load_tokenizer_bpe.cpp

+    if (isUrl(pathOrUrl)) {
+        CURL *curl = curl_easy_init();
+        if (!curl) throw std::runtime_error("curl_easy_init() failed");
+        curl_easy_setopt(curl, CURLOPT_URL, pathOrUrl.c_str());
+        curl_easy_setopt(curl, CURLOPT_FOLLOWLOCATION, 1L);
+        curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, curlWrite);
+        curl_easy_setopt(curl, CURLOPT_WRITEDATA, &contents);
+        auto res = curl_easy_perform(curl);
+        curl_easy_cleanup(curl);


OpenCV does not have CURL integration for now. Looks like your forgot find_package.
I propose not to introduce new dependency, but support local files only.

I’ll remove these files from the repository and keep them locally. Since they aren’t essential to the tokenizer’s core functionality, leaving them out will simplify this initial draft. I’ll re add them later once they’re independent of external libraries.

fengyuentau · 2025-07-17T06:46:09Z

modules/dnn/src/tokenizer/cl100k_base.tiktoken

Please, drop this file from implementation. We can read it as external files.

Thank you for the feedback! I will remove it and read externally and it should be up with my next commit.

fengyuentau · 2025-07-17T06:46:23Z

modules/dnn/src/tokenizer/encoder.json

Same above.

fengyuentau · 2025-07-17T06:46:37Z

modules/dnn/src/tokenizer/vocab.bpe

Same above.

fengyuentau · 2025-07-17T06:46:58Z

modules/dnn/src/tokenizer/taylorswift.txt

What is this file for? Drop if not needed.

This was used to test the training in the bpe algorithm train_bpe() as the data but it can be removed and will be updated in the next commit.

fengyuentau · 2025-07-17T06:57:13Z

modules/dnn/src/tokenizer/core_bpe.hpp

@@ -0,0 +1,122 @@
+#pragma once


No need to use #pragma once here since macro __OPENCV_DNN_SRC_TOKENIZERTOKENS_CORE_BPE_HPP__ is used below.

fengyuentau · 2025-07-17T07:02:18Z

modules/dnn/src/tokenizer/core_bpe.cpp

For any new files, you also need to add a claimer as the other files:

// This file is part of OpenCV project. // It is subject to the license terms in the LICENSE file found in the top-level directory // of this distribution and at http://opencv.org/license.html.

Okay will do so!

asmorkalov · 2025-07-28T05:38:02Z

modules/dnn/include/opencv2/dnn/tokenizer.hpp

@@ -4,18 +4,21 @@
 #include <string>
 #include <vector>

-#include "encoding.hpp"
+// #include "../../src/tokenizer/encoding.hpp"
+#include "../../../src/tokenizer/encoding.hpp"


The header is a part of interface and binary distribution. Such includes are not allowed.

Thanks for the review! I moved tokenizer.hpp into the public include tree so that the Python binding generator could see it (I’d discussed this briefly with Yuantao). Based on your comment, I’ll remove the #include of the private encoding.hpp from the header and replace it with a forward declaration in tokenizer.hpp, then include encoding.hpp only in the .cpp implementation. I’ll test that locally to make sure the Python bindings still pick up the interface correctly.

Please let me know if there’s any other approach you’d recommend or way i should handle this. Thanks again!

fengyuentau · 2025-07-30T06:10:08Z

modules/dnn/test/test_encoding_bpe.cpp

+#include "../src/tokenizer/core_bpe.hpp"
+#include "../src/tokenizer/encoding.hpp"
+#include "../src/tokenizer/utils.hpp"
+#include "../include/opencv2/dnn/tokenizer.hpp"
+#include "../src/tokenizer/gpt2_tokenizer_fast.hpp"


Avoid including headers files here.

The definition of newly added methods or classes should be put in dnn.hpp.

This should apply to other tests as well.

Thanks for the feedback and I am currently working on fixing and adding the Tokenizer interface in dnn.hpp.

vpisarev · 2025-07-30T06:17:56Z

modules/dnn/include/opencv2/dnn/tokenizer.hpp

+    Tokenizer(std::shared_ptr<Encoding> e,
+                   std::string model_name = "");
+
+    CV_WRAP static Tokenizer from_pretrained(const std::string& name, const std::string& pretrained_model_path); 


please, rename it to load

fengyuentau · 2025-07-30T06:16:05Z

modules/dnn/include/opencv2/dnn/tokenizer.hpp

Put class Tokenizer in dnn.hpp. User-targeted API should be exposed in namespace dnn rather than inside namespace tokenizer. Namespace tokenizer should contain some internal-used-only methods.

fengyuentau · 2025-07-30T06:16:34Z

modules/dnn/include/opencv2/dnn/tokenizer.hpp

@@ -0,0 +1,57 @@
+#pragma once


Do not use #pragma once.

fengyuentau · 2025-07-30T06:18:13Z

modules/dnn/include/opencv2/dnn/tokenizer.hpp

+    CV_WRAP static Tokenizer from_pretrained(const std::string& name, const std::string& pretrained_model_path); 
+    CV_EXPORTS static Tokenizer train_bpe_from_corpus(const std::string& corpus,
+                                   int vocab_sz,
+                                   const std::string& pattern);
+    CV_EXPORTS static Tokenizer train_bpe_from_corpus(const std::vector<std::string>& corpus,


Please, use same Camel naming style as others.

asmorkalov · 2025-07-31T06:24:53Z

I merged 4.x->5.x again. You should get all required fixes for FileStorage in 5.x branch. Please rebase and check.

fengyuentau · 2025-07-31T07:47:23Z

@JorgeV92 Please rebase your branch to have new FileStorage with required fixes.

git remote add upstream https://github.com/opencv/opencv
git fetch upstream
# make sure your worktree is clean, i.e. no changes
git rebase upstream/5.x
git push -f

fengyuentau

APIs should be cleaned and put in dnn.hpp:

class Tokenizer: {
public:
  static Tokenizer load(const std:string &pretrained_model_path);
  std::vector<int> encode(const std::string &text);
  std::string decode(const std::vector<int> &tokens);
};

Our pipeline to use tokenizer:

import cv2 as cv

tokenizer = cv.dnn.tokenizer.load("/path/to/local/hf/repo")
input_string = "hello world!"
tokens = tokenizer.encode(input_string)
# LLM inference
output_string = tokenizer.decode(tokens)

…tf8 unicode

…ngth

…inbpe, llama, and rust bstr code

…d version.

…ge text too slow.

asmorkalov · 2025-08-08T10:32:42Z

Small additions to the proposed interface for proper bindings generation:

class CV_EXPORTS_W Tokenizer: {
public:
  CV_WRAP static Tokenizer load(CV_WRAP_FILE_PATH const std:string &pretrained_model_path);
  CV_WRAP std::vector<int> encode(const std::string &text);
  CV_WRAP std::string decode(const std::vector<int> &tokens);
};

JorgeV92 · 2025-08-08T18:10:13Z

With the updated branch for FileStorage, I tried reading thetokenizer.json file from gpt2 but I ran into a error when reading a key such as "\"âĢĶ . It would skip the character 'â' and give me "?ĢĶ failing to read the correct vocab. I made a small change to the modules/core/src/persistence_json.cpp to able to read the tokenizer.json files. I made the change so we dont skip the character after reading the blackslash and fail to read the next character. This is not fully tested and will bring this up with @fengyuentau and get his thoughts on this. The update does pass the test cases I wrote to read the tokenizer json files in opencv/modules/dnn/test/test_tokenizer.cpp.

For example

TEST(Tokenizer_BPE, Tokenizer_GPT2_Model) {
    std::string gpt2_dir = getOpenCVExtraDir() + "testdata/dnn/llm/gpt2/";
    Tokenizer tok = Tokenizer::load(gpt2_dir);
    auto ids = tok.encode("hello world");
    auto text = tok.decode(ids);
    EXPECT_EQ(text, "hello world");
}

JorgeV92 · 2025-08-08T19:30:04Z

The current branch gsoc2025-tokenizer will only have functionality for encode() and decode() all extra functionality will live in gsoc2025-tokenizer-backup if ever needed we could just move these over if we want more functionality later.

fengyuentau self-requested a review July 14, 2025 04:25

fengyuentau changed the base branch from 4.x to 5.x July 14, 2025 04:25

fengyuentau requested a review from vpisarev July 14, 2025 04:25

asmorkalov reviewed Jul 14, 2025

View reviewed changes

asmorkalov added feature GSoC category: dnn labels Jul 14, 2025

fengyuentau reviewed Jul 17, 2025

View reviewed changes

fengyuentau mentioned this pull request Jul 25, 2025

core: support parsing null in json parser in FileStorage #27579

Merged

6 tasks

asmorkalov reviewed Jul 28, 2025

View reviewed changes

fengyuentau mentioned this pull request Jul 28, 2025

core: support parsing back slash \ in parseKey in FileStorage (JSON) #27587

Merged

6 tasks

fengyuentau reviewed Jul 30, 2025

View reviewed changes

vpisarev reviewed Jul 30, 2025

View reviewed changes

fengyuentau reviewed Jul 30, 2025

View reviewed changes

fengyuentau reviewed Aug 4, 2025

View reviewed changes

JorgeV92 added 12 commits August 7, 2025 14:51

Initial set up for the tokenizer in the dnn module

2f3285f

Add .gitkeep to preserve empty directory

8b9ed46

Made changes to bpe.hpp

b0c80e5

Added tokenizertokens that is based on the tiktoken from openai

1cfa61b

Restructure the directory for tokenizer and add a small version for u…

d060c43

…tf8 unicode

Added tests for core_bpe.cpp inside test_core_bpe.cpp

5a6c91e

Added the Encoding class not finished in process.

05a1775

Added functionality to port and encode cl100k_base and gpt2 need testing

9425f32

Added load functionality for gpt2 and cl100k_base not tested yet

3514a62

Working version with added helper libaries not yet tested

8682c0c

For now we added boost but might change back to std::regex

0dafa30

Back to std::regex tested more core_bpe.hpp

ba5880d

JorgeV92 added 21 commits August 7, 2025 14:51

Added another train_bpe_hugface version for now a wip

06d6259

Added tests to test train_bpe_hugface not all passing

ca430e2

Fixed errors in train_bpe_hugface and pass test case for max_token_le…

f73199b

…ngth

Remove auxiliary tokenizer files no longer needed

47e20bf

Cleaned and removed code and adjusted CMake

150293a

Add MIT license headers and source references for adapted tiktoken, m…

03474dd

…inbpe, llama, and rust bstr code

Added GPT4 pre trained model functionality in encoding.cpp

a281507

Added more funcitonality to decode and encode in core_bpe

7f1c511

Added example of how to run a tokenizer with gpt2, gpt4, and a traine…

fab6500

…d version.

Added into the example of how to use an Encoding

47a1b11

Updated tokenizer directory and deleted unused files

6638a2d

Remove data files in tokenizer

297b2aa

Added functionality for general Tokenizer and GPT2TokenizerFast

e5cbf1e

Fixed erros in Tokenizer and how we train bpe

be9ae3b

Fixed the test to work with Tokenizer and GPT2 new classes.

1670c61

Fixed GPT2TokenizerFast removed train functions

ae381ec

Moved tokenizer.hpp into dnn to build python bindings

7eaa4b6

Add correct path from opencv_extra fork for tokenizer data

323144a

Fixed test to include opencv_extra data files.

304dfac

Fixed errors and issues with includes sugguested by asmorkalov

9eb1dfa

Adjusted train_bpe to use only short text for testing cant handle lar…

5049fb8

…ge text too slow.

JorgeV92 force-pushed the gsoc2025-tokenizer branch from fa351bc to 5049fb8 Compare August 7, 2025 19:54

JorgeV92 added 2 commits August 8, 2025 12:50

Add FileStorage to read json and final format before reformat.

c1b1038

Add a small change to the json reader to read correclty tokenizer.json

04a5515

JorgeV92 added 2 commits August 8, 2025 13:50

Small change to tokenizer load

8a6c992

Removed functionality in Tokenizer for a short version only

0b4e3e2

Adjust format to just encode and decode.

a607371

		find_package(nlohmann_json REQUIRED)
		list(APPEND libs nlohmann_json::nlohmann_json)

Uh oh!

[WIP] GSoC 2025: Add Tokenizer Support to DNN Module #27534

Are you sure you want to change the base?

[WIP] GSoC 2025: Add Tokenizer Support to DNN Module #27534

Conversation

JorgeV92 commented Jul 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Status

Goals

Feedback

Reference

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

asmorkalov commented Jul 31, 2025

Uh oh!

fengyuentau commented Jul 31, 2025

Uh oh!

fengyuentau left a comment

Choose a reason for hiding this comment

Uh oh!

asmorkalov commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JorgeV92 commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JorgeV92 commented Aug 8, 2025

Uh oh!

Uh oh!

JorgeV92 commented Jul 12, 2025 •

edited

Loading

asmorkalov commented Aug 8, 2025 •

edited

Loading

JorgeV92 commented Aug 8, 2025 •

edited

Loading