Skip to content

core: support parsing back slash \ in parseKey in FileStorage (JSON) #27587

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

fengyuentau
Copy link
Member

Fixes #27585

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
  • The PR is proposed to the proper branch
  • There is a reference to the original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake

@fengyuentau fengyuentau added this to the 4.13.0 milestone Jul 28, 2025
@fengyuentau fengyuentau added feature category: core port to 5.x is needed Label for maintainers. Authors of PR can ignore this labels Jul 28, 2025
@fengyuentau
Copy link
Member Author

fengyuentau commented Jul 28, 2025

With this PR, tokenizer.json from gpt2 and qwen3 can now be read by cv::FileStorage successfully.

If possible, this PR (along with #27579 ) needs to be merged and ported to 5.x soon to support the development progress of tokenizer support #27534 .

@fengyuentau

This comment was marked as outdated.

@asmorkalov
Copy link
Contributor

@fengyuentau What about raw strings in python with triple quotes?

@fengyuentau
Copy link
Member Author

fengyuentau commented Jul 28, 2025

@fengyuentau What about raw strings in python with triple quotes?

@asmorkalov Thank you for suggestion! Now it works as expected with the r prefix in python.

# a.json: {"\"":1,"\\":59,"Ġ\"":366,"\\\\":6852}
import cv2 as cv
import json

# a.json
a = cv.FileStorage("a.json", cv.FileStorage_FORMAT_JSON)
print(type(a), a.getNode("\"").real(), a.getNode(r"\\").real())
# <class 'cv2.FileStorage'> 1.0 59.0

with open("a.json", "r") as f:
    a = json.load(f)
    print(type(a), a["\""], a["\\"])
    # <class 'dict'> 1 59

dkurt
dkurt previously approved these changes Jul 28, 2025
@dkurt dkurt dismissed their stale review July 28, 2025 11:04

debug print

@fengyuentau
Copy link
Member Author

fengyuentau commented Jul 28, 2025

@dkurt @asmorkalov I found multiple sources saying that backslash should be escaped as well. So python's json printing a key of double backslash should be the reason of human readability; that's to say in memory it is still parsed as one single backslash

# a.json: {"\"":1,"\\":59,"Ġ\"":366,"\\\\":6852}
import json
with open("a.json", "r") as f:
    a = json.load(f)
    print(a.keys())
    # dict_keys(['"', '\\', 'Ġ"', '\\\\'])
    print(type(a), a["\""], a["\\"], a[r"\\"])
    # <class 'dict'> 1 59 6852

What do you think?

@fengyuentau
Copy link
Member Author

With the latest commit, we can do the following things correctly:

import cv2 as cv
import json

# a.json
a = cv.FileStorage("a.json", cv.FileStorage_FORMAT_JSON)
print(a.getNode("\"").name(), a.getNode("\\").name(), a.getNode(r"\\").name())
# " \ \\
print(type(a), a.getNode("\"").real(), a.getNode("\\").real(), a.getNode(r"\\").real())
# <class 'cv2.FileStorage'> 1.0 59.0 6852.0

with open("a.json", "r") as f:
    a = json.load(f)
    print(a.keys())
    # dict_keys(['"', '\\', 'Ġ"', '\\\\'])
    print(type(a), a["\""], a["\\"], a[r"\\"])
    # <class 'dict'> 1 59 6852

# tokenizer.json
a = cv.FileStorage("tokenizer.json", cv.FileStorage_FORMAT_JSON)
b = a.getNode("model")
c = b.getNode("vocab")
print(type(a), c.getNode("\"").real(), c.getNode("\\").real())
# <class 'cv2.FileStorage'> 1.0 59.0

with open("tokenizer.json", "r") as f:
   a = json.load(f)
   b = a["model"]
   c = b["vocab"]
   print(type(a), c["\""], c["\\"])
   # <class 'dict'> 1 59

The only problem now is our key does not show as python's json's. I guess it is fine since it is different way to show information.

@asmorkalov
Copy link
Contributor

I would say, that it's the case, when real file on opencv_extra is better than hard-coded solution. It's error prone and with file with do not need to re-check escape sequences and other syntax sugar in particular programming language.

@fengyuentau
Copy link
Member Author

I would say, that it's the case, when real file on opencv_extra is better than hard-coded solution. It's error prone and with file with do not need to re-check escape sequences and other syntax sugar in particular programming language.

The current test uses raw string; that should be clear enough for readability and maintainability.

@asmorkalov asmorkalov merged commit 07cf36c into opencv:4.x Jul 29, 2025
54 of 55 checks passed
@fengyuentau fengyuentau deleted the 4x/core/filestorage_json_support_backslash branch July 30, 2025 05:55
@asmorkalov asmorkalov mentioned this pull request Jul 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: core feature port to 5.x is needed Label for maintainers. Authors of PR can ignore this
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support back slash in parseKey in FileStorage (JSON)
4 participants