Skip to content

Simplify utils.find_json function #68253

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: 3006.x
Choose a base branch
from

Conversation

agraul
Copy link
Collaborator

@agraul agraul commented Aug 12, 2025

What does this PR do?

Instead of trying to parse all combinations of json blobs to find the largest valid one, track open JSON objects / lists and try to parse them one-by-one, ordered by size.

This approach prioritizes speed over correctness. We don't track strings inside JSON, i.e. strings with odd numbers of opening/closing brackets/braces break the algorithm.

What issues does this PR fix or reference?

Fixes #68258

Previous Behavior

Parsing long JSON strings takes a lot of time.

New Behavior

Parsing long JSON strings is quick.

Merge requirements satisfied?

[NOTICE] Bug fixes or features added to Salt require tests.

Commits signed with GPG?

Yes

@m-czernek
Copy link
Contributor

m-czernek commented Aug 13, 2025

This PR stems from a user trying to parse a very large JSON with find_json. The problem is that original function did an exhaustive search at [0]. With a test JSON that has ~3MB, ~74k lines, the original function finishes correctly at slightly below 4 hours on my test VM:

$ time python3 test.py

real	232m14.408s
user	231m52.357s
sys	    0m19.929s

With the same JSON but the linear complexity algorithm implemented in this PR, the same JSON returns much quicker:

$ time python3 test.py

real	0m0.540s
user	0m0.528s
sys	    0m0.012s

So we're looking at orders of magnitude speedup in this PR (as is expected when we go from multiplicative big-O complexity to linear).

I cannot share the specific JSON I use for testing (because it has customer data), but should you want to replicate this, it shouldn't be difficult to create a valid JSON with a lot of opening/closing braces inside of the JSON.

[0]

salt/salt/utils/json.py

Lines 50 to 56 in 4a84820

for start, start_char in starts:
for end, end_br in reversed(ends):
if end > start and (
(start_char == "{" and end_br == "}")
or (start_char == "[" and end_br == "]")
):
starts_ends.append((start, end, sum(lengths[start : end + 1])))

@agraul agraul force-pushed the simplify-utils-find-json-master branch from 90f47d8 to 43119ef Compare August 13, 2025 13:36
@agraul agraul changed the base branch from master to 3006.x August 13, 2025 13:36
@agraul agraul force-pushed the simplify-utils-find-json-master branch from 43119ef to cfec538 Compare August 13, 2025 13:37
@agraul agraul marked this pull request as ready for review August 13, 2025 13:37
@agraul agraul requested a review from a team as a code owner August 13, 2025 13:37
@agraul agraul force-pushed the simplify-utils-find-json-master branch from cfec538 to 60cafab Compare August 13, 2025 13:53
Copy link
Contributor

@twangboy twangboy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind creating a test for this?

@twangboy twangboy added the needs-testcase PR needs test cases written, or the issue is about a bug/feature that needs test cases label Aug 14, 2025
@twangboy twangboy added this to the Sulfur v3006.15 milestone Aug 14, 2025
@twangboy twangboy linked an issue Aug 14, 2025 that may be closed by this pull request
@agraul
Copy link
Collaborator Author

agraul commented Aug 15, 2025

I wasn't happy with the fact that I traded correctness for speed and found a better way works better and is faster.

The below is a benchmark of three implementations: incorrect.py (same as this PR), handrolled.py (this PR with an additional heuristic that improves correctness slightly), and raw_decode.py (takes advantage of JSONDecoder.raw_decode(). I ran this on my laptop with the same ~3MB input as in #68253 (comment). While handrolled is not noticably slower than the implementation in this PR, readability is decreased a little. But that does not matter, the raw_decode.py approach is 5 times faster than both.

% hyperfine --warmup 3 'python3 ./handrolled.py' 'python3 ./raw_decode.py' 'python3 ./incorrect.py'
Benchmark 1: python3 ./handrolled.py
  Time (mean ± σ):     352.5 ms ±  16.2 ms    [User: 330.8 ms, System: 20.9 ms]
  Range (min … max):   338.1 ms … 391.4 ms    10 runs
 
Benchmark 2: python3 ./raw_decode.py
  Time (mean ± σ):      68.1 ms ±   0.9 ms    [User: 46.1 ms, System: 22.2 ms]
  Range (min … max):    66.7 ms …  71.2 ms    42 runs
 
Benchmark 3: python3 ./incorrect.py
  Time (mean ± σ):     354.9 ms ±  17.8 ms    [User: 332.4 ms, System: 22.3 ms]
  Range (min … max):   338.0 ms … 396.2 ms    10 runs
 
Summary
  python3 ./raw_decode.py ran
    5.18 ± 0.25 times faster than python3 ./handrolled.py
    5.22 ± 0.27 times faster than python3 ./incorrect.py

@twangboy there are already tests in tests/unit/utils/test_json.py, but I'm adding another test case that my first implementation didn't pass.

The previous implementation computed all combinations of potential JSON
documents and tried to `json.loads()`them. That resumted in num({) *
num(}) tries, which could take hours on large inputs.

The approach implemented with this change simplifies the work we do: we
only look for opening '{' and '[' characters, and try to parse the rest
of input string with JSONDecoder.raw_decode. This method ignores
extraneous data at the end and is faster than doing it ourselves in
Python.
@@ -49,6 +49,12 @@ class JSONTestCase(TestCase):
)
)

def test_find_json_unbalanced_brace_in_string(self):
test_sample_json = '{"title": "I like curly braces like this one:{"}'
Copy link
Contributor

@bdrx312 bdrx312 Aug 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvements. Can you also add this same test for square bracket [?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-testcase PR needs test cases written, or the issue is about a bug/feature that needs test cases
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] utils.json.find_json takes up too much time
4 participants