-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Displaying the numbers of documents parsed per second #652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…rsed per second. Obviously, this means reusing the same parser again and again.
Oh yes, that's SUPER nice and absolutely worth having a competition benchmark for. It would be immediately convincing to me. GB/s takes me time to decide whether it'll make my server faster or not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
@jkeiser I have added a sentence in the README within this PR about our ability to parse millions of JSON documents per second (on a single core). |
I will open an issue regarding the "competition" because I don't want to start anything big right now. |
I do wonder if we can make parsingcompetition spit out its results in documents/sec instead of gb/s, assuming it correctly takes allocation. Assuming users reuse their input buffers, it might be a reasonable proxy for requests/second. |
…number of documents parsed per second.
@jkeiser I just hacked parsingcompetition so that it spits out documents/sec but it is not very reliable. One should do a finer job. I'll open an issue. |
Looks good. It's a little more reliable (though not perfect) if you up the repeat multiplier to 100 with -r 100:
|
@jkeiser Changing the number of repetitions exposes a bug in my new code for parsingcompetition. Let me check. |
Ok. Should be good now. |
This should be good enough for a release now. |
@lemire any reason you're re-measuring total time instead of just using sumclockdiff? I used min_sumclockdiff and sumclockdiff for best/avg, and came up with this:
|
if (verbose) \
printf(" %13.0f Kdocuments/s (best)", 1.0/min_sumclockdiff); \
if (verbose) \
printf(" %13.0f Kdocuments/s (avg)", 1.0/(sumclockdiff/repeat)); \ |
Round numbers like this are a lot easier to appreciate in terms of benchmarks, too :) |
Let us do it your way. |
For twitter:
|
4 million documents/s has such a nice ring to it! RapidJSON's best time is 540 thousand, sajson has a respectable 1.5 mil. I am still utterly boggled that we're faster than getline ....... |
I have updated the code so that it does what @jkeiser's suggested. I like how we have two numbers (avg,best) so we can get an idea as to whether the numbers can be trusted. I still think that this whole benchmark will fail you on small files, and it is easy to fix, but I don't want to go there today. |
Looks right to me! +1 The x64 CI perf check is going really haywire recently. I don't think it's your change. |
Merged. |
Some users are interested, as a metric, in the number of documents parsed per second.
Obviously, this means reusing the same parser again and again. Suppose that you have a target, you want parse documents like "twitter_timeline.json" that you will receive in quick succession... how many could you parse per second?
So 42,000.
And, of course, for small documents (say demo.json), I achieve speeds in the millions of documents per second...
For an even smaller document (
{"status":"success"}
), I reach nearly 4 million documents per second.The implementation is damn ugly, but I have tried to work within the "benchmarker" framework. I probably could have done a better job.
I think it would be nice, before pushing out version 0.3, to add a remark about our ability to parse millions of documents per second. But, of course, this needs to be backed by an actual benchmark.