1
- I'll write a few of the design features / innovations here:
1
+ How simdjson's On Demand Parsing works
2
+ ======================================
2
3
3
- ## Classes
4
+ Current JSON parsers generally have either ease of use or performance. Very few have both at
5
+ once. simdjson's On Demand API bridges that gap with a familiar, friendly DOM API and the
6
+ performance of just-in-time parsing on top of the simdjson core's legendary performance.
7
+
8
+ To achieve ease of use, we mimicked the * form* of a traditional DOM API: you can iterate over
9
+ arrays, look up fields in objects, and extract native values like double, uint64_t, string and bool.
10
+ To achieve performance, we introduced some key limitations that make the DOM API * streaming* :
11
+ array/object iteration cannot be restarted, and fields must be looked up in order, and string/number
12
+ values can only be parsed once.
13
+
14
+ ``` c++
15
+ ondemand::parser parser;
16
+ auto doc = parser.iterate(json);
17
+ for (auto tweet : doc[" statuses" ]) {
18
+ std::string_view text = tweet[ "text"] ;
19
+ std::string_view screen_name = tweet[ "user"] [ "screen_name" ] ;
20
+ uint64_t retweets = tweet[ "retweet_count"] ;
21
+ uint64_t favorites = tweet[ "favorite_count"] ;
22
+ cout << screen_name << " (" << retweets << " retweets / " << favorites << " favorites): " << text << endl;
23
+ }
24
+ ```
25
+
26
+ This streaming approach means that fields or values you don't use don't have to get parsed or
27
+ converted, saving space and time.
28
+
29
+ Further, the On Demand API doesn't parse a value * at all* until you try to convert it to double,
30
+ int, string, or bool. Because you have told it the type at that point, it can avoid the the key
31
+ "what type is this" branch present in almost all other parsers, avoiding branch misprediction that
32
+ cause massive (sometimes 2-4x) slowdowns.
33
+
34
+ Parsers Today
35
+ -------------
36
+
37
+ To understand exactly what's happening here and why it's different, it's helpful to review the major
38
+ approaches to parsing and parser APIs in use today.
39
+
40
+ ### Generic DOM
41
+
42
+ Many of the most usable, popular JSON APIs deserialize into a ** DOM** : an intermediate tree of
43
+ objects, arrays and values. The resulting API lets you refer to each array or object separately,
44
+ using familiar techniques like iteration (` for (auto value : array) ` ) or indexing (` object["key"] ` ).
45
+ In some cases the values are even deserialized straight into familiar C++ constructs like vector and
46
+ map.
47
+
48
+ This model is dead simple to use, since it talks in terms of * data types* instead of JSON. It's
49
+ often easy enough that many users use the deserialized JSON as-is instead of deserializing into
50
+ their own custom structs, saving a ton of development work.
51
+
52
+ simdjson's DOM parser is one such example. It looks very similar to the ondemand example, except
53
+ it calls ` parse ` instead of ` iterate ` :
54
+
55
+ ``` c++
56
+ dom::parser parser;
57
+ auto doc = parser.parse(json);
58
+ for (auto tweet : doc[" statuses" ]) {
59
+ std::string_view text = tweet[ "text"] ;
60
+ std::string_view screen_name = tweet[ "user"] [ "screen_name" ] ;
61
+ uint64_t retweets = tweet[ "retweet_count"] ;
62
+ uint64_t favorites = tweet[ "favorite_count"] ;
63
+ cout << screen_name << " (" << retweets << " retweets / " << favorites << " favorites): " << text << endl;
64
+ }
65
+ ```
66
+
67
+ Pros and cons of generic DOM:
68
+ * Straightforward, user-friendly interface (arrays and objects)
69
+ * No lifetime concerns (arrays and objects are often independent of JSON text and parser internal state)
70
+ * Parses and stores everything, using memory/CPU even on values you don't end up using (cost can be brought down some with lazy numbers/strings and top-level iterators)
71
+ * Values stay in memory even if you only use them once
72
+ * Heavy performance drain from [ type blindness] ( #type-blindness ) .
73
+
74
+ ### SAX (SAJ?)
75
+
76
+ The SAX model ("Streaming API for XML") uses streaming to eliminate the high cost of
77
+ parsing and storing the entire JSON. In the SAX model, a core JSON parser parses the JSON document
78
+ piece by piece, but instead of stuffing values in a DOM tree, it passes each value to a callback,
79
+ letting the user use the value and decide for themselves whether to discard it and where to store
80
+ it. or discard it.
81
+
82
+ This allows users to work with much larger files without running out of memory. Some SAX APIs even
83
+ allow the user to skip values entirely, lightening the parsing burden as well.
84
+
85
+ The big drawback is complexity: SAX APIs generally have you define a single callback for each type
86
+ (e.g. ` string_field(std::string_view key, std::string_view value) ` ). Because of this, you suffer
87
+ from context blindness: when you find a string you have to check where it is before you know what to
88
+ do with it. Is this string the text of the tweet, the screen name, or something else I don't even
89
+ care about? Are we even in a tweet right now, or is this from some other place in the document
90
+ entirely?
91
+
92
+ The following is SAX example of the Twitter problem we've seen in the Generic DOM and On Demand
93
+ examples. To make it short enough to use as an example at all, it's heavily redacted: it only solves
94
+ a part of the problem (doesn't get user.screen_name), it has bugs (it doesn't handle sub-objects
95
+ in a tweet at all), and it uses a theoretical, simple SAX API that minimizes ceremony and skips over
96
+ the issue of lazy parsing and number types entirely.
97
+
98
+ ``` c++
99
+ struct twitter_callbacks {
100
+ bool in_statuses;
101
+ bool in_tweet;
102
+ std::string_view text;
103
+ uint64_t retweets;
104
+ uint64_t favorites;
105
+ void start_object_field(std::string_view key) {
106
+ if (key == "statuses") { in_statuses = true; }
107
+ }
108
+ void start_object() {
109
+ if (in_statuses) { in_tweet = true; }
110
+ }
111
+ void string_field(std::string_view key, std::string_view value) {
112
+ if (in_tweet && key == "text") { text = value; }
113
+ }
114
+ void number_field(std::string_view key, uint64_t value) {
115
+ if (in_tweet) {
116
+ if (key == "retweet_count") { retweets = value; }
117
+ if (key == "favorite_count") { favorites = value; }
118
+ }
119
+ }
120
+ void end_object() {
121
+ if (in_tweet) {
122
+ cout << "[ redacted] (" << retweets << " retweets / " << favorites << " favorites): " << text << endl;
123
+ in_tweet = false;
124
+ } else if (in_statuses) {
125
+ in_statuses = false;
126
+ }
127
+ }
128
+ };
129
+ sax::parser parser;
130
+ parser.parse(twitter_callbacks());
131
+ ```
132
+
133
+ This is a startling amount of code, requiring mental gymnastics even to read, and in order to get it
134
+ this small and illustrate basic usage, *it has bugs* and skips over parsing user.screen_name
135
+ entirely. The real implementation is much, much harder to write (and to read).
136
+
137
+ Pros and cons of SAX (SAJ):
138
+ * Speed and space benefits from low, predictable memory usage
139
+ * Some SAX APIs can lazily parse numbers/strings, another performance win (pay for what you use)
140
+ * Performance drain from context blindness (switch statement for "where am I in the document")
141
+ * Startlingly difficult to use
142
+
143
+ ### Schema-Based Parser Generators
144
+
145
+ There is another breed of parser, commonly used to generate REST API clients, which is in principle
146
+ capable of fixing most of the issues with DOM and SAX. These parsers take a schema--a description of
147
+ your JSON, with field names, types, everything--and generate classes/structs in your language of
148
+ choice, as well as a parser to deserialize the JSON into those structs. (In another variant, you
149
+ define your own struct and a preprocessor inspects it and generates a JSON parser for it.)
150
+
151
+ Not all of these schema-based parser generators actually generate a parser or even optimize for
152
+ streaming, but they are *able* to.
153
+
154
+ Some of the features help entirely eliminate the DOM and SAX issues:
155
+
156
+ Pros and cons:
157
+ * Ease of Use is on par with DOM
158
+ * Parsers that generate iterators and lazy values in structs can keep memory pressure down to SAX levels.
159
+ * Type Blindness can be entirely solved with specific parsers for each type, saving many branches.
160
+ * Context Blindness can be solved, especially if object fields are required and in order, saving
161
+ even more branches.
162
+ * Scenarios are limited by declarative language (often limited to deserialization-to-objects)
163
+
164
+ Rust's serde does a lot of the necessary legwork here, for example. (Editor's Note: I don't know
165
+ *how much* it does, but I know it does a decent amount, and is very fast.)
166
+
167
+ Type Blindness and Branch Misprediction
168
+ ---------------------------------------
169
+
170
+ The DOM parsing model, and even the SAX model to a great extent, suffers from **type
171
+ blindness:** you, the user, almost always know exactly what fields and what types are in your JSON,
172
+ but the parser doesn't. When you say `json_parser.parse(json)`, the parser doesn't get told *any*
173
+ of this. It has no way to know. This means it has to look at each value blind with a big "switch"
174
+ statement, asking "is this a number? A string? A boolean? An array? An object?"
175
+
176
+ In modern processors, this kind of switch statement can make your program run *3-4 times slower*
177
+ than it needs to. This is because of the high cost of branch misprediction.
178
+
179
+ Modern processors have more than one core, even on a single thread. To go fast, each of these cores
180
+ "reads ahead" in your program, each picking different instructions to run (as soon as data is
181
+ available). If all the cores are working almost all the time, your single-threaded program will run
182
+ around 4 instructions per cycle--*4 times faster* than it theoretically could.
183
+
184
+ Most modern programs don't manage to get much past 1 instruction per cycle, however. This is
185
+ because of branch misprediction. Programs have a lot of if statements, so to read ahead, processors
186
+ guess which branch will be taken and read ahead from that branch. If it guesses wrong, all that
187
+ wonderful work it did is discarded, and it starts over from the if statement. It *was* running at
188
+ 4x speed, but it was all wasted work!
189
+
190
+ And this brings us back to that switch statement. Type blindness means the processor essentially has
191
+ to guess, for every JSON value, whether it will be an array, an object, number, string or boolean.
192
+ Unless your file is something ridiculously predictable, like a giant array of numbers, it's going to
193
+ trip up a lot. (Processors get better about this all the time, but for something complex like this
194
+ there is only so much it can do in the tiny amount of time it has to guess.)
195
+
196
+ On Demand parsing is tailor-made to solve this problem at the source, parsing values only after the
197
+ user declares their type by asking for a double, an int, a string, etc.
198
+
199
+
200
+
201
+ NOTE: EVERYTHING BELOW THIS NEEDS REWRITING AND MAY NOT BE ACCURATE AT PRESENT
202
+ ==============================================================================
203
+
204
+ Classes
205
+ -------
4
206
5
207
In general, simdjson's parser classes are divided into two categories:
6
208
@@ -141,15 +343,15 @@ the work to parse and validate the array:
141
343
The ` ondemand::array ` object lets you iterate the values of an array in a forward-only manner:
142
344
143
345
``` c++
144
- for (object tweet : doc[" tweets " ].get_array()) {
346
+ for (object tweet : doc[" statuses " ].get_array()) {
145
347
...
146
348
}
147
349
```
148
350
149
351
This is equivalent to:
150
352
151
353
``` c++
152
- array tweets = doc[" tweets " ].get_array();
354
+ array tweets = doc[" statuses " ].get_array();
153
355
array::iterator iter = tweets.begin();
154
356
array::iterator end = tweets.end();
155
357
while (iter != end) {
@@ -164,13 +366,13 @@ the work to parse and validate the array:
164
366
165
367
1 . ` get_array() ` :
166
368
- Validates that this is an array (starts with ` [ ` ). Returns INCORRECT_TYPE if not.
167
- - If the array is empty (followed by ` ] ` ), advances the iterator past the ` ] ` and returns an
168
- array with finished == true .
169
- - If the array is not empty, returns an array with finished == false. Iterator remains pointed
170
- at the first value (the token just after ` [ ` ).
171
- 2 . ` tweets.begin() ` , ` tweets.end() ` : Returns an ` array::iterator ` , which just points back at the
369
+ - If the array is empty (followed by ` ] ` ), advances the iterator past the ` ] ` and then releases the
370
+ iterator (indicating the array is finished iterating) .
371
+ - If the array is not empty, the iterator remains pointed at the first value (the token just
372
+ after ` [ ` ).
373
+ 2 . ` tweets.begin() ` , ` tweets.end() ` : Returns an ` array_iterator ` , which just points back at the
172
374
array object.
173
- 3 . ` while (iter != end) ` : Stops if finished == true .
375
+ 3 . ` while (iter != end) ` : Stops if the iterator has been released .
174
376
4 . ` *iter ` : Yields the value (or error).
175
377
- If there is an error, returns it and sets finished == true.
176
378
- Returns a value object, advancing the iterator just past the value token (if it is ` [ ` , ` { ` ,
@@ -182,83 +384,14 @@ the work to parse and validate the array:
182
384
- If anything else is there (` true ` , ` : ` , ` } ` , etc.), sets error = TAPE_ERROR.
183
385
- #3 gets run next.
184
386
185
- #### Error Chaining
186
-
187
- When you iterate over an array or object with an error, the error is yielded in the loop
387
+ Design Features
388
+ ---------------
188
389
189
390
#### Error Chaining
190
391
392
+ When you iterate over an array or object with an error, the error is yielded in the loop
191
393
192
- * ` document ` : Represents the root value of the document, as well as iteration state.
193
- - Inline Algorithm Context. MUST be kept around for the duration of the algorithm.
194
- - ` iter ` : The ` json_iterator ` , which handles low-level iteration.
195
- - ` parser ` : A pointer to the parser to get at persistent state.
196
- - Can be converted to ` value ` (and by proxy, to array/object/string/number/bool/null).
197
- - Can be iterated (like ` value ` , assumes it is an array).
198
- - Safety: can only be converted to ` value ` once. This prevents double iteration of arrays/objects
199
- (which will cause inconsistent iterator state) or double-unescaping strings (which could cause
200
- memory overruns if done too much). NOTE: this is actually not ideal; it means you can't do
201
- ` if (document.parse_double().error() == INCORRECT_TYPE) { document.parse_string(); } `
202
-
203
- The ` value ` class is expected to be temporary (NOT kept around), used to check the type of a value
204
- and parse it to another type. It can convert to array or object, and has helpers to parse string,
205
- double, int64, uint64, boolean and is_null().
206
-
207
- ` value ` does not check the type of the value ahead of time. Instead, when you ask for a value of a
208
- given type, it tries to parse as that type, treating a parser error as INCORRECT_TYPE. For example,
209
- if you call get_int64() on the JSON string ` "123" ` , it will fail because it is looking for either a
210
- ` - ` or digit at the front. The philosophy here is "proceed AS IF the JSON has the desired type, and
211
- have an unlikely error branch if not." Since most files have well-known types, this avoids
212
- unnecessary branchiness and extra checks for the value type.
213
-
214
- It does preemptively advance the iterator and store the pointer to the JSON, allowing you to attempt
215
- to parse more than one type. This is how you do nullable ints, for example: check is_null() and
216
- then get_int64(). If we didn't store the JSON, and instead did ` iter.advance() ` during is_null(),
217
- you would get garbage if you then did get_int64().
218
-
219
- ##
220
-
221
- until it's asked to convert, in which case it proceeds
222
- * expecting* the value is of the given type, treating an error in parsing as "it must be some
223
- other type." This saves the extra explicit type check in the common case where you already know
224
- saving the if/switch statement
225
- and . We find out it's not a double
226
- when the double parser says "I couldn't find any digits and I don't understand what this ` t `
227
- character is for."
228
- - The iterator has already been advanced past the value token. If it's { or [ , the iterator is just
229
- after the open brace.
230
- - Can be parsed as a string, double, int, unsigned, boolean, or ` is_null() ` , in which case a parser is run
231
- and the value is returned.
232
- - Can be converted to array or object iterator.
233
- * ` object ` : Represents an object iteration.
234
- - ` doc ` : A pointer to the document (and the iterator).
235
- - ` at_start ` : boolean saying whether we're at the start of the object, used for ` [] ` .
236
- - Can do order-sensitive field lookups.
237
- - Can iterate over key/value pairs.
238
- * ` array ` : Represents an array iteration.
239
- - ` doc ` : A pointer to the document (and the iterator).
240
-
241
-
242
-
243
- - Can be returned as a ` raw_json_string ` , so that people who want to check JSON strings without
244
- unescaping can do so.
245
- - Can be converted to array or object
246
- and keep member variables in the same registers.
247
- In fact, , and . Usually based on whether they are persistent--i.e. we intend them
248
- to be stored in memory--or algorithmic--
249
- (which generally means we intend to persist them), or algorithm-lifetime, possibly
250
-
251
- - persistent classes
252
- - non-persistent
253
- ` ondemand::parser ` is the equivalent of ` dom::parser ` , and .
254
- ` json_iterator ` handles all iteration state.
255
-
256
- ### json_iterator / document
257
-
258
- ` document ` owns the json_iterator and the . I'm considering moving this into the json_iterator so there's one
259
- less class that requires persistent ownership, however.
260
-
261
- ### String unescaping
394
+ ### String Parsing
262
395
263
396
When the user requests strings, we unescape them to a single string_buf (much like the DOM parser)
264
397
so that users enjoy the same string performance as the core simdjson. The current_string_buf_loc is
@@ -278,14 +411,12 @@ If it's stored in a `simdjson_result<document>` (as it would be in `auto doc = p
278
411
the document pointed to by children is the one inside the simdjson_result, and the simdjson_result,
279
412
therefore, can't be moved either.
280
413
281
- ### Object Iteration
414
+ ### Object/Array Iteration
282
415
283
416
Because the C++ iterator contract requires iterators to be const-assignable and const-constructable,
284
417
object and array iterators are separate classes from the object/array itself, and have an interior
285
418
mutable reference to it.
286
419
287
- ### Array Iteration
288
-
289
420
### Iteration Safety
290
421
291
422
- If the value fails to be parsed as one type, you can try to parse it as something else until you
0 commit comments