Skip to content

Commit 9815ce8

Browse files
committed
Add ondemand rationale to beginning of document
1 parent 652a410 commit 9815ce8

File tree

1 file changed

+217
-86
lines changed

1 file changed

+217
-86
lines changed

doc/ondemand.md

Lines changed: 217 additions & 86 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,208 @@
1-
I'll write a few of the design features / innovations here:
1+
How simdjson's On Demand Parsing works
2+
======================================
23

3-
## Classes
4+
Current JSON parsers generally have either ease of use or performance. Very few have both at
5+
once. simdjson's On Demand API bridges that gap with a familiar, friendly DOM API and the
6+
performance of just-in-time parsing on top of the simdjson core's legendary performance.
7+
8+
To achieve ease of use, we mimicked the *form* of a traditional DOM API: you can iterate over
9+
arrays, look up fields in objects, and extract native values like double, uint64_t, string and bool.
10+
To achieve performance, we introduced some key limitations that make the DOM API *streaming*:
11+
array/object iteration cannot be restarted, and fields must be looked up in order, and string/number
12+
values can only be parsed once.
13+
14+
```c++
15+
ondemand::parser parser;
16+
auto doc = parser.iterate(json);
17+
for (auto tweet : doc["statuses"]) {
18+
std::string_view text = tweet["text"];
19+
std::string_view screen_name = tweet["user"]["screen_name"];
20+
uint64_t retweets = tweet["retweet_count"];
21+
uint64_t favorites = tweet["favorite_count"];
22+
cout << screen_name << " (" << retweets << " retweets / " << favorites << " favorites): " << text << endl;
23+
}
24+
```
25+
26+
This streaming approach means that fields or values you don't use don't have to get parsed or
27+
converted, saving space and time.
28+
29+
Further, the On Demand API doesn't parse a value *at all* until you try to convert it to double,
30+
int, string, or bool. Because you have told it the type at that point, it can avoid the the key
31+
"what type is this" branch present in almost all other parsers, avoiding branch misprediction that
32+
cause massive (sometimes 2-4x) slowdowns.
33+
34+
Parsers Today
35+
-------------
36+
37+
To understand exactly what's happening here and why it's different, it's helpful to review the major
38+
approaches to parsing and parser APIs in use today.
39+
40+
### Generic DOM
41+
42+
Many of the most usable, popular JSON APIs deserialize into a **DOM**: an intermediate tree of
43+
objects, arrays and values. The resulting API lets you refer to each array or object separately,
44+
using familiar techniques like iteration (`for (auto value : array)`) or indexing (`object["key"]`).
45+
In some cases the values are even deserialized straight into familiar C++ constructs like vector and
46+
map.
47+
48+
This model is dead simple to use, since it talks in terms of *data types* instead of JSON. It's
49+
often easy enough that many users use the deserialized JSON as-is instead of deserializing into
50+
their own custom structs, saving a ton of development work.
51+
52+
simdjson's DOM parser is one such example. It looks very similar to the ondemand example, except
53+
it calls `parse` instead of `iterate`:
54+
55+
```c++
56+
dom::parser parser;
57+
auto doc = parser.parse(json);
58+
for (auto tweet : doc["statuses"]) {
59+
std::string_view text = tweet["text"];
60+
std::string_view screen_name = tweet["user"]["screen_name"];
61+
uint64_t retweets = tweet["retweet_count"];
62+
uint64_t favorites = tweet["favorite_count"];
63+
cout << screen_name << " (" << retweets << " retweets / " << favorites << " favorites): " << text << endl;
64+
}
65+
```
66+
67+
Pros and cons of generic DOM:
68+
* Straightforward, user-friendly interface (arrays and objects)
69+
* No lifetime concerns (arrays and objects are often independent of JSON text and parser internal state)
70+
* Parses and stores everything, using memory/CPU even on values you don't end up using (cost can be brought down some with lazy numbers/strings and top-level iterators)
71+
* Values stay in memory even if you only use them once
72+
* Heavy performance drain from [type blindness](#type-blindness).
73+
74+
### SAX (SAJ?)
75+
76+
The SAX model ("Streaming API for XML") uses streaming to eliminate the high cost of
77+
parsing and storing the entire JSON. In the SAX model, a core JSON parser parses the JSON document
78+
piece by piece, but instead of stuffing values in a DOM tree, it passes each value to a callback,
79+
letting the user use the value and decide for themselves whether to discard it and where to store
80+
it. or discard it.
81+
82+
This allows users to work with much larger files without running out of memory. Some SAX APIs even
83+
allow the user to skip values entirely, lightening the parsing burden as well.
84+
85+
The big drawback is complexity: SAX APIs generally have you define a single callback for each type
86+
(e.g. `string_field(std::string_view key, std::string_view value)`). Because of this, you suffer
87+
from context blindness: when you find a string you have to check where it is before you know what to
88+
do with it. Is this string the text of the tweet, the screen name, or something else I don't even
89+
care about? Are we even in a tweet right now, or is this from some other place in the document
90+
entirely?
91+
92+
The following is SAX example of the Twitter problem we've seen in the Generic DOM and On Demand
93+
examples. To make it short enough to use as an example at all, it's heavily redacted: it only solves
94+
a part of the problem (doesn't get user.screen_name), it has bugs (it doesn't handle sub-objects
95+
in a tweet at all), and it uses a theoretical, simple SAX API that minimizes ceremony and skips over
96+
the issue of lazy parsing and number types entirely.
97+
98+
```c++
99+
struct twitter_callbacks {
100+
bool in_statuses;
101+
bool in_tweet;
102+
std::string_view text;
103+
uint64_t retweets;
104+
uint64_t favorites;
105+
void start_object_field(std::string_view key) {
106+
if (key == "statuses") { in_statuses = true; }
107+
}
108+
void start_object() {
109+
if (in_statuses) { in_tweet = true; }
110+
}
111+
void string_field(std::string_view key, std::string_view value) {
112+
if (in_tweet && key == "text") { text = value; }
113+
}
114+
void number_field(std::string_view key, uint64_t value) {
115+
if (in_tweet) {
116+
if (key == "retweet_count") { retweets = value; }
117+
if (key == "favorite_count") { favorites = value; }
118+
}
119+
}
120+
void end_object() {
121+
if (in_tweet) {
122+
cout << "[redacted] (" << retweets << " retweets / " << favorites << " favorites): " << text << endl;
123+
in_tweet = false;
124+
} else if (in_statuses) {
125+
in_statuses = false;
126+
}
127+
}
128+
};
129+
sax::parser parser;
130+
parser.parse(twitter_callbacks());
131+
```
132+
133+
This is a startling amount of code, requiring mental gymnastics even to read, and in order to get it
134+
this small and illustrate basic usage, *it has bugs* and skips over parsing user.screen_name
135+
entirely. The real implementation is much, much harder to write (and to read).
136+
137+
Pros and cons of SAX (SAJ):
138+
* Speed and space benefits from low, predictable memory usage
139+
* Some SAX APIs can lazily parse numbers/strings, another performance win (pay for what you use)
140+
* Performance drain from context blindness (switch statement for "where am I in the document")
141+
* Startlingly difficult to use
142+
143+
### Schema-Based Parser Generators
144+
145+
There is another breed of parser, commonly used to generate REST API clients, which is in principle
146+
capable of fixing most of the issues with DOM and SAX. These parsers take a schema--a description of
147+
your JSON, with field names, types, everything--and generate classes/structs in your language of
148+
choice, as well as a parser to deserialize the JSON into those structs. (In another variant, you
149+
define your own struct and a preprocessor inspects it and generates a JSON parser for it.)
150+
151+
Not all of these schema-based parser generators actually generate a parser or even optimize for
152+
streaming, but they are *able* to.
153+
154+
Some of the features help entirely eliminate the DOM and SAX issues:
155+
156+
Pros and cons:
157+
* Ease of Use is on par with DOM
158+
* Parsers that generate iterators and lazy values in structs can keep memory pressure down to SAX levels.
159+
* Type Blindness can be entirely solved with specific parsers for each type, saving many branches.
160+
* Context Blindness can be solved, especially if object fields are required and in order, saving
161+
even more branches.
162+
* Scenarios are limited by declarative language (often limited to deserialization-to-objects)
163+
164+
Rust's serde does a lot of the necessary legwork here, for example. (Editor's Note: I don't know
165+
*how much* it does, but I know it does a decent amount, and is very fast.)
166+
167+
Type Blindness and Branch Misprediction
168+
---------------------------------------
169+
170+
The DOM parsing model, and even the SAX model to a great extent, suffers from **type
171+
blindness:** you, the user, almost always know exactly what fields and what types are in your JSON,
172+
but the parser doesn't. When you say `json_parser.parse(json)`, the parser doesn't get told *any*
173+
of this. It has no way to know. This means it has to look at each value blind with a big "switch"
174+
statement, asking "is this a number? A string? A boolean? An array? An object?"
175+
176+
In modern processors, this kind of switch statement can make your program run *3-4 times slower*
177+
than it needs to. This is because of the high cost of branch misprediction.
178+
179+
Modern processors have more than one core, even on a single thread. To go fast, each of these cores
180+
"reads ahead" in your program, each picking different instructions to run (as soon as data is
181+
available). If all the cores are working almost all the time, your single-threaded program will run
182+
around 4 instructions per cycle--*4 times faster* than it theoretically could.
183+
184+
Most modern programs don't manage to get much past 1 instruction per cycle, however. This is
185+
because of branch misprediction. Programs have a lot of if statements, so to read ahead, processors
186+
guess which branch will be taken and read ahead from that branch. If it guesses wrong, all that
187+
wonderful work it did is discarded, and it starts over from the if statement. It *was* running at
188+
4x speed, but it was all wasted work!
189+
190+
And this brings us back to that switch statement. Type blindness means the processor essentially has
191+
to guess, for every JSON value, whether it will be an array, an object, number, string or boolean.
192+
Unless your file is something ridiculously predictable, like a giant array of numbers, it's going to
193+
trip up a lot. (Processors get better about this all the time, but for something complex like this
194+
there is only so much it can do in the tiny amount of time it has to guess.)
195+
196+
On Demand parsing is tailor-made to solve this problem at the source, parsing values only after the
197+
user declares their type by asking for a double, an int, a string, etc.
198+
199+
200+
201+
NOTE: EVERYTHING BELOW THIS NEEDS REWRITING AND MAY NOT BE ACCURATE AT PRESENT
202+
==============================================================================
203+
204+
Classes
205+
-------
4206
5207
In general, simdjson's parser classes are divided into two categories:
6208
@@ -141,15 +343,15 @@ the work to parse and validate the array:
141343
The `ondemand::array` object lets you iterate the values of an array in a forward-only manner:
142344

143345
```c++
144-
for (object tweet : doc["tweets"].get_array()) {
346+
for (object tweet : doc["statuses"].get_array()) {
145347
...
146348
}
147349
```
148350

149351
This is equivalent to:
150352

151353
```c++
152-
array tweets = doc["tweets"].get_array();
354+
array tweets = doc["statuses"].get_array();
153355
array::iterator iter = tweets.begin();
154356
array::iterator end = tweets.end();
155357
while (iter != end) {
@@ -164,13 +366,13 @@ the work to parse and validate the array:
164366

165367
1. `get_array()`:
166368
- Validates that this is an array (starts with `[`). Returns INCORRECT_TYPE if not.
167-
- If the array is empty (followed by `]`), advances the iterator past the `]` and returns an
168-
array with finished == true.
169-
- If the array is not empty, returns an array with finished == false. Iterator remains pointed
170-
at the first value (the token just after `[`).
171-
2. `tweets.begin()`, `tweets.end()`: Returns an `array::iterator`, which just points back at the
369+
- If the array is empty (followed by `]`), advances the iterator past the `]` and then releases the
370+
iterator (indicating the array is finished iterating).
371+
- If the array is not empty, the iterator remains pointed at the first value (the token just
372+
after `[`).
373+
2. `tweets.begin()`, `tweets.end()`: Returns an `array_iterator`, which just points back at the
172374
array object.
173-
3. `while (iter != end)`: Stops if finished == true.
375+
3. `while (iter != end)`: Stops if the iterator has been released.
174376
4. `*iter`: Yields the value (or error).
175377
- If there is an error, returns it and sets finished == true.
176378
- Returns a value object, advancing the iterator just past the value token (if it is `[`, `{`,
@@ -182,83 +384,14 @@ the work to parse and validate the array:
182384
- If anything else is there (`true`, `:`, `}`, etc.), sets error = TAPE_ERROR.
183385
- #3 gets run next.
184386

185-
#### Error Chaining
186-
187-
When you iterate over an array or object with an error, the error is yielded in the loop
387+
Design Features
388+
---------------
188389

189390
#### Error Chaining
190391

392+
When you iterate over an array or object with an error, the error is yielded in the loop
191393

192-
* `document`: Represents the root value of the document, as well as iteration state.
193-
- Inline Algorithm Context. MUST be kept around for the duration of the algorithm.
194-
- `iter`: The `json_iterator`, which handles low-level iteration.
195-
- `parser`: A pointer to the parser to get at persistent state.
196-
- Can be converted to `value` (and by proxy, to array/object/string/number/bool/null).
197-
- Can be iterated (like `value`, assumes it is an array).
198-
- Safety: can only be converted to `value` once. This prevents double iteration of arrays/objects
199-
(which will cause inconsistent iterator state) or double-unescaping strings (which could cause
200-
memory overruns if done too much). NOTE: this is actually not ideal; it means you can't do
201-
`if (document.parse_double().error() == INCORRECT_TYPE) { document.parse_string(); }`
202-
203-
The `value` class is expected to be temporary (NOT kept around), used to check the type of a value
204-
and parse it to another type. It can convert to array or object, and has helpers to parse string,
205-
double, int64, uint64, boolean and is_null().
206-
207-
`value` does not check the type of the value ahead of time. Instead, when you ask for a value of a
208-
given type, it tries to parse as that type, treating a parser error as INCORRECT_TYPE. For example,
209-
if you call get_int64() on the JSON string `"123"`, it will fail because it is looking for either a
210-
`-` or digit at the front. The philosophy here is "proceed AS IF the JSON has the desired type, and
211-
have an unlikely error branch if not." Since most files have well-known types, this avoids
212-
unnecessary branchiness and extra checks for the value type.
213-
214-
It does preemptively advance the iterator and store the pointer to the JSON, allowing you to attempt
215-
to parse more than one type. This is how you do nullable ints, for example: check is_null() and
216-
then get_int64(). If we didn't store the JSON, and instead did `iter.advance()` during is_null(),
217-
you would get garbage if you then did get_int64().
218-
219-
##
220-
221-
until it's asked to convert, in which case it proceeds
222-
*expecting* the value is of the given type, treating an error in parsing as "it must be some
223-
other type." This saves the extra explicit type check in the common case where you already know
224-
saving the if/switch statement
225-
and . We find out it's not a double
226-
when the double parser says "I couldn't find any digits and I don't understand what this `t`
227-
character is for."
228-
- The iterator has already been advanced past the value token. If it's { or [, the iterator is just
229-
after the open brace.
230-
- Can be parsed as a string, double, int, unsigned, boolean, or `is_null()`, in which case a parser is run
231-
and the value is returned.
232-
- Can be converted to array or object iterator.
233-
* `object`: Represents an object iteration.
234-
- `doc`: A pointer to the document (and the iterator).
235-
- `at_start`: boolean saying whether we're at the start of the object, used for `[]`.
236-
- Can do order-sensitive field lookups.
237-
- Can iterate over key/value pairs.
238-
* `array`: Represents an array iteration.
239-
- `doc`: A pointer to the document (and the iterator).
240-
241-
242-
243-
- Can be returned as a `raw_json_string`, so that people who want to check JSON strings without
244-
unescaping can do so.
245-
- Can be converted to array or object
246-
and keep member variables in the same registers.
247-
In fact, , and . Usually based on whether they are persistent--i.e. we intend them
248-
to be stored in memory--or algorithmic--
249-
(which generally means we intend to persist them), or algorithm-lifetime, possibly
250-
251-
- persistent classes
252-
- non-persistent
253-
`ondemand::parser` is the equivalent of `dom::parser`, and .
254-
`json_iterator` handles all iteration state.
255-
256-
### json_iterator / document
257-
258-
`document` owns the json_iterator and the . I'm considering moving this into the json_iterator so there's one
259-
less class that requires persistent ownership, however.
260-
261-
### String unescaping
394+
### String Parsing
262395

263396
When the user requests strings, we unescape them to a single string_buf (much like the DOM parser)
264397
so that users enjoy the same string performance as the core simdjson. The current_string_buf_loc is
@@ -278,14 +411,12 @@ If it's stored in a `simdjson_result<document>` (as it would be in `auto doc = p
278411
the document pointed to by children is the one inside the simdjson_result, and the simdjson_result,
279412
therefore, can't be moved either.
280413

281-
### Object Iteration
414+
### Object/Array Iteration
282415

283416
Because the C++ iterator contract requires iterators to be const-assignable and const-constructable,
284417
object and array iterators are separate classes from the object/array itself, and have an interior
285418
mutable reference to it.
286419

287-
### Array Iteration
288-
289420
### Iteration Safety
290421

291422
- If the value fails to be parsed as one type, you can try to parse it as something else until you

0 commit comments

Comments
 (0)