Skip to content

Commit 2eaeac5

Browse files
committed
Revamp design documentation to match new design
1 parent 3baba73 commit 2eaeac5

File tree

1 file changed

+220
-80
lines changed

1 file changed

+220
-80
lines changed

doc/ondemand.md

Lines changed: 220 additions & 80 deletions
Original file line numberDiff line numberDiff line change
@@ -29,11 +29,6 @@ auto doc = parser.iterate(json);
2929
for (auto tweet : doc["statuses"]) {
3030
std::string_view text = tweet["text"];
3131
std::string_view screen_name = tweet["user"]["screen_name"];
32-
std::string_view screen_name;
33-
{
34-
ondemand::object user = tweet["user"];
35-
screen_name = user["screen_name"];
36-
}
3732
uint64_t retweets = tweet["retweet_count"];
3833
uint64_t favorites = tweet["favorite_count"];
3934
cout << screen_name << " (" << retweets << " retweets / " << favorites << " favorites): " << text << endl;
@@ -66,7 +61,10 @@ Such code would be apply to a JSON document such as the following JSON mimicking
6661
"retweet_count": 82,
6762
"favorite_count": 42
6863
}
69-
]
64+
],
65+
"search_metadata": {
66+
"count": 100,
67+
}
7068
}
7169
```
7270

@@ -91,7 +89,6 @@ The On Demand approach is designed around several principles:
9189
* **Validate What You Use:** On Demand deliberately validates the values you use and the structure leading to it, but nothing else. The goal is a guarantee that the value you asked for is the correct one and is not malformed: there must be no confusion over whether you got the right value.
9290

9391

94-
9592
To understand why On Demand is different, it is helpful to review the major
9693
approaches to parsing and parser APIs in use today.
9794

@@ -119,8 +116,7 @@ for (auto tweet : doc["statuses"]) {
119116
std::string_view text = tweet["text"];
120117
std::string_view screen_name = tweet["user"]["screen_name"];
121118
uint64_t retweets = tweet["retweet_count"];
122-
uint64_t favorites = tweet["favorite_count"];
123-
cout << screen_name << " (" << retweets << " retweets / " << favorites << " favorites): " << text << endl;
119+
cout << screen_name << " (" << retweets << " retweets): " << text << endl;
124120
}
125121
```
126122

@@ -273,9 +269,10 @@ To help visualize the algorithm, we'll walk through the example C++ given at the
273269
```json
274270
{
275271
"statuses": [
276-
{ "id": 1, "text": "first!", "user": { "screen_name": "lemire", "name": "Daniel" }, "favorite_count": 100, "retweet_count": 40 },
277-
{ "id": 2, "text": "second!", "user": { "screen_name": "jkeiser2", "name": "John" }, "favorite_count": 2, "retweet_count": 3 }
278-
]
272+
{ "id": 1, "text": "first!", "user": { "screen_name": "lemire", "name": "Daniel" }, "retweet_count": 40 },
273+
{ "id": 2, "text": "second!", "user": { "screen_name": "jkeiser2", "name": "John" }, "retweet_count": 3 }
274+
],
275+
"search_metadata": { "count": 2 }
279276
}
280277
```
281278

@@ -318,57 +315,84 @@ To help visualize the algorithm, we'll walk through the example C++ given at the
318315
rely on error chaining, so it is possible to delay error checks: we shall shortly explain error
319316
chaining more fully.
320317

321-
NOTE: You should always have such a `document` instance (here `doc`) and it should remain in scope for the duration
322-
of your parsing function. E.g., you should not use the returned document as a temporary (e.g., `auto x = parser.iterate(json).get_object();`)
323-
followed by other operations as the destruction of the `document` instance makes all of the derived instances
324-
ill-defined.
318+
> NOTE: You should always have such a `document` instance (here `doc`) and it should remain in scope for the duration
319+
> of your parsing function. E.g., you should not use the returned document as a temporary (e.g., `auto x = parser.iterate(json).get_object();`)
320+
> followed by other operations as the destruction of the `document` instance makes all of the derived instances
321+
> ill-defined.
322+
323+
At this point, the iterator is at the start of the JSON:
324+
325+
```json
326+
{
327+
^ (depth 1)
325328

329+
"statuses": [
330+
{ "id": 1, "text": "first!", "user": { "screen_name": "lemire", "name": "Daniel" }, "retweet_count": 40 },
331+
{ "id": 2, "text": "second!", "user": { "screen_name": "jkeiser2", "name": "John" }, "retweet_count": 3 }
332+
],
333+
"search_metadata": { "count": 2 }
334+
}
335+
```
326336

327337
3. We iterate over the "statuses" field using a typical C++ iterator, reading past the initial
328338
`{ "statuses": [ {`.
329339

330340
```c++
331341
for (ondemand::object tweet : doc["statuses"]) {
332342
```
333-
This shorthand does much, and it is helpful to see what it expands to.
334-
Comments in front of each one explain what's going on:
335-
```c++
336-
// Validate that the top-level value is an object: check for {
337-
ondemand::object top = doc.get_object();
338-
339-
// Find the field statuses by:
340-
// 1. Check whether the object is empty (check for }). (We do not really need to do this unless the key lookup fails!)
341-
// 2. Check if we're at the field by looking for the string "statuses" using byte-by-byte comparison.
342-
// 3. Validate that there is a `:` after it.
343-
auto tweets_field = top["statuses"];
344-
345-
// Validate that the field value is an array: check for [
346-
// Also mark the array as finished if there is a ] next, which would cause the while () statement to exit immediately.
347-
ondemand::array tweets = tweets_field.get_array();
348-
// These three method calls do nothing substantial (the real checking happens in get_array() and ++)
349-
// != checks whether the array is marked as finished (if we have found a ]).
350-
ondemand::array_iterator tweets_iter = tweets.begin();
351-
while (tweets_iter != tweets.end()) {
352-
auto tweet_value = *tweets_iter;
353-
354-
// Validate that the array element is an object: check for {
355-
ondemand::object tweet = tweet_value.get_object();
356-
...
357-
}
358-
```
359-
What is not explained in this code expansion is *error chaining*.
360-
Generally, you can use `document` methods on a `simdjson_result<...>` value; any errors will
361-
just be passed down the chain. Many method calls
362-
can be chained in this manner. So `for (object tweet : doc["statuses"])`, which is the equivalent of
363-
`object tweet = *(doc.get_object()["statuses"].get_array().begin()).get_object()`, could fail in any of
364-
6 method calls, and the error will only be checked at the end,
365-
when you attempt to cast the final `simdjson_result<object>` to object. Upon casting, an exception is
366-
thrown if there was an error.
367343
368-
NOTE: while the document can be queried once for a key as if it were an object, it is not an actual object
369-
instance. If you need to treat it as an object (e.g., to query more than one keys), you can cast it as
370-
such `ondemand::object root_object = doc.get_object();`.
344+
This shorthand does a lot, and it is helpful to see what it expands to.
345+
Comments in front of each one explain what's going on:
346+
347+
```c++
348+
// Validate that the top-level value is an object: check for {. Increase depth to 2 (root > field).
349+
ondemand::object top = doc.get_object();
350+
351+
// Find the field statuses by:
352+
// 1. Check whether the object is empty (check for }). (We do not really need to do this unless
353+
// the key lookup fails!)
354+
// 2. Check if we're at the field by looking for the string "statuses" using byte-by-byte comparison.
355+
// 3. Validate that there is a `:` after it.
356+
auto tweets_field = top["statuses"];
357+
358+
// - Validate that the field value is an array: check for [
359+
// - If the array is empty (if there is a ] next), decrease depth back to 0.
360+
// - If not, increase depth to 3 (root > statuses > tweet).
361+
ondemand::array tweets = tweets_field.get_array();
362+
// These three method calls do nothing substantial (the real checking happens in get_array() and ++)
363+
// != checks whether the array is finished (if we found a ] and decreased depth back to 0).
364+
ondemand::array_iterator tweets_iter = tweets.begin();
365+
while (tweets_iter != tweets.end()) {
366+
auto tweet_value = *tweets_iter;
367+
368+
// - Validate that the array element is an object: check for {
369+
// - If the object is empty (if there is a } next), decrease depth back to 1.
370+
// - If not, increase depth to 4 (root > statuses > tweet > field).
371+
ondemand::object tweet = tweet_value.get_object();
372+
...
373+
}
374+
```
371375

376+
> NOTE: What is not explained in this code expansion is *error chaining*.
377+
> Generally, you can use `document` methods on a `simdjson_result<...>` value; any errors will
378+
> just be passed down the chain. Many method calls
379+
> can be chained in this manner. So `for (object tweet : doc["statuses"])`, which is the equivalent of
380+
> `object tweet = *(doc.get_object()["statuses"].get_array().begin()).get_object()`, could fail in any of
381+
> 6 method calls, and the error will only be checked at the end,
382+
> when you attempt to cast the final `simdjson_result<object>` to object. Upon casting, an exception is
383+
> thrown if there was an error.
384+
385+
```json
386+
{
387+
"statuses": [
388+
{ "id": 1, "text": "first!", "user": { "screen_name": "lemire", "name": "Daniel" }, "retweet_count": 40 },
389+
^ (depth 4 - root > statuses > tweet > field)
390+
391+
{ "id": 2, "text": "second!", "user": { "screen_name": "jkeiser2", "name": "John" }, "retweet_count": 3 }
392+
],
393+
"search_metadata": { "count": 2 }
394+
}
395+
```
372396

373397
4. We get the `"text"` field as a string.
374398

@@ -382,45 +406,109 @@ To help visualize the algorithm, we'll walk through the example C++ given at the
382406

383407
The second field is matched (`"text"`), so we validate the `:` and move to the actual value.
384408

385-
NOTE: `["text"]` does a *raw match*, comparing the key directly against the raw JSON. This means
386-
that keys with escapes in them may not be matched and the letter case must match exactly.
409+
> NOTE: `["text"]` does a *raw match*, comparing the key directly against the raw JSON. This
410+
> allows simdjson to do field lookup very, very quickly when the keys you want to match have
411+
> letters, numbers and punctuation. However, this means that fields with escapes in them will not
412+
> be matched.
387413
388414
To convert to a string, we check for `"` and use simdjson's fast unescaping algorithm to copy
389415
`first!` (plus a terminating `\0`) into a buffer managed by the `document`. This buffer stores
390416
all strings from a single iteration. The next string will be written after the `\0`.
391417

392418
A `string_view` is returned which points to that buffer, and contains the length.
393419

420+
We advance to the comma, and decrease depth to 3 (root > statuses > tweet).
421+
422+
At this point, we are here in the JSON:
423+
424+
```json
425+
{
426+
"statuses": [
427+
{ "id": 1, "text": "first!", "user": { "screen_name": "lemire", "name": "Daniel" }, "retweet_count": 40 },
428+
^ (depth 2 - root > statuses > tweet)
429+
430+
{ "id": 2, "text": "second!", "user": { "screen_name": "jkeiser2", "name": "John" }, "retweet_count": 3 }
431+
],
432+
"search_metadata": { "count": 2 }
433+
}
434+
```
435+
394436
4. We get the `"screen_name"` from the `"user"` object.
395437

396438
```c++
397439
ondemand::object user = tweet["user"];
398440
screen_name = user["screen_name"];
399441
```
400442

401-
First, `["user"]` checks whether there are any more object fields by looking for either `,` or
402-
`}`. Then it matches `"user"` and validates the `:`.
443+
First, `["user"]` finds the `,`, discovers the next key is `"user"`, validates that the `:`
444+
is there, and increases depth to 4 (root > statuses > tweet > field).
403445

404-
`["screen_name"]` then converts to object, checking for `{`, and finds `"screen_name"`.
446+
Next, the cast to ondemand::object checks for `{` and increases depth to 5 (root > statuses >
447+
tweet > user > field).
448+
449+
`["screen_name"]` finds the first field `"screen_name"` and validates the `:`.
405450

406451
To convert the result to usable string (i.e., the screen name `lemire`), the characters are written to the document's
407452
string buffer (after possibly escaping them), which now has *two* string_views pointing into it, and looks like `first!\0lemire\0`.
408453

409-
Finally, the temporary user object is destroyed, causing it to skip the remainder of the object
410-
(`}`).
454+
The iterator advances to the comma and decreases depth back to 4 (root > statuses > tweet > user).
455+
456+
At this point, the iterator is here in the JSON:
411457

412-
NOTE: You may only have one active array or object active at any given time. An array or an object becomes
413-
active when the `ondemand::object` or `ondemand::array` is created, and it releases its 'focus' when
414-
its destructor is called. If you create an array or an object located inside a parent object or array,
415-
the child array or object becomes active while the parent becomes temporarily inactive. If you access
416-
several sibling objects or arrays, you must ensure that the destructor is called by scoping each access
417-
(see Iteration Safety section below for further details).
458+
```json
459+
{
460+
"statuses": [
461+
{ "id": 1, "text": "first!", "user": { "screen_name": "lemire", "name": "Daniel" }, "retweet_count": 40 },
462+
^ (depth 4 - root > statuses > tweet > user)
463+
464+
{ "id": 2, "text": "second!", "user": { "screen_name": "jkeiser2", "name": "John" }, "retweet_count": 3 }
465+
],
466+
"search_metadata": { "count": 2 }
467+
}
468+
```
418469

419-
5. We get `"retweet_count"` and `"favorite_count"` as unsigned integers.
470+
5. We get `"retweet_count"` as an unsigned integer.
420471

421472
```c++
422473
uint64_t retweets = tweet["retweet_count"];
423-
uint64_t favorites = tweet["favorite_count"];
474+
```
475+
476+
First, `["retweet_count"]` checks whether the previous field value is finished (if it was, depth
477+
would be 3 (root > statuses > tweet). Since it's not, we skip JSON until depth is 3. This brings
478+
the iterator to the `,` after the user object:
479+
480+
```json
481+
{
482+
"statuses": [
483+
{ "id": 1, "text": "first!", "user": { "screen_name": "lemire", "name": "Daniel" }, "retweet_count": 40 },
484+
^ (depth 4 - root > statuses > tweet > user)
485+
486+
{ "id": 2, "text": "second!", "user": { "screen_name": "jkeiser2", "name": "John" }, "retweet_count": 3 }
487+
],
488+
"search_metadata": { "count": 2 }
489+
}
490+
```
491+
492+
Because of the cast to uint64_t, simdjson knows it's parsing an unsigned integer. This lets
493+
us use a fast parser which *only* knows how to parse digits. It validates that it is an integer
494+
by rejecting negative numbers, strings, and other values based on the fact that they are not the
495+
digits 0-9. This type specificity is part of why parsing with on demand is so fast: you lose all
496+
the code that has to understand those other types.
497+
498+
The iterator is advanced to the `}`, and depth decreased back to 3 (root > statuses > tweet).
499+
500+
At this point, we are here in the JSON:
501+
502+
```json
503+
{
504+
"statuses": [
505+
{ "id": 1, "text": "first!", "user": { "screen_name": "lemire", "name": "Daniel" }, "retweet_count": 40 },
506+
^ (depth 3 - root > statuses > tweet)
507+
508+
{ "id": 2, "text": "second!", "user": { "screen_name": "jkeiser2", "name": "John" }, "retweet_count": 3 }
509+
],
510+
"search_metadata": { "count": 2 }
511+
}
424512
```
425513

426514
6. We loop to the next tweet.
@@ -441,25 +529,77 @@ To help visualize the algorithm, we'll walk through the example C++ given at the
441529
}
442530
```
443531

444-
First, the `tweet` destructor runs, skipping the remainder of the object which in this case is
445-
just `}`.
532+
First, `iter++` (remember, this is the array of tweets) checks whether the previous object was
533+
fully iterated. It was not--depth is 3 (root > statuses > tweet), so we skip until it's 2--which
534+
in this case just means consuming the `}`, leaving the iterator at the next comma. Depth is now 2
535+
(root > statuses).
536+
537+
Next, `iter++` finds the `,` and advances past it to the `{`, increasing depth to 3 (root >
538+
statuses > tweet).
539+
540+
Finally, `ondemand::object tweet = *iter` validates the `{` and increases depth to 4 (root >
541+
statuses > tweet > field). This leaves the iterator here:
542+
543+
```json
544+
{
545+
"statuses": [
546+
{ "id": 1, "text": "first!", "user": { "screen_name": "lemire", "name": "Daniel" }, "retweet_count": 40 },
547+
{ "id": 2, "text": "second!", "user": { "screen_name": "jkeiser2", "name": "John" }, "retweet_count": 3 }
548+
^ (depth 3 - root > statuses > tweet)
549+
],
550+
"search_metadata": { "count": 2 }
551+
}
552+
```
553+
554+
7. This tweet is processed just like the previous one, leaving the iterator here:
555+
556+
```json
557+
{
558+
"statuses": [
559+
{ "id": 1, "text": "first!", "user": { "screen_name": "lemire", "name": "Daniel" }, "retweet_count": 40 },
560+
{ "id": 2, "text": "second!", "user": { "screen_name": "jkeiser2", "name": "John" }, "retweet_count": 3 }
561+
^ (depth 3 - root > statuses > tweet)
562+
],
563+
"search_metadata": { "count": 2 }
564+
}
565+
```
446566

447-
Next, `iter++` checks whether there are more values and finds `,`. The loop continues.
567+
8. The loop ends. Recall the relevant parts of the statuses loop:
568+
569+
```c++
570+
while (iter != statuses.end()) {
571+
ondemand::object tweet = *iter;
572+
...
573+
iter++;
574+
}
575+
```
448576
449-
Finally, `ondemand::object tweet = *iter` checks for `{` and returns the object.
577+
First, `iter++` finishes up any children, consuming the `}` and leaving depth at 2 (root > statuses).
450578
451-
This tweet is processed just like the previous one.
579+
Next, `iter++` notices the `]` and ends the array by decreasing depth to 1. This leaves the iterator
580+
here in the JSON:
452581
453-
7. We finish the last tweet.
582+
```json
583+
{
584+
"statuses": [
585+
{ "id": 1, "text": "first!", "user": { "screen_name": "lemire", "name": "Daniel" }, "retweet_count": 40 },
586+
{ "id": 2, "text": "second!", "user": { "screen_name": "jkeiser2", "name": "John" }, "retweet_count": 3 }
587+
],
588+
^ (depth 1 - root)
589+
"search_metadata": { "count": 2 }
590+
}
591+
```
454592

455-
At the end of the loop, the `tweet` is first destroyed, skipping the remainder of the tweet
456-
object (`}`).
593+
9. The remainder of the file is skipped.
457594

458-
The `iter++` instruction from `for (ondemand::object tweet : doc["statuses"])` then checks whether there are
459-
more values and finds that there are none (`]`). It marks the array iteration as finished and the for
460-
loop terminates.
595+
Because no more action is taken, JSON processing stops: processing only occurs when you ask for
596+
values.
461597

462-
Then the outer object is destroyed, skipping everything up to the `}`.
598+
This means you can very efficiently do things like read a single value from a JSON file, or take
599+
the top N, for example. It also means the things you don't use won't be fully validated. This is
600+
a general principle of On Demand: don't validate what you don't use. We still fully validate
601+
values you do use, however, as well as the objects and arrays that lead to them, so that you can
602+
be sure you get the information you need.
463603

464604
Design Features
465605
---------------

0 commit comments

Comments
 (0)