@@ -29,11 +29,6 @@ auto doc = parser.iterate(json);
29
29
for (auto tweet : doc[" statuses" ]) {
30
30
std::string_view text = tweet[ "text"] ;
31
31
std::string_view screen_name = tweet[ "user"] [ "screen_name" ] ;
32
- std::string_view screen_name;
33
- {
34
- ondemand::object user = tweet[ "user"] ;
35
- screen_name = user[ "screen_name"] ;
36
- }
37
32
uint64_t retweets = tweet[ "retweet_count"] ;
38
33
uint64_t favorites = tweet[ "favorite_count"] ;
39
34
cout << screen_name << " (" << retweets << " retweets / " << favorites << " favorites): " << text << endl;
@@ -66,7 +61,10 @@ Such code would be apply to a JSON document such as the following JSON mimicking
66
61
"retweet_count" : 82 ,
67
62
"favorite_count" : 42
68
63
}
69
- ]
64
+ ],
65
+ "search_metadata" : {
66
+ "count" : 100 ,
67
+ }
70
68
}
71
69
```
72
70
@@ -91,7 +89,6 @@ The On Demand approach is designed around several principles:
91
89
* ** Validate What You Use:** On Demand deliberately validates the values you use and the structure leading to it, but nothing else. The goal is a guarantee that the value you asked for is the correct one and is not malformed: there must be no confusion over whether you got the right value.
92
90
93
91
94
-
95
92
To understand why On Demand is different, it is helpful to review the major
96
93
approaches to parsing and parser APIs in use today.
97
94
@@ -119,8 +116,7 @@ for (auto tweet : doc["statuses"]) {
119
116
std::string_view text = tweet[ "text"] ;
120
117
std::string_view screen_name = tweet[ "user"] [ "screen_name" ] ;
121
118
uint64_t retweets = tweet[ "retweet_count"] ;
122
- uint64_t favorites = tweet[ "favorite_count"] ;
123
- cout << screen_name << " (" << retweets << " retweets / " << favorites << " favorites): " << text << endl;
119
+ cout << screen_name << " (" << retweets << " retweets): " << text << endl;
124
120
}
125
121
```
126
122
@@ -273,9 +269,10 @@ To help visualize the algorithm, we'll walk through the example C++ given at the
273
269
```json
274
270
{
275
271
"statuses": [
276
- { "id": 1, "text": "first!", "user": { "screen_name": "lemire", "name": "Daniel" }, "favorite_count": 100, "retweet_count": 40 },
277
- { "id": 2, "text": "second!", "user": { "screen_name": "jkeiser2", "name": "John" }, "favorite_count": 2, "retweet_count": 3 }
278
- ]
272
+ { "id": 1, "text": "first!", "user": { "screen_name": "lemire", "name": "Daniel" }, "retweet_count": 40 },
273
+ { "id": 2, "text": "second!", "user": { "screen_name": "jkeiser2", "name": "John" }, "retweet_count": 3 }
274
+ ],
275
+ "search_metadata": { "count": 2 }
279
276
}
280
277
```
281
278
@@ -318,57 +315,84 @@ To help visualize the algorithm, we'll walk through the example C++ given at the
318
315
rely on error chaining, so it is possible to delay error checks: we shall shortly explain error
319
316
chaining more fully.
320
317
321
- NOTE: You should always have such a ` document ` instance (here ` doc ` ) and it should remain in scope for the duration
322
- of your parsing function. E.g., you should not use the returned document as a temporary (e.g., ` auto x = parser.iterate(json).get_object(); ` )
323
- followed by other operations as the destruction of the ` document ` instance makes all of the derived instances
324
- ill-defined.
318
+ > NOTE: You should always have such a ` document ` instance (here ` doc ` ) and it should remain in scope for the duration
319
+ > of your parsing function. E.g., you should not use the returned document as a temporary (e.g., ` auto x = parser.iterate(json).get_object(); ` )
320
+ > followed by other operations as the destruction of the ` document ` instance makes all of the derived instances
321
+ > ill-defined.
322
+
323
+ At this point, the iterator is at the start of the JSON:
324
+
325
+ ``` json
326
+ {
327
+ ^ (depth 1)
325
328
329
+ "statuses" : [
330
+ { "id" : 1 , "text" : " first!" , "user" : { "screen_name" : " lemire" , "name" : " Daniel" }, "retweet_count" : 40 },
331
+ { "id" : 2 , "text" : " second!" , "user" : { "screen_name" : " jkeiser2" , "name" : " John" }, "retweet_count" : 3 }
332
+ ],
333
+ "search_metadata" : { "count" : 2 }
334
+ }
335
+ ```
326
336
327
337
3 . We iterate over the "statuses" field using a typical C++ iterator, reading past the initial
328
338
` { "statuses": [ { ` .
329
339
330
340
``` c++
331
341
for (ondemand::object tweet : doc[" statuses" ]) {
332
342
```
333
- This shorthand does much, and it is helpful to see what it expands to.
334
- Comments in front of each one explain what's going on:
335
- ```c++
336
- // Validate that the top-level value is an object: check for {
337
- ondemand::object top = doc.get_object();
338
-
339
- // Find the field statuses by:
340
- // 1. Check whether the object is empty (check for }). (We do not really need to do this unless the key lookup fails!)
341
- // 2. Check if we're at the field by looking for the string "statuses" using byte-by-byte comparison.
342
- // 3. Validate that there is a `:` after it.
343
- auto tweets_field = top["statuses"];
344
-
345
- // Validate that the field value is an array: check for [
346
- // Also mark the array as finished if there is a ] next, which would cause the while () statement to exit immediately.
347
- ondemand::array tweets = tweets_field.get_array();
348
- // These three method calls do nothing substantial (the real checking happens in get_array() and ++)
349
- // != checks whether the array is marked as finished (if we have found a ]).
350
- ondemand::array_iterator tweets_iter = tweets.begin();
351
- while (tweets_iter != tweets.end()) {
352
- auto tweet_value = *tweets_iter;
353
-
354
- // Validate that the array element is an object: check for {
355
- ondemand::object tweet = tweet_value.get_object();
356
- ...
357
- }
358
- ```
359
- What is not explained in this code expansion is * error chaining* .
360
- Generally, you can use ` document ` methods on a ` simdjson_result<...> ` value; any errors will
361
- just be passed down the chain. Many method calls
362
- can be chained in this manner. So ` for (object tweet : doc["statuses"]) ` , which is the equivalent of
363
- ` object tweet = *(doc.get_object()["statuses"].get_array().begin()).get_object() ` , could fail in any of
364
- 6 method calls, and the error will only be checked at the end,
365
- when you attempt to cast the final ` simdjson_result<object> ` to object. Upon casting, an exception is
366
- thrown if there was an error.
367
343
368
- NOTE: while the document can be queried once for a key as if it were an object, it is not an actual object
369
- instance. If you need to treat it as an object (e.g., to query more than one keys), you can cast it as
370
- such ` ondemand::object root_object = doc.get_object(); ` .
344
+ This shorthand does a lot, and it is helpful to see what it expands to.
345
+ Comments in front of each one explain what's going on:
346
+
347
+ ```c++
348
+ // Validate that the top-level value is an object: check for {. Increase depth to 2 (root > field).
349
+ ondemand::object top = doc.get_object();
350
+
351
+ // Find the field statuses by:
352
+ // 1. Check whether the object is empty (check for }). (We do not really need to do this unless
353
+ // the key lookup fails!)
354
+ // 2. Check if we're at the field by looking for the string "statuses" using byte-by-byte comparison.
355
+ // 3. Validate that there is a `:` after it.
356
+ auto tweets_field = top["statuses"];
357
+
358
+ // - Validate that the field value is an array: check for [
359
+ // - If the array is empty (if there is a ] next), decrease depth back to 0.
360
+ // - If not, increase depth to 3 (root > statuses > tweet).
361
+ ondemand::array tweets = tweets_field.get_array();
362
+ // These three method calls do nothing substantial (the real checking happens in get_array() and ++)
363
+ // != checks whether the array is finished (if we found a ] and decreased depth back to 0).
364
+ ondemand::array_iterator tweets_iter = tweets.begin();
365
+ while (tweets_iter != tweets.end()) {
366
+ auto tweet_value = *tweets_iter;
367
+
368
+ // - Validate that the array element is an object: check for {
369
+ // - If the object is empty (if there is a } next), decrease depth back to 1.
370
+ // - If not, increase depth to 4 (root > statuses > tweet > field).
371
+ ondemand::object tweet = tweet_value.get_object();
372
+ ...
373
+ }
374
+ ```
371
375
376
+ > NOTE: What is not explained in this code expansion is * error chaining* .
377
+ > Generally, you can use ` document ` methods on a ` simdjson_result<...> ` value; any errors will
378
+ > just be passed down the chain. Many method calls
379
+ > can be chained in this manner. So ` for (object tweet : doc["statuses"]) ` , which is the equivalent of
380
+ > ` object tweet = *(doc.get_object()["statuses"].get_array().begin()).get_object() ` , could fail in any of
381
+ > 6 method calls, and the error will only be checked at the end,
382
+ > when you attempt to cast the final ` simdjson_result<object> ` to object. Upon casting, an exception is
383
+ > thrown if there was an error.
384
+
385
+ ``` json
386
+ {
387
+ "statuses" : [
388
+ { "id" : 1 , "text" : " first!" , "user" : { "screen_name" : " lemire" , "name" : " Daniel" }, "retweet_count" : 40 },
389
+ ^ (depth 4 - root > statuses > tweet > field)
390
+
391
+ { "id" : 2 , "text" : " second!" , "user" : { "screen_name" : " jkeiser2" , "name" : " John" }, "retweet_count" : 3 }
392
+ ],
393
+ "search_metadata" : { "count" : 2 }
394
+ }
395
+ ```
372
396
373
397
4 . We get the ` "text" ` field as a string.
374
398
@@ -382,45 +406,109 @@ To help visualize the algorithm, we'll walk through the example C++ given at the
382
406
383
407
The second field is matched (` "text" ` ), so we validate the ` : ` and move to the actual value.
384
408
385
- NOTE: ` ["text"] ` does a * raw match* , comparing the key directly against the raw JSON. This means
386
- that keys with escapes in them may not be matched and the letter case must match exactly.
409
+ > NOTE: ` ["text"] ` does a * raw match* , comparing the key directly against the raw JSON. This
410
+ > allows simdjson to do field lookup very, very quickly when the keys you want to match have
411
+ > letters, numbers and punctuation. However, this means that fields with escapes in them will not
412
+ > be matched.
387
413
388
414
To convert to a string, we check for ` " ` and use simdjson's fast unescaping algorithm to copy
389
415
` first! ` (plus a terminating ` \0 ` ) into a buffer managed by the ` document ` . This buffer stores
390
416
all strings from a single iteration. The next string will be written after the ` \0 ` .
391
417
392
418
A ` string_view ` is returned which points to that buffer, and contains the length.
393
419
420
+ We advance to the comma, and decrease depth to 3 (root > statuses > tweet).
421
+
422
+ At this point, we are here in the JSON:
423
+
424
+ ``` json
425
+ {
426
+ "statuses" : [
427
+ { "id" : 1 , "text" : " first!" , "user" : { "screen_name" : " lemire" , "name" : " Daniel" }, "retweet_count" : 40 },
428
+ ^ (depth 2 - root > statuses > tweet)
429
+
430
+ { "id" : 2 , "text" : " second!" , "user" : { "screen_name" : " jkeiser2" , "name" : " John" }, "retweet_count" : 3 }
431
+ ],
432
+ "search_metadata" : { "count" : 2 }
433
+ }
434
+ ```
435
+
394
436
4 . We get the ` "screen_name" ` from the ` "user" ` object.
395
437
396
438
``` c++
397
439
ondemand::object user = tweet[" user" ];
398
440
screen_name = user[" screen_name" ];
399
441
```
400
442
401
- First, ` ["user"] ` checks whether there are any more object fields by looking for either ` , ` or
402
- ` } ` . Then it matches ` "user" ` and validates the ` : ` .
443
+ First, ` ["user"] ` finds the ` , ` , discovers the next key is ` "user" ` , validates that the ` : `
444
+ is there, and increases depth to 4 (root > statuses > tweet > field) .
403
445
404
- ` ["screen_name"] ` then converts to object, checking for ` { ` , and finds ` "screen_name" ` .
446
+ Next, the cast to ondemand::object checks for ` { ` and increases depth to 5 (root > statuses >
447
+ tweet > user > field).
448
+
449
+ ` ["screen_name"] ` finds the first field ` "screen_name" ` and validates the ` : ` .
405
450
406
451
To convert the result to usable string (i.e., the screen name ` lemire ` ), the characters are written to the document's
407
452
string buffer (after possibly escaping them), which now has * two* string_views pointing into it, and looks like ` first!\0lemire\0 ` .
408
453
409
- Finally, the temporary user object is destroyed, causing it to skip the remainder of the object
410
- (` } ` ).
454
+ The iterator advances to the comma and decreases depth back to 4 (root > statuses > tweet > user).
455
+
456
+ At this point, the iterator is here in the JSON:
411
457
412
- NOTE: You may only have one active array or object active at any given time. An array or an object becomes
413
- active when the ` ondemand::object ` or ` ondemand::array ` is created, and it releases its 'focus' when
414
- its destructor is called. If you create an array or an object located inside a parent object or array,
415
- the child array or object becomes active while the parent becomes temporarily inactive. If you access
416
- several sibling objects or arrays, you must ensure that the destructor is called by scoping each access
417
- (see Iteration Safety section below for further details).
458
+ ``` json
459
+ {
460
+ "statuses" : [
461
+ { "id" : 1 , "text" : " first!" , "user" : { "screen_name" : " lemire" , "name" : " Daniel" }, "retweet_count" : 40 },
462
+ ^ (depth 4 - root > statuses > tweet > user)
463
+
464
+ { "id" : 2 , "text" : " second!" , "user" : { "screen_name" : " jkeiser2" , "name" : " John" }, "retweet_count" : 3 }
465
+ ],
466
+ "search_metadata" : { "count" : 2 }
467
+ }
468
+ ```
418
469
419
- 5 . We get ` "retweet_count" ` and ` "favorite_count" ` as unsigned integers .
470
+ 5 . We get ` "retweet_count" ` as an unsigned integer .
420
471
421
472
``` c++
422
473
uint64_t retweets = tweet[" retweet_count" ];
423
- uint64_t favorites = tweet[" favorite_count" ];
474
+ ```
475
+
476
+ First, ` ["retweet_count"] ` checks whether the previous field value is finished (if it was, depth
477
+ would be 3 (root > statuses > tweet). Since it's not, we skip JSON until depth is 3. This brings
478
+ the iterator to the ` , ` after the user object:
479
+
480
+ ``` json
481
+ {
482
+ "statuses" : [
483
+ { "id" : 1 , "text" : " first!" , "user" : { "screen_name" : " lemire" , "name" : " Daniel" }, "retweet_count" : 40 },
484
+ ^ (depth 4 - root > statuses > tweet > user)
485
+
486
+ { "id" : 2 , "text" : " second!" , "user" : { "screen_name" : " jkeiser2" , "name" : " John" }, "retweet_count" : 3 }
487
+ ],
488
+ "search_metadata" : { "count" : 2 }
489
+ }
490
+ ```
491
+
492
+ Because of the cast to uint64_t, simdjson knows it's parsing an unsigned integer. This lets
493
+ us use a fast parser which * only* knows how to parse digits. It validates that it is an integer
494
+ by rejecting negative numbers, strings, and other values based on the fact that they are not the
495
+ digits 0-9. This type specificity is part of why parsing with on demand is so fast: you lose all
496
+ the code that has to understand those other types.
497
+
498
+ The iterator is advanced to the ` } ` , and depth decreased back to 3 (root > statuses > tweet).
499
+
500
+ At this point, we are here in the JSON:
501
+
502
+ ``` json
503
+ {
504
+ "statuses" : [
505
+ { "id" : 1 , "text" : " first!" , "user" : { "screen_name" : " lemire" , "name" : " Daniel" }, "retweet_count" : 40 },
506
+ ^ (depth 3 - root > statuses > tweet)
507
+
508
+ { "id" : 2 , "text" : " second!" , "user" : { "screen_name" : " jkeiser2" , "name" : " John" }, "retweet_count" : 3 }
509
+ ],
510
+ "search_metadata" : { "count" : 2 }
511
+ }
424
512
```
425
513
426
514
6 . We loop to the next tweet.
@@ -441,25 +529,77 @@ To help visualize the algorithm, we'll walk through the example C++ given at the
441
529
}
442
530
```
443
531
444
- First, the ` tweet ` destructor runs, skipping the remainder of the object which in this case is
445
- just ` } ` .
532
+ First, ` iter++ ` (remember, this is the array of tweets) checks whether the previous object was
533
+ fully iterated. It was not--depth is 3 (root > statuses > tweet), so we skip until it's 2--which
534
+ in this case just means consuming the ` } ` , leaving the iterator at the next comma. Depth is now 2
535
+ (root > statuses).
536
+
537
+ Next, ` iter++ ` finds the ` , ` and advances past it to the ` { ` , increasing depth to 3 (root >
538
+ statuses > tweet).
539
+
540
+ Finally, ` ondemand::object tweet = *iter ` validates the ` { ` and increases depth to 4 (root >
541
+ statuses > tweet > field). This leaves the iterator here:
542
+
543
+ ``` json
544
+ {
545
+ "statuses" : [
546
+ { "id" : 1 , "text" : " first!" , "user" : { "screen_name" : " lemire" , "name" : " Daniel" }, "retweet_count" : 40 },
547
+ { "id" : 2 , "text" : " second!" , "user" : { "screen_name" : " jkeiser2" , "name" : " John" }, "retweet_count" : 3 }
548
+ ^ (depth 3 - root > statuses > tweet)
549
+ ],
550
+ "search_metadata" : { "count" : 2 }
551
+ }
552
+ ```
553
+
554
+ 7 . This tweet is processed just like the previous one, leaving the iterator here:
555
+
556
+ ``` json
557
+ {
558
+ "statuses" : [
559
+ { "id" : 1 , "text" : " first!" , "user" : { "screen_name" : " lemire" , "name" : " Daniel" }, "retweet_count" : 40 },
560
+ { "id" : 2 , "text" : " second!" , "user" : { "screen_name" : " jkeiser2" , "name" : " John" }, "retweet_count" : 3 }
561
+ ^ (depth 3 - root > statuses > tweet)
562
+ ],
563
+ "search_metadata" : { "count" : 2 }
564
+ }
565
+ ```
446
566
447
- Next, ` iter++ ` checks whether there are more values and finds ` , ` . The loop continues.
567
+ 8 . The loop ends. Recall the relevant parts of the statuses loop:
568
+
569
+ ``` c++
570
+ while (iter != statuses.end()) {
571
+ ondemand::object tweet = *iter;
572
+ ...
573
+ iter++;
574
+ }
575
+ ```
448
576
449
- Finally , ` ondemand::object tweet = *iter ` checks for ` { ` and returns the object .
577
+ First , `iter++` finishes up any children, consuming the `} ` and leaving depth at 2 (root > statuses) .
450
578
451
- This tweet is processed just like the previous one.
579
+ Next, `iter++` notices the `]` and ends the array by decreasing depth to 1. This leaves the iterator
580
+ here in the JSON:
452
581
453
- 7 . We finish the last tweet.
582
+ ```json
583
+ {
584
+ "statuses": [
585
+ { "id": 1, "text": "first!", "user": { "screen_name": "lemire", "name": "Daniel" }, "retweet_count": 40 },
586
+ { "id": 2, "text": "second!", "user": { "screen_name": "jkeiser2", "name": "John" }, "retweet_count": 3 }
587
+ ],
588
+ ^ (depth 1 - root)
589
+ "search_metadata": { "count": 2 }
590
+ }
591
+ ```
454
592
455
- At the end of the loop, the ` tweet ` is first destroyed, skipping the remainder of the tweet
456
- object (` } ` ).
593
+ 9 . The remainder of the file is skipped.
457
594
458
- The ` iter++ ` instruction from ` for (ondemand::object tweet : doc["statuses"]) ` then checks whether there are
459
- more values and finds that there are none (` ] ` ). It marks the array iteration as finished and the for
460
- loop terminates.
595
+ Because no more action is taken, JSON processing stops: processing only occurs when you ask for
596
+ values.
461
597
462
- Then the outer object is destroyed, skipping everything up to the ` } ` .
598
+ This means you can very efficiently do things like read a single value from a JSON file, or take
599
+ the top N, for example. It also means the things you don't use won't be fully validated. This is
600
+ a general principle of On Demand: don't validate what you don't use. We still fully validate
601
+ values you do use, however, as well as the objects and arrays that lead to them, so that you can
602
+ be sure you get the information you need.
463
603
464
604
Design Features
465
605
---------------
0 commit comments