@@ -369,6 +369,213 @@ This helps as we redefine some new characters as pseudo-structural such as the c
369
369
370
370
> { "foo" : 1.5, "bar" : 1.5 GEOFF_IS_A_DUMMY bla bla , "baz", null }
371
371
372
+
373
+
374
+ ### UTF-8 validation (lookup2)
375
+
376
+ The simdjson library relies on the lookup2 algorithm for UTF-8 validation on x64 platforms.
377
+
378
+ This algorithm validate the length of multibyte characters (that each multibyte character has the right number of continuation characters, and that all continuation characters are part of a multibyte character).
379
+
380
+ #### Algorithm
381
+
382
+ This algorithm compares * expected* continuation characters with * actual* continuation bytes, and emits an error anytime there is a mismatch.
383
+
384
+ For example, in the string "𝄞₿֏ab", which has a 4-, 3-, 2- and 1-byte
385
+ characters, the file will look like this:
386
+
387
+ | Character | 𝄞 | | | | ₿ | | | ֏ | | a | b |
388
+ | -----------------------| ----| ----| ----| ----| ----| ----| ----| ----| ----| ----| ----|
389
+ | Character Length | 4 | | | | 3 | | | 2 | | 1 | 1 |
390
+ | Byte | F0 | 9D | 84 | 9E | E2 | 82 | BF | D6 | 8F | 61 | 62 |
391
+ | is_second_byte | | X | | | | X | | | X | | |
392
+ | is_third_byte | | | X | | | | X | | | | |
393
+ | is_fourth_byte | | | | X | | | | | | | |
394
+ | expected_continuation | | X | X | X | | X | X | | X | | |
395
+ | is_continuation | | X | X | X | | X | X | | X | | |
396
+
397
+ The errors here are basically (Second Byte OR Third Byte OR Fourth Byte == Continuation):
398
+
399
+ - ** Extra Continuations:** Any continuation that is not a second, third or fourth byte is not
400
+ part of a valid 2-, 3- or 4-byte character and is thus an error. It could be that it's just
401
+ floating around extra outside of any character, or that there is an illegal 5-byte character,
402
+ or maybe it's at the beginning of the file before any characters have started; but it's an
403
+ error in all these cases.
404
+ - ** Missing Continuations:** Any second, third or fourth byte that * isn't* a continuation is an error, because that means
405
+ we started a new character before we were finished with the current one.
406
+
407
+ #### Getting the Previous Bytes
408
+
409
+ Because we want to know if a byte is the * second* (or third, or fourth) byte of a multibyte
410
+ character, we need to "shift the bytes" to find that out. This is what they mean:
411
+
412
+ - ` is_continuation ` : if the current byte is a continuation.
413
+ - ` is_second_byte ` : if 1 byte back is the start of a 2-, 3- or 4-byte character.
414
+ - ` is_third_byte ` : if 2 bytes back is the start of a 3- or 4-byte character.
415
+ - ` is_fourth_byte ` : if 3 bytes back is the start of a 4-byte character.
416
+
417
+ We use shuffles to go n bytes back, selecting part of the current ` input ` and part of the
418
+ ` prev_input ` (search for ` .prev<1> ` , ` .prev<2> ` , etc.). These are passed in by the caller
419
+ function, because the 1-byte-back data is used by other checks as well.
420
+
421
+ #### Getting the Continuation Mask
422
+
423
+ Once we have the right bytes, we have to get the masks. To do this, we treat UTF-8 bytes as
424
+ numbers, using signed ` < ` and ` > ` operations to check if they are continuations or leads.
425
+ In fact, we treat the numbers as * signed* , partly because it helps us, and partly because
426
+ Intel's SIMD presently only offers signed ` < ` and ` > ` operations (not unsigned ones).
427
+
428
+ In UTF-8, bytes that start with the bits 110, 1110 and 11110 are 2-, 3- and 4-byte "leads,"
429
+ respectively, meaning they expect to have 1, 2 and 3 "continuation bytes" after them.
430
+ Continuation bytes start with 10, and ASCII (1-byte characters) starts with 0.
431
+
432
+ When treated as signed numbers, they look like this:
433
+
434
+ | Type | High Bits | Binary Range | Signed |
435
+ | --------------| ------------| --------------| --------|
436
+ | ASCII | ` 0 ` | ` 01111111 ` | 127 |
437
+ | | | ` 00000000 ` | 0 |
438
+ | 4+-Byte Lead | ` 1111 ` | ` 11111111 ` | -1 |
439
+ | | | `11110000 | -16 |
440
+ | 3-Byte Lead | ` 1110 ` | ` 11101111 ` | -17 |
441
+ | | | `11100000 | -32 |
442
+ | 2-Byte Lead | ` 110 ` | ` 11011111 ` | -33 |
443
+ | | | `11000000 | -64 |
444
+ | Continuation | ` 10 ` | ` 10111111 ` | -65 |
445
+ | | | `10000000 | -128 |
446
+
447
+ This makes it pretty easy to get the continuation mask! It's just a single comparison:
448
+
449
+ ```
450
+ is_continuation = input < -64`
451
+ ```
452
+
453
+ We can do something similar for the others, but it takes two comparisons instead of one: "is
454
+ the start of a 4-byte character" is ` < -32 ` and ` > -65 ` , for example. And 2+ bytes is ` < 0 ` and
455
+ ` > -64 ` . Surely we can do better, they're right next to each other!
456
+
457
+ #### Getting the is_xxx Masks: Shifting the Range
458
+
459
+ Notice * why* continuations were a single comparison. The actual * range* would require two
460
+ comparisons--` < -64 ` and ` > -129 ` --but all characters are always greater than -128, so we get
461
+ that for free. In fact, if we had * unsigned* comparisons, 2+, 3+ and 4+ comparisons would be
462
+ just as easy: 4+ would be ` > 239 ` , 3+ would be ` > 223 ` , and 2+ would be ` > 191 ` .
463
+
464
+ Instead, we add 128 to each byte, shifting the range up to make comparison easy. This wraps
465
+ ASCII down into the negative, and puts 4+-Byte Lead at the top:
466
+
467
+ | Type | High Bits | Binary Range | Signed |
468
+ | ----------------------| ------------| --------------| -------|
469
+ | 4+-Byte Lead (+ 127) | ` 0111 ` | ` 01111111 ` | 127 |
470
+ | | | `01110000 | 112 |
471
+ | ----------------------| ------------| --------------| -------|
472
+ | 3-Byte Lead (+ 127) | ` 0110 ` | ` 01101111 ` | 111 |
473
+ | | | `01100000 | 96 |
474
+ | ----------------------| ------------| --------------| -------|
475
+ | 2-Byte Lead (+ 127) | ` 010 ` | ` 01011111 ` | 95 |
476
+ | | | `01000000 | 64 |
477
+ | ----------------------| ------------| --------------| -------|
478
+ | Continuation (+ 127) | ` 00 ` | ` 00111111 ` | 63 |
479
+ | | | `00000000 | 0 |
480
+ | ----------------------| ------------| --------------| -------|
481
+ | ASCII (+ 127) | ` 1 ` | ` 11111111 ` | -1 |
482
+ | | | ` 10000000 ` | -128 |
483
+ | ----------------------| ------------| --------------| -------|
484
+
485
+ * Now* we can use signed ` > ` on all of them:
486
+
487
+ ```
488
+ prev1 = input.prev<1>
489
+ prev2 = input.prev<2>
490
+ prev3 = input.prev<3>
491
+ prev1_flipped = input.prev<1>(prev_input) ^ 0x80; // Same as `+ 128`
492
+ prev2_flipped = input.prev<2>(prev_input) ^ 0x80; // Same as `+ 128`
493
+ prev3_flipped = input.prev<3>(prev_input) ^ 0x80; // Same as `+ 128`
494
+ is_second_byte = prev1_flipped > 63;2+-byte lead
495
+ is_third_byte = prev2_flipped > 95;3+-byte lead
496
+ is_fourth_byte = prev3_flipped > 111; // 4+-byte lead
497
+ ```
498
+
499
+ NOTE: we use ` ^ 0x80 ` instead of ` + 128 ` in the code, which accomplishes the same thing, and even takes the same number
500
+ of cycles as ` + ` , but on many Intel architectures can be parallelized better (you can do 3
501
+ ` ^ ` 's at a time on Haswell, but only 2 ` + ` 's).
502
+
503
+ That doesn't look like it saved us any instructions, did it? Well, because we're adding the
504
+ same number to all of them, we can save one of those ` + 128 ` operations by assembling
505
+ ` prev2_flipped ` out of prev 1 and prev 3 instead of assembling it from input and adding 128
506
+ to it. One more instruction saved!
507
+
508
+ ```
509
+ prev1 = input.prev<1>
510
+ prev3 = input.prev<3>
511
+ prev1_flipped = prev1 ^ 0x80; // Same as `+ 128`
512
+ prev3_flipped = prev3 ^ 0x80; // Same as `+ 128`
513
+ prev2_flipped = prev1_flipped.concat<2>(prev3_flipped): // <shuffle: take the first 2 bytes from prev1 and the rest from prev3
514
+ ```
515
+
516
+ #### Bringing It All Together: Detecting the Errors
517
+
518
+ At this point, we have ` is_continuation ` , ` is_first_byte ` , ` is_second_byte ` and ` is_third_byte ` .
519
+ All we have left to do is check if they match!
520
+
521
+ ```
522
+ return (is_second_byte | is_third_byte | is_fourth_byte) ^ is_continuation;
523
+ ```
524
+
525
+ But wait--there's more. The above statement is only 3 operations, but they * cannot be done in
526
+ parallel* . You have to do 2 ` | ` 's and then 1 ` & ` . Haswell, at least, has 3 ports that can do
527
+ bitwise operations, and we're only using 1!
528
+
529
+ #### Epilogue: Addition For Booleans
530
+
531
+ There is one big case the above code doesn't explicitly talk about--what if is_second_byte
532
+ and is_third_byte are BOTH true? That means there is a 3-byte and 2-byte character right next
533
+ to each other (or any combination), and the continuation could be part of either of them!
534
+ Our algorithm using ` & ` and ` | ` won't detect that the continuation byte is problematic.
535
+
536
+ Never fear, though. If that situation occurs, we'll already have detected that the second
537
+ leading byte was an error, because it was supposed to be a part of the preceding multibyte
538
+ character, but it * wasn't a continuation* .
539
+
540
+ We could stop here, but it turns out that we can fix it using ` + ` and ` - ` instead of ` | ` and
541
+ ` & ` , which is both interesting and possibly useful (even though we're not using it here). It
542
+ exploits the fact that in SIMD, a * true* value is -1, and a * false* value is 0. So those
543
+ comparisons were giving us numbers!
544
+
545
+ Given that, if you do ` is_second_byte + is_third_byte + is_fourth_byte ` , under normal
546
+ circumstances you will either get 0 (0 + 0 + 0) or -1 (-1 + 0 + 0, etc.). Thus,
547
+ ` (is_second_byte + is_third_byte + is_fourth_byte) - is_continuation ` will yield 0 only if
548
+ * both* or * neither* are 0 (0-0 or -1 - -1). You'll get 1 or -1 if they are different. Because
549
+ * any* nonzero value is treated as an error (not just -1), we're just fine here :)
550
+
551
+ Further, if * more than one* multibyte character overlaps,
552
+ ` is_second_byte + is_third_byte + is_fourth_byte ` will be -2 or -3! Subtracting ` is_continuation `
553
+ from * that* is guaranteed to give you a nonzero value (-1, -2 or -3). So it'll always be
554
+ considered an error.
555
+
556
+ One reason you might want to do this is parallelism. ^ and | are not associative, so
557
+ (A | B | C) ^ D will always be three operations in a row: either you do A | B -> | C -> ^ D, or
558
+ you do B | C -> | A -> ^ D. But addition and subtraction * are* associative: (A + B + C) - D can
559
+ be written as ` (A + B) + (C - D) ` . This means you can do A + B and C - D at the same time, and
560
+ then adds the result together. Same number of operations, but if the processor can run
561
+ independent things in parallel (which most can), it runs faster.
562
+
563
+ This doesn't help us on Intel, but might help us elsewhere: on Haswell, at least, | and ^ have
564
+ a super nice advantage in that more of them can be run at the same time (they can run on 3
565
+ ports, while + and - can run on 2)! This means that we can do A | B while we're still doing C,
566
+ saving us the cycle we would have earned by using +. Even more, using an instruction with a
567
+ wider array of ports can help * other* code run ahead, too, since these instructions can "get
568
+ out of the way," running on a port other instructions can't.
569
+
570
+ #### Epilogue II: One More Trick
571
+
572
+ There's one more relevant trick up our sleeve, it turns out: it turns out on Intel we can "pay
573
+ for" the (prev<1> + 128) instruction, because it can be used to save an instruction in
574
+ check_special_cases()--but we'll talk about that there :)
575
+
576
+
577
+
578
+
372
579
## About the Project
373
580
374
581
### Bindings and Ports of simdjson
@@ -420,6 +627,8 @@ make allparsingcompetition
420
627
Both the ` parsingcompetition ` and ` allparsingcompetition ` tools take a ` -t ` flag which produces
421
628
a table-oriented output that can be conveniently parsed by other tools.
422
629
630
+
631
+
423
632
### Various References
424
633
425
634
- [ Google double-conv] ( https://github.com/google/double-conversion/ )
0 commit comments