Skip to content

Commit 12150ba

Browse files
authored
Using just ASCII. (simdjson#899)
* Using just ASCII. * Let us prune checkperf. * Moving the description of lookup2 to the HACKING.md file.
1 parent 219b02c commit 12150ba

File tree

9 files changed

+259
-242
lines changed

9 files changed

+259
-242
lines changed

CMakeLists.txt

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,13 +38,27 @@ add_subdirectory(singleheader)
3838
#
3939
# Compile tools / tests / benchmarks
4040
#
41-
4241
add_subdirectory(dependencies)
4342
add_subdirectory(tests)
4443
add_subdirectory(examples)
4544
add_subdirectory(benchmark)
4645
add_subdirectory(fuzz)
4746

47+
#
48+
# Source files should be just ASCII
49+
#
50+
find_program(FIND find)
51+
find_program(FILE file)
52+
find_program(GREP grep)
53+
if((FIND) AND (FILE) AND (GREP))
54+
add_test(
55+
NAME "just_ascii"
56+
COMMAND sh -c "${FIND} include src windows tools singleheader tests examples benchmark -path benchmark/checkperf-reference -prune -name '*.h' -o -name '*.cpp' -type f -exec ${FILE} '{}' \; |${GREP} -v ASCII || exit 0 && exit 1"
57+
WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
58+
)
59+
endif()
60+
61+
4862
#
4963
# CPack
5064
#

CONTRIBUTING.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,7 @@ We have few hard rules, but we have some:
4040

4141
- Printing to standard output or standard error (`stderr`, `stdout`, `std::cerr`, `std::cout`) in the core library is forbidden. This follows from the [Writing R Extensions](https://cran.r-project.org/doc/manuals/R-exts.html) manual which states that "Compiled code should not write to stdout or stderr".
4242
- Calls to `abort()` are forbidden in the core library. This follows from the [Writing R Extensions](https://cran.r-project.org/doc/manuals/R-exts.html) manual which states that "Under no circumstances should your compiled code ever call abort or exit".
43+
- All source code files (.h, .cpp) must be ASCII.
4344

4445
Tools, tests and benchmarks are not held to these same strict rules.
4546

HACKING.md

Lines changed: 209 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -369,6 +369,213 @@ This helps as we redefine some new characters as pseudo-structural such as the c
369369

370370
> { "foo" : 1.5, "bar" : 1.5 GEOFF_IS_A_DUMMY bla bla , "baz", null }
371371
372+
373+
374+
### UTF-8 validation (lookup2)
375+
376+
The simdjson library relies on the lookup2 algorithm for UTF-8 validation on x64 platforms.
377+
378+
This algorithm validate the length of multibyte characters (that each multibyte character has the right number of continuation characters, and that all continuation characters are part of a multibyte character).
379+
380+
#### Algorithm
381+
382+
This algorithm compares *expected* continuation characters with *actual* continuation bytes, and emits an error anytime there is a mismatch.
383+
384+
For example, in the string "𝄞₿֏ab", which has a 4-, 3-, 2- and 1-byte
385+
characters, the file will look like this:
386+
387+
| Character | 𝄞 | | | || | | ֏ | | a | b |
388+
|-----------------------|----|----|----|----|----|----|----|----|----|----|----|
389+
| Character Length | 4 | | | | 3 | | | 2 | | 1 | 1 |
390+
| Byte | F0 | 9D | 84 | 9E | E2 | 82 | BF | D6 | 8F | 61 | 62 |
391+
| is_second_byte | | X | | | | X | | | X | | |
392+
| is_third_byte | | | X | | | | X | | | | |
393+
| is_fourth_byte | | | | X | | | | | | | |
394+
| expected_continuation | | X | X | X | | X | X | | X | | |
395+
| is_continuation | | X | X | X | | X | X | | X | | |
396+
397+
The errors here are basically (Second Byte OR Third Byte OR Fourth Byte == Continuation):
398+
399+
- **Extra Continuations:** Any continuation that is not a second, third or fourth byte is not
400+
part of a valid 2-, 3- or 4-byte character and is thus an error. It could be that it's just
401+
floating around extra outside of any character, or that there is an illegal 5-byte character,
402+
or maybe it's at the beginning of the file before any characters have started; but it's an
403+
error in all these cases.
404+
- **Missing Continuations:** Any second, third or fourth byte that *isn't* a continuation is an error, because that means
405+
we started a new character before we were finished with the current one.
406+
407+
#### Getting the Previous Bytes
408+
409+
Because we want to know if a byte is the *second* (or third, or fourth) byte of a multibyte
410+
character, we need to "shift the bytes" to find that out. This is what they mean:
411+
412+
- `is_continuation`: if the current byte is a continuation.
413+
- `is_second_byte`: if 1 byte back is the start of a 2-, 3- or 4-byte character.
414+
- `is_third_byte`: if 2 bytes back is the start of a 3- or 4-byte character.
415+
- `is_fourth_byte`: if 3 bytes back is the start of a 4-byte character.
416+
417+
We use shuffles to go n bytes back, selecting part of the current `input` and part of the
418+
`prev_input` (search for `.prev<1>`, `.prev<2>`, etc.). These are passed in by the caller
419+
function, because the 1-byte-back data is used by other checks as well.
420+
421+
#### Getting the Continuation Mask
422+
423+
Once we have the right bytes, we have to get the masks. To do this, we treat UTF-8 bytes as
424+
numbers, using signed `<` and `>` operations to check if they are continuations or leads.
425+
In fact, we treat the numbers as *signed*, partly because it helps us, and partly because
426+
Intel's SIMD presently only offers signed `<` and `>` operations (not unsigned ones).
427+
428+
In UTF-8, bytes that start with the bits 110, 1110 and 11110 are 2-, 3- and 4-byte "leads,"
429+
respectively, meaning they expect to have 1, 2 and 3 "continuation bytes" after them.
430+
Continuation bytes start with 10, and ASCII (1-byte characters) starts with 0.
431+
432+
When treated as signed numbers, they look like this:
433+
434+
| Type | High Bits | Binary Range | Signed |
435+
|--------------|------------|--------------|--------|
436+
| ASCII | `0` | `01111111` | 127 |
437+
| | | `00000000` | 0 |
438+
| 4+-Byte Lead | `1111` | `11111111` | -1 |
439+
| | | `11110000 | -16 |
440+
| 3-Byte Lead | `1110` | `11101111` | -17 |
441+
| | | `11100000 | -32 |
442+
| 2-Byte Lead | `110` | `11011111` | -33 |
443+
| | | `11000000 | -64 |
444+
| Continuation | `10` | `10111111` | -65 |
445+
| | | `10000000 | -128 |
446+
447+
This makes it pretty easy to get the continuation mask! It's just a single comparison:
448+
449+
```
450+
is_continuation = input < -64`
451+
```
452+
453+
We can do something similar for the others, but it takes two comparisons instead of one: "is
454+
the start of a 4-byte character" is `< -32` and `> -65`, for example. And 2+ bytes is `< 0` and
455+
`> -64`. Surely we can do better, they're right next to each other!
456+
457+
#### Getting the is_xxx Masks: Shifting the Range
458+
459+
Notice *why* continuations were a single comparison. The actual *range* would require two
460+
comparisons--`< -64` and `> -129`--but all characters are always greater than -128, so we get
461+
that for free. In fact, if we had *unsigned* comparisons, 2+, 3+ and 4+ comparisons would be
462+
just as easy: 4+ would be `> 239`, 3+ would be `> 223`, and 2+ would be `> 191`.
463+
464+
Instead, we add 128 to each byte, shifting the range up to make comparison easy. This wraps
465+
ASCII down into the negative, and puts 4+-Byte Lead at the top:
466+
467+
| Type | High Bits | Binary Range | Signed |
468+
|----------------------|------------|--------------|-------|
469+
| 4+-Byte Lead (+ 127) | `0111` | `01111111` | 127 |
470+
| | | `01110000 | 112 |
471+
|----------------------|------------|--------------|-------|
472+
| 3-Byte Lead (+ 127) | `0110` | `01101111` | 111 |
473+
| | | `01100000 | 96 |
474+
|----------------------|------------|--------------|-------|
475+
| 2-Byte Lead (+ 127) | `010` | `01011111` | 95 |
476+
| | | `01000000 | 64 |
477+
|----------------------|------------|--------------|-------|
478+
| Continuation (+ 127) | `00` | `00111111` | 63 |
479+
| | | `00000000 | 0 |
480+
|----------------------|------------|--------------|-------|
481+
| ASCII (+ 127) | `1` | `11111111` | -1 |
482+
| | | `10000000` | -128 |
483+
|----------------------|------------|--------------|-------|
484+
485+
*Now* we can use signed `>` on all of them:
486+
487+
```
488+
prev1 = input.prev<1>
489+
prev2 = input.prev<2>
490+
prev3 = input.prev<3>
491+
prev1_flipped = input.prev<1>(prev_input) ^ 0x80; // Same as `+ 128`
492+
prev2_flipped = input.prev<2>(prev_input) ^ 0x80; // Same as `+ 128`
493+
prev3_flipped = input.prev<3>(prev_input) ^ 0x80; // Same as `+ 128`
494+
is_second_byte = prev1_flipped > 63;2+-byte lead
495+
is_third_byte = prev2_flipped > 95;3+-byte lead
496+
is_fourth_byte = prev3_flipped > 111; // 4+-byte lead
497+
```
498+
499+
NOTE: we use `^ 0x80` instead of `+ 128` in the code, which accomplishes the same thing, and even takes the same number
500+
of cycles as `+`, but on many Intel architectures can be parallelized better (you can do 3
501+
`^`'s at a time on Haswell, but only 2 `+`'s).
502+
503+
That doesn't look like it saved us any instructions, did it? Well, because we're adding the
504+
same number to all of them, we can save one of those `+ 128` operations by assembling
505+
`prev2_flipped` out of prev 1 and prev 3 instead of assembling it from input and adding 128
506+
to it. One more instruction saved!
507+
508+
```
509+
prev1 = input.prev<1>
510+
prev3 = input.prev<3>
511+
prev1_flipped = prev1 ^ 0x80; // Same as `+ 128`
512+
prev3_flipped = prev3 ^ 0x80; // Same as `+ 128`
513+
prev2_flipped = prev1_flipped.concat<2>(prev3_flipped): // <shuffle: take the first 2 bytes from prev1 and the rest from prev3
514+
```
515+
516+
#### Bringing It All Together: Detecting the Errors
517+
518+
At this point, we have `is_continuation`, `is_first_byte`, `is_second_byte` and `is_third_byte`.
519+
All we have left to do is check if they match!
520+
521+
```
522+
return (is_second_byte | is_third_byte | is_fourth_byte) ^ is_continuation;
523+
```
524+
525+
But wait--there's more. The above statement is only 3 operations, but they *cannot be done in
526+
parallel*. You have to do 2 `|`'s and then 1 `&`. Haswell, at least, has 3 ports that can do
527+
bitwise operations, and we're only using 1!
528+
529+
#### Epilogue: Addition For Booleans
530+
531+
There is one big case the above code doesn't explicitly talk about--what if is_second_byte
532+
and is_third_byte are BOTH true? That means there is a 3-byte and 2-byte character right next
533+
to each other (or any combination), and the continuation could be part of either of them!
534+
Our algorithm using `&` and `|` won't detect that the continuation byte is problematic.
535+
536+
Never fear, though. If that situation occurs, we'll already have detected that the second
537+
leading byte was an error, because it was supposed to be a part of the preceding multibyte
538+
character, but it *wasn't a continuation*.
539+
540+
We could stop here, but it turns out that we can fix it using `+` and `-` instead of `|` and
541+
`&`, which is both interesting and possibly useful (even though we're not using it here). It
542+
exploits the fact that in SIMD, a *true* value is -1, and a *false* value is 0. So those
543+
comparisons were giving us numbers!
544+
545+
Given that, if you do `is_second_byte + is_third_byte + is_fourth_byte`, under normal
546+
circumstances you will either get 0 (0 + 0 + 0) or -1 (-1 + 0 + 0, etc.). Thus,
547+
`(is_second_byte + is_third_byte + is_fourth_byte) - is_continuation` will yield 0 only if
548+
*both* or *neither* are 0 (0-0 or -1 - -1). You'll get 1 or -1 if they are different. Because
549+
*any* nonzero value is treated as an error (not just -1), we're just fine here :)
550+
551+
Further, if *more than one* multibyte character overlaps,
552+
`is_second_byte + is_third_byte + is_fourth_byte` will be -2 or -3! Subtracting `is_continuation`
553+
from *that* is guaranteed to give you a nonzero value (-1, -2 or -3). So it'll always be
554+
considered an error.
555+
556+
One reason you might want to do this is parallelism. ^ and | are not associative, so
557+
(A | B | C) ^ D will always be three operations in a row: either you do A | B -> | C -> ^ D, or
558+
you do B | C -> | A -> ^ D. But addition and subtraction *are* associative: (A + B + C) - D can
559+
be written as `(A + B) + (C - D)`. This means you can do A + B and C - D at the same time, and
560+
then adds the result together. Same number of operations, but if the processor can run
561+
independent things in parallel (which most can), it runs faster.
562+
563+
This doesn't help us on Intel, but might help us elsewhere: on Haswell, at least, | and ^ have
564+
a super nice advantage in that more of them can be run at the same time (they can run on 3
565+
ports, while + and - can run on 2)! This means that we can do A | B while we're still doing C,
566+
saving us the cycle we would have earned by using +. Even more, using an instruction with a
567+
wider array of ports can help *other* code run ahead, too, since these instructions can "get
568+
out of the way," running on a port other instructions can't.
569+
570+
#### Epilogue II: One More Trick
571+
572+
There's one more relevant trick up our sleeve, it turns out: it turns out on Intel we can "pay
573+
for" the (prev<1> + 128) instruction, because it can be used to save an instruction in
574+
check_special_cases()--but we'll talk about that there :)
575+
576+
577+
578+
372579
## About the Project
373580

374581
### Bindings and Ports of simdjson
@@ -420,6 +627,8 @@ make allparsingcompetition
420627
Both the `parsingcompetition` and `allparsingcompetition` tools take a `-t` flag which produces
421628
a table-oriented output that can be conveniently parsed by other tools.
422629

630+
631+
423632
### Various References
424633

425634
- [Google double-conv](https://github.com/google/double-conversion/)

singleheader/amalgamate_demo.cpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
/* auto-generated on Wed May 20 10:23:07 EDT 2020. Do not edit! */
1+
/* auto-generated on Thu 21 May 2020 14:01:15 EDT. Do not edit! */
22

33
#include <iostream>
44
#include "simdjson.h"

singleheader/simdjson.cpp

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
/* auto-generated on Wed May 20 10:23:07 EDT 2020. Do not edit! */
1+
/* auto-generated on Thu 21 May 2020 14:01:15 EDT. Do not edit! */
22
/* begin file src/simdjson.cpp */
33
#include "simdjson.h"
44

@@ -3180,10 +3180,10 @@ namespace utf8_validation {
31803180
// This algorithm compares *expected* continuation characters with *actual* continuation bytes,
31813181
// and emits an error anytime there is a mismatch.
31823182
//
3183-
// For example, in the string "𝄞₿֏ab", which has a 4-, 3-, 2- and 1-byte
3183+
// For example, in the string "ab", which has a 4-, 3-, 2- and 1-byte
31843184
// characters, the file will look like this:
31853185
//
3186-
// | Character | 𝄞 | | | | | | | ֏ | | a | b |
3186+
// | Character | | | | | | | | | | a | b |
31873187
// |-----------------------|----|----|----|----|----|----|----|----|----|----|----|
31883188
// | Character Length | 4 | | | | 3 | | | 2 | | 1 | 1 |
31893189
// | Byte | F0 | 9D | 84 | 9E | E2 | 82 | BF | D6 | 8F | 61 | 62 |
@@ -4049,10 +4049,10 @@ static bool parse_float_strtod(const char *ptr, double *outDouble) {
40494049
// If you consume a large value and you map it to "infinity", you will no
40504050
// longer be able to serialize back a standard-compliant JSON. And there is
40514051
// no realistic application where you might need values so large than they
4052-
// can't fit in binary64. The maximal value is about 1.7976931348623157 ×
4052+
// can't fit in binary64. The maximal value is about 1.7976931348623157 x
40534053
// 10^308 It is an unimaginable large number. There will never be any piece of
40544054
// engineering involving as many as 10^308 parts. It is estimated that there
4055-
// are about 10^80 atoms in the universe.  The estimate for the total number
4055+
// are about 10^80 atoms in the universe. The estimate for the total number
40564056
// of electrons is similar. Using a double-precision floating-point value, we
40574057
// can represent easily the number of atoms in the universe. We could also
40584058
// represent the number of ways you can pick any three individual atoms at
@@ -5872,10 +5872,10 @@ static bool parse_float_strtod(const char *ptr, double *outDouble) {
58725872
// If you consume a large value and you map it to "infinity", you will no
58735873
// longer be able to serialize back a standard-compliant JSON. And there is
58745874
// no realistic application where you might need values so large than they
5875-
// can't fit in binary64. The maximal value is about 1.7976931348623157 ×
5875+
// can't fit in binary64. The maximal value is about 1.7976931348623157 x
58765876
// 10^308 It is an unimaginable large number. There will never be any piece of
58775877
// engineering involving as many as 10^308 parts. It is estimated that there
5878-
// are about 10^80 atoms in the universe.  The estimate for the total number
5878+
// are about 10^80 atoms in the universe. The estimate for the total number
58795879
// of electrons is similar. Using a double-precision floating-point value, we
58805880
// can represent easily the number of atoms in the universe. We could also
58815881
// represent the number of ways you can pick any three individual atoms at
@@ -8142,10 +8142,10 @@ namespace utf8_validation {
81428142
// This algorithm compares *expected* continuation characters with *actual* continuation bytes,
81438143
// and emits an error anytime there is a mismatch.
81448144
//
8145-
// For example, in the string "𝄞₿֏ab", which has a 4-, 3-, 2- and 1-byte
8145+
// For example, in the string "ab", which has a 4-, 3-, 2- and 1-byte
81468146
// characters, the file will look like this:
81478147
//
8148-
// | Character | 𝄞 | | | | | | | ֏ | | a | b |
8148+
// | Character | | | | | | | | | | a | b |
81498149
// |-----------------------|----|----|----|----|----|----|----|----|----|----|----|
81508150
// | Character Length | 4 | | | | 3 | | | 2 | | 1 | 1 |
81518151
// | Byte | F0 | 9D | 84 | 9E | E2 | 82 | BF | D6 | 8F | 61 | 62 |
@@ -9015,10 +9015,10 @@ static bool parse_float_strtod(const char *ptr, double *outDouble) {
90159015
// If you consume a large value and you map it to "infinity", you will no
90169016
// longer be able to serialize back a standard-compliant JSON. And there is
90179017
// no realistic application where you might need values so large than they
9018-
// can't fit in binary64. The maximal value is about 1.7976931348623157 ×
9018+
// can't fit in binary64. The maximal value is about 1.7976931348623157 x
90199019
// 10^308 It is an unimaginable large number. There will never be any piece of
90209020
// engineering involving as many as 10^308 parts. It is estimated that there
9021-
// are about 10^80 atoms in the universe.  The estimate for the total number
9021+
// are about 10^80 atoms in the universe. The estimate for the total number
90229022
// of electrons is similar. Using a double-precision floating-point value, we
90239023
// can represent easily the number of atoms in the universe. We could also
90249024
// represent the number of ways you can pick any three individual atoms at
@@ -11254,10 +11254,10 @@ namespace utf8_validation {
1125411254
// This algorithm compares *expected* continuation characters with *actual* continuation bytes,
1125511255
// and emits an error anytime there is a mismatch.
1125611256
//
11257-
// For example, in the string "𝄞₿֏ab", which has a 4-, 3-, 2- and 1-byte
11257+
// For example, in the string "ab", which has a 4-, 3-, 2- and 1-byte
1125811258
// characters, the file will look like this:
1125911259
//
11260-
// | Character | 𝄞 | | | | | | | ֏ | | a | b |
11260+
// | Character | | | | | | | | | | a | b |
1126111261
// |-----------------------|----|----|----|----|----|----|----|----|----|----|----|
1126211262
// | Character Length | 4 | | | | 3 | | | 2 | | 1 | 1 |
1126311263
// | Byte | F0 | 9D | 84 | 9E | E2 | 82 | BF | D6 | 8F | 61 | 62 |
@@ -12130,10 +12130,10 @@ static bool parse_float_strtod(const char *ptr, double *outDouble) {
1213012130
// If you consume a large value and you map it to "infinity", you will no
1213112131
// longer be able to serialize back a standard-compliant JSON. And there is
1213212132
// no realistic application where you might need values so large than they
12133-
// can't fit in binary64. The maximal value is about 1.7976931348623157 ×
12133+
// can't fit in binary64. The maximal value is about 1.7976931348623157 x
1213412134
// 10^308 It is an unimaginable large number. There will never be any piece of
1213512135
// engineering involving as many as 10^308 parts. It is estimated that there
12136-
// are about 10^80 atoms in the universe.  The estimate for the total number
12136+
// are about 10^80 atoms in the universe. The estimate for the total number
1213712137
// of electrons is similar. Using a double-precision floating-point value, we
1213812138
// can represent easily the number of atoms in the universe. We could also
1213912139
// represent the number of ways you can pick any three individual atoms at

0 commit comments

Comments
 (0)