Skip to content

Commit 0a907ec

Browse files
authored
Tweaking further the documentation. (simdjson#1237)
* Tweaking further the documentation. * More details. * Another sentence. * Saving. * Tweaking more
1 parent f1b4a54 commit 0a907ec

File tree

4 files changed

+30
-28
lines changed

4 files changed

+30
-28
lines changed

.github/ISSUE_TEMPLATE/bug_report.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ Before submitting an issue, please ensure that you have read the documentation:
1212
* Basics is an overview of how to use simdjson and its APIs: https://github.com/simdjson/simdjson/blob/master/doc/basics.md
1313
* Performance shows some more advanced scenarios and how to tune for them: https://github.com/simdjson/simdjson/blob/master/doc/performance.md
1414
* Contributing: https://github.com/simdjson/simdjson/blob/master/CONTRIBUTING.md
15+
* We follow the [JSON specification as described by RFC 8259](https://www.rfc-editor.org/rfc/rfc8259.txt) (T. Bray, 2017).
1516

1617

1718
**Describe the bug**

.github/ISSUE_TEMPLATE/feature_request.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ Before submitting an issue, please ensure that you have read the documentation:
1212
* Basics is an overview of how to use simdjson and its APIs: https://github.com/simdjson/simdjson/blob/master/doc/basics.md
1313
* Performance shows some more advanced scenarios and how to tune for them: https://github.com/simdjson/simdjson/blob/master/doc/performance.md
1414
* Contributing: https://github.com/simdjson/simdjson/blob/master/CONTRIBUTING.md
15+
* We follow the [JSON specification as described by RFC 8259](https://www.rfc-editor.org/rfc/rfc8259.txt) (T. Bray, 2017).
1516

1617
We do not make changes to simdjson without clearly identifiable benefits, which typically means either performance improvements, bug fixes or new features. Avoid bike-shedding: we all have opinions about how to write code, but we want to focus on what makes simdjson objectively better.
1718

.github/ISSUE_TEMPLATE/standard-issue-template.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ Before submitting an issue, please ensure that you have read the documentation:
1212
* Basics is an overview of how to use simdjson and its APIs: https://github.com/simdjson/simdjson/blob/master/doc/basics.md
1313
* Performance shows some more advanced scenarios and how to tune for them: https://github.com/simdjson/simdjson/blob/master/doc/performance.md
1414
* Contributing: https://github.com/simdjson/simdjson/blob/master/CONTRIBUTING.md
15+
* We follow the [JSON specification as described by RFC 8259](https://www.rfc-editor.org/rfc/rfc8259.txt) (T. Bray, 2017).
1516

1617
We do not make changes to simdjson without clearly identifiable benefits, which typically means either performance improvements, bug fixes or new features. Avoid bike-shedding: we all have opinions about how to write code, but we want to focus on what makes simdjson objectively better.
1718

doc/ondemand.md

Lines changed: 27 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,8 @@ Whether we parse JSON or XML, or any other serialized format, there are relative
88
- Another established approach is a event-based approach (like SAX, SAJ).
99
- Another popular approach is the schema-based deserialization model.
1010

11-
We propose an approach that is as easy to use and often as flexible as the DOM approach, yet as fast and
12-
efficient as the schema-based or event-based approaches. We call this new approach "On Demand". The
11+
We propose an approach that is as easy to use and often as flexible as the DOM approach, yet as fast and
12+
efficient as the schema-based or event-based approaches. We call this new approach "On Demand". The
1313
simdjson On Demand API offers a familiar, friendly DOM API and
1414
provides the performance of just-in-time parsing on top of the simdjson superior performance.
1515

@@ -71,11 +71,12 @@ and `"friends_count"` keys and matching values are skipped.
7171

7272
Further, the On Demand API does not parse a value *at all* until you try to convert it (e.g., to `double`,
7373
`int`, `string`, or `bool`). In our example, when accessing the key-value pair `"retweet_count": 82`, the parser
74-
may not convert the pair of characters `82` to the binary integer 82. Because the programmer specifies the data type, we avoid branch
75-
mispredictions related to data type determination and improve the performance.
76-
74+
may not convert the pair of characters `82` to the binary integer 82. Because the programmer specifies the data
75+
type, we avoid branch mispredictions related to data type determination and improve the performance.
7776

7877

78+
We expect users of an On Demand API to work in terms of a JSON dialect, which is a set of expectations and
79+
specifications that come in addition to the [JSON specification](https://www.rfc-editor.org/rfc/rfc8259.txt).
7980
The On Demand approach is designed around several principles:
8081

8182
* **Streaming (\*):** It avoids preparsing values, keeping the memory usage and the latency down.
@@ -85,6 +86,7 @@ The On Demand approach is designed around several principles:
8586
* **Validate What You Use:** On Demand deliberately validates the values you use and the structure leading to it, but nothing else. The goal is a guarantee that the value you asked for is the correct one and is not malformed: there must be no confusion over whether you got the right value.
8687

8788

89+
8890
To understand why On Demand is different, it is helpful to review the major
8991
approaches to parsing and parser APIs in use today.
9092

@@ -275,7 +277,7 @@ To help visualize the algorithm, we'll walk through the example C++ given at the
275277
### Starting the iteration
276278

277279
1. First, we declare a parser object that keeps internal buffers necessary for parsing. This can be
278-
reused to parse multiple JSON files, so you don't pay the high cost of allocating memory every
280+
reused to parse multiple JSON files, so you do not pay the high cost of allocating memory every
279281
time (and so it can stay in cache!).
280282

281283
This declaration does not allocate any memory; that will happen in the next step.
@@ -325,8 +327,8 @@ To help visualize the algorithm, we'll walk through the example C++ given at the
325327
ondemand::object top = doc.get_object();
326328
327329
// Find the field statuses by:
328-
// 1. Check whether the object is empty (check for }). (TODO we don't really need to do this unless the key lookup fails!)
329-
// 2. Check if we're at the field by looking for the string "statuses".
330+
// 1. Check whether the object is empty (check for }). (We do not really need to do this unless the key lookup fails!)
331+
// 2. Check if we're at the field by looking for the string "statuses" using byte-by-byte comparison.
330332
// 3. Validate that there is a `:` after it.
331333
auto tweets_field = top["statuses"];
332334
@@ -359,14 +361,14 @@ To help visualize the algorithm, we'll walk through the example C++ given at the
359361
std::string_view text = tweet["text"];
360362
```
361363

362-
First, `["text"]` skips the `"id"` field because it doesn't match: skips the key, `:` and
364+
First, `["text"]` skips the `"id"` field because it does not match: skips the key, `:` and
363365
value (`1`). We then check whether there are more fields by looking for either `,`
364366
or `}`.
365367

366368
The second field is matched (`"text"`), so we validate the `:` and move to the actual value.
367369

368370
NOTE: `["text"]` does a *raw match*, comparing the key directly against the raw JSON. This means
369-
that keys with escapes in them may not be matched.
371+
that keys with escapes in them may not be matched and the letter case must match exactly.
370372

371373
To convert to a string, we check for `"` and use simdjson's fast unescaping algorithm to copy
372374
`first!` (plus a terminating `\0`) into a buffer managed by the `document`. This buffer stores
@@ -445,11 +447,6 @@ When the user requests strings, we unescape them to a single string buffer much
445447
so that users enjoy the same string performance as the core simdjson. We do not write the length to the
446448
string buffer, however; that is stored in the `string_view` instance we return to the user.
447449

448-
### Object/Array Iteration
449-
450-
Because the C++ iterator contract requires iterators to be const-assignable and const-constructable,
451-
object and array iterators are separate classes from the object/array itself, and have an interior
452-
mutable reference to it.
453450

454451
### Iteration Safety
455452

@@ -466,24 +463,27 @@ in production systems:
466463
if it was `nullptr` but did not care what the actual value was--it will iterate. The destructor automates
467464
the iteration.
468465

469-
### Limitations of the On Demand Approach
466+
### Benefits of the On Demand Approach
470467

471-
We expect that the On Demand approach has many of the performance benefits of the schema-based approach, while providing a flexibility that is similar to that of the DOM-based approach. However, there are some limitations.
468+
We expect that the On Demand approach has many of the performance benefits of the schema-based approach, while providing a flexibility that is similar to that of the DOM-based approach.
472469

473-
Pros of the On Demand approach:
474470
* Faster than DOM in some cases. Reduced memory usage.
475471
* Straightforward, programmer-friendly interface (arrays and objects).
472+
* Highly expressive, beyond deserialization and pointer queries: many tasks can be accomplished with little code.
473+
474+
### Limitations of the On Demand Approach
475+
476+
The On Demand approach has some limitations:
476477

477-
Cons of the On Demand approach:
478-
* Because it operates in streaming mode, you only have access to the current element in the JSON document. Furthermore, the document is traversed in order so the code is sensitive to the order of the JSON nodes in the same manner as an event-based approach (e.g., SAX). It is possible for the programmer to handle out-of-order keys when the JSON dialect is underspecified, but it requires additional care. You should be mindful that the though your software might write the keys in a consistent manner, the JSON specification does not prescribe that the order be significant and thus, a JSON producer could change the order of the keys within an object. The On Demand API will still help the programmer by throwing an exception when the unexpected occurs, but the programmer is responsible for handling such cases (e.g., by rejecting the JSON input that does not follow the expected JSON dialect).
479-
* Less safe than DOM: we only validate the components of the JSON document that are used and it is possible to begin ingesting an invalid document only to find out later that the document is invalid. Are you fine ingesting a large JSON document that starts with well formed JSON but ends with invalid JSON content?
478+
* Because it operates in streaming mode, you only have access to the current element in the JSON document. Furthermore, the document is traversed in order so the code is sensitive to the order of the JSON nodes in the same manner as an event-based approach (e.g., SAX).
479+
* The On Demand approach is less safe than DOM: we only validate the components of the JSON document that are used and it is possible to begin ingesting an invalid document only to find out later that the document is invalid. Are you fine ingesting a large JSON document that starts with well formed JSON but ends with invalid JSON content?
480480

481481
There are currently additional technical limitations which we expect to resolve in future releases of the simdjson library:
482482

483483
* The simdjson library offers runtime dispatching which allows you to compile one binary and have it run at full speed on different processors, taking advantage of the specific features of the processor. The On Demand API does not have runtime dispatch support at this time. To benefit from the On Demand API, you must compile your code for a specific processor. E.g., if your processor supports AVX2 instructions, you should compile your binary executable with AVX2 instruction support (by using your compiler's commands). If you are sufficiently technically proficient, you can implement runtime dispatching within your application, by compiling your On Demand code for different processors.
484484
* There is an initial phase which scans the entire document quickly, irrespective of the size of the document. We plan to break this phase into distinct steps for large files in a future release as we have done with other components of our API (e.g., `parse_many`).
485485
* The On Demand API does not support JSON Pointer. This capability is currently limited to our core API.
486-
* We intend to help users who wish to use the On Demand API but require support for order-insensitive semantics, but in our current implementation support for out-of-order keys (if needed) must be provided by the programmer. Currently, one might proceed in the following manner as a fallback measure if keys can appear in any order:
486+
* You should be mindful that the though your software might write the keys in a consistent manner, the [JSON specification](https://www.rfc-editor.org/rfc/rfc8259.txt) states that "JSON parsing libraries have been observed to differ as to whether or not they make the ordering of object members visible". The On Demand API will help the programmer handle unexpected JSON dialects by throwing an exception when the unexpected occurs, but the programmer is responsible for handling such cases: e.g., by rejecting the JSON input that does not follow the expected JSON dialect. We intend to help users who wish to use the On Demand API but require support for order-insensitive semantics, but in our current implementation support for out-of-order keys (if needed) must be provided by the programmer. Currently, one might proceed in the following manner as a fallback measure if keys can appear in any order:
487487
```C++
488488
for (ondemand::object my_object : doc["mykey"]) {
489489
for (auto field : my_object) {
@@ -519,11 +519,10 @@ most programmers will want to target `arm64`. The `fallback` is probably only go
519519
std::cout << simdjson::builtin_implementation()->name() << std::endl;
520520
```
521521

522-
If you are using CMake for your C++ project, then you can pass compilation flags to your compiler during the first configuration
523-
by using the `CXXFLAGS` configuration variable:
522+
If you are using CMake for your C++ project, then you can pass compilation flags to your compiler by using
523+
the `CMAKE_CXX_FLAGS` variable:
524+
524525
```
525-
CXXFLAGS=-march=haswell cmake -B build_haswell
526+
cmake -DCMAKE_CXX_FLAGS="-march=haswell" -B build_haswell
526527
cmake --build build_haswell
527-
```
528-
529-
You may also use the `CMAKE_CXX_FLAGS` variable.
528+
```

0 commit comments

Comments
 (0)