Skip to content

Commit 3316df9

Browse files
authored
Adding test for issue 1133 and improving documentation (simdjson#1134)
* Adding test. * Saving. * With exceptions. * Added extensive tests. * Better documentation. * Tweaking CI * Cleaning. * Do not assume make. * Let us make the build verbose * Reorg * I do not understand how circle ci works. * Breaking it up. * Better syntax.
1 parent 5d355f1 commit 3316df9

File tree

6 files changed

+193
-27
lines changed

6 files changed

+193
-27
lines changed

.circleci/config.yml

Lines changed: 24 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,8 @@
11
version: 2.1
22

3+
4+
# We constantly run out of memory so please do not use parallelism (-j, -j4).
5+
36
# Reusable image / compiler definitions
47
executors:
58
gcc8:
@@ -8,53 +11,53 @@ executors:
811
environment:
912
CXX: g++-8
1013
CC: gcc-8
11-
BUILD_FLAGS: -j
12-
CTEST_FLAGS: -j4 --output-on-failure
14+
BUILD_FLAGS:
15+
CTEST_FLAGS: --output-on-failure
1316

1417
gcc9:
1518
docker:
1619
- image: conanio/gcc9
1720
environment:
1821
CXX: g++-9
1922
CC: gcc-9
20-
BUILD_FLAGS: -j
21-
CTEST_FLAGS: -j4 --output-on-failure
23+
BUILD_FLAGS:
24+
CTEST_FLAGS: --output-on-failure
2225

2326
gcc10:
2427
docker:
2528
- image: conanio/gcc10
2629
environment:
2730
CXX: g++-10
2831
CC: gcc-10
29-
BUILD_FLAGS: -j
30-
CTEST_FLAGS: -j4 --output-on-failure
32+
BUILD_FLAGS:
33+
CTEST_FLAGS: --output-on-failure
3134

3235
clang10:
3336
docker:
3437
- image: conanio/clang10
3538
environment:
3639
CXX: clang++-10
3740
CC: clang-10
38-
BUILD_FLAGS: -j
39-
CTEST_FLAGS: -j4 --output-on-failure
41+
BUILD_FLAGS:
42+
CTEST_FLAGS: --output-on-failure
4043

4144
clang9:
4245
docker:
4346
- image: conanio/clang9
4447
environment:
4548
CXX: clang++-9
4649
CC: clang-9
47-
BUILD_FLAGS: -j
48-
CTEST_FLAGS: -j4 --output-on-failure
50+
BUILD_FLAGS:
51+
CTEST_FLAGS: --output-on-failure
4952

5053
clang6:
5154
docker:
5255
- image: conanio/clang60
5356
environment:
5457
CXX: clang++-6.0
5558
CC: clang-6.0
56-
BUILD_FLAGS: -j
57-
CTEST_FLAGS: -j4 --output-on-failure
59+
BUILD_FLAGS:
60+
CTEST_FLAGS: --output-on-failure
5861

5962
# Reusable test commands (and initializer for clang 6)
6063
commands:
@@ -68,13 +71,15 @@ commands:
6871
- checkout
6972
- run: mkdir -p build
7073

71-
cmake_build:
74+
cmake_build_cache:
7275
steps:
7376
- cmake_prep
74-
- run: |
75-
cd build &&
76-
cmake $CMAKE_FLAGS -DCMAKE_INSTALL_PREFIX:PATH=destination .. &&
77-
make $BUILD_FLAGS all
77+
- run: cmake $CMAKE_FLAGS -DCMAKE_INSTALL_PREFIX:PATH=destination -B build .
78+
79+
cmake_build:
80+
steps:
81+
- cmake_build_cache
82+
- run: cmake --build build
7883

7984
cmake_test:
8085
steps:
@@ -138,12 +143,12 @@ jobs:
138143
sanitize-gcc10:
139144
description: Build and run tests on GCC 10 and AVX 2 with a cmake sanitize build
140145
executor: gcc10
141-
environment: { CMAKE_FLAGS: -DSIMDJSON_BUILD_STATIC=OFF -DSIMDJSON_SANITIZE=ON, BUILD_FLAGS: "", CTEST_FLAGS: -j4 --output-on-failure -E checkperf }
146+
environment: { CMAKE_FLAGS: -DSIMDJSON_BUILD_STATIC=OFF -DSIMDJSON_SANITIZE=ON, BUILD_FLAGS: "", CTEST_FLAGS: --output-on-failure -E checkperf }
142147
steps: [ cmake_test ]
143148
sanitize-clang10:
144149
description: Build and run tests on clang 10 and AVX 2 with a cmake sanitize build
145150
executor: clang10
146-
environment: { CMAKE_FLAGS: -DSIMDJSON_BUILD_STATIC=OFF -DSIMDJSON_SANITIZE=ON, CTEST_FLAGS: -j4 --output-on-failure -E checkperf }
151+
environment: { CMAKE_FLAGS: -DSIMDJSON_BUILD_STATIC=OFF -DSIMDJSON_SANITIZE=ON, CTEST_FLAGS: --output-on-failure -E checkperf }
147152
steps: [ cmake_test ]
148153

149154
# dynamic

doc/basics.md

Lines changed: 25 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -592,14 +592,37 @@ Here is a simple example, given "x.json" with this content:
592592

593593
```c++
594594
dom::parser parser;
595-
dom::document_stream docs = parser.load_many(filename);
595+
dom::document_stream docs = parser.load_many("x.json");
596596
for (dom::element doc : docs) {
597597
cout << doc["foo"] << endl;
598598
}
599599
// Prints 1 2 3
600600
```
601601

602-
In-memory ndjson strings can be parsed as well, with `parser.parse_many(string)`.
602+
In-memory ndjson strings can be parsed as well, with `parser.parse_many(string)`:
603+
604+
605+
```c++
606+
dom::parser parser;
607+
auto json = R"({ "foo": 1 }
608+
{ "foo": 2 }
609+
{ "foo": 3 })"_padded;
610+
dom::document_stream docs = parser.parse_many(json);
611+
for (dom::element doc : docs) {
612+
cout << doc["foo"] << endl;
613+
}
614+
// Prints 1 2 3
615+
```
616+
617+
618+
Unlike `parser.parse`, both `parser.load_many(filename)` and `parser.parse_many(string)` may parse
619+
"on demand" (lazily). That is, no parsing may have been done before you enter the loop
620+
`for (dom::element doc : docs) {` and you should expect the parser to only ever fully parse one JSON
621+
document at a time.
622+
623+
1. When calling `parser.load_many(filename)`, the file's content is loaded up in a memory buffer owned by the `parser`'s instance. Thus the file can be safely deleted after calling `parser.load_many(filename)` as the parser instance owns all of the data.
624+
2. When calling `parser.parse_many(string)`, no copy is made of the provided string input. The provided memory buffer may be accessed each time a JSON document is parsed. Calling `parser.parse_many(string)` on a temporary string buffer (e.g., `docs = parser.parse_many("[1,2,3]"_padded)`) is unsafe because the `document_stream` instance needs access to the buffer to return the JSON documents. In constrast, calling `doc = parser.parse("[1,2,3]"_padded)` is safe because `parser.parse` eagerly parses the input.
625+
603626

604627
Both `load_many` and `parse_many` take an optional parameter `size_t batch_size` which defines the window processing size. It is set by default to a large value (`1000000` corresponding to 1 MB). None of your JSON documents should exceed this window size, or else you will get the error `simdjson::CAPACITY`. You cannot set this window size larger than 4 GB: you will get the error `simdjson::CAPACITY`. The smaller the window size is, the less memory the function will use. Setting the window size too small (e.g., less than 100 kB) may also impact performance negatively. Leaving it to 1 MB is expected to be a good choice, unless you have some larger documents.
605628

doc/basics_doxygen.md

Lines changed: 26 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -564,22 +564,45 @@ than 4GB), though each individual document must be no larger than 4 GB.
564564

565565
Here is a simple example, given "x.json" with this content:
566566

567-
```json
567+
```
568568
{ "foo": 1 }
569569
{ "foo": 2 }
570570
{ "foo": 3 }
571571
```
572572

573573
```
574574
dom::parser parser;
575-
dom::document_stream docs = parser.load_many(filename);
575+
dom::document_stream docs = parser.load_many("x.json");
576576
for (dom::element doc : docs) {
577577
cout << doc["foo"] << endl;
578578
}
579579
// Prints 1 2 3
580580
```
581581

582-
In-memory ndjson strings can be parsed as well, with `parser.parse_many(string)`.
582+
583+
In-memory ndjson strings can be parsed as well, with `parser.parse_many(string)`:
584+
585+
586+
```
587+
dom::parser parser;
588+
auto json = R"({ "foo": 1 }
589+
{ "foo": 2 }
590+
{ "foo": 3 })"_padded;
591+
dom::document_stream docs = parser.parse_many(json);
592+
for (dom::element doc : docs) {
593+
cout << doc["foo"] << endl;
594+
}
595+
// Prints 1 2 3
596+
```
597+
598+
599+
Unlike `parser.parse`, both `parser.load_many(filename)` and `parser.parse_many(string)` may parse
600+
"on demand" (lazily). That is, no parsing may have been done before you enter the loop
601+
`for (dom::element doc : docs) {` and you should expect the parser to only ever fully parse one JSON
602+
document at a time.
603+
604+
1. When calling `parser.load_many(filename)`, the file's content is loaded up in a memory buffer owned by the `parser`'s instance. Thus the file can be safely deleted after calling `parser.load_many(filename)` as the parser instance owns all of the data.
605+
2. When calling `parser.parse_many(string)`, no copy is made of the provided string input. The provided memory buffer may be accessed each time a JSON document is parsed. Calling `parser.parse_many(string)` on a temporary string buffer (e.g., `docs = parser.parse_many("[1,2,3]"_padded)`) is unsafe because the `document_stream` instance needs access to the buffer to return the JSON documents. In constrast, calling `doc = parser.parse("[1,2,3]"_padded)` is safe because `parser.parse` eagerly parses the input.
583606

584607
Both `load_many` and `parse_many` take an optional parameter `size_t batch_size` which defines the window processing size. It is set by default to a large value (`1000000` corresponding to 1 MB). None of your JSON documents should exceed this window size, or else you will get the error `simdjson::CAPACITY`. You cannot set this window size larger than 4 GB: you will get the error `simdjson::CAPACITY`. The smaller the window size is, the less memory the function will use. Setting the window size too small (e.g., less than 100 kB) may also impact performance negatively. Leaving it to 1 MB is expected to be a good choice, unless you have some larger documents.
585608

include/simdjson/dom/parser.h

Lines changed: 39 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,10 @@ class parser {
7070
*
7171
* dom::parser parser;
7272
* const element doc = parser.load("jsonexamples/twitter.json");
73-
*
73+
*
74+
* The function is eager: the file's content is loaded in memory inside the parser instance
75+
* and immediately parsed. The file can be deleted after the `parser.load` call.
76+
*
7477
* ### IMPORTANT: Document Lifetime
7578
*
7679
* The JSON document still lives in the parser: this is the most efficient way to parse JSON
@@ -96,6 +99,9 @@ class parser {
9699
*
97100
* dom::parser parser;
98101
* element doc = parser.parse(buf, len);
102+
*
103+
* The function eagerly parses the input: the input can be modified and discarded after
104+
* the `parser.parse(buf, len)` call has completed.
99105
*
100106
* ### IMPORTANT: Document Lifetime
101107
*
@@ -149,14 +155,21 @@ class parser {
149155
* cout << std::string(doc["title"]) << endl;
150156
* }
151157
*
158+
* The file is loaded in memory and can be safely deleted after the `parser.load_many(path)`
159+
* function has returned. The memory is held by the `parser` instance.
160+
*
161+
* The function is lazy: it may be that no more than one JSON document at a time is parsed.
162+
* And, possibly, no document many have been parsed when the `parser.load_many(path)` function
163+
* returned.
164+
*
152165
* ### Format
153166
*
154167
* The file must contain a series of one or more JSON documents, concatenated into a single
155168
* buffer, separated by whitespace. It effectively parses until it has a fully valid document,
156169
* then starts parsing the next document at that point. (It does this with more parallelism and
157170
* lookahead than you might think, though.)
158171
*
159-
* documents that consist of an object or array may omit the whitespace between them, concatenating
172+
* Documents that consist of an object or array may omit the whitespace between them, concatenating
160173
* with no separator. documents that consist of a single primitive (i.e. documents that are not
161174
* arrays or objects) MUST be separated with whitespace.
162175
*
@@ -213,6 +226,30 @@ class parser {
213226
* cout << std::string(doc["title"]) << endl;
214227
* }
215228
*
229+
* No copy of the input buffer is made.
230+
*
231+
* The function is lazy: it may be that no more than one JSON document at a time is parsed.
232+
* And, possibly, no document many have been parsed when the `parser.load_many(path)` function
233+
* returned.
234+
*
235+
* The caller is responsabile to ensure that the input string data remains unchanged and is
236+
* not deleted during the loop. In particular, the following is unsafe:
237+
*
238+
* auto docs = parser.parse_many("[\"temporary data\"]"_padded);
239+
* // here the string "[\"temporary data\"]" may no longer exist in memory
240+
* // the parser instance may not have even accessed the input yet
241+
* for (element doc : docs) {
242+
* cout << std::string(doc["title"]) << endl;
243+
* }
244+
*
245+
* The following is safe:
246+
*
247+
* auto json = "[\"temporary data\"]"_padded;
248+
* auto docs = parser.parse_many(json);
249+
* for (element doc : docs) {
250+
* cout << std::string(doc["title"]) << endl;
251+
* }
252+
*
216253
* ### Format
217254
*
218255
* The buffer must contain a series of one or more JSON documents, concatenated into a single

tests/document_stream_tests.cpp

Lines changed: 69 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,69 @@ namespace document_stream_tests {
6969
}
7070
return true;
7171
}
72+
bool single_document() {
73+
std::cout << "Running " << __func__ << std::endl;
74+
simdjson::dom::parser parser;
75+
auto json = R"({"hello": "world"})"_padded;
76+
simdjson::dom::document_stream stream;
77+
ASSERT_SUCCESS(parser.parse_many(json).get(stream));
78+
size_t count = 0;
79+
for (auto doc : stream) {
80+
if(doc.error()) {
81+
std::cerr << "Unexpected error: " << doc.error() << std::endl;
82+
return false;
83+
}
84+
std::string expected = R"({"hello":"world"})";
85+
simdjson::dom::element this_document;
86+
ASSERT_SUCCESS(doc.get(this_document));
87+
88+
std::string answer = simdjson::minify(this_document);
89+
if(answer != expected) {
90+
std::cout << this_document << std::endl;
91+
return false;
92+
}
93+
count += 1;
94+
}
95+
return count == 1;
96+
}
97+
#if SIMDJSON_EXCEPTIONS
98+
bool single_document_exceptions() {
99+
std::cout << "Running " << __func__ << std::endl;
100+
simdjson::dom::parser parser;
101+
auto json = R"({"hello": "world"})"_padded;
102+
size_t count = 0;
103+
for (simdjson::dom::element doc : parser.parse_many(json)) {
104+
std::string expected = R"({"hello":"world"})";
105+
std::string answer = simdjson::minify(doc);
106+
if(answer != expected) {
107+
std::cout << "got : " << answer << std::endl;
108+
std::cout << "expected: " << expected << std::endl;
109+
return false;
110+
}
111+
count += 1;
112+
}
113+
return count == 1;
114+
}
115+
116+
bool issue1133() {
117+
std::cout << "Running " << __func__ << std::endl;
118+
simdjson::dom::parser parser;
119+
auto json = "{\"hello\": \"world\"}"_padded;
120+
simdjson::dom::document_stream docs = parser.parse_many(json);
121+
size_t count = 0;
122+
for (simdjson::dom::element doc : docs) {
123+
std::string expected = R"({"hello":"world"})";
124+
std::string answer = simdjson::minify(doc);
125+
if(answer != expected) {
126+
std::cout << "got : " << answer << std::endl;
127+
std::cout << "expected: " << expected << std::endl;
128+
return false;
129+
}
130+
count += 1;
131+
}
132+
return count == 1;
133+
}
134+
#endif
72135

73136
bool small_window() {
74137
std::cout << "Running " << __func__ << std::endl;
@@ -247,7 +310,12 @@ namespace document_stream_tests {
247310
}
248311

249312
bool run() {
250-
return test_current_index() &&
313+
return test_current_index() &&
314+
single_document() &&
315+
#if SIMDJSON_EXCEPTIONS
316+
single_document_exceptions() &&
317+
issue1133() &&
318+
#endif
251319
#ifdef SIMDJSON_THREADS_ENABLED
252320
threaded_disabled() &&
253321
#endif

tests/readme_examples.cpp

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -179,6 +179,16 @@ void basics_ndjson() {
179179
// Prints 1 2 3
180180
}
181181

182+
void basics_ndjson_parse_many() {
183+
dom::parser parser;
184+
auto json = R"({ "foo": 1 }
185+
{ "foo": 2 }
186+
{ "foo": 3 })"_padded;
187+
dom::document_stream docs = parser.parse_many(json);
188+
for (dom::element doc : docs) {
189+
cout << doc["foo"] << endl;
190+
}
191+
}
182192
void implementation_selection_1() {
183193
cout << "simdjson v" << STRINGIFY(SIMDJSON_VERSION) << endl;
184194
cout << "Detected the best implementation for your machine: " << simdjson::active_implementation->name();

0 commit comments

Comments
 (0)