Skip to content

Commit 5b10c38

Browse files
authored
Make parse_many safer. (simdjson#1137)
1 parent 3316df9 commit 5b10c38

File tree

5 files changed

+55
-5
lines changed

5 files changed

+55
-5
lines changed

doc/basics.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -621,7 +621,7 @@ Unlike `parser.parse`, both `parser.load_many(filename)` and `parser.parse_many(
621621
document at a time.
622622

623623
1. When calling `parser.load_many(filename)`, the file's content is loaded up in a memory buffer owned by the `parser`'s instance. Thus the file can be safely deleted after calling `parser.load_many(filename)` as the parser instance owns all of the data.
624-
2. When calling `parser.parse_many(string)`, no copy is made of the provided string input. The provided memory buffer may be accessed each time a JSON document is parsed. Calling `parser.parse_many(string)` on a temporary string buffer (e.g., `docs = parser.parse_many("[1,2,3]"_padded)`) is unsafe because the `document_stream` instance needs access to the buffer to return the JSON documents. In constrast, calling `doc = parser.parse("[1,2,3]"_padded)` is safe because `parser.parse` eagerly parses the input.
624+
2. When calling `parser.parse_many(string)`, no copy is made of the provided string input. The provided memory buffer may be accessed each time a JSON document is parsed. Calling `parser.parse_many(string)` on a temporary string buffer (e.g., `docs = parser.parse_many("[1,2,3]"_padded)`) is unsafe (and will not compile) because the `document_stream` instance needs access to the buffer to return the JSON documents. In constrast, calling `doc = parser.parse("[1,2,3]"_padded)` is safe because `parser.parse` eagerly parses the input.
625625

626626

627627
Both `load_many` and `parse_many` take an optional parameter `size_t batch_size` which defines the window processing size. It is set by default to a large value (`1000000` corresponding to 1 MB). None of your JSON documents should exceed this window size, or else you will get the error `simdjson::CAPACITY`. You cannot set this window size larger than 4 GB: you will get the error `simdjson::CAPACITY`. The smaller the window size is, the less memory the function will use. Setting the window size too small (e.g., less than 100 kB) may also impact performance negatively. Leaving it to 1 MB is expected to be a good choice, unless you have some larger documents.

doc/basics_doxygen.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -602,7 +602,7 @@ Unlike `parser.parse`, both `parser.load_many(filename)` and `parser.parse_many(
602602
document at a time.
603603

604604
1. When calling `parser.load_many(filename)`, the file's content is loaded up in a memory buffer owned by the `parser`'s instance. Thus the file can be safely deleted after calling `parser.load_many(filename)` as the parser instance owns all of the data.
605-
2. When calling `parser.parse_many(string)`, no copy is made of the provided string input. The provided memory buffer may be accessed each time a JSON document is parsed. Calling `parser.parse_many(string)` on a temporary string buffer (e.g., `docs = parser.parse_many("[1,2,3]"_padded)`) is unsafe because the `document_stream` instance needs access to the buffer to return the JSON documents. In constrast, calling `doc = parser.parse("[1,2,3]"_padded)` is safe because `parser.parse` eagerly parses the input.
605+
2. When calling `parser.parse_many(string)`, no copy is made of the provided string input. The provided memory buffer may be accessed each time a JSON document is parsed. Calling `parser.parse_many(string)` on a temporary string buffer (e.g., `docs = parser.parse_many("[1,2,3]"_padded)`) is unsafe (and will not compile) because the `document_stream` instance needs access to the buffer to return the JSON documents. In constrast, calling `doc = parser.parse("[1,2,3]"_padded)` is safe because `parser.parse` eagerly parses the input.
606606

607607
Both `load_many` and `parse_many` take an optional parameter `size_t batch_size` which defines the window processing size. It is set by default to a large value (`1000000` corresponding to 1 MB). None of your JSON documents should exceed this window size, or else you will get the error `simdjson::CAPACITY`. You cannot set this window size larger than 4 GB: you will get the error `simdjson::CAPACITY`. The smaller the window size is, the less memory the function will use. Setting the window size too small (e.g., less than 100 kB) may also impact performance negatively. Leaving it to 1 MB is expected to be a good choice, unless you have some larger documents.
608608

include/simdjson/dom/parser.h

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -233,7 +233,7 @@ class parser {
233233
* returned.
234234
*
235235
* The caller is responsabile to ensure that the input string data remains unchanged and is
236-
* not deleted during the loop. In particular, the following is unsafe:
236+
* not deleted during the loop. In particular, the following is unsafe and will not compile:
237237
*
238238
* auto docs = parser.parse_many("[\"temporary data\"]"_padded);
239239
* // here the string "[\"temporary data\"]" may no longer exist in memory
@@ -314,9 +314,11 @@ class parser {
314314
inline simdjson_result<document_stream> parse_many(const char *buf, size_t len, size_t batch_size = DEFAULT_BATCH_SIZE) noexcept;
315315
/** @overload parse_many(const uint8_t *buf, size_t len, size_t batch_size) */
316316
inline simdjson_result<document_stream> parse_many(const std::string &s, size_t batch_size = DEFAULT_BATCH_SIZE) noexcept;
317+
inline simdjson_result<document_stream> parse_many(const std::string &&s, size_t batch_size) = delete;// unsafe
317318
/** @overload parse_many(const uint8_t *buf, size_t len, size_t batch_size) */
318319
inline simdjson_result<document_stream> parse_many(const padded_string &s, size_t batch_size = DEFAULT_BATCH_SIZE) noexcept;
319-
320+
inline simdjson_result<document_stream> parse_many(const padded_string &&s, size_t batch_size) = delete;// unsafe
321+
320322
/** @private We do not want to allow implicit conversion from C string to std::string. */
321323
simdjson_result<document_stream> parse_many(const char *buf, size_t batch_size = DEFAULT_BATCH_SIZE) noexcept = delete;
322324

tests/compilation_failure_tests/CMakeLists.txt

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
# the macro COMPILATION_TEST_USE_FAILING_CODE set to 0 or 1.
66
#
77

8-
# adds a compilation test. two targets are created, one expected to
8+
# adds a compilation test. Two targets are created, one expected to
99
# succeed compilation and one that is expected to fail.
1010
function(add_dual_compile_test TEST_NAME)
1111
add_cpp_test(${TEST_NAME}_should_compile SOURCES ${TEST_NAME}.cpp COMPILE_ONLY)
@@ -22,4 +22,5 @@ if (SIMDJSON_EXCEPTIONS)
2222
add_dual_compile_test(dangling_parser_parse_uchar)
2323
add_dual_compile_test(dangling_parser_parse_stdstring)
2424
add_dual_compile_test(dangling_parser_parse_padstring)
25+
add_dual_compile_test(unsafe_parse_many)
2526
endif()
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
#include <string>
2+
#include <vector>
3+
#include <iostream>
4+
5+
#include "simdjson.h"
6+
7+
bool single_document() {
8+
std::cout << "Running " << __func__ << std::endl;
9+
simdjson::dom::parser parser;
10+
simdjson::dom::document_stream stream;
11+
12+
#if COMPILATION_TEST_USE_FAILING_CODE
13+
auto error = parser.parse_many(json).get(R"({"hello": "world"})"_padded);
14+
#else
15+
auto json = R"({"hello": "world"})"_padded;
16+
auto error = parser.parse_many(json).get(stream);
17+
#endif
18+
if(error) {
19+
std::cerr << error << std::endl;
20+
return false;
21+
}
22+
size_t count = 0;
23+
for (auto doc : stream) {
24+
if(doc.error()) {
25+
std::cerr << "Unexpected error: " << doc.error() << std::endl;
26+
return false;
27+
}
28+
std::string expected = R"({"hello":"world"})";
29+
simdjson::dom::element this_document;
30+
error = doc.get(this_document);
31+
if(error) {
32+
std::cerr << error << std::endl;
33+
return false;
34+
}
35+
std::string answer = simdjson::minify(this_document);
36+
if(answer != expected) {
37+
std::cout << this_document << std::endl;
38+
return false;
39+
}
40+
count += 1;
41+
}
42+
return count == 1;
43+
}
44+
45+
int main() {
46+
return single_document() ? EXIT_SUCCESS : EXIT_FAILURE;
47+
}

0 commit comments

Comments
 (0)