Skip to content

Commit 0d6919d

Browse files
authored
Reenable the on-demand tests and allows us to convert a raw string into a C++ string. (simdjson#1232)
* Reenable the on-demand tests and allows us to convert a raw string into a C++ string. * Fixing a 1-byte buffer overrun. * More documentation. * Adding more tests. * Enabling the new tests * Committing a nicer example. * Not yet happy but this should fix our failures. * Duh. * Ok. Making it easier to get string_view instances from field instances. * It is a struct. * Trying to satisfy VS. * Adopting John's name.
1 parent 3e8e797 commit 0d6919d

File tree

8 files changed

+203
-13
lines changed

8 files changed

+203
-13
lines changed

.circleci/config.yml

Lines changed: 17 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -209,15 +209,25 @@ jobs:
209209

210210
# make (test and checkperf)
211211
arch-haswell-gcc10:
212-
description: Build, run tests and check performance on GCC 7 with -march=haswell
212+
description: Build, run tests and check performance on GCC 10 with -march=haswell
213213
executor: gcc10
214214
environment: { CXXFLAGS: -march=haswell }
215215
steps: [ cmake_test ]
216216
arch-nehalem-gcc10:
217-
description: Build, run tests and check performance on GCC 7 with -march=nehalem
217+
description: Build, run tests and check performance on GCC 10 with -march=nehalem
218218
executor: gcc10
219219
environment: { CXXFLAGS: -march=nehalem }
220220
steps: [ cmake_test ]
221+
sanitize-haswell-gcc10:
222+
description: Build and run tests on GCC 10 and AVX 2 with a cmake sanitize build
223+
executor: gcc10
224+
environment: { CXXFLAGS: -march=haswell, CMAKE_FLAGS: -DSIMDJSON_BUILD_STATIC=OFF -DSIMDJSON_SANITIZE=ON, BUILD_FLAGS: "", CTEST_FLAGS: --output-on-failure -E checkperf }
225+
steps: [ cmake_test ]
226+
sanitize-haswell-clang10:
227+
description: Build and run tests on clang 10 and AVX 2 with a cmake sanitize build
228+
executor: clang10
229+
environment: { CXXFLAGS: -march=haswell, CMAKE_FLAGS: -DSIMDJSON_BUILD_STATIC=OFF -DSIMDJSON_SANITIZE=ON, CTEST_FLAGS: --output-on-failure -E checkperf }
230+
steps: [ cmake_test ]
221231

222232
workflows:
223233
version: 2.1
@@ -248,6 +258,11 @@ workflows:
248258
- arch-haswell-gcc10
249259
- arch-nehalem-gcc10
250260

261+
262+
# sanitized single-implementation tests
263+
- sanitize-haswell-gcc10
264+
- sanitize-haswell-clang10
265+
251266
# testing "just the library"
252267
- justlib-gcc10
253268

doc/ondemand.md

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -447,6 +447,44 @@ When the user requests strings, we unescape them to a single string buffer much
447447
so that users enjoy the same string performance as the core simdjson. We do not write the length to the
448448
string buffer, however; that is stored in the `string_view` instance we return to the user.
449449

450+
```C++
451+
ondemand::parser parser;
452+
auto doc = parser.iterate(json);
453+
std::set<std::string_view> default_users;
454+
ondemand::array tweets = doc["statuses"].get_array();
455+
for (auto tweet_value : tweets) {
456+
auto tweet = tweet_value.get_object();
457+
ondemand::object user = tweet["user"].get_object();
458+
std::string_view screen_name = user["screen_name"].get_string();
459+
bool default_profile = user["default_profile"].get_bool();
460+
if (default_profile) { default_users.insert(screen_name); }
461+
}
462+
```
463+
464+
By using `string_view` instances, we avoid the high cost of allocating many small strings (as would be the
465+
case with `std::string`) but be mindful that the life cycle of these `string_view` instances is tied to the
466+
parser instance. If the parser instance is destroyed or reused for a new JSON document, these strings are no longer valid.
467+
468+
We iterate through object instances using `field` instances which represent key-value pairs. The value
469+
is accessible by the `value()` method whereas the key is accessible by the `key()` method.
470+
The keys are treated differently than values are made available as as special type `raw_json_string`
471+
which is a lightweight type that is meant to be used on a temporary basis, amost solely for
472+
direct raw ASCII comparisons (`field.key() == "mykey"`). If you occasionally need to access and store the
473+
unescaped key values, you may use the `unescaped_key()` method. Once you have called `unescaped_key()` method,
474+
neither the `key()` nor the `unescaped_key()` methods should be called: the current field instance
475+
has no longer a key (that is by design). Like other strings, the resulting `std::string_view` generated
476+
from the `unescaped_key()` method has a lifecycle tied to the `parser` instance: once the parser
477+
is destroyed or reused with another document, the `std::string_view` instance becomes invalid.
478+
479+
480+
```C++
481+
auto doc = parser.iterate(json);
482+
for(auto field : doc.get_object()) {
483+
std::string_view keyv = field.unescaped_key();
484+
}
485+
```
486+
487+
450488

451489
### Iteration Safety
452490

include/simdjson/generic/ondemand/field-inl.h

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,15 @@ simdjson_really_inline simdjson_result<field> field::start(json_iterator_ref &&i
2121
return field(key, value::start(std::forward<json_iterator_ref>(iter)));
2222
}
2323

24+
simdjson_really_inline simdjson_warn_unused simdjson_result<std::string_view> field::unescaped_key() noexcept {
25+
SIMDJSON_ASSUME(first.buf != nullptr); // We would like to call .alive() by Visual Studio won't let us.
26+
simdjson_result<std::string_view> answer = first.unescape(second.get_iterator());
27+
first.consume();
28+
return answer;
29+
}
30+
2431
simdjson_really_inline raw_json_string field::key() const noexcept {
32+
SIMDJSON_ASSUME(first.buf != nullptr); // We would like to call .alive() by Visual Studio won't let us.
2533
return first;
2634
}
2735

@@ -58,6 +66,10 @@ simdjson_really_inline simdjson_result<SIMDJSON_IMPLEMENTATION::ondemand::raw_js
5866
if (error()) { return error(); }
5967
return first.key();
6068
}
69+
simdjson_really_inline simdjson_result<std::string_view> simdjson_result<SIMDJSON_IMPLEMENTATION::ondemand::field>::unescaped_key() noexcept {
70+
if (error()) { return error(); }
71+
return first.unescaped_key();
72+
}
6173
simdjson_really_inline simdjson_result<SIMDJSON_IMPLEMENTATION::ondemand::value> simdjson_result<SIMDJSON_IMPLEMENTATION::ondemand::field>::value() noexcept {
6274
if (error()) { return error(); }
6375
return std::move(first.value());

include/simdjson/generic/ondemand/field.h

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,17 @@ class field : public std::pair<raw_json_string, value> {
2626
simdjson_really_inline field &operator=(const field &other) noexcept = delete;
2727

2828
/**
29-
* Get the key.
29+
* Get the key as a string_view (for higher speed, consider raw_key).
30+
* We deliberately use a more cumbersome name (unescaped_key) to force users
31+
* to think twice about using it.
32+
*
33+
* This consumes the key: once you have called unescaped_key(), you cannot
34+
* call it again nor can you call key().
35+
*/
36+
simdjson_really_inline simdjson_warn_unused simdjson_result<std::string_view> unescaped_key() noexcept;
37+
/**
38+
* Get the key as a raw_json_string: this is fast and allows straight comparisons.
39+
* We want this to be the default for most users.
3040
*/
3141
simdjson_really_inline raw_json_string key() const noexcept;
3242
/**
@@ -62,6 +72,7 @@ struct simdjson_result<SIMDJSON_IMPLEMENTATION::ondemand::field> : public SIMDJS
6272
simdjson_really_inline simdjson_result(simdjson_result<SIMDJSON_IMPLEMENTATION::ondemand::field> &&a) noexcept = default;
6373
simdjson_really_inline ~simdjson_result() noexcept = default; ///< @private
6474

75+
simdjson_really_inline simdjson_result<std::string_view> unescaped_key() noexcept;
6576
simdjson_really_inline simdjson_result<SIMDJSON_IMPLEMENTATION::ondemand::raw_json_string> key() noexcept;
6677
simdjson_really_inline simdjson_result<SIMDJSON_IMPLEMENTATION::ondemand::value> value() noexcept;
6778
};

include/simdjson/generic/ondemand/parser-inl.h

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,9 +8,8 @@ simdjson_warn_unused simdjson_really_inline error_code parser::allocate(size_t n
88
// string_capacity copied from document::allocate
99
_capacity = 0;
1010
_max_depth = 0;
11-
// The most string buffer we could possibly need is capacity-2 (a string the whole document long).
12-
// Allocate up to capacity so we don't have to check for capacity == 0 or 1.
13-
string_buf.reset(new (std::nothrow) uint8_t[new_capacity]);
11+
size_t string_capacity = SIMDJSON_ROUNDUP_N(5 * new_capacity / 3 + SIMDJSON_PADDING, 64);
12+
string_buf.reset(new (std::nothrow) uint8_t[string_capacity]);
1413
SIMDJSON_TRY( dom_parser.set_capacity(new_capacity) );
1514
SIMDJSON_TRY( dom_parser.set_max_depth(DEFAULT_MAX_DEPTH) );
1615
_capacity = new_capacity;

include/simdjson/generic/ondemand/raw_json_string.h

Lines changed: 29 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,10 +8,20 @@ class object;
88
class parser;
99

1010
/**
11-
* A string escaped per JSON rules, terminated with quote (")
11+
* A string escaped per JSON rules, terminated with quote ("). They are used to represent
12+
* unescaped keys inside JSON documents.
1213
*
1314
* (In other words, a pointer to the beginning of a string, just after the start quote, inside a
1415
* JSON file.)
16+
*
17+
* This class is deliberately simplistic and has little functionality. You can
18+
* compare two raw_json_string instances, or compare a raw_json_string with a string_view, but
19+
* that is pretty much all you can do.
20+
*
21+
* They originate typically from field instance which in turn represent key-value pairs from
22+
* object instances. From a field instance, you get the raw_json_string instance by calling key().
23+
* You can, if you want a more usable string_view instance, call the unescaped_key() method
24+
* on the field instance.
1525
*/
1626
class raw_json_string {
1727
public:
@@ -35,8 +45,24 @@ class raw_json_string {
3545
simdjson_really_inline raw_json_string(const uint8_t * _buf) noexcept;
3646
/**
3747
* Get the raw pointer to the beginning of the string in the JSON (just after the ").
48+
*
49+
* It is possible for this function to return a null pointer if the instance
50+
* has outlived its existence.
3851
*/
3952
simdjson_really_inline const char * raw() const noexcept;
53+
54+
private:
55+
/**
56+
* This will set the inner pointer to zero, effectively making
57+
* this instance unusable.
58+
*/
59+
simdjson_really_inline void consume() noexcept { buf = nullptr; }
60+
61+
/**
62+
* Checks whether the inner pointer is non-null and thus usable.
63+
*/
64+
simdjson_really_inline simdjson_warn_unused bool alive() const noexcept { return buf != nullptr; }
65+
4066
/**
4167
* Unescape this JSON string, replacing \\ with \, \n with newline, etc.
4268
*
@@ -62,9 +88,10 @@ class raw_json_string {
6288
*/
6389
simdjson_really_inline simdjson_warn_unused simdjson_result<std::string_view> unescape(json_iterator &iter) const noexcept;
6490

65-
private:
6691
const uint8_t * buf{};
6792
friend class object;
93+
friend class field;
94+
friend struct simdjson_result<raw_json_string>;
6895
};
6996

7097
simdjson_unused simdjson_really_inline bool operator==(const raw_json_string &a, std::string_view b) noexcept;

include/simdjson/generic/stringparsing.h

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,10 @@ simdjson_really_inline bool handle_unicode_codepoint(const uint8_t **src_ptr,
7373
return offset > 0;
7474
}
7575

76+
/**
77+
* Unescape a string from src to dst, stopping at a final unescaped quote. E.g., if src points at 'joe"', then
78+
* dst needs to have four free bytes.
79+
*/
7680
simdjson_warn_unused simdjson_really_inline uint8_t *parse_string(const uint8_t *src, uint8_t *dst) {
7781
while (1) {
7882
// Copy the next n bytes, and find the backslash and quote in them.

tests/ondemand/ondemand_basictests.cpp

Lines changed: 89 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@
1515
#include "simdjson.h"
1616
#include "test_ondemand.h"
1717

18+
1819
// const size_t AMAZON_CELLPHONES_NDJSON_DOC_COUNT = 793;
1920
#define SIMDJSON_SHOW_DEFINE(x) printf("%s=%s\n", #x, STRINGIFY(x))
2021

@@ -41,6 +42,42 @@ void compilation_test_1() {
4142
}
4243
}
4344
}
45+
46+
47+
// Do not run this, it is only meant to compile
48+
void compilation_test_2() {
49+
const padded_string bogus = ""_padded;
50+
ondemand::parser parser;
51+
auto doc = parser.iterate(bogus);
52+
std::set<std::string_view> default_users;
53+
ondemand::array tweets = doc["statuses"].get_array();
54+
for (auto tweet_value : tweets) {
55+
auto tweet = tweet_value.get_object();
56+
ondemand::object user = tweet["user"].get_object();
57+
std::string_view screen_name = user["screen_name"].get_string();
58+
bool default_profile = user["default_profile"].get_bool();
59+
if (default_profile) { default_users.insert(screen_name); }
60+
}
61+
}
62+
63+
64+
// Do not run this, it is only meant to compile
65+
void compilation_test_3() {
66+
const padded_string bogus = ""_padded;
67+
ondemand::parser parser;
68+
auto doc = parser.iterate(bogus);
69+
ondemand::array tweets;
70+
if(! doc["statuses"].get(tweets)) { return; }
71+
for (auto tweet_value : tweets) {
72+
auto tweet = tweet_value.get_object();
73+
for (auto field : tweet) {
74+
std::string_view key = field.unescaped_key().value();
75+
std::cout << "key = " << key << std::endl;
76+
std::string_view val = std::string_view(field.value());
77+
std::cout << "value (assuming it is a string) = " << val << std::endl;
78+
}
79+
}
80+
}
4481
#endif
4582

4683
#define ONDEMAND_SUBTEST(NAME, JSON, TEST) \
@@ -53,6 +90,32 @@ void compilation_test_1() {
5390
} \
5491
}
5592

93+
94+
namespace key_string_tests {
95+
#if SIMDJSON_EXCEPTIONS
96+
bool parser_key_value() {
97+
TEST_START();
98+
ondemand::parser parser;
99+
const padded_string json = R"({ "1": "1", "2": "2", "3": "3", "abc": "abc", "\u0075": "\u0075" })"_padded;
100+
auto doc = parser.iterate(json);
101+
for(auto field : doc.get_object()) {
102+
std::string_view keyv = field.unescaped_key();
103+
std::string_view valuev = field.value();
104+
if(keyv != valuev) { return false; }
105+
}
106+
return true;
107+
}
108+
#endif
109+
bool run() {
110+
return
111+
#if SIMDJSON_EXCEPTIONS
112+
parser_key_value() &&
113+
#endif
114+
true;
115+
}
116+
117+
}
118+
56119
namespace number_tests {
57120

58121
// ulp distance
@@ -866,10 +929,30 @@ namespace twitter_tests {
866929
auto media = entities["media"];
867930
if (media.error() == SUCCESS) {
868931
for (ondemand::object image : media) {
932+
/**
933+
* Fun fact: id and id_str can differ:
934+
* 505866668485386240 and 505866668485386241.
935+
* Presumably, it is because doubles are used
936+
* at some point in the process and the number
937+
* 505866668485386241 cannot be represented as a double.
938+
* (not our fault)
939+
*/
940+
uint64_t id_val = image["id"].get_uint64();
941+
std::cout << "id = " <<id_val << std::endl;
942+
auto id_string = std::string_view(image["id_str"].value());
943+
std::cout << "id_string = " << id_string << std::endl;
869944
auto sizes = image["sizes"].get_object();
870945
for (auto size : sizes) {
946+
/**
947+
* We want to know the key that describes the size.
948+
*/
949+
std::string_view raw_size_key_v = size.unescaped_key().value();
950+
std::cout << "Type of image size = " << raw_size_key_v << std::endl;
871951
ondemand::object size_value = size.value();
872-
image_sizes.insert(make_pair(size_value["w"], size_value["h"]));
952+
int64_t width = size_value["w"];
953+
int64_t height = size_value["h"];
954+
std::cout << width << " x " << height << std::endl;
955+
image_sizes.insert(make_pair(width, height));
873956
}
874957
}
875958
}
@@ -1346,12 +1429,13 @@ int main(int argc, char *argv[]) {
13461429

13471430
std::cout << "Running basic tests." << std::endl;
13481431
if (
1349-
// parse_api_tests::run() &&
1350-
// dom_api_tests::run() &&
1351-
// twitter_tests::run() &&
1352-
// number_tests::run() &&
1432+
parse_api_tests::run() &&
1433+
dom_api_tests::run() &&
1434+
twitter_tests::run() &&
1435+
number_tests::run() &&
13531436
error_tests::run() &&
13541437
ordering_tests::run() &&
1438+
key_string_tests::run() &&
13551439
true
13561440
) {
13571441
std::cout << "Basic tests are ok." << std::endl;

0 commit comments

Comments
 (0)