Skip to content

Commit 9304d88

Browse files
authored
Prototype test for issue 1299: using parse_many, find the location of the end of the last document (simdjson#1301)
* Prototype test for issue 1299. * This improves the documentation. * Removing trailing white spaces. * Removing trailing spaces * Trailing.
1 parent 725ca01 commit 9304d88

File tree

2 files changed

+147
-1
lines changed

2 files changed

+147
-1
lines changed

doc/parse_many.md

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,8 @@ Contents
1313
- [Support](#support)
1414
- [API](#api)
1515
- [Use cases](#use-cases)
16+
- [Tracking your position](#tracking-your-position)
17+
- [Incomplete streams](#incomplete-streams)
1618

1719
Motivation
1820
-----------
@@ -158,3 +160,58 @@ From [jsonlines.org](http://jsonlines.org/examples/):
158160
```
159161
JSON Lines' biggest strength is in handling lots of similar nested data structures. One .jsonl file is easier to
160162
work with than a directory full of XML files.
163+
164+
165+
Tracking your position
166+
-----------
167+
168+
Some users would like to know where the document they parsed is in the input array of bytes.
169+
It is possible to do so by accessing directly the iterator and calling its `current_index()`
170+
method which reports the location (in bytes) of the current document in the input stream.
171+
172+
Let us illustrate the idea with code:
173+
174+
175+
```C++
176+
auto json = R"([1,2,3] {"1":1,"2":3,"4":4} [1,2,3] )"_padded;
177+
simdjson::dom::parser parser;
178+
simdjson::dom::document_stream stream;
179+
ASSERT_SUCCESS( parser.parse_many(json).get(stream) );
180+
auto i = stream.begin();
181+
for(; i != stream.end(); ++i) {
182+
auto doc = *i;
183+
if(!doc.error()) {
184+
std::cout << "got full document at " << i.current_index() << std::endl;
185+
}
186+
}
187+
size_t index = i.current_index();
188+
if(index != 38) {
189+
std::cerr << "Expected to stop after the three full documents " << std::endl;
190+
std::cerr << "index = " << index << std::endl;
191+
return false;
192+
}
193+
```
194+
195+
This code will print:
196+
```
197+
got full document at 0
198+
got full document at 9
199+
got full document at 29
200+
```
201+
202+
The last call to `i.current_index()` return the byte index 38, which is just beyond
203+
the last document.
204+
205+
Incomplete streams
206+
-----------
207+
208+
Some users may need to work with truncated streams while tracking their location in the stream.
209+
The same code, with the `current_index()` will work. However, the last block (by default 1MB)
210+
terminates with an unclosed string, then no JSON document within this last block will validate.
211+
In particular, it means that if your input string is `[1,2,3] {"1":1,"2":3,"4":4} [1,2` then
212+
no JSON document will be successfully parsed. The error `simdjson::UNCLOSED_STRING` will be
213+
given (even with the first JSON document). It is then your responsability to terminate the input
214+
maybe by appending the missing data at the end of the truncated string, or by copying the truncated
215+
data before the continuing input.
216+
217+

tests/document_stream_tests.cpp

Lines changed: 90 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -312,7 +312,93 @@ namespace document_stream_tests {
312312
return count == 1;
313313
}
314314
#endif
315+
bool simple_example() {
316+
std::cout << "Running " << __func__ << std::endl;
317+
// The last JSON document is
318+
// intentionally truncated.
319+
auto json = R"([1,2,3] {"1":1,"2":3,"4":4} [1,2,3] )"_padded;
320+
simdjson::dom::parser parser;
321+
size_t count = 0;
322+
simdjson::dom::document_stream stream;
323+
// We use a window of json.size() though any large value would do.
324+
ASSERT_SUCCESS( parser.parse_many(json, json.size()).get(stream) );
325+
auto i = stream.begin();
326+
for(; i != stream.end(); ++i) {
327+
auto doc = *i;
328+
if(!doc.error()) {
329+
std::cout << "got full document at " << i.current_index() << std::endl;
330+
count++;
331+
}
332+
}
333+
if(count != 3) {
334+
std::cerr << "Expected to get three full documents " << std::endl;
335+
return false;
336+
}
337+
size_t index = i.current_index();
338+
if(index != 38) {
339+
std::cerr << "Expected to stop after the three full documents " << std::endl;
340+
std::cerr << "index = " << index << std::endl;
341+
return false;
342+
}
343+
return true;
344+
}
345+
346+
347+
bool truncated_window() {
348+
std::cout << "Running " << __func__ << std::endl;
349+
// The last JSON document is
350+
// intentionally truncated.
351+
auto json = R"([1,2,3] {"1":1,"2":3,"4":4} [1,2 )"_padded;
352+
simdjson::dom::parser parser;
353+
size_t count = 0;
354+
simdjson::dom::document_stream stream;
355+
// We use a window of json.size() though any large value would do.
356+
ASSERT_SUCCESS( parser.parse_many(json, json.size()).get(stream) );
357+
auto i = stream.begin();
358+
for(; i != stream.end(); ++i) {
359+
auto doc = *i;
360+
if(!doc.error()) {
361+
std::cout << "got full document at " << i.current_index() << std::endl;
362+
count++;
363+
}
364+
}
365+
if(count != 2) {
366+
std::cerr << "Expected to get two full documents " << std::endl;
367+
return false;
368+
}
369+
size_t index = i.current_index();
370+
if(index != 29) {
371+
std::cerr << "Expected to stop after the two full documents " << std::endl;
372+
std::cerr << "index = " << index << std::endl;
373+
return false;
374+
}
375+
return true;
376+
}
315377

378+
bool truncated_window_unclosed_string() {
379+
std::cout << "Running " << __func__ << std::endl;
380+
// The last JSON document is intentionally truncated. In this instance, we use
381+
// a truncated string which will create trouble since stage 1 will recognize the
382+
// JSON as invalid and refuse to even start parsing.
383+
auto json = R"([1,2,3] {"1":1,"2":3,"4":4} "intentionally unclosed string )"_padded;
384+
simdjson::dom::parser parser;
385+
simdjson::dom::document_stream stream;
386+
// We use a window of json.size() though any large value would do.
387+
ASSERT_SUCCESS( parser.parse_many(json,json.size()).get(stream) );
388+
// Rest is ineffective because stage 1 fails.
389+
auto i = stream.begin();
390+
for(; i != stream.end(); ++i) {
391+
auto doc = *i;
392+
if(!doc.error()) {
393+
std::cout << "got full document at " << i.current_index() << std::endl;
394+
return false;
395+
} else {
396+
std::cout << doc.error() << std::endl;
397+
return (doc.error() == simdjson::UNCLOSED_STRING);
398+
}
399+
}
400+
return false;
401+
}
316402
bool small_window() {
317403
std::cout << "Running " << __func__ << std::endl;
318404
char input[2049];
@@ -502,7 +588,10 @@ namespace document_stream_tests {
502588
}
503589

504590
bool run() {
505-
return issue1307() &&
591+
return simple_example() &&
592+
truncated_window() &&
593+
truncated_window_unclosed_string() &&
594+
issue1307() &&
506595
issue1308() &&
507596
issue1309() &&
508597
issue1310() &&

0 commit comments

Comments
 (0)