Skip to content

Commit fa4ce6a

Browse files
authored
There is confusion between gigabytes and gigibytes. Let us standardize throughout. (simdjson#838)
* There is confusion between gigabytes and gigibytes. * Trying to be consistent.
1 parent 9863f62 commit fa4ce6a

12 files changed

+61
-45
lines changed

HACKING.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ Other important files and directories:
4141
* **amalgamate.sh:** Generates singleheader/simdjson.h and singleheader/simdjson.cpp for release.
4242
* **benchmark:** This is where we do benchmarking. Benchmarking is core to every change we make; the
4343
cardinal rule is don't regress performance without knowing exactly why, and what you're trading
44-
for it. Many of our benchmarks are microbenchmarks. We trying to assess a specific functions in a specific library. In this scenario, we are effectively doing controlled scientific experiments for the purpose of understanding what affects our performance. So we simplify as much as possible. We try to avoid irrelevant factors such as page faults, interrupts, unnnecessary system calls, how fast and how eagerly the OS maps memory In such scenarios, we typically want to get the best performance that we can achieve... the case where we did not get interrupts, context switches, page faults... What we want is consistency and predictability. The numbers should not depend too much on how busy the machine is, on whether your upgraded your operating system recently, and so forth. This type of benchmarking is distinct from system benchmarking. If you're not sure what else to do to check your performance, this is always a good start:
44+
for it. Many of our benchmarks are microbenchmarks. We are effectively doing controlled scientific experiments for the purpose of understanding what affects our performance. So we simplify as much as possible. We try to avoid irrelevant factors such as page faults, interrupts, unnnecessary system calls. We recommend checking the performance as follows:
4545
```bash
4646
mkdir build
4747
cd build
@@ -53,11 +53,11 @@ Other important files and directories:
5353
```bash
5454
mkdir build
5555
cd build
56-
cmake .. -DSIMDJSON_GOOGLE_BENCHMARKS=ON
56+
cmake ..
5757
cmake --build . --target bench_parse_call --config Release
5858
./benchmark/bench_parse_call
5959
```
60-
The last line becomes `./benchmark/Release/bench_parse_call.exe` under Windows. Under Windows, you can also build with the clang compiler by adding `-T ClangCL` to the call to `cmake .. `.
60+
The last line becomes `./benchmark/Release/bench_parse_call.exe` under Windows. Under Windows, you can also build with the clang compiler by adding `-T ClangCL` to the call to `cmake ..`: `cmake .. - TClangCL`.
6161
* **fuzz:** The source for fuzz testing. This lets us explore important edge and middle cases
6262
* **fuzz:** The source for fuzz testing. This lets us explore important edge and middle cases
6363
automatically, and is run in CI.

README.md

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -75,15 +75,14 @@ Usage documentation is available:
7575
Performance results
7676
-------------------
7777
78-
The simdjson library uses three-quarters less instructions than state-of-the-art parser RapidJSON and
78+
The simdjson library uses three-quarters less instructions than state-of-the-art parser [RapidJSON](https://rapidjson.org) and
7979
fifty percent less than sajson. To our knowledge, simdjson is the first fully-validating JSON parser
80-
to run at gigabytes per second on commodity processors. It can parse millions of JSON documents
81-
per second on a single core.
80+
to run at [gigabytes per second](https://en.wikipedia.org/wiki/Gigabyte) (GB/s) on commodity processors. It can parse millions of JSON documents per second on a single core.
8281
8382
The following figure represents parsing speed in GB/s for parsing various files
8483
on an Intel Skylake processor (3.4 GHz) using the GNU GCC 9 compiler (with the -O3 flag).
8584
We compare against the best and fastest C++ libraries.
86-
The simdjson library offers full unicode (UTF-8) validation and exact
85+
The simdjson library offers full unicode ([UTF-8](https://en.wikipedia.org/wiki/UTF-8)) validation and exact
8786
number parsing. The RapidJSON library is tested in two modes: fast and
8887
exact number parsing. The sajson library offers fast (but not exact)
8988
number parsing and partial unicode validation. In this data set, the file
@@ -183,8 +182,8 @@ Head over to [CONTRIBUTING.md](CONTRIBUTING.md) for information on contributing
183182
License
184183
-------
185184
186-
This code is made available under the Apache License 2.0.
185+
This code is made available under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0.html).
187186
188187
Under Windows, we build some tools using the windows/dirent_portable.h file (which is outside our library code): it under the liberal (business-friendly) MIT license.
189188
190-
For compilers that do not support C++17, we bundle the string-view library which is published under the Boost license (http://www.boost.org/LICENSE_1_0.txt). Like the Apache license, the Boost license is a permissive license allowing commercial redistribution.
189+
For compilers that do not support [C++17](https://en.wikipedia.org/wiki/C%2B%2B17), we bundle the string-view library which is published under the Boost license (http://www.boost.org/LICENSE_1_0.txt). Like the Apache license, the Boost license is a permissive license allowing commercial redistribution.

benchmark/bench_parse_call.cpp

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -36,9 +36,10 @@ static void parse_twitter(State& state) {
3636
}
3737
benchmark::DoNotOptimize(doc);
3838
}
39-
state.counters["Bytes"] = benchmark::Counter(
39+
// Gigabyte: https://en.wikipedia.org/wiki/Gigabyte
40+
state.counters["Gigabytes"] = benchmark::Counter(
4041
double(bytes), benchmark::Counter::kIsRate,
41-
benchmark::Counter::OneK::kIs1024);
42+
benchmark::Counter::OneK::kIs1000); // For GiB : kIs1024
4243
state.counters["docs"] = Counter(double(state.iterations()), benchmark::Counter::kIsRate);
4344
}
4445
BENCHMARK(parse_twitter)->Repetitions(10)->ComputeStatistics("max", [](const std::vector<double>& v) -> double {
@@ -72,9 +73,10 @@ static void parse_gsoc(State& state) {
7273
}
7374
benchmark::DoNotOptimize(doc);
7475
}
75-
state.counters["Bytes"] = benchmark::Counter(
76+
// Gigabyte: https://en.wikipedia.org/wiki/Gigabyte
77+
state.counters["Gigabytes"] = benchmark::Counter(
7678
double(bytes), benchmark::Counter::kIsRate,
77-
benchmark::Counter::OneK::kIs1024);
79+
benchmark::Counter::OneK::kIs1000); // For GiB : kIs1024
7880
state.counters["docs"] = Counter(double(state.iterations()), benchmark::Counter::kIsRate);
7981
}
8082
BENCHMARK(parse_gsoc)->Repetitions(10)->ComputeStatistics("max", [](const std::vector<double>& v) -> double {
@@ -160,4 +162,4 @@ BENCHMARK(document_parse_exception);
160162

161163
#endif // SIMDJSON_EXCEPTIONS
162164

163-
BENCHMARK_MAIN();
165+
BENCHMARK_MAIN();

benchmark/benchmarker.h

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -379,9 +379,10 @@ struct benchmarker {
379379
run_loop(iterations);
380380
}
381381

382+
// Gigabyte: https://en.wikipedia.org/wiki/Gigabyte
382383
template<typename T>
383384
void print_aggregate(const char* prefix, const T& stage) const {
384-
printf("%s%-13s: %8.4f ns per block (%6.2f%%) - %8.4f ns per byte - %8.4f ns per structural - %8.3f GB/s\n",
385+
printf("%s%-13s: %8.4f ns per block (%6.2f%%) - %8.4f ns per byte - %8.4f ns per structural - %8.4f GB/s\n",
385386
prefix,
386387
"Speed",
387388
stage.elapsed_ns() / static_cast<double>(stats->blocks), // per block

benchmark/distinctuseridcompetition.cpp

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -335,13 +335,13 @@ int main(int argc, char *argv[]) {
335335
std::cerr << "Could not load the file " << filename << std::endl;
336336
return EXIT_FAILURE;
337337
}
338-
338+
// Gigabyte: https://en.wikipedia.org/wiki/Gigabyte
339339
if (verbose) {
340340
std::cout << "Input has ";
341-
if (p.size() > 1024 * 1024)
342-
std::cout << p.size() / (1024 * 1024) << " MB ";
343-
else if (p.size() > 1024)
344-
std::cout << p.size() / 1024 << " KB ";
341+
if (p.size() > 1000 * 1000)
342+
std::cout << p.size() / (1000 * 1000) << " MB ";
343+
else if (p.size() > 1000)
344+
std::cout << p.size() / 1000 << " KB ";
345345
else
346346
std::cout << p.size() << " B ";
347347
std::cout << std::endl;

benchmark/get_corpus_benchmark.cpp

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
#include <cstring>
55
#include <iostream>
66

7+
// Gigabyte: https://en.wikipedia.org/wiki/Gigabyte
78
never_inline
89
double bench(std::string filename, simdjson::padded_string& p) {
910
std::chrono::time_point<std::chrono::steady_clock> start_clock =
@@ -12,7 +13,7 @@ double bench(std::string filename, simdjson::padded_string& p) {
1213
std::chrono::time_point<std::chrono::steady_clock> end_clock =
1314
std::chrono::steady_clock::now();
1415
std::chrono::duration<double> elapsed = end_clock - start_clock;
15-
return (static_cast<double>(p.size()) / (1024. * 1024. * 1024.)) / elapsed.count();
16+
return (static_cast<double>(p.size()) / (1000000000.)) / elapsed.count();
1617
}
1718

1819
int main(int argc, char *argv[]) {
@@ -32,8 +33,8 @@ int main(int argc, char *argv[]) {
3233
double meanval = 0;
3334
double maxval = 0;
3435
double minval = 10000;
35-
std::cout << "file size: "<< (static_cast<double>(p.size()) / (1024. * 1024. * 1024.)) << " GB" <<std::endl;
36-
size_t times = p.size() > 1024*1024*1024 ? 5 : 50;
36+
std::cout << "file size: "<< (static_cast<double>(p.size()) / (1000000000.)) << " GB" <<std::endl;
37+
size_t times = p.size() > 1000000000 ? 5 : 50;
3738
#if __cpp_exceptions
3839
try {
3940
#endif

benchmark/minifiercompetition.cpp

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -72,12 +72,13 @@ int main(int argc, char *argv[]) {
7272
std::cerr << "Could not load the file " << filename << std::endl;
7373
return EXIT_FAILURE;
7474
}
75+
// Gigabyte: https://en.wikipedia.org/wiki/Gigabyte
7576
if (verbose) {
7677
std::cout << "Input has ";
77-
if (p.size() > 1024 * 1024)
78-
std::cout << p.size() / (1024 * 1024) << " MB ";
79-
else if (p.size() > 1024)
80-
std::cout << p.size() / 1024 << " KB ";
78+
if (p.size() > 1000 * 1000)
79+
std::cout << p.size() / (1000 * 1000) << " MB ";
80+
else if (p.size() > 1000)
81+
std::cout << p.size() / 1000 << " KB ";
8182
else
8283
std::cout << p.size() << " B ";
8384
std::cout << std::endl;

benchmark/parseandstatcompetition.cpp

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -302,13 +302,13 @@ int main(int argc, char *argv[]) {
302302
std::cerr << "Could not load the file " << filename << std::endl;
303303
return EXIT_FAILURE;
304304
}
305-
305+
// Gigabyte: https://en.wikipedia.org/wiki/Gigabyte
306306
if (verbose) {
307307
std::cout << "Input has ";
308-
if (p.size() > 1024 * 1024)
309-
std::cout << p.size() / (1024 * 1024) << " MB ";
310-
else if (p.size() > 1024)
311-
std::cout << p.size() / 1024 << " KB ";
308+
if (p.size() > 1000 * 1000)
309+
std::cout << p.size() / (1000 * 1000) << " MB ";
310+
else if (p.size() > 1000)
311+
std::cout << p.size() / 1000 << " KB ";
312312
else
313313
std::cout << p.size() << " B ";
314314
std::cout << std::endl;

benchmark/parsingcompetition.cpp

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -86,13 +86,13 @@ bool bench(const char *filename, bool verbose, bool just_data, double repeat_mul
8686

8787
int repeat = static_cast<int>((50000000 * repeat_multiplier) / static_cast<double>(p.size()));
8888
if (repeat < 10) { repeat = 10; }
89-
89+
// Gigabyte: https://en.wikipedia.org/wiki/Gigabyte
9090
if (verbose) {
9191
std::cout << "Input " << filename << " has ";
92-
if (p.size() > 1024 * 1024)
93-
std::cout << p.size() / (1024 * 1024) << " MB";
94-
else if (p.size() > 1024)
95-
std::cout << p.size() / 1024 << " KB";
92+
if (p.size() > 1000 * 1000)
93+
std::cout << p.size() / (1000 * 1000) << " MB";
94+
else if (p.size() > 1000)
95+
std::cout << p.size() / 1000 << " KB";
9696
else
9797
std::cout << p.size() << " B";
9898
std::cout << ": will run " << repeat << " iterations." << std::endl;

doc/performance.md

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,8 @@ are still some scenarios where tuning can enhance performance.
99
* [Server Loops: Long-Running Processes and Memory Capacity](#server-loops-long-running-processes-and-memory-capacity)
1010
* [Large files and huge page support](#large-files-and-huge-page-support)
1111
* [Computed GOTOs](#computed-gotos)
12+
* [Number parsing](#number-parsing)
13+
* [Visual Studio](#visual-studio)
1214

1315
Reusing the parser for maximum efficiency
1416
-----------------------------------------
@@ -61,7 +63,7 @@ without bound:
6163
* You can set a *max capacity* when constructing a parser:
6264

6365
```c++
64-
dom::parser parser(1024*1024); // Never grow past documents > 1MB
66+
dom::parser parser(1000*1000); // Never grow past documents > 1MB
6567
for (web_request request : listen()) {
6668
auto [doc, error] = parser.parse(request.body);
6769
// If the document was above our limit, emit 413 = payload too large
@@ -77,7 +79,7 @@ without bound:
7779
7880
```c++
7981
dom::parser parser(0); // This parser will refuse to automatically grow capacity
80-
simdjson::error_code allocate_error = parser.allocate(1024*1024); // This allocates enough capacity to handle documents <= 1MB
82+
simdjson::error_code allocate_error = parser.allocate(1000*1000); // This allocates enough capacity to handle documents <= 1MB
8183
if (allocate_error) { cerr << allocate_error << endl; exit(1); }
8284
8385
for (web_request request : listen()) {
@@ -140,3 +142,13 @@ few hundred megabytes per second if your JSON documents are densely packed with
140142
- When possible, you should favor integer values written without a decimal point, as it simpler and faster to parse decimal integer values.
141143
- When serializing numbers, you should not use more digits than necessary: 17 digits is all that is needed to exactly represent double-precision floating-point numbers. Using many more digits than necessary will make your files larger and slower to parse.
142144
- When benchmarking parsing speeds, always report whether your JSON documents are made mostly of floating-point numbers when it is the case, since number parsing can then dominate the parsing time.
145+
146+
147+
Visual Studio
148+
--------------
149+
150+
On Intel and AMD Windows platforms, Microsoft Visual Studio enables programmers to build either 32-bit (x86) or 64-bit (x64) binaries. We urge you to always use 64-bit mode. Visual Studio 2019 should default on 64-bit builds when you have a 64-bit version of Windows, which we recommend.
151+
152+
We do not recommend that you compile simdjson with architecture-specific flags such as `arch:AVX2`. The simdjson library automatically selects the best execution kernel at runtime.
153+
154+
Recent versions of Microsoft Visual Studio on Windows provides support for the LLVM Clang compiler. You only need to install the "Clang compiler" optional component. You may also get a copy of the 64-bit LLVM CLang compiler for [Windows directly from LLVM](https://releases.llvm.org/download.html). The simdjson library fully supports the LLVM Clang compiler under Windows. In fact, you may get better performance out of simdjson with the LLVM Clang compiler than with the regular Visual Studio compiler.

tests/allparserscheckfile.cpp

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -70,10 +70,10 @@ int main(int argc, char *argv[]) {
7070
}
7171
if (verbose) {
7272
std::cout << "Input has ";
73-
if (p.size() > 1024 * 1024)
74-
std::cout << p.size() / (1024 * 1024) << " MB ";
75-
else if (p.size() > 1024)
76-
std::cout << p.size() / 1024 << " KB ";
73+
if (p.size() > 1000 * 1000)
74+
std::cout << p.size() / (1000 * 1000) << " MB ";
75+
else if (p.size() > 1000)
76+
std::cout << p.size() / 1000 << " KB ";
7777
else
7878
std::cout << p.size() << " B ";
7979
std::cout << std::endl;

tests/readme_examples.cpp

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -214,7 +214,7 @@ void performance_1() {
214214
SIMDJSON_PUSH_DISABLE_ALL_WARNINGS
215215
// The web_request part of this is aspirational, so we compile as much as we can here
216216
void performance_2() {
217-
dom::parser parser(1024*1024); // Never grow past documents > 1MB
217+
dom::parser parser(1000*1000); // Never grow past documents > 1MB
218218
// for (web_request request : listen()) {
219219
auto [doc, error] = parser.parse("1"_padded/*request.body*/);
220220
// // If the document was above our limit, emit 413 = payload too large
@@ -226,7 +226,7 @@ void performance_2() {
226226
// The web_request part of this is aspirational, so we compile as much as we can here
227227
void performance_3() {
228228
dom::parser parser(0); // This parser will refuse to automatically grow capacity
229-
simdjson::error_code allocate_error = parser.allocate(1024*1024); // This allocates enough capacity to handle documents <= 1MB
229+
simdjson::error_code allocate_error = parser.allocate(1000*1000); // This allocates enough capacity to handle documents <= 1MB
230230
if (allocate_error) { cerr << allocate_error << endl; exit(1); }
231231

232232
// for (web_request request : listen()) {

0 commit comments

Comments
 (0)