-
Notifications
You must be signed in to change notification settings - Fork 14.9k
Description
tl;dr:
Considering std::string
is (supposed to be) tailored to handle characters, and implements small buffer optimization which should make appending to it a bit cheaper even for considerable eventual size of the resulting string, appending to std::string
is significantly slower than inserting at the end into std::vector<char>
.
Is there a particular reason for that?
Benchmark results
I ran benchmarks inside Ubuntu 24.04.1 LTS inside WSL on i5 13600k CPU using latest Clang 20.1.
I used option --benchmark_min_warmup_time=0.1
to make results more stable so the benchmark was launched like this:
./bench --benchmark_min_warmup_time=0.1
The benchmarks look like this:
void vector_push_back(benchmark::State& state) {
for (auto $ : state) {
std::vector<char> v;
//v.reserve(Count);
for (size_t i = 0; i != Count; ++i) {
v.push_back('0');
benchmark::DoNotOptimize(v);
}
}
state.counters["rate"] = benchmark::Counter{ static_cast<double>(Count),
benchmark::Counter::kIsIterationInvariantRate };
}
I.e. a chunk of characters is appended to a container Count = 1000
times, in this example the chunk is just one character.
I compared rate as reported by the benchmarks which is amount of bytes appended divided by CPU time.
Here is the complete code. (Click/tap to expand.)
#include <vector>
#include <string>
#include <string_view>
#include <cmath>
#include <benchmark/benchmark.h>
using namespace std::string_view_literals;
const auto data =
"012345678901234567890123456789012345678901234567890123456789"
"012345678901234567890123456789012345678901234567890123456789"
"012345678901234567890123456789012345678901234567890123456789"
"012345678901234567890123456789012345678901234567890123456789"
"012345678901234567890123456789012345678901234567890123456789"
"012345678901234567890123456789012345678901234567890123456789"
"012345678901234567890123456789012345678901234567890123456789"
"012345678901234567890123456789012345678901234567890123456789"
"012345678901234567890123456789012345678901234567890123456789"
"012345678901234567890123456789012345678901234567890123456789"
"012345678901234567890123456789012345678901234567890123456789"
"012345678901234567890123456789012345678901234567890123456789"
"012345678901234567890123456789012345678901234567890123456789"
"012345678901234567890123456789012345678901234567890123456789"
"012345678901234567890123456789012345678901234567890123456789"
"012345678901234567890123456789012345678901234567890123456789"
"012345678901234567890123456789012345678901234567890123456789"
"0123"sv;
const size_t Count = 1000;
void dummy(benchmark::State& state) {
for (auto $ : state) {
size_t s = 0;
for (size_t i = 0; i != Count; ++i) {
++s;
benchmark::DoNotOptimize(s);
}
}
state.counters["rate"] = benchmark::Counter{ static_cast<double>(Count),
benchmark::Counter::kIsIterationInvariantRate };
}
void vector_push_back(benchmark::State& state) {
for (auto $ : state) {
std::vector<char> v;
//v.reserve(Count);
for (size_t i = 0; i != Count; ++i) {
v.push_back('0');
benchmark::DoNotOptimize(v);
}
}
state.counters["rate"] = benchmark::Counter{ static_cast<double>(Count),
benchmark::Counter::kIsIterationInvariantRate };
}
void vector_insert(benchmark::State & state) {
const auto d = data.substr(0, state.range());
for (auto $ : state) {
std::vector<char> v;
//v.reserve(d.size() * Count);
for (size_t i = 0; i != Count; ++i) {
v.insert(v.end(), d.begin(), d.end());
benchmark::DoNotOptimize(v);
}
}
state.counters["rate"] = benchmark::Counter{ static_cast<double>(state.range() * Count),
benchmark::Counter::kIsIterationInvariantRate };
}
void string_push_back(benchmark::State& state) {
for (auto $ : state) {
std::string s;
//s.reserve(Count);
for (size_t i = 0; i != Count; ++i) {
s.push_back('0');
benchmark::DoNotOptimize(s);
}
}
state.counters["rate"] = benchmark::Counter{ static_cast<double>(Count),
benchmark::Counter::kIsIterationInvariantRate };
}
void string_append(benchmark::State & state) {
const auto d = data.substr(0, state.range());
for (auto $ : state) {
std::string s;
//s.reserve(d.size() * Count);
for (size_t i = 0; i != Count; ++i) {
s.append(d);
benchmark::DoNotOptimize(s);
}
}
state.counters["rate"] = benchmark::Counter{ static_cast<double>(state.range() * Count),
benchmark::Counter::kIsIterationInvariantRate };
}
constexpr size_t RangeMultiplierSquared = 2;
constexpr size_t DataSize = 512;
template<typename B>
void Args(B b);
BENCHMARK(dummy);
BENCHMARK(vector_push_back);
BENCHMARK(vector_insert)->Apply(Args)->Arg(DataSize * 2);
BENCHMARK(string_push_back);
BENCHMARK(string_append)->Apply(Args)->Arg(DataSize * 2);
int64_t roundedBefore(int64_t i) {
return static_cast<int64_t>(round(i * pow(RangeMultiplierSquared, -0.5)));
}
template<typename B>
void Args(B b) {
b->Args({ 1 });
int i = RangeMultiplierSquared;
if (const auto k = roundedBefore(i); k != 1)
b->Args({ k });
b->Args({ i });
while ((i *= RangeMultiplierSquared) < DataSize) {
b->Args({ roundedBefore(i) });
b->Args({ i });
}
if (const auto k = roundedBefore(i); k < DataSize)
b->Args({ k });
b->Args({ DataSize });
};
Here dummy
is just a control to gauge problems.
Args
function template sets up sizes so that there is an additional data point roughly between powers of two on a logarithmic scale.
For some reason the last run gives incorrect measurements sometimes, I added sacrificial data points at the end to mitigate that (DataSize * 2
).
Results
Full benchmark output. (Click/tap to expand.)
2025-08-26T14:12:56+03:00
Running ./bench
Run on (20 X 3494.4 MHz CPU s)
CPU Caches:
L1 Data 48 KiB (x10)
L1 Instruction 32 KiB (x10)
L2 Unified 2048 KiB (x10)
L3 Unified 24576 KiB (x1)
Load Average: 0.00, 0.03, 0.08
-----------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-----------------------------------------------------------------------------
dummy 227 ns 209 ns 3356710 rate=4.77757G/s
vector_push_back 530 ns 489 ns 1420428 rate=2.0451G/s
vector_insert/1 1246 ns 1151 ns 608950 rate=869.157M/s
vector_insert/2 1485 ns 1371 ns 510698 rate=1.45902G/s
vector_insert/3 1637 ns 1511 ns 464718 rate=1.98556G/s
vector_insert/4 1907 ns 1760 ns 399303 rate=2.2721G/s
vector_insert/6 2219 ns 2049 ns 340955 rate=2.9287G/s
vector_insert/8 1903 ns 1757 ns 397994 rate=4.55447G/s
vector_insert/11 2451 ns 2263 ns 309396 rate=4.86143G/s
vector_insert/16 2710 ns 2501 ns 279629 rate=6.39634G/s
vector_insert/23 4064 ns 3751 ns 186611 rate=6.13099G/s
vector_insert/32 1851 ns 1708 ns 409604 rate=18.7305G/s
vector_insert/45 4019 ns 3710 ns 188916 rate=12.1305G/s
vector_insert/64 3436 ns 3172 ns 220936 rate=20.1796G/s
vector_insert/91 23445 ns 21476 ns 33860 rate=4.2373G/s
vector_insert/128 36662 ns 33703 ns 20895 rate=3.7979G/s
vector_insert/181 61004 ns 55933 ns 12640 rate=3.23601G/s
vector_insert/256 90194 ns 83002 ns 8457 rate=3.08425G/s
vector_insert/362 137841 ns 127006 ns 5485 rate=2.85025G/s
vector_insert/512 203347 ns 187528 ns 3777 rate=2.73027G/s
vector_insert/1024 443245 ns 409018 ns 1713 rate=2.50356G/s
string_push_back 1645 ns 1519 ns 459028 rate=658.405M/s
string_append/1 4398 ns 4060 ns 171797 rate=246.306M/s
string_append/2 4295 ns 3965 ns 175805 rate=504.405M/s
string_append/3 4357 ns 4022 ns 175788 rate=745.915M/s
string_append/4 4348 ns 4013 ns 174600 rate=996.641M/s
string_append/6 4286 ns 3956 ns 176163 rate=1.51661G/s
string_append/8 4386 ns 4048 ns 173727 rate=1.97618G/s
string_append/11 4377 ns 4040 ns 172980 rate=2.72269G/s
string_append/16 4393 ns 4056 ns 171781 rate=3.94523G/s
string_append/23 4619 ns 4264 ns 165310 rate=5.39399G/s
string_append/32 4567 ns 4215 ns 165359 rate=7.59122G/s
string_append/45 4509 ns 4162 ns 167575 rate=10.8112G/s
string_append/64 5094 ns 4702 ns 150211 rate=13.6121G/s
string_append/91 6280 ns 5797 ns 120507 rate=15.6978G/s
string_append/128 7148 ns 6598 ns 105000 rate=19.3984G/s
string_append/181 10105 ns 9328 ns 75234 rate=19.4047G/s
string_append/256 11624 ns 10730 ns 65008 rate=23.8591G/s
string_append/362 15333 ns 14153 ns 49536 rate=25.5773G/s
string_append/512 19747 ns 18228 ns 38433 rate=28.0885G/s
string_append/1024 449037 ns 414404 ns 1705 rate=2.47102G/s
On the right some crazy things happen to std::vector<char>
which I can not fully explain (probably data set exceeding L3 size plays a big role), instead let's focus on "the good part" on the left.
Here std::string
is 68% slower than std::vector<char>
regarding push_back()
;
and around 40% to 70% slower regarding appending characters in chunks depending on the chunk size (using vector insert()
and string append()
).
One can think that maybe different allocation patterns of std::string
and std::vector
play a role, and indeed they do play a role, so instead of investigating it I eliminated it altogether by reserving necessary storage (uncommented reserve()
calls).
Results with reserve()
Full benchmark output. (Click/tap to expand.)
2025-08-26T14:19:29+03:00
Running ./bench
Run on (20 X 3494.4 MHz CPU s)
CPU Caches:
L1 Data 48 KiB (x10)
L1 Instruction 32 KiB (x10)
L2 Unified 2048 KiB (x10)
L3 Unified 24576 KiB (x1)
Load Average: 0.40, 0.20, 0.13
-----------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-----------------------------------------------------------------------------
dummy 226 ns 208 ns 3361395 rate=4.79691G/s
vector_push_back 439 ns 405 ns 1726001 rate=2.46609G/s
vector_insert/1 997 ns 920 ns 764443 rate=1.0865G/s
vector_insert/2 1230 ns 1135 ns 610784 rate=1.76206G/s
vector_insert/3 1440 ns 1330 ns 524039 rate=2.25626G/s
vector_insert/4 1572 ns 1451 ns 487840 rate=2.75743G/s
vector_insert/6 2006 ns 1852 ns 374242 rate=3.24026G/s
vector_insert/8 1560 ns 1440 ns 481767 rate=5.55526G/s
vector_insert/11 2199 ns 2030 ns 345151 rate=5.41978G/s
vector_insert/16 2424 ns 2238 ns 313185 rate=7.15004G/s
vector_insert/23 3759 ns 3470 ns 200937 rate=6.62851G/s
vector_insert/32 1114 ns 1028 ns 678571 rate=31.1138G/s
vector_insert/45 3111 ns 2872 ns 244499 rate=15.6689G/s
vector_insert/64 2012 ns 1857 ns 379973 rate=34.4563G/s
vector_insert/91 5697 ns 5259 ns 133134 rate=17.3044G/s
vector_insert/128 3046 ns 2812 ns 249728 rate=45.5218G/s
vector_insert/181 6762 ns 6241 ns 111984 rate=28.9997G/s
vector_insert/256 5731 ns 5290 ns 117313 rate=48.3906G/s
vector_insert/362 9787 ns 9034 ns 83134 rate=40.0707G/s
vector_insert/512 7479 ns 6904 ns 101189 rate=74.1609G/s
vector_insert/1024 13777 ns 12717 ns 55029 rate=80.5231G/s
string_push_back 1453 ns 1342 ns 524367 rate=745.401M/s
string_append/1 4704 ns 4343 ns 160682 rate=230.276M/s
string_append/2 4347 ns 4013 ns 173773 rate=498.401M/s
string_append/3 4346 ns 4011 ns 174178 rate=747.867M/s
string_append/4 4421 ns 4081 ns 172674 rate=980.083M/s
string_append/6 4434 ns 4093 ns 166934 rate=1.46594G/s
string_append/8 4400 ns 4062 ns 173671 rate=1.96953G/s
string_append/11 4420 ns 4080 ns 174113 rate=2.6962G/s
string_append/16 4499 ns 4153 ns 169456 rate=3.85279G/s
string_append/23 4542 ns 4193 ns 167624 rate=5.48576G/s
string_append/32 3922 ns 3620 ns 191241 rate=8.83886G/s
string_append/45 3935 ns 3633 ns 192302 rate=12.3881G/s
string_append/64 3948 ns 3644 ns 193690 rate=17.5631G/s
string_append/91 4459 ns 4116 ns 170266 rate=22.1066G/s
string_append/128 4470 ns 4126 ns 169312 rate=31.0209G/s
string_append/181 6626 ns 6116 ns 119231 rate=29.5937G/s
string_append/256 5613 ns 5181 ns 134655 rate=49.4099G/s
string_append/362 8021 ns 7404 ns 95064 rate=48.8942G/s
string_append/512 9513 ns 8781 ns 79907 rate=58.3047G/s
string_append/1024 14779 ns 13642 ns 51335 rate=75.0635G/s
Here you can see that std::vector<char>
really doesn't like chunk sizes different from powers of two, but overall it is still mostly faster than std::string
(unless it becomes really unlucky).
Again let's look at "the good part" with less influence from weird effects.
This picture is very similar to the one earlier.
Here std::string
is 70% slower than std::vector<char>
regarding push_back()
;
and around 45% to 80% slower regarding appending characters in chunks (depending on the chunk size).
Removing memory allocation overhead revealed even larger difference in performance between std::string
and std::vector<char>
.
Quick-bench
Here's another evidence that appending to std::string
is significantly slower than to std::vector<char>
:
https://quick-bench.com/q/EKUKhSTDZ3wKF866p1ZiUgRESI8
There it shows appending to std::string
is from around 16% to two times slower than inserting at the end of std::vector<char>
.
(Although it uses Clang 17 which is not the most recent, the latest version is not available.)
For context
I discovered this issue while researching this:
simdjson/simdjson#2408