[libc++][perf] Is there a reason appending to `std::string` is much slower than to `std::vector<char>`?

# tl;dr:
Considering `std::string` is (supposed to be) tailored to handle characters, and implements small buffer optimization which should make appending to it a bit cheaper even for considerable eventual size of the resulting string, appending to `std::string` is significantly slower than inserting at the end into `std::vector<char>`.
Is there a particular reason for that?

# Benchmark results
I ran benchmarks inside Ubuntu 24.04.1 LTS inside WSL on i5 13600k CPU using latest Clang 20.1.

I used option `--benchmark_min_warmup_time=0.1` to make results more stable so the benchmark was launched like this:
```
./bench --benchmark_min_warmup_time=0.1
```

The benchmarks look like this:
```c++
void vector_push_back(benchmark::State& state) {
  for (auto $ : state) {
    std::vector<char> v;
    //v.reserve(Count);
    for (size_t i = 0; i != Count; ++i) {
      v.push_back('0');
      benchmark::DoNotOptimize(v);
    }
  }
  state.counters["rate"] = benchmark::Counter{ static_cast<double>(Count),
    benchmark::Counter::kIsIterationInvariantRate };
}
```
I.e. a chunk of characters is appended to a container `Count = 1000` times, in this example the chunk is just one character.

I compared **rate** as reported by the benchmarks which is amount of bytes appended divided by CPU time.
<details><summary>Here is the complete code. (Click/tap to expand.)</summary>

```c++
#include <vector>
#include <string>
#include <string_view>
#include <cmath>

#include <benchmark/benchmark.h>

using namespace std::string_view_literals;

const auto data =
"012345678901234567890123456789012345678901234567890123456789"
"012345678901234567890123456789012345678901234567890123456789"
"012345678901234567890123456789012345678901234567890123456789"
"012345678901234567890123456789012345678901234567890123456789"
"012345678901234567890123456789012345678901234567890123456789"
"012345678901234567890123456789012345678901234567890123456789"
"012345678901234567890123456789012345678901234567890123456789"
"012345678901234567890123456789012345678901234567890123456789"
"012345678901234567890123456789012345678901234567890123456789"
"012345678901234567890123456789012345678901234567890123456789"
"012345678901234567890123456789012345678901234567890123456789"
"012345678901234567890123456789012345678901234567890123456789"
"012345678901234567890123456789012345678901234567890123456789"
"012345678901234567890123456789012345678901234567890123456789"
"012345678901234567890123456789012345678901234567890123456789"
"012345678901234567890123456789012345678901234567890123456789"
"012345678901234567890123456789012345678901234567890123456789"
"0123"sv;

const size_t Count = 1000;

void dummy(benchmark::State& state) {
  for (auto $ : state) {
    size_t s = 0;
    for (size_t i = 0; i != Count; ++i) {
      ++s;
      benchmark::DoNotOptimize(s);
    }
  }
  state.counters["rate"] = benchmark::Counter{ static_cast<double>(Count),
    benchmark::Counter::kIsIterationInvariantRate };
}

void vector_push_back(benchmark::State& state) {
  for (auto $ : state) {
    std::vector<char> v;
    //v.reserve(Count);
    for (size_t i = 0; i != Count; ++i) {
      v.push_back('0');
      benchmark::DoNotOptimize(v);
    }
  }
  state.counters["rate"] = benchmark::Counter{ static_cast<double>(Count),
    benchmark::Counter::kIsIterationInvariantRate };
}

void vector_insert(benchmark::State & state) {
  const auto d = data.substr(0, state.range());
  for (auto $ : state) {
    std::vector<char> v;
    //v.reserve(d.size() * Count);
    for (size_t i = 0; i != Count; ++i) {
      v.insert(v.end(), d.begin(), d.end());
      benchmark::DoNotOptimize(v);
    }
  }
  state.counters["rate"] = benchmark::Counter{ static_cast<double>(state.range() * Count),
    benchmark::Counter::kIsIterationInvariantRate };
}

void string_push_back(benchmark::State& state) {
  for (auto $ : state) {
    std::string s;
    //s.reserve(Count);
    for (size_t i = 0; i != Count; ++i) {
      s.push_back('0');
      benchmark::DoNotOptimize(s);
    }
  }
  state.counters["rate"] = benchmark::Counter{ static_cast<double>(Count),
    benchmark::Counter::kIsIterationInvariantRate };
}

void string_append(benchmark::State & state) {
  const auto d = data.substr(0, state.range());
  for (auto $ : state) {
    std::string s;
    //s.reserve(d.size() * Count);
    for (size_t i = 0; i != Count; ++i) {
      s.append(d);
      benchmark::DoNotOptimize(s);
    }
  }
  state.counters["rate"] = benchmark::Counter{ static_cast<double>(state.range() * Count),
    benchmark::Counter::kIsIterationInvariantRate };
}

constexpr size_t RangeMultiplierSquared = 2;
constexpr size_t DataSize = 512;

template<typename B>
void Args(B b);

BENCHMARK(dummy);
BENCHMARK(vector_push_back);
BENCHMARK(vector_insert)->Apply(Args)->Arg(DataSize * 2);
BENCHMARK(string_push_back);
BENCHMARK(string_append)->Apply(Args)->Arg(DataSize * 2);

int64_t roundedBefore(int64_t i) {
  return static_cast<int64_t>(round(i * pow(RangeMultiplierSquared, -0.5)));
}
template<typename B>
void Args(B b) {
  b->Args({ 1 });
  int i = RangeMultiplierSquared;
  if (const auto k = roundedBefore(i); k != 1)
    b->Args({ k });
  b->Args({ i });
  while ((i *= RangeMultiplierSquared) < DataSize) {
    b->Args({ roundedBefore(i) });
    b->Args({ i });
  }
  if (const auto k = roundedBefore(i); k < DataSize)
    b->Args({ k });
  b->Args({ DataSize });
};
```
Here `dummy` is just a control to gauge problems.

`Args` function template sets up sizes so that there is an additional data point roughly between powers of two on a logarithmic scale.

For some reason the last run gives incorrect measurements sometimes, I added sacrificial data points at the end to mitigate that (`DataSize * 2`).
</details>

## Results
![Image](https://github.com/user-attachments/assets/0ee4cc86-febe-4f93-97fb-58a91101ff61)
<details><summary>Full benchmark output. (Click/tap to expand.)</summary>

```
2025-08-26T14:12:56+03:00
Running ./bench
Run on (20 X 3494.4 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x10)
  L1 Instruction 32 KiB (x10)
  L2 Unified 2048 KiB (x10)
  L3 Unified 24576 KiB (x1)
Load Average: 0.00, 0.03, 0.08
-----------------------------------------------------------------------------
Benchmark                   Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------
dummy                     227 ns          209 ns      3356710 rate=4.77757G/s
vector_push_back          530 ns          489 ns      1420428 rate=2.0451G/s
vector_insert/1          1246 ns         1151 ns       608950 rate=869.157M/s
vector_insert/2          1485 ns         1371 ns       510698 rate=1.45902G/s
vector_insert/3          1637 ns         1511 ns       464718 rate=1.98556G/s
vector_insert/4          1907 ns         1760 ns       399303 rate=2.2721G/s
vector_insert/6          2219 ns         2049 ns       340955 rate=2.9287G/s
vector_insert/8          1903 ns         1757 ns       397994 rate=4.55447G/s
vector_insert/11         2451 ns         2263 ns       309396 rate=4.86143G/s
vector_insert/16         2710 ns         2501 ns       279629 rate=6.39634G/s
vector_insert/23         4064 ns         3751 ns       186611 rate=6.13099G/s
vector_insert/32         1851 ns         1708 ns       409604 rate=18.7305G/s
vector_insert/45         4019 ns         3710 ns       188916 rate=12.1305G/s
vector_insert/64         3436 ns         3172 ns       220936 rate=20.1796G/s
vector_insert/91        23445 ns        21476 ns        33860 rate=4.2373G/s
vector_insert/128       36662 ns        33703 ns        20895 rate=3.7979G/s
vector_insert/181       61004 ns        55933 ns        12640 rate=3.23601G/s
vector_insert/256       90194 ns        83002 ns         8457 rate=3.08425G/s
vector_insert/362      137841 ns       127006 ns         5485 rate=2.85025G/s
vector_insert/512      203347 ns       187528 ns         3777 rate=2.73027G/s
vector_insert/1024     443245 ns       409018 ns         1713 rate=2.50356G/s
string_push_back         1645 ns         1519 ns       459028 rate=658.405M/s
string_append/1          4398 ns         4060 ns       171797 rate=246.306M/s
string_append/2          4295 ns         3965 ns       175805 rate=504.405M/s
string_append/3          4357 ns         4022 ns       175788 rate=745.915M/s
string_append/4          4348 ns         4013 ns       174600 rate=996.641M/s
string_append/6          4286 ns         3956 ns       176163 rate=1.51661G/s
string_append/8          4386 ns         4048 ns       173727 rate=1.97618G/s
string_append/11         4377 ns         4040 ns       172980 rate=2.72269G/s
string_append/16         4393 ns         4056 ns       171781 rate=3.94523G/s
string_append/23         4619 ns         4264 ns       165310 rate=5.39399G/s
string_append/32         4567 ns         4215 ns       165359 rate=7.59122G/s
string_append/45         4509 ns         4162 ns       167575 rate=10.8112G/s
string_append/64         5094 ns         4702 ns       150211 rate=13.6121G/s
string_append/91         6280 ns         5797 ns       120507 rate=15.6978G/s
string_append/128        7148 ns         6598 ns       105000 rate=19.3984G/s
string_append/181       10105 ns         9328 ns        75234 rate=19.4047G/s
string_append/256       11624 ns        10730 ns        65008 rate=23.8591G/s
string_append/362       15333 ns        14153 ns        49536 rate=25.5773G/s
string_append/512       19747 ns        18228 ns        38433 rate=28.0885G/s
string_append/1024     449037 ns       414404 ns         1705 rate=2.47102G/s
```

</details>

On the right some crazy things happen to `std::vector<char>` which I can not fully explain (probably data set exceeding L3 size plays a big role), instead let's focus on "the good part" on the left.

![Image](https://github.com/user-attachments/assets/cd51d118-1262-416c-be9c-db084fd79dda)
Here `std::string` is **68% slower** than `std::vector<char>` regarding `push_back()`;
and around **40% to 70% slower** regarding appending characters in chunks depending on the chunk size (using vector `insert()` and string `append()`).

One can think that maybe different allocation patterns of `std::string` and `std::vector` play a role, and indeed they do play a role, so instead of investigating it I eliminated it altogether by reserving necessary storage (uncommented `reserve()` calls).

## Results with `reserve()`
![Image](https://github.com/user-attachments/assets/033fb02b-e200-4b07-b7f2-d7ad04b15226)
<details><summary>Full benchmark output. (Click/tap to expand.)</summary>

```
2025-08-26T14:19:29+03:00
Running ./bench
Run on (20 X 3494.4 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x10)
  L1 Instruction 32 KiB (x10)
  L2 Unified 2048 KiB (x10)
  L3 Unified 24576 KiB (x1)
Load Average: 0.40, 0.20, 0.13
-----------------------------------------------------------------------------
Benchmark                   Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------
dummy                     226 ns          208 ns      3361395 rate=4.79691G/s
vector_push_back          439 ns          405 ns      1726001 rate=2.46609G/s
vector_insert/1           997 ns          920 ns       764443 rate=1.0865G/s
vector_insert/2          1230 ns         1135 ns       610784 rate=1.76206G/s
vector_insert/3          1440 ns         1330 ns       524039 rate=2.25626G/s
vector_insert/4          1572 ns         1451 ns       487840 rate=2.75743G/s
vector_insert/6          2006 ns         1852 ns       374242 rate=3.24026G/s
vector_insert/8          1560 ns         1440 ns       481767 rate=5.55526G/s
vector_insert/11         2199 ns         2030 ns       345151 rate=5.41978G/s
vector_insert/16         2424 ns         2238 ns       313185 rate=7.15004G/s
vector_insert/23         3759 ns         3470 ns       200937 rate=6.62851G/s
vector_insert/32         1114 ns         1028 ns       678571 rate=31.1138G/s
vector_insert/45         3111 ns         2872 ns       244499 rate=15.6689G/s
vector_insert/64         2012 ns         1857 ns       379973 rate=34.4563G/s
vector_insert/91         5697 ns         5259 ns       133134 rate=17.3044G/s
vector_insert/128        3046 ns         2812 ns       249728 rate=45.5218G/s
vector_insert/181        6762 ns         6241 ns       111984 rate=28.9997G/s
vector_insert/256        5731 ns         5290 ns       117313 rate=48.3906G/s
vector_insert/362        9787 ns         9034 ns        83134 rate=40.0707G/s
vector_insert/512        7479 ns         6904 ns       101189 rate=74.1609G/s
vector_insert/1024      13777 ns        12717 ns        55029 rate=80.5231G/s
string_push_back         1453 ns         1342 ns       524367 rate=745.401M/s
string_append/1          4704 ns         4343 ns       160682 rate=230.276M/s
string_append/2          4347 ns         4013 ns       173773 rate=498.401M/s
string_append/3          4346 ns         4011 ns       174178 rate=747.867M/s
string_append/4          4421 ns         4081 ns       172674 rate=980.083M/s
string_append/6          4434 ns         4093 ns       166934 rate=1.46594G/s
string_append/8          4400 ns         4062 ns       173671 rate=1.96953G/s
string_append/11         4420 ns         4080 ns       174113 rate=2.6962G/s
string_append/16         4499 ns         4153 ns       169456 rate=3.85279G/s
string_append/23         4542 ns         4193 ns       167624 rate=5.48576G/s
string_append/32         3922 ns         3620 ns       191241 rate=8.83886G/s
string_append/45         3935 ns         3633 ns       192302 rate=12.3881G/s
string_append/64         3948 ns         3644 ns       193690 rate=17.5631G/s
string_append/91         4459 ns         4116 ns       170266 rate=22.1066G/s
string_append/128        4470 ns         4126 ns       169312 rate=31.0209G/s
string_append/181        6626 ns         6116 ns       119231 rate=29.5937G/s
string_append/256        5613 ns         5181 ns       134655 rate=49.4099G/s
string_append/362        8021 ns         7404 ns        95064 rate=48.8942G/s
string_append/512        9513 ns         8781 ns        79907 rate=58.3047G/s
string_append/1024      14779 ns        13642 ns        51335 rate=75.0635G/s
```

</details>

Here you can see that `std::vector<char>` _really_ doesn't like chunk sizes different from powers of two, but overall it is still mostly faster than `std::string` (unless it becomes really unlucky).

Again let's look at "the good part" with less influence from weird effects.
![Image](https://github.com/user-attachments/assets/42fcac87-df47-4f4f-a6f6-40e087bd6c6f)
This picture is very similar to the one earlier.

Here `std::string` is **70% slower** than `std::vector<char>` regarding `push_back()`;
and around **45% to 80% slower** regarding appending characters in chunks (depending on the chunk size).
Removing memory allocation overhead revealed even larger difference in performance between `std::string` and `std::vector<char>`.

## Quick-bench
Here's another evidence that appending to `std::string` is significantly slower than to `std::vector<char>`:
https://quick-bench.com/q/EKUKhSTDZ3wKF866p1ZiUgRESI8

There it shows appending to `std::string` is from around 16% to two times slower than inserting at the end of `std::vector<char>`.
(Although it uses Clang 17 which is not the most recent, the latest version is not available.)

# For context
I discovered this issue while researching this:
https://github.com/simdjson/simdjson/pull/2408

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[libc++][perf] Is there a reason appending to `std::string` is much slower than to `std::vector<char>`? #155410

tl;dr:

Benchmark results

Results

Results with `reserve()`

Quick-bench

For context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[libc++][perf] Is there a reason appending to std::string is much slower than to std::vector<char>? #155410

Description

tl;dr:

Benchmark results

Results

Results with reserve()

Quick-bench

For context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[libc++][perf] Is there a reason appending to `std::string` is much slower than to `std::vector<char>`? #155410

Results with `reserve()`