Skip to content

[libc++][perf] Is there a reason appending to std::string is much slower than to std::vector<char>? #155410

@toughengineer

Description

@toughengineer

tl;dr:

Considering std::string is (supposed to be) tailored to handle characters, and implements small buffer optimization which should make appending to it a bit cheaper even for considerable eventual size of the resulting string, appending to std::string is significantly slower than inserting at the end into std::vector<char>.
Is there a particular reason for that?

Benchmark results

I ran benchmarks inside Ubuntu 24.04.1 LTS inside WSL on i5 13600k CPU using latest Clang 20.1.

I used option --benchmark_min_warmup_time=0.1 to make results more stable so the benchmark was launched like this:

./bench --benchmark_min_warmup_time=0.1

The benchmarks look like this:

void vector_push_back(benchmark::State& state) {
  for (auto $ : state) {
    std::vector<char> v;
    //v.reserve(Count);
    for (size_t i = 0; i != Count; ++i) {
      v.push_back('0');
      benchmark::DoNotOptimize(v);
    }
  }
  state.counters["rate"] = benchmark::Counter{ static_cast<double>(Count),
    benchmark::Counter::kIsIterationInvariantRate };
}

I.e. a chunk of characters is appended to a container Count = 1000 times, in this example the chunk is just one character.

I compared rate as reported by the benchmarks which is amount of bytes appended divided by CPU time.

Here is the complete code. (Click/tap to expand.)
#include <vector>
#include <string>
#include <string_view>
#include <cmath>

#include <benchmark/benchmark.h>

using namespace std::string_view_literals;

const auto data =
"012345678901234567890123456789012345678901234567890123456789"
"012345678901234567890123456789012345678901234567890123456789"
"012345678901234567890123456789012345678901234567890123456789"
"012345678901234567890123456789012345678901234567890123456789"
"012345678901234567890123456789012345678901234567890123456789"
"012345678901234567890123456789012345678901234567890123456789"
"012345678901234567890123456789012345678901234567890123456789"
"012345678901234567890123456789012345678901234567890123456789"
"012345678901234567890123456789012345678901234567890123456789"
"012345678901234567890123456789012345678901234567890123456789"
"012345678901234567890123456789012345678901234567890123456789"
"012345678901234567890123456789012345678901234567890123456789"
"012345678901234567890123456789012345678901234567890123456789"
"012345678901234567890123456789012345678901234567890123456789"
"012345678901234567890123456789012345678901234567890123456789"
"012345678901234567890123456789012345678901234567890123456789"
"012345678901234567890123456789012345678901234567890123456789"
"0123"sv;

const size_t Count = 1000;

void dummy(benchmark::State& state) {
  for (auto $ : state) {
    size_t s = 0;
    for (size_t i = 0; i != Count; ++i) {
      ++s;
      benchmark::DoNotOptimize(s);
    }
  }
  state.counters["rate"] = benchmark::Counter{ static_cast<double>(Count),
    benchmark::Counter::kIsIterationInvariantRate };
}

void vector_push_back(benchmark::State& state) {
  for (auto $ : state) {
    std::vector<char> v;
    //v.reserve(Count);
    for (size_t i = 0; i != Count; ++i) {
      v.push_back('0');
      benchmark::DoNotOptimize(v);
    }
  }
  state.counters["rate"] = benchmark::Counter{ static_cast<double>(Count),
    benchmark::Counter::kIsIterationInvariantRate };
}

void vector_insert(benchmark::State & state) {
  const auto d = data.substr(0, state.range());
  for (auto $ : state) {
    std::vector<char> v;
    //v.reserve(d.size() * Count);
    for (size_t i = 0; i != Count; ++i) {
      v.insert(v.end(), d.begin(), d.end());
      benchmark::DoNotOptimize(v);
    }
  }
  state.counters["rate"] = benchmark::Counter{ static_cast<double>(state.range() * Count),
    benchmark::Counter::kIsIterationInvariantRate };
}

void string_push_back(benchmark::State& state) {
  for (auto $ : state) {
    std::string s;
    //s.reserve(Count);
    for (size_t i = 0; i != Count; ++i) {
      s.push_back('0');
      benchmark::DoNotOptimize(s);
    }
  }
  state.counters["rate"] = benchmark::Counter{ static_cast<double>(Count),
    benchmark::Counter::kIsIterationInvariantRate };
}

void string_append(benchmark::State & state) {
  const auto d = data.substr(0, state.range());
  for (auto $ : state) {
    std::string s;
    //s.reserve(d.size() * Count);
    for (size_t i = 0; i != Count; ++i) {
      s.append(d);
      benchmark::DoNotOptimize(s);
    }
  }
  state.counters["rate"] = benchmark::Counter{ static_cast<double>(state.range() * Count),
    benchmark::Counter::kIsIterationInvariantRate };
}

constexpr size_t RangeMultiplierSquared = 2;
constexpr size_t DataSize = 512;

template<typename B>
void Args(B b);

BENCHMARK(dummy);
BENCHMARK(vector_push_back);
BENCHMARK(vector_insert)->Apply(Args)->Arg(DataSize * 2);
BENCHMARK(string_push_back);
BENCHMARK(string_append)->Apply(Args)->Arg(DataSize * 2);

int64_t roundedBefore(int64_t i) {
  return static_cast<int64_t>(round(i * pow(RangeMultiplierSquared, -0.5)));
}
template<typename B>
void Args(B b) {
  b->Args({ 1 });
  int i = RangeMultiplierSquared;
  if (const auto k = roundedBefore(i); k != 1)
    b->Args({ k });
  b->Args({ i });
  while ((i *= RangeMultiplierSquared) < DataSize) {
    b->Args({ roundedBefore(i) });
    b->Args({ i });
  }
  if (const auto k = roundedBefore(i); k < DataSize)
    b->Args({ k });
  b->Args({ DataSize });
};

Here dummy is just a control to gauge problems.

Args function template sets up sizes so that there is an additional data point roughly between powers of two on a logarithmic scale.

For some reason the last run gives incorrect measurements sometimes, I added sacrificial data points at the end to mitigate that (DataSize * 2).

Results

Image

Full benchmark output. (Click/tap to expand.)
2025-08-26T14:12:56+03:00
Running ./bench
Run on (20 X 3494.4 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x10)
  L1 Instruction 32 KiB (x10)
  L2 Unified 2048 KiB (x10)
  L3 Unified 24576 KiB (x1)
Load Average: 0.00, 0.03, 0.08
-----------------------------------------------------------------------------
Benchmark                   Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------
dummy                     227 ns          209 ns      3356710 rate=4.77757G/s
vector_push_back          530 ns          489 ns      1420428 rate=2.0451G/s
vector_insert/1          1246 ns         1151 ns       608950 rate=869.157M/s
vector_insert/2          1485 ns         1371 ns       510698 rate=1.45902G/s
vector_insert/3          1637 ns         1511 ns       464718 rate=1.98556G/s
vector_insert/4          1907 ns         1760 ns       399303 rate=2.2721G/s
vector_insert/6          2219 ns         2049 ns       340955 rate=2.9287G/s
vector_insert/8          1903 ns         1757 ns       397994 rate=4.55447G/s
vector_insert/11         2451 ns         2263 ns       309396 rate=4.86143G/s
vector_insert/16         2710 ns         2501 ns       279629 rate=6.39634G/s
vector_insert/23         4064 ns         3751 ns       186611 rate=6.13099G/s
vector_insert/32         1851 ns         1708 ns       409604 rate=18.7305G/s
vector_insert/45         4019 ns         3710 ns       188916 rate=12.1305G/s
vector_insert/64         3436 ns         3172 ns       220936 rate=20.1796G/s
vector_insert/91        23445 ns        21476 ns        33860 rate=4.2373G/s
vector_insert/128       36662 ns        33703 ns        20895 rate=3.7979G/s
vector_insert/181       61004 ns        55933 ns        12640 rate=3.23601G/s
vector_insert/256       90194 ns        83002 ns         8457 rate=3.08425G/s
vector_insert/362      137841 ns       127006 ns         5485 rate=2.85025G/s
vector_insert/512      203347 ns       187528 ns         3777 rate=2.73027G/s
vector_insert/1024     443245 ns       409018 ns         1713 rate=2.50356G/s
string_push_back         1645 ns         1519 ns       459028 rate=658.405M/s
string_append/1          4398 ns         4060 ns       171797 rate=246.306M/s
string_append/2          4295 ns         3965 ns       175805 rate=504.405M/s
string_append/3          4357 ns         4022 ns       175788 rate=745.915M/s
string_append/4          4348 ns         4013 ns       174600 rate=996.641M/s
string_append/6          4286 ns         3956 ns       176163 rate=1.51661G/s
string_append/8          4386 ns         4048 ns       173727 rate=1.97618G/s
string_append/11         4377 ns         4040 ns       172980 rate=2.72269G/s
string_append/16         4393 ns         4056 ns       171781 rate=3.94523G/s
string_append/23         4619 ns         4264 ns       165310 rate=5.39399G/s
string_append/32         4567 ns         4215 ns       165359 rate=7.59122G/s
string_append/45         4509 ns         4162 ns       167575 rate=10.8112G/s
string_append/64         5094 ns         4702 ns       150211 rate=13.6121G/s
string_append/91         6280 ns         5797 ns       120507 rate=15.6978G/s
string_append/128        7148 ns         6598 ns       105000 rate=19.3984G/s
string_append/181       10105 ns         9328 ns        75234 rate=19.4047G/s
string_append/256       11624 ns        10730 ns        65008 rate=23.8591G/s
string_append/362       15333 ns        14153 ns        49536 rate=25.5773G/s
string_append/512       19747 ns        18228 ns        38433 rate=28.0885G/s
string_append/1024     449037 ns       414404 ns         1705 rate=2.47102G/s

On the right some crazy things happen to std::vector<char> which I can not fully explain (probably data set exceeding L3 size plays a big role), instead let's focus on "the good part" on the left.

Image
Here std::string is 68% slower than std::vector<char> regarding push_back();
and around 40% to 70% slower regarding appending characters in chunks depending on the chunk size (using vector insert() and string append()).

One can think that maybe different allocation patterns of std::string and std::vector play a role, and indeed they do play a role, so instead of investigating it I eliminated it altogether by reserving necessary storage (uncommented reserve() calls).

Results with reserve()

Image

Full benchmark output. (Click/tap to expand.)
2025-08-26T14:19:29+03:00
Running ./bench
Run on (20 X 3494.4 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x10)
  L1 Instruction 32 KiB (x10)
  L2 Unified 2048 KiB (x10)
  L3 Unified 24576 KiB (x1)
Load Average: 0.40, 0.20, 0.13
-----------------------------------------------------------------------------
Benchmark                   Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------
dummy                     226 ns          208 ns      3361395 rate=4.79691G/s
vector_push_back          439 ns          405 ns      1726001 rate=2.46609G/s
vector_insert/1           997 ns          920 ns       764443 rate=1.0865G/s
vector_insert/2          1230 ns         1135 ns       610784 rate=1.76206G/s
vector_insert/3          1440 ns         1330 ns       524039 rate=2.25626G/s
vector_insert/4          1572 ns         1451 ns       487840 rate=2.75743G/s
vector_insert/6          2006 ns         1852 ns       374242 rate=3.24026G/s
vector_insert/8          1560 ns         1440 ns       481767 rate=5.55526G/s
vector_insert/11         2199 ns         2030 ns       345151 rate=5.41978G/s
vector_insert/16         2424 ns         2238 ns       313185 rate=7.15004G/s
vector_insert/23         3759 ns         3470 ns       200937 rate=6.62851G/s
vector_insert/32         1114 ns         1028 ns       678571 rate=31.1138G/s
vector_insert/45         3111 ns         2872 ns       244499 rate=15.6689G/s
vector_insert/64         2012 ns         1857 ns       379973 rate=34.4563G/s
vector_insert/91         5697 ns         5259 ns       133134 rate=17.3044G/s
vector_insert/128        3046 ns         2812 ns       249728 rate=45.5218G/s
vector_insert/181        6762 ns         6241 ns       111984 rate=28.9997G/s
vector_insert/256        5731 ns         5290 ns       117313 rate=48.3906G/s
vector_insert/362        9787 ns         9034 ns        83134 rate=40.0707G/s
vector_insert/512        7479 ns         6904 ns       101189 rate=74.1609G/s
vector_insert/1024      13777 ns        12717 ns        55029 rate=80.5231G/s
string_push_back         1453 ns         1342 ns       524367 rate=745.401M/s
string_append/1          4704 ns         4343 ns       160682 rate=230.276M/s
string_append/2          4347 ns         4013 ns       173773 rate=498.401M/s
string_append/3          4346 ns         4011 ns       174178 rate=747.867M/s
string_append/4          4421 ns         4081 ns       172674 rate=980.083M/s
string_append/6          4434 ns         4093 ns       166934 rate=1.46594G/s
string_append/8          4400 ns         4062 ns       173671 rate=1.96953G/s
string_append/11         4420 ns         4080 ns       174113 rate=2.6962G/s
string_append/16         4499 ns         4153 ns       169456 rate=3.85279G/s
string_append/23         4542 ns         4193 ns       167624 rate=5.48576G/s
string_append/32         3922 ns         3620 ns       191241 rate=8.83886G/s
string_append/45         3935 ns         3633 ns       192302 rate=12.3881G/s
string_append/64         3948 ns         3644 ns       193690 rate=17.5631G/s
string_append/91         4459 ns         4116 ns       170266 rate=22.1066G/s
string_append/128        4470 ns         4126 ns       169312 rate=31.0209G/s
string_append/181        6626 ns         6116 ns       119231 rate=29.5937G/s
string_append/256        5613 ns         5181 ns       134655 rate=49.4099G/s
string_append/362        8021 ns         7404 ns        95064 rate=48.8942G/s
string_append/512        9513 ns         8781 ns        79907 rate=58.3047G/s
string_append/1024      14779 ns        13642 ns        51335 rate=75.0635G/s

Here you can see that std::vector<char> really doesn't like chunk sizes different from powers of two, but overall it is still mostly faster than std::string (unless it becomes really unlucky).

Again let's look at "the good part" with less influence from weird effects.
Image
This picture is very similar to the one earlier.

Here std::string is 70% slower than std::vector<char> regarding push_back();
and around 45% to 80% slower regarding appending characters in chunks (depending on the chunk size).
Removing memory allocation overhead revealed even larger difference in performance between std::string and std::vector<char>.

Quick-bench

Here's another evidence that appending to std::string is significantly slower than to std::vector<char>:
https://quick-bench.com/q/EKUKhSTDZ3wKF866p1ZiUgRESI8

There it shows appending to std::string is from around 16% to two times slower than inserting at the end of std::vector<char>.
(Although it uses Clang 17 which is not the most recent, the latest version is not available.)

For context

I discovered this issue while researching this:
simdjson/simdjson#2408

Metadata

Metadata

Assignees

No one assigned

    Labels

    libc++libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi.performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions