-
Notifications
You must be signed in to change notification settings - Fork 14.9k
[lld][MachO]Multi-threaded i/o. Twice as fast linking a large project. #147134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-lld-macho @llvm/pr-subscribers-lld Author: John Holdsworth (johnno1962) ChangesThis PR adds a new option to lld 26.01, 25.84, 26.15, 26.03, 27.10, 25.90, 25.86, 25.81, 25.80, 25.87 With the proposed code change, and using the 21.13, 20.35, 20.01, 20.01, 20.30, 20.39, 19.97, 20.23, 20.17, 20.23 The secret sauce is in the new function Full diff: https://github.com/llvm/llvm-project/pull/147134.diff 3 Files Affected:
diff --git a/lld/MachO/Config.h b/lld/MachO/Config.h
index a01e60efbe761..92c6eb85f4123 100644
--- a/lld/MachO/Config.h
+++ b/lld/MachO/Config.h
@@ -186,6 +186,7 @@ struct Configuration {
bool interposable = false;
bool errorForArchMismatch = false;
bool ignoreAutoLink = false;
+ int readThreads = 0;
// ld64 allows invalid auto link options as long as the link succeeds. LLD
// does not, but there are cases in the wild where the invalid linker options
// exist. This allows users to ignore the specific invalid options in the case
diff --git a/lld/MachO/Driver.cpp b/lld/MachO/Driver.cpp
index 9eb391c4ee1b9..a244f2781c22c 100644
--- a/lld/MachO/Driver.cpp
+++ b/lld/MachO/Driver.cpp
@@ -47,6 +47,7 @@
#include "llvm/Support/TarWriter.h"
#include "llvm/Support/TargetSelect.h"
#include "llvm/Support/TimeProfiler.h"
+#include "llvm/Support/Process.h"
#include "llvm/TargetParser/Host.h"
#include "llvm/TextAPI/Architecture.h"
#include "llvm/TextAPI/PackedVersion.h"
@@ -282,11 +283,11 @@ static void saveThinArchiveToRepro(ArchiveFile const *file) {
": Archive::children failed: " + toString(std::move(e)));
}
-static InputFile *addFile(StringRef path, LoadType loadType,
- bool isLazy = false, bool isExplicit = true,
- bool isBundleLoader = false,
- bool isForceHidden = false) {
- std::optional<MemoryBufferRef> buffer = readFile(path);
+static InputFile *deferredAddFile(std::optional<MemoryBufferRef> buffer,
+ StringRef path, LoadType loadType,
+ bool isLazy = false, bool isExplicit = true,
+ bool isBundleLoader = false,
+ bool isForceHidden = false) {
if (!buffer)
return nullptr;
MemoryBufferRef mbref = *buffer;
@@ -441,6 +442,14 @@ static InputFile *addFile(StringRef path, LoadType loadType,
return newFile;
}
+static InputFile *addFile(StringRef path, LoadType loadType,
+ bool isLazy = false, bool isExplicit = true,
+ bool isBundleLoader = false,
+ bool isForceHidden = false) {
+ return deferredAddFile(readFile(path), path, loadType, isLazy,
+ isExplicit, isBundleLoader, isForceHidden);
+}
+
static std::vector<StringRef> missingAutolinkWarnings;
static void addLibrary(StringRef name, bool isNeeded, bool isWeak,
bool isReexport, bool isHidden, bool isExplicit,
@@ -564,13 +573,21 @@ void macho::resolveLCLinkerOptions() {
}
}
-static void addFileList(StringRef path, bool isLazy) {
+typedef struct { StringRef path; std::optional<MemoryBufferRef> buffer; } DeferredFile;
+
+static void addFileList(StringRef path, bool isLazy,
+ std::vector<DeferredFile> &deferredFiles, int readThreads) {
std::optional<MemoryBufferRef> buffer = readFile(path);
if (!buffer)
return;
MemoryBufferRef mbref = *buffer;
for (StringRef path : args::getLines(mbref))
- addFile(rerootPath(path), LoadType::CommandLine, isLazy);
+ if (readThreads) {
+ StringRef rrpath = rerootPath(path);
+ deferredFiles.push_back({rrpath, readFile(rrpath)});
+ }
+ else
+ addFile(rerootPath(path), LoadType::CommandLine, isLazy);
}
// We expect sub-library names of the form "libfoo", which will match a dylib
@@ -1215,13 +1232,61 @@ static void handleSymbolPatterns(InputArgList &args,
parseSymbolPatternsFile(arg, symbolPatterns);
}
-static void createFiles(const InputArgList &args) {
+// Most input files have been mapped but not yet paged in.
+// This code forces the page-ins on multiple threads so
+// the process is not stalled waiting on disk buffer i/o.
+void multiThreadedPageIn(std::vector<DeferredFile> &deferred, int nthreads) {
+ typedef struct {
+ std::vector<DeferredFile> &deferred;
+ size_t counter, total, pageSize;
+ pthread_mutex_t mutex;
+ } PageInState;
+ PageInState state = {deferred, 0, 0,
+ llvm::sys::Process::getPageSizeEstimate(), pthread_mutex_t()};
+ pthread_mutex_init(&state.mutex, NULL);
+
+ pthread_t running[200];
+ int maxthreads = sizeof running / sizeof running[0];
+ if (nthreads > maxthreads)
+ nthreads = maxthreads;
+ for (int t=0; t<nthreads; t++)
+ pthread_create(&running[t], nullptr, [](void* ptr) -> void*{
+ PageInState &state = *(PageInState *)ptr;
+ static int total = 0;
+ while (true) {
+ pthread_mutex_lock(&state.mutex);
+ if (state.counter >= state.deferred.size()) {
+ pthread_mutex_unlock(&state.mutex);
+ return nullptr;
+ }
+ DeferredFile &add = state.deferred[state.counter];
+ state.counter += 1;
+ pthread_mutex_unlock(&state.mutex);
+
+ int t = 0; // Reference each page to load it into memory.
+ for (const char *start = add.buffer->getBuffer().data(),
+ *page = start; page<start+add.buffer->getBuffer().size();
+ page += state.pageSize)
+ t += *page;
+ state.total += t; // Avoids whole section being optimised out.
+ }
+ }, &state);
+
+ for (int t=0; t<nthreads; t++)
+ pthread_join(running[t], nullptr);
+
+ pthread_mutex_destroy(&state.mutex);
+}
+
+void createFiles(const InputArgList &args, int readThreads) {
TimeTraceScope timeScope("Load input files");
// This loop should be reserved for options whose exact ordering matters.
// Other options should be handled via filtered() and/or getLastArg().
bool isLazy = false;
// If we've processed an opening --start-lib, without a matching --end-lib
bool inLib = false;
+ std::vector<DeferredFile> deferredFiles;
+
for (const Arg *arg : args) {
const Option &opt = arg->getOption();
warnIfDeprecatedOption(opt);
@@ -1229,6 +1294,11 @@ static void createFiles(const InputArgList &args) {
switch (opt.getID()) {
case OPT_INPUT:
+ if (readThreads) {
+ StringRef rrpath = rerootPath(arg->getValue());
+ deferredFiles.push_back({rrpath,readFile(rrpath)});
+ break;
+ }
addFile(rerootPath(arg->getValue()), LoadType::CommandLine, isLazy);
break;
case OPT_needed_library:
@@ -1249,7 +1319,7 @@ static void createFiles(const InputArgList &args) {
dylibFile->forceWeakImport = true;
break;
case OPT_filelist:
- addFileList(arg->getValue(), isLazy);
+ addFileList(arg->getValue(), isLazy, deferredFiles, readThreads);
break;
case OPT_force_load:
addFile(rerootPath(arg->getValue()), LoadType::CommandLineForce);
@@ -1295,6 +1365,12 @@ static void createFiles(const InputArgList &args) {
break;
}
}
+
+ if (readThreads) {
+ multiThreadedPageIn(deferredFiles, readThreads);
+ for (auto &add : deferredFiles)
+ deferredAddFile(add.buffer, add.path, LoadType::CommandLine, isLazy);
+ }
}
static void gatherInputSections() {
@@ -1687,6 +1763,14 @@ bool link(ArrayRef<const char *> argsArr, llvm::raw_ostream &stdoutOS,
}
}
+ if (auto *arg = args.getLastArg(OPT_read_threads)) {
+ StringRef v(arg->getValue());
+ unsigned threads = 0;
+ if (!llvm::to_integer(v, threads, 0) || threads < 0)
+ error(arg->getSpelling() + ": expected a positive integer, but got '" +
+ arg->getValue() + "'");
+ config->readThreads = threads;
+ }
if (auto *arg = args.getLastArg(OPT_threads_eq)) {
StringRef v(arg->getValue());
unsigned threads = 0;
@@ -2107,7 +2191,7 @@ bool link(ArrayRef<const char *> argsArr, llvm::raw_ostream &stdoutOS,
TimeTraceScope timeScope("ExecuteLinker");
initLLVM(); // must be run before any call to addFile()
- createFiles(args);
+ createFiles(args, config->readThreads);
// Now that all dylibs have been loaded, search for those that should be
// re-exported.
diff --git a/lld/MachO/Options.td b/lld/MachO/Options.td
index 4f0602f59812b..3dc98fccc1b7b 100644
--- a/lld/MachO/Options.td
+++ b/lld/MachO/Options.td
@@ -396,6 +396,9 @@ def dead_strip : Flag<["-"], "dead_strip">,
def interposable : Flag<["-"], "interposable">,
HelpText<"Indirects access to all exported symbols in an image">,
Group<grp_opts>;
+def read_threads : Joined<["--"], "read-threads=">,
+ HelpText<"Number of threads to use paging in files.">,
+ Group<grp_lld>;
def order_file : Separate<["-"], "order_file">,
MetaVarName<"<file>">,
HelpText<"Layout functions and data according to specification in <file>">,
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
b66eb42
to
fd5647a
Compare
9acbaea
to
47bad1d
Compare
47bad1d
to
3d11a33
Compare
a324caa
to
6936449
Compare
fdc4c38
to
767b7b1
Compare
aa3f70b
to
c5ce3a7
Compare
c5ce3a7
to
4cab9be
Compare
5ca4401
to
f7c8008
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see anything obviously wrong with this current revision, but concurrency is hard so I'll give others more time to review
Co-authored-by: Ellis Hoag <ellis.sparky.hoag@gmail.com>
lld/MachO/Driver.cpp
Outdated
#ifndef NDEBUG | ||
#include <iomanip> | ||
#include <iostream> | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think having these includes here goes against the LLVM coding standards: https://llvm.org/docs/CodingStandards.html#include-style
They should be at the top.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed the includes and the std::setprecision(4) that required them. Guess I'm just going to have to view times in scientific notation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record, I don't think they need to be removed and you can still use setprecision
, but the includes should not be in the middle of the file.
lld/MachO/Driver.cpp
Outdated
static size_t totalBytes = 0; | ||
std::atomic_int index = 0; | ||
|
||
parallelFor(0, config->readThreads, [&](size_t I) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is probably good code, but being hidden away in Clangd makes it difficult to reuse. There might be ideas in there than can be copied, though, or moved into a common part of the LLVM source to be usable from both places. Not necessary if you don't want to, but all the concurrency code is making this review more complicated than it should because we are not using proven pieces of code.
I still think that using parallelFor(0, config->readThreads, …)
is incorrect, and while it will do config->readThreads
"chunks" of work, it will not necessarily do them in readThread
number of threads (it can be less).
Either the code changes if you really want to have readThread
number of threads, or the code changes to iterate over all the deferred files and let the parallel algorithm do its work chunking into pieces. In any case, the code should change.
Co-authored-by: Daniel Rodríguez Troitiño <drodrigueztroitino@gmail.com>
Co-authored-by: Daniel Rodríguez Troitiño <drodrigueztroitino@gmail.com>
Co-authored-by: Daniel Rodríguez Troitiño <drodrigueztroitino@gmail.com>
0859078
to
bb91c53
Compare
@drodriguez, Thanks for persisting with this PR. I get that It's unfortunate that we are having to develop our own background threading abstraction but platform agnostic alternatives don't seem to be readily available and there is no point running the risk of having existing code regress by trying to restructure it for our limited use case. We are beginning to cycle on this PR reverting changes that it was suggested be put in which is a sign it is ready enough and we're getting down to fairly minor nits. If there is something you really can't live with let me know or suggest a specific alternative and I can test it. |
lld/MachO/Driver.cpp
Outdated
static size_t totalBytes = 0; | ||
std::atomic_int index = 0; | ||
|
||
parallelFor(0, config->readThreads, [&](size_t I) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If readThreads
does not longer represent the exact number of threads that is used to read, maybe it deserve other name and different argument name and help text for the argument. If this is going to use the -threads
value anyway, why one needs anything more than -parallalelize-input-preload
boolean flag, and use deferred.size()
as the value. And if one is going to use deferred.size()
, why not use parallelFor
iterating over deferred
and remove a bunch of code to handle the indices manually?
PD: it can be less performant that this version, but I will argue that I would prefer less performance and easier code to maintain.
lld/MachO/Driver.cpp
Outdated
#ifndef NDEBUG | ||
#include <iomanip> | ||
#include <iostream> | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record, I don't think they need to be removed and you can still use setprecision
, but the includes should not be in the middle of the file.
9f816c6
to
432fb04
Compare
1972748
to
30b8c13
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the absence of feedback from others, I approve this version, which seems to make everyone happy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working with us. I think it's a very strong PR now.
👋, back from my break and I didn't think of anything new to add to the PR other than perhaps expanding the main comment or including a link to this PR as documentation of the approach. As peace seems to have broken out how about we merge this before it snags a conflict? Thanks for all your help and ideas. |
@johnno1962 do you want to modify the summary with the final spelling of the flag and maybe updating the working for some later results? |
I'm not sure I understand what you mean by "the summary" and "the working" but I've updated the initial comment in this PR thread with the new flag name. I don't plan on making any further commits while everything is signed off. |
I meant to write "wording". The title and the summary of this PR (your first thing "comment") is used as the content of the commit that will be created. It contains the old spelling of the flag, and numbers that I don't know if they are relevant or they are out of date. Do you want to modify those before merging? |
I've updated the second occurrence of --read-workers I missed in the "initial post". The numbers are still relevant, they haven't changed significantly since Jul 8th. |
Thanks @drodriguez, Happy to see this landed safe and sound. A conflict might have been outside my git comfort zone! It was an interesting PR for me as I had started out seeing if using compressed object files could be more performant then I noticed the uncompressed versions could me made to read in much more quickly. Seems like memory mapping input files and leaving it to the system to page them in is not necessarily a winning I/o strategy. Thanks for all your help a patience. |
This PR adds a new option to lld
--read-workers=20
that defers all disk I/o then performs it multithreaded so the process is never stalled waiting for the I/o of the page-in of mapped input files. This results in a saving of elapsed time. For a large link (iterating on Chromium) these are the baseline linkage times saving a single file and rebuilding (seconds inside Xcode):26.01, 25.84, 26.15, 26.03, 27.10, 25.90, 25.86, 25.81, 25.80, 25.87
With the proposed code change, and using the
--read-threads=20
option, the linking times reduce to the following:21.13, 20.35, 20.01, 20.01, 20.30, 20.39, 19.97, 20.23, 20.17, 20.23
The secret sauce is in the new function
multiThreadedPageIn()
in Driver.cpp. Without the option lld behaves as before.Edit: with subsequent commits I've taken this novel i/o approach to its full potential. Latest linking times are now:
13.2, 11.9, 12.12, 12.01, 11.99, 13.11, 11.93, 11.95, 12.18, 11.97
Chrome is still linking and running so it doesn't look like anything is broken. Despite being multi-threaded all memory access is readonly and the original code paths are not changed. All that is happening is the system is being asked to proactively page in files rather than waiting for processing to page fault which would otherwise stall the process.