taskflow
diff --git a/‎docs/BenchmarkTaskflow.html
Lines changed: 3 additions & 3 deletions b/‎docs/BenchmarkTaskflow.html
Lines changed: 3 additions & 3 deletions
diff --git a/‎docs/ExecuteTaskflow.html
Lines changed: 1 addition & 1 deletion b/‎docs/ExecuteTaskflow.html
Lines changed: 1 addition & 1 deletion
@@ -85,12 +85,12 @@ <h3>Contents</h3>
 <span class="go">  -h,--help                   Print this help message and exit</span>
 <span class="go">  -t,--num_threads UINT       number of threads (default=1)</span>
 <span class="go">  -r,--num_rounds UINT        number of rounds (default=1)</span>
-<span class="go">  -m,--model TEXT             model name tbb|omp|tf (default=tf)</span></pre><p>We currently implement the following instances that are commonly used by the parallel computing community to evaluate the system performance.</p><table class="m-table"><thead><tr><th>Instance</th><th>Description</th></tr></thead><tbody><tr><td>bench_binary_tree</td><td>traverses a complete binary tree</td></tr><tr><td>bench_black_scholes</td><td>computes option pricing with Black-Shcoles Models</td></tr><tr><td>bench_graph_traversal</td><td>traverses a randomly generated direct acyclic graph</td></tr><tr><td>bench_linear_chain</td><td>traverses a linear chain of tasks</td></tr><tr><td>bench_mandelbrot</td><td>exploits imbalanced workloads in a Mandelbrot set</td></tr><tr><td>bench_matrix_multiplication</td><td>multiplies two 2D matrices</td></tr><tr><td>bench_mnist</td><td>trains a neural network-based image classifier on the MNIST dataset</td></tr><tr><td>bench_parallel_sort</td><td>sorts a range of items</td></tr><tr><td>bench_reduce_sum</td><td>sums a range of items using reduction</td></tr><tr><td>bench_wavefront</td><td>propagates computations in a 2D grid</td></tr><tr><td>bench_linear_pipeline</td><td>pipeline scheduling on a linear chain of pipes</td></tr><tr><td>bench_graph_pipeline</td><td>pipeline scheduling on a graph of pipes</td></tr></tbody></table></section><section id="ConfigureRunOptions"><h2><a href="#ConfigureRunOptions">Configure Run Options</a></h2><p>We implement consistent options for each benchmark instance. Common options are:</p><table class="m-table"><thead><tr><th>option</th><th>value</th><th>function</th></tr></thead><tbody><tr><td><code>-h</code></td><td>none</td><td>display the help message</td></tr><tr><td><code>-t</code></td><td>integer</td><td>configure the number of threads to run</td></tr><tr><td><code>-r</code></td><td>integer</td><td>configure the number of rounds to run</td></tr><tr><td><code>-m</code></td><td>string</td><td>configure the baseline models to run, tbb, omp, or tf</td></tr></tbody></table><p>You can configure the benchmarking environment by giving different options.</p><section id="SpecifyTheRunModel"><h3><a href="#SpecifyTheRunModel">Specify the Run Model</a></h3><p>In addition to a Taskflow-based implementation for each benchmark instance, we have implemented two baseline models using the state-of-the-art parallel programming libraries, <a href="https://www.openmp.org/">OpenMP</a> and <a href="https://github.com/oneapi-src/oneTBB">Intel TBB</a>, to measure and evaluate the performance of Taskflow. You can select different implementations by passing the option <code>-m</code>.</p><pre class="m-console"><span class="go">~$ ./bench_graph_traversal -m tf   # run the Taskflow implementation (default)</span>
+<span class="go">  -m,--model TEXT             model name tbb|omp|tf (default=tf)</span></pre><p>We currently implement the following instances that are commonly used by the parallel computing community to evaluate the system performance.</p><table class="m-table"><thead><tr><th>Instance</th><th>Description</th></tr></thead><tbody><tr><td>bench_binary_tree</td><td>traverses a complete binary tree</td></tr><tr><td>bench_black_scholes</td><td>computes option pricing with Black-Shcoles Models</td></tr><tr><td>bench_graph_traversal</td><td>traverses a randomly generated direct acyclic graph</td></tr><tr><td>bench_linear_chain</td><td>traverses a linear chain of tasks</td></tr><tr><td>bench_mandelbrot</td><td>exploits imbalanced workloads in a Mandelbrot set</td></tr><tr><td>bench_matrix_multiplication</td><td>multiplies two 2D matrices</td></tr><tr><td>bench_mnist</td><td>trains a neural network-based image classifier on the MNIST dataset</td></tr><tr><td>bench_parallel_sort</td><td>sorts a range of items</td></tr><tr><td>bench_reduce_sum</td><td>sums a range of items using reduction</td></tr><tr><td>bench_wavefront</td><td>propagates computations in a 2D grid</td></tr><tr><td>bench_linear_pipeline</td><td>performs pipeline parallelism on a linear chain of pipes</td></tr><tr><td>bench_graph_pipeline</td><td>performs pipeline parallelism on a graph of pipes</td></tr><tr><td>bench_deferred_pipeline</td><td>performs pipeline parallelism with dependencies from future pipes</td></tr><tr><td>bench_data_pipeline</td><td>performs pipeline parallelisms on a cache-friendly data wrapper</td></tr><tr><td>bench_thread_pool</td><td>uses our executor as a simple thread pool</td></tr><tr><td>bench_for_each</td><td>performs parallel-iteration algorithms</td></tr><tr><td>bench_scan</td><td>performs parallel-scan algorithms</td></tr><tr><td>bench_async_task</td><td>creates asynchronous tasks</td></tr><tr><td>bench_fibonacci</td><td>finds Fibonacci numbers using recursive asynchronous tasking</td></tr><tr><td>bench_nqueens</td><td>parallelizes n-queen search using recursive asynchronous tasking</td></tr><tr><td>bench_integrate</td><td>parallelizes integration using recursive asynchronous tasking</td></tr><tr><td>bench_primes</td><td>finds a range of prime numbers using parallel-reduction algorithms</td></tr><tr><td>bench_skynet</td><td>traverses a 10-ray tree using recursive asynchronous tasking</td></tr></tbody></table></section><section id="ConfigureRunOptions"><h2><a href="#ConfigureRunOptions">Configure Run Options</a></h2><p>We implement consistent options for each benchmark instance. Common options are:</p><table class="m-table"><thead><tr><th>option</th><th>value</th><th>function</th></tr></thead><tbody><tr><td><code>-h</code></td><td>none</td><td>displays the help message</td></tr><tr><td><code>-t</code></td><td>integer</td><td>configures the number of threads to run</td></tr><tr><td><code>-r</code></td><td>integer</td><td>configures the number of rounds to run</td></tr><tr><td><code>-m</code></td><td>string</td><td>configures the baseline models to run, tbb, omp, or tf</td></tr></tbody></table><p>You can configure the benchmarking environment by giving different options.</p><section id="SpecifyTheRunModel"><h3><a href="#SpecifyTheRunModel">Specify the Run Model</a></h3><p>In addition to a Taskflow-based implementation for each benchmark instance, we have implemented two baseline models using the state-of-the-art parallel programming libraries, <a href="https://www.openmp.org/">OpenMP</a> and <a href="https://github.com/oneapi-src/oneTBB">Intel TBB</a>, to measure and evaluate the performance of Taskflow. You can select different implementations by passing the option <code>-m</code>.</p><pre class="m-console"><span class="go">~$ ./bench_graph_traversal -m tf   # run the Taskflow implementation (default)</span>
 <span class="go">~$ ./bench_graph_traversal -m tbb  # run the TBB implementation</span>
 <span class="go">~$ ./bench_graph_traversal -m omp  # run the OpenMP implementation</span></pre></section><section id="SpecifyTheNumberOfThreads"><h3><a href="#SpecifyTheNumberOfThreads">Specify the Number of Threads</a></h3><p>You can configure the number of threads to run a benchmark instance by passing the option <code>-t</code>. The default value is one.</p><pre class="m-console"><span class="gp"># </span>run the Taskflow implementation using <span class="m">4</span> threads
 <span class="go">~$ ./bench_graph_traversal -m tf -t 4</span></pre><p>Depending on your environment, you may need to use <code>taskset</code> to set the CPU affinity of the running process. This allows the OS scheduler to keep process on the same CPU(s) as long as practical for performance reason.</p><pre class="m-console"><span class="gp"># </span>affine the process to <span class="m">4</span> CPUs, CPU <span class="m">0</span>, CPU <span class="m">1</span>, CPU <span class="m">2</span>, and CPU <span class="m">3</span>
-<span class="go">~$ taskset -c 0-3 bench_graph_traversal -t 4  </span></pre></section><section id="SpecifyTheNumberOfRounds"><h3><a href="#SpecifyTheNumberOfRounds">Specify the Number of Rounds</a></h3><p>Each benchmark instance evaluates the runtime of the implementation at different problem sizes. Each problem size corresponds to one iteration. You can configure the number of rounds per iteration to average the runtime.</p><pre class="m-console"><span class="gp"># </span>measure the runtime <span class="k">in</span> an average of <span class="m">10</span> runs
-<span class="go">~$ ./bench_graph_traversal -r 10</span>
+<span class="go">~$ taskset -c 0-3 bench_graph_traversal -t 4  </span></pre></section><section id="SpecifyTheNumberOfRounds"><h3><a href="#SpecifyTheNumberOfRounds">Specify the Number of Rounds</a></h3><p>Each benchmark instance evaluates the runtime of the implementation at different problem sizes. Each problem size corresponds to one iteration. You can configure the number of rounds per iteration to average the runtime.</p><pre class="m-console"><span class="gp"># </span>measure the %Taskflow runtime by averaging the results over <span class="m">10</span> runs
+<span class="go">~$ ./bench_graph_traversal -r 10 -m tf</span>
 <span class="go">|V|+|E|     Runtime</span>
 <span class="go">      2       0.109   # the runtime value 0.109 is an average of 10 runs</span>
 <span class="go">    842       0.298</span>
 
@@ -65,7 +65,7 @@ <h3>Contents</h3>
         </nav>
 <p>After you create a task dependency graph, you need to submit it to threads for execution. In this chapter, we will show you how to execute a task dependency graph.</p><section id="CreateAnExecutor"><h2><a href="#CreateAnExecutor">Create an Executor</a></h2><p>To execute a taskflow, you need to create an <em>executor</em> of type <a href="classtf_1_1Executor.html" class="m-doc">tf::<wbr />Executor</a>. An executor is a <em>thread-safe</em> object that manages a set of worker threads and executes tasks through an efficient <em>work-stealing</em> algorithm. Issuing a call to run a taskflow creates a <em>topology</em>, a data structure to keep track of the execution status of a running graph. <a href="classtf_1_1Executor.html" class="m-doc">tf::<wbr />Executor</a> takes an unsigned integer to construct with <code>N</code> worker threads. The default value is <a href="http://en.cppreference.com/w/cpp/thread/thread/hardware_concurrency.html" class="m-doc-external">std::<wbr />thread::<wbr />hardware_concurrency</a>.</p><pre class="m-code"><span class="n">tf</span><span class="o">::</span><span class="n">Executor</span><span class="w"> </span><span class="n">executor1</span><span class="p">;</span><span class="w">     </span><span class="c1">// create an executor with the number of workers</span>
 <span class="w">                            </span><span class="c1">// equal to std::thread::hardware_concurrency</span>
-<span class="n">tf</span><span class="o">::</span><span class="n">Executor</span><span class="w"> </span><span class="nf">executor2</span><span class="p">(</span><span class="mi">4</span><span class="p">);</span><span class="w">  </span><span class="c1">// create an executor of 4 worker threads</span></pre><p>An executor can be reused to execute multiple taskflows. In most workloads, you may need only one executor to run multiple taskflows where each taskflow represents a part of a parallel decomposition.</p></section><section id="UnderstandWorkStealingInExecutor"><h2><a href="#UnderstandWorkStealingInExecutor">Understand Work-stealing in Executor</a></h2><p>Taskflow designs a highly efficient <em>work-stealing</em> algorithm to schedule and run tasks in an executor. Work-stealing is a dynamic scheduling algorithm widely used in parallel computing to distribute and balance workload among multiple threads or cores. Specifically, within an executor, each worker maintains its own local queue of tasks. When a worker finishes its own tasks, instead of becoming idle or going sleep, it (thief) tries to <em>steal</em> a task from the queue another worker (victim). The figure below illustrates the idea of work-stealing:</p><img class="m-image" src="work-stealing.png" alt="Image" /><p>The key advantage of work-stealing lies in its <em>decentralized</em> nature and efficiency. Most of the time, worker threads work on their local queues without contention. Stealing only occurs when a worker becomes idle, minimizing overhead associated with synchronization and task distribution. This decentralized strategy effectively balances the workload, ensuring that idle workers are put to work and that the overall computation progresses efficiently.</p></section><section id="ExecuteATaskflow"><h2><a href="#ExecuteATaskflow">Execute a Taskflow</a></h2><p><a href="classtf_1_1Executor.html" class="m-doc">tf::<wbr />Executor</a> provides a set of <code>run_*</code> methods, <a href="classtf_1_1Executor.html#a8d08f0cb79e7b3780087975d13368a96" class="m-doc">tf::<wbr />Executor::<wbr />run</a>, <a href="classtf_1_1Executor.html#af15db5f7dde8e7ff1f86ef8fe825e9e2" class="m-doc">tf::<wbr />Executor::<wbr />run_n</a>, and <a href="classtf_1_1Executor.html#ae4f9e214ea5ee873e8d90a70bc1c77e8" class="m-doc">tf::<wbr />Executor::<wbr />run_until</a> to run a taskflow for one time, multiple times, or until a given predicate evaluates to true. All methods accept an optional callback to invoke after the execution completes, and return a <a href="classtf_1_1Future.html" class="m-doc">tf::<wbr />Future</a> for users to access the execution status. The code below shows several ways to run a taskflow.</p><pre class="m-code"><span class="w"> </span><span class="mi">1</span><span class="o">:</span><span class="w"> </span><span class="c1">// Declare an executor and a taskflow</span>
+<span class="n">tf</span><span class="o">::</span><span class="n">Executor</span><span class="w"> </span><span class="nf">executor2</span><span class="p">(</span><span class="mi">4</span><span class="p">);</span><span class="w">  </span><span class="c1">// create an executor of 4 worker threads</span></pre><aside class="m-note m-warning"><h4>Attention</h4><p>Creating a <a href="classtf_1_1Executor.html" class="m-doc">tf::<wbr />Executor</a> has non-negligible overhead. Unless your application requires multiple executors, we recommend creating a single <a href="classtf_1_1Executor.html" class="m-doc">tf::<wbr />Executor</a> and reusing it to run multiple taskflows.</p></aside></section><section id="UnderstandWorkStealingInExecutor"><h2><a href="#UnderstandWorkStealingInExecutor">Understand Work-stealing in Executor</a></h2><p>Taskflow designs a highly efficient <em>work-stealing</em> algorithm to schedule and run tasks in an executor. Work-stealing is a dynamic scheduling algorithm widely used in parallel computing to distribute and balance workload among multiple threads or cores. Specifically, within an executor, each worker maintains its own local queue of tasks. When a worker finishes its own tasks, instead of becoming idle or going sleep, it (thief) tries to <em>steal</em> a task from the queue another worker (victim). The figure below illustrates the idea of work-stealing:</p><img class="m-image" src="work-stealing.png" alt="Image" /><p>The key advantage of work-stealing lies in its <em>decentralized</em> nature and efficiency. Most of the time, worker threads work on their local queues without contention. Stealing only occurs when a worker becomes idle, minimizing overhead associated with synchronization and task distribution. This decentralized strategy effectively balances the workload, ensuring that idle workers are put to work and that the overall computation progresses efficiently.</p><p>That being said, the internal scheduling mechanisms in <a href="classtf_1_1Executor.html" class="m-doc">tf::<wbr />Executor</a> are not trivial, and it&#x27;s not easy to explain every detail in just a few sentences. If you&#x27;re interested in learning more about the technical details, please refer to our paper published in 2022 <em>IEEE Transactions on Parallel and Distributed Systems (TPDS)</em>:</p><ul><li>Tsung-Wei Huang, Dian-Lun Lin, Chun-Xun Lin, and Yibo Lin, &quot;<a href="https://tsung-wei-huang.github.io/papers/tpds21-taskflow.pdf">Taskflow: A Lightweight Parallel and Heterogeneous Task Graph Computing System</a>,&quot; <em>IEEE Transactions on Parallel and Distributed Systems (TPDS)</em>, vol. 33, no. 6, pp. 1303-1320, June 2022</li></ul></section><section id="ExecuteATaskflow"><h2><a href="#ExecuteATaskflow">Execute a Taskflow</a></h2><p><a href="classtf_1_1Executor.html" class="m-doc">tf::<wbr />Executor</a> provides a set of <code>run_*</code> methods, <a href="classtf_1_1Executor.html#a8d08f0cb79e7b3780087975d13368a96" class="m-doc">tf::<wbr />Executor::<wbr />run</a>, <a href="classtf_1_1Executor.html#af15db5f7dde8e7ff1f86ef8fe825e9e2" class="m-doc">tf::<wbr />Executor::<wbr />run_n</a>, and <a href="classtf_1_1Executor.html#ae4f9e214ea5ee873e8d90a70bc1c77e8" class="m-doc">tf::<wbr />Executor::<wbr />run_until</a> to run a taskflow for one time, multiple times, or until a given predicate evaluates to true. All methods accept an optional callback to invoke after the execution completes, and return a <a href="classtf_1_1Future.html" class="m-doc">tf::<wbr />Future</a> for users to access the execution status. The code below shows several ways to run a taskflow.</p><pre class="m-code"><span class="w"> </span><span class="mi">1</span><span class="o">:</span><span class="w"> </span><span class="c1">// Declare an executor and a taskflow</span>
 <span class="w"> </span><span class="mi">2</span><span class="o">:</span><span class="w"> </span><span class="n">tf</span><span class="o">::</span><span class="n">Executor</span><span class="w"> </span><span class="n">executor</span><span class="p">;</span>
 <span class="w"> </span><span class="mi">3</span><span class="o">:</span><span class="w"> </span><span class="n">tf</span><span class="o">::</span><span class="n">Taskflow</span><span class="w"> </span><span class="n">taskflow</span><span class="p">;</span>
 <span class="w"> </span><span class="mi">4</span><span class="o">:</span>