Understanding the behavior of transactional memory applications

João Lourenço; Ricardo Dias; João Luís; Miguel Rebelo; Vasco Pessanha

Understanding the behavior of transactional memory applications

João Luís

2009, Proceedings of the 7th Workshop on Parallel and Distributed Systems Testing, Analysis, and Debugging - PADTAD '09

visibility

…

description

9 pages

link

1 file

Transactional memory is a new trend in concurrency control that was boosted by the advent of multi-core processors and the near to come many-core processors. It promises the performance of finer grain with the simplicity of coarse grain threading. However, there is a clear absence of software development tools oriented to the transactional memory programming model, which is confirmed by the very small number of related scientific works published until now. This paper describes ongoing work. We propose a very low overhead monitoring framework, developed specifically for monitoring TM computations, that collects the transactional events into a single log file, sorted in a global order. This framework is then used by a visualization tool to display different types of charts from two categories: statistical charts and thread-time space diagrams. These last diagrams are interactive, allowing to identify conflicting transactions. We use the visualization tool to analyse the behavior of two different, but similar, testing applications, illustrating how it can be used to better understand the behavior of these transactional memory applications.

Understanding the Behavior of Transactional Memory Applications João Lourenço Ricardo Dias João Luı́s Miguel Rebelo Vasco Pessanha CITI — Departamento de Informática Faculdade de Ciências e Tecnologia Universidade Nova de Lisboa, Portugal {jecluis, {Joao.Lourenco, rjfd}@di.fct.unl.pt miguelrebelo6, vascopessanha}@gmail.com ABSTRACT Transactional memory is a new trend in concurrency control that was boosted by the advent of multi-core processors and the near to come many-core processors. It promises the performance of finer grain with the simplicity of coarse grain threading. However, there is a clear absence of software development tools oriented to the transactional memory programming model, which is confirmed by the very small number of related scientific works published until now. This paper describes ongoing work. We propose a very low overhead monitoring framework, developed specifically for monitoring TM computations, that collects the transactional events into a single log file, sorted in a global order. This framework is then used by a visualization tool to display different types of charts from two categories: statistical charts and thread-time space diagrams. These last diagrams are interactive, allowing to identify conflicting transactions. We use the visualization tool to analyse the behavior of two different, but similar, testing applications, illustrating how it can be used to better understand the behavior of these transactional memory applications. Categories and Subject Descriptors D.1.3 [PROGRAMMING TECHNIQUES]: Concurrent Programming—Parallel Programming; D.2.5 [SOFTWARE ENGINEERING]: Testing and Debugging—Diagnostics General Terms Algorithms, Performance, Reliability, Experimentation Keywords Software Transactional Memory, Monitoring, Profiling, Visualization, Testing, Debugging, Concurrency 1. INTRODUCTION The interest in parallel programming was boosted by the recent emergence of multi-core processors. In the past, performance improvement had a strong dependency on processor speed increase, but processor speed is not increasing anymore. The recent evident drop of prices and general availability of multiprocessors in desktop computers made these multi-core architectures available not only to the everyday user, but also to the software developers, who must now rely on parallelism to fully exploit computational systems and achieve performance improvements. One can estimate that soon desktop computers will include dozens of processors and, thus, the programming mechanisms and methodologies must consider scalability as a key issue. Transactional Memory (TM) promises to ease the development of scalable parallel applications with performance close to finer grain threading but with the simplicity of coarse grain threading. Parallelism comes to application development at the expense of a dramatic increase in the program complexity and in the development efforts. Coding is harder due to many factors, such as tracking and coordinating the multiple concurrent control flows. Testing is also harder, as the parallel application may exhibit a multitude of behaviors, many of them unacceptable. Debugging is also much harder, as the exponential number of possible application states makes state-based debugging per se almost useless, and the intrusion effect introduced by monitoring (logging) approaches may change the application behavior and potentially masks errors previously observed and trigger new ones. Also, developers observe that parallel applications underperform for the available hardware. This is frequently due to design and/or coding decisions that limit the exploitation of concurrency internally by the application. The increased complexity of parallel program development at all levels, including testing and correction and performance debugging, may be eased up by a good understanding of the effective application behavior in its specific hardware and software execution contexts, including understanding the transactional framework being used to model and control interactions between the multiple control flows. One way to achieve such an understanding is by collecting run-time information about the application behavior and later analyze this data. The collected run-time raw data can easily achieve hundreds of megabytes and, thus, become unmanageable by the common developer. A visual representation of the collected run-time data may aggregate large amounts of data in a single figure and, thus, may be very convenient for the program behavior analysis. In this paper we propose a framework for analyzing the behavior of Transactional memory applications. This framework is composed by four components: the low overhead monitoring tool and trace file generator, the trace-file processor, the trace-file analyzers, and the graphical user interface. Each of these components will be further detailed in this paper. The main contributions of this paper are: • The proposal of a low overhead monitoring system for transactional memory programs that does not change the global application behavior; List 1000 nodes Red-Black 1000 nodes 6000 250 5000 1000x Operations/second 300 200 150 100 50 4000 3000 2000 1000 5% insert, 5% remove, 90% lookup 45% insert, 45% remove, 10% lookup 0 1 2 4 Number of threads 5% insert, 5% remove, 90% lookup 45% insert, 45% remove, 10% lookup 0 8 1 List / Atomic Counter / 1000 nodes 2 4 Number of threads 8 Red-Black / Atomic Counter / 1000 nodes 25 350 300 20 1000x Operations/second Monitoring transactional memory requires registering the Start and ending of a transaction, either with Commit or Abort, and all Read and Write accesses to shared memory locations that took place within the transaction. Reading or writing data from/to a memory location is usually accomplished with a single machine instruction. Logging these memory access events will probably require dozens or even hundreds of machine instructions, speeding down the memory accesses operations by one or two orders of magnitude. This level of overhead may be unacceptable to the computation. Disks are much slower than memory and saving the logged events into a file is also unacceptable in most situations. As an alternative, a limited number of events may be kept in a shared memory buffer, but the need to have exclusive access to the shared buffer for registering the events makes it a bottleneck in the logging system, eliminating much of the non-determinism inherent to the parallel computation and significantly changing the application behavior. Three important properties must be considered when developing a transactional memory monitoring system: i) have the logged events kept in main memory represented with a small memory footprint; ii) do not introduce additional synchronizations between threads; and iii) do keep the global application behavior. The last property depends on its predecessor, as additional synchronizations between threads will most probably change the global program behavior. To satisfy these three properties, we opted for an approach where each thread keeps the logging information in a private buffer in a compact binary format. All threads are thus working independently from each other, allowing the concurrent registration of events with no contention between threads. When the program finishes its execution, the tracing system merges all buffers and dumps the events into a single file, in text format for easier understanding. Merging the local thread buffers depends on defining a global order for the events. One possible solution would be to have an atomic counter incremented by each thread each time an event is registered. However, this approach would not comply to the second and third properties, which states that the logging system should not introduce additional synchronization requirements neither change the global application behavior. Our solution was to use a specific CPU register (the RDTS register) that gives the number of clock cycles 15 10 5 0 1 2 4 250 200 150 100 50 5% insert, 5% remove, 90% lookup 45% insert, 45% remove, 10% lookup 5% insert, 5% remove, 90% lookup 45% insert, 45% remove, 10% lookup 0 8 1 2 Number of threads List / RDTSC Counter / 1000 nodes 6000 100 5000 80 60 40 20 5% insert, 5% remove, 90% lookup 45% insert, 45% remove, 10% lookup 1 2 Number of threads 8 Red-Black 1000 nodes 120 0 4 Number of threads 1000x Operations/second 2. THE MONITORING FRAMEWORK 1000x Operations/second The remaining of this paper is organized as follows: the next Section will introduce the low overhead monitoring framework for transactional memory; Section 3 will describe the experimental context where the monitoring framework was used; Section 4 will describe our tool that displays a set of charts reporting on the information collected by the monitoring framework; Section 5 will show how the tool can be used to help in analyzing the behavior of two testing applications; and Section 7 presents some concluding remarks and line out some future work. 1000x Operations/second • An interactive graphical user interface that displays the information produced by the analyzers. since the last system reset. The value given by this register can be used to impose a global order to the events and can be accessed by all threads with no additional synchronization. The value of the RDTSC register in each processor that may drift from the others. This clock drifting causes the global ordering of the events to be error prone and, thus, the resulting single file is not 100% accurate. However, the inaccuracy of this methodology does not compromise seriously the results for the statistical information; and this methodology provides very accurate information that could not be obtained otherwise, such as the real transaction duration time. When using the tracing system, if all the operations being traced incur in the same overhead, the global application behaviour will be essentially the same, including the inherent non-determinism of the application. There will be a simple reduction of the overall system performance without significant impacts in the system behavior. Figure 1 illustrates the performance of the two testing applications with and without the monitoring system activated. We studied the behavior in a read dominant context (the red/dark gray line), and a write dominant context (the green/light gray line). The left column always refers to the Linked List application and the right column to the Red-Black Tree application. For the top line, the testing applications were ran with no monitoring system. For the middle line, we used a (shared) atomic counter as a logical clock to timestamp the events and support their global ordering. For the bottom line, we used the CPU registers (RDTSC) to timestamp the events. 1000x Operations/second • A set of analyzers that extract relevant information from the trace file; 4 4000 3000 2000 1000 5% insert, 5% remove, 90% lookup 45% insert, 45% remove, 10% lookup 0 8 1 2 4 8 Number of threads Figure 1: The performance of testing applications with and without the monitoring system. By analysing the figure, one can conclude that the performance results for the applications without monitoring (top line) and with the atomic counter (middle line) are completely different, thus the applications exhibit different be- haviors. On the other hand, comparing the graphs in the top and bottom lines, one can depict that the monitoring system in our approach reduces the overall performance to approximately 40% of the original, but displays similar scalability when the number of processors increase, keeping the global behavior for both applications. In the top-left and bottomleft charts, the performance curve for the read-dominant context are slightly different. This is due to scalability limitations of the testing application (List) when running without monitoring. This limitations are not triggered when monitoring is activated because the total number of operations per second is much lower (approximately 40%). 2.1 Event Types There are many differences between the multiple transactional memory frameworks described in the literature. However, all of them rely in a small set of operations to provide its functionalities, namely: Start of a transaction (Tx Start), end a transaction successfully with Commit (TxCommit) or unsuccessfully with Abort TxAbort, and access a shared data item for Reading (TxRead) or for Writing (TxWrite). Although with multiple alternative implementations, the above set of events is widely accepted as the minimum set necessary to describe Transactional Memory computations. The programmer will use these operations in the following order: TxStart (TxRead | TxWrite)* [TxAbort] TxCommit, where “*” denotes repetition and “[ ]” denotes optional. However, considering that at runtime TxAbort and TxCommit are mutually exclusive and that a transaction may abort by many reasons; the actual behavior of the application can be represented as: TxStart (TxRead | TxWrite)* (TxCommit | TxAbortUser | TxAbortCommit | TxAbortOther), where TxAbortUser denotes that the transaction was aborted by programmer request, TxAbortCommit denotes that the transaction tried but was unable to commit, and TxAbortOther denotes that the transaction aborted when accessing a shared data item, either for reading or for writing. Each event must be registered upon the execution of the associated operation. In fact, we specify that all the events, except the TxCommit, must be registered right before the execution of the operation. The TxCommit event must be registered only when the transactional framework knows that it can commit the transaction. If a transaction willing to commit is aborted by the transactional framework, only a TxAbort event should be registered. This means that all transactions are delimited in the trace log by a TxStart event, and by either a TxCommit or a TxAbort event. 2.2 Event Structure Each event is composed by a set of attributes. A subset of these attributes are common to all types of events while others are event specific. The following attributes are common to all events: • timestamp — The time instant in which the event occurred; • eventId — The identifier for the type of event, e.g., TxStart, TxRead, etc; • threadId — The identifier of the thread that executed the operation; • transactionId — The identifier of the transaction code block in which this operation took place. All the attributes described above are self explanatory, except the last one. A single tread can execute multiple transactions in sequence, thus the transactionId attribute is used to identify the transaction code block where the operation took place. This allows to uniquely identify each transaction code block and to locate it in the source code. It also allows to map a set of operations into a single transaction. The TxAbort event has an additional attribute, the type attribute, that is used to identify the reason for aborting the transaction. Because transactional memory frameworks implement different validation schemes, transactions can abort when preforming a read operation, a write operation, a commit operation, or when the user explicitly aborts the transaction. This attribute has three possible values: commit, when the transaction aborts on commit; user, when the transaction aborts by user request; other, when none of the previous apply and, thus, the transaction aborted in the sequence of a read or a write operation. To keep the tracing system light, it should avoid postprocessing the events online if it can be done later offline. There is no distinction between read and write aborts in the tracing file because it is easy to post-process the trace and, by looking back in time, to find which operation triggered the abort. In this case, one just needs to look back for the previous event from the same thread and check if it is either a read or a write operation. The TxRead and TxWrite operations also have an additional attribute: the varId attribute. This attribute is used to identify the memory address or object ID (for an OO programming language) that was accessed by the operation. This identifier must be unique for each memory location or object. 2.3 Tracing System Instrumentation We implemented a simple API so that TM frameworks could easily insert the tracing system call functions within the existing code. The current prototype only implements the API and the programmer must change the source code accordingly, but we intend to implement an mechanism to automatically insert the calls to the API. The API is composed by five functions, one to register each of the previously described events, namely TxStart, TxCommit, TxAbort, TxRead, and TxWrite. All the calls to the monitoring API can requested by the TM framework. The exception goes to the TxStart event. This event must associate a unique ID to the transaction source code block that will latter be used to refer to that code block for, e.g., associate a transaction to a user-level operation. If IDs were generated automatically, a table that maps transactions IDs into source code blocks would have to be generated and made accessible to the programmer. A source-to-source compiler could easily generate the unique transaction IDs and dump this table in the end of the source code transformation process. 2.4 Tracing System Output The tracing system dumps the content of all the buffers into a single file upon the termination of the application. All the events are ordered by increasing values of the timestamp attribute. The format for each event in the trace file was defined aiming at allowing our analyzing tool to work with traces generated by different TM runtime systems. This format definition is tus neutral to the TM and can be depicted in Figure 2, along with a small example of the output of a trace with two threads and two transactions. 3. EXPERIMENTAL CONTEXT We performed a set of simple tests, logging the behavior with our monitoring system and then used our tool to analyze the behavior of these testing programs. The tests consisted on series of operations on a set. The set has two implementations, one as a Sorted Single Linked List and another as a Red Black Tree. The interface for both implementations provides three methods: insert(), remove() and lookup(). The set elements have a key and a value and all functions are indexed by the key. Duplicate keys are not allowed and adding an element with an already existing key will update its value. The tests were executed over CTL [3], a transactional memory framework for C and C++ programming languages derived from TL2 [4]. The tests are divided into three main categories, with three different load patterns. The test load pattern is defined by assigning different probabilities to each of the three methods. The first load pattern is meant to simulate a read dominant context, with 5% of inserts, 5% of removes and 90% of lookups. The second load pattern is meant to simulate a balanced system, with 20% of inserts, 20% of removes, and 60% of lookups. The third load pattern is meant to simulate a write dominant context, with 45% of inserts, 45% of removes, and 10% of lookups. The tests were performed on a Sun Fire X4600 M2 x64 server with eight dual-core AMD Opteron Model 8220 processors @ 2.8 GHz with 1024 KB cache and 16 GB of RAM. The tests were executed to a maximum of 8 threads, to avoid having the operating system tasks interfere with the tests and introducing noise (arbitrary time delays) into the tracing system. 4. THE VISUALIZATION TOOL One of the problems of tracing systems is that they tend to generate huge amounts of hard to digest information. In our monitoring system, besides registering the start and end of memory transactions, we must also register all the accesses to shared memory locations, and there may be millions of memory accesses per second. For the applications under testing, the monitoring system was generating approximately 50 Kbytes of tracing data per processor per millisecond, resulting in near 100 MByte of data for a time slot of 200 milliseconds with eight parallel threads. Such volumes of information are clearly unmanageable with no aid from helping tools. Our goal was to develop a tool to help processing the huge amounts of information generated by the monitoring system. The tool should serve two main purposes: i) provide a graphical representation for statistical information of the testing application behavior; and ii) provide a graphical representation of the application behavior along time. The visualization tool we developed has thus two main features: visualization of statistical information by means of charts; and visualization of transactional computations in a timeline. Both features consume the same source of information which is the trace log generated by the tracing system (see Figure 2). 4.1 Application Components The application is composed by four components: the monitoring framework, which was already described; the trace-file processor; the trace-file analyzers; and the graphical user interface. 4.1.1 The Graphical User Interface The application graphical user interface (GUI) is in a very preliminary phase. The current set of analyzing modules will also be expanded in the near future. The application GUI is composed by two panes: one pane, on the left, allows to choose the type of visualization; the other pane, on the right, will display the selected chart/graph. Figure 3 illustrates the main view of the visualization tool. Figure 3: Main layout of the visualization tool 4.1.2 The Trace-File Processing To ease the work of the analyzers when parsing the tracefile for generating charts or transaction behavior graphs, we developed a small component that allows to see the trace file as a list of events. Due to its size, we cannot load all the trace file data into main memory, thus this component provides the analyzers with an event iterator that operates over the events in the trace-file stored directly in secondary memory. This iterator also supports the notion of savepoint. Savepoints work as bookmarks in the trace file and can be used to jump directly back and forth in the trace file without further processing. 4.1.3 The Trace-File Analyzers There are two types of analyzers in our tool: visualization of statistical information by means of charts; and visualization of transactional computations in a timeline. The first one uses JFreeChart, a library to render many different types of charts (see Figure 5 as as example). The second one uses a new Java Swing component developed from scratch, to render the transactions behavior along time (see Figure 4 as an example). Each analyzer must extend a well defined interface and must implement a method that returns the respective visual component. This component is rendered later on by the application GUI component. This approach allows analyzers to be considered as plugins to the GUI. All analyzers use <timestamp> <eventId> T<threadId> <transactionId> [<TxAbort:type> | <varId>] %% Example: 3043566053937770 3043566053938505 3043566053938530 3043566053938569 3043566053939240 3043566053939378 3043566053939505 3043566053939725 3043566053940104 tx_start tx_read tx_start tx_read tx_read tx_write tx_read tx_commit tx_abort T1 T1 T2 T1 T2 T2 T1 T2 T1 2 2 0 2 0 0 2 0 2 0x3871dbf8 0x805fa0 0x805fa0 0x805fa0 0x3871dbf8 commit Figure 2: Event string format and output example. the trace-file processor to extract the information needed to create the visual information. The time-based transaction behavior analyzer is backed up by a Java Swing component that can show the executed transactions of each thread along time. Each transaction is represented by a color depending on the type of transaction and by the type of abort, and the size of the box representing the transaction is directly proportional to the real time duration of the transaction. This analyzer also allows the user to click in the abort event of a transaction A, and will automatically draw an arrow from that event to another transaction B that forced A to abort. 4.2 Visualization Modules The visualization tool provides a graphical representations for statistical information of the application behavior as well as a graphical representation of transactional status of each application thread along time. These two main classes of charts will be further discussed in the following sections. 4.2.1 Statistical Information Charts At the time of writing this paper, the visualization tool supports eight different charts for displaying statistical information: Abort Types. Displays a pie chart with the relative (percentage) number of transactions aborted at different moments in the transaction life-cycle: by user request; when reading a memory cell/object; when writing a memory cell/object; and just prior to committing the transaction. Allows to have a feeling on the eagerness of conflict detection. Commit/Abort. Displays the percentage of transactions that finished successfully versus those that had to abort. Allows to have a feeling on the amount of wasted computation cycles. Transaction ID. Displays the relative number of each kind of user-level transactional operations. In our testing application, this will be the percentage of insert(), remove() and lookup() operations. Contributes to the understanding of the global application behavior. Read/Write Rates. For each user-level transactional operation, displays a bar with the percentage of memory read and write operations. Contributes to the understanding of the behavior of the individual operations executed by the application. Commits/Aborts XYChart. Reports on the number of committed and aborted transactions per execution time slice. Allows to infer the transactional throughput along time. AccessMemChart. Reports on the access frequency for each transactional unit, e.g., how many times each memory cell was accessed. Allows to identify contention points. In the future we plan to split this chart into two charts, depending on the type of memory operation executed, i.e., a memory read or write. Transaction Retry Rates. For each user-level transactional operation, reports average numbers for transaction retries. Allows to understand the level of contention exhibited by each user-level transactional operation. Abort Reason. Reports on whether the aborts were caused by real conflicts or by false positives, i.e., the transaction was unnecessarily aborted by the transactional memory framework. Allows to understand if the contention management policies are adequate for the application under testing. Wasted Work. Reports the percentage of time spent by aborted transactions in relation to the total time spent by all transactions. This contributes to the understanding of the time wasted in processing doomed transactions. 4.2.2 Time-based Behavior Information Charts The visualization tool also supports the representation of the application behavior along time. This chart will represent in the Y-axis the multiple application threads, and in the X-axis the transactional status of those threads. The example illustrated in Figure 4 refers to the evolution of the testing application with eight threads, with Tx0 (dark blue) meaning the thread is executing an insert operation, Tx1 (light blue) meaning the thread is executing a remove operation, and Tx2 (yellow) meaning the thread is executing a lookup operation. Transactions terminate with either a commit (green) or abort (pink). In the bottom left there is a slider to change the zooming factor of the displayed information. In opposition to the statistical visualization charts/modules, this time-based behavior information chart is interactive. If the user selects a time-slot in a transaction A corresponding to an abort, the module will locate and identify the transaction B that conflicted with transaction A and forced it to abort. An arrow will be drawn connecting the abort timeslot of transaction A to the beginning of the time-slot of transaction B. This functionality can be depicted in Figure 4. The tool can also draw arrows from all abort events to the corresponding conflicting transactions. Behavioral patterns may be observed using this feature. memory access pattern that can be read-dominant (5%, 5%, 90%), balanced (20%, 20%, 60%), or write-dominant (45%, 45%, 10%); key range will define the size of the set. 5.1 Statistical Information Analysis 5.1.1 Commit/Abort Ratio Figure 5 illustrates the Commit/Abort rates for different conditions of the Linked List testing application. Transaction Commit/Abort Percentages Transaction Commit/Abort Percentages Aborts 0% Aborts 12% Commits 88% Commits 100% Commits Aborts Commits Aborts (LL, 1, 20%, 20%, 60%, 500) (LL, 2, 20%, 20%, 60%, 500) Transaction Commit/Abort Percentages Transaction Commit/Abort Percentages Commits 23% Figure 4: Example of a transaction conflict detection Commits 47% Aborts 53% In the future we plan to extend the user-interaction based functionalities, such as mapping the transaction time-slots to source-code locations and allowing the user to implicitly invoke the text editor in the source line associated with a specific time-slot. 4.2.3 Analyzer/Visualization Module Development Developing a new analyzer is extremely easy in this tool. As described previously each type of analyzer corresponds to a Java class that extends an interface which defines the type of analyzer. This class implements the generation of the visual information, that can be a chart or a transaction behavior graph, by collecting information while iterating the trace-file. After implementing the analyzer class, it must be registered in the GUI component, in order to be listed in the left pane. These are the only two steps needed to create a new analyzer module. In future work we will dynamically load the analyzers classes from a specified directory and list them in the GUI component without requiring an explicit registration. This will give the possibility to developers to create new analyzers even without having the tool source code. 5. APPLICATION BEHAVIOR ANALYSIS In this section we will illustrate how the visualization charts can help understanding the behavior of an application that uses transactional memory. We recall that we actually have two similar testing applications implementing randomgenerated operations over a set. In each application the set resorts to different data structures: one uses a single linked list (LL), the other uses a red-black tree (RB). The syntax used to describe the testing conditions for the charts is as follows: (App, n-threads, %inserts, %removes, %lookups, key range). App will be either LL for the linked-list or RB for the red-black tree; n-threads should be between 1 and 8 and will identify how many threads were executing concurrently; %inserts, %removes, %lookups will identify the Aborts 77% Commits Aborts (LL, 8, 20%, 20%, 60%, 500) Commits Aborts (LL, 8, 45%, 45%, 10%, 10000) Figure 5: Commit abort rates for different testing application conditions. As expected, with a single thread the abort rate is zero. This rate increases to 12% with two threads in a moderate update access pattern (40% of updates and 60% of lookups). With eight threads there is higher memory contention in accessing the list elements, and in the same balanced context the abort ratio increases to 53%. The worst case we could identify was with a very high rate of updates (90%) and with a very long list (10.000 elements). In this case, there is a very high probability that long transactions aiming at changing an element with a high key value, has to abort because it read a value that has been changed by a shorter transaction. In the last two cases there is a considerable amount of wasted work done by aborted transactions. This means the transactional implementation of the underlying data structure is not adequate for current usage/parameters of the testing application. 5.1.2 False Positives The TM framework used to generate the trace files was operating with memory words as the transactional unit and using a deferred update mechanism. Because each transaction may access thousands of memory cells, CTL maps memory addresses into a limited size table using a hash function. As multiple memory words may be mapped into the same position in the table, there is the chance to have undetected false conflicts. The size of this table has direct influence in percentage of the false conflicts. Transaction Retry Rate 16 75 15 70 14 65 60 55 50 45 40 35 30 25 21 20 15 13 12 11 10 9 8 7 6.776 6 5 4 3 3 10 5 17 17 80 average of number of retries average of number of retries Transaction Retry Rate 84 85 0.528 Code 0 1 0.222 Code 1 0.324 (LL, 2, 20%, 20%, 60%, 500) (LL, 8, 20%, 20%, 60%, 500) Abort False Positives Abort False Positives Code 2 Code 0 Code 1 Code 2 Individual Transactions Max Average (LL, 8, 20%, 20%, 60%, 50) Max (LL, 8, 20%, 20%, 60%, 10000) Transaction Retry Rate Transaction Retry Rate 13 13 0.233 0 Individual Transactions Average 3 2 6 3.897 0 2 2.00 2 12 True Conflicts 23% 1.75 False Conflicts 58% False Conflicts 77% 10 9 9 8 7 6 5 4 3 3 2 1 (RB, 8, 20%, 20%, 60%, 500) False Conflicts True Conflicts (RB, 8, 45%, 45%, 10%, 500) Figure 6: Percentage of false positives for different testing application conditions. Figure 6 shows the amount of false positives detected in different traces. From the charts available, of which we present a subset here, we may infer that in general the list based (LL) solution has proportionally more false positives than the redblack tree based solution (RB). We may also infer that the number of false positives decrease when the contention level increases. 5.1.3 Transaction Retry Rate Transactions frequently conflict with other transactions. The common approach to deal with conflicts is to abort one of the conflicting transactions. The transactional framework resorts to a contention manager to decide which of the conflicting transactions must abort. Depending on the contention manager policies, some types of transactions will be more prone to abort than others, e.g., the contention manager may give preference to shorter/longer transactions, or to younger/older transactions, or to read-only transactions, or to transactions that accessed the smaller/higher number of shared resources, etc. Figure 7 illustrates the transaction retry rate for different testing application conditions. Different contention managers would originate different charts. Finding an element in a list has an O(n) complexity, while doing the same operation in a red-black tree has as O(log2 n). Thus, as expected, the transaction retry rate for the LL solution is much higher than for the RB solution. This can easily be depicted in the figure by comparing the top two charts with the lower ones (please note that the vertical scale differs in all the charts). Another interesting effect, is that with a higher number of keys, the LL implementation has in average more aborts/retries and the RB implementation has less. This is due to the fact that the CTL contention manager does not privilege the longer, and the linear search in the LL implementation forces many long transactions to abort, because at commit time there is a good chance that a shorter transaction has updated an item that was previously read by the 1.25 1 1.00 0.75 0.50 0.25 0.146 0.00 Code 0 True Conflicts 1.50 0.589 0.52 0 False Conflicts average of number of retries True Conflicts 42% average of number of retries 11 Code 1 0.006 0.006 Code 2 Code 0 Individual Transactions Average (RB, 8, 45%, 45%, 10%, 50) 0.002 Code 1 Code 2 Individual Transactions Max Average Max (RB, 8, 45%, 45%, 10%, 10000) Figure 7: Transaction retry rate for different testing application conditions. longer transaction and, thus, the longer transaction must abort. The opposite applies to the RB implementation. The search space is split by half at each step and the probability of having two transactions in conflict is much lower. The contention level decreases when the set cardinality increases, as the search space is being slit even more. 5.1.4 Application Level Operations Our testing applications receive a set of command line arguments (exactly the same, for both the LL and RB applications) that instantiate some of the configurable parameters, changing in this way the overall application behavior, including the memory access pattern. In the case of our testing applications, there are only three high-level operations: insert(), remove() and lookup(). The Figure 8 indicates the relative frequency of those operations as registered in the trace file. Individual Transaction Percentages Code 2 10% Code 0 45% Code 1 45% Code 0 (LL, 8, 20%, 20%, 60%, 500) Code 1 Code 2 (LL, 8, 45%, 45%, 10%, 500) Figure 8: Relative frequency of user level operations in the application. The chart in the right has a relative frequency exactly as expected for a write dominant context, i.e., 45% of inserts, 45% of removes and 10% of lookups. The chart in the left indicates 19% of inserts, 21% of removes and 61% of lookups, which even sum up to 101% and differs lightly from the values given in the command line, namely, 20%, 20% and 60% respectively. These small variations are due to the usage Read/Write Transaction Rates Read/Write Transaction Rates 100 100 90 90 80 80 12.106 5.1.5 Abort Types 70 70 60 50 99.628 99.167 100 87.894 40 30 30 20 20 Write 0% Abort Percentages User 0% Write 0% User 0% Commit 23% Commit 25% Read 75% Read 77% User Commit Read Write (LL, 8, 20%, 20%, 60%, 500) User Commit Read Write (RB, 8, 20%, 20%, 60%, 500) Figure 9: Relative frequency of abort types in the application. In the case of our testing applications, the transactions are either read-only (lookup operations) or read-write. The later are a special case of read-write transactions, as the write operation is always done in the very final moments of the transaction life time. The LL version has more committime aborts than the RB version. This is due to the fact that LL transactions are very long and many conflicts will only be detected at commit time. 5.1.6 Read/Write Rates Different application-level transactional operations exhibit different behavior in terms of memory access patterns. Some are read-only, some others are mixed read-write, and some others, as in the case of our testing applications, the write operations, when they exist, are in a small number (and take place in the very end of the transaction life-time). Figure 10 illustrates this ratio between read and write accesses to the shared memory cells. From the figure it is possible to infer that the lookup operation (third column in both charts) does not update any memory location. It can also be depicted that the RB tree test does a lot more memory updates than the LL test. This is due to the internal re-balancing of the RB tree. In the case of the LL, the update operations (insert and remove) iterate the list nodes until the right node is found, and just then the node is updated and the transaction concluded. 79.045 10 0 0 Abort Percentages 100 50 40 10 The contention manager may order a transaction to abort when accessing a shared resource. If the transaction reaches the commit phase, a new validation phase is triggered, and the committing transaction read set is validated against the write set of all remaining concurrent transactions. The transaction write set must also be validated against the read and write sets of all concurrent transactions. Thus, four different situations may trigger the aborting of a transaction (whether the transaction is really aborted depends on the contention manager): reading a shared resource that has been changed since the current transaction has started; writing to a shared resource that has been changed since the current transaction has started; at the final validation of the read and write sets, just prior to committing the transaction; and by explicitly request from the user/programmer. Figure 9 represents visually this information. % 60 % of a pseudo-random number generator to select the operation to be executed and, on the other hand, to the rounding algorithm used to convert the percentages into integers. 20.955 Code 0 Code 1 Code 2 Code 0 Read Read Write (LL, 2, 20%, 20%, 60%, 500) Code 1 Code 2 Individual Transactions Individual Transactions Write (RB, 2, 20%, 20%, 60%, 500) Figure 10: Relative frequency of memory accesses (read/write) within a transaction. Thus, the proportion of reads/writes is much higher than in the case of the RB tree. Please note that both charts in the figure refer to tests with a range of only 50 different keys, which implies a list/tree with at most 50 different nodes. With larger lists/trees, the ratio of reads/writes will only increase. 6. RELATED WORK As transactional memory is an emerging research area, few work has been done concerning tools to support the development of applications using transactional memory. Yossi Lev in [7] presents a debugger which supports transactional memory. The work introduced new debugging mechanisms, shadowing the inner work of the transactional memory framework from the user. Ansari in [1] presents a tool to profile the execution of applications that make use of transactional memory. The profiling tool was applied to non-trivial benchmarks, such as STAMP [2], to better understand what factors have more impact in the overall performance. Some of the statistical information provided by our tool has similar goals as those of Ansari’s work. Lourenço et al. in [8] presented some testing patterns that proved to be useful at testing and debugging transactional memory framework. Harmanci et al. in [6] developed a tool to help in design and optimization of transactional memory frameworks. The tool, TMUnit, provides a domain specific language for specifying workloads, and tests the performance and semantics of transactional memory frameworks. The works reported in [1,7] and the works reported in [6,8] have different goals. The former aim at aiding the development of transactional memory applications, while the later aims at the development of transactional memory frameworks. 7. CONCLUDING REMARKS Transactional memory is a new trend in concurrency control and there are not many tools available targeting software development using transactional memory. Not even in the research community. To our best knowledge up to the moment, all the TM-oriented known tools are reported in Section 6. In this paper we presented a novel tool aiming at helping software developers to understand the behavior of transactional memory applications. Our tool resorts to a very low overhead monitoring framework, developed specifically for monitoring TM computations, that collects the transactional events (start of a transaction, commit, abort, read and write shared resources) in each thread and logs them, together with a time-stamp, into a thread local memory buffer. When the application finishes, all the buffers are merged into a single one using a time-stamp to impose the global ordering. The tool will use the global log to display different types of charts. Until now, all the charts may be grouped into one of two main categories, statistical and state-time diagrams. Statistical charts resort into analysis modules that process the contents of the log file aiming at displaying a specific kind of statistical information. The global log may also be processed to extract timerelative information, such as which application-level operations are being executed concurrently, or how long did it take to execute a specific transaction. For our tool we developed and interactive module that exhibits a threads/time chart, with thread IDs in the Y-axis and time in the X-axis. The status of each thread changes visually along time. This module is interactive and the user may select an abort event in any thread and the tool will localize and point (with an arrow) the beginning of the conflicting transaction that forced the initial one to abort. The tool is being actively developed and many other charts, both statistical and interactive, will may be developed. Other statistics that could be interesting to collect would be, for example, the effective amount of wasted work for each transaction type. Other interactive functionalities would include, for example, cross-referencing between the transactional operations in the log file and locations in the original source code. We also plan to generate logs from other non-trivial TM benchmarking applications, such as STMBench7 [5], and the STAMP [2] and SPLASH2 [9] collections, and interpret the results using our visualization tool. Acknowledgments This work was partially supported by Sun Microsystems under the “Sun Worldwide Marketing Loaner Agreement #11497”, by the Centro de Informática e Tecnologias da Informação (CITI), and by the Fundação para a Ciência e Tecnologia (FCT/MCTES) in the research projects PTDC/EIA/ 74325/2006, PTDC/EIA-EIA/108963/2008, and PTDC/EIAEIA/113613/2009, and research grant SFRH/BD/41765/2007. 8. REFERENCES [1] Mohammad Ansari, Kim Jarvis, Christos Kotselidis, Mikel Luján, Chris Kirkham, and Ian Watson. Profiling transactional memory applications. In PDP ’09: Proceedings of the 17th Euromicro International Conference on Parallel, Distributed, and Network-based Processing. IEEE Computer Society Press, February 2009. [2] Chi Cao Minh, JaeWoong Chung, Christos Kozyrakis, and Kunle Olukotun. STAMP: Stanford transactional applications for multi-processing. In IISWC ’08: Proceedings of The IEEE International Symposium on Workload Characterization, September 2008. [3] Gonçalo Cunha. Consistent state software transactional memory. Master’s thesis, Universidade Nova de Lisboa, November 2007. [4] Dave Dice, Ori Shalev, and Nir Shavit. Transactional locking ii. In Distributed Computing, volume 4167, pages 194–208. Springer Berlin / Heidelberg, October 2006. [5] Rachid Guerraoui, Michal Kapalka, and Jan Vitek. Stmbench7: a benchmark for software transactional memory. SIGOPS Oper. Syst. Rev., 41(3):315–324, 2007. [6] Derin Harmanci, Pascal Felber, Vincent Gramoli, and Christof Fetzer. Tmunit: Testing transactional memories. In 4st ACM SIGPLAN Workshop on Transactional Computing (TRANSACT 2009), February 2009. [7] Yossi Lev. Debugging with transactional memory. In 1st ACM SIGPLAN Workshop on Transactional Computing (TRANSACT 2006), June 2006. [8] João Lourenço and Gonçalo Cunha. Testing patterns for software transactional memory engines. In PADTAD ’07: Proceedings of the 2007 ACM workshop on Parallel and distributed systems: testing and debugging, pages 36–42, New York, NY, USA, 2007. ACM. [9] Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. The splash-2 programs: characterization and methodological considerations. In ISCA ’95: Proceedings of the 22nd annual international symposium on Computer architecture, pages 24–36, New York, NY, USA, 1995. ACM.

Log In

Understanding the behavior of transactional memory applications

Sign up to get access to over 50M papers

Related papers

Related papers

Related topics