Memory Workload Synthesis Using Generative AI: Chengao SHI Fan Jiang Zhenguo LIU

Memory Workload Synthesis Using Generative AI
Chengao SHI Fan JIANG Zhenguo LIU

Hong Kong University of Science and Hong Kong University of Science and Hong Kong University of Science and
Technology Technology Technology (GZ)
Hong Kong, China Hong Kong, China Guangzhou, Guangdong, China
Chen DING Jiang XU

University of Rochester Hong Kong University of Science and
Rochester, New York, USA Technology (GZ)
Guangzhou, Guangdong, China
ABSTRACT workload synthesis, there are at least two types of similarity that
Generative AI, a novel general-purpose technology, has the ca- are important. First, the latency of a memory access depends on
pability to produce outputs closely resembling its training inputs. nearby instructions. Second, the cache behaviour depends on data
This paper presents the first use of such learning in memory work- reuses that may span many instructions. The essence of this re-
load synthesis. By devising three learning techniques and compar- search hinges on two pivotal questions: (i) Can Transformer mod-
ing them with two existing techniques, the paper shows that by els effectively capture and reproduce the similar memory behav-
carefully choosing the training input and processing the generated iors exhibited by program traces? (ii) How do these models com-
output, AI-based synthesis can produce generated workloads that pare against traditional methods in terms of synthesis quality and
have similar or better accuracy than the best of existing methods. computational overhead?
The answers to these questions will help to understand how
KEYWORDS workload synthesis can benefit from generative AI to construct its
representative microbenchmarks. In tandem with fast yet accurate
Computer systems organization → Memory system; Neural
simulators, we believe transformer-enabled synthesis promises to
networks; Statistical simulation
accelerate computer architecture research in the future.
1 INTRODUCTION
The recent release of ChatGPT and GPT-4 by OpenAI [19] has 2 BACKGROUND
gained immense popularity since November 2022, amassing over
100 million users by January 2023, demonstrating the remarkable 2.1 Synthesis of Memory Workloads
capabilities of GPT (Generative Pre-trained Transformer). The trans- Memory workload synthesis is a computational process that aims
former architecture was introduced in 2017 [21].This architecture to generate synthetic traces that reproduces the memory behaviour
breaks away from previous neural networks in significant ways. In- of original workloads. Typically, statistical profiles are collected
stead of convolution, it is based on a technique called multi-headed from the binary-instrumented workloads, without storing original
attention, which allows parallel encoding. Unlike the dense activa- traces, ensuring client confidentiality. These profiles later direct
tion in convolution, the attention activation is sparse. Both parallel the workload generator, producing either synthetic memory traces
encoding and sparse activation provide greater speed, better scal- or executables, to ensure a holistic capturing of memory behaviour.
ability, and faster learning ability than convolutional or recurrent Various statistical models have been proposed [1, 3, 18], aiming at
neural networks. capturing the memory behaviour more accurately.
More importantly and surprisingly, it is capable of general-purpose The pursuit of accurately capturing memory behaviour using lo-
learning and has been used for a myriad of purposes, including cality features has led to the proposal of sophisticated methodolo-
summarizing and graphing experimental results and writing a pa- gies and frameworks. WEST [3] builds its model around temporal
per. locality by harnessing the LRU stack distance, or reuse distance[11]
In this work, we study the use of generative AI in memory work- distribution on the cache hierarchy. Yet, the focus on temporal
load synthesis. Cycle-level simulators, such as Gem5 [5], are in- locality underscores its ability to model microarchitectural struc-
valuable tools in computer architecture research. However, they tures, such as prefetchers, that harness spatial locality. On the other
suffer from prohibitively high computational costs for large-scale hand, STM [1] emerges as a more promising solution. While it uti-
applications. A promising approach to reduce the simulation time lizes a similar framework to West, a pivotal distinction is intro-
is workload synthesis. By constructing shortened workloads, or duced. STM’s broadened approach integrates both temporal and
microbenchmarks through statistical profiles, it captures key pro- spatial locality, facilitating a more in-depth understanding of mem-
gram behaviours including memory access patterns and data de- ory behaviour and versatile explorations across memory hierar-
pendence of the profiled applications[1, 2, 12]. chies. Later, HRD [18] is proposed as an effective and efficient
The nature of learning by generative AI is not well defined. One generalization of reuse distance to capture the spatial locality mea-
may say that it generates output similar to the input, but with- sured at multiple data-block granularities in addition to the tempo-
out a precise definition what exactly this similarity means. For ral locality model of traditional reuse distance.
1
Chengao SHI, Fan JIANG, Zhenguo LIU, Chen DING, and Jiang XU
2.2 Transformer in Tabular Data Synthesis transformer. A GPT-2 decoder equipped with a causal LM head is
Synthetic data, especially for image, video, and speech generation, adeptly trained to produce observations from the child table, en-
is witnessing a surge in applications. Given the prevalence of tab- suring flexibility for variable lengths.
ular data in industries, there’s a growing demand for tabular data
synthesis to address challenges related to privacy and proprietary 3.2 Trace-Driven Fine-Tuning
assets[6]. For instance, over 67% of datasets on the Google Dataset In transformer-based synthesis, the training process involves feed-
Search platform are categorized as structured or tabular data, often ing a program trace to REaLTabFormer [20]. We call the process
in formats such as CSV, XML, and JSON[4]. the training period and the cost the training time. Subsequent to
However, given the cost-intensive nature of gathering data and the training, a new trace of the prescribed length is generated. We
preprocessing, tabular datasets frequently exhibit several charac- denote the process as the generation period and the cost the gener-
teristics that hamper the utility of contemporary machine learning ation time. The cost of synthesis is the total cost of these two steps.
techniques as follows: 1.imbalanced label distribution, especially Central to our method is the concept of a program trace, which
long-tailed patterns, making generalizations even harder[9];2.sensitive records a sequence of memory accesses. In this work, we use Intel-
personal details, making data sharing untenable due to privacy Pin[17] to profile the memory traces, where each access is a row
and socio-ethical considerations[13]; 3.noisy data or missing val- containing the following five fields: running instruction count (IC),
ues, which is usual in real-life scenarios, several data imputation read or write (Op), physical address (Addr), data item size (Size),
methods have been proposed[16]. program counter (PC). We also calculate an extra row of reuse
Traditionally, tabular data synthesis relies on techniques like distance (RD) as a locality metric, based on the address of data
Variational Autoencoders(VAE)[15] and Generative Adversarial Net- items and a given cache-line size. As an example, Table 1 shows a
works(GAN)[14]. These approaches often necessitate careful data trace with nine load and store instructions. It is a memory work-
preprocessing, such as encoding categorical columns or transform- load, where non-memory instructions, i.e. the missing instruction
ing numerical ones, impacting the generative model’s performance. counts, are removed.
Conversely, transformer-based models like GReaT [7] utilize transformer-
decoder networks and leverage auto-regressive generative large Table 1: An Example of Program Trace
language models (LLMs) for the creation of highly authentic syn-
thetic data. This approach offers more probabilistic control over Instruction Count 1 2 3 5 6 10 13 20 21
sampling, eliminates the preprocessing bottleneck, and displays Read/Write 0 0 0 0 0 1 1 0 1
state-of-the-art performance over existing methods in various sizes. Physical Address 3672 2138 3712 3672 3888 168 3888 3712 3712
Data Size 8 1 8 8 8 8 8 4 4
Program Counter 1726 53 58 63 1734 3120 3122 3124 3126
Reuse Distance ∞ ∞ ∞ 3 ∞ ∞ 2 4 1
3 TRANSFORMER BASED WORKLOAD
SYNTHESIS
Given a program trace, we use three methods to synthesize a
3.1 Why Using Tabular Transformer shorter trace. In these methods, a program trace is modified to have
From the perspective of the data itself, the memory behavior of a different information as discussed below. The generated output by
program is sequentially correlated, with data dependence between the tabular transformer has the same format as the input trace it
preceding and succeeding elements. The behavior of individual mem- takes. We then convert the output trace to have the five fields as
ory accesses is also defined by attributes such as read/write oper- in the program trace.
ations and addresses. Therefore, we need to include the bindings In the basic method, the instruction count (IC) field is removed
between different attributes of individual memory accesses in the from the training input, which has only the remaining four fields.
input to allow the Transformer model to learn. Removing the IC field reduces the training time by about 20% but
Unlike other forms of data such as time series or images, tabular has no significant effect on the generation time. Without IC, the
data encapsulate a multitude of information in a structured man- column index is used by the transformer. The actual IC includes
ner, often representing multifaceted relationships. Each column in only memory-access instructions and is not continuous. In the gen-
a table may represent distinct features or related attributes, and erated trace, the column index is used as IC.
each row often represents a unique instance or record. As such, In the second method, a field is added to contain the reuse dis-
the synthesis of tabular data becomes critically essential, yet it’s tance (RD) of data access. The reuse distance is for cache-line gran-
this very structure and complexity that makes the task particularly ularity and is measured by the number of cache lines. We still re-
challenging. move the IC field. The input has five fields for each row. During
A tabular generative model open-sourced by World Bank re- trace generation, we use the generated RD as follows. If RD is larger
searchers is Realistic Relational and Tabular Transformer (REaLTab- than 64, we use the generated physical address. If RD is less than or
Former) [20]. It focuses on generating single parent relational data equal to 64, we generate the physical address that has this RD. The
and employs a GPT-2 encoder with a causal language model (LM) IC field is added by using the column index, as we do in the first
head to independently model the parent table. Once trained, this method.
encoder remains static and serves to conditionally model the child In the third method, we include the IC field but do not use any
tables. Notably, each child table necessitates a distinct conditional new field such as RD in the second method. The generated trace has
model, which is manifested as a sequence-to-sequence (Seq2Seq) the same format as the program trace.
2
4 EVALUATION 4.3.1 Read Latency. We first evaluate the timing accuracy. Table 3
shows the mean memory read latency in 𝜇𝑠 2 . For each configura-
4.1 Experimental Setup
tion (L1 size) and each method, the table shows the geomean of
We use 12 different workloads in SPEC2017[8].1 In order to capture the average read latency of the 12 workloads. The second column
the important phases of each workload, we fast-forward about 10 named Reference shows the ground truth, which is the read latency
billion instructions before trace-collections and stop trace-collections for the trace before synthesis. The remaining columns compare the
once 10 million instructions of memory requests are recorded. All latency result from five synthesis techniques: three based on AI
the traces are collected using a Pin based tool. and two previous analytical methods. In addition to the mean la-
We use the three methods described in Section 3.2 to produce tency, each table cell shows the range, which is the geomean of the
the synthesized trace using generative AI. In addition, we use two highest and the lowest latency values of 12 workloads.
previous techniques, hierarchical reuse distance (HRD) and STM, The second column shows that for these workloads, the read
as introduced in Section 2. In all five methods, we generate a trace latency starts high, drops precipitously around 2KB, and then be-
of one million memory requests for each application, which is one- comes negligible. It is higher than 33 𝜇𝑠 for cache sizes 128B to
tenth of the profiled trace. 1KB, 25 𝜇𝑠 at 2KB, and stays at 2.1 𝜇𝑠 at 4KB and greater sizes.
We set up the Gem5[5] simulator as a platform for validation. Among the five synthesis techniques, Tab-RD is the most accu-
Given a trace of memory requests, we measure both the time and rate for cache sizes smaller than 2KB. The other four methods are
data movement. The timing is measured by the access latency for less accurate but similar to each other except for Tab-IC, whose la-
each request, and the data movement by the miss ratio at L1 and tency is consistently lower than the ground truth. For cache sizes
L2 caches. The L2 cache size is 256KB. For L1, we simulate eight larger than 2KB, HRD is the most accurate, followed by STM and
configurations from 128 bytes, doubling at each step, to 16KB. We Tab-IC. Tab-Base and Tab-RD are the least accurate, although the
measure how accurate a synthesized trace is compared to the orig- absolute difference is small (less than 3 𝜇𝑠). None of the methods is
inal trace in the timing and the miss ratios reported by Gem5. accurate for the 2KB threshold cache size. HRD is the closest, over
6 𝜇𝑠 higher, and Tab-Base is the farthest, over 14 𝜇𝑠 higher. For L1
4.2 Synthesis Costs cache sizes from 4KB to 16KB, The read latency across all methods
becomes significantly small, with REaLTabFormer’s Tab-Base and
For the three REaLTabFormer based methods, we used
Tab-RD showing much higher latency compared to other methods.
NVIDIA A100[10] Tensor Core GPUs, founded on the Ampere Ar-
chitecture, with 40GB high-bandwidth memory Table 4 shows the comparison for 12 workloads. Averaged over
(HBM2) whose memory bandwidth surpasses 2TB/s. The 8 hardware configurations (varied L1 cache size), the table shows
GPUs are rented from Google Colab. A100 has 19.5 TFLOPS F32 the geomean read latency of each workload and those from the
and 156 TFLOPS Tensor Float 32. Each run of REaLTabFormer was five synthesis techniques. The ground truth given by the reference
on a single A100 GPU. trace is shown in the second column. The remaining columns show
For HRD and STM, we employed an Intel 12th Generation (Alder first three AI-based techniques and then two conventional tech-
Lake) multicore processor 12900, featuring 16 Cores (24 Threads) niques.
with base and turbo frequencies of 2.4 GHz and 5.1 GHz, respec- The reference read latency varies between 24 and 38 𝜇𝑠 across
tively. The system incorporated a 30MB Cache and Dual-Channel 12 workloads, with the geomean at 32.5𝜇𝑠. By the geomean, HRD
DDR4-3600 MHz Memory. Both HRD and STM were executed in is the most accurate at 33.8 𝜇𝑠, followed by Tab-RD at 36.3 𝜇𝑠 and
a single-threaded mode. STM at 37.2 𝜇𝑠, while the result is too high by Tab-Base and too low
Table 2 compares the time taken by the five methods. The three by Tab-IC. Tab-Base, 39.0 𝜇𝑠, is close to Tab-RD and STM. Looking
AI based methods use over two hours, while HRD takes less than at each workload, HRD is the most accurate in all individual cases.
a minute, and STM slightly over 4 minutes. The table shows the Tab-RD is more accurate than STM in more than half of the work-
breakdown of the cost. The most timing consuming step is training loads, 605, 607, 619, 623, 631, 638, 644.
in AI based methods. Although Tab-RD requires post processing in The main difference from AI-based techniques and traditional
the generation step, it still takes less time than IC. techniques comes from the latency results for workloads 631, 638
and 641, which have the largest relative error in both Tab-Base and
Tab-RD, while HRD maintains good accuracy in these cases. Over-
4.3 Synthesis Quality
all, from the perspective of workloads, HRD is the most accurate
The quality is measured by the similarity between the results of in reproducing the reference read latency.
the original and the synthesized workload. We evaluate two mea-
sures: the timing shown by the read latency and the data move- 4.3.2 Miss Ratios. Tables 5 to 8 compare the L1 and L2 miss ra-
ment shown by the miss ratio at the two cache levels. tios respectively, first for different cache configurations and then
for different workloads. The L1 miss ratio, as reported by Gem5,
is the number of L1 misses divided by the number of L1 accesses,
including both reads and writes. The L2 miss ratio is the number of
1 Theselected programs from SPEC2017 are 600.perlbench_s, 602.gcc_s, 2When asked to describe Table 3, GPT-4 replied “The table represents latency measure-
603.bwaves_s, 605.mcf_s, 607.cactuBSSN_s, 619.lbm_s, 620.omnetpp_s, ments for two main categories: REaLTabFormer (with its three methods: Tab-Base,
623.xalancbmk_s, 631.deepsjeng_s, 638.imagick_s, 641.leela_s, Tab-RD, and Tab-IC) and Traditional (with its two methods: HRD and STM). The mea-
644.nab_s, all with the reference input. surements are taken across different L1 cache sizes.”
3
Table 2: Comparison of Synthesis Time (h:m:s)
Methods Training/Profiling Generation Total Cost

Base 1:39:43±0:06:46 0:20:27±0:00:90 2:00:10±0:06:51
REaLTab RD 2:13:42±0:04:90 0:29:60±0:01:47 2:43:42±0:05:12
IC 2:37:18±0:09:94 0:34:20±0:01:03 3:11:38±0:09:99
Trad HRD 0:00:04:49±0:00:00:36 0:00:50:14±0:00:05:67 0:00:54.63±0:00:05.73
STM 0:02:50:12±0:00:00:85 0:01:13:20±0:00:00:82 0:04:03.31±0:00:01.09
Table 3: Read Latency (μs) of System Configurations
REaLTabFormer Traditional
L1 Size Reference
Tab-Base Tab-RD Tab-IC HRD STM
128B 37.4137±9.92497 38.7896±9.3345 38.9558±9.7325 29.6694±9.5026 34.4064±10.2590 41.8590±7.0406
256B 36.8903±10.2554 39.0276±9.4335 37.6864±10.3045 29.7062±9.4897 34.0346±10.3563 40.6587±9.5634
512B 36.0381±10.4466 39.4092±9.5608 36.6487±10.5516 29.2366±9.4188 33.6215±10.3833 39.3389±11.5229
1KB 33.4471±10.9816 39.8508±9.6357 35.9896±10.5616 25.9639±10.0382 33.2077±10.3090 37.6201±13.8018
2KB 25.2486±12.4658 39.3350±9.8288 33.3499±11.3098 17.3430±11.0772 31.5611±10.6096 33.6720±16.4813
4KB 2.1419±2.6475 4.8042±11.4474 4.8199±11.5312 3.6307±3.4630 2.4147±3.6816 2.4979±3.8404
8KB 2.1415±2.6474 4.7707±11.4095 4.8080±11.5185 3.6285±3.4569 2.1323±2.5162 2.4979±3.8404
16KB 2.1415±2.6474 4.7707±11.4095 4.8080±11.5185 3.6285±3.4569 2.1323±2.5162 2.4979±3.8404
Geomean 11.9387±6.36714 17.8324±10.2179 17.0661±10.8593 12.3880±6.66805 12.0785±6.39936 13.8086±7.49812
Table 4: Read Latency (μs) of Workloads
REaLTabFormer Traditional
App Reference
600.perlbench_s 32.719±11.470 41.003±7.601 38.191±9.735 23.930±10.785 35.368±10.871 35.518±14.207
602.gcc_s 30.471±10.880 38.397±9.425 36.676±10.478 35.379±10.779 32.058±10.552 34.639±15.205
603.bwaves_s 38.542±8.805 39.268±9.413 35.046±10.791 40.726±7.485 37.341±9.908 41.671±7.409
605.mcf_s 32.243±10.614 40.575±8.230 37.814±10.103 24.486±10.600 34.100±10.402 38.610±11.674
607.cactuBSSN_s 35.725±10.352 38.915±9.290 35.400±10.619 25.582±8.855 36.990±9.909 41.321±7.573
619.lbm_s 33.532±10.775 38.682±9.449 33.481±10.733 23.934±8.747 34.508±10.545 38.967±11.962
620.omnetpp_s 34.243±10.813 40.215±8.521 39.230±9.503 24.160±9.877 35.414±10.582 37.248±13.192
623.xalancbmk_s 33.886±10.549 39.056±9.334 37.747±10.300 24.912±10.065 31.507±10.374 38.673±12.401
631.deepsjeng_s 24.824±1.262 38.274±9.187 38.279±9.184 23.474±1.969 24.824±1.263 7.200±8.843
638.imagick_s 26.634±10.512 33.727±11.026 33.211±11.187 21.440±7.670 29.282±9.564 39.482±10.202
641.leela_s 30.311±12.583 40.422±8.669 38.685±9.946 22.193±11.637 33.937±10.927 38.221±12.697
644.nab_s 33.610±10.812 40.219±8.283 34.301±10.674 22.452±9.068 32.888±10.576 37.651±12.262
Geomean 32.466 ± 10.612 38.998 ±9.068 36.364 ±10.372 25.051 ± 9.953 33.838 ±10.338 37.173 ± 12.022
L2 misses divided by the number of L2 accesses. L2 miss ratios are workloads. While HRD performs much better in 631 and 619, Tab-
trivial, either 0% or close to 100%. The accuracy is best measured RD excels in 605 and 638. Tab-Base has the largest relative error
by the L1 miss ratio. across all workloads. Notably, all AI-based techniques seem not to
Table 5 shows that miss ratios are positive for cache sizes from fit well into the simplest trace pattern from workload 631, which
128B to 2KB. The last row shows that when measured by the ge- is the reason Tab-RD falls behind HRD slightly after taking the
omean, HRD and Tab-RD are more accurate in reproducing the geomean. We will discuss workload 631 in Section 4.5.
reference trace result. This is consistent with the timing results in
Table 3, where HRD and Tab-RD are the most accurate for cache
4.4 The Reuse Distance (RD) Threshold
sizes up to 2KB.
Table 6 compares L1 cache miss ratios from the perspective of Tab-RD uses the threshold of RD=64 (Section 3.2). We have tested
different workloads. After taking the geomean of all hardware con- different thresholds above and below but found that this particular
figurations, the reference L1 miss ratios vary from 0.198 to 0.063. threshold gave the best overall result. Upon closer inspection, we
In general, HRD and Tab-RD are the most accurate among all 12 see that the threshold includes most of the data reuses.
Table 9 shows the Cumulative Distribution Function (CDF) of
the RD histograms of the original 10M traces for each workloads.
4
Table 6: L1 Cache Miss Ratio of Workloads
Table 5: L1 Cache Miss Ratio of System Configurations REaLTabFormer Traditional

App Reference
REaLTabFormer Traditional 600 0.383 0.622 0.331 0.620 0.442 0.565
L1 Size Reference
Tab-Base Tab-RD Tab-IC HRD STM 602 0.265 0.831 0.538 0.457 0.462 0.555
128B 0.479 0.905 0.460 0.738 0.561 0.648 603 0.400 0.677 0.715 0.630 0.727 0.617
256B 0.387 0.825 0.355 0.665 0.502 0.575 605 0.283 0.756 0.297 0.450 0.385 0.580
512B 0.284 0.691 0.269 0.553 0.43 0.445 607 0.652 0.764 0.803 0.762 0.804 0.749
1KB 0.179 0.502 0.191 0.406 0.335 0.287 619 0.576 0.706 0.754 0.699 0.665 0.666
2KB 0.084 0.267 0.104 0.236 0.198 0.137 620 0.363 0.682 0.343 0.675 0.446 0.598
4KB 0.0 0.0 0.0 0.001 0.001 0.0 623 0.336 0.710 0.453 0.570 0.488 0.511
8KB 0.0 0.0 0.0 0.001 0.0 0.0 631 0.063 0.878 0.879 0.498 0.063 0.061
16KB 0.0 0.0 0.0 0.001 0.0 0.0 638 0.198 0.582 0.116 0.475 0.530 0.287
Geomean 0.227 0.566 0.261 0.490 0.402 0.391 641 0.276 0.634 0.310 0.505 0.365 0.561
644 0.464 0.777 0.788 0.684 0.830 0.640
Geomean 0.351 0.706 0.487 0.577 0.464 0.533
Table 8: L2 Cache Miss Ratio of Workloads
Table 7: L2 Cache Miss Ratio of System Configurations REaLTabFormer Traditional

App Reference
REaLTabFormer Traditional 600 0.128 0.213 0.209 0.215 0.052 0.116
L1 Size Reference
Tab-Base Tab-RD Tab-IC HRD STM 602 0.139 0.158 0.169 0.000 0.054 0.108
128B 0.0 0.0 0.001 0.0 0.0 0.0 603 0.137 0.170 0.165 0.000 0.036 0.113
256B 0.0 0.0 0.001 0.0 0.0 0.0 605 0.131 0.167 0.170 0.194 0.065 0.113
512B 0.0 0.0 0.001 0.0 0.0 0.0 607 0.112 0.161 0.156 0.159 0.032 0.101
1KB 0.0 0.001 0.001 0.0 0.0 0.0 619 0.104 0.165 0.160 0.173 0.040 0.100
2KB 0.0 0.001 0.002 0.001 0.0 0.0 620 0.123 0.174 0.181 0.181 0.055 0.111
4KB 0.998 0.891 0.949 0.174 0.033 1.0 623 0.127 0.171 0.167 0.180 0.042 0.112
8KB 1.0 1.0 1.0 0.19 0.999 1.0 631 0.142 0.144 0.099 0.151 0.061 0.152
16KB 1.0 1.0 1.0 0.19 1.0 1.0 638 0.196 0.184 0.181 0.215 0.045 0.155
Geomean 0.999 0.297 0.434 0.072 0.182 1.0 641 0.140 0.188 0.187 0.206 0.041 0.115
644 0.125 0.171 0.162 0.181 0.036 0.110
Geomean 0.138 0.167 0.163 0.072 0.045 0.113
It shows that for most workloads, 95% of their RD are below this reproduced in AI-based synthesis, leading to large relative errors
threshold. A high CDF value at RD=64 means that the RD beyond in terms of both latency and miss ratio results. If the results from
this threshold are relatively infrequent. By focusing on the RD val- app 631 are excluded, Tab-RD outperforms HRD.
ues below this threshold, the fine-tuning by Tab-RD captures most
of the data reuses for most of the workloads.
4.6 Main Findings
4.5 A Case of Poor AI Learning From the results, we make the following observations:
The cumulative RD distributions in Table 9 show a special case
which is 631.𝑑𝑒𝑒𝑝𝑠 𝑗𝑒𝑛𝑔_𝑠. In this workload, virtually all reuses have (1) REaLTabFormer, a most recent AI technique based on large
the reuse distance 0, which means that they are consecutive uses of language models (LLM), when used out of the box, is not as
the same cache block. Checking the 10M profiled trace, we see that good as (less accurate than) existing analytical techniques.
the trace only captures a data initialization loop. The entire trace (2) By augmenting the training input with a domain specific
consists of only memory writes to consecutive word addresses, re- measure, the AI technique produces results as accurate as
sulting in a long sequence of RD of 03 , with a relative small number or slightly better than the best of existing techniques. For
of cold misses4 . example, Tab-IC offers much better read latency results
While 631.𝑑𝑒𝑒𝑝𝑠 𝑗𝑒𝑛𝑔_𝑠 has the trivial memory access pattern, and Tab-RD achieves nearly or slightly better miss ratio
AI synthesis performs unusually poorly. The geomean L1 miss ra- accuracy compared to HRD.
tio is 0.063 for the reference workload, but it is 0.5 by Tab-IC and (3) Not only is the accuracy of AI learning highly sensitive
over 0.8 by Tab-Base and Tab-RD. The simplest pattern is not well to the fields supplied in the input, but also the cost of AI
training. However, the longest learning time, 2.5 hours in
3 During the experiments, we measure the RD based on the physical address of each Tab-IC, actually produces less accurate results than Tab-
memory request. The calculation of RD is in terms of cache block size that presumes RD, whose training is half an hour faster.
a cacheline size of 64 bytes. A RD of 0 means that the two consecutive addresses are
in the same cache block. (4) AI techniques can learn to reproduce similar instruction
4 In a contiguous word-by-word traversal, there is a cache miss every 16 accesses. sequences and similar reuse patterns. The most accurate
5
Table 9: Cumulative Reuse Distance (RD) Distributions
Applications RD=64 RD=32 RD=16 RD=8 RD=4 RD=2 RD=1 RD=0

600.perlbench_s 0.953 0.884 0.794 0.677 0.563 0.477 0.414 0.334
602.gcc_s 0.970 0.908 0.845 0.775 0.705 0.636 0.590 0.508
603.bwaves_s 0.873 0.790 0.718 0.634 0.480 0.178 0.096 0.048
605.mcf_s 0.983 0.954 0.904 0.705 0.556 0.447 0.384 0.282
607.cactuBSSN_s 0.916 0.770 0.577 0.375 0.213 0.098 0.060 0.030
619.lbm_s 0.950 0.874 0.773 0.597 0.384 0.176 0.086 0.041
620.omnetpp_s 0.961 0.889 0.824 0.730 0.614 0.532 0.482 0.353
623.xalancbmk_s 0.969 0.855 0.803 0.734 0.698 0.641 0.615 0.459
631.deepsjeng_s 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
638.imagick_s 0.886 0.809 0.777 0.643 0.627 0.519 0.506 0.032
641.leela_s 0.958 0.899 0.797 0.738 0.660 0.556 0.505 0.354
644.nab_s 0.942 0.872 0.773 0.605 0.474 0.129 0.050 0.023
learning by Tab-RD depends on the selection of the fields This work was supported by the Guangzhou-HKUST(GZ) Joint
in the input and on the post-processing of the output. Funding Program (No. 2023A03J0013). This work was also supported
(5) Tab-RD shows that RD is better than AI for synthesizing by the National Science Foundation (Contract No. CCF-2217395,
short distance reuses, while AI is better at long distance CCF-2114319). Any opinions, findings, and conclusions or recom-
reuses. mendations expressed in this material are those of the author(s)
(6) Learning using one particular information, the instruction and do not necessarily reflect the views of the funding organiza-
count, requires the highest training and generation time tions.
but produces the lowest accuracy among all methods.
5 SUMMARY REFERENCES
[1] Amro Awad and Yan Solihin. 2014. STM: Cloning the spatial and temporal
In this paper, we have developed a new technique for memory memory access behavior. In 2014 IEEE 20th International Symposium on High
workload synthesis based on generative AI that performs slightly Performance Computer Architecture (HPCA). https://doi.org/10.1109/hpca.2014.
6835935
better than the best of existing techniques for smaller cache sizes [2] Mario Badr, Carlo Delconte, Isak Edo, Radhika Jagtap, Matteo Andreozzi, and
and similar for other cache sizes. It uses the reuse distance as the Natalie Enright Jerger. 2020. Mocktails: Capturing the Memory Behaviour of
additional information in training. The output is modified based on Proprietary Mobile Architectures. In Proceedings of the ACM/IEEE 47th Annual
International Symposium on Computer Architecture (Virtual Event) (ISCA ’20).
the generated reuse distance. The combination takes less time than IEEE Press, 460–472. https://doi.org/10.1109/ISCA45697.2020.00046
the method using unmodified input and output, yet the synthesis [3] Ganesh Balakrishnan and Yan Solihin. 2012. WEST: Cloning data cache behavior
using Stochastic Traces. In IEEE International Symposium on High-Performance
accuracy is significantly higher. Comp Architecture. 1–12. https://doi.org/10.1109/HPCA.2012.6169042
As with any early-stage research, there are opportunities for [4] Omar Benjelloun, Shiyu Chen, and Natasha Noy. 2020. Google Dataset Search
improvement and further investigation. Here are some highlights by the Numbers. arXiv:2006.06894 [cs.IR]
[5] Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali
that need to be improved in future work. One observation through Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh
our simulation results on Gem5 is that these workloads seem to Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D.
fit neatly into 4KB of L1 cache, which is rather small. This can Hill, and David A. Wood. 2011. The Gem5 Simulator. SIGARCH Comput. Archit.
News 39, 2 (aug 2011), 1–7. https://doi.org/10.1145/2024716.2024718
be solved by profiling diverse traces of these workloads that in- [6] Vadim Borisov, Tobias Leemann, Kathrin Sessler, Johannes Haug, Martin Pawel-
clude memory-intensive regions. Also, we collected traces purely czyk, and Gjergji Kasneci. 2022. Deep Neural Networks and Tabular Data: A
Survey. IEEE Transactions on Neural Networks and Learning Systems (Jan 2022),
by Intel-PIN. A natural progression will be to integrate mecha- 1–21. https://doi.org/10.1109/tnnls.2022.3229161
nisms for trace gathering from a weakly ordered memory infras- [7] Vadim Borisov, Kathrin Seßler, Tobias Leemann, Martin Pawelczyk, and Gjergji
tructure like ARM, RISC- V, and MIPS. Also, the model can be ex- Kasneci. 2023. Language Models are Realistic Tabular Data Generators.
arXiv:2210.06280 [cs.LG]
tended to include atomics, transactional requests and flushes. En- [8] James Bucek, Klaus-Dieter Lange, and Jóakim v. Kistowski. 2018. SPEC
capsulating these requests types could further improve the accu- CPU2017: Next-Generation Compute Benchmark. In Companion of the 2018
racy of the models with respect to actual memory subsystems. ACM/SPEC International Conference on Performance Engineering (Berlin, Ger-
many) (ICPE ’18). 41–42.
In conclusion, while our current research offers foundational [9] Kaidi Cao, Colin Wei, Adrien Gaidon, Nikos Arechiga, and Tengyu Ma. 2019.
insights, we recognize the potential avenues for improvement and Learning Imbalanced Datasets with Label-Distribution-Aware Margin Loss.
arXiv:1906.07413 [cs.LG]
re-evaluation. We are optimistic that addressing these will not only [10] Jack Choquette and Wish Gandhi. 2020. NVIDIA A100 GPU: Performance &
fortify our methodology but also offer valuable insights. Innovation for GPU Computing. In 2020 IEEE Hot Chips 32 Symposium (HCS).
1–43. https://doi.org/10.1109/HCS49909.2020.9220622
[11] C. Ding and Y. Zhong. 2001. Reuse Distance Analysis. Technical Report UR-CS-
ACKNOWLEDGMENTS TR-741. University of Rochester.
We thank the anonymous reviewers of MEMSYS for their construc- [12] L. Eeckhout, J. Sampson, and B. Calder. 2006. Exploiting program microarchi-
tecture independent characteristics and phase behavior for reduced benchmark
tive feedback and Nathan Ganger, Dylan McKellips, and Yifan Zhu suite simulation. In IEEE International. 2005 Proceedings of the IEEE Workload
for their help with the presentation of the material. Characterization Symposium, 2005. https://doi.org/10.1109/iiswc.2005.1525996
6
[13] Adrià Gascón, Phillipp Schoppmann, Borja Balle, Mariana Raykova, Jack Do- Pin: building customized program analysis tools with dynamic instrumentation.
erner, Samee Zahur, and David Evans. 2016. Privacy-Preserving Distributed ACM SIGPLAN Notices (Jun 2005), 190–200. https://doi.org/10.1145/1064978.
Linear Regression on High-Dimensional Data. Cryptology ePrint Archive, Pa- 1065034
per 2016/892. https://doi.org/10.1515/popets-2017-0053 https://eprint.iacr.org/ [18] Rafael K. V. Maeda, Qiong Cai, Jiang Xu, Zhe Wang, and Zhongyuan Tian. 2017.
2016/892. Fast and Accurate Exploration of Multi-level Caches Using Hierarchical Reuse
[14] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde- Distance. In Proceedings of the International Symposium on High-Performance
Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Computer Architecture. 145–156. https://doi.org/10.1109/HPCA.2017.11
Adversarial Networks. arXiv:1406.2661 [stat.ML] [19] OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
[15] Diederik P Kingma and Max Welling. 2022. Auto-Encoding Variational Bayes. [20] Aivin V. Solatorio and Olivier Dupriez. 2023. REaLTabFormer: Generating Real-
arXiv:1312.6114 [stat.ML] istic Relational and Tabular Data using Transformers. arXiv:2302.02041 [cs.LG]
[16] Wei-Chao Lin and Chih-Fong Tsai. 2020. Missing Value Imputation: A Review [21] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
and Analysis of the Literature (2006–2017). Artif. Intell. Rev. 53, 2 (feb 2020), Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All
1487–1509. https://doi.org/10.1007/s10462-019-09709-4 You Need. CoRR abs/1706.03762 (2017).
[17] Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Ge-
off Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005.

Memory Workload Synthesis Using Generative AI: Chengao SHI Fan Jiang Zhenguo LIU

Uploaded by

Copyright:

Available Formats

Memory Workload Synthesis Using Generative AI: Chengao SHI Fan Jiang Zhenguo LIU

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Memory Workload Synthesis Using Generative AI: Chengao SHI Fan Jiang Zhenguo LIU

Uploaded by

Copyright:

Available Formats

Memory Workload Synthesis Using Generative AI

Chengao SHI Fan JIANG Zhenguo LIU

Chen DING Jiang XU

Table 2: Comparison of Synthesis Time (h:m:s)

Methods Training/Profiling Generation Total Cost

Table 3: Read Latency (μs) of System Configurations

Table 4: Read Latency (μs) of Workloads

Table 6: L1 Cache Miss Ratio of Workloads

Table 5: L1 Cache Miss Ratio of System Configurations REaLTabFormer Traditional

Table 8: L2 Cache Miss Ratio of Workloads

Table 7: L2 Cache Miss Ratio of System Configurations REaLTabFormer Traditional

Table 9: Cumulative Reuse Distance (RD) Distributions

Applications RD=64 RD=32 RD=16 RD=8 RD=4 RD=2 RD=1 RD=0

You might also like