Acecode: A Reinforcement Learning Framework For Aligning Code Efficiency and Correctness in Code Language Models
Acecode: A Reinforcement Learning Framework For Aligning Code Efficiency and Correctness in Code Language Models
CCS Concepts: • Software and its engineering → Automatic programming; Open source model.
Additional Key Words and Phrases: Code Language Models, Code Generation, Reinforcement Learning, Code
Efficiency, Code Correctness, Software Engineering
Authors’ Contact Information: Chengran Yang, cryang@smu.edu.sg, Singapore Management University, Singapore; Hong
Jin Kang, hjkang@smu.edu.sg, Singapore Management University, Singapore; Jieke Shi, jiekeshi@smu.edu.sg, Singapore
Management University, Singapore; David Lo, davidlo@smu.edu.sg, Singapore Management University, Singapore.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the
full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from permissions@acm.org.
© 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM XXXX-XXXX/2024/12-ART
https://doi.org/10.1145/nnnnnnn.nnnnnnn
1 Introduction
Large Language Models (LLMs), such as GPT [5] and Qwen [6], have demonstrated remarkable
advancements across various domains including software engineering [12, 15, 37]. Specialized LLMs
for code like Code Alpaca [8], WizardCoder [26], Magicoder [40], and WaveCoder [42], known as
CodeLLMs, are built on general-purpose LLMs and fine-tuned with code datasets spanning multiple
programming languages. These CodeLLMs demonstrate exceptional performance across a range of
code generation tasks, such as vulnerability patch generation [29] and code completion [14, 19].
While current CodeLLMs are capable of generating functionally correct code, this often comes at
the expense of code efficiency. CodeLLM-generated code tends to be suboptimal in terms of runtime
(i.e., the execution time for generated code), with prior studies [17, 18, 27, 33] showing that it can
require 3-13× more runtime than human-written code.
The inefficiency of CodeLLM-generated code poses a significant challenge as CodeLLMs are
increasingly integrated into real-world software development workflows. For instance, Google
reports that CodeLLMs now generate over a quarter of new code in their products [20]. As such,
inefficient code generated by CodeLLMs can significantly hinders the performance of software
products and reduces their competitiveness, particularly in resource-constrained scenarios like
mobile devices, cloud servers, and IoT systems. On the other hand, optimizing the efficiency of
CodeLLM-generated code can contribute to environmental sustainability by enabling software
products to consume less energy and resources, ultimately lowering their carbon footprint [33].
To date, few approaches have attempted to address this issue, including SOAP [17] and PIE [34].
SOAP [17] adopts a two-stage inference process, where CodeLLMs first generate code that is
executed with test cases to collect runtime feedback. This feedback is then used to guide a second
round of code generation aimed at reducing execution overhead. However, SOAP requires a
compatible execution environment for LLM-generated code and pre-defined test cases for each
task, which significantly restricts its out-of-the-box usability. Moreover, the two-stage inference
process doubles the inference time and introduces additional overhead during code execution
On the other hand, PIE [34] crafts a dataset with efficient code snippets and applies instruction
tuning for CodeLLMs to optimize code efficiency. While PIE optimizes code efficiency without
relying on execution feedback, it applies instruction tuning with next-token prediction as its
training objective. This approach primarily optimizes CodeLLMs as general language models,
rather than addressing task-specific, multi-objective goals including code efficiency and correctness.
Consequently, it leaves a gap in tailored fine-tuning methods specifically designed to optimize
code efficiency. Moreover, a recent empirical study [35] highlights a crucial drawback of existing
approaches including PIE: while code efficiency improves with these methods, they often do so
at the expense of functional correctness (referred to as correctness). This drawback underscores
the need for a fine-tuning framework specifically designed to optimize both code efficiency and
correctness in CodeLLMs, eliminating the dependency on pre-defined test cases and a compatible
execution environment during inference stage.
To address these limitations of existing approaches, we draw on techniques used for LLM
alignment [28, 32, 39]—a methodology originally developed to tailor general LLM outputs to human
preferences. LLM alignment often applies reinforcement learning from human feedback (RLHF)
as a fine-tuning approach [28, 32, 39]. In RLHF, the LLM generates responses evaluated by a
reward model (referred to as a "rewarder"), which estimates how much a human might favor the
responses. The LLM’s parameters are then iteratively updated to maximize the reward signals with
reinforcement learning algorithms, progressively aligning the model with human preferences. In
this work, we extend RLHF principles by tailoring specific LLM output, i.e., generated code, to meet
fine-grained code quality requirements, i.e., code efficiency and correctness.
Applying RLHF to CodeLLMs to align the CodeLLM-generated code with correctness and
efficiency requirements presents two primary challenges. First, RLHF typically requires a manually
labeled dataset where LLM outputs are annotated with human preferences [10, 28, 32, 39] to train
the reward model. Creating such a dataset is time-consuming and labor-intensive, requiring deep
expertise in code algorithms and carrying the risk of subjective biases. The second challenge lies
in representing both code efficiency and correctness within a rewarder, enabling the model to
optimize for the dual objectives.
To address both challenges, we propose ACECode (Aligning Code Correctness and Efficiency
for CodeLLMs), a reinforcement learning-based fine-tuning framework incorporating a training-
free rewarder derived from code execution. ACECode eliminating the need for manually labeled
datasets to train a reward model. Instead, it introduces a training-free, dual-objective rewarder that
jointly evaluates code efficiency and correctness. Specifically, our rewarder assesses correctness
based on the execution status of code against provided test cases and evaluates efficiency by
comparing the runtime of generated code to that of a human-written reference solution. Moreover,
to represent efficiency and correctness within one rewarder, we design the rewarder as a step
function that penalizes incorrect or inefficient code while rewarding code that achieves both
attributes, guiding the model to optimize toward both goals jointly. With this reward feedback, the
CodeLLM is then optimized using Proximal Policy Optimization (PPO) [31] algorithm. To the best
of our knowledge, this is the first RLHF method that improves both the efficiency and correctness
of CodeLLM-generated code.
To evaluate the effectiveness of ACECode, we conduct a series of experiments using an ex-
tended version of EffiBench [18], a widely-used benchmark for code efficiency. Specifically, to
enable a comprehensive evaluation of efficiency and correctness, we augment each coding task
in EffiBench, originally paired with one ground-truth solution, with 10 additional human-written
ground-truth solutions. We assess the performance of ACECode by fine-tuning four state-of-the-
art open-source CodeLLMs and comparing their generated code with three baselines: original
CodeLLMs, instruction-tuned CodeLLMs, and CodeLLMs fine-tuned with the state-of-the-art ap-
proach PIE [34]. Extensive experiment results suggest that ACECode significantly improves both
the efficiency and correctness of generated code against all baselines for all CodeLLMs. Specifically,
CodeLLMs fine-tuned with ACECode demonstrate improvements over the original CodeLLMs in
code correctness ranging from 1.84% to 14.51% in terms of pass@1. Moreover, 65% to 72% of the code
solutions generated by CodeLLMs fine-tuned with ACECode exhibit reduced runtime compared to
those produced by the original CodeLLMs. Furthermore, ACECode outperforms instruction tuning
and PIE in terms of code correctness and efficiency, achieving improvements of up to 51.15% in
pass@1 and 23.18% in average execution time compared to instruction tuning, and up to 14.41% in
pass@1 and 11.45% in average execution time compared to PIE.
We summarize our contributions in this paper as follows:
• The first RLHF framework for optimizing dual attributes of LLM-generated code:
We introduce ACECode, the first reinforcement learning-based fine-tuning framework for
optimizing code efficiency and correctness.
• Proposing a training-free rewarder for assessing code correctness and efficiency:
ACECode provides a novel rewarder that jointly assesses code efficiency and correctness,
derived directly from code execution feedback without requiring a manually labeled dataset.
• Extending Existing Benchmark: We extend existing benchmark EffiBench [18] by adding
10 additional human-written ground-truth solutions for each task, making the dataset more
robust for evaluating efficiency and correctness improvements.
2.1.2 Evaluating CodeLLM for Code Generation. Code generation, which involves generating code
snippets or functions based on natural language prompts or docstrings, has received increasing
attention in recent years [23, 40, 42]. To evaluate LLM performance on this task, various benchmarks
have been proposed, primarily focusing on the correctness of generated code across different
scenarios. For example, HumanEval assesses functional correctness by executing generated code
on a set of test cases [9]. DS1000 evaluates code generation capabilities with a focus on third-
party libraries within the data science domain [21]. EvoCodeBench measures code generation
performance at the repository level [22]. Recently, EffiBench [18] and ECCO [35] are introduced
to evaluate the efficiency of generated code. Both works release a code efficiency benchmark by
crafting coding tasks, reference solutions on Python, and corresponding test cases to evaluate
the performance of CodeLLMs. Findings from EffiBench [18] indicate that CodeLLMs all produce
code that is suboptimal in execution runtime, while ECCO [35] points out that existing works
2.2 RLHF
Sampled
{A, B, C, D} {C, B, D, A}
Prompt
Responses Labelers Responses A Reward Model
Generated Ranks the Ranked from Trained with
LLM by LLM Responses Best to Worst Ranked Responses
New
{E} 𝑟
Prompt
One
Response Reward Model Update LLM Using
Policy Generated by Assigns a Reward PPO Algorithm Updated
LLM LLM Score Policy LLM
Fig. 1. Overview of RLHF. Generally, the process begins with the LLM generating multiple responses for a
given prompt, which are then ranked by human labelers based on their preference. These ranked responses
are used to train a reward model that assigns a reward score to evaluate the human preference of newly
generated responses. The LLM is then fine-tuned using the Proximal Policy Optimization (PPO) algorithm to
optimize its outputs for higher reward scores, iteratively aligning LLM’s outputs with human preferences.
Model Weight
Coding Updating
Solution
LLM Output
Unit-test Based
Coding Actor LLM Reward
Task
Critic LLM
Fig. 2. Overview of ACECode with PPO training. The Actor LLM, which serves as the target model for
optimization, takes prompts as input and generates code snippets at each step. For each LLM output, the
Unit-test Based Rewarder executes the generated code with all test cases, assigning a reward score to the
generated code, representing both code correctness and efficiency based on execution feedback. This reward
score is used to guide the Actor LLM in optimizing its policy via the PPO algorithm. Meanwhile, the Critic
LLM estimates the value function for each Actor LLM output, facilitating more stable policy optimization.
ˆ ˆ
ˆ 𝜃 (· |𝐼 ) 𝑅(𝐺𝑒 (𝐶), 𝐺𝑐 (𝐶))
max E𝐶∼𝑝 (1)
𝜃
3 Methodology
We propose ACECode, a novel reinforcement learning-based fine-tuning framework with dual
objectives: code correctness and efficiency. ACECode consists of three main components: (1) code
generation, (2) reward calculation, and (3) policy optimization through PPO algorithm. Figure 2
illustrates the ACECode architecture, comprising an actor LLM (𝜋𝜃 ) responsible for generating code
snippets, a rewarded R to output reward signals, and a critic LLM (𝑉𝜋 ) to estimate value functions
for stable policy updates. Both LLMs are optimized using the PPO algorithm [31], guided by a
reward signal R that estimates code quality based on efficiency and correctness. Specifically, the
actor LLM serves as the target model for optimization, responsible for producing code solutions
based on natural language prompts. Meanwhile, the Critic LLM serves as a stability mechanism in
the training process, estimating the value function for each code snippet generated by the Actor
LLM. The value function quantifies the expected cumulative reward of each code snippet, which
stabilizes the policy updating for the actor LLM. Finally, the tuned actor LLM (𝜋𝜃 ) is extracted as
the optimized target model for generating efficient and correct code solutions.
output format of the LLM with the task requirements, we add an example pair 𝐼𝑒𝑥𝑎𝑚𝑝𝑙𝑒 , 𝐶𝑒𝑥𝑎𝑚𝑝𝑙𝑒
into the prompt for in-context learning. In-context learning has proven to be effective for aligning
output format [7].
The actor LLM generates 𝑁 responses for each input prompt. We then extract the code snippets
(𝐶 1, ..., 𝐶ˆ𝑁 ) from each response by matching the format of the coding block that is explicitly defined
ˆ
in the instruction. Notably, following standard practice in RLHF [10, 28], the temperature of LLM
in the fine-tuning stage is set to 0.85 to generate diverse responses. (𝐶, 𝐶ˆ1, ..., 𝐶ˆ𝑁 ) is then passed
to the rewarder that will assess and produce scores representing code efficiency and correctness
jointly.
3.2 Rewarder
The rewarder R serves as a proxy for code efficiency and correctness assessment, providing a scalar
reward signal for each input 𝐶. ˆ The reward signal then guides the optimization of the actor LLM
(𝜋𝜃 ) with dual objectives, i.e., generating efficient and correct code, through the PPO algorithm.
Different with common rewarders that are trained with human-labeled dataset [10, 28] and are
prone to subjective bias, ACECode’s R applies a training-free and objective evaluator grounded by
code execution, eliminating the need for manual labeling and reducing the risk of subjective bias.
Designing R to represent both code efficiency and correctness presents several challenges:
• C1. Requiring human-labeled dataset: Rewarders often require a human-labeled dataset
for training [10, 28], which is prone to subjective bias and requires substantial labeling effort.
• C2. Representing code efficiency and correctness in one rewarder: The rewarder must
encapsulate both objectives including code correctness and efficiency; otherwise, the actor
LLM may prioritize one objective over the other during fine-tuning.
• C3. Inaccurate runtime measurements: Some code snippets with short execution time
can have high variance in the measurement of execution runtime due to potential concurrent
processes and can be inaccurate if measured with profiling tools [4].
The following subsections detail our approach to address these challenges.
3.2.1 Objective Evaluator for Efficiency and Correctness. To address challenge C1, we do not
manually label code snippets with efficiency and correctness scores to train a rewarder. Instead, we
use an objective and training-free evaluator based on code execution that measures code efficiency
ˆ
and correctness directly as the rewarder. Specifically, for a given pair 𝐶, 𝐶 , we obtain a set of
ground-truth test cases 𝑇 = {𝑡 1, 𝑡 2, ..., 𝑡𝑛 } and execute both 𝐶 and 𝐶ˆ on 𝑇 . We record the compiler
signals,which indicate whether the code encounters a compilation error, and execution runtimes
𝐸𝐶 , 𝐸𝐶ˆ for the ground-truth and generated solutions, respectively. We then apply a customized
reward function R to calculate the reward score that reflects both code efficiency and correctness
based on the compiler signals and runtime values.
3.2.2 Reward Function. To address challenge C2, we design R as a step function that evaluates the
generated code’s correctness and efficiency based on compiler outcomes and runtime performance:
𝑘 !
𝐸𝑡
0.5 + 0.5 · min 𝐸𝐶𝑡 , 1 , if 𝐶ˆ passes all unit tests
ˆ
ˆ 𝐶) =
R (𝐶,
𝐶
(2)
−0.3, if 𝐶ˆ fails any unit test
if 𝐶ˆ encounters a compile error
−0.5,
where 𝑘 is a penalty factor controlling the sensitivity to efficiency gain.
This reward structure encourages 𝜋𝜃 to generate code that is both correct and efficient. Following
existing approaches [10, 28], we normalize the reward values from −1 to 1 to stabilize the training
process. Firstly, if 𝐶ˆ fails to compile or pass any unit test, it receives a negative reward. Code that
fails to compile receives a penalty of −0.5, set as half the worst-case value to avoid making the
model overly cautious and under-optimizing. A smaller penalty of −0.3 is assigned for code that
fails any unit test, as compilation correctness is a prerequisite for functional correctness. Conversely,
if 𝐶ˆ passes all tests, it receives a positive reward comprising a fixed score 0.5 representing code
𝑘 !
𝐸 𝑡
correctness and an adaptive score 0.5 · min 𝐸𝐶𝑡 , 1 representing code efficiency. To encourage
𝐶ˆ
balanced optimization of both objectives, the fixed score for correctness is set to 0.5 and the adaptive
score for efficiency is capped at a maximum of 0.5. Concretely, the code efficiency of 𝐶ˆ is quantified
as the ratio of its runtime 𝐸𝑡ˆ to that of the ground-truth 𝐸𝐶𝑡 .
𝐶
3.2.3 Precise Runtime Measurement. Existing benchmark EffiBench [18] applies profiling tool
memory-profiler1 to execute code once and extract the runtime. However, as mentioned by
official Python documentation2 , profiling tools would introduce additional overhead to the runtime
measurement of Python code and are not designed for benchmarking purposes. Meanwhile, we
observe that the runtime of each code may be affected by concurrent processes, resulting in
inconsistent runtime for each execution.
Therefore, to address challenge C3, following recommendations from the official Python docu-
mentation3 , we use Python’s timeit module [4]. This allows reasonable assessments of execution
runtime with less overhead. Additionally, to mitigate the effect of randomness in the runtime be-
tween different executions, we implement a repeated execution strategy. We calculate the runtime
value by executing each generated code snippet 𝑛 times and computing the average, 𝐸e𝐶ˆ𝑡 . We define
two parameters for repeating execution:
1 https://pypi.org/project/memory-profiler/
2 https://docs.python.org/3.9/library/profile.html
3 https://docs.python.org/3.9/library/profile.html
ˆ 𝜃 (· |𝐼 ) [R (𝑐,
max E𝐼 ∼I, 𝑐∼𝜋 ˆ 𝐼, 𝑐)] (5)
𝜃
where I is the dataset of input prompts, and 𝑐 is the ground-truth code for each prompt. Specifically,
the PPO loss function 𝐿PPO (𝜃 ) is defined as:
where 𝑟𝑡 (𝜃 ) = 𝜋𝜋𝜃 𝜃 (𝑎(𝑎𝑡 𝑡|𝑠|𝑠𝑡 𝑡) ) is the probability ratio between the current and previous policies for
old
action 𝑎𝑡 . clip(·, 1 − 𝜖, 1 + 𝜖) ensures the ratio is constrained within a small range to prevent large
updates, with 𝜖 being a hyperparameter. Concretely, the clip function is defined as follows:
1 − 𝜖, if 𝑟𝑡 (𝜃 ) < 1 − 𝜖,
clip(𝑟𝑡 (𝜃 ), 1 − 𝜖, 1 + 𝜖) = 𝑟𝑡 (𝜃 ), if 1 − 𝜖 ≤ 𝑟𝑡 (𝜃 ) ≤ 1 + 𝜖, (7)
1 + 𝜖,
if 𝑟𝑡 (𝜃 ) > 1 + 𝜖.
In parallel, the critic LLM 𝑉𝜋 is updated by minimizing the following value loss:
4 Experimental Setup
4.1 Dataset
To evaluate ACECode, we use an extended version of the EffiBench [18] dataset, which is designed
for assessing code efficiency in CodeLLMs. EffiBench consists of 1,000 efficiency-focused coding
tasks extracted from LeetCode [2], each paired with a single ground-truth solution.
To compute the reward for a CodeLLM-generated code solution, ACECode determines the level
of code efficiency by comparing its execution runtime against the ground-truth reference solution,
which ideally serves as an upper bound for runtime efficiency. While EffiBench extracts one most-
upvoted solution from LeetCode’s discussion forum as a reference solution for each coding task, we
observe that these solutions are not always the most runtime-efficient, as some prioritize readability
or memory usage over runtime efficiency. To address this, we extend EffiBench by gathering at most
15 most-upvoted solutions for each coding task, executing them, and then picking the solution with
the shortest runtime as a new reference solution. This approach establishes an accurate benchmark
for the upper bound efficiency of code solutions for each coding task, allowing for a more precise
calculation of code efficiency reward and fair evaluation. Note that coding tasks that are restricted
to premium users or contain faulty test cases are excluded from our dataset. The resulting extended
dataset, which we refer to as EffiBench+, comprises 797 coding tasks and 8,520 code solutions. We
randomly split EffiBench+ into training and test sets with an 8:2 ratio. This test set aligns in size
with HumanEval [9], one of the most widely used benchmarks for code generation.
requires updating the internal states of CodeLLMs via reinforcement learning, we are unable to
evaluate commercial closed-source models.
Magicoder [40] is a family of CodeLLMs designed for code generation tasks. These models are
trained using OSS-Instruct, a novel dataset that leverages open-source code snippets to generate
diverse and realistic instruction data. This methodology aims to mitigate inherent biases present
in LLM-synthesized data by incorporating a wealth of open-source references, resulting in more
diverse, realistic, and controllable outputs.
DeepSeekCoder [43] is an open-source series of CodeLLMs that are designed to enhance code
generation and understanding tasks. The base model of DeepSeekCoder is trained from scratch
on a massive dataset of 2 trillion tokens, comprising 87% code and 13% natural language in both
English and Chinese. They utilize a 16K context window and a fill-in-the-blank task to support
project-level code completion and infilling tasks.
CodeQwen1.5 [6] is a code-specific variant of the Qwen1.5 LLM, which is pre-trained on approx-
imately 3 trillion tokens of code-related data, encompassing 92 programming languages. The model
supports a context length of up to 64,000 tokens.
WaveCoder [42] is a series of CodeLLMs fine-tuned on an instruction dataset called CodeSeaX-
Dataset, which comprises 19,915 instruction instances across four tasks: code generation, code
summarization, code translation, and code repair. The fine-tuning process employs a generator-
discriminator framework to ensure high-quality and diverse instruction data.
DeepSpeed [30]. We randomly sample 5% of the training data as a validation set and select the
checkpoint with the best validation performance for final evaluation.
4.4 Baselines
We consider three baseline settings for evaluating the effectiveness of ACECode:
• Original CodeLLMs: We evaluate the correctness and efficiency of the code from CodeLLMs
fine-tuned with ACECode against the original CodeLLMs without fine-tuning.
• CodeLLMs with Instruction Tuning: We perform instruction tuning on the original
CodeLLMs using the same training split from EffiBench as ACECode and compare their
performance with CodeLLMs fine-tuned by ACECode. The instruction is constructed with
the task description, a one-shot example of task description and code, and the ground-truth
solution. The template of the task description is the same as ACECode, which is illustrated
in Table 1. We fine-tune CodeLLMs using the instructions with the next token prediction
objective.
• PIE [34]: We consider PIE as a strong baseline for code efficiency optimization. PIE applies
instruction tuning with performance-conditioned generation to optimize the efficiency of
CodeLLM-generated code. Performance-conditioned generation involves tagging code solu-
tions with performance scores during training (e.g., the top 10% most efficient solutions are
labeled as "10/10," the next 10% as "9/10," and so on). PIE’s dataset consists of pairs of efficient
and inefficient code snippets for each task, written in C++. During the training stage, the
ground-truth performance scores are provided for each input code to quantify the extent of
code efficiency. During the inference stage, PIE prompts the LLM to generate a code solution
with a 10/10 performance score.
As the released model checkpoints of PIE are optimized for C++, and our training dataset is
focused on Python, we extend PIE approach to Python for a fair comparison. Specifically,
we follow PIE’s instruction-tuning methodology while leveraging our extended version of
EffiBench (see Section 4.1). Our extension provides multiple ground-truth code solutions per
coding task, which are executed once before training to assign performance tags.
To mitigate this bias, we propose a new metric called Efficiency Code Counting (ECC), inspired
by differential testing [13]. ECC calculates the percentage of code solutions generated by CodeLLMs
fine-tuned with ACECode that exhibit lower execution times than those generated by the original
CodeLLM for the same tasks and test cases. Formally, ECC is defined as:
𝑁
1 ∑︁ originLLM
ECC = 1 𝑡𝑖 > 𝑡𝑖 (11)
𝑁 𝑖=1
originLLM
where 𝑁 is the number of code generated by CodeLLM, 𝑡𝑖 is 𝑖-th code task, and 𝑡𝑖 is the
accumulated execution time of the original CodeLLM generated solutions. The ECC metric ranges
from 0 to 1 and is not affected by extreme outliers. An ECC value above 0.5 indicates that more
than half of the solutions generated by the fine-tuned CodeLLM are more efficient than those of
the original model.
One limitation of ECC is that it only counts the number of more efficient solutions generated by
the fine-tuned CodeLLM without accounting for the magnitude of the efficiency improvement. To
address this, we introduce one additional metric: Geometric Execution Time (GET), inspired by
recent works on accelerating software testing [36]. Unlike AET and NAET which apply arithmetic
means of execution time, GET calculates the geometric mean of execution time. The geometric
mean allows us to capture the average improvement in efficiency without undue influence from
outlier values. Specifically, we define GET as follows:
v
u
t 𝑁
𝑁
Ö 𝑡𝑖
𝐺𝐸𝑇 = (12)
𝑛
𝑖=1 𝑖
where 𝑁 is the number of code tasks, 𝑡𝑖 is the accumulated execution time of the code solution for
𝑖-th task, and 𝑛𝑖 is the number of code executions through adaptive repeated execution strategy.
GET calculates the geometric mean of execution time for all code tasks. A lower GET value of the
fine-tuned CodeLLM relative to the original CodeLLM indicates better runtime efficiency of the
generated code.
To ensure fair and consistent comparison with EffiBench, we evaluate code efficiency only on
tasks for which both the fine-tuned and original CodeLLM generate correct solutions. In this paper,
we consider both ECC and GET as the primary metrics for assessing code efficiency improvements.
5 Experimental Results
This section describes the evaluation results of CodeLLM fine-tuned by ACECode compared to the
original CodeLLM. Specifically, our evaluation focuses on the following two research questions
(RQ):
• RQ1: How do the CodeLLMs fine-tuned by ACECode perform in terms of code correctness
and efficiency?
• RQ2: How do the parameters of ACECode affect the performance of the fine-tuned CodeLLMs?
Table 3. Correctness and Efficiency of Code Generated by Original CodeLLMs and CodeLLMs Fine-tuned by
ACECode.
fine-tuned by ACECode outperform the original CodeLLMs in terms of both code correctness and
efficiency. Specifically, the ECC score of all CodeLLMs fine-tuned by ACECode is higher than 60%,
indicating that at least 60% of the code generated by all CodeLLMs fine-tuned by ACECode has
executes in a shorter time than the original CodeLLMs. Similarly, GET for all CodeLLMs fine-tuned
by ACECode decreased from 0.52% to 10.86% compared to the original CodeLLMs. Both GET and
ECC scores illustrate the superior efficiency of the code generated by CodeLLMs fine-tuned by
ACECode. Moreover, we can also observe that the pass@1 score of all CodeLLMs fine-tuned by
ACECode is higher than the original CodeLLMs by from 1.84% to 14.51%, showing the improvement
in code correctness of the generated code.
Besides, we observe that the less efficient the original CodeLLMs, the more significant the
efficiency improvements achieved by ACECode. For all CodeLLMs in this study, the higher the
original CodeLLM’s GET score, the more substantial the efficiency gains achieved by ACECode.
We attribute this trend to an upper bound on achievable efficiency improvements. CodeLLMs that
are less efficient offer more room for targeted fine-tuning improvements.
5.1.2 Comparing With Instruction Tuned CodeLLMs. Table 4 presents the evaluation results com-
paring CodeLLMs fine-tuned with ACECode to those trained via instruction tuning. CodeLLMs
fine-tuned by ACECode consistently outperform their instruction-tuned counterparts in both code
correctness and efficiency.
Specifically, ACECode achieves substantial improvements in code correctness from 5.79% to
51.15% in terms of pass@1. Similarly, the runtime efficiency of ACECode-fine-tuned CodeLLMs
surpasses instruction-tuned models by 9.92% to 23.18% in terms of GET score, highlighting the
superior efficiency of ACECode in generating correct and efficient code.
Interestingly, instruction tuning does not consistently yield improvements over the original
CodeLLMs in either metric. Among the evaluated models, only Magicoder-S-DS-6.7B achieves
a higher pass@1 compared to its original version, and only DeepseekCoder-6.7B-instruct and
Magicoder-S-DS-6.7B show efficiency gains over their original versions (i.e., ECC > 0.5). We attribute
this inconsistency to the inherent limitations of instruction tuning, which heavily depends on
diverse and representative training datasets. In this study, EffiBench provides a single ground-truth
solution per task, which may not be sufficient to effectively guide CodeLLMs in generating both
efficient and correct code. Furthermore, instruction tuning typically employs next-token prediction
as its training objective, which focuses on predicting subsequent tokens based on the context
Table 4. Correctness and Efficiency of Code Generated by CodeLLMs with Instruction tuning and CodeLLMs
Fine-tuned by ACECode.
rather than explicitly optimizing CodeLLMs for multi-objective and fine-grained goals. However,
for the task of generating both correct and efficient code, the optimization goals encompass both
code correctness and efficiency, necessitating an optimization process that extends beyond the
capabilities of standard instruction tuning. Conversely, ACECode overcomes these challenges
by utilizing a reinforcement learning-based fine-tuning method that 1) autonomously generates
varied responses for each coding task to enable iterative training and 2) integrates multi-objective
optimization directly into its reward system.
5.1.3 Comparison with PIE. We compare the performance of CodeLLMs fine-tuned by ACECode
with those fine-tuned by PIE [34] in terms of code correctness and efficiency. The evaluation results
are shown in Table 5.
CodeLLMs fine-tuned by ACECode consistently outperform PIE-tuned models in both code
correctness and efficiency. In particular, ACECode enhances code correctness over PIE by a margin
ranging from 2.92% to 14.41% in terms of pass@1. The runtime efficiency of code generated by
ACECode-fine-tuned CodeLLMs surpasses that of PIE-tuned models by 5.31% to 11.45% in terms of
GET score. The results suggest that ACECode is more effective in optimizing code efficiency and
correctness than PIE.
Furthermore, we observe that though PIE improves the efficiency of generated code for all
CodeLLMs (i.e., all ECC>0.5), it does not consistently enhance code correctness. Specifically,
CodeQwen1.5-7B-Chat and WaveCoder-Ultra-6.7B show lower pass@1 scores than the vanilla mod-
els that did not undergo fine-tuning. This finding is aligned with existing studies [35] that highlight
the trade-off between code efficiency and correctness in PIE. In contrast, ACECode simultaneously
enhances both code efficiency and correctness, overcoming these limitations.
Table 5. Correctness and Efficiency of Code Generated by CodeLLMs fine-tuned with PIE and ACECode.
Key Findings
0.05
0.00
0.05
0.10
0 5 10 15 20 25 30
Epoch Number
Fig. 3. Impact of Training Epoch Numbers on ACECode Performance. Pass@1 and GET relative improvements
refer to the percentage of improvement in terms of Pass@1 and GET scores compared to the original
CodeLLMs.
5.2.1 Number of Training Epoches. Experiment Setting: To investigate the impact of the training
epoch on the performance of ACECode, we compare the performance of CodeQwen1.5-7B-Chat
fine-tuned by ACECode with different training epoch numbers ranging from 1 to 30. We set the
epoch interval to be 3.
Results: The evaluation results are shown in Figure 3. We observe that epoch number affects the
performance of ACECode. Specifically, as the training epoch number increases from 1 to 18, the
GET score of tuned CodeLLMs is consistently higher than that of the original CodeLLMs by 0.01 to
0.05. Similarly, the Pass@1 score of tuned CodeLLMs is also higher than the original CodeLLMs by
up to 11.52% from 1 to 24 epoch except for the 6th epoch, which is 0.01% lower than the original
CodeLLMs. The improvement variance may be due to the randomness of the RL training process.
However, we observe that the performance of tuned CodeLLMs in terms of GET and Pass@1 starts
to decrease and performs worse than the original CodeLLMs consistently after 18 and 24 epochs,
respectively. We hypothesize that overfitting may occur when the number of training epochs is too
large, which leads to a performance decrease of ACECode.
0.10
pass@1 relative improvement
0.08 GET relative improvement
0.06
Relative Improvement
0.04
0.02
0.00
0.02
0.04
0.06
1 2 3 4 5
Number of Response
Fig. 4. Impact of Number of Responses on ACECode Performance. Pass@1 and GET relative improvements
refer to the percentage of improvement in terms of Pass@1 and GET scores compared to the original
CodeLLMs.
5.2.2 Number of CodeLLM Response. Experiment Setting: We investigate the impact of response
number 𝑁 on the performance of ACECode. A response number 𝑁 indicates that ACECode
generates N responses for each coding task in the training dataset, which leads the pairs of coding
tasks and solutions that are used for RL training to be 𝑁 times the size of the original dataset.
We conduct experiments using 𝑁 ranging from 1 to 5 for fine-tuning CodeQwen1.5-7B-Chat on
ACECode.
Results: The evaluation results are shown in Figure 4. We can observe that both GET and Pass@1
scores of tuned CodeLLMs generally increase as the response number increases from 1 to 5, except
for the GET score when the response number is 4 and the pass@1 score when the response number
is 3. We attribute these slight declines to potential variability in RL training. Specifically, both GET
and Pass@1 scores achieve the highest improvement when the response number is 5, indicating
that the performance of ACECode is positively correlated with the response number and follows
the trend of scaling law. However, the training time cost of ACECode is approximately linearly
increasing with the response number, presenting a trade-off between the performance and training
cost of ACECode.
6 Threats to Validity
The threats to external validity concern the generalizability of our findings. To mitigate this threat,
we evaluated ACECode using four popular and state-of-the-art CodeLLMs on the Evalplus [24]
benchmark. However, we acknowledge that our findings may not generalize to CodeLLMs of other
sizes or architectures. In the future, we plan to extend our study to a broader range of CodeLLMs
to further validate the effectiveness of ACECode.
The threats to internal validity of this study include the measurement of execution time and
correctness of CodeLLM-generated code. To mitigate this threat, we run each code snippet with
an adaptive repeated strategy to ensure the code snippets are executed multiple times and take
the average execution time as the final result. Meanwhile, different from the previous benchmark
EffiBench, which uses profiling tools to measure the execution time of code snippets, we follow
the best practices in Python documentation for runtime benchmarking [3] and use timeit module.
Furthermore, we take into account the library import overhead and exclude its time cost from
the final result by leveraging the timeit module. On the other hand, to mitigate the threat of
randomness in code correctness evaluation, we fix the temperature of LLM as 0 and 𝑡𝑜𝑝 𝑝 as 1 to
force LLMs to generate identical code snippets for each coding task.
The threats to construct validity mainly lie in the selection of evaluation metrics. To mitigate this
threat, we identify the deficiencies of existing evaluation metrics and propose two new evaluation
metrics to better measure the performance of CodeLLMs.
7 Conclusion
In this paper, we propose ACECode, a reinforcement learning-based fine-tuning framework for
CodeLLMs, to improve CodeLLMs’ performance in terms of both the correctness and efficiency
of their generated code. ACECode leverages execution feedback of the generated code to guide
reinforcement learning-based fine-tuning of CodeLLMs, in which the reward is calculated based on
the execution correctness and runtime efficiency of the generated code. We conduct experiments
on four state-of-the-art CodeLLMs and evaluate the performance of CodeLLMs fine-tuned on
ACECode in terms of code correctness and efficiency. The results show that CodeLLMs fine-tuned
by ACECode outperform the original CodeLLMs in terms of code correctness and efficiency by
from 1.84% to 14.51% in terms of pass@1 and from 0.52% to 10.86% in terms of GET score. In the
future, we plan to extend our study to more CodeLLMs with different architectures, pre-training
objectives, and parameter sizes.
References
[1] [n. d.]. Hugging Face. https://huggingface.co/. (Accessed on 10/25/2024).
[2] [n. d.]. Leetcode. https://leetcode.com/. (Accessed on 10/25/2024).
[3] [n. d.]. Python Documentation 3.9: Profile. https://docs.python.org/3.9/library/profile.html. (Accessed on 10/25/2024).
[4] [n. d.]. Python Documentation 3.9: timeit. https://docs.python.org/3.9/library/timeit.html. (Accessed on 10/25/2024).
[5] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida,
Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774
(2023).
[6] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al.
2023. Qwen technical report. arXiv preprint arXiv:2309.16609 (2023).
[7] Tom B Brown. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020).
[8] Sahil Chaudhary. 2023. Code Alpaca: An Instruction-following LLaMA model for code generation. https://github.com/
sahil280114/codealpaca.
[9] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards,
Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv
preprint arXiv:2107.03374 (2021).
[10] Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. 2023. Safe
rlhf: Safe reinforcement learning from human feedback. arXiv preprint arXiv:2310.12773 (2023).
[11] Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. Flashattention: Fast and memory-efficient
exact attention with io-awareness. Advances in Neural Information Processing Systems 35 (2022), 16344–16359.
[12] Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M Zhang. 2023. Large
language models for software engineering: Survey and open problems. In 2023 IEEE/ACM International Conference on
Software Engineering: Future of Software Engineering (ICSE-FoSE). IEEE, 31–53.
[13] Muhammad Ali Gulzar, Yongkang Zhu, and Xiaofeng Han. 2019. Perception and practices of differential testing. In
2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP).
IEEE, 71–80.
[14] Daya Guo, Canwen Xu, Nan Duan, Jian Yin, and Julian McAuley. 2023. Longcoder: A long-range pre-trained language
model for code completion. In International Conference on Machine Learning. PMLR, 12098–12107.
[15] Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu
Wang. 2023. Large language models for software engineering: A systematic literature review. ACM Transactions on
Software Engineering and Methodology (2023).
[16] Jian Hu, Xibin Wu, Weixun Wang, Xianyu, Dehao Zhang, and Yu Cao. 2024. OpenRLHF: An Easy-to-use, Scalable and
High-performance RLHF Framework. arXiv preprint arXiv:2405.11143 (2024).
[17] Dong Huang, Jianbo Dai, Han Weng, Puzhen Wu, Yuhao Qing, Jie M Zhang, Heming Cui, and Zhijiang Guo. 2024.
SOAP: Enhancing Efficiency of Generated Code via Self-Optimization. arXiv preprint arXiv:2405.15189 (2024).
[18] Dong Huang, Jie M Zhang, Yuhao Qing, and Heming Cui. 2024. EffiBench: Benchmarking the Efficiency of Automatically
Generated Code. arXiv preprint arXiv:2402.02037 (2024).
[19] Maliheh Izadi, Jonathan Katzy, Tim Van Dam, Marc Otten, Razvan Mihai Popescu, and Arie Van Deursen. 2024.
Language models for code completion: A practical evaluation. In Proceedings of the IEEE/ACM 46th International
Conference on Software Engineering. 1–13.
[20] Jack Kelly. 2024. AI Writes Over 25% Of Code At Google—What Does The Future Look Like For Software Engineers?
— forbes.com. https://www.forbes.com/sites/jackkelly/2024/11/01/ai-code-and-the-future-of-software-engineers/.
[Accessed 05-11-2024].
[21] Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen-tau Yih, Daniel Fried, Sida
Wang, and Tao Yu. 2023. DS-1000: A natural and reliable benchmark for data science code generation. In International
Conference on Machine Learning. PMLR, 18319–18345.
[22] Jia Li, Ge Li, Xuanming Zhang, Yihong Dong, and Zhi Jin. 2024. EvoCodeBench: An Evolving Code Generation
Benchmark Aligned with Real-World Code Repositories. arXiv preprint arXiv:2404.00599 (2024).
[23] Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling,
Felix Gimeno, Agustin Dal Lago, et al. 2022. Competition-level code generation with alphacode. Science 378, 6624
(2022), 1092–1097.
[24] Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is Your Code Generated by ChatGPT Really
Correct? Rigorous Evaluation of Large Language Models for Code Generation. In Thirty-seventh Conference on Neural
Information Processing Systems. https://openreview.net/forum?id=1qvx610Cu7
[25] Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang,
Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, et al. 2024. Starcoder 2 and the stack v2: The next generation. arXiv preprint
arXiv:2402.19173 (2024).
[26] Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin,
and Daxin Jiang. 2023. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint
arXiv:2306.08568 (2023).
[27] Changan Niu, Ting Zhang, Chuanyi Li, Bin Luo, and Vincent Ng. 2024. On Evaluating the Efficiency of Source Code
Generated by LLMs. In Proceedings of the 2024 IEEE/ACM First International Conference on AI Foundation Models and
Software Engineering. 103–107.
[28] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini
Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.
Advances in neural information processing systems 35 (2022), 27730–27744.
[29] Hammond Pearce, Benjamin Tan, Baleegh Ahmad, Ramesh Karri, and Brendan Dolan-Gavitt. 2023. Examining zero-shot
vulnerability repair with large language models. In 2023 IEEE Symposium on Security and Privacy (SP). IEEE, 2339–2356.
[30] Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. Deepspeed: System optimizations enable
training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International
Conference on Knowledge Discovery & Data Mining. 3505–3506.
[31] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization
algorithms. arXiv preprint arXiv:1707.06347 (2017).
[32] Tianhao Shen, Renren Jin, Yufei Huang, Chuang Liu, Weilong Dong, Zishan Guo, Xinwei Wu, Yan Liu, and Deyi Xiong.
2023. Large language model alignment: A survey. arXiv preprint arXiv:2309.15025 (2023).
[33] Jieke Shi, Zhou Yang, and David Lo. 2024. Efficient and Green Large Language Models for Software Engineering:
Vision and the Road Ahead. arXiv preprint arXiv:2404.04566 (2024).
[34] Alexander Shypula, Aman Madaan, Yimeng Zeng, Uri Alon, Jacob Gardner, Milad Hashemi, Graham Neubig,
Parthasarathy Ranganathan, Osbert Bastani, and Amir Yazdanbakhsh. 2023. Learning performance-improving code
edits. arXiv preprint arXiv:2302.07867 (2023).
[35] Siddhant Waghjale, Vishruth Veerendranath, Zora Zhiruo Wang, and Daniel Fried. 2024. ECCO: Can We Improve
Model-Generated Code Efficiency Without Sacrificing Functional Correctness? arXiv preprint arXiv:2407.14044 (2024).
[36] Bo Wang, Sirui Lu, Yingfei Xiong, and Feng Liu. 2021. Faster mutation analysis with fewer processes and smaller
overheads. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 381–393.
[37] Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. 2024. Software testing with large
language models: Survey, landscape, and vision. IEEE Transactions on Software Engineering (2024).
[38] Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi DQ Bui, Junnan Li, and Steven CH Hoi. 2023. Codet5+: Open
code large language models for code understanding and generation. arXiv preprint arXiv:2305.07922 (2023).
[39] Yufei Wang, Wanjun Zhong, Liangyou Li, Fei Mi, Xingshan Zeng, Wenyong Huang, Lifeng Shang, Xin Jiang, and Qun
Liu. 2023. Aligning large language models with human: A survey. arXiv preprint arXiv:2307.12966 (2023).
[40] Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. 2023. Magicoder: Source code is all you need.
arXiv preprint arXiv:2312.02120 (2023).
[41] Shusheng Xu, Wei Fu, Jiaxuan Gao, Wenjie Ye, Weilin Liu, Zhiyu Mei, Guangju Wang, Chao Yu, and Yi Wu. 2024. Is
dpo superior to ppo for llm alignment? a comprehensive study. arXiv preprint arXiv:2404.10719 (2024).
[42] Zhaojian Yu, Xin Zhang, Ning Shang, Yangyu Huang, Can Xu, Yishujie Zhao, Wenxiang Hu, and Qiufeng Yin. 2023.
Wavecoder: Widespread and versatile enhanced instruction tuning with refined data generation. arXiv preprint
arXiv:2312.14187 (2023).
[43] Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y Wu, Yukun Li, Huazuo Gao, Shirong Ma,
et al. 2024. DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence. arXiv preprint
arXiv:2406.11931 (2024).