You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -125,7 +126,8 @@ We created `scripts/generate.sh` to generate programs on the APPS benchmark. You
125
126
|`end`| end index of test samples to be generated | 5000 |
126
127
|`num_seqs`| number of total output programs to be generated (for sampling generation) | 1000 |
127
128
|`num_seqs_per_iter`| Depending on the limit of GPU, we can generate multiple rounds, each with this number of output programs | 50 |
128
-
|`temp`| temperature for sampling generation | 0.6 ||
129
+
|`temp`| temperature for sampling generation | 0.6 |
130
+
|`output_path`| Path to save generated programs | outputs/codes/ |
129
131
130
132
Other parameters are defined in the file `utils/generate_configs.py`.
131
133
@@ -162,7 +164,7 @@ To compute the pass@k metrics, rather than using the APPS evaluation metrics, we
162
164
163
165
### Training Critic
164
166
165
-
We can train a critic model as a classifier that predicts the test outcomes of generated samples. For each training sample, we can follow the prior processes to generate programs and evaluate them with available unit tests. On average, we generate 20 programs per training sample (we provided some example generated programs in `data/APPS/train/`).
167
+
We can train a critic model as a classifier that predicts the test outcomes of generated samples. For each training sample, we can follow the prior processes ([generating programs](#generating-programs) and [running unit tests](running-unit-tests)) to obtain synthetic samples and their annotations of unit test outcomes. On average, we generate 20 programs per training sample (we provided some example generated programs in `data/APPS/train/`).
166
168
167
169
Once the programs are tested, we can used their test outcomes as annotations to train a critic model initialized from a LM pretrained on source code data (we used CodeT5-based in this case).
168
170
@@ -185,6 +187,20 @@ Other parameters are defined in the file `utils/train_critic_configs.py`.
185
187
186
188
Running the script will train a critic model as a classifier that receives inputs as a problem description + a generated program and returns an output as one of 4 test outcomes: compile error, runtime error, failed tests, and passed tests. The model checkpoints are saved in a folder under `exps/`.
187
189
190
+
### Generating Critic Scores
191
+
192
+
We created `scripts/generate_critic_scores.sh` to generate critic scores for synthetic programs. We use the same parameters as defined in [the generating program process](generating-programs) with the following additional parameters:
|`critic_scores`| Enable this to run inference on critic models and obtain critic scores | N/A |
197
+
|`gt_solutions`| Enable this to run inference on ground-truth programs; else, synthetic programs are used by default | N/A |
198
+
199
+
Other parameters are defined in the file `utils/generate_configs.py`.
200
+
201
+
Running the generation script will output programs, each of which is saved into a `pkl` (pickle) file, including data fields `code` (list of programs), `prompt` (constructed input sequence to the critic model), `gt_error_type` (ground-truth test outcomes), `pred_error_type` (predicted test outcomes by critic), `error_hidden_states` (hidden states returned by critic).
202
+
203
+
188
204
### Generating Programs with Critic Sampling
189
205
190
206
We will release the implementation details of our critic sampling procedure.
Copy file name to clipboardExpand all lines: configs/generate_configs.py
+2Lines changed: 2 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -13,12 +13,14 @@
13
13
parser.add_argument("--output_path", type=str, help='Path to save output programs')
14
14
parser.add_argument("--model_path", type=str, help='Path of trained model')
15
15
parser.add_argument("--tokenizer_path", type=str, help='Path to the tokenizer')
16
+
parser.add_argument("--critic_scores", default=False, action='store_true', help='if model is a critic model, enable this to output critic scores')
16
17
17
18
parser.add_argument("--num_seqs", default=5, type=int, help='Number of total generated programs per test sample')
18
19
parser.add_argument('--num_seqs_per_iter', default=5, type=int, help='Number of possible minibatch to generate programs per iteration, depending on GPU memory')
19
20
20
21
parser.add_argument("--max_len", default=512, type=int, help='Maximum length of output sequence')
21
22
parser.add_argument('--source_len', default=600, type=int, help='Maximum length of input sequence')
23
+
parser.add_argument('--gt_solutions', default=False, action='store_true', help='Only when critic is used, enable this to estimate returns/rewards for ground-truth programs, else synthetic programs by default')
22
24
23
25
parser.add_argument("--temperature", default=0.6, type=float, help='temperature for sampling tokens')
24
26
parser.add_argument("-s","--start", default=0, type=int, help='start index of test samples')
0 commit comments