salesforce
diff --git a/‎README.md
Lines changed: 18 additions & 2 deletions b/‎README.md
Lines changed: 18 additions & 2 deletions
diff --git a/‎configs/generate_configs.py
Lines changed: 2 additions & 0 deletions b/‎configs/generate_configs.py
Lines changed: 2 additions & 0 deletions
diff --git a/‎generate.py
Lines changed: 116 additions & 30 deletions b/‎generate.py
Lines changed: 116 additions & 30 deletions
diff --git a/‎outputs/critic_scores/111_gtFalse.pkl
201 KB b/‎outputs/critic_scores/111_gtFalse.pkl
201 KB
diff --git a/‎outputs/critic_scores/111_gtTrue.pkl
11.1 KB b/‎outputs/critic_scores/111_gtTrue.pkl
11.1 KB
diff --git a/‎outputs/critic_scores/1_gtFalse.pkl
203 KB b/‎outputs/critic_scores/1_gtFalse.pkl
203 KB
diff --git a/‎outputs/critic_scores/1_gtTrue.pkl
242 KB b/‎outputs/critic_scores/1_gtTrue.pkl
242 KB
diff --git a/‎scripts/generate.sh
Lines changed: 1 addition & 2 deletions b/‎scripts/generate.sh
Lines changed: 1 addition & 2 deletions
diff --git a/‎scripts/generate_critic_scores.sh
Lines changed: 17 additions & 0 deletions b/‎scripts/generate_critic_scores.sh
Lines changed: 17 additions & 0 deletions
diff --git a/‎transformers/src/transformers/models/t5/modeling_t5.py
Lines changed: 3 additions & 1 deletion b/‎transformers/src/transformers/models/t5/modeling_t5.py
Lines changed: 3 additions & 1 deletion
@@ -29,6 +29,7 @@ Authors:
 	* [x] [Running Unit Tests](#running-unit-tests)
 	* [x] [Evaluating Programs](#evaluating-programs)
 	* [x] [Training Critic](#training-critic)
+	* [x] [Generating Critic Scores](#generating-critic-scores)
 	* [ ] [Generating Programs with Critic Sampling](#generating-programs-with-critic-sampling)
 * [x] [Example Generated Programs](#example-generated-programs)
 * [x] [Citation](#citation)
@@ -125,7 +126,8 @@ We created `scripts/generate.sh` to generate programs on the APPS benchmark. You
 | `end`               | end index of test samples to be generated                                                                | 5000                           |
 |`num_seqs`          | number of total output programs to be generated (for sampling generation)                                | 1000                           |
 | `num_seqs_per_iter` | Depending on the limit of GPU, we can generate multiple rounds, each with this number of output programs | 50                             |
-| `temp`              | temperature for sampling generation                                                                      | 0.6                            ||
+| `temp`              | temperature for sampling generation                                                                      | 0.6                            |
+| `output_path`              | Path to save generated programs                                                                      | outputs/codes/                            |
 
 Other parameters are defined in the file `utils/generate_configs.py`.
 
@@ -162,7 +164,7 @@ To compute the pass@k metrics, rather than using the APPS evaluation metrics, we
 
 ### Training Critic 
 
-We can train a critic model as a classifier that predicts the test outcomes of generated samples. For each training sample, we can follow the prior processes to generate programs and evaluate them with available unit tests. On average, we generate 20 programs per training sample (we provided some example generated programs in `data/APPS/train/`).
+We can train a critic model as a classifier that predicts the test outcomes of generated samples. For each training sample, we can follow the prior processes ([generating programs](#generating-programs) and [running unit tests](running-unit-tests)) to obtain synthetic samples and their annotations of unit test outcomes. On average, we generate 20 programs per training sample (we provided some example generated programs in `data/APPS/train/`).
 
 Once the programs are tested, we can used their test outcomes as annotations to train a critic model initialized from a LM pretrained on source code data (we used CodeT5-based in this case). 
 
@@ -185,6 +187,20 @@ Other parameters are defined in the file `utils/train_critic_configs.py`.
 
 Running the script will train a critic model as a classifier that receives inputs as a problem description + a generated program and returns an output as one of 4 test outcomes: compile error, runtime error, failed tests, and passed tests. The model checkpoints are saved in a folder under `exps/`. 
 
+### Generating Critic Scores
+
+We created `scripts/generate_critic_scores.sh` to generate critic scores for synthetic programs. We use the same parameters as defined in [the generating program process](generating-programs) with the following additional parameters:   
+
+|   **Parameters**  |                                              **Description**                                             |       **Example Values**       |
+|:-----------------:|:--------------------------------------------------------------------------------------------------------:|:------------------------------:|
+| `critic_scores`        | Enable this to run inference on critic models and obtain critic scores                                                                    | N/A |
+| `gt_solutions`    | Enable this to run inference on ground-truth programs; else, synthetic programs are used by default                                  | N/A      |
+
+Other parameters are defined in the file `utils/generate_configs.py`.
+
+Running the generation script will output programs, each of which is saved into a `pkl` (pickle) file, including data fields `code` (list of programs), `prompt` (constructed input sequence to the critic model), `gt_error_type` (ground-truth test outcomes), `pred_error_type` (predicted test outcomes by critic), `error_hidden_states` (hidden states returned by critic). 
+
+
 ### Generating Programs with Critic Sampling 
 
 We will release the implementation details of our critic sampling procedure. 
 
@@ -13,12 +13,14 @@
 parser.add_argument("--output_path", type=str, help='Path to save output programs') 
 parser.add_argument("--model_path", type=str, help='Path of trained model') 
 parser.add_argument("--tokenizer_path", type=str, help='Path to the tokenizer') 
+parser.add_argument("--critic_scores", default=False, action='store_true', help='if model is a critic model, enable this to output critic scores')
 
 parser.add_argument("--num_seqs", default=5, type=int, help='Number of total generated programs per test sample')
 parser.add_argument('--num_seqs_per_iter', default=5, type=int, help='Number of possible minibatch to generate programs per iteration, depending on GPU memory')
 
 parser.add_argument("--max_len", default=512, type=int, help='Maximum length of output sequence') 
 parser.add_argument('--source_len', default=600, type=int, help='Maximum length of input sequence')
+parser.add_argument('--gt_solutions', default=False, action='store_true', help='Only when critic is used, enable this to estimate returns/rewards for ground-truth programs, else synthetic programs by default')
 
 parser.add_argument("--temperature", default=0.6, type=float, help='temperature for sampling tokens') 
 parser.add_argument("-s","--start", default=0, type=int, help='start index of test samples')
 
@@ -1,18 +1,21 @@
 #
-# '''
 # Copyright (c) 2022, salesforce.com, inc.
 # All rights reserved.
 # SPDX-License-Identifier: BSD-3-Clause
 # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
-# '''#
+# 
 import json
 import os
 import pprint
 import torch
 import pdb 
 import glob 
 from tqdm import tqdm
+import pickle as pkl 
+import numpy as np 
+from collections import Counter 
 from transformers import RobertaTokenizer, T5ForConditionalGeneration
+import datasets.utils as dsutils
 
 def generate_prompt(args, test_case_path, prompt_path, solutions_path, tokenizer, 
                     starter_path=None):
@@ -46,6 +49,42 @@ def generate_prompt(args, test_case_path, prompt_path, solutions_path, tokenizer
 
     return _input
 
+def generate_critic_inputs(args, test_case_path, prompt_path, solutions_path, tokenizer, 
+                           starter_path=None, gt_solutions=False):    
+    _input = generate_prompt(args, test_case_path, prompt_path, solutions_path, tokenizer, starter_path)
+    
+    q_tokens = tokenizer.encode(_input, verbose=False, max_length=args.source_len)
+    in_tokens = [tokenizer.eos_token_id] * args.source_len
+    in_tokens[:len(q_tokens)] = q_tokens
+    in_tokens = in_tokens[:args.source_len]
+    
+    solutions = json.load(open(solutions_path, 'r')) 
+    
+    all_texts = []
+    gt_errors = [] 
+    all_codes = [] 
+    
+    for sol_index, solution in enumerate(solutions):        
+        if gt_solutions: 
+            solution_str = dsutils.reindent_code(solution)
+        else:
+            solution_str = dsutils.reindent_code(solution['code'])
+            
+        a_tokens = tokenizer.encode(solution_str)        
+        code = [-100] * args.max_len 
+        code[:len(a_tokens)] = a_tokens         
+        code = code[:args.max_len]
+            
+        all_texts.append(in_tokens)
+        all_codes.append(code)
+        
+        if gt_solutions: 
+            gt_errors.append(dsutils.get_error_type(True))
+        else:
+            gt_errors.append(dsutils.get_error_type(solution['result']))
+
+    return all_texts, all_codes, gt_errors
+
 def main(args):
 
     argsdict = vars(args)
@@ -71,56 +110,103 @@ def main(args):
     # Set up model
     tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base', cache_dir=args.tokenizer_path)
     print("Loading model from {}...".format(args.model_path))
-    model = T5ForConditionalGeneration.from_pretrained(args.model_path) 
+    if args.critic_scores:
+        model = T5ForConditionalGeneration.from_pretrained(args.model_path, tuning_mode='critic') 
+    else:
+        model = T5ForConditionalGeneration.from_pretrained(args.model_path) 
+        
     device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
     model.to(device)
 
+    if args.critic_scores:
+        all_preds = [] 
+        all_gts = [] 
+        
     # main eval loop
     for index, problem in tqdm(enumerate(problems), ncols=0, total=len(problems)):
 
         prob_path = os.path.join(problem)
         print(f"problem path = {prob_path}")
 
         problem_id = int(problem.split('/')[-1])
-        if os.path.exists(os.path.join(args.output_path, f"{problem_id}.json")):
+        
+        if args.critic_scores and \
+            os.path.exists(os.path.join(args.output_path, f"{problem_id}_gt{args.gt_solutions}.pkl")):
             continue 
-
+        elif os.path.exists(os.path.join(args.output_path, f"{problem_id}.json")):
+            continue 
+        
         test_case_path = os.path.join(prob_path, "input_output.json")
         prompt_path = os.path.join(prob_path, "question.txt")
         starter_path = os.path.join(prob_path, "starter_code.py")
-        solutions_path = os.path.join(prob_path, "solutions.json")
+        if args.critic_scores and not args.gt_solutions: 
+            solutions_path = os.path.join(prob_path, "gen_solutions.json")
+        else:
+            solutions_path = os.path.join(prob_path, "solutions.json")
         if not os.path.exists(starter_path):
             starter_path = None
 
-        input_text = generate_prompt(args, test_case_path, prompt_path, solutions_path, 
+        if args.critic_scores:
+            input_texts, input_codes, gt_error_types = generate_critic_inputs(args, test_case_path, prompt_path, solutions_path,
+                                                                  tokenizer, starter_path, args.gt_solutions)
+        else:
+            input_text = generate_prompt(args, test_case_path, prompt_path, solutions_path, 
                                           tokenizer, starter_path)
 
         with torch.no_grad():
-            input_ids = torch.LongTensor(tokenizer.encode(input_text, 
-                                                          verbose=False, 
-                                                          max_length=args.source_len)).unsqueeze(0).cuda()
-
-            num_loops = int(args.num_seqs / args.num_seqs_per_iter)
-            output_programs = [] 
-            for i in tqdm(range(num_loops), ncols=0, total=num_loops, leave=False):
-                output_ids = model.generate(
-                    input_ids, 
-                    do_sample=True, 
-                    temperature=args.temperature, 
-                    max_length=args.max_len, 
-                    num_return_sequences=args.num_seqs_per_iter,
-                    top_p=0.95)                    
+            if args.critic_scores:
+                text_tensor = torch.tensor(input_texts).to(device)
+                code_tensor = torch.tensor(input_codes).to(device)
+                gt_error_tensor = torch.tensor(gt_error_types).to(device)
+                
+                curr_inputs = {'input_ids': text_tensor, 'error_types': gt_error_tensor, 'labels': code_tensor}
+                _, error_preds, error_hidden_states = model(**curr_inputs, return_error_hidden_states=True)
 
-                for output_id in output_ids: 
-                    output_programs.append(tokenizer.decode(output_id, skip_special_tokens=True))
+                assert len(gt_error_types) == len(error_preds)
+                all_preds.extend(error_preds.cpu().numpy().tolist())
+                all_gts.extend(gt_error_types)
+                
+                saved_critic_scores = {}
+                saved_critic_scores[problem_id] = {'code': input_codes, 'prompt': input_texts,
+                                          'gt_error_type': gt_error_types, 
+                                          'pred_error_type': error_preds.cpu().numpy(),
+                                          'error_hidden_states': error_hidden_states.cpu().numpy()}
+                scores_loc = os.path.join(args.output_path,  f"{problem_id}_gt{args.gt_solutions}.pkl")
+                pkl.dump(saved_critic_scores, open(scores_loc, 'wb'))
+                    
+            else:
+                input_ids = torch.LongTensor(tokenizer.encode(input_text, 
+                                                              verbose=False, 
+                                                              max_length=args.source_len)).unsqueeze(0).cuda()
 
-        saved_codes = {}
-        saved_codes[problem_id] = {'code': output_programs, 'prompt': input_text}
-               
-        codes_loc = os.path.join(args.output_path, f"{problem_id}.json")
-        with open(codes_loc, "w") as f:
-            json.dump(saved_codes, f)
-    
+                num_loops = int(args.num_seqs / args.num_seqs_per_iter)
+                output_programs = [] 
+                for i in tqdm(range(num_loops), ncols=0, total=num_loops, leave=False):
+                    output_ids = model.generate(
+                        input_ids, 
+                        do_sample=True, 
+                        temperature=args.temperature, 
+                        max_length=args.max_len, 
+                        num_return_sequences=args.num_seqs_per_iter,
+                        top_p=0.95)                    
+
+                    for output_id in output_ids: 
+                        output_programs.append(tokenizer.decode(output_id, skip_special_tokens=True))
+
+                saved_codes = {}
+                saved_codes[problem_id] = {'code': output_programs, 'prompt': input_text}
+
+                codes_loc = os.path.join(args.output_path, f"{problem_id}.json")
+                with open(codes_loc, "w") as f:
+                    json.dump(saved_codes, f)
+
+    if args.critic_scores: 
+        print("Total number of samples: {}".format(len(all_gts)))
+        acc = (np.array(all_preds) == np.array(all_gts)).sum()/len(all_gts)
+        print("Error Pred Acc: {}".format(acc))
+        print("Prediction distribution: {}".format(Counter(all_preds)))
+        print("GT distribution: {}".format(Counter(all_gts)))
+                    
 if __name__ == "__main__":
 
     from configs.generate_configs import * 
 
@@ -1,10 +1,9 @@
 ##
-## '''
 ## Copyright (c) 2022, salesforce.com, inc.
 ## All rights reserved.
 ## SPDX-License-Identifier: BSD-3-Clause
 ## For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
-## '''##
+##
 model_path=models/codet5_finetuned_codeRL
 tokenizer_path=models/codet5_tokenizer/
 test_path=data/APPS/test/ 
 
@@ -0,0 +1,17 @@
+##
+## Copyright (c) 2022, salesforce.com, inc.
+## All rights reserved.
+## SPDX-License-Identifier: BSD-3-Clause
+## For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
+## 
+critic_path=models/codet5_finetuned_critic/
+tokenizer_path=models/codet5_tokenizer/
+test_path=data/APPS/train/ #test.json
+
+output_path=outputs/critic_scores/
+
+CUDA_VISIBLE_DEVICES=0 python generate.py \
+    --model_path ${critic_path} \
+    --test_path ${test_path} \
+    --output_path ${output_path} \
+    --critic_scores --gt_solutions
@@ -1542,6 +1542,7 @@ def forward(
         output_hidden_states=None,
         return_dict=None,
         error_types=None,
+        return_error_hidden_states=False
     ):
         r"""
         labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
@@ -1665,7 +1666,8 @@ def forward(
             error_pred_loss_fct = CrossEntropyLoss()
             error_pred_loss =  error_pred_loss_fct(error_logits.view(-1, error_logits.size(-1)), error_types.view(-1))
             _, error_preds = torch.max(error_logits, dim=-1) 
-           
+            if return_error_hidden_states:
+                return error_pred_loss, error_preds, error_states 
             return error_pred_loss, error_preds
 
         if not return_dict: