This repository contains the code and documentation for the thesis "Investigating Memorization in Code-Based Large Language Models." The research focuses on exploring the memorization tendencies of various code-based large language models (LLMs) across different architectures and programming languages, with a specific emphasis on the impact of few-shot learning, quantization, and prompt tuning.
The project directory is structured as follows:
PROJECT/
│
├── requirements.txt
│
├── Evaluating_Memorization_Across_LLM_Architectures/
│ ├── Non-Autoregressive-Task.py
│ ├── Autoregressive-T5.py
│ ├── Autoregressive-GPT.py
│ ├── codebleu/
│ └── codebleu.py
│
├── Few-shot/
│ ├── fsl-codeT5.py
│ ├── fsl-codeGPT.py
│
└── Prompt-Tuning/
├── codeT5-PT.py
├── codeGPT-PT.py
├── codeBERT-PT.py
The dataset used for this project is CodeSearchNet, an open-source dataset specifically curated for code search and natural language processing tasks. The dataset can be accessed from Hugging Face using the following URL:
The following open-source models were used:
- CodeBERT: An encoder-only model pre-trained on code data.
- CodeT5: An encoder-decoder model designed for both code understanding and generation.
- CodeGPT: A decoder-only model fine-tuned for code generation tasks.
To reproduce the results from Chapter 4 of the thesis, navigate to the Evaluating_Memorization_Across_LLM_Architectures
directory and run the following scripts:
- Non-Autoregressive Task:
python Non-Autoregressive-Task.py
- Autoregressive Task using CodeT5:
python Autoregressive-T5.py
- Autoregressive Task using CodeGPT:
python Autoregressive-GPT.py
These scripts evaluate the models using the extended CodeBLEU score implemented in codebleu/codebleu.py
.
To reproduce the results from Chapter 5, quantization can be enabled in the scripts by passing the --quantization
flag:
- Non-Autoregressive Task with quantization:
python Non-Autoregressive-Task.py --quantization True
- Autoregressive Task using CodeT5 with quantization:
python Autoregressive-T5.py --quantization True
- Autoregressive Task using CodeGPT with quantization:
python Autoregressive-GPT.py --quantization True
To reproduce the results from Chapter 6, navigate to the Few-shot
and Prompt-Tuning
directories and run the corresponding scripts:
-
Few-Shot Learning:
cd Few-shot python fsl-codeT5.py python fsl-codeGPT.py
-
Prompt Tuning:
cd Prompt-Tuning python codeT5-PT.py python codeGPT-PT.py python codeBERT-PT.py
The implementation of the extended CodeBLEU score can be found at:
./PROJECT/Evaluating Memorization Across LLM Architectures/codebleu/codebleu.py
Please ensure you have the following main dependencies installed:
You can install the required packages using:
pip install -r requirements.txt
All the scripts accept various arguments to customize the experiments. Here are the common arguments:
parser.add_argument('--prog_lang', type=str, default='python', choices=['python', 'java', 'javascript', 'ruby'], help='Programming language of the code snippets')
parser.add_argument('--mask_ratio', type=float, default=0.75, help='Masking ratio for the code snippets')
parser.add_argument('--output_dir', type=str, default='output', help='Output directory for the extracted content')
parser.add_argument('--device', type=str, default='cuda', choices=['cpu', 'cuda'], help='Device to run the extraction process on')
parser.add_argument('--autoregressive_model_name', type=str, default='Salesforce/codet5-large', help='Name of the autoregressive model to use for extracting content')
parser.add_argument('--non_autoregressive_model_name', type=str, default='microsoft/codebert-base-mlm', help='Name of the non-autoregressive model to use for extracting content')
parser.add_argument('--quantization', type=bool, default=False, help='Whether to quantize the model or not')
For few-shot learning:
parser.add_argument('--number_of_shot', type=int, default=0, help='Number of shots to use for the extraction process')
parser.add_argument('--number_of_examples', type=int, default=10, help='Number of examples to try few-shot learning on')
This repository provides the necessary scripts and instructions to reproduce the experiments discussed in the thesis "Investigating Memorization in Code-Based Large Language Models." Please ensure that the appropriate dependencies are installed, and the dataset is correctly set up as described before running the experiments.
For further inquiries, please refer to the thesis document or contact the author.
- Large datasets and model checkpoints are not included in this archive due to size constraints.
- For reproducibility, please use the same versions of the libraries and models as specified in the thesis.
- Some experiments may require significant computational resources. Consider using GPU acceleration when available.
Vaikunth Guruswamy