From 475a5ed61d42488db1a9918d1fd99425d35288fd Mon Sep 17 00:00:00 2001 From: Rangeet Pan Date: Wed, 28 Aug 2024 21:40:28 -0400 Subject: [PATCH 1/4] cleaning notebook examples --- .../{ => notebook}/code_summarization.ipynb | 380 +++++++----------- .../{ => notebook}/generate_unit_tests.ipynb | 373 ++++++++--------- .../validating_code_translation.ipynb | 339 ++++++++++++++++ .../java/{ => python}/code_summarization.py | 0 .../java/validating_code_translation.ipynb | 171 -------- 5 files changed, 655 insertions(+), 608 deletions(-) rename docs/examples/java/{ => notebook}/code_summarization.ipynb (71%) rename docs/examples/java/{ => notebook}/generate_unit_tests.ipynb (62%) create mode 100644 docs/examples/java/notebook/validating_code_translation.ipynb rename docs/examples/java/{ => python}/code_summarization.py (100%) delete mode 100644 docs/examples/java/validating_code_translation.ipynb diff --git a/docs/examples/java/code_summarization.ipynb b/docs/examples/java/notebook/code_summarization.ipynb similarity index 71% rename from docs/examples/java/code_summarization.ipynb rename to docs/examples/java/notebook/code_summarization.ipynb index 4ec3855..9d15c33 100644 --- a/docs/examples/java/code_summarization.ipynb +++ b/docs/examples/java/notebook/code_summarization.ipynb @@ -1,19 +1,11 @@ { "cells": [ { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "!pip install ollama" - ], + "cell_type": "markdown", + "id": "59d05bbe28e62687", "metadata": { "collapsed": false }, - "id": "eebee2515df69b96" - }, - { - "cell_type": "markdown", "source": [ "# Using CLDK to explain Java methods\n", "\n", @@ -38,16 +30,16 @@ "
    \n", "
  1. Format the instruction for the given focal method and class.\n", "
  2. Prompts the local model on Ollama.\n", - "
  3. Prints the instruction and LLM output.\n", + "
  4. Use CLDK to analyze code and get context information for generating code summary.\n", "
" - ], - "metadata": { - "collapsed": false - }, - "id": "59d05bbe28e62687" + ] }, { "cell_type": "markdown", + "id": "92896c8ce12b0e9e", + "metadata": { + "collapsed": false + }, "source": [ "## Prequisites\n", "\n", @@ -56,188 +48,124 @@ "
    \n", "
  1. Python 3.11 or later\n", "
  2. Ollama 0.3.4 or later\n", + "
  3. Java 11 or later\n", "
\n", "We will use ollama to spin up a local granite model that will act as our LLM for this turorial." - ], - "metadata": { - "collapsed": false - }, - "id": "92896c8ce12b0e9e" + ] }, { "cell_type": "markdown", + "id": "bfeb1e1227191e3b", + "metadata": { + "collapsed": false + }, "source": [ "### Prerequisite 1: Install ollama\n", "\n", "If you don't have ollama installed, please download and install it from here: [Ollama](https://ollama.com/download).\n", - "Once you have ollama, start the server and make sure it is running.\n", - "If you're on MacOS, Linux, or WSL, you can check to make sure the server is running by running the following command:" - ], - "metadata": { - "collapsed": false - }, - "id": "bfeb1e1227191e3b" + "Once you have ollama, start the server and make sure it is running. Once ollama is up and running, you can download the latest version of the Granite 8b Instruct model by running the following command:\n", + "There are other granite versions available, but for this tutorial, we will use the Granite 8b Instruct model. You if prefer to use a different version, you can replace `8b-instruct` with any of the other [versions](https://ollama.com/library/granite-code/tags).\n", + "Let's make sure the model is downloaded by running the following command:" + ] }, { "cell_type": "code", "execution_count": null, - "outputs": [], - "source": [ - "systemctl status ollama" - ], + "id": "6ff900382e86a18e", "metadata": { "collapsed": false }, - "id": "c53214c8106642ce" - }, - { - "cell_type": "markdown", - "source": [ - "If not, you may have to start the server manually. You can do this by running the following command:" - ], - "metadata": { - "collapsed": false - }, - "id": "34a7b1802be15a3f" - }, - { - "cell_type": "code", - "execution_count": null, "outputs": [], "source": [ - "systemctl start ollama" - ], - "metadata": { - "collapsed": false - }, - "id": "f60e2d9ec12f0bf6" + "%%bash\n", + "ollama run granite-code:8b-instruct \\\"Write a python function to print 'Hello, World!'" + ] }, { "cell_type": "markdown", - "source": [ - "Once ollama is up and running, you can download the latest version of the Granite 8b Instruct model by running the following command:\n", - "\n", - "There are other granite versions available, but for this tutorial, we will use the Granite 8b Instruct model. You if prefer to use a different version, you can replace `8b-instruct` with any of the other [versions](https://ollama.com/library/granite-code/tags)." - ], + "id": "531205b489bbec73", "metadata": { "collapsed": false }, - "id": "f629a10841aca9e2" + "source": [ + "### Prerequisite 3: Install ollama Python SDK" + ] }, { "cell_type": "code", "execution_count": null, - "outputs": [], - "source": [ - "ollama pull granite-code:8b-instruct" - ], + "id": "e2a749932a800c9d", "metadata": { "collapsed": false }, - "id": "6ff900382e86a18e" - }, - { - "cell_type": "markdown", - "source": [ - "Let's make sure the model is downloaded by running the following command:" - ], - "metadata": { - "collapsed": false - }, - "id": "d076e98c390591b5" - }, - { - "cell_type": "code", - "execution_count": null, "outputs": [], "source": [ - "ollama run granite-code:8b-instruct \\\"Write a python function to print 'Hello, World!'" - ], - "metadata": { - "collapsed": false - }, - "id": "7aff854a031589f0" + "pip install ollama" + ] }, { "cell_type": "markdown", - "source": [ - "### Prerequisite 3: Install ollama Python SDK" - ], + "id": "6f42dbd286b3f7a6", "metadata": { "collapsed": false }, - "id": "531205b489bbec73" + "source": [ + "### Prerequisite 4: Install CLDK\n", + "CLDK is avaliable on github at github.com/IBM/codellm-devkit.git. You can install it by running the following command:" + ] }, { "cell_type": "code", "execution_count": null, - "outputs": [], - "source": [ - "pip install ollama" - ], + "id": "327e212f20a489d6", "metadata": { "collapsed": false }, - "id": "e2a749932a800c9d" + "outputs": [], + "source": [ + "pip install git+https://github.com/IBM/codellm-devkit.git" + ] }, { "cell_type": "markdown", - "source": [ - "### Prerequisite 4: Install CLDK\n", - "CLDK is avaliable on github at github.com/IBM/codellm-devkit.git. You can install it by running the following command:" - ], + "id": "dd8ec5b9c837898f", "metadata": { "collapsed": false }, - "id": "6f42dbd286b3f7a6" + "source": [ + "### Step 1: Get the sample Java application\n", + "For this tutorial, we will use apache commons cli. You can download the source code to a temporary directory by running the following command:" + ] }, { "cell_type": "code", "execution_count": null, - "outputs": [], - "source": [ - "pip install git+https://github.com/IBM/codellm-devkit.git" - ], + "id": "c196e58b3ce90c34", "metadata": { "collapsed": false }, - "id": "327e212f20a489d6" + "outputs": [], + "source": [ + "%%bash\n", + "wget https://github.com/apache/commons-cli/archive/refs/tags/rel/commons-cli-1.7.0.zip -O /tmp/commons-cli-1.7.0.zip && unzip -o /tmp/commons-cli-1.7.0.zip -d /tmp" + ] }, { "cell_type": "markdown", - "source": [ - "### Step 1: Get the sample Java application\n", - "For this tutorial, we will use apache commons cli. You can download the source code to a temporary directory by running the following command:" - ], + "id": "44e875e7ce6db504", "metadata": { "collapsed": false }, - "id": "dd8ec5b9c837898f" - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], "source": [ - "wget https://github.com/apache/commons-cli/archive/refs/tags/rel/commons-cli-1.7.0.zip -O /tmp/commons-cli-1.7.0.zip && unzip -o /tmp/commons-cli-1.7.0.zip -d /tmp" - ], - "metadata": { - "collapsed": false - }, - "id": "c196e58b3ce90c34" + "The project will now be extracted to `/tmp/commons-cli-rel-commons-cli-1.7.0`. We'll remove these files later, so don't worry about the location." + ] }, { "cell_type": "markdown", - "source": [ - "The project will now be extracted to `/tmp/commons-cli-rel-commons-cli-1.7.0`. We'll remove these files later, so don't worry about the location." - ], + "id": "6ad70b81e8957fc0", "metadata": { "collapsed": false }, - "id": "44e875e7ce6db504" - }, - { - "cell_type": "markdown", "source": [ "### Generate code summary\n", "Code summarization or code explanation is a task that converts a code written in a programming language to a natural language. This particular task has several\n", @@ -245,51 +173,51 @@ "understand the basic details of code structure works, and use that knowledge to generate the summary using various AI-based approaches. In this particular\n", "example, we will be using Large Language Models (LLM), specifically Granite 8B, an open-source model built by IBM. We will show how easily a developer can use\n", "CLDK to expose various parts of the code by calling various APIs without implementing various time-intensive program analyses from scratch." - ], - "metadata": { - "collapsed": false - }, - "id": "6ad70b81e8957fc0" + ] }, { "cell_type": "markdown", - "source": [ - "Step 1: Add all the neccessary imports" - ], + "id": "15555404790e1411", "metadata": { "collapsed": false }, - "id": "15555404790e1411" + "source": [ + "Step 1: Add all the neccessary imports" + ] }, { "cell_type": "code", "execution_count": null, + "id": "8e8e5de7e5c68020", + "metadata": { + "collapsed": false + }, "outputs": [], "source": [ "from pathlib import Path\n", "import ollama\n", "from cldk import CLDK\n", "from cldk.analysis import AnalysisLevel" - ], - "metadata": { - "collapsed": false - }, - "id": "8e8e5de7e5c68020" + ] }, { "cell_type": "markdown", - "source": [ - "Step 2: Formulate the LLM prompt. The prompt can be tailored towards various needs. In this case, we show a simple example of generating summary for each\n", - "method in a Java class" - ], + "id": "ffc4ee9a6d27acc2", "metadata": { "collapsed": false }, - "id": "ffc4ee9a6d27acc2" + "source": [ + "Step 2: Formulate the LLM prompt. The prompt can be tailored towards various needs. In this case, we show a simple example of generating summary for each\n", + "method in a Java class" + ] }, { "cell_type": "code", "execution_count": null, + "id": "9e23523c71636727", + "metadata": { + "collapsed": false + }, "outputs": [], "source": [ "def format_inst(code, focal_method, focal_class, language):\n", @@ -304,96 +232,96 @@ " inst += \"```\" if code.endswith(\"\\n\") else \"\\n```\"\n", " inst += \"\\n\"\n", " return inst" - ], - "metadata": { - "collapsed": false - }, - "id": "9e23523c71636727" + ] }, { "cell_type": "markdown", - "source": [], + "id": "a4e9cb4e4f00b25c", "metadata": { "collapsed": false }, - "id": "a4e9cb4e4f00b25c" + "source": [] }, { "cell_type": "markdown", - "source": [ - "Step 3: Create a function to call LLM. There are various ways to achieve that. However, for illustrative purpose, we use ollama, a library to communicate with models downloaded locally." - ], + "id": "dd8439be222b5caa", "metadata": { "collapsed": false }, - "id": "dd8439be222b5caa" + "source": [ + "Step 3: Create a function to call LLM. There are various ways to achieve that. However, for illustrative purpose, we use ollama, a library to communicate with models downloaded locally." + ] }, { "cell_type": "code", "execution_count": null, + "id": "62807e0cbf985ae6", + "metadata": { + "collapsed": false + }, "outputs": [], "source": [ "def prompt_ollama(message: str, model_id: str = \"granite-code:8b-instruct\") -> str:\n", " \"\"\"Prompt local model on Ollama\"\"\"\n", - " response_object = ollama.generate(model=model_id, prompt=message)\n", + " response_object = ollama.generate(model=model_id, prompt=message, options={\"temperature\":0.2})\n", " return response_object[\"response\"]" - ], - "metadata": { - "collapsed": false - }, - "id": "62807e0cbf985ae6" + ] }, { "cell_type": "markdown", - "source": [ - "Step 4: Create an object of CLDK and provide the programming language of the source code." - ], + "id": "1022e86e38e12767", "metadata": { "collapsed": false }, - "id": "1022e86e38e12767" + "source": [ + "Step 4: Create an object of CLDK and provide the programming language of the source code." + ] }, { "cell_type": "code", "execution_count": null, + "id": "a2c8bbe4e3244f60", + "metadata": { + "collapsed": false + }, "outputs": [], "source": [ "# Create a new instance of the CLDK class\n", "cldk = CLDK(language=\"java\")" - ], - "metadata": { - "collapsed": false - }, - "id": "a2c8bbe4e3244f60" + ] }, { "cell_type": "markdown", + "id": "23dd4a6e5d5cb0c5", + "metadata": { + "collapsed": false + }, "source": [ "Step 5: CLDK uses different analysis engine--Codeanalyzer (built using WALA and Javaparser), Treesitter, and CodeQL (future). By default, codenanalyzer has\n", "been selected as the default analysis engine. Also, CLDK support different analysis levels--(a) symbol table, (b) call graph, (c) program dependency graph, and\n", "(d) system dependency graph. Analysis engine can be selected using ```AnalysisLevel``` enum. In this example, we will generate summarization of all the methods\n", "of an application. " - ], - "metadata": { - "collapsed": false - }, - "id": "23dd4a6e5d5cb0c5" + ] }, { "cell_type": "code", "execution_count": null, + "id": "fdd09f5e77d4a68a", + "metadata": { + "collapsed": false + }, "outputs": [], "source": [ "# Create an analysis object over the java application\n", "analysis = cldk.analysis(project_path=\"/tmp/commons-cli-rel-commons-cli-1.7.0\", analysis_level=AnalysisLevel.symbol_table)" - ], - "metadata": { - "collapsed": false - }, - "id": "fdd09f5e77d4a68a" + ] }, { "cell_type": "markdown", + "id": "f148325e92781e13", + "metadata": { + "collapsed": false + }, "source": [ "Step 6: Iterate over all the class files and create the prompt. In this case, we want to provide a customized Java class in the prompt. For instance,\n", "\n", @@ -418,55 +346,59 @@ " } \n", "```\n", "Given the above class, let's say we want to generate a summary for the ```bar``` method. To understand what it does, we add the callee of this method in the prompt, which in this case is ```baz```. We also remove imports, comments, etc. All of these are done using a single call to ```sanitize_focal_class``` API. In this process, we also use Treesitter to analyze the code. Once the input code has been sanitized, we call the ```format_inst``` method to create the LLM prompt, which has been passed to ```prompt_ollama``` method to generate the summary using LLM." - ], - "metadata": { - "collapsed": false - }, - "id": "f148325e92781e13" + ] }, { "cell_type": "code", "execution_count": null, + "id": "462ef7dceae367ad", + "metadata": { + "collapsed": false + }, "outputs": [], "source": [ + "# For simplicity, we run the code summarization for a single class and method. One can remove that filter to run this code for the entire application\n", + "qualified_class_name = 'org.apache.commons.cli.GnuParser'\n", + "method_signature = 'flatten(Options, String[], boolean)'\n", "# Iterate over all the files in the project\n", - "for file_path, class_file in analysis.get_symbol_table().items():\n", - " class_file_path = Path(file_path).absolute().resolve()\n", - " # Iterate over all the classes in the file\n", - " for type_name, type_declaration in class_file.type_declarations.items():\n", + "for class_name in analysis.get_classes():\n", + " if class_name==qualified_class_name:\n", + " class_file_path = analysis.get_java_file(qualified_class_name=class_name)\n", " # Iterate over all the methods in the class\n", - " for method in type_declaration.callable_declarations.values():\n", - " # Get code body of the method\n", - " code_body = class_file_path.read_text()\n", - " \n", - " # Initialize the treesitter utils for the class file content\n", - " tree_sitter_utils = cldk.tree_sitter_utils(source_code=code_body)\n", - " \n", - " # Sanitize the class for analysis\n", - " sanitized_class = tree_sitter_utils.sanitize_focal_class(method.declaration)\n", + " for method in analysis.get_methods_in_class(qualified_class_name=class_name):\n", + " if method==method_signature:\n", + " # Get code body of the method\n", + " with open(class_file_path, 'r') as f:\n", + " code_body = f.read()\n", + " \n", + " # Initialize the treesitter utils for the class file content\n", + " tree_sitter_utils = cldk.tree_sitter_utils(source_code=code_body)\n", + " \n", + " # Get all the method details\n", + " method_details = analysis.get_method(qualified_class_name=class_name,\n", + " qualified_method_name=method)\n", + " # Sanitize the class for analysis\n", + " sanitized_class = tree_sitter_utils.sanitize_focal_class(method_details.declaration)\n", + " \n", + " # Format the instruction for the given focal method and class\n", + " instruction = format_inst(\n", + " code=sanitized_class,\n", + " focal_method=method_details.declaration,\n", + " focal_class=class_name.split('.')[-1],\n", + " language=\"java\"\n", + " )\n", " \n", - " # Format the instruction for the given focal method and class\n", - " instruction = format_inst(\n", - " code=sanitized_class,\n", - " focal_method=method.declaration,\n", - " focal_class=type_name,\n", - " language=\"java\"\n", - " )\n", - " \n", - " # Prompt the local model on Ollama\n", - " llm_output = prompt_ollama(\n", - " message=instruction,\n", - " model_id=\"granite-code:20b-instruct\",\n", - " )\n", - " \n", - " # Print the instruction and LLM output\n", - " print(f\"Instruction:\\n{instruction}\")\n", - " print(f\"LLM Output:\\n{llm_output}\")" - ], - "metadata": { - "collapsed": false - }, - "id": "462ef7dceae367ad" + " print(f\"Instruction:\\n{instruction}\\n\")\n", + " print(f\"Generating code summary . . .\\n\")\n", + " \n", + " # Prompt the local model on Ollama\n", + " llm_output = prompt_ollama(\n", + " message=instruction\n", + " )\n", + " \n", + " # Print the LLM output\n", + " print(f\"LLM Output:\\n{llm_output}\")" + ] } ], "metadata": { @@ -478,14 +410,14 @@ "language_info": { "codemirror_mode": { "name": "ipython", - "version": 2 + "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", - "pygments_lexer": "ipython2", - "version": "2.7.6" + "pygments_lexer": "ipython3", + "version": "3.11.4" } }, "nbformat": 4, diff --git a/docs/examples/java/generate_unit_tests.ipynb b/docs/examples/java/notebook/generate_unit_tests.ipynb similarity index 62% rename from docs/examples/java/generate_unit_tests.ipynb rename to docs/examples/java/notebook/generate_unit_tests.ipynb index 5acd853..3b17311 100644 --- a/docs/examples/java/generate_unit_tests.ipynb +++ b/docs/examples/java/notebook/generate_unit_tests.ipynb @@ -1,19 +1,11 @@ { "cells": [ { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "!pip install ollama" - ], + "cell_type": "markdown", + "id": "428dbbfa206f5417", "metadata": { "collapsed": false }, - "id": "ee51e198aaebcd9b" - }, - { - "cell_type": "markdown", "source": [ "# Using CLDK to generate JUnit tests\n", "\n", @@ -38,16 +30,16 @@ "
    \n", "
  1. Format the instruction for the given focal method and class.\n", "
  2. Prompts the local model on Ollama.\n", - "
  3. Prints the instruction and LLM output.\n", + "
  4. Use CLDK to go through an application and generate unit test cases for each method.\n", "
" - ], - "metadata": { - "collapsed": false - }, - "id": "428dbbfa206f5417" + ] }, { "cell_type": "markdown", + "id": "f619a9379b9dd006", + "metadata": { + "collapsed": false + }, "source": [ "## Prequisites\n", "\n", @@ -56,346 +48,301 @@ "
    \n", "
  1. Python 3.11 or later\n", "
  2. Ollama 0.3.4 or later\n", + "
  3. Java 11 or later\n", "
\n", "We will use ollama to spin up a local granite model that will act as our LLM for this turorial." - ], - "metadata": { - "collapsed": false - }, - "id": "f619a9379b9dd006" + ] }, { "cell_type": "markdown", + "id": "3485879a7733bcba", + "metadata": { + "collapsed": false + }, "source": [ "### Prerequisite 1: Install ollama\n", "\n", "If you don't have ollama installed, please download and install it from here: [Ollama](https://ollama.com/download).\n", - "Once you have ollama, start the server and make sure it is running.\n", - "If you're on MacOS, Linux, or WSL, you can check to make sure the server is running by running the following command:" - ], - "metadata": { - "collapsed": false - }, - "id": "3485879a7733bcba" + "Once you have ollama, start the server and make sure it is running. Once ollama is up and running, you can download the latest version of the Granite 8b Instruct model by running the following command:\n", + "There are other granite versions available, but for this tutorial, we will use the Granite 8b Instruct model. You if prefer to use a different version, you can replace `8b-instruct` with any of the other [versions](https://ollama.com/library/granite-code/tags).\n", + "Let's make sure the model is downloaded by running the following command:" + ] }, { "cell_type": "code", "execution_count": null, - "outputs": [], - "source": [ - "systemctl status ollama" - ], + "id": "e3410ce4d0afa788", "metadata": { + "ExecuteTime": { + "end_time": "2024-08-28T23:49:03.488152Z", + "start_time": "2024-08-28T23:49:03.424389Z" + }, "collapsed": false }, - "id": "2f67be6c8c024e12" - }, - { - "cell_type": "markdown", - "source": [ - "If not, you may have to start the server manually. You can do this by running the following command:" - ], - "metadata": { - "collapsed": false - }, - "id": "273e60ca598e0a53" - }, - { - "cell_type": "code", - "execution_count": null, "outputs": [], "source": [ - "systemctl start ollama" - ], - "metadata": { - "collapsed": false - }, - "id": "cc6877ce338e9102" + "%%bash\n", + "ollama run granite-code:8b-instruct \\\"Write a python function to print 'Hello, World!'" + ] }, { "cell_type": "markdown", - "source": [ - "Once ollama is up and running, you can download the latest version of the Granite 8b Instruct model by running the following command:\n", - "\n", - "There are other granite versions available, but for this tutorial, we will use the Granite 8b Instruct model. You if prefer to use a different version, you can replace `8b-instruct` with any of the other [versions](https://ollama.com/library/granite-code/tags)." - ], + "id": "d8c0224c3c4ecf4d", "metadata": { "collapsed": false }, - "id": "c024dc7ec2869a72" + "source": [ + "### Prerequisite 3: Install ollama Python SDK" + ] }, { "cell_type": "code", "execution_count": null, - "outputs": [], - "source": [ - "ollama pull granite-code:8b-instruct" - ], + "id": "5539b5251aee5642", "metadata": { "collapsed": false }, - "id": "5ad0e8ac33c7108e" - }, - { - "cell_type": "markdown", - "source": [ - "Let's make sure the model is downloaded by running the following command:" - ], - "metadata": { - "collapsed": false - }, - "id": "14f9946fdc5e2025" - }, - { - "cell_type": "code", - "execution_count": null, "outputs": [], "source": [ - "ollama run granite-code:8b-instruct \\\"Write a python function to print 'Hello, World!'" - ], - "metadata": { - "collapsed": false - }, - "id": "e3410ce4d0afa788" + "pip install ollama" + ] }, { "cell_type": "markdown", - "source": [ - "### Prerequisite 3: Install ollama Python SDK" - ], + "id": "cea573e625257581", "metadata": { "collapsed": false }, - "id": "d8c0224c3c4ecf4d" + "source": [ + "### Prerequisite 4: Install CLDK\n", + "CLDK is avaliable on github at github.com/IBM/codellm-devkit.git. You can install it by running the following command:" + ] }, { "cell_type": "code", "execution_count": null, - "outputs": [], - "source": [ - "pip install ollama" - ], + "id": "eeb38b312427329d", "metadata": { "collapsed": false }, - "id": "5539b5251aee5642" + "outputs": [], + "source": [ + "pip install git+https://github.com/IBM/codellm-devkit.git" + ] }, { "cell_type": "markdown", - "source": [ - "### Prerequisite 4: Install CLDK\n", - "CLDK is avaliable on github at github.com/IBM/codellm-devkit.git. You can install it by running the following command:" - ], + "id": "ca7682c71d844b68", "metadata": { "collapsed": false }, - "id": "cea573e625257581" + "source": [ + "### Get the sample Java application\n", + "For this tutorial, we will use apache commons cli. You can download the source code to a temporary directory by running the following command:" + ] }, { "cell_type": "code", "execution_count": null, - "outputs": [], - "source": [ - "pip install git+https://github.com/IBM/codellm-devkit.git" - ], + "id": "a4d08ca64b9dbccb", "metadata": { "collapsed": false }, - "id": "eeb38b312427329d" + "outputs": [], + "source": [ + "%%bash\n", + "wget https://github.com/apache/commons-cli/archive/refs/tags/rel/commons-cli-1.7.0.zip -O /tmp/commons-cli-1.7.0.zip && unzip -o /tmp/commons-cli-1.7.0.zip -d /tmp" + ] }, { "cell_type": "markdown", - "source": [ - "### Step 1: Get the sample Java application\n", - "For this tutorial, we will use apache commons cli. You can download the source code to a temporary directory by running the following command:" - ], + "id": "51d30f3eb726afc0", "metadata": { "collapsed": false }, - "id": "ca7682c71d844b68" - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], "source": [ - "wget https://github.com/apache/commons-cli/archive/refs/tags/rel/commons-cli-1.7.0.zip -O /tmp/commons-cli-1.7.0.zip && unzip -o /tmp/commons-cli-1.7.0.zip -d /tmp" - ], - "metadata": { - "collapsed": false - }, - "id": "a4d08ca64b9dbccb" + "The project will now be extracted to `/tmp/commons-cli-rel-commons-cli-1.7.0`. We'll remove these files later, so don't worry about the location." + ] }, { "cell_type": "markdown", - "source": [ - "The project will now be extracted to `/tmp/commons-cli-rel-commons-cli-1.7.0`. We'll remove these files later, so don't worry about the location." - ], + "id": "98e69eb0bccedfc9", "metadata": { "collapsed": false }, - "id": "51d30f3eb726afc0" - }, - { - "cell_type": "markdown", "source": [ "### Building a JUnit test generator using CLDK and Granite Code Instruct Model\n", "Now that we have all the prerequisites installed, let's start building a JUnit test generator using CLDK and the Granite Code Instruct Model." - ], - "metadata": { - "collapsed": false - }, - "id": "98e69eb0bccedfc9" + ] }, { "cell_type": "markdown", + "id": "5856baff4aa64ed7", + "metadata": { + "collapsed": false + }, "source": [ "Generating unit tests for code is a very tedious task and often takes a significant effort from the developers to write good test cases. There are various tools that are available for automated test generation, such as EvoSuite, which uses evolutionary algorithms to generate test cases. However, the test cases that are being generated are not natural and often developers do not prefer to add them to their test suite. Whereas Large Language Models (LLM) being trained with developer-written code it has a better affinity towards generating more natural code--more readable, maintainable code. In this excercise, we will show we can leverage LLMs to generate test cases with the help of CLDK. \n", "\n", "For simplicity, we will cover certain aspects of test generation and provide some context information to LLM for better quality of test cases. In this exercise, we will generate a unit test for a non-private method from a Java class and provide the focal method body and the signature of all the constructors of the class so that LLM can understand how to create an object of the focal class during the setup phase of the tests. Also, we will ask LLMs to generate ```N``` number of test cases, where ```N``` is the cyclomatic complexity of the focal method. The intuition is that one test may not be sufficient for covering fairly complex methods, and a cyclomatic complexity score can provide some guidance towards that. \n", "\n", "(Step 1) First, we will import all the necessary libraries" - ], - "metadata": { - "collapsed": false - }, - "id": "5856baff4aa64ed7" + ] }, { "cell_type": "code", "execution_count": null, + "id": "b3d2498ae092fcc", + "metadata": { + "collapsed": false + }, "outputs": [], "source": [ "import ollama\n", "from cldk import CLDK\n", "from cldk.analysis import AnalysisLevel" - ], - "metadata": { - "collapsed": false - }, - "id": "b3d2498ae092fcc" + ] }, { "cell_type": "markdown", - "source": [ - "(Step 2) Second, we will form the prompt for the model, which will include all the constructor signarures, and the body of the focal method." - ], + "id": "67eb24b29826d730", "metadata": { "collapsed": false }, - "id": "67eb24b29826d730" + "source": [ + "(Step 2) Second, we will form the prompt for the model, which will include all the constructor signarures, and the body of the focal method." + ] }, { "cell_type": "code", "execution_count": null, + "id": "d7bc9bbaa917df24", + "metadata": { + "collapsed": false + }, "outputs": [], "source": [ - "def format_inst(focal_method_body, focal_method, focal_class, constructor_signatures, cyclomatic_complexity, language):\n", + "def format_inst(focal_method_body, focal_method, focal_class, constructor_signatures, language):\n", " \"\"\"\n", " Format the instruction for the given focal method and class.\n", " \"\"\"\n", - " inst = f\"Question: Can you generate {cyclomatic_complexity} unit tests for the method `{focal_method}` in the class `{focal_class}` below?\\n\"\n", - "\n", + " inst = f\"Question: Can you generate junit tests with @Test annotation for the method `{focal_method}` in the class `{focal_class}` below. Only generate the test and no description.\\n\"\n", + " inst += 'Use the constructor signatures to form the object if the method is not static. Generate the code under ``` code block.'\n", " inst += \"\\n\"\n", " inst += f\"```{language}\\n\"\n", - " inst += \"```\\n\"\n", - " inst += \"public class {focal_class} {\"\n", - " inst += f\"<|constructors|>\\n{constructor_signatures}\\n<|constructors|>\\n\"\n", - " inst += f\"<|focal method|>\\n {focal_method_body} \\n <|focal method|>\\n\" \n", + " inst += f\"public class {focal_class} \" + \"{\\n\"\n", + " inst += f\"{constructor_signatures}\\n\"\n", + " inst += f\"{focal_method_body} \\n\" \n", " inst += \"}\"\n", " inst += \"```\\n\"\n", " inst += \"Answer:\\n\"\n", " return inst" - ], - "metadata": { - "collapsed": false - }, - "id": "d7bc9bbaa917df24" + ] }, { "cell_type": "markdown", - "source": [ - "(Step 3) Third, use ollama to call LLM (in case Granite 8b)." - ], + "id": "ae9ceb150f5efa92", "metadata": { "collapsed": false }, - "id": "ae9ceb150f5efa92" + "source": [ + "(Step 3) Third, use ollama to call LLM (in case Granite 8b)." + ] }, { "cell_type": "code", "execution_count": null, + "id": "52634feae7374599", + "metadata": { + "collapsed": false + }, "outputs": [], "source": [ - "def prompt_ollama(message: str, model_id: str = \"granite-code:20b-instruct\") -> str:\n", + "def prompt_ollama(message: str, model_id: str = \"granite-code:8b-instruct\") -> str:\n", " \"\"\"Prompt local model on Ollama\"\"\"\n", - " response_object = ollama.generate(model=model_id, prompt=message)\n", + " response_object = ollama.generate(model=model_id, prompt=message, options={\"temperature\":0.2})\n", " return response_object[\"response\"]" - ], - "metadata": { - "collapsed": false - }, - "id": "52634feae7374599" + ] }, { "cell_type": "markdown", - "source": [ - "(Step 4) Fourth, collect all the information needed for each method. In this process, we go through all the classes in the application, and then for each class, we collect the signature of all the constructors. If there is no constructor present, we add the signature of the default constructor. Then, we go through all the non-private methods of the class and formulate the prompt using the constructor and the method information. Finally, we use the prompt to call LLM and get the final output." - ], + "id": "308c3325116b87d4", "metadata": { "collapsed": false }, - "id": "308c3325116b87d4" + "source": [ + "(Step 4) Fourth, collect all the information needed for each method. In this process, we go through all the classes in the application, and then for each class, we collect the signature of all the constructors. If there is no constructor present, we add the signature of the default constructor. Then, we go through all the non-private methods of the class and formulate the prompt using the constructor and the method information. Finally, we use the prompt to call LLM and get the final output." + ] }, { "cell_type": "code", "execution_count": null, + "id": "65c9558e4de65a52", + "metadata": { + "collapsed": false + }, "outputs": [], "source": [ "# Create a new instance of the CLDK class\n", "cldk = CLDK(language=\"java\")\n", "# Create an analysis object over the java application. Provide the application path.\n", "analysis = cldk.analysis(project_path=\"/tmp/commons-cli-rel-commons-cli-1.7.0\", analysis_level=AnalysisLevel.symbol_table)\n", + "\n", + "# For simplicity, we run the test generation for a single class and method. One can remove that filter to run this code for the entire application\n", + "qualified_class_name = 'org.apache.commons.cli.GnuParser'\n", + "method_signature = 'flatten(Options, String[], boolean)'\n", + "\n", "# Go through all the classes in the application\n", "for class_name in analysis.get_classes():\n", - " class_details = analysis.get_class(qualified_class_name=class_name)\n", - " # Generate test cases for non-interface and non-abstract classes\n", - " if not class_details.is_interface and 'abstract' not in class_details.modifiers:\n", - " # Get all constructor signatures\n", - " constructor_signatures = ''\n", - " for method in analysis.get_methods_in_class(qualified_class_name=class_name):\n", - " method_details = analysis.get_method(qualified_class_name=class_name, qualified_method_name=method)\n", - " if method_details.is_constructor:\n", - " constructor_signatures += method_details.signature + '\\n'\n", - " # If no constructor present, then add the signature of the default constructor\n", - " if constructor_signatures=='':\n", - " constructor_signatures = f'public {class_name} ()'\n", - " # Go through all the methods in the class\n", - " for method in analysis.get_methods_in_class(qualified_class_name=class_name):\n", - " # Get the method details\n", - " method_details = analysis.get_method(qualified_class_name=class_name, qualified_method_name=method)\n", - " # Generate test cases for non-private methods\n", - " if 'private' not in method_details.modifiers and not method_details.is_constructor:\n", - " # Gather all the information needed for the prompt, which are focal method body, focal method name, focal class name, constructor signature, and cyclomatic complexity\n", - " prompt = format_inst(focal_method_body=method_details.code,\n", - " focal_method=method,\n", - " focal_class=class_name,\n", - " constructor_signatures=constructor_signatures,\n", - " cyclomatic_complexity=method_details.cyclomatic_complexity)\n", - " # Prompt the local model on Ollama\n", - " llm_output = prompt_ollama(\n", - " message=prompt,\n", - " model_id=\"granite-code:20b-instruct\",\n", - " )\n", - " \n", - " # Print the instruction and LLM output\n", - " print(f\"Instruction:\\n{prompt}\")\n", - " print(f\"LLM Output:\\n{llm_output}\")" - ], - "metadata": { - "collapsed": false - }, - "id": "65c9558e4de65a52" + "\n", + " if class_name == qualified_class_name:\n", + " class_details = analysis.get_class(qualified_class_name=class_name)\n", + " focal_class_name = class_name.split('.')[-1]\n", + "\n", + " # Generate test cases for non-interface and non-abstract classes\n", + " if not class_details.is_interface and 'abstract' not in class_details.modifiers:\n", + " \n", + " # Get all constructor signatures\n", + " constructor_signatures = ''\n", + " \n", + " for method in analysis.get_methods_in_class(qualified_class_name=class_name):\n", + " method_details = analysis.get_method(qualified_class_name=class_name, qualified_method_name=method)\n", + " \n", + " if method_details.is_constructor:\n", + " constructor_signatures += method_details.signature + '\\n'\n", + " \n", + " # If no constructor present, then add the signature of the default constructor\n", + " if constructor_signatures=='':\n", + " constructor_signatures = f'public {focal_class_name}() ' + '{}'\n", + " \n", + " # Go through all the methods in the class\n", + " for method in analysis.get_methods_in_class(qualified_class_name=class_name):\n", + " \n", + " if method==method_signature:\n", + " # Get the method details\n", + " method_details = analysis.get_method(qualified_class_name=class_name, qualified_method_name=method)\n", + " \n", + " # Generate test cases for non-private methods\n", + " if 'private' not in method_details.modifiers and not method_details.is_constructor:\n", + " \n", + " # Gather all the information needed for the prompt, which are focal method body, focal method name, focal class name, and constructor signature\n", + " prompt = format_inst(focal_method_body=method_details.declaration+method_details.code,\n", + " focal_method=method.split('(')[0],\n", + " focal_class=focal_class_name,\n", + " constructor_signatures=constructor_signatures,\n", + " language='Java')\n", + " \n", + " print(f\"Instruction:\\n{prompt}\\n\")\n", + " print(f\"Generating test case . . .\\n\")\n", + " \n", + " # Prompt the local model on Ollama\n", + " llm_output = prompt_ollama(\n", + " message=prompt\n", + " )\n", + " \n", + " # Print the instruction and LLM output\n", + " print(f\"LLM Output:\\n{llm_output}\")" + ] } ], "metadata": { @@ -407,14 +354,14 @@ "language_info": { "codemirror_mode": { "name": "ipython", - "version": 2 + "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", - "pygments_lexer": "ipython2", - "version": "2.7.6" + "pygments_lexer": "ipython3", + "version": "3.11.4" } }, "nbformat": 4, diff --git a/docs/examples/java/notebook/validating_code_translation.ipynb b/docs/examples/java/notebook/validating_code_translation.ipynb new file mode 100644 index 0000000..27636fd --- /dev/null +++ b/docs/examples/java/notebook/validating_code_translation.ipynb @@ -0,0 +1,339 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "source": [ + "# Using CLDK to validate code translation\n", + "\n", + "In this tutorial, we will use CLDK to valdiate translated code.\n", + "\n", + "By the end of this tutorial, you will have a very light-weight approach for validating code translated from Java to Python. You'll be able to explore some of the benefits of using CLDK to perform fast and easy program analysis.\n", + "\n", + "You will learn how to do the following:\n", + "\n", + "
    \n", + "
  1. Create a new instance of the CLDK class.\n", + "
  2. Create an analysis object over the Java application.\n", + "
  3. Iterate over all the files in the project.\n", + "
  4. Iterate over all the classes in the file.\n", + "
  5. Iterate over all the methods in the class.\n", + "
  6. Get the code body of the method.\n", + "
  7. Initialize the treesitter utils for the class file content.\n", + "
  8. Sanitize the class for analysis.\n", + "
\n", + "Next, we will write a couple of helper methods to:\n", + "\n", + "
    \n", + "
  1. Format the instruction for the given focal method and class.\n", + "
  2. Prompts the local model on Ollama.\n", + "
  3. Use CLDK to analyze code and get context information for translating code.\n", + "
" + ], + "metadata": { + "collapsed": false + }, + "id": "fcac940432e10687" + }, + { + "cell_type": "markdown", + "source": [ + "## Prequisites\n", + "\n", + "Before we get started, let's make sure you have the following installed:\n", + "\n", + "
    \n", + "
  1. Python 3.11 or later\n", + "
  2. Ollama 0.3.4 or later\n", + "
  3. Java 11 or later\n", + "
\n", + "We will use ollama to spin up a local granite model that will act as our LLM for this turorial." + ], + "metadata": { + "collapsed": false + }, + "id": "e9411e761b32fcbc" + }, + { + "cell_type": "markdown", + "source": [ + "### Prerequisite 1: Install ollama\n", + "\n", + "If you don't have ollama installed, please download and install it from here: [Ollama](https://ollama.com/download).\n", + "Once you have ollama, start the server and make sure it is running. Once ollama is up and running, you can download the latest version of the Granite 8b Instruct model by running the following command:\n", + "There are other granite versions available, but for this tutorial, we will use the Granite 8b Instruct model. You if prefer to use a different version, you can replace `8b-instruct` with any of the other [versions](https://ollama.com/library/granite-code/tags).\n", + "Let's make sure the model is downloaded by running the following command:" + ], + "metadata": { + "collapsed": false + }, + "id": "930b603c7eb3cd55" + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "%%bash\n", + "ollama run granite-code:8b-instruct \\\"Write a python function to print 'Hello, World!'" + ], + "metadata": { + "collapsed": false + }, + "id": "635bb847107749f8" + }, + { + "cell_type": "markdown", + "source": [ + "### Prerequisite 3: Install ollama Python SDK" + ], + "metadata": { + "collapsed": false + }, + "id": "a6015cb7728debca" + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "pip install ollama" + ], + "metadata": { + "collapsed": false + }, + "id": "9dceb297bbab0ab3" + }, + { + "cell_type": "markdown", + "source": [ + "### Prerequisite 4: Install CLDK\n", + "CLDK is avaliable on github at github.com/IBM/codellm-devkit.git. You can install it by running the following command:" + ], + "metadata": { + "collapsed": false + }, + "id": "e06325ad56287f0b" + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "pip install git+https://github.com/IBM/codellm-devkit.git" + ], + "metadata": { + "collapsed": false + }, + "id": "d6dc34436d0f2d15" + }, + { + "cell_type": "markdown", + "source": [ + "### Step 1: Get the sample Java application\n", + "For this tutorial, we will use apache commons cli. You can download the source code to a temporary directory by running the following command:" + ], + "metadata": { + "collapsed": false + }, + "id": "6e4ef425987e53ed" + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "%%bash\n", + "wget https://github.com/apache/commons-cli/archive/refs/tags/rel/commons-cli-1.7.0.zip -O /tmp/commons-cli-1.7.0.zip && unzip -o /tmp/commons-cli-1.7.0.zip -d /tmp" + ], + "metadata": { + "collapsed": false + }, + "id": "98ddaf361bb8c025" + }, + { + "cell_type": "markdown", + "source": [ + "The project will now be extracted to `/tmp/commons-cli-rel-commons-cli-1.7.0`. We'll remove these files later, so don't worry about the location." + ], + "metadata": { + "collapsed": false + }, + "id": "7a963481d3c7d083" + }, + { + "cell_type": "markdown", + "source": [ + "### Translating Jave code to Python and build a light-weight validation logic\n", + "Code translation aims to convert source code from one programming language (PL) to another. Given the promising abilities of large language models (LLMs) in code synthesis, researchers are exploring their potential to automate code translation. In our recent paper [https://dl.acm.org/doi/10.1145/3597503.3639226] published at ICSE'24, we found that LLM-based code translation is very promising. In this example, we will walk through the steps of translating each Java class to Python and checking various properties of translated code, such as the number of methods, number of fields, formal arguments, etc.\n", + "\n", + "(Step 1) First, we will import all the necessary libraries" + ], + "metadata": { + "collapsed": false + }, + "id": "47af1410ab0a3b4d" + }, + { + "cell_type": "code", + "execution_count": null, + "id": "47a78f61a53b2b55", + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from cldk.analysis.python.treesitter import PythonSitter\n", + "from cldk.analysis.java.treesitter import JavaSitter\n", + "import ollama\n", + "from cldk import CLDK\n", + "from cldk.analysis import AnalysisLevel" + ] + }, + { + "cell_type": "markdown", + "id": "c6d2f67e1a17cf1", + "metadata": { + "collapsed": false + }, + "source": [ + "(Step 2) Second, we will form the prompt for the model, which will include the body of the Java class after removing all the comments and the import statements." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dc1ec56e92e90c15", + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "def format_inst(code, focal_class, language):\n", + " \"\"\"\n", + " Format the instruction for the given focal method and class.\n", + " \"\"\"\n", + " inst = f\"Question: Can you translate the Java class `{focal_class}` below to Python and generate under code block (```)?\\n\"\n", + "\n", + " inst += \"\\n\"\n", + " inst += f\"```{language}\\n\"\n", + " inst += code\n", + " inst += \"```\" if code.endswith(\"\\n\") else \"\\n```\"\n", + " inst += \"\\n\"\n", + " return inst" + ] + }, + { + "cell_type": "markdown", + "id": "1239041c3315e5e5", + "metadata": { + "collapsed": false + }, + "source": [ + "(Step 3) Create a function to call LLM. There are various ways to achieve that. However, for illustrative purpose, we use ollama, a library to communicate with models downloaded locally." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1c86224032a6eb70", + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "def prompt_ollama(message: str, model_id: str = \"granite-code:8b-instruct\") -> str:\n", + " \"\"\"Prompt local model on Ollama\"\"\"\n", + " response_object = ollama.generate(model=model_id, prompt=message)\n", + " return response_object[\"response\"]" + ] + }, + { + "cell_type": "markdown", + "id": "518efea0d8c4d307", + "metadata": { + "collapsed": false + }, + "source": [ + "(Step 4) Translate each class in the application and check certain properties of the translated code, such as (a) number of translated method, and (b) number of translated fields. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fe3be3de6790f7b3", + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# Create a new instance of the CLDK class\n", + "cldk = CLDK(language=\"java\")\n", + "\n", + "# Create an analysis object over the java application\n", + "analysis = cldk.analysis(project_path=\"/tmp/commons-cli-rel-commons-cli-1.7.0\", analysis_level=AnalysisLevel.symbol_table)\n", + "\n", + "# For simplicity, we run the code translation for a single class. One can remove that filter to run this code for the entire application\n", + "qualified_class_name = 'org.apache.commons.cli.GnuParser'\n", + "\n", + "# Go through all the classes in the application\n", + "for class_name in analysis.get_classes():\n", + " \n", + " if class_name==qualified_class_name:\n", + " # Get the location of the Java class\n", + " class_path = analysis.get_java_file(qualified_class_name=class_name)\n", + " \n", + " # Read the file content\n", + " if not class_path:\n", + " class_body = ''\n", + " with open(class_path, 'r', encoding='utf-8', errors='ignore') as f:\n", + " class_body = f.read()\n", + " \n", + " # Sanitize the file content by removing comments.\n", + " tree_sitter_utils = cldk.tree_sitter_utils(source_code=class_body)\n", + " sanitized_class = JavaSitter().remove_all_comments(source_code=class_body)\n", + "\n", + " inst = format_inst(code=sanitized_class, language='java', focal_class=class_name.split('.')[-1])\n", + "\n", + " print(f\"Instruction:\\n{inst}\\n\")\n", + " print(f\"Translating Java code to Python . . .\\n\")\n", + " translated_code = prompt_ollama(\n", + " message=inst)\n", + " \n", + " print(f\"Translated Python code: {translated_code}\")\n", + " py_cldk = PythonSitter()\n", + " all_methods = py_cldk.get_all_methods(module=translated_code)\n", + " all_functions = py_cldk.get_all_functions(module=translated_code)\n", + " all_fields = py_cldk.get_all_fields(module=translated_code)\n", + " \n", + " if len(all_methods) + len(all_functions) != len(analysis.get_methods_in_class(qualified_class_name=class_name)):\n", + " print(f'Number of translated method not matching in class {class_name}')\n", + " else:\n", + " print(f'Number of translated method in class {class_name} is {len(all_methods)}')\n", + " if all_fields:\n", + " if len(all_fields) != len(analysis.get_class(qualified_class_name=class_name).field_declarations):\n", + " print(f'Number of translated field not matching in class {class_name}') " + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.4" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/docs/examples/java/code_summarization.py b/docs/examples/java/python/code_summarization.py similarity index 100% rename from docs/examples/java/code_summarization.py rename to docs/examples/java/python/code_summarization.py diff --git a/docs/examples/java/validating_code_translation.ipynb b/docs/examples/java/validating_code_translation.ipynb deleted file mode 100644 index 2266cf0..0000000 --- a/docs/examples/java/validating_code_translation.ipynb +++ /dev/null @@ -1,171 +0,0 @@ -{ - "cells": [ - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "!pip install ollama" - ], - "metadata": { - "collapsed": false - }, - "id": "3195a8c0612cb428" - }, - { - "cell_type": "markdown", - "source": [ - "Code translation aims to convert source code from one programming language (PL) to another. Given the promising abilities of large language models (LLMs) in code synthesis, researchers are exploring their potential to automate code translation. In our recent paper [https://dl.acm.org/doi/10.1145/3597503.3639226] published at ICSE'24, we found that LLM-based code translation is very promising. In this example, we will walk through the steps of translating each Java class to Python and checking various properties of translated code, such as the number of methods, number of fields, formal arguments, etc.\n", - "\n", - "(Step 1) First, we will import all the necessary libraries" - ], - "metadata": { - "collapsed": false - }, - "id": "47af1410ab0a3b4d" - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "from cldk.analysis.python.treesitter import PythonSitter\n", - "from cldk.analysis.java.treesitter import JavaSitter\n", - "import ollama\n", - "from cldk import CLDK\n", - "from cldk.analysis import AnalysisLevel" - ], - "metadata": { - "collapsed": false - }, - "id": "47a78f61a53b2b55" - }, - { - "cell_type": "markdown", - "source": [ - "(Step 2) Second, we will form the prompt for the model, which will include the body of the Java class after removing all the comments and the import statements." - ], - "metadata": { - "collapsed": false - }, - "id": "c6d2f67e1a17cf1" - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "def format_inst(code, focal_class, language):\n", - " \"\"\"\n", - " Format the instruction for the given focal method and class.\n", - " \"\"\"\n", - " inst = f\"Question: Can you translate the Java class `{focal_class}` below to Python and generate under code block (```)?\\n\"\n", - "\n", - " inst += \"\\n\"\n", - " inst += f\"```{language}\\n\"\n", - " inst += code\n", - " inst += \"```\" if code.endswith(\"\\n\") else \"\\n```\"\n", - " inst += \"\\n\"\n", - " return inst" - ], - "metadata": { - "collapsed": false - }, - "id": "dc1ec56e92e90c15" - }, - { - "cell_type": "markdown", - "source": [ - "(Step 3) Create a function to call LLM. There are various ways to achieve that. However, for illustrative purpose, we use ollama, a library to communicate with models downloaded locally." - ], - "metadata": { - "collapsed": false - }, - "id": "1239041c3315e5e5" - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "def prompt_ollama(message: str, model_id: str = \"granite-code:8b-instruct\") -> str:\n", - " \"\"\"Prompt local model on Ollama\"\"\"\n", - " response_object = ollama.generate(model=model_id, prompt=message)\n", - " return response_object[\"response\"]" - ], - "metadata": { - "collapsed": false - }, - "id": "1c86224032a6eb70" - }, - { - "cell_type": "markdown", - "source": [ - "(Step 4) Translate each class in the application (provide the application path as an environment variable, ```JAVA_APP_PATH```) and check certain properties of the translated code, such as (a) number of translated method, and (b) number of translated fields. " - ], - "metadata": { - "collapsed": false - }, - "id": "518efea0d8c4d307" - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "# Create a new instance of the CLDK class\n", - "cldk = CLDK(language=\"java\")\n", - "# Create an analysis object over the java application. Provide the application path using JAVA_APP_PATH\n", - "analysis = cldk.analysis(project_path=\"JAVA_APP_PATH\", analysis_level=AnalysisLevel.symbol_table)\n", - "# Go through all the classes in the application\n", - "for class_name in analysis.get_classes():\n", - " # Get the location of the Java class\n", - " class_path = analysis.get_java_file(qualified_class_name=class_name)\n", - " # Read the file content\n", - " if not class_path:\n", - " class_body = ''\n", - " with open(class_path, 'r', encoding='utf-8', errors='ignore') as f:\n", - " class_body = f.read()\n", - " # Sanitize the file content by removing comments.\n", - " tree_sitter_utils = cldk.tree_sitter_utils(source_code=class_body)\n", - " sanitized_class = JavaSitter.remove_all_comments(source_code=class_body)\n", - " translated_code = prompt_ollama(\n", - " message=sanitized_class,\n", - " model_id=\"granite-code:20b-instruct\")\n", - " py_cldk = PythonSitter()\n", - " all_methods = py_cldk.get_all_methods(module=translated_code)\n", - " all_functions = py_cldk.get_all_functions(module=translated_code)\n", - " all_fields = py_cldk.get_all_fields(module=translated_code)\n", - " if len(all_methods) + len(all_functions) != len(analysis.get_methods_in_class(qualified_class_name=class_name)):\n", - " print(f'Number of translated method not matching in class {class_name}')\n", - " if len(all_fields) != len(analysis.get_class(qualified_class_name=class_name).field_declarations):\n", - " print(f'Number of translated field not matching in class {class_name}') " - ], - "metadata": { - "collapsed": false - }, - "id": "fe3be3de6790f7b3" - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 2 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython2", - "version": "2.7.6" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} From 810cc5dae0cd4d5afb8b351efc6c4a3fcf6d27df Mon Sep 17 00:00:00 2001 From: Saurabh Sinha Date: Sun, 1 Sep 2024 10:46:04 -0400 Subject: [PATCH 2/4] Updated Java test generation example Signed-off-by: Saurabh Sinha --- .../java/notebook/generate_unit_tests.ipynb | 161 +++++++++--------- 1 file changed, 82 insertions(+), 79 deletions(-) diff --git a/docs/examples/java/notebook/generate_unit_tests.ipynb b/docs/examples/java/notebook/generate_unit_tests.ipynb index 3b17311..90465f6 100644 --- a/docs/examples/java/notebook/generate_unit_tests.ipynb +++ b/docs/examples/java/notebook/generate_unit_tests.ipynb @@ -9,29 +9,21 @@ "source": [ "# Using CLDK to generate JUnit tests\n", "\n", - "In this tutorial, we will use CLDK to generate a JUnit test for all the methods in a Java Application.\n", + "In this tutorial, we will use CLDK to implement a simple unit test generator for Java. You'll explore some of the benefits of using CLDK to perform quick and easy program analysis and build an LLM-based test generator. By the end of this tutorial, you will have implemented such a tool and generated a JUnit test case for a Java application.\n", "\n", - "By the end of this tutorial, you will have a JUnit test for all the methods in a Java application. You'll be able to explore some of the benefits of using CLDK to perform fast and easy program analysis and build a LLM-based test generator.\n", + "Specifically, you will learn how to perform the following tasks on the application under test to create LLM prompts for test generation:\n", "\n", - "You will learn how to do the following:\n", + "1. Create a new instance of the CLDK class.\n", + "2. Create an analysis object for the Java application under test.\n", + "3. Iterate over all files in the application.\n", + "4. Iterate over all classes in a file.\n", + "5. Iterate over all methods in a class.\n", + "6. Get the code body of a method.\n", + "7. Get the constructors of a class.\n", + "\n", "\n", - "
    \n", - "
  1. Create a new instance of the CLDK class.\n", - "
  2. Create an analysis object over the Java application.\n", - "
  3. Iterate over all the files in the project.\n", - "
  4. Iterate over all the classes in the file.\n", - "
  5. Iterate over all the methods in the class.\n", - "
  6. Get the code body of the method.\n", - "
  7. Initialize the treesitter utils for the class file content.\n", - "
  8. Sanitize the class for analysis.\n", - "
\n", - "Next, we will write a couple of helper methods to:\n", - "\n", - "
    \n", - "
  1. Format the instruction for the given focal method and class.\n", - "
  2. Prompts the local model on Ollama.\n", - "
  3. Use CLDK to go through an application and generate unit test cases for each method.\n", - "
" + "We will write a couple of helper methods to (1) format the LLM instruction for generating test cases for a given focal method (i.e., method under test) and (2) prompt the LLM via Ollama. We will then use CLDK to go through an application and generate unit test cases for a target method." ] }, { @@ -45,12 +37,11 @@ "\n", "Before we get started, let's make sure you have the following installed:\n", "\n", - "
    \n", - "
  1. Python 3.11 or later\n", - "
  2. Ollama 0.3.4 or later\n", - "
  3. Java 11 or later\n", - "
\n", - "We will use ollama to spin up a local granite model that will act as our LLM for this turorial." + "1. Python 3.11 or later (you can use [pyenv](https://github.com/pyenv/pyenv) to install Python)\n", + "2. Java 11 or later (you can use [SDKMAN!](https://sdkman.io) to instal Java)\n", + "3. Ollama 0.3.4 or later (you can get Ollama here: [Ollama download](https://ollama.com/download))\n", + "\n", + "We will use Ollama to spin up a local [Granite code model](https://ollama.com/library/granite-code), which will serve as our LLM for this turorial." ] }, { @@ -60,12 +51,28 @@ "collapsed": false }, "source": [ - "### Prerequisite 1: Install ollama\n", + "### Download Granite code model\n", "\n", - "If you don't have ollama installed, please download and install it from here: [Ollama](https://ollama.com/download).\n", - "Once you have ollama, start the server and make sure it is running. Once ollama is up and running, you can download the latest version of the Granite 8b Instruct model by running the following command:\n", - "There are other granite versions available, but for this tutorial, we will use the Granite 8b Instruct model. You if prefer to use a different version, you can replace `8b-instruct` with any of the other [versions](https://ollama.com/library/granite-code/tags).\n", - "Let's make sure the model is downloaded by running the following command:" + "After starting the Ollama server, please download the latest version of the Granite code 8b-instruct model by running the following command. There are other Granite code models available, but for this tutorial, we will use Granite code 8b-instruct. If you prefer to use a different Granite code model, you can replace `8b-instruct` with the tag of another version (see [Granite code tags](https://ollama.com/library/granite-code/tags))." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "670f2b23", + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "ollama pull granite-code:8b-instruct" + ] + }, + { + "cell_type": "markdown", + "id": "02d5bbfa", + "metadata": {}, + "source": [ + " Let's make sure the model is downloaded by running the following command: " ] }, { @@ -82,7 +89,7 @@ "outputs": [], "source": [ "%%bash\n", - "ollama run granite-code:8b-instruct \\\"Write a python function to print 'Hello, World!'" + "ollama run granite-code:8b-instruct \\\"Write a python function to print 'Hello, World!'\\\"" ] }, { @@ -92,7 +99,7 @@ "collapsed": false }, "source": [ - "### Prerequisite 3: Install ollama Python SDK" + "### Install Ollama Python SDK" ] }, { @@ -114,8 +121,8 @@ "collapsed": false }, "source": [ - "### Prerequisite 4: Install CLDK\n", - "CLDK is avaliable on github at github.com/IBM/codellm-devkit.git. You can install it by running the following command:" + "### Install CLDK\n", + "CLDK is avaliable at https://github.com/IBM/codellm-devkit. You can install it by running the following command:" ] }, { @@ -138,7 +145,7 @@ }, "source": [ "### Get the sample Java application\n", - "For this tutorial, we will use apache commons cli. You can download the source code to a temporary directory by running the following command:" + "For this tutorial, we will use [Apache Commons CLI](https://github.com/apache/commons-cli) as the Java application under test. You can download the source code to a temporary directory by running the following command:" ] }, { @@ -161,7 +168,7 @@ "collapsed": false }, "source": [ - "The project will now be extracted to `/tmp/commons-cli-rel-commons-cli-1.7.0`. We'll remove these files later, so don't worry about the location." + "The project will be extracted to `/tmp/commons-cli-rel-commons-cli-1.7.0`. We'll remove these files later, so don't worry about the location." ] }, { @@ -171,22 +178,16 @@ "collapsed": false }, "source": [ - "### Building a JUnit test generator using CLDK and Granite Code Instruct Model\n", - "Now that we have all the prerequisites installed, let's start building a JUnit test generator using CLDK and the Granite Code Instruct Model." - ] - }, - { - "cell_type": "markdown", - "id": "5856baff4aa64ed7", - "metadata": { - "collapsed": false - }, - "source": [ - "Generating unit tests for code is a very tedious task and often takes a significant effort from the developers to write good test cases. There are various tools that are available for automated test generation, such as EvoSuite, which uses evolutionary algorithms to generate test cases. However, the test cases that are being generated are not natural and often developers do not prefer to add them to their test suite. Whereas Large Language Models (LLM) being trained with developer-written code it has a better affinity towards generating more natural code--more readable, maintainable code. In this excercise, we will show we can leverage LLMs to generate test cases with the help of CLDK. \n", + "## Building a JUnit test generator using CLDK and Granite Code Instruct Model\n", + "\n", + "Now that we have all the prerequisites installed, let's start building a JUnit test generator using CLDK and the Granite Code Instruct Model.\n", + "\n", + "Generating unit tests for code is a tedious task and developers often have to put in significant effort in writing good test cases. There are various tools available for automated test generation, such as EvoSuite, which uses evolutionary algorithms to generate unit test cases for Java. However, the generated test cases are not natural and often developers do not prefer to add them to their test suites. Large Language Models (LLMs), having been trained with developer-written code, have a better affinity towards generating more natural code---code that is more readable, comprehensible, and maintainable. In this excercise, we will show how we can leverage LLMs to generate test cases with the help of CLDK. \n", "\n", - "For simplicity, we will cover certain aspects of test generation and provide some context information to LLM for better quality of test cases. In this exercise, we will generate a unit test for a non-private method from a Java class and provide the focal method body and the signature of all the constructors of the class so that LLM can understand how to create an object of the focal class during the setup phase of the tests. Also, we will ask LLMs to generate ```N``` number of test cases, where ```N``` is the cyclomatic complexity of the focal method. The intuition is that one test may not be sufficient for covering fairly complex methods, and a cyclomatic complexity score can provide some guidance towards that. \n", + "For simplicity, we will cover certain aspects of test generation and provide some context information to the LLM to help it create usable test cases. In this exercise, we will generate a unit test for a non-private method from a Java class and provide the focal method body and the signature of all the constructors of the class so that LLM can understand how to create an object of the focal class during the setup phase of the tests.\n", + "\n", "\n", - "(Step 1) First, we will import all the necessary libraries" + "(Step 1) First, we will import all the necessary libraries." ] }, { @@ -210,7 +211,7 @@ "collapsed": false }, "source": [ - "(Step 2) Second, we will form the prompt for the model, which will include all the constructor signarures, and the body of the focal method." + "(Step 2) Second, we will define a function for creating the LLM prompt, which includes signatures of relevant constructors and the body of the focal method." ] }, { @@ -224,7 +225,7 @@ "source": [ "def format_inst(focal_method_body, focal_method, focal_class, constructor_signatures, language):\n", " \"\"\"\n", - " Format the instruction for the given focal method and class.\n", + " Format the LLM instruction for the given focal method and class.\n", " \"\"\"\n", " inst = f\"Question: Can you generate junit tests with @Test annotation for the method `{focal_method}` in the class `{focal_class}` below. Only generate the test and no description.\\n\"\n", " inst += 'Use the constructor signatures to form the object if the method is not static. Generate the code under ``` code block.'\n", @@ -246,7 +247,7 @@ "collapsed": false }, "source": [ - "(Step 3) Third, use ollama to call LLM (in case Granite 8b)." + "(Step 3) Third, we define a function to call the LLM (in this case, Granite code 8b-instruct) using Ollama." ] }, { @@ -271,7 +272,7 @@ "collapsed": false }, "source": [ - "(Step 4) Fourth, collect all the information needed for each method. In this process, we go through all the classes in the application, and then for each class, we collect the signature of all the constructors. If there is no constructor present, we add the signature of the default constructor. Then, we go through all the non-private methods of the class and formulate the prompt using the constructor and the method information. Finally, we use the prompt to call LLM and get the final output." + "(Step 4) Fourth, we collect the relevant information for the focal method. To do this, we go through all the classes in the application, and for each class, we collect the signatures of its constructors. If a class has no constructors, we add the signature of the default constructor. Then, we go through each non-private method of the class and formulate the prompt using the constructor and the method information. Finally, we use the prompt to call LLM to generate test cases and get the LLM response." ] }, { @@ -283,27 +284,28 @@ }, "outputs": [], "source": [ - "# Create a new instance of the CLDK class\n", + "# Create an instance of the CLDK class for Java analysis\n", "cldk = CLDK(language=\"java\")\n", - "# Create an analysis object over the java application. Provide the application path.\n", + "\n", + "# Create an analysis object for the Java application. Provide the application path.\n", "analysis = cldk.analysis(project_path=\"/tmp/commons-cli-rel-commons-cli-1.7.0\", analysis_level=AnalysisLevel.symbol_table)\n", "\n", - "# For simplicity, we run the test generation for a single class and method. One can remove that filter to run this code for the entire application\n", - "qualified_class_name = 'org.apache.commons.cli.GnuParser'\n", - "method_signature = 'flatten(Options, String[], boolean)'\n", + "# For simplicity, we run the test generation for a single focal class and method (this filter can be removed to run this code over the entire application)\n", + "focal_class = \"org.apache.commons.cli.GnuParser\"\n", + "focal_method = \"flatten(Options, String[], boolean)\"\n", "\n", "# Go through all the classes in the application\n", "for class_name in analysis.get_classes():\n", "\n", - " if class_name == qualified_class_name:\n", + " if class_name == focal_class:\n", " class_details = analysis.get_class(qualified_class_name=class_name)\n", - " focal_class_name = class_name.split('.')[-1]\n", + " focal_class_name = class_name.split(\".\")[-1]\n", "\n", " # Generate test cases for non-interface and non-abstract classes\n", - " if not class_details.is_interface and 'abstract' not in class_details.modifiers:\n", + " if not class_details.is_interface and \"abstract\" not in class_details.modifiers:\n", " \n", " # Get all constructor signatures\n", - " constructor_signatures = ''\n", + " constructor_signatures = \"\"\n", " \n", " for method in analysis.get_methods_in_class(qualified_class_name=class_name):\n", " method_details = analysis.get_method(qualified_class_name=class_name, qualified_method_name=method)\n", @@ -312,35 +314,36 @@ " constructor_signatures += method_details.signature + '\\n'\n", " \n", " # If no constructor present, then add the signature of the default constructor\n", - " if constructor_signatures=='':\n", - " constructor_signatures = f'public {focal_class_name}() ' + '{}'\n", + " if constructor_signatures == \"\":\n", + " constructor_signatures = f\"public {focal_class_name}() \" + \"{}\"\n", " \n", " # Go through all the methods in the class\n", " for method in analysis.get_methods_in_class(qualified_class_name=class_name):\n", " \n", - " if method==method_signature:\n", + " if method == focal_method:\n", " # Get the method details\n", " method_details = analysis.get_method(qualified_class_name=class_name, qualified_method_name=method)\n", " \n", " # Generate test cases for non-private methods\n", - " if 'private' not in method_details.modifiers and not method_details.is_constructor:\n", + " if \"private\" not in method_details.modifiers and not method_details.is_constructor:\n", " \n", " # Gather all the information needed for the prompt, which are focal method body, focal method name, focal class name, and constructor signature\n", - " prompt = format_inst(focal_method_body=method_details.declaration+method_details.code,\n", - " focal_method=method.split('(')[0],\n", - " focal_class=focal_class_name,\n", - " constructor_signatures=constructor_signatures,\n", - " language='Java')\n", + " prompt = format_inst(\n", + " focal_method_body=method_details.declaration+method_details.code,\n", + " focal_method=method.split(\"(\")[0],\n", + " focal_class=focal_class_name,\n", + " constructor_signatures=constructor_signatures,\n", + " language=\"Java\"\n", + " )\n", " \n", + " # Print the instruction\n", " print(f\"Instruction:\\n{prompt}\\n\")\n", - " print(f\"Generating test case . . .\\n\")\n", + " print(f\"Generating test case ...\\n\")\n", " \n", " # Prompt the local model on Ollama\n", - " llm_output = prompt_ollama(\n", - " message=prompt\n", - " )\n", + " llm_output = prompt_ollama(message=prompt)\n", " \n", - " # Print the instruction and LLM output\n", + " # Print the LLM output\n", " print(f\"LLM Output:\\n{llm_output}\")" ] } @@ -361,7 +364,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.11.4" + "version": "3.11.9" } }, "nbformat": 4, From 6036748a93a8ae38fe6995b52e0c71ad34acb404 Mon Sep 17 00:00:00 2001 From: Saurabh Sinha Date: Sun, 1 Sep 2024 14:08:30 -0400 Subject: [PATCH 3/4] Updated Java code summarization and test generation examples Signed-off-by: Saurabh Sinha --- .../java/notebook/code_summarization.ipynb | 147 +++++++++--------- .../java/notebook/generate_unit_tests.ipynb | 21 +-- 2 files changed, 86 insertions(+), 82 deletions(-) diff --git a/docs/examples/java/notebook/code_summarization.ipynb b/docs/examples/java/notebook/code_summarization.ipynb index 9d15c33..4d7c625 100644 --- a/docs/examples/java/notebook/code_summarization.ipynb +++ b/docs/examples/java/notebook/code_summarization.ipynb @@ -9,29 +9,20 @@ "source": [ "# Using CLDK to explain Java methods\n", "\n", - "In this tutorial, we will use CLDK to explain or generate code summary for all the methods in a Java Application.\n", + "In this tutorial, we will use CLDK to explain or generate code summary for a Java method. You'll explore some of the benefits of using CLDK to perform quick and easy program analysis and build an LLM-based code summarizer. By the end of this tutorial, you will have implemented such a tool and generated code summary for a Java method.\n", "\n", - "By the end of this tutorial, you will have code summary for all the methods in a Java application. You'll be able to explore some of the benefits of using CLDK to perform fast and easy program analysis and build a LLM-based code summary generation.\n", + "Specifically, you will learn how to perform the following tasks on a Java application to create LLM prompts for code summarization:\n", "\n", - "You will learn how to do the following:\n", + "1. Create a new instance of the CLDK class.\n", + "2. Create an analysis object for the target Java application.\n", + "3. Iterate over all files in the application.\n", + "4. Iterate over all classes in a file.\n", + "5. Initialize treesitter utils for the class content.\n", + "6. Iterate over all methods in a class.\n", + "7. Get the code body of a method.\n", + "8. Sanitize the class for prompting the LLM.\n", "\n", - "
    \n", - "
  1. Create a new instance of the CLDK class.\n", - "
  2. Create an analysis object over the Java application.\n", - "
  3. Iterate over all the files in the project.\n", - "
  4. Iterate over all the classes in the file.\n", - "
  5. Iterate over all the methods in the class.\n", - "
  6. Get the code body of the method.\n", - "
  7. Initialize the treesitter utils for the class file content.\n", - "
  8. Sanitize the class for analysis.\n", - "
\n", - "Next, we will write a couple of helper methods to:\n", - "\n", - "
    \n", - "
  1. Format the instruction for the given focal method and class.\n", - "
  2. Prompts the local model on Ollama.\n", - "
  3. Use CLDK to analyze code and get context information for generating code summary.\n", - "
" + "We will write a couple of helper methods to (1) format the LLM instruction for summarizing a given target method and (2) prompt the LLM via Ollama. We will then use CLDK to go through an application and generate the summary for the target method." ] }, { @@ -45,12 +36,11 @@ "\n", "Before we get started, let's make sure you have the following installed:\n", "\n", - "
    \n", - "
  1. Python 3.11 or later\n", - "
  2. Ollama 0.3.4 or later\n", - "
  3. Java 11 or later\n", - "
\n", - "We will use ollama to spin up a local granite model that will act as our LLM for this turorial." + "1. Python 3.11 or later (you can use [pyenv](https://github.com/pyenv/pyenv) to install Python)\n", + "2. Java 11 or later (you can use [SDKMAN!](https://sdkman.io) to instal Java)\n", + "3. Ollama 0.3.4 or later (you can get Ollama here: [Ollama download](https://ollama.com/download))\n", + "\n", + "We will use Ollama to spin up a local [Granite code model](https://ollama.com/library/granite-code), which will serve as our LLM for this turorial." ] }, { @@ -60,12 +50,28 @@ "collapsed": false }, "source": [ - "### Prerequisite 1: Install ollama\n", + "### Download Granite code model\n", "\n", - "If you don't have ollama installed, please download and install it from here: [Ollama](https://ollama.com/download).\n", - "Once you have ollama, start the server and make sure it is running. Once ollama is up and running, you can download the latest version of the Granite 8b Instruct model by running the following command:\n", - "There are other granite versions available, but for this tutorial, we will use the Granite 8b Instruct model. You if prefer to use a different version, you can replace `8b-instruct` with any of the other [versions](https://ollama.com/library/granite-code/tags).\n", - "Let's make sure the model is downloaded by running the following command:" + "After starting the Ollama server, please download the latest version of the Granite code 8b-instruct model by running the following command. There are other Granite code models available, but for this tutorial, we will use Granite code 8b-instruct. If you prefer to use a different Granite code model, you can replace `8b-instruct` with the tag of another version (see [Granite code tags](https://ollama.com/library/granite-code/tags))." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "627e7184", + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "ollama pull granite-code:8b-instruct" + ] + }, + { + "cell_type": "markdown", + "id": "8cc1ca5b", + "metadata": {}, + "source": [ + " Let's make sure the model is downloaded by running the following command:" ] }, { @@ -88,7 +94,7 @@ "collapsed": false }, "source": [ - "### Prerequisite 3: Install ollama Python SDK" + "### Install Ollama Python SDK" ] }, { @@ -110,8 +116,8 @@ "collapsed": false }, "source": [ - "### Prerequisite 4: Install CLDK\n", - "CLDK is avaliable on github at github.com/IBM/codellm-devkit.git. You can install it by running the following command:" + "### Install CLDK\n", + "CLDK is avaliable at https://github.com/IBM/codellm-devkit. You can install it by running the following command:" ] }, { @@ -134,7 +140,7 @@ }, "source": [ "### Step 1: Get the sample Java application\n", - "For this tutorial, we will use apache commons cli. You can download the source code to a temporary directory by running the following command:" + "For this tutorial, we will use [Apache Commons CLI](https://github.com/apache/commons-cli) as the sample Java application. You can download the source code to a temporary directory by running the following command:" ] }, { @@ -157,7 +163,8 @@ "collapsed": false }, "source": [ - "The project will now be extracted to `/tmp/commons-cli-rel-commons-cli-1.7.0`. We'll remove these files later, so don't worry about the location." + "The project will now be extracted to `/tmp/commons-cli-rel-commons-cli-1.7.0`.\n", + "" ] }, { @@ -167,12 +174,9 @@ "collapsed": false }, "source": [ - "### Generate code summary\n", - "Code summarization or code explanation is a task that converts a code written in a programming language to a natural language. This particular task has several\n", - "benefits, such as understanding code without looking at its intrinsic details, documenting code for better maintenance, etc. To do that, one needs to\n", - "understand the basic details of code structure works, and use that knowledge to generate the summary using various AI-based approaches. In this particular\n", - "example, we will be using Large Language Models (LLM), specifically Granite 8B, an open-source model built by IBM. We will show how easily a developer can use\n", - "CLDK to expose various parts of the code by calling various APIs without implementing various time-intensive program analyses from scratch." + "## Generate code summary\n", + "\n", + "Code summarization or code explanation is the task of converting code written in a programming language to natural language. It has several benefits, such as understanding code without looking at its intrinsic details, documenting code for better maintenance, etc. To perform code summarization, one needs to understand the basic details of code implementation, and use that knowledge to generate the summary using various AI-based approaches. In this tutorial, we will use LLMs, specifically Granite code 8b-instruct. We will show how a developer can easily use CLDK to analyze code by calling various APIs without having to implement such analyses." ] }, { @@ -182,7 +186,7 @@ "collapsed": false }, "source": [ - "Step 1: Add all the neccessary imports" + "Step 1: Add the neccessary imports" ] }, { @@ -194,7 +198,6 @@ }, "outputs": [], "source": [ - "from pathlib import Path\n", "import ollama\n", "from cldk import CLDK\n", "from cldk.analysis import AnalysisLevel" @@ -207,8 +210,7 @@ "collapsed": false }, "source": [ - "Step 2: Formulate the LLM prompt. The prompt can be tailored towards various needs. In this case, we show a simple example of generating summary for each\n", - "method in a Java class" + "Step 2: Define a function for creating the LLM prompt, which instructs the LLM to summarize a Java method and includes relevant code for the task." ] }, { @@ -249,7 +251,7 @@ "collapsed": false }, "source": [ - "Step 3: Create a function to call LLM. There are various ways to achieve that. However, for illustrative purpose, we use ollama, a library to communicate with models downloaded locally." + "Step 3: Define a function to call the LLM (in this case, Granite code 8b-instruct) using Ollama." ] }, { @@ -274,7 +276,7 @@ "collapsed": false }, "source": [ - "Step 4: Create an object of CLDK and provide the programming language of the source code." + "Step 4: Create an instance of CLDK and provide the programming language of the source code." ] }, { @@ -286,7 +288,7 @@ }, "outputs": [], "source": [ - "# Create a new instance of the CLDK class\n", + "# Create an instance of CLDK for Java analysis\n", "cldk = CLDK(language=\"java\")" ] }, @@ -297,10 +299,7 @@ "collapsed": false }, "source": [ - "Step 5: CLDK uses different analysis engine--Codeanalyzer (built using WALA and Javaparser), Treesitter, and CodeQL (future). By default, codenanalyzer has\n", - "been selected as the default analysis engine. Also, CLDK support different analysis levels--(a) symbol table, (b) call graph, (c) program dependency graph, and\n", - "(d) system dependency graph. Analysis engine can be selected using ```AnalysisLevel``` enum. In this example, we will generate summarization of all the methods\n", - "of an application. " + "Step 5: Select the analysis engine and analysis level. CLDK uses different analysis engines---[CodeAnalyzer](https://github.com/IBM/codenet-minerva-code-analyzer) (built over [WALA](https://github.com/wala/WALA) and [JavaParser](https://github.com/javaparser/javaparser)), [Treesitter](https://tree-sitter.github.io/tree-sitter/), and [CodeQL](https://codeql.github.com/) (future)---with CodeAnalyzer being the default analysis engine. CLDK supports different analysis levels: (1) symbol table, (2) call graph, (3) program dependency graph, and (4) system dependency graph. The analysis level can be selected using the `AnalysisLevel` enumerated type. For this example, we select the symbol-table analysis level, with CodeAnalyzer as the default analysis engine." ] }, { @@ -323,9 +322,9 @@ "collapsed": false }, "source": [ - "Step 6: Iterate over all the class files and create the prompt. In this case, we want to provide a customized Java class in the prompt. For instance,\n", + "Step 6: Iterate over all the class files and create the prompt. In this case, we want to provide a sanitized Java class in the prompt, containing only the relevant information for summarizing the target method. To illustrate, consider the floowing class:\n", "\n", - "```\n", + "```java\n", "package com.ibm.org;\n", "import A.B.C.D;\n", "...\n", @@ -345,7 +344,7 @@ " // do somthing\n", " } \n", "```\n", - "Given the above class, let's say we want to generate a summary for the ```bar``` method. To understand what it does, we add the callee of this method in the prompt, which in this case is ```baz```. We also remove imports, comments, etc. All of these are done using a single call to ```sanitize_focal_class``` API. In this process, we also use Treesitter to analyze the code. Once the input code has been sanitized, we call the ```format_inst``` method to create the LLM prompt, which has been passed to ```prompt_ollama``` method to generate the summary using LLM." + "Let's say we want to generate a summary for method `bar`. To understand what it does, we add the callees of this method in the prompt, which in this case includes `baz`. We remove the other methods, imports, comments, etc. All of this can be achieved with a single call to CLDK's `sanitize_focal_class` API. In this process, we also use Treesitter to analyze the code. After creating the sanitized code, we call the previously defined `format_inst` method to create the LLM prompt and pass the prompt to `prompt_ollama` to generate the method summary." ] }, { @@ -357,30 +356,34 @@ }, "outputs": [], "source": [ - "# For simplicity, we run the code summarization for a single class and method. One can remove that filter to run this code for the entire application\n", - "qualified_class_name = 'org.apache.commons.cli.GnuParser'\n", - "method_signature = 'flatten(Options, String[], boolean)'\n", - "# Iterate over all the files in the project\n", + "# For simplicity, we run the code summarization for a single class and method (this filter can be removed to run this code over the entire application)\n", + "target_class = \"org.apache.commons.cli.GnuParser\"\n", + "target_method = \"flatten(Options, String[], boolean)\"\n", + "\n", + "# Iterate over all classes in the application\n", "for class_name in analysis.get_classes():\n", - " if class_name==qualified_class_name:\n", + " if class_name == target_class:\n", " class_file_path = analysis.get_java_file(qualified_class_name=class_name)\n", - " # Iterate over all the methods in the class\n", + "\n", + " # Read code for the class\n", + " with open(class_file_path, 'r') as f:\n", + " code_body = f.read()\n", + "\n", + " # Initialize treesitter utils for the class file content\n", + " tree_sitter_utils = cldk.tree_sitter_utils(source_code=code_body)\n", + " \n", + " # Iterate over all methods in class\n", " for method in analysis.get_methods_in_class(qualified_class_name=class_name):\n", - " if method==method_signature:\n", - " # Get code body of the method\n", - " with open(class_file_path, 'r') as f:\n", - " code_body = f.read()\n", + " if method == target_method:\n", " \n", - " # Initialize the treesitter utils for the class file content\n", - " tree_sitter_utils = cldk.tree_sitter_utils(source_code=code_body)\n", - " \n", " # Get all the method details\n", " method_details = analysis.get_method(qualified_class_name=class_name,\n", " qualified_method_name=method)\n", - " # Sanitize the class for analysis\n", + " \n", + " # Sanitize the class for analysis with respect to the target method\n", " sanitized_class = tree_sitter_utils.sanitize_focal_class(method_details.declaration)\n", " \n", - " # Format the instruction for the given focal method and class\n", + " # Format the instruction for the given target method and class\n", " instruction = format_inst(\n", " code=sanitized_class,\n", " focal_method=method_details.declaration,\n", @@ -417,7 +420,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.11.4" + "version": "3.11.9" } }, "nbformat": 4, diff --git a/docs/examples/java/notebook/generate_unit_tests.ipynb b/docs/examples/java/notebook/generate_unit_tests.ipynb index 90465f6..43f6335 100644 --- a/docs/examples/java/notebook/generate_unit_tests.ipynb +++ b/docs/examples/java/notebook/generate_unit_tests.ipynb @@ -23,7 +23,7 @@ "\n", "\n", - "We will write a couple of helper methods to (1) format the LLM instruction for generating test cases for a given focal method (i.e., method under test) and (2) prompt the LLM via Ollama. We will then use CLDK to go through an application and generate unit test cases for a target method." + "We will write a couple of helper methods to (1) format the LLM instruction for generating test cases for a given focal method (i.e., method under test) and (2) prompt the LLM via Ollama. We will then use CLDK to go through an application and generate unit test cases for the target method." ] }, { @@ -72,7 +72,7 @@ "id": "02d5bbfa", "metadata": {}, "source": [ - " Let's make sure the model is downloaded by running the following command: " + " Let's make sure the model is downloaded by running the following command:" ] }, { @@ -168,7 +168,8 @@ "collapsed": false }, "source": [ - "The project will be extracted to `/tmp/commons-cli-rel-commons-cli-1.7.0`. We'll remove these files later, so don't worry about the location." + "The project will be extracted to `/tmp/commons-cli-rel-commons-cli-1.7.0`.\n", + "" ] }, { @@ -178,16 +179,16 @@ "collapsed": false }, "source": [ - "## Building a JUnit test generator using CLDK and Granite Code Instruct Model\n", + "## Build a JUnit test generator using CLDK and Granite Code Model\n", "\n", "Now that we have all the prerequisites installed, let's start building a JUnit test generator using CLDK and the Granite Code Instruct Model.\n", "\n", - "Generating unit tests for code is a tedious task and developers often have to put in significant effort in writing good test cases. There are various tools available for automated test generation, such as EvoSuite, which uses evolutionary algorithms to generate unit test cases for Java. However, the generated test cases are not natural and often developers do not prefer to add them to their test suites. Large Language Models (LLMs), having been trained with developer-written code, have a better affinity towards generating more natural code---code that is more readable, comprehensible, and maintainable. In this excercise, we will show how we can leverage LLMs to generate test cases with the help of CLDK. \n", + "Generating unit tests for code is a tedious task and developers often have to put in significant effort in writing good test cases. There are various tools available for automated test generation, such as EvoSuite, which uses evolutionary algorithms to generate unit test cases for Java. However, the generated test cases are not natural and often developers do not prefer to add them to their test suites. LLMs, having been trained with developer-written code, have a better affinity towards generating more natural code---code that is more readable, comprehensible, and maintainable. In this excercise, we will show how we can leverage LLMs to generate test cases with the help of CLDK. \n", "\n", "For simplicity, we will cover certain aspects of test generation and provide some context information to the LLM to help it create usable test cases. In this exercise, we will generate a unit test for a non-private method from a Java class and provide the focal method body and the signature of all the constructors of the class so that LLM can understand how to create an object of the focal class during the setup phase of the tests.\n", "\n", "\n", - "(Step 1) First, we will import all the necessary libraries." + "Step 1: Import the required modules." ] }, { @@ -211,7 +212,7 @@ "collapsed": false }, "source": [ - "(Step 2) Second, we will define a function for creating the LLM prompt, which includes signatures of relevant constructors and the body of the focal method." + "Step 2: Define a function for creating the LLM prompt, which instructs the LLM to generate unit tests cases and includes signatures of relevant constructors and the body of the focal method." ] }, { @@ -247,7 +248,7 @@ "collapsed": false }, "source": [ - "(Step 3) Third, we define a function to call the LLM (in this case, Granite code 8b-instruct) using Ollama." + "Step 3: Define a function to call the LLM (in this case, Granite code 8b-instruct) using Ollama." ] }, { @@ -272,7 +273,7 @@ "collapsed": false }, "source": [ - "(Step 4) Fourth, we collect the relevant information for the focal method. To do this, we go through all the classes in the application, and for each class, we collect the signatures of its constructors. If a class has no constructors, we add the signature of the default constructor. Then, we go through each non-private method of the class and formulate the prompt using the constructor and the method information. Finally, we use the prompt to call LLM to generate test cases and get the LLM response." + "Step 4: Collect the relevant information for the focal method and prompt the LLM. To do this, we go through all the classes in the application, and for each class, we collect the signatures of its constructors. If a class has no constructors, we add the signature of the default constructor. Then, we go through each non-private method of the class and formulate the prompt using the constructor and the method information. Finally, we use the prompt to call LLM to generate test cases and get the LLM response." ] }, { @@ -284,7 +285,7 @@ }, "outputs": [], "source": [ - "# Create an instance of the CLDK class for Java analysis\n", + "# Create an instance CLDK for Java analysis\n", "cldk = CLDK(language=\"java\")\n", "\n", "# Create an analysis object for the Java application. Provide the application path.\n", From 072e978fc3d821ca0493ff555df542515334308a Mon Sep 17 00:00:00 2001 From: Saurabh Sinha Date: Sun, 1 Sep 2024 16:39:57 -0400 Subject: [PATCH 4/4] Updated Java translation, code summarization, and test generation examples Signed-off-by: Saurabh Sinha --- .../java/notebook/code_summarization.ipynb | 14 +- .../java/notebook/generate_unit_tests.ipynb | 6 +- .../validating_code_translation.ipynb | 228 ++++++++++-------- 3 files changed, 130 insertions(+), 118 deletions(-) diff --git a/docs/examples/java/notebook/code_summarization.ipynb b/docs/examples/java/notebook/code_summarization.ipynb index 4d7c625..48a3ee2 100644 --- a/docs/examples/java/notebook/code_summarization.ipynb +++ b/docs/examples/java/notebook/code_summarization.ipynb @@ -311,7 +311,7 @@ }, "outputs": [], "source": [ - "# Create an analysis object over the java application\n", + "# Create an analysis object for the Java application\n", "analysis = cldk.analysis(project_path=\"/tmp/commons-cli-rel-commons-cli-1.7.0\", analysis_level=AnalysisLevel.symbol_table)" ] }, @@ -356,7 +356,7 @@ }, "outputs": [], "source": [ - "# For simplicity, we run the code summarization for a single class and method (this filter can be removed to run this code over the entire application)\n", + "# For simplicity, we run the code summarization on a single class and method (this filter can be removed to run this code over the entire application)\n", "target_class = \"org.apache.commons.cli.GnuParser\"\n", "target_method = \"flatten(Options, String[], boolean)\"\n", "\n", @@ -366,7 +366,7 @@ " class_file_path = analysis.get_java_file(qualified_class_name=class_name)\n", "\n", " # Read code for the class\n", - " with open(class_file_path, 'r') as f:\n", + " with open(class_file_path, \"r\") as f:\n", " code_body = f.read()\n", "\n", " # Initialize treesitter utils for the class file content\n", @@ -387,17 +387,15 @@ " instruction = format_inst(\n", " code=sanitized_class,\n", " focal_method=method_details.declaration,\n", - " focal_class=class_name.split('.')[-1],\n", + " focal_class=class_name.split(\".\")[-1],\n", " language=\"java\"\n", " )\n", " \n", " print(f\"Instruction:\\n{instruction}\\n\")\n", - " print(f\"Generating code summary . . .\\n\")\n", + " print(f\"Generating code summary ...\\n\")\n", " \n", " # Prompt the local model on Ollama\n", - " llm_output = prompt_ollama(\n", - " message=instruction\n", - " )\n", + " llm_output = prompt_ollama(message=instruction)\n", " \n", " # Print the LLM output\n", " print(f\"LLM Output:\\n{llm_output}\")" diff --git a/docs/examples/java/notebook/generate_unit_tests.ipynb b/docs/examples/java/notebook/generate_unit_tests.ipynb index 43f6335..57cb7b0 100644 --- a/docs/examples/java/notebook/generate_unit_tests.ipynb +++ b/docs/examples/java/notebook/generate_unit_tests.ipynb @@ -285,13 +285,13 @@ }, "outputs": [], "source": [ - "# Create an instance CLDK for Java analysis\n", + "# Create an instance of CLDK for Java analysis\n", "cldk = CLDK(language=\"java\")\n", "\n", "# Create an analysis object for the Java application. Provide the application path.\n", "analysis = cldk.analysis(project_path=\"/tmp/commons-cli-rel-commons-cli-1.7.0\", analysis_level=AnalysisLevel.symbol_table)\n", "\n", - "# For simplicity, we run the test generation for a single focal class and method (this filter can be removed to run this code over the entire application)\n", + "# For simplicity, we run the test generation on a single focal class and method (this filter can be removed to run this code over the entire application)\n", "focal_class = \"org.apache.commons.cli.GnuParser\"\n", "focal_method = \"flatten(Options, String[], boolean)\"\n", "\n", @@ -334,7 +334,7 @@ " focal_method=method.split(\"(\")[0],\n", " focal_class=focal_class_name,\n", " constructor_signatures=constructor_signatures,\n", - " language=\"Java\"\n", + " language=\"java\"\n", " )\n", " \n", " # Print the instruction\n", diff --git a/docs/examples/java/notebook/validating_code_translation.ipynb b/docs/examples/java/notebook/validating_code_translation.ipynb index 27636fd..9abc5d8 100644 --- a/docs/examples/java/notebook/validating_code_translation.ipynb +++ b/docs/examples/java/notebook/validating_code_translation.ipynb @@ -2,176 +2,181 @@ "cells": [ { "cell_type": "markdown", + "id": "fcac940432e10687", + "metadata": { + "collapsed": false + }, "source": [ "# Using CLDK to validate code translation\n", "\n", - "In this tutorial, we will use CLDK to valdiate translated code.\n", + "In this tutorial, we will use CLDK to translate code and check properties of the translated code. You'll explore some of the benefits of using CLDK to perform quick and easy program analysis for this task. By the end of this tutorial, you will have implemented a simple Java-to-Python code translator that also performs light-weight property checking on the translated code.\n", "\n", - "By the end of this tutorial, you will have a very light-weight approach for validating code translated from Java to Python. You'll be able to explore some of the benefits of using CLDK to perform fast and easy program analysis.\n", + "Specifically, you will learn how to perform the following tasks on a Java application to create LLM prompts for code translation and checking the translated code:\n", "\n", - "You will learn how to do the following:\n", + "1. Create a new instance of the CLDK class.\n", + "2. Create an analysis object for the target Java application.\n", + "3. Iterate over all files in the application.\n", + "4. Iterate over all classes in a file.\n", + "5. Sanitize the class for prompting the LLM.\n", + "6. Create treesitter-based Java and Python analysis objects\n", "\n", - "
    \n", - "
  1. Create a new instance of the CLDK class.\n", - "
  2. Create an analysis object over the Java application.\n", - "
  3. Iterate over all the files in the project.\n", - "
  4. Iterate over all the classes in the file.\n", - "
  5. Iterate over all the methods in the class.\n", - "
  6. Get the code body of the method.\n", - "
  7. Initialize the treesitter utils for the class file content.\n", - "
  8. Sanitize the class for analysis.\n", - "
\n", - "Next, we will write a couple of helper methods to:\n", - "\n", - "
    \n", - "
  1. Format the instruction for the given focal method and class.\n", - "
  2. Prompts the local model on Ollama.\n", - "
  3. Use CLDK to analyze code and get context information for translating code.\n", - "
" - ], - "metadata": { - "collapsed": false - }, - "id": "fcac940432e10687" + "We will write a couple of helper methods to (1) format the LLM instruction for translating a Java class to Python and (2) prompt the LLM via Ollama. We will then use CLDK to analyze code and get context information for translating code and also checking properties of the translated code." + ] }, { "cell_type": "markdown", + "id": "e9411e761b32fcbc", + "metadata": { + "collapsed": false + }, "source": [ "## Prequisites\n", "\n", "Before we get started, let's make sure you have the following installed:\n", "\n", - "
    \n", - "
  1. Python 3.11 or later\n", - "
  2. Ollama 0.3.4 or later\n", - "
  3. Java 11 or later\n", - "
\n", - "We will use ollama to spin up a local granite model that will act as our LLM for this turorial." - ], - "metadata": { - "collapsed": false - }, - "id": "e9411e761b32fcbc" + "1. Python 3.11 or later (you can use [pyenv](https://github.com/pyenv/pyenv) to install Python)\n", + "2. Java 11 or later (you can use [SDKMAN!](https://sdkman.io) to instal Java)\n", + "3. Ollama 0.3.4 or later (you can get Ollama here: [Ollama download](https://ollama.com/download))\n", + "\n", + "We will use Ollama to spin up a local [Granite code model](https://ollama.com/library/granite-code), which will serve as our LLM for this turorial." + ] }, { "cell_type": "markdown", + "id": "5c7c3ccb", + "metadata": {}, "source": [ - "### Prerequisite 1: Install ollama\n", + "### Download Granite code model\n", "\n", - "If you don't have ollama installed, please download and install it from here: [Ollama](https://ollama.com/download).\n", - "Once you have ollama, start the server and make sure it is running. Once ollama is up and running, you can download the latest version of the Granite 8b Instruct model by running the following command:\n", - "There are other granite versions available, but for this tutorial, we will use the Granite 8b Instruct model. You if prefer to use a different version, you can replace `8b-instruct` with any of the other [versions](https://ollama.com/library/granite-code/tags).\n", - "Let's make sure the model is downloaded by running the following command:" - ], + "After starting the Ollama server, please download the latest version of the Granite code 8b-instruct model by running the following command. There are other Granite code models available, but for this tutorial, we will use Granite code 8b-instruct. If you prefer to use a different Granite code model, you can replace `8b-instruct` with the tag of another version (see [Granite code tags](https://ollama.com/library/granite-code/tags))." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "db17a05f", + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "ollama pull granite-code:8b-instruct" + ] + }, + { + "cell_type": "markdown", + "id": "930b603c7eb3cd55", "metadata": { "collapsed": false }, - "id": "930b603c7eb3cd55" + "source": [ + " Let's make sure the model is downloaded by running the following command:" + ] }, { "cell_type": "code", "execution_count": null, + "id": "635bb847107749f8", + "metadata": { + "collapsed": false + }, "outputs": [], "source": [ "%%bash\n", "ollama run granite-code:8b-instruct \\\"Write a python function to print 'Hello, World!'" - ], - "metadata": { - "collapsed": false - }, - "id": "635bb847107749f8" + ] }, { "cell_type": "markdown", - "source": [ - "### Prerequisite 3: Install ollama Python SDK" - ], + "id": "a6015cb7728debca", "metadata": { "collapsed": false }, - "id": "a6015cb7728debca" + "source": [ + "### Install Ollama Python SDK" + ] }, { "cell_type": "code", "execution_count": null, - "outputs": [], - "source": [ - "pip install ollama" - ], + "id": "9dceb297bbab0ab3", "metadata": { "collapsed": false }, - "id": "9dceb297bbab0ab3" + "outputs": [], + "source": [ + "pip install ollama" + ] }, { "cell_type": "markdown", - "source": [ - "### Prerequisite 4: Install CLDK\n", - "CLDK is avaliable on github at github.com/IBM/codellm-devkit.git. You can install it by running the following command:" - ], + "id": "e06325ad56287f0b", "metadata": { "collapsed": false }, - "id": "e06325ad56287f0b" + "source": [ + "### Install CLDK\n", + "CLDK is avaliable at https://github.com/IBM/codellm-devkit. You can install it by running the following command:" + ] }, { "cell_type": "code", "execution_count": null, - "outputs": [], - "source": [ - "pip install git+https://github.com/IBM/codellm-devkit.git" - ], + "id": "d6dc34436d0f2d15", "metadata": { "collapsed": false }, - "id": "d6dc34436d0f2d15" + "outputs": [], + "source": [ + "pip install git+https://github.com/IBM/codellm-devkit.git" + ] }, { "cell_type": "markdown", - "source": [ - "### Step 1: Get the sample Java application\n", - "For this tutorial, we will use apache commons cli. You can download the source code to a temporary directory by running the following command:" - ], + "id": "6e4ef425987e53ed", "metadata": { "collapsed": false }, - "id": "6e4ef425987e53ed" + "source": [ + "### Get the sample Java application\n", + "For this tutorial, we will use [Apache Commons CLI](https://github.com/apache/commons-cli) as the Java application under test. You can download the source code to a temporary directory by running the following command:" + ] }, { "cell_type": "code", "execution_count": null, + "id": "98ddaf361bb8c025", + "metadata": { + "collapsed": false + }, "outputs": [], "source": [ "%%bash\n", "wget https://github.com/apache/commons-cli/archive/refs/tags/rel/commons-cli-1.7.0.zip -O /tmp/commons-cli-1.7.0.zip && unzip -o /tmp/commons-cli-1.7.0.zip -d /tmp" - ], - "metadata": { - "collapsed": false - }, - "id": "98ddaf361bb8c025" + ] }, { "cell_type": "markdown", - "source": [ - "The project will now be extracted to `/tmp/commons-cli-rel-commons-cli-1.7.0`. We'll remove these files later, so don't worry about the location." - ], + "id": "7a963481d3c7d083", "metadata": { "collapsed": false }, - "id": "7a963481d3c7d083" + "source": [ + "The project will now be extracted to `/tmp/commons-cli-rel-commons-cli-1.7.0`.\n", + "" + ] }, { "cell_type": "markdown", - "source": [ - "### Translating Jave code to Python and build a light-weight validation logic\n", - "Code translation aims to convert source code from one programming language (PL) to another. Given the promising abilities of large language models (LLMs) in code synthesis, researchers are exploring their potential to automate code translation. In our recent paper [https://dl.acm.org/doi/10.1145/3597503.3639226] published at ICSE'24, we found that LLM-based code translation is very promising. In this example, we will walk through the steps of translating each Java class to Python and checking various properties of translated code, such as the number of methods, number of fields, formal arguments, etc.\n", - "\n", - "(Step 1) First, we will import all the necessary libraries" - ], + "id": "47af1410ab0a3b4d", "metadata": { "collapsed": false }, - "id": "47af1410ab0a3b4d" + "source": [ + "## Translate Jave code to Python and build a light-weight property checker (for translation validation)\n", + "Code translation aims to convert source code from one programming language to another. Given the promising abilities of large language models (LLMs) in code synthesis, researchers are exploring their potential to automate code translation. In our recent work, [presented at ICSE'24](https://dl.acm.org/doi/10.1145/3597503.3639226), we found that LLM-based code translation is very promising. In this example, we will walk through the steps of translating a Java class to Python and checking various properties of translated code (e.g., number of methods, number of fields, formal arguments, etc.) as a simple form of translation validation.\n", + "\n", + "Step 1: Import the required modules" + ] }, { "cell_type": "code", @@ -196,7 +201,7 @@ "collapsed": false }, "source": [ - "(Step 2) Second, we will form the prompt for the model, which will include the body of the Java class after removing all the comments and the import statements." + "Step 2: Define a function for creating the LLM prompt, which instructs the LLM to translate a Java class to Python and includes the body of the Java class after removing all the comments and import statements." ] }, { @@ -229,7 +234,7 @@ "collapsed": false }, "source": [ - "(Step 3) Create a function to call LLM. There are various ways to achieve that. However, for illustrative purpose, we use ollama, a library to communicate with models downloaded locally." + "Step 3: Define a function to call the LLM (in this case, Granite code 8b-instruct) using Ollama." ] }, { @@ -254,7 +259,7 @@ "collapsed": false }, "source": [ - "(Step 4) Translate each class in the application and check certain properties of the translated code, such as (a) number of translated method, and (b) number of translated fields. " + "Step 4: Translate a class of the Java application to Python and check for two properties of the translated code: number of translated method and number of translated fields. " ] }, { @@ -266,52 +271,61 @@ }, "outputs": [], "source": [ - "# Create a new instance of the CLDK class\n", + "# Create an instance of CLDK for Java analysis\n", "cldk = CLDK(language=\"java\")\n", "\n", - "# Create an analysis object over the java application\n", + "# Create an analysis object for the Java application, providing the application path\n", "analysis = cldk.analysis(project_path=\"/tmp/commons-cli-rel-commons-cli-1.7.0\", analysis_level=AnalysisLevel.symbol_table)\n", "\n", - "# For simplicity, we run the code translation for a single class. One can remove that filter to run this code for the entire application\n", - "qualified_class_name = 'org.apache.commons.cli.GnuParser'\n", + "# For simplicity, we run the code translation on a single class(this filter can be removed to run this code over the entire application)\n", + "target_class = \"org.apache.commons.cli.GnuParser\"\n", "\n", "# Go through all the classes in the application\n", "for class_name in analysis.get_classes():\n", " \n", - " if class_name==qualified_class_name:\n", + " if class_name == target_class:\n", " # Get the location of the Java class\n", " class_path = analysis.get_java_file(qualified_class_name=class_name)\n", " \n", " # Read the file content\n", " if not class_path:\n", - " class_body = ''\n", - " with open(class_path, 'r', encoding='utf-8', errors='ignore') as f:\n", + " class_body = \"\"\n", + " with open(class_path, \"r\", encoding=\"utf-8\", errors=\"ignore\") as f:\n", " class_body = f.read()\n", " \n", - " # Sanitize the file content by removing comments.\n", - " tree_sitter_utils = cldk.tree_sitter_utils(source_code=class_body)\n", + " # Sanitize the file content by removing comments\n", " sanitized_class = JavaSitter().remove_all_comments(source_code=class_body)\n", "\n", - " inst = format_inst(code=sanitized_class, language='java', focal_class=class_name.split('.')[-1])\n", + " # Create prompt for translating sanitized Java class to Python\n", + " inst = format_inst(code=sanitized_class, language=\"java\", focal_class=class_name.split(\".\")[-1])\n", "\n", " print(f\"Instruction:\\n{inst}\\n\")\n", " print(f\"Translating Java code to Python . . .\\n\")\n", - " translated_code = prompt_ollama(\n", - " message=inst)\n", + "\n", + " # Prompt the local model on Ollama\n", + " translated_code = prompt_ollama(message=inst)\n", " \n", - " print(f\"Translated Python code: {translated_code}\")\n", + " # Print translated code\n", + " print(f\"Translated Python code: {translated_code}\\n\")\n", + "\n", + " # Create python sitter instance for analyzing translated Python code\n", " py_cldk = PythonSitter()\n", + "\n", + " # Compute methods, function, and field counts for translated code\n", " all_methods = py_cldk.get_all_methods(module=translated_code)\n", " all_functions = py_cldk.get_all_functions(module=translated_code)\n", " all_fields = py_cldk.get_all_fields(module=translated_code)\n", " \n", + " # Check counts against method and field counts for Java code\n", " if len(all_methods) + len(all_functions) != len(analysis.get_methods_in_class(qualified_class_name=class_name)):\n", " print(f'Number of translated method not matching in class {class_name}')\n", " else:\n", " print(f'Number of translated method in class {class_name} is {len(all_methods)}')\n", - " if all_fields:\n", + " if all_fields is not None:\n", " if len(all_fields) != len(analysis.get_class(qualified_class_name=class_name).field_declarations):\n", - " print(f'Number of translated field not matching in class {class_name}') " + " print(f'Number of translated field not matching in class {class_name}')\n", + " else:\n", + " print(f'Number of translated fields in class {class_name} is {len(all_fields)}')\n" ] } ], @@ -331,7 +345,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.11.4" + "version": "3.11.9" } }, "nbformat": 4,