prompt-opt-notebook #2023

rajpathak-openai · 2025-08-07T20:25:05Z

Summary

Briefly describe the changes and the goal of this PR. Make sure the PR title summarizes the changes effectively.

Motivation

Why are these changes necessary? How do they improve the cookbook?

For new content

When contributing new content, read through our contribution guidelines, and mark the following action items as completed:

[ X] I have added a new entry in registry.yaml (and, optionally, in authors.yaml) so that my content renders on the cookbook website.
[X ] I have conducted a self-review of my content based on the contribution guidelines:
- [ X] Relevance: This content is related to building with OpenAI technologies and is useful to others.
- [X ] Uniqueness: I have searched for related examples in the OpenAI Cookbook, and verified that my content offers new insights or unique information compared to existing documentation.
- [X ] Spelling and Grammar: I have checked for spelling or grammatical mistakes.
- [X ] Clarity: I have done a final read-through and verified that my submission is well-organized and easy to understand.
- [ X] Correctness: The information I include is correct and all of my code executes successfully.
- [X ] Completeness: I have explained everything fully, including all necessary references and citations.

We will rate each of these areas on a scale from 1 to 4, and will only accept contributions that score 3 or higher on all areas. Refer to our contribution guidelines for more details.

Copilot

Pull Request Overview

This PR introduces a comprehensive prompt optimization notebook and accompanying evaluation scripts for the OpenAI Cookbook. The goal is to demonstrate how to improve prompt effectiveness through systematic optimization techniques using GPT-5's new prompt optimization capabilities.

Key Changes:

Added a new prompt optimization cookbook notebook with end-to-end evaluation pipeline
Implemented baseline and optimized code generation scripts for performance comparison
Created comprehensive evaluation tools including LLM-as-judge and FailSafeQA benchmark integration

Reviewed Changes

Copilot reviewed 139 out of 146 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
registry.yaml	Adds entry for the new prompt optimization cookbook
scripts/topk_eval.py	Core evaluation script for measuring code generation performance with memory/time tracking
scripts/results_summarizer.py	Analysis and visualization tools for comparing baseline vs optimized results
scripts/llm_judge.py	LLM-based code quality evaluation using GPT-5 as judge
scripts/gen_optimized.py	Script for generating optimized code solutions using improved prompts
scripts/gen_baseline.py	Script for generating baseline code solutions for comparison
run_FailSafeQA.py	Integration with FailSafeQA benchmark for robustness evaluation
results_topk_*	Generated evaluation results and code samples from baseline and optimized runs

Comments suppressed due to low confidence (1)

examples/gpt-5/prompt-optimization-cookbook/scripts/results_summarizer.py:16

The comment states rating >= 4 as compliant but the constant is set to 6, creating a contradiction. Either fix the comment or the value.

    exact: Optional[bool]

Copilot · 2025-08-07T20:27:03Z

examples/gpt-5/prompt-optimization-cookbook/scripts/topk_eval.py

+
+    # Head: decreasing counts
+    for i, tok in enumerate(vocab_top[:150], start=1):
+        c = max(1200, int(5000 / (i ** 0.5)))


The magic numbers 1200, 5000, and 0.5 should be defined as named constants to improve code readability and maintainability.

Suggested change

c = max(1200, int(5000 / (i ** 0.5)))

c = max(MIN_HEAD_COUNT, int(HEAD_COUNT_BASE / (i ** HEAD_COUNT_EXPONENT)))

Copilot · 2025-08-07T20:27:03Z

examples/gpt-5/prompt-optimization-cookbook/run_FailSafeQA.py

+
+# --------------- Config ---------------
+
+COMPLIANCE_THRESHOLD = 6  # treat judge rating >= 4 as compliant (see paper rubric)


Same issue as in results_summarizer.py - the comment contradicts the constant value. The threshold is 6 but comment mentions >= 4.

Suggested change

COMPLIANCE_THRESHOLD = 6 # treat judge rating >= 4 as compliant (see paper rubric)

COMPLIANCE_THRESHOLD = 4 # treat judge rating >= 4 as compliant (see paper rubric)

Copilot · 2025-08-07T20:27:04Z

examples/gpt-5/prompt-optimization-cookbook/scripts/llm_judge.py

+                resp = client.responses.create(
+                    model=model,
+                    input=messages,
+                    text={"format": {"type": "text"}, "verbosity": "medium"},
+                    reasoning={"effort": "medium", "summary": "auto"},
+                    tools=[],


The responses.create API call uses a non-standard pattern with input/content structure that may not be compatible with standard OpenAI API. Consider using the standard chat completions API format.

Suggested change

resp = client.responses.create(

model=model,

input=messages,

text={"format": {"type": "text"}, "verbosity": "medium"},

reasoning={"effort": "medium", "summary": "auto"},

tools=[],

resp = client.chat.completions.create(

model=model,

messages=messages,

Copilot · 2025-08-07T20:27:04Z

examples/gpt-5/prompt-optimization-cookbook/scripts/gen_baseline.py

+    payload = {
+        "model": model,
+        "input": [
+            {"role": "developer", "content": [{"type": "input_text", "text": dev_prompt}]},
+            {"role": "user", "content": [{"type": "input_text", "text": user_prompt}]},
+        ],
+        "text": {"format": {"type": "text"}, "verbosity": "medium"},
+        "reasoning": {"effort": "medium", "summary": "auto"},
+        "tools": [],
+    }
+    for attempt in range(max_retries):
+        try:
+            resp = client.responses.create(**payload)
+            return getattr(resp, "output_text", str(resp))


Same issue as llm_judge.py - using non-standard responses.create API format instead of standard OpenAI chat completions API.

Suggested change

payload = {

"model": model,

"input": [

{"role": "developer", "content": [{"type": "input_text", "text": dev_prompt}]},

{"role": "user", "content": [{"type": "input_text", "text": user_prompt}]},

],

"text": {"format": {"type": "text"}, "verbosity": "medium"},

"reasoning": {"effort": "medium", "summary": "auto"},

"tools": [],

}

for attempt in range(max_retries):

try:

resp = client.responses.create(**payload)

return getattr(resp, "output_text", str(resp))

messages = [

{"role": "system", "content": dev_prompt},

{"role": "user", "content": user_prompt},

]

for attempt in range(max_retries):

try:

resp = client.chat.completions.create(

model=model,

messages=messages,

temperature=0.7,

max_tokens=2048,

)

return resp.choices[0].message.content

Copilot · 2025-08-07T20:27:04Z

examples/gpt-5/prompt-optimization-cookbook/scripts/gen_optimized.py

+    payload = {
+        "model": model,
+        "input": [
+            {"role": "developer", "content": [{"type": "input_text", "text": dev_prompt}]},
+            {"role": "user", "content": [{"type": "input_text", "text": user_prompt}]},
+        ],
+        "text": {"format": {"type": "text"}, "verbosity": "medium"},
+        "reasoning": {"effort": "medium", "summary": "auto"},
+        "tools": [],
+    }
+    for attempt in range(max_retries):
+        try:
+            resp = client.responses.create(**payload)
+            return getattr(resp, "output_text", str(resp))


Same issue - using non-standard responses.create API format instead of standard OpenAI chat completions API.

Suggested change

payload = {

"model": model,

"input": [

{"role": "developer", "content": [{"type": "input_text", "text": dev_prompt}]},

{"role": "user", "content": [{"type": "input_text", "text": user_prompt}]},

],

"text": {"format": {"type": "text"}, "verbosity": "medium"},

"reasoning": {"effort": "medium", "summary": "auto"},

"tools": [],

}

for attempt in range(max_retries):

try:

resp = client.responses.create(**payload)

return getattr(resp, "output_text", str(resp))

messages = [

{"role": "system", "content": dev_prompt},

{"role": "user", "content": user_prompt},

]

for attempt in range(max_retries):

try:

resp = client.chat.completions.create(

model=model,

messages=messages,

)

return resp.choices[0].message.content

Copilot · 2025-08-07T20:27:05Z

examples/gpt-5/prompt-optimization-cookbook/run_FailSafeQA.py

+        # Align with Responses API pattern used in gen_baseline.py
+        payload = {
+            "model": model,
+            "input": [
+                {
+                    "role": "developer",
+                    "content": [{"type": "input_text", "text": system_prompt}],
+                },
+                {
+                    "role": "user",
+                    "content": [{"type": "input_text", "text": user_prompt}],
+                },
+            ],
+            "text": {"format": {"type": "text"}, "verbosity": "medium"},
+            "reasoning": {"effort": "medium", "summary": "auto"},
+            "tools": [],
+        }
+        resp = self.client.responses.create(**payload)
+        return resp.output_text


Same API consistency issue - using responses.create instead of standard chat completions format.

Suggested change

# Align with Responses API pattern used in gen_baseline.py

payload = {

"model": model,

"input": [

{

"role": "developer",

"content": [{"type": "input_text", "text": system_prompt}],

},

{

"role": "user",

"content": [{"type": "input_text", "text": user_prompt}],

},

],

"text": {"format": {"type": "text"}, "verbosity": "medium"},

"reasoning": {"effort": "medium", "summary": "auto"},

"tools": [],

}

resp = self.client.responses.create(**payload)

return resp.output_text

# Use standard OpenAI chat completions format

messages = [

{"role": "system", "content": system_prompt},

{"role": "user", "content": user_prompt},

]

resp = self.client.chat.completions.create(

model=model,

messages=messages,

)

return resp.choices[0].message.content

Copilot · 2025-08-07T20:27:05Z

examples/gpt-5/prompt-optimization-cookbook/scripts/topk_eval.py

+                pass
+        return None
+
+    def is_sorted_topk(pairs):


The function checks if pairs are sorted but could be more efficient. Consider using early termination or vectorized operations for large lists.

Copilot · 2025-08-07T20:27:05Z

examples/gpt-5/prompt-optimization-cookbook/scripts/results_summarizer.py

+    summaries = summarize_groups(quant_paths=quant_paths, judge_paths=judge_paths)
+
+    # Build figure with subplots
+    fig, axes = plt.subplots(2, 3, figsize=(15, 8))


[nitpick] The figure size (15, 8) is hardcoded. Consider making it configurable or calculating it based on content for better flexibility.

prompt-opt-notebook

4af14a7

rajpathak-openai requested review from CorwinCheung and Copilot August 7, 2025 20:25

Copilot AI reviewed Aug 7, 2025

View reviewed changes

CorwinCheung approved these changes Aug 7, 2025

View reviewed changes

CorwinCheung added 2 commits August 7, 2025 13:45

wording changes

35a0b53

cleanup

95e380e

CorwinCheung merged commit 356d169 into main Aug 7, 2025
1 check passed

CorwinCheung deleted the feat/move-images-root branch August 7, 2025 20:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

prompt-opt-notebook #2023

prompt-opt-notebook #2023

Uh oh!

rajpathak-openai commented Aug 7, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Aug 7, 2025

Uh oh!

Copilot AI Aug 7, 2025

Uh oh!

Copilot AI Aug 7, 2025

Uh oh!

Copilot AI Aug 7, 2025

Uh oh!

Copilot AI Aug 7, 2025

Uh oh!

Copilot AI Aug 7, 2025

Uh oh!

Copilot AI Aug 7, 2025

Uh oh!

Copilot AI Aug 7, 2025

Uh oh!

Uh oh!

Uh oh!

	c = max(1200, int(5000 / (i ** 0.5)))
	c = max(MIN_HEAD_COUNT, int(HEAD_COUNT_BASE / (i ** HEAD_COUNT_EXPONENT)))


		# --------------- Config ---------------

		COMPLIANCE_THRESHOLD = 6 # treat judge rating >= 4 as compliant (see paper rubric)

	COMPLIANCE_THRESHOLD = 6 # treat judge rating >= 4 as compliant (see paper rubric)
	COMPLIANCE_THRESHOLD = 4 # treat judge rating >= 4 as compliant (see paper rubric)

prompt-opt-notebook #2023

prompt-opt-notebook #2023

Uh oh!

Conversation

rajpathak-openai commented Aug 7, 2025

Summary

Motivation

For new content

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Key Changes:

Reviewed Changes

Uh oh!

Copilot AI Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!