Skip to content

prompt-opt-notebook #2023

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Aug 7, 2025
Merged

prompt-opt-notebook #2023

merged 3 commits into from
Aug 7, 2025

Conversation

rajpathak-openai
Copy link
Contributor

Summary

Briefly describe the changes and the goal of this PR. Make sure the PR title summarizes the changes effectively.

Motivation

Why are these changes necessary? How do they improve the cookbook?


For new content

When contributing new content, read through our contribution guidelines, and mark the following action items as completed:

  • [ X] I have added a new entry in registry.yaml (and, optionally, in authors.yaml) so that my content renders on the cookbook website.
  • [X ] I have conducted a self-review of my content based on the contribution guidelines:
    • [ X] Relevance: This content is related to building with OpenAI technologies and is useful to others.
    • [X ] Uniqueness: I have searched for related examples in the OpenAI Cookbook, and verified that my content offers new insights or unique information compared to existing documentation.
    • [X ] Spelling and Grammar: I have checked for spelling or grammatical mistakes.
    • [X ] Clarity: I have done a final read-through and verified that my submission is well-organized and easy to understand.
    • [ X] Correctness: The information I include is correct and all of my code executes successfully.
    • [X ] Completeness: I have explained everything fully, including all necessary references and citations.

We will rate each of these areas on a scale from 1 to 4, and will only accept contributions that score 3 or higher on all areas. Refer to our contribution guidelines for more details.

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a comprehensive prompt optimization notebook and accompanying evaluation scripts for the OpenAI Cookbook. The goal is to demonstrate how to improve prompt effectiveness through systematic optimization techniques using GPT-5's new prompt optimization capabilities.

Key Changes:

  • Added a new prompt optimization cookbook notebook with end-to-end evaluation pipeline
  • Implemented baseline and optimized code generation scripts for performance comparison
  • Created comprehensive evaluation tools including LLM-as-judge and FailSafeQA benchmark integration

Reviewed Changes

Copilot reviewed 139 out of 146 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
registry.yaml Adds entry for the new prompt optimization cookbook
scripts/topk_eval.py Core evaluation script for measuring code generation performance with memory/time tracking
scripts/results_summarizer.py Analysis and visualization tools for comparing baseline vs optimized results
scripts/llm_judge.py LLM-based code quality evaluation using GPT-5 as judge
scripts/gen_optimized.py Script for generating optimized code solutions using improved prompts
scripts/gen_baseline.py Script for generating baseline code solutions for comparison
run_FailSafeQA.py Integration with FailSafeQA benchmark for robustness evaluation
results_topk_* Generated evaluation results and code samples from baseline and optimized runs
Comments suppressed due to low confidence (1)

examples/gpt-5/prompt-optimization-cookbook/scripts/results_summarizer.py:16

  • The comment states rating >= 4 as compliant but the constant is set to 6, creating a contradiction. Either fix the comment or the value.
    exact: Optional[bool]


# Head: decreasing counts
for i, tok in enumerate(vocab_top[:150], start=1):
c = max(1200, int(5000 / (i ** 0.5)))
Copy link
Preview

Copilot AI Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The magic numbers 1200, 5000, and 0.5 should be defined as named constants to improve code readability and maintainability.

Suggested change
c = max(1200, int(5000 / (i ** 0.5)))
c = max(MIN_HEAD_COUNT, int(HEAD_COUNT_BASE / (i ** HEAD_COUNT_EXPONENT)))

Copilot uses AI. Check for mistakes.


# --------------- Config ---------------

COMPLIANCE_THRESHOLD = 6 # treat judge rating >= 4 as compliant (see paper rubric)
Copy link
Preview

Copilot AI Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as in results_summarizer.py - the comment contradicts the constant value. The threshold is 6 but comment mentions >= 4.

Suggested change
COMPLIANCE_THRESHOLD = 6 # treat judge rating >= 4 as compliant (see paper rubric)
COMPLIANCE_THRESHOLD = 4 # treat judge rating >= 4 as compliant (see paper rubric)

Copilot uses AI. Check for mistakes.

Comment on lines +135 to +140
resp = client.responses.create(
model=model,
input=messages,
text={"format": {"type": "text"}, "verbosity": "medium"},
reasoning={"effort": "medium", "summary": "auto"},
tools=[],
Copy link
Preview

Copilot AI Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The responses.create API call uses a non-standard pattern with input/content structure that may not be compatible with standard OpenAI API. Consider using the standard chat completions API format.

Suggested change
resp = client.responses.create(
model=model,
input=messages,
text={"format": {"type": "text"}, "verbosity": "medium"},
reasoning={"effort": "medium", "summary": "auto"},
tools=[],
resp = client.chat.completions.create(
model=model,
messages=messages,

Copilot uses AI. Check for mistakes.

Comment on lines +27 to +40
payload = {
"model": model,
"input": [
{"role": "developer", "content": [{"type": "input_text", "text": dev_prompt}]},
{"role": "user", "content": [{"type": "input_text", "text": user_prompt}]},
],
"text": {"format": {"type": "text"}, "verbosity": "medium"},
"reasoning": {"effort": "medium", "summary": "auto"},
"tools": [],
}
for attempt in range(max_retries):
try:
resp = client.responses.create(**payload)
return getattr(resp, "output_text", str(resp))
Copy link
Preview

Copilot AI Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as llm_judge.py - using non-standard responses.create API format instead of standard OpenAI chat completions API.

Suggested change
payload = {
"model": model,
"input": [
{"role": "developer", "content": [{"type": "input_text", "text": dev_prompt}]},
{"role": "user", "content": [{"type": "input_text", "text": user_prompt}]},
],
"text": {"format": {"type": "text"}, "verbosity": "medium"},
"reasoning": {"effort": "medium", "summary": "auto"},
"tools": [],
}
for attempt in range(max_retries):
try:
resp = client.responses.create(**payload)
return getattr(resp, "output_text", str(resp))
messages = [
{"role": "system", "content": dev_prompt},
{"role": "user", "content": user_prompt},
]
for attempt in range(max_retries):
try:
resp = client.chat.completions.create(
model=model,
messages=messages,
temperature=0.7,
max_tokens=2048,
)
return resp.choices[0].message.content

Copilot uses AI. Check for mistakes.

Comment on lines +24 to +37
payload = {
"model": model,
"input": [
{"role": "developer", "content": [{"type": "input_text", "text": dev_prompt}]},
{"role": "user", "content": [{"type": "input_text", "text": user_prompt}]},
],
"text": {"format": {"type": "text"}, "verbosity": "medium"},
"reasoning": {"effort": "medium", "summary": "auto"},
"tools": [],
}
for attempt in range(max_retries):
try:
resp = client.responses.create(**payload)
return getattr(resp, "output_text", str(resp))
Copy link
Preview

Copilot AI Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue - using non-standard responses.create API format instead of standard OpenAI chat completions API.

Suggested change
payload = {
"model": model,
"input": [
{"role": "developer", "content": [{"type": "input_text", "text": dev_prompt}]},
{"role": "user", "content": [{"type": "input_text", "text": user_prompt}]},
],
"text": {"format": {"type": "text"}, "verbosity": "medium"},
"reasoning": {"effort": "medium", "summary": "auto"},
"tools": [],
}
for attempt in range(max_retries):
try:
resp = client.responses.create(**payload)
return getattr(resp, "output_text", str(resp))
messages = [
{"role": "system", "content": dev_prompt},
{"role": "user", "content": user_prompt},
]
for attempt in range(max_retries):
try:
resp = client.chat.completions.create(
model=model,
messages=messages,
)
return resp.choices[0].message.content

Copilot uses AI. Check for mistakes.

Comment on lines +119 to +137
# Align with Responses API pattern used in gen_baseline.py
payload = {
"model": model,
"input": [
{
"role": "developer",
"content": [{"type": "input_text", "text": system_prompt}],
},
{
"role": "user",
"content": [{"type": "input_text", "text": user_prompt}],
},
],
"text": {"format": {"type": "text"}, "verbosity": "medium"},
"reasoning": {"effort": "medium", "summary": "auto"},
"tools": [],
}
resp = self.client.responses.create(**payload)
return resp.output_text
Copy link
Preview

Copilot AI Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same API consistency issue - using responses.create instead of standard chat completions format.

Suggested change
# Align with Responses API pattern used in gen_baseline.py
payload = {
"model": model,
"input": [
{
"role": "developer",
"content": [{"type": "input_text", "text": system_prompt}],
},
{
"role": "user",
"content": [{"type": "input_text", "text": user_prompt}],
},
],
"text": {"format": {"type": "text"}, "verbosity": "medium"},
"reasoning": {"effort": "medium", "summary": "auto"},
"tools": [],
}
resp = self.client.responses.create(**payload)
return resp.output_text
# Use standard OpenAI chat completions format
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt},
]
resp = self.client.chat.completions.create(
model=model,
messages=messages,
)
return resp.choices[0].message.content

Copilot uses AI. Check for mistakes.

pass
return None

def is_sorted_topk(pairs):
Copy link
Preview

Copilot AI Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function checks if pairs are sorted but could be more efficient. Consider using early termination or vectorized operations for large lists.

Copilot uses AI. Check for mistakes.

summaries = summarize_groups(quant_paths=quant_paths, judge_paths=judge_paths)

# Build figure with subplots
fig, axes = plt.subplots(2, 3, figsize=(15, 8))
Copy link
Preview

Copilot AI Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The figure size (15, 8) is hardcoded. Consider making it configurable or calculating it based on content for better flexibility.

Copilot uses AI. Check for mistakes.

@CorwinCheung CorwinCheung merged commit 356d169 into main Aug 7, 2025
1 check passed
@CorwinCheung CorwinCheung deleted the feat/move-images-root branch August 7, 2025 20:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants