-
Notifications
You must be signed in to change notification settings - Fork 11.1k
prompt-opt-notebook #2023
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prompt-opt-notebook #2023
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a comprehensive prompt optimization notebook and accompanying evaluation scripts for the OpenAI Cookbook. The goal is to demonstrate how to improve prompt effectiveness through systematic optimization techniques using GPT-5's new prompt optimization capabilities.
Key Changes:
- Added a new prompt optimization cookbook notebook with end-to-end evaluation pipeline
- Implemented baseline and optimized code generation scripts for performance comparison
- Created comprehensive evaluation tools including LLM-as-judge and FailSafeQA benchmark integration
Reviewed Changes
Copilot reviewed 139 out of 146 changed files in this pull request and generated 8 comments.
Show a summary per file
File | Description |
---|---|
registry.yaml | Adds entry for the new prompt optimization cookbook |
scripts/topk_eval.py | Core evaluation script for measuring code generation performance with memory/time tracking |
scripts/results_summarizer.py | Analysis and visualization tools for comparing baseline vs optimized results |
scripts/llm_judge.py | LLM-based code quality evaluation using GPT-5 as judge |
scripts/gen_optimized.py | Script for generating optimized code solutions using improved prompts |
scripts/gen_baseline.py | Script for generating baseline code solutions for comparison |
run_FailSafeQA.py | Integration with FailSafeQA benchmark for robustness evaluation |
results_topk_* | Generated evaluation results and code samples from baseline and optimized runs |
Comments suppressed due to low confidence (1)
examples/gpt-5/prompt-optimization-cookbook/scripts/results_summarizer.py:16
- The comment states rating >= 4 as compliant but the constant is set to 6, creating a contradiction. Either fix the comment or the value.
exact: Optional[bool]
|
||
# Head: decreasing counts | ||
for i, tok in enumerate(vocab_top[:150], start=1): | ||
c = max(1200, int(5000 / (i ** 0.5))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The magic numbers 1200, 5000, and 0.5 should be defined as named constants to improve code readability and maintainability.
c = max(1200, int(5000 / (i ** 0.5))) | |
c = max(MIN_HEAD_COUNT, int(HEAD_COUNT_BASE / (i ** HEAD_COUNT_EXPONENT))) |
Copilot uses AI. Check for mistakes.
|
||
# --------------- Config --------------- | ||
|
||
COMPLIANCE_THRESHOLD = 6 # treat judge rating >= 4 as compliant (see paper rubric) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue as in results_summarizer.py - the comment contradicts the constant value. The threshold is 6 but comment mentions >= 4.
COMPLIANCE_THRESHOLD = 6 # treat judge rating >= 4 as compliant (see paper rubric) | |
COMPLIANCE_THRESHOLD = 4 # treat judge rating >= 4 as compliant (see paper rubric) |
Copilot uses AI. Check for mistakes.
resp = client.responses.create( | ||
model=model, | ||
input=messages, | ||
text={"format": {"type": "text"}, "verbosity": "medium"}, | ||
reasoning={"effort": "medium", "summary": "auto"}, | ||
tools=[], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The responses.create API call uses a non-standard pattern with input/content structure that may not be compatible with standard OpenAI API. Consider using the standard chat completions API format.
resp = client.responses.create( | |
model=model, | |
input=messages, | |
text={"format": {"type": "text"}, "verbosity": "medium"}, | |
reasoning={"effort": "medium", "summary": "auto"}, | |
tools=[], | |
resp = client.chat.completions.create( | |
model=model, | |
messages=messages, |
Copilot uses AI. Check for mistakes.
payload = { | ||
"model": model, | ||
"input": [ | ||
{"role": "developer", "content": [{"type": "input_text", "text": dev_prompt}]}, | ||
{"role": "user", "content": [{"type": "input_text", "text": user_prompt}]}, | ||
], | ||
"text": {"format": {"type": "text"}, "verbosity": "medium"}, | ||
"reasoning": {"effort": "medium", "summary": "auto"}, | ||
"tools": [], | ||
} | ||
for attempt in range(max_retries): | ||
try: | ||
resp = client.responses.create(**payload) | ||
return getattr(resp, "output_text", str(resp)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue as llm_judge.py - using non-standard responses.create API format instead of standard OpenAI chat completions API.
payload = { | |
"model": model, | |
"input": [ | |
{"role": "developer", "content": [{"type": "input_text", "text": dev_prompt}]}, | |
{"role": "user", "content": [{"type": "input_text", "text": user_prompt}]}, | |
], | |
"text": {"format": {"type": "text"}, "verbosity": "medium"}, | |
"reasoning": {"effort": "medium", "summary": "auto"}, | |
"tools": [], | |
} | |
for attempt in range(max_retries): | |
try: | |
resp = client.responses.create(**payload) | |
return getattr(resp, "output_text", str(resp)) | |
messages = [ | |
{"role": "system", "content": dev_prompt}, | |
{"role": "user", "content": user_prompt}, | |
] | |
for attempt in range(max_retries): | |
try: | |
resp = client.chat.completions.create( | |
model=model, | |
messages=messages, | |
temperature=0.7, | |
max_tokens=2048, | |
) | |
return resp.choices[0].message.content |
Copilot uses AI. Check for mistakes.
payload = { | ||
"model": model, | ||
"input": [ | ||
{"role": "developer", "content": [{"type": "input_text", "text": dev_prompt}]}, | ||
{"role": "user", "content": [{"type": "input_text", "text": user_prompt}]}, | ||
], | ||
"text": {"format": {"type": "text"}, "verbosity": "medium"}, | ||
"reasoning": {"effort": "medium", "summary": "auto"}, | ||
"tools": [], | ||
} | ||
for attempt in range(max_retries): | ||
try: | ||
resp = client.responses.create(**payload) | ||
return getattr(resp, "output_text", str(resp)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue - using non-standard responses.create API format instead of standard OpenAI chat completions API.
payload = { | |
"model": model, | |
"input": [ | |
{"role": "developer", "content": [{"type": "input_text", "text": dev_prompt}]}, | |
{"role": "user", "content": [{"type": "input_text", "text": user_prompt}]}, | |
], | |
"text": {"format": {"type": "text"}, "verbosity": "medium"}, | |
"reasoning": {"effort": "medium", "summary": "auto"}, | |
"tools": [], | |
} | |
for attempt in range(max_retries): | |
try: | |
resp = client.responses.create(**payload) | |
return getattr(resp, "output_text", str(resp)) | |
messages = [ | |
{"role": "system", "content": dev_prompt}, | |
{"role": "user", "content": user_prompt}, | |
] | |
for attempt in range(max_retries): | |
try: | |
resp = client.chat.completions.create( | |
model=model, | |
messages=messages, | |
) | |
return resp.choices[0].message.content |
Copilot uses AI. Check for mistakes.
# Align with Responses API pattern used in gen_baseline.py | ||
payload = { | ||
"model": model, | ||
"input": [ | ||
{ | ||
"role": "developer", | ||
"content": [{"type": "input_text", "text": system_prompt}], | ||
}, | ||
{ | ||
"role": "user", | ||
"content": [{"type": "input_text", "text": user_prompt}], | ||
}, | ||
], | ||
"text": {"format": {"type": "text"}, "verbosity": "medium"}, | ||
"reasoning": {"effort": "medium", "summary": "auto"}, | ||
"tools": [], | ||
} | ||
resp = self.client.responses.create(**payload) | ||
return resp.output_text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same API consistency issue - using responses.create instead of standard chat completions format.
# Align with Responses API pattern used in gen_baseline.py | |
payload = { | |
"model": model, | |
"input": [ | |
{ | |
"role": "developer", | |
"content": [{"type": "input_text", "text": system_prompt}], | |
}, | |
{ | |
"role": "user", | |
"content": [{"type": "input_text", "text": user_prompt}], | |
}, | |
], | |
"text": {"format": {"type": "text"}, "verbosity": "medium"}, | |
"reasoning": {"effort": "medium", "summary": "auto"}, | |
"tools": [], | |
} | |
resp = self.client.responses.create(**payload) | |
return resp.output_text | |
# Use standard OpenAI chat completions format | |
messages = [ | |
{"role": "system", "content": system_prompt}, | |
{"role": "user", "content": user_prompt}, | |
] | |
resp = self.client.chat.completions.create( | |
model=model, | |
messages=messages, | |
) | |
return resp.choices[0].message.content |
Copilot uses AI. Check for mistakes.
pass | ||
return None | ||
|
||
def is_sorted_topk(pairs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function checks if pairs are sorted but could be more efficient. Consider using early termination or vectorized operations for large lists.
Copilot uses AI. Check for mistakes.
summaries = summarize_groups(quant_paths=quant_paths, judge_paths=judge_paths) | ||
|
||
# Build figure with subplots | ||
fig, axes = plt.subplots(2, 3, figsize=(15, 8)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The figure size (15, 8) is hardcoded. Consider making it configurable or calculating it based on content for better flexibility.
Copilot uses AI. Check for mistakes.
Summary
Briefly describe the changes and the goal of this PR. Make sure the PR title summarizes the changes effectively.
Motivation
Why are these changes necessary? How do they improve the cookbook?
For new content
When contributing new content, read through our contribution guidelines, and mark the following action items as completed:
We will rate each of these areas on a scale from 1 to 4, and will only accept contributions that score 3 or higher on all areas. Refer to our contribution guidelines for more details.