Skip to content

feat: Add support for image function tools #654

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

vaydingul
Copy link

This PR adds support for image function tools to the OpenAI Agents Python SDK.

It is inspired by the #341 .

Current function_tool implementation only allows output to be strictly string, which creates a problem when we want to pass image input in the request data. This PR tackles this problem by both providing a standart function_call_output and additional image-related arguments back-to-back.

What's included

  • Added ImageFunctionTool class and image_function_tool decorator
  • Implemented necessary support in the run implementation, models, and item handling
  • Added example showing usage of image function tools (examples/tools/image_function_tool.py)

Usage

Use the @image_function_tool decorator to create tools that work with images:

@image_function_tool
def image_to_base64(path: str) -> str:
    """
    This function takes a path to an image and returns a base64 encoded string of the image.
    It is used to convert the image to a base64 encoded string so that it can be sent to the LLM.
    """
    with open(path, "rb") as image_file:
        encoded_string = base64.b64encode(image_file.read()).decode("utf-8")
    return f"data:image/jpeg;base64,{encoded_string}"

The tool can then be used to allow agents to process and analyze images.

Supporting example script is located at examples/tools/image_function_tool.py

This commit introduces the ImageFunctionTool and ImageFunctionToolResult classes, enabling the creation and execution of image-generating tools. The necessary modifications include updates to the tool execution logic, new data classes for handling image function calls, and adjustments to the response processing to accommodate image outputs. Additionally, the input handling in the Runner class has been refined to support the new image function items.
Changes include:
- New classes: ImageFunctionTool, ImageFunctionToolResult, ToolRunImageFunction
- Updated tool execution methods to handle image functions
- Modifications to the ProcessedResponse class to include image function results
- Enhancements to ItemHelpers for image function output formatting
- Adjustments in the Runner class for input item processing

These changes enhance the SDK's capabilities for handling image generation tasks alongside existing function tools.
@diwu-sf
Copy link

diwu-sf commented May 15, 2025

Hi,
We have a very similar use case where we have PDF file analysis function calls that want to return:

  • file name
  • base64 file content
  • type = input_file
  • additional str information about the file

Can you expand this PR to also support the concept of a @file_function_tool?

@stevemadere
Copy link

I appreciate very much that you've done this work.
Perhaps I don't understand, but it kind of looks like all an image tool can return is one image. Is that right?
How about a tool that returns complex json output, some properties of which are images?
I have some ideas on how to implement that and am considering it.
LMK, if this method actually supports that already.

@nileshtrivedi
Copy link

@stevemadere At that point, instead of a tool, we just have an agent participating in the conversation by posting a message made of multiple parts.

@diwu-sf
Copy link

diwu-sf commented May 18, 2025

I thought agent responses have to be all string ? How does the agent response return File content types without a PR like this one?

@nileshtrivedi
Copy link

@diwu-sf OpenAI is now promoting Responses API instead of ChatCompletion API. This new spec allows agents or models to return output as various parts of different types:

image

@diwu-sf
Copy link

diwu-sf commented May 19, 2025

@nileshtrivedi nope, function call responses from the tool itself still must be string:
https://platform.openai.com/docs/guides/function-calling?api-mode=responses#formatting-results

That's why this PR is generating a user message to embed the function call's image output:

    @classmethod
    def image_function_tool_call_output_item(
        cls, tool_call: ResponseFunctionToolCall, output: str
    ) -> FunctionCallOutput:
        """Creates a tool call output item from a tool call and its output."""
        return [
            {
                "call_id": tool_call.call_id,
                "output": "Image generating tool is called.",
                "type": "function_call_output",
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "input_image",
                        "image_url": output,
                    }
                ],
            },
        ]

Something similar can be done for arbitrary PDF / file uploads

@stevemadere
Copy link

stevemadere commented May 19, 2025

@nileshtrivedi:

Consider the following situation (which I suspect is about to become super common):

A MCP server to conduct web browsing operations such as navigateToUrl, takeAction. (e.g. via stagehand).

Now, such an action would need to return all of these:

  1. requested information from the DOM
  2. the current location (url)
  3. a screenshot of the browser's screen (typically a .png) after navigating or taking the desired action.

It could store the screenshot at a publicly accessible location on the web (e.g. in a S3 bucket served via cloudfront) so that the screenshot could be returned as an http URL easily enough. (just perfect for providing in a input_image message hoisted from the MCP function call results)

The LLM at OpenAI can examine the screenshot and decide which action to take next and make a tool call to take that action.
It will need to see the new screenshot as well as the new location and perhaps some information from the DOM via stagehand's observe method.

Did I do a better job of describing the multi-modal results and why an MCP tool call would need to propagate them all simultaneously to the calling model?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants