Skip to content

Streamed Voice Agent Demo - Multiple Performance Issues #301

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
muhammadsmalik opened this issue Mar 22, 2025 · 9 comments
Open

Streamed Voice Agent Demo - Multiple Performance Issues #301

muhammadsmalik opened this issue Mar 22, 2025 · 9 comments
Labels
bug Something isn't working

Comments

@muhammadsmalik
Copy link

Streamed Voice Agent Demo - Multiple Performance Issues

Description

The streamed voice agent demo is experiencing several critical issues that affect its usability and functionality:

  1. High Latency: There is a significant delay (3-4 seconds) before receiving responses.
  2. Language Switching: The agent randomly switches to Spanish during conversations.
  3. Over-sensitivity: The agent frequently detects speech and provides incorrect descriptions even when no one is speaking.
  4. Interruption Issues: The agent cannot be interrupted despite semantic_vad apparently being implemented in the code.

Steps to Reproduce

  1. Launch the streamed voice agent demo
  2. Attempt to engage in conversation with the agent
  3. Observe the delay between speaking and receiving a response
  4. Continue conversation for several exchanges to observe language switching
  5. Remain silent for periods to observe false speech detection
  6. Try to interrupt the agent while it's speaking

Expected Behavior

  • Responses should begin within 1 second of user input
  • The agent should maintain the initially selected language throughout the conversation
  • Speech detection should only activate when actual speech is present
  • The semantic_vad feature should allow interruption of the agent's responses

Actual Behavior

  • Responses take 3-4 seconds to begin after user input
  • The agent randomly switches to Spanish during English conversations
  • The agent frequently reports detecting speech and provides descriptions when no one is speaking
  • The agent cannot be interrupted despite the apparent implementation of semantic_vad

Technical Details

From code inspection, semantic_vad appears to be implemented but is not functioning as expected. This suggests a potential issue with how the feature is integrated or configured in the current build.

Additional Notes

These issues significantly impact the user experience and demonstration value of the agent. The latency and language switching problems are particularly disruptive during presentations.

Possible Solutions

  • Investigate streaming optimization to reduce latency
  • Check language model configuration for potential causes of language switching
  • Adjust speech detection sensitivity parameters
  • Review semantic_vad implementation to ensure proper configuration

Priority

High - These issues prevent effective demonstration of the voice agent's capabilities.

@muhammadsmalik muhammadsmalik added the bug Something isn't working label Mar 22, 2025
@rm-openai
Copy link
Collaborator

cc @dkundel-openai

@dkundel-openai
Copy link
Contributor

Hi @muhammadsmalik

Thanks for raising the issue and sorry you are experiencing issues with the streamed demo.

We are looking into a couple of performance improvements to ship to improve the response time hopefully.

We do not support interruptions yet. There is a bit of guidance in the docs on what you can do in the meantime: https://openai.github.io/openai-agents-python/voice/pipeline/#interruptions
A proper interruptions implementation will require more client side implementation to make sure that there is detailed information of how much of the text was read when the interruption happened. It's something we want to support but in the meantime the suggestion would be what is laid out in the docs.

Overall if your focus is on lowest latency and best interruption handling my suggestion would still be our speech-to-speech model and the Realtime API though.

@duncsand
Copy link

Just to add that I also see the issues described above. Notably, the agent frequently detects speech when there is none and then switches to Spanish, making the whole thing unusable for even a simple demonstration.

@dkundel-openai
Copy link
Contributor

Out of curiosity are you using any specific microphones maybe with built-in noise cancellation?

@muhammadsmalik
Copy link
Author

@dkundel-openai I'm using the built-in microphones on my MacBook Air, nothing with special noise cancellation technology.

Do you have any timeline for when interruptions will be supported? I believe this is one of the main use cases for voice agents - allowing for more natural conversation flow where users can interrupt when needed. Thanks.

@anuragsharanjuspay
Copy link

@rm-openai @dkundel-openai I tried using the VoicePipeline but the latency is too high and it just randomly transcribes arabic text (I am using StreamedAudioInput). The realtime model is great but how do I use it with Agents SDK? Is the support for gpt-4o-realtime planned / are there any workarounds?

@pmmohanmishra
Copy link

@dkundel-openai any support for realtime audio models added yet?

@dkundel-openai
Copy link
Contributor

It's planned but there is no timeline yet when support for it will land.
We are working on improving latency on the chained approach though in the meantime.

@pmmohanmishra
Copy link

@dkundel-openai thanks. Any timelines for fixing the above reported issues?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants