Skip to content

Conversation

Joe-Simo
Copy link

@Joe-Simo Joe-Simo commented Aug 29, 2025

Summary

This PR implements Voice Pipeline Orchestration for OpenAI's Realtime API, providing a comprehensive framework for managing voice interactions with the gpt-realtime model.

Motivation

While working with the OpenAI Agents SDK, I wanted to contribute a voice pipeline orchestration feature that makes it easier to build voice-enabled applications using OpenAI's Realtime API with gpt-realtime, Whisper STT, and the new Marin/Cedar voices.

What's Included

Core Implementation (packages/agents-realtime/src/voicePipeline.ts)

  • VoicePipeline class with event-driven architecture
  • Integration with gpt-realtime model
  • Support for Marin and Cedar realtime voices
  • Whisper STT integration
  • WebRTC support for ultra-low latency (<100ms)
  • Voice Activity Detection (VAD)
  • Audio enhancement (echo/noise suppression, gain control)
  • Plugin pattern for easy RealtimeSession integration

Comprehensive Tests (packages/agents-realtime/test/voicePipeline.test.ts)

  • Test coverage for all features
  • Audio processing and synthesis tests
  • WebRTC integration tests
  • Error handling scenarios

Documentation (docs/src/content/docs/guides/voice-pipeline.mdx)

  • Complete usage guide
  • Configuration examples
  • Best practices
  • Integration with RealtimeSession

Working Example (examples/voice-pipeline/)

  • Full implementation example
  • Demonstrates all features
  • Ready-to-run with README

Key Features

OpenAI Realtime API Integration

  • gpt-realtime model support
  • Marin and Cedar voice options
  • Whisper STT for transcription

Real-time Processing

  • Low-latency audio streaming
  • WebRTC support for <100ms latency
  • Automatic buffering
  • Streaming responses

Voice Activity Detection

  • Configurable thresholds
  • Debouncing support
  • Silence detection

Audio Enhancement

  • Echo suppression
  • Noise reduction
  • Automatic gain control

Developer Experience

  • TypeScript with full type safety
  • Event-driven API
  • Plugin pattern for RealtimeSession
  • Comprehensive metrics monitoring

Usage Example

import { createVoicePipeline } from '@openai/agents/realtime';

const pipeline = createVoicePipeline({
  model: 'gpt-realtime',
  voice: 'marin', // or 'cedar'
  stt: { model: 'whisper-1' }
});

pipeline.on('speech.final', (text) => {
  console.log('User said:', text);
});

await pipeline.processAudio(audioBuffer);

Testing

All tests pass with the existing test suite. The new tests follow the same patterns as existing SDK tests.

Breaking Changes

None. This is a purely additive feature that doesn't modify any existing APIs.

Checklist

  • Code follows SDK patterns and conventions
  • Tests written and passing
  • Documentation added
  • Example implementation provided
  • No breaking changes
  • TypeScript types included
  • Follows existing code style

Notes

This contribution provides a framework for voice pipeline orchestration that integrates with OpenAI's Realtime API. The implementation focuses on providing a clean abstraction over the complexity of audio streaming, transcription, and synthesis while maintaining low latency for real-time voice interactions.


Thank you for considering this contribution! 🙏

Copy link

changeset-bot bot commented Aug 29, 2025

⚠️ No Changeset found

Latest commit: e839d59

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Implements comprehensive voice pipeline orchestration for OpenAI's Realtime API:

- Voice Pipeline class for managing TTS/STT orchestration with gpt-realtime
- Support for Marin and Cedar realtime voices
- Whisper STT integration for speech-to-text
- WebRTC support for ultra-low latency (<100ms)
- Voice Activity Detection (VAD) capabilities
- Audio processing with configurable settings
- Metrics monitoring for pipeline performance
- Plugin system for easy RealtimeSession integration

The Voice Pipeline provides a framework for building voice-enabled applications
using OpenAI's Realtime API, handling the complexity of audio streaming,
transcription, and synthesis while maintaining low latency.

Features:
- Seamless integration with RealtimeSession
- Configurable audio processing (sample rate, encoding, buffer sizes)
- Real-time metrics (STT/TTS latency, processing time)
- WebRTC support for browser-based voice applications
- Event-driven architecture for audio and speech events
@Joe-Simo Joe-Simo force-pushed the feature/voice-pipeline-orchestration branch from 0683968 to e839d59 Compare August 29, 2025 02:39
@seratch seratch marked this pull request as draft August 30, 2025 01:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants