How to integrate AI STT/TTS with Zoom Video SDK on a headless server

Hi Zoom Dev Community,

I’m working on a project where I want to integrate AI speech-to-text (STT) and text-to-speech (TTS) into a Zoom Video SDK application running on a remote/headless server. The idea is to have an AI “agent” that can:

  1. Listen to participants in real time (via STT).

  2. Generate a response (via LLM or other AI logic).

  3. Speak back into the Zoom session (via TTS).

I’ve already managed to run the Zoom Video SDK on a server and successfully join sessions with audio/video. The challenge I’m facing is:

  • Since the server has no microphone or speakers, I can’t use normal input/output devices for the AI tool.

  • I need a way to capture participant audio directly from the Zoom SDK and feed it into my STT service.

  • I also need to inject the AI-generated TTS audio back into the Zoom session as if it were microphone input.

    Questions:

    1. What is the recommended way to capture raw audio from participants inside the Video SDK?

    2. Can I continuously stream AI-generated PCM audio into sendAudioRawData() to make the bot “speak” in the meeting?

    3. Are there constraints around audio format (e.g., PCM 16-bit, 16kHz vs 48kHz)?

    4. Is this the right approach, or is there a better way to implement an AI voice agent inside a Zoom Video SDK session?

    Ultimately, I want to create a bot that can “listen and talk” naturally in real time, without needing physical audio devices.

    Any guidance, examples, or best practices would be greatly appreciated!

    Thanks in advance :folded_hands: