Hi Zoom Dev Community,
I’m working on a project where I want to integrate AI speech-to-text (STT) and text-to-speech (TTS) into a Zoom Video SDK application running on a remote/headless server. The idea is to have an AI “agent” that can:
-
Listen to participants in real time (via STT).
-
Generate a response (via LLM or other AI logic).
-
Speak back into the Zoom session (via TTS).
I’ve already managed to run the Zoom Video SDK on a server and successfully join sessions with audio/video. The challenge I’m facing is:
-
Since the server has no microphone or speakers, I can’t use normal input/output devices for the AI tool.
-
I need a way to capture participant audio directly from the Zoom SDK and feed it into my STT service.
-
I also need to inject the AI-generated TTS audio back into the Zoom session as if it were microphone input.
Questions:
-
What is the recommended way to capture raw audio from participants inside the Video SDK?
-
Can I continuously stream AI-generated PCM audio into
sendAudioRawData()
to make the bot “speak” in the meeting? -
Are there constraints around audio format (e.g., PCM 16-bit, 16kHz vs 48kHz)?
-
Is this the right approach, or is there a better way to implement an AI voice agent inside a Zoom Video SDK session?
Ultimately, I want to create a bot that can “listen and talk” naturally in real time, without needing physical audio devices.
Any guidance, examples, or best practices would be greatly appreciated!
Thanks in advance
-