Audio stream access from Zoom's SDK

Hello, I’m trying to build a live transcription service with Zoom SDK, and I’m unsure about the best approach.

My first thought was to use the Video SDK to incorporate meetings into a web application since it offers access to the audio stream. However, it seems this access is only available on native platforms.

Does this mean that I must use the Video SDK within a Zoom App, that would run inside of the native Zoom client?

@miki there are some assumptions I’ll be putting down here.

You are trying to create a Zoom App which helps to do live transcription during a Zoom Meeting?

You will probably need 2 components, a Zoom App and a Zoom Meeting SDK (Bot)

If that is the case, you will probably need a Zoom Meeting SDK (Bot) running on Linux / Windows which join the meeting and once it is in the meeting it will

  • listening to the audio stream
  • sending the audio stream to a remote server or processing the audio stream locally
  • sending the transcribed text to your Zoom App, via web service or web sockets.
1 Like

@miki, there are 4 main ways to get the live audio stream from Zoom.

1. Use the Zoom RTMP live-streaming API

Pros:

  • Doesn’t require any 3rd party services
  • Lighter weight than building and running a Zoom bot

Cons:

  • Needs to initiated on a per-meeting basis
  • You need to set up an RTMP server to receive the data, which requires engineering effort to deploy, scale, and monitor
  • Participants can get spooked by the “live” badge that appears in the meeting (even if it’s a privte meeting)
  • No speaker separation

2. Build a desktop app to capture users’ computer audio

Pros:

  • One of the most cost effective solutions since audio processing can be run on-device.

Cons:

  • You need to build a separate app for Windows, Mac and Linux
  • App runs on users’ computer so it can slow their computer down/make computer fans go off
  • No speaker separation

3. Build a Zoom bot

Pros:

  • Can get the separate audio streams per participant for perfect diarization / speaker labels

Cons:

  • It is very heavy-weight as you would need to spin up multiple servers to run the Zoom client for the bot
  • Running infrastructure for Zoom bot costs more than live streaming.
  • You need to encode the raw video and audio yourself

4. Use Recall.ai

It’s a unified API that lets you send meeting bots to video conferencing platforms to capture the audio,
video and transcription in real-time.

Pros:

  • Handles spinning up the servers, and providing the real-time raw audio/transcript so all you interact with is a simple API.
  • Works on any Zoom plan (including Free)
  • Gets speaker diarization / speaker labels
  • Works agnostic of meeting platform

Cons:

  • It’s another 3rd party service in your stack

Let me know if you have any questions!