Audio stream access from Zoom's SDK

Hello, I’m trying to build a live transcription service with Zoom SDK, and I’m unsure about the best approach.

My first thought was to use the Video SDK to incorporate meetings into a web application since it offers access to the audio stream. However, it seems this access is only available on native platforms.

Does this mean that I must use the Video SDK within a Zoom App, that would run inside of the native Zoom client?

@miki there are some assumptions I’ll be putting down here.

You are trying to create a Zoom App which helps to do live transcription during a Zoom Meeting?

You will probably need 2 components, a Zoom App and a Zoom Meeting SDK (Bot)

If that is the case, you will probably need a Zoom Meeting SDK (Bot) running on Linux / Windows which join the meeting and once it is in the meeting it will

  • listening to the audio stream
  • sending the audio stream to a remote server or processing the audio stream locally
  • sending the transcribed text to your Zoom App, via web service or web sockets.
2 Likes

@miki, there are 4 main ways to get the live audio stream from Zoom.

1. Use the Zoom RTMP live-streaming API

Pros:

  • Doesn’t require any 3rd party services
  • Lighter weight than building and running a Zoom bot

Cons:

  • Needs to initiated on a per-meeting basis
  • You need to set up an RTMP server to receive the data, which requires engineering effort to deploy, scale, and monitor
  • Participants can get spooked by the “live” badge that appears in the meeting (even if it’s a privte meeting)
  • No speaker separation

2. Build a desktop app to capture users’ computer audio

Pros:

  • One of the most cost effective solutions since audio processing can be run on-device.

Cons:

  • You need to build a separate app for Windows, Mac and Linux
  • App runs on users’ computer so it can slow their computer down/make computer fans go off
  • No speaker separation

3. Build a Zoom bot

Pros:

  • Can get the separate audio streams per participant for perfect diarization / speaker labels

Cons:

  • It is very heavy-weight as you would need to spin up multiple servers to run the Zoom client for the bot
  • Running infrastructure for Zoom bot costs more than live streaming.
  • You need to encode the raw video and audio yourself

4. Use Recall.ai

It’s a unified API that lets you send meeting bots to video conferencing platforms to capture the audio,
video and transcription in real-time.

Pros:

  • Handles spinning up the servers, and providing the real-time raw audio/transcript so all you interact with is a simple API.
  • Works on any Zoom plan (including Free)
  • Gets speaker diarization / speaker labels
  • Works agnostic of meeting platform

Cons:

  • It’s another 3rd party service in your stack

Let me know if you have any questions!

1 Like

Hi Amanda, thank you for your reply. Can you please elaborate on the third point mentioned?

  1. What kind of a bot is this? Web based/server based?
  2. Is it built using the libraries provided by zoom or making calls to any zoom APIs?
  3. How does it access the audio stream?

Thanks in advance.

Hi @miki ,

I’m in the initial stages of planning a Zoom app primarily for a web interface. The app aims to access live streaming audio and participant details, sending this data to our server via REST or GraphQL API. We plan to use AI tools for generating meeting summaries to automate client business requirements.

Considering our focus on a web app initially, would you recommend starting with the REST API or SDK? If an SDK is preferable, which one would be best for developing a web interface MVP?

I appreciate any suggestions and guidance you can provide as we embark on this project.

Anand VM