Is there any API or any other way we can use to transcript audio from zoom call in real-time so that we can use it in our backend for further processes ofcourse with the consent of the user

This is for developer-specific feature requests. For other requests please contact our customer support team.

Is your feature request related to a problem? Please describe.
I want some feature by which I can transcript real-time zoom call audio to ise it in my backend with the consent of the user.

Describe the solution you’d like
If I can het an API or access to zoom’s audio of call then I will be able to perform the rest.

Describe alternatives you’ve considered
A clear and concise description of any alternative solutions or features you’ve considered.

Additional context
Add any other context or screenshots about the feature request here.

@muskan , there are a few ways you can create a real-time transcription from Zoom. Here are the top 3 most common ways:

1. Use the Zoom live-streaming API

Pros:

  • Doesn’t require any 3rd party services
  • Lighter weight than building and running a Zoom bot

Cons:

  • Needs to be initiated by the end-user every meeting
  • You need to set up an RTMP server to receive the data, which requires engineering effort to deploy, scale, and monitor
  • Participants can get spooked by the “live” badge that appears in the meeting
  • No speaker separation

2. Build a Zoom bot

Pros:

  • Can get the separate audio streams per participant for perfect diarization / speaker labels
  • Doesn’t spook participants

Cons:

  • It is very heavy-weight as you would need to spin up multiple servers to run the Zoom client for the bot
  • Running infrastructure for Zoom bot costs more than live streaming.
  • You need to encode the raw video and audio yourself

3. Use Recall.ai

It’s a unified API that lets you send meeting bots to video conferencing platforms to capture the audio and video in real-time.

Pros:

  • Handles spinning up the servers, and providing the real-time raw audio so all you interact with is a simple API.
  • Gets speaker diarization / speaker labels
  • Works agnostic of meeting platform

Cons:

  • It’s another 3rd party service in your stack

Let me know if you have any questions!