I am struggling to begin creating a zoom app that would essentially transcribe meeting audio live and push this audio to an external API. I am not sure where to start; whether I should use meeting sdk, video sdk, zoom apps, oauth, server-to-server oauth.
The idea is that I would like to have this application transcribe audio from every meeting the user is a part of, and after every meeting, push this transcription to an external API.
Hey @parthasarathy.madhav , Unfortunately, there are no direct API endpoints to access the real-time transcript. However, here are 4 other ways you could explore to create a real-time transcript from a Zoom meeting.
1. Use the Zoom RTMP live-streaming API
Pros:
- Doesn’t require any 3rd party services
- Lighter weight than building and running a Zoom bot
Cons:
- Needs to initiated on a per-meeting basis
- You need to set up an RTMP server to receive the data, which requires engineering effort to deploy, scale, and monitor
- Participants can get spooked by the “live” badge that appears in the meeting (even if it’s a privte meeting)
- No speaker separation
2. Build a desktop app to capture users’ computer audio
Pros:
- One of the most cost effective solutions
Cons:
- You need to build a separate app for Windows, Mac and Linux
- It is especially difficult to tap into computer audio on Mac
- App runs on users’ computer so it can slow their computer down/make computer fans go off
- No speaker separation
- Not compliant with Zoom’s recording policies
3. Build a Zoom bot
Pros:
- Can get the separate audio streams per participant for perfect diarization / speaker labels
Cons:
- It is very heavy-weight as you would need to spin up multiple servers to run the Zoom client for the bot
- Running infrastructure for Zoom bot costs more than live streaming.
- You need to encode the raw video and audio yourself
Recall.ai is a unified API that lets you send meeting bots to video conferencing platforms to capture the audio, video and transcription in real-time.
Pros:
- Handles spinning up the servers, and providing the real-time raw audio/transcript so all you interact with is a simple API.
- Works on any Zoom plan (including Free)
- Gets speaker diarization / speaker labels
- Works agnostic of meeting platform
Cons:
- It’s another 3rd party service in your stack
Let me know if you have any questions!