Obtaining Zoom video in real time for object detection

Hi, I’d like to build a patient monitoring interface on top of Zoom by performing object detection of people in real time. In order to do this I need to get access to the raw realtime video data, and I found this discouraging post from 2020: Access to streaming data for object detection

Is it still true that this is unavailable or is there a way to do this nowadays?

Thanks

Hi @austinmw
Thanks for reaching out to the Zoom Developer Forum, I am happy to help here!
Unfortunately, this is still not available.
Cheers,
Elisa

@austinmw, this is definitely possible. There are 3 common ways you could access the real-time raw video data from Zoom.

1. Use the Zoom live-streaming API

Pros:

  • Doesn’t require any 3rd party services
  • Lighter weight than building and running a Zoom bot

Cons:

  • Needs to initiated on a per-meeting basis
  • You need to set up an RTMP server to receive the data, which requires engineering effort to deploy, scale, and monitor
  • Participants can get spooked by the “live” badge that appears in the meeting, depending on the use case
  • Can’t get separate video stream per participant
  • No speaker diarization

2. Build a Zoom bot

Pros:

  • Can get the separate audio streams per participant for perfect diarization / speaker labels
  • Can get separate video streams per participant
  • Doesn’t spook participants

Cons:

  • It is extremely heavy-weight as you would need to spin up multiple servers to run the Zoom client for the bot
  • Running infrastructure for Zoom bot costs more than live streaming.
  • You need to encode the raw video and audio yourself

3. Use Recall.ai

It’s a unified API that lets you send meeting bots to video conferencing platforms to capture the audio and video in real-time.

Pros:

  • Handles spinning up the servers, and providing the real-time raw video/audio so all you interact with is a simple API
  • Can get separate video streams per participant
  • Can get the separate audio streams per participant for perfect diarization / speaker labels
  • Works agnostic of meeting platform

Cons:

  • It’s another 3rd party service in your stack