Audio-Video Sync Issues with Raw Data from Zoom SDK on Linux

Description:

We’re experiencing persistent audio-video synchronization issues when processing raw YUV video and PCM audio data from the Zoom Meeting SDK on Linux. Despite multiple approaches, we cannot achieve proper sync.

Setup:

  • Linux Ubuntu 22.04, Zoom Meeting SDK
  • Capturing raw YUV420 video via onVideoRawDataReceived()
  • Capturing raw PCM audio via onMixedAudioRawDataReceived()
  • Using FFmpeg for post-processing and composition

Issues:

  1. Variable video speed: Video starts at ~1.5x speed, then settles to 1x
  2. Audio-video drift: Constant lag between audio and video streams
  3. Timestamp inconsistencies: Raw data timestamps appear unreliable

Attempted Solutions:

  • FFmpeg setpts and atempo filters with various values
  • GStreamer automatic synchronization with videomixer
  • Multiple participant layouts (Stack Overflow style: 1 large + 3 small videos)
  • Forced constant frame rates with fps=30 filter
  • Used Zoom SDK’s GetTimeStamp() method

References:

Question:
What’s the recommended approach for maintaining proper audio-video sync when processing raw data from multiple participants? Are there specific timing considerations or SDK methods we should use for frame-accurate synchronization?

“We would also like to know how we can start recording automatically without the host’s help, whether the host is from our organization or a different one.”
(Internal meeting or external meetings)

Any guidance would be greatly appreciated!

Hey @Venkat_Koushik, you could be seeing A/V drift and speed swings for several reasons:

  • Early frames are using wall time or callback order instead of SDK timestamps
  • Video pacing isn’t tied to a master clock, so the first seconds can run ~1.5× before settling
  • PTS origin varies per stream, so audio and video start at different zeros and then drift

There are a few things you could try to sync A/V better though:

  • Align all timings from SDK timestamps, not arrival time or system clock, using AudioRawData::GetTimeStamp and the Linux raw data callbacks
  • Buffer ~150–300 ms per stream as a jitter buffer, pick audio as master, normalize each track to t=0, and pace the video by drop/dup to your target FPS
  • If capture rate drifts, resample audio to the master clock and keep video aligned; Zoom’s guidance is to timestamp each audio and video frame and ensure they are played back in sync

On how to auto-start recording for internal or external meetings, you can get a meeting’s join token for local recording to have your bot automatically start recording after it enters the call; this generally works for meetings owned by the authenticated user/app.

If you’d rather not build and maintain the buffering and sync layer, teams often use Recall.ai’s meeting bot API to pull real-time Zoom audio, video, and transcripts and offload multi-participant timing and layout orchestration