This is a good question. Zoom automated captions are based on transcription from automated speech recognition. I believe it is relaying what one would receiving in the VTT or TXT files for speakers. However, I believe 708 is being used because you have the ability to customize the captions
in a meeting or webinar.
Here’s a closed captioning guide that could also be useful: Live Streaming with RTMP - #4 by MaxM
There may be a few UI differences in the references, but it should be generally accurate still.