Uptick in recording.started webhook delays

Description
We’ve been observing an uptick delays for receiving recording.started webhooks. This unfortunately causes a large degradation in our app’s UX and our users have started to complain.

Here is the data I have around delays since the beginning of June:

    date    | day_of_week | avg_two_std_dev |  percentile_95  |  percentile_50
------------+-------------+-----------------+-----------------+-----------------
 2020-06-01 | Monday      | 00:00:16.021279 | 00:00:50.442579 | 00:00:12.749569
 2020-06-02 | Tuesday     | 00:00:15.963014 | 00:01:06.92465  | 00:00:13.099279
 2020-06-03 | Wednesday   | 00:00:21.078589 | 00:01:14.396287 | 00:00:15.946571
 2020-06-04 | Thursday    | 00:00:20.511641 | 00:01:23.695327 | 00:00:14.270173
 2020-06-05 | Friday      | 00:00:15.233531 | 00:00:45.238371 | 00:00:12.707256
 2020-06-06 | Saturday    | 00:00:12.971628 | 00:00:31.363456 | 00:00:12.413408
 2020-06-07 | Sunday      | 00:00:12.605642 | 00:00:35.668267 | 00:00:11.128933
 2020-06-08 | Monday      | 00:00:15.418336 | 00:00:56.735335 | 00:00:11.965418
 2020-06-09 | Tuesday     | 00:00:18.231974 | 00:01:13.947889 | 00:00:12.628912
 2020-06-10 | Wednesday   | 00:00:20.24954  | 00:01:14.570044 | 00:00:13.164179
 2020-06-11 | Thursday    | 00:00:19.104737 | 00:01:19.646558 | 00:00:12.326409
 2020-06-12 | Friday      | 00:00:11.041388 | 00:00:57.802527 | 00:00:04.45315
 2020-06-13 | Saturday    | 00:00:04.802712 | 00:00:14.116997 | 00:00:03.096227
 2020-06-14 | Sunday      | 00:00:12.620403 | 00:00:46.024638 | 00:00:06.520203
 2020-06-15 | Monday      | 00:00:10.532329 | 00:00:51.779646 | 00:00:03.830514
 2020-06-16 | Tuesday     | 00:00:10.514079 | 00:01:13.330755 | 00:00:04.214024
 2020-06-17 | Wednesday   | 00:00:07.017967 | 00:00:57.419305 | 00:00:03.845976
 2020-06-18 | Thursday    | 00:00:08.298223 | 00:01:06.493188 | 00:00:03.665947
 2020-06-19 | Friday      | 00:00:03.841219 | 00:00:14.667136 | 00:00:02.598809
 2020-06-20 | Saturday    | 00:00:02.484391 | 00:00:10.073425 | 00:00:02.053527
 2020-06-21 | Sunday      | 00:00:03.062286 | 00:00:09.605838 | 00:00:02.122998
 2020-06-22 | Monday      | 00:00:04.352862 | 00:00:19.724626 | 00:00:02.927834
 2020-06-23 | Tuesday     | 00:00:04.857935 | 00:00:29.300136 | 00:00:03.424311
 2020-06-24 | Wednesday   | 00:00:06.007431 | 00:00:44.227375 | 00:00:03.553677
 2020-06-25 | Thursday    | 00:00:06.388686 | 00:00:39.514964 | 00:00:02.956353
 2020-06-26 | Friday      | 00:00:03.778851 | 00:00:21.198154 | 00:00:02.417401
 2020-06-27 | Saturday    | 00:00:03.163239 | 00:00:08.340153 | 00:00:03.214184
 2020-06-28 | Sunday      | 00:00:02.938334 | 00:00:11.469515 | 00:00:02.517636
 2020-06-29 | Monday      | 00:00:05.448739 | 00:00:21.400127 | 00:00:03.277315
 2020-06-30 | Tuesday     | 00:00:11.765401 | 00:02:00.743002 | 00:00:04.730113
 2020-07-01 | Wednesday   | 00:00:05.705405 | 00:00:31.70697  | 00:00:03.166139
 2020-07-02 | Thursday    | 00:00:06.871294 | 00:00:59.873723 | 00:00:03.845508
 2020-07-03 | Friday      | 00:00:06.260145 | 00:00:22.774157 | 00:00:04.071521
 2020-07-04 | Saturday    | 00:00:04.608581 | 00:00:19.114854 | 00:00:03.130253
 2020-07-05 | Sunday      | 00:00:03.462327 | 00:00:20.154267 | 00:00:03.292376
 2020-07-06 | Monday      | 00:00:05.918549 | 00:00:24.122897 | 00:00:03.369823
 2020-07-07 | Tuesday     | 00:00:33.959562 | 00:03:25.983952 | 00:00:12.536293
 2020-07-08 | Wednesday   | 00:00:20.876246 | 00:02:03.447585 | 00:00:06.711596
 2020-07-09 | Thursday    | 00:00:38.477688 | 00:03:40.809901 | 00:00:09.565021
 2020-07-10 | Friday      | 00:00:12.370818 | 00:02:32.168604 | 00:00:04.663384
 2020-07-11 | Saturday    | 00:00:03.69942  | 00:00:10.970256 | 00:00:02.969889
 2020-07-12 | Sunday      | 00:00:05.276426 | 00:00:17.522659 | 00:00:04.207764
 2020-07-13 | Monday      | 00:00:15.481143 | 00:01:10.20115  | 00:00:05.431154
 2020-07-14 | Tuesday     | 00:00:26.56394  | 00:02:38.371137 | 00:00:08.799974
 2020-07-15 | Wednesday   | 00:00:35.296247 | 00:03:44.236482 | 00:00:12.859589
 2020-07-16 | Thursday    | 00:00:33.385211 | 00:03:42.686415 | 00:00:10.051654
 2020-07-17 | Friday      | 00:00:22.230609 | 00:02:13.764114 | 00:00:03.740537
 2020-07-18 | Saturday    | 00:00:02.981186 | 00:00:07.242833 | 00:00:02.724824
 2020-07-19 | Sunday      | 00:00:03.456752 | 00:00:10.845208 | 00:00:02.805896

Is this something on the platform team’s radar to resolve?

Which App Type (OAuth / Chatbot / JWT / Webhook)?
User OAuth

Updated OP with the full day of stats from yesterday.

Here are the stats broken down by UTC hour for 07/16:

 hour | avg_two_std_dev |  percentile_95  |  percentile_50
------+-----------------+-----------------+-----------------
    0 | 00:00:02.125198 | 00:00:03.146512 | 00:00:01.808505
    1 | 00:00:03.329084 | 00:00:04.696485 | 00:00:03.184225
    2 | 00:00:09.926758 | 00:00:15.639179 | 00:00:09.926758
    5 | 00:00:02.519912 | 00:00:03.230893 | 00:00:02.489508
    6 | 00:00:02.726775 | 00:00:05.17247  | 00:00:01.477894
    7 | 00:00:02.456481 | 00:00:04.203141 | 00:00:02.228411
    9 | 00:00:02.753934 | 00:00:04.34714  | 00:00:02.993786
   10 | 00:00:02.397506 | 00:00:03.880715 | 00:00:01.824574
   11 | 00:00:03.698221 | 00:00:05.675607 | 00:00:03.698221
   12 | 00:00:03.302421 | 00:00:04.05543  | 00:00:03.302421
   13 | 00:00:04.926494 | 00:00:14.801445 | 00:00:02.986262
   14 | 00:00:53.225172 | 00:01:54.883006 | 00:00:45.7984
   15 | 00:01:05.852784 | 00:06:52.766924 | 00:01:43.542934
   16 | 00:00:36.437627 | 00:04:06.896411 | 00:00:17.55249
   17 | 00:00:55.041941 | 00:03:59.856861 | 00:00:37.274657
   18 | 00:00:50.963014 | 00:02:52.633823 | 00:00:17.285255
   19 | 00:00:23.68317  | 00:01:54.767004 | 00:00:06.381489
   20 | 00:00:36.655798 | 00:01:45.771758 | 00:00:28.077282
   21 | 00:00:17.069276 | 00:00:48.587531 | 00:00:07.524841
   22 | 00:00:03.342405 | 00:00:05.925283 | 00:00:03.007755
   23 | 00:00:04.258815 | 00:00:13.087449 | 00:00:02.553884

As you can see it gets particularly bad between the hours of 14 - 21.

1 Like

Updated with data from over the weekend. Saturdays and Sundays show no delays, so it appears to be a load issue. Is expanding capacity to process these queues on the roadmap?

Hey @ryan,

Our engineering team is looking into the issue. (ZOOM-178685) I will provide you with updates.

Thanks,
Tommy

Hey @ryan,

We have identified the issue and are working to fix it.

Thanks,
Tommy

3 Likes

Just to substantiate this a little but. This is also something we’ve observed with other events (meeting started & ended, participants joined & left). It’s also affecting us as we’re counting on the webhook to take prompt action on an integration. Most of the time it works close to real time, but there are periods of time where it they take several minutes to find their way to us.

I’m attaching graphs for the past 3 days of the number of seconds that events take to find their way to us. I’m happy to try to gather any other data that could help.

-> it looks like the forum only lets me attach one image so I’m only attaching 2020-08-05.

2020-08-05:

here’s 2020-08-04

and 2020-08-06

Thanks for the info @BenS,

I will pass this to our engineers to review. :slight_smile:

To make sure we are on the same page, what are the X and Y units and for which webhook?

-Tommy

Sorry for my late reply, I need to tweak my notifications for this forum.

X is the time of the day, Y is the delay in seconds between the event showing up on the webhook and the time is actually occurred (yes it can take several minutes).

The events in question are: meeting.participant_joined, meeting.participant_left, meeting.ended, meeting.started, meeting.participant_jbh_joined, and meeting.participant_jbh_waiting.

I did not observe any difference in delay between these events, as far as I can tell, when a delay exists, it exists for all of them indiscriminately.

I can provide raw data if this would be helpful, I just thought I’d add my voice substantiated by visual proof :slight_smile:

Thank you for looking into it

Looks like today is another very bad day. Seeing recording webhooks 30 minutes late. Any update on ZOOM-178685?

@Tommy These are probably the worst webhook delays we’ve seen so far. Any status updates here?

Hey @ryan,

Can you please share some meetingUUIDs and which specific webhooks are delayed?

Thanks,
Tommy

Hey @BenS,

If you could provide some meetingUUIDs with the delayed webhooks that would be great.

Thanks,
Tommy

All of these had recording.started webhooks delayed for more than 40 minutes this morning:

 RVejKbixSL6zZMefOhJk6Q==
 1nOvzHCpTpGUd/XEoMp2eQ==
 FM5gpgKETUuN4FvOeZu8gg==
 ZUeZAZ4DS9GIGe3AXt0DFQ==
 Hq2zOgzTRJiYOaRhYxELzg==
 OkG/WUyGTxCalm4mdXsvPQ==
 pAM6qWp1RT6VbGPzxglFcg==
 KNOQP96IRwuSpbW1iCiUGA==
 kf7UXbA8TyKDQPwA/7e7CA==
 zzC8usEFRHSwnELUPv7SwA==
 xMZmE3ztSJeOjos9kYePeQ==
 mGzWKivKQU+aHJuaKBtnTQ==
 a2wtcpNgRASGgeqsbF21/g==
 5wlaGLzVRBWW+KNpGxzsDw==
 PjwR9jBLRfOxCpgqvEib7g==
 hnjhHkNUTyGLTs/BQMLX9Q==

It does appear the backup has now cleared. We started noticing the delays around 14:50 UTC. Started clearing up around 17:50 UTC. We’d really appreciate any info we can pass on to our rightfully unhappy users.

I’ve got meeting UUID 4bur3VZ/QRC7rq09vMBeAQ== with a participant.left event which occurred at 2020-08-13T13:29:39 but showed up at: 2020-08-13 13:43:26.

Here are 2 more UUIDs which recently were affected around that same time, I can also point to the exact event and timing of reception if that’s helpful.
1jv+bj9oQjq8eRURyGhd5Q==
tclHkENKQTS6mujFQQulxA==

Thank you!

Thanks @BenS, @ryan,

We are looking into why this happened and will get back to you.

Thanks,
Tommy

Hey @BenS, @ryan,

Make sure you are sending a 200 OK response back within 3 seconds. We see delayed responses coming from your end which is one factor of the issue you are seeing.

That being said, some of this is on our side as well, and we are increasing our server capacity to prevent delays like this.

Thanks,
Tommy

@tommy thanks, I’ll investigate why that may be on my end. It’s definitely pretty curious the processing I do it super lightweight and the server handles much more than the webhook call without such response times. I’ll start keeping track of timing between the request’s arrival and the time it finished processing, thank you for pointing to that.

And thank you for getting to the bottom of it on your end :). Are you saying that the server capacity has been increased already? If not do you have an idea of what the timing of this will be? I’m curious to see if I can confirm the effects in the graphs.

Thanks! I love your position and how proactive you are on the forums threads. It’s refreshing to deal with knowledgeable people who can actually enact change. Hats off to Zoom & you for this.