Understanding Daily call quality logging and metrics
At Daily, we want to give all of our users as much information as possible about their calls, allowing users to gain a better understanding of call quality logging and metrics without intruding on call participants' privacy. And even though we don't save or record any audio and video (unless you want us to), calls can still generate a lot of metadata!
We have some great blog posts about viewing logs in our Daily Dashboard and understanding the logging and metrics found in our
/logs API endpoint, but looking through the logs to determine why a participant had a bad call experience can be daunting, to say the least.
This guide will help the support teams within Daily customer companies know what to say to their end users when they report experiencing call issues. This guide should give you an understanding of what to look out for when reading through Daily call logs.
To get started, first we have to understand what we mean when we say call quality. At Daily, we define call quality as the user's perception of how good or bad a call was. In particular, a good quality call is one where a user does not perceive any issues in sending or receiving audio and video to/from others in a call.
This sounds simple, but user-perceived issues often depend on external factors like network issues, device capabilities, user error, and confusing UI. For example, a user may report not being able to hear others in a call. This could be because the sound output was muted or it could have been an issue with receiving the audio streams.
This guide will primarily focus on reading the logs and metrics to determine if issues were due to network or CPU concerns and will not address the plethora of other issues that can contribute to a "bad call."
The two most common constraints are:
- Bandwidth and network congestion
- CPU load on participant's machine
Being able to communicate instantly with anyone anywhere in the world, using streaming video over the internet, is a pretty amazing feat. But network issues can quickly and negatively impact call experience.
Some common network issues include:
- Poor WiFi or cellular connections
- Internet router constraints
- Network congestion (such as other applications sending or receiving a lot of bandwidth at the same time as the video call)
- VPNs and corporate firewalls
Browsers have some insight into a user's CPU performance, but that information is not available downstream to browser-based applications. Due to this lack of data, it can be nearly impossible to know if something else happening on the user's computer is causing call issues, but we'll give you a guide to some of the tea leaves below.
We rely on web browsers to convey any information about the participant's technical setup and configuration. We can then make some assumptions based on the data we are getting back. Browsers do not implement the WebRTC standard equally, and not all browsers consistently convey accurate information.
Daily collects and stores events from a call (these are the logs) and gathers statistics about tracks as they change over time (these are the metrics).
Call logs detail what happens in a call from the point of view of each participant. These logs include information about signaling connections, participant connections and events, participants' environments and their actions, and error states.
Call metrics help to characterize the performance and stability of the connection by providing statistics about each track and transport layer for every participant and their connections. Daily visualizes this information in the Daily Dashboard for each user of each session.
Below are some significant log messages to look out for. If your user can also include timestamps with their complaint, this can significantly cut down on the time spent searching in logs.
This video quality limitation reason message can be triggered by a variety of network conditions or CPU issues and can give some clues about what might have caused the low video resolution during a call.
The above log line will look something like this:
cam-video-videoQualityLimReason=bandwidth. The Track will be the specific video that is being limited and the Value is the reason why the limitation was implemented. This reason is implemented by the participant's browser, not by Daily. You can dig into the details by looking at the WebRTC stats specification.
When the browser is encoding the video a participant is sending out, it may need to temporarily reduce the quality below what has been requested. The value of this log line will describe the reason for the limit. The only options are
other. If the issue has been resolved and the browser is announcing that it is returning the video to full expected quality, there will be a log line where the
videoQualityLimReason will be
When the quality limitation is CPU, this is the only time when we directly receive information from the browser that there are detected CPU constraints. CPU limitations can still negatively impact call quality in ways that manifest without this message.
These log messages give us insight into what the participant's browser is doing to adjust the participant's video resolution. This adjustment is determined by the browser and made in response to various conditions described above, such as bandwidth or CPU issues. These log events are generated anytime the video resolution changes.
The participant's browser (not Daily) implements these constraints, as defined by the RTCInboundRtpStreamStats specification. This adjustment by the browser is only supported on Chrome and Chromium-based applications (Microsoft Edge, Electron).
As an example, a call where our servers are sending 3 simulcast cam-video Tracks with the following resolutions:
- High (1280x720)
- Mid (640x360)
- Low (320x180)
Our logs will show:
This above example represents a normal and expected call. In this, the log
cam-video-HighestVideoSendFrameSize represents the highest video resolution that is being sent at that time. If the
HighestVideoSendFrameSize is less than the highest
videoSendFrame, there is probably an issue. Look for the above log lines around
videoQualityLimReason prior to this log for insight into what might be causing the browser to send frames that are smaller than what you requested.
At the very beginning of the call, we log data sent from the participant's browser. The
Connection downlink = [Value] log line can indicate that the browser is conveying that there are poor network measurements at the beginning of the call when this value is less than 10.
Note that 10 is also the maximum possible number for this value. Any number below this cap of 10 is strongly indicative of network issues that will impact the user.
This value is pulled from the NetworkInformation API's downlink property and the number represents the bandwidth estimated at the beginning of the call and represents megabits per second (Mbps). We recommend keeping the receiving bitrate of calls to under 3 Mbps. If a connection downlink of less than 3 is reported, they will likely have a hard time even joining the call, and won't have a good experience if they do manage to connect.
Another thing to note is that this log captures only the download speed and does not include upload speeds. If your users run a speed test on their machine, it is very likely that their upload rates are far worse than their download rates. So if the download rate reported here is low, their upload rate is probably even lower.
Note that this is only supported on Chrome and Chromium-based applications (Microsoft Edge, Electron).
This could point the blame at the participant's operating system and possibly hardware acceleration (either at the OS level or the browser level, configurable in Chrome). We have seen this cause some problems with encoding, ranging from keyframe issues to black frames.
We want to see the
videoEncoderImpl value as pretty much anything else, such as a known encoder value like
libvpx. Even something more complex like
SimulcastEncoderAdapter (libvpx, libvpx, libvpx) is fine.
There are cases when a
network-connection turns to
interrupted not because of a network issue, but because of an overloaded CPU. This can result in erratic logging and create
wss is stale logs. Seeing a
network-connection state move to interrupted repeatedly over a short period of time could be an indication of networking issues or CPU issues. If all other indicators point to the actual network performing well, the issue could be CPU-based. Look for any other indicators that CPU is to blame, like those outlined above.
You can look for log lines that mention
wss is stale and
network-connection interrupted events. These indicate that a connection between the Daily client (embedded within your app) and our signaling server (the server that directs calls to the right people and places, among other things) has been temporarily broken and our code is working towards remedying that severed connection. If and when the client and server are reconnected, any missed messages are redelivered, but it can still cause some gaps and a poor call quality experience.
If paired with a message that reads
recv/send transport changed to disconnected, the user experience is more jarring because it means the user had a complete disconnect (from both UDP and WebSocket). When this happens, we reconnect all media connections. A disconnect on just one or the other could go undetected depending on how long the connections were disconnected. Transport disconnects (UDP) oftentimes reconnect almost instantaneously.
There are a few other indicators that can point to different call quality issues, but they do not definitively point to problems. These are additional things you can look out for or check when trying to get a better picture of the call based on the data we have access to.
The value here could be a negative or positive number. If this line is in the logs and the value is significantly negative (that is, a negative number larger than than a couple hundred milliseconds), it can indicate an unexpected delay between the device sending data and when our servers receive that data. Look for a report logging any number greater than 1 second (1000 ms).
This log line can help build the case that there is an issue, but is harder to rely on as a sole indicator of poor call performance due to bandwidth or device constraints. It is possible that the clock on the device is not perfectly in sync with global time and is not an accurate measurement.
We do our best to make adjustments to match the timestamps for client logs with the server timestamp (which fixes issues like the participant's computer clock being off), and this log was initially added to indicate this adjustment. In practice, we have been able to use it for identifying network lag, even if that wasn't the original intent.
This log message is only logged once, at the beginning of the call, so it is possible that the issue will correct itself or get worse as the call goes on. If your user is having issues for the entire duration of their call, this could be one to look for. But if your user starts off having a great call that suddenly experiences trouble 15 minutes into the call, this line will not provide much insight.
Chrome browsers will announce the estimated device memory of the device they are on. This value will be a floating point number, representing the estimated amount of gigabytes capacity on the device. If this number is
2 or below, the device may not have a suitable amount of RAM to have good quality calls, especially if there are other programs running on the machine or other interactive elements in the application that can cause the device to struggle with memory management. This number could be high, but if there are enough other programs or memory-hogging applications running, there could still be CPU issues.
The maximum value for
In addition to the verbose logs we provide about events happening during a call, we also calculate metrics collected from WebRTC stats for deeper insights into call experience (and make them available in the Daily Dashboard). Below are a few things to keep an eye out for when looking at the metrics charts or if building your own.
A high average sending or receiving bitrate can indicate that the participant is receiving more data than their network can handle. We recommend that the overall bandwidth of the call stays under 3Mbps (three million bits per second). Short spikes are okay, but look for any sustained heavy data. In our Dashboard, this will appear as numbers greater than 3000k in the Metrics charts). Meetings that are heavy on screen sharing can especially have large, lingering bursts of data and participants' networks may struggle to keep up. Participants on slower connections will benefit from having even less data sent in order to have a consistently smooth call experience.
Take a look at the packet loss metrics to determine if there was high and continuous packet loss for the participant. Sustained continuous packet loss could mean the network was always struggling to send and receive data. A short burst in packet loss could mean that something momentarily went wrong in the network. Calls should be able to fully recover from short bursts of packet loss.
outbound-cam-video-XX track for this user. A low value could represent the encoder on the user's device struggling to encode video bytes.
Make sure to review our help material guide featuring templates for building out customer guides around best video call experiences. If you see any of the logs or metrics issues that we have outlined above, tell your end user to do all the things covered in that guide.
You can reach out to us at any time for help understanding your call quality. First, visit our status page to determine if there is a known outage or issue, especially if you are seeing reoccurring problems not covered here with multiple calls over a short period of time.
What can your team do to anticipate and react to changes, and create better overall app experiences for your users?
Engineers working to integrate Daily can start listening for our
network-change-event message by adding an event handler on the Daily call frame or call object. This can help your application to know when to make adjustments to improve the experience (such as limiting the amount of streaming audio/video data coming from or being received by the user). A temporary reduction in video quality can help with the overall experience for those with poor networks or CPU constraints. You can see an example of this event in action by enabling the Network analytics display tab in a Daily Prebuilt room (configurable in the Dashboard).