Build audio-only experiences with the Daily API
The Daily call object can be used to build voice-first and voice-only applications. This guide covers how to:
- Set up an audio-only app with Daily
- Get to know browser-imposed audio constraints
- Apply best practices to optimize sound quality
- Implement common audio-only app feature requests
- Analyze performance issues
All Daily audio-only experiences are built on two call object properties:
videoSource: falseturns off cameras.
subscribeToTracksAutomatically: falseturns on manual track subscriptions.
For audio-only applications, the camera stream needs to be disabled. To do this, set the
videoSource call object property to
false when calling
Daily calls operate on a publish/subscribe model: participants publish their own media tracks, and subscribe to other participants’ tracks. By default, Daily abstracts away the complexity of track management, and subscribes every participant to every other participant's tracks.
For audio-only applications, however, we recommend turning off that default behavior. As more participants join, subscribing a participant to all audio tracks could overwhelm the server routing all the tracks (the Selective Forwarding Unit, or SFU). The risk of unwanted background noise also increases if every participant automatically subscribes to every other participant’s audio.
Manual track subscriptions make it possible to selectively subscribe, unsubscribe, and "stage" audio tracks.
To set them up (and turn off the Daily default behavior), set the
subscribeToTracksAutomatically call object property to
join(), or the
join() will initialize the call without using manual track subscriptions.
setSubscribeToTracksAutomatically(), conversely, can be used throughout the call to dynamically update your settings. The latter is a good option if you want to conditionally use manual track subscriptions, such as waiting until there's a certain number of participants in the call.
There are lots of ways to manage track subscriptions directly. For example, in Daily Prebuilt, we can scale the number of total participants higher by subscribing each single participant to only eight other participant tracks at most. To do this, we identify the most recent speakers in a queue that updates when the
active-speaker-change event emits. After each
active-speaker-change event, the participant ID (
peerId) of the new active speaker gets pushed to the top of the queue.
Up to 100,000 participants can join a Daily audio-only room, with up to 25 microphones on at a time.
Daily is built on top of the WebRTC protocol that enables real-time communication between browsers. Before two clients can exchange data like audio tracks, they need to agree on a codec to compress and then decompress the media.
Daily calls use the Opus codec, which is the standard audio codec for WebRTC calls and is supported by all modern browsers. We use the recommended settings for 'Full-band Speech', so one audio stream typically consumes around 40 kbps.
Browsers apply their own echo cancellation, noise reduction and automatic gain control, but it doesn’t take many unmuted mics for background noise to start adding up. For the best participant experience, stick to no more than ten active mics at a time.
<audio> tags are tied to visual elements, like an avatar that represents the participant, the audio stream will only be playable when the visual element is on the screen. This can lead to unintended user experiences. For example, scrolling out of view of the current speaker to see who else is listed on the call could stop audio entirely.
<audio> elements separately from visual elements to maintain consistent access to audio no matter what is currently displayed on a screen.
An AudioWorklet executes custom audio processing scripts in separate threads for low latency audio processing. This is useful for offloading expensive real-time audio related tasks. For example, Daily Prebuilt uses an AudioWorklet to detect microphone audio.
While none of the following features are required to build a successful audio-only application, they come up frequently enough in conversation that we wanted to share pseudocode.
Adding a visual indicator when a participant starts talking can help sighted attendees follow a conversation.
Listening for the Daily
active-speaker-change event is the best way to update an app’s UI to reflect when the current speaker has changed. Under the hood, the Daily API detects whose audio input is currently the loudest to identify the active participant, and emits
active-speaker-change when that value changes.
The React Party Line source code listens for
active-speaker-change to update the
activeSpeakerId stored in local state. An
isActive boolean on the participant’s
<Avatar /> component is set to
true when their ID matches the
There are many other ways to keep track of and visualize speaker history. If you have ideas or need help, please reach out.
For situations with a few keynote speakers, moderated community forums, and other use cases, audio-only applications often need to give different participants different permissions.
In the Party Line demo app, for example, there are moderators, speakers, and listeners. Only moderators and speakers can unmute. Moderators have the ability to promote listeners to speakers (or moderators), mute speakers, and demote speakers to listeners.
Moderators, the participants who have access to all room privileges, are identified with meeting tokens that have the
is_owner property set to
true. A POST request to the Daily
/meeting-tokens endpoint generates a meeting token.
In a production environment, it is recommended to set up custom endpoints that manage participant roles in a database. Then, the Daily
sendAppMessage() method can be used to notify all participants when a change has been made and a new participant roster should be pulled.
To keep code client-side for demo purposes, Party Line takes a different approach. It uses
sendAppMessage() to send state updates to participants, and then listens for the
app-message event and handles the change client-side. We recommend handling these changes server-side for production-level apps, however.
In audio-only calls, it's common for participants to accidentally speak over each other or need to request to speak. To aid this experience, it's common to add a "raise your hand" feature. Being able to raise your hand enables a participant to indicate they have something to say without disruption.
There are many ways to build this. You could use the
app-message event to let one or multiple other participants know you'd like to speak. In the Party Line app,
setUserName() was used to update a participant's username when they wanted to speak since the hand emoji was shown in the UI for the participant's username.
Audio-only calls require relatively low bandwidth and CPU power compared to video calls. If testing results in drained CPU and stale web sockets, be sure to check if anything else could be causing a strain. For example, we've found complex CSS animations can have a noticeable impact.