Introduction to spatialization

In this guide, we will present some considerations you might face when building spatialization into your video or audio call application with Daily. This guide covers high-level considerations and the most relevant parts of Daily and other web APIs which you'll likely use during development.

This guide is independent of any code example. However, if you want to check out a sample implementation which uses some of the concepts covered here, you can head to this demo repository.

What is spatialization?

In the context of this guide, spatialization will refer to manipulating the behavior of a video or audio call in relation to some measure of participants' proximity to each other. This can include toggling video and audio based on participant proximity as well as applying more advanced effects. For example, you can fade video and audio based on distance, or pan audio depending on which direction the speaker is coming from.

Defining your world space and navigation

The first thing you will want to consider is how you want to define your world space and what your navigation should look like. There are many ways to approach this. We will focus on two examples:

  1. A "gamified", traversable virtual world
  2. Click-based participant grouping

1. Gamified world space

If your design calls for users to walk around a gamified virtual space using their arrow or WASD keys, you'll want to consider how participants will send updated coordinates to each other.

For a client-only approach, you can use Daily's app-message event. The app message is backed by a data channel in P2P mode and a WebSocket in SFU mode.

If you'd like more direct control or need to store additional persistent data for your world, you may also consider running a server with your own websocket.

Having gamified navigation in a virtual world can result in more load to handle between clients as you send coordinates back and forth. It also isn't the only way to make use of proximity features in a video or audio call application.

2. Proximity grouping

Another option can be introducing click-based navigation and grouping in your application.

One example that represents proximity grouping is letting users click on a DOM element to join a group. Users are considered to be in close proximity by nature of being assigned the same group ID.

For a code example of proximity grouping, our spatialization demo uses both approaches within the same application. User proximity is calculated based on their distance from each other while they are traversing the world. When they join a dedicated desk zone, they are grouped into proximity by zone ID.

Single room vs ad-hoc breakout rooms

Another aspect to consider is whether you will be using a single room or have users join different rooms as they traverse the space or join different groups. There are pros and cons to both approaches, and many applications will likely lean toward a mix of both.

One such scenario might involve users landing in a common "lobby" room. As they move to different parts of the world or start up conversations with smaller groups, those participants can be redirected to dedicated rooms. New Daily rooms can either be created at runtime or set up in advance via the Daily dashboard and linked from your world.

You might want to design a persistent office space in which there are some pre-made meeting rooms for users to join. At the same time, you may want users to be able to jump into a brand new call on the fly after meeting up in the lobby. In this case, you can use the Daily /rooms REST endpoint to generate a room for them.

Daily Prebuilt vs call object mode

In some cases, you can use the Daily Prebuilt in your proximity video or audio app. If you have users navigating in a space outside of a Daily call and then have them form a group to jump into a separate room, you can simply load up Daily Prebuilt for them on room join.

But for the more advanced and immersive spatialization applications (such as fading user tracks in and out as they navigate in a world), consider using the Daily call object. Call object mode provides more flexibility in how you visualize and control the Daily call.

Proximity-based video and audio

Once you've decided what "proximity" means in your context, you will want to toggle video and audio based on how close participants are to each other.

If you've gone with the gamified world approach where users have coordinates in a virtual space, you can use Euclidean distance to calculate their proximity.

Alternatively, if you're taking a group-based-proximity approach, you can simply decide that everyone in a matching group ID is within close enough proximity to see and hear each other.

Proximity in a single-room approach

Depending on how many people you expect to be in a single room, for performance reasons you likely want to limit which other participant video and audio tracks they are subscribed to. There can be little point in subscribing to the tracks of a user who is on the other side of the world - until they come closer or join the same proximity group. This is where Daily track subscriptions come in.

This is also applicable to subscribing participants to others' screen share tracks. We recommend only subscribing each user to a maximum of two screen share tracks in parallel within a single world space.

To manage track subscriptions in a single room, take a look at the following parts of the Daily API:

Proximity in a breakout room approach

If you're using a breakout room approach, you will want to check out the following parts of the Daily API:

  • /rooms POST request to create a new room.
  • /rooms/:name DELETE request to delete a room.
  • join() daily-js method to join a room.
  • leave() daily-js method to leave a room (in case users start out in a lobby room and then hop to a new room).

Performance considerations

If the design of your world calls for it, it's possible to define the constraints with which video tracks are sent to call participants.

  • sendSettings, a call object property allowing you to customize your simulcast layers.
  • userMediaVideoConstraints call object property, which allows you to set track constraints for the sender. Note that this sets constraints for the captured track itself and in turn affects the highest simulcast layer settings. This means you are setting the maximum possible resolution/frame rate that other participants can receive. Therefore, if using lower constraints along with multiple simulcast layers, receivers may experience undesirably low video quality if/when they drop to those lower layers. This is why you may want to customize your layers accordingly, by using the sendSettings property mentioned above.

We recommend looking at our large meetings guide to learn more about simulcast layer control.

Depending on the size of your world space and your expected participant load, you might also want to eventually consider additional optimizations, such as breaking up the world into smaller "zones" and only processing position data within participants who are in the same zone.

Please don't hesitate to contact us if you would like some assistance in optimizing your spatialized application.

Spatial audio

In addition to enabling and disabling video and audio tracks based on proximity, you can also manipulate audio effects based on where the listener is in relation to each speaker.

One example of such audio manipulation could be panning. If the speaker is positioned on the local participant's right hand side, you can have their audio favor the local participant's right speaker. You can also vary the volume of the audio based on the speaker's distance from the listener.

This can be achieved by using the Web Audio API in conjunction with Daily. Some nodes of interest might include the PannerNode, GainNode, and/or StereoPannerNode

If using a PannerNode, you will want to configure your AudioListener. PannerNode position is set relative to the listener.

If using a GainNode and/or StereoPannerNode, configuring a listener should not be required as they are not spatialized nodes. These nodes allow you to control the volume and pan of the output directly, regardless of position in space. Of course, you can determine the gain and pan values themselves based on participants' positions in your world.

Once your nodes are set up, you can create a MediaStreamAudioSourceNode object with a Daily participant's audio track.

You would then chain your nodes, connecting the last link to a MediaStreamAudioDestinationNode. Below is a small example of a GainNode chained to a source and destination:

Depending on your target environment, you might encounter some Chromium issues when working with the Web Audio API. One especially relevant example is the lack of acoustic echo cancellation for remote streams, which is can require you to implement an RTCPeerConnection loopback approach. Unfortunately, not all nodes seem to be compatible with this workaround.

Conclusion

As you can see, there are many considerations and decisions to make related to building an app with spatialization. In this guide, we went through the main concepts you'll want to look into when creating a spatial video/audio application with Daily. We also pointed out the most relevant parts of the Daily and Web Audio APIs, proximity definition options, and some design considerations to keep in mind.

Suggested posts