The topic about integrating IP cameras with WebRTC-based streaming solutions is one that has been touched before in this blog: Interoperating WebRTC and IP cameras. It's been a while since that post, so in this one we would like to offer sort of a recap for all the basic concepts that were treated on the older article, together with a new perspective on the more technical decisions that one has to make when designing an architecture where IP cameras are published as WebRTC endpoints via the Kurento Media Server.
The WebRTC transport
The first thing to worry about when trying to integrate IP cameras with a WebRTC application is the compatibility between streams of video and audio. The WebRTC specification is very clear about what video encoding formats (also known as codecs) are supported, and browsers do in general fit themselves to what the standard says: Any complying implementation must be able to support both VP8 and H.264 video codecs (See RFC7742 / 5. Mandatory-to-Implement Video Codec).
So, browsers complying with WebRTC will be able to understand video that has been encoded in either VP8 or H.264 codecs. Luckily enough, these two codecs are pretty much de facto standards in the industry, and so most (if not all) IP cameras will already have support for them (specially for H.264, which is the most commonly found codec in lots of video appliances).
That sounds great, doesn't it?
In theory, this could mean that the H.264 video that is generated by most cameras should be able to be transmitted directly as a WebRTC stream, like this:
Fig. 1: Simplest configuration of a IP-to-WebRTC Media Gateway
That looks great, and actually the resource consumption of such a gateway would be minimal, because it would only be serving packets without any processing at all - only a conversion between transports.
There is, however, some big issues with this simplistic setup: what if not all of the receivers support the same codec? what if the network is unreliable and viewers don't have enough bandwidth to see the video?
WebRTC is not just about sending video as-is from the source. If that was the case, the whole complexity of the standard wouldn't make much sense! We could just as well go ahead and send the RTSP streams directly to the viewers. The point of WebRTC is to actually be able to serve video in a secure, reliable and effective manner, and this must include being able to respond appropriately when the network connectivity of the viewers is flaky and suffers from all sorts of real-world issues, such as congestion, packet loss, and other types of misbehaviors of the network.
For this reason, WebRTC supports a variety of feedback mechanisms that allow any viewer to "inform" about current network conditions to the sender of the video:
Fig. 2: Full Media Gateway configuration, allowing to perform codec transformations (transcoding) and reacting to adverse network conditions
By introducing a decoder + encoder pair – effectively doing what is commonly referred to as transcoding – the gateway is able to solve both the codec compatibility and the network reliability issues. The intermediate encoder is then able to attend the feedback information and react accordingly:
When there is network congestion, the receiver sends SRTCP control packets back to the gateway, which include REMB messages. These messages contain an estimation of what is the real bandwidth available for video reception, and the gateway takes it as a hint to adapt its encoding bitrate accordingly.
When some packets are lost in their way, the receiver also sends SRTCP packets that contain a PLI message, asking the gateway to re-send a video keyframe which allows it to recover from missing video data.
For example, a common scenario is that the network gets congested (i.e. the viewer's reception bandwidth is temporarily reduced) and so the video encoder is instructed (via the REMB messages) to lower the bitrate and generate a lower-quality video as a response. This video would then be lighter (because less bitrate equals less overall video size), so it will be transmitted just fine through the congested network.
The feedback methods that WebRTC brings to the table tend to work pretty well for most typical real-world network mishaps, but it comes with a cost: now the media gateway needs to work much more, although this is usually a worthwhile compromise.
Using Kurento Media Server
Up until now, we have described the typical scenario and related difficulties for transmitting video from a camera (or any other source of video) to a WebRTC consumer, such as a web browser. The stated problems are common to all solutions, and any gateway will have to deal with them in one way or another.
Kurento Media Server offers a comprehensive solution that covers all of the described points. Acquiring a video stream from a variety of sources, together with the optional transcoding of the media, is performed by the PlayerEndpoint. Then everything related to the WebRTC communications is handled by the aptly named WebRtcEndpoint. Just with these two components, you will have covered all important details that have been talked about in the previous section, and you will effectively have a fully working WebRTC Media Gateway for IP cameras:
Fig. 3: Implementation of a IP-to-WebRTC Media Gateway with Kurento Media Server
The key element in this graphic is the Agnostic transcoding, which in Kurento is performed by a component called agnosticbin. This component encapsulates the transcoding operation, which as described earlier is where SRTCP packets are handled, providing support for bitrate adjustments (in the case of REMB messages) or re-generating video keyframes (in the case of PLI messages).
It is also in the agnosticbin element where the codec of the video will be selected to better match the requirements of the receivers. So, if the original video was encoded as H.264, but the receivers only happened to support VP8, the video would be transformed into VP8 at this point.
Another useful feature, this time from the WebRtcEndpoint, is that we can adjust the maximum and minimum video bitrate used for sending; the bandwidth estimation hints received through REMB messages will then be constrained to the range established for our application.
Agnosticbin encoding tree
When implementing our application, we'll probably want to have multiple viewers for the same single stream. Does that mean that each and every viewer needs to have their own transcoding process?
That is a fair question, and the answer is not quite. Having one transcoding process for each viewer would really be the ideal solution, as it would allow each receiver to have the bitrate adjusted according to their own necessities. However, in real world deployments this solution is overkill, rapidly becoming a resource hog, and not scaling at all for a big number of viewers.
Kurento implements a compromise solution, where the transcoding process is performed once per codec type, and then the resulting video is distributed among all consumers that required each given codec. The agnosticbin element inside Kurento is again the component in charge of building an encoding tree that covers codec requirements for all consumers, without doing the same work twice:
Fig. 4: Agnosticbin encoding tree for 2 VP8 and 2 H.264 consumers
The net effect of this is that your server won't use more CPU than really needed. Encoding video is a quite expensive task in terms of CPU usage, but once it's done, each additional WebRtcEndpoint consumer adds a negligible amount of extra resources usage.
When writing our WebRTC application, we tend to have a lot of questions and wonder what is the actual effect of different choices we can make:
Is there any actual difference, performance wise, between using VP8 or H.264 to serve video for our WebRTC consumers?
How much does it actually cost to perform transcoding in the WebRTC Gateway?
Does it matter if the IP camera provides an extremely high quality, if after all it is going to be re-encoded anyway?
In this section we'll compare several possible configurations, and show how different choices can lead to different compromises, in an attempt to shed some light over some of these questions.
All the plotted graphs shown in this section have been generated in a test system with the following characteristics:
CPU: Intel Core i5-5675C @ 3.10GHz (4 cores)
RAM: 16 GB DDR3
Network: 100 Mbps LAN
Lastly, on the software side, the following tests have these characteristics:
Tests that include simulating adverse network conditions use the slow-network helper script.
All plots show a span of 4 minutes with 4 RTSP streams. During the test duration, a new IP Camera stream to WebRTC pipeline was added to the system, once per minute, for a total of 4 RTSP streams and 4 WebRTC consumers at the end of the test. This explains why all the CPU / Memory usage graphs resemble a stair with 4 "steps", as each one of the steps corresponds to the addition of a new RTSP source and WebRTC consumer.
All tests have a maximum bitrate set on the WebRtcEndpoint, to constrain the received REMB bandwidth estimations inside a controlled range.
Transcoding vs direct encoded media
We have mentioned that Kurento has the agnosticbin element in charge to provide video codec interoperability between RTSP sources and WebRTC consumers. However, the transcoding process is optional inside the PlayerEndpoint. This means that it would be totally possible to simulate the simple architecture of Figure 1 with KMS, just by using the PlayerEndpoint's builder parameter useEncodedMedia. When using this parameter, the agnostic element inside PlayerEndpoint is skipped altogether:
Fig. 5: Using the Kurento's Player Endpoint with "useEncodedmedia"
The effect of this is that, as was shown in Figure 1, media is transmitted directly from the source to the WebRTC transport. Of course, this has the problem of not being able to use SRTCP feedback messages to adapt for adverse conditions, but it allows for a quick and simple gateway between an IP camera and WebRTC consumers.
The resource consumption of this configuration is very small:
Encoded media is the best option if good network conditions can be somehow guaranteed by external means (e.g. if the streaming is going to happen through a private Ethernet local network, or an intranet with QoS routing middleware). This mode is equivalent to just disabling all QoS related features of WebRTC, and instead just relying on external means. Of course this provides the best experience given that CPU and memory usage is reduced to the minimum. However, as we have already made clear, if any kind of network mishap happens the reception of video streams will suffer from any of several issues, such as: frozen frames, visual glitches caused by lost or corrupted frames, dropped connections, etc.
VP8 vs. H.264
There are different angles that we can observe when trying to compare these two choices for WebRTC video encoding. First of them will be CPU and memory usage. Do both codecs use the same amount of CPU? We'll see this with a couple graphs:
VP8 tends to use double the amount of memory than H.264, while using a bit higher CPU too. Also, CPU used by VP8 seems to grow slightly with higher bitrates, while the CPU usage of H.264 doesn't seem to be affected by the output bitrate.
REMB bandwidth estimations from Chrome
However, resource usage is not the only criterion by which we should decide whether to prefer one codec over the other. Let's see what happens from the client's point of view and analyze the content of REMB messages that Chrome sends to the media server, depending on the choice of codec:
WebRTC VP8 REMB bitrate raising towards maximum allowed (20 Mbps)
WebRTC H.264 REMB bitrate raising towards maximum allowed (20 Mbps)
VP8 is much more aggressive and is able to reach the maximum available bitrate of 20 Mbps. This aggressiveness means that it also has more chances of overshooting and surpassing the actual available bandwidth when there is network congestion, as we will see next. But undoubtedly, the benefit is that VP8 is able to provide better quality in a very quick span of time.
H264 has a much slower reaction time, and it only gets to approx. 2.5 Mbps during the 4 minutes long test, even though the maximum bandwidth available was 20 MBps. This means that, when using H.264, Chrome takes its time to ask for higher video qualities, the bitrate grows at a slower pace, and it takes much more time to reach the maximum level of quality. On the flip side, its slowness means a more steady change when network congestion happens, which is shown in the next graphs.
REMB bandwidth estimations from Chrome with network congestion
When network congestion is happening, the bandwidth available for the consumer to receive video gets smaller than usual, and the server should try to limit the encoding bitrate. These next graphs show the effect of simulating a maximum bandwidth of 5 Mbps for VP8 and 2.5 Mbps for H.264:
WebRTC VP8 bitrate raising towards congestion (5 Mbps)
WebRTC H.264 bitrate raising towards congestion (2.5 Mbps)
Here we see a clear example of the overshoot that was mentioned earlier: VP8 frantically tries to adjust to the actual available bandwidth, but in its attempts it goes higher than what is actually available, and in turn the REMB estimations make big drops in order to compensate. Then, the fast rise starts again.
On the other hand, H.264 is slow and steady, and we can even see slight drops in the available bitrate once each new consumer is added to the session. The first REMB correction is quite large after adding the second viewer, but from there all other viewers affect the available per-stream bitrate in only a small amount (as these measurements are per-stream, meaning that the actual available bandwidth must be shared between all 4 consumers, and the bitrate per consumer gets lower and lower).
High or low quality in the IP camera video
Should I configure my cameras to emit the highest possible quality?
This is a fair question. Given that later the video is going to be re-encoded to adapt for the network bandwidth, seems like it might be better to have as much quality as possible in the original source.
This choice is again a matter of compromises. A higher quality video sent from the IP camera to Kurento will mean higher CPU and memory requirements to decode the stream:
In these tests, the original streams from the RTSP IP cameras were configured first as high quality 1080p, and later as medium quality 720p. The precise source video settings that were configured in the camera were these for the 1080p video:
H.264 High Profile
4000 kbps VBR @ 30 fps
Similarly, the 720p video was configured as follows:
H.264 Baseline Profile
2000 kbps VBR @ 30 fps
As you can see, regardless of what is our choice for the final video codec, the original format of the source material has an impact on the resource usage when transcoding the video for WebRTC.
Ideally, we should make a decision based on what is the lowest acceptable visual quality that our users should see when the best network conditions occur, because when the REMB bitrate raises, what they see in their screens will get closer and closer to whatever was the original source material generated by the camera.
Single camera, multiple viewers
As we mentioned earlier, the agnosticbin component is clever enough to avoid encoding the same video more times than needed, and instead it just builds an encoding "tree" that is able to provide a single encoded stream to multiple consumers.
The effect of this design choice is that CPU and memory usage are "paid" only once per codec, and following consumers add a negligible amount of resource usage to the machine:
This graphic shows the evolution of CPU and memory usage across a test that lasted 12 minutes, where every minute a new WebRTC consumer was attached to the same RTSP source camera, for a total of 12 consumers. As you can see, the average use of machine resources is completely driven by the single transcoding that is initially done for the reception of the RTSP video.
Transcoding is almost a requirement in scenarios where a multitude of devices and video profiles will be involved, and it is also what provides the basic support for the bitrate adaptation needed to fight network congestion. But sometimes it's a bit difficult to know which are the best parameters for our specific case.
We hope this article helps in shedding some light into this topic, and you can use Kurento to create your dream product.