Video calling for billions without internet
Reliable video calling and screen sharing over cellular telephone call
The views and opinions expressed in this blog are purely those of my own and do not reflect the official positions of my current or past employers.
This past weekend, I came across a news article that mentioned that millions of students in India are stuck at home with no access to either internet or online education. In fact, over half the world’s population still does not have any internet connection. While this digital divide has manifested itself in the U.S., especially with regards to in children’s education during the Coronavirus pandemic lockdown, the problem is far worse in Asia and Africa where fewer than 1 in 5 people are connected to the internet.
As a result of lockdowns being implemented globally, many adults and children are excluded from online education and telehealth.
Despite the 1 in 5 statistic mentioned above, the problem in fact, is actually worse. For even those who have access to the internet, the price is premium and the bandwidth limited. For instance, while talking to my parents in India, they frequently run out of their allocated 4 GB far before the allowance period, after which the bandwidth gets throttled: Stalled frames, choppy audio, painful delays, and eventual disconnections, and subsequent retries are a normal occurrence, but still arguably much better than normal telephone conversations because I get to "see" them.
I understand when Andrew Stuart explains that video calling is better for the mental health¹ of people. But, video calls typically require 2 Mbps up and 2 Mbps down as in the case of Zoom, a privilege many cannot avail.
To address this problem, I propose a new approach based on the insight that if we are willing to give up some realism or realistic rendering of faces and screens, then there is whole new world of face and screen representations that can be derived for ultra-low bandwidth, with an acceptable quality of experience.
This article explores such representations and methods that reduce needed bandwidth from the normal 2 Mbps to as low as 1.5 Kbps, allowing video to be encoded along with telephone audio with minimal degradation in audio quality. The proposed solution can be primarily implemented as software needing no change in the underlying infrastructure. This would in turn be cheaper, and allow internet access to people that are currently being marginalized based on their affordability.
Cellular Telephone vs Internet
You can skip this section if you are convinced that cellular telephone calls are more reliable, faster, and better than VoIP calls
From anecdotal experience, it is clear that a cellular telephone call is often more reliable than a video call in many attributes such as delays, dropped calls, ease of use, minimal variation in call quality and long distance options.
Video-call providers such as Google Meet, Discord, Gotomeeting, Amazon Chime, and potentially Facebook use a protocol called WebRTC that enables real-time communication between two or more parties. WebRTC also relies on an internet protocol called User Datagram Protocol, which offers speedy delivery of packets, but provides poor guarantees resulting in some packets to either not arrive or arrive out of order, thereby resulting in lost frames and jitter (a variation in the latency of a packet flow between peers)². Newer Machine Learning techniques such as using WaveNetEq³ in Google Duo improves the audio quality, however video quality variations remain unsolved.
On the other hand, while cellular telephone calls are also increasingly based on packets, engineers have found ways of making packet-switching cellphone networks increasingly efficient by constantly ensuring stringent quality standards and clever bandwidth management. Another reason why cellular calls work well is that the bandwidth needed for audio is much lower compared to that for video. Further, the encoding standards have been developed over many decades allowing the human brain to fill in gaps. I will eventually argue that many of the goals of internet-video calling such as two-way video calling, multi-party video broadcasts and live events, screen sharing, A/B testing of Quality of Experience (QoE)⁷ are also possible over cellular telephone calls. However, in the rest of this article, we shall only focus on two-way video call to set the stage.
Architecture: Video Over Telephone
The primary goal in an educational video is to obtain the minimum video representation prioritizing information transfer (e.g.,whiteboard, sketches), followed by human actions (e.g., pointing, writing) and expressions (e.g., lips, eyebrows), while excluding distracting elements such as background, non-expressive elements of a face, as well as annoying video call elements such as freeze frames, variable frame rates, etc. Another architectural goal is to prioritize reliable frame rate with low latency and low jitter (smooth and consistent), as well as high audio quality.
In any telephone call, the human tolerance for delay is at most 200 ms. For telephone calls, all packets do not matter as the encoding schemes are so sophisticated that even if some packets are lost, the caller can still be heard. The sophisticated schemes can handle multiple rates of transmission that adaptively adjust based on available bandwidth. As a result, Adaptive Multi-rate Codec (AMR) is popular. AMR codec uses eight source codecs with bit-rates of 12.2, 10.2, 7.95, 7.40, 6.70, 5.90, 5.15 and 4.75 Kbps. For our analysis, we picked 7.95 Kbps as our worst case analysis. Assuming a frame rate of 24 frames per second and a video compression ratio of 100 (H.264 lossy video compression can be as high as 200), we get a target of approximately 1 KB/frame uncompressed as a back-of-envelope calculation. This would assume a 7.5 kbps audio transmission rate, out of which 3.2 kbps is allocated to 2-way video transmission and 4.75 kbps for audio transmission. If we were to meet this worst-case design target, we can surely do better.
Video Capture
The first step is to capture video. For my prototype, I captured video at a frame size of 640×480 at 24 frames per second as this frequency is known to be good enough for deriving aesthetic film-like motion characteristics.
Near-realistic Video Representations
I experimented with different video formats but I decided to pursue the bi-level image (each pixel is either black or white) format using Simple Global Thresholding to obtain the video below. There are other usable bi-level thresholding techniques such as Adaptive Mean Thresholding and Adaptive Gaussian Thresholding, where the latter was less noisy but I like to defer advanced video processing until after segmentation because I thought it best to prioritize processing for certain regions (e.g. face, whiteboard) instead of global approaches. If neural computation is not possible on the sender’s side, then a simple global threshold might work.
I also experimented with the ASCII rendering format as shown in the video below.
Additionally, I tried the Neural Style Transfer techniques using the example code provided here based on the paper⁶ and the Github code provided by Justin Johnson, Alexandre Alahi, Fei-Fei Li, but in the end I decided to pursue the bi-level images for wholly pragmatic reasons.
Video Semantic Segmentation
The core idea here is that we can use Deep Learning to segment and prioritize specific areas of a frame. In this case, I prioritized the following segments in decreasing order of priority: whiteboard, hands, face, torso, and background.
I am using Superannotate to annotate my videos and a combination of OpenCV and Da Vinci Resolve 16 to enhance the segmented video frames.
Segment-based Video Enhancement
The next step is to remove the background, followed by exposure, shadows, and contrast enhancement. However, the exact parameters are highly dependent on the lighting as well as the exposure, hence I developed some heuristics for my training samples.
Instead of multiple steps involving exposure, shadow and contrast enhancement, I decided to apply the heuristics of S-curve enhancement.
In the tonal histogram for an RGB image, the lower-left quadrant represents the shadows while the upper-right quadrant represents the highlights. A simple heuristic could be to reducing the tones in the shadow regions by roughly around one standard deviation and increasing the tones in the highlight region around two standard deviations.
In setting the threshold for bi-level images, I applied the heuristic of setting it at the peak of the distribution. This heuristic seem to apply well at least for night lighting.
Overall, the localized video pre-processing based on segmented video frames provided a bi-level image with higher clarity.
Vectorization
The next step was to convert a bi-level pixelated image into a vector image. As vector-based images are not made up of a specific number of dots, they can not only be scaled to a larger size without losing any quality, but also reduce the footprint of the image significantly.
Encoding
Encoding is necessary to compress data to fit the available bandwidth of communication as well as meet the delay requirements. Most video encoding standards are based on Discrete Cosine Transformation (DCT). Transformations such as the H.264 standard can compress videos by a factor of 200, but H.264 is not only quite compute heavy but also built for "natural" rendering. For this reason, I decided to explore video encoders optimized for our specific use. Compression technologies for videos include Spatial compression, which exploits redundancy within a single frame, and Temporal compression, which exploits redundancy such as motion redundancy between frames. Currently, I am experimenting with different encoding technologies to obtain a balance between Quality of Experience and compression. This investigation includes bi-level bitmap (pixel) based frames as well as vector based frames. Popular methods of compression for bi-level bitmap images include Run-length encoding, entropy encoding such as Arithmetic coding, as well as the JBIG2 standard for bi-level images. Following this, I shall investigate exploiting motion redundancy.
Decoding and Rendering
After decoding the compressed streaming video on the receiver end, one needs to render the video either on a smartphone or on a larger screen such as a TV. Vector representations are easy to scale and work well for representing whiteboard information, but they result in non-realistic rendering of faces, which for our purposes might work well. However, we need to test the Quality of Experience to confirm this. On the other hand, bitmap-based representations lead to pixelation when scaled. It should be noted that there are a few simple algorithms that can be applied to enhance those frames such as Nearest neighbor enhancement, Bi-cubic enhancement, sharpening, and removing of artifacts.
In conclusion, I started with a 640 x 480 RGB image (~1 MB/frame) and transformed it to a bi-level image (~38 KB/frame). Vectorization futher reduced the frame size to about 10 KB/frame. While it is still further away than the 1 KB/frame worst-case target we discussed earlier, I believe there is a chance to meet that target. It is possible through the use of custom compression techniques that exploit redundancies such as having no background or through further opportunities in extreme vectorization such as having line drawings.
Research challenges and next steps
There are a number of approaches worth investigating based on available compute on either the sender or receiver ends, as well as the available communication bandwidth.
Video Optimizations
Areas to investigate include optimizing video encodings around spatial frequencies for better interpretation of human actions⁴, and facial expressions such as Facial Action Coding Systems⁵.
Novel representations
Simple representations such as ASCII, 3D avatars, and other human-understandable representations are worth investigating. End-to-end AI-based approaches such as using autoencoders can be used to generate compact intermediate representations, though intermediate latent space encodings cannot be understood by humans. Of all approaches, I am most optimistic about autoencoders and related technologies. Another approach is the Facial Action Coding Systems (FACS), where an intermediate representation consists of Action Units (AUs) that control different muscles and movements. Such an FACS intermediate representation can actually encode a face and position in roughly 44 AUs, 3 for XYZ translation and 3 for angles resulting in 50 floats or 200 bytes per frame, which can be much lower after compression, almost certainly much lower than the target of 1 KB/frame. Such an encoding can be used to render an avatar on the receiver’s end. The latter two approaches would also need heavy compute on both the sender and the receiver side.
Encoding alongside audio
This is the biggest challenge in my opinion. There is a plethora of audio encoding standards such as Enhanced voice services, variable-rate multimode-wideband, mult-rate wideband, enhanced variable-rate codec, GSM enhanced full rate among others on either ends of the call. When a call is initiated from one carrier in one country and ends in another carrier in a different country, the quality of the streaming audio varies significantly depending on the encodings used as well as the quality of the connection. The challenge is to encode the video alongside the audio such that the video can be reliably decoded and will "survive" the transmission.
A framework for novel applications
There are lots of novel applications possible such as interactive teaching methods using Vector-graphic animations or even minimalist augmented reality along with video and audio calls.
User: Build Android app as pilot.
I would appreciate it if you provide some feedback, or pass this article on to someone or some company that might have the resources and talent to develop a complete solution and an Android app that can be deployed in developing countries for educational purposes. I have been working on this during the past weekend, and it is impractical for me to develop a solution quickly. I thank you in advance for your help!
[1] Lauren E. Sherman; Minas Michikyan; Patricia M. Greenfield; The effects of text, audio, video, and in-person communication on bonding between friends. Cyberpsychology: Journal of Psychosocial Research on Cyberspace, 2013, 7(2), Article 3.
[2] García, B.; Gallego, M.; Gortázar, F.; Bertolino, A., Understanding and estimating quality of experience in WebRTC applications. Computing 2018, 101, 1–23.
[3] Pablo Barrera, Software Engineer, Google Research and Florian Stimberg, Research Engineer, DeepMind, Improving Audio Quality in Duo with WaveNetEQ, April 1, 2020, Google AI Blog.
[4] Steven M. Thurman Emily D. Grossman, Diagnostic spatial frequencies and human efficiency for discriminating actions, 2010 November 16, Springer, Attention, Perception and Pyschophysics.
[5] Paul Ekman, Facial action coding system, 1978.
[6] Justin Johnson, Alexandre Alahi, Li Fei-Fei, Perceptual Losses for Real-Time Style Transfer and Super-Resolution, 2016, European Conference of Computer Vision.
[7] Julie (Novak) Beckley, Andy Rhines, Jeffrey Wong, Matthew Wardrop, Toby Mao, Martin Tingley, [Data Compression for Large-Scale Streaming Experimentation](http://Data Compression for Large-Scale Streaming Experimentation), Dec 2, 2019, Netflix Technology Blog.
Share This Article
Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.
Write for TDS