Virtual Stage. How was it possible? | Microsoft Build 2020
As one of the two major events that Microsoft holds every year, Microsoft Build is the developer conference in which the company announces the last updates on its tools and services. Although it is usually an in-person event, due to the COVID-19 worldwide crisis, Microsoft decided that this year it would consist of 48 hours of online sessions available to anyone to register.
Immediately some questions started to arise: Because speakers cannot travel, Microsoft needed to record them but, how to ensure the quality of image & audio? Where are the speakers going to be recorded, at their home? How will the speaker surroundings look like when published, and what feeling they will project into the attendees?
These and many other related questions are usually solved by recording in a professional studio or stage like a television set, with professional camera assistant and lightning, which is not in the pandemic context. Fortunately, we found a solution!
An application that leverages the power of Azure Kinect and AI’s latest breakthroughs to record the speakers at their homes like they were in a professional recording studio talking behind a green screen. The recordings are later sent to post-production where virtual stages, animations, and compositions can easily be performed: Imagination rules!
Huge thanks to the team, to the amazing devs at @plainconcepts, and the researchers at @UW for their work that helped us keep speakers safe at home at #MSBuild. Try it out for your own presentations! https://t.co/s6AEMYIWVq
— David Carmona (@davidcsa) May 21, 2020
How it works?
This technology developed in collaboration with Microsoft Corp. consists of two separate parts:
The Speaker Recorder App, that allows to record a lecture using one or two Azure Kinect devices, and Background Matting, the backend that removes the background with great quality using a sophisticated AI model and the information of the Azure Kinect sensors.
The Speaker Recorder application captures the color, depth information of one or two Azure Kinect cameras (you can use two cameras to record two angles of the same lecture). The speaker can additionally use a presenter to move through PowerPoint slides and a wireless microphone, which are recorded as well as a video. When finished, the videos are uploaded to Azure, ready to be processed.
At Azure, the Background Matting service uses the depth information to generate an imprecise segmentation of the speaker. Then the backend separates the background from the speaker, with high precision, generating a transparent video.
The Background Matting is based on a brand-new technique from the University of Washington. Due to the lack of labeled training data portraying standing humans, original AI was trained with 512×512 square images/videos until the hip or knee-length, resulting in poor quality when matting full HD standing human videos.
In order to get high quality foreground in zones like hair, hands, or feet we have made two major contributions to the original method. First, we have replaced the original segmentation step by the AI models of the Azure Body Tracking SDK, getting a segmentation that is more tolerant of color similarities and ambiguous zones of the image.
Second, we are splitting the body into two square images with a small overlapping and processing them separately. This allows the model to “see” better in difficult zones like the shadow between the feet, without losing precision in hair or hands.
To download the code, test it, or get more technical details please check Github.