Microsoft's VASA-1 takes AI-generated video one step closer to 'aw hell, we're all doomed'

A promotional image for Microsoft's VASA-1 generative AI image — (Image credit: Microsoft)

With generative AI being a key feature of all its new software and hardware projects, it should be no surprise that Microsoft has been developing its own machine learning models. VASA-1 is one such example, where a single image of a person and an audio track can be converted into a convincing video clip of said person speaking the recording.

Just a few years ago, anything created via generative AI was instantly identifiable, by several factors. With still images, it would be things like the number of fingers on a person's hand or even just something as simple as having the correct number of legs. AI-generated video was even worse, but at least it was very meme-worthy.

However, a research report from Microsoft shows that the obvious nature of generative AI is rapidly going to disappear. VASA-1 is a machine learning model that turns a single static image of a person's face into a short, realistic video, through the use of a speech audio track. The model examines the sound's changes in tone and pace and then creates a sequence of new images where the face is altered to match the speech.

Latest Videos From

I'm not doing it any justice with that description, because some of the examples posted by Microsoft are startlingly good. Others aren't so hot, though, and it's clear that the researchers selected the best examples to showcase what they've achieved. In particular, a short video demonstrating the use of the model in real-time highlights that it still has a long way to go before it becomes impossible to distinguish real reality from computer-generated reality.

"It is not intended to create content that is used to mislead or deceive. However, like other related content generation techniques, it could still potentially be misused for impersonating humans. We are opposed to any behavior to create misleading or harmful contents of real persons, and are interested in applying our technique for advancing forgery detection."

This is probably why Microsoft's research remains behind closed doors right now. That said, I can't imagine it will be long before someone manages to not only replicate the work but improve it, and potentially use it for some nefarious purpose. On the other hand, if VASA-1 can be used to detect deepfakes and it could be implemented in the form of a simple desktop application, then this would be a big step forward—or rather, a step away from a world where AI dooms us all. Yay!

Nick, gaming, and computers all first met in the early 1980s. After leaving university, he became a physics and IT teacher and started writing about tech in the late 1990s. That resulted in him working with MadOnion to write the help files for 3DMark and PCMark. After a short stint working at Beyond3D.com, Nick joined Futuremark (MadOnion rebranded) full-time, as editor-in-chief for its PC gaming section, YouGamers. After the site shutdown, he became an engineering and computing lecturer for many years, but missed the writing bug. Cue four years at TechSpot.com covering everything and anything to do with tech and PCs. He freely admits to being far too obsessed with GPUs and open-world grindy RPGs, but who isn't these days?