Blog Economie Numérique - Microsoft : VALL-E can mimic your voice from a very short voice sample

In the very beginning of 2023, after major steps towards AI writing with OpenAI’s ChatGPT, and with just a three seconds clip, Microsoft’s VALL-E AI can now start talking like you.

A paper published by Cornell University introduced VALL-E. This AI is developed by Microsoft and can take only three seconds before starting to talk just like a person.

A realistic Text To Speech system

This AI just needs to hear a three-second audio clip of a human’s voice to start replicating it by turning text into speech, with very “realistic intonation and emotion depending on the context of the text”, and it can imitate the acoustics of a room. It was developed after many English speech training sessions of 60.000 hours with 7.000 unique speakers recording data.

It is derived from the neural net Encodec, Meta’s AI-powered compression neural net, the « neural codec language model. » It also matches the environment where the sample is recorded, “so if the speaker recorded their voice in an echo-y hall, the VALL-E output also sounds like it came from the same place”.

Generating anybody’s voice correctly with this AI can cause safety issues

The developers demonstrated that this text-to-speech (TTS) technology mimicked correctly the person’s voice. “Alongside the speaker prompt and VALL-E’s output, you can compare the results with the « ground truth » – the actual speaker reading the prompt text – and the “baseline” result from current TTS technology.”

However, imitating anybody’s voice can be amusing, but many question its safety, especially since the voice it generates is very close to the voice of a real person and it’s easily possible to misuse this technology. Microsoft said: « Since VALL-E could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating. »

VALL-E limitations and improvements

VALL-E is an impressive AI, but still has many limitations – such as in cases when some words are repeated two times or are incomprehensible. Additionally, some results have a machine-like voice, while others are more realistic.

It is planned to scale up VALL-E’s training data and Microsoft aims to reduce the words that are not clear or not heard by the AI. The goal is « to improve the model performance across prosody, speaking style, and speaker similarity perspectives. »

So, when will VALL-E become publicly available?

This AI is still not available for public use yet for security reasons. However, the developers are working on a VALL-E model that differentiates a real person speaking from the AI’s results to prevent any misuse.

That’s why Microsoft hasn’t announced when VALL-E will be available to be used by the public, and the users are impatiently waiting to try this impressive AI soon.

Sources

https://www.engadget.com/microsofts-vall-e-ai-can-simulate-any-persons-voice-from-a-short-audio-sample-112520213.html?guccounter=1&guce_referrer=a

https://www.euronews.com/next/2023/01/10/after-chatgpt-and-dalle-meet-vall-e-the-text-to-speech-ai-that-mimics-anyones-voice

https://screenrant.com/microsoft-valle-ai-imitate-human-voice-audio-sample/