Microsoft’s VALL-E AI Can Mimic Any Voice From a 3-Second Sample

Trained on 60,000 hours of English language recordings.

3 years ago 40.1K

Microsoft has unveiled a new AI model capable of listening to and simulating virtually anyone’s voice. While most AI models that recreate human voices require at least one minute of audio recording input, VALL-E needs a mere three-second sample.

Aphex Twin Launches Custom Sample Mashing App ‘Samplebrain’

The company calls VALL-E a “neural codec language model,” based on Meta’s model that uses AI to produce text-to-speech audio. To develop VALL-E, scientists tapped into Meta’s Libri-Light library, containing audio from over 7,000 speakers. Scientists then used the library to train the AI technology on 60,000 hours of English language recordings.

While some VALL-E voices are surprisingly realistic, others don’t quite match up as well. As of right now, the voice inputted into the system must sound somewhat similar to one of the speakers the model was trained on in order to create an accurate simulation. Microsoft plans to continue developing the model to improve the accuracy and pronunciation of certain words.

Currently, the code isn’t open-source due to the risk of deep fakes. However, those interested can check out a demo of VALL-E AI here.