Listen to Microsoft’s new speech AI that mimics your voice from 3 seconds of audio

Microsoft has revealed a tool that can assume a someone ’s voice and speech when give just three second of sample distribution sound recording to establish it off .

The VALL - E cock is a natural codec language framework , the researches say , and can be used to synthesise speech . The idea is to meliorate textual matter - to - speech capacity and make it sound a little more natural .

In apost on GitHub , Microsoft says even with the very special sample of voice communication , the engineering is subject of maintain the genuineness and emotion in the voice .

Whether the speaker is angry , amused , disgusted , or sleepy VALL - E can have a pop at maintaining the emotion when it simulates the voice . It ’s not perfect yet , far from it , and seems to have problem with some of the warm accent , but all in all it ’s quite telling for a validation of concept .

The company train the tool using technology created by Meta , called LibriLight . It has 60,000 hour of English words speech from 7,000 speakers . Meta created the tech to attempt to take in the opening on audio calls when the signal is poor , but Microsoft has other destination in mind .

As with anything AI - related , there will be fear the technology could be misuse to make it appear as if someone has say something they have n’t . This is something we ’ve already experienced with video recording deepfakes .

However , if the technology is used for the correct reasons , it could help people who have drop off their voice transmit with others again in their own speech .

You ca n’t test it for yourself yet , but Microsoft hasreleased a lot of samples(viaArs Technica ) showcasing the engineering science .

In a Wiley Post explaining the trials Microsoft say : “ VALL - E emerge in - context learning capacity and can be used to synthesize high - quality personalized speech with only a 3 - 2nd enrolled recording of an unseen speaker unit as an acoustic prompt . Experiment results show that VALL - E significantly outgo the state - of - the - artistic creation zero - shot TTS system in terms of delivery naturalness and speaker similarity . In addition , we discover VALL - E could preserve the utterer ’s emotion and acoustic environment of the acoustic prompt in synthetic thinking . ”

VALL-E Overview