Microsoft researchers have announced a new application that uses artificial intelligence to mimic a person’s voice with just a few seconds of training. Voice models can be used for text-to-speech applications.

The application, called VALL-E, can be used to synthesize high-quality individual speech with only three seconds of nomination recording of the speaker as an acoustic signal, the researchers wrote in a paper published online on arXiv, a Free distribution service and an open-access archive for scholarly articles.

There are now programs that can cut and paste speech into an audio stream, and that speech is converted from typed text to the speaker’s voice. However, the program must be trained to simulate a person’s voice, which can take an hour or more.

“One of the extraordinary things about this model is that it does it in a matter of seconds. It’s very impressive,” Ross Rubin, principal analyst at Reticle Research, a consumer technology advisory firm in New York City, told TechNewsWorld.

According to the researchers, VALL-E outperforms current state-of-the-art text-to-speech (TTS) systems in both speech naturalness and speaker similarity.

In addition, VALL-E can preserve the speaker’s emotions and acoustic environment. So if a speech sample was recorded on a phone, for example, text using that voice would sound like it was being read through a phone.

‘super impressive’

VALL-E is a noticeable improvement compared to previous state-of-the-art systems like YourTTS, due to be released in early 2022, said Giacomo Micelli, a computer scientist and Werner Herzog, creator of a website featuring an AI-generated, never-ending discussion and the synthesized speech of Slavoj Zizek.

“The interesting thing about VALL-E is not just that it only needs three seconds of audio to clone a voice, but also how much it can mimic that voice, emotional timing, and any background noise.” can match closely,” Michaeli told TechNewsWorld. Ritu Jyoti, group vice president of AI and automation at IDC, a global market research company, called VALL-E “significant and highly impactful”.


“This is a significant improvement over previous models, which required a much longer training period to generate a new sound,” Jyoti told TechNewsWorld.

“It’s still early days for this technology, and more improvements are expected to make it more human-like,” he added.

emotion simulation questioned

Unlike OpenAI, the creator of ChatGPT, Microsoft has not opened VALL-E to the public, so questions remain about its performance. For example, are there factors that could cause degradation of the speech produced by the application?

“The longer the audio snippet generated, the more likely a human is to hear things that seem a bit distant,” Micheli said. “Words in speech synthesis may be ambiguous, omitted, or duplicated.”

“It’s also possible that switching between emotional registers will feel unnatural,” he said.

There is also doubt in the application’s ability to simulate the speaker’s emotions. “It will be interesting to see how strong the potential holds,” said Mark N. Vena, president and principal analyst at SmartTech Research in San Jose, Calif.

“The fact that they claim this is hard to believe with only a few seconds of audio,” he continued, “given the current limitations of AI algorithms, which require a lot of voice samples. “

ethical concerns

Experts see beneficial applications for VALL-E as well as some non-profit applications. Jyoti cited speech editing and replacing voice actors. Miceli said the technology could be used to build editing tools for podcasters, customize the sound of smart speakers, as well as incorporate them into messaging systems and chat rooms, videogames and even navigation systems.

“The other side of the coin is that a malicious user could clone a politician’s voice, and have them say things that sound absurd or inflammatory, or just to spread false information or propaganda in general,” Miceli said. Told.

If it’s as good as Microsoft claims, Vena sees huge abuse potential in the technology. “At the level of financial services and security, it is not difficult to accept use cases by rogue actors who can do really harmful things,” he said.


Jyoti also sees ethical concerns emerging around VALL-E. “As technology advances, the sounds produced by VALL-E and similar technologies will become more reliable,” he explained. “This would open the door to genuine spam calls that mimic the voices of real people a potential victim knows.”

“Politicians and other public figures can also be impersonated,” he added.

“There could be potential security concerns,” she continued. “For example, some banks allow voice passwords, which raises concerns about misuse. We can expect an increase in the arms race between AI-generated content and AI-detecting software to prevent misuse. Huh.

“It is important to note that currently VALL-E is not available,” Jyothi said. “Overall, it is important to regulate AI. We will have to see what measures Microsoft takes to regulate the use of VALL-E.”

enter lawyers

Legal issues may also arise around the technology. “Unfortunately, there may not be existing, adequate legal tools to deal directly with such issues, and instead, a hodgepodge of laws that cover how the technology is misused reduce such misuse. can be used to,” Michael L. Principal at Harness IP, a national intellectual property law firm.

“For example,” he continued, “voice cloning can result in a deepfake of a real person’s voice that can be used to deceive a listener or even be used to mimic the voice of an election candidate.” While such misuse would raise legal issues in the area of ​​fraud, defamation, or electoral misinformation laws, there is a lack of specific AI laws that would deal with the use of the technology itself.


“Further, depending on how the initial voice sample was obtained, there may be implications under the federal Wiretap Act and state wiretap laws if the voice sample was obtained over, for example, a telephone line,” he said. .

“After all,” Teich said, “in limited circumstances, there may be First Amendment concerns if such voice cloning is used by a government actor to silence, delegate, or dilute legitimate voices from exercising their free speech rights.” is done to.”

“As these technologies mature, there may be a need for specific laws to directly address the technology and prevent its misuse as the technology advances and becomes more accessible,” he said.

make smart investments

In recent weeks, Microsoft AI has been making headlines. ChatGPT is expected to be incorporated this year into its Bing search engine and possibly its Office apps. It also reportedly plans to invest $10 million in OpenAI — and now, VALL-E.

“I think they’re making a lot of smart investments,” said Bob O’Donnell, founder and principal analyst at Technalysis Research in Foster City, Calif., a technology market research and consulting firm.

“They jumped on the OpenAI bandwagon several years ago, so they’ve been behind the scenes on this for quite some time. Now it’s coming out in a big way,” O’Donnell told TechNewsworld.

“They’ve had to play catch-up with Google, which is known for its AI, but Microsoft is making some aggressive moves to come to the forefront,” he continued. “They’re jumping on the popularity and the incredible coverage that all these things are getting.”

Rubin said, “Microsoft, having been the leader in productivity for the last 30 years, is looking to preserve and extend that leadership. AI may hold the key to that.”

Write A Comment