Microsoft researchers have announced a new application that uses artificial intelligence to mimic a person’s voice with just a few seconds of training. Voice models can be used for text-to-speech applications.

The application, called VALL-E, can be used to synthesize high-quality individual speech with only three seconds of nomination recording of the speaker as an acoustic signal, the researchers wrote in a paper published online on arXiv, a Free distribution service and an open-access archive for scholarly articles.

There are now programs that can cut and paste speech into an audio stream, and that speech is converted from typed text to the speaker’s voice. However, the program must be trained to simulate a person’s voice, which can take an hour or more.

“One of the extraordinary things about this model is that it does it in a matter of seconds. It’s very impressive,” Ross Rubin, principal analyst at Reticle Research, a consumer technology advisory firm in New York City, told TechNewsWorld.

According to the researchers, VALL-E outperforms current state-of-the-art text-to-speech (TTS) systems in both speech naturalness and speaker similarity.

In addition, VALL-E can preserve the speaker’s emotions and acoustic environment. So if a speech sample was recorded on a phone, for example, text using that voice would sound like it was being read through a phone.

‘super impressive’

VALL-E is a noticeable improvement compared to previous state-of-the-art systems like YourTTS, due to be released in early 2022, said Giacomo Micelli, a computer scientist and Werner Herzog, creator of a website featuring an AI-generated, never-ending discussion and the synthesized speech of Slavoj Zizek.

“The interesting thing about VALL-E is not just that it only needs three seconds of audio to clone a voice, but also how much it can mimic that voice, emotional timing, and any background noise.” can match closely,” Michaeli told TechNewsWorld. Ritu Jyoti, group vice president of AI and automation at IDC, a global market research company, called VALL-E “significant and highly impactful”.

“This is a significant improvement over previous models, which required a much longer training period to generate a new sound,” Jyoti told TechNewsWorld.

“It’s still early days for this technology, and more improvements are expected to make it more human-like,” he added.

emotion simulation questioned

Unlike OpenAI, the creator of ChatGPT, Microsoft has not opened VALL-E to the public, so questions remain about its performance. For example, are there factors that could cause degradation of the speech produced by the application?

“The longer the audio snippet generated, the more likely a human is to hear things that seem a bit distant,” Micheli said. “Words in speech synthesis may be ambiguous, omitted, or duplicated.”

“It’s also possible that switching between emotional registers will feel unnatural,” he said.

There is also doubt in the application’s ability to simulate the speaker’s emotions. “It will be interesting to see how strong the potential holds,” said Mark N. Vena, president and principal analyst at SmartTech Research in San Jose, Calif.

“The fact that they claim this is hard to believe with only a few seconds of audio,” he continued, “given the current limitations of AI algorithms, which require a lot of voice samples. “

ethical concerns

Experts see beneficial applications for VALL-E as well as some non-profit applications. Jyoti cited speech editing and replacing voice actors. Miceli said the technology could be used to build editing tools for podcasters, customize the sound of smart speakers, as well as incorporate them into messaging systems and chat rooms, videogames and even navigation systems.

“The other side of the coin is that a malicious user could clone a politician’s voice, and have them say things that sound absurd or inflammatory, or just to spread false information or propaganda in general,” Miceli said. Told.

If it’s as good as Microsoft claims, Vena sees huge abuse potential in the technology. “At the level of financial services and security, it is not difficult to accept use cases by rogue actors who can do really harmful things,” he said.

Jyoti also sees ethical concerns emerging around VALL-E. “As technology advances, the sounds produced by VALL-E and similar technologies will become more reliable,” he explained. “This would open the door to genuine spam calls that mimic the voices of real people a potential victim knows.”

“Politicians and other public figures can also be impersonated,” he added.

“There could be potential security concerns,” she continued. “For example, some banks allow voice passwords, which raises concerns about misuse. We can expect an increase in the arms race between AI-generated content and AI-detecting software to prevent misuse. Huh.

“It is important to note that currently VALL-E is not available,” Jyothi said. “Overall, it is important to regulate AI. We will have to see what measures Microsoft takes to regulate the use of VALL-E.”

enter lawyers

Legal issues may also arise around the technology. “Unfortunately, there may not be existing, adequate legal tools to deal directly with such issues, and instead, a hodgepodge of laws that cover how the technology is misused reduce such misuse. can be used to,” Michael L. Principal at Harness IP, a national intellectual property law firm.

“For example,” he continued, “voice cloning can result in a deepfake of a real person’s voice that can be used to deceive a listener or even be used to mimic the voice of an election candidate.” While such misuse would raise legal issues in the area of ​​fraud, defamation, or electoral misinformation laws, there is a lack of specific AI laws that would deal with the use of the technology itself.

“Further, depending on how the initial voice sample was obtained, there may be implications under the federal Wiretap Act and state wiretap laws if the voice sample was obtained over, for example, a telephone line,” he said. .

“After all,” Teich said, “in limited circumstances, there may be First Amendment concerns if such voice cloning is used by a government actor to silence, delegate, or dilute legitimate voices from exercising their free speech rights.” is done to.”

“As these technologies mature, there may be a need for specific laws to directly address the technology and prevent its misuse as the technology advances and becomes more accessible,” he said.

make smart investments

In recent weeks, Microsoft AI has been making headlines. ChatGPT is expected to be incorporated this year into its Bing search engine and possibly its Office apps. It also reportedly plans to invest $10 million in OpenAI — and now, VALL-E.

“I think they’re making a lot of smart investments,” said Bob O’Donnell, founder and principal analyst at Technalysis Research in Foster City, Calif., a technology market research and consulting firm.

“They jumped on the OpenAI bandwagon several years ago, so they’ve been behind the scenes on this for quite some time. Now it’s coming out in a big way,” O’Donnell told TechNewsworld.

“They’ve had to play catch-up with Google, which is known for its AI, but Microsoft is making some aggressive moves to come to the forefront,” he continued. “They’re jumping on the popularity and the incredible coverage that all these things are getting.”

Rubin said, “Microsoft, having been the leader in productivity for the last 30 years, is looking to preserve and extend that leadership. AI may hold the key to that.”

Having trouble understanding the person at the end of the support line you’ve called to get some customer service? A Silicon Valley company wants to make problems like this a thing of the past.

The company, Sunus, makes software that uses artificial intelligence to remove accents in the speech of non-native, or even native, English speakers and output a more standard version of the language. “The program performs phonetic-based speech synthesis in real time,” Sharath Keshav Narayan, one of the firm’s founders, told TechNewsWorld.

Furthermore, the voice characteristics remain the same even after the accent is removed. The sound output by the software sounds the same as the voice input, only the pronunciation has been removed, for example, the gender of the speaker is preserved.

“What we’re doing is allowing agents to keep their identity, keep their tone, it doesn’t need to change,” said Sunus CEO Maxim Serebryakov.

“The call center market is huge. It’s 4% of India’s GDP, 14% of the Philippine GDP,” he told TechNewsWorld. “We’re not talking about a few thousand people whose Along with their cultural identity they are being discriminated against on a daily basis. We are talking about hundreds of millions of people who behave differently because of their voices.”

“The concept is sound. If they can make it work, that’s a great deal,” said Jack E. Gould, founder and principal analyst at J.Gold Associates, an IT consulting firm in Northborough, Mass.

“It can make companies more efficient and more effective and more responsive to consumers,” he told TechNewsWorld.

talking local

Gould explained that local people understand the local dialects better and engage better with them. “Even talking to someone with a heavy Southern accent gives me pause sometimes,” said the Massachusetts resident. “If you can be too much like me it affects the effectiveness of the call center.”

“Many call center employees are located overseas and customers may have trouble understanding what they are saying in terms of strong accents,” said John Harmon, a senior analyst at CoreSight Research, specializing in retail and technology. told TechNewsWorld, a global advisory and research firm.

“But the same could be true for the regional American accent,” he said.

However, Taylor Goucher, COO of Connext Global Solutions, an outsourcing company in Honolulu, cited discounts as a source of customer frustration.

“It is well known that companies outsource call center support to different countries and rural parts of the United States,” he told TechNewsWorld. “The bigger issue is the positioning of employees and the right selection for the training and processes to make them successful.”

customer perception

Harmon notes that consumers may have a negative reaction when they encounter a support person with a foreign accent at the other end of a support line. “A caller may feel that a company is not taking customer support seriously because it is looking for a cheaper solution by outsourcing service to a foreign call center,” he said.

“In addition,” he said, “some customers may feel that someone overseas may be less able to help them.”

Goucher cited a study conducted by Zendesk in 2011 that showed customer satisfaction dropped from 79% to 58% when a call center was relocated outside the United States. “Everyone I know is likely to have a bad customer experience at some point in their life with an agent they didn’t understand,” he observed.

He said the biggest problem with poor customer experience is the lack of support systems, training and management oversight in the call center.

“Too often we see companies take call centers offshore just to answer the phone.” They said. “In customer service, answering the phone isn’t the most important part, it’s what comes next.”

“Agents, Accent or No Accent, will be able to deliver a winning customer experience if they are the right person for the role, have the right training, and have the right tools to solve customer problems,” he said. “It’s easy to say the pronunciation is the problem.”

prejudice against accents

When a customer support person doesn’t have the tools to solve a problem, it can be a huge disappointment for the customer, Gold said. “If I call someone, I want my problem solved, and I don’t want to go through 88 steps to get there,” he said. “It’s frustrating for me because I spent a lot of money with your company.”

“Anything that can be done to get over that hump faster has many benefits,” he continued. “From a consumer standpoint, I have the advantage of not annoying. Plus, if I can move faster, it means the service person can spend less time with me and handle more calls. And If I can understand the problem better, I won’t have to call about it again.”

Even if a customer support person has the equipment they need to provide the highest level of service, accents can affect the caller’s response to the person on the other end of the phone line.

“A customer may be bothered by decoding a foreign accent,” Harmon said. “There’s also a stereotype that some American accents seem illiterate, and a customer may feel like the service provider is getting cheap support.”

“In some cases, I think the biggest pre-existing bias is that if the agent has an accent, they won’t be able to solve my problem,” Goucher said.

options for voice

Serebryakov noted that one of the goals of Sunus is to provide people with options for their voice. “When we post photos on Instagram, we can use filters to represent ourselves however we want,” he explained. “But you don’t have a uniform medium for voice. Our mission at Sunus is to provide that kind of choice.”

Although Sunus initially targeted call centers for its technology, there are other areas that have potential for it.

“One of the biggest uses we see for the technology is in enterprise communications,” Narayan said. “We got a call from Samsung that they have 70,000 engineers in Korea who interact with engineers in the US, and they don’t talk in team meetings because they’re afraid of how they’ll be interpreted. That’s the next use case That’s what we want to solve.”

He said the technology also has potential in gaming, healthcare, telemedicine and education.

Sunus announced a $32 million Series A on June 22, marking the largest Series A round in history for the speech technology company.