A Montreal-based AI startup called Lyrebird has taken the wraps off a voice imitation algorithm that the team says can not only mimic the speech of a real person but shift its emotional cadence — and do all this with just a tiny snippet of real world audio.
The public demo, released online yesterday, consists of a series audio samples of (fake) speech generated using their algorithm and one minute voice samples of the speakers. They’ve used voice samples from Presidents Trump, Obama and Hilary Clinton to demo the tech in action — and for maximum FAKE NEWS impact, obviously.
Here’s a sample of the fake Obama:
And here’s a fake Trump:
And here’s a totally fabricated discussion between fake Trump, fake Obama and fake Clinton. Truly we live in the strangest times…
Lyrebird says its intention is to offer an API in the future so that third parties can make use of the audio mimicry technology for their own ends. So if you think fake news online is bad now, wait until there’s a tech that lets anyone generate a ‘recording’ of a person apparently incriminating themselves, trivially easily.
The startup does have an ethics statement on its website to confront head on what it describes as the “important societal issues” thrown up by technology’s ability to fabricate recorded evidence — in which it states:
Voice recordings are currently considered as strong pieces of evidence in our societies and in particular in jurisdictions of many countries. Our technology questions the validity of such evidence as it allows to easily manipulate audio recordings. This could potentially have dangerous consequences such as misleading diplomats, fraud and more generally any other problem caused by stealing the identity of someone else.
By releasing our technology publicly and making it available to anyone, we want to ensure that there will be no such risks. We hope that everyone will soon be aware that such technology exists and that copying the voice of someone else is possible. More generally, we want to raise attention about the lack of evidence that audio recordings may represent in the near future.
Asked if they have any concerns about putting the tech into the wild, Alexandre de Brébisson, one of the PhD students developing the deep learning tech, told TechCrunch: “By releasing the API publicly and allowing anyone to use it, we want people to become aware that this technology exists and that audio recordings are not as reliable as we may think. It is similar to what Photoshop did.
“Not publishing the technology because of those potential misuses do not make sense to us as we think that the positive aspects overcome the bad ones (a hammer can be used to build but also to break). If we do not publish the technology ourselves, others will do it in the future (and, contrary to us, they might have bad intentions, maybe hiding it from a part of the population).”
It’s a fair point of course. You can’t put a finger in the dam of engineering progress. But you can warn people to be smarter and think more critically about the stuff they’re (apparently) being exposed to. More proof, if proof were needed, of the value of critical and analytical thinking to intelligently navigate an ever-expanding digital realm that is intent on increasingly augmenting and shapeshifting reality.
At this stage de Brébisson won’t give a timeframe for the release of the API, saying only that the beta version to copy a voice “will be available soon”, and that they’ll be adding new features over time. “We have been working for more than a year on the technology (at the MILA lab of the University of Montréal, we are advised by Yoshua Bengio, an AI pioneer),” he adds.
It’s also not clear if the Lyrebird API will be free or not — it sounds more like the plan is to put out a freemium API. de Brébisson says it won’t “necessarily” be free. “Maybe simple features will, or initial samples will be,” he tells TechCrunch. “What we meant is that anyone with Internet will be able to use our API — we are not selling the technology to a particular company or a particular government.”
Though he also specifies that the API monetization plan is to make developers/companies pay for the number of samples they request (e.g. 1,000 generated sentences for x dollars). “The first samples will be free,” he confirms.
Here’s how Lyrebird is pitching what the API will be able to do:
In terms of potential applications for a voice mimicking tech, the sky is surely the limit. But its website has a few ideas for potential applications to get developers’ creative juices flowing — such as for personal assistants; audio book readings with famous voices; connected devices of all stripes; speech synthesis for people with disabilities; and animation movies or video game studios.
The voice quality in the samples still has a distinctly metallic rasp to my ear — a sort of audio uncanny valley, if I can put it that way. So it seems very unlikely that it would offer a like for like replacement for a professionally recorded audio book, for example, (at least not yet) though it will probably offer a more economic alternative.
de Brébisson also points out that the one minute audio samples they’ve used as the source for the demo recordings do not contain all the “DNA of the voice”, and claims: “More data would significantly improve the quality.”
“We still believe that our voices have significantly more natural intonations than other published voices,” he says. “Sometimes we can hear a little bit of noise in our samples, it’s because we trained our models on real-world data and the model is learning the background noise or microphone noise. We are working hard on removing those artifacts for the release.”
Asked whether he believes it will be possible to develop perfect vocal speech synthesis in future — i.e. which is indistinguishable from the real thing — he says he believes this will indeed be possible in “a matter of years”. So start tuning your aural expectations for the end of (technically) distinguishable reality.
The Lyrebird team has been bootstrapping development thus far, working on the core tech at the MILA lab as part of their PhD research, and saying they wanted to release the website before raising any external capital.
Since yesterday’s launch de Brébisson says they’ve had “several offers” — so it seems likely this deep learning startup won’t need to rely purely on their own fiscal resources for too long.
“The launch was a success (100K visits in one day on the webpage, 1 million of samples have been listened in one day) and we have already been contacted by several famous investors,” he adds.
If you’re wondering where Lyrebird’s name comes from its namesake is a real life mimic: a bird capable of recreating the songs of at least 20 other species, along with assorted (and rather less dulcet) manmade sounds like camera shutters, car alarms and chainsaws. Aka fake news of the feathered variety.
Featured Image: Jonathan Zawada/Flickr UNDER A CC BY-SA 2.0 LICENSE