Rupal: You live in Venice Beach? They do the "American Ninja Warrior" show out there don't they?
NIN: I don't know. That's your guilty pleasure?
No it's not. It's my children's guilty pleasure.
Don't blame it on your kids. That's the worst. Just own it. If it's you, it's you.
(Laughing) It's not me. I promise.
And you are based where?
I'm based in Boston.
When did VOCALiD start?
Officially, as a company, it started in May of this year, and the technology itself has been stuff I've been working on through my laboratory over the last 5-6 years or so.
What was the actual inspiration point for why you started working in this field at all?
I was really interested in communication for a long time, and I got in the field of communication sciences very accidentally. This particular idea came when I was at a conference and I was presenting my research on people who can't speak. Even though they can't speak, they can still produce something in their voice. I've been studying what it is about their voice that people who know them recognize about them, and how it is that mothers can still decode for a child who is pretty much unintelligible to anybody else.
At that same conference, it was a big huge international conference in Denmark, I walked into an exhibit hall and I saw an older man and a young woman having a conversation. They were using the exact same voice, this adult male voice. As I turned around, I saw those voices all around me and I'd just finished giving a talk that every individual, even if they're non-speaking has this unique voice about them. I thought, "We've got to be able to make voices that are unique to the individuals." That's kind of where it started.
That was in 2002 and I told my graduate students about it and didn't think much of it because we really wouldn't have funding for it. Finally, we're going to a different conference and my student said, "I've been still working on that project that you told me about," and played me a demo. We applied for a grant, showed that program, and it started there.
And you were able to do it full time from then forward?
No, we got a couple of government grants to work on that project. That's one of about five or six projects in my lab. But in the last year or so we've focused quite a bit on this area, because it's as close to a commercialization phase as we've gotten. In May of this year, I decided, we just had to go for it.
Who was the first person who has benefited from your technology?
We have made several voices. Many experimental. Our very first voice was for a boy named William. He was nine. When he first heard his voice, he said, “Never heard me before.” More recently, we made a voice for a young woman who is now 17. What we'd been doing is using samples of her speech from another experiment that we'd done to create a voice for her. We were experimenting with our technology - voice-blending with her voice - for many years, and it wasn't until April 2013 when an NPR reporter picked up this story and essentially pushed me to act. The reporter said, "Have you ever had this young woman listen to her voice, even while you're creating it?" I said, "No, it's not ready yet. We're still experimenting, we're still experimenting." She said, "Would you please play it for her?" I'm like, "No, no, no."
Anyway, long story short, I reached out to the family again and I said, "By the way, we've also been doing this and would you like to hear it?" We played it for them. The voice that we had created for her utilized a sample from when she was 9 and when we played it for her, she was 17. We decided that we were going to create a new voice for her from the sample of her voice today. She was the very first person that we actually made a voice for.
We got another sample of her when she was 17. The way our technology works is we get a lot of speech from a healthy donor, about 3-4 hours of speech, and then we get whatever speech is still left from the person who's going to receive a voice. In her case it's really mostly just a vowel that she can vocalize. Then we blend those together to make this unique voice.
What we're trying to do is capture the essence of the person, their voice quality, their identity, which is then blended with the clarity of the captured donor voice. That's what we want to do.
How did she react when you played her the voice?
Oh, she was just smiles and in disbelief that this is possible. We recently gave her her voice and I watched her as we loaded it onto an iPad. We got it onto the iPad and I was watching her fumble with her phone, and she's doing something with her phone. I thought she was distracted, texting with her friends, or whatever. What she was doing was deleting photos and videos off her phone so we could put the voice on her phone too.
She uses her voice a lot more now. One of the biggest tests to me is that she says that now when she goes to a restaurant, rather than hiding her device behind her and not speaking and just signing or doing other modes of communication, she wants to use her device now. That means she'll put herself into situations where she can be more who she is. And that's opening doors for her.
How does that make you feel?
Amazing, amazing. I really feel that in the last two years, as we've started to let people experience the technology, even though it may not be fully ready yet, we're learning so much more. It's been a huge growth experience for me.
As a scientist you are not ready to let it go until something's perfect. It's never perfect though, and you can learn so much from the user and what they're telling you about what's right and not right. It's empowering. It fuels all of what we're doing now.
Most of the people that you're helping, are they people with ALS and other kind of diseases, or are they people who've suffered some kind of trauma?
Right now, the few beta users we have are actually all people who were born with congenital impairments. There are a number of people with ALS who also want voices. There are so many reasons why you may want or need an artificial voice that sounds like you either temporarily or permanently.
There are also people who are not speech impaired but who also use text to speech technology. Take someone who's blind or has poor vision. Many use screen-reading technology to read text out loud to them. They've shown an interest in this technology because they feel, when they compose their own written text (especially people who are blind later in life, or lost their vision later in life), they lose their inner writing voice which hinders creativity.
That's a market we didn't even know about until we started talking about this work, which has been interesting.
Do they want it to sound like them or more like Scarlett Johansson?
The people that we've talked to wanted to sound like themselves because they can hear their own voice in their head.
Yeah. Externalizing there own inner voice.
Really fascinating. How close to commercialization is it at right now?
What we're in is the prototype phase with the beta users. This is a 3-part technology. There's the voice donation part and then there's the blending part, and then we have the synthesizer.
We're focusing a lot right now on the voice donation because, in the past, we'd find a donor that's about the same age and the same gender. That gives you an approximate match. Right? But now we've got lots of donors, so we can get a lot more detailed about the cultural and linguistic history of the recipient as well.
The focus is for banking a lot of people's voices. We've had some 20,000 people who've signed up to pledge to give their voice and have been building the web interface system. We hope that we can start getting people to start donating.
How many people have you banked already?
Right now we are building the tools and have a few beta users. We're still in the process of getting this technology up and running to start recording.
How long does it take for you to collect a good sample?
To get a good sounding voice, you need about 3,000 sentences to be banked. Which is about 3-4 hours of speech. The more speech you have the better. You can get away with a little bit less, but it's just going to suffer in terms of quality.
In terms of what you'd like to have banked in order to have a critical mass of samples to provide anybody with a voice? Is there any sort of idea how many voices that takes?
There's two things. The more diversity we have in the speakers, the more nuanced kinds of things we can look for in a voice, the more characteristics we can try to match for.
What's also very critical is going deep. In other words, with the people donating, getting not just a few samples. The more speech they can generate for us, the better sounding overall voice we can create. It's both broad and deep. How many donor voices? I don't know but we are going for diversity. Just one voice can generate hundreds of voices. Right? More than one person matches to me and to whom I can donate voice to.
If you could solve another problem in a completely different sphere, what would be that other thing you'd like to see be "not impossible"?
Oh, that's tough. I don't know if it's a health problem or social problem. But, how can we give kids enough of a dose of self-esteem that they are protected from anything. From depression, from anxiety, from fear. How can we make them strong, how can we toughen them up? When do we do that? I have no idea.
When is it that we make them strong and healthy psychologically? I think we've focused a lot on physical health and well-being, which is obviously really important, but I think psychological health and well-being and that inner core is important.
Does that mean you have a couple of small kids?
Yeah. I have a 7-year-old and a 9-year-old now. And they're amazing! But I always wonder, what can I expose them to so that they're robust to anything that is thrown towards them.