Enterprise IT Watch Blog

Jul 2 2018   7:05AM GMT

How AI is teaching robots to speak fluent human

Michael Tidmarsh Michael Tidmarsh Profile: Michael Tidmarsh

Artificial intelligence

Robotics image via FreeImages

By James Kobielus (@jameskobielus)

The only place where robots actually look like real human beings is in the movies. And that’s only because cinema artists have perfected the use of live actors, motion-capture technology, and computer-generated imagery. I seriously doubt that Steven Spielberg will live long enough to direct an actual android playing a fictional android.

However, we’re already beginning to hear robotic speech that is indistinguishable from what emanates naturally from the human vocal tract. That’s pretty much what Google’s new Duplex technology has been able to achieve. Though the vendor demonstrated it coming from its next-generation smartphone operating system rather than from a distant R2D2 relative, this new artificial intelligence (AI)-driven speech generation technology is clearly capable of being embedded in any device. So it probably won’t be long till robot manufacturers embed it in mechanisms that resemble sci-fi androids.

It’s only a matter of time before these “uncanny valley” robots speak fluent human, perhaps even more mellifluously than those of us who were born with the old-fashioned type of tongue. Or, as it was phrased in this recent article, many of us will start to regard the natural language processing (NLP) variety of AI as little more than a “chatbot in a robot suit.” Their robotic voices will echo across the uncanny valley as if they were coming from lips of flesh and blood.

Within our lifetimes, “robotic” will lose the connotation of conspicuously artificial, especially where their speaking voices are concerned. As robotic audio processing technology comes into our lives, it will create conversational user interfaces of amazing naturalness. Chatbots, now the subject of so much derision for their awkward speech, will become amazingly and automatically human with little or no training. And they will serve as auditory nerve centers of any physical robot that is built to interact with you and me.

Generative audio is one of AI’s hottest research frontiers. This technology can now render any computer-generated voice into one that truly sounds like it was produced in a human vocal tract. It can translate text to speech with astonishing naturalness. It can also compose music that feels like it expresses some authentic feeling deep in the soul of an actual human musician.

To resemble humans with uncanny precision, robots should be attuned to and be able to produce the full range of non-verbal utterances of which people are capable. In addition, they should be able to create and respond to music just like people do. And, going beyond voices and instruments, they must be able to manipulate sound-producing objects with full fidelity in the same meaningful, environmentally aware patterns that we normally associate with human agency.

For robots to truly embody the full sonic signature of flesh-and-blood humans, generative audio techniques will need to advance to the point where AI can help devices achieve near-perfect mastery of human behavior in each of the following tasks:

  • Robotic mimicry: Vocalized language is one of the hallmarks of the human species. This is what Duplex has achieved, in terms of vocalizing with uncanny fidelity like some random human being, as opposed to mimicking the speech patterns of a specific individual. Within the Android mobile OS, Duplex enables Google Assistant to carry on natural-language phone conversations without betraying the fact that it’s a bot. Clearly, this innovation has triggered waves of concern about its potential to transform robo-calling into a tool for mass deception on an unprecedented scale. By relegating synthesized digital-assistant voices to history’s dustbin, Duplex has leapt over that uncanny valley clear to the other side. But for Duplex and equivalent speech generation technologies to become truly revolutionary, they will need to advance to the point where they have the flexibility to mimic any speaker in any dialect or language, and inject a believable quality of humanlike emotion into their delivery. This will, of course, require not just an underlying text-to-speech NLP technology of unprecedented sophistication, but also continual training of this technology from live or recorded speech samples from people in every linguistic community.
  • Robotic conversation: Language isn’t fully human unless it’s engaged in conversations. This is what the next version of Google Assistant has been built to do, but only in a very limited fashion. Rather than attempt to truly ace the Turing test, Google Assistant will be able to respond to questions with multiple subjects. It will also be able to continue conversations without having to constantly repeat the trigger phrase “Hey Google.” And, in league with Google Duplex and TensorFlow Extended, it will be able to carry on natural phone conversations, based on its ability to understand complex sentences, fast speech, long remarks, and speaker intent. But it won’t be able to engage in the back-and-forth of debate, argumentation, and engaged conversation at the same level as humans. For that, we’ll need technologies such as what IBM has built under its Project Debater. Though it’s still quite limited and not integrated with audio-speech recognition technology, this NLP-based AI program could conceivably win an argument, though it quite literally doesn’t know what it’s talking about. Instead, it was built and trained by IBM researchers from large repositories of textual human conversations of a topic-focused nature. The technology rapidly analyzes huge of amount of textual conversational records before constructing a a plausible argument on a specific topic. It constructs arguments as sequences of conversation-contextual statements that combine winning elements of previous human arguments.
  • Robotic musicianship: Music is very much a human language that robots can consume, produce, and engage around. This is what the growing range of generative music programs are able to do, using convolutional neural networks, NLP, and other AI tools to compose and perform music of various sorts with the occasional lyric. For example, Google Deepmind’s “WaveNet”  can generate convincing music-like recordings by training a deep learning model on audio recordings of human-performed classical piano pieces. Along these same lines, iZotope’s Neutron 2 uses AI to isolate individual instruments and voices in a recording, thereby facilitating AI-assisted remixing of those elements into a unique new recording-like object. However, these “source separation” capabilities are limited to relatively simple musical and vocal performances, and the challenges of isolating the vast range of instrumentals and vocals in recorded or live musical performances—such as in symphonic music–are still beyond the capabilities of even the best AI-based tools. But the potential of solutions that can do this is great, which explains why researchers such as Eriksholm—the R&D center for hearing aid manufacturer Oticon—are exploring use of sophisticated AI techniques such as convolutional recurrent neural networks to distinguish voices and other sounds in natural environments, with the hope that they could be generalizable to real-world musical contexts as well.
  • Robotic audio engineering: Humans inhabit a sonic cacophony that expresses the complexity of our lived and built experience as a social species. As an aspect of ambient engineering, audio can be sculpted into diverse forms and for myriad purposes. For example, acoustical engineering is a well-established discipline with applications in urban planning, architecture and interior design, facilities monitoring and management, noise-pollution mitigation, and other disciplines of that necessary to quality of life, privacy, and other societal concerns. Robotics could conceivably handle more of these functions through algorithmic tools in such areas as automated audio mastering and refinement (e.g., LANDR, which uses AI to automate setting of audio parameters). Along these lines, AI could also be used to build high-quality audio outputs from inputs gathered from low-quality microphones; to emulate analog audio where there are only digital audio inputs; to add binaural effects such as simulated stereo; to add reverb, echo, and doppler shifts to provide spatial heft to sound outputs; to apply selective noise cancellation, boosting, and multi-tracking to create an immersive audio tableaux.

However, robots won’t truly be able to perform these amazing feats autonomously until they can emulate humans’ organic audio-processing abilities at a fine-grained level. With that in mind, I took great interest in this recent blog in which AI engineer Daniel Rothmann outlines the layers of neuromorphic audio-processing needed to equip robots with ability to comprehend audio material with human-like sophistication.

What’s clear is that the technology needed to perfect humanlike robot audio capabilities is not entirely worked out by the AI community. In his article, Rothmann mapped out a framework for robotic audio processing that aligns roughly with the architecture of the human auditory system:

  • Response of the eardrum to the air pressure fluctuations that produce our perception of sounds (as captured through raw digital audio samples)
  • Representation of those fluctuations on the fluid-filled cochlea organs in our inner ears (as processed through gammatone filterbanks)
  • Persistence of those representations in the sensory memories associated with auditory information (as processed through dilated circular buffers)
  • encoding of those sensory memories in our central nervous system (as processed through long short term memory encoders)
  • processing of those neural encodings in the auditory cortexes of our brains (as processed through neuromorphic circuitry that can distill it all into valid cognitive, affective, sensory, and other meaningful patterns)

Teaching robots to speak fluent human will be no easy task. It is far more than a matter of building AI that is grounded in computational linguistics, machine translation, and situational awareness.

It goes beyond words. Giving robots human fluency requires that we embody them with mastery over the full range of the human faculties for shaping the sonic environments in which we live.

 Comment on this Post

There was an error processing your information. Please try again later.
Thanks. We'll let you know when a new response is added.
Send me notifications when other members comment.

Forgot Password

No problem! Submit your e-mail address below. We'll send you an e-mail containing your password.

Your password has been sent to:

Share this item with your network: