Share this content

Integrating voice and conversational AI

20th Apr 2020
Share this content

Combining popular voice recognition technologies with conversational AI opens the door to array of possibilities in the enterprise space. 

The race is underway to ensure voice and speech recognition technology remains the dominant interaction method in the smart home market. Companies like Google and Amazon have invested heavily in voice, hoping to make their respective smart assistants as integral to our lives as personal computers and smartphones have been. Just as Google’s search algorithm revolutionized the consumption of information and upended the advertising industry, AI-driven voice computing promises a similar transformation, not just at home, but in the enterprise too. 

The spread of voice enabled technology is only expected to grow as new services and smart home devices become more readily available, and advances in artificial intelligence are already making impressively futuristic applications a reality. A benchmarking report from Cognylitica recently found that voice assistants still require a lot of work to set up, before even half of their responses are acceptable. This is due, in part, to early voice recognition programs being only as good as the programmers that wrote them. 

However, things are improving. Thanks to permanent connections to the internet and data centers, the complex mathematical models that power voice recognition technology are able to sift through huge amounts of data that businesses have spent years collecting, in order to learn and recognize different speech patterns.

The latest virtual agents can interpret vocabulary, regional accents, colloquialisms and the context of conversations, by analyzing everything - from recordings of call center agents talking with customers, to thousands of daily interactions with a digital assistant.

How blending Conversational AI and voice recognition technology works

A solid conversational AI foundation is essential if any application of voice recognition technology is to be successful. The speech layer needs to be implemented on top of the virtual agent and treated as a regular integration - similar to how you would integrate a third-party application. That process is straightforward and works by transcribing speech into text, which is then sent to the virtual agent and (via conversational AI) is predicted to the right intent. When the correct intent is identified, you can then forward any text that is communicated orally by the speech interface and presented to the user.

There are several technologies for voice available on the market today, including Google Assistant, Amazon Alexa and Microsoft Cortana. Each solution is vying for market share in an effort to become the main way that we communicate with the internet in the future. However, my team and I have identified three major pain points that need to be overcome in order for voice to be effectively and successfully integrated with conversational AI:

  • Presenting external links, media and buttons - the majority of chatbots and virtual agents are heavily based on either action links (i.e. buttons) used to move a conversation along, or external links that take a customer away from a conversation entirely. These pose a challenge for voice-only interactions, as do enrichments such as images and videos. Asking Google Home “How do I apply for a loan?” only to be told “Click here to do it” doesn’t make much sense in a voice setting and will ultimately lead to a failed interaction. (Note: products such as Google’s Nest Hub and Amazon’s Echo Show combine smart screen technology with voice integration to tackle some of these issues, however, the adoption rate of these products is drastically lower than their voice-only alternatives.)
  • Deciphering subtle linguistic nuances - while voice recognition technology continues to develop, it is still drastically less robust than its chat-based counterpart. Voice assistants, on the other hand, are also often unable to distinguish between the subtle linguistic nuances that are common in speech. For instance, understanding the difference in the phrases “I have bought a car” and “I want to buy a car” can already be a difficult exercise for conversational AI and adding speech on top only increases the complexity.
  • Keeping customers happy - Customers will not always follow what’s known as “the happy path” - i.e. they don’t follow the conversation flow to the endpoint you desire. It’s only human nature to jump back and forth between conversation topics, yet this requires specific context action functionality from a conversational AI solution in order to keep up. For example, a user may ask a question about your company values. They may then want to follow up for more information and ask ‘why’? When we define the ‘why?’ intent in context action, a new action is defined and we tell the user why we believe people are key to your values. Without context action, we trigger the same ‘why?’ intent, but give a boilerplate answer, which is too general and doesn’t actually answer the question.

Getting started 

For any organisations who are interested in exploring what is possible with voice-integrated conversational AI, Gartner recommends starting simultaneous pilots with both voice-enabled and text-based virtual agents to fully understand the opportunities and limitations of the technology. I would add the following recommendations too. 

  • Keep content short and speech-friendly - It’s important to tailor content and responses to be better suited for speech integration. Text answers will, in general, need to be kept as brief as possible while still providing pertinent information. This will likely need to be even shorter than in chat, with consecutive answers kept to a minimum or you run the risk of long, trailing answers when converted to speech.
  • Clearly define your use case - Start with one or two use cases to test the waters and gain experience on how to talk to your customers. Map out limited use cases that actually make sense for voice rather than throwing everything at the wall to see what sticks.
  • Context functionality is crucial - Make sure that your vendor has support for context actions. These are crucial to maintaining natural conversation flows with customers and looping them back around to your desired end goals.
  • Start with a strong language understanding core - Robust language understanding is a must if you want your voice use cases to work well.  A strong foundational Natural Language Understanding (NLU) is mandatory, while additional proprietary technologies, such as Automatic Semantic Understanding (ASU), hold the key to helping voice recognition technology reach its full potential. ASU enables a deeper understanding of a customer’s overall intent, allowing conversational AI to understand what words in a request are important, and when. Through this technology, we are able to reduce the risk of false-positives to a minimum which is a fundamental part to ensuring that a voice assistant can deliver the correct responses.

Replies (1)

Please login or register to join the discussion.

By Lembergadg
20th Jul 2020 06:07

Voice assistants, on the other hand, are also often unable to distinguish between the subtle linguistic nuances that are common in speech. For instance, understanding the difference in the phrases “I have bought a car” and “I want to buy a car” can already be a difficult exercise for conversational AI and adding speech on top only increases the complexity.

Thanks (0)