Kristen Kersh
Niamh Parsley
Robert Brogan

Listen up: designing for voice

While speech-to-text recognition has largely improved, voice-controlled services still lack in the sophistication that people expect when communicating through a nuanced medium like speech. Even if this level of sophistication is attained, and the services understand and respond exactly as we expect them to, the challenge of intimacy remains.

When common interaction with phones shifted from calls to text, we created a controversial-yet-accepted balance of interacting with people directly while multitasking with our pocket computers. Voice interaction, however, necessitates a more public display of that human computer interaction, one that so uncomfortable that it directly inhibits its use. Think of the times you have used your voice input on a phone: was it in public? Or in private with people around you? Or (more likely) in solitary settings?

Users have long struggled with gaps in their relationships to their devices – and this is most obvious with voice commands. Voice interactions have grown more sophisticated to bridge some of these gaps; they also quickly and directly communicate with people and many of their smart objects. Yet, this evolution is only in its nascent stages. There is still a long way to go.

At Fjord, we too have grappled with the question of how you properly design for voice. So we took to our Makeshop to play around and experiment with materials and scenarios to see what we can cook up. And voila! Out popped two critical things for designers when designing for voice: 1) important considerations to take under advisement and 2) best practices for actually doing it right.


“It’s not you, it’s me” 

We speak to computers differently than we speak to humans:

Humans have to re-learn how to communicate with machines. It is no longer optimal to speak queries or commands in the form of keywords like the old telephone prompts trained us to do. Without adopting this menu prompt style of interaction, we can create affordances and produce models for users to communicate naturally with the devices.

“Can you hear me now?”

Voice commands are not ideal for all situations:

Voice command interfaces introduce a deeper level of intimacy with a device than we currently have with solely screen-based interactions. Not all environments are conducive, though, for voice command interactions – and not every need can be appropriately addressed with voice commands.

For example, we found that hotels are appropriate settings for voice technology-enabled devices such as the Amazon Echo (which connects to the Alexa Voice Service) due to the individual and accommodating nature of the hotel room experience. Say a hotel guest is looking for information about local businesses and restaurants in an unfamiliar area. He or she can request restaurant recommendations from Alexa based on personal information provided during booking, such as data from services like Foursquare, Uber, or Seamless. Once the user selects the location they would like to visit, Alexa can offer and set up transportation options.

Voice-enabled devices can also help hotel patrons communicate with the hotel staff. A person can request items such as toothpaste, shampoo, or extra towels from the hotel concierge using the voice technology in their room. The technology can also be used to avoid communicating with real people in the case of potentially embarrassing queries such as how to turn up the air conditioning or how to use the shower.

“When you don’t feel like listening”

Use supplemental UI when appropriate:

While the use of non-visual user interfaces, such as voice commands, allows us to reimagine information retrieval, there are still instances where supplemental visuals are beneficial. We found this to be the case when listening to Alexa caused too much user fatigue, yet the supplemental information provided on a screen was useful.

“Let them know that it works”

Yay, your product works, but does your user know that?:

Providing the user with visual, haptic, or aural feedback to register that a device has received the input assures the user that their expectations of functionality are met. Although many designers are gravitating toward seamless interface design, we assert that with relatively novel interactions such as voice, smooth seams can guide the user in understanding the system without confusion.

A user will trust a system that not only provides them feedback that they are being understood, but also provides ways to correct errors. A system should clearly identify an error and actionable ways to address it. The point of error, confusion, or misunderstanding should be isolated and identified by the system in order to give the user opportunity to correct.

“All aboard!”

Welcome new users to the system:

Through the use of a simple and clear onboarding system or a quick prompt, a user should be able to communicate effectively with the system and have a clear understanding of its capabilities. The system should orient the user and allow for user discovery.

Clear and intuitive commands:

With an understanding of human intonation and other natural speech indicators, we are able to adjust the cadence of our voice-controlled devices to respond appropriately to user commands. The programs should be able to intuit simple commands based on the natural language input.

Expectations of capabilities:

User expectations of device capabilities can greatly affect a user’s experience with the device. This is especially true with interfaces such as voice that attempt to mimic human behavior to a certain extent. The higher the level of a device’s human likeness, the more human a user will expect it to behave. This will impact a user’s perception of the tool and potentially lead to unmet expectations.

Aim to create natural interactions that evoke a “human” feeling. The best practice for the time being should be to achieve a more human interaction, but not human simulation. As opposed to focusing on perfecting voice simulation, the focus should be placed on vocabulary and the way of speaking. The prioritization being on the conversational aspect of the interaction more than the likeness of a human voice.

A how-to guide for designing for voice 

The integration of computer voice interaction will be seen in homes, hotels, classrooms, hospitals, cars and beyond in years to come. With the desire to integrate new capabilities throughout our daily lives, the challenge for designers is to question and validate which settings are appropriate. Advances in machine learning have made it easier to create more natural interfaces, and the rapid expansion and implementation of natural language processing helps get us there.

Through the development process, we have identified some guideposts between development and design that are pivotal to designing stellar voice command interfaces. These guidelines should help get you started:

1. Uncanny valley 

Users will expect certain functionality and interactions from systems depending on how human or machine they may seem. If a system too closely resembles a human, it will no longer be perceived as an advanced system, but instead as a defective human 

2. Verbosity

It is important to gauge when to strive for deep learning versus language processing. In order to avoid user fatigue, systems should adjust to provide appropriate levels of information.

3. Recognition verses recall 

To limit frustration and cognitive load for the user, make actions, options, and information readily available. The user should not have to remember information, but should be given the opportunity to recall it.

4. Targeted interactions

Similar to “recognition over recall,” targeted interactions focus on requests and tasks versus information exploration. Voice systems should be designed to primarily accommodate targeted interactions.

5. Supplemental UI

Systems that rely on non-visual user interfaces should utilize supplemental interfaces when necessary to convey additional information that is not appropriate for voice-based interfaces, as an alternative to information omission.

6. Accessibility

Systems and interactions should be designed to be usable by many people with varied abilities. Differing speech patterns, languages, and abilities should be accounted for and even trigger training moments for the system.

7. Feedback

The system status should be apparent to users when appropriate, such as during interactions. Error messages should be expressed in plain language with actionable solutions.

Kristen Kersh
Niamh Parsley
Robert Brogan

More Stories from Fjord