I’m really interested in these. We’ve made some Alexa skills at home, and I think there’s huge, huge potential for opening-up access to arts and cultural venues through voice interfaces.
Let me just explain where I’m coming from:
I have a fair whack of home automation set up. My partner, Stuart Turner, is quadriplegic and has 24hr care. So, we have a random assortment of people coming into the house, having to quickly learn how it all works, twenty four hours a day! It’s a little crazy – but also a constant supply of user test subjects.
So, I’ve seen first-hand how much more accessible voice interfaces can be to people with limited computer experience.
Stuart and I both work in tech, so we access our home automations through our beloved and ever-present computers (Stuart can move one finger a few millimetres; he uses voice and a binary switch).
For the care workers, at first, we taped some old iPads to the walls and thought that would work fine. Our very wonderful care workers did not find it fine.
They didn't like turning lights on and off on the app, or checking states in Home Assistant, or any of the things that were routinely expected of them.
Frankly, our expectations were unrealistic and the interface wasn't user-friendly enough. It was a problem because they would do things like turning the lights off at the wall – which broke access for Stuart. We made a dedicated network and let them use their own phones (for familiarity; which is a big issue). This worked a little better, but not much.
Then, when we put in the Amazon Alexa and Google Home hubs, it instantly opened-up access to a lot of care workers who were not confident or comfortable on the screens. Massive improvement!
Oooohhh, I like this!Pretty much all care workers – on voice interfaces
Why our care workers prefer voice interface:
- Screen interfaces have too many steps
- Can’t remember which apps do what on phones / tablets
- Scared of pressing the wrong button on screens (a BIG issue, which even plenty of reassurance did not help)
- Feels safer to get things wrong – it says sorry, and you try again
- If they don’t know what to do it's easy to ask Alexa
I would not have guessed any of this, so had to find it out through testing. But now I do know it, I want to build great voice interfaces that more people can use.
Okay, life story over – back to Léonie …
Léonie took us through the history of voice in computing – from speech production, text to speech, speech to text – it’s been a work in progress for a surprisingly long time.
I was pretty amazed by the Voder from 1939:
The thing that really struck me about the Voder was the expression in the speech.
It’s rudimentary, to be sure, but it’s clear that even then, engineers understood that designing really engaging, human, and friendly voice interfaces is going to be hugely about designing expression. It’s the graphic design of voice.
And the great news is we can do that now!
In an amazingly instructive and useful talk, Léonie covered all the existing standards for Voice User Interfaces (VUI) and then went into some depth on Speech Synthesis Markup Language (SSML) – including expressive attributes.
This means we can mark up ways of saying things. We can get excited about good things, and be sombre about bad things. We can be reassuring, or energising, or funny.
I think this is so important for access as it’s going to make the way we communicate that much richer.
My rough notes:
Rate, pitch, volume. We can use this to design some expression and hierarchy into our content. We can speed up for less important data lists (like repeated content) and slow down for key points.
phoneme, lang, voices
SSML allows use to design pronunciation and choose voices.
We can say particular words in a French accent, for example, using lang.
We can choose to speak as a man or woman - there are lots of different voices.
At the moment I think Amazon only has a general UK voice, but they (and competitors including the BBC ) are developing regional voices. This means our regional venues could eventually all speak in their own voices, something I'm really excited about.
This is a culture issue on one level, but it's also super-practical. I don't have great hearing myself, and am regularly maddened by the way the train announcer mispronounces so many of the stops on the Leeds to Manchester line. It's hard enough hearing what's being said even without the mispronunciation!
With SSML we can design pronunciation down at the phoneme level. I'm sure a lot of users will appreciate that kind of detail and polish on place names, event titles, and personal names.
We can unpack shorthand, explain terms of art, and identify which meaning of a word we're using. We can also decide how to say things like dates:
The following example is spoken as "The tenth of September":
<speak> <say-as interpret-as="date" format="dm">10-9</say-as></speak>
SSML is a standard, but Alexa also has custom expressions. Léonie also demonstrated amazon:emotion (US only for now) and speechcon:
<speak> Here is an example of a speechcon. <say-as interpret-as="interjection">abracadabra!</say-as>.</speak>
Speechcons are interjections that have been designed by Amazon.
There's a whole list of 'em here: UK Interjections – I particularly like cheerio and good grief.
Speech CSS 1
And just as a passing aside she mentioned that speech CSS may be back!
I designed some speech CSS for screenreaders back in 2010 or so, but at that time none of the screenreaders picked up these rules and we never took it further.
I’m excited to know this might be revived.