Siri, Voice, GUIs, and Programs

GUIs aren’t always the best interface. Most people have probably had the unpleasant experience of selecting a tool buried in a pile of menus with organization that made little sense. In an application with decent complexity, there's usually a plethora of settings but limited screen space. Once the interface is laid out, changes mean thinking about the effects on the interface as a whole and can result in a decent amount of work. A text-based environment solves some of these issues. Since the user just needs to know the correct text commands and he or she is unrestrained by screen size, the interface can more easily and naturally cover a wider scope.

The reason the everyday user doesn’t use a text-based interface is the learning curve. Users may never get past the strictness required to operate with text-based tools, and no libraries exist to easily help programmers deal with ambiguity. Make a mistake and the computer can only spit out a somewhat helpful error message. The helpfulness is based on a user’s familiarity with the system. There’s no magical, “did you mean this?” response.

Siri is a foray into this world as the speech maps directly into text, and once in text form, the same design principles can be applied. From what I’ve read, the amount of commands are limited and it trips up a bit from time to time, but Apple seems to have done a good job. If nothing else, they’ve thrown marketing to expose the casual user to this type of interface. The biggest problem with previous systems, poor speech recognition aside, is the strictness of the commands. When the device can’t understand the user even though the user is pretty sure the device should understand, the experience falls apart. Siri seems to handle a bunch of different cases very well, and I hope it can break past this sticking point. One huge advantage is that Apple has a large userbase and all the questions are sent back to the server. I would reason that eventually all the pesky edge cases can be covered. It may take some time, but it’s not an impossible task. Then maybe, we’ll have that awesome text-based system.

Update October 31: An article at Swombat articulating the point I’m trying to make pretty well and a great discussion on Hacker News.

After reading some discussions, I feel that I’ve underestimated the difficulty in solving the edge cases. I can certainly see how increasing the domain size can lead to a larger amount of errors. I need to look more closely at the research done in this area. I may be exposing my ignorance further, but I actually believe that there has been some innovation around Siri. In the pursuit of making Siri interoperate with other services, people have been using SMS. By setting up a contact phone number as a bot and sending text messages to it, services can operate on the texts being sent. Essentially people are creating specific domains blocked off from each other. For example, we can imagine a contact called Metro handling everything to do with metro schedules and any ambiguity would be specific within that particular domain.

blog comments powered by Disqus