This was originally posted January 14th, 2005 on my blog at robertdot.org before I pulled a lot of content off. I'm reposting because I realized when I got the iPhone 3GS that Apple had implemented almost exactly what I described below with their new voice control.
I was scouring the web looking for a program for my Nokia 3660 that would accept Speech-to-Dial after
finding out that the technology is used at least two phones. No luck finding any reasonable solution at this point.
I did stumble across an article called
Speak & Dial by Mike Hogan. Basically, Mike says less than 40% of cell phone users use the "speak-to-dial" functionality in their phone.
My problem with the current crop of speak-to-dial is that it isn't true speech recognition. It is the audible equivalent to string matching. The phone says: if what-you-say sounds basically the same as what-you-said-before, then we have a match, and I'll dial the contact associated with it.
That's not good enough. Most are low on memory, so they can only store a certain number of these little sound clips. For example, it may hold 25. That's fine if I have 25 or less people in my phone. But I don't. After 3 or 4 months, I don't remember who has a voice tag and who doesn't.
But, lets pretend that all phones had super-small terabyte hard drives them. I can store thousands of voice tags. Then the problem is that I forget how I stored them. Usually, I stick to a standard of "First-Name Last-Name" when I do it, but most people aren't going to think ahead. Further, I know a few different people that have the same names. That might confuse a phone.
So, basically, any speak-to-dial based on the current system is flawed. But I have a solution.
First, we need a basic speech recognition. The voice recognition chip turns voice into phone commands with little/no software overhead. It's all done on-chip.
The second thing we need is a text to speech app. This could be a chip, but I think a software program would be fine on most upper-tier phones.
The third thing is a set of commands for dialing. The first will be "call" and the second will be "dial". "Call" alerts the phone that a name will be given. "Dial" alerts the phone that it is going to dial a number.
If I say, Call, I will then say a name. It can be a first name or last name or nickname or whatever, as long as it is stored in the phone's address book. If I say, Call Erin, the phone looks for any contacts named "Erin" and maybe "Aaron" depending on it's sound recognition. If more than one is returned, the phone will prompt me: Which one: Erin Lee or Erin Poe? or Which one: Lee or Poe? To this I would reply, Lee. Then the phone would dial Erin Lee. If no matches are found, the phone will say: Sorry, no matches found. If multiple numbers were found for the user the phone would prompt: Which number? Cell, Home, Work ... listing all numbers for that contact in the address book. The user could then reply, Home. At any point during a dial, the user should be able to cancel out with a voice command like, Cancel call. This could be predetermined / changed as a setting so that if the user had a friend with a nickname Cancel Call, it wouldn't not call him.
If I say Dial, I will then say a phone number. This should NOT be limiting! For example, if I want to dial "911", "2054826876", or "1800555TELLME" it should be able to figure it out. The only requirement is that you say actually numbers or letters. That way it doesn't have to guess if you meant "Night" or "Nite". It should, obviously, translate letters to their keypad equivalent, and understand "pound" and "star". It will then take these spoken commands and send them after a 2 or 3 second pause. If this isn't a standard number format (3, 10, 11, or 12 digits), it would prompt the user (repeating the phone number quickly) to make sure that is what s/he wanted to dial before sending, waiting for a "yes" or "no" answer.
The voice dial system should turn off once the call has been placed. This will avoid any voice answering services that use voice recognition, and avoid odd dialing during calls.
Caveats: Unless this chip is I18N, you will have to have a different language chip for each phone. This alone is probably the only reason for staying with the current method. The current method doesn't care if you make clicks and beeps, use gibberish, Japanese, English, or any other language, as long as it can compare wave forms. That means there are no worries about languages.
Salvage: The only way I can think to salvage the current system is by allowing multiple voice tags for each user, and including more storage to accommodate them. As of now, at least on all phones I've dealt with, you can only use one tag per name. If I could, for example, add "Brodrecht","Kathy","Kathy Brodrecht","Brodrecht Kathy", and "Home" to my mom's entry, I wouldn't have to remember which to say. Also, each phone number would need a designation. My T60D did this, but my Nokia 3360 does not. The T60D had an optional voice tag for each type of phone number (mobile, home, fax, work).
This is a first draft of my idea, and is open for your comments / questions to further refine the idea.