My guess is that the path of least resistance in this case is to use a third-party audio recognition library in combination with a high level language (such as Java or one of the .NET family languages such as C# or VB.NET).
You can start by doing some research in the areas of Digital Sound Processing and Audio Recognition.
When you find a library or framework that has the capabilities you're interested in, and bindings in your language of choice, start implementing with it.
See MARF (a Java library) and maybe Microsoft's work in this area withe the System.Speech.Recognition namespace (which if I remember correctly has been integrated with the newer Windows operating systems)
EDIT - Desktop vs. Run From Web
In the comments you asked about using Flash or Silverlight in order for your solution to be able to work both on the Desktop or from the web.
First off, I would like to point out that both Flash and Silverlight actually run on the client computer. The difference is that they run in the context of a web browser, and that the user doesn't have to install the application. Otherwise they are not much different than a desktop application, and the user obviously has to have the Flash of Silverlight plugin installed for their browser.
If that's what you're after (i.e. the user to not have to install your application) than you can look into Flash, Silverlight or Java Web Start. Actually JAVA Web Start would probably be a good candidate because you could leverage the MARF framework.
However if you do decide to go with Flash, Silverlight, or Java Web Start there are some security issues that you might have to deal with because accessing client system resources is bound to require some privileges that most "web-based apps" don't typically require.