views:

39

answers:

3

I am writing an application that will behave similar to the existing Voice recognition but will be sending the sound data to a proprietary web service to perform the speech recognition part. I am using the standard MediaRecord (which is AMR-NB encoded) which seems to be perfect to speech recognition. The only data provided by this is the Amplitude via the getMaxAmplitude() method.

I am trying to detect when the person starts to talk so that when the person stops talking for about 2 seconds I can proceed to send the sound data to the web service. Right now I am using a threshold for the amplitude that if its goes over a value (i.e. 1500) then I assume the person is speaking. My concern is that the amplitude levels may vary by device (i.e. Nexus One v Droid), so I am looking for a more standard approach to this that can be derived from the amplitude values.

P.S. I looked at graphing-amplitude but it doesn't provide a way to do it with just the amplitude.

A: 

Well, this might not be of much help but how about starting by measuring the offset noise captured by the microphone of the device by the application, and apply the threshold dynamically based on that? That way you would make it adaptable to the different devices' microphones and also to the environment the user is using it at, at a given time.

Luis Miguel
This is pretty much what I am thinking of doing. Get the initial amplitude level and then apply a threshold based on that.
Eli
A: 

Most of the smartphones come with a proximity sensor. Android has API for using these sensors. This would be adequate for the job you described. When the user moves the phone near to his ear, you can code the app to start recording. It should be easy enough.

Sensor class for android

anto8421
I don't think I can use that since the person can just talk to the phone without putting it close to his/her ear.
Eli
A: 

1500 is too low of a number. Measuring the change in amplitude will work better. However, it will still result in miss detections.

I fear the only way to solve this problem is to figure out how to recognize a simple word or tone rather than simply detect noise.

gregm
This is true, I will have adjust the threshold but it will have to be sensible enough that will not miss any spoken words, even if we get false detections. It will have to be a blind listening since I don't know of a way to recognize word with just the amplitude as input.
Eli