tags:

views:

274

answers:

1

There are a ton of tutorials and blogs about recording sound from .NET. I have read and understand fundamentally the various options, but am not sure which approach will be the simplest while still meeting my requirements:

Minimal

  • Start and stop of recording controlled by a .NET program
  • Record from default microphone to file
  • Minimize requirements on end-user's computer (i.e. ideally no requirement for latest DirectX)
  • Save to any common, compressed file format

Ideally

  • Remove quiet areas of recording
  • Trigger start/stop of recording based on presence of sound input (record only when there's something to record)
  • Ability to resume recording, appending to previously-saved file

I can work out the details of the implementation and am just looking for advice about the best path to start down given my requirement set.

+4  A: 

I prefer recording audio using the waveIn* API functions (waveInOpen etc.). Although this API is old (15+ years) and slightly more difficult to work with than you might like, it can do everything you mention above (except one), doesn't require DirectX at all, and works on every version of Windows going back to Windows 95 (although .Net doesn't work on anything prior to Windows 98), even including Windows Mobile (this last fact blew my mind when I discovered it).

The one thing it doesn't handle is saving to any common, compressed file format (but I don't think recording with DirectSound - the other major option - handles this either). However, there are a number of .Net-compatible libraries out there that can handle this requirement for you (NAudio comes well-recommended, although I've never used it). One of the advantages of recording with waveIn* (the same advantage accrues to DirectSound) is that you record into memory (as opposed to recording direct to a file), so it's easy to do whatever you want with the audio (e.g. save it to a file, strip out the quiet parts, filter it via FFT, alter the format etc.). Many of the .Net-compatible libraries are written to process in-memory buffers instead of or in addition to files, so having your audio always in memory is a big benefit.

Triggering the starting and stopping of recording can be done, although not in the way you might be thinking. With the waveIn* API, you basically start recording from the default audio source, and the API starts filling up memory buffers with recorded sound. You receive a notification as each buffer is filled, and you can then do whatever you like with each buffer. For actually recording to a file, you can simply scan each buffer as it comes in, and if a buffer is empty (contains no audible sound) you simply discard it without writing the contents to a file.

Here is a CodeProject sample that shows how to use both the waveIn* and waveOut* APIs:

http://www.codeproject.com/KB/audio-video/cswavrec.aspx?msg=2137882

I've actually worked with this project before in C#, and it works quite well.

MusiGenesis
Thank you for your thoughtful answer. The Code Project article you refer to is one source I looked at a bit closer. I'll leave this open a while longer to see what other responses I get, but this is certainly useful.
Eric J.
@MusiGenesis: Can you elaborate on "contains no audible sound"? Is it enough to look for no samples with a value above a certain threshold within the buffer to consider it silence? Roughly how many (milli) seconds of audio does a buffer typical for this type of solution represent?
Eric J.
@Eric J: the size of each buffer is entirely up to you as the programmer. If you use very small buffers, you can have your UI respond visually to audio information very quickly (i.e. with a very low latency), but it will be more susceptible to glitches and interruptions. If you use very large buffers (like, say, 1 second / 44100 samples) you will get fewer glitches, but your UI will be at least 1 second behind what you're hearing.
MusiGenesis
An application I wrote recently did an FFT on each buffer as it came it, and displayed the spectrograph in (close to) realtime. In this application, I used buffers of 2048 samples, which is about 5 milliseconds. I chose 2048 by experimentation, because FFT has to be done on buffers where the length is a power of 2, and the 5 ms latency is small relative to the minimum 15 ms latency you're already going to face because of Windows scheduling.
MusiGenesis
Actually, your latency for something like this is already going to be worse than 15ms. I learned from Larry Osterman (who is a user on SO and also an engineer on the Windows audio code [!]) that because of the CLR in .Net, your latency can be as bad as 250 ms or worse.
MusiGenesis
As far as a threshold for "no audible sound", that's really up to you, also. From experiments that I've done on this sort of thing, the difference in peak sample values between audible speech and background noise is often incredibly small, so an algorithm that determines the difference between the two usually requires a lot more than just the signal level. Most likely you would have to do an FFT, and look for characteristics in the frequency domain that distinguish speech (or whatever other sound you're looking for) from noise.
MusiGenesis
Correction: 2048 samples is about 50 ms, not 5 ms (I knew it didn't seem right as I was typing). I just looked at my code again, and I was actually using buffers of 512 samples, which is about 12 ms.
MusiGenesis