I am working on a project for my computer security class and I have a couple questions. I had an idea to write a program that would search the whole hard drive looking for email addresses. I am just looking for addresses stored in plain text since it would be hard to find anything otherwise. I figured the best way to find addresses would be to use a regular expression.
I wrote an application in C# that works fairly well but it I would like to see if anyone has any better ideas. I am completely up for writing this in another language since I'm assuming C# isn't the best for this type of thing. So far the application I created just starts at the C:/ and recursively locates all files on the drive skipping those that aren't accessible. It also skips all common image, video, audio, compressed, and files over 512mb. This speeds it up quite a bit but there is a small chance that a large file could contain something useful. It takes about 12 seconds to generate the list of files and I'm guessing about an hour to check them all. One downside is that it uses about 50% cpu while scanning.
I'm looking for ideas on how to improve the search. Is there a faster way, a more efficient way, a more thorough way, things like that? I was trying to think if there was any way that you could tell if the file would contain plain text strings or not. Just let me know if you have any cool ideas. Thanks.