views:

116

answers:

3

Hello everybody.

My question regards the existence of a search algorithm for searching source code. In my project, I will have to implement an application that will search through a repository of source code (through a lot of source code files). All the files are from previous projects done within the company. We think that implementing a search software specially aimed at our company's needs will be a good idea.

The software will be done in C#/.NET and should be able to search based on function name, comments, class names, variable names etc. I don't think a simple string search or contains through thousands of files would solve this, so I'll probably need to implement some algorithm to increase the response speed for this.

Thus, does anybody know of any good search algorithms that could be aimed at searching through source code? Examples or articles on those search algorithms would be most welcomed.

Thanks in advance.

Cheers!

+1  A: 

You might consider the SD Source Code Search Engine.

It has everything you requested plus

  • understands the lexical syntax of many, many languages including C#
  • ignores language-specific whitespace; searches nicely across line breaks in languages where linebreaks aren't important
  • provides queries that understand the lexical syntax, which helps cut down false positives (doesn't find text equal to your target identifier in a string or comment unless you insist)
  • can search across multiple lanuages at the same moment-
  • preindexes the source code base to provide seconds-plural-barely searches on tens of thousands of files (much faster than grep at this scale)

The algorithm is "easy" and fits in the following paragraph :-} For each language, build a language accurate lexical analyzer that picks up the values of each token. Build very fast lexical analyzers based on compiled regular expressions so that you can extract all the tokens across huge source code bases in a few minutes. Build an indexer to process all the lexemes into indexed databases by content, retrieving file location positions (line and column, watch out for differences in line counting between different languages [GNU counts form character as a line break, most other langauges don not!]) accurately. Design a query language that lets you express queries over code, such as the following:

 'for' ... I=x ... I=x '<' N>100

so that you can search for interesting combinations, (in this example, for for loops with a loop *I*dentifier variable "x" whose limit is a *N*umeric constant larger than 100 ["..." means "near"]). Tie the query language to the token index data base. Build a GUI that accepts queries, displays hits and takes you to source code for a hit. Add logging facilities so you can show your query results to others easily.

Implementing all this is a lot more work :-{

Ira Baxter
A: 

This isn't an algorithm, but an existing product that may help. SciTools Understand has super fast search, even for a large code base.

(That sounds like a paid endorsement, so I should say I'm not affiliated with the company at all, I just use the product and know of this feature).

fatcat1111
A: 

You might want to look into 'self-indexing' - see this SO question and this DrDobbs article. With self-indexing, you can index a document or corpus such that you can efficiently find regular expression matches over the whole corpus. It's not likely to be simple to implement, though. :)

Nick Johnson