ansaurus

Question

Do you know of any search algorithms for searching through source code?

Answer 1

+1 A:

You might consider the SD Source Code Search Engine.

It has everything you requested plus

understands the lexical syntax of many, many languages including C#
ignores language-specific whitespace; searches nicely across line breaks in languages where linebreaks aren't important
provides queries that understand the lexical syntax, which helps cut down false positives (doesn't find text equal to your target identifier in a string or comment unless you insist)
can search across multiple lanuages at the same moment-
preindexes the source code base to provide seconds-plural-barely searches on tens of thousands of files (much faster than grep at this scale)

The algorithm is "easy" and fits in the following paragraph :-} For each language, build a language accurate lexical analyzer that picks up the values of each token. Build very fast lexical analyzers based on compiled regular expressions so that you can extract all the tokens across huge source code bases in a few minutes. Build an indexer to process all the lexemes into indexed databases by content, retrieving file location positions (line and column, watch out for differences in line counting between different languages [GNU counts form character as a line break, most other langauges don not!]) accurately. Design a query language that lets you express queries over code, such as the following:

 'for' ... I=x ... I=x '<' N>100

so that you can search for interesting combinations, (in this example, for for loops with a loop *I*dentifier variable "x" whose limit is a *N*umeric constant larger than 100 ["..." means "near"]). Tie the query language to the token index data base. Build a GUI that accepts queries, displays hits and takes you to source code for a hit. Add logging facilities so you can show your query results to others easily.

Implementing all this is a lot more work :-{

Ira Baxter 2010-02-23 17:46:48

Answer 2

A:

This isn't an algorithm, but an existing product that may help. SciTools Understand has super fast search, even for a large code base.

(That sounds like a paid endorsement, so I should say I'm not affiliated with the company at all, I just use the product and know of this feature).

fatcat1111 2010-02-23 17:57:15

Answer 3

A:

You might want to look into 'self-indexing' - see this SO question and this DrDobbs article. With self-indexing, you can index a document or corpus such that you can efficiently find regular expression matches over the whole corpus. It's not likely to be simple to implement, though. :)

Nick Johnson 2010-02-25 10:30:40

ansaurus

tags:

views:

answers:

Do you know of any search algorithms for searching through source code?

related questions