views:

23

answers:

3

My company maintains a domain-specific language that syntactically resembles the Excel formula language. We're considering adding new builtins to the language. One way to do this is to identify verbose commands that are repeatedly used in our codebase. For example, if we see people always write the same 100-character command to trim whitespace from the beginning and end of a string, that suggests we should add a trim function.

Seeing a list of frequent substrings in the codebase would be a good start (though sometimes the frequently used commands differ by a few characters because of different variable names used).

I know there are well-established algorithms for doing this, but first I want to see if I can avoid reinventing the wheel. For example, I know this concept is the basis of many compression algorithms, so is there a compression module that lets me retrieve the dictionary of frequent substrings? Any other ideas would be appreciated.

A: 

You might want to look into tag-cloud generators. I couldn't find any source in the minute that I spent looking, but here's an online one: http://tagcloud.oclc.org/tagcloud/TagCloudDemo which probably won't work since it uses spaces as delimiters.

Adam Shiemke
A: 

I would think you could use an existing full-text indexer like Lucene, and implement your own Analyzer and Tokenizer that is specific to your formula language.

You then would be able to run queries, and be able to see the most used formulas, which ones appear next to each other, etc.

Here's a quick article to get you started:

Lucene Analyzer, Tokenizer and TokenFilter

GalacticJello
A: 

The string matching is just the low hanging fruit, the obvious cases. The harder cases are where you're doing similar things but in different order. For example suppose you have:

X+Y
Y+X

Your string matching approach won't realize that those are effectively the same. If you want to go a bit deeper I think you need to parse the formulas into an AST and actually compare the AST's. If you did that you could see that the tree's are actually the same since the binary operator '+' is commutative.

You could also apply reduction rules so you could evaluate complex functions into simpler ones, for example:

(X * A) + ( X * B)
X * ( A + B )

Those are also the same! String matching won't help you there.

  1. Parse into AST
  2. Reduce and Optimize the functions
  3. Compare the resulting AST to other ASTs

If you find a match then replace them with a call to a shared function.

justin.m.chase
Also if you have existing functions like "Trim" you could get the AST of that function and see if it matches sub trees in the functions you're evaluating.
justin.m.chase