views:

196

answers:

4

Hey guys,

I am looping through a large text file and im looking for lines that contain no more than 3 different characters (those characters, however, can be repeated indefinitely). I am assuming the best way to do this would be some sort of regular expression.

All help is appreciated.

(I am writing the script in PHP, if that helps)

+6  A: 

At the risk of getting downvoted, I will suggest regular expressions are not meant to handle this situation.

You can match a character or a set of characters, but you can't have it remember what characters of a set have already been found to exclude those from further match.

I suggest you maintain a character set, you reset it before you begin with a new line, and you add there elements while going over the line. As soon as the count of elements in the set exceeds 3, you drop the current line and proceed to the next.

Developer Art
Regular Expressions aren't really meant for a lot of the situations they get used for, this one is right up its alley though. RegExp builds deterministic state machines built for parsing strings, given back references, they can remember what characters were matched. Why take the time to write your own string parsing state machine when a regular expression engine can handle it for you?
gnarf
@gnaft: a regexp for will be vary complicated and unreadable, while a code will be clear. for example, in ruby: "line_string.chars.to_a.uniq.length < 4" will do the trick.
amikazmi
+4  A: 

Perhaps this will work:

preg_match("/^(.)\\1*(.)?(?:\\1*\\2*)*(.)?(?:\\1*\\2*\\3*)*$/", $string, $matches);
// aaaaa:Pass
// abababcaaabac:Pass
// aaadsdsdads:Pass
// aasasasassa:Pass
// aasdasdsadfasf:Fail

Explaination:

/
 ^                 #start of string
 (.)               #match any character in group 1
 \\1*              #match whatever group 1 was 0 or more times
 (.)?              #match any character in group 2 (optional)
 (?:\\1*\\2*)*     #match group 1 or 2, 0 or more times, 0 or more times 
                   #(non-capture group)
 (.)?              #match any character in group 3 (optional)
 (?:\\1*\\2*\\3*)* #match group 1, 2 or 3, 0 or more times, 0 or more times
                   #(non-capture group)
 $                 #end of string
/

An added benifit, $matches[1], [2], [3] will contain the three characters you want. The regular expression looks for the first character, then stores it and matches it up until something other than that character is found, catches that as a second character, matching either of those characters as many times as it can, catches the third character, and matches all three until the match fails or the string ends and the test passes.

EDIT

This regexp will be much faster because of the way the parsing engine and backtracking works, read bobince's answer for the explanation:

/^(.)\\1*(?:(.)(?:\\1|\\2)*(?:(.)(?:\\1|\\2|\\3)*)?)?$/
gnarf
Nice one! Regex may not be the best approach, but this works, and it gives the OP a concrete basis on which to decide whether to use regex or some other technique.
Alan Moore
FYI, I tried all these regexes on a largish file and your original regex hung. I got it to work by removing the question marks after the capturing groups, but the performance wasn't very impressive. But when I added a `'+'` to each quantifier to make them possessive, it outperformed all the others.
Alan Moore
A: 

for me - as a programmer with fair-enough regular expression knowledge this sounds not like a problem that you can solve using Regexp only.

more likely you will need to build a hashMap/array data structure key: character value:count and iterate the large text file, rebuilding the map for each line. at each new character check if the already-encountered character count is 2, if so, skip current line.

but im keen to be suprised if one mad regexp hacker will come up with a solution.

Andreas Petersson
+7  A: 

Regex optimisation fun time exercise for kids! Taking gnarf's regex as a starting point:

^(.)\1*(.)?(?:\1*\2*)*(.)?(?:\1*\2*\3*)*$

I noticed that there were nested and sequential *s here, which can cause a lot of backtracking. For example in 'abcaaax' it will try to match that last string of ‘a’s as a single \1* of length 3, a \1* of length two followed by a single \1, a \1 followed by a 2-length \1*, or three single-match \1s. That problem gets much worse when you have longer strings, especially when due to the regex there is nothing stopping \1 from being the same character as \2.

^(.)\1*(.)?(?:\1|\2)*(.)?(?:\1|\2|\3)*$

This was over twice as fast as the original, testing on Python's PCRE matcher. (It's quicker than setting it up in PHP, sorry.)

This still has a problem in that (.)? can match nothing, and then carry on with the rest of the match. \1|\2 will still match \1 even if there is no \2 to match, resulting in potential backtracking trying to introduce the \1|\2 and \1|\2|\3 clauses earlier when they can't result in a match. This can be solved by moving the ? optionalness around the whole of the trailing clauses:

^(.)\1*(?:(.)(?:\1|\2)*(?:(.)(?:\1|\2|\3)*)?)?$

This was twice as fast again.

There is still a potential problem in that any of \1, \2 and \3 can be the same character, potentially causing more backtracking when the expression does not match. This would stop it by using a negative lookahead to not match a previous character:

^(.)\1*(?:(?!\1)(.)(?:\1|\2)*(?:(?!\1|\2)(.)(?:\1|\2|\3)*)?)?$

However in Python with my random test data I did not notice a significant speedup from this. Your mileage may vary in PHP dependent on test data, but it might be good enough already. Possessive-matching (*+) might have helped if this were available here.

No regex performed better than the easier-to-read Python alternative:

len(set(s))<=3

The analogous method in PHP would probably be with count_chars:

strlen(count_chars($s, 3))<=3

I haven't tested the speed but I would very much expect this to be faster than regex, in addition to being much, much nicer to read.

So basically I just totally wasted my time fiddling with regexes. Don't waste your time, look for simple string methods first before resorting to regex!

bobince