views:

395

answers:

10

I need to do a search for people who are violating our "don't use social security numbers in your data" rule and need to know if there are performance differences (and why) between the two lines below.

Thanks.

[0-9]{3}-[0-9]{2}-[0-9]{4}

vs

\d\d\d-\d\d-\d\d\d\d



Requested Details:
engine: removed to stop confusion in tagging

+9  A: 

I think you would see very negligible differences in performance. Use the first one, as it is easier to read at a glance. Once the Regex is compiled (if you are compiling it before using it for reuse purposes), it would not matter anyway.

Don't optimize until you need to optimize.

Chris Ballance
+3  A: 

The performance difference, if any, will be absolutely neglible. You're likely to be optimizing wrong part of your application.

Anton Gogolev
+1  A: 

As with any performance question, the answer is to benchmark test it with your own data and find out. Post the results with some sample data, because this is a good question.

Bill the Lizard
I was hoping to get some good theory on building efficient searches. But, if no one gives me that I'll run and post.
Keng
+8  A: 

Performance aside, I recently found out that \d and [0-9] are not identical, because there are more than just 10 digits. Therefore, the second version might yield more false positives.

MiseryIndex
Yeah, I saw that too 8O)hopefully there's not a way for our people to even enter them in the first place though.
Keng
I'm doubting that anyone in Turkey will provide their social security number in the first place, though. :-)
Ken White
+3  A: 

The performance difference should be negligible. On an unrelated note, if the data you're dealing with are anything like the stuff I see, it might be useful to expand the search by making the dashes optional:

\b\d{3}-?\d{2}-?\d{4}\b

Update: Good point, Keng. The word-boundary trick is really useful, so I'd definitely include it in a first pass.

Hank Gay
yeah, I was thinking about that but because of all the other numbers that could be there I wanted to try and avoid 9 numbers in a row...but then if i used \b\d{9}\b that might help limit the false positives....hmmmm.
Keng
+1  A: 

This Ruby script says the first is marginally slower, but I would expect the differences on any engine to be negligible.

require 'benchmark'
include Benchmark

def random_ssn
  format "%03d-%02d-%04d", rand(1000), rand(100), rand(10000)
end

bm do |x|
  x.report("range") { 100_000.times { /[0-9]{3}-[0-9]{2}-[0-9]{4}/ =~ random_ssn } }
  x.report("digit") { 100_000.times { /\d\d\d-\d\d-\d\d\d\d/       =~ random_ssn } }
end

Results:

      user     system      total        real
range  1.080000   0.030000   1.110000 (  1.245579)
digit  0.980000   0.030000   1.010000 (  1.149390)
jcrossley3
Your benchmark uses data that will always trigger. With the OP's data, a match will be the odd one out.
innaM
Agree, but that is trivially remedied.
jcrossley3
+1  A: 

Of course the performance of the two expressions depends on the implementation of the regex engine you are using. The difference should be small, so don't optimize until you see it as a bottleneck.

Here is a little performance comparison, using perl 5.8.3 and a sample of 8MB of random data (digits, dashes, spaces):

time perl -ne 'if (/\d\d\d-\d\d-\d\d\d\d/) {print "."}' < numbers.txt
[output omitted]
real    0m0.143s
user    0m0.136s
sys     0m0.007s

time perl -ne 'if (/[0-9]{3}-[0-9]{2}-[0-9]{4}/) {print "."}' < numbers.txt
[output omitted]
real    0m0.166s
user    0m0.160s
sys     0m0.006s

So the first is actually a tiny bit faster (this is consistent across several calls).

Christian Berg
This is consistent with the ruby results, too.
jcrossley3
Which version of Perl?
Brad Gilbert
I'd have to check (I'm at a different machine now), but I think 5.8
Christian Berg
"This is perl, v5.8.3 built for x86_64-linux-thread-multi"I also edited my answer to include the version.
Christian Berg
+1  A: 

There are better optimization available apart from what you note :

Social security number cannot start from number greater than 772

So that instantly reduces your match group , now you can :

[0-7][0-9]{2}-[0-9]{2}-[0-9]{4}

I guess what I'm trying to say is that optimization need not be just technical.

EDIT

Changed the regex as according to comment. Thanks David!

Learning
Should be <code>[0-7][0-9]{2}-[0-9]{2}-[0-9]{4}</code> or else it won't match 580-...
David
this is what Regex Buddy says to use.\b(?!000)(?!666)(?:[0-6]\d{2}|7(?:[0-356]\d|7[012]))[- ](?!00)\d{2}[- ](?!0000)\d{4}\b
Keng
That's the highest number they've allocated up until now. I don't think there's anything guaranteeing that won't change in the future.
Kibbee
+1  A: 

Seconding the comment that this is not likely to be the performance bottleneck -- compared to I/O, etc., the difference is not likely to be measurable.

Having said that -- if you're concerned, measure it, don't guess.

Paul Roub
I was hoping to get some good theory on building efficient searches. But, if no one gives me that I'll run and post.
Keng
"Measurement > thought experiment" is good theory. :)
chaos
egad...I think it might be good to figure out 'on paper' how much plutonium to use before building the bomb...<8O() also, I need to avoid building it in the worst way possible (). Two words for you "catastrophic backtracking" http://www.regular-expressions.info/catastrophic.html
Keng
+1  A: 

I just tested this in .NET with the benchmark feature of Regex Hero.

Surprisingly the first expression is faster, albeit only marginally so. I performed 500,000 iterations on a valid social security number and here are the results:

1.547 seconds - [0-9]{3}-[0-9]{2}-[0-9]{4}

1.844 seconds - \d\d\d-\d\d-\d\d\d\d

I tested each one 3 times to make sure the benchmark was accurate. It's funny that the result in .NET is the exact opposite of the results in Ruby and Perl.

Steve Wortham
wow...I wouldn't have thought that (.Net vs Perl/Ruby). Does Perl and Ruby use the same RX engine? That might explain why those two are the same.
Keng
I don't think so. Looks like Ruby has its own regex engine. In fact all 3 use the same basic syntax as Perl but each engine is different.
Steve Wortham