ansaurus

Question

Is there a performance difference in these two social security number searches?

Answer 1

+9 A:

I think you would see very negligible differences in performance. Use the first one, as it is easier to read at a glance. Once the Regex is compiled (if you are compiling it before using it for reuse purposes), it would not matter anyway.

Don't optimize until you need to optimize.

Chris Ballance 2009-02-18 14:58:32

Answer 2

+3 A:

The performance difference, if any, will be absolutely neglible. You're likely to be optimizing wrong part of your application.

Anton Gogolev 2009-02-18 14:59:01

Answer 3

+1 A:

As with any performance question, the answer is to benchmark test it with your own data and find out. Post the results with some sample data, because this is a good question.

Bill the Lizard 2009-02-18 15:03:06

I was hoping to get some good theory on building efficient searches. But, if no one gives me that I'll run and post.

Keng 2009-02-18 15:06:43

Answer 4

+8 A:

Performance aside, I recently found out that \d and [0-9] are not identical, because there are more than just 10 digits. Therefore, the second version might yield more false positives.

MiseryIndex 2009-02-18 15:05:37

Yeah, I saw that too 8O)hopefully there's not a way for our people to even enter them in the first place though.

Keng 2009-02-18 15:10:52

I'm doubting that anyone in Turkey will provide their social security number in the first place, though. :-)

Ken White 2009-02-18 15:31:12

Answer 5

+3 A:

The performance difference should be negligible. On an unrelated note, if the data you're dealing with are anything like the stuff I see, it might be useful to expand the search by making the dashes optional:

\b\d{3}-?\d{2}-?\d{4}\b

Update: Good point, Keng. The word-boundary trick is really useful, so I'd definitely include it in a first pass.

Hank Gay 2009-02-18 15:07:07

yeah, I was thinking about that but because of all the other numbers that could be there I wanted to try and avoid 9 numbers in a row...but then if i used \b\d{9}\b that might help limit the false positives....hmmmm.

Keng 2009-02-18 15:16:50

Answer 6

+1 A:

This Ruby script says the first is marginally slower, but I would expect the differences on any engine to be negligible.

require 'benchmark'
include Benchmark

def random_ssn
  format "%03d-%02d-%04d", rand(1000), rand(100), rand(10000)
end

bm do |x|
  x.report("range") { 100_000.times { /[0-9]{3}-[0-9]{2}-[0-9]{4}/ =~ random_ssn } }
  x.report("digit") { 100_000.times { /\d\d\d-\d\d-\d\d\d\d/       =~ random_ssn } }
end

Results:

      user     system      total        real
range  1.080000   0.030000   1.110000 (  1.245579)
digit  0.980000   0.030000   1.010000 (  1.149390)

jcrossley3 2009-02-18 15:18:23

Your benchmark uses data that will always trigger. With the OP's data, a match will be the odd one out.

innaM 2009-02-18 15:33:26

Agree, but that is trivially remedied.

jcrossley3 2009-02-18 15:35:10

Answer 7

+1 A:

Of course the performance of the two expressions depends on the implementation of the regex engine you are using. The difference should be small, so don't optimize until you see it as a bottleneck.

Here is a little performance comparison, using perl 5.8.3 and a sample of 8MB of random data (digits, dashes, spaces):

time perl -ne 'if (/\d\d\d-\d\d-\d\d\d\d/) {print "."}' < numbers.txt
[output omitted]
real    0m0.143s
user    0m0.136s
sys     0m0.007s

time perl -ne 'if (/[0-9]{3}-[0-9]{2}-[0-9]{4}/) {print "."}' < numbers.txt
[output omitted]
real    0m0.166s
user    0m0.160s
sys     0m0.006s

So the first is actually a tiny bit faster (this is consistent across several calls).

Christian Berg 2009-02-18 15:23:18

This is consistent with the ruby results, too.

jcrossley3 2009-02-18 15:36:29

Which version of Perl?

Brad Gilbert 2009-02-18 15:56:26

I'd have to check (I'm at a different machine now), but I think 5.8

Christian Berg 2009-02-18 19:13:28

"This is perl, v5.8.3 built for x86_64-linux-thread-multi"I also edited my answer to include the version.

Christian Berg 2009-02-19 15:21:13

Answer 8

+1 A:

There are better optimization available apart from what you note :

Social security number cannot start from number greater than 772

So that instantly reduces your match group , now you can :

[0-7][0-9]{2}-[0-9]{2}-[0-9]{4}

I guess what I'm trying to say is that optimization need not be just technical.

EDIT

Changed the regex as according to comment. Thanks David!

Learning 2009-02-18 15:31:00

Should be <code>[0-7][0-9]{2}-[0-9]{2}-[0-9]{4}</code> or else it won't match 580-...

David 2009-02-18 15:45:37

this is what Regex Buddy says to use.\b(?!000)(?!666)(?:[0-6]\d{2}|7(?:[0-356]\d|7[012]))[- ](?!00)\d{2}[- ](?!0000)\d{4}\b

Keng 2009-02-18 15:48:24

That's the highest number they've allocated up until now. I don't think there's anything guaranteeing that won't change in the future.

Kibbee 2009-02-18 17:36:14

Answer 9

+1 A:

Seconding the comment that this is not likely to be the performance bottleneck -- compared to I/O, etc., the difference is not likely to be measurable.

Having said that -- if you're concerned, measure it, don't guess.

Paul Roub 2009-02-18 15:48:12

I was hoping to get some good theory on building efficient searches. But, if no one gives me that I'll run and post.

Keng 2009-02-18 15:50:05

"Measurement > thought experiment" is good theory. :)

chaos 2009-02-18 16:41:14

egad...I think it might be good to figure out 'on paper' how much plutonium to use before building the bomb...<8O() also, I need to avoid building it in the worst way possible (). Two words for you "catastrophic backtracking" http://www.regular-expressions.info/catastrophic.html

Keng 2009-02-18 17:20:41

Answer 10

+1 A:

I just tested this in .NET with the benchmark feature of Regex Hero.

Surprisingly the first expression is faster, albeit only marginally so. I performed 500,000 iterations on a valid social security number and here are the results:

1.547 seconds - [0-9]{3}-[0-9]{2}-[0-9]{4}

1.844 seconds - \d\d\d-\d\d-\d\d\d\d

I tested each one 3 times to make sure the benchmark was accurate. It's funny that the result in .NET is the exact opposite of the results in Ruby and Perl.

Steve Wortham 2009-08-14 21:34:43

wow...I wouldn't have thought that (.Net vs Perl/Ruby). Does Perl and Ruby use the same RX engine? That might explain why those two are the same.

Keng 2009-08-18 12:55:49

I don't think so. Looks like Ruby has its own regex engine. In fact all 3 use the same basic syntax as Perl but each engine is different.

Steve Wortham 2009-08-18 13:37:33

ansaurus

tags:

views:

answers:

Is there a performance difference in these two social security number searches?

related questions