views:

415

answers:

5

How can you find the repetiting sequences of at least 30 numbers?

Sample of the data

2.3758542141230068337129840546697038724373576309794988610478359908883826879271070615034168564920273348519362186788154897494305239179954441913439635535307517084282460136674259681093394077448747152619589977220956719817767653758542141230068337129840546697038724373576309794988610478359908883826879271070615034168564920273348519362186788154897494305239179954441913439635535307517084282460136674259681093394077448747152619589977220956719817767653758542141230068337129840546697038724373576309794988610478359908883826879271070615034168564920273348519362186788154897494305239179954441913439635535307517084282460136674259681093394077448747152619589977220956719817767653758542141230068337129840546697038724373576309794988610478359908883826879271070615034168564920273348519362186788154897494305239179954441913439635535307517084282460136674259681093394077448747152619589977220956719817767653758542141230068337129840546697038724373576309794988610478359908883826879271070615034168564920273348519362186788154897494305239179954441913439635535307517084282460136674259681093394077448747152619589977220956719817767653758542141230068337129840546697038724373576309794988610478359908883826879271070615034168564920273348519362186788154897494305239179954441913439635535307517084282460136674259681093394077448747152619589977220956719817767653758542141230068337129840546697038724373576309794988610478359908883826879271070615034168564920273348519362186788154897494305239179954441913439635535307517084282460136674259681093394077448747152619589977220956719817767653758542141230068337129840546697038724373576309794988610478359908883826879271070615034168564920273348519362186788154897494305239179954441913439635535307517084282460136674259681093394077448747152619589977220956719817767653758542141230068337129840546697038724373576309794988610478359908883826879271070615034168564920273348519362186788154897494305239179954441913439635535307517084282460136674259681093394077448747152619589977220956719817767653758542141230068337129840547

My attempt in Vim

:g/\(\d\{4}\)\[^\1\]\1/
                |
                |----------- Problem here!

I do not know how you can have the negation of the first glob.

+2  A: 

How about :g/\(\d\{30,\}\{2,\}\)/?

chaos
This does not find the repeating sequences. What is its purpose?
Masi
What does "find" mean to you?
chaos
I'm pretty sure it does (it's the same as mine, except matching arbitrarily many repetitions). Of course, you might want to try it without the :g on the front - you're trying to find them, not run a command on lines containing them.
Jefromi
Ah, if the repeating sequences must be next to eachother (which makes sense), then this regular expression should certainly do the job.
Blixt
@Chaos: How did you extract the repeating sequence in the sample data? - I get everything highlighted in the match. - The data should contain at least four same sequences of about 400 characters. This `:g/\d\{400,\}\{2,\}/` does not help.
Masi
Masi, I added capturing parens to my expression. I'm guessing, based on your expression having them, that this is what you need. This is only a guess because it's completely unclear what "finding" and "extracting" the pattern means to you.
chaos
A: 

If it helps you on the way, the appropriate way to make sure that the following set of characters aren't the same as what is stored in back-reference #1 would be (?!\1). Note that the (?!) (negative look-ahead) group is a zero-width assertion (i.e., it will not change the position of the cursor, it just checks whether the regex should fail or not.)

Whether that is supported by the regex engine you're using, I don't know.

Update

I just had a quick sketch on paper, and something along these lines might work in PCRE... but I haven't tested it and can't right now, but maybe it'll give you some ideas:

(?=(\d{30}))\d(?=\d{29,}?\1)

To ensure that I understood you correctly, the purpose of the above regex would be to match any sequence of 30 digits that also exists later in the whole string being searched.

My thoughts for the above regex were these:

  1. First I want to match a sequence of 30 digits, but I don't want to consume them since I want to check 1 digit later (not 30) next time. Therefore I use a look-ahead with a capturing group that stores the next 30 digits.
  2. Then I consume one digit to ensure I don't match the 30 digits with themselves.
  3. Then I match at least 29 digits (which means I'll be starting on the digit just outside the current sequence of digits) with a non-greedy quantifier, so that it will try 30, then 31, etc.
  4. Then I match the 30 digits I'm currently testing. If they exist later in the sequence, the regular expression will succeed; otherwise, it will fail.
Blixt
vim regex supports lookahead, but not using that syntax.
chaos
+2  A: 

I'm not sure why you need the negation. /\(\d\{4\}\)\1/ will match a sequence of (exactly) four digits, repeated once. You probably actually want something like /\(\d\{30,\}\)\1/ to get your "at least 30". This appears to work for me, unless I've misunderstood what you're trying to search for. Note that since the regex are greedy, you will get the longest possible repeated sequence.

Jefromi
A: 

This command will match lines with 123451234 but not 111111111

:g/\(\d\{4}\)\1\@!.\1/
  • \1\@!. uses a negative lookahead to say "make sure this position doesn't match (\@!) group 1 (\1), then consume a character (.)"
rampion
+1  A: 

First of all, to find your repeating numbers, you can use this simple search:

/\(\d\{5\}\).\{-}\1

This search finds repetitions of 5 digits. Unfortunately, vim highlights from the start of the 5 digit number to the end of the repetition - including every digit in between - and this makes it hard to see what the 5 digit number is. Also, because your number sequence repeats so much, the whole thing is highlighted because there are repeats all the way through.

You will probably find it's more useful to use :set incsearch and type /\(\d\{5\}\).\{-}\1 or /\(\d\{5\}\)\ze.\{-}\1 without hitting enter so you can see what the digits are.

This command might be more useful to you:

:syn region repeatSection matchgroup=Search start=/\z(\d\{30}\)/ matchgroup=Error end=/\z1/ oneline

This will highlight a sequence of 30 digits in yellow (first time it is seen) or red (when it is repeated). Note that this only works for a single line of text (multi-line isn't possible).

too much php