views:

427

answers:

14

I've been stuck trying to write this regular expression I need. Basically, I have a long string comprised of two different types of data:

  1. [a-f0-9]{32}
  2. [a-zA-Z0-9=]{x}

The thing is, x is only constant in the particular instance: if in one case, it happens to be 12, it will be 12 for that particular dataset, but next time I run the regular expression it might need to be 15, or 45 for example. I have an unpredictable number of type (1)'s between each piece of type (2). My goal is to "harvest" all the data of type (2).

For example I could have a string of the following form:

[a-f0-9]{192}
[a-zA-Z0-9=]{11}
[a-f0-9]{96}
[a-zA-Z0-9=]{11}
[af-0-9]{160}
[a-zA-Z0-9=]{11}

(All put together with no delimitations). I need it to return a string comprised of the 33 characters of the [a-zA-Z0-9=] character set. The fact that the number of characters in each of the substrings is constant in the instance (in the case above it was 11, but it could just have easily have been 13) is vital as since it contains the smaller character set it would otherwise be impossible to know where one string begins and the other ends.

I've been trying to get this to work for almost a month now, and I'm close to tearing out my hair. I'm not particularly good at regular expressions...

Example data:

3c21e03a10b9415fb3e1067ea75f8205
c8dc9900a5089d31e01241c7a947ed7e
d5f8cd6bb86ebef6d7d104c84ae6e8a7
e23c99af9c9d6d0294d8b51094c39021
4bb4af7e61760735ba17c29e8f542a66
875da91e90863f1ddb7e149297fc59af
cf5de951fb65d06d2927aab7b9b54830
e2d935616a54c381c2f38db3731d5a37
SGVsbG8gbXk
6dd11d15c419ac219901f14bdd999f38
0ad94e978ad624d15189f5230e5435a9
2dc19fe95e583e7d593dd52ae7e68a6e
465ffa6074a371a8958dad3ad271181a
23310939b981b4e56f2ecee26f82ec60
fe04bef49be47603d1278cc80673b226
gbmFtZSBpcy
3c21e03a10b9415fb3e1067ea75f8205
c8dc9900a5089d31e01241c7a947ed7e
d5f8cd6bb86ebef6d7d104c84ae6e8a7
e23c99af9c9d6d0294d8b51094c39021
BvbGl2ZXIga
4bb4af7e61760735ba17c29e8f542a66
875da91e90863f1ddb7e149297fc59af
cf5de951fb65d06d2927aab7b9b54830
e2d935616a54c381c2f38db3731d5a37
G9vcmF5IQ==

I would want to extract "SGVsbG8gbXkgbmFtZSBpcyBvbGl2ZXIgaG9vcmF5IQ==".

A: 

How do you determine this magical x?

  • If you know x beforehand for each dataset, just use your regex, and replace x with the actual value before each invocation (in most languages you can compose an arbitrary string of characters and use that as a regex).
  • If you don't know x then I don't see how there is any answer, as it cannot be determined from the input data alone (as you point out).

Edit:

From your comment, 2) seems to apply: x not known in advance.

As pointed out, then there will in general be more than one solution for a given piece of input data.

You could write a program that will extract all the substrings that satisfy your criteria. If there is only one solution for a given input, you're lucky; otherwise you'll have to decide which you like best.

To extract the substrings, one idea (possibly not optimal) would be to just loop through all reasonable values for x, and try your regex for each x. If it matches, you have found one solution. If more than one x matches, there is more than one solution.

There is probably are more performant way to do this, but if you have a reasonably low upper bound for x this should be doable. (Obviously, the data size - 32 is always an upper bound for x, so this will in principle always work).

sleske
When doing it by hand, I determine x by trying a value that "seems likely", and if i end up with not enough or too many chars at the end, trying with a different x, etc. I'm trying to find a way to do this programatically.
Mala
A: 

Why not just do this:

^[a-zA-Z0-9]+==$

or

^[a-zA-Z0-9]+[=]+$
Bobby
this would also pick up all of the [a-f0-9] stuff, no?
Mala
No, not as far as I see it, since the + indicates that = is there at least ones.
Bobby
+5  A: 

I do not believe that regular expressions are the right tool for this problem.

One thing that bothers me is that the range [a-f0-9] is included in the range [a-zA-Z0-9=] and since there are no delimiters and the length of the records is variable, the boundary between two records seems pretty fuzzy.

You may have a heuristic that works to determine where the records start and end by finding a pattern in the data, and you may then apply regular expressions using this pattern, but it is unlikely that regular expressions will help you to uncover this pattern in the first place.

Eric Bréchemier
yes this is precisely my problem. I could write a regexp for the case where the short-strings are (x) chars long and loop through all reasonable x, but that seems like taking a bazooka to a mosquito..
Mala
in the worst case, x could be 32, and all the base64 characters could fall in the hex range: you would not be able to tell them apart, would you?
Eric Bréchemier
The "bazooka" method as I have implemented it works better and better (fewer false matches) as x increases. x = 32 wouldn't cause any particular problem. Of course you can design an example where it will fail, but the chance of real data accidentally failing is increasingly remote as x increases in size. x=1 causes the most problems.
Mark Byers
A: 

How's about something along the lines of:

([a-f0-9]*([a-zA-Z0-9=]*))*

And then just concatenate the matches of ([a-zA-Z0-9=]*).

Can you count on the [a-zA-Z0-9=]* part being the same length each time? Or do you have to verify it? If you must verify the length each time, then this problem is not solvable with a regex (i.e. it is not a regular language, but rather a context-free language at least).

Yuval A
the [a-zA-Z0-9=]* is the same length every time in a particular instance, yes. So in one dataset, it will always be, for example, 11. But in the next dataset it might be 13
Mala
A: 

Is this a chance that the last string you want to match finishes with '=='?

If not, you could match the line finishing with '==' first, calculate its size, then use it as your x to grab the other lines you wish to grab.

Raphink
Sometimes it will end with ==, sometimes with =, and sometimes without any =s. It's basically a base64 encoded string
Mala
OK. Then maybe you could try to filter out the lines that contain characters you wouldn't find in the other lines (capital letters, equal signs, etc.). All these lines should have the same `x` length, and then you can apply this to your regex.
Raphink
A: 

I really think that you can't harvest all of your pieces of type (2) if you don't know how much pieces of type (1) you will have and the length of them.

The best solution would be to parse the string line by line and to apply a regexp for each lines. If it matches the type (2) then concatenate it into your result string.

If your string isn't splitted by lines, make a preg_replace before to parse it.

dasilvj
+1  A: 

If you know the size of each field, I would simply use substr.

$a = substr($line,192,11);
$b = substr($line,299,11);
$c = substr($line,380,11);

or use str_split and convert the line to an array and build up the substrings from the array pieces.

tvanfosson
i do not know the lengths or locations...
Mala
+3  A: 

I don't think your "types" of data are well-defined enough to make the problem solvable for all cases, irrespective of whether you use regular expressions at all.

Since, judging from your example, type 1 can occur multiple times in a row, and type 2 can look like type 1 since the character sets overlap, I don't see how you can tell them apart for all cases even when you know X (which, judging from the question, I'm not sure you do).

As a primitive example, given a string of 2000 repetitions of the letter "a", how could you tell apart types 1 and 2?

If there is any possibility at all of having whatever is giving you that data put in explicit delimiters, do that. Otherwise, you'll have to use heuristics to disambiguate, and I don't think a regexp is the right tool for that.

Michael Borgwardt
A: 

It seems you don't really care about the content of the string, so this should do. Of course you have to know the number to use. Also I presume the data is all in one line (I presume you put the newline just to clarify)

^.{192}(.{11}).{96}(.{11}).{160}(.{11}).*$

Then you just have to merge the 3 last element from the matches.

== Added

Ok since the uppercase seems to be the indicator of where you need to extract.

What you have to do is first get all occurence of an UpperCase char, get the multiple of 32 smaller than each position, then use a substring to extract the content you want. How do you get the 11 again?

jfno
Using RegExp to do what this RegExp does is only a waste of time and resources.
BYK
the 192, 96 and 160 could be any multiples of 32, and there could be more of them too. The 11 is not constant between datasets either. It is only constant for a particular run
Mala
I know, it's a waste, I think the substring solution is the right solution, but he asks for RegExp.
jfno
+1  A: 

You are on the wrong path IMO. The pattern is a hex-str encoded data having base64 encoded parts put in it. This hex data should mean something which can be used to determine when the "needed" data starts. Also if the original data you are deaing with is split up to rows which have the same length, that should also mean something. You should "understand" the data, not use a a brainless RegExp pattern to match it which does not seem possible from here.

BYK
The hex data, as far as i can make out, is the result of md5'ing random strings, and as such does not "mean" anything. It's the base64 data that means something, but the hex is interspersed in such a way as to break up bits of the base64 that need to be together, so I can't just decode and filter out the garbage.
Mala
+2  A: 

It seems that the data you are parsing from between the hex strings is Base64. The actual problem you are describing seems unsolvable with the restrictions you have given (cannot assume any lengths etc.).

But the big thing you should be aware is that base64 character set also contains the characters '+' and '/'. The '=' characters are padding as the length of the entire (in your case, concatenated) base64 encoded bit is always an even multiple of 4 characters.

Nakedible
Ah thanks for the tip about those extra chars. All I knew about base64 was what I'd seen of it from observation, and I guess I hadn't noticed those!
Mala
+2  A: 

As some other answers have said, I think regular expressions are not right here, or at least not initially. You need to start with an algorithmic approach. Here's why: you can't know the value of x for sure. The best you can do is run through the data making estimates of x for each chunk of type 2. Then you need a mechanism for guessing the most likely value of x based on all the estimates (possibly using something like hill-climbing). After that, you could apply a regular expression or simply take out chunks of the appropriate length.

Ben
A: 

Or you could just check for allowed characters via regex and then check for the string length via property/function. Sounds like you are making things more complicated than they should be.

marko
+7  A: 

It's your lucky day! The problem is not solvable in general, but I believe that the following will nearly always give the right answer for typical data from real life:

<?php

$s = '
3c21e03a10b9415fb3e1067ea75f8205
c8dc9900a5089d31e01241c7a947ed7e
d5f8cd6bb86ebef6d7d104c84ae6e8a7
e23c99af9c9d6d0294d8b51094c39021
4bb4af7e61760735ba17c29e8f542a66
875da91e90863f1ddb7e149297fc59af
cf5de951fb65d06d2927aab7b9b54830
e2d935616a54c381c2f38db3731d5a37
SGVsbG8gbXk
6dd11d15c419ac219901f14bdd999f38
0ad94e978ad624d15189f5230e5435a9
2dc19fe95e583e7d593dd52ae7e68a6e
465ffa6074a371a8958dad3ad271181a
23310939b981b4e56f2ecee26f82ec60
fe04bef49be47603d1278cc80673b226
gbmFtZSBpcy
3c21e03a10b9415fb3e1067ea75f8205
c8dc9900a5089d31e01241c7a947ed7e
d5f8cd6bb86ebef6d7d104c84ae6e8a7
e23c99af9c9d6d0294d8b51094c39021
BvbGl2ZXIga
4bb4af7e61760735ba17c29e8f542a66
875da91e90863f1ddb7e149297fc59af
cf5de951fb65d06d2927aab7b9b54830
e2d935616a54c381c2f38db3731d5a37
G9vcmF5IQ==
';
$s = preg_replace('/\r?\n/', '', $s);

for ($i = 1; $i < 20; ++$i) {
    $pattern = "/^(([a-f0-9]{32})+([a-zA-Z0-9=]{{$i}})?)+$/";
    if (preg_match($pattern, $s)) {
        $pattern = "/(?:[a-f0-9]{32})+([a-zA-Z0-9=]{{$i}})/";
        $matches = array();
        preg_match_all($pattern, $s, $matches);
        print_r(join('', $matches[1]));
        break;
    }
}

Output in this case:

SGVsbG8gbXkgbmFtZSBpcyBvbGl2ZXIgaG9vcmF5IQ==

I believe that the code could be improved, but I'm sure you're just happy to get something that works. I think this is similar to the "bazooka" method you described above, but I honestly don't think there is a better way. Note also that it is important to start with small guesses first to minimize the chance of false matches. The order of the terms in the regex is also important to increase the likelihood of choosing correctly when more than one choice is possible (try hardest to match first, greedy, then easiest match only if that fails).

Mark Byers
Note that in the question, he mentions the [a-f0-9] part will not always be 32 characters long (though that is the case in his example). So your solution might not work in the general case (whatever that may be...).
sleske
@sleske: Hmm... I missed that. Where does he say that?
Mark Byers
That's in fact the main problem, the length of the first piece can change all the time... >.<
dasilvj
@dasilvj: 192 = 32*6, 96 = 32*3, 160 = 32*5. So there's no problem actually. Any multiple of 32 is handled correctly. Notice the + after (...{32})
Mark Byers
@sleske He says in a duplicate question http://stackoverflow.com/questions/1760150/base64-encoded-data-mixed-in-with-random-hex that the hex data will always be multiples of 32 characters; and he says in the comments on another answer here that the hex data consists of md5 sums, which are 128 bits, or 32 hex digits long. So, matching against one or more 32 character long strings seems to be correct here, even if it isn't very well specified in the original question.
Brian Campbell
In the example on his other thread, the length of the base-64 strings also varies which is why I didn't suggest this solution originally. But he seems pretty clear this time that the base-64 fragments are always the same length, and that's why this solution works.
Mark Byers
Just a quick edit, the "[a-zA-Z0-9=]"s should be "[a-zA-Z0-9=+\/]"s. Sorry, that wasn't in my original spec but Nakedible kindly pointed out that base64 also contains the '/' and the '+' chars. But thank you so much, this works beautifully :) I definitely owe you a beer!
Mala
Mala: Actually I knew that the base-64 regex was wrong (it should of course contain 64 characters plus the = filler which can only be at the end). I just was concentrating on solving the issue and completely forgot to correct it or even to mention it. It's lucky that someone else pointed it out. You might also want to adjust the limits for the length of the base-64 segments that I hardcoded. I used 1 to 20 in the example but you can increase 20 to something higher if you need.
Mark Byers
Thanks, I already did so - I incrementally brought it up and actually ended up with a 100% success rate when I set it to 50. So yeah, in short, you are officially my hero
Mala