views:

252

answers:

3

So I have an interesting problem: I have a string, and for the most part i know what to expect:

http://www.someurl.com/st=????????

Except in this case, the ?'s are either upper case letters or numbers. The problem is, the string has garbage mixed in: the string is broken up into 5 or 6 pieces, and in between there's lots of junk: unprintable characters, foreign characters, as well as plain old normal characters. In short, stuff that's apt to look like this: Nyþ=mî;ëMÝ×nüqÏ

Usually the last 8 characters (the ?'s) are together right at the end, so at the moment I just have PHP grab the last 8 chars and hope for the best. Occasionally, that doesn't work, so I need a more robust solution.

The problem is technically unsolvable, but I think the best solution is to grab characters from the end of the string while they are upper case or numeric. If I get 8 or more, assume that is correct. Otherwise, find the st= and grab characters going forward as many as I need to fill up the 8 character quota. Is there a regex way to do this or will i need to roll up my sleeves and go nested-loop style?

update:

To clear up some confusion, I get an input string that's like this:

[garbage]http:/[garbage]/somewe[garbage]bsite.co[garbage]m/something=[garbage]????????

except the garbage is in unpredictable locations in the string (except the end is never garbage), and has unpredictable length (at least, I have been able to find patterns in neither). Usually the ?s are all together hence me just grabbing the last 8 chars, but sometimes they aren't which results in some missing data and returned garbage :-\

A: 

What do these values represent? If you want to retain all of it, just without having to deal with garbage in your database, maybe you should hex-encode it using bin2hex().

intgr
I basically get a string with garbage mixed in at unpredictable intervals, and I want to get the original string back. I don't think bin2hex() would help me with that
Mala
+1  A: 
$var = '†http://þ=www.ex;üßample-website.î;ëcomÝ×ü/joy_hÏere.html'; // test case


$clean = join(
    array_filter(
        str_split($var, 1),
        function ($char) {
            return (
                array_key_exists(
                    $char,
                    array_flip(array_merge(
                        range('A','Z'),
                        range('a','z'),
                        range((string)'0',(string)'9'),
                        array(':','.','/','-','_')
                    ))
                )
            );
        }
    )
);

Hah, that was a joke. Here's a regex for you:

$clean = preg_replace('/[^A-Za-z0-9:.\/_-]/','',$var);
Dereleased
What about other characters, such as colons or slashes?
Aistina
Ah, the post has been edited. I'll update shortly.
Dereleased
Thanks! I'm not entirely sure what that regex does, but the output of an example input is 4Z56M9NQ9GP215, which is longer than 8 characters since the garbage can contain all these chars as well. Basically I need to discard anything between [garbage]s after the (hopefully) last '=' sign
Mala
+1  A: 

As stated, the problem is unsolvable. If the garbage can contain "plain old normal characters" characters, and the garbage can fall at the end of the string, then you cannot know whether the target string from this sample is "ABCDEFGH" or "BCDEFGHI":

__http:/____/somewe___bsite.co____m/something=__ABCDEFGHI__
Sparr
it is unsolvable true, but the best approximation is something along the lines of: collect from the back as far as looks reasonable; collect from the "something=" forward as long as looks reasonable; if the first part is >= than 8 chars use that, otherwise take as many from the second part as necessary to fill it to 8 chars
Mala
+1, good point. But on the other hand, the ambiguous cases are probably in the minority. If you can identify the correct URL 90% of the time, that may still be worth it.
Frank Farmer
it's the "going as far as looks reasonable" that i'm having trouble with
Mala