ansaurus

Question

php regular expression to filter out junk

Answer 1

A:

What do these values represent? If you want to retain all of it, just without having to deal with garbage in your database, maybe you should hex-encode it using bin2hex().

intgr 2009-11-18 23:03:08

I basically get a string with garbage mixed in at unpredictable intervals, and I want to get the original string back. I don't think bin2hex() would help me with that

Mala 2009-11-18 23:16:44

Answer 2

+1 A:

$var = '†http://þ=www.ex;üßample-website.î;ëcomÝ×ü/joy_hÏere.html'; // test case

$clean = join(
    array_filter(
        str_split($var, 1),
        function ($char) {
            return (
                array_key_exists(
                    $char,
                    array_flip(array_merge(
                        range('A','Z'),
                        range('a','z'),
                        range((string)'0',(string)'9'),
                        array(':','.','/','-','_')
                    ))
                )
            );
        }
    )
);

Hah, that was a joke. Here's a regex for you:

$clean = preg_replace('/[^A-Za-z0-9:.\/_-]/','',$var);

Dereleased 2009-11-18 23:15:12

What about other characters, such as colons or slashes?

Aistina 2009-11-18 23:18:35

Ah, the post has been edited. I'll update shortly.

Dereleased 2009-11-18 23:22:30

Thanks! I'm not entirely sure what that regex does, but the output of an example input is 4Z56M9NQ9GP215, which is longer than 8 characters since the garbage can contain all these chars as well. Basically I need to discard anything between [garbage]s after the (hopefully) last '=' sign

Mala 2009-11-18 23:26:22

Answer 3

+1 A:

As stated, the problem is unsolvable. If the garbage can contain "plain old normal characters" characters, and the garbage can fall at the end of the string, then you cannot know whether the target string from this sample is "ABCDEFGH" or "BCDEFGHI":

__http:/____/somewe___bsite.co____m/something=__ABCDEFGHI__

Sparr 2009-11-18 23:29:39

it is unsolvable true, but the best approximation is something along the lines of: collect from the back as far as looks reasonable; collect from the "something=" forward as long as looks reasonable; if the first part is >= than 8 chars use that, otherwise take as many from the second part as necessary to fill it to 8 chars

Mala 2009-11-18 23:33:08

+1, good point. But on the other hand, the ambiguous cases are probably in the minority. If you can identify the correct URL 90% of the time, that may still be worth it.

Frank Farmer 2009-11-18 23:34:20

it's the "going as far as looks reasonable" that i'm having trouble with

Mala 2009-11-18 23:34:50

ansaurus

tags:

views:

answers:

php regular expression to filter out junk

related questions