tags:

views:

88

answers:

4

Hi,

I'm completely new to regular expressions, and I need to filter all the words of at least 3 characters (and a maximum size of 16) out of a text. (so I can enter those data into a MySQL database)

Currently, everything works, except for the regular expression:

/^.{3,16}$/

(I constructed this from a tutorial found using Google ;-) )

Thanks! Yvan

Sample Data:

rjm1986 * SinuhePalma * excel2010 * Jimineedles * 209663603 * C6A7XR * Snojog * XmafiaX * Cival2 * HitmanPirrie * MAX * 4163016 * Dredd23 * Daddy420 * mattpauley * Mykillurdeath * 244833585 * KCKnight * Greystoke * Fatbastard * Fucku4 * Davkar * Banchy2 * ET187 * Slayr69 * Nik1236 * SeriousAl * 315791 * 216996334 * K1ra * Koops1 * LastFallout * zmileben * bismark * Krlssi * FuckOff1 * 1owni * Ulme * Rxtvjq * halfdeadman * Jamacola * LBTG1008 * toypark * Magicman6497 * Tyboe187 * Bob187 * Zetrox

PHP Code (yeah, I know - it's kind of sloppy - this is only used to generate the queries...)

<?php
    //regexer.php

    $text = @$_REQUEST['fText'];
    if ($text == '') {
?>
<form method="post" action="">
    <input type="text" name="regex" />
    <textarea name="fText"></textarea>
    <br />
    <input type="submit"></input>
</form>
<?php 
    } else {
        preg_match_all($_REQUEST['regex'], $_REQUEST['fText'], $matches);
        header ("Content-type: text/plain");
        foreach ($matches as $match) {
            //print_r($match);
            echo ("INSERT INTO maf_codes (Code, GameID) VALUES ('$match', %GAMEID%);\n");
        }
    }
?>

Found a solution: replace the $_REQUEST['regex'] with the regex did work ;)

+5  A: 

Try this:

/\b\w{3,16}\b/

Explained:

  • \b matches a word boundary
  • \w matches a word character
  • {3,16} applies to the \w and it indicates that at least 3 and at most 16 characters should be matched.

FYI: I omitted the start anchor (^) and end anchor ($) from the regex you noted in your question because it seems like you want to find matches with longer strings of text as input and the anchors would restrict the matching to only instances where the entire input string matched.

UPDATE:

Here is the proof that this regex works:

<?php

$input = 'rjm1986 * SinuhePalma * excel2010 * Jimineedles * 209663603 * C6A7XR * Snojog * XmafiaX * Cival2 * HitmanPirrie * MAX * 4163016 * Dredd23 * Daddy420 * mattpauley * Mykillurdeath * 244833585 * KCKnight * Greystoke * Fatbastard * Fucku4 * Davkar * Banchy2 * ET187 * Slayr69 * Nik1236 * SeriousAl * 315791 * 216996334 * K1ra * Koops1 * LastFallout * zmileben * bismark * Krlssi * FuckOff1 * 1owni * Ulme * Rxtvjq * halfdeadman * Jamacola * LBTG1008 * toypark * Magicman6497 * Tyboe187 * Bob187 * Zetrox';

$matches = array();

preg_match_all('/\b\w{3,16}\b/', $input, $matches);

print_r($matches);

?>

Outputs:

Array
(
    [0] => Array
        (
            [0] => rjm1986
            [1] => SinuhePalma
            [2] => excel2010
            [3] => Jimineedles
            [4] => 209663603
            [5] => C6A7XR
            [6] => Snojog
            [7] => XmafiaX
            [8] => Cival2
            [9] => HitmanPirrie
            [10] => MAX
            [11] => 4163016
            [12] => Dredd23
            [13] => Daddy420
            [14] => mattpauley
            [15] => Mykillurdeath
            [16] => 244833585
            [17] => KCKnight
            [18] => Greystoke
            [19] => Fatbastard
            [20] => Fucku4
            [21] => Davkar
            [22] => Banchy2
            [23] => ET187
            [24] => Slayr69
            [25] => Nik1236
            [26] => SeriousAl
            [27] => 315791
            [28] => 216996334
            [29] => K1ra
            [30] => Koops1
            [31] => LastFallout
            [32] => zmileben
            [33] => bismark
            [34] => Krlssi
            [35] => FuckOff1
            [36] => 1owni
            [37] => Ulme
            [38] => Rxtvjq
            [39] => halfdeadman
            [40] => Jamacola
            [41] => LBTG1008
            [42] => toypark
            [43] => Magicman6497
            [44] => Tyboe187
            [45] => Bob187
            [46] => Zetrox
        )

)
Asaph
I get an empty array if I try to use it, other regexes work (eg [0-9][0-9][0-9] for 3 numbers). See my sample data attached above.
Yvan JANSSENS
Please post your php code.
Asaph
print_r returns an ampty array...
Yvan JANSSENS
Your regex works, as stated in my post - Thanks!!
Yvan JANSSENS
+2  A: 

Can you tell what exactly is not working? But anyway I think in your regex you should use the word boundary metacharacter \b:

/\b\w{3,16}\b/

Update: It works for me. This:

<?php
$a = array();

preg_match_all('/\b\w{3,16}\b/', "rjm1986 * SinuhePalma * excel2010 * Jimineedles * 209663603 * C6A7XR * Snojog * XmafiaX * Cival2 * HitmanPirrie * MAX * 4163016 * Dredd23 * Daddy420 * mattpauley * Mykillurdeath * 244833585 * KCKnight * Greystoke * Fatbastard * Fucku4 * Davkar * Banchy2 * ET187 * Slayr69 * Nik1236 * SeriousAl * 315791 * 216996334 * K1ra * Koops1 * LastFallout * zmileben * bismark * Krlssi * FuckOff1 * 1owni * Ulme * Rxtvjq * halfdeadman * Jamacola * LBTG1008 * toypark * Magicman6497 * Tyboe187 * Bob187 * Zetrox", $a);

print_r($a);

gives me:

Array
(
    [0] => Array
        (
            [0] => rjm1986
            [1] => SinuhePalma
            [2] => excel2010
            [3] => Jimineedles
            [4] => 209663603
            //.... lot more here...
            [45] => Bob187
            [46] => Zetrox
        )

)

Also note that the matches are in the first entry of the result array, so you have to do:

 foreach ($matches[0] as $match) {
        print_r($match);
        //...
 }

And you have to declare $matches before you use it:

$matches = array();
preg_match_all($_REQUEST['regex'], $_REQUEST['fText'], $matches);
Felix Kling
Thanks. You gave me the idea to put the regex in the php code - that works. Getting it from the text field doesn't :(
Yvan JANSSENS
@Yvan JANSSENS: Well you can check with `print_r($_REQUEST)` what values are exactly sent and whether `$_REQUEST['regex']` contains something useful.
Felix Kling
@Yvan JANSSENS: Getting the regex from a form shouldn't pose a problem (security aside) unless you have `magic_quotes_gpc` turned on in which case certain regex chars will auto-magically get escape characters prepended to them thereby ruining your regular expression.
Asaph
The problem is solved now - with the regular expression hardcoded, I only needed to scan 4 large documents, so manually altering the script was a manageable solution ;).Thanks!
Yvan JANSSENS
A: 

you can just use strlen().

$mystr="rjm1986 * SinuhePalma * excel2010 * Jimineedles * 209663603 * C6A7XR * Snojog * XmafiaX * Cival2 * HitmanPirrie * MAX * 4163016 * Dredd23 * Daddy420 * mattpauley * Mykillurdeath * 244833585 * KCKnight * Greystoke * Fatbastard * Fucku4 * Davkar * Banchy2 * ET187 * Slayr69 * Nik1236 * SeriousAl * 315791 * 216996334 * K1ra * Koops1 * LastFallout * zmileben * bismark * Krlssi * FuckOff1 * 1owni * Ulme * Rxtvjq * halfdeadman * Jamacola * LBTG1008 * toypark * Magicman6497 * Tyboe187 * Bob187 * Zetrox";
$s = explode(" ",$mystr);
foreach($s as $v){
    $len=strlen($v);
    if($len>=3 && $len<=16){
        echo "found: $v\n";
    }
}
ghostdog74
Sorry, but I need to filter the data from a file which contains HTML and other junk too...
Yvan JANSSENS
it doesn't matter, the concept of "at least 3 chars and at most 16" is still the same. Just use strlen to check each string you get when you have parsed the HTML data. Besides, the regex solutions posted will also not strip HTML and junks for you.
ghostdog74
+1  A: 

As others have said, the following will do it.

/\b\w{3,16}\b/g

The reason your original line (below) didn't work is because:

/^.{3,16}$/
  1. The ^ and $ stand for the beginning and end of a line. It looks like you want to extract words from within a line.
  2. The . will match any character at all, including spaces or special characters.
Alison R.
Not true. `\w` matches alpha-numeric. See the proof in my answer.
Asaph
I realized that and edited my answer, but (oops) didn't edit it completely. Fixed now.
Alison R.