views:

56

answers:

2

I have an array of regular expressions and am trying to loop through a text document to find the first pattern, assign that as the key to an array then continue through find the second pattern and assign that as the value. Whenever I come across pattern 1 I want that to always be assigned as a key and all pattern 2 matches that follow until I come across a new key will be assigned to that first key as values.

Text document structure:

Subject: sometext

Email: [email protected]

source: www.google.com www.stackoverflow.com www.reddit.com

So I have an array of expressions:

$expressions=array(
                'email'=>'(\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}\b)',
                'url'=>'([A-Za-z][A-Za-z0-9+.-]{1,120}:[A-Za-z0-9/](([A-Za-z0-9$_.+!*,;/?:@&~=-])|%[A-Fa-f0-9]{2}){1,333}(#([a-zA-Z0-9][a-zA-Z0-9$_.+!*,;/?:@&~=%-]{0,1000}))?)'
               );

I want to loop through my text document and match the email address then assign that as the key to an array then assign all urls that follow as the values, s the output to the above text would be:

array(
  '[email protected]' => array (
      0 => 'www.google.com',
      1 => 'www.stackoverflow.com',
      2 => 'www.reddit.com'
    )      
A: 

One way to do such a thing:

$parts = preg_split("/(emailexpr)/",$txt,-1,PREG_SPLIT_DELIM_CAPTURE);

$res = array();

// note: $parts[0] will be everything preceding the first emailexpr match
for ( $i=1; isset($parts[$i]); $i+=2 )
{
    $email = $parts[$i];
    $chunk = $parts[$i+1];
    if ( preg_match_all("/domainexpr/",$chunk,$match) )
    {
        $res[$email] = $match[0];
    }
}

replace emailexpr and domainexpr with your regexp gibberish.

mvds
+1  A: 

I would do:

$lines = file('input_file', FILE_SKIP_EMPTY_LINES);
$array = array();
foreach($lines as $line) {
  if(preg_match('/^Subject:/', $line) {
    $email = '';
  } elseif(preg_match('/^Email: (.*)$/', $line, $m)) {
    if(preg_match($expressions['email'], $m[1])) {
      $email = $m[1];
    }
  } elseif(preg_match('/^source: (.*)$/', $line, $m) && $email) {
    foreach(explode(' ', $m[1]) as $url) {
      if(preg_match($expressions['url'], $url)) {
        $array[$email][] = $url;
      }
    }
  }
}
M42
this will complain about initialized array elements, and an uninitialized variable, both for your handling or `$array`
mvds
you should look into `preg_match_all`, which will make things cleaner (it would combine the `foreach`, `explode` and `preg_match`, plus it will prevent the warning about `$array[$email]` not being set.
mvds
can you show me how that would be done?
sassy_geekette
@mvds: You're right, i missed $array = array(); updated. For the second point I prefer to extract the urls before but, sure, it can be done with preg_match_all.
M42
@M42: You miss an `if ( !isset($array[$email]) ) $array[$email] = array();` as well... @sassy: see my answer
mvds