tags:

views:

27

answers:

3

I built a site a long time ago and now I want to place the data into a database without copying and pasting the 400+ pages that it has grown to so that I can make the site database driven.

My site has meta tags like this (each page different):

<meta name="clan_name" content="Dark Mage" />

So what I'm doing is using cURL to place the entire HTML page in a variable as a string. I can also do it with fopen etc..., but I don't think it matters.

I need to shift through the string to find 'Dark Mage' and store it in a variable (so i can put into sql)

Any ideas on the best way to find Dark Mage to store in a variable? I was trying to use substr and then just subtracting the number of characters from the e in clan_name, but that was a bust.

+1  A: 

Use preg_match. A possible regular expression pattern is /clan_name.+content="([^"]+)"/

Sundeep
+4  A: 

Just parse the page using the PHP DOM functions, specifically loadHTML(). You can then walk the tree or use xpath to find the nodes you are looking for.

<?
$doc = new DomDocument;
$doc->loadHTML($html);
$meta = $doc->getElementsByTagName('meta');
foreach ($meta as $data) {
  $name = $meta->getAttribute('name');
  if ($name == 'clan_name') {
    $content = $meta->getAttribute('content');
    // TODO handle content for clan_name
  }
} 
?>

EDIT If you want to remove certain tags (such as <script>) before you load your HTML string into memory, try using the strip_tags() function. Something like this will keep only the meta tags:

<?
  $html = strip_tags($html, '<meta>');
?>
jheddings
Does this require the HTML to be well-formed?
Ben Dunlap
Nope! That's the beauty of this function as opposed to loadXML(). Check out the example in the docs (you'll see it's not well-formed).
jheddings
Nice; that makes this a way better answer than mine. +1
Ben Dunlap
Always more than one way to solve a problem... If it's a constrained server install, perhaps the DOM features were disabled. Then a regex approach may be the only option.
jheddings
thanks for all the help both of you.. I just have 1 more q... Its not letting me load loadHTML because of some javascript. I want to trim off everything starting at the javascript to just keep the meta tags. I'm trying: $html = strstr($html, '<script', true); but get Warning: Wrong parameter count for strstr()thanks!
krio
The third argument to strstr() is only in PHP 5.3, which isn't widely deployed. See the "Changelog" section at http://www.php.net/manual/en/function.strstr.php
Ben Dunlap
The 3rd paramter (`true`) was added in PHP 5.3. You may be running an older version.
jheddings
+2  A: 

Use a regular expression like the following, with PHP's preg_match():

/<meta name="clan_name" content="([^"]+)"/

If you're not familiar with regular expressions, read on.

The forward-slashes at the beginning and end delimit the regular expression. The stuff inside the delimiters is pretty straightforward except toward the end.

The square-brackets delimit a character class, and the caret at the beginning of the character-class is a negation-operator; taken together, then, this character class:

[^"]

means "match any character that is not a double-quote".

The + is a quantifier which requires that the preceding item occur at least once, and matches as many of the preceding item as appear adjacent to the first. So this:

[^"]+

means "match one or more characters that are not double-quotes".

Finally, the parentheses cause the regular-expression engine to store anything between them in a subpattern. So this:

([^"]+)

means "match one or more characters that are not double-quotes and store them as a matched subpattern.

In PHP, preg_match() stores matches in an array that you pass by reference. The full pattern is stored in the first element of the array, the first sub-pattern in the second element, and so forth if there are additional sub-patterns.

So, assuming your HTML page is in the variable "$page", the following code:

$matches = array();
$found = preg_match('/<meta name="clan_name" content="([^"]+)"/', $page, $matches);

if ($found) {
    $clan_name = $matches[1];
}

Should get you what you want.

Ben Dunlap
Nice writeup on regex and great explanation on the character classes...
jheddings
You could also capture the 'name' and let the `if` check that instead of relying on the match count.
jheddings
Can you give a super-quick example? I've never really liked checking count($matches) in cases like this; doesn't seem all that robust.
Ben Dunlap
jheddings
Got it, thanks. Now that I think about it, my "count($matches)" is redundant, because $found will be FALSE if the subpattern doesn't match. Editing.
Ben Dunlap