views:

84

answers:

7

I have a string like the following:

$string = "
<paragraph>apples are red...</paragraph>
<paragraph>john is a boy..</paragraph>
<paragraph>this is dummy text......</paragraph>
";

I would like to split this string into an array contanining the text found between the <paragraph></paragraph> tags. E.g something like this:

$string = "
<paragraph>apples are red...</paragraph>
<paragraph>john is a boy..</paragraph>
<paragraph>this is dummy text......</paragraph>
";

$paragraphs = splitParagraphs($string);
/* $paragraphs now contains:
   $paragraphs[0] = apples are red...
   $paragraphs[1] = john is a boy...
   $paragraphs[1] = this is dummy text...
*/

Any ideas?

P.S it should be case insensitive, <paragraph>, <PARAGRAPH>, <Paragraph> should all be treated the same way.

Edit: This is not XML, there are a lot of things here which will break the structure of XML hence I cannot use SimpleXML etc. I need a regular expression which will parse this out.

A: 

If you're really parsing XML, then the PHP DOM is of use here. You may have a trivial example case above, but if you're parsing XML, I'd use a dedicated XML API.

Brian Agnew
No, it is not xml.
Click Upvote
A: 

This furiously looks like XML. If it indeed is, you should use a SimpleXMLElement or any other XML-parcing facility of PHP.

$xml = new SimpleXMLElement('<root>' . $paragraphs . '</root>');

foreach($xml->paragraph as $paragraph)
{
    // do stuff to $paragraph; it's strval is the contents of the paragraph
}
zneak
+2  A: 

If this is a simple structure, with no nesting:

preg_split("#</?paragraph>#i", $string);

To ignore empty tokens:

preg_split("#</?paragraph>#i", $string, -1, PREG_SPLIT_NO_EMPTY);

Source: http://php.net/manual/en/function.preg-split.php

Kobi
I should add I don't have a handy php around here. The regex is right, but you may have to tweak the syntax a little.
Kobi
is this case insensitive?
Click Upvote
should be, it has the `i` flag. here I'm using `#` for regex boundary, because `/` is part of the regex. this is common to avoid escaping.
Kobi
When I run this, i get this result:`Array( [0] => [1] => apples are red... [2] => [3] => john is a boy.. [4] => [5] => this is dummy text...... [6] => )`So its working, but giving a lot of blank elements as well. Anything in mind that could fix it?
Click Upvote
Take a look at the NO_EMPTY flag at http://docs.php.net/preg_split
VolkerK
@VolkerK - thanks. Already edited.
Kobi
A: 
Mike Cialowicz
+5  A: 

If this is actually XML then I agree with the other answers. But if it isn't valid XML, but just something that looks vaguely like XML then you should not try to parse it with an XML parser. Instead you can use a regular expression:

$matches = array();
preg_match_all(":<paragraph>(.*?)</paragraph>:is", $string, $matches);
$result = $matches[1];
print_r($result);

Output:

Array
(
    [0] => apples are red...
    [1] => john is a boy..
    [2] => this is dummy text......
)

Note that the i means case-insensitive and the s allows new lines to match in the text. All text not inside paragraph tags will be ignored.

Mark Byers
Thanks, this works but in the array result it still retains the <paragraph> </paragraph> tags, can they be stripped off via the reg ex?
Click Upvote
@Click Upvote: Yes. Answer updated.
Mark Byers
thanks a lot` `!
Click Upvote
@Mark Hi there, i'm having a prob. with this reg, would be cool if you could help, http://stackoverflow.com/questions/3146752/php-regex-help-for-parsing-string
Click Upvote
A: 

So assuming that you've got some stuff in the paragraphs that is going to break XML format, or you're just looking to learn a bit more about regexp parsing, this should get the job done for the example you've posted. It's not particularly robust, but that's why people like to use XML, because it's got a formal syntax that makes it easy to parse. or easier, anyway. In particular this solution depends on the string that's being parsed starting with a paragraph tag and ending with a paragraph close tag, and also on there being nothing but whitespace in between each pair of paragraphs. So it's a very literal solution to your example problem. But then since this is the only existing specification document for your custom data format it was the best I could do :)

$string = " <paragraph>apples are red...</paragraph> <paragraph>john is a boy..</paragraph> <paragraph>this is dummy text......</paragraph> ";
$paragraphs = preg_replace('/(^\s*<paragraph>|<\/paragraph>\s*$)/', '', preg_split('/(?<=<\/paragraph>)\s*(?=<paragraph>)/', $string));

What's going on here is that you're using, in the preg_split function call, zero-width lookaround assertions to find the beginning and end of each paragraph, and then calling preg_replace to crop out the tags from the beginning and end of each chunk. You end up with the contents of $paragraphs being

array (
  0 => 'apples are red...',
  1 => 'john is a boy..',
  2 => 'this is dummy text......',
)
intuited
Oh yes.. for case insensitivity you just add an i as the modifier for the two regexps. ie you add it after the final slash.
intuited
A: 

After your edits (case insensitive, and tags too big for XML parser to handle), the following should work:

$paragraphs = array();
$exploded = explode("</", $string);
unset($exploded[count($exploded) - 1]); //remove the useless, final "paragraph>" item
$exploded[0] = str_replace("<paragraph>", "", $exploded[0]); // first item is a special case
foreach($exploded as $item)
{
    array_push($paragraphs, str_replace("paragraph>\n<paragraph>", "", $item));
}
Mike Cialowicz