ansaurus

Question

String parsing help

Answer 1

A:

If you're really parsing XML, then the PHP DOM is of use here. You may have a trivial example case above, but if you're parsing XML, I'd use a dedicated XML API.

Brian Agnew 2010-03-25 21:11:19

No, it is not xml.

Click Upvote 2010-03-25 21:11:52

Answer 2

A:

This furiously looks like XML. If it indeed is, you should use a SimpleXMLElement or any other XML-parcing facility of PHP.

$xml = new SimpleXMLElement('<root>' . $paragraphs . '</root>');

foreach($xml->paragraph as $paragraph)
{
    // do stuff to $paragraph; it's strval is the contents of the paragraph
}

zneak 2010-03-25 21:11:46

Answer 3

+2 A:

If this is a simple structure, with no nesting:

preg_split("#</?paragraph>#i", $string);

To ignore empty tokens:

preg_split("#</?paragraph>#i", $string, -1, PREG_SPLIT_NO_EMPTY);

Source: http://php.net/manual/en/function.preg-split.php

Kobi 2010-03-25 21:12:17

I should add I don't have a handy php around here. The regex is right, but you may have to tweak the syntax a little.

Kobi 2010-03-25 21:14:59

is this case insensitive?

Click Upvote 2010-03-25 21:16:24

should be, it has the `i` flag. here I'm using `#` for regex boundary, because `/` is part of the regex. this is common to avoid escaping.

Kobi 2010-03-25 21:17:46

When I run this, i get this result:`Array( [0] => [1] => apples are red... [2] => [3] => john is a boy.. [4] => [5] => this is dummy text...... [6] => )`So its working, but giving a lot of blank elements as well. Anything in mind that could fix it?

Click Upvote 2010-03-25 21:19:08

Take a look at the NO_EMPTY flag at http://docs.php.net/preg_split

VolkerK 2010-03-25 21:24:38

@VolkerK - thanks. Already edited.

Kobi 2010-03-25 21:26:40

Answer 4

A:

Mike Cialowicz 2010-03-25 21:13:57

Answer 5

+5 A:

If this is actually XML then I agree with the other answers. But if it isn't valid XML, but just something that looks vaguely like XML then you should not try to parse it with an XML parser. Instead you can use a regular expression:

$matches = array();
preg_match_all(":<paragraph>(.*?)</paragraph>:is", $string, $matches);
$result = $matches[1];
print_r($result);

Output:

Array
(
    [0] => apples are red...
    [1] => john is a boy..
    [2] => this is dummy text......
)

Note that the i means case-insensitive and the s allows new lines to match in the text. All text not inside paragraph tags will be ignored.

Mark Byers 2010-03-25 21:15:25

Thanks, this works but in the array result it still retains the <paragraph> </paragraph> tags, can they be stripped off via the reg ex?

Click Upvote 2010-03-25 21:22:20

@Click Upvote: Yes. Answer updated.

Mark Byers 2010-03-25 21:23:31

thanks a lot` `!

Click Upvote 2010-03-25 21:28:41

@Mark Hi there, i'm having a prob. with this reg, would be cool if you could help, http://stackoverflow.com/questions/3146752/php-regex-help-for-parsing-string

Click Upvote 2010-06-30 05:54:16

Answer 6

A:

So assuming that you've got some stuff in the paragraphs that is going to break XML format, or you're just looking to learn a bit more about regexp parsing, this should get the job done for the example you've posted. It's not particularly robust, but that's why people like to use XML, because it's got a formal syntax that makes it easy to parse. or easier, anyway. In particular this solution depends on the string that's being parsed starting with a paragraph tag and ending with a paragraph close tag, and also on there being nothing but whitespace in between each pair of paragraphs. So it's a very literal solution to your example problem. But then since this is the only existing specification document for your custom data format it was the best I could do :)

$string = " <paragraph>apples are red...</paragraph> <paragraph>john is a boy..</paragraph> <paragraph>this is dummy text......</paragraph> ";
$paragraphs = preg_replace('/(^\s*<paragraph>|<\/paragraph>\s*$)/', '', preg_split('/(?<=<\/paragraph>)\s*(?=<paragraph>)/', $string));

What's going on here is that you're using, in the preg_split function call, zero-width lookaround assertions to find the beginning and end of each paragraph, and then calling preg_replace to crop out the tags from the beginning and end of each chunk. You end up with the contents of $paragraphs being

array (
  0 => 'apples are red...',
  1 => 'john is a boy..',
  2 => 'this is dummy text......',
)

intuited 2010-03-25 21:30:51

Oh yes.. for case insensitivity you just add an i as the modifier for the two regexps. ie you add it after the final slash.

intuited 2010-03-25 21:39:36

Answer 7

A:

After your edits (case insensitive, and tags too big for XML parser to handle), the following should work:

$paragraphs = array();
$exploded = explode("</", $string);
unset($exploded[count($exploded) - 1]); //remove the useless, final "paragraph>" item
$exploded[0] = str_replace("<paragraph>", "", $exploded[0]); // first item is a special case
foreach($exploded as $item)
{
    array_push($paragraphs, str_replace("paragraph>\n<paragraph>", "", $item));
}

Mike Cialowicz 2010-03-25 21:32:47

ansaurus

tags:

views:

answers:

String parsing help

related questions