tags:

views:

69

answers:

2

I've got a string like this:

####################
Section One
####################
Data A
Data B


####################
   Section Two
####################
Data C
Data D

etc.

I want to parse it into something like:

$arr(
    'Section One' => array('Data A', 'Data B'),
    'Section Two' => array('Data C', 'Data D')
)

At first I tried this:

$sections = preg_split("/(\r?\n)(\r?\n)#/", $file_content);

The problem is, the file isn't perfectly clean: sometimes there are different numbers of blank lines between the sections, or blank spaces between data rows.

The section head pattern itself seems to be relatively consistent:

####################
   Section Title
####################

The number of #'s is probably consistent, but I don't want to count on it. The white space on the title line is pretty random.

Once I have it split into sections, I think it'll be pretty straightforward, but any help writing a killer reg ex to get it there would be appreciated. (Or if there's a better approach than reg ex...)

+2  A: 

I'd take a multi-step approach:

  • split into section headings/content
  • parse each heading/content pair into the desired array structure

Here's an example, split into multiple lines so you can track what is going on:

Note the lack of sanity checking, this assumes nice, neat heading/content groups.
The regex was written for brevity and may or may not be sufficient for your needs.

// Split string on a line of text wrapped in lines of only #'s
$parts = preg_split('/^#+$\R(.+)\R^#+$/m', $subject, null, PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_NO_EMPTY);
// Tidy up leading/trailing whitespace for each heading/content-block
$parts = array_map('trim', $parts);
// Chunk into array("heading", "content")
$parts = array_chunk($parts, 2);

// Create the final array
$sections = array();
foreach ($parts as $part) {
    $sections[$part[0]] = explode("\n", $part[1]);
}

// Lets take a look
var_dump($sections);
salathe
thanks for the help. I wound up going back and forth with @polygenelubricants....
sprugman
Oookkkk. I'll never quite understand this place. :-/
salathe
+1  A: 

I was able to quickly wrote this up:

<?php
$text = <<<EOT
####################
Section One
####################
Data B.Thing=bar#
.##.#%#

####################
   Empty Section!
####################
####################
   Last section
####################

Blah

   Blah C# C# C#

EOT;
$entries = array_chunk(
   preg_split("/^#+/m", $text, null, PREG_SPLIT_NO_EMPTY),
   2
);
$sections = array();
foreach ($entries as $entry) {
  $key = trim($entry[0]);
  $value = preg_split("/\n/", $entry[1], null, PREG_SPLIT_NO_EMPTY);
  $sections[$key] = $value;
} 
print_r($sections);
?>

The output is: (as run on ideone.com)

Array
(
    [Section One] => Array
        (
            [0] => Data B.Thing=bar#
            [1] => .##.#%#
        )

    [Empty Section!] => Array
        (
        )

    [Last section] => Array
        (
            [0] => Blah
            [1] =>    Blah C# C# C#
        )

)
polygenelubricants
That's awesome, thanks! But it doesn't quite work. :( It seems to choke on non-alpha characters in the data rows, which all of my data rows have, since they're name value pairs like "foo.bar=baz" http://ideone.com/u3xYo
sprugman
@sprugman, well, I wasn't sure what the data pattern is, but if you can guarantee that it will never contain `#`, (e.g. no `"C# is awesome!"` or anything like that), then just use `[^#]+` instead of `[\w\s]+` http://ideone.com/zrx9n
polygenelubricants
how 'bout if I guarantee that no line except the section delimiters will ever start with a #?
sprugman
@sprugman: check out latest revision. Tell me if there's anything else I can do.
polygenelubricants
thanks for all your help. I found another way to break it (unless I wasn't looking at the latest -- it seems to change the url every time you edit): http://ideone.com/2TOfp
sprugman
ah: found your latest (http://ideone.com/59Y3V) looking...
sprugman
I had to add a (\r?) to deal with the CRLFs of the source, but that seems to be working. Thanks!! http://ideone.com/V4X0D
sprugman
@sprugman: you're the first person on stackoverflow who've went back and forth with me on ideone.com; I think this is neat!!!
polygenelubricants
It worked pretty well, apart from having to update the url every time. (I wonder if they have a setting for that.) It might be nice to update your answer with the final version.... (Oh wait, maybe you did already. :)
sprugman