views:

423

answers:

5

I'm trying to do a "preg match all" on the response below to get all the binary data. I've tried just about everything imaginable and for the life of me, can't get anything.

I was hoping it'd be as simple as doing something like this:

preg_match_all("#\n\n(.*)\n--$boundary#",$body,$matches);

But I can't get anything. I've tried other stuff too. \r \n | i s m U - I just can't get it for some reason.

Here is a pseudo response not including the headers:

--boundary
content-type:image/jpeg

<binary data>
--boundary
content-type:image/jpeg

<binary data>
--boundary
content-type:image/jpeg

<binary data>
--boundary

unfortunately the binary data isn't enclosed with < & > it's just raw data with special characters over the course of multiple lines...

also: i think the problem lies within the actual binary data that is being displayed because when i run a preg match all on the info above it works just fine but when i try it on the actual data that has all the binary data crap in it, it doesn't work.

A: 

I don't have an answer regarding your regular expressions, but did you have a look at Zend_Mime?

André Hoffmann
yeah, i have in the past, that zend stuff is foreign to me but thanks for the tip.
John
Well if you don't like to learn how to use the Zend Framework you could also have a look at the Zend_Mime_Decode class. I suppose the splitMime function could be very helpful for you. Here's the link: http://framework.zend.com/code/browse/~raw,r=9280/Zend_Framework/trunk/library/Zend/Mime/Decode.php
André Hoffmann
that class is using a method similar to the one i have already made. i just want to see if i can parse the <binary data> with one preg_match_all statement.
John
+1  A: 

You're expression seems to work fine for me on the data you provided. I pulled down your output.php, and renamed it output.txt, then ran this script:

<?php

$body = file_get_contents('output.txt');
$boundary = '__NEXT_PART_gc0p4Jq0M2Yt08jU534c0p__';
preg_match_all("#\n\n(.*)\n--$boundary#",$body,$matches);
print_r($matches);

Seems to have worked fine, ie it printed this:

Array
(
    [0] => Array
        (
            [0] => 

    [body] => 
--__NEXT_PART_gc0p4Jq0M2Yt08jU534c0p__
            [1] => 

ÿ( RAW IMAGE DATA CONTINUES OVER MULTIPLE LINES starts with "ÿ" ends with "ÿÙ" )ÿÙ
--__NEXT_PART_gc0p4Jq0M2Yt08jU534c0p__
            [2] => 

ÿ( RAW IMAGE DATA CONTINUES OVER MULTIPLE LINES starts with "ÿ" ends with "ÿÙ" )ÿÙ
--__NEXT_PART_gc0p4Jq0M2Yt08jU534c0p__
            [3] => 

ÿ( RAW IMAGE DATA CONTINUES OVER MULTIPLE LINES starts with "ÿ" ends with "ÿÙ" )ÿÙ
--__NEXT_PART_gc0p4Jq0M2Yt08jU534c0p__
            [4] => 

ÿ( RAW IMAGE DATA CONTINUES OVER MULTIPLE LINES starts with "ÿ" ends with "ÿÙ" )ÿÙ
--__NEXT_PART_gc0p4Jq0M2Yt08jU534c0p__
        )

    [1] => Array
        (
            [0] =>     [body] => 
            [1] => ÿ( RAW IMAGE DATA CONTINUES OVER MULTIPLE LINES starts with "ÿ" ends with "ÿÙ" )ÿÙ
            [2] => ÿ( RAW IMAGE DATA CONTINUES OVER MULTIPLE LINES starts with "ÿ" ends with "ÿÙ" )ÿÙ
            [3] => ÿ( RAW IMAGE DATA CONTINUES OVER MULTIPLE LINES starts with "ÿ" ends with "ÿÙ" )ÿÙ
            [4] => ÿ( RAW IMAGE DATA CONTINUES OVER MULTIPLE LINES starts with "ÿ" ends with "ÿÙ" )ÿÙ
        )

)

Looks like the $matches[1] contains the list of binary data you're after.

JasonWoof
im still getting two empty arrays :(
John
oh shit... i think i found out what's causing the problem..i get nothing when i run this script locally on my windows xp pc using wamp but when i run it on my linux server it works just fine. any ideas?
John
oh and, you were only supposed to run the preg match on the stuff that is in the [body]=>array // but other than that i guess it works just fine, but not on windows... hmm
John
ahh windows...Try passing FILE_BINARY as the 2nd parameter to file_get_contents()
JasonWoof
you might also try a double backslash before your ns (like \\n) so it's expanded by preg instead of php string constants.
JasonWoof
well i'm not file_get(ing)_conents on it.. I'm getting it via a string and i'm developing on windows but i need it to work in both windows and linuxand i dont quite understand the what you mean in your second comment
John
In my second comment, I meant try this: preg_match_all("#\\n\\n(.*)\\n--$boundary#",$body,$matches); (that is, \\n instead of \n. I don't have access to a windows machine to test on, but I just thought it'd be worth a shot.
JasonWoof
Yeah, I tried that and it didn't work :(
John
I think you should upload the output.txt file and test with that on your windows machine. I think that would would give a hint as to where the problem is.
JasonWoof
A: 

Ok, well I'm not all that familiar with PHP regular expressions...

Considering what you are trying to do, the dot-matches-newline s switch should work. Using this regular expression seemed to work on my end:

/<binary data>\r\n(.*?)\r\n--simple boundary/s

The *? should be non-greedy, and so it will gobble only so much as to match the very first --simple boundary text string it sees.

Your line endings may differ from mine (I'm on a Windows machine), so you may have to fire up a hex editor to see exactly what should be matched before and after the <binary data> content.

David Andres
thanks for the tip, i'll try that.after seeing that JasonWoof was able to get it to work I narrowed my issue down to the OS the script is being ran on opposed to the regex itself that works just fine on a linux box.
John
glad to hear that you've narrowed it down
David Andres
+1  A: 

\n is platform dependent. Presumably your data is a http-request or an email? In this case, line breaks will be \r\n, so you need to test for that instead

troelskn
I did not know that. Thanks for the tip. So \r\n is pretty much universal?
John
No, but it's standard for most web-based protocols (http and mail).
troelskn
A: 

Alternatively, you could parse with explode() this should be much faster, it's not too complex, and it gives you the header info if you want it:

<?php

$body = file_get_contents('output.txt');
$boundary = '__NEXT_PART_gc0p4Jq0M2Yt08jU534c0p__';
$parts = explode("--$boundary", $body);
array_shift($parts); # delete up to the first boundary
array_pop($parts); # delete after the last boundary

$binaries = array();
foreach($parts as $part) {
    list($header, $binary) = explode("\n\n", $part, 2);
    $binaries[] = $binary;
}    

print_r($binaries);
JasonWoof