tags:

views:

60

answers:

1

I'm using cURL to get a web page and present to our users. Things have worked well until I came upon a website using considerable amounts of Ajax that's formatted so:

33687|updatePanel|ctl00_SiteContentPlaceHolder_FormView1_upnlOTHER_NATL|
                                        <div id="ctl00_SiteContentPlaceHolder_FormView1_othernationalities">
                                            <h4>

                                                <span class="tooltip_text" onmousemove="widetip=false; tip=''; delayToolTip(event,tip,widetip,0,0);return false"
                                                    onmouseout="hideToolTip()">
                                                    <span id="ctl00_SiteContentPlaceHolder_FormView1_lblProvideOTHER_NATL">Provide the following information:</span></span>
                                            </h4>
|
266|scriptBlock|ScriptContentNoTags|
    document.getElementById('ctl00_SiteContentPlaceHolder_FormView1_dtlOTHER_NATL_ctl00_csvOTHER_NATL').dispose = function() {
        Array.remove(Page_Validators, document.getElementById('ctl00_SiteContentPlaceHolder_FormView1_dtlOTHER_NATL_ctl00_csvOTHER_NATL'));
    }

So, each part of the response is 4 parts: 2 and 3 are just identifiers, 4 is the real "body", and 1 is the length of the body. The problem comes in that we modify the body, and I need to be able to update the length of the 1st part to indicate that; otherwise, we throw a parsing error when inserting this into the web page.

I'm trying to figure out a combination of shell commands (awk, sed, whatever) to: a) read the saved file b) run regex on it to gather each individual block of information (using '(\d*?)\|(.?)\|(.?)\|(.*?)\|') c) make the first capturing group equal to the length of the last capturing group d) save all the regex matches to a new document or back to the original

Any input from "the collective" would be GREATLY appreciated.

A: 

It doesn't look like a single line of RegEx will solve this problem, as there is no way to put the first captured bracket between {braces} to indicate the length. This is what I'm thinking would be ideal:

(\d*?)\|([^|]+)\|([^|]+)\|(.{\1})\|

That value can also not be bypassed because there is no indication of an escape character in the case that there is a | somewhere in the message body. I suggest a straight split by '|' and using a 2-dimensional array to store the content. Check every forth item for a matching length and if too short, concatenate a | and the next item, then increment the read counter. PHP shall explain:

$items=explode('|', $file)
$len=count($items);
$oi=0;
$ol=-1;
for($i=0;$i<$count;++$i){
  $output[$oi][++$ol]=$items[$i];
  if($ol==3){
    $target=$output[$oi][0];
    while(strlen($output[$oi][3])<$target){
      $output[$oi][3].='|'.$items[++$i];
    }
    ++$oi;
    $ol=-1;
  }
}
Patrick