views:

626

answers:

4

I'm trying to extract data from the following page:

http://www.bmreports.com/servlet/com.logica.neta.bwp_PanBMDataServlet?param1=&param2=&param3=&param4=&param5=2009-04-22&param6=37#

Which, conveniently and inefficiently enough, includes all the data embedded as a csv file in the header, set as a variable called gs_csv.

How do I extract this? Document.body.innerhtml skips the header where the data is, what is the alternative that includes the header (or better yet, the value associated with gs_csv)?

(Sorry, new to all this, I've been searching through loads of documentation, and trying a lot of them, but nothing so far has worked).


Thanks to Sinan (this is mostly his solution transcribed into Python).

import win32com.client 

import time 

import os 

import os.path

ie = Dispatch("InternetExplorer.Application") 

ie.Visible=False 

ie.Navigate("http://www.bmreports.com/servlet/com.logica.neta.bwp_PanBMDataServlet?param1=&param2=&param3=&param4=&param5=2009-04-22&param6=37#")

time.sleep(20)

webpage=ie.document.body.innerHTML

s1=ie.document.scripts(1).text 

s1=s1[s1.find("gs_csv")+8:-11]

scriptfilepath="c:\FO Share\bmreports\script.txt" 

scriptfile = open(scriptfilepath, 'wb') 

scriptfile.write(s1.replace('\n','\n')) 

scriptfile.close()

ie.quit
+1  A: 

Untested: Did you try looking at what Document.scripts contains?

UPDATE:

For some reason, I am having immense difficulty getting this to work using the Windows Scripting Host (but then, I don't use it very often, apologies). Anyway, here is the Perl source that works:

use strict;
use warnings;

use Win32::OLE;
$Win32::OLE::Warn = 3;

my $ie = get_ie();

$ie->{Visible} = 1;

$ie->Navigate(
    'http://www.bmreports.com/servlet/com.logica.neta.bwp_PanBMDataServlet?'
    .'param1=&param2=&param3=&param4=&param5=2009-04-22&param6=37#'
);

sleep 1 until is_ready( $ie );

my $scripts = $ie->Document->{scripts};

for my $script (in $scripts ) {
    print $script->text;
}

sub is_ready { $_[0]->{ReadyState} == 4 }

sub get_ie {
    Win32::OLE->new('InternetExplorer.Application', 
        sub { $_[0] and $_[0]->Quit },
    );
}

__END__

C:\Temp> ie > output

output now contains everything within the script tags.

Sinan Ünür
Hi Sinan,As I said, I'm completely new to all this. Trying ie.document.scripts returns <COMObject <unknown>>. What should the syntax be?Thanks
Brendan
It is a collection: ie.document.scripts.item[0] should hold the first script in the document. My IE8 is giving me problems, so I can't test.
Sinan Ünür
ie.document.scripts.item[0] gives an error: TypeError: 'instancemethod' object is unsubscriptable
Brendan
Hi Sinan,Thanks very much for your help. That works perfectly. Sorry I can't vote you up, it seems I'm not reputable enough to do so....:)Anyway, for future reference,the code in Python is appended.
Brendan
A: 

fetch the source of that page using ajax, and parse the response text like XML using jquery. It should be simple enought to get the text of the first tag you encounter inside the

I'm out of touch with jquery, or I would have posted code examples.

EDIT: I assume you are talking about fetching the csv on the client side.

Here Be Wolves
It's a static webpage, so I dont know what ajax has to do with it? Seems overly complicated, I could extract it from the full HTML source if I knew how to return it?
Brendan
A: 

If this is just a one off script then exctracting this csv data is as simple as this:

import urllib2

response = urllib2.urlopen('http://www.bmreports.com/foo?bar?')
html = response.read()
csv = data.split('gs_csv=')[1].split('</SCRIPT>')[0]

#process csv data here
Randle Taylor
Hi randle,I was looking at that method this morning, but this is from behind a company firewall/proxy with NTLM authentication. I tried several different ways and examples to get python working by proxy, but then gave up and thought it would be easier to script IE to get the document. From what I've read, Python and NTLM proxies dont play too well together.I assumed there should be some equivalent to innerhtml that returns the full html, so thought it would be quick and easy to do it this way...
Brendan
@Brendan: You can use NTLMAPS to circumvent NTLM authentication in any aplication. It is written in python. http://ntlmaps.sourceforge.net/
nosklo
nosklo, ntlmaps (as far as I can see) is a local proxy that routes through the NTLM Lan proxy, but has to be running all the time to field requests on localhost from another application. I could be wrong, but it's a bit awkward, and not very portable.
Brendan
A: 

Thanks to Sinan (this is mostly his solution transcribed into Python).

import win32com.client

import time import os

import os.path

ie = Dispatch("InternetExplorer.Application") ie.Visible=False

ie.Navigate("http://www.bmreports.com/servlet/com.logica.neta.bwp_PanBMDataServlet?param1=&amp;param2=&amp;param3=&amp;param4=&amp;param5=2009-04-22&amp;param6=37#")

time.sleep(20)

webpage=ie.document.body.innerHTML

s1=ie.document.scripts(1).text s1=s1[s1.find("gs_csv")+8:-11]

scriptfilepath="c:\FO Share\bmreports\script.txt"

scriptfile = open(scriptfilepath, 'wb')

scriptfile.write(s1.replace('\n','\n'))

scriptfile.close()

ie.quit

Brendan
You should delete this answer and incorporate it in to your original post using proper markdown (the question mark next to the textbox tells you how to properly post code among other things). As for voting me up, it's no problem if you can't but AFAIK you should at least be able to mark the answer that solved your problem.
Sinan Ünür