views:

213

answers:

7

We need to poll a web page every 5 minutes and the web page is growing rather large. The web page is a directory listing and we need the last line (to get a file name). What is the best way to get just this last line?

(If this was a local file I could position back a little relative to the end of file and read).

Thanks, Richard

+1  A: 

You have two options:

  1. Use chunked encoding. See http://msdn.microsoft.com/en-us/library/aa287673.aspx Pay attention to the Range request header field. Also your server must support it.

  2. Use FTP and do a "restart" on the ftp command with the offset you need.

Chris Lively
+2  A: 

HTTP does support chunked responses which means that you can probably ask for the same page but asking with a different offset IIRC. Check the HTTP RFCs.

EDIT: after checking RFC-2616, it is the Range: HTTP header you want.

Keltia
A: 

Use FTP and Resume programatically?

Gordon Carpenter-Thompson
A: 

You could do this in python using a combination of urllib2 (built in) and the Beautiful Soup 3rd party module (easy_install BeautifulSoup).

You'll need to load the whole page regardless since the data is streamed to your local machine in order. However, urllib2 makes it easy to connect and retrieve the page and Beautiful Soup will turn the raw HTML into an easily navigated hierarchy you can traverse using "dot syntax".

import urllib2
from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen(url)
html = page.read()
soup = BeautifulSoup(html)
# assumes you're looking for a tag in the body with an id='last-line' attribute on it
tag = soup.html.body.find(id='last-line')
# this will print a list of the contents of the tag
print tag.contents
# if only text is inside the tag you can use this
print tag.string
Soviut
A: 

A dirty hack would be to open it in Word and record a macro to grab the last line (which might involve deleting tables etc.)

The following VBA code opens the google define result for "stack overflow" and removes the header and footer, leaving only the list of results:

Sub getWebpage()

Documents.Open FileName:="http://www.google.com/search?hl=en&safe=off&rls=com.microsoft%3A*&q=define%3A+stack+overflow"

With Selection
    .MoveDown Unit:=wdLine, Count:=8, Extend:=wdExtend
    .Delete Unit:=wdCharacter, Count:=1
    .MoveRight Unit:=wdCharacter, Count:=1
    .EndKey Unit:=wdStory
    .MoveUp Unit:=wdParagraph, Count:=5, Extend:=wdExtend
    .Delete Unit:=wdCharacter, Count:=1
End With

End Sub

Then grab the result and write it somewhere.

EDIT: This is pretty hideous, I just recorded and altered a little bit.

+12  A: 

HTTP 1.1 does support a set of headers to request only a particular range of bytes, including support for just the last n bytes of a file (using the "suffix" format). See here. For instance,

Range: bytes=-1000

for the last 1000 bytes. (Assuming the server supports the Range header, of course.)

Eric Rosenberger
A: 

If you can not get the chunked encoding and range header to work, then I suggest doing the work server side with a CGI script or whatever you are comfortable. It seems wasteful to retrieve the whole file merely to examine the whole line!

If you post which OS and web server you are using, I'm sure someone here will post you a working CGI script within minutes if you get stuck.

Daniel Paull