tags:

views:

246

answers:

3

I'm trying to download account transactions (an XML file) from a server. When I enter this URL from a browser:

https://secure.somesite.com:443/my/account/download_transactions.php?type=xml

it successfully downloads a correct XML file (assuming I've already logged in).

I want to do this programmatically with Ruby, and tried this code:

require 'open-uri'
require 'rexml/document'
require 'net/http' 
require 'net/https'
include REXML

url = URI.parse("https://secure.somesite.com:443/my/account/download_transactions.php?type=xml")
req = Net::HTTP::Get.new(url.path)
req.basic_auth 'userid', 'password'
req.content_type = 'text/xml'

http = Net::HTTP.new(url.host, url.port)
http.use_ssl = true
response = http.start { |http| http.request(req) }

root = Document.new(response.read_body).root

root.elements.each("transaction") do |t|
   id = t.elements["id"].text
   description = t.elements["description"].text
   puts "TRANSACTION ID='#{id}' DESCRIPTION='#{description}'"
end

Execution proceeds, but fails on the "Document.new":

RuntimeError: Illegal character '&' in raw string "??ࡱ?;??

The returned body is clearly not XML if printed, and appears to be a long string of mostly unreadables, with the occasional visible word indicating it has something to do with the intended content. I also see the string "Arial1" mixed in with the unreadables several times, which makes me think I'm receiving a format other than XML.

My question is, what am I doing wrong here? The XML file is definitely available (and correct if you examine the browser-obtained copy). Am I specifying something wrong with the SSL? The HTTPS request? Is there a different and proper way to reveal the correct body? Thanks in advance for your assistance!

A: 

Ruby should throw an exception if it can't handle HTTPS. At least, it should. Maybe the website is compressing the XML and maybe you need to uncompress before parsing it? See what headers are returned when you try to access the XML. If you are using Firefox, try using HttpLiveHeaders.

sri
A: 

Interesting idea to check headers. The successful browser sequence shows this from HttpLiveHeaders:

https://secure.somesite.com/my/account/download_transactions.php?&type=xml

GET /my/account/download_transactions.php?type=xml HTTP/1.1
Host: secure.somesite.com
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Cookie: <obscured>

HTTP/1.x 200 OK
Date: Wed, 21 Oct 2009 13:13:08 GMT
Server: Apache/2.2
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: must-revalidate, post-check=0,pre-check=0
Pragma: public
Content-Disposition: attachment; filename=stuff.xml
Connection: close
Transfer-Encoding: chunked
Content-Type: application/xml

I've tried to match all the HTTP header bits by literally cutting and pasting the "accepts" from the above them into my request, but the XML file returned is still screwed up.

A hexdump of the returned response from my code shows a lot of 00x and FFx, and the words "root" and "entry" near each other. A WireShark dump of the unsuccessful ruby sequence is less helpful since it shows the SSL-encoded Application Data. But clearly a chunk of data is being returned.

START DUMP
00000000: d0 cf 11 e0 a1 b1 1a e1 - 00 00 00 00 00 00 00 00  ................
00000010: 00 00 00 00 00 00 00 00 - 3b 00 03 00 fe ff 09 00  ........;.......
00000020: 06 00 00 00 00 00 00 00 - 00 00 00 00 01 00 00 00  ................
00000030: 04 00 00 00 00 00 00 00 - 00 10 00 00 00 00 00 00  ................
00000040: 01 00 00 00 fe ff ff ff - 00 00 00 00 05 00 00 00  ................
00000050: ff ff ff ff ff ff ff ff - ff ff ff ff ff ff ff ff  ................
00000060: ff ff ff ff ff ff ff ff - ff ff ff ff ff ff ff ff  ................
00000070: ff ff ff ff ff ff ff ff - ff ff ff ff ff ff ff ff  ................
... and so on... non 00 and FF's appear much further down.

I'm not sure what to try next. Any suggestions?

RubyNube
A: 

Fixed the problem myself. Turns out this particular site does not seem to use "basic authentication", and I was required to execute a specific login screen to produce a usable cookie. I also simplified the solution by using "Mechanize", a gem that handles much of the leg-work of HTTP activity.

require 'rubygems'
require 'mechanize'

login_username = "theusername"
login_password = "thepassword"

# get login page
agent = WWW::Mechanize.new
agent.user_agent_alias = 'Mac Safari'
page = agent.get('https://somesite.com/login.php')

# fill out login form and submit
form = page.forms[0] # use first form on page
form['form[username]'] = login_username
form['form[password]'] = login_password
page = agent.submit(form)

# process returned page 
if page.uri.to_s.include?("login") 
  puts '---- LOGIN FAILED ----'
else
  puts '---- LOGIN SUCCESSFUL ----'
  xml_data = agent.get('https://secure.somesite.com:443/download_transactions.php?type=xml')
  puts xml_data.body
end

The thing that threw me was the way to set the form fields, which for some reason were different than the examples I've seen doing this.

RubyNube