views:

747

answers:

2

I'm currently writing some MATLAB code to interact with my company's internal reports database. So far I can access the HTML abstract page using code which looks like this:

import com.mathworks.mde.desk.*;
wb=com.mathworks.mde.webbrowser.WebBrowser.createBrowser;
wb.setCurrentLocation(ReportURL(8:end));
pause(1);

s={};
while isempty(s)
    s=char(wb.getHtmlText);
    pause(.1);
end
desk=MLDesktop.getInstance;
desk.removeClient(wb);

I can extract out various bits of information from the HTML text which ends up in the variable s, however the PDF of the report is accessed via what I believe is a JavaScript command (onClick="gotoFulltext('','[Report Number]')").

Any ideas as to how I execute this JavaScript command and get the contents of the PDF file into a MATLAB variable?

(MATLAB sits on top of Java, so I believe a Java solution would work...)

+4  A: 

I think you should take a look at the JavaScript that is being called and see what the final request to the webserver looks like.

You can do this quite easily in Firefox using the FireBug plugin.

https://addons.mozilla.org/en-US/firefox/addon/1843

Once you have found the real server request then you can just request this URL or post to this URL instead of trying to run the JavaScript.

pjp
pjp's is the only sensible approach. You should also have the developer of the web interface to the internal database taken out and shot - or at least tell them to learn about progressive enhancement ;-)
NickFitz
This looks a very promising route - I now have a URL which gets me the PDF - all I need to do now is work out how to get it into a variable...Firebug is rather handy!
Ian Hopkinson
Yes it's pretty nice.
pjp
+1  A: 

Once you have gotten the correct URL (a la the answer from pjp), your next problem is to "get the contents of the PDF file into a MATLAB variable". Whether or not this is possible may depend on what you mean by "contents"...


If you want to get the raw data in the PDF file, I don't think there is a way currently to do this in MATLAB. The URLREAD function was the first thing I thought of to read content from a URL into a string, but it has this note in the documentation:

s = urlread('url') reads the content at a URL into the string s. If the server returns binary data, s will be unreadable.

Indeed, if you try to read a PDF as in the following example, s contains some text intermingled with mostly garbage:

s = urlread('http://samplepdf.com/sample.pdf');


If you want to get the text from the PDF file, you have some options. First, you can use URLWRITE to save the contents of the URL to a file:

urlwrite('http://samplepdf.com/sample.pdf','temp.pdf');

Then you should be able to use one of two submissions on The MathWorks File Exchange to extract the text from the PDF:

If you simply want to view the PDF, you can just open it in Adobe Acrobat with the OPEN function:

open('temp.pdf');
gnovice
My problem at the moment is that the URL requires authentication to access the contents, and I can't work out how to provide it via urlread. I believe there might be a route using a Java URL object.Using the webbrowser method above I can *see* the pdf document on screen, which is frustratingly close to what I want. The text from PDF functions look useful...
Ian Hopkinson
The `URLREAD` and `URLWRITE` functions allow for optional parameters to be passed to them. You would have to find out what the parameter names are for the authentication, then pass them along with the parameter values as a cell array. An example appears on this documentation page: http://www.mathworks.com/access/helpdesk/help/techdoc/matlab_prog/f5-136137.html#f5-136158
gnovice
Dimitri Shvorob's solution for converting the PDF file to text works nicely
Ian Hopkinson
@Ian: As expected... Dimitri is a well-respected contributor to the File Exchange. =)
gnovice