views:

238

answers:

3

I need to parse some text from pdfs but the pdf formatting results in extremely unreliable spacing. The result is that I have to ignore the spaces and have a continuous stream of non-space characters.

Any suggestions on how to parse the string and put spaces back into the string by guessing?

I'm using ruby. Or should I say I'musingruby?

Edit: I've pulled the text out using pdf-reader. Some of the pdf files are nicely formatted and some are not. An example of text mixed with positioning:

.7aspe-5.5cts-715.1o0.6f-708.5f-0.4aces-721.4that-716.3are-720.0i-1.8mportant-716.3in-713.9soc-5.5i-1.8alcommunica6.6tion6.3.-711.6Althoug6.3h-708.1m-1.9od6.3els-709.3o6.4f-702.8f5.4ace-707.9proc6.6essing-708.2haveproposed-611.2ways-615.5to-614.7deal-613.2with-613.0these-613.9diff10.4erent-613.7tasks,-611.9it-617.1remainsunclear-448.0how-450.7these-443.2mechanisms-451.7might-446.7be-447.7implemented-447.2in-450.3visualOne-418.9model-418.8of-417.3human-416.4face-421.9processing-417.5proposes-422.7that-419.8informa-tion-584.5is-578.0processed-586.1in-583.1specialised-584.7modules-577.0(Breen-584.4et-582.9al.,-582.32002;Bruce-382.1and-384.0Y92.0oung,-380.21986;-379.2Haxby-379.9et-380.5al.,-

and if I print just string data (I added returns at the end of each line to keep it from messing up the layout here:

'Distinctrepresentationsforfacialidentityandchangeableaspectsoffacesinthehumantemporal lobeTimothyJ.Andrews*andMichaelP.EwbankDepartmentofPsychology,WolfsonResearchInstitute, UniversityofDurham,UKReceived23December2003;revised26March2004;accepted27July2004Availab leonline14October2004Theneuralsystemunderlyingfaceperceptionmustrepresenttheunchanging featuresofafacethatspecifyidentity,aswellasthechangeableaspectsofafacethatfacilitates ocialcommunication.However,thewayinformationaboutfacesisrepresentedinthebrainremainsc ontroversial.Inthisstudy,weusedfMRadaptation(thereductioninfMRIactivitythatfollowsthe repeatedpresentationofidenticalimages)toaskhowdifferentface-andobject-selectiveregionsofvisualcortexcontributetospecificaspectsoffaceperception'

The data is spit out by callbacks so if I print each string as it is returned it looks like this:

'The

-571.3

neural

-573.7

system

-577.4

underly

13.9

ing

-577.2

face

-573.0

perc

13.7

eption

-574.9

must

-572.1

repr

20.8

esent

-577.0

the

unchangin

14.4

g

-538.5

featur

16.5

es

-529.5

of

-536.6

a

-531.4

face

'

On examination it looks like the true spaces are large negative numbers < -300 and the false spaces are much smaller positive numbers. Thanks guys. Just getting to the point where i am asking the question clearly helped me answer it!

+4  A: 

Hmmmm... I'd have to say that guessing is never a good idea. Looking at the problem root cause and solving that is the answer, anything else is a kludge.

If the spacing is unreliable from the PDF, how is it unreliable? The PDF viewer needs to be able to reliably space the text so the data is there somewhere, you just need to find it.

EDIT following comment: The idea of parsing the file using a dictionary (your only other option really, apart from randomly inserting spaces and hoping for the best) and inserting spaces at identified word boundaries (a real problem when dealing with punctuation, plurals that don't alter the base word i.e. plural, etc) would, I believe, be a much greater programming challenge than correctly parsing the PDF in the first place. After all, PDF is clearly defined whereas English is somewhat wooly.

Why not look down the route of existing solutions like ps2ascii in linux, call the function from your Ruby and pick up the result.

Lazarus
There is no choice but kludge. The formatting of the PDFs in this case contains positioning information at pretty random points rather than just spaces in the text. I don't want to have to write a pdf text position formatting parser and I can tolerate some messiness in the data.
srboisvert
+2  A: 

PDF doesn't only store spaces as space characters, but also uses layout commands for spacing (so it doesn't print a space, but moves the "pen" to the right). Perhaps you should have a look at the PDF reference (the big PDF on the bottom of the site), Chapter 9 "Text" should be what you're looking for.

EDIT: After reading your comment to Lazarus' answer, this doesn't seem to be what you're looking for. I think you should try to get a word list from somewhere and try to split your text using it. A good strategy would be to do that using recursion, because for example:

"meandyou"

The first word could be "me" or "mean", but if you try "mean", "dyou" doesn't make sense, so it will be "me", same for the next word that could be "a" or "an" or "and", only "and" makes sense.

schnaader
A: 

If it were me I'd go back to the source PDFs and try a different method of extracting the text, such as iText (for Java) or maybe some kind of PDF-to-HTML to text conversion software method.

banjollity