tags:

views:

1116

answers:

6
+2  Q: 

PDF to LaTex Linux

I know how make a pdf from LaTex. Is there a way to extract the LaTex from a PDF I created earlier? How about if someone sends me a PDF and I like the formatting. Can I extract the LaTex from it?

+4  A: 

Short version: No.

Long version: It's a lot like decompiling: You technically could, but it would involve lots of guessing and heuristics.

I'm not familiar with the PDF innards, but it will likely set fonts/sizes/position directly, instead of defining a format and applying it to headers and such, like in LaTeX.

Tordek
+4  A: 

LaTeX does not have a one-to-one conversion to PDF. With regards to your first question, I believe such a conversion may be technically possible, but I do not believe an application to do so yet exists. Similar to the way assembler can be decompiled back into high level language, there is probably a way to do it. However -- a pdf is allowed to contain all matter of kinds of data -- AutoCAD drawings, JPEG graphics, font files, forms, digital signatures, etc. LaTeX has no idea what these things are. So in answer to the second question is no -- there's not a way to extract equivalent LaTeX from any PDF document.

Billy ONeal
+6  A: 

It's only possible if you embed the source of the document into the PDF file. See the attachfile package for doing this.

Will Robertson
Alternatively, you can add the clue-giving metadata using tagged PDF.
Charles Stewart
Yes, that's true, but I'm not aware of a pre-existing way of turning LaTeX source into a PDF via this route. Any suggestions?
Will Robertson
@Will: Sorry, didn't see your question until recently. Ross Moore has demonstrated pdftex additions that allow generation of PDFs where the mathematics is tagged with the Tex code that generates them. This is a long way from a complete answer to the question, but I think it shows that it is *possible*. There's more I want to say about this than fits in a comment - I'll just say it could make a great MSc thesis.
Charles Stewart
+1  A: 

See my answer on related question (http://stackoverflow.com/questions/1621885/how-to-turn-a-dvi-to-tex/1622348#1622348)

To amplify - there is no requirement for characters to be in reading order (I have found PDFs where part of the sdrawkcab sdaer txet (and relies on the coordinates). That is very difficult to reconstruct as it can depend on Font metrics. Which can use the appalling ASCII86 protocol.

peter.murray.rust
A: 

It may work with texmacs, which includes an import of pdf files.

Aif
texmacs is abandonware that never tried to solve this problem.
Charles Stewart
still, I have done it already.
Aif
Tell me more! I wrote off texmacs several years ago as an overengineered approach to a problem that didn't need a revolution. I guess you have a different view?
Charles Stewart
A: 

The best way for data mining from pdf files (due to its complicated format) is to open them with adobe illustrator. Then convert the pdf file to svg file and use a svg parser library writing some tricky code on yourself.

One efficient svg parser lib is batik

(For Linux it is quite a bit complex for converting pdf to svg: calcmaster.net/personal_projects/pdf2svg/)

PS I've been trying since a lot to find a solution to your second part of your question but I've figured out in books such "Visualizing Data, Ben Fry, O’Reilly" that pdf especially Adobe pdf is to complex to parse, so instead use a svg parser lib.

Novemberland
OP asked for solutions on Linux...
TJ Ellis