views:

52

answers:

2

Good evening everyone!

I'm converting ebook files to ereader-optimized pdf files (the sony ereader can't propertly justify text). I'm therefore converting html to latex, and then building the latex output using pdflatex.

The sony reader has a function to lookup words in a dictionary. However, it figures out words by analysing boxes; and pdflatex generates one box per line. I subsequently have lost the possibility to use the dictionary search.

So here's my question: How do I tell pdflatex to put each word in a separate box?

Thanks!

EDIT:
I’m trying to tweak the output of the pdflatex command to make it produce one box per word. Consider this example:

\documentclass{minimal}

\begin{document}
    This is an example sentence.
\end{document}

When opened in a PDF editor after compilation, this sample will appear as one text box containing the sentence "This is an example sentence.". This is fine for most full-featured pdf readers. Yet on my sony e-reader, selection of words is based on boxes ; therefore my pdf reader will select the full sentence, hence failing to find a definition for the word I clicked.

I noticed that pdflatex stops at punctuation marks. How can I proceed to make it create one box per word? In the output, I would then have one box for "This", one for "is", one for "an", and so on.

A: 

Set the hyphenation penalty to 10000 (effective infinity)

\hyphenpenalty=10000

and perhaps increase the typesetting tolerance

\tolerance=1000

See http://dcwww.camd.dtu.dk/~schiotz/comp/LatexTips/LatexTips.html#nohyphen.


In case you don't know this, TeX makes layout decisions by assigning penalties to bad stuff (too much or too little white space (horizontal or vertical), widow or orphan lines, over- or under-full boxes, splitting footnotes across pages, and so on ad naseum), then tries to minimize the per-page penalty.

You can diddle the kinds of choices it makes quite extensively by adjusting the penalty values. Any arrangement which scores 10000 is absolutely forbidden, and I guess that if there is no arrangement which scores less the run stops.

dmckee
Hmmm, but hyphens are no problem. the problem is words getting gathered in the same box...
CFP
+1  A: 

I'm guessing your trouble is not with boxes, but with your font encoding. Try putting the following just after your \documentclass{minimal}:

\usepackage{cmap} % Puts extra info in the PDF's font dictionary that helps searching
\usepackage{lmodern} % cmr, the default Tex font, has a whacky font layout
\usepackage[T1]{fontenc} % This and next line are recommended with lmodern
\usepackage{textcomp}
Charles Stewart
Hmmm... In fact I also discussed this on the pdflatex mailing list, and it seems that what I'd actually need would be tags in the document. Changing the font encoding does nothing at all.
CFP
@CFP: Can you link to the discussion? The purpose of the information cmap puts it, namely a reverse map from glyphs to their Unicode information, is just to do what you want. If the reader ignores that information, then why do you think it will look at tags you attach to boxes?
Charles Stewart
See here : http://tug.org/pipermail/pdftex/2010-July/008427.html :) When trying with a tagged pdf, the reader was able to select correctly =)
CFP