I'm using pdftotext to convert Spanish language text. Characters with accents or tildes are output in a systematic way that requires further conversion. Accents and tildes appear in the converted text in the correct position but without the letter. The letter almost always appears at the end of the output line. When it doesn't, I can fix those by hand.
For example, the pdf sentence
¿Por qué?
becomes
¿Por qu´? e
I know enough about sed, awk and grep to think it can be done with some combination of those - and that it would take me a long time. I intend to use this to process all the pdf files in a folder.
The sentences appear in Spanish-English pairs on separate lines. I'd like to concatenate the two with a semicolon delimiter, the import format of my flash card app (Anki). Delete all the content that are not Spanish-English sentence pairs.
For example, convert this output
B:
¿Por qu´? e
Why?
into
¿Por qué?;Why?
Where there are multiple accents, tildes or a mix of both, the letters trailing the line are in the correct order and may be comma separated by commas. For example, the pdf sentence
Sí pero vi en la televisión que iba a llover.
becomes
S´ pero vi en la televisi´n que iba a llover. ı, o
or S´ pero vi en la televisi´n que iba a llover. ı o
Output File Format
The sentences always have an end punctuation, either "!", "?" or ".". For those unfamiliar with Spanish, vowels (aeiou) are the only letters which may have an accent, the letter "n" is the only one that may have a tilde, and the 2 special characters may be found on both upper and lower case letters.
The first output line may contain the level and title of the pdf. The level and title always precede the first occurrence of "A:"
I'm not interested in the line "Key Vocabulary" or anything that appears on any subsequent lines.
pdftotext run with UTF8 encoding. My OS is Linux Mint 9, which is based on Ubuntu 10.04
Below are two sample output files.
Output 1
Elementary - Credit Card A:
(B0089)
Me da la cuenta, por favor.
Bring me the check, please.
B:
Se la doy enseguida.
I’ll bring it to you right away.
B:
Perd´n se˜or, pero no aceptamos tarjeta. o n
Sorry sir, but we don’t take cards.
A:
¿No aceptan ninguna tarjeta de cr´dito? e
You don’t take any credit cards?
Key Vocabulary
tarjeta cr´dito e cuenta
Noun Noun Noun
card credit bill
Output 2
Elementary - My computer is not working A: ¡No puede ser!
It can’t be!
(B0079)
B:
¿Qu´ pasa? e
What happened?
A:
Mi computadora no est´ funcionando. a
My computer is not working.
B:
Rein´ ıciala.
Restart it.
Key Vocabulary
funcionar
Verb
to work