tags:

views:

31

answers:

2

I'm using pdftotext to convert Spanish language text. Characters with accents or tildes are output in a systematic way that requires further conversion. Accents and tildes appear in the converted text in the correct position but without the letter. The letter almost always appears at the end of the output line. When it doesn't, I can fix those by hand.

For example, the pdf sentence

¿Por qué?

becomes

¿Por qu´? e

I know enough about sed, awk and grep to think it can be done with some combination of those - and that it would take me a long time. I intend to use this to process all the pdf files in a folder.

The sentences appear in Spanish-English pairs on separate lines. I'd like to concatenate the two with a semicolon delimiter, the import format of my flash card app (Anki). Delete all the content that are not Spanish-English sentence pairs.

For example, convert this output

B:

¿Por qu´? e
Why?

into

¿Por qué?;Why?

Where there are multiple accents, tildes or a mix of both, the letters trailing the line are in the correct order and may be comma separated by commas. For example, the pdf sentence

Sí pero vi en la televisión que iba a llover.

becomes

S´ pero vi en la televisi´n que iba a llover. ı, o

or S´ pero vi en la televisi´n que iba a llover. ı o

Output File Format

The sentences always have an end punctuation, either "!", "?" or ".". For those unfamiliar with Spanish, vowels (aeiou) are the only letters which may have an accent, the letter "n" is the only one that may have a tilde, and the 2 special characters may be found on both upper and lower case letters.

The first output line may contain the level and title of the pdf. The level and title always precede the first occurrence of "A:"

I'm not interested in the line "Key Vocabulary" or anything that appears on any subsequent lines.

pdftotext run with UTF8 encoding. My OS is Linux Mint 9, which is based on Ubuntu 10.04

Below are two sample output files.

Output 1

Elementary - Credit Card A:

(B0089)

Me da la cuenta, por favor.
Bring me the check, please.

B:

Se la doy enseguida.
I’ll bring it to you right away.

B:

Perd´n se˜or, pero no aceptamos tarjeta. o n
Sorry sir, but we don’t take cards.

A:

¿No aceptan ninguna tarjeta de cr´dito? e
You don’t take any credit cards?


Key Vocabulary

tarjeta cr´dito e cuenta

Noun Noun Noun

card credit bill

Output 2

Elementary - My computer is not working A: ¡No puede ser!
It can’t be!

(B0079)

B:

¿Qu´ pasa? e
What happened?

A:

Mi computadora no est´ funcionando. a
My computer is not working.

B:

Rein´ ıciala.
Restart it.


Key Vocabulary

funcionar

Verb

to work
A: 

Hello, I think it would be difficult with sed or awk…

I suggest using Perl or Vim commands to do that (if you know to use Vim) :

A vim command would be:

:%s/^.\{-}\zs´\(.*\.\) ı\(,\|$\)/í\1/
:%s/^.\{-}\zs´\(.*\.\) o\(,\|$\)/ó\1/
:%s/^.\{-}\zs´\(.*\.\) e\(,\|$\)/é\1/
: " etc

And repeat until there is no more vowel at an end of line after a full stop.

\zs sets start of match, and \1 is back-reference to .*. put inside brackets in matched regexp.

If you want to process all pdf files, do as follows:

vim *.pdf
:set hidden   "allows modifying a not-on-display buffer
:bufdo %s/^.\{-}\zs´\(.*\.\) ı\(,\|$\)/í\1/
:bufdo %s/^.\{-}\zs´\(.*\.\) o\(,\|$\)/ó\1/
: " etc
:next         "allows you to see other buffers to validate
:bufdo w      "will save all buffers
:q            "will quit
Benoit
@Benoit: My knowledge of vim is close to zero, so I wasn't able to test this.
Bill Lapworth
A: 
Dennis Williamson
The "inconsistencies between your two output examples" are intentional, both are valid output. I tried to explain this in my question by saying "The first output line may contain the level and title of the pdf. The level and title always precede the first occurrence of "A:""
Bill Lapworth
Thank you for identifying where my examples did not match the question. I've updated the question.
Bill Lapworth
@Bill: I made a small change that should be able to handle cases where the first line includes a line of data.
Dennis Williamson