ansaurus

Question

Regex query: how can I search PDFs for a phrase where words in that phrase appear on more than one line?

Answer 1

A:

You have hard-coded spaces in your names. Replace them with \s+ and you should be OK.

E.g.:

CB\s+Richard\s+Ellis

What's happening is, when you have a forced line break it doesn't have that space (" ") character anymore. Instead it has \n or \r\n. Using \s+ means that you are looking for any whitespace character, including carriage-returns and linefeeds, in quantity of one or more.

Robusto 2010-05-07 14:22:17

Thank you both for your answers. They make sense but unfortunately neither suggestion worked - the search still did not pick up the phrase in the example I used above. I am at a loss ... do you have any other ideas? I checked the original doc (InDesign CS3) and there is just one space between the words - which appear on different lines. This shows up on the PDF as more than one space because the text has not been justified, rather it is simply ranged left, which results in a ragged right-hand edge to the column of text. I'm wondering if this is a clue to what else I could include in the regex?

Alison 2010-05-07 15:14:40

@Alison: OK, I checked up on this, and it appears a PDF stores text in binary code, not readable as the string you think you are looking for. PowerGREP is supposed to be able to find strings like the ones you show above, which it does by creating an internal representation of the text as vanilla strings, so if it's not working I can only imagine that the process of PDF conversion from InDesign may have converted text to vector graphics,or whatever. Check the settings on your PDF converter to see if it's anything like that. Sorry to have nothing else to add.

Robusto 2010-05-07 16:36:17

`grep` is traditionally a line-based activity. I'm sure PowerGREP *can* find matches that span multiple lines, but does it do so by default? I would look for an option to enable that.

Alan Moore 2010-05-07 21:55:18

Answer 2

A:

The regex for matching spaces is \s, so it would be

\bCB\s+Richard\s+Ellis\b

(\s+ = match at least one whitespace). Line breaks are \n (newline) and \r (return), depending on your OS. So form a group using [] including all [\r\n\s] would result in:

\bCB[\r\n\s]+Richard[\r\n\s]+Ellis\b

arnep 2010-05-07 14:22:21

Thank you both for your answers. They make sense but unfortunately neither suggestion worked - the search still did not pick up the phrase in the example I used above. I am at a loss ... do you have any other ideas? I checked the original doc (InDesign CS3) and there is just one space between the words - which appear on different lines. This shows up on the PDF as more than one space because the text has not been justified, rather it is simply ranged left, which results in a ragged right-hand edge to the column of text. I'm wondering if this is a clue to what else I could include in the regex?

Alison 2010-05-07 15:12:16

@arnep: `\s` already matches `\r` and `\n`; you don't have to list them separately.

Alan Moore 2010-05-07 21:24:32

@alan: ya, you are right, thank you for clarification :-)

arnep 2010-05-10 08:14:23

ansaurus

tags:

views:

answers:

Regex query: how can I search PDFs for a phrase where words in that phrase appear on more than one line?

related questions