tags:

views:

26

answers:

2

I am trying to set up an index page for the weekly magazine I work on. It is to show readers the names of companies mentioned in that weeks' issue, plus the page numbers they are appear on.

I want to search all the PDF files for the week, where one PDF = one magazine page (originally made in Adobe InDesign CS3 and Adobe InCopy CS3).

I have set up a list of companies I want to search for and, using PowerGREP and using delimited regular expressions, I am able to find most page numbers where a company is mentioned. However, where a company name contains two or more words, the search I am running will not pick up instances where the name appears over more than one line.

For example, when looking for "CB Richard Ellis" and "Cushman & Wakefield", I got no result when the text appeared like this:

DTZ beat BNP PRE, CB [line break here]

Richard Ellis and Cushman & [line break here]

Wakefield to secure the contract. [line end here]

Could someone advise me on how to write a regular expression that will ignore white space between words and ignore line endings OR one that will look for the words including all types of white space (ie uneven spaces between words; spaces at the end of lines or line endings; and tabs (I am guessing that this info is imbedded somehow in PDF files).

Here is a sample of the set of terms I have asked PowerGREP to search for:

\bCB Richard Ellis\b
\bCB Richard Ellis Hotels\b
\bCentaur Services\b
\bChapman Herbert\b
\bCharities Property Fund\b
\bChetwoods Architects\b
\bChurch Commissioners\b
\bClive Emson\b
\bClothworkers’ Company\b
\bColliers CRE\b
\bCombined English Stores Group\b
\bCommercial Estates Group\b
\bConnells\b
\bCooke & Powell\b 
\bCordea Savills\b
\bCrown Estate\b
\bCushman & Wakefield\b
\bCWM Retail Property Advisors\b

[Note that there is a delimited hard return between each \b at the end of each phrase and beginnong of the next phrase.]

By the way, I am a production journalist and not usually involved in finding IT-type solutions and am finding it difficult to get to grips with the technical language on the PowerGREP site.

Thanks for assistance

Alison

A: 

You have hard-coded spaces in your names. Replace them with \s+ and you should be OK.

E.g.:

CB\s+Richard\s+Ellis

What's happening is, when you have a forced line break it doesn't have that space (" ") character anymore. Instead it has \n or \r\n. Using \s+ means that you are looking for any whitespace character, including carriage-returns and linefeeds, in quantity of one or more.

Robusto
Thank you both for your answers. They make sense but unfortunately neither suggestion worked - the search still did not pick up the phrase in the example I used above. I am at a loss ... do you have any other ideas? I checked the original doc (InDesign CS3) and there is just one space between the words - which appear on different lines. This shows up on the PDF as more than one space because the text has not been justified, rather it is simply ranged left, which results in a ragged right-hand edge to the column of text. I'm wondering if this is a clue to what else I could include in the regex?
Alison
@Alison: OK, I checked up on this, and it appears a PDF stores text in binary code, not readable as the string you think you are looking for. PowerGREP is supposed to be able to find strings like the ones you show above, which it does by creating an internal representation of the text as vanilla strings, so if it's not working I can only imagine that the process of PDF conversion from InDesign may have converted text to vector graphics,or whatever. Check the settings on your PDF converter to see if it's anything like that. Sorry to have nothing else to add.
Robusto
`grep` is traditionally a line-based activity. I'm sure PowerGREP *can* find matches that span multiple lines, but does it do so by default? I would look for an option to enable that.
Alan Moore
A: 

The regex for matching spaces is \s, so it would be

\bCB\s+Richard\s+Ellis\b

(\s+ = match at least one whitespace). Line breaks are \n (newline) and \r (return), depending on your OS. So form a group using [] including all [\r\n\s] would result in:

\bCB[\r\n\s]+Richard[\r\n\s]+Ellis\b
arnep
Thank you both for your answers. They make sense but unfortunately neither suggestion worked - the search still did not pick up the phrase in the example I used above. I am at a loss ... do you have any other ideas? I checked the original doc (InDesign CS3) and there is just one space between the words - which appear on different lines. This shows up on the PDF as more than one space because the text has not been justified, rather it is simply ranged left, which results in a ragged right-hand edge to the column of text. I'm wondering if this is a clue to what else I could include in the regex?
Alison
@arnep: `\s` already matches `\r` and `\n`; you don't have to list them separately.
Alan Moore
@alan: ya, you are right, thank you for clarification :-)
arnep