views:

296

answers:

5

I need to do a "find and replace" on about 45k lines of a CSV file and then put this into a database.

I figured I should be able to do this with PHP and preg_replace but can't seem to figure out the expression...

The lines consist of one field and are all in the following format:

"./1/024/9780310320241/SPSTANDARD.9780310320241.jpg" or "./t/fla/8204909_flat/SPSTANDARD.8204909_flat.jpg"

The first part will always be a period, the second part will always be one alphanumeric character, the third will always be three alphanumeric characters and the fourth should always be between 1 and 13 alphanumeric characters.

I came up with the following which seems to be right however I will openly profess to not knowing very much at all about regular expressions, it's a little new to me! I'm probably making a whole load of silly mistakes here...

$pattern = "/^(\.\/[0-9a-zA-Z]{1}\/[0-9a-zA-Z]{3}\/[0-9a-zA-Z]{1,13}\/)$/";
$new = preg_replace($pattern, " ", $i);

Anyway any and all help appreciated!

Thanks, Phil

+2  A: 

The only mistake I encouter is the anchor for the string end $ that should be removed. And your expression is also missing the _ character:

/^(\.\/[0-9a-zA-Z]{1}\/[0-9a-zA-Z]{3}\/[0-9a-zA-Z_]{1,13}\/)/

A more general pattern would be to just exclude the /:

/^(\.\/[^\/]{1}\/[^\/]{3}\/[^\/]{1,13}\/)/
Gumbo
Thanks, works fine now! Nice to know I was only making one tiny mistake!The second example throws out an error however! Warning: preg_replace() [function.preg-replace]: Unknown modifier ']'The first one works fine though.Thanks again!
phil
@phil: Fixed it.
Gumbo
A: 

The $ means the end of the string. So your pattern would match ./1/024/9780310320241/ and ./t/fla/8204909_flat/ if they were alone on their line. Remove the $ and it will match the first four parts of your string, replacing them with a space.

Ölbaum
A: 
$pattern = "/(\.\/[0-9a-z]{1}\/[0-9a-z]{3}\/[0-9a-z\_]+\.(jpg|bmp|jpeg|png))\n/is";

I just saw, that your example string doesn't end with /, so may be you should remove it from your pattern at the end. Also underscore is used in the filename and should be in the character class.

stefita
+1  A: 

You should use PHP's builtin parser for extracting the values out of the csv before matching any patterns.

soulmerge
The values do not have quotation marks surrounding them in the file that this is processing.Purely out of educational interest how would I go about performing the same pattern replacement without using regex? I wouldn't know where to begin I'm afraid.
phil
Sorry, I didn't read your question well enough. I guess you *must* use regular expressions here, but I would extract the values out of the csv first, and apply the RE afterwards.
soulmerge
A: 

I'm not sure I understand what you're asking. Do you mean every line in the file looks like that, and you want to process all of them? If so, this regex would do the trick:

'#^.*/#'

That simply matches everything up to and including the last slash, which is what your regex would do if it weren't for that rogue '$' everyone's talking about. If there are other lines in other formats that you want to leave alone, this regex will probably suit your needs:

'#^\./\w/\w{3}/\w{1,13}/#"

Notice how I changed the regex delimiter from '/' to '#' so I don't have to escape the slashes inside. You can use almost any punctuation character for the delimiters (but of course they both have to be the same).

Alan Moore
That's much cleaner, the lines should all be in the same format but I don't want to assume that. I used the second version as it's simpler and cleaner, just needed to change to [\w-] to account for hyphens as well. Am I right in assuming that \w is alphanumeric characters and underscores?
phil
Yes, `\w` is the same as `[A-Za-z0-9_]`. In some other regex flavors it also matches accented letters plus letters and digits from other writings systems, but PHP's `\w` is limited to ASCII.
Alan Moore