tags:

views:

386

answers:

4

I sometimes import data from CSV files that were provided to me, into a mysql table.

In the last one I did, some of the entries has a weird bad character in front of the actual data, and it got imported in my database. Now I'm looking for a way to clean it up.

The bad data is in the mysql column 'email', it seems to be always right in front of the actual data. When trying to print it on my screen using PHP, it shows up as �. When exporting it to a CSV file, it looks like  , and if I SET CHARACTER SET utf8 before printing it on the screen using PHP, it looks like a normal space ' '.

I was thinking of writing a PHP script that goes over all my rows one at a time, fix the email address field, and update the row. However I'm not quite sure about the "fix the email" part!

I was thinking maybe to do a "explode" and use the bad character as a delimiter, but I don't know how to type that character into my code.

Is there maybe a way to find the underlying value/utf8/hex or whatever of that character, then find it in the string?

I hope it's clear enough.

Thanks

EDIT: In Hex, it looks like it's A0. What can I do to search and delete a character by its hex value? Either in PHP or directly in MySQL I guess ...

A: 

you can trim any leading unprintable ascii char with something like:

update t set email = substr(email, 2) where ascii(email) not between 32 and 126

you can get the ascii value of the offending char with this:

select ascii(email) as first_char
jspcal
The problem is that not all rows have that problem. So I don't want to erase legitimate first characters.
nute
you can use a where (added above)
jspcal
this selects almost ALL the rows. I think something is wrong with the WHERE part. Shouldn't it only check the first char?
nute
probably the vals have more unprintables, since the query will only select unprintables (or empty string)
jspcal
A: 

SELECT HEX(field) FROM table; should help determine the character.

ggiroux
It looks like it's 'A0'. Would that make sense?
nute
yes its the non-breaking space character. try to see if you can get rid of it with replace (on a test dataset) update table set email=replace(email,char(160),'');
ggiroux
it said 87 rows affected yet it still looks the same
nute
weird -- does select HEX(email) still shows some A0 ?
ggiroux
A: 

As an alternative solution, it might actually be easier to fix the issue at the source. I've encountered similar problems with CSV files exported from Excel and have generally found that using something along the lines of...

$correctedLine = mb_convert_variables('UTF-8', 'Windows-1252', $sourceLine);

...tends to rectify the issue. (That said, you'll need to ensure that you have the multi byte string extension compiled in/enabled.)

middaparka
A: 

I think I found a PHP answer that seems to work more reliably:

$newemail = preg_replace('/\xA0/', '', $row['oldemail']);

And then I'm going to update the row with the new email

nute
it worked! (comments have to be at least 15 characters, so I'm adding this)
nute