views:

697

answers:

4

I am inserting some text from scraped web into my database. some of the fields in the string have unprintable/weird characters. For example,

if text is "C__O__?__P__L__E__T__E", 
then the text in the database is stored only as "C__O__"

I know about h(), strip_tags()... sanitize, ... etc etc. But I do not want to sanitize this SQL. The activerecord logs the SQL correctly, and when run in phpMySQL, the query is executed correctly. something happens between the SQL query generation and it being executed.

Help is much appreciated.

A: 

Hmmmm.. using CGI escape, I found out that the character coming in the system is not what I expected it to be. It is not a question mark (%3F) but a question mark (%D5).

C__%D5__M__P__L__%80___T__%80__
C__%3F__M__P__L__%3F___T__%3F__

Eventually I gsubbed out the non-printable characters before saving.

gsub(/[^[:print:]]/, '')

Only after removing the invalid characters in my string, was I able to save the item properly. None of the other solutions worked, partially because the problem was not understood clearly upfront.

Ryan Oberoi
A: 

Can you escape the question mark using "\?"?

Toby Hede
+1  A: 

Just replace the question mark in the string with a string containing a question mark, I haven't found any other way either:

["C__O__?__P__L__E__T__E", '?']

works perfectly.

A: 

I know this is way late, but I ran into the same problem when we were trying to process a file as UTF-8 that actually used the ISO-8859-1 character encoding. I suspect you had a similar issue in your scraping where you assumed the wrong encoding and it ended up causing things to fail.

Russ