views:

224

answers:

7

I'm accepting a string from a feed for display on the screen that may or may not include some rubbish I want to filter out. I don't want to filter normal symbols at all.

The values I want to remove look like this: �

It is only this that I want removed. Relevant technology is PHP.

Suggestions appreciated.

A: 

If you cant resolve the issue with the data from the feed and need to filter the information then this may help:

PHP5 filter_input is very good for filtering input strings and allows a fair amount of rlexability

filter_input(input_type, variable, filter, options)

You can also filter all of your form data in one line if it requires the same filtering :)

There are some good examples and more information about it here:

http://www.w3schools.com/PHP/func%5Ffilter%5Finput.asp

The PHP site has more information on the options here: Validation Filters

Andi
A: 

You are looking for characters that are outside of the range of glyphs that your font can display. You can find the maximum unicode value that your font can display, and then create a regex that will replace anything above that value with an empty string. An example would be

s/[\u00FF-\uFFFF]//

This would strip anything above character 255.

Chris Marasti-Georg
It's just Arial mate... do you happen to know what the maximum unicode value is for that?
Evernoob
Looking through the set in charmap, it looks like Arial includes a lot of unicode glyphs, but there are some holes. For instance, it jumps from 04E9 to 05B0, with none of the glyphs between. You would either need a way to get that information from the font, simply strip everything above a certain range and realize you may lose information, or deal with the data issues upstream. If it's coming from Office (which uses special quote/apostrophe characters), you could try using an Office font.
Chris Marasti-Georg
Yep. I just fond that. And the other problem with the above solution is that regex rejects the range as being too large to match on.
Evernoob
You could try something like s/[^\u0000-\u00FF]//, which rejects any characters not in the range of 0-255.
Chris Marasti-Georg
A: 

That's going to be difficult for you to do, since you don't have a solid definition of what to filter and what to keep. Typically, characters that show up as empty squares are anything that the typeface you're using doesn't have a glyph for, so the definition of "stuff that shows up like this: �" is horribly inexact.

It would be much better for you to decide exactly what characters are valid (this is always a good approach anyway, with any kind of data cleanup) and discard everything that is not one of those. The PHP filter function is one possibility to do this, depending on the level of complexity and robustness you require.

Andrzej Doyle
+6  A: 

This is an encoding problem; you shouldn't try to clean that bogus characters but understand why you're receiving them scrambled.

Try to get your data as Unicode, or to make a agreement with your feed provider to you both use the same encoding.

Rubens Farias
The problem is that I don't have any control over how the data comes in. It comes in the form that I receive it and it's my job to get it into a state that's fit for the screen.
Evernoob
Then change your encoding to match the source
Brock Woolf
The problem is probably not encoding, it's probably the font used to display the characters.
Chris Marasti-Georg
font is Arial, and I'm coming up donuts so far guys.
Evernoob
@Evernoob have you looked at my suggestion of using filter_input? Was that not useful? Cheers
Andi
Yeah sorry didn't work SocialAddict but thanks anyway.
Evernoob
A: 

Thanks for the responses, guys. Unfortunately, those submitted had the following problems:

wrong for obvious reasons:

ereg_replace("[^A-Za-z0-9]", "", $string);

This:

s/[\u00FF-\uFFFF]//

which also uses the deprecated ereg form of regex also didn't work when I converted to preg because the range was simply too large for the regex to handle. Also, there are holes in that range that would allow rubbish to seep through.

This suggestion:

This is an encoding problem; you shouldn't try to clean that bogus characters but understand why you're receiving them scrambled.

while valid, is no good because I don't have any control over how the data I receive is encoded. It comes from an external source. Sometimes there's garbage in there and sometimes there is not.

So, the solution I came up with was relatively dirty, but in the absence of something more robust I'm just accepting all standard letters, numbers and symbols and discarding the rest.

This does seem to work for now. The solution is as follows:

$fixT = str_replace("£", "£", $string); 
$fixT = str_replace("€", "€", $fixT);
$fixT = preg_replace("/[^a-zA-Z0-9\s\.\/:!\[\]\*\+\-\|\<\>@#\$%\^&\(\)_=\';,'\?\\\{\}`~\"]/", "", $fixT);

If anyone has any better ideas I'm still keen to hear them. Cheers.

Evernoob
The 3rd line of your solution could probably be changed to [^ -ÿ], which is the space character through the 255th character. It would strip out line feeds, carriage returns, and tabs, so if you wanted to leave that whitespace in, you could use [^ -ÿ\t\r\n], or [^ -ÿ\s]
Chris Marasti-Georg
A: 

Take a look at this question to get the value of each byte in your string. (This assumes that multibyte overloading is turned off.)

Once you have the bytes, you can use them to determine what these "rubbish" characters actually are. It's possible that they're a result of misinterpreting the encoding of the string, or displaying it in the wrong font, or something else. Post them here and people can help you further.

JW
Appreciate that, but it's a feed that changes daily. The garbage characters of today will be replaced with the garbage characters of tomorrow.
Evernoob
The point is not to find out what specific characters they are. The point is to find out what *type* of characters they are. There's a chance that they're actually not garbage, but that you're just misinterpreting the encoding. If that's the case, your best approach is not to strip them out, but to read the encoding correctly. But you won't know until you actually look at the data.
JW
A: 

Try this:

  • Download a sample from the feed manually.
  • Open it in Notepad++ or another advanced text editor (KATE on Linux is good for this).
  • Try changing the encoding and converting from one encoding to another.

If you find a setting that makes the characters display properly, then you'll need to either encode your site in that encoding, or convert it from that encoding to whatever you use on your site.

DisgruntledGoat