tags:

views:

710

answers:

7

Hello,

I'm parsing a large text file using PHP and some lines look like this "äåòñêèå ïåñíè", or "ääò", or like this "åãîð ëåòîâ". Is there any way to check if there are more than three characters like this in string?

Thank you.

+1  A: 

I'd avoid a regex.

Simply step through the string, looking at each character, and keep count of how many characters fit your criteria.

Will
But how does this help to remove them?
pavium
Why avoid regexp? RegExp is built for "stepping through strings, looking at each character"
gnarf
Put those that characters that are ok into a new string?
Will
A: 

Check with: /[^\d\s\w]{3,}/

mck89
A: 
/X.*?X.*?X/

Replace X with whatever characters you want or don't want (e.g. [\x80-\xFF]).

strager
A: 

It sounds like you might not be using the correct character encoding. A file on disk is just array of bytes and a character encoding is the idea that a byte with the value 77 is uppercase M. Most character encodings have the numbers 0-127 mapping to the same characters, but after that, they're all different. Many newer character encodings use more than one byte per character, and often use the notion of code point rather than character.

You should become really comfortable with character encodings, especially unicode, if you don't want to mangle, and ruin character data.

Jon Hess
That's how it looks with Western ISO8859-15 encoding. With UTF8 it looks like this: ������� �����
Psyche
You need to already know the character encoding. If this is just a one time thing, you could try guessing more of them. ISO-8859-1 is another common encoding.
Jon Hess
Or Windows-1252 which apparently is different from ISO-8859-1.
Jon Hess
+1  A: 

You could try:

if (preg_match("/(?:.*?[\x80-\xFF]){3,}/", $string)) {
  // report excess high-bit ascii
}

(?:           ; create a non-capture group
  .*?         ; match any number of characters, without being greedy.
  [\x80-\xFF] ; match a single high-bit character
)             ; end the group
{3,}          ; match the group 3 or more times

Your question title eludes to removing:

$out = preg_replace('/[\x80-\xFF]/', '', $input);
gnarf
Thanks gnarf, it seems to work fine.
Psyche
A: 

You can do:

$string = preg_replace('~&([a-z]{1,2})(acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml);~i', '$1', htmlentities($string, ENT_COMPAT, 'UTF-8'));

This will replace all the UTF-8 chars with their ASCII equivalent.

Alix Axel
A: 

I use the below ... hope these help...

function just_clean($string)  
{  
// Replace other special chars  
$specialCharacters = array(  
'#' => '',  
'’' => '', 
'`' => '', 
'\'' => '', 
'$' => '',  
'%' => '',  
'&' => '',  
'@' => '',  
'.' => '',  
'€' => '',  
'+' => '',  
'=' => '',  
'§' => '',  
'\\' => '',  
'/' => '',
'`' => '',
'•' => ''
);

while (list($character, $replacement) = each($specialCharacters)) {  
$string = str_replace($character, '', $string);  
}  

$string = strtr($string,  
"ÀÁÂÃÄÅàáâãäåÒÓÔÕÖØòóôõöøÈÉÊËèéêëÇçÌÍÎÏìíîïÙÚÛÜùúûüÿÑñ",  
"AAAAAAaaaaaaOOOOOOooooooEEEEeeeeCcIIIIiiiiUUUUuuuuyNn"  
);  

 // Remove all remaining other unknown characters  
$string = preg_replace('/[^a-zA-Z0-9\-]/', ' ', $string);  
$string = preg_replace('/^[\-]+/', '', $string);  
$string = preg_replace('/[\-]+$/', '', $string);  
$string = preg_replace('/[\-]{2,}/', ' ', $string);  
$string = clean_url($string);  
return $string;  
}

function clean_url($text)
{
$text=strtolower($text);
$code_entities_match = array( '&quot;' ,'!' ,'@' ,'#' ,'$' ,'%' ,'^' ,'&' ,'*' ,'(' ,')' ,'+' ,'{' ,'}' ,'|' ,':' ,'"' ,'<' ,'>' ,'?' ,'[' ,']' ,';' ,"'" ,',' ,'.' ,'_' ,'/' ,'*' ,'+' ,'~' ,'`' ,'=' ,'---' ,'--','--','-','’','`','•');
$code_entities_replace = array(' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ',' ',' ',' ',' ',' ',' ');
$text = str_replace($code_entities_match, $code_entities_replace, $text);
$text = trim($text," ");
$text=str_replace(" ","-",$text);
$text = cleanUnderScores($text);
return $text;
}

function cleanUnderScores($text)
{
$tst = $text;
$under = "--";
$pos = 0;

    while(strpos($tst, $under) != false )
    {
    //$pos = strpos($tst, $under);
    $tst = str_replace("--", "-", $tst); 
    }
return $tst;
}
Pushpinder