Hello,
I'm parsing a large text file using PHP and some lines look like this "äåòñêèå ïåñíè", or "ääò", or like this "åãîð ëåòîâ". Is there any way to check if there are more than three characters like this in string?
Thank you.
Hello,
I'm parsing a large text file using PHP and some lines look like this "äåòñêèå ïåñíè", or "ääò", or like this "åãîð ëåòîâ". Is there any way to check if there are more than three characters like this in string?
Thank you.
I'd avoid a regex.
Simply step through the string, looking at each character, and keep count of how many characters fit your criteria.
/X.*?X.*?X/
Replace X with whatever characters you want or don't want (e.g. [\x80-\xFF]
).
It sounds like you might not be using the correct character encoding. A file on disk is just array of bytes and a character encoding is the idea that a byte with the value 77 is uppercase M. Most character encodings have the numbers 0-127 mapping to the same characters, but after that, they're all different. Many newer character encodings use more than one byte per character, and often use the notion of code point rather than character.
You should become really comfortable with character encodings, especially unicode, if you don't want to mangle, and ruin character data.
You could try:
if (preg_match("/(?:.*?[\x80-\xFF]){3,}/", $string)) {
// report excess high-bit ascii
}
(?: ; create a non-capture group
.*? ; match any number of characters, without being greedy.
[\x80-\xFF] ; match a single high-bit character
) ; end the group
{3,} ; match the group 3 or more times
Your question title eludes to removing:
$out = preg_replace('/[\x80-\xFF]/', '', $input);
You can do:
$string = preg_replace('~&([a-z]{1,2})(acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml);~i', '$1', htmlentities($string, ENT_COMPAT, 'UTF-8'));
This will replace all the UTF-8 chars with their ASCII equivalent.
I use the below ... hope these help...
function just_clean($string)
{
// Replace other special chars
$specialCharacters = array(
'#' => '',
'’' => '',
'`' => '',
'\'' => '',
'$' => '',
'%' => '',
'&' => '',
'@' => '',
'.' => '',
'€' => '',
'+' => '',
'=' => '',
'§' => '',
'\\' => '',
'/' => '',
'`' => '',
'•' => ''
);
while (list($character, $replacement) = each($specialCharacters)) {
$string = str_replace($character, '', $string);
}
$string = strtr($string,
"ÀÁÂÃÄÅàáâãäåÒÓÔÕÖØòóôõöøÈÉÊËèéêëÇçÌÍÎÏìíîïÙÚÛÜùúûüÿÑñ",
"AAAAAAaaaaaaOOOOOOooooooEEEEeeeeCcIIIIiiiiUUUUuuuuyNn"
);
// Remove all remaining other unknown characters
$string = preg_replace('/[^a-zA-Z0-9\-]/', ' ', $string);
$string = preg_replace('/^[\-]+/', '', $string);
$string = preg_replace('/[\-]+$/', '', $string);
$string = preg_replace('/[\-]{2,}/', ' ', $string);
$string = clean_url($string);
return $string;
}
function clean_url($text)
{
$text=strtolower($text);
$code_entities_match = array( '"' ,'!' ,'@' ,'#' ,'$' ,'%' ,'^' ,'&' ,'*' ,'(' ,')' ,'+' ,'{' ,'}' ,'|' ,':' ,'"' ,'<' ,'>' ,'?' ,'[' ,']' ,';' ,"'" ,',' ,'.' ,'_' ,'/' ,'*' ,'+' ,'~' ,'`' ,'=' ,'---' ,'--','--','-','’','`','•');
$code_entities_replace = array(' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ' ,' ',' ',' ',' ',' ',' ',' ');
$text = str_replace($code_entities_match, $code_entities_replace, $text);
$text = trim($text," ");
$text=str_replace(" ","-",$text);
$text = cleanUnderScores($text);
return $text;
}
function cleanUnderScores($text)
{
$tst = $text;
$under = "--";
$pos = 0;
while(strpos($tst, $under) != false )
{
//$pos = strpos($tst, $under);
$tst = str_replace("--", "-", $tst);
}
return $tst;
}