tags:

views:

151

answers:

3

I need to sanitize article titles when (creative) users try to "attract attention" with some non-alphanum repetition.

Exemples:

  • Buy my product !!!!!!!!!!!!!!!!!!!!!!!!
  • Buy my product !? !? !? !? !? !?
  • Buy my product !!!!!!!!!.......!!!!!!!!
  • Buy my product <-----------

Some acceptable solution would be to reduce the repetition of non-alphanum to 2.

So I would get:

  • Buy my product !!
  • Buy my product !? !?
  • Buy my product !!..!!
  • Buy my product <--

This solution did not work that well:

preg_replace('/(\W{2,})(?=\1+)/', '', $title)

Any idea how to do it in PHP with regex?

Other better solution is also welcomed (I cannot strip all the non-alphanum characters as they can make sense).

Edit: the objective is only to avoid most common issues. The other creative cases will be sanitized manually or sanitized with an other regex.

+2  A: 

That's really an inefficient problem to solve with a regex, especially if the repeated expression is arbitrarily large. Practically, it shold be enough to just cap the length of the repeated expression at something like 3 to 5, and it should be a lot easier.

Something like

$title = preg_replace('/(\W{1,5})(?=\1+)/', '', $title);

should work.

Some preliminary testing shows that

$title = 'Buy my product !!!!!!!!!!!!!!!!!!!!!!!! Buy my product !? !? !? !? !? !? Buy my product !!!!!!!!!.......!!!!!!!! Buy my product <-----------';

$title = preg_replace('/(\W{1,5})(?=\1{2,})/', '', $title);

echo $title;

will output

Buy my product !! Buy my product !? !? Buy my product !!..!! Buy my product <--

This appears to pass all your test cases.


Re: Gordon

Your string:

¸·´`·¸·´`·¸·´`·¸ Human ·-> creativity << is endless !¡!¡! ☻☺

doesn't repeat anything but the first part more than two times. It seems to require:

$title = preg_replace('/(\W{1,9})(?=\1{2,})/', '', $title);

before it simplifies to

¸·´`·¸·´`·¸ Human ·-> creativity << is endless !¡!¡! ☻☺

(Which implies that preg_replace isn't Unicode-aware - oh well)

you can also adjust it to repeat only once:

$title = preg_replace('/(\W{1,9})(?=\1+)/', '', $title);

in which case it becomes:

¸·´`·¸ Human ·-> creativity < is endless !¡! ☻☺

If your point is that it's possible to create lots of "ASCII art" even if it's required to repeat less than two times, well, that's outside of the scope of this question. For the purposes of keeping ASCII art to a minimum, I would recommend simply using something like:

preg_replace('/(\W{5})\W+/', '$1', $title);

(i.e. just cap the number of non-alphanumeric characters that can be displayed in a row. Note that this would need to be adjusted for compatibility with languages with non-Latin alphabets, like Russian.)

Zarel
@Gordon I've edited my reply; new version passes all strings in the question.
Zarel
@Zarel Try with the string I gave in the **comment** to the question please.
Gordon
Also: If you really just want to reduce "ASCII-art" in titles, it's enough just to do something like `preg_replace('/(\W{5})\W+/', '$1', $title);` (i.e. just cap the number of non-alphanumeric characters that can be displayed in a row. Note that this could cause problems with languages with non-Latin alphabets, like Russian.)
Zarel
@Zarel what the OP asks for cannot be solved. The possibilities to arrange chars are infinite. Even when multiple occurences of odd characters are removed, it is still possible to use a string like "x X x X x Best offers here x X x X x" for which your solution does *nothing*. If this was possible, there would be no more spam mails.
Gordon
@Gordon Eliminating anything "ASCII-art"-like in titles is impossible, yes. Reducing the repetition of non-alnum characters to 2, however, is possible, and it happens to be the question that was asked. Here at Stack Overflow, we are interested in solutions to mathematical problems like the latter, not social problems like the former.
Zarel
@Gordon If you feel that the question is not a good question, then please keep your criticism to [comments and votes on] the question, rather than my answer, which I feel is a good solution to the question of how to reduce the repetition of non-alnum characters to 2.
Zarel
@Zarel The OP explicitly asked for sanitizing those parts from a string meant to *attract attention*. Your solution does not solve this problem. I gave you two examples already, where it does not work. Here is another one: "________ Best offers here ________".
Gordon
@Gordon @Zarel: sorry it was my fault. I rephrased the question and removed "ascii art" which was not a correct word. To avoid many cases I already clean up (html entities, ...) and check (uppercase ratio, whitespace ratio, non-alphanum ratio, non-alphanum position in the text, etc.) a lot. :)
Toto
@Gordon The OP's exact words are "Some acceptable solution would be to reduce the repetition of non-alphanum to 2", and the rest of the question discusses that solution. Again, if you feel that's incorrect, take it up with the OP, not me.
Zarel
@Gordon In addition, you are simultaneously claiming that the question of removing all parts of a string that attract unnecessary attention is impossible, and criticizing me for making what I believe is a fairly good effort to solve it. Many questions can't be completely solved for all corner cases - that doesn't mean we should stop trying.
Zarel
A: 

Use non-greedy search?

preg_replace('/(\W{2,}?)(?=\1+)/', '', '{{your data}}');

result is

  * Buy my product !!
  * Buy my product !?
  * Buy my product !!!...!!
  * Buy my product <---
KennyTM
it does not work :(
Toto
@Toto: strange, works for me. See if the updated full PHP code helps.
KennyTM
+1  A: 

I have an answer that's a little bit different

if (preg_match('/^[^\da-z\s_-]$/i', $str)) {

  // auto post, but flag to moderator to inspect title OR
  // instead of auto posting, put in 'waiting to be authorised' by a mod

}

I hope I have that regex correct, but I have not tested it. Basically it should detect when someone has in their title characters that are not 0-9, A-Z (case insensitive), whitespace, underscore and dash. Of course, you can modify this to suit your liking.

It would also be a good idea to inform the end user

"Titles that deliberately try to attract attention without benefiting the product description may be deleted without warning"

alex