views:

85

answers:

2

I like to keep track of delicious.com/popular RSS feed. However, lately there are more and more Asian pages in the items. Since I do not understand any Asian languages, I would like to somehow filter them from the feed and save myself some time.

I've been trying to cook up something using Yahoo pipes, but have not been able to get it working.

Anyone any ideas how to make this work?

+1  A: 

I've had some luck at http://pipes.yahoo.com/pipes/pipe.info?_id=yJh1aRp_3hGaPi23tPvyrQ

The source of the pipe has all the info, but the key bit is running a filter with the regex ^[A-Za-z 0-9 \.,\?'""!@#\$%\^&\*\(\)-_=\+;:<>\/\\\|\}\{\[\]~]+$`.

This will filter out any feeds that use anything but standard ASCII in the title. Unfortunately, this means it will also filter words like "résumé," but it should be pretty easy for you to adjust the regex to include common non-english characters from the languages you know.

anschauung
Thanks! This will do fine for me.
A: 

You probably want to skip titles where more than X% of the characters are NOT from the code blocks assigned to the scripts of those languages that you can understand. For example, if you can't read Greek, Russian, Arabic, Hebrew, Armenian, Chinese, Japanese, Korean, Indic languages etc, reject titles where more than (say) 10% of characters are not in the range U+0000 to U+0233. This leaves you with the Latin alphabet. The idea of leaving a margin like 10% is for punctuation marks; also technical articles may use symbols that are not in the base alphabet.

John Machin