views:

114

answers:

4

The russian alphabet includes many letters that are the same in the English alphabet. Here is the list of common letters: L='acekopuxy'

Now, given two huge lists R and E, each in the form [word_A, word_B, ...], where each word_N is a lowercase word, I want to create a list C, which should contain only those words that have the same spelling in E and in R. For example, a word 'cop' must be in C, because it is in the list R as well as in E.

Is there any polynomial way to do it?

P.S.: One important note: because of the different character encodings, there are two L lists, LE for English letters and LR for Russian, but the appearance of their letters is the same:

LE='acekopuxy'
LR='асекориху'
+1  A: 

You can use regular expressions for this:

^[acekopuxy]+$

will match words that contain only those characters.

import re
regex = re.compile(r"^[acekopuxy]+$", re.I)
output = []
for word in mylist:
    if regex.match(word):
        output.append(word)

You'll need to do this for both lists, using the correct encodings. That means that for the Russian list, you'll need to use the equivalent characters, like ^[\u0441\u1234...]$.

Then, if you want to find the words that "look the same", you could use a translation table to convert the words in one of the list into the format of the other list, then convert the lists to sets, and check their intersection.

Tim Pietzcker
I don't think, that this will work. The russian "c" is actually U+0441, and so on.
Boldewyn
Because of the different character encodings, there are two L lists, LE for English letters and LR for Russian, but the appearance of their letters is the same.
psihodelia
+1  A: 
Eset = set(E)
C = [w for w in R if w.replace(LR,LE) in Eset]

Not sure if I understood the problem correctly, but assuming good hashing, this runs in O(n).

larsmans
+1  A: 

You need to tell the program yourself, which characters are similar. Since they are each different Unicode codepoints, you will have to have a mapping like this:

var RE_map = (
  (u'c', u'\u0441'),
  # ...and so on
)

Then, translate all words from R to their E representation:

for ec, rc in RE_map:
    string = string.replace(rc, ec)

and finally check, if the string is now in E:

if string in E:
    print "The word exists of characters similar in Latin and Cyrillic."
Boldewyn
+2  A: 

You can use sets for this:

english_set = set(E)
russian_set = set(R)
common_words = english_set.intersection(russian_set)

I'm not sure I got the encoding part right though, but if that means letters that look similar are actually different bytes, you can for example prepare the russian list by replacing these letters by their english counterpart prior to doing the intersection.

Luper Rouch
Because of the different character encodings, there are two L lists, LE for English letters and LR for Russian, but the appearance of their letters is the same.
psihodelia
What is about time-complexity of set() and set.intersection() ?
psihodelia
O(n), where n is the length of the shortest set.
Luper Rouch
Python wiki was just updated saying O(n) is the average case and O(n*m) the worst case.
Luper Rouch
O(n×m) worst-case time for set intersection? I can't believe that. It can be done in O(min(m,n)).
larsmans
I don't know the reason of this edit or if it's correct, here is the page for reference: http://wiki.python.org/moin/TimeComplexity
Luper Rouch