ansaurus

Question

How to find all words that have the same appearence in two different languages?

Answer 1

+1 A:

You can use regular expressions for this:

^[acekopuxy]+$

will match words that contain only those characters.

import re
regex = re.compile(r"^[acekopuxy]+$", re.I)
output = []
for word in mylist:
    if regex.match(word):
        output.append(word)

You'll need to do this for both lists, using the correct encodings. That means that for the Russian list, you'll need to use the equivalent characters, like ^[\u0441\u1234...]$.

Then, if you want to find the words that "look the same", you could use a translation table to convert the words in one of the list into the format of the other list, then convert the lists to sets, and check their intersection.

Tim Pietzcker 2010-10-27 11:57:08

I don't think, that this will work. The russian "c" is actually U+0441, and so on.

Boldewyn 2010-10-27 12:03:14

Because of the different character encodings, there are two L lists, LE for English letters and LR for Russian, but the appearance of their letters is the same.

psihodelia 2010-10-27 12:07:04

Answer 2

+1 A:

Eset = set(E)
C = [w for w in R if w.replace(LR,LE) in Eset]

Not sure if I understood the problem correctly, but assuming good hashing, this runs in O(n).

larsmans 2010-10-27 12:00:33

Answer 3

+1 A:

You need to tell the program yourself, which characters are similar. Since they are each different Unicode codepoints, you will have to have a mapping like this:

var RE_map = (
  (u'c', u'\u0441'),
  # ...and so on
)

Then, translate all words from R to their E representation:

for ec, rc in RE_map:
    string = string.replace(rc, ec)

and finally check, if the string is now in E:

if string in E:
    print "The word exists of characters similar in Latin and Cyrillic."

Boldewyn 2010-10-27 12:01:01

Answer 4

+2 A:

You can use sets for this:

english_set = set(E)
russian_set = set(R)
common_words = english_set.intersection(russian_set)

I'm not sure I got the encoding part right though, but if that means letters that look similar are actually different bytes, you can for example prepare the russian list by replacing these letters by their english counterpart prior to doing the intersection.

Luper Rouch 2010-10-27 12:04:58

Because of the different character encodings, there are two L lists, LE for English letters and LR for Russian, but the appearance of their letters is the same.

psihodelia 2010-10-27 12:07:54

What is about time-complexity of set() and set.intersection() ?

psihodelia 2010-10-27 12:23:13

O(n), where n is the length of the shortest set.

Luper Rouch 2010-10-27 13:10:10

Python wiki was just updated saying O(n) is the average case and O(n*m) the worst case.

Luper Rouch 2010-10-27 13:16:37

O(n×m) worst-case time for set intersection? I can't believe that. It can be done in O(min(m,n)).

larsmans 2010-10-27 17:34:04

I don't know the reason of this edit or if it's correct, here is the page for reference: http://wiki.python.org/moin/TimeComplexity

Luper Rouch 2010-10-27 17:43:34

ansaurus

tags:

views:

answers:

How to find all words that have the same appearence in two different languages?

related questions