views:

224

answers:

5

I am going about transliteration from one source language(input file) to a target language(target file) so I am checking for equivalent mappings in a dictionary in my source code, certain characters in the source code don't have an equivalent mapping like comma(,) and all other such special symbols. How do I check if the character belongs to the dictionary for which I have an equivalent mapping and to even take care of those special symbols to be printed in the target file which don't have an equivalent mapping in the dictionary. Thank you:).

+1  A: 

I think you want something like this:

tokenMapping = {"&&" : "and"}

for token in source file: # <-- pseudocode
    translatedToken = tokenMapping[token] if token in tokenMapping else "transliteration unknown"

If there's a translation in the dictionary (e.g. "&&" -> "and"), it will use that. Else it will translate to "transliteration unknown".

Hope that helped.

EDIT: As LeafStorm suggested, a dictionary's get function can be used to simplify the above code. The code line in the loop would become

    translatedToken = tokenMapping.get(token, "transliteration unknown")
AndiDog
will just check it and get back to you Sir..
mgj
I need to run the entire code, I am currently dealing with certain errors will surely get back to you sir Thank you for you time:)
mgj
A: 
dictx = {}
for itm in my_source :
    dictx[itm] = dictx.get(itm, 0) + 1

I didn't completely understand the details of your question, but here's the simplest example i could think of that illustrates the pattern i think you are after.

The 'get' method i believe is what you want. It allows you to retrieve a key from a dictionary, but if the key is not there, you can set a default value--i.e., "i want dictx[itm] (the value assigned to the key 'itm') but if 'itm' is not in dictionary then create it and value of .'

This snippet will loop through your source document ('my_source') and count the frequency of the various items in it, adding those counts as values to the keys already in your dictionary, but when it reaches an item for which no key exists, no exception is thrown, a key is added and a value of '0' assigned.

doug
Let me give you an e.g. Sir.. Say the source file contaings "Hi! What are you doing" Now I need to check for each char or a set of char and see for their equivalent transliteration in a dictionary, but certain characters like '!' are to be copied as it is from source to destination and they have no equivalent in transliteration but their original forms.. My question was how to check if its in the dictionary and print its equivalent if any that exists, and if not how to print the original char(like '!') as it is if no equivalent is in the dictionary. Thank you for your support Sir..:)
mgj
+3  A: 

My recommendation, given that rules is a mapping of the characters to their transliterated equivalents:

results = []
for char in source_text:
    results.append(rules.get(char, char))
return ''.join(results)    # turns the list back into a string

A dict's get method will return either the value for a key or a default value if the key does not exist - normally the default value is None, but in this case, we gave the same character as the default value (the second argument) so that if the key is not found it will just return itself.

A more compact way to write this using generator expressions would be:

''.join((rules.get(char, char) for char in source_text))
LeafStorm
Thank You Sir:)
mgj
A: 

This seems pretty straightforward. If your dictionary is char to char, then you would do something like

outstr = ''
for ch in instr:
    if ch in mydict:
        outstr += mydict[ch]
    else:
        outstr += ch

Here, instr is your input string and mydict contains your mapping of chars to chars.

If you want to check parts of words, I would recommend using two dictionaries: one that contains the characters that are contained in any word, and one that contains the words. You could use it like this:

outstr = ''
word = ''
for ch in instr:
    if ch in chardict:
        word += ch
    else:
        if len(word):
            if word in worddict:
                outstr += worddict[word]
            else:
                outstr += word
            word = ''
        outstr += ch
if len(word):
    outstr += worddict[word]
else:
    outstr += word

chardict might contain all of the alphabet for instance. Of course, you might want to do some parts a little bit differently (like use something other than chardict to check if a char is to be considered part of a valid word - perhaps something with a binary search), but hopefully you get the idea.

Justin Peel
+3  A: 

If you use the translate method of Unicode objects, as I recommended in answer to another question of yours, everything's done automatically for you exactly as you desire: each Unicode character c whose codepoints (ord(c)) is not in the transliteration dictionary is simply passed unchanged from input to output, just as you want. Why reinvent the wheel?

Alex Martelli
Point Sir..:) I will try out this method.
mgj