I am going about transliteration from one source language(input file) to a target language(target file) so I am checking for equivalent mappings in a dictionary in my source code, certain characters in the source code don't have an equivalent mapping like comma(,) and all other such special symbols. How do I check if the character belongs to the dictionary for which I have an equivalent mapping and to even take care of those special symbols to be printed in the target file which don't have an equivalent mapping in the dictionary. Thank you:).
I think you want something like this:
tokenMapping = {"&&" : "and"}
for token in source file: # <-- pseudocode
translatedToken = tokenMapping[token] if token in tokenMapping else "transliteration unknown"
If there's a translation in the dictionary (e.g. "&&" -> "and"), it will use that. Else it will translate to "transliteration unknown".
Hope that helped.
EDIT: As LeafStorm suggested, a dictionary's get
function can be used to simplify the above code. The code line in the loop would become
translatedToken = tokenMapping.get(token, "transliteration unknown")
dictx = {}
for itm in my_source :
dictx[itm] = dictx.get(itm, 0) + 1
I didn't completely understand the details of your question, but here's the simplest example i could think of that illustrates the pattern i think you are after.
The 'get' method i believe is what you want. It allows you to retrieve a key from a dictionary, but if the key is not there, you can set a default value--i.e., "i want dictx[itm] (the value assigned to the key 'itm') but if 'itm' is not in dictionary then create it and value of .'
This snippet will loop through your source document ('my_source') and count the frequency of the various items in it, adding those counts as values to the keys already in your dictionary, but when it reaches an item for which no key exists, no exception is thrown, a key is added and a value of '0' assigned.
My recommendation, given that rules
is a mapping of the characters to their transliterated equivalents:
results = []
for char in source_text:
results.append(rules.get(char, char))
return ''.join(results) # turns the list back into a string
A dict's get
method will return either the value for a key or a default value if the key does not exist - normally the default value is None
, but in this case, we gave the same character as the default value (the second argument) so that if the key is not found it will just return itself.
A more compact way to write this using generator expressions would be:
''.join((rules.get(char, char) for char in source_text))
This seems pretty straightforward. If your dictionary is char to char, then you would do something like
outstr = ''
for ch in instr:
if ch in mydict:
outstr += mydict[ch]
else:
outstr += ch
Here, instr is your input string and mydict contains your mapping of chars to chars.
If you want to check parts of words, I would recommend using two dictionaries: one that contains the characters that are contained in any word, and one that contains the words. You could use it like this:
outstr = ''
word = ''
for ch in instr:
if ch in chardict:
word += ch
else:
if len(word):
if word in worddict:
outstr += worddict[word]
else:
outstr += word
word = ''
outstr += ch
if len(word):
outstr += worddict[word]
else:
outstr += word
chardict might contain all of the alphabet for instance. Of course, you might want to do some parts a little bit differently (like use something other than chardict to check if a char is to be considered part of a valid word - perhaps something with a binary search), but hopefully you get the idea.
If you use the translate
method of Unicode objects, as I recommended in answer to another question of yours, everything's done automatically for you exactly as you desire: each Unicode character c
whose codepoints (ord(c)
) is not in the transliteration dictionary is simply passed unchanged from input to output, just as you want. Why reinvent the wheel?