How do you create words which are not part of the english language, but sound english? For example: janertice, bellagom
You might be interested in How do I determine if a random string sounds like English?
Consider this algorithm, which is really just a degenerate case of a Markov chain.
A common practice is to build a Markov Chain based on the letter transitions in a "training set" made of several words (noums?) from an English lexicon, and to then let this chain produce "random" words for you.
One approach that's relatively easy and effective is to run a Markov chain generator per-character instead of per-word, using a large corpus of English words as source material.
Here's an example of somebody doing it. They talk about Markov chains and dissociated press.
Here's some code I found. You can run it online at codepad.
import random
vowels = ["a", "e", "i", "o", "u"]
consonants = ['b', 'c', 'd', 'f', 'g', 'h', 'j', 'k', 'l', 'm', 'n', 'p', 'q',
'r', 's', 't', 'v', 'w', 'x', 'y', 'z']
def _vowel():
return random.choice(vowels)
def _consonant():
return random.choice(consonants)
def _cv():
return _consonant() + _vowel()
def _cvc():
return _cv() + _consonant()
def _syllable():
return random.choice([_vowel, _cv, _cvc])()
def create_fake_word():
""" This function generates a fake word by creating between two and three
random syllables and then joining them together.
"""
syllables = []
for x in range(random.randint(2,3)):
syllables.append(_syllable())
return "".join(syllables)
if __name__ == "__main__":
print create_fake_word()
Note: Linguistics is a hobby, but I am in no way an expert at it.
First you need to get a "dictionary" so to speak of English Phonemes.
Then you simply string them together.
While not the most complex and accurate solution, it should lead you to a generally acceptable outcome.
Far simpler to implement if you don't understand the complexities of the other solutions mentioned.
Using Markov chains is an easy way, as already pointed out. Just be careful that you don't end up with an Automated Curse Generator.
I think this story will answer your question quite nicely.
It describes the development of a Markov chain algorithm quite nicely, including the pitfalls that come up.
Take the start of one English word and the end of another and concatenate.
E.g.
Fortune + totality = fortality
You might want to add some more rules like only cutting your words on consonant-vowel boundaries and so on.
Markov chain is the way to go, as others have already posted. Here is an overview of the algorithm:
- Let H be a dictionary mapping letters to another dictionary mapping letters to the frequency they occur with.
- Initialize H by scanning through a corpus of text (for example, the Bible, or the Stack Overflow public data). This is a simple frequency count. An example entry might be H['t'] = {'t': 23, 'h': 300, 'a': 50}. Also create a special "start" symbol indicating the beginning of a word, and an "end" symbol for the end.
- Generate a word by starting with the "start" symbol, and then randomly picking a next letter based on the frequency counts. Generate each additional letter based on the last letter. For example, if the last letter is 't', then you will pick 'h' with probability 300/373, 't' with probability 23/373, and 'a' with probability 50/373. Stop when you hit the "end" symbol.
To make your algorithm more accurate, instead of mapping one letter to the next letters, you could map two letters to the next letter.
Use n-grams based off the English corpora with n > 3, that gets you an approximation.
If you decide to go with a simple approach like the code Andy West suggested, you might get even better results by weighting the frequencies of vowels and consonants to correspond with those occurring normally in the English language: Wikipedia: Letter Frequency
You could even go as far as looking at the frequencies of paired letters or sequences of three letters, but at that point you're actually implementing the same idea as the Markov chain others have suggested. Is it more important that the "fake words" look potentially authentic to humans, or are the statistical properties of the words more important, such as in cryptographic applications?