views:

282

answers:

2

The US census bureau uses a special encoding called “soundex” to locate information about a person. The soundex is an encoding of surnames (last names) based on the way a surname sounds rather than the way it is spelled. Surnames that sound the same, but are spelled differently, like SMITH and SMYTH, have the same code and are filed together. The soundex coding system was developed so that you can find a surname even though it may have been recorded under various spellings.

In this lab you will design, code, and document a program that produces the soundex code when input with a surname. A user will be prompted for a surname, and the program should output the corresponding code.

Basic Soundex Coding Rules

Every soundex encoding of a surname consists of a letter and three numbers. The letter used is always the first letter of the surname. The numbers are assigned to the remaining letters of the surname according to the soundex guide shown below. Zeroes are added at the end if necessary to always produce a four-character code. Additional letters are disregarded.

Soundex Coding Guide

Soundex assigns a number for various consonants. Consonants that sound alike are assigned the same number:

Number Consonants

1 B, F, P, V 2 C, G, J, K, Q, S, X, Z 3 D, T 4 L 5 M, N 6 R

Soundex disregards the letters A, E, I, O, U, H, W, and Y.

There are 3 additional Soundex Coding Rules that are followed. A good program design would implement these each as one or more separate functions.

Rule 1. Names With Double Letters

If the surname has any double letters, they should be treated as one letter. For example:

Gutierrez is coded G362 (G, 3 for the T, 6 for the first R, second R ignored, 2 for the Z). Rule 2. Names with Letters Side-by-Side that have the Same Soundex Code Number

If the surname has different letters side-by-side that have the same number in the soundex coding guide, they should be treated as one letter. Examples:

Pfister is coded as P236 (P, F ignored since it is considered same as P, 2 for the S, 3 for the T, 6 for the R).

Jackson is coded as J250 (J, 2 for the C, K ignored same as C, S ignored same as C, 5 for the N, 0 added).

Rule 3. Consonant Separators

3.a. If a vowel (A, E, I, O, U) separates two consonants that have the same soundex code, the consonant to the right of the vowel is coded. Example:

Tymczak is coded as T-522 (T, 5 for the M, 2 for the C, Z ignored (see "Side-by-Side" rule above), 2 for the K). Since the vowel "A" separates the Z and K, the K is coded. 3.b. If "H" or "W" separate two consonants that have the same soundex code, the consonant to the right is not coded. Example:

*Ashcraft is coded A261 (A, 2 for the S, C ignored since same as S with H in between, 6 for the R, 1 for the F). It is not coded A226.

So far this is my code:

surname = raw_input("Please enter surname:")
outstring = ""

outstring = outstring + surname[0]
for i in range (1, len(surname)):
        nextletter = surname[i]
        if nextletter in ['B','F','P','V']:
            outstring = outstring + '1'

        elif nextletter in ['C','G','J','K','Q','S','X','Z']:
            outstring = outstring + '2'

        elif nextletter in ['D','T']:
            outstring = outstring + '3'

        elif nextletter in ['L']:
            outstring = outstring + '4'

        elif nextletter in ['M','N']:
            outstring = outstring + '5'

        elif nextletter in ['R']:
            outstring = outstring + '6'

print outstring

sufficiently does what it is asked to, I am just not sure how to code the three rules. That is where I need help. So, any help is appreciated.

+1  A: 

I would suggest you try the following.

  • Store a CurrentCoded and LastCoded variable to work with before appended to your output
  • Break down the system into useful functions, such as
    1. Boolean IsVowel(Char)
    2. Int Coded(Char)
    3. Boolean IsRule1(Char, Char)

Once you break it down nicely it should become easier to manage.

Robin Day
A: 

This is hardly perfect (for instance, it produces the wrong result if the input doesn't start with a letter), and it doesn't implement the rules as independently-testable functions, so it's not really going to serve as an answer to the homework question. But this is how I'd implement it:

>>> def soundex_prepare(s):
        """Prepare string for Soundex encoding.

        Remove non-alpha characters (and the not-of-interest W/H/Y), 
        convert to upper case, and remove all runs of repeated letters."""
        p = re.compile("[^a-gi-vxz]", re.IGNORECASE)
        s = re.sub(p, "", s).upper()
        for c in set(s):
            s = re.sub(c + "{2,}", c, s)
        return s

>>> def soundex_encode(s):
        """Encode a name string using the Soundex algorithm."""
        result = s[0].upper()
        s = soundex_prepare(s[1:])
        letters = 'ABCDEFGIJKLMNOPQRSTUVXZ'
        codes   = '.123.12.22455.12623.122'
        d = dict(zip(letters, codes))
        prev_code=""
        for c in s:
            code = d[c]
            if code != "." and code != prev_code:
                result += code
         if len(result) >= 4: break
            prev_code = code
        return (result + "0000")[:4]
Robert Rossney