tags:

views:

104

answers:

2

I'm learning about text processing in Java for a class and the example in class was to read in data from a file, do text processing, write back data (List) to the file. I understand the example in that he reads in each line into a String and adds that line to the list and uses the .split(" ") and then Collections.sort to sort the data returning one of the strings. However, if there are commas and extra whitespace, I don't know how to format those. I read up on regex, but wasn't sure if that was needed since we haven't covered that and was going for the trim() method. But if I put the trim() method in the compare method of my class that implements Comparator that gets passed to Collections.sort, it wouldn't get passed the correctly formatted string since compare returns an int. So I guess I'm looking for some general guidelines to help with this assignment, but not giving away the answer completely. Thanks.

Edit: Assignment is to write the list in order, deleting duplicates and extra whitespace.

    public class TextProcess 
    {
        public static void main(String[] args)
        {
            try {
// get data from class file
                List<String> data = TextFileUtils.readTextFile("addressbooktest.txt");
// process data.  Really just the same address book that looks like
// firstName, lastName, phone, email
// with the commas, but deleting duplicates, the extra whitespace, and sorting alphabetically
                Collections.sort(data, FIRSTNAMECOMPARATOR); 
       // write to output file
                TextFileUtils.writeTextFile(data, "parsedaddressbooktest.txt");
                }

            catch (IOException e) {
                e.printStackTrace();
            }
        }
        private static final FirstNameComparator FIRSTNAMECOMPARATOR = new FirstNameComparator(); 
    }

    class FirstNameComparator implements Comparator<String> 
    { 
       public int compare(String s1, String s2) 
       {

          String[] st1 = s1.split(","); 
          String[] st2 = s2.split(","); 


             String firstName1 = st1[0].toUpperCase().trim(); 
             String lastName1 = st1[1].toUpperCase().trim(); 

             String firstName2 = st2[0].toUpperCase().trim(); 
             String lastName2 = st2[1].toUpperCase().trim(); 
             if (!(firstName1.equals(firstName2))) 
                return firstName1.compareTo(firstName2); 
             else 
                return lastName1.compareTo(lastName2); 
       } 
    } 
A: 

A Comparator is simply a way of determining the relative order of two items, nothing more. You'd use it when you want to control the order that a collection of objects are sorted, but in this case it sounds like you're trying to mutate the objects within your comparator; this isn't going to work.

You're right that the trim() method will get rid of leading and trailing whitespace (subject to its own definition of whitespace, which is fine for simple use cases like yours). You'll need to use this earlier on; after you've extracted the "raw" data, of course, but before you add the data to the list.

Beyond that, you haven't actually said what the requirements are. I can assume that you need to discard trailing whitespace, but what about the commas? Should these be interpreted as element separators, in a functionally equivalent way to newlines? Or is something else needed?

I think you're on the right track in general; just think about the steps required and try to do each one separately as it's cleaner that way. From what I can tell, your steps might be something like:

  1. Identify and open a stream to read data from a file (done).
  2. Use this stream to serve up character data from the file, one line at a time (done).
  3. For each line, remove whitespace and split on commas.
  4. For each formatted string, add it to the list.
  5. Sort the list in a given order.
Andrzej Doyle
I don't agree that the comparator presented above mutates the items being compared. Strings can't be mutated, and the comparator doesn't modify the array either. Also, what do you mean by "the trim() method will get rid of leading and trailing whitespace"? If you use str.split(",") as above, this will not happen. You have to expicitely do it with the regex parameter.
Eyal Schneider
+1  A: 

I am not sure what exactly bothers you with the code, but here is what the code you presented seems to do:

1) It reads the lines of a text file, and organizes them as a list of strings, preserving their order (supposedly, because we don't see how TextFileUtils.readTextFile(..) is actually implemented).

2) Sorts the list in ascending name order. Each line is assumed to consist of a sequence of words delimited by comma, where the first word is the first name, and the second is the last name. The primary ordering is by the first name, and the secondary ordering is by the last name. The usage of String.Split() is part of the FirstNameComparator implementation.

3) Writes back the original lines after being sorted, into a different text file.

A note about Comparators:

A Comparator defines a mechanism for comparing two items. Once the mechanism is implemented, you can use it for a variety of purposes where ordering matters (sorting, looking for maximum/minimum, search trees, priority queues etc). Your explanation of the steps is not really accurate; the code does not read a file into a list, performs split, and then sorts. The splitting is actually a part of the sorting. The sorting algorithm invokes your comparator many times until it determines that the sorting is complete. Furthermore, the way it is implemented now it will probably split the same line multiple times, in order to compare it against different lines (not so efficient, but I suppose this is not the focus here).

Two more comments

  • Regarding the way you parse lines: The current code deals only with commas. It doesn't remove whitespaces. You can use a more complex regex to deal with other kind of delimiters and whitespaces as well.

  • I don't see in the code anything that removes duplicates

Eyal Schneider