views:

134

answers:

6

For my data structures project, the goal is to read in a provided file containing over 10000 songs with artist, title and lyrics clearly marked, and each song is separated by a line with a single double quote. I've written this code to parse the text file, and it works, with a running time of just under 3 seconds to
read the 422K lines of text
create a Song object
add said Song to an ArrayList

The parsing code I wrote is:

if (songSource.canRead()) {  //checks to see if file is valid to read
    readIn= new Scanner(songSource);
    while (readIn.hasNextLine()) {
 do {
     readToken= readIn.nextLine();

             if (readToken.startsWith("ARTIST=\"")) {
  artist= readToken.split("\"")[1];
      } 
      if (readToken.startsWith("TITLE=\"")) {
  title= readToken.split("\"")[1];
      } 
      if (readToken.startsWith("LYRICS=\"")) {
  lyrics= readToken.split("\"")[1];
      } else {
  lyrics+= "\n"+readToken;
      }//end individual song if block
 } while (!readToken.startsWith("\"")); //end inner while loop

    songList.add(new Song(artist, title, lyrics));

    }//end while not EOF 
} //end if file can be read 

I was talking with my Intro to Algorithms professor about the code for this project, and he stated that I should try to be more defensive in my code to allow for inconsistencies in data provided by other people. Originally I was using if/else blocks between the Artist, Title and Lyrics fields, and on his suggestion I changed to sequential if statements. While I can see his point, using this code example, how can I be more defensive about allowing for input inconsistencies?

+4  A: 

I would replace e.g.:

artist= readToken.split("\"")[1];

with

String[] parts = readToken.split("\"");
if(parts.length >= 2) artist = parts[1];
else continue;

Other modifications would include:

  1. reset the local variables (so you don't accidentally get the wrong artist for a song, if no artist is supplied for some song after the first)
  2. decide what to do if some data is missing - do you still want to add the song to the song list?
sje397
Can the regexp be optimized or is it constructed for every iteration?
Thorbjørn Ravn Andersen
@Thorborn - that doesn't affect the code's robustness ... only its performance. And even then, not a lot. (IIRC, the Pattern.compile method uses a simple one-regex cache to avoid repeatedly compiling the same regex. In this case, it would be effective.)
Stephen C
Thanks for the example, you've given me some thinking to do.
Jason
+2  A: 

In the real world, there are some guarantees made regarding data integrity. In the case of dealing with user input (whether from stdin or a file) there is some project defined paradigm for notifying the user of a problem that requires attention.

For instance, when a compiler compiling code or a shell executing a script encounters an inconsistency it might halt and print the line containing the inconsistency with a second line below it that uses the "^" symbol to indicate the location of the problem.

So here are some basic question to ask yourself:
1. Is every line guaranteed to contain every field?
2. Is the ordering of the fields guaranteed?

If those are conditions of the input contract and are violated, you should ignore/report the line. If they are not conditions of the input, then you need to handle it .. which you currently do not.

Tim Bender
This kind of thing is exactly why I like this site. I think I'm running into some of the shortcomings of the way this CS program is designed. It seems to be more interested in getting the overall concepts first, then going into detail, but other than a handout on code syntax and whitespacing, there is very little emphasis placed on proper procedures like defensive coding.
Jason
A: 

Here are some issues that could be addressed:

  • Your code assumes that there is no whitespace before (for example) "ARTIST", none around the "=" sign and so on.

  • Your code assumes that the keywords are in all-caps. Someone could use lowercase or mixed case.

  • Your code assumes that a line that does not start with keyword=\" is a continuation of the song's lyrics. But what if the user entered ARTOST="Sting"? Or what if the user tried to use two lines for an artist name?

Finally, I'm not convinced that replacing "else if" with "if" in this case has made any difference to the code's robustness.

Stephen C
+1  A: 

I see a couple of things that are missing here Jason.

I think the if/else was fine and it won't change the logic. However, you should restrict the scope of your variables as much as possible. By declaring artist, title, etc. inside of the while loop, they will be initialized to null (or whatever) so if an entry is missing the artist then it won't get the last entry's value.

Also, what happens if title, artist, etc. has a quote in it? How is that handled? How about the Lyrics which seem to be multiple lines right?

What happens if there is an unknown field -- maybe a misspelling? It will be added to the end of Lyrics which doesn't seem right. Only once the LYRICS field has been found should you append to it. If lyrics is null then it will start with "null".

Gray
+2  A: 

You are assuming that the input is perfect. If you look at the way your application is currently setup, Based on a quick read of your algorithm the data would look like this

ARTIST="John"
TITLE="HELLO WORLD"
LYRICS="Sing Song All night long"
"

But consider the case

ARTIST="John"
TITLE="HELLO WORLD"
LYRICS="Sing Song All night long"
"
ARTIST="Peter"
LYRICS="Sing Song All night long"
"

Based on your algorithm, you now have 2 songs characterized as

songList = { Song("JOHN", "HELLO WORLD", "Sing Song All night long"),
             Song("Peter", "HELLO WORLD", "Sing Song All night long") }

With the current algorithm, the artist and title are exposed and will show up in the 2nd song even though they were not defined. You need to reset your three variables.

in your else you are just dumping the complete line into lyrics. What if you had already pulled Lyrics out, you are now overriding that. Test case

 ARTIST="John"
 LYRICS="Sing Song All night long"
 TILET="HELLO WORLD"
 "

Consider sending this record to an Error state. So when the batch read is completed, an error report can be generated and fixed.

Also you only consider EOF after an artist was read in. What if the EOF occurs during the Artist read, and the file does not end in ". You are going to get an exception there. In your do/while add another check for hasNextLine()

Sean
I picked your response as the answer, since even though I didn't use it in the current project, you've given me a concrete example on working around user input errors without crashing a program.
Jason
A: 

Deal with exceptions (I guess Scanner could throw InputMismatchException for an invalid character).

It looks like the do { } while (...) can loop endlessly if the file is ill-formed, and the end of the file is reached.

Nothing prevents artist or title from being empty.

pascal