views:

245

answers:

7

I have a text file that looks like this:

grn129          agri-
ac-214          ahss
hud114          ahss
lov1150         ahss
lov1160         ahss
lov1170         ahss
lov1210         ahss

What is the best way to parse this file using Java if I want to create a HashMap with the first column as the key and the second column as the value.

Should I use the Scanner class? Try to read in the whole file as a string and split it?

What is the best way?

+1  A: 

Using a Scanner or a normal FileReader + String.split() should both work fine. I think the speed differences are minimal, and unless you plan to read a very large file over and over again, it doesn't matter.

EDIT: Actually, for the second method, use a BufferedReader. It has a getLine() method, which makes things slightly easier.

Brendan Long
A: 

If you wish to follow the textbook solution, use StringTokenizer. It's straight-forward, easy to learn and quite simple. It can overcome simple deviations in structure (variable number of white-space characters, uneven formatted lines, etc)

But if your text is known to be 100% well-formatted and predictable, then just read a bunch of lines into a buffer, take them one at a time, and take-out parts of the strings into your HashMap key and value. It's faster than StringTokenizer, but lacks the flexibility.

Etamar L.
StringTokenizer being if not quite deprecated, at least according to its JavaDoc considered a legacy class, is not the textbook solution anymore.
Thilo
@Thilo: What are you supposed to use instead?
Brendan Long
According to the JavaDoc, String.split.
Thilo
+2  A: 

I don't know about the best way, but I suspect that the most efficient way would be to read one line at a time (using BufferedReader), and then split each line by finding the first whitespace character, splitting there, and then trimming both sides. However, whatever you like best is fine unless it needs to be super fast.

I am personally biased against loading an entire file all at once... aside from the fact that it assumes there is enough memory to hold the entire file, it doesn't allow for any parallel computation (for example, if input is coming in from a pipe). It makes sense to be able to process the input while it is still being generated.

Michael Aaron Safyan
Yes, I've done some tests with BufferedReaders and Scanners, and using a BufferedReader and doing the split yourself (not using String's .split() method) is much faster than a Scanner, but BufferedReader + String.split() is about the same speed. Either way, it's a lot more work and probably not worth it most of the time.
Brendan Long
@Brendan, I was suggesting splitting manually (rather than with the String.split function), but you are right... it is likely not a significant difference.
Michael Aaron Safyan
+4  A: 

Here's how I would do it! I'm almost exclusively a Java programmer since 2000, so it might be a little old fashioned. There is one line in particular I'm a little proud of:

new InputStreamReader(fin, "UTF-8");

http://www.joelonsoftware.com/articles/Unicode.html

Enjoy!

import java.io.*;
import java.util.*;

public class StackOverflow2565230 {

  public static void main(String[] args) throws Exception {
    Map<String, String> m = new LinkedHashMap<String, String>();
    FileInputStream fin = null;
    InputStreamReader isr = null;
    BufferedReader br = null;
    try {
      fin = new FileInputStream(args[0]);
      isr = new InputStreamReader(fin, "UTF-8");
      br = new BufferedReader(isr);
      String line = br.readLine();
      while (line != null) {
        // Regex to scan for 1 or more whitespace characters
        String[] toks = line.split("\\s+");
        m.put(toks[0], toks[1]);
        line = br.readLine();
      }
    } finally {
      if (br != null)  { br.close();  }
      if (isr != null) { isr.close(); }
      if (fin != null) { fin.close(); }
    }

    System.out.println(m);
  }

}

And here's the output:

julius@flower:~$ javac StackOverflow2565230.java 
julius@flower:~$ java -cp .  StackOverflow2565230  file.txt 
{grn129=agri-, ac-214=ahss, hud114=ahss, lov1150=ahss, lov1160=ahss, lov1170=ahss, lov1210=ahss}

Yes, my computer's name is Flower. Named after the skunk from Bambi.

One final note: because close() can throw an IOException, this is how I would really close the streams:

} finally {
  try {
    if (br != null) br.close();
  } finally {
    try {
      if (isr != null) isr.close();
    } finally {
      if (fin != null) fin.close();
    }
  }
}
Julius Davies
+1 would be me solution + LinkedHashMap, nice! "There is one line in particular I'm a little proud of" -> lol
Karussell
+3  A: 

Based on @Julius Davies, here is a shorter version.

import java.io.*; 
import java.util.*; 

public class StackOverflow2565230b { 
  public static void main(String... args) throws IOException { 
    Map<String, String> m = new LinkedHashMap<String, String>(); 
    BufferedReader br = null; 
    try { 
      br = new BufferedReader(new FileReader(args[0])); 
      String line;
      while ((line = br.readLine()) != null) { 
        // Regex to scan for 1 or more whitespace characters 
        String[] toks = line.split("\\s+"); 
        m.put(toks[0], toks[1]); 
      } 
    } finally { 
      if (br != null) br.close(); // dont throw an NPE because the file wasn't found.
    } 

    System.out.println(m); 
  } 
}
Peter Lawrey
Good point!if (br != null) br.close();
Julius Davies
A: 

Julius Davies's answer is fine.

However I am afraid you will have to define the format of your text file which is to be parsered. For example what is the separate character between your first column and the second column,if it is not fixed, it will cause somemore difficulties.

hguser
A: 

How about caching a regular expression? (String.split() would compile the regular expression on each call)

I'd be curious if you performance tested each of the methods on several large files (100, 1k, 100k, 1m, 10m entries) and see how the performance compares.

import java.io.*;
import java.util.*;
import java.util.regex.*;

public class So2565230 {

    private static final Pattern rgx = Pattern.compile("^([^ ]+)[ ]+(.*)$");

    private static InputStream getTestData(String charEncoding) throws UnsupportedEncodingException {
        String nl = System.getProperty("line.separator");
        StringBuilder data = new StringBuilder();
        data.append(" bad data " + nl);
        data.append("grn129          agri-" + nl);
        data.append("grn129          agri-" + nl);
        data.append("ac-214          ahss" + nl);
        data.append("hud114          ahss" + nl);
        data.append("lov1150         ahss" + nl);
        data.append("lov1160         ahss" + nl);
        data.append("lov1170         ahss" + nl);
        data.append("lov1210         ahss" + nl);
        byte[] dataBytes = data.toString().getBytes(charEncoding);
        return new ByteArrayInputStream(dataBytes);
    }

    public static void main(final String[] args) throws IOException {
        String encoding = "UTF-8";

        Map<String, String> valuesMap = new LinkedHashMap<String, String>();

        InputStream is = getTestData(encoding);
        new So2565230().fill(valuesMap, is, encoding);

        for (Map.Entry<String, String> entry : valuesMap.entrySet()) {
            System.out.format("K=[%s] V=[%s]%n", entry.getKey(), entry.getValue());
        }
    }

    private void fill(Map<String, String> map, InputStream is, String charEncoding) throws IOException {
        BufferedReader bufReader = new BufferedReader(new InputStreamReader(is, charEncoding));
        for (String line = bufReader.readLine(); line != null; line = bufReader.readLine()) {
            Matcher m = rgx.matcher(line);
            if (!m.matches()) {
                System.err.println("Line has improper format (" + line + ")");
                continue;
            }
            String key = m.group(1);
            String value = m.group(2);
            if (map.put(key, value) != null) {
                System.err.println("Duplicate key detected: (" + line + ")");
            }
        }
    }
}
TJ