views:

638

answers:

10

I need to parse some text files that have different types of delimiters (tildes, spaces, commas, pipes, caret characters).

There is also a different order of elements depending on what the delimiter is, e.g:

comma: A, B, C, D, E
caret: B, C, A, E, D
tilde: C, A, B, D, E

The delimiter is the same within the file but different from one file to another. From what I can tell, there are no delimiters within the data elements.

What's a good approach to do this in plain ol' Java?

+1  A: 
Traveling Tech Guy
The order of elelements is not consistent from one file to the next. It also varies with the delimiter.
amsterdam
So, you're saying every file is completely different than the others?
Traveling Tech Guy
:) Yeah. It's a little tricky, isn't it. Sorry, forgot to mention about the order originally.
amsterdam
+2  A: 

I might start by playing with Java's StringTokenizer. This takes a string, and lets you find each token that is separated by a delimiter.

Here is one example from the net.

But you want to tokenize things from a file. In that case, you might want to play with Java's StreamTokenizer, which lets you parse input from a file stream.

edit

If you don't know the delimiters in advance, you could do a few things:

  1. Delimit based on all possible delimiters. If your data itself doesn't have any delimiters, then this would work. (ie, look for both "," and ";" - provided that your data itself doesn't nave either of those characters)
  2. If you have an idea of what your data is supposed to look like (supposed to be integers, or supposed to be single characters) then your code could try different delimiters (try "," first, then try ";", etc) until it parsed a line of text "correctly".
rascher
That example seems to require me to hardcode a list of delimiters. Is that right?
amsterdam
With my answer, that is correct - you would need to know the delimiter, and then pass that into the constructor of StringTokenizer (or into one of StreamTokenizer's setters.) I've edited my answer, though.
rascher
+1  A: 

if its same delimiter through out the file then probabably while loading file to parse you can input the delimiter.

Say for ex..

    void someFunction(char delimiter){
--- do wateva you want to do with the file --- // you can use stringTokenizer for this purpose
}

Each time upon loading the file , you can use this function by calling it with delimiter for the file as argument.

Hope this helps.. :-)

Richie
If this is an automated process, then I can't pass the delimiter as an argument because I don't know what's in the file at that point.
amsterdam
if "The delimiter is the same within the file but different from one file to another. From what I can tell, there are no delimiters within the data elements" is the case then probably u can send in a list of delimiters amoung which one is the probable delimiter for the file
Richie
+1  A: 

You could write a class that parses a file something like this:

interface MyParser {
  public MyParser(char delimiter, List<String> fields);

  Map<String,String> ParseFile(InputStream file);
}

You'd pass the delimiter and an ordered list of fields to the constructor, then ask it to parse a file. You'd get back a map of field names (from the ordered list) to values.

The implementation of ParseFile would probably use split with the delimiter and then iterate through the array returned by split and the list of fields concurrently, creating the map as it went.

Moishe
How would that take into account the different order of elements in each file depending on what the delimiter is?
amsterdam
A: 

Most of the open source CSV parsing libraries allow you to change the delimiter characters, and also have behavior built in to handle escaping. Opencsv seems to be the popular one nowadays, but I haven't used it yet. I was pretty happy with the Ostermiller csv library last time I had to do a lot of csv parsing.

Jason Gritman
I'm still hoping to write this myself in plain java but thanks for the links.
amsterdam
+1  A: 

One possible approach is to use the Java Compiler Compiler (https://javacc.dev.java.net/). With this you can write a set of rules for what you will accept and what delimiters might appear at any one time. The engine can be given rules to work around order issues depending on the delimiter in use. And the file could, if necessary, switch delimiters along the way.

Jonathan B
It's a pretty basic parsing problem. Seems like it should be easy enough to write myself without resorting to external tools. Thanks for the link. I'll save it in case I'm not able to figure out how to do it in plain java.
amsterdam
Yes a parser generator would work well as to define the syntax you'll see. That's likely the right way to do it, though I myself can never seem to get yacc/javacc to do things well.
Xepoch
+1  A: 

I like to read the first two lines of a file, and then test the delimiters. If you split on a delimiter, and both lines return the same non-zero number of pieces, then you've probably guessed the correct one. Here's an example program which checks the file names.txt.

public static void main(String[] args) throws IOException {
 File file = new File("etc/names.txt");

 String delim = getDelimiter(file);
 System.out.println("Delim is " + delim + " (" + (int) delim.charAt(0) + ")");
}

private static final String[] DELIMS = new String[] { "\t", ",", " " };

private static String getDelimiter(File file) throws IOException {
 for (String delim : DELIMS) {

  BufferedReader br = new BufferedReader(new FileReader(file));
  String[] line0 = br.readLine().split(delim);
  String[] line1 = br.readLine().split(delim);
  br.close();
  if (line0.length == line1.length && line0.length > 1) {
   return delim;
  }
 }
 throw new IllegalStateException("Failed to find delimiter for file " + file);
}
brianegge
Let me make sure I understood you. You're saying to read in the file and test against the set of known possible delimiters. Not sure what you mean by "If you split on a delimiter, and both lines return the same non-zero number of pieces" -> Does this refer to a function for splitting strings based on a delimiter? My apologies, I'm new to Java.
amsterdam
+1  A: 

If the exactly order of the records is known when a specific delimiter is used, I'd just create a parser that would return a Record object for each line... something like below.

This does include a lot of hard coded values but I'm not sure how flexible you would need this. I would consider this more of a scripty/hacky solution rather than something you could extend. If you don't know the delimiters, you could test the first line of the file by using the String.split() method and see if the number of columns match the expected count.

 class MyParser

    {
        public static Record parseLine(String line, char delimiter)
        {
         StringTokenizer st1 = new StringTokenizer(line, delimiter);
            //You could easily use an array instead of these dumb variables
         String temp1,temp2,temp3,temp4,temp5;

         temp1 = st1.getNextToken();
         .. etc..

         Record ret = new Record();
         switch (delimiter)
         {
          case '^':
          ret.A = temp2;
          ret.B = temp3;
                ...etc...
          break;
          case '~':
          ...etc...
          break;
         }
        }
    }

    class Record
    {
        String A;
        String B;
        String C;
        String D;
        String E:
    }
llamaoo7
+1  A: 

You can use the StringTokenizer as mentioned earlier. Yes you will need to specify a string for all the possible delimiters. Don't forget to set the "returnsDelims" property of the tokenizer. That way you will know which token is used in the file and can then parse the data accordingly.

camickr
+1  A: 

One way to find the delimiter in the file is to some kind of regex. A simple case would be to find any character that isn't alphabetical or numerical: [^A-Za-z0-9]

static String getDelimiter(String str) {
  Pattern p = Pattern.compile("([^A-Za-z0-9])");
  Matcher m = p.matcher(str.trim()); //remove whitespace as first char(s)
  if(m.find())
   return m.group(0);
  else 
   return null;
 }




public static void main(String[] args) {
  String[] str = {" A, B, C, D", "A B C D", "A;B;C;D"};
  for(String s : str){   
   String[] data = s.split(getDelimiter(s));
   //do clever stuff with the array
  }
 }

In this case I've loaded the data from an array instead of reading from a file. When reading from a file feed the first line to the getDelimiter method.

Kennet