views:

102

answers:

2

I am working on parsing plain text and converting it to key-value pairs. For example, plain text:

some_uninteresting_thing
key1 valueA, some_uninteresting_thing  valueB
key2 valueD
key3 some_uninteresting_thing  valueE 
key4 valueG(valueH, valueI)
key5 some_uninteresting_thing 

And possible mappings:

 Map(

 key1 ->(valueA, valueB,valueC), 
 key2 ->(valueD, valueE),
 key3 ->(valueF)
 key4 ->(valueH, valueI)

 ...
 )

Amd result will be :

key1 ->(valueA, valueB)
key2 ->(valueD)
key4 ->(valueH, valueI)

(key5 shouldn't be mapped because has no appropriate values. As you can see plain text is lenient. What java library will help to handle this?

A: 

You can use an Interpreter and a Builder.

The Interpreter parses the source and identifies keys and values, which are passed to the Builder, which constructs any data structure you desire.

+2  A: 

If you are familiar with formal languages, tokenization/grammars etc., you could use a parser generator like, JavaCC. JavaCC takes the grammar file that you write and generates java code that parses the text file into a series of tokens, or a sytax tree. There are plugins for Maven and Ant that can help integrate this additional source into your build.

For a runtime-only solution, there is RunCC, which I've used with good results. (I suspect it is not as fast as JavaCC, but for my case the performance was fine.)

There is also Chaperon, which converts plain text to XML, using a grammar file.

An alternative to these is to use an ad hoc mix of regex and StringTokenizer.

With a parser project or regex armed and ready, your general approach is then like this:

  1. write a grammar for your plain text file. Some details are missing about the your plain text format, but you may simply be able to use a BufferedReader.readLine() to read lines of the file, and StringTokenizer to split the line into substrings at spaces and commas.
  2. The strings you get form the parser, the first string you use as the key, and the subsequent strings are values, that you add to a Map. E.g. in pseudocode

    Map> map = new HashMap>(); for each line { List tokens = ...; // result of splitting the line String key = tokens.get(0); map.add(key, tokens.sublist(1, tokens.size()); }

    Even if the parser doesn't filter uninteresting text, it will be filtered later.

  3. Build a parser with the above projects to parse the map file format. Again, you may be able to build a simple parser with regexes and StringTokenizer. Use the parser to build a map. The map has the same signature as above, i.e Map<String,List<String>>.

  4. Finally, filter the input map against the allowed values map.

Something like this.

   Map<String,List<String>> input = ...; // from step 1.
   Map<String,List<String>> allowed = ...; // from step 3.
   Map<String,List<String>> result = new HashMap<String<list<String>>(); // the final map
   for (String key : input.keySet()) {
      if (allowd.contains(key)) {
         List<String> outputValues = new ArrayList();
         List<String> allowedValues = allowed.get(key);
         List<String> inputValues = input.get(key);
         for (String value: inputValues) {
            if (allowedValues.contains(value))
                outputValues.add(value);
         }
         if (!outputValues.isEmpty())
            output.put(key, outputValues);
      }
   }
   // final result in filter
mdma