views:

83

answers:

4

hey guys.. I am working on a project where I need to read some generic text...I am looking for any api by I can read generic text and also can convert it to .csv file... Can any one plz help... using java on windows os...

--------------------------MORE Detail--------------------------------------------------------------------------------------- let me clarify:

Assume I have a pdf document or for that matter any file type document. I intend to use Print to Generic text printer option and get the file in that format.Finally, I intend to use some API which shoudl enable me to programatically read this Generic Text Format file. I intend to extract text from this generic text file.

So, be it any file (.doc/.pdf/.xls etc wtatever), I intend to create a Generic Text Format file using print option. Then run my code to read those files and extract some information.

PS: Assume that I have a Status report form with standard fields. Ok. But, some people might submit in .pdf, some in .doc , some in text format. But, every document contains same fields, but probably with diferent layouts.

Now, I am looking for a generic solution, by which i shoudl be able to convert every file type in to generic text file format and then apply some logic to extract my Status report fields.

A: 

A generic free book: Text Processing in Python

The MYYN
A: 

Just used the standard Java classes for I/O:

BufferedWriter, File, FileWriter, IOException, PrintWriter

.csv is simply a comma-separated values file. So just name your output file with a .csv extension.

You'll also need to figure out how you'd like to split your content.

Here are Java examples to get you going:

writing to a text file

how to read lines from a file

Chris Tek
+1  A: 

CSV is a format for data in columns. It's not very useful for, say, a Wikipedia article.

The Apache Tika library will take all kinds of data and turn it into bland XML, from which you can make CSV as you like.

It would help if you would edit your question to clarify 'generic' versus' generated', and tell more about the data.

As for Windows printer drivers, are you looking to do something like 'print to pdf' as 'print to csv'? If so, I suspect that you need to start from MSDN samples of printer drivers and code this the hard way.

The so-called 'generic text file format' is not a structured format. It's completely unpredictable what you will find in there for any given input to the printer system.

bmargulies
+1  A: 

In Java this is more or less what you need to read a text file, assuming it's comma separated (just change the string in the "line.split" method if you need something else). It also skips the header.

    public void parse(String filename) throws IOException {
     File file = new File(filename);
     FileInputStream fis = new FileInputStream(file);
     InputStreamReader isr = new InputStreamReader(fis);
     BufferedReader br = new BufferedReader(isr);
     String line;
     int header = 1;
     while ((line = br.readLine()) != null) {
      if (header == 1) {
       header = 2;
       continue; // skips header
      }
      String[] splitter = line.split(",");
                    // do whatever
                    System.out.println(splitter[0]);
     }
    }
Thrawn