views:

1509

answers:

5

Hi there I'm facing a problem, anyone can help me??

The Problem is : I'm trying to open MS Word 2003 document in java, search for a specified String and replace it with a new String. I use APACHE POI for to do that. My Code is like the Following :

public void searchAndReplace(String inputFilename, String outputFilename,
      HashMap<String, String> replacements) {
    File outputFile = null;
    File inputFile = null;
    FileInputStream fileIStream = null;
    FileOutputStream fileOStream = null;
    BufferedInputStream bufIStream = null;
    BufferedOutputStream bufOStream = null;
    POIFSFileSystem fileSystem = null;
    HWPFDocument document = null;
    Range docRange = null;
    Paragraph paragraph = null;
    CharacterRun charRun = null;
    Set<String> keySet = null;
    Iterator<String> keySetIterator = null;
    int numParagraphs = 0;
    int numCharRuns = 0;
    String text = null;
    String key = null;
    String value = null;
        try {
         // Create an instance of the POIFSFileSystem class and
         // attach it to the Word document using an InputStream.
         inputFile = new File(inputFilename);
         fileIStream = new FileInputStream(inputFile);
         bufIStream = new BufferedInputStream(fileIStream);
         fileSystem = new POIFSFileSystem(bufIStream);
         document = new HWPFDocument(fileSystem);
         docRange = document.getRange();
         numParagraphs = docRange.numParagraphs();
         keySet = replacements.keySet();
         for (int i = 0; i < numParagraphs; i++) {
          paragraph = docRange.getParagraph(i);
          text = paragraph.text();
          numCharRuns = paragraph.numCharacterRuns();
          for (int j = 0; j < numCharRuns; j++) {
           charRun = paragraph.getCharacterRun(j);
           text = charRun.text();
           System.out.println("Character Run text: " + text);
           keySetIterator = keySet.iterator();
           while (keySetIterator.hasNext()) {
            key = keySetIterator.next();
            if (text.contains(key)) {
             value = replacements.get(key);
             charRun.replaceText(key, value);
             docRange = document.getRange();
             paragraph = docRange.getParagraph(i);
             charRun = paragraph.getCharacterRun(j);
             text = charRun.text();
            }
           }
          }
         }
         bufIStream.close();
         bufIStream = null;
         outputFile = new File(outputFilename);
         fileOStream = new FileOutputStream(outputFile);
         bufOStream = new BufferedOutputStream(fileOStream);
         document.write(bufOStream);
        } catch (Exception ex) {
         System.out.println("Caught an: " + ex.getClass().getName());
         System.out.println("Message: " + ex.getMessage());
         System.out.println("Stacktrace follows.............");
         ex.printStackTrace(System.out);
        }
}

When I call this function with following arguments :

HashMap<String, String> replacements = new HashMap<String, String>();
replacements.put("AAA", "BBB");
searchAndReplace("C:/Test.doc", "C:/Test1.doc", replacements);

when the Test.doc file containing Simple line like that : "AAA EEE", it works successfully, but when i use a complicated file it will read the content successfully and generate the Test1.doc file but when i try to open it it will give me the following error :

Word unable to read this document. It may be corrupt. Try one or more of the following: * Open and repair the file. * Open the file with Text Recovery converter. (C:\Test1.doc)

Please tell me what to do, because I'm a beginner in POI and i don't have a good tutorials for it,,, Thanking you in advance...... Saeed

+2  A: 

First of all you should be closing your document.

Besides that, what I suggest doing is resaving your original Word document as a Word XML document, then changing the extension manually from .XML to .doc . Then look at the XML of the actual document you're working with and trace the content to make sure you're not accidentally editing hexadecimal values (AAA and EEE could be hex values in other fields).

Without seeing the actual Word document it's hard to say what's going on.

There is not much documentation about POI at all, especially for Word document unfortunately.

AlbertoPL
First of all, Thank you very much for your answer...I add 'Finally' section just to close the document, thanks for your interest.The values AAA and EEE is not a valid values, i use them just for an example, so my actual values not like that, it something like : <<SubSource>>, <<Date>> and so on....For your suggestion about save the doc file as a XML file from MS Word : can i ask you a question : if i save the file as a xml file can i open the file from SAXParser and replace the text what i need to replace or it should be encrypted ???
Saeed
Yes you can open it from SAXParser once saved to an XML.
AlbertoPL
A: 

Could this be the issue?

pugmarx
+1  A: 

You could try OpenOffice API, but there arent many resources out there to tell you how to use it.

01
Thanks very much. I use These APIs just for open the .docx files and get the core document as an XML file and parse it using XML parser, then search for what i need using XPath, and every thing is OK. Another Solution, without OpenXML API, you can open the .docx file in MSWord 2007, save it as XML file -NOT 2003 XML-, parse the XML file it in java and replace what you need. In this solution you can replace images also. Image stored in the XML file as a Base64 encoded string and you can replace this string with another string representing the encoding of another images using Base64 encoding
Saeed
A: 

Hi all,,, I don't know : is its OK to answer myself, but Just to share the knowledge, I'll answer myself.

After navigating the web, the final solution i found is : The Library called docx4j is very good for dealing with MS docx file, although its documentation is not enough till now and its forum is still in a beginning steps, but overall it help me to do what i need..

Thanks 4 all who help me..

Saeed
The documentation has improved somewhat over the last 6 months or so. See http://dev.plutext.org/svn/docx4j/trunk/docx4j/docs/Docx4j_GettingStarted.html
plutext
A: 

You can also try this one: http://www.dancrintea.ro/doc-to-pdf/

gusti