views:

271

answers:

5

I am in a need of programatically convert an Word-XML file into a RTF file. It has become a requirement, because of some third party libraries. Any API/Library that can do that?

Actually the language is not a problem because I just need to work done. But Java, .NET languages or Python are preferred.

A: 

Java

I've used Apache POI in the past to parse Word Documents. It seemed to work pretty well. Then here are some libraries to write to RTF.

.Net

Here's an article about writing to a Word Document in .Net. I'm sure you could use the same library for reading.

Python

Here is an article for Python.

Related Question

Also, here is a related if not duplicate question.

Jay Askren
A: 

have a look at Docvert. You'll have to set it up for yourself because the demo only lets you upload open office documents, i believe.

pocketfullofcheese
A: 

You can use AutoIt to automate opening the XML files in word and doing a save as RTF.

I've used the user defined functions for Word to save RTF files as plain text for conversion and it works good. The syntax is very easy.

http://www.autoitscript.com/autoit3/index.shtml

DevNull
+1  A: 

A Python/linux way:

You need the OpenOffice Uno Bride (On server you could run OO in headless mode). As a result you can convert every OO-readable format to every OO-writeable:

see http://wiki.services.openoffice.org/wiki/Framework/Article/Filter/FilterList_OOo_3_0

Run Example Code

/usr/lib64/openoffice.org/program/soffice.bin -accept=socket,host=localhost,port=8100\;urp -headless

Python Example:

import uno
from os.path import abspath, isfile, splitext
from com.sun.star.beans import PropertyValue
from com.sun.star.task import ErrorCodeIOException
from com.sun.star.connection import NoConnectException

FAMILY_TEXT = "Text"
FAMILY_SPREADSHEET = "Spreadsheet"
FAMILY_PRESENTATION = "Presentation"
FAMILY_DRAWING = "Drawing"
DEFAULT_OPENOFFICE_PORT = 8100

FILTER_MAP = {
    "pdf": {
        FAMILY_TEXT: "writer_pdf_Export",
        FAMILY_SPREADSHEET: "calc_pdf_Export",
        FAMILY_PRESENTATION: "impress_pdf_Export",
        FAMILY_DRAWING: "draw_pdf_Export"
    },
    "html": {
        FAMILY_TEXT: "HTML (StarWriter)",
        FAMILY_SPREADSHEET: "HTML (StarCalc)",
        FAMILY_PRESENTATION: "impress_html_Export"
    },
    "odt": { FAMILY_TEXT: "writer8" },
    "doc": { FAMILY_TEXT: "MS Word 97" },
    "rtf": { FAMILY_TEXT: "Rich Text Format" },
    "txt": { FAMILY_TEXT: "Text" },
    "docx": { FAMILY_TEXT: "MS Word 2007 XML" },
    "ods": { FAMILY_SPREADSHEET: "calc8" },
    "xls": { FAMILY_SPREADSHEET: "MS Excel 97" },
    "odp": { FAMILY_PRESENTATION: "impress8" },
    "ppt": { FAMILY_PRESENTATION: "MS PowerPoint 97" },
    "swf": { FAMILY_PRESENTATION: "impress_flash_Export" }
}

class DocumentConverter:

    def __init__(self, port=DEFAULT_OPENOFFICE_PORT):
        localContext = uno.getComponentContext()
        resolver = localContext.ServiceManager.createInstanceWithContext("com.sun.star.bridge.UnoUrlResolver", localContext)
        try:
            self.context = resolver.resolve("uno:socket,host=localhost,port=%s;urp;StarOffice.ComponentContext" % port)
        except NoConnectException:
            raise Exception, "failed to connect to OpenOffice.org on port %s" % port
        self.desktop = self.context.ServiceManager.createInstanceWithContext("com.sun.star.frame.Desktop", self.context)

    def convert(self, inputFile, outputFile):

        inputUrl = self._toFileUrl(inputFile)
        outputUrl = self._toFileUrl(outputFile)

        document = self.desktop.loadComponentFromURL(inputUrl, "_blank", 0, self._toProperties(Hidden=True))
        #document.setPropertyValue("DocumentTitle", "saf" ) TODO: Check how this can be set and set doc update mode to  FULL_UPDATE

        if self._detectFamily(document) == FAMILY_TEXT:
            indexes = document.getDocumentIndexes()
            for i in range(0, indexes.getCount()):
                index = indexes.getByIndex(i)
                index.update()

            try:
                document.refresh()
            except AttributeError:
                pass

            indexes = document.getDocumentIndexes()
            for i in range(0, indexes.getCount()):
                index = indexes.getByIndex(i)
                index.update()

        outputExt = self._getFileExt(outputFile)
        filterName = self._filterName(document, outputExt)

        try:
            document.storeToURL(outputUrl, self._toProperties(FilterName=filterName))
        finally:
            document.close(True)

    def _filterName(self, document, outputExt):
        family = self._detectFamily(document)
        try:
            filterByFamily = FILTER_MAP[outputExt]
        except KeyError:
            raise Exception, "unknown output format: '%s'" % outputExt
        try:
            return filterByFamily[family]
        except KeyError:
            raise Exception, "unsupported conversion: from '%s' to '%s'" % (family, outputExt)

    def _detectFamily(self, document):
        if document.supportsService("com.sun.star.text.GenericTextDocument"):
            # NOTE: a GenericTextDocument is either a TextDocument, a WebDocument, or a GlobalDocument
            # but this further distinction doesn't seem to matter for conversions
            return FAMILY_TEXT
        if document.supportsService("com.sun.star.sheet.SpreadsheetDocument"):
            return FAMILY_SPREADSHEET
        if document.supportsService("com.sun.star.presentation.PresentationDocument"):
            return FAMILY_PRESENTATION
        if document.supportsService("com.sun.star.drawing.DrawingDocument"):
            return FAMILY_DRAWING
        raise Exception, "unknown document family: %s" % document

    def _getFileExt(self, path):
        ext = splitext(path)[1]
        if ext is not None:
            return ext[1:].lower()

    def _toFileUrl(self, path):
        return uno.systemPathToFileUrl(abspath(path))

    def _toProperties(self, **args):
        props = []
        for key in args:
            prop = PropertyValue()
            prop.Name = key
            prop.Value = args[key]
            props.append(prop)
        return tuple(props)

if __name__ == "__main__":
    from sys import argv, exit

    if len(argv) < 3:
        print "USAGE: python %s <input-file> <output-file>" % argv[0]
        exit(255)
    if not isfile(argv[1]):
        print "no such input file: %s" % argv[1]
        exit(1)

    try:
        converter = DocumentConverter()    
        converter.convert(argv[1], argv[2])
    except Exception, exception:
        print "ERROR!" + str(exception)
        exit(1)
maersu
A: 

From java, you could use Docmosis to do conversion and optional populating. It sits over openoffice to perform the format conversions. If you install openoffice and manually load and save a few example documents you'll get a feel for whether the format conversions are good enough for you. If so, you can use Docmosis to drive it from Java.

jowierun