views:

227

answers:

3

I have N Word documents (Office 2003) from which I want to make a single Word document by merging all the N documents together in some order. How do I go about doing this in Ruby? Thanks!

It's just the documents that are created in MS Office. I do not use Windows and would prefer non-Windows solutions.

EDIT: Will this be easy if the docs are odt files rather than doc files?

+2  A: 

There is a whole series of really good articles about word and ruby at http://rubyonwindows.blogspot.com/search/label/word. Word files are really complicated, at least before 2007, so you're better off automating word to do it.

stimms
Automate how? Can you explain? Also, mine is a Linux server, if it matters.
Vijay Dev
The blogs are quite helpful for teaching you how to do the automation. but as they automating word they will only work on windows, or maybe under wine. You would probably do better to look at automating open office.
stimms
+3  A: 

The only non-Windows solution that I know of is Ruby bindings in POI. After that, the code would be really similar to to this .NET code: Merge Word Documents As Pages Of A Single Document Using VB.NET. The key code you'll want is to use Selection.InsertFile for as many doucments as you need in the order you choose.

For ODT document merges, see this thread: http://cpanforum.com/threads/9938

Otaku
A: 

Understand, almost any answer to this question will depend on the constraints of the doc files you are using...

That being said, in my mind the first option if you are going to do this would be to convert them to a more easily parsed format - RTF is a great example, and if you can get them into this format the RTF Pocket Guide from O Reilly is a GREAT resource for understanding the structure of the files. To convert the files is pretty simple if you can install abiword on the Linux machine. From a command line, you'd just run:

abiword --to=rtf some_file_name.doc

Of course, in Ruby you'd just wrap these commands.

It's the merging that is more complicated -- it will depend on your files. You'll have to make some programmer decisions about whether you're going to combine the stylesheets in each individual doc, the font tables, etc, etc, etc. The content just sits in the middle of that rtf file, but it's all the semantic and style data that you'll have to make choices about. There is no 'one way' here, simply because it depends on what you want on the other side. Here is wher ethe RTF Pocket Guide is a great help - basically you'll want to use it to understand the structure of your rtf's, and decide what you do and don't want.

Otherwise, if you just want the content with NONE of the semantics, you could always convert them to txt files, then concat them. The command is very similar:

abiword --to=txt some_file_name.doc

This is dead simple, it will just split out the text, and you can concat it and be done with it. But again, you'll lose ALL the formatting of any sort.

jasonpgignac