tags:

views:

2363

answers:

10

I need a way to convert .doc or .docx extensions to .txt without installing anything. I also don't want to have to manually open Word to do this obviously. As long as it's running on auto.

I was thinking that either Perl or VBA could do the trick, but I can't find anything online for either.

Any suggestions?

+1  A: 

.doc's that use the WordprocessingML and .docx's XML format can have their XML parsed to retrieve the actual text of the document. You'll have to read their specifications to figure out which tags contain readable text.

AlbertoPL
+1  A: 

You can't do it in VBA if you don't want to start Word (or another Office application). Even if you meant VB, you'd still have to start a (hidden) instance of Word to do the processing.

Gary McGill
as long as it can be automated through a scheduled task on a windows pc, it doesn't matter if word is open.... ill reword the question
CheeseConQueso
+2  A: 

Are you trying to do this without requiring any installed MS Office components? Even then VBA will require you to install the COM libraries to work.

How about the Perl Win32::OLE automation?

blispr
+3  A: 

I strongly recommend AsposeWords if you can do Java or .NET. It can convert, without Word installed, between all major text file types.

Jim
+5  A: 

Note that an excellent source of information for Microsoft Office applications is the Object Browser. You can access it via ToolsMacroVisual Basic Editor. Once you are in the editor, hit F2 to browse the interfaces, methods, and properties provided by Microsoft Office applications.

Here is an example using Win32::OLE:

#!/usr/bin/perl

use strict;
use warnings;

use File::Spec::Functions qw( catfile );

use Win32::OLE;
use Win32::OLE::Const 'Microsoft Word';
$Win32::OLE::Warn = 3;

my $word = get_word();
$word->{Visible} = 0;

my $doc = $word->{Documents}->Open(catfile $ENV{TEMP}, 'test.docx');

$doc->SaveAs(
    catfile($ENV{TEMP}, 'test.txt'),
    wdFormatTextLineBreaks
);

$doc->Close(0);

sub get_word {
    my $word;
    eval {
        $word = Win32::OLE->GetActiveObject('Word.Application');
    };

    die "$@\n" if $@;

    unless(defined $word) {
        $word = Win32::OLE->new('Word.Application', sub { $_[0]->Quit })
            or die "Oops, cannot start Word: ",
                   Win32::OLE->LastError, "\n";
    }
    return $word;
}
__END__
Sinan Ünür
+5  A: 

A simple Perl only solution for docx:

  1. Use Archive::Zip to get the word/document.xml file from your docx file. (A docx is just a zipped archive.)

  2. Use XML::LibXML to parse it.

  3. Then use XML::LibXSLT to transform it into text or html format. Seach the web to find a nice docx2txt.xsl file :)

Cheers !

J.

jeje
+1  A: 

If you have some flavour of unix installed, you can use the 'strings' utility to find and extract all readable strings from the document. There will be some mess before and after the text you are looking for, but the results will be readable.

Ether
A: 

I need a way to convert .doc or .docx extensions to .txt without installing anything

for I in *.doc?; do mv $I `echo $ | sed 's/\.docx?/\.txt'`; done

Just joking.

You could use antiword for the older versions of Word documents, and try to parse the xml of the new ones.

fortran
hahah... funny stuff
CheeseConQueso
A: 

hiiiiiiiiiiiiii.... i too have the same question but what i need is... the above code i need for my linux system...i need to convert openoffice word document to text.. i to want to do it automatically without opening that perticular word document...so can anyone help me regarding this.. please i need it...

thanks in advance....

siva prasad
see my post above.
vladr
A: 

Note that you can also use OpenOffice to perform miscellaneous document, drawing, spreadhseet etc. conversions on both Windows and *nix platforms.

You can access OpenOffice programmatically (in a way analogous to COM on Windows) via UNO from a variety of languages for which a UNO binding exists, including from Perl via the OpenOffice::UNO module.

On the OpenOffice::UNO page you will also find a sample Perl scriptlet which opens a document, all you then need to do is export it to txt by using the document.storeToURL() method -- see a Python example which can be easily adapted to your Perl needs.

Cheers, V.

vladr