I have a set of word documents that contains a lot of non-embedded images in them. The url that the images point to no longer exist. I would like to programmatically change the domain name of the url to something else. How can I go about doing this in Java or Python ?
Perhaps the Microsoft Office Word binary file format specification could help you here, though someone who already did stuff like this might come up with a better answer.
This is the sort of thing that VBA is for:
Sub HlinkChanger()
Dim oRange As Word.Range
Dim oField As Field
Dim link As Variant
With ActiveDocument
.Range.AutoFormat
For Each oRange In .StoryRanges
For Each oFld In oRange.Fields
If oFld.Type = wdFieldHyperlink Then
For Each link In oFld.Result.Hyperlinks
// the hyperlink is stored in link.Address
// strip the first x characters of the URL
// and replace them with your new URL
Next link
End If
Next oFld
Set oRange = oRange.NextStoryRange
Next oRange
You want to do this in Java or Python. Try OpenOffice. In OpenOffice, you can insert Java or Python code as a "Makro".
I'm sure there will be a possibility to change the image URLs.
The VBA answer is the closest because this is best done using the Microsoft Word COM API. However, you can use this just as well from Python. I've used it myself to import data into a database from hundreds of forms that were Word Documents.
This article explains the basics. Note that even though it creates a class wrapper for the WordDocument COM object, you don't need to do this if you don't want to. You can just access the COM API directly.
For documentation of the WordDocument COM API, open a word document, press Alt-F11 to open the VBA editor, and then F2 to view the object browser. This allows you to browse through all of the objects and the methods that they provide. An introduction to Python and the COM object model is found here.