Using Word for command line conversion of DOC files to XML
Or to RTF, or to whatever.
I've written before about using OpenOffice to convert Microsoft Office files to OpenOffice files (and hence XML) with a shell prompt command that starts up OpenOffice with the MS Office file, does a Save As, and then quits OpenOffice. Because it can be done from the command line, this makes conversion of multiple files with a batch file or shell script much easier.
I recently had to do the same thing with Word to convert Word files to MS XML, and it turned out to be similar: you write a macro that does the SaveAs and then quits, and you start up Word from the command line naming the file to convert and the macro to do the conversion.
The macro I wrote yesterday could use some refinement, but it works:
Sub SaveAsXML() NewFilename = (Replace(ActiveDocument.FullName, ".doc", ".xml")) ActiveDocument.SaveAs FileName:=NewFilename, FileFormat:=wdFormatXML Application.Quit End Sub
(It seems like I have to write a bit of VB code about every three years, so with any luck that's it until 2010. I was sorry to hear that in my nephew's first year at the University of Kansas, the "Intro to Programming" course uses VB. As I said to my sister, "But you're not living in a Seattle suburb anymore!") If you want this to save as something other than XML, see the other options for the FileFormat parameter.
My word2xml.bat batch file to tell Word to start up with a given file and run the macro looks like this:
"C:\Program Files\Microsoft Office\OFFICE11\winword" %1 /mSaveAsXML
There are other command line options for winword.exe besides /m, but none looked very interesting to me.
As with my command line trick for converting MS Office files to OpenOffice files, this technique can get filed with quick and dirty perl scripts: if you have a batch of files that need a one-time conversion some afternoon, it's great, but it's not really fast, so if you're building a production system that needs to perform this conversion every day, there are some other options that will be more complex to set up but will run more quickly because they won't require starting up and shutting down the word processor for every document.
As far as what to do with the Word XML files once I have them, well, don't get me started...
Comments
(Note: I usually close comments for an entry a few weeks after posting it to avoid comment spam.)
>The macro I wrote yesterday could use some refinement,
>but it works:
thank you for the code
>As far as what to do with the Word XML files once I
>have them, well, don't get me started...
when will you give us another Word XML review? i found the old ones very insightful
greetings
marcelo
Posted by: marcelo | September 14, 2007 11:27 AM
Thanks Marco!
>when will you give us another Word XML review?
Let me put it this way: I look hard at certain technology for fun, and at other technology only because it's related to something I'm being paid to do.
Word XML does not fall in the "fun" category.
Posted by: Bob DuCharme | September 14, 2007 11:45 AM
Hi Bob,
Maybe it's because I already invested about a year of my life into WordML because I was paid to do it (writing for O'Reilly), but I think processing WordML can be fun too. It certainly gives me a lot of tough, real-world problems to try out XSLT 2.0's more advanced facilities on. WordML's format in itself isn't terribly nice in general, and I touched on some of its idiosyncrasies in the Office 2003 XML book, but it does have a certain consistency to it. Also, I've found XML-config-file-driven invocations of xsl:for-each-group to be a very powerful, generic way of reconstituting the hierarchy that's implicit in the relationship of flat lists of paragraph styles.
Evan
Posted by: Evan Lenz | September 14, 2007 10:26 PM
"...there are some other options that will be more complex to set up but will run more quickly because they won't require starting up and shutting down the word processor for every document"
TEASE!!! What are these other options? I need to do this on hundreds of files on a regular basis. If you can't write an article on how to accomplish this efficiently, could you drop some breadcrumbs to help me research it further?
Also, many thanks for writing this article; it gets me one BIG step closer to a solution.
Lynwood Hines
Posted by: Lynwood Hines | October 5, 2007 9:17 AM
Lynwood,
These other methods would involve telling an existing running process to open a file, save it as XML, close it, and then move on to the next file. If you were going to have OpenOffice do this, http://api.openoffice.org would be a place to start, but I would look through and ask questions on the appropriate OpenOffice mailing list before I started serious coding. To do this with Word,
I'm sure there's some VB or C# way to tell a running Word instance to do this, either via COM from outside the instance, or with some macro from inside of Word. For example, the macro might look in a certain directory and convert everything it finds there. As with OpenOffice, I'm sure there's a mailing list out there where you can find people to give you some more specific tips. I haven't done it myself, but this is what would guide my research.
Posted by: Bob DuCharme | October 5, 2007 10:27 AM
Bob,
Thank you for the quick response. I'll look into the OA api and COM communication approaches first. If I come up with a useful recipe I'll post it back here.
LH
Posted by: Lynwood Hines | October 8, 2007 9:57 AM
I figured out how to use OLE to tell Word to convert documents to text format. This approach would only require a minor tweak to generate XML, RTF, or any other supported format.
The example below is written in Perl and used the Win32::OLE Perl package:
use Win32::OLE qw(in with);
use Win32::OLE::Const;
use Win32::OLE::Const 'Microsoft Word';
# Instantiate our very own MS Word process:
$wordApp = Win32::OLE->new('Word.Application', 'Quit');
$wordApp->{Visible}= 1; # Set to 0 to hide
# Load "foo.doc" into our instance of MS Word. Terminate with error message if something goes awry:
$wordApp->Documents->Open("foo.doc")
or die("Unable to open Word document: ", Win32::OLE->LastError());
# Save the file as a text file. Delete the destination text file first so we don't have to contend with an overwrite warning window in Word:
unlink "foo.txt" if (-e "foo.txt");
$wordApp->ActiveDocument->SaveAs
({
FileName => "foo.txt",
FileFormat => wdFormatDOSTextLineBreaks
});
# Close document, leave Word running to speed future conversions:
$wordApp->ActiveDocument->Close();
# When you are finished doing conversions, kill the word instance:
$wordApp->Quit;
Posted by: Lynwood Hines | October 13, 2007 3:19 PM
Looks great, thanks!
Posted by: Bob DuCharme | October 13, 2007 3:36 PM