9 Ocak 2011 Pazar

Manipulating Microsoft Docx Files With Docx4j

Last week, I witnessed complaints of one of my colleagues who is responsible for testing of our current project. She was yelling because she didn't want to write no more documents. I thought that, she hates documentation. Me too. I completelly share same ideas with her. I know lots of people agree with me. And in addition, I also hate to do something again and again. To deal with usual and banal activities is so boring that I always try to kepp away. In my opinion, documents and documentation take place at the center of boring activities of software development processes.

So, I thought whether it is possible or not to make the documents be written programatically. Why not? Most of our documents are based on MS Word and There are several libraries to write MS Word documents. I came across docx4j while I was looking around these libraries. Docx4j stands out with its solutions on the latest version (currently 2007) of MS Word. To investigate its structure, let's go into details by creating a new docx file. Open MS word, write your name and save it. Then, change the file extension from docx to zip and extract it with your favorite zip program. You see a directory hierarchy like this:

Doc1
_rels
.rels
docProps
app.xml
core.xml
word
_rels
document.xml.rels
theme
theme1.xml
document.xml
fontTable.xml
settings.xml
styles.xml
webSettings.xml
[Content_Types].xml

As you can see, there are just directories and some xml files. Open word/document.xml with your favorite xml editor and find your name in it. As you will see, your name is between and tags. We can say that docx file format is a zipped directory structure which contains some XML files, and these XML files hold information typed and saved by you. So we can extract and parse these XML files to gather content of the document. To change this content, we can also edit these files and package again. Docx4j does these for us with assistance of JaxB.

To show how Docx4j works, we will create a sample project. In this project, assume that you have a son and intend to throw a party in orde to celebrate your little son's birthday. To invite your friends, you design an invitation card template. In this template you leave some parts as blank and want to fill in the blanks with custom information. Because you don't want to prepare another template for next year.

To implement this dream, we will start by creating a project. I prefer to use Maven (http://maven.apache.org/) to create project structure, because it's easy. I will skip creation steps of project, because they are irrelevant. If you want, you can refer to maven documentation to see these steps. I'll give you a link to download whole project at the end of this document.

Anyway, I created a project and a template for invitation card. When creating the template, I left some placeholders to name blank parts of the template. Let's create a class named as Invitation. This class will be responsible for loading the template, replacing placeholder with some real values and saving the produced file as another word document. We will be giving placeholder values in a map, template file path and output directory as constructor arguments:

public class Invitation {

private Map templateProperties;
private String templateFilePath;
private String outputFolderPath;

public Invitation(Map templateProperties, String templateFilePath,
String outputFolderPath) {
this.templateProperties = templateProperties;
this.templateFilePath = templateFilePath;
this.outputFolderPath = outputFolderPath;
}
...

First we should load the template:

WordprocessingMLPackage template = WordprocessingMLPackage
.load(new File(templateFilePath));

By doing this, we created template object, a WordprocessingMLPackage instance. WordprocessingMLPackage is a docx package and holds whole Word document.

Second, we should fill in the blanks. We should replace the placeholders with their relevant real values given in a map as a constructor argument. To do this, retrieve the "w:t" nodes via xpath and change their values:

private void replacePlaceholders(WordprocessingMLPackage targetDocument,
String nameOfTheInvitedGuest) throws JAXBException {

List texts = targetDocument.getMainDocumentPart()
.getJAXBNodesViaXPath(XPATH_TO_SELECT_TEXT_NODES, true);

for (Object obj : texts) {
Text text = (Text) ((JAXBElement) obj).getValue();

String textValue = text.getValue();
for (Object key : templateProperties.keySet()) {
textValue = textValue.replaceAll("\\$\\{" + key + "\\}",
(String) templateProperties.get(key));
}

text.setValue(textValue);
}
}

And finally, we should save the file:

template.save(new File(outputFolderPath + "/"
+ nameOfTheInvitedGuest + ".docx"));

That's all. We opened a file, traversed its content with Xpath, changed some parts and saved it. It was a very simple example. I intended to give you a point of view about docx file format and Docx4j's functionality. As you saw, docx file is a package of some parts. In Docx4j this package can be loaded as a WordprocessingMLPackage instance. If you go into deeper, you will see that WordprocessingMLPackage class has a main document part and an optional glossary document part. We obtained main document part and picked up text nodes via XPath and changed them. We didn't do anything about other MS Word concepts such as numbering, styles and images. They are also another parts and can be reached by WordprocessingMLPackage. If you deal with these parts, you will see that, you need more information about open xml formats. You can also find more information on http://dev.plutext.org.

In conclusion, upon my first impression about Docx4j, it is a useful library for processing MS Word documents. However, you should be familar with open xml formats in order to perform more complex works. It gives a flexible way to edit XML content with JaxB although this flexibility comes with a bit complexity. As the last word, I want to congratulate Plutext team, creators of Docx4j.

Click here to download sample project.

8 yorum:

tinne dedi ki...

Hello Muammer,

thank you for your sample project with a quick docx4j trial.

The sample project has a few maven issues and uncovers two weaknesses of docx4j in itself:

- docx4j depends on org.apache.commons:commons-vfs-patched:1.9.1 but only 1.1 and 1.1a are provided. An exclusion and extra inclusion help.
- the samples do not work with project build directories containing spaces, as the loading mechanism converts them into %20, which cannot be handled properly by nio on windows 7
- you need to configure maven-compiler-plugin to have source and target language level at least 1.5 to support generics.
- according to warnings, strange system default encodings are used by default. rather configure maven-compiler-plugin and maven-resource-plugin to some encoding, e.g. US-ASCII or UTF-8.

Thanks again for the five click demo!
Karsten

tinne dedi ki...

One thing I've forgotten:
- the tests need a simple sample log4j.xml in src/test/resources to pass the tests (blog does not allow to include sample)

- Also, the project configuration must not exclude src/test/resources from the build path

Best wishes,
Karsten

Muammer Yücel dedi ki...

Hi Karsten;

I have updated the sample project according to your valuable comments. Thank you very much for your advices.

Regards

Muammer

Nilanjan Raychaudhuri dedi ki...

Thank you for the blog post. You just saved few hours for me.

victor dedi ki...

Hola Muamer.

Pero, que pasaría si dentro de la plantilla hubiera que colocar la misma variable varias veces??

Gracias por tu aporte, esta genial!!!
Felicitaciones!!!

victor dedi ki...

Muamer Hello.

But, what if in the template would have to put the same variable several times?

Thank you for your contribution, this great!
Congratulations!

Amelia dedi ki...

Do you know how to convert docx to pdf? Or do you hire someone to do it?

Unknown dedi ki...

Good example but I have one question. What about situation when you need to fill your template twice by different values? So, you have one-page-template and after some manipulating you get two pages in one docx with different values...