Last week, I witnessed complaints of one of my colleagues who is responsible for testing of our current project. She was yelling because she didn't want to write no more documents. I thought that, she hates documentation. Me too. I completelly share same ideas with her. I know lots of people agree with me. And in addition, I also hate to do something again and again. To deal with usual and banal activities is so boring that I always try to kepp away. In my opinion, documents and documentation take place at the center of boring activities of software development processes.
So, I thought whether it is possible or not to make the documents be written programatically. Why not? Most of our documents are based on MS Word and There are several libraries to write MS Word documents. I came across docx4j while I was looking around these libraries. Docx4j stands out with its solutions on the latest version (currently 2007) of MS Word. To investigate its structure, let's go into details by creating a new docx file. Open MS word, write your name and save it. Then, change the file extension from docx to zip and extract it with your favorite zip program. You see a directory hierarchy like this:
Doc1
_rels
.rels
docProps
app.xml
core.xml
word
_rels
document.xml.rels
theme
theme1.xml
document.xml
fontTable.xml
settings.xml
styles.xml
webSettings.xml
[Content_Types].xml
As you can see, there are just directories and some xml files. Open word/document.xml with your favorite xml editor and find your name in it. As you will see, your name is between and tags. We can say that docx file format is a zipped directory structure which contains some XML files, and these XML files hold information typed and saved by you. So we can extract and parse these XML files to gather content of the document. To change this content, we can also edit these files and package again. Docx4j does these for us with assistance of JaxB.
To show how Docx4j works, we will create a sample project. In this project, assume that you have a son and intend to throw a party in orde to celebrate your little son's birthday. To invite your friends, you design an invitation card template. In this template you leave some parts as blank and want to fill in the blanks with custom information. Because you don't want to prepare another template for next year.
To implement this dream, we will start by creating a project. I prefer to use Maven (http://maven.apache.org/) to create project structure, because it's easy. I will skip creation steps of project, because they are irrelevant. If you want, you can refer to maven documentation to see these steps. I'll give you a link to download whole project at the end of this document.
Anyway, I created a project and a template for invitation card. When creating the template, I left some placeholders to name blank parts of the template. Let's create a class named as Invitation. This class will be responsible for loading the template, replacing placeholder with some real values and saving the produced file as another word document. We will be giving placeholder values in a map, template file path and output directory as constructor arguments:
public class Invitation {
private Map templateProperties;
private String templateFilePath;
private String outputFolderPath;
public Invitation(Map templateProperties, String templateFilePath,
String outputFolderPath) {
this.templateProperties = templateProperties;
this.templateFilePath = templateFilePath;
this.outputFolderPath = outputFolderPath;
}
...
First we should load the template:
WordprocessingMLPackage template = WordprocessingMLPackage
.load(new File(templateFilePath));
By doing this, we created template object, a WordprocessingMLPackage instance. WordprocessingMLPackage is a docx package and holds whole Word document.
Second, we should fill in the blanks. We should replace the placeholders with their relevant real values given in a map as a constructor argument. To do this, retrieve the "w:t" nodes via xpath and change their values:
private void replacePlaceholders(WordprocessingMLPackage targetDocument,
String nameOfTheInvitedGuest) throws JAXBException {
List texts = targetDocument.getMainDocumentPart()
.getJAXBNodesViaXPath(XPATH_TO_SELECT_TEXT_NODES, true);
for (Object obj : texts) {
Text text = (Text) ((JAXBElement) obj).getValue();
String textValue = text.getValue();
for (Object key : templateProperties.keySet()) {
textValue = textValue.replaceAll("\\$\\{" + key + "\\}",
(String) templateProperties.get(key));
}
text.setValue(textValue);
}
}
And finally, we should save the file:
template.save(new File(outputFolderPath + "/"
+ nameOfTheInvitedGuest + ".docx"));
That's all. We opened a file, traversed its content with Xpath, changed some parts and saved it. It was a very simple example. I intended to give you a point of view about docx file format and Docx4j's functionality. As you saw, docx file is a package of some parts. In Docx4j this package can be loaded as a WordprocessingMLPackage instance. If you go into deeper, you will see that WordprocessingMLPackage class has a main document part and an optional glossary document part. We obtained main document part and picked up text nodes via XPath and changed them. We didn't do anything about other MS Word concepts such as numbering, styles and images. They are also another parts and can be reached by WordprocessingMLPackage. If you deal with these parts, you will see that, you need more information about open xml formats. You can also find more information on http://dev.plutext.org.
In conclusion, upon my first impression about Docx4j, it is a useful library for processing MS Word documents. However, you should be familar with open xml formats in order to perform more complex works. It gives a flexible way to edit XML content with JaxB although this flexibility comes with a bit complexity. As the last word, I want to congratulate Plutext team, creators of Docx4j.
Click
here to download sample project.