Issue
With Java 9 there was a change in the way javax.xml.transform.Transformer
with href="https://docs.oracle.com/en/java/javase/11/docs/api/java.xml/javax/xml/transform/OutputKeys.html#INDENT" rel="nofollow noreferrer">OutputKeys.INDENT
handles CDATA tags. In short, in Java 8 a tag named 'test' containing some character data would result in:
<test><![CDATA[data]]></test>
But with Java 9 the same results in
<test>
<![CDATA[data]]>
</test>
Which is not the same XML.
I understood (from a source no longer available) that for Java 9 there was a workaround using a DocumentBuilderFactory
with setIgnoringElementContentWhitespace=true
but this no longer works for Java 11.
Does anyone know a way to deal with this in Java 11? I'm either looking for a way to prevent the extra newlines (but still be able to format my XML), or be able to ignore them when parsing the XML (preferably using SAX).
Unfortunately I don't know what the CDATA tag will actually contain in my application. It might begin or end with white space or newlines so I can't just strip them when reading the XML or actually setting the value in the resulting object.
Sample program to demonstrate the issue:
public static void main(String[] args) throws TransformerException, ParserConfigurationException, IOException, SAXException
{
String data = "data";
StreamSource source = new StreamSource(new StringReader("<foo><bar><![CDATA[" + data + "]]></bar></foo>"));
StreamResult result = new StreamResult(new StringWriter());
Transformer tform = TransformerFactory.newInstance().newTransformer();
tform.setOutputProperty(OutputKeys.INDENT, "yes");
tform.transform(source, result);
String xml = result.getWriter().toString();
System.out.println(xml); // I expect bar and CDATA to be on same line. This is true for Java 8, false for Java 11
Document document = DocumentBuilderFactory.newInstance()
.newDocumentBuilder()
.parse(new InputSource(new StringReader(xml)));
String resultData = document.getElementsByTagName("bar")
.item(0)
.getTextContent();
System.out.println(data.equals(resultData)); // True for Java 8, false for Java 11
}
EDIT: For future reference, I've submitted a bug report to Oracle, and this is fixed in Java 14: https://bugs.java.com/bugdatabase/view_bug.do?bug_id=JDK-8223291
Solution
As your code relies on unspecified behavior, extra explicit code seems better:
You want indentation like:
tform.setOutputProperty(OutputKeys.INDENT, "yes"); tform.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "4");
However not for elements containing a CDATA.
String xml = result.getWriter().toString(); // No indentation (whitespace) for elements with a CDATA section. xml = xml.replaceAll(">\\s*(<\\!\\[CDATA\\[.*?]]>)\\s*</", ">$1</");
The regex uses:
(?s)
DOT_ALL to have.
match any character, also newline characters..*?
the shortest matching sequence, to not match "...]]>...]]>".
Alternatively: In a DOM tree (preserving CDATA) you can retrieve all CDATA sections per XPath, and remove whitespace siblings using the parent element.
Answered By - Joop Eggen
Answer Checked By - Marie Seifert (JavaFixing Admin)