Apache Ant/Cleaning up HTML
Appearance
Motivation
[edit | edit source]We want to clean up HTML that is not well formed. We will use the Apache Tika tools to convert dirty HTML to well-formed XHTML.
Sample Ant File
[edit | edit source]<project name="tika tests" default="extract-xhtml-from-html">
<description>Sample invocations of Apache Tika</description>
<property name="lib.dir" value="../lib"/>
<property name="input-dirty-html-file" value="input-dirty.html"/>
<property name="output-clean-xhtml-file" value="output-clean.xhtml"/>
<target name="extract-xhtml-from-html">
<echo message="Cleaning up dirty HTML file: ${input-dirty-html-file} to ${output-clean-xhtml-file}"/>
<java jar="${lib.dir}/tika-app-1.3.jar" fork="true" failonerror="true"
maxmemory="128m" input="${input-dirty-html-file}" output="${output-clean-xhtml-file}">
<arg value="-x" />
</java>
</target>
</project>
Sample Input
[edit | edit source]<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Dirty HTML</title>
</head>
<body>
<p><b>test</b></p>
<p><b>test<b></p>
<p>test<br/>test</p>
<p>test<br>test<br>test</p>
<p>This is <B>bold, <I>bold italic, </b>italic, </i>normal text</p>
</body>
</html>
Sample Output
[edit | edit source]<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="Content-Encoding" content="ISO-8859-1"/>
<meta name="Content-Type" content="application/xhtml+xml"/>
<meta name="dc:title" content="Dirty HTML"/>
<title>Dirty HTML</title>
</head>
<body>
<p>test</p>
<p>test</p>
<p>test
test</p>
<p>test
test
test</p>
<p>This is bold, bold italic, italic, normal text</p>
</body></html>