Package org.htmlcleaner
Class HtmlCleaner
- java.lang.Object
-
- org.htmlcleaner.HtmlCleaner
-
public class HtmlCleaner extends java.lang.Object
Main HtmlCleaner class.It represents public interface to the user. It's task is to call tokenizer with specified source HTML, traverse list of produced token list and create internal object model. It also offers a set of methods to write resulting XML to string, file or any output stream.
Typical usage is the following:
// create an instance of HtmlCleaner HtmlCleaner cleaner = new HtmlCleaner(); // take default cleaner properties CleanerProperties props = cleaner.getProperties(); // customize cleaner's behaviour with property setters props.setXXX(...); // Clean HTML taken from simple string, file, URL, input stream, // input source or reader. Result is root node of created // tree-like structure. Single cleaner instance may be safely used // multiple times. TagNode node = cleaner.clean(...); // optionally find parts of the DOM or modify some nodes TagNode[] myNodes = node.getElementsByXXX(...); // and/or Object[] myNodes = node.evaluateXPath(xPathExpression); // and/or aNode.removeFromTree(); // and/or aNode.addAttribute(attName, attValue); // and/or aNode.removeAttribute(attName, attValue); // and/or cleaner.setInnerHtml(aNode, htmlContent); // and/or do some other tree manipulation/traversal // serialize a node to a file, output stream, DOM, JDom... new XXXSerializer(props).writeXmlXXX(aNode, ...); myJDom = new JDomSerializer(props, true).createJDom(aNode); myDom = new DomSerializer(props, true).createDOM(aNode);
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description private class
HtmlCleaner.CleanTimeValues
private class
HtmlCleaner.OpenTags
Class that contains information and mathods for managing list of open, but unhandled tags.private class
HtmlCleaner.TagPos
Contains information about single open tag
-
Field Summary
Fields Modifier and Type Field Description static java.lang.String
DEFAULT_CHARSET
private CleanerProperties
properties
private ITagInfoProvider
tagInfoProvider
private CleanerTransformations
transformations
-
Constructor Summary
Constructors Constructor Description HtmlCleaner()
Constructor - creates cleaner instance with default tag info provider and default properties.HtmlCleaner(CleanerProperties properties)
Constructor - creates the instance with default tag info provider and specified propertiesHtmlCleaner(ITagInfoProvider tagInfoProvider)
Constructor - creates the instance with specified tag info provider and default propertiesHtmlCleaner(ITagInfoProvider tagInfoProvider, CleanerProperties properties)
Constructor - creates the instance with specified tag info provider and specified properties
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description private void
addAttributesToTag(TagNode tag, java.util.Map attributes)
Add attributes from specified map to the specified tag.private void
addPossibleHeadCandidate(TagInfo tagInfo, TagNode tagNode, HtmlCleaner.CleanTimeValues cleanTimeValues)
Checks if specified tag with specified info is candidate for moving to head section.private void
calculateRootNode(HtmlCleaner.CleanTimeValues cleanTimeValues)
Assigns root node to internal variable.TagNode
clean(java.io.File file)
TagNode
clean(java.io.File file, java.lang.String charset)
TagNode
clean(java.io.InputStream in)
TagNode
clean(java.io.InputStream in, java.lang.String charset)
TagNode
clean(java.io.Reader reader)
TagNode
clean(java.io.Reader reader, HtmlCleaner.CleanTimeValues cleanTimeValues)
Basic version of the cleaning call.TagNode
clean(java.lang.String htmlContent)
TagNode
clean(java.net.URL url)
Creates instance from the content downloaded from specified URL.TagNode
clean(java.net.URL url, java.lang.String charset)
private void
closeAll(java.util.List<BaseToken> nodeList, HtmlCleaner.CleanTimeValues cleanTimeValues)
Close all unclosed tags if there are any.private java.util.List
closeSnippet(java.util.List nodeList, HtmlCleaner.TagPos tagPos, java.lang.Object toNode, HtmlCleaner.CleanTimeValues cleanTimeValues)
private void
createDocumentNodes(java.util.List listNodes, HtmlCleaner.CleanTimeValues cleanTimeValues)
private TagNode
createTagNode(java.lang.String name, HtmlCleaner.CleanTimeValues cleanTimeValues)
private TagNode
createTagNode(TagNode startTagToken)
java.lang.String
getInnerHtml(TagNode node)
For the specified node, returns it's content as string.CleanerProperties
getProperties()
ITagInfoProvider
getTagInfoProvider()
CleanerTransformations
getTransformations()
private boolean
isAllowedInLastOpenTag(BaseToken token, HtmlCleaner.CleanTimeValues cleanTimeValues)
private boolean
isFatalTagSatisfied(TagInfo tag, HtmlCleaner.CleanTimeValues cleanTimeValues)
Checks if open fatal tag is missing if there is a fatal tag for the specified tag.private boolean
isStartToken(java.lang.Object o)
private TagNode
makeTagNodeCopy(TagNode tagNode, HtmlCleaner.CleanTimeValues cleanTimeValues)
(package private) void
makeTree(java.util.List<BaseToken> nodeList, java.util.ListIterator<BaseToken> nodeIterator, HtmlCleaner.CleanTimeValues cleanTimeValues)
private boolean
mustAddRequiredParent(TagInfo tag, HtmlCleaner.CleanTimeValues cleanTimeValues)
Check if specified tag requires parent tag, but that parent tag is missing in the appropriate context.private void
saveToLastOpenTag(java.util.List nodeList, BaseToken tokenToAdd, HtmlCleaner.CleanTimeValues cleanTimeValues)
void
setInnerHtml(TagNode node, java.lang.String content)
For the specified tag node, defines it's html content.private void
setPruneTags(java.lang.String pruneTags, HtmlCleaner.CleanTimeValues cleanTimeValues)
void
setTransformations(CleanerTransformations transformations)
Sets tranformations for this cleaner instance.
-
-
-
Field Detail
-
DEFAULT_CHARSET
public static final java.lang.String DEFAULT_CHARSET
-
properties
private CleanerProperties properties
-
tagInfoProvider
private ITagInfoProvider tagInfoProvider
-
transformations
private CleanerTransformations transformations
-
-
Constructor Detail
-
HtmlCleaner
public HtmlCleaner()
Constructor - creates cleaner instance with default tag info provider and default properties.
-
HtmlCleaner
public HtmlCleaner(ITagInfoProvider tagInfoProvider)
Constructor - creates the instance with specified tag info provider and default properties- Parameters:
tagInfoProvider
- Provider for tag filtering and balancing
-
HtmlCleaner
public HtmlCleaner(CleanerProperties properties)
Constructor - creates the instance with default tag info provider and specified properties- Parameters:
properties
- Properties used during parsing and serializing
-
HtmlCleaner
public HtmlCleaner(ITagInfoProvider tagInfoProvider, CleanerProperties properties)
Constructor - creates the instance with specified tag info provider and specified properties- Parameters:
tagInfoProvider
- Provider for tag filtering and balancingproperties
- Properties used during parsing and serializing
-
-
Method Detail
-
clean
public TagNode clean(java.lang.String htmlContent)
-
clean
public TagNode clean(java.io.File file, java.lang.String charset) throws java.io.IOException
- Throws:
java.io.IOException
-
clean
public TagNode clean(java.io.File file) throws java.io.IOException
- Throws:
java.io.IOException
-
clean
public TagNode clean(java.net.URL url, java.lang.String charset) throws java.io.IOException
- Throws:
java.io.IOException
-
clean
public TagNode clean(java.net.URL url) throws java.io.IOException
Creates instance from the content downloaded from specified URL. HTML encoding is resolved following the attempts in the sequence: 1. reading Content-Type response header, 2. Analyzing META tags at the beginning of the html, 3. Using platform's default charset.- Parameters:
url
-- Returns:
- Throws:
java.io.IOException
-
clean
public TagNode clean(java.io.InputStream in, java.lang.String charset) throws java.io.IOException
- Throws:
java.io.IOException
-
clean
public TagNode clean(java.io.InputStream in) throws java.io.IOException
- Throws:
java.io.IOException
-
clean
public TagNode clean(java.io.Reader reader) throws java.io.IOException
- Throws:
java.io.IOException
-
clean
public TagNode clean(java.io.Reader reader, HtmlCleaner.CleanTimeValues cleanTimeValues) throws java.io.IOException
Basic version of the cleaning call.- Parameters:
reader
-- Returns:
- An instance of TagNode object which is the root of the XML tree.
- Throws:
java.io.IOException
-
createTagNode
private TagNode createTagNode(java.lang.String name, HtmlCleaner.CleanTimeValues cleanTimeValues)
-
makeTagNodeCopy
private TagNode makeTagNodeCopy(TagNode tagNode, HtmlCleaner.CleanTimeValues cleanTimeValues)
-
calculateRootNode
private void calculateRootNode(HtmlCleaner.CleanTimeValues cleanTimeValues)
Assigns root node to internal variable. Root node of the result depends on parameter "omitHtmlEnvelope". If it is set, then first child of the body will be root node, or html will be root node otherwise.
-
addAttributesToTag
private void addAttributesToTag(TagNode tag, java.util.Map attributes)
Add attributes from specified map to the specified tag. If some attribute already exist it is preserved.- Parameters:
tag
-attributes
-
-
isFatalTagSatisfied
private boolean isFatalTagSatisfied(TagInfo tag, HtmlCleaner.CleanTimeValues cleanTimeValues)
Checks if open fatal tag is missing if there is a fatal tag for the specified tag.- Parameters:
tag
-
-
mustAddRequiredParent
private boolean mustAddRequiredParent(TagInfo tag, HtmlCleaner.CleanTimeValues cleanTimeValues)
Check if specified tag requires parent tag, but that parent tag is missing in the appropriate context.- Parameters:
tag
-
-
isAllowedInLastOpenTag
private boolean isAllowedInLastOpenTag(BaseToken token, HtmlCleaner.CleanTimeValues cleanTimeValues)
-
saveToLastOpenTag
private void saveToLastOpenTag(java.util.List nodeList, BaseToken tokenToAdd, HtmlCleaner.CleanTimeValues cleanTimeValues)
-
isStartToken
private boolean isStartToken(java.lang.Object o)
-
makeTree
void makeTree(java.util.List<BaseToken> nodeList, java.util.ListIterator<BaseToken> nodeIterator, HtmlCleaner.CleanTimeValues cleanTimeValues)
-
createDocumentNodes
private void createDocumentNodes(java.util.List listNodes, HtmlCleaner.CleanTimeValues cleanTimeValues)
-
closeSnippet
private java.util.List closeSnippet(java.util.List nodeList, HtmlCleaner.TagPos tagPos, java.lang.Object toNode, HtmlCleaner.CleanTimeValues cleanTimeValues)
-
closeAll
private void closeAll(java.util.List<BaseToken> nodeList, HtmlCleaner.CleanTimeValues cleanTimeValues)
Close all unclosed tags if there are any.
-
addPossibleHeadCandidate
private void addPossibleHeadCandidate(TagInfo tagInfo, TagNode tagNode, HtmlCleaner.CleanTimeValues cleanTimeValues)
Checks if specified tag with specified info is candidate for moving to head section.- Parameters:
tagInfo
-tagNode
-
-
getProperties
public CleanerProperties getProperties()
-
setPruneTags
private void setPruneTags(java.lang.String pruneTags, HtmlCleaner.CleanTimeValues cleanTimeValues)
-
getTagInfoProvider
public ITagInfoProvider getTagInfoProvider()
- Returns:
- ITagInfoProvider instance for this HtmlCleaner
-
getTransformations
public CleanerTransformations getTransformations()
- Returns:
- Transormations defined for this instance of cleaner
-
setTransformations
public void setTransformations(CleanerTransformations transformations)
Sets tranformations for this cleaner instance.- Parameters:
transformations
-
-
getInnerHtml
public java.lang.String getInnerHtml(TagNode node)
For the specified node, returns it's content as string.- Parameters:
node
-
-
setInnerHtml
public void setInnerHtml(TagNode node, java.lang.String content)
For the specified tag node, defines it's html content. This causes cleaner to reclean given html portion and insert it inside the node instead of previous content.- Parameters:
node
-content
-
-
-