Class HtmlCleaner


  • public class HtmlCleaner
    extends java.lang.Object
    Main HtmlCleaner class.

    It represents public interface to the user. It's task is to call tokenizer with specified source HTML, traverse list of produced token list and create internal object model. It also offers a set of methods to write resulting XML to string, file or any output stream.

    Typical usage is the following:

    // create an instance of HtmlCleaner HtmlCleaner cleaner = new HtmlCleaner(); // take default cleaner properties CleanerProperties props = cleaner.getProperties(); // customize cleaner's behaviour with property setters props.setXXX(...); // Clean HTML taken from simple string, file, URL, input stream, // input source or reader. Result is root node of created // tree-like structure. Single cleaner instance may be safely used // multiple times. TagNode node = cleaner.clean(...); // optionally find parts of the DOM or modify some nodes TagNode[] myNodes = node.getElementsByXXX(...); // and/or Object[] myNodes = node.evaluateXPath(xPathExpression); // and/or aNode.removeFromTree(); // and/or aNode.addAttribute(attName, attValue); // and/or aNode.removeAttribute(attName, attValue); // and/or cleaner.setInnerHtml(aNode, htmlContent); // and/or do some other tree manipulation/traversal // serialize a node to a file, output stream, DOM, JDom... new XXXSerializer(props).writeXmlXXX(aNode, ...); myJDom = new JDomSerializer(props, true).createJDom(aNode); myDom = new DomSerializer(props, true).createDOM(aNode);
    • Constructor Detail

      • HtmlCleaner

        public HtmlCleaner()
        Constructor - creates cleaner instance with default tag info provider and default properties.
      • HtmlCleaner

        public HtmlCleaner​(ITagInfoProvider tagInfoProvider)
        Constructor - creates the instance with specified tag info provider and default properties
        Parameters:
        tagInfoProvider - Provider for tag filtering and balancing
      • HtmlCleaner

        public HtmlCleaner​(CleanerProperties properties)
        Constructor - creates the instance with default tag info provider and specified properties
        Parameters:
        properties - Properties used during parsing and serializing
      • HtmlCleaner

        public HtmlCleaner​(ITagInfoProvider tagInfoProvider,
                           CleanerProperties properties)
        Constructor - creates the instance with specified tag info provider and specified properties
        Parameters:
        tagInfoProvider - Provider for tag filtering and balancing
        properties - Properties used during parsing and serializing
    • Method Detail

      • clean

        public TagNode clean​(java.lang.String htmlContent)
      • clean

        public TagNode clean​(java.io.File file,
                             java.lang.String charset)
                      throws java.io.IOException
        Throws:
        java.io.IOException
      • clean

        public TagNode clean​(java.io.File file)
                      throws java.io.IOException
        Throws:
        java.io.IOException
      • clean

        public TagNode clean​(java.net.URL url,
                             java.lang.String charset)
                      throws java.io.IOException
        Throws:
        java.io.IOException
      • clean

        public TagNode clean​(java.net.URL url)
                      throws java.io.IOException
        Creates instance from the content downloaded from specified URL. HTML encoding is resolved following the attempts in the sequence: 1. reading Content-Type response header, 2. Analyzing META tags at the beginning of the html, 3. Using platform's default charset.
        Parameters:
        url -
        Returns:
        Throws:
        java.io.IOException
      • clean

        public TagNode clean​(java.io.InputStream in,
                             java.lang.String charset)
                      throws java.io.IOException
        Throws:
        java.io.IOException
      • clean

        public TagNode clean​(java.io.InputStream in)
                      throws java.io.IOException
        Throws:
        java.io.IOException
      • clean

        public TagNode clean​(java.io.Reader reader)
                      throws java.io.IOException
        Throws:
        java.io.IOException
      • clean

        public TagNode clean​(java.io.Reader reader,
                             HtmlCleaner.CleanTimeValues cleanTimeValues)
                      throws java.io.IOException
        Basic version of the cleaning call.
        Parameters:
        reader -
        Returns:
        An instance of TagNode object which is the root of the XML tree.
        Throws:
        java.io.IOException
      • calculateRootNode

        private void calculateRootNode​(HtmlCleaner.CleanTimeValues cleanTimeValues)
        Assigns root node to internal variable. Root node of the result depends on parameter "omitHtmlEnvelope". If it is set, then first child of the body will be root node, or html will be root node otherwise.
      • addAttributesToTag

        private void addAttributesToTag​(TagNode tag,
                                        java.util.Map attributes)
        Add attributes from specified map to the specified tag. If some attribute already exist it is preserved.
        Parameters:
        tag -
        attributes -
      • isFatalTagSatisfied

        private boolean isFatalTagSatisfied​(TagInfo tag,
                                            HtmlCleaner.CleanTimeValues cleanTimeValues)
        Checks if open fatal tag is missing if there is a fatal tag for the specified tag.
        Parameters:
        tag -
      • mustAddRequiredParent

        private boolean mustAddRequiredParent​(TagInfo tag,
                                              HtmlCleaner.CleanTimeValues cleanTimeValues)
        Check if specified tag requires parent tag, but that parent tag is missing in the appropriate context.
        Parameters:
        tag -
      • createTagNode

        private TagNode createTagNode​(TagNode startTagToken)
      • isStartToken

        private boolean isStartToken​(java.lang.Object o)
      • addPossibleHeadCandidate

        private void addPossibleHeadCandidate​(TagInfo tagInfo,
                                              TagNode tagNode,
                                              HtmlCleaner.CleanTimeValues cleanTimeValues)
        Checks if specified tag with specified info is candidate for moving to head section.
        Parameters:
        tagInfo -
        tagNode -
      • getTagInfoProvider

        public ITagInfoProvider getTagInfoProvider()
        Returns:
        ITagInfoProvider instance for this HtmlCleaner
      • getTransformations

        public CleanerTransformations getTransformations()
        Returns:
        Transormations defined for this instance of cleaner
      • setTransformations

        public void setTransformations​(CleanerTransformations transformations)
        Sets tranformations for this cleaner instance.
        Parameters:
        transformations -
      • getInnerHtml

        public java.lang.String getInnerHtml​(TagNode node)
        For the specified node, returns it's content as string.
        Parameters:
        node -
      • setInnerHtml

        public void setInnerHtml​(TagNode node,
                                 java.lang.String content)
        For the specified tag node, defines it's html content. This causes cleaner to reclean given html portion and insert it inside the node instead of previous content.
        Parameters:
        node -
        content -