Skip to content Skip to sidebar Skip to footer

How To Get Orphaned Text With Jsoup?

I have an html: This is the first text More text here Another line of text Text in the span Another text in span

Solution 1:

I would go with a recursive method that takes your starting tag and iterates over its child nodes. For each TextNode, print the contents. For each Element, check it for child nodes.

public static void main(String[] args) throws ParseException, IOException
{
    //I put your HTML in the body tag in a local file
    Document doc = Jsoup.parse(new File("input/20160505.html"), "UTF-8");
    Elements elements = doc.getElementsByTag("body");
    Element rootTag = elements.get(0);
    printTextOfTag(rootTag);
}

public static void printTextOfTag(Element currentTag)
{
    List<Node> nodes = currentTag.childNodes();
    for(Node n : nodes)
    {
        if(n instanceof TextNode)
        {
            System.out.println(((TextNode)n).text());
        }
        else if(n instanceof Element)
        {
            printTextOfTag((Element)n);
        }
    }
}

Output

This is the first text

 More text here Another line of text 

Text in the span



Another text in span

 This is another line

Post a Comment for "How To Get Orphaned Text With Jsoup?"