Convert (render) HTML to Text with correct line-breaks

Alex picture Alex · May 1, 2015 · Viewed 13.7k times · Source

I need to convert HTML string to plain text (preferably using HTML Agility pack). With proper white-spaces and, especially, proper line-breaks.

And by "proper line-breaks" I mean that this code:

<div>
    <div>
        <div>
            line1
        </div>
    </div>
</div>
<div>line2</div>

Should be converted as

line1
line2

I.e. only one line-break.

Most of the solutions I've seen simply convert all <div> <br> <p> tags to \n which, obviously, s*cks.

Any suggestions for html-to-plaintext rendering logic for C#? Not the complete code, at least common logic answers like "replace all closing DIVs with line-breaks, but only if the next sibling is not a DIV too" will really help.

Things I tried: simply getting the .InnerText property (wrong obviously), regex (slow, painful, lots of hacks, also regexs are 12 times slower then HtmlAgilityPack - I measured it), this solution and similar (returns more line-breaks then required)

Answer

Serge Shultz picture Serge Shultz · May 6, 2015

The code below works correctly with the example provided, even deals with some weird stuff like <div><br></div>, there're still some things to improve, but the basic idea is there. See the comments.

public static string FormatLineBreaks(string html)
{
    //first - remove all the existing '\n' from HTML
    //they mean nothing in HTML, but break our logic
    html = html.Replace("\r", "").Replace("\n", " ");

    //now create an Html Agile Doc object
    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(html);

    //remove comments, head, style and script tags
    foreach (HtmlNode node in doc.DocumentNode.SafeSelectNodes("//comment() | //script | //style | //head"))
    {
        node.ParentNode.RemoveChild(node);
    }

    //now remove all "meaningless" inline elements like "span"
    foreach (HtmlNode node in doc.DocumentNode.SafeSelectNodes("//span | //label")) //add "b", "i" if required
    {
        node.ParentNode.ReplaceChild(HtmlNode.CreateNode(node.InnerHtml), node);
    }

    //block-elements - convert to line-breaks
    foreach (HtmlNode node in doc.DocumentNode.SafeSelectNodes("//p | //div")) //you could add more tags here
    {
        //we add a "\n" ONLY if the node contains some plain text as "direct" child
        //meaning - text is not nested inside children, but only one-level deep

        //use XPath to find direct "text" in element
        var txtNode = node.SelectSingleNode("text()");

        //no "direct" text - NOT ADDDING the \n !!!!
        if (txtNode == null || txtNode.InnerHtml.Trim() == "") continue;

        //"surround" the node with line breaks
        node.ParentNode.InsertBefore(doc.CreateTextNode("\r\n"), node);
        node.ParentNode.InsertAfter(doc.CreateTextNode("\r\n"), node);
    }

    //todo: might need to replace multiple "\n\n" into one here, I'm still testing...

    //now BR tags - simply replace with "\n" and forget
    foreach (HtmlNode node in doc.DocumentNode.SafeSelectNodes("//br"))
        node.ParentNode.ReplaceChild(doc.CreateTextNode("\r\n"), node);

    //finally - return the text which will have our inserted line-breaks in it
    return doc.DocumentNode.InnerText.Trim();

    //todo - you should probably add "&code;" processing, to decode all the &nbsp; and such
}    

//here's the extension method I use
private static HtmlNodeCollection SafeSelectNodes(this HtmlNode node, string selector)
{
    return (node.SelectNodes(selector) ?? new HtmlNodeCollection(node));
}