Extract text from HTML while preserving block-level element newlines

Dave Jarvis picture Dave Jarvis · Dec 4, 2013 · Viewed 8.2k times · Source

Background

Most questions about extracting text from HTML (i.e., stripping the tags) use:

jQuery( htmlString ).text();

While this abstracts browser inconsistencies (such as innerText vs. textContent), the function call also ignores the semantic meaning of block-level elements (such as li).

Problem

Preserving newlines of block-level elements (i.e., the semantic intent) across various browsers entails no small effort, as Mike Wilcox describes.

A seemingly simpler solution would be to emulate pasting HTML content into a <textarea>, which strips HTML while preserving block-level element newlines. However, JavaScript-based inserts do not trigger the same HTML-to-text routines that browsers employ when users paste content into a <textarea>.

I also tried integrating Mike Wilcox's JavaScript code. The code works in Chromium, but not in Firefox.

Question

What is the simplest cross-browser way to extract text from HTML while preserving semantic newlines for block-level elements using jQuery (or vanilla JavaScript)?

Example

Consider:

  1. Select and copy this entire question.
  2. Open the textarea example page.
  3. Paste the content into the textarea.

The textarea preserves the newlines for ordered lists, headings, preformatted text, and so forth. That is the result I would like to achieve.

To further clarify, given any HTML content, such as:

   <h1>Header</h1>
   <p>Paragraph</p>
   <ul>
     <li>First</li>
     <li>Second</li>
   </ul>
   <dl>
     <dt>Term</dt>
       <dd>Definition</dd>
   </dl>
   <div>Div with <span>span</span>.<br />After the <a href="...">break</a>.</div>

How would you produce:

  Header
  Paragraph

    First
    Second

  Term
    Definition

  Div with span.
  After the break.

Note: Neither indentation nor non-normalized whitespace are relevant.

Answer

svidgen picture svidgen · Dec 4, 2013

Consider:

/**
 * Returns the style for a node.
 *
 * @param n The node to check.
 * @param p The property to retrieve (usually 'display').
 * @link http://www.quirksmode.org/dom/getstyles.html
 */
this.getStyle = function( n, p ) {
  return n.currentStyle ?
    n.currentStyle[p] :
    document.defaultView.getComputedStyle(n, null).getPropertyValue(p);
}

/**
 * Converts HTML to text, preserving semantic newlines for block-level
 * elements.
 *
 * @param node - The HTML node to perform text extraction.
 */
this.toText = function( node ) {
  var result = '';

  if( node.nodeType == document.TEXT_NODE ) {
    // Replace repeated spaces, newlines, and tabs with a single space.
    result = node.nodeValue.replace( /\s+/g, ' ' );
  }
  else {
    for( var i = 0, j = node.childNodes.length; i < j; i++ ) {
      result += _this.toText( node.childNodes[i] );
    }

    var d = _this.getStyle( node, 'display' );

    if( d.match( /^block/ ) || d.match( /list/ ) || d.match( /row/ ) ||
        node.tagName == 'BR' || node.tagName == 'HR' ) {
      result += '\n';
    }
  }

  return result;
}

http://jsfiddle.net/3mzrV/2/

That is to say, with an exception or two, iterate through each node and print its contents, letting the browser's computed style tell you when to insert newlines.