Tim's Weblog
Tim Strehle’s links and thoughts on Web apps, software development and Digital Asset Management, since 2002.
2016-03-10

Turn HTML into plain text with proper whitespace (in XSLT and PHP)

Turning HTML into (unformatted) plain text seems simple at first: PHP has <a href="http://www.php.net/strip_tags">strip_tags()</a>, XSLT has <a href="https://www.w3.org/TR/xslt#value-of">xsl:value-of</a>. In practice, though, you’ll frequently find that words are glued together which should have whitespace between them.

Take this example – extra weirdly-formatted to get the point across:

If you select and copy this text in the browser, the result will look similar to the following:

Hello
World

First line
Second line.

    1
    2

Now look what we get if we feed the same HTML source code into strip_tags() or xsl:value-of:

HelloWorld
First lineSecond line.
12

Words (“HelloWorld” instead of “Hello World”) and lines are glued together! To understand why this happens, pay attention to the (perfectly valid) missing spaces between HTML tags in my example’s source (I’m using XHTML to be able to process it via XSLT):

<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<body>
<h1>Hello</h1><h2>World</h2>
<p class="c"><span class="u">F</span>irst l<a href="#">in</a>e<br /><b>S</b><del>e</del><em>c</em><i>o</i><ins>n</ins><mark>d</mark> <strike>l</strike><strong>i</strong><sub>n</sub><sup>e</sup><u>.</u></p>
<ul><li>1</li><li>2</li></ul>
</body>
</html>

PHP and the XSLT processor simply remove all the angle-bracketed HTML tags. This is fine for inline elements; <span>4</span>2 is supposed to become 42. But it’s wrong for block-level elements (“typically displayed with a newline both before and after the element by browsers”) – <p>4</p><p>2</p> means 4 2 not 42. Since the PHP and XSLT functions have no knowledge of block-level elements in HTML, they have no way to correctly distinguish between them and inline elements.

In our application, we’ve implemented a slightly hacky workaround that solves that problem. In essence, we’re maintaining a list of inline elements (currently: a, b, del, em, i, ins, mark, span, strike, strong, sub, sup, u) which are simply going to be removed. All other elements are supposed to be block-level and will be appended a newline. We’ve done two implementations, in both XSLT and PHP.

Here’s our XSLT solution – certainly not the most elegant one (I’m no XSLT guru):

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
    xmlns:xhtml="http://www.w3.org/1999/xhtml"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    >

  <xsl:output method="xml" encoding="UTF-8" indent="yes"/>

  <xsl:template match="/xhtml:html">
    <plaintext>
      <wrong>
        <xsl:value-of select="xhtml:body"/>
      </wrong>
      <correct>
        <xsl:apply-templates select="xhtml:body" mode="xhtml_to_plaintext"/>
      </correct>
    </plaintext>
  </xsl:template>
  
  <xsl:template match="xhtml:*" mode="xhtml_to_plaintext">
    <xsl:choose>
      <xsl:when test="self::text()">
        <xsl:value-of select="."/>
      </xsl:when>
      <xsl:otherwise>
        <xsl:apply-templates mode="xhtml_to_plaintext"/>
        <xsl:variable name="html_element_type">
          <xsl:call-template name="get_html_element_type">
            <xsl:with-param name="e" select="local-name()"/>
          </xsl:call-template>
        </xsl:variable>
        <xsl:if test="$html_element_type = 'block'">
          <xsl:text>
</xsl:text>
        </xsl:if>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:template>  

  <xsl:template name="get_html_element_type">
    <xsl:param name="e"/>
    <xsl:choose>     
      <xsl:when test="$e='a' or $e='b' or $e='del' or $e='em' or $e='i' or $e='ins' or $e='mark' or $e='span' or $e='strike' or $e='strong' or $e='sub' or $e='sup' or $e='u'">
        <xsl:text>inline</xsl:text>
      </xsl:when>
      <xsl:otherwise>
        <xsl:text>block</xsl:text>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:template>

</xsl:stylesheet>

And here’s the same thing in PHP:

<?php

function xhtmlToPlaintext($html)
{
    // Add line breaks after block elements
    $html = preg_replace_callback('/(<.*?>)/', 'replaceCallback', $html);

    // Remove HTML tags with strip_tags()
    $plaintext = strip_tags($html);
    
    // Replace multiple spaces with a single space
    $plaintext = trim(preg_replace('/ +/s', ' ', $plaintext));
    
    // Decode HTML entities like &quot;
    return html_entity_decode($plaintext, ENT_QUOTES, 'UTF-8');
}

function replaceCallback($matches)
{
    $inline_elements =
    [
        'a', 'b', 'del', 'em', 'i', 'ins', 'mark', 'span', 
        'strike', 'strong', 'sub', 'sup', 'u' 
    ];

    // <span class="x"> => span
    
    $replace = [ '<' => '', '>' => '', '/' => '' ];
    $parts = explode(' ', trim(strtr($matches[ 1 ], $replace)));
    
    $e = $parts[ 0 ];

    $result = $matches[ 0 ];

    if (! in_array($e, $inline_elements))
    {
        $result .= "\n";
    }

    return $result;
}

$html = file_get_contents('xhtml-in.html');

echo "Wrong:\n";
var_dump(trim(strip_tags($html)));

echo "Correct:\n";
var_dump(xhtmlToPlaintext($html));

Suggestions for improvement are welcome!