{"id":1830,"date":"2016-03-10T00:00:00","date_gmt":"2016-03-09T23:00:00","guid":{"rendered":"https:\/\/wwwneu.strehle.de\/tim\/weblog\/archives\/2016\/03\/10\/1590-2\/"},"modified":"2025-07-31T21:57:40","modified_gmt":"2025-07-31T19:57:40","slug":"1590-2","status":"publish","type":"post","link":"https:\/\/www.strehle.de\/tim\/weblog\/archives\/2016\/03\/10\/1590-2\/","title":{"rendered":"Turn HTML into plain text with proper whitespace (in XSLT and PHP)"},"content":{"rendered":"\n<p>Turning HTML into (unformatted) plain text seems simple at first: PHP has <code>&lt;a href=\"http:\/\/www.php.net\/strip_tags\"&gt;strip_tags()&lt;\/a&gt;<\/code>, XSLT has <code>&lt;a href=\"https:\/\/www.w3.org\/TR\/xslt#value-of\"&gt;xsl:value-of&lt;\/a&gt;<\/code>. In practice, though, you\u2019ll frequently find that words are glued together which should have whitespace between them.<\/p>\n\n\n\n<p>Take this example \u2013 extra weirdly-formatted to get the point across:<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/s3.eu-central-1.amazonaws.com\/files.strehle.de\/tim\/blog\/html-to-plaintext-browser-view.png\" alt=\"\"\/><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p>If you select and copy this text in the browser, the result will look similar to the following:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Hello\nWorld\n\nFirst line\nSecond line.\n\n&nbsp;&nbsp;&nbsp; 1\n&nbsp;&nbsp;&nbsp; 2\n<\/code><\/pre>\n\n\n\n<p>Now look what we get if we feed the same HTML source code into <code>strip_tags()<\/code> or <code>xsl:value-of<\/code>:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>HelloWorld\nFirst lineSecond line.\n12\n<\/code><\/pre>\n\n\n\n<p>Words (\u201cHelloWorld\u201d instead of \u201cHello World\u201d) and lines are glued together! To understand why this happens, pay attention to the (perfectly valid) missing spaces between HTML tags in my example\u2019s source (I\u2019m using XHTML to be able to process it via XSLT):<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>&lt;?xml version=\"1.0\" encoding=\"UTF-8\"?&gt;\n&lt;html xmlns=\"http:\/\/www.w3.org\/1999\/xhtml\"&gt;\n&lt;body&gt;\n&lt;h1&gt;Hello&lt;\/h1&gt;&lt;h2&gt;World&lt;\/h2&gt;\n&lt;p class=\"c\"&gt;&lt;span class=\"u\"&gt;F&lt;\/span&gt;irst l&lt;a href=\"#\"&gt;in&lt;\/a&gt;e&lt;br \/&gt;&lt;b&gt;S&lt;\/b&gt;&lt;del&gt;e&lt;\/del&gt;&lt;em&gt;c&lt;\/em&gt;&lt;i&gt;o&lt;\/i&gt;&lt;ins&gt;n&lt;\/ins&gt;&lt;mark&gt;d&lt;\/mark&gt; &lt;strike&gt;l&lt;\/strike&gt;&lt;strong&gt;i&lt;\/strong&gt;&lt;sub&gt;n&lt;\/sub&gt;&lt;sup&gt;e&lt;\/sup&gt;&lt;u&gt;.&lt;\/u&gt;&lt;\/p&gt;\n&lt;ul&gt;&lt;li&gt;1&lt;\/li&gt;&lt;li&gt;2&lt;\/li&gt;&lt;\/ul&gt;\n&lt;\/body&gt;\n&lt;\/html&gt;\n<\/code><\/pre>\n\n\n\n<p>PHP and the XSLT processor simply remove all the angle-bracketed HTML tags. This is fine for inline elements; <code>&lt;span&gt;4&lt;\/span&gt;2<\/code> is supposed to become <code>42<\/code>. But it\u2019s wrong for <a href=\"https:\/\/developer.mozilla.org\/en-US\/docs\/Web\/HTML\/Block-level_elements\">block-level elements<\/a> (\u201ctypically displayed with a newline both before and after the element by browsers\u201d) \u2013 <code>&lt;p&gt;4&lt;\/p&gt;&lt;p&gt;2&lt;\/p&gt;<\/code> means <code>4 2<\/code> not <code>42<\/code>. Since the PHP and XSLT functions have no knowledge of block-level elements in HTML, they have no way to correctly distinguish between them and inline elements.<\/p>\n\n\n\n<p>In our application, we\u2019ve implemented a slightly hacky workaround that solves that problem. In essence, we\u2019re maintaining a list of inline elements (currently: <code>a<\/code>, <code>b<\/code>, <code>del<\/code>, <code>em<\/code>, <code>i<\/code>, <code>ins<\/code>, <code>mark<\/code>, <code>span<\/code>, <code>strike<\/code>, <code>strong<\/code>, <code>sub<\/code>, <code>sup<\/code>, <code>u<\/code>) which are simply going to be removed. All other elements are supposed to be block-level and will be appended a newline. We\u2019ve done two implementations, in both XSLT and PHP.<\/p>\n\n\n\n<p>Here\u2019s our XSLT solution \u2013 certainly not the most elegant one (I\u2019m no XSLT guru):<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>&lt;?xml version=\"1.0\" encoding=\"UTF-8\"?&gt;\n&lt;xsl:stylesheet version=\"1.0\"\n&nbsp;&nbsp;&nbsp; xmlns:xhtml=\"http:\/\/www.w3.org\/1999\/xhtml\"\n&nbsp;&nbsp;&nbsp; xmlns:xsl=\"http:\/\/www.w3.org\/1999\/XSL\/Transform\"\n&nbsp;&nbsp;&nbsp; &gt;\n\n&nbsp; &lt;xsl:output method=\"xml\" encoding=\"UTF-8\" indent=\"yes\"\/&gt;\n\n&nbsp; &lt;xsl:template match=\"\/xhtml:html\"&gt;\n&nbsp;&nbsp;&nbsp; &lt;plaintext&gt;\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;wrong&gt;\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;xsl:value-of select=\"xhtml:body\"\/&gt;\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;\/wrong&gt;\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;correct&gt;\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;xsl:apply-templates select=\"xhtml:body\" mode=\"xhtml_to_plaintext\"\/&gt;\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;\/correct&gt;\n&nbsp;&nbsp;&nbsp; &lt;\/plaintext&gt;\n&nbsp; &lt;\/xsl:template&gt;\n&nbsp; \n&nbsp; &lt;xsl:template match=\"xhtml:*\" mode=\"xhtml_to_plaintext\"&gt;\n&nbsp;&nbsp;&nbsp; &lt;xsl:choose&gt;\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;xsl:when test=\"self::text()\"&gt;\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;xsl:value-of select=\".\"\/&gt;\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;\/xsl:when&gt;\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;xsl:otherwise&gt;\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;xsl:apply-templates mode=\"xhtml_to_plaintext\"\/&gt;\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;xsl:variable name=\"html_element_type\"&gt;\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;xsl:call-template name=\"get_html_element_type\"&gt;\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;xsl:with-param name=\"e\" select=\"local-name()\"\/&gt;\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;\/xsl:call-template&gt;\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;\/xsl:variable&gt;\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;xsl:if test=\"$html_element_type = 'block'\"&gt;\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;xsl:text&gt;\n&lt;\/xsl:text&gt;\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;\/xsl:if&gt;\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;\/xsl:otherwise&gt;\n&nbsp;&nbsp;&nbsp; &lt;\/xsl:choose&gt;\n&nbsp; &lt;\/xsl:template&gt;&nbsp; \n\n&nbsp; &lt;xsl:template name=\"get_html_element_type\"&gt;\n&nbsp;&nbsp;&nbsp; &lt;xsl:param name=\"e\"\/&gt;\n&nbsp;&nbsp;&nbsp; &lt;xsl:choose&gt;&nbsp;&nbsp;&nbsp;&nbsp; \n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;xsl:when test=\"$e='a' or $e='b' or $e='del' or $e='em' or $e='i' or $e='ins' or $e='mark' or $e='span' or $e='strike' or $e='strong' or $e='sub' or $e='sup' or $e='u'\"&gt;\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;xsl:text&gt;inline&lt;\/xsl:text&gt;\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;\/xsl:when&gt;\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;xsl:otherwise&gt;\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;xsl:text&gt;block&lt;\/xsl:text&gt;\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;\/xsl:otherwise&gt;\n&nbsp;&nbsp;&nbsp; &lt;\/xsl:choose&gt;\n&nbsp; &lt;\/xsl:template&gt;\n\n&lt;\/xsl:stylesheet&gt;\n<\/code><\/pre>\n\n\n\n<p>And here\u2019s the same thing in PHP:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>&lt;?php\n\nfunction xhtmlToPlaintext($html)\n{\n&nbsp;&nbsp;&nbsp; \/\/ Add line breaks after block elements\n&nbsp;&nbsp;&nbsp; $html = preg_replace_callback('\/(&lt;.*?&gt;)\/', 'replaceCallback', $html);\n\n&nbsp;&nbsp;&nbsp; \/\/ Remove HTML tags with strip_tags()\n&nbsp;&nbsp;&nbsp; $plaintext = strip_tags($html);\n&nbsp;&nbsp;&nbsp; \n&nbsp;&nbsp;&nbsp; \/\/ Replace multiple spaces with a single space\n&nbsp;&nbsp;&nbsp; $plaintext = trim(preg_replace('\/ +\/s', ' ', $plaintext));\n&nbsp;&nbsp;&nbsp; \n&nbsp;&nbsp;&nbsp; \/\/ Decode HTML entities like &amp;quot;\n&nbsp;&nbsp;&nbsp; return html_entity_decode($plaintext, ENT_QUOTES, 'UTF-8');\n}\n\nfunction replaceCallback($matches)\n{\n&nbsp;&nbsp;&nbsp; $inline_elements =\n&nbsp;&nbsp;&nbsp; &#91;\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 'a', 'b', 'del', 'em', 'i', 'ins', 'mark', 'span', \n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 'strike', 'strong', 'sub', 'sup', 'u' \n&nbsp;&nbsp;&nbsp; ];\n\n&nbsp;&nbsp;&nbsp; \/\/ &lt;span class=\"x\"&gt; =&gt; span\n&nbsp;&nbsp;&nbsp; \n&nbsp;&nbsp;&nbsp; $replace = &#91; '&lt;' =&gt; '', '&gt;' =&gt; '', '\/' =&gt; '' ];\n&nbsp;&nbsp;&nbsp; $parts = explode(' ', trim(strtr($matches&#91; 1 ], $replace)));\n&nbsp;&nbsp;&nbsp; \n&nbsp;&nbsp;&nbsp; $e = $parts&#91; 0 ];\n\n&nbsp;&nbsp;&nbsp; $result = $matches&#91; 0 ];\n\n&nbsp;&nbsp;&nbsp; if (! in_array($e, $inline_elements))\n&nbsp;&nbsp;&nbsp; {\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $result .= \"\\n\";\n&nbsp;&nbsp;&nbsp; }\n\n&nbsp;&nbsp;&nbsp; return $result;\n}\n\n$html = file_get_contents('xhtml-in.html');\n\necho \"Wrong:\\n\";\nvar_dump(trim(strip_tags($html)));\n\necho \"Correct:\\n\";\nvar_dump(xhtmlToPlaintext($html));\n<\/code><\/pre>\n\n\n\n<p>Suggestions for improvement are welcome!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Turning HTML into (unformatted) plain text seems simple at first: PHP has &lt;a href=&#8220;http:\/\/www.php.net\/strip_tags&#8220;&gt;strip_tags()&lt;\/a&gt;, XSLT has &lt;a href=&#8220;https:\/\/www.w3.org\/TR\/xslt#value-of&#8220;&gt;xsl:value-of&lt;\/a&gt;. In practice, though, you\u2019ll frequently find that words are glued together which should have whitespace between them. Take this example \u2013 extra weirdly-formatted to get the point across: If you select and copy this text in the [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":"","_share_on_mastodon":"0"},"categories":[1],"tags":[],"class_list":["post-1830","post","type-post","status-publish","format-standard","hentry","category-weblog"],"share_on_mastodon":{"url":"","error":""},"_links":{"self":[{"href":"https:\/\/www.strehle.de\/tim\/wp-json\/wp\/v2\/posts\/1830","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.strehle.de\/tim\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.strehle.de\/tim\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.strehle.de\/tim\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.strehle.de\/tim\/wp-json\/wp\/v2\/comments?post=1830"}],"version-history":[{"count":1,"href":"https:\/\/www.strehle.de\/tim\/wp-json\/wp\/v2\/posts\/1830\/revisions"}],"predecessor-version":[{"id":1912,"href":"https:\/\/www.strehle.de\/tim\/wp-json\/wp\/v2\/posts\/1830\/revisions\/1912"}],"wp:attachment":[{"href":"https:\/\/www.strehle.de\/tim\/wp-json\/wp\/v2\/media?parent=1830"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.strehle.de\/tim\/wp-json\/wp\/v2\/categories?post=1830"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.strehle.de\/tim\/wp-json\/wp\/v2\/tags?post=1830"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}