在电子邮件中将 HTML 转换为 PHP 中的纯文本

我使用 TinyMCE来允许在我的网站文本的最小格式化。从生成的 HTML 中,我希望将其转换为用于电子邮件的纯文本。我一直在使用一个名为 Html2text的类,但它实际上缺乏对 UTF-8的支持。但是,我喜欢它将某些 HTML 标记映射到纯文本格式ーー比如在以前 HTML 中有 < i > 标记的文本周围放置下划线。

是否有人使用类似的方法在 PHP 中将 HTML 转换为纯文本?如果是这样的话: 您是否推荐任何我可以使用的第三方类?或者你如何最好地解决这个问题?

204171 次浏览

There's the trusty strip_tags function. It's not pretty though. It'll only sanitize. You could combine it with a string replace to get your fancy underscores.


<?php
// to strip all tags and wrap italics with underscore
strip_tags(str_replace(array("<i>", "</i>"), array("_", "_"), $text));


// to preserve anchors...
str_replace("|a", "<a", strip_tags(str_replace("<a", "|a", $text)));


?>

Converting from HTML to text using a DOMDocument is a viable solution. Consider HTML2Text, which requires PHP5:

Regarding UTF-8, the write-up on the "howto" page states:

PHP's own support for unicode is quite poor, and it does not always handle utf-8 correctly. Although the html2text script uses unicode-safe methods (without needing the mbstring module), it cannot always cope with PHP's own handling of encodings. PHP does not really understand unicode or encodings like utf-8, and uses the base encoding of the system, which tends to be one of the ISO-8859 family. As a result, what may look to you like a valid character in your text editor, in either utf-8 or single-byte, may well be misinterpreted by PHP. So even though you think you are feeding a valid character into html2text, you may well not be.

The author provides several approaches to solving this and states that version 2 of HTML2Text (using DOMDocument) has UTF-8 support.

Note the restrictions for commercial use.

Use html2text (example HTML to text), licensed under the Eclipse Public License. It uses PHP's DOM methods to load from HTML, and then iterates over the resulting DOM to extract plain text. Usage:

// when installed using the Composer package
$text = Html2Text\Html2Text::convert($html);


// usage when installed using html2text.php
require('html2text.php');
$text = convert_html_to_text($html);

Although incomplete, it is open source and contributions are welcome.

Issues with other conversion scripts:

  • Since html2text (GPL) is not EPL-compatible.
  • lkessler's link (attribution) is incompatible with most open source licenses.

Markdownify converts HTML to Markdown, a plain-text formatting system used on this very site.

You can use lynx with -stdin and -dump options to achieve that:

<?php
$descriptorspec = array(
0 => array("pipe", "r"),  // stdin is a pipe that the child will read from
1 => array("pipe", "w"),  // stdout is a pipe that the child will write to
2 => array("file", "/tmp/htmp2txt.log", "a") // stderr is a file to write to
);


$process = proc_open('lynx -stdin -dump 2>&1', $descriptorspec, $pipes, '/tmp', NULL);


if (is_resource($process)) {
// $pipes now looks like this:
// 0 => writeable handle connected to child stdin
// 1 => readable handle connected to child stdout
// Any error output will be appended to htmp2txt.log


$stdin = $pipes[0];
fwrite($stdin,  <<<'EOT'
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>TEST</title>
</head>
<body>
<h1><span>Lorem Ipsum</span></h1>


<h4>"Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit..."</h4>
<h5>"There is no one who loves pain itself, who seeks after it and wants to have it, simply because it is pain..."</h5>
<p>
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Pellentesque et sapien ut erat porttitor suscipit id nec dui. Nam rhoncus mauris ac dui tristique bibendum. Aliquam molestie placerat gravida. Duis vitae tortor gravida libero semper cursus eu ut tortor. Nunc id orci orci. Suspendisse potenti. Phasellus vehicula leo sed erat rutrum sed blandit purus convallis.
</p>
<p>
Aliquam feugiat, neque a tempus rhoncus, neque dolor vulputate eros, non pellentesque elit lacus ut nunc. Pellentesque vel purus libero, ultrices condimentum lorem. Nam dictum faucibus mollis. Praesent adipiscing nunc sed dui ultricies molestie. Quisque facilisis purus quis felis molestie ut accumsan felis ultricies. Curabitur euismod est id est pretium accumsan. Praesent a mi in dolor feugiat vehicula quis at elit. Mauris lacus mauris, laoreet non molestie nec, adipiscing a nulla. Nullam rutrum, libero id pellentesque tempus, erat nibh ornare dolor, id accumsan est risus at leo. In convallis felis at eros condimentum adipiscing aliquam nisi faucibus. Integer arcu ligula, porttitor in fermentum vitae, lacinia nec dui.
</p>
</body>
</html>
EOT
);
fclose($stdin);


echo stream_get_contents($pipes[1]);
fclose($pipes[1]);


// It is important that you close any pipes before calling
// proc_close in order to avoid a deadlock
$return_value = proc_close($process);


echo "command returned $return_value\n";
}

I have just found a PHP function "strip_tags()" and its working in my case.

I tried to convert the following HTML :

<p><span style="font-family: 'Verdana','sans-serif'; color: black; font-size: 7.5pt;">&nbsp;</span>Many  practitioners are optimistic that the eyeglass and contact lens  industry will recover from the recent economic storm. Did your practice  feel its affects?&nbsp; Statistics show revenue notably declined in 2008 and  2009. But interestingly enough, those that monitor these trends state  that despite the industry's lackluster performance during this time,  revenue has grown at an average annual rate&nbsp;of 2.2% over the last five  years, to $9.0 billion in 2010.&nbsp; So despite the downturn, how were we  able to manage growth as an industry?</p>

After applying strip_tags() function, I have got the following output :

&amp;nbsp;Many  practitioners are optimistic that the eyeglass and contact lens  industry will recover from the recent economic storm. Did your practice  feel its affects?&amp;nbsp; Statistics show revenue notably declined in 2008 and  2009. But interestingly enough, those that monitor these trends state  that despite the industry&#039;s lackluster performance during this time,  revenue has grown at an average annual rate&amp;nbsp;of 2.2% over the last five  years, to $9.0 billion in 2010.&amp;nbsp; So despite the downturn, how were we  able to manage growth as an industry?

here is another solution:

$cleaner_input = strip_tags($text);

For other variations of sanitization functions, see:

https://github.com/ttodua/useful-php-scripts/blob/master/filter-php-variable-sanitize.php

You can test this function

function html2text($Document) {
$Rules = array ('@<script[^>]*?>.*?</script>@si',
'@<[\/\!]*?[^<>]*?>@si',
'@([\r\n])[\s]+@',
'@&(quot|#34);@i',
'@&(amp|#38);@i',
'@&(lt|#60);@i',
'@&(gt|#62);@i',
'@&(nbsp|#160);@i',
'@&(iexcl|#161);@i',
'@&(cent|#162);@i',
'@&(pound|#163);@i',
'@&(copy|#169);@i',
'@&(reg|#174);@i',
'@&#(d+);@e'
);
$Replace = array ('',
'',
'',
'',
'&',
'<',
'>',
' ',
chr(161),
chr(162),
chr(163),
chr(169),
chr(174),
'chr()'
);
return preg_replace($Rules, $Replace, $Document);
}

I didn't find any of the existing solutions fitting - simple HTML emails to simple plain text files.

I've opened up this repository, hope it helps someone. MIT license, by the way :)

https://github.com/RobQuistNL/SimpleHtmlToText

Example:

$myHtml = '<b>This is HTML</b><h1>Header</h1><br/><br/>Newlines';
echo (new Parser())->parseString($myHtml);

returns:

**This is HTML**
### Header ###




Newlines

I came around the same problem as the OP, and trying some solutions from the top answers above didn't prove to work for my scenarios. See why at the end.

Instead, I found this helpful script, to avoid confusion let's call it html2text_roundcube, available under GPL:

It's actually an updated version of an already mentioned script - http://www.chuggnutt.com/html2text.php - updated by RoundCube mail.

Usage:

$h2t = new \Html2Text\Html2Text('Hello, &quot;<b>world</b>&quot;');
echo $h2t->getText(); // prints Hello, "WORLD"

Why html2text_roundcube proved better than the others:

  • Script http://www.chuggnutt.com/html2text.php didn't work out of the box for cases with special HTML codes/names (eg &auml;), or unpaired quotes (eg <p>25" Monitor</p>).

  • Script https://github.com/soundasleep/html2text had no option to hide or group the links at the end of the text, making a usual HTML page look bloated with links when in text-plain format; customizing the code for special treatment of how the transformation is done is not as straight forward as simply editing an array in html2text_roundcube.

public function plainText($text)
{
$text = strip_tags($text, '<br><p><li>');
$text = preg_replace ('/<[^>]*>/', PHP_EOL, $text);


return $text;
}

$text = "string 1<br>string 2<br/><ul><li>string 3</li><li>string 4</li></ul><p>string 5</p>";

echo planText($text);

output
string 1
string 2
string 3
string 4
string 5

If you don't want to strip the tags completely and keep the content inside the tags, you can use the DOMDocument and extract the textContent of the root node like this:

function html2text($html) {
$dom = new DOMDocument();
$dom->loadHTML("<body>" . strip_tags($html, '<b><a><i><div><span><p>') . "</body>");
$xpath = new DOMXPath($dom);
$node = $xpath->query('body')->item(0);
return $node->textContent; // text
}


$p = 'this is <b>test</b>. <p>how are <i>you?</i>. <a href="#">I\'m fine!</a></p>';
print html2text($p);
// this is test. how are you?. I'm fine!

One advantage of this approach is that it does not require any external packages.

If you want to convert the HTML special characters and not just remove them as well as strip things down and prepare for plain text this was the solution that worked for me...

function htmlToPlainText($str){
$str = str_replace('&nbsp;', ' ', $str);
$str = html_entity_decode($str, ENT_QUOTES | ENT_COMPAT , 'UTF-8');
$str = html_entity_decode($str, ENT_HTML5, 'UTF-8');
$str = html_entity_decode($str);
$str = htmlspecialchars_decode($str);
$str = strip_tags($str);


return $str;
}


$string = '<p>this is (&nbsp;) a test</p>
<div>Yes this is! &amp; does it get "processed"? </div>'


htmlToPlainText($string);
// "this is ( ) a test. Yes this is! & does it get processed?"`

html_entity_decode w/ ENT_QUOTES | ENT_XML1 converts things like &#39; htmlspecialchars_decode converts things like &amp; html_entity_decode converts things like '&lt; and strip_tags removes any HTML tags left over.

For texts in utf-8, it worked for me mb_convert_encoding. To process everything regardless of errors, make sure you use the "@".

The basic code I use is:

$dom = new DOMDocument();
@$dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));


$body = $dom->getElementsByTagName('body')->item(0);
echo $body->textContent;

If you want something more advanced, you can iteratively analyze the nodes, but you will encounter many problems with whitespaces.

I have implemented a converter based on what I say here. If you are interested, you can download it from git https://github.com/kranemora/html2text

It may serve as a reference to make yours

You can use it like this:

$html = <<<EOF
<p>Welcome to <strong>html2text<strong></p>
<p>It's <em>works</em> for you?</p>
EOF;


$html2Text = new \kranemora\Html2Text\Html2Text;
$text = $html2Text->convert($html);