如何使用 php 从 html 中提取 img src、 title 和 alt?

我想创建一个网页,其中所有图像居住在我的网站上列出的标题和替代表示。

我已经写了一个小程序来查找和加载所有的 HTML 文件,但现在我卡在如何提取 srctitlealt从这个 HTML:

<img src="/image/fluffybunny.jpg" title="Harvey the bunny" alt="a cute little fluffy bunny" />

I guess this should be done with some regex, but since the order of the tags may vary, and I need all of them, I don't really know how to parse this in an elegant way (I could do it the hard char by char way, but that's painful).

336618 次浏览

仅举一个使用 PHP 的 XML 功能完成任务的小例子:

$doc=new DOMDocument();
$doc->loadHTML("<html><body>Test<br><img src=\"myimage.jpg\" title=\"title\" alt=\"alt\"></body></html>");
$xml=simplexml_import_dom($doc); // just to make xpath more simple
$images=$xml->xpath('//img');
foreach ($images as $img) {
echo $img['src'] . ' ' . $img['alt'] . ' ' . $img['title'];
}

我确实使用了 DOMDocument::loadHTML()方法,因为该方法可以处理 HTML 语法,并且不强制输入文档为 XHTML。严格地说,转换到 SimpleXMLElement是不必要的-它只是使用 xpath 和 xpath 的结果更简单。

如果是 XHTML,您的示例是,您只需要 simpleXML。

<?php
$input = '<img src="/image/fluffybunny.jpg" title="Harvey the bunny" alt="a cute little fluffy bunny"/>';
$sx = simplexml_load_string($input);
var_dump($sx);
?>

产出:

object(SimpleXMLElement)#1 (1) {
["@attributes"]=>
array(3) {
["src"]=>
string(22) "/image/fluffybunny.jpg"
["title"]=>
string(16) "Harvey the bunny"
["alt"]=>
string(26) "a cute little fluffy bunny"
}
}

编辑: 现在我知道更好

使用 regexp 解决这类问题是 一个坏主意的做法,可能导致代码不可维护和不可靠。最好用 HTML 解析器

使用 regexp 的解决方案

在这种情况下,最好将整个过程分成两部分:

  • 获取所有的 img 标签
  • 提取他们的元数据

我假设您的文档不是 xHTML 严格的,所以您不能使用 XML 解析器。例如这个网页的源代码:

/* preg_match_all match the regexp in all the $html string and output everything as
an array in $result. "i" option is used to make it case insensitive */


preg_match_all('/<img[^>]+>/i',$html, $result);


print_r($result);
Array
(
[0] => Array
(
[0] => <img src="/Content/Img/stackoverflow-logo-250.png" width="250" height="70" alt="logo link to homepage" />
[1] => <img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" />
[2] => <img class="vote-down" src="/content/img/vote-arrow-down.png" alt="vote down" title="This was not helpful (click again to undo)" />
[3] => <img src="http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG" height=32 width=32 alt="gravatar image" />
[4] => <img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" />


[...]
)


)

然后我们通过一个循环得到所有的 img 标记属性:

$img = array();
foreach( $result as $img_tag)
{
preg_match_all('/(alt|title|src)=("[^"]*")/i',$img_tag, $img[$img_tag]);
}


print_r($img);


Array
(
[<img src="/Content/Img/stackoverflow-logo-250.png" width="250" height="70" alt="logo link to homepage" />] => Array
(
[0] => Array
(
[0] => src="/Content/Img/stackoverflow-logo-250.png"
[1] => alt="logo link to homepage"
)


[1] => Array
(
[0] => src
[1] => alt
)


[2] => Array
(
[0] => "/Content/Img/stackoverflow-logo-250.png"
[1] => "logo link to homepage"
)


)


[<img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" />] => Array
(
[0] => Array
(
[0] => src="/content/img/vote-arrow-up.png"
[1] => alt="vote up"
[2] => title="This was helpful (click again to undo)"
)


[1] => Array
(
[0] => src
[1] => alt
[2] => title
)


[2] => Array
(
[0] => "/content/img/vote-arrow-up.png"
[1] => "vote up"
[2] => "This was helpful (click again to undo)"
)


)


[<img class="vote-down" src="/content/img/vote-arrow-down.png" alt="vote down" title="This was not helpful (click again to undo)" />] => Array
(
[0] => Array
(
[0] => src="/content/img/vote-arrow-down.png"
[1] => alt="vote down"
[2] => title="This was not helpful (click again to undo)"
)


[1] => Array
(
[0] => src
[1] => alt
[2] => title
)


[2] => Array
(
[0] => "/content/img/vote-arrow-down.png"
[1] => "vote down"
[2] => "This was not helpful (click again to undo)"
)


)


[<img src="http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG" height=32 width=32 alt="gravatar image" />] => Array
(
[0] => Array
(
[0] => src="http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG"
[1] => alt="gravatar image"
)


[1] => Array
(
[0] => src
[1] => alt
)


[2] => Array
(
[0] => "http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG"
[1] => "gravatar image"
)


)


[..]
)


)

Regexp 是 CPU 密集型的,因此您可能需要缓存此页。如果没有缓存系统,可以通过使用 开始和从文本文件加载/保存来调整自己的缓存系统。

这东西怎么用?

首先,我们使用 Preg _ match _ all,这是一个函数,它获取与模式匹配的每个字符串,并将其输出到它的第三个参数中。

Regexp:

<img[^>]+>

我们应用它在所有的 html 网页。它可以被解读为 每个以“ <img”开头的字符串包含非“ >”char 并以 > 结束

(alt|title|src)=("[^"]*")

我们依次在每个 img 标签上应用它,它可以被读作 每个以“ alt”,“ title”或“ src”开头的字符串,然后是“ =”,然后是“”’,一堆不是“”’的东西,最后以“”’结尾。隔离()之间的子字符串

最后,每次您想要处理 regexp 时,拥有好的工具来快速测试它们是很方便的。看看这个 在线正则表达式测试程序

编辑: 回答第一条评论。

的确,我没有考虑到(希望很少)人们使用单引号。

那么,如果您只使用’,只需替换所有的’由’。

如果你把两者混合。首先你应该扇自己一巴掌: ——然后试着用(“ |”)或“和[ ^ ø ]代替[ ^”]。

脚本必须像这样编辑

foreach( $result[0] as $img_tag)

因为 preg _ match _ all 返回数组的数组

$url="http://example.com";


$html = file_get_contents($url);


$doc = new DOMDocument();
@$doc->loadHTML($html);


$tags = $doc->getElementsByTagName('img');


foreach ($tags as $tag) {
echo $tag->getAttribute('src');
}

我使用 preg _ match 来完成它。

在我的例子中,我有一个字符串,其中只包含一个从 Wordpress 获得的 <img>标记(没有其他标记) ,我试图获得 src属性,这样就可以通过 timumber 运行它。

// get the featured image
$image = get_the_post_thumbnail($photos[$i]->ID);


// get the src for that image
$pattern = '/src="([^"]*)"/';
preg_match($pattern, $image, $matches);
$src = $matches[1];
unset($matches);

在抓取标题或 alt 的模式中,您可以简单地使用 $pattern = '/title="([^"]*)"/';抓取标题或 $pattern = '/title="([^"]*)"/';抓取 alt。遗憾的是,我的正则表达式不够好,不能一次抓住所有三个(alt/title/src)。

以下是 PHP 中的解决方案:

只需下载 QueryPath,然后执行以下操作:

$doc= qp($myHtmlDoc);


foreach($doc->xpath('//img') as $img) {


$src= $img->attr('src');
$title= $img->attr('title');
$alt= $img->attr('alt');


}

够了,你完了!

这里有一个 PHP 函数,我从上面的信息中总结出来,用于类似的目的,即动态调整图像标签的宽度和长度属性... ... 可能有点笨重,但似乎工作可靠:

function ReSizeImagesInHTML($HTMLContent,$MaximumWidth,$MaximumHeight) {


// find image tags
preg_match_all('/<img[^>]+>/i',$HTMLContent, $rawimagearray,PREG_SET_ORDER);


// put image tags in a simpler array
$imagearray = array();
for ($i = 0; $i < count($rawimagearray); $i++) {
array_push($imagearray, $rawimagearray[$i][0]);
}


// put image attributes in another array
$imageinfo = array();
foreach($imagearray as $img_tag) {


preg_match_all('/(src|width|height)=("[^"]*")/i',$img_tag, $imageinfo[$img_tag]);
}


// combine everything into one array
$AllImageInfo = array();
foreach($imagearray as $img_tag) {


$ImageSource = str_replace('"', '', $imageinfo[$img_tag][2][0]);
$OrignialWidth = str_replace('"', '', $imageinfo[$img_tag][2][1]);
$OrignialHeight = str_replace('"', '', $imageinfo[$img_tag][2][2]);


$NewWidth = $OrignialWidth;
$NewHeight = $OrignialHeight;
$AdjustDimensions = "F";


if($OrignialWidth > $MaximumWidth) {
$diff = $OrignialWidth-$MaximumHeight;
$percnt_reduced = (($diff/$OrignialWidth)*100);
$NewHeight = floor($OrignialHeight-(($percnt_reduced*$OrignialHeight)/100));
$NewWidth = floor($OrignialWidth-$diff);
$AdjustDimensions = "T";
}


if($OrignialHeight > $MaximumHeight) {
$diff = $OrignialHeight-$MaximumWidth;
$percnt_reduced = (($diff/$OrignialHeight)*100);
$NewWidth = floor($OrignialWidth-(($percnt_reduced*$OrignialWidth)/100));
$NewHeight= floor($OrignialHeight-$diff);
$AdjustDimensions = "T";
}


$thisImageInfo = array('OriginalImageTag' => $img_tag , 'ImageSource' => $ImageSource , 'OrignialWidth' => $OrignialWidth , 'OrignialHeight' => $OrignialHeight , 'NewWidth' => $NewWidth , 'NewHeight' => $NewHeight, 'AdjustDimensions' => $AdjustDimensions);
array_push($AllImageInfo, $thisImageInfo);
}


// build array of before and after tags
$ImageBeforeAndAfter = array();
for ($i = 0; $i < count($AllImageInfo); $i++) {


if($AllImageInfo[$i]['AdjustDimensions'] == "T") {
$NewImageTag = str_ireplace('width="' . $AllImageInfo[$i]['OrignialWidth'] . '"', 'width="' . $AllImageInfo[$i]['NewWidth'] . '"', $AllImageInfo[$i]['OriginalImageTag']);
$NewImageTag = str_ireplace('height="' . $AllImageInfo[$i]['OrignialHeight'] . '"', 'height="' . $AllImageInfo[$i]['NewHeight'] . '"', $NewImageTag);


$thisImageBeforeAndAfter = array('OriginalImageTag' => $AllImageInfo[$i]['OriginalImageTag'] , 'NewImageTag' => $NewImageTag);
array_push($ImageBeforeAndAfter, $thisImageBeforeAndAfter);
}
}


// execute search and replace
for ($i = 0; $i < count($ImageBeforeAndAfter); $i++) {
$HTMLContent = str_ireplace($ImageBeforeAndAfter[$i]['OriginalImageTag'],$ImageBeforeAndAfter[$i]['NewImageTag'], $HTMLContent);
}


return $HTMLContent;


}

您可以使用 简单世界。 simplehtmldom 支持大多数 jQuery 选择器。下面给出一个例子

// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');


// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';


// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';

我已经阅读了这个页面上的许多评论,它们抱怨使用 dom 解析器是不必要的开销。嗯,它可能比单纯的正则表达式调用更昂贵,但是 OP 已经声明,img 标记中的属性顺序不受控制。这个事实导致不必要的正则表达式模式卷积。除此之外,使用 dom 解析器还提供了可读性、可维护性和 dom 感知(regex 不支持 dom 感知)的额外好处。

我喜欢正则表达式并且回答了很多正则表达式问题,但是当处理有效的 HTML 时,很少有好的理由使用解析器来处理正则表达式。

在下面的演示中,可以看到 DOMDocument 是如何简单而干净地以任意顺序处理 img 标记属性,并且混合了引号(根本没有引号)。还要注意,没有目标属性的标记一点也不具有破坏性——提供了一个空字符串作为值。

密码: (演示)

$test = <<<HTML
<img src="/image/fluffybunny.jpg" title="Harvey the bunny" alt="a cute little fluffy bunny" />
<img src='/image/pricklycactus.jpg' title='Roger the cactus' alt='a big green prickly cactus' />
<p>This is irrelevant text.</p>
<img alt="an annoying white cockatoo" title="Polly the cockatoo" src="/image/noisycockatoo.jpg">
<img title=something src=somethingelse>
HTML;


libxml_use_internal_errors(true);  // silences/forgives complaints from the parser (remove to see what is generated)
$dom = new DOMDocument();
$dom->loadHTML($test);
foreach ($dom->getElementsByTagName('img') as $i => $img) {
echo "IMG#{$i}:\n";
echo "\tsrc = " , $img->getAttribute('src') , "\n";
echo "\ttitle = " , $img->getAttribute('title') , "\n";
echo "\talt = " , $img->getAttribute('alt') , "\n";
echo "---\n";
}

产出:

IMG#0:
src = /image/fluffybunny.jpg
title = Harvey the bunny
alt = a cute little fluffy bunny
---
IMG#1:
src = /image/pricklycactus.jpg
title = Roger the cactus
alt = a big green prickly cactus
---
IMG#2:
src = /image/noisycockatoo.jpg
title = Polly the cockatoo
alt = an annoying white cockatoo
---
IMG#3:
src = somethingelse
title = something
alt =
---

在专业代码中使用这种技术将使您得到一个干净的脚本,更少的问题需要处理,更少的同事希望您在其他地方工作。