DOMDocument::loadHTML
(PHP 5, PHP 7, PHP 8)
DOMDocument::loadHTML — 文字列から HTML を読み込む
説明
この関数は、文字列 source
に含まれる HTML を
パースします。XML を読み込む場合とは異なり、妥当な HTML でなくても
読み込むことができます。
この関数は、HTML4 のパーサを使って入力をパースします。モダンなWebブラウザが採用している HTML5 のパースルールとは異なります。入力によっては、このことが原因で異なるDOM構造になるかもしれません。よって、この関数はHTMLを無害化する目的で安全に使えません。
HTML をパースする挙動は、利用している
libxml
のバージョンに依存します。
特にエッジケースやエラーハンドリングについてそれが当てはまります。
HTML5 に準拠したパースを行うには、PHP 8.4 で追加される
Dom\HTMLDocument::createFromString() や
Dom\HTMLDocument::createFromFile() を使いましょう。
例を挙げましょう。HTML要素によっては、暗黙のうちに親の要素を閉じるものがあります。親要素を自動で閉じるルールは、HTML4 と HTML5 で異なります。よって、DOMDocument が表す DOM 構造は Webブラウザ上のそれと異なる可能性があります。このことから、攻撃者がHTMLを壊す攻撃を許す可能性があります。
エラー / 例外
空の文字列を source
に渡すと、警告が発生します。
この警告は libxml が発するものではないので、libxml
のエラー処理関数では処理できません。
壊れた HTML も読み込めますが、マークアップが正しくない場合には E_WARNING
が発生します。
このエラーの処理には libxml のエラー処理関数
が使えます。
変更履歴
バージョン | 説明 |
---|---|
8.3.0 | この関数の仮の戻り値の型が、bool になりました。 |
8.0.0 |
この関数を static メソッドとしてコールすると、
Error がスローされるようになりました。
これより前のバージョンでは、
E_DEPRECATED が発生していました。
|
例
例1 ドキュメントを作成する
<?php
$doc = new DOMDocument();
$doc->loadHTML("<html><body>Test<br></body></html>");
echo $doc->saveHTML();
?>
参考
- DOMDocument::loadHTMLFile() - ファイルから HTML を読み込む
- DOMDocument::saveHTML() - 内部のドキュメントを HTML 形式の文字列として出力する
- DOMDocument::saveHTMLFile() - 内部のドキュメントを HTML 形式でファイルに出力する
User Contributed Notes 19 notes
You can also load HTML as UTF-8 using this simple hack:
<?php
$doc = new DOMDocument();
$doc->loadHTML('<?xml encoding="UTF-8">' . $html);
// dirty fix
foreach ($doc->childNodes as $item)
if ($item->nodeType == XML_PI_NODE)
$doc->removeChild($item); // remove hack
$doc->encoding = 'UTF-8'; // insert proper
?>
If you are loading html content from any website, in "utf-8" encoding, when meta width content-type is not first child of HEAD, it would not be acknowledged by parser (encoding); So you can make this fix:
function domLoadHTML($html)
{$testDOM = new DOMDocument('1.0', 'UTF-8');
$testDOM->loadHTML($html);
$charset = NULL;
$searchInElemnt = function(&$item) use (&$searchInElemnt, &$charset)
{if($item->childNodes)
{foreach($item->childNodes as $childItem)
{switch($childItem->nodeName)
{case 'html':
case 'head':
$searchInElemnt($childItem);
break;
case 'meta':
$attributes = array();
foreach ($childItem->attributes as $attr)
{$attributes[mb_strtoupper($attr->localName)] = $attr->nodeValue;
}
if(array_key_exists('HTTP-EQUIV', $attributes) && (mb_strtoupper($attributes['HTTP-EQUIV']) == 'CONTENT-TYPE') && array_key_exists('CONTENT', $attributes) && preg_match('~[\s]*;[\s]*charset[\s]*=[\s]*([^\s]+)~', $attributes['CONTENT'], $matches))
{$charset = preg_replace('~[\s\']~', '', $matches[1]);
}
}
}
}
};
$searchInElemnt($testDOM);
if(isset($charset))
{$dom = new DOMDocument('1.0', $charset);
$dom->loadHTML('<?xml encoding="'.$charset.'">'.$html);
foreach ($dom->childNodes as $item)
if($item->nodeType == XML_PI_NODE)
{$dom->removeChild($item);
}
$dom->encoding = $charset;
}
else
{$dom = $testDOM;
}
return $dom;
};
DOMDocument is very good at dealing with imperfect markup, but it throws warnings all over the place when it does.
This isn't well documented here. The solution to this is to implement a separate aparatus for dealing with just these errors.
Set libxml_use_internal_errors(true) before calling loadHTML. This will prevent errors from bubbling up to your default error handler. And you can then get at them (if you desire) using other libxml error functions.
You can find more info here http://www.php.net/manual/en/ref.libxml.php
When using loadHTML() to process UTF-8 pages, you may meet the problem that the output of dom functions are not like the input. For example, if you want to get "Cạnh tranh", you will receive "Cạnh tranh". I suggest we use mb_convert_encoding before load UTF-8 page :
<?php
$pageDom = new DomDocument();
$searchPage = mb_convert_encoding($htmlUTF8Page, 'HTML-ENTITIES', "UTF-8");
@$pageDom->loadHTML($searchPage);
?>
To support HTML5 you have to disable xml error handling by add `LIBXML_NOERROR` as an option of loadHTML method.
Example:
<?php
$doc = new DOMDocument();
$doc->loadHTML("<html><body>Test<br><section>I'M UNSUPPORTED</section></body></html>", LIBXML_NOERROR);
echo $doc->saveHTML();
?>
Pay attention when loading html that has a different charset than iso-8859-1. Since this method does not actively try to figure out what the html you are trying to load is encoded in (like most browsers do), you have to specify it in the html head. If, for instance, your html is in utf-8, make sure you have a meta tag in the html's head section:
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
</head>
If you do not specify the charset like this, all high-ascii bytes will be html-encoded. It is not enough to set the dom document you are loading the html in to UTF-8.
If we are loading html5 tags such as <section>, <svg> there is following error:
DOMDocument::loadHTML(): Tag section invalid in Entity
We can disable standard libxml errors (and enable user error handling) using libxml_use_internal_errors(true); before loadHTML();
This is quite useful in phpunit custom assertions as given in following example (if using phpunit test cases):
// Create a DOMDocument
$dom = new DOMDocument();
// fix html5/svg errors
libxml_use_internal_errors(true);
// Load html
$dom->loadHTML("<section></section>");
$htmlNodes = $dom->getElementsByTagName('section');
if ($htmlNodes->length == 0) {
$this->assertFalse(TRUE);
} else {
$this->assertTrue(TRUE);
}
loadHTML() & loadHTMLFile() may always generate warnings if the html include some tags such as "nav, section, footer, etc" adopted as of HTML5 (in PHP 8.1.6).
Try to run below.
<?php
$file_name = 'PHP Runtime Configuration - Manual.html'; // Download this file from "https://www.php.net/manual/en/session.configuration.php" in advance.
$doc = new DOMDocument();
$doc->loadHTMLFile($file_name); // if set "LIBXML_NOERROR" as 2nd arg, no error
echo $doc->saveHTML();
// Warning: DOMDocument::loadHTMLFile(): Tag nav invalid in PHP Runtime Configuration - Manual.html, line: 63 in D:\xampp\htdocs\test\xml(dom)\loadHTML\index.php on line 6
?>
Warning: This does not function well with HTML5 elements such as SVG. Most of the advice on the Web is to turn off errors in order to have it work with HTML5.
Remember: If you use an HTML5 doctype and a meta element like so
<meta charset=utf-8">
your HTML code will get interpreted as ISO-8859-something and non-ASCII chars will get converted into HTML entities. However the HTML4-like version will work (as has been pointed out 10 years ago by "bigtree at 29a"):
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
Be aware that this function doesn't actually understand HTML -- it fixes tag-soup input using the general rules of SGML, so it creates well-formed markup, but has no idea which element contexts are allowed.
For example, with input like this where the first element isn't closed:
<span>hello <div>world</div>
loadHTML will change it to this, which is well-formed but invalid:
<span>hello <div>world</div></span>
Note that the elements of such document will have no namespace even with <html xmlns="http://www.w3.org/1999/xhtml">
It should be noted that when any text is provided within the body tag
outside of a containing element, the DOMDocument will encapsulate that
text into a paragraph tag (<p>).
For example:
<?php
$doc = new DOMDocument();
$doc->loadHTML("<html><body>Test<br><div>Text</div></body></html>");
echo $doc->saveHTML();
?>
will yield:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<p>Test<br></p>
<div>Text</div>
</body></html>
while:
<?php
$doc = new DOMDocument();
$doc->loadHTML(
"<html><body><i>Test</i><br><div>Text</div></body></html>");
echo $doc->saveHTML();
?>
will yield:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<i>Test</i><br><div>Text</div>
</body></html>
For those of you who want to get an external URL's class element, I have 2 usefull functions. In this example we get the '<h3 class="r">'
elements back (search result headers) from google search:
1. Check the URL (if it is reachable, existing)
<?php
# URL Check
function url_check($url) {
$headers = @get_headers($url);
return is_array($headers) ? preg_match('/^HTTP\\/\\d+\\.\\d+\\s+2\\d\\d\\s+.*$/',$headers[0]) : false;
};
?>
2. Clean the element you want to get (remove all tags, tabs, new-lines etc.)
<?php
# Function to clean a string
function clean($text){
$clean = html_entity_decode(trim(str_replace(';','-',preg_replace('/\s+/S', " ", strip_tags($text)))));// remove everything
return $clean;
echo '\n';// throw a new line
}
?>
After doing that, we can output the search result headers with following method:
<?php
$searchstring = 'djceejay';
$url = 'http://www.google.de/webhp#q='.$searchstring;
if(url_check($url)){
$doc = new DomDocument;
$doc->validateOnParse = true;
$doc->loadHtml(file_get_contents($url));
$output = clean($doc->getElementByClass('r')->textContent);
echo $output . '<br>';
}else{
echo 'URL not reachable!';// Throw message when URL not be called
}
?>
Using loadHTML() automagically sets the doctype property of your DOMDocument instance(to the doctype in the html, or defaults to 4.0 Transitional). If you set the doctype with DOMImplementation it will be overridden.
I assumed it was possible to set it and then load html with the doctype I defined(in order to decide the doctype at runtime), and ran into a huge headache trying to find out where my doctype was going. Hopefully this helps someone else.
if you want to get rid of all the "DOMText elements containing ONLY whitespace", maybe try
<?php
function loadHTML_noemptywhitespace(string $html, int $extra_flags = 0, int $exclude_flags = 0): DOMDocument
{
$flags = LIBXML_HTML_NODEFDTD | LIBXML_NOBLANKS | LIBXML_NONET;
$flags = ($flags | $extra_flags) & ~ $exclude_flags;
$domd = new DOMDocument();
$domd->preserveWhiteSpace = false;
@$domd->loadHTML('<?xml encoding="UTF-8">' . $html, $flags);
$removeAnnoyingWhitespaceTextNodes = function (\DOMNode $node) use (&$removeAnnoyingWhitespaceTextNodes): void {
if ($node->hasChildNodes()) {
// Warning: it's important to do it backwards; if you do it forwards, the index for DOMNodeList might become invalidated;
// that's why i don't use foreach() - don't change it (unless you know what you're doing, ofc)
for ($i = $node->childNodes->length - 1; $i >= 0; --$i) {
$removeAnnoyingWhitespaceTextNodes($node->childNodes->item($i));
}
}
if ($node->nodeType === XML_TEXT_NODE && !$node->hasChildNodes() && !$node->hasAttributes() && empty(trim($node->textContent))) {
//echo "Removing annoying POS";
// var_dump($node);
$node->parentNode->removeChild($node);
} //elseif ($node instanceof DOMText) { echo "not removed"; var_dump($node, $node->hasChildNodes(), $node->hasAttributes(), trim($node->textContent)); }
};
$removeAnnoyingWhitespaceTextNodes($domd);
return $domd;
}