While hanging out in the Freenode ##php channel, a user came along asking how to create a table of contents from an HTML string by extracting the header elements. Many suggestions came through including using Regex or a jQuery solution. Using Regex to parse HTML in PHP is a bad idea and can get quite messy. You can see what I mean by looking at the PHP implementation section towards the bottom of this page. While it may be effective, and might have previously been the only option available, we now have better tools available to get this job done cleanly, elegantly, and with far fewer lines of code than before.
So, what’s our new alternative? It’s called the Document Object Model and it’s powerful, easy to use, and a great alternative to Regex or functions designed for XML.
So, you want to parse all the H2 tags from a website and generate a table of contents? No sweat! Check it out:
/** * Use whatever method you like to grab your HTML string. We'll use a site I mentioned in this post * as an example. */ $remote_site_data = file_get_contents('http://www.10stripe.com/articles/automatically-generate-table-of-contents-php.php'); $dom_document = new DOMDocument(); @$dom_document->loadHTML($remote_site_data); $headers = $dom_document->getElementsByTagName('h2'); foreach ($headers as $header) { $table_of_contents[] = trim($header->nodeValue); } var_dump($table_of_contents); /** array(5) { [0]=> string(10) "Motivation" [1]=> string(11) "This script" [2]=> string(12) "How it works" [3]=> string(3) "Fin" [4]=> string(17) "About the Author" } */ |
You can do whatever you’d like with the array of contents. It’s likely that you may want to only grab H2 elements of a particular class. It’s easy to alter our code to do just that. Just add a conditional before you add it to your array:
foreach ($headers as $header) { if ($header->getAttribute('class') == 'specialclassname') { $table_of_contents[] = trim($header->nodeValue); } } |
As you can see, DOMDocument is a powerful and flexible tool and can really allow you to get a hold of some HTML and manipulate it however you like.
If you liked this post, then please consider subscribing to my feed.