parsing - How do you parse and process HTML/XML in PHP? -

how can 1 parse html/xml , extract information it?

native xml extensions

i prefer using 1 of native xml extensions since come bundled php, faster 3rd party libs , give me control need on markup.

dom

the dom extension allows operate on xml documents through dom api php 5. implementation of w3c's document object model core level 3, platform- , language-neutral interface allows programs , scripts dynamically access , update content, structure , style of documents.

dom capable of parsing , modifying real world (broken) html , can xpath queries. based on libxml.

it takes time productive dom, time worth imo. since dom language-agnostic interface, you'll find implementations in many languages, if need change programming language, chances know how use language's dom api then.

a basic usage example can found in grabbing href attribute of element , general conceptual overview can found @ domdocument in php

how use dom extension has been covered extensively on stackoverflow, if choose use it, can sure of issues run can solved searching/browsing stack overflow.

xmlreader

the xmlreader extension xml pull parser. reader acts cursor going forward on document stream , stopping @ each node on way.

xmlreader, dom, based on libxml. not aware of how trigger html parser module, chances using xmlreader parsing broken html might less robust using dom can explicitly tell use libxml's html parser module.

a basic usage example can found @ getting values h1 tags using php

xml parser

this extension lets create xml parsers , define handlers different xml events. each xml parser has few parameters can adjust.

the xml parser library based on libxml, , implements sax style xml push parser. may better choice memory management dom or simplexml, more difficult work pull parser implemented xmlreader.

simplexml

the simplexml extension provides simple , usable toolset convert xml object can processed normal property selectors , array iterators.

simplexml option when know html valid xhtml. if need parse broken html, don't consider simplexml because choke.

a basic usage example can found @ a simple program crud node , node values of xml file , there lots of additional examples in php manual.

3rd party libraries (libxml based)

if prefer use 3rd-party lib, i'd suggest using lib uses dom/libxml underneath instead of string parsing.

fluentdom

fluentdom provides jquery-like fluent xml interface domdocument in php. selectors written in xpath or css (using css xpath converter). current versions extend dom implementing standard interfaces , add features dom living standard. fluentdom can load formats json, csv, jsonml, rabbitfish , others. can installed via composer.

htmlpagedom

wa72\htmlpagedom` php library easy manipulation of html documents using requires domcrawler symfony2 components traversing dom tree , extends adding methods manipulating dom tree of html documents.

phpquery (not updated years)

phpquery server-side, chainable, css3 selector driven document object model (dom) api based on jquery javascript library written in php5 , provides additional command line interface (cli).

also see: https://github.com/electrolinux/phpquery

zend_dom

zend_dom provides tools working dom documents , structures. currently, offer zend_dom_query, provides unified interface querying dom documents utilizing both xpath , css selectors.

querypath

querypath php library manipulating xml , html. designed work not local files, web services , database resources. implements of jquery interface (including css-style selectors), heavily tuned server-side use. can installed via composer.

fdomdocument

fdomdocument extends standard dom use exceptions @ occasions of errors instead of php warnings or notices. add various custom methods , shortcuts convenience , simplify usage of dom.

sabre/xml

sabre/xml library wraps , extends xmlreader , xmlwriter classes create simple "xml object/array" mapping system , design pattern. writing , reading xml single-pass , can therefore fast , require low memory on large xml files.

fluidxml

fluidxml php library manipulating xml concise , fluent api. leverages xpath , fluent programming pattern fun , effective.

3rd-party (not libxml-based)

the benefit of building upon dom/libxml performance out of box because based on native extension. however, not 3rd-party libs go down route. of them listed below

php simple html dom parser

an html dom parser written in php5+ lets manipulate html in easy way!

require php 5+.

supports invalid html.

find tags on html page selectors jquery.

extract contents html in single line.

i not recommend parser. codebase horrible , parser rather slow , memory hungry. not jquery selectors (such child selectors) possible. of libxml based libraries should outperform easily.

php html parser

phphtmlparser simple, flexible, html parser allows select tags using css selector, jquery. goal assiste in development of tools require quick, easy way scrap html, whether it's valid or not! project original supported sunra/php-simple-html-dom-parser support seems have stopped project adaptation of previous work.

again, not recommend parser. rather slow high cpu usage. there no function clear memory of created dom objects. these problems scale particularly nested loops. documentation inaccurate , misspelled, no responses fixes since 14 apr 16.

ganon

a universal tokenizer , html/xml/rss dom parser

ability manipulate elements , attributes

supports invalid html , utf8

can perform advanced css3-like queries on elements (like jquery -- namespaces supported)

a html beautifier (like html tidy)

minify css , javascript

sort attributes, change character case, correct indentation, etc.

extensible

parsing documents using callbacks based on current character/token

operations separated in smaller functions easy overriding

fast , easy

never used it. can't tell if it's good.

html 5

you can use above parsing html5, there can quirks due markup html5 allows. html5 want consider using dedicated parser, like

html5lib

a python , php implementations of html parser based on whatwg html5 specification maximum compatibility major desktop web browsers.

we might see more dedicated parsers once html5 finalized. there blogpost w3's titled how-to html 5 parsing worth checking out.

webservices

if don't feel programming php, can use web services. in general, found little utility these, that's me , use cases.

yql

the yql web service enables applications query, filter, , combine data different sources across internet. yql statements have sql-like syntax, familiar developer database experience.

scraperwiki.

scraperwiki's external interface allows extract data in form want use on web or in own applications. can extract information state of scraper.

regular expressions

last , least recommended, can extract data html regular expressions. in general using regular expressions on html discouraged.

most of snippets find on web match markup brittle. in cases working particular piece of html. tiny markup changes, adding whitespace somewhere, or adding or changing attributes in tag, can make regex fails when it's not written. should know doing before using regex on html.

html parsers know syntactical rules of html. regular expressions have taught each new regex write. regex fine in cases, depends on use-case.

you can write more reliable parsers, writing complete , reliable custom parser regular expressions waste of time when aforementioned libraries exist , better job on this.

also see parsing html cthulhu way

books

if want spend money, have at

php architect's guide webscraping php

i not affiliated php architect or authors.

Search This Blog

My