Extracting parts of an HTML document
From ActiveArchives
The html5lib parser is code that turns the source text of an HTML page into a structured object, allowing, for instance, to use CSS selectors or xpath expressions to select/extract portions of a page
You can use xpath expressions:
import html5lib, lxml htmlsource="<html><body><p>Example page.</p><p>More stuff.</p></body></html>" htmlparser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("lxml"), namespaceHTMLElements=False) page = htmlparser.parse(htmlsource) p = page.xpath("/html/body/p[2]") if p: p = p[0] print "".join([t for t in p.itertext()])
outputs: More stuff.
Also CSS selectors are possible:
import html5lib, lxml, lxml.cssselect htmlsource="<html><body><p>Example page.</p><p>More stuff.</p></body></html>" htmlparser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("lxml"), namespaceHTMLElements=False) page = htmlparser.parse(htmlsource) selector = lxml.cssselect.CSSSelector("p") for p in selector(page): print "-"*20 print "".join([t for t in p.itertext()])
-------------------- Example page. -------------------- More stuff.
Function that takes a URL + xpath
NB the function returns a LIST of matching fragments (since xpaths can potentially match multiple things). So, if you expect only one result, use [0] to pull off the first (single) item. lxml.etree.tostring is used to re-serialize the result.
import urllib2, html5lib, lxml, lxml.etree def getXpath (url, xpath): htmlparser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("lxml"), namespaceHTMLElements=False) request = urllib2.Request(url) request.add_header("User-Agent", "Mozilla/5.0 (X11; U; Linux x86_64; fr; rv:1.9.1.5) Gecko/20091109 Ubuntu/9.10 (karmic) Firefox/3.5.5") f=urllib2.urlopen(request) page = htmlparser.parse(f) return page.xpath(xpath) if __name__ == "__main__": url = "http://www.jabberwocky.com/carroll/walrus.html" xpath = "/html/body/p[6]" print lxml.etree.tostring(getXpath(url, xpath)[0])
Function that takes a URL + CSS selector
import html5lib, lxml, lxml.cssselect def getCSS (url, selector): htmlparser = html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("lxml"), namespaceHTMLElements=False) request = urllib2.Request(url) request.add_header("User-Agent", "Mozilla/5.0 (X11; U; Linux x86_64; fr; rv:1.9.1.5) Gecko/20091109 Ubuntu/9.10 (karmic) Firefox/3.5.5") f=urllib2.urlopen(request) page = htmlparser.parse(f) selector = lxml.cssselect.CSSSelector(selector) return list(selector(page)) # TEST if __name__ == "__main__": url = "http://www.jabberwocky.com/carroll/walrus.html" print lxml.etree.tostring(getCSS(url, "p")[0])