Parsing HTML using BeautifulSoup
BeautifulSoup is an HTML parsing library that allows you to “select” various DOM elements, and extract their attributes and text contents.
Video tutorial
To get started, you can watch this cheffing video tutorial
that will show the basic steps of using requests
and BeautifulSoup
for crawling a website.
See the sushi-chef-shls code repo
for the final version of the web crawling code that was used for this content source.
Scraping 101
The basic code to GET the HTML source of a webpage and parse it:
import requests
from bs4 import BeautifulSoup
url = 'https://somesite.edu'
html = requests.get(url).content
doc = BeautifulSoup(html, "html5lib")
You can now call doc.find
and doc.find_all
methods to select various DOM elements:
special_ul = doc.find('ul', class_='some-special-class')
section_lis = special_ul.find_all('li', recursive=False) # search only immediate children
for section_li in section_lis:
print('processing a section <li> right now...')
print(section_li.prettify()) # useful seeing HTML in when developing...
The most commonly used parts of the BeautifulSoup API are:
.find(tag_name, <spec>)
: find the next occurrence of the tagtag_name
that has attributes specified in<spec>
(given as a dictionary), or can use the shortcut optionsid
andclass_
(note extra underscore)..find_all(tag_name, <spec>)
: same as above but returns a list of all matching elements. Use the optional keyword argumentrecursive=False
to select only immediate child nodes (instead of including children of children, etc.)..next_sibling
: find the next element (for badly formatted pages with no useful selectors).get_text()
extracts the text contents of the node. See also helper method calledget_text
that performs additional cleanup of newlines and spaces..extract()
: to extract an element from the DOM tree.decompose()
: useful to remove any unwanted DOM elements (same as.extract()
but throws away the extracted element)
Example 1
Here is some sample code for getting the text of the LE mission statement:
from bs4 import BeautifulSoup
from ricecooker.utils.downloader import read
url = 'https://learningequality.org/'
html = read(url)
doc = BeautifulSoup(html, 'html5lib')
main_div = doc.find('div', {'id': 'body-content'})
mission_el = main_div.find('h3', class_='mission-state')
mission = mission_el.get_text().strip()
print(mission)
Example 2
To print a list of all the links on the page, use the following code:
links = doc.find_all('a')
for link in links:
print(link.get_text().strip(), '-->', link['href'])
Further reading
For more info about BeautifulSoup, see the docs.
There are also some excellent tutorials online you can read: