Parsing HTML using BeautifulSoup

BeautifulSoup is an HTML parsing library that allows you to “select” various DOM elements, and extract their attributes and text contents.

Video tutorial

To get started, you can watch this cheffing video tutorial that will show the basic steps of using requests and BeautifulSoup for crawling a website. See the sushi-chef-shls code repo for the final version of the web crawling code that was used for this content source.

 

Scraping 101

The basic code to GET the HTML source of a webpage and parse it:

import requests
from bs4 import BeautifulSoup

url = 'https://somesite.edu'
html = requests.get(url).content
doc = BeautifulSoup(html, "html5lib")

You can now call doc.find and doc.find_all methods to select various DOM elements:

special_ul = doc.find('ul', class_='some-special-class')
section_lis = special_ul.find_all('li', recursive=False)  # search only immediate children
for section_li in section_lis:
    print('processing a section <li> right now...')
    print(section_li.prettify())  # useful seeing HTML in when developing...

The most commonly used parts of the BeautifulSoup API are:

  • .find(tag_name,  <spec>): find the next occurrence of the tag tag_name that has attributes specified in <spec> (given as a dictionary), or can use the shortcut options id and class_ (note extra underscore).

  • .find_all(tag_name, <spec>): same as above but returns a list of all matching elements. Use the optional keyword argument recursive=False to select only immediate child nodes (instead of including children of children, etc.).

  • .next_sibling: find the next element (for badly formatted pages with no useful selectors)

  • .get_text() extracts the text contents of the node. See also helper method called get_text that performs additional cleanup of newlines and spaces.

  • .extract(): to extract an element from the DOM tree

  • .decompose(): useful to remove any unwanted DOM elements (same as .extract() but throws away the extracted element)

Example 1

Here is some sample code for getting the text of the LE mission statement:

from bs4 import BeautifulSoup
from ricecooker.utils.downloader import read

url = 'https://learningequality.org/'
html = read(url)
doc = BeautifulSoup(html, 'html5lib')

main_div = doc.find('div', {'id': 'body-content'})
mission_el = main_div.find('h3', class_='mission-state')
mission = mission_el.get_text().strip()
print(mission)

Example 2

To print a list of all the links on the page, use the following code:

links = doc.find_all('a')
for link in links:
    print(link.get_text().strip(), '-->', link['href'])

Further reading

For more info about BeautifulSoup, see the docs.

There are also some excellent tutorials online you can read: