Web scraping with Python and BeautifulSoup -

Web scraping (data mining, web harvesting, or web data extraction) is the practice of scraping and extracting data from webpages, using any means possible apart from interacting with an API. The process of scraping a web page usually involves the same set of steps: using libraries that request data from a web server and then querying and parsing that same data (usually received in the HTML form). Industries that rely heavily on data harvesting, e-commerce (comparing prices of different sellers for example), and collecting personal information about users or buyers will use web scraping techniques. In this article, we’ll learn the process of web scraping using Python and BeautifulSoup.

Web scrapers are a great way to process large amounts of data. They allow the user to strip away the more human-readable and bloated content from the webpage (Javascript, images, and web styles), by removing the visual interface of those excess elements at the browser level. Of course, scraping webpages should be secondary to using APIs when possible. On the other hand, APIs can be problematic, the responses not cohesive or descriptive, or they might not even exist. Suitable cases to rely on web scraping.

Let us learn about the basic mechanics of web scraping: how to use Python to request information from a web server with the help of BeautifulSoup module, how to perform basic handling of the server’s response, and how to interact with the data received.

Scraping example using BeautifulSoup

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('https://edition.cnn.com/2022/10/24/world/ufos-nasa-team-study-scn/index.html')
print(html.read())

bs = BeautifulSoup(html.read(), 'html.parser')
print(bs.h1)

Reading the snippet from the start we import the request module from the urllib package (collection of modules for working with URLs). Request library defines functions and modules which help in opening URLs. urlopen is a method that opens the URL, provided either as a string (as in this case) or as a RequestObject. The only parameter provided in the snippet is the URL that we want to use, but the method itself can have more optional parameters. We also request the BeautifulSoup object from the bs4 module. The BeautifulSoup object represents the parsed document and has built-in support for navigating and searching the parsed document in its tree form. One thing to note is that urllib is a standard Python library (comes prepackaged with your Python distribution), while BeautifulSoup is not and it needs to be installed using the Python package manager – pip.

After running the snippet above we’ll have two results printed. The first one – print(html.read()) will return the response body of the requested URL as a string-like object.

b'  <!DOCTYPE html>\n<html lang="en" data-uri="archive.cms.cnn.com/_pages/h_48352b37546e12ac41cf938434485a30@published" data-layout-uri="archive.cms.cnn.com/_layouts/layout-with-rail/instances/world-article-v1@published">\n  <head><script>if (/MSIE \\d|Trident.*rv:/.test(navigator.userAgent)) {document.write(\'<scr\' + \'ipt src="/js/pollyfills.js"></scr\' + \'ipt>\');}</... and so forth... (it's a really looong file!)

The second snippet of code creates a BeautifulSoup object with the object to parse and the parser itself (the parser used usually makes no difference) and tries to access the first instance of the h1 tag found on the page. Keep in mind that this retrieves only the first instance! If we want to access or filter on multiple elements we can use different handlers (such as find_all(), more on that later). The object to parse can be a string or an open filehandle, and the parser used is Python’s default html.parser.

<h1 class="headline__text inline-placeholder" data-editable="headlineText" id="maincontent">
      NASA announces team of scientists who will study mysterious ‘UFO’ events in the sky
    </h1>

Handling scraping results

The data on the internet is more often than not mismanaged, volatile, and unpredictable. We can encounter multiple issues while parsing our data. For example, we might insert the wrong URL, the program could have trouble locating the server or it couldn’t retrieve files from the server. We could try accessing a faulty or missing tag or element. In cases like these, the scraper would most probably stop its execution. We need to handle volatile code and any exceptions that might appear. For example:

from urllib.error import HTTPError, URLError
from urllib.request import urlopen
from bs4 import BeautifulSoup

try:
    html = urlopen("https://edition.cnmn.com/2022/10/24/world/ufos-nasa-team-study-scn/index.html")
except HTTPError as e:
    print("HTTP error.")
except URLError as e:
    print("Server error.")
else:
    bs = BeautifulSoup(html.read(), 'html.parser')
    print(bs.h1)

We’re using urllib.error module which defines exception objects for raised exceptions. From the module we’re importing HTTPError and URLError objects. URLError is the base class of the error module and is mostly used for catching non-existing servers, on the other hand HTTPError is a subclass of URLError and the difference is that it can be used as non-exceptional. Basically, the thing to remember when distinguishing these two errors is that URLError raises on URL errors (duh!), and connectivity issues, while HTTPError mostly raises on authentication requests and 4xx status codes.

Making sure elements exist

from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup


def getPageTitle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        return
    try:
        bs = BeautifulSoup(html.read(), "html.parser")
        title = bs.h1
    except AttributeError as e:
        print("Error finding title")
        return
    print(title)
    return

getPageTitle("https://edition.cnn.com/2022/10/24/world/ufos-nasa-team-study-scn/index.html")

<h1 class="headline__text inline-placeholder" data-editable="headlineText" id="maincontent">
      NASA announces team of scientists who will study mysterious ‘UFO’ events in the sky
    </h1>

What is going on above with the AttributeError exception?

Let us say the server is on, we retrieve the content of the page successfully. Should be an easy ride from now on right? Well, we might still encounter issues. For example, what would happen if we would access a non-existent element or an invalid reference (test this by trying to search for h6 elements instead of h1)? Compared to throwing error objects from urllib.error library the execution of the program wouldn’t stop, but using invalid references would return the None object. In itself, this isn’t that big of a deal. But most of the time developers are unaware that they received a None object and they will try to further access elements on that object, which will trigger the AttributeError exception. So in order to handle exceptions on missing attribute references and assignments we’re using the AttributeError object.

Scraping tips

The best way to scrape without writing behemoth references (stuff like bs.find_all('p')[2].find_all('span').find('i').find_all('span')) is to investigate and use stylesheets. With CSS we can greatly boost our way of differentiating HTML elements, especially with additional markup. Two div elements could be hard to distinguish, but one of those elements having a class or an id can help a lot, and would bypass referencing elements by their index in the tree (like above). It is really easy to separate tags based on different classes. The first block of elements in the snippet below is miles easier to scrape than the second.

<div class="sidebar"></div>
<div class="main" id="main"></div>
<div class="footer"></div>

<div></div>
<div></div>
<div></div>

Below we are using a new function called find_all() that will extract a list of all div elements that have a class of related-content. We are not only looking for bs.div elements which will retrieve the first occurrence of the element, but now we want to get a list of all elements that fit certain criteria. The for loop then iterates through the content using get_text() to retrieve the content as a Unicode string.

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('https://edition.cnn.com/2022/10/24/world/ufos-nasa-team-study-scn/index.html')
bs = BeautifulSoup(html, "html.parser") 
related_content = bs.find_all('div', {'class': 'related-content'})
for content in related_content:
    print(content.get_text())

find() vs. find_all()

find() and find_all() are the most commonly used BeautifulSoup methods. Their only difference is the response they provide.

find_all(name, attrs, recursive, string, limit, **kwargs)

find_all() looks through a tag’s descendants and retrieves all descendants that match provided filters. The name argument tells BeautifulSoup only to consider tags with certain names. Keyword argument is used as a filter on the element. We can also search for a CSS class, but the name ‘class’ is reserved in Python and would print out a syntax error, so we’re using the keyword argument class_ instead. The limit argument is used when we don’t need all the results, and with it, we’re telling BeautifulSoup to stop searching for results after a certain number. These are only the most basic functionalities, for a deeper look at find_all() please reference the official documentation.

bs.find_all('div')
# will search for all div elements

bs.find_all(id='main')
# will search for elements with the id of 'main'
bs.find_all(id=True)
# will search for all elements that have an id
bs.find_all(href=re.compile("ufo"))
# will filter against the 'href' attribute (if the href attribute has 'ufo' in the content)

bs.find_all("div", class_="footer")
# will search for div elements with the class of footer

bs.find_all("div", limit=2)
# will search for a maximum of two div elements

find(name, attrs, recursive, string, **kwargs)

Find() is somewhat similar to find_all(), but it limits the result to only one element. Actually, the two lines of code below are nearly equivalent.

bs.find_all('div', limit=1)

bs.find('div')

Another difference is the type of empty responses. If find_all() can’t find anything it will return an empty list. Find() on the other hand, will return the None object. In the next article, we’ll dig deeper into more advanced nuances of using Python and BeautifulSoup.

Facebook Tweet Pin LinkedIn