Web scraping (data mining, web harvesting, or web data extraction) is the practice of scraping and extracting data from webpages, using any means possible apart from interacting with an API. The process of scraping a web page usually involves the same set of steps: using libraries that request data from a web server and then querying and parsing that same data (usually received in the HTML form). Industries that rely heavily on data harvesting, e-commerce (comparing prices of different sellers for example), and collecting personal information about users or buyers will use web scraping techniques. In this article, we’ll learn the process of web scraping using Python and BeautifulSoup.
Web scrapers are a great way to process large amounts of data. They allow the user to strip away the more human-readable and bloated content from the webpage (Javascript, images, and web styles), by removing the visual interface of those excess elements at the browser level. Of course, scraping webpages should be secondary to using APIs when possible. On the other hand, APIs can be problematic, the responses not cohesive or descriptive, or they might not even exist. Suitable cases to rely on web scraping.
Let us learn about the basic mechanics of web scraping: how to use Python to request information from a web server with the help of BeautifulSoup
module, how to perform basic handling of the server’s response, and how to interact with the data received.
Scraping example using BeautifulSoup
from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen('https://edition.cnn.com/2022/10/24/world/ufos-nasa-team-study-scn/index.html') print(html.read()) bs = BeautifulSoup(html.read(), 'html.parser') print(bs.h1)
Reading the snippet from the start we import the request
module from the urllib
package (collection of modules for working with URLs). Request
library defines functions and modules which help in opening URLs. urlopen
is a method that opens the URL, provided either as a string (as in this case) or as a RequestObject
. The only parameter provided in the snippet is the URL that we want to use, but the method itself can have more optional parameters. We also request the BeautifulSoup
object from the bs4
module. The BeautifulSoup
object represents the parsed document and has built-in support for navigating and searching the parsed document in its tree form. One thing to note is that urllib
is a standard Python library (comes prepackaged with your Python
distribution), while BeautifulSoup
is not and it needs to be installed using the Python
package manager – pip
.
After running the snippet above we’ll have two results printed. The first one – print(html.read())
will return the response body of the requested URL as a string-like object.
b' <!DOCTYPE html>\n<html lang="en" data-uri="archive.cms.cnn.com/_pages/h_48352b37546e12ac41cf938434485a30@published" data-layout-uri="archive.cms.cnn.com/_layouts/layout-with-rail/instances/world-article-v1@published">\n <head><script>if (/MSIE \\d|Trident.*rv:/.test(navigator.userAgent)) {document.write(\'<scr\' + \'ipt src="/js/pollyfills.js"></scr\' + \'ipt>\');}</... and so forth... (it's a really looong file!)
The second snippet of code creates a BeautifulSoup
object with the object to parse and the parser itself (the parser used usually makes no difference) and tries to access the first instance of the h1
tag found on the page. Keep in mind that this retrieves only the first instance! If we want to access or filter on multiple elements we can use different handlers (such as find_all(), more on that later). The object to parse can be a string or an open filehandle, and the parser used is Python’s default html.parser
.
<h1 class="headline__text inline-placeholder" data-editable="headlineText" id="maincontent"> NASA announces team of scientists who will study mysterious ‘UFO’ events in the sky </h1>
Handling scraping results
The data on the internet is more often than not mismanaged, volatile, and unpredictable. We can encounter multiple issues while parsing our data. For example, we might insert the wrong URL, the program could have trouble locating the server or it couldn’t retrieve files from the server. We could try accessing a faulty or missing tag or element. In cases like these, the scraper would most probably stop its execution. We need to handle volatile code and any exceptions that might appear. For example:
from urllib.error import HTTPError, URLError from urllib.request import urlopen from bs4 import BeautifulSoup try: html = urlopen("https://edition.cnmn.com/2022/10/24/world/ufos-nasa-team-study-scn/index.html") except HTTPError as e: print("HTTP error.") except URLError as e: print("Server error.") else: bs = BeautifulSoup(html.read(), 'html.parser') print(bs.h1)
We’re using urllib.error
module which defines exception objects for raised exceptions. From the module we’re importing HTTPError
and URLError
objects. URLError
is the base class of the error
module and is mostly used for catching non-existing servers, on the other hand HTTPError
is a subclass of URLError
and the difference is that it can be used as non-exceptional. Basically, the thing to remember when distinguishing these two errors is that URLError
raises on URL errors (duh!), and connectivity issues, while HTTPError mostly raises on authentication requests and 4xx status codes.
Making sure elements exist
from urllib.request import urlopen from urllib.error import HTTPError from bs4 import BeautifulSoup def getPageTitle(url): try: html = urlopen(url) except HTTPError as e: return try: bs = BeautifulSoup(html.read(), "html.parser") title = bs.h1 except AttributeError as e: print("Error finding title") return print(title) return getPageTitle("https://edition.cnn.com/2022/10/24/world/ufos-nasa-team-study-scn/index.html")
<h1 class="headline__text inline-placeholder" data-editable="headlineText" id="maincontent"> NASA announces team of scientists who will study mysterious ‘UFO’ events in the sky </h1>
What is going on above with the AttributeError
exception?
Let us say the server is on, we retrieve the content of the page successfully. Should be an easy ride from now on right? Well, we might still encounter issues. For example, what would happen if we would access a non-existent element or an invalid reference (test this by trying to search for h6 elements instead of h1)? Compared to throwing error objects from urllib.error
library the execution of the program wouldn’t stop, but using invalid references would return the None
object. In itself, this isn’t that big of a deal. But most of the time developers are unaware that they received a None
object and they will try to further access elements on that object, which will trigger the AttributeError
exception. So in order to handle exceptions on missing attribute references and assignments we’re using the AttributeError
object.
Scraping tips
The best way to scrape without writing behemoth references (stuff like bs.find_all('p')[2].find_all('span').find('i').find_all('span')
) is to investigate and use stylesheets. With CSS we can greatly boost our way of differentiating HTML elements, especially with additional markup. Two div elements could be hard to distinguish, but one of those elements having a class
or an id
can help a lot, and would bypass referencing elements by their index in the tree (like above). It is really easy to separate tags based on different classes. The first block of elements in the snippet below is miles easier to scrape than the second.
<div class="sidebar"></div> <div class="main" id="main"></div> <div class="footer"></div> <div></div> <div></div> <div></div>
Below we are using a new function called find_all()
that will extract a list of all div elements that have a class of related-content
. We are not only looking for bs.div
elements which will retrieve the first occurrence of the element, but now we want to get a list of all elements that fit certain criteria. The for loop then iterates through the content using get_text()
to retrieve the content as a Unicode string.
from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen('https://edition.cnn.com/2022/10/24/world/ufos-nasa-team-study-scn/index.html') bs = BeautifulSoup(html, "html.parser") related_content = bs.find_all('div', {'class': 'related-content'}) for content in related_content: print(content.get_text())
find() vs. find_all()
find()
and find_all()
are the most commonly used BeautifulSoup
methods. Their only difference is the response they provide.
find_all(name, attrs, recursive, string, limit, **kwargs)
find_all()
looks through a tag’s descendants and retrieves all descendants that match provided filters. The name
argument tells BeautifulSoup
only to consider tags with certain names. Keyword
argument is used as a filter on the element. We can also search for a CSS class, but the name ‘class’ is reserved in Python and would print out a syntax error, so we’re using the keyword argument class_
instead. The limit
argument is used when we don’t need all the results, and with it, we’re telling BeautifulSoup to stop searching for results after a certain number. These are only the most basic functionalities, for a deeper look at find_all()
please reference the official documentation.
bs.find_all('div') # will search for all div elements bs.find_all(id='main') # will search for elements with the id of 'main' bs.find_all(id=True) # will search for all elements that have an id bs.find_all(href=re.compile("ufo")) # will filter against the 'href' attribute (if the href attribute has 'ufo' in the content) bs.find_all("div", class_="footer") # will search for div elements with the class of footer bs.find_all("div", limit=2) # will search for a maximum of two div elements
find(name, attrs, recursive, string, **kwargs)
Find()
is somewhat similar to find_all()
, but it limits the result to only one element. Actually, the two lines of code below are nearly equivalent.
bs.find_all('div', limit=1) bs.find('div')
Another difference is the type of empty responses. If find_all()
can’t find anything it will return an empty list. Find()
on the other hand, will return the None
object. In the next article, we’ll dig deeper into more advanced nuances of using Python and BeautifulSoup.