This post is part 1 of the "Advanced Scraping" series:
Static vs Dynamic
The Python documentation, wikipedia, and most blogs (including this one) use static content. When we request the URL, we get the final HTML returned to us. If that's the case, then a parser like BeautifulSoup is all you need. A short example of scraping a static page is demonstrated below. I have an overview of BeautifulSoup here.
This post will outline different strategies for scraping dynamic pages.
An example of scraping a static page
Let's start with an example of scraping a static page. This code demonstrates how to get the Introduction section of the Python style guide, PEP8:
import requests from bs4 import BeautifulSoup # install with 'pip install BeautifulSoup4' url = 'https://www.python.org/dev/peps/pep-0008/' r = requests.get(url) soup = BeautifulSoup(r.text, 'html.parser') # By inspecting the HTML in our browser, we find the introduction # is contained in <div id='introduction'> ..... </div> intro_div = soup.find(id='introduction') print(intro_div.text)
Introduction This document gives coding conventions for the Python code comprising the standard library in the main Python distribution. Please see the companion informational PEP describing style guidelines for the C code in the C implementation of Python . ....
Volia! If all you have is a static page, you are done!
The straightforward way to scrape a dynamic page
- Initialize a driver (a Python object that controls a browser window)
- Direct the driver to the URL we want to scrape.
- Use a parser on the returned HTML
The website https://webscraper.io has some fake pages to test scraping on. Let's use it on the page https://www.webscraper.io/test-sites/e-commerce/ajax/computers/laptops to get the product name and the price for the six items listed on the first page. These are randomly generated; at the time of writing the products were an Asus VivoBook (295.99), two Prestigio SmartBs (299 each), an Acer Aspire ES1 (306.99), and two Lenovo V110s (322 and 356).
Once the HTML has been by Selenium, each item has a div with class
caption that contains the information we want. The product name is in a subdiv with class
title, and the price is in a subdiv with the classes
price. Here is code for scraping the product names and prices:
Trying scraping a dynamic site using requests
What would happen if we tried to load this e-commerce site using requests? That is, what if we didn't know it was a dynamic site?
The html we get out can be a little difficult to read directly. If you are using a terminal, then you can save the results from
r.html to a file and then load it in a browser. If you are using a Jupyter notebook, you can actually use a neat trick to render the output in your browser:
Alternatives to Selenium
Using Selenium is an (almost) sure-fire way of being able to generate any of the dynamic content that you need, because the pages are actually visited by a browser (albeit one controlled by Python rather than you). If you can see it while browsing, Selenium will be able to see it as well.
There are some drawbacks to using Selenium over pure requests:
- It's slow.
We have to wait for pages to render, rather than just grabbing the data we want.
- We have to download images and assets, using bandwidth
Related to the previous point, even if we are just parsing for text, our browser will download all ads and images on the site.
- Chrome takes a lot of memory
When scraping, we might want to have parallel scrapers running (e.g. one for each category of items on an e-commerce site) to allow us to finish faster. If we use Selenium, we will have to have enough memory to have multiple copies running.
- We might not need to parse
- Selenium (like parsing) is often tedious and error-prone
The bad news for using the alternative methods is that there are so many different ways of loading data that no single technique is guaranteed to work. The biggest advantage Selenium has is that it uses a browser, and with enough care, should be indistinguishable from you browsing the web yourself.
This is the first in a series of articles that will look at other techniques to get data from dynamic webpages. Because scraping requires a custom approach to each site we scrape, each technique will be presented as a case study. The examples will be detailed enough to enable you to try the technique on other sites.
|Scheme or Opengraph MetaData||OpenGraph is a standard for allowing sites like Facebook to easily find what your page is 'about'. We can scrape the relevant data directly from these tags||??? Need example ???|
|XHR||Use the same API requests that the browser does to get the data||Sephora lipsticks, Apple jobs|
The short list of pros and cons for using Selenium to scrape dynamic sites.
|* Will work||* Slow|
|* Bandwidth and memory intensive|
|* Requires error-prone parsing|