You might have seen some data in a website that you need in a specific format to run your algorithms. The algorithm can be anything. For instance, it can be a traditional machine learning algorithm, NLP algorithm, search engine, etc. We can gather data available on different websites using web scraping.

The web is full of data. But this data is not parsed i.e. you can't divide this data in its components directly. Take a look at following data:

Sherlock Holmes,
221b Baker St, Marylebone, 
London NW1 6XE, United Kingdom
Raw address of Sherlock

The above information is currently raw. You can't get the name, city and country directly from such raw data. However, when you parse the data, you can easily obtain the required information.

{
	"name": "Sherlock Holmes",
    	"city": "London",
    	"country": "United Kingdom",
        ...
}
Parsed address of Sherlock

In this tutorial, we will build a simple web scraper.

Overview of all the tasks:

  1. Select the website to scrape data from.
  2. Install and import the required libraries.
  3. Use a library such as requests to get the html content of the URL.
  4. Use a parser library such as BeautifulSoup to parse the html content obtained from step 3.
  5. Use pandas library to store the data to CSV files.

The implementation

  1. In our case, we will be using https://books.toscrape.com to scrape all the book categories available in the website. We will also scrape all the books with their names and their cost that are available in the first page.

  2. We require 3 libraries viz.,requests. pandas and bs4 (BeautifulSoup). We can install them by:

pip install requests pandas bs4

To import the libraries:

import pandas as pd
import requests
from bs4 import BeautifulSoup
  1. We can get html content of the webpage using requests library.
WEBSITE = "https://books.toscrape.com"
html_content = requests.get(WEBSITE).content

In the above code, requests library is performing GET request on WEBSITE. the content property gives the content of the request.

Anatomy of the website to scrape
  1. Now we have to parse the data using BeautifulSoup library.
    However before doing that, we have to understand the anatomy of the web page. The above image gives you an idea of where your data is and what you need to do to parse it.

On the left side, we can see the book categories. If we find out where the categories are present in the html_content, we can easily parse it.

To find this out, you can inspect element on the web page and select the html node which contains all the book category data.

You can use ctrl + shift + i shortcut or simply right click and slect inspect element in your browser.

Inspect element and find the required DOM Node

By looking into the DOM nodes, we can conclude that we can simply find all the a tags inside the <ul class="nav nav-list"> node.

soup = BeautifulSoup(html_content, features='html5lib')

side_categories = soup.find("ul", class_="nav nav-list")
list_items = side_categories.find_all("a")

All the categories are now available in list_items variable. We can properly get the category names and their links now. We can also see one link of Books in categories which we can simply remove.

cat_names = []
cat_links = []

for item in list_items:
  cat_names.append(item.text.strip())
  cat_links.append(item['href'])

We can similarly do this process for the books available in the first page.

  1. Now, we can store this data in csv format using pandas library.

Simply create a dataframe and store it to csv file.

df = pd.DataFrame({
  "category_name": cat_names[1:],
  "category_link": cat_links[1:]
})

df.to_csv("scraped_categories.csv")

cat_names[1:] means all the elements in the list except for first.

If you want to play with the code, you can do so in following repl:

https://repl.it/@feat7/Book-scraper-demo

It also contains the code to scrape the books name and price from first page.

Going forward, to test your newly acquired scraping skills, you can try extracting detailed book information from it's individual link. eg. The link contains a book information https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html

You can extract such information for all the books in the website.