How to scrape website data using Python

September 5, 2023

Engineering

Web scraping is the accurate and fast collection of publicly available data. A webpage scraper automatically gathers massive volumes of public data from target websites in seconds.

While web scraping can be simple, there are times when it is challenging. One of the best ways to start web scraping is using Python, an object-oriented language with user-friendly classes and objects. Additionally, many libraries make it easy to build a web scraping tool using Python.

In this tutorial, we’ll go over what you need to get started with a basic web scraping application that will collect text-based data from various sources, save it to a file, and sort the output based on the settings you provide. Choices for more complex functionality when using Python for web scraping will be detailed in the conclusion along with implementation ideas. By following the procedures, you’ll have a solid grasp of web scraping by the end of this tutorial.

Prerequisite

Python 3.5 or higher already installed
Request module
Beautiful Soup library successfully installed
Lxml
Selenium

To verify the installation and version of Python and PIP, open the command prompt and enter the following codes: “python -V” and “pip -V”. This will confirm that the installation from the official website was successful.

Request module

The Python requests module allows you to easily make HTTP requests to a specified URL and handle the response using its built-in functionality. With this powerful tool, you can easily manage both the request and response aspects of your web interactions.

Installation

To install the requests library, use the following command in your terminal:

pip install requests

However, depending on your specific operating system, additional steps may be required for a successful installation.

For example, you may need to add the pip executable to your PATH environment variable on a Windows operating system before running the install command. On a Mac or Linux system, you may need to use a different command, such as sudo pip install requests, to install the package with the necessary permissions. It’s important to check the installation instructions for your specific operating system to ensure a smooth installation process.

Making a request

The Python requests module has built-in methods for making HTTP requests with GET, POST, PUT, PATCH, or HEAD. It uses a request-response protocol between a client and a server. We will be using the GET request in this tutorial.

GET method

This method retrieves information from a specified server by sending a request with encoded user information appended to the URL.

Example: Python requests making GET request

import requests

# Making a GET request

r = requests.get('https://handbook.mattermost.com/')

# check the status code for the response received
# The success code - 200

print(r)

# print the content of the request

print(r.content)

The code above makes a GET request to the URL “https://handbook.mattermost.com/”. It then checks the status code of the response received and prints it. The status code 200 indicates that the request was successful. The code then prints the response’s content, which is the data obtained from the website.

Output:

Response object

When you request a specific URL, it returns a response. In Python, this Response object is returned by requests. method(), with methods such as get, post, and put.

Response is a sophisticated object with several operations and characteristics that aid in data normalization and creating suitable code segments. An example of this is that the response.status code will return the status code found in the headers, allowing one to determine whether or not the request was properly executed.

Response objects could indicate a wide range of properties, methods, and functions.

Example: Python requests response object

import requests

# Making a GET request
r = requests.get('https://handbook.mattermost.com/')

# print request object
print(r.url)

# print status code
print(r.status_code)

Output:

https://handbook.mattermost.com/
200

Beautiful Soup library

Beautiful Soup can extract information from HTML and XML files. It offers a parse tree and functions to navigate, search, or modify the parse tree.

Beautiful Soup is a Python tool that helps you to scrape and parse web pages. It can work with different types of parsers, including html5lib, html.parser, and lxml. Each of these parsers has its own features and is better suited for specific tasks.

Html5lib is written in Python, making it slower than other parsers. But it’s very flexible and can handle even the most poorly-formed HTML and XML pages. It mimics the way web browsers read pages, which can be useful when parsing pages that don’t follow all the rules of HTML and XML. For more, read the docs.

Html.parser is the default parser that comes with Python. It is also written in Python and is faster than html5lib. It’s good for parsing well-formed HTML and XML pages. Learn more about Html.parser.

Lxml is a C-based parser that is faster than the other two. It can handle poorly-formed HTML and XML pages. It’s a great option when speed is important. For more, go here.

If you want to work with a well-formed page quickly, you can use lxml. But utilize html5lib if the pages are not really well-formed and you wish to parse them. Finally, you can use Html.parser if you’re working with well-formed pages and don’t require speed.

Installation

In the terminal, type the following:

 pip install beautifulsoup4

Output:

Inspecting the website

We must first understand its structure to extract information from an HTML page. This allows us to select the specific data we want to scrape. We can do this by right-clicking on the page and selecting “Inspect Element”.

Once the inspect button is clicked, the browser’s Developer Tools will open. For this tutorial, we will use Microsoft Edge.

You can view the site’s Document Object Model (DOM) with the developer’s tools. If you are unfamiliar with the DOM, think of it as the HTML structure of the page that is displayed.

Take a look at the image screenshot from edge browser inspection:

Parsing the HTML

After obtaining the HTML of the page, we will parse the raw HTML code into useful information. To do this, we will create a BeautifulSoup object and specify the parser we want to use.

Note: The BeautifulSoup library supports multiple libraries for parsing HTML, and they differ on speed and accuracy depending on the HTML. For more information about which parser is best for any given situation, refer to the documentation.

Example: Python BeautifulSoup parsing HTML

import requests
from bs4 import BeautifulSoup


# Making a GET request
r = requests.get('https://handbook.mattermost.com/')

# Parsing the HTML code
soup = BeautifulSoup(r.content, 'HTML.parser')

# Getting the title tag
print(soup.title)

# Getting tag name
print(soup.title.name)

# Getting the parent tag name
print(soup.title.parent.name)

# use the child attribute to get
# the name of the child tag

Output:

Extracting text from the tags

In the previous examples, you may have noticed that the tags are also included when scraping data. However, you can use the text property if you want to obtain the text without any tags. This will only print the text within the tag. Let’s use the same example and remove all the tags from the data.

Example 1: Removing tags from the content of a page

import requests
from bs4 import BeautifulSoup
# GET request
r = requests.get('https://www.geeksforgeeks.org/python-programming-language/')
# Parse the HTML
soup = BeautifulSoup(r.content, 'HTML.parser')
s = soup.find('div', class_='entry-content')
lines = s.find_all('p')
for line in lines:
	print(line.text)

Output:

Removing tags from the content of a page

Extracting links

We’ve already seen how to extract text; now let’s examine how to extract links from a page.

Example: Python BeautifulSoup for extracting links

import requests
from bs4 import BeautifulSoup


# Making a GET request
r = requests.get('https://handbook.mattermost.com/')

# Parsing the HTML
soup = BeautifulSoup(r.content, 'HTML.parser')

# find all the anchor tags with "href"
for link in soup.find_all('a'):
	print(link.get('href'))

This code above fetches and parses a webpage, then extracts and prints all the hyperlinks on the page.

Output:

Python BeautifulSoup for extracting links

Extracting image information

This code fetches and parses a webpage then extracts and stores information about the images on the page. Once that’s done, it prints this information.

Example: Python BeautifulSoup to extract image

import requests
from bs4 import BeautifulSoup


# Making a GET request
r = requests.get('https://linktr.ee/ddavidking/')

# Parsing the HTML
soup = BeautifulSoup(r.content, 'HTML.parser')

images_list = []

images = soup.select('img')
for image in images:
	src = image.get('src')
	alt = image.get('alt')
	images_list.append({"src": src, "alt": alt})
	
for image in images_list:
	print(image)

Output:

Saving data to CSV

First, we’ll list dictionaries containing the key-value pairs we wish to include in the CSV file. The result will then be exported to a CSV file using the CSV module. Consider the example below for a better understanding.

Example: Python BeautifulSoup saving to CSV

Lastly, let’s look at how to use Python to scrape and export data from a Wikipedia table to a CSV file for further analysis.

import requests
import csv

website= requests.get('https://en.wikipedia.org/wiki/List_of_most-followed_Twitter_accounts').text

from bs4 import BeautifulSoup
soup= BeautifulSoup(website, 'lxml')

table= soup.find('table')
table_rows = table.findAll('tr')

data = []
for tr in table_rows:
    td= tr.findAll('td')
    rows = [i.text for i in td]
    data.append(rows)

# Export the data to a CSV file
with open('twitter_accounts.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerows(data)

print("Data has been saved to CSV file successfully!")

Using the requests and BeautifulSoup libraries, this code above scrapes data from a Wikipedia page including a table of the most-followed Twitter accounts. The result is subsequently saved to a CSV file using the CSV library. Then uses requests to make a GET request to the Wikipedia page and BeautifulSoup to parse the HTML, discover the table components, and extract data from each row of the table.

Finally, it then exports the extracted data to a CSV file and prints out the outputs: “Data successfully saved to CSV file!”

Output:

Uses of scraped data

Below are contents that demonstrate the practical use of the data after scraping it:

Data scraped from social media platforms can be used for sentiment analysis. By collecting and analyzing data from social media posts, businesses can gauge public opinion on their brand or industry and use this information to improve their marketing strategies and customer service.

One practical use of data after scraping it could be for price comparison. For example, if you scrape data from various online retailers, you can use that data to compare the prices of products and make informed purchasing decisions. This could be particularly useful for consumers looking to save money by finding the best deals.

Finally, another practical use of data after scraping it could be for market research. Collecting data from various sources allows you to analyze trends and patterns in a particular industry or market. This can be helpful for businesses looking to make informed decisions about their products or services.

Conclusion

Scraping website data using Python can be a helpful tool for collecting and analyzing data from the web. It involves using Python libraries such as Beautiful Soup and requests to make HTTP requests to a website, retrieve the HTML or XML data, and parse it into a structured format.

Following the steps above, you can successfully scrape data from a website using Python for various purposes, such as sentiment analysis on a specified topic or how a word is being used on a social media platform, data analysis, machine learning, or web development. However, it’s important to always follow the website’s terms of service and use scraping responsibly to avoid legal issues.
Now that you know how to scrape website data using Python, it’s time to learn more about what you can do with said data. To continue your learning, check out the Beginner’s Guide to NumPy.

Prerequisite

Request module

Installation

Making a request

GET method

Example: Python requests making GET request

Output:

Response object

Example: Python requests response object

Beautiful Soup library

Installation

Inspecting the website

Parsing the HTML

Example: Python BeautifulSoup parsing HTML

Extracting text from the tags

Example 1: Removing tags from the content of a page

Extracting links

Example: Python BeautifulSoup for extracting links

Output:

Extracting image information

Example: Python BeautifulSoup to extract image

Saving data to CSV

Example: Python BeautifulSoup saving to CSV

Output:

Uses of scraped data

Conclusion

Read More Engineering Articles

Open source news, right in your inbox

Thanks for subscribing!