How to scrape website data using Python
Web scraping is the accurate and fast collection of publicly available data. A webpage scraper automatically gathers massive volumes of public data from target websites in seconds.
While web scraping can be simple, there are times when it is challenging. One of the best ways to start web scraping is using Python, an object-oriented language with user-friendly classes and objects. Additionally, many libraries make it easy to build a web scraping tool using Python.
In this tutorial, we’ll go over what you need to get started with a basic web scraping application that will collect text-based data from various sources, save it to a file, and sort the output based on the settings you provide. Choices for more complex functionality when using Python for web scraping will be detailed in the conclusion along with implementation ideas. By following the procedures, you’ll have a solid grasp of web scraping by the end of this tutorial.
Prerequisite
- Python 3.5 or higher already installed
- Request module
- Beautiful Soup library successfully installed
- Lxml
- Selenium
To verify the installation and version of Python and PIP, open the command prompt and enter the following codes: “python -V” and “pip -V”. This will confirm that the installation from the official website was successful.
Request module
The Python requests module allows you to easily make HTTP requests to a specified URL and handle the response using its built-in functionality. With this powerful tool, you can easily manage both the request and response aspects of your web interactions.
Installation
To install the requests library, use the following command in your terminal:
pip install requests
However, depending on your specific operating system, additional steps may be required for a successful installation.
For example, you may need to add the pip executable to your PATH environment variable on a Windows operating system before running the install command. On a Mac or Linux system, you may need to use a different command, such as sudo pip install requests
, to install the package with the necessary permissions. It’s important to check the installation instructions for your specific operating system to ensure a smooth installation process.
Making a request
The Python requests module has built-in methods for making HTTP requests with GET, POST, PUT, PATCH, or HEAD. It uses a request-response protocol between a client and a server. We will be using the GET request in this tutorial.
GET method
This method retrieves information from a specified server by sending a request with encoded user information appended to the URL.
Example: Python requests making GET request
import requests
# Making a GET request
r = requests.get('https://handbook.mattermost.com/')
# check the status code for the response received
# The success code - 200
print(r)
# print the content of the request
print(r.content)
The code above makes a GET request to the URL “https://handbook.mattermost.com/”. It then checks the status code of the response received and prints it. The status code 200 indicates that the request was successful. The code then prints the response’s content, which is the data obtained from the website.
Output:
Response object
When you request a specific URL, it returns a response. In Python, this Response object is returned by requests. method(), with methods such as get, post, and put.
Response is a sophisticated object with several operations and characteristics that aid in data normalization and creating suitable code segments. An example of this is that the response.status code will return the status code found in the headers, allowing one to determine whether or not the request was properly executed.
Response objects could indicate a wide range of properties, methods, and functions.
Example: Python requests response object
import requests
# Making a GET request
r = requests.get('https://handbook.mattermost.com/')
# print request object
print(r.url)
# print status code
print(r.status_code)
Output:
https://handbook.mattermost.com/
200
Beautiful Soup library
Beautiful Soup can extract information from HTML and XML files. It offers a parse tree and functions to navigate, search, or modify the parse tree.
Beautiful Soup is a Python tool that helps you to scrape and parse web pages. It can work with different types of parsers, including html5lib, html.parser, and lxml. Each of these parsers has its own features and is better suited for specific tasks.
Html5lib is written in Python, making it slower than other parsers. But it’s very flexible and can handle even the most poorly-formed HTML and XML pages. It mimics the way web browsers read pages, which can be useful when parsing pages that don’t follow all the rules of HTML and XML. For more, read the docs.
Html.parser is the default parser that comes with Python. It is also written in Python and is faster than html5lib. It’s good for parsing well-formed HTML and XML pages. Learn more about Html.parser.
Lxml is a C-based parser that is faster than the other two. It can handle poorly-formed HTML and XML pages. It’s a great option when speed is important. For more, go here.
If you want to work with a well-formed page quickly, you can use lxml. But utilize html5lib if the pages are not really well-formed and you wish to parse them. Finally, you can use Html.parser if you’re working with well-formed pages and don’t require speed.
Installation
In the terminal, type the following:
pip install beautifulsoup4
Output:
Inspecting the website
We must first understand its structure to extract information from an HTML page. This allows us to select the specific data we want to scrape. We can do this by right-clicking on the page and selecting “Inspect Element”.
Once the inspect button is clicked, the browser’s Developer Tools will open. For this tutorial, we will use Microsoft Edge.
You can view the site’s Document Object Model (DOM) with the developer’s tools. If you are unfamiliar with the DOM, think of it as the HTML structure of the page that is displayed.
Take a look at the image screenshot from edge browser inspection:
Parsing the HTML
After obtaining the HTML of the page, we will parse the raw HTML code into useful information. To do this, we will create a BeautifulSoup object and specify the parser we want to use.
Note: The BeautifulSoup library supports multiple libraries for parsing HTML, and they differ on speed and accuracy depending on the HTML. For more information about which parser is best for any given situation, refer to the documentation.
Example: Python BeautifulSoup parsing HTML
import requests
from bs4 import BeautifulSoup
# Making a GET request
r = requests.get('https://handbook.mattermost.com/')
# Parsing the HTML code
soup = BeautifulSoup(r.content, 'HTML.parser')
# Getting the title tag
print(soup.title)
# Getting tag name
print(soup.title.name)
# Getting the parent tag name
print(soup.title.parent.name)
# use the child attribute to get
# the name of the child tag
Output:
Extracting text from the tags
In the previous examples, you may have noticed that the tags are also included when scraping data. However, you can use the text property if you want to obtain the text without any tags. This will only print the text within the tag. Let’s use the same example and remove all the tags from the data.
Example 1: Removing tags from the content of a page
import requests
from bs4 import BeautifulSoup
# GET request
r = requests.get('https://www.geeksforgeeks.org/python-programming-language/')
# Parse the HTML
soup = BeautifulSoup(r.content, 'HTML.parser')
s = soup.find('div', class_='entry-content')
lines = s.find_all('p')
for line in lines:
print(line.text)
Output:
Extracting links
We’ve already seen how to extract text; now let’s examine how to extract links from a page.
Example: Python BeautifulSoup for extracting links
import requests
from bs4 import BeautifulSoup
# Making a GET request
r = requests.get('https://handbook.mattermost.com/')
# Parsing the HTML
soup = BeautifulSoup(r.content, 'HTML.parser')
# find all the anchor tags with "href"
for link in soup.find_all('a'):
print(link.get('href'))
This code above fetches and parses a webpage, then extracts and prints all the hyperlinks on the page.
Output:
Extracting image information
This code fetches and parses a webpage then extracts and stores information about the images on the page. Once that’s done, it prints this information.
Example: Python BeautifulSoup to extract image
import requests
from bs4 import BeautifulSoup
# Making a GET request
r = requests.get('https://linktr.ee/ddavidking/')
# Parsing the HTML
soup = BeautifulSoup(r.content, 'HTML.parser')
images_list = []
images = soup.select('img')
for image in images:
src = image.get('src')
alt = image.get('alt')
images_list.append({"src": src, "alt": alt})
for image in images_list:
print(image)
Output:
Saving data to CSV
First, we’ll list dictionaries containing the key-value pairs we wish to include in the CSV file. The result will then be exported to a CSV file using the CSV module. Consider the example below for a better understanding.
Example: Python BeautifulSoup saving to CSV
Lastly, let’s look at how to use Python to scrape and export data from a Wikipedia table to a CSV file for further analysis.
import requests
import csv
website= requests.get('https://en.wikipedia.org/wiki/List_of_most-followed_Twitter_accounts').text
from bs4 import BeautifulSoup
soup= BeautifulSoup(website, 'lxml')
table= soup.find('table')
table_rows = table.findAll('tr')
data = []
for tr in table_rows:
td= tr.findAll('td')
rows = [i.text for i in td]
data.append(rows)
# Export the data to a CSV file
with open('twitter_accounts.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile)
writer.writerows(data)
print("Data has been saved to CSV file successfully!")
Using the requests and BeautifulSoup libraries, this code above scrapes data from a Wikipedia page including a table of the most-followed Twitter accounts. The result is subsequently saved to a CSV file using the CSV library. Then uses requests to make a GET request to the Wikipedia page and BeautifulSoup to parse the HTML, discover the table components, and extract data from each row of the table.
Finally, it then exports the extracted data to a CSV file and prints out the outputs: “Data successfully saved to CSV file!”
Output:
Uses of scraped data
Below are contents that demonstrate the practical use of the data after scraping it:
- Data scraped from social media platforms can be used for sentiment analysis. By collecting and analyzing data from social media posts, businesses can gauge public opinion on their brand or industry and use this information to improve their marketing strategies and customer service.
- One practical use of data after scraping it could be for price comparison. For example, if you scrape data from various online retailers, you can use that data to compare the prices of products and make informed purchasing decisions. This could be particularly useful for consumers looking to save money by finding the best deals.
- Finally, another practical use of data after scraping it could be for market research. Collecting data from various sources allows you to analyze trends and patterns in a particular industry or market. This can be helpful for businesses looking to make informed decisions about their products or services.
Conclusion
Scraping website data using Python can be a helpful tool for collecting and analyzing data from the web. It involves using Python libraries such as Beautiful Soup and requests to make HTTP requests to a website, retrieve the HTML or XML data, and parse it into a structured format.
Following the steps above, you can successfully scrape data from a website using Python for various purposes, such as sentiment analysis on a specified topic or how a word is being used on a social media platform, data analysis, machine learning, or web development. However, it’s important to always follow the website’s terms of service and use scraping responsibly to avoid legal issues.
Now that you know how to scrape website data using Python, it’s time to learn more about what you can do with said data. To continue your learning, check out the Beginner’s Guide to NumPy.