© 2022 - aleteo.co
last modified July 27, 2020
Python BeautifulSoup tutorial is an introductory tutorial to BeautifulSoup Python library.The examples find tags, traverse document tree, modify document, and scrape web pages.
Let's say you find data from the web, and there is no direct way to download it, web scraping using Python is a skill you can use to extract the data into a useful form that can then be imported and used in various ways. Some of the practical applications of web scraping could be: Gathering resume of candidates with a specific skill. The incredible amount of data on the Internet is a rich resource for any field of research or personal interest. To effectively harvest that data, you’ll need to become skilled at web scraping. The Python libraries requests and Beautiful Soup are powerful tools for the job. If you like to learn with hands-on examples and you have a basic understanding of Python and HTML, then this tutorial is for you. In this Python Programming Tutorial, we will be learning how to scrape websites using the BeautifulSoup library. BeautifulSoup is an excellent tool for parsi. Part one of this series focuses on requesting and wrangling HTML using two of the most popular Python libraries for web scraping: requests and BeautifulSoup After the 2016 election I became much more interested in media bias and the manipulation of individuals through advertising. Manually Opening a Socket and Sending the HTTP Request. The most basic way to perform.
BeautifulSoup is a Python library for parsing HTML and XML documents. It is often usedfor web scraping. BeautifulSoup transforms a complex HTML document into a complextree of Python objects, such as tag, navigable string, or comment.
We use the
pip3 command to install the necessary modules.
We need to install the
lxml module, which is usedby BeautifulSoup.
BeautifulSoup is installed with the above command.
In the examples, we will use the following HTML file:
In the first example, we use BeautifulSoup module to get three tags.
The code example prints HTML code of three tags.
We import the
BeautifulSoup class from the
BeautifulSoup is the main class for doing work.
We open the
index.html file and read its contentswith the
BeautifulSoup object is created; the HTML data is passed to theconstructor. The second option specifies the parser.
Here we print the HTML code of two tags:
There are multiple
li elements; the line prints the first one.
This is the output.
name attribute of a tag gives its name andthe
text attribute its text content.
The code example prints HTML code, name, and text of the
This is the output.
recursiveChildGenerator method we traverse the HTML document.
The example goes through the document tree and prints thenames of all HTML tags.
In the HTML document we have these tags.
children attribute, we can get the childrenof a tag.
The example retrieves children of the
html tag, places theminto a Python list and prints them to the console. Since the
childrenattribute also returns spaces between the tags, we add a condition to includeonly the tag names.
html tags has two children:
descendants attribute we get all descendants (children of all levels)of a tag.
The example retrieves all descendants of the
These are all the descendants of the
Requests is a simple Python HTTP library. It provides methods foraccessing Web resources via HTTP.
The example retrieves the title of a simple web page. It alsoprints its parent.
We get the HTML data of the page.
We retrieve the HTML code of the title, its text, and the HTML codeof its parent.
This is the output.
prettify method, we can make the HTML code look better.
We prettify the HTML code of a simple web page.
This is the output.
We can also serve HTML pages with a simple built-in HTTP server.
We create a
public directory and copy the
Then we start the Python HTTP server.
Now we get the document from the locally running server.
find method we can find elements by various meansincluding element id.
The code example finds
ul tag that has
mylist id.The commented line has is an alternative way of doing the same task.
find_all method we can find all elements that meetsome criteria.
The code example finds and prints all
This is the output.
find_all method can take a list of elementsto search for.
The example finds all
p elementsand prints their text.
find_all method can also take a function which determineswhat elements should be returned.
The example prints empty elements.
The only empty element in the document is
It is also possible to find elements by using regular expressions.
The example prints content of elements that contain 'BSD' string.
This is the output.
select_one methods, we can usesome CSS selectors to find elements.
This example uses a CSS selector to print the HTML code of the third
This is the third
The # character is used in CSS to select tags by theirid attributes.
The example prints the element that has
append method appends a new tag to the HTML document.
The example appends a new
First, we create a new tag with the
We get the reference to the
We append the newly created tag to the
We print the
ul tag in a neat format.
insert method inserts a tag at the specified location.
The example inserts a
li tag at the thirdposition into the
replace_with replaces a text of an element.
The example finds a specific element with the
find method andreplaces its content with the
decompose method removes a tag from the tree and destroys it.
The example removes the second
In this tutorial, we have worked with the Python BeautifulSoup library.
Read Python tutorial or listall Python tutorials.
Usually, dynamic websites use AJAX to load content dynamically, or even the whole site is based on a Single-Page Application (SPA) technology.
In contrast to dynamic websites, we can observe static websites containing all the requested content on the page load.
A great example of a static website is
The whole content of this website is loaded as a plain HTML while the initial page load.
To demonstrate the basic idea of a dynamic website, we can create a web page that contains dynamically rendered text. It will not include any request to get information, just a render of a different HTML after the page load:
All we have here is an HTML file with a single
<div> in the body that contains text -
To prove this, let's open this page in the browser and observe a dynamically replaced text:
Alright, so the browser displays a text, and HTML tags wrap this text.
Can't we use BeautifulSoup or LXML to parse it? Let's find out.
BeautifulSoup is one of the most popular Python libraries across the Internet for HTML parsing. Almost 80% of web scraping Python tutorials use this library to extract required content from the HTML.
Let's use BeautifulSoup for extracting the text inside
<div> from our sample above.
This code snippet uses
os library to open our test HTML file (
test.html) from the local directory and creates an instance of the BeautifulSoup library stored in
soup variable. Using the
soup we find the tag with id
test and extracts text from it.
In the screenshot from the first article part, we've seen that the content of the test page is
I ❤️ ScrapingAnt, but the code snippet output is the following:
We need the HTML to be run in a browser to see the correct values and then be able to capture those values programmatically.
Selenium is one of the most popular web browser automation tools for Python. It allows communication with different web browsers by using a special connector - a webdriver.
To use Selenium with Chrome/Chromium, we'll need to download webdriver from the repository and place it into the project folder. Don't forget to install Selenium itself by executing:
Selenium instantiating and scraping flow is the following:
In the code perspective, it looks the following:
And finally, we'll receive the required result:
Selenium usage for dynamic website scraping with Python is not complicated and allows you to choose a specific browser with its version but consists of several moving components that should be maintained. The code itself contains some boilerplate parts like the setup of the browser, webdriver, etc.
I like to use Selenium for my web scraping project, but you can find easier ways to extract data from dynamic web pages below.
Puppeteer is a high-level API to control headless Chrome, so it allows you to automate actions you're doing manually with the browser: copy page's text, download images, save page as HTML, PDF, etc.
To install Pyppeteer you can execute the following command:
The usage of Pyppeteer for our needs is much simpler than Selenium:
I've tried to comment on every atomic part of the code for a better understanding. However, generally, we've just opened a browser page, loaded a local HTML file into it, and extracted the final rendered HTML for further BeautifulSoup processing.
As we can expect, the result is the following:
We did it again and not worried about finding, downloading, and connecting webdriver to a browser. Though, Pyppeteer looks abandoned and not properly maintained. This situation may change in the nearest future, but I'd suggest looking at the more powerful library.
The API is almost the same as for Pyppeteer, but have sync and async version both.
Installation is simple as always:
Let's rewrite the previous example using Playwright.
As a good tradition, we can observe our beloved output:
We've gone through several different data extraction methods with Python, but is there any more straightforward way to implement this job? How can we scale our solution and scrape data with several threads?
Meet the web scraping API!
Usage of web scraping API is the simplest option and requires only basic programming skills.
You do not need to maintain the browser, library, proxies, webdrivers, or every other aspect of web scraper and focus on the most exciting part of the work - data analysis.
As the web scraping API runs on the cloud servers, we have to serve our file somewhere to test it. I've created a repository with a single file: https://github.com/kami4ka/dynamic-website-example/blob/main/index.html
To check it out as HTML, we can use another great tool: HTMLPreview
The final test URL to scrape a dynamic web data has a following look: http://htmlpreview.github.io/?https://github.com/kami4ka/dynamic-website-example/blob/main/index.html
The scraping code itself is the simplest one across all four described libraries. We'll use ScrapingAntClient library to access the web scraping API.
Let's install in first:
And use the installed library:
To get you API token, please, visit Login page to authorize in ScrapingAnt User panel. It's free.
And the result is still the required one.
All the headless browser magic happens in the cloud, so you need to make an API call to get the result.
Check out the documentation for more info about ScrapingAnt API.
Happy web scraping, and don't forget to use proxies to avoid blocking 🚀