© 2022 - aleteo.co
lxml is a pretty extensive library written for parsingXML and HTML documents very quickly, even handling messed up tags in theprocess. We will also be using theRequests module instead of thealready built-in urllib2 module due to improvements in speed and readability.You can easily install both using pipinstalllxml
andpipinstallrequests
.
Python 3 is the best programming language to do web scraping. Python is so fast and easy to do web scraping. Also, most of the tools of web scraping that are present in the Kali-Linux are being designed in Python. Enough of the theories, let’s start scraping the web using the beautiful soup library. Web scraping is one of the most powerful things you can learn, so let’s Learn to scrape some data from some websites using Python! Basic introduction you could probably skip that I copied from my other article. First things first, we will need to have Python installed, read my article here to make sure you have Python and some IDE installed.
Let’s start with the imports:
Next we will use requests.get
to retrieve the web page with our data,parse it using the html
module, and save the results in tree
:
(We need to use page.content
rather than page.text
becausehtml.fromstring
implicitly expects bytes
as input.)
tree
now contains the whole HTML file in a nice tree structure whichwe can go over two different ways: XPath and CSSSelect. In this example, wewill focus on the former.
XPath is a way of locating information in structured documents such asHTML or XML documents. A good introduction to XPath is onW3Schools .
There are also various tools for obtaining the XPath of elements such asFireBug for Firefox or the Chrome Inspector. If you’re using Chrome, youcan right click an element, choose ‘Inspect element’, highlight the code,right click again, and choose ‘Copy XPath’.
After a quick analysis, we see that in our page the data is contained intwo elements – one is a div with title ‘buyer-name’ and the other is aspan with class ‘item-price’:
Knowing this we can create the correct XPath query and use the lxmlxpath
function like this:
Let’s see what we got exactly:
Congratulations! We have successfully scraped all the data we wanted froma web page using lxml and Requests. We have it stored in memory as twolists. Now we can do all sorts of cool stuff with it: we can analyze itusing Python or we can save it to a file and share it with the world.
Some more cool ideas to think about are modifying this script to iteratethrough the rest of the pages of this example dataset, or rewriting thisapplication to use threads for improved speed.
Friday, January 22, 2021The need for extracting data from websites is increasing. When we are conducting data related projects such as price monitoring, business analytics or news aggregator, we would always need to record the data from websites. However, copying and pasting data line by line has been outdated. In this article, we would teach you how to become an “insider” in extracting data from websites, which is to do web scraping with python.
Web scraping is a technique that could help us transform HTML unstructured data into structured data in a spreadsheet or database. Besides using python to write codes, accessing website data with API or data extraction tools like Octoparse are other alternative options for web scraping.
For some big websites like Airbnb or Twitter, they would provide API for developers to access their data. API stands for Application Programming Interface, which is the access for two applications to communicate with each other. For most people, API is the most optimal approach to obtain data provided from the website themselves.
However, most websites don’t have API services. Sometimes even if they provide API, the data you could get is not what you want. Therefore, writing a python script to build a web crawler becomes another powerful and flexible solution.
So why should we use python instead of other languages?
Now let's start our trip on web scraping using Python!
In this tutorial, we would show you how to scrape reviews from Yelp. We will use two libraries: BeautifulSoup in bs4 and request in urllib. These two libraries are commonly used in building a web crawler with Python. The first step is to import these two libraries in Python so that we could use the functions in these libraries.
We need to extract reviews from “https://www.yelp.com/biz/milk-and-cream-cereal-bar-new-york?osq=Ice+Cream”. So first, let’s save the URL in a variable called URL. Then we could access the content on this webpage and save the HTML in “ourUrl” by using urlopen() function in request.
Then we apply BeautifulSoup to parse the page.
Now that we have the “soup”, which is the raw HTML for this website, we could use a function called prettify() to clean the raw data and print it to see the nested structure of HTML in the “soup”.
Next, we should find the HTML reviews on this web page, extract them and store them. For each element in the web page, they would always have a unique HTML “ID”. To check their ID, we would need to INSPECT them on a web page.
After clicking 'Inspect element' (or 'Inspect', depends on different browsers), we could see the HTML of the reviews.
In this case, the reviews are located under the tag called ”p”. So we will first use the function called find_all() to find the parent node of these reviews. And then locate all elements with the tag “p” under the parent node in a loop. After finding all “p” elements, we would store them in an empty list called “review”.
Now we get all the reviews from that page. Let’s see how many reviews have we extracted.
You must notice that there are still some useless texts such as “<p lang=’en’>” at the beginning of each review, “<br/>” in the middle of the reviews and “</p>” at the end of each review.
“<br/>” stands for a single line break. We don’t need any line break in the reviews so we will need to delete them. Also, “<p lang=’en’>” and “</p>” are the beginning and ending of the HTML and we also need to delete them.
Finally, we successfully get all the clean reviews with less than 20 lines of code.
Here is just a demo to scrape 20 reviews from Yelp. But in real cases, we may need to face a lot of other situations. For example, we will need steps like pagination to go to other pages and extract the rest reviews for this shop. Or we will also need to scrape down other information like reviewer name, reviewer location, review time, rating, check-in......
To implement the above operation and get more data, we would need to learn more functions and libraries such as selenium or regular expression. It would be interesting to spend more time drilling into the challenges in web scraping.
However, if you are looking for some simple ways to do web scraping, Octoparse could be your solution. Octoparse is a powerful web scraping tool which could help you easily obtain information from websites. Check out this tutorial about how to scrape reviews from Yelp with Octoparse. Feel free to contact us when you need a powerful web-scraping tool for your business or project!
Author: Jiahao Wu
日本語記事:PythonによるWebスクレイピングを解説
Webスクレイピングについての記事は 公式サイトでも読むことができます。
Artículo en español: Web Scraping con Python: Guía Paso a Paso
También puede leer artículos de web scraping en el WebsOficiite al