© 2022 - aleteo.co
lxml is a pretty extensive library written for parsingXML and HTML documents very quickly, even handling messed up tags in theprocess. We will also be using theRequests module instead of thealready built-in urllib2 module due to improvements in speed and readability.You can easily install both using
Let’s start with the imports:
Next we will use
requests.get to retrieve the web page with our data,parse it using the
html module, and save the results in
(We need to use
page.content rather than
html.fromstring implicitly expects
bytes as input.)
tree now contains the whole HTML file in a nice tree structure whichwe can go over two different ways: XPath and CSSSelect. In this example, wewill focus on the former.
XPath is a way of locating information in structured documents such asHTML or XML documents. A good introduction to XPath is onW3Schools .
There are also various tools for obtaining the XPath of elements such asFireBug for Firefox or the Chrome Inspector. If you’re using Chrome, youcan right click an element, choose ‘Inspect element’, highlight the code,right click again, and choose ‘Copy XPath’.
After a quick analysis, we see that in our page the data is contained intwo elements – one is a div with title ‘buyer-name’ and the other is aspan with class ‘item-price’:
Knowing this we can create the correct XPath query and use the lxml
xpath function like this:
Let’s see what we got exactly:
Congratulations! We have successfully scraped all the data we wanted froma web page using lxml and Requests. We have it stored in memory as twolists. Now we can do all sorts of cool stuff with it: we can analyze itusing Python or we can save it to a file and share it with the world.
Some more cool ideas to think about are modifying this script to iteratethrough the rest of the pages of this example dataset, or rewriting thisapplication to use threads for improved speed.
In this chapter, let us learn how to perform web scraping on dynamic websites and the concepts involved in detail.
Let us look at an example of a dynamic website and know about why it is difficult to scrape. Here we are going to take example of searching from a website named http://example.webscraping.com/places/default/search. But how can we say that this website is of dynamic nature? It can be judged from the output of following Python script which will try to scrape data from above mentioned webpage −
The above output shows that the example scraper failed to extract information because the <div> element we are trying to find is empty.
The process called reverse engineering would be useful and lets us understand how data is loaded dynamically by web pages.
For doing this, we need to click the inspect element tab for a specified URL. Next, we will click NETWORK tab to find all the requests made for that web page including search.json with a path of /ajax. Instead of accessing AJAX data from browser or via NETWORK tab, we can do it with the help of following Python script too −
The above script allows us to access JSON response by using Python json method. Similarly we can download the raw string response and by using python’s json.loads method, we can load it too. We are doing this with the help of following Python script. It will basically scrape all of the countries by searching the letter of the alphabet ‘a’ and then iterating the resulting pages of the JSON responses.
After running the above script, we will get the following output and the records would be saved in the file named countries.txt.
In the previous section, we did reverse engineering on web page that how API worked and how we can use it to retrieve the results in single request. However, we can face following difficulties while doing reverse engineering −
Sometimes websites can be very difficult. For example, if the website is made with advanced browser tool such as Google Web Toolkit (GWT), then the resulting JS code would be machine-generated and difficult to understand and reverse engineer.
In this example, for rendering Java Script we are going to use a familiar Python module Selenium. The following Python code will render a web page with the help of Selenium −
First, we need to import webdriver from selenium as follows −
Now, provide the path of web driver which we have downloaded as per our requirement −
Now, provide the url which we want to open in that web browser now controlled by our Python script.
Now, we can use ID of the search toolbox for setting the element to select.
Next, we can use java script to set the select box content as follows −
The following line of code shows that search is ready to be clicked on the web page −
Next line of code shows that it will wait for 45 seconds for completing the AJAX request.
Now, for selecting country links, we can use the CSS selector as follows −
Now the text of each link can be extracted for creating the list of countries −