© 2022 - aleteo.co
Jul 15, 2020 Selenium is a great tool for web scraping, especially when learning the basics. But, depending on your goals, it is sometimes easier to choose an already-built tool that does web scraping for you. Building your own scraper is a long and resource-costly procedure that might not be worth the time and effort. In this article I will show you how it is easy to scrape a web site using Selenium WebDriver. I will guide you through a sample project which is written in C# and uses WebDriver in conjunction with the Chrome browser to login on the testing page and scrape the text from the private area of the website.
In this article I will show you how it iseasy to scrape a web siteusingSelenium WebDriver. I will guide you through a sample project which is written inC#and usesWebDriverin conjunction with theChromebrowser to login on thetesting pageand scrape the text from the private area of the website.
First of all we need to get the latest version ofSelenium Client & WebDriver Language Bindings and theChrome Driver. Of course, you can download WebDriver bindings for any language (Java, C#, Python, Ruby), but within the scope of this sample project I will use the C# binding only. In the same manner, you can use any browser driver, but here I will use Chrome.
After downloading the libraries and the browser driver we need to include them in our Visual Studio solution:
In order to use the WebDriver in our program we need to add its namespaces:
Then, in the main function, we need to initialize the Chrome Driver:
This piece of code searches for thechromedriver.exefile. If this file is located in a directory different from the directory where our program is executed, then we need to specify explicitly its path in theChromeDriverconstructor.
When an instance of ChromeDriver is created, a new Chrome browser will be started. Now we can control this browser via thedrivervariable. Let’s navigate to the target URL first:
Then we can find the web page elements needed for us to login in the private area of the website:
Here we search for user name and password fields and the login button and put them into the corresponding variables. After we have found them, we can type in the user name and the password and press the login button:
At this point the new page will be loaded into the browser, and after it’s done we can scrape the text we need and save it into the file:
That’s it! At the end, I’d like to give you a bonus – saving a screenshot of the current page into a file:
Get the whole project.
I hope you are impressed with how easy it is to scrape web pages using the WebDriver. You can naturally press keys and click buttons as you would in working with the browser. You don’t even need to understand what kind of HTTP requests are sent and what cookies are stored; the browser does all this for you. This makes theWebDrivera wonderful tool in the hands of a web scraping specialist.
Web scraping or data extraction from internet sites can be done with various tools and methods. The more complex sites to scrape are always the ones that look for suspicious behaviors and non-human patterns. Therefore, the tools we use for scraping must simulate human behavior as much as possible.
The tools that are developed and can simulate human behavior are testing tools, and one of the most commonly used and well-known tools as such is Selenium.
In our previous blog post in this series we talked about puppeteer and how it is used for web scraping. In this article we will focus on another library called Selenium.
Selenium is a framework for web testing that allows simulating various browsers and was initially made for testing front-end components and websites. As you can probably guess, whatever one would like to test, another would like to scrape. And in the case of Selenium, this is a perfect library for scraping. Or is it?
Selenium was first developed by Jason Huggins in 2004 as an internal tool for a company he worked for. Since then it has been evolved a lot but the concept has remained the same: a framework that simulates (or in truth really operates as) a web browser.
Basically, Selenium is a library that can control an automated version of Google Chrome, Firefox, Safari, Vivaldi, etc. You can use it in an automated process and imitate a normal user’s behavior. If for example, you would like to check your competitor’s top 10 ranking products daily, you would write a piece of code that will open a Chrome window (automatically, using Selenium), surf to your competitor’s storefront or search results on Amazon and scrape the data of their leading products.
Selenium is composed of 5 components that make the testing (or scraping) process possible:
This is a real IDE for testing. This is actually a Chrome Extension or a Firefox Add on, and allows recording, editing and debugging tests. This component is less functional for scraping since usually scraping would be done using an API.
This is the component of Selenium that actually plays the browser’s part in the scraping. The driver is just like a remote control that connects to a specific browser (like in anything else, each remote control is designed to control a specific browser), and through it, we can control the browser and tell it what to do.
This is not a part of selenium itself, but more of a tool to allow us to use multiple selenium instances on remote machines.
Since our previous article talked about Puppeteer and praised it for being the best tool for web scraping, let’s examine the differences and similarities between Selenium and Puppeteer.
For scraping purposes, the fact that Puppeteer supports only Chrome really does not matter in my opinion. It matters more for testing usages when you would want to test your app or website on different browsers.
The advantages of Puppeteer over Selenium are immense. And on top of stands the fact that Puppeteer is faster than Selenium. If you are planning on a high scale scraping operation, that may be a point to consider.
In addition, Puppeteer has more options that are vital for scraping, such as network interception. This is another great advantage over Selenium that allows you to handle each network request and response that is generated while loading a page and log them, stop them, or generate more of them.
If you are planning on scraping for business purposes, either if to build a comparison website, a business intelligence tool, or any other business-oriented purpose, we would suggest you to use an API solution.
To sum it up, here are the main pros and cons of Selenium and puppeteer for web scraping.
Works with many programming languages.
It can be used with many different browsers and platforms.
Can manually record tests/small scrape operations.
It is powered by a great community.
Slower than Puppeteer.
It has less control over the way the scraping is done and has less advanced features.
Faster than other libraries.
It is easier to use – no need of installing any external driver.
It has more control, allows more options like Network Interception.
I know Puppeteer isn’t for everyone and you might be already into Selenium, want to write a python scraper or just don’t like Puppeteer for any reason. That’s ok and in this case let’s see how to use Selenium for web scraping.
Web scraping using Selenium is rather straightforward. I don’t want to go into a specific language (We will add more language-specific tutorials in the future), but the steps are very similar in every language. Feel free to use the official Selenium Documentation for in-depth details. The first and most important step is to install the Selenium web driver component.
npm install selenium-webdriver
Install selenium in Python:
pip install selenium
Then you can start using the library according to the documentation.
Scraping with Selenium is rather straight forwards. First, you need to get the HTML of the div, component or page you are scraping. This is done by navigating to that page using the web driver and then using a selector to extract the data you need. There are 3 key points you should notice though:
This is the pitfall you can most easily fall to in case you are a programmer who is starting his scraping journey – If you do not use a good web proxy service you will get blocked. All modern sites including Google, Amazon, Airbnb, eBay, and many others use advanced anti-scraping mechanisms. Some put more effort into it than others, but all of them will start blocking you very quickly if you don’t use a proxy service and change your address every X amount of requests. What is X? depends on the website, but it varies between 5 and 20 usually.
Once you have your proxy in place and have a rotation mechanism for the IPs you use, the number of times you will be blocked by websites will be reduced to around zero.
Crawling too fast is a big problem and will cause you to hit a lot of anti scrape defenses on the websites you are scraping. Of course, you want this to be as fast as possible, but scraping too fast is also sometimes painfully slow.
Try to play with it. Start with a request per second and increase this. make sure that for this test you are using a different IP address every time and that you are trying to fetch different objects from the website you are testing.
When sending a request for a webpage or an object, the browser sends a header called User-Agent. It is necessary to change this every few requests, and always send a “legitimate” User-Agent (one that conceals the fact that this is a headless browser). Read more about User-Agents.
In the next posts we will continue to discuss the challenges you may face when scraping online information and alternatives. You are also more than welcome to check our recent article about the differences between in house scraping and web scraping API.