© 2022 - aleteo.co
Web scraping is a technique to extract a large amount of data from a website and display it or store it in a file for further use.
Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. The Mechanize library is used for automating interaction with websites. Web scraping is the act of pulling data directly from a website by parsing the HTML from the web page itself. It refers to retrieving or “scraping” data from a website. Instead of going through the difficult process of physically extracting data, web scraping employs cutting-edge automation to retrieve countless data points from any number.
It is used to crawl and extract the required data from a static website or a JS rendered website.
There are few tools available for web scrapings such as Nokogiri, Capybara and Kimurai. But, Kimurai is the most powerful framework to scrape data.
A web scraping framework in ruby works out of the box with Headless Chromium/Firefox, PhantomJS, or simple HTTP requests and allows us to scrape and interact with JavaScript rendered websites.
Features :
You can also scrape data from JS rendered websites, i.e. infinite scrollable websites and even static websites. Amazing right !!!
Read Also: Web scraping using Mechanize in Ruby on Rails
You can use this framework in 2 ways:
rails _5.2.3_ new web_scrapping_demo --database=postgresql
rails db:create
bundle install
rails g model Web Scrapper --parent Kimurai::Base
rails db:migrate
rails g controller WebScrappersController index
root 'web_scrappers#new'
resources: web_scrapper
<%= link_to 'Start Scrap', new_web_scrapper_path %>
def new
Web Scrapper.crawl!
end
Note: Here, crawl! Performs the full run of the spider. parse method is very important and should be present in every spider. The entry point of any spider is parse.
Here,
@name
= name of the spider/web scraper
@engine
= specifies the supported engine
@start_url
s = array of start URLs to process one by one inside parse method.
@config
= optional, can provide various custom configurations such as user_agent, delay, etc…
Read the Case Study about – Web Scraping RPA (Data Extraction)
Note: You can use several supported engines here, but if we use mechanize no configurations or installations are involved and work for simple HTTP requests but no javascript but if we use other engines such as selenium_chrome, poltergeist_phantomjs, selenium_firefox are all javascript based and rendered in HEADLESS mode.
Here, in the above parse method,
response
= Nokogiri::HTML::Document object for the requested website.
URL
= String URL of a processed web page.
data
= used to pass data between 2 requests.
The data to be fetched from a website is selected using XPath and structures the data as per the requirement.
rails s
'Start Scrap'
results.json
file using save_to
helper of the gem.JSON
file, you will get the scraped data.Hooray !! You have extracted information from the static website.
gem install kimurai
ruby filename.rb
Dynamic Websites / JS rendered websites:
Pre-requisites:
Install browsers with web drivers:
For Ubuntu 18.04:
setup
command:$ kimurai setup localhost --local --ask-sudo
Note: It works using Ansible. If not installed, install using:
$ sudo apt install ansible
sudo apt install -q -y unzip wget tar openssl
sudo apt install -q -y xvfb
You can use this framework in 2 ways:
a to o
for static websites.:mechanize
to :selenium_chrome
for using chrome driver for scraping.gem install kimurai
ruby filename.rb
You can find the whole source code here.
Visit BoTree Technologies for excellent Ruby on Rails web development services and hire Ruby on Rails web developers with experience in handling marketplace development projects.
Reach out to learn more about the New York web development agencies for the various ways to improve or build the quality of projects and across your company.