Com 059

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Proceedings of 8th International Research Conference, KDU, Published November 2015

A Comparative Study on Web Scraping

SCM de S Sirisuriya
Department of Computer Science, Faculty of Computing, General Sir John Kotelawala Defence University, Ratmalana,
Sri Lanka
[email protected]

Abstract— The World Wide Web contains all kinds of I. INTRODUCTION


information of different origins; some of those are social, Data is an essential part of any research, either it can be
financial, security and academic. Most people access academic, marketing or scientific (SysNucleus, n.d.).
information through internet for educational purposes. People might want to collect and analyse data from
Information on the web is available in different formats multiple websites. The different websites which belongs
and through different access interfaces. Therefore, to the specific category displays information in different
indexing or semantic processing of the data through formats. Even with a single website you may not be able
to see all the data at once. The data may be spanned
websites could be cumbersome. Web Scraping is the
across multiple pages under various sections. Most
technique which aims to address this issue. Web scraping
websites do not allow to save a copy of the data,
is used to transform unstructured data on the web into displayed in their web sites to your local storage (Penman
structured data that can be stored and analysed in a et al., 2009). The only option is to manually copy and
central local database or spreadsheet. There are various paste the data shown by the website to a local file in your
web scraping techniques including Traditional copy-and- computer. This is a very tedious job which can take lot of
paste, Text grapping and regular expression matching, time. Web Scraping is the technique which people can
HTTP programming, HTML parsing, DOM parsing, Web- extract data from multiple websites to a single
scraping software, Vertical aggregation platforms, spreadsheet or database so that it becomes easy to
Semantic annotation recognizing and Computer vision analyse or even visualize the data. The aim of this study is
web-page analysers. Traditional copy and paste is the to offers a review on web scraping techniques and
basic and tiresome web scraping technique where people software which can be used to extract data from web
need to scrap lots of datasets. Web scraping software is sites.
the easiest scraping technique since all the other
techniques except traditional copy and paste require Rest of the paper arranged as follows. Section II describes
the overview of web scraping. Section III describes the
some form of technical expertise. There are hundreds of
practical usage of web scraping. Section IV describes
web scraping software available today, most of them
some web scraping techniques. Section V gives the detail
designed by using Java, Python and Ruby. There are also
description about web scraping software. Finally Section
some open source web scraping software and as well as VI gives the discussion, comparing the several web
commercial software. Web scraping software such as scraping techniques and some famous web scraping
YahooPipes, Google Web Scrapers and Outwit Firefox software.
extensions are the best tools for beginners in web
scraping. This study focused on giving comparative II. OVERVIEW OF WEB SCRAPING
clarification about web scraping techniques and famous Web Scraping is a great technique of extracting
web scraping software. To accomplish this, we compare unstructured data from the websites and transforming
and contrast several web scraping techniques and some that data into structured data that can be stored and
famous web scraping software. The outcome of this study analysed in a database. Web Scraping is also known as
offers a review on web scraping techniques and software web data extraction, web data scraping, web harvesting
which can be used to extract data from educational web or screen scraping. Web scraping is a form of data
sites. mining. The overall goal of the web scraping process is to
extract information from a websites and transform it into
Keywords— Web Scraping, Information Extraction an understandable structure like spreadsheets, database
or a comma-separated values (CSV) file as shown in
Figure 1. Data like item pricing, stock pricing, different
reports, market pricing and product details, can be
gathered through web scraping. Extracting targeted

135
Proceedings of 8th International Research Conference, KDU, Published November 2015

information from websites assists you to take effective E. Document Object Model (DOM)Parsing
decisions in your business. By embedding a full-fledged web browser, such as the
Internet Explorer or the Mozilla browser control,
programs can retrieve the dynamic content generated by
client-side scripts. These browser controls also parse web
Figure 2. Structure of the Web Scraping
pages into a DOM tree, based on which programs can
retrieve parts of the pages(“Web scraping,” 2015b).

F. Web Scraping Software


III. PRACTICES OF WEB SCRAPING There are many software tools available that can be used
to customize web-scraping solutions. This software may
 Online price comparison
attempt to automatically recognize the data structure of
 Contact scraping
a page or provide a recording interface that removes the
 Weather data monitoring
necessity to manually write web-scraping code, or some
 Website change detection scripting functions that can be used to extract and
 Research transform content, and database interfaces that can
 Web mash up — integrate data from multiple store the scraped data in local databases(“Web scraping,”
sources 2015b).
 Extract offers and discounts
 Scrape job postings information from job portals G. Vertical aggregation platforms
 Collect properties lists from real estate websites There are several companies that have developed vertical
 Brand monitoring specific harvesting platforms. These platforms create and
 Extract business details from business directory monitor a multitude of “bots” for specific verticals with
websites like Yelp and Yellow pages no direct human involvement, and no work related to a
 Collect government data specific target site. The preparation involves establishing
 Market Analysis the knowledge base for the entire vertical and then the
platform creates the bots automatically. The platform's
robustness is measured by the quality of the information
IV. WEB SCRAPING TECHNIQUES
it retrieves (usually number of fields) and its scalability
A. Traditional copy and paste
(how quick it can scale up to hundreds or thousands of
Occasionally the human’s manual examination and copy-
sites). This scalability is mostly used to target the Long
and-paste method is the best and the workable web-
Tail of sites that common aggregators find complicated or
scraping technology. But this is an error-prone, boring
too labour-intensive to harvest content from (“Web
and tiresome technique when people need to scrap lots
scraping,” 2015b).
of datasets (“Web scraping,” 2015a).
H. Semantic annotation recognizing
B. Text grapping and regular expression The pages being scraped may embrace metadata or
This is the simple and powerful approach to extract semantic mark-ups and annotations, which can be used
information from web pages. This technique based on to locate specific data snippets. If the annotations are
the UNIX command or regular expression-matching embedded in the pages, as Microformat does, this
facilities of programming language (“Web scraping,” technique can be viewed as a special case of DOM
2015b). parsing. In another case, the annotations, organized into
a semantic layer, are stored and managed separately
C. Hypertext Transfer Protocol (HTTP) Programming from the web pages, so the scrapers can retrieve data
This technique used to extract data from static and schema and instructions from this layer before scraping
dynamic web pages. Data can be retrieved by posting the pages (“Web scraping,” 2015b).
HTTP requests to the remote web server using socket
programming (“Web scraping,” 2015b). I. Computer vision web-page analysers
There are efforts using machine learning and computer
D. Hyper Text Markup Language (HTML) Parsing vision that attempt to identify and extract information
Semi-structured data query languages, like XQuery and from web pages by interpreting pages visually as a
the Hyper Text Query Language (HTQL), can be used to human being might (“Web scraping,” 2015b).
parse HTML pages and to retrieve and transform page
content (“Web scraping,” 2015b).

136
Proceedings of 8th International Research Conference, KDU, Published November 2015

V. WEB SCRAPING SOFTWARE technical can also make simple scrape. Mozenda runs
Web Scraping Software are the tools that are used to your scraping project (agent) on their cloud environment
automate the manual copy paste work to gather large which is the main difference of Mozanda from other
amount of data from websites like directory sites, real scrapers. (“List of Web Harvester, Data Scraper, Web
estate sites, classified websites and job boards. Suppose Scraping Software and Tools,” n.d.).
you want to scrape real estate property details of UK
then you need to appoint few guys to copy and paste D. UiPath – Robotic Process Automation
details from websites to excel by visiting each property UiPath can automatically log in to a web site, extract data
page. This way it will take days and even months to get spanning multiple webpages, filter and transform it into
your property data ready to use. So web scraping can the format of user choice, before integrating it into
automate the manual work programmatically by visiting another application or web service. UiPath resembles a
each page and extract data from pages and parsing the real browser with a real user, so it can extract data that
html pages. There are number of Web Scraping Software most automation tools cannot even see (Savinkin, n.d.).
that available in market that can help you to scrape data No programming is needed to create intelligent web
from any website you want. Following are the list of agents using its drag-and-drop graphical designer-but the
some scraping tools. .NET hacker inside you has complete control over the
data (“List of Web Harvester, Data Scraper,Web Scraping
The Price of Web Scraping Software varies based on Software and Tools,” n.d.).
features it provide, support and upgrade period. You can
always get the trial version and check whether it has all E. Out Wit Hub
the scraping features that you need (“List of Web The OutWit Hub is a powerful Firefox extension Tool for
Harvester, Data Scraper,Web Scraping Software and Everyone. The contents extracted from a Web page are
Tools,” n.d.). presented in an easy and visual way, without requiring
any programming skills or advanced technical knowledge.
A. Visual Web Ripper Users can easily extract links, images, email addresses,
Visual Web Ripper is one of the most advance web data tables, etc. from series of pages without ever seeing
scraping software, created by Sequentum group in 2006
the source code. Extracted data can be exported to CSV,
that provides functionality that allows you to scrape data
HTML, Excel or SQL databases, while images and
from any websites like Business Directories, Simple Web
Pages, Classified Sites, Forums and e-commerce site documents, are directly saved to your hard disk. The
scraping (eBay, amazon, magento sites). Once data OutWit Hub is best to use for beginners in web scraping
scraping finish, data can be exported to structured CSV, (“Software for Web Scraping,” n.d.).
Excel, or XML format(“List of Web Harvester, Data
Scraper,Web Scraping Software and Tools,” n.d.). F. Screen Scraper
Screen Scraper is advance web scraping application that
B. Web Content Extractor comes in three flavor Enterprise, Professional and Basic.
Web Content Extractor (WCE) is a simple user-oriented Basic version is free to download and use with basic
application developed by Newprosoft. It has good wizard scraping features (“List of Web Harvester, Data Scraper,
that guide user to setup scraper. You can scrape data Web Scraping Software and Tools,” n.d.). Other versions
from website with few clicks and Web Content Extractor take much time for an inexperienced user to master the
is excellent for putting data into different formats like techniques. The important mechanism is that Screen
Excel, text, HTML formats, Microsoft Access database, Scraper can integrate with other systems, with Java
Structured Query Language(SQL) Script File, MySQL Script support allowing you to write serious scripts for a large
File, Extensible Markup Language (XML) file, HTTP submit
scale program (Savinkin, n.d.).
form and Open Database Connectivity (ODBC) Data
source. (“List of Web Harvester, Data Scraper, Web
G. WebHarvy
Scraping Software and Tools,” n.d.) (“Software for Web
Scraping,” n.d.). WebHarvy is a lightweight, visual, point-to-click scrape
tool. It takes minimum time to master and to extract
C. Mozanda Web Scraper data. WebHarvy is best suited for quick scraping of text,
Mozanda Web Scraper is powerful web data extraction URLs and images from web pages. Extracted data can be
service. It can extract data from websites as well as PDFs. saved into common formats (CSV, Tab Separated
It has simple Point and selection interface so non- Values(TSV), XML) and also SQL for database input”
(SysNucleus, n.d.). It is best known for tabular data

137
Proceedings of 8th International Research Conference, KDU, Published November 2015

extraction, it can extract data that has well-structured This scrape is completely free and also provides source
HTML. It can’t extract data by doing deep crawling and code.
Ajax based data scraping (“List of Web Harvester, Data
Scraper, Web Scraping Software and Tools,” n.d.). M. FMiner
Fminer is one of the best Visual Web Scraping tool built in
H. Easy Web Extract Python. It has nice diagrammatic representation of
Easy Web Extract by Web2Mine Founded in 2009 is scraping flow and actions. It also allows to run custom
designed for simple and quick data extraction. This python code (“List of Web Harvester, Data Scraper,Web
scrape tool is written using .NET technology and allows Scraping Software and Tools,” n.d.).
you to apply data transforming built-in scripts (C#, VB,
JS). Easy Web Extract is excellent for exporting data into N. Scrapy
Excel (CSV), text, XML file, HTML formats, MS Access DB, An open source and collaborative framework for
SQL Script File, MySQL Script File, HTTP submit form and
extracting the data you need from websites. Scrapy
ODBC Data source. One shortcoming is that while making
written in Python and runs on Linux, Windows, and Mac.
a scrape project, loading the URL sometimes takes a long
time(Savinkin, n.d.) .
O. import io
I. WebSunDew import io is a free online web scraper founded in March
WebSundew is as easy to use web scraping software that 2012, which allows you to scrape various types of
allows point-and-click user interface to define fields that information and then organize the extracted information
you want to scrape from webpages. This screen scraper is into data sets. import io is a cloud-based platform so you
designed for high productivity and speed data ripping. don’t need to run the scraper on your machine, and all
The Enterprise edition allows the scrape to run at a your data is kept somewhere in the cloud. import io is
remote Server and publish extracted data through FTP usable for all kinds of people, regardless of their technical
(Savinkin, n.d.). It also supports images and file ability (“Software for Web Scraping,” n.d.).
extraction. It can perform multilevel web extraction by
doing deep crawling (“List of Web Harvester, Data P. Web Scraper
Scraper, Web Scraping Software and Tools,” n.d.). Web Scraper offer two great options for users. Those are
free Google Chrome Extension and Enterprise Data
J. Web Data Extarctor Extraction Service. In Google Chrome Extension user can
Web Data Extractor by Automation Anywhere United create a plan (sitemap) how a web site should be
States founded in 2003 is a web scraping tool specifically traversed and what should be extracted. Using these
designed for Link Extraction, Meta Tag, Body Text, Emails, sitemaps the Web Scraper will navigate the site
Phones, Faxes number scraping. It is not good for rule accordingly and extract all data. Scraped data later can be
based web scraping. (“List of Web Harvester, Data exported as CSV. In Enterprise Data Extraction Service
Scraper, Web Scraping Software and Tools,” n.d.). offers top quality results driven at the level you require.
This option allows you to extract large amounts of data,
K. Helium Scraper run multiple scrapings at once, and even run them on a
Helium is one of the powerful web scraping software that set schedule .
has all the features that one need to scrape data from
any web pages. It has point-and-click user interface to VI. DISCUSSION
define scraping fields. It has support of Ajax based Visual Web Ripper, Helium Scraper, Screen Scraper,
scraping, CAPTCHA based scraping and proxy supports OutWit Hub, Mozenda, WebSundew, Web Content
(“List of Web Harvester, Data Scraper,Web Scraping Extractor, Easy Web Extract are commercial web scraping
Software and Tools,” n.d.). tools. Screen Scraper has free basic edition and OutWit
Hub has free Light version and all the others have free
L. WebExtractor 360 trial version. WebExtractor 360 and Scrapy are open
WebExtractor 360 is an open source web scraper. It uses source web scraping tools. import io is a free online web
Regular Expression to scrape data from web pages. You scraper. The main difference of the Mozenda screen
need to have good knowledge of Regular Expressions to scraper software from other scrapers is that it runs your
work with this regular expression based scraping tool. scraping projects in clouds.

138
Proceedings of 8th International Research Conference, KDU, Published November 2015

Table 3. Comparison of Web Scraping Software OutWit Hub and Visual Web Ripper is two scrapes which
Web Scraping Operating Data Export formats can table and listed HTML table data.
Software System
Visual Web Win CSV, Excel, XML, SQL Server, According to this comparative study, we identified most
Ripper MySQL, SQLite, Oracle and
OleDB, Customized C# or VB
of the web scrapers are often quite generic and mostly
script file output designed to perform common, simple tasks. In other
Helium Win CSV, XML, MS Access words, they may appear not to be as flexible and
Scraper database, MySQL script file universal as you would expect. All the web scraper
Screen Scraper Win, Mac, Text. developers try to make their products scrape all kinds of
Unix/Linux HTML, SQL Script File, web pages, but we realized some web scraping software
MySQL Script File, XML file,
HTTP submit form
are better suited for one type of task and some are suited
OutWit Hub Win, Mac CSV (TSV), HTML, Excel or for another.
OS-X, Linux, SQL script
Mozenda Win CSV, TSV, or XML only. ACKNOWLEDGMENT
WebSundew Win Text, CSV, Excel, XML; The author would like to thank Dr. L. Ranathunga, Prof.
SQL Server, MySQL, Oracle S.P. Karunanayaka and Prof. N.A. Abdullah for their
and JDBC compatible DB support.
(Pro and Enterprise edition)
Web Content Win Excel, text, HTML, MS Access
Extractor DB, SQL Script File, MySQL REFERENCES
Script File, XML file, HTTP List of Web Harvester, Data Scraper,Web Scraping
submit form, ODBC Data Software and Tools [WWW Document], n.d. WebData
source Scraping. URL http://webdata-scraping.com/web-
Easy Web Win Excel (CSV, TSV), text, HTML, scraping-software/ (accessed 6.9.15).
Extract MS Access DB, SQL Script
File, MySQL Script File, XML Penman, R.B., Baldwin, T., Martinez, D., 2009. Web
file, HTTP submit form, scraping made simple with site scraper. Text.
ODBC Data source
Savinkin, I., n.d. UiPath – Robotic Process Automation
According to the Table 1 most of the web scraping Software. Web Scraping.
software supports Windows operating system except
Screen Scraper and OutWit Hub. Excel, CSV and XML file Savinkin, I., n.d. Screen Scraper Review. Web Scraping.
are most common data export formats.
Savinkin, I., n.d. Easy Web Extract Review. Web Scraping.
Disparate Data Collection, Email Address Extraction,
Savinkin, I., n.d. WebSundew Data Extractor Review. Web
Image Extraction, IP Address Extraction, Phone Number
Scraping.
Extraction and Web Data Extraction are common features
to import io, Visual Web Ripper, Easy Web Extract and Software for Web Scraping, n.d. Web Scraping.
FMiner.
SysNucleus, n.d. WebHarvy Web Scraper [WWW
Disparate Data Collection, Document Extraction, Email Document]. URL
Address Extraction, Image Extraction, Phone Number https://www.webharvy.com/articles/what-is-web-
Extraction, Pricing Extraction and Web Data Extraction scraping.html (accessed 6.3.15).
are key features to Helium Scraper.
Web scraping, 2015a. . Wikipedia Free Encycl.
Email Address Extraction, Image Extraction and Web Data
Extraction are the main features of Web Data Extractor. Web scraping, 2015b. . Wikipedia Free Encycl.

139
Proceedings of 8th International Research Conference, KDU, Published November 2015

BIOGRAPHY OF AUTHORS her Master of Computer Science degree from University


of Colombo School of Computing. She is interested in the
S.C.M. de S Sirisuriya is a lecturer at field of E-Learning content evaluation and reads her
General Sir John Kotelawala Defence MPhil degree on that area.
University. She received her B.Sc. Degree
in Computer Science from the University
of Sri Jayewardenepura. She completed

140

You might also like