Ijitcs V10 N1 4

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/322316095

A Systematic Study of Data Wrangling

Article in International Journal of Information Technology and Computer Science · January 2018
DOI: 10.5815/ijitcs.2018.01.04

CITATIONS READS

11 2,714

2 authors:

Malini Mrityunjay Patil Basavaraj N Hiremath


JSS Academy of Technical Education , Bangalore JSS Academy of Technical Education
27 PUBLICATIONS 343 CITATIONS 4 PUBLICATIONS 64 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Malini Mrityunjay Patil on 04 June 2020.

The user has requested enhancement of the downloaded file.


I.J. Information Technology and Computer Science, 2018, 1, 32-39
Published Online January 2018 in MECS (http://www.mecs-press.org/)
DOI: 10.5815/ijitcs.2018.01.04

A Systematic Study of Data Wrangling


Malini M. Patil
Associate Professor, Dept. of Information Science and Engineering.
J.S.S Academy of Technical Education, Bengaluru, Karnataka
E-mail: [email protected]

Basavaraj N. Hiremath
Research Scholar, Dept. of Computer Science and Engineering,
JSSATE Research Centre, J.S.S Academy of Technical Education, Bengaluru, Karnataka
E-mail: [email protected]

Received: 26 September 2017; Accepted: 07 November 2017; Published: 08 January 2018

Abstract—The paper presents the theory, design, usage content to carry out to the next level of data warehouse
aspects of data wrangling process used in data ware hous- architecture. The wrangler provides interactive
ing and business intelligence. Data wrangling is defined transformative language with power of learning and
as an art of data transformation or data preparation. It is a recommending predictive transformational scripts to the
method adapted for basic data management which is to be users to have an insight into data by reduction in manual
properly processed, shaped, and is made available for iterative processes. There are also tools for learning
most convenient consumption of data by the potential methodologies to give predictive scripts. Finally,
future users. A large historical data is either aggregated or publishing the results of summary data set in a
stored as facts or dimensions in data warehouses to ac- compatible format for a data visualization tool. In
commodate large adhoc queries. Data wrangling enables metadata management, data integration and cleansing
fast processing of business queries with right solutions to play a vital role. Their role is to understand how to utilize
both analysts and end users. The wrangler provides inter- the automated suggestions in changing patterns, be it data
active language and recommends predictive transfor- type or mismatched values in standardizing data or in
mation scripts. This helps the user to have an insight of identifying missing values.
reduction of manual iterative processes. Decision support Data wrangling term is derived and defined as a
systems are the best examples here. The methodologies process to prepare the data for analysis with data
associated in preparing data for mining insights are high- visualization aids that accelerates the faster process [1]. It
ly influenced by the impact of big data concepts in the allows reformatting, validating, standardizing, enriching
data source layer to self-service analytics and visualiza- and integration with varieties of data sources which also
tion tools. provides space for self-service by allowing iterative
discovery of patterns in the datasets. Wrangling is not
Index Terms—Business Intelligence, wrangler, prescrip- dynamic data cleaning [2]. In this process it manages to
tive analytics, data integration, predictive transformation. reduce inconsistency in cleaning incomplete data objects,
deleting outliers by identifying abnormal objects. All
these methods involve distance metrics to make a
I. INTRODUCTION consistent dataset. Techniques of data profiling [3] are
involved in few of the data quality tools, which consists
The evolution of data warehouse (DWH) and business of iterative processes to clean the corrupted data. These
intelligence (BI) started with basic framework of techniques have limitations in profiling the data of
maintaining the wide variety of data sources. In multiple relations with dependencies. The optimization
traditional systems, the data warehouse is built to achieve techniques and algorithms need to be defined in the tools.
compliance auditing, data analysis, reporting and data Data wrangling refers to „Data preparation‟ which is
mining. A large historical data is either aggregated or associated with business user savvy i.e. self-service
stored in facts to accommodate ad hoc queries. In capability and enables, faster time to business insights
building these dimensional models the basic feature and faster time to action into business solutions to the
focused is „clean data‟ and „integrate data‟ to have an business end users and analysts in today‟s analytics space.
interaction, when a query is requested from the As per the recent best practices report from Transforming
downstream applications, to envisage meaningful Data with Intelligence (TDWI) [4], the percentage of time
analysis and decision making. This process of cleansing spent in preparing data compared to the time spent in
not only relieves the computational related complexities performing analysis is considerable to the tune of 61 per-
at the business intelligence layer but also on the context cent to 80 percentage. The report emphasizes on the
of performance. The two key processes involved are challenges like limited data access, poor data quality and
detecting discrepancy and transforming them to standard delayed data preparation tasks. The best practices fol

Copyright © 2018 MECS I.J. Information Technology and Computer Science, 2018, 1, 32-39
A Systematic Study of Data Wrangling 33

lowed are as follows: R and editing manually with tools like MS-Excel [9].
There are processes, that user might use iterative frame-
1. Evolve as an independent entity without the work to cleanse the data at all stages like analysis in visu-
involvement of information technology experts, as it‟s alization layer and on downstream side. The understand-
not efficient and time consuming. ing of sources of data problems can give us the level of
2. Data preparation should involve all types of quality of data for example sensor data which is collected
corporate data sets like data warehouses/data lakes, BI has automated edits in a standard format, whereas manu-
data, log data, web data and historical data in documents ally edited data can give relatively higher errors which
and reports. follows different formats.
3. To create a hub of data community which eases In the later years, many research works are carried out
collaboration for individuals and organization, more in developing query and transformational languages. The
informed as agile and productive. authors [10] suggests to focus future of research should
be on data source formats, split transforms, complex
The advent of more statistical and analytical methods transforms and format transforms which extracts semi
to arrive at a decision-making solution of business needs, automatically restricted text documents. The research
prompted the use of intensive tools and graphical user works continued along with advancement in structure and
interface based reports. This advancement paved the way architecture of functioning of ETL tools. These structures
for in-memory and „data visualization‟ tools in the indus- helped in building warehousing for data mining [11]. The
try, like Qlik Sense, Tableau, SiSense, and SAS. But their team of authors [12] from Berkeley and Stanford
growth was very fast and limited to „BI reporting area‟. university made a breakthrough work in discovering the
On the other hand, managing data quality, integration of methodology and processes of wrangler, an interactive
multiple data sources in the upstream side of data ware- visual specification of data transformation scripts by
house was inevitable. Because of the generous sized data, embedding learning algorithms and self-service
as described by authors [5], a framework of bigdata eco- recommending scripts which aid validation of data i.e.
system has emerged to cater the challenges posed by five created by automatic inference of relevant transforms.
dimensions of big data viz., volume, velocity, veracity, According to description of data transformation in [13],
value and variety. The trend has triggered growth in tech- specify exploratory way of data cleansing and mining
nology to move towards, designing of self-service, high which evolved into interesting statistical results. The
throughput systems with reduction in deployment dura- results were published in economic education
tion in all units of building end to end solution ecosystem. commission report, as this leads to statutory regulations
The data preparation layer which focuses on of any organization in Europe. The authors [14] provide
“WRANGLING” of data by data wrangler tools, A data an insight of specification for cleansing data for large
management processes to increase the quality and databases. In publication [15], authors stipulated the need
completeness of overall data, to keep as simple and of intelligently creating and recommending rules for
flexible as possible. The features of data wrangler tools reformatting. The algorithm automatically generates
involve natural language based suggestive scripts, reformatting rules. This deals with expert system called
transformation predictions, reuse of history scripts, „tope‟ that can recognize and reformat. The concepts of
formatting, data lineage and quality, profiling and Stanford team [7] became a visualized tool and it is
cleaning of data. The Big data stack includes the patented for future development of enterprise level,
following processes to be applied for data [6], data scalable and commercial tools The paper discusses about
platforms integration, data preparation, analytics, the overview of evolution of data integrator and data
advanced analytics. preparation tools used in the process of data wrangling.
In all these activities data preparation consumes more A broad description of what strategy is made to evolve
time and effort when compared to other tasks. So, the from basic data cleansing tools to automation and self-
data preparation work requires self-service tools for data service methods in data warehousing and business
transformation. The basic process of building and intelligence space is discussed. To understand this the
working with the data warehouse is managed by extract, desktop version of Trifacta software, the global leader in
transform and loading (ETL) tools. They have focused on wrangling technology is used as an example [16]. The
primary tasks of removing inconsistencies and errors of tool creates a unique experience and partnership with user
data termed as data cleansing [7] in decision support and machine. A data flow is also shown with a case study
systems. Authors specify need of „potters wheel‟ of a sample insurance data, in the later part of the paper.
approach on cleansing processes in parallel with detecting The design and usage of data wrangling has been dealt in
discrepancy and transforming them to an optimized this paper. The paper is organized as, in section III,
program in [8]. detailed and the latest wrangling processes are defined
which made Datawarehousing space to identify self-
service methods to be carried out by Business analysts, in
II. RELATED WORK order to understand how their own data gives insights to
be derived from this dataset. The section III A, illustrates
Wrangle the data into a dataset that provides meaning- a case study for data wrangling process with a sample
ful insights to carryout cleansing process, it requires writ- dataset followed with conclusions.
ing codes in idiosyncratic characters in languages of Perl,

Copyright © 2018 MECS I.J. Information Technology and Computer Science, 2018, 1, 32-39
34 A Systematic Study of Data Wrangling

environment called as “data lake”. In both the cases the


A. Overview of Data Wrangling Process
data preparation happens with data wrangling tools which
For building and designing different DWH and BI has all the transformation language features. The data
roadmap for any business domain the following factors obtained from data lake platform has lot of challenging
are followed. namely, transformations i.e. not only reshaping or resizing but
data governance tasks [17] which is like “Governance on
 Advances in technology landscape need” [6] and the schema of the data is “Schema on read‟
 Performance of complete process which is beyond the predefined silos structure. This
 Cost of resources situation demands self-service tasks in transformations
 Ease of design and deployment beyond IT driven tasks. So, on hand prescriptive scripts,
 Complexity of data and business needs visual and interactive data profiling methods, supports
 Challenges in Integration and data visualization. guided data wrangling tasks steps to reshape and load for
the consumption of exploratory analytics. The analysis of
 Rapid development of Data Analytics solution
evolution of enterprise analytics is shown in Fig. 2.
frameworks
Earlier, business intelligence was giving solutions for
business analysis, now predictive and prescriptive
analytics guides and influence the business for „what
happens‟ scenarios. These principles guide the data
management decisions about the data formats and data
interpretation, data presentation and documentation.
Some of the Jargons associated are „Data munging‟ and
„Janitor works‟.
The purpose and usage of Data Lake will not yield
results as expected unless the information processing is
fast and the underlying data governance is built with a
well-structured form. The business analysis team could
not deal with data stored in Hadoop systems and they
could not patch with techno functional communication to
get what business team wants to accomplish as most of
the data is behind the frame, built and designed by IT
team. These situations end in the need of self-service data
preparation processes.

Fig.2. Evolution of Enterprise Analytics

Fig.1. Data Flow diagram for data preparation The self-service tool will help to identify typical
transformation data patterns in „easy to view visualize
This trend has fuelled the fast growth of ETL process forms. Four years back there were few frameworks with
of enterprise [8] or data integration tools i.e. building the open source tools released as a data quality tools to
methodologies, processing frameworks and layers of data execute by IT teams. For example, TALEND DQ is
storages. In the process of evolution of transformation emerged as an open source tool for building solutions to
rules, for a specific raw data, various data profiling and analyze and design transformations by IT teams, but they
data quality tools are plugged in. In handling, various lack automation scripts and predictions for probable
data sources to build an enterprise data warehouse transformations. on the other side, advancement of
relationship, a “Data Integrator” tool is used which forms learning algorithms and pattern classifications, fuzzy
the main framework of ETL tools. At the other end, the algorithms helped to design transformational language for
robust reporting tools, the front end of DWH, built on the enterprise level solution providers. The complexity of
downstream of ETL framework also transformed to cater data size for identifying business analytical solutions is
both canned and real-time reports under the umbrella of one of the important challenges of five dimensions of big
business intelligence, which helps in decision making. data.
Fig. 1., is a typical data flow which needs to be carried The authors [18] have the opinion that, wrangling is an
out for preparing data, subjected to exploratory analytics. opportunity and a problem with context to big data. The
In this data flow, the discussion on big data is carried out unstructured form of big data makes the approaches of
in two cases one, cluster with Hadoop platform and other manual ETL processes a problem. If the methods and
with multiple application and multi structured techniques of wrangling becomes cost effective, the

Copyright © 2018 MECS I.J. Information Technology and Computer Science, 2018, 1, 32-39
A Systematic Study of Data Wrangling 35

impractical tasks are made practical. Therefore, the is used by few wrangling tools which uses their own
traditional ETL tools, struggled to support for the enterprise level „wrangling languages‟ [15] like Refine
analytical requirements and adopted the processes by tool uses Google refine expression language [14]. This
framing tightly governed mappings and released as builds, framework must have an in-built structure to deal with
with a considerable usage of BI and DWH methodologies. data governance and security, whereas, ETL tool frames
The data wrangling solutions were delivered fast by an extra module built in enterprise level. Though the
processing the „raw‟ data into the datasets required for current self-service BI tools give limitless visual analytics
analytical solutions. Traditionally the ETL technologies to showcase and understand data, but they are not
and data wrangling solutions are data preparation tools. managed to give us end to end data governance lineage
but for gearing up any analytics initiatives they require among data sets. This enables expert users to perform
cross functional and spanning various business verticals expressive transformations with less difficulty and tedium.
of organizations. They can concentrate on analysis and not being bogged
There is a need of analysis of data at the beginning, down in preparing data. The design is built to suggest
which ETL tools lacks, this can be facilitated by transforms without programming. The wrangler language
wrangling processes. The combination of visualization, uses input data type as semantic role to reduce tasks arise
pattern classification of data using algorithms of learning out of execution and evaluation by providing natural
and the interaction of analysts makes process easy and language scripts and virtual transform previews [8].
faster, which will be self-serviced data preparation
methods. This transformation language is [19] built with
visualization interfaces to display profile and to interact
with user by recommending predictive transformation for
the specific data in wrangling processes. Earlier the
programming [11] was used for automating
transformations. It‟s difficult to program the scripts for
complex transformations using a programming language,
spread sheets and schema mapping jobs. So, an outcome
of research was design of a transformation language, the
following basic transformations were considered in this
regard. Fig.3. Sample script of transformation language

Sizing: Data type and length of data value matters a lot B. Tools for Data Wrangling
in cleansing, rounding, truncate the integer and text
values respectively. The data wrangling technology is evolving at a very
Reshape: The data to consider different shapes like faster rate to prove and get placed as enterprise level
pivot/un-pivot makes rows and columns to have relevant industry standard software (self-service data preparation
grid frame. tool) [14] To quote few, clear story data, Trifacta, Tamr,
Look up: To enrich data with more information like Paxata and Google refine, other tools Global ID‟s, IBM
addition of attributes from different data sources provided Data works, Informatica Springbok. These tools are
Key ID present. basically used in industry for data transformation, data
Joins: Use of joins helps to join two different datasets harmonization, data preparation, data integration, data
with join keys. refinery and data governance tools [15] Currently Trifacta
Semantics: To create semantic mapping of meanings and Paxata are the strong performer tools [16]. In some
and relationships with a learning tools. organizations, the deployment is done both with ETL
Map positional transforms: To enrich with time and solutions and Wrangling tools, as ETL solutions do
geographical reference maps for the available a data. primary data integration and load into enterprise data
Aggregations and sorting: The aggregations (count, warehouse. From this system, the business users can
mean, max, list, min etc.) by group of attributes to experience data analysis by exploring wrangling solutions.
facilitate drill down feature and sorting on specific Most of data wrangling tools uses predictive
ranking for attribute values. transformation scripts with underlying machine learning
algorithms, visualizations methods and scalable data
These transformation activities can be done by technologies, namely., Hadoop based infrastructure
framework of architecture which includes connectivity to framed by Cloudera, Hortonworks and IBM. In earlier
all the data sources of all sizes of data. This process phase, most of the “traditional” enterprise ETL / BI tools
supports task of metadata management as it is more were doing cleansing of structured data and semi
effective for enrichment of data. An embedded structured data to fit in RDBMS (Relational database
knowledge learning system i.e. using machine learning management system) frame. But the advent of big data
algorithms are built to give user, a data lineage and challenged the growth of ETL tools to be reshaped with
matching data sets. The algorithm efficiently identifies various add-ons as plugins into the existing framework.
categories of attributes like geospatial [20]. A frame of So, the enterprise ETL tools were reformed as individual
transformational language in Fig. 3. depicts sample script) bundled tools to carry out the specific purpose of action
as per the industry needs. They are data quality, data

Copyright © 2018 MECS I.J. Information Technology and Computer Science, 2018, 1, 32-39
36 A Systematic Study of Data Wrangling

integrator, big data plugins, cloud integrators. In parallel, a separate file. The data can be previewed to have a first
the BI space, as discussed earlier, because of fast growth look of attributes and to know the relevance after import-
of data visualization and in-memory tools and their ease ing. The data analysis is done in grid view displayed each
of use, they evolved into self-service BI tools to leverage attribute, data type patterns, in order to give a clear data
purpose of saving ease of design and development i.e. format quality tips as shown in Fig. 4.
lesser intervention of IT resources, more ease of use of The user can get a predictive transformation tips on the
domain expertise to be called as „self-service tools‟. processed data values for standardization, cleaning and
formatting that will be displayed at the lower pane of the
grid like rounding of integer values for e.g., travel time of
III. DATA WRANGLING PROCESS USING TRIFACTA each insured vehicle is represented in minutes as whole
number and not as decimals. If a missing value is found,
This section presents the data wrangling process using Trifacta intelligently suggests the profile of data values in
the desktop version of Trifacta tool. The description of each attribute, that is easy for analysis. To have an insight
the dataset used for the wrangling process is also dis- on valuable customers, the user can concentrate on data
cussed. enrichment, that must be a formatting suggestion for date
A benchmark dataset is used to investigate the features format.
of the wrangling tool is taken from catalog USA govern- All the features can be viewed in Fig. 5. The data en-
ment dataset [21]. The data set is from insurance domain richment is done by bringing profile data of the nature of
and is publicly available, which emphasizes economic personal information to be appended with lookup attrib-
growth by the USA federal forum on open data. This utes. The lookup parameters are set and mapped to the
case study has three different files to be imported with other files as shown in Fig. 6.
scenarios of cleansing and standardizing [22]. The dataset
has variations of data values with different data types like
integer for policy number, date format for policy date,
travel time in minutes, education profile in text and has a
primary key to process any table joins. The dataset has
the scope to enrich, merge, identify missing values,
rounding to decimals and to visualize in the wrangling
tool with perfect variations in attribute histogram. (Ob-
served in Fig.4).

Fig.5. Data format script of transformation language

Trifacta provides an option of retaining or deleting the


information by selecting number of attributes in the file
as shown in Fig. 6. Trifacta also supports editing the user
defined functions and can be programmed using other
languages like Java and Python. This feature can be used
for doing sentiment analysis using social media data as
Trifacta is able to import Json data and other text data.
Fig.4. Visualization of distribution of each data attributes Finally, merging of the data files as received from other
branches with existing records is done by opening tool
An example of insurance company is considered to in-
menu using union tool which prompts for right keys and
vestigate the business insights of it. A sample data is se-
attributes. All the modified or processed data transfor-
lected to wrangle. This case study has three different files
mation steps are stored as stepwise procedures which can
to be imported with scenarios of cleansing and standard-
be reused and modified for future use. A graphical repre-
izing [22]. They are main transaction file, enrich the data
sentation of flow diagram is shown in Fig. 7. Once the
file with merged customer profile and the add on data
job is run, the result summary parameters can be seen in
from different branches. Trifacta enables the connectivity
Fig. 8. Data enrich, data cleansing and data merge is done
to various data sources. The data set is imported using
with simple steps of visualizing and analyzing the data,
drag and drop features available in Trifacta and saved as
finally the summarized result data set is stored in the

Copyright © 2018 MECS I.J. Information Technology and Computer Science, 2018, 1, 32-39
A Systematic Study of Data Wrangling 37

native format of data visualization tool like tableau or cesses. The [16] (2017 Q1) report cites that the future
QlikView or in csv and Json format. data preparation tools must evolve urgently to identify
customer insights and must be a self-service tool
equipped with machine learning approaches and move
towards independence from technology management.

Fig.6. Display of Lookup attributes

Thus, it is found that the wrangler tool allows human


Fig.8. Results summary
machine interactive transformation of real world data. It
enables business analysts to iteratively explore predictive
transformation scripts with the help of highly trained
learning algorithms. Its inefficient to wrangle data with
Excel which limits the data volume. The wrangling tools
come with seamless cloud deployment, big data and with
scalable capabilities with wide variety of data source
connectivity which has evolved from preliminary version
displayed in Fig. 9 [19].

Fig.9. Preliminary version of wrangler tool

IV. CONCLUSION
It is concluded that data processing is a process of
Fig.7. Graphical representation of flow diagram more sophisticated. The future „wrangling‟ tools should
be aggressive in high throughput and reduction in time to
The future of Data science leverages data sources re- carry out wrangling tasks in achieving the goal of making
served for data scientists which are now made available data more accessible and informative and ready to ex-
for business analysis, interactive exploration, predictive plore and mine business insights. The data wrangling
transformation, intelligent execution, collaborative data solutions are exploratory in nature in arriving at analytics
governance are the drivers of the future to build advanced initiative. The key element in encouraging technology is
analytics framework with prescriptive analytics to beat to develop strategy to skill enablement of considerable
the race of analytical agility. The future insights of the number of business users, analysts and to create large
data science need a scalable, faster and transparent pro- corpus of data on machine learning models. The

Copyright © 2018 MECS I.J. Information Technology and Computer Science, 2018, 1, 32-39
38 A Systematic Study of Data Wrangling

technology revolves around to design and support data [15] Data wrangling platform (2017) publication,
integration, quality, governance, collaboration and en- www.trifacta.com. [Online] Available:
richment [23]. The paper emphasizes on the usage of https://www.trifacta.com/products/architecture//, [Ac-
Trifacta tool for data warehouse process, which is one of cessed on: 01 May 2017].
[16] Little Cinny, The Forrester Wave™: Data Preparation
its unique kind. It is understood that the data preparation, Tools, Q1 2017 “The Seven Providers That Matter Most
data visualization, validation, standardization, data en- and How They Stack Up”, [Accessed On:March 13, 2017]
richment, data integration encompasses a data wrangling [17] Parsons Mark A, Brodzik Mary J, Rutter Nick J. (2004),
process. “Data management for the Cold Land Processes Experi-
ment: improving hydrological science HYDROLOGICAL
REFERENCES PROCESSES” Hydrol. Process. 18, 3637-3653.
[18] T. Furche, G. Gottlob, L. Libkin, G. Orsi, and N. W. Pa-
[1] Cline Don, Yueh Simon and Chapman Bruce, Stankov ton, “Data Wrangling for Big Data: Challenges and Op-
Boba, Al Gasiewski, and Masters Dallas, Elder Kelly, portunities,” EDBT, pp. 473–478, 2016.
Richard Kelly, Painter Thomas H., Miller Steve, Katzberg [19] Kandel Sean, Paepcke Andreas, Hellersteiny Joseph and
Steve, Mahrt Larry, (2009), NASA Cold Land Processes Heer Jeffrey (2011), published Image on Papers tab,
Experiment (CLPX 2002/03): Airborne Remote Sensing. www.vis.stanford.edu. [Online] Available:
[2] S. K. S and M. S. S, “A New Dynamic Data Cleaning http://vis.stanford.edu/papers/wrangler%20paper [Ac-
Technique for Improving Incomplete Dataset Consisten- cessed on: 25 May 2017].
cy,” Int. J. Inf. Technol. Comput. Sci., vol. 9, no. 9, pp. [20] Ahuja.S, Roth.M, Gangadharaiah R, Schwarz.P and Bas-
60–68, 2017. tidas.R, (2016), “Using Machine Learning to Accelerate
[3] A. Fatima, N. Nazir, and M. G. Khan, “Data Cleaning In Data Wrangling”, IEEE 16th International Conference on
Data Warehouse: A Survey of Data Pre-processing Tech- Data Mining Workshops (ICDMW), 2016, Barcelona,
niques and Tools,” Int. J. Inf. Technol. Comput. Sci., vol. Spain, pp. 343-349.doi:10.1109/ICDMW.2016.0055.
9, no. 3, pp. 50–61, 2017. [21] “Data catalog” Insurance dataset, [Online]www.data.gov.
[4] Stodder. David (2016), WP 219 - EN, TDWI Best practic- Available: https://catalog.data.gov/dataset. [Accessed: 24-
es report: Improving data preparation for business analyt- Oct-2017].
ics Q3 2016. © 2016 by TDWI, a division of 1105 Media, [22] Piringer Florian Endel Harald, florian. (2015), Data
Inc, [Accessed on: May 3rd, 2017]. Wrangling: Making data useful again, IFAC-
[5] Richard Wray, “Internet data heads for 500bn gigabytes | PapersOnLine 48-1 (2015) 111-112.
Business the Guardian,” www.theguardian.com. [Online] [23] V. kumar, Tan Pang Ning, Steinbach micheal, Introduc-
Availa- tion to Data Mining. Dorling Kindersley (India) Pvt. Ltd:
ble:https://www.theguardian.com/business/2009/may/18/d Pearson education publisher, 2012.
igital-content-expansion. [Accessed: 24-Oct-2017].
[6] Aslett Matt Research analyst Trifacta of 451 Research and
Davis Will head of marketing, Trifacta, “Trifacta main-
tains data preparation” [July ,7, 2017], [Online] Available:
https://451research.com [Accessed on: 01 August 2017]. Authors’ Profiles
[7] Kandel Sean, Paepcke Andreas, Hellersteiny Joseph and
Heer Jeffrey (2011), Wrangler: Interactive Visual Specifi- Malini M. Patil is presently working as
cation of Data Transformation Scripts, ACM Human Fac- Associate Professor in the Department of
tors in Computing Systems (CHI) ACM 978-1-4503- Information Science and Engineering at
0267-8/11/05. J.S.S. Academy of Technical Education,
[8] Chaudhuri. S and Dayal. U (1997), An overview of data Bangalore, Karnataka, INDIA. She
warehousing and OLAP technology. In SIGMOD Record. received her Ph.D. degree from Bharathiar
[9] S. Kandel et al., “Research directions in data wrangling: University in the year 2015.
Visualizations and transformations for usable and credible Her research interests are big data
data,” Inf. Vis., vol. 10, no. 4, pp. 271–288, 2011. analytics, bioinformatics, cloud computing, image processing.
[10] Chen W, Kifer.M, and Warren D.S, (1993), “HiLog: A She has published more than 20 research papers in many
foundation for higher-order logic programming”. In Jour- reputed international journals. Published article, Malini M patil,
nal of Logic Programming, volume 15, pages 187-230. Prof P K Srimani, "Performance analysis of Hoeffding trees in
[11] Raman Vijayshankar and Hellerstein Joseph M, frshankar, data streams by using massive online analysis framework",
(2001) “Potter's Wheel: An Interactive Data Cleaning Sys- 2015, vol 12, International Journal of Data Mining, Modelling
tem”, Proceedings of the 27th VLDB Conference. and Management (IJDMMM),7,4pg293-313,Inderscience
[12] Norman D.A, (2013), Text book on “The Design of Eve- Publishers. Malini M Patil, and Srimani P K, "Mining Data
ryday Things, Basic Books”, [Accessed on:12 April 2017]. streams with concept drift in massive online analysis frame
[13] Carey Lucy, “Self-Service Data Governance & Prepara- work", WSES Transactions on computers, 2016/3, Volume 15
tion on Hadoop”, www.jaxenter.com. [Online] (May 29, Pages 133-142.
2014) Available: https://jaxenter.com/trifacta-ceo-the- Dr. M Patil is a member of IEEE, Institution of Engineers
evolution-of-data-transformation-and-its-impact-on-the- (India), Indian Society for Technical Education, Computer
bottom-line-107826.html [Accessed on01 April 2017]. Society of India. She is guiding four students. She has attended
[14] Google code online publication (n.d), and presented papers in many international conferences in India
www.code.google.com. [Online] Available: and Abroad. She is a recipient of distinguished woman in
https://code.google.com/archive/p/google-refine Science Award for the year 2017 from Venus International
https://github.com/OpenRefine/OpenRefine [Accessed on; Foundation. Contact email: [email protected]
28 March 2017].

Copyright © 2018 MECS I.J. Information Technology and Computer Science, 2018, 1, 32-39
A Systematic Study of Data Wrangling 39

Basavaraj N Hiremath is a research


scholar working in the field of artificial
intelligence at JSSATE Research Centre,
Dept. of CSE, JSSATE, affiliated to VTU
Belagavi, INDIA. He is pursuing research
under Dr. Malini M Patil, Associate
Professor, Dept., of ISE, JSSATE.
Completed M. S. in Computer Cognition
Technology in 2003 from Department of studies in Computer
science, University of Mysore, Karnataka INDIA.
He has also worked in information technology industry as
SOLUTION ARCHITECT in data warehouse, business
Intelligence and analytics space in various business domains of
Airlines, Retail, Logistics and FMCG. Published article,
Basavaraj N Hiremath, and Malini M Patil, "A Comprehensive
Study of Text Analytics", CiiT International Journal of
Artificial Intelligent Systems and Machine Learning, 2017/4,
vol 9, no 4,70-77
Mr Hiremath is a Fellow of Institution of Engineers (India),
member IEEE, member Computer Society of India and Life
member Indian Society for Technical Education, Member
Association for the Advancement of Artificial Intelligence.

How to cite this paper: Malini M. Patil, Basavaraj N. Hiremath,


"A Systematic Study of Data Wrangling", International Journal
of Information Technology and Computer Science(IJITCS),
Vol.10, No.1, pp.32-39, 2018. DOI: 10.5815/ijitcs.2018.01.04

Copyright © 2018 MECS I.J. Information Technology and Computer Science, 2018, 1, 32-39

View publication stats

You might also like