Data Transformation (Computing)
Data Transformation (Computing)
Data Transformation (Computing)
In computing, data transformation is the process of converting data from one format or structure into another format or
structure. It is a fundamental aspect of most data integration[1] and data management tasks such as data wrangling, data
warehousing, data integration and application integration.
Data transformation can be simple or complex based on the required changes to the data between the source (initial) data
and the target (final) data. Data transformation is typically performed via a mixture of manual and automated steps.[2]
Tools and technologies used for data transformation can vary widely based on the format, structure, complexity, and
volume of the data being transformed.
A master data recast is another form of data transformation where the entire database of data values is transformed or
recast without extracting the data from the database. All data in a well designed database is directly or indirectly related to
a limited set of master database tables by a network of foreign key constraints. Each foreign key constraint is dependent
upon a unique database index from the parent database table. Therefore, when the proper master database table is recast
with a different unique index, the directly and indirectly related data are also recast or restated. The directly and indirectly
related data may also still be viewed in the original form since the original unique index still exists with the master data.
Also, the database recast must be done in such a way as to not impact the applications architecture software.
When the data mapping is indirect via a mediating data model, the process is also called data mediation.
Data discovery
Data mapping
Code generation
Code execution
Data review
These steps are often the focus of developers or technical data analysts who may use multiple specialized tools to
perform their tasks.
Data discovery is the first step in the data transformation process. Typically the data is profiled using profiling tools or
sometimes using manually written profiling scripts to better understand the structure and characteristics of the data and
decide how it needs to be transformed.
Data mapping is the process of defining how individual fields are mapped, modified, joined, filtered, aggregated etc. to
produce the final desired output. Developers or technical data analysts traditionally perform data mapping since they
work in the specific technologies to define the transformation rules (e.g. visual ETL tools,[3] transformation languages).
Code generation is the process of generating executable code (e.g. SQL, Python, R, or other executable instructions)
that will transform the data based on the desired and defined data mapping rules.[4] Typically, the data transformation
technologies generate this code[5] based on the definitions or metadata defined by the developers.
Code execution is the step whereby the generated code is executed against the data to create the desired output. The
executed code may be tightly integrated into the transformation tool, or it may require separate steps by the developer to
manually execute the generated code.
Data review is the final step in the process, which focuses on ensuring the output data meets the transformation
requirements. It is typically the business user or final end-user of the data that performs this step. Any anomalies or errors
in the data that are found and communicated back to the developer or data analyst as new requirements to be
implemented in the transformation process.[1]
Traditionally, data transformation has been a bulk or batch process,[6] whereby developers write code or implement
transformation rules in a data integration tool, and then execute that code or those rules on large volumes of data.[7] This
process can follow the linear set of steps as described in the data transformation process above.
Batch data transformation is the cornerstone of virtually all data integration technologies such as data warehousing, data
migration and application integration.[1]
When data must be transformed and delivered with low latency, the term “microbatch” is often used.[6] This refers to
small batches of data (e.g. a small number of rows or small set of data objects) that can be processed very quickly and
delivered to the target system when needed.
Traditional data transformation processes have served companies well for decades. The various tools and technologies
(data profiling, data visualization, data cleansing, data integration etc.) have matured and most (if not all) enterprises
transform enormous volumes of data that feed internal and external applications, data warehouses and other data
stores.[8]
This traditional process also has limitations that hamper its overall efficiency and effectiveness.[1][2][7]
The people who need to use the data (e.g. business users) do not play a direct role in the data transformation process.[9]
Typically, users hand over the data transformation task to developers who have the necessary coding or technical skills to
define the transformations and execute them on the data.[8]
This process leaves the bulk of the work of defining the required transformations to the developer, which often in turn do
not have the same domain knowledge as the business user. The developer interprets the business user requirements and
implements the related code/logic. This has the potential of introducing errors into the process (through misinterpreted
requirements), and also increases the time to arrive at a solution.[9][10]
This problem has given rise to the need for agility and self-service in data integration (i.e. empowering the user of the
data and enabling them to transform the data themselves interactively).[7][10]
There are companies that provide self-service data transformation tools. They are aiming to efficiently analyze, map and
transform large volumes of data without the technical knowledge and process complexity that currently exists. While
these companies use traditional batch transformation, their tools enable more interactivity for users through visual
platforms and easily repeated scripts.[11]
Still, there might be some compatibility issues (e.g. new data sources like IoT may not work correctly with older tools)
and compliance limitations due to the difference in data governance, preparation and audit practices.[12]
Although interactive data transformation follows the same data integration process steps as batch data integration, the key
difference is that the steps are not necessarily followed in a linear fashion and typically don't require significant technical
skills for completion.[14]
Det finnes en rekke selskaper som tilbyr interaktive verktøy for datatransformasjon, eksempelvis oppstartsselskaper som
Trifacta, Alteryx og Paxata. De tar sikte på å gi effektiv analyse, avbildning og transformasjon av store datamengder
samtidig som de abstraherer bort noe av den tekniske kompleksiteten og prosessene som foregår under panseret
There are a number of companies which provide interactive data transformation tools, like for example the start-ups
Trifacta, Alteryx and Paxata. They are aiming to efficiently analyze, map and transform large volumes of data while at
the same time abstracting away some of the technical complexity and processes which take place under the hood.
Interactive data transformation solutions provide an integrated visual interface that combines the previously disparate
steps of data analysis, data mapping and code generation/execution and data inspection.[8] That is, if changes are made at
one step (like for example renaming), the software automatically updates the preceding or following steps accordingly.
Interfaces for interactive data transformation incorporate visualizations to show the user patterns and anomalies in the
data so they can identify erroneous or outlying values.[9]
Once they've finished transforming the data, the system can generate executable code/logic, which can be executed or
applied to subsequent similar data sets.
By removing the developer from the process, interactive data transformation systems shorten the time needed to prepare
and transform the data, eliminate costly errors in interpretation of user requirements and empower business users and
analysts to control their data and interact with it as needed.[10]
Transformational languages
There are numerous languages available for performing data transformation. Many transformation languages require a
grammar to be provided. In many cases, the grammar is structured using something closely resembling Backus–Naur
form (BNF). There are numerous languages available for such purposes varying in their accessibility (cost) and general
usefulness.[15] Examples of such languages include:
AWK - one of the oldest and popular textual data transformation language;
Perl - a high-level language with both procedural and object-oriented syntax capable of powerful
operations on binary or text data.
Template languages - specialized to transform data into documents (see also template processor);
TXL - prototyping language-based descriptions, used for source code or data transformation.
XSLT - the standard XML data transformation language (suitable by XQuery in many applications);
Additionally, companies such as Trifacta and Paxata have developed domain-specific transformational languages (DSL)
for servicing and transforming datasets. The development of domain-specific languages has been linked to increased
productivity and accessibility for non-technical users.[16] Trifacta's “Wrangle” is an example of such a domain specific
language.[17]
Another advantage of the recent domain-specific transformational languages trend is that a domain-specific
transformational language can abstract the underlying execution of the logic defined in the domain-specific
transformational language. They can also utilize that same logic in various processing engines, such as Spark,
MapReduce, and Dataflow. In other words, with a domain-specific transformational language, the transformation
language is not tied to the underlying engine.[17]
Although transformational languages are typically best suited for transformation, something as simple as regular
expressions can be used to achieve useful transformation. A text editor like vim, emacs or TextPad supports the use of
regular expressions with arguments. This would allow all instances of a particular pattern to be replaced with another
pattern using parts of the original pattern. For example:
In other words, all instances of a function invocation of foo with three arguments, followed by a function invocation with
two arguments would be replaced with a single function invocation using some or all of the original set of arguments.
Another advantage to using regular expressions is that they will not fail the null transform test. That is, using your
transformational language of choice, run a sample program through a transformation that doesn't perform any
transformations. Many transformational languages will fail this test.
See also
Data cleansing
Data mapping
Data integration
Data preparation
Data wrangling
Extract, transform, load
Information integration
References
1. CIO.com. Agile Comes to Data Integration. Retrieved from: https://www.cio.com/article/2378615/data-
management/agile-comes-to-data-integration.html Archived (https://web.archive.org/web/201708290354
36/https://www.cio.com/article/2378615/data-management/agile-comes-to-data-integration.html) 2017-
08-29 at the Wayback Machine
2. DataXFormer. Morcos, Abedjan, Ilyas, Ouzzani, Papotti, Stonebraker. An interactive data transformation
tool. Retrieved from: http://livinglab.mit.edu/wp-content/uploads/2015/12/DataXFormer-An-Interactive-
Data-Transformation-Tool.pdf Archived (https://web.archive.org/web/20190805211122/http://livinglab.mit.
edu/wp-content/uploads/2015/12/DataXFormer-An-Interactive-Data-Transformation-Tool.pdf) 2019-08-
05 at the Wayback Machine
3. DWBIMASTER. Top 10 ETL Tools. Retrieved from: http://dwbimaster.com/top-10-etl-tools/ Archived (http
s://web.archive.org/web/20170829035105/http://dwbimaster.com/top-10-etl-tools/) 2017-08-29 at the
Wayback Machine
4. Petr Aubrecht, Zdenek Kouba. Metadata driven data transformation. Retrieved from:
http://labe.felk.cvut.cz/~aubrech/bin/Sumatra.pdf Archived (https://web.archive.org/web/2021041612132
3/http://labe.felk.cvut.cz/~aubrech/bin/Sumatra.pdf) 2021-04-16 at the Wayback Machine
5. LearnDataModeling.com. Code Generators. Retrieved from:
http://www.learndatamodeling.com/tm_code_generator.php Archived (https://web.archive.org/web/20170
802064905/http://www.learndatamodeling.com/tm_code_generator.php) 2017-08-02 at the Wayback
Machine
6. TDWI. 10 Rules for Real-Time Data Integration. Retrieved from: https://tdwi.org/Articles/2012/12/11/10-
Rules-Real-Time-Data-Integration.aspx?Page=1 Archived (https://web.archive.org/web/2017082903250
4/https://tdwi.org/Articles/2012/12/11/10-Rules-Real-Time-Data-Integration.aspx?Page=1) 2017-08-29 at
the Wayback Machine
7. Tope Omitola, Andr´e Freitas, Edward Curry, Sean O'Riain, Nicholas Gibbins, and Nigel Shadbolt.
Capturing Interactive Data Transformation Operations using Provenance Workflows Retrieved from:
http://andrefreitas.org/papers/preprint_capturing%20interactive_data_transformation_eswc_highlights.pdf
Archived (https://web.archive.org/web/20160131145724/http://andrefreitas.org/papers/preprint_capturin
g%20interactive_data_transformation_eswc_highlights.pdf) 2016-01-31 at the Wayback Machine
8. The Value of Data Transformation
9. Morton, Kristi -- Interactive Data Integration and Entity Resolution for Exploratory Visual Data Analytics.
Retrieved from: https://digital.lib.washington.edu/researchworks/handle/1773/35165 Archived (https://we
b.archive.org/web/20170907043519/https://digital.lib.washington.edu/researchworks/handle/1773/3516
5) 2017-09-07 at the Wayback Machine
10. McKinsey.com. Using Agile to Accelerate Data Transformation
11. "Why Self-Service Prep Is a Killer App for Big Data" (https://www.datanami.com/2016/05/31/self-service-
prep-killer-app-big-data/). Datanami. 2016-05-31. Archived (https://web.archive.org/web/2017092100172
4/https://www.datanami.com/2016/05/31/self-service-prep-killer-app-big-data/) from the original on 2017-
09-21. Retrieved 2017-09-20.
12. Sergio, Pablo (2022-05-27). "Your Practical Guide to Data Transformation" (https://blog.coupler.io/what-i
s-data-transformation/). Coupler.io Blog. Archived (https://web.archive.org/web/20220517173509/https://
blog.coupler.io/what-is-data-transformation/) from the original on 2022-05-17. Retrieved 2022-07-08.
13. Tope Omitola , Andr´e Freitas , Edward Curry , Sean O’Riain , Nicholas Gibbins , and Nigel Shadbolt.
Capturing Interactive Data Transformation Operations using Provenance Workflows Retrieved from:
http://andrefreitas.org/papers/preprint_capturing%20interactive_data_transformation_eswc_highlights.pdf
Archived (https://web.archive.org/web/20160131145724/http://andrefreitas.org/papers/preprint_capturin
g%20interactive_data_transformation_eswc_highlights.pdf) 2016-01-31 at the Wayback Machine
14. Peng Cong, Zhang Xiaoyi. Research and Design of Interactive Data Transformation and Migration
System for Heterogeneous Data Sources. Retrieved from:
https://ieeexplore.ieee.org/document/5211525/ Archived (https://web.archive.org/web/20180607184030/
https://ieeexplore.ieee.org/document/5211525/) 2018-06-07 at the Wayback Machine
15. DMOZ. Extraction and Transformation. Retrieved from:
https://dmoztools.net/Computers/Software/Databases/Data_Warehousing/Extraction_and_Transformation/
Archived (https://web.archive.org/web/20170829041136/https://dmoztools.net/Computers/Software/Data
bases/Data_Warehousing/Extraction_and_Transformation/) 2017-08-29 at the Wayback Machine
16. "Wrangle Language - Trifacta Wrangler - Trifacta Documentation" (https://docs.trifacta.com/display/PE/W
rangle+Language). docs.trifacta.com. Archived (https://web.archive.org/web/20170921045735/https://doc
s.trifacta.com/display/PE/Wrangle+Language) from the original on 2017-09-21. Retrieved 2017-09-20.
17. Kandel, Joe Hellerstein, Sean. "Advantages of a Domain-Specific Language Approach to Data
Transformation - Strata + Hadoop World in New York 2014" (https://conferences.oreilly.com/strata/stratan
y2014/public/schedule/detail/36612). conferences.oreilly.com. Archived (https://web.archive.org/web/201
70921000834/https://conferences.oreilly.com/strata/stratany2014/public/schedule/detail/36612) from the
original on 2017-09-21. Retrieved 2017-09-20.
External links
File Formats, Transformation, and Migration (https://en.wikiversity.org/wiki/Digital_Libraries/File_formats,
_transformation,_migration), a related Wikiversity article
Extraction and Transformation (https://curlie.org/Computers/Software/Databases/Data_Warehousing/Extr
action_and_Transformation) at Curlie