Working With XML Data: Optional

Download as pdf or txt
Download as pdf or txt
You are on page 1of 52

Information Management

Optional Working with XML Data


Module 2

IBM InfoSphere Information Server for Data Integration

© 2013 IBM Corporation


Information Management

Module Objetives
After completing this topic, you should be able to:
■ Describe the XML stage

■ Understand the Schema Library Manager

■ Understand how to use the Assembly Editor for building an assembly

■ Understand how to read XML sources

■ Understand how to perform XML operations on hierarchical data

■ Describe how to transform relational to hierarchical sources

2
© 2013 IBM Corporation
Information Management

XML Document
■ XML syntax rules
‒ Document has exactly one root element
‒ Each start tag is matched by one end tag
‒ All elements are properly nested
‒ Attribute values are quoted
‒ XML elements are case sensitive
‒ Disallowed characters are not used in tags or values
■ DTD or Schema
‒ Elements and attributes that must or might be included, and their permitted
structure
‒ The structure is specified by a regular expression syntax
■ Well-formed document
‒ Follows XML syntax rules
■ Valid document
‒ Well-formed document
‒ Follows the rules defined in its DTD or schema

3
© 2013 IBM Corporation
Information Management

XML Transformation

■ XML Source/Target Documents


‒ Files
‒ Databases
‒ Web services
‒ Events (MQ)

■ DataStage XML Transformation


‒ Schema Library manager
➢ Type library
‒ XML Stage
➢ Can be used as source, target or transformer
‒ Assembly
➢ Describes the hierarchical data that flows through
the assembly steps

4
© 2013 IBM Corporation
Information Management

Schema Library Manager


■ Schemas are organized into
libraries.
‒ Each library can contain one or
more schemas. Different versions
of the same schema should be
organized into different schema
libraries.
■ A Schema library is a type library
‒ Type name uniqueness is
determined by the namespace
URI and local name of the top
level elements
■ The type library cannot contain
duplicate definitions
■ The library can be loaded using
multiple file selection or a zip file

5
© 2013 IBM Corporation
Information Management

XML Stage
■ Can be used as source, target or transformer
■ Parse
‒ Parse incoming XML files or messages and create hierarchical data
representation
■ Compose
‒ Convert hierarchical or relational records to XML file or message
■ Transformer
‒ Performs transformation on XML data using a set of hierarchical operations
➢ Aggregate, Sort, Regroup, etc.

■ Contains two editors:


‒ Stage editor
➢ Runtime properties
‒ Assembly editor
➢ Actual design (assembly)

6
© 2013 IBM Corporation
Information Management

Assembly Editor
■ Performs actual design combining a set of steps or operations on XML
data
Input/Output
Palette links

Input Step

Steps

Output Step

7
© 2013 IBM Corporation
Information Management

Assembly Computational Model


■ Set of steps that performs enrichments or transformations on XML data
■ Output from one step can feed other step

Input Step Step 1 Step 2 Step 3 Output Step


Parse Join Aggregate

■ Input/Output steps have a distinguished role


■ Input step
‒ Transforms relational data to hierarchical data
■ Output step
‒ Transforms hierarchical data to relational data
■ Steps are aggregated between Input and Output steps

8
© 2013 IBM Corporation
Information Management

Schema Representation in the Assembly


■ Simplified model

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name=" department"> Complex types
<xs:complexType> are simplified
<xs:sequence>
<xs:element name="employee">
<xs:complexType>
<xs:sequence>
<xs:element name="employee_name" type="xs:string"/>
<xs:element name="age" type="xs:int"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType> Primitive Types
</xs:element>
</xs:schema>

9
© 2013 IBM Corporation
Information Management

Schema Representation in the Assembly: Cont’d


■ Entire input tree is rooted in a list named “top”
■ Each input link corresponds to one list

■ Each data type has a schema representation


<xs:element name="book_title" type="xs:string"/>
<xs:element name="date_of_publish" type="xs:date"/>
<xs:element name="code" type="xs:ID"/>
<xs:element name="author" type="xs:string"/>
<xs:element name="cost" type="xs:decimal"/>

10
© 2013 IBM Corporation
Information Management

XML Transformation
■ XML job design is XML schema driven
■ It starts with an XML schema document (XSD files) imported in schema
library manager
‒ Schema specifies the basic structure of incoming XML file or message
format, it also specifies the basic constraints on the XML document
‒ Schema is converted to internal hierarchical schema tree representation and
presented to end user for job design
■ Design the corresponding operations that fit business requirements
‒ Parsing, sorting, aggregation, composing, etc.

11
© 2013 IBM Corporation
Information Management

XML Parser
■ Parse incoming XML files or
messages and create hierarchical
data representation based on
given schema
■ The XML file can be read in 3
ways
‒ String Set
➢ Select a column from an input link
of the XML stage or an input
schema item from previous parser
steps
‒ Single File
➢ Specify the file path to be
processed
‒ File Set
➢ Select a column from an input link
of the XML stage. The selected
column contains the path name of
an XML file to be processed

12
© 2013 IBM Corporation
Information Management

Input Step
■ Used to map a relational data structure to a
hierarchical data structure
■ Activated when an Input link is defined
■ Output from the input step is used in further
steps
■ Each input link is transformed into a child list of
the InputLinks node
■ Two views
‒ Links view
➢ Same as columns tab of a
Stage
‒ Tree view
➢ Hierarchical view

13
© 2013 IBM Corporation
Information Management

XML Parser Documentation Root Schema


Select Document
Root tab

Click to select

14
© 2013 IBM Corporation
Information Management

XML Validation
■ The XML file is validated against the schema defined in the Document
Root tab using validation rules
■ Validation rules
‒ Value validation
➢ Checks the actual values in the XML document against the defined schema types
‒ Structure validation
➢ Checks that the XML
document conforms to
the schema specified in
the document root
‒ Consists of a condition
and an action
➢ Condition
Validation check
➢ Action
How to handle the
error when the
check fails

15
© 2013 IBM Corporation
Information Management

Minimal Validation vs. Strict Validation


■ Minimum validation
‒ Ignores all violations and attempts to run the job
‒ The actions for the validation rules are set to Ignore
‒ Ensures maximum performance
‒ Default validation rule selection in the XML Parser

■ Strict validation
‒ Job aborts for any type of
validation error
‒ The action for the validation
rules are set to Fatal
‒ Default validation rule selection
in the XML Composer

16
© 2013 IBM Corporation
Information Management

Additional XML Parser Features


■ Two context menu options to let user further
tweak XML schema
‒ Chunk
‒ Disable Type Derivatives
■ When an item in the schema is chunked, all data
for that item and its children are concatenated
into a single xml value

chunk

■ No validation will be performed on the chunked data as there is no


schema to validate against. The type is simply XMLtype

17
© 2013 IBM Corporation
Information Management

Disable Type Derivative


■ The XML type can have derived types
‒ For example, there can be a base element and a child element that derives
from that base element
■ By default, in the XML stage, all the items of the child element will be
presented
■ If the user does not want to use the derived types, then select Disable
Type Derivative in context menu of schema tree in the Document Root
tab

Disable Type Derivative

18
© 2013 IBM Corporation
Information Management

XSLT Support in XML Parser


■ Use in case XML input doesn’t match the
XML schema provided
■ XML Pack supports XSLT 1.0 and XSLT Select
1.1 specifications
■ Select the “Enable Filtering” option in the
XML Source tab Paste XSLT

■ The XSLT Stylesheet is applied to the xml


document, the output of the XSLT
transformation is parsed by XML parser

XSLT XML Parser


XML File Stylesheet Step

XML Stage

19
© 2013 IBM Corporation
Information Management

Parallel Parsing Support in XML Parser


■ Can be used for XML documents that have small header and footer and
large body with many repeating nodes
■ Click “Compute” button to generate the parallel parsing paths
■ At runtime it will split your
XML document into
multiple segments, each
one for a specific parallel
engine node

20
© 2013 IBM Corporation
Information Management

Output Step
■ Used to map a hierarchical data structure to a relational data structure
■ Output step is activated when output link defined
■ Simple rules
‒ A list is mapped to other list
‒ Groups are not mapped
‒ Parent item is mapped before child
■ Columns can be viewed in Output Tab

21
© 2013 IBM Corporation
Information Management

Output Step: Configuration

Select Mappings tab

Map Source
to target

22
© 2013 IBM Corporation
Information Management

XML Composer
■ Creates an XML file or message based on a pre-defined schema
■ Allows the user to control validation types
■ The Header tab allows the user to add comments, processing
instructions and XML declaration
■ Three target options
‒ Write the output to a file
‒ Write the output as a
XML string
‒ Write the output as Large
object and pass the
reference from this stage
to downstream

23
© 2013 IBM Corporation
Information Management

XML Composer Document Root

Select root
element

Composer
elements

24
© 2013 IBM Corporation
Information Management

XML Composer Validation Rules


■ Strict validation by default
Select root
element

25
© 2013 IBM Corporation
Information Management

XML Schema Mapping Specification


■ Parents items need to be mapped before a child item is mapped
■ Lists are mapped to other list
Mapping begins in
the root of the target

26
© 2013 IBM Corporation
Information Management

XML Composer Output Header


■ Specifies header attributes

27
© 2013 IBM Corporation
Information Management

XML Schema Output Format


■ Format XML output file

28
© 2013 IBM Corporation
Information Management

XML Operations
■ Regroup
■ H-Join
■ Sort
■ Switch
■ Aggregation
■ Union
■ H-Pivot
■ V-Pivot
■ OrderJoin

29
© 2013 IBM Corporation
Information Management

Supporting Slides

30
Information Management

Regroup Step
▪ Creates a parent-child hierarchy from a single list
▪ Removes redundancy in data by allowing the user to put the
repeating items into a parent list
▪ For best performance, the
data coming into Regroup
should be pre-sorted
– Unsorted data decreases
performance and adds
significant memory
requirement
– It is recommended to use Regroup
DataStage sort if possible

31
© 2013 IBM Corporation
Information Management

Regroup Parent Items Configuration


Select list to regroup Select Scope

Drag parent
& child items

32
© 2013 IBM Corporation
Information Management

Regroup Keys Configuration


▪ Only the parent items can be selected as key values.

Select key for


grouping

33
© 2013 IBM Corporation
Information Management

H-Join Transformation
▪ Transforms the items from two lists into a single list
▪ The Output will have the child list placed within the parent
list

HJoin

34
© 2013 IBM Corporation
Information Management

H-Join Configuration
▪ Disk based optimization is recommended for large input
data

Select Parent List

Select Child List

Select Parent and


Child keys

35
© 2013 IBM Corporation
Information Management

Switch Step
▪ Used to filter the input list into multiple output lists based on the
specified constraints
▪ Supported switch functions:
– between
– compare <Address>
<street>121 Main Street</street>
– contains <Address> <city>San Jose</city>
– equals <street>121 Main Street</street> <state>California</state>

– greater than <city>San Jose</city>


<state>California</state>
<country>USA</country>
<postalCode>95387</postalCode>
– isBlank <country>USA</country> <address_type>A</address_type>

– isNull <postalCode>95387</postalCode> </Address>


<address_type>A</address_type>
– isTrue </Address>
Switch by
address_type=“A”
– isFalse <Address> <Address>

– less than <street>20400 Junction Street</street> <street>20400 Junction St</street>


<city>San Mateo</city>

<city>San Mateo</city>
like [Support <state>California</state> <state>California</state>
java <country>USA</country> <country>USA</country>

expression] <postalCode>90200</postalCode> <postalCode>90200</postalCode>


<address_type>B</address_type> <address_type>B</address_type>
</Address> </Address>

36
© 2013 IBM Corporation
Information Management

Switch Configuration

Select List for filtering

Select Scope

37
© 2013 IBM Corporation
Information Management

Sort Step
▪ Sorts the items in a list as based on one or more sort keys
▪ For relational records, it is recommended to sort in
DataStage job flow rather than inside XML Stage

<integer>45</integer> <integer>25</integer>
<integer>41</integer> <integer>26</integer>
<integer>32</integer> <integer>29</integer>
<integer>25</integer> <integer>30</integer>
<integer>29</integer> <integer>31</integer>
<integer>40</integer> Sort <integer> in <integer>31</integer>
<integer>38</integer> <integer>32</integer>
ascending order
<integer>35</integer> <integer>35</integer>
<integer>26</integer> <integer>36</integer>
<integer>31</integer> <integer>38</integer>
<integer>30</integer> <integer>40</integer>
<integer>36</integer> <integer>41</integer>
<integer>31</integer> <integer>45</integer>

38
© 2013 IBM Corporation
Information Management

Sort Step Configuration

Select list for sorting

Select Scope

Select sorting key(s)

39
© 2013 IBM Corporation
Information Management

Aggregation Step
▪ Performs aggregation on the items in a list
▪ Supported aggregation functions
– average
– concat
– count <root>
– first <integer>1001</integer>
<integer>-3456</integer>
– last <integer>23453</integer> <root>
Aggregate with
– maximum <integer>32767</integer> function “First”
<integer>1001</integer>
</root>
– minimum <integer>-32768</integer>
– sum <integer>-234</integer>
<integer>7932</integer>
– variance </root>

40
© 2013 IBM Corporation
Information Management

Aggregation Configuration

Select List to
aggregate

Select Scope

Select element and


function to aggregate

41
© 2013 IBM Corporation
Information Management

Union Step
▪ The Union step combines 2 different lists
▪ Target list has to be pre-imported in the library (Union Type)
▪ Use two parser steps to read each document (list)
<tns:EmployeeInfo employeeID="B6540" departmentID="A100"> <prn:employee employeeID="B6540" departmentID="A100">
<EMP_Name> <name>
<firstName>Cynthia</firstName> <firstName>Cynthia</firstName>
<middleName>P</middleName> <middleName>P</middleName>
<lastName>Donald</lastName> <lastName>Donald</lastName>
</EMP_Name> </name>
<gender>female</gender> <gender>female</gender>
<DOB>1987-01-17</DOB> <dateOfBirth>1987-01-17</dateOfBirth>
<title>Miss</title> <title>Miss</title>
<hireDate>2000-07-25</hireDate> <hireDate>2000-07-25</hireDate>
</tns:EmployeeInfo> </prn:employee>
Union
<prn:employee employeeID="A8990" departmentID="A100">
<tns:Dept_employee employeeID="A8990" departmentID="A100">
<name>
<name>
<firstName>Zen</firstName>
<Emp_firstName>Zen</Emp_firstName>
<middleName>P</middleName>
<Emp_middleName>P</Emp_middleName>
<lastName>Wright</lastName>
<Emp_lastName>Wright</Emp_lastName>
</name>
</name>
<gender>male</gender>
<gender>male</gender>
<dateOfBirth>1980-04-04</dateOfBirth>
<dateOfBirth>1980-04-04</dateOfBirth>
<title>Mr</title>
<title>Mr</title>
<hireDate>2008-07-11</hireDate>
<hireDate>2008-07-11</hireDate>
</prn:employee>
</tns:Dept_employee>

42
© 2013 IBM Corporation
Information Management

Union Configuration
▪ Select target schema as Union Type
▪ Map Source to target Select
Mappings Tab
Map Source
to target list

Select Union Select Union


Type Tab Type

43
© 2013 IBM Corporation
Information Management

H-Pivot Step
▪ Converts columns to a single hierarchical column

H-Pivot

List

44
© 2013 IBM Corporation
Information Management

H-Pivot Configuration

Select Scope

Add columns
to convert

45
© 2013 IBM Corporation
Information Management

V-Pivot Step
▪ Spread values from a column to different columns according
to a key value.

New columns

Column
contains “car”
& “Meals”
values
46
© 2013 IBM Corporation
Information Management

V Pivot Configuration

Select Source Select


of Rows scope

Select column
with types

47
© 2013 IBM Corporation
Information Management

Order Join Step


▪ Joins two items based on their position in the lists
▪ If one list has fewer items than the other, then null values
are added to the shorter list
▪ No joining keys required

"1001” "Raman" "1001","Raman"


"1006” "Kely" "1006","Kely"
"1005” “Charles" "1005",“Charles"
"1010” "Gitesh" "1010","Gitesh"
Order Join
"1026” "Satish" "1026","Satish"
"1018” "Sarita" "1018","Sarita"
"1003” "Rakesh" "1003","Rakesh"
"1033” "1033",""

48
© 2013 IBM Corporation
Information Management

Order Join Configuration

Select Left list Select Right list

Left list becomes first


in the Output list

49
© 2013 IBM Corporation
Information Management

XML Pack Architecture


Design Environment Runtime Environment
DataStage Designer

XML Mapper
▪ Runs on Engine node only,
Menu->Tools Stage
Custom UI no interaction to service and
Web 2.0 Clients
repository tiers
XML Metadata
Importer
Assembly
Editor
▪ Streaming and event driven
UI UI

▪ Two supported engines


DS Engine

Schemas & assembly


service definitions definitions Common Connector
JVM

Information Server Domain


E2 Connector
Server Jobs

XML Pack 3.0 App

Restful Services

XML Tables
DataStage jobs PX Engine

Type Cache containing Common Connector


E2 Compiler assembly as a JVM
Common Connector
string property of
the stage JVM
E2 Connector

E2 Connector
XMeta PX Jobs

Contracts Model Assemblies Model Common Model


(XSD Schemas + (E2 Steps and (ASCL)
WSDL definitions) Mappings)

50
© 2013 IBM Corporation
Information Management

Heap Size Setting


▪ Used by the Java Virtual
Machine to run XML Pack
job
▪ Can be set in the stage
editor
▪ A large heap size should be
used when running a job
having a large schema or
processing a large xml file
▪ Default size is 256 MB

51
© 2013 IBM Corporation
Information Management

WebSphere Application Server JVM Heap Size Setting

Default path: Application servers> server1>


Process Definition>Java Virtual Machine

Set Size value

52
© 2013 IBM Corporation

You might also like