Working With XML Data: Optional

Information Management
Optional Working with XML Data

Module 2
IBM InfoSphere Information Server for Data Integration
© 2013 IBM Corporation

Module Objetives
After completing this topic, you should be able to:
■ Describe the XML stage
■ Understand the Schema Library Manager
■ Understand how to use the Assembly Editor for building an assembly
■ Understand how to read XML sources
■ Understand how to perform XML operations on hierarchical data
■ Describe how to transform relational to hierarchical sources
2
XML Document
■ XML syntax rules
‒ Document has exactly one root element
‒ Each start tag is matched by one end tag
‒ All elements are properly nested
‒ Attribute values are quoted
‒ XML elements are case sensitive
‒ Disallowed characters are not used in tags or values
■ DTD or Schema
‒ Elements and attributes that must or might be included, and their permitted
structure
‒ The structure is specified by a regular expression syntax
■ Well-formed document
‒ Follows XML syntax rules
■ Valid document
‒ Well-formed document
‒ Follows the rules defined in its DTD or schema
3
XML Transformation
■ XML Source/Target Documents

‒ Files
‒ Databases
‒ Web services
‒ Events (MQ)
■ DataStage XML Transformation

‒ Schema Library manager
➢ Type library
‒ XML Stage
➢ Can be used as source, target or transformer
‒ Assembly
➢ Describes the hierarchical data that flows through
the assembly steps
4
Schema Library Manager

■ Schemas are organized into
libraries.
‒ Each library can contain one or
more schemas. Different versions
of the same schema should be
organized into different schema
libraries.
■ A Schema library is a type library
‒ Type name uniqueness is
determined by the namespace
URI and local name of the top
level elements
■ The type library cannot contain
duplicate definitions
■ The library can be loaded using
multiple file selection or a zip file
5
XML Stage
■ Can be used as source, target or transformer
■ Parse
‒ Parse incoming XML files or messages and create hierarchical data
representation
■ Compose
‒ Convert hierarchical or relational records to XML file or message
■ Transformer
‒ Performs transformation on XML data using a set of hierarchical operations
➢ Aggregate, Sort, Regroup, etc.
■ Contains two editors:

‒ Stage editor
➢ Runtime properties
‒ Assembly editor
➢ Actual design (assembly)
6
Assembly Editor
■ Performs actual design combining a set of steps or operations on XML
data
Input/Output
Palette links
Input Step
Steps
Output Step
7
Assembly Computational Model

■ Set of steps that performs enrichments or transformations on XML data
■ Output from one step can feed other step
Input Step Step 1 Step 2 Step 3 Output Step

Parse Join Aggregate
■ Input/Output steps have a distinguished role

■ Input step
‒ Transforms relational data to hierarchical data
■ Output step
‒ Transforms hierarchical data to relational data
■ Steps are aggregated between Input and Output steps
8
Schema Representation in the Assembly

■ Simplified model
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name=" department"> Complex types
<xs:complexType> are simplified
<xs:sequence>
<xs:element name="employee">
<xs:complexType>
<xs:sequence>
<xs:element name="employee_name" type="xs:string"/>
<xs:element name="age" type="xs:int"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType> Primitive Types
</xs:element>
</xs:schema>
9
Schema Representation in the Assembly: Cont’d

■ Entire input tree is rooted in a list named “top”
■ Each input link corresponds to one list
■ Each data type has a schema representation

<xs:element name="book_title" type="xs:string"/>
<xs:element name="date_of_publish" type="xs:date"/>
<xs:element name="code" type="xs:ID"/>
<xs:element name="author" type="xs:string"/>
<xs:element name="cost" type="xs:decimal"/>
10
XML Transformation
■ XML job design is XML schema driven
■ It starts with an XML schema document (XSD files) imported in schema
library manager
‒ Schema specifies the basic structure of incoming XML file or message
format, it also specifies the basic constraints on the XML document
‒ Schema is converted to internal hierarchical schema tree representation and
presented to end user for job design
■ Design the corresponding operations that fit business requirements
‒ Parsing, sorting, aggregation, composing, etc.
11
XML Parser
■ Parse incoming XML files or
messages and create hierarchical
data representation based on
given schema
■ The XML file can be read in 3
ways
‒ String Set
➢ Select a column from an input link
of the XML stage or an input
schema item from previous parser
steps
‒ Single File
➢ Specify the file path to be
processed
‒ File Set
➢ Select a column from an input link
of the XML stage. The selected
column contains the path name of
an XML file to be processed
12
Input Step
■ Used to map a relational data structure to a
hierarchical data structure
■ Activated when an Input link is defined
■ Output from the input step is used in further
steps
■ Each input link is transformed into a child list of
the InputLinks node
■ Two views
‒ Links view
➢ Same as columns tab of a
Stage
‒ Tree view
➢ Hierarchical view
13
XML Parser Documentation Root Schema

Select Document
Root tab
Click to select
14
XML Validation
■ The XML file is validated against the schema defined in the Document
Root tab using validation rules
■ Validation rules
‒ Value validation
➢ Checks the actual values in the XML document against the defined schema types
‒ Structure validation
➢ Checks that the XML
document conforms to
the schema specified in
the document root
‒ Consists of a condition
and an action
➢ Condition
Validation check
➢ Action
How to handle the
error when the
check fails
15
Minimal Validation vs. Strict Validation

■ Minimum validation
‒ Ignores all violations and attempts to run the job
‒ The actions for the validation rules are set to Ignore
‒ Ensures maximum performance
‒ Default validation rule selection in the XML Parser
■ Strict validation
‒ Job aborts for any type of
validation error
‒ The action for the validation
rules are set to Fatal
‒ Default validation rule selection
in the XML Composer
16
Additional XML Parser Features

■ Two context menu options to let user further
tweak XML schema
‒ Chunk
‒ Disable Type Derivatives
■ When an item in the schema is chunked, all data
for that item and its children are concatenated
into a single xml value
chunk
■ No validation will be performed on the chunked data as there is no

schema to validate against. The type is simply XMLtype
17
Disable Type Derivative

■ The XML type can have derived types
‒ For example, there can be a base element and a child element that derives
from that base element
■ By default, in the XML stage, all the items of the child element will be
presented
■ If the user does not want to use the derived types, then select Disable
Type Derivative in context menu of schema tree in the Document Root
tab
Disable Type Derivative
18
XSLT Support in XML Parser

■ Use in case XML input doesn’t match the
XML schema provided
■ XML Pack supports XSLT 1.0 and XSLT Select
1.1 specifications
■ Select the “Enable Filtering” option in the
XML Source tab Paste XSLT
■ The XSLT Stylesheet is applied to the xml

document, the output of the XSLT
transformation is parsed by XML parser
XSLT XML Parser

XML File Stylesheet Step
XML Stage
19
Parallel Parsing Support in XML Parser

■ Can be used for XML documents that have small header and footer and
large body with many repeating nodes
■ Click “Compute” button to generate the parallel parsing paths
■ At runtime it will split your
XML document into
multiple segments, each
one for a specific parallel
engine node
20
Output Step
■ Used to map a hierarchical data structure to a relational data structure
■ Output step is activated when output link defined
■ Simple rules
‒ A list is mapped to other list
‒ Groups are not mapped
‒ Parent item is mapped before child
■ Columns can be viewed in Output Tab
21
Output Step: Configuration
Select Mappings tab
Map Source
to target
22
XML Composer
■ Creates an XML file or message based on a pre-defined schema
■ Allows the user to control validation types
■ The Header tab allows the user to add comments, processing
instructions and XML declaration
■ Three target options
‒ Write the output to a file
‒ Write the output as a
XML string
‒ Write the output as Large
object and pass the
reference from this stage
to downstream
23
XML Composer Document Root
Select root
element
Composer
elements
24
XML Composer Validation Rules

■ Strict validation by default
Select root
element
25
XML Schema Mapping Specification

■ Parents items need to be mapped before a child item is mapped
■ Lists are mapped to other list
Mapping begins in
the root of the target
26
XML Composer Output Header

■ Specifies header attributes
27
XML Schema Output Format

■ Format XML output file
28
XML Operations
■ Regroup
■ H-Join
■ Sort
■ Switch
■ Aggregation
■ Union
■ H-Pivot
■ V-Pivot
■ OrderJoin
29
Supporting Slides
30
Regroup Step
▪ Creates a parent-child hierarchy from a single list
▪ Removes redundancy in data by allowing the user to put the
repeating items into a parent list
▪ For best performance, the
data coming into Regroup
should be pre-sorted
– Unsorted data decreases
performance and adds
significant memory
requirement
– It is recommended to use Regroup
DataStage sort if possible
31
Regroup Parent Items Configuration

Select list to regroup Select Scope
Drag parent
& child items
32
Regroup Keys Configuration

▪ Only the parent items can be selected as key values.
Select key for

grouping
33
H-Join Transformation
▪ Transforms the items from two lists into a single list
▪ The Output will have the child list placed within the parent
list
HJoin
34
H-Join Configuration
▪ Disk based optimization is recommended for large input
data
Select Parent List
Select Child List
Select Parent and

Child keys
35
Switch Step
▪ Used to filter the input list into multiple output lists based on the
specified constraints
▪ Supported switch functions:
– between
– compare <Address>
<street>121 Main Street</street>
– contains <Address> <city>San Jose</city>
– equals <street>121 Main Street</street> <state>California</state>
– greater than <city>San Jose</city>

<state>California</state>
<country>USA</country>
<postalCode>95387</postalCode>
– isBlank <country>USA</country> <address_type>A</address_type>
– isNull <postalCode>95387</postalCode> </Address>

<address_type>A</address_type>
– isTrue </Address>
Switch by
address_type=“A”
– isFalse <Address> <Address>
– less than <street>20400 Junction Street</street> <street>20400 Junction St</street>

<city>San Mateo</city>
–
<city>San Mateo</city>
like [Support <state>California</state> <state>California</state>
java <country>USA</country> <country>USA</country>
expression] <postalCode>90200</postalCode> <postalCode>90200</postalCode>

<address_type>B</address_type> <address_type>B</address_type>
</Address> </Address>
36
Switch Configuration
Select List for filtering
Select Scope
37
Sort Step
▪ Sorts the items in a list as based on one or more sort keys
▪ For relational records, it is recommended to sort in
DataStage job flow rather than inside XML Stage
<integer>45</integer> <integer>25</integer>
<integer>40</integer> Sort <integer> in <integer>31</integer>
ascending order
38
Sort Step Configuration
Select list for sorting
Select Scope
Select sorting key(s)
39
Aggregation Step
▪ Performs aggregation on the items in a list
▪ Supported aggregation functions
– average
– concat
– count <root>
– first <integer>1001</integer>
<integer>-3456</integer>
– last <integer>23453</integer> <root>
Aggregate with
– maximum <integer>32767</integer> function “First”
<integer>1001</integer>
</root>
– minimum <integer>-32768</integer>
– sum <integer>-234</integer>
<integer>7932</integer>
– variance </root>
40
Aggregation Configuration
Select List to
aggregate
Select Scope
Select element and

function to aggregate
41
Union Step
▪ The Union step combines 2 different lists
▪ Target list has to be pre-imported in the library (Union Type)
▪ Use two parser steps to read each document (list)
<tns:EmployeeInfo employeeID="B6540" departmentID="A100"> <prn:employee employeeID="B6540" departmentID="A100">
<EMP_Name> <name>
<firstName>Cynthia</firstName> <firstName>Cynthia</firstName>
<middleName>P</middleName> <middleName>P</middleName>
<lastName>Donald</lastName> <lastName>Donald</lastName>
</EMP_Name> </name>
<gender>female</gender> <gender>female</gender>
<DOB>1987-01-17</DOB> <dateOfBirth>1987-01-17</dateOfBirth>
<title>Miss</title> <title>Miss</title>
<hireDate>2000-07-25</hireDate> <hireDate>2000-07-25</hireDate>
</tns:EmployeeInfo> </prn:employee>
Union
<prn:employee employeeID="A8990" departmentID="A100">
<tns:Dept_employee employeeID="A8990" departmentID="A100">
<name>
<name>
<firstName>Zen</firstName>
<Emp_firstName>Zen</Emp_firstName>
<middleName>P</middleName>
<Emp_middleName>P</Emp_middleName>
<lastName>Wright</lastName>
<Emp_lastName>Wright</Emp_lastName>
</name>
</name>
<gender>male</gender>
<gender>male</gender>
<dateOfBirth>1980-04-04</dateOfBirth>
<dateOfBirth>1980-04-04</dateOfBirth>
<title>Mr</title>
<title>Mr</title>
<hireDate>2008-07-11</hireDate>
<hireDate>2008-07-11</hireDate>
</prn:employee>
</tns:Dept_employee>
42
Union Configuration
▪ Select target schema as Union Type
▪ Map Source to target Select
Mappings Tab
Map Source
to target list
Select Union Select Union

Type Tab Type
43
H-Pivot Step
▪ Converts columns to a single hierarchical column
H-Pivot
List
44
H-Pivot Configuration
Select Scope
Add columns
to convert
45
V-Pivot Step
▪ Spread values from a column to different columns according
to a key value.
New columns
Column
contains “car”
& “Meals”
values
46
V Pivot Configuration
Select Source Select

of Rows scope
Select column
with types
47
Order Join Step

▪ Joins two items based on their position in the lists
▪ If one list has fewer items than the other, then null values
are added to the shorter list
▪ No joining keys required
"1001” "Raman" "1001","Raman"

"1006” "Kely" "1006","Kely"
"1005” “Charles" "1005",“Charles"
"1010” "Gitesh" "1010","Gitesh"
Order Join
"1026” "Satish" "1026","Satish"
"1018” "Sarita" "1018","Sarita"
"1003” "Rakesh" "1003","Rakesh"
"1033” "1033",""
48
Order Join Configuration
Select Left list Select Right list
Left list becomes first

in the Output list
49
XML Pack Architecture

Design Environment Runtime Environment
DataStage Designer
XML Mapper
▪ Runs on Engine node only,
Menu->Tools Stage
Custom UI no interaction to service and
Web 2.0 Clients
repository tiers
XML Metadata
Importer
Assembly
Editor
▪ Streaming and event driven
UI UI
▪ Two supported engines

DS Engine
Schemas & assembly

service definitions definitions Common Connector
JVM
Information Server Domain

E2 Connector
Server Jobs
XML Pack 3.0 App
Restful Services
XML Tables
DataStage jobs PX Engine
Type Cache containing Common Connector

E2 Compiler assembly as a JVM
Common Connector
string property of
the stage JVM
E2 Connector
E2 Connector
XMeta PX Jobs
Contracts Model Assemblies Model Common Model

(XSD Schemas + (E2 Steps and (ASCL)
WSDL definitions) Mappings)
50
Heap Size Setting

▪ Used by the Java Virtual
Machine to run XML Pack
job
▪ Can be set in the stage
editor
▪ A large heap size should be
used when running a job
having a large schema or
processing a large xml file
▪ Default size is 256 MB
51
WebSphere Application Server JVM Heap Size Setting
Default path: Application servers> server1>

Process Definition>Java Virtual Machine
Set Size value
52

Working With XML Data: Optional

Uploaded by

Copyright:

Available Formats

Working With XML Data: Optional

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Working With XML Data: Optional

Uploaded by

Copyright:

Available Formats

Information Management

Optional Working with XML Data

IBM InfoSphere Information Server for Data Integration

© 2013 IBM Corporation

■ Understand the Schema Library Manager

■ Understand how to use the Assembly Editor for building an assembly

■ Understand how to read XML sources

■ Understand how to perform XML operations on hierarchical data

■ Describe how to transform relational to hierarchical sources

■ XML Source/Target Documents

■ DataStage XML Transformation

Schema Library Manager

■ Contains two editors:

Assembly Computational Model

Input Step Step 1 Step 2 Step 3 Output Step

■ Input/Output steps have a distinguished role

Schema Representation in the Assembly

Schema Representation in the Assembly: Cont’d

■ Each data type has a schema representation

XML Parser Documentation Root Schema

Minimal Validation vs. Strict Validation

Additional XML Parser Features

■ No validation will be performed on the chunked data as there is no

Disable Type Derivative

Disable Type Derivative

XSLT Support in XML Parser

■ The XSLT Stylesheet is applied to the xml

XSLT XML Parser

Parallel Parsing Support in XML Parser

Output Step: Configuration

Select Mappings tab

XML Composer Document Root

XML Composer Validation Rules

XML Schema Mapping Specification

XML Composer Output Header

XML Schema Output Format

Regroup Parent Items Configuration

Regroup Keys Configuration

Select key for

Select Parent List

Select Child List

Select Parent and

– greater than <city>San Jose</city>

– isNull <postalCode>95387</postalCode> </Address>

– less than <street>20400 Junction Street</street> <street>20400 Junction St</street>

expression] <postalCode>90200</postalCode> <postalCode>90200</postalCode>

Select List for filtering

Sort Step Configuration

Select list for sorting

Select sorting key(s)

Select element and

Select Union Select Union

Select Source Select

Order Join Step

"1001” "Raman" "1001","Raman"

Order Join Configuration

Select Left list Select Right list

Left list becomes first