XML Parser

Download as pdf or txt
Download as pdf or txt
You are on page 1of 66

Parsing XML Data

Parsing XML
• Goal: read XML files into data structures in
programming languages

• Possible strategies
– Parse by hand with some reusable libraries
– Parse into generic tree structure
– Parse as sequence of events
– Automagically parse to language-specific objects
Parsing by-hand
• Advantages
– Complete control
– Good if simple needs – build off of regex package

• Disadvantages
– Must write the initial code yourself, even if it becomes
generalized
– Pretty tedious and error prone.
– Gets very hard when using schema or DTD to validate
– No one does this anymore
Parsing into generic tree structure
• Advantages
– Industry-wide, language neutral W3C standard exists called DOM
(Document Object Model)
– Learning DOM for one language makes it easy to learn for any
other
– As of JAXP 1.2, support for Schema
– Have to write much less code to get XML to something you want
to manipulate in your program

• Disadvantages
– Non-intuitive API, doesn’t take full advantage of Java
– Still quite a bit of work
What is JAXP?
• JAXP: Java API for XML Processing
– In the Java language, the definition of these standard
API’s (together with XSLT API) comprise a set of
interfaces known as JAXP
– Java also provides standard implementations together
with vendor pluggability layer
– Some of these come standard with J2SDK, others are
only availdable with Web Services Developers Pack
– We will study these shortly
Another alternative
• JDOM: Native Java published API for
representing XML as tree
• Like DOM but much more Java-specific,
object oriented
• However, not supported by other languages
• Also, no support for schema
• Dom4j another alternative
JAXB
• JAXB: Java API for XML Bindings

• Defines an API for automagically representing


XML schema as collections of Java classes.

• Most convenient for application programming

• Will cover next class


DOM-Document Object Model
About DOM
• Stands for Document Object Model

• A World Wide Web Consortium (w3c) standard

• Standard constantly adding new features – Level 3


Core released late 05

• Well cover most of the basics. There’s always


more, and it’s always changing.
DOM abstraction layer in Java --
architecture
Emphasis is on allowing vendors to supply their own DOM
Implementation without requiring change to source code
Returns specific parser
implementation

org.w3d.dom.Document
Sample Code
A factory instance
DocumentBuilderFactor factory = is the parser implementation.
Can be changed with runtime
DocumentBuilderFactory.newInstance();System property. Jdk has default.
Xerces much better.
/* set some factory options here */
From the factory one obtains
DocumentBuilder builder = an instance of the parser
factory.newDocumentBuilder();
xmlFile can be an java.io.File,
Document doc = builder.parse(xmlFile); an inputstream, etc.

javax.xml.parsers.DocumentBuilderFactory
For reference. Notice that the
javax.xml.parsers.DocumentBuilder
Document class comes from the
org.w3c.dom.Document w3c-specified bindings.
Validation
• Note that by default the parser will not
validate against a schema or DTD

• As of JAXP1.2, java provides a default


parser than can handle most schema features

• See next slide for details on how to setup


Important: Schema validation
String JAXP_SCHEMA_LANGUAGE =
"http://java.sun.com/xml/jaxp/properties/schemaLanguage";
String W3C_XML_SCHEMA =
"http://www.w3.org/2001/XMLSchema";

Next, you need to configure DocumentBuilderFactory to generate a


namespace-aware, validating parser that uses XML Schema:

… DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance()


factory.setNamespaceAware(true);
factory.setValidating(true);
try {
factory.setAttribute(JAXP_SCHEMA_LANGUAGE, W3C_XML_SCHEMA); }
catch (IllegalArgumentException x) {
// Happens if the parser does not support JAXP 1.2 ...
}
Associating document with schema

• An xml file can be associated with a


schema in two ways
1. Directly in xml file in regular way
2. Programmatically from java

• Latter is done as:


– factory.setAttribute(JAXP_SCHEMA_SOURCE,
new File(schemaSource));
A few notes
• Factory allows ease of switching parser
implementations
– Java provides simple DOM implementation, but
much better to use vendor-supplied when doing
serious work
– Xerces, part of apache project, is installed on
cluster as Eclipse plugin. We’ll use next week.
– Note that some properties are not supported by
all parser implementations.
Document object
• Once a Document object is obtained, rich API to
manipulate.

• First call is usually


Element root = doc.getDocumentElement();
This gets the root element of the Document as an
instance of the Element class

• Note that Element subclasses Node and has methods


getType(), getName(), and getValue(), and
getChildNodes()
Types of Nodes
• Note that there are many types of Nodes (ie
subclasses of Node):
Attr, CDATASection, Comment, Document, DocumentFragment,
DocumentType, Element, Entity, EntityReference, Notation,
ProcessingInstruction, Text

Each of these has a special and non-obvious associated type, value, and name.

Standards are language-neutral and are specified on chart on following slide

Important: keep this chart nearby when using DOM


Node nodeName() nodeValue() Attributes nodeType()
Attr Attr name Value of attribute null 2

CDATASection #cdata-section CDATA cotnent null 4


Comment #comment Comment content null 8
Document #document Null null 9
DocumentFragment #document-fra null null 11
gment
DocumentType Doc type name null null 10
Element Tag name null NamedNodeMap 1
Entity Entity name null null 6
EntityReference Name entity null null 5
referenced
Notation Notation name null null 1
ProcessingInstruction target Entire string null 7
Text #text Actual text null 3
Transforming XML
The JAXP Transformation Packages

• JAXP Transformation APIs:


– javax.xml.transform
• This package defines the factory class you use to get a Transformer object. You then
configure the transformer with input (Source) and output (Result) objects, and invoke its
transform() method to make the transformation happen. The source and result objects are
created using classes from one of the other three packages.
– javax.xml.transform.dom
• Defines the DOMSource and DOMResult classes that let you use a DOM as an input to or
output from a transformation.
– javax.xml.transform.sax
• Defines the SAXSource and SAXResult classes that let you use a SAX event generator as
input to a transformation, or deliver SAX events as output to a SAX event processor.
– javax.xml.transform.stream
• Defines the StreamSource and StreamResult classes that let you use an I/O stream as an
input to or output from a transformation.
Transformer Architecture
Writing DOM to XML
public class WriteDOM{
public static void main(String[] argv) throws Exception{
File f = new File(argv[0]);
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse(f);

TransformerFactory tFactory =
TransformerFactory.newInstance();
Transformer transformer = tFactory.newTransformer();
DOMSource source = new DOMSource(document);
StreamResult result = new StreamResult(System.out);
transformer.transform(source, result);
}
}
Creating a DOM
• Sometimes you may want to create a DOM
tree directly in memory. This is done with:

DocumentBuilderFactory factory
= DocumentBuilderFactory.newInstance();
DocumentBuilder builder
= factory.newDocumentBuilder();
document = builder.newDocument();
Manipulating Nodes
• Once the root node is obtained, typical tree
methods exist to manipulate other elements:
boolean node.hasChildNodes()
NodeList node.getChildNodes()
Node node.getNextSibling()
Node node.getParentNode()
String node.getValue();
String node.getName();
String node.getText();
void setNodeValue(String nodeValue);
Node insertBefore(Node new, Node ref);
JDOM
JDOM Motivation

• Unfortunately DOM suffers from a number of design flaws and


limitations that make it less than ideal as a Java API for processing
XML
– DOM had to be backwards compatible with the hackish, poorly thought out,
unplanned object models used in third generation web browsers.
– DOM was designed by a committee trying to reconcile differences between
the object models implemented by Netscape, Microsoft, and other vendors.
They needed a solution that was at least minimally acceptable to everybody,
which resulted in an API thatユs maximally acceptable to no one.
– DOM is a cross-language API defined in IDL, and thus limited to those
features and classes that are available in essentially all programming
languages, including not fully-object oriented scripting languages like
JavaScript and Visual Basic. It is a lowest common denominator API. It
does not take full advantage of Java, nor does it adhere to Java best
practices, naming conventions, and coding standards.
– DOM must work for both HTML (not just XHTML, but traditional malformed
HTML) and XML.
Some sample JDOM
<fibonacci/>

In JDOM:
Element element = new Element("fibonacci");

In DOM:
DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
DOMImplementation impl = builder.getDOMImplementation();
Document doc = impl.createDocument( null, "Fibonacci_Numbers", null);

In JDOM:
Element element = doc.createElement("fibonacci");
Element element = new Element("fibonacci");
element.setText("8"); :
element.setAttribute("index", "6");

Extremely simple and intuitive!


More JDOM
• To create this element
<sequence>
<number>3</number>
<number>5</number>
</sequence>

Element element = new Element("sequence");


Element firstNumber = new Element("number");
Element secondNumber = new Element("number");
firstNumber.setText("3");
secondNumber.setText("5");
element.addContent(firstNumber);
element.addContent(secondNumber);
import org.jdom.*;
import org.jdom.input.SAXBuilder; Parsing XML file with JDOM
import java.io.IOException;
import java.util.*;
public class ElementLister {
public static void main(String[] args) {
if (args.length == 0) {
System.out.println("Usage: java ElementLister URL");
return; }
SAXBuilder builder = new SAXBuilder();
try {
Document doc = builder.build(args[0]);
Element root = doc.getRootElement();
listChildren(root, 0); } // indicates a well-formedness error
catch (JDOMException e) {
System.out.println(args[0] + " is not well-formed.");
System.out.println(e.getMessage()); }
catch (IOException e) { System.out.println(e); } }

public static void listChildren(Element current, int depth) {


printSpaces(depth);
System.out.println(current.getName());
List children = current.getChildren();
Iterator iterator = children.iterator();
while (iterator.hasNext()) {
Element child = (Element) iterator.next();
listChildren(child, depth+1); } }

private static void printSpaces(int n) {


for (int i = 0; i < n; i++) { System.out.print(' '); }
}}
SAX

Simple API for XML Processing


About SAX
• SAX in Java is hosted on source forge

• SAX is not a w3c standard

• Originated purely in Java

• Other languages have chosen to implement in their


own ways based on this prototype
SAX vs. …
– SAX is an alternative to DOM, but realize that
DOM is often built on top of SAX

– SAX and DOM do not compete with JAXP

– They do both compete with JAXB


implementations
How a SAX parser works
• SAX parser scans an xml stream on the fly and responds to
certain parsing events as it encounters them.

• This is very different than digesting an entire XML


document into memory.

• Much faster, requires less memory.

• However, need to reparse if you need to revisit data.


Obtaining a SAX parser
• Important classes
javax.xml.parsers.SAXParserFactory;
javax.xml.parsers.SAXParser;
javax.xml.parsers.ParserConfigurationException;

//get the parser


SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();

//parse the document


saxParser.parse( new File(argv[0]), handler);
DefaultHandler
• Note that an event handler has to be passed to the
SAX parser.

• This must implement the interface


org.xml.sax.ContentHandler;

• Easier to extend the adapter


org.xml.sax.helpers.DefaultHandler
Overriding Handler methods
• Most important methods to override
– void startDocument()
• Called once when document parsing begins
– void endDocument()
• Called once when parsing ends
– void startElement(...)
• Called each time an element begin tag is encountered
– void endElement(...)
• Called each time an element end tag is encountered
– void characters(...)
• Called randomly between startElement and endElement calls
to accumulated character data
startElement
• public void startElement(
String namespaceURI, //if namespace assoc
String sName, //nonqualified name
String qName, //qualified name
Attributes attrs) //list of attributes

• Attribute info is obtained by querying Attributes


objects.
Characters
• public void characters(
char buf[], //buffer of chars accumulated
int offset, //begin element of chars
int len) //number of chars

• Note, characters may be called more than once between


begin tag / end tag

• Also, mixed-content elements require careful handling


Entity references
• Recall that entity references are special character
sequences for referring to characters that have
special meaning in XML syntax
– ‘<‘ is &lt
– ‘>’ is &gt
• In SAX these are automatically converted and
passed to the characters stream unless they are part
of a CDATA section
Choosing a Parser
• Choosing your Parser Implementation
– If no other factory class is specified, the default SAXParserFactory
class is used. To use a different manufacturer's parser, you can
change the value of the environment variable that points to it. You
can do that from the command line, like this:
• java -Djavax.xml.parsers.SAXParserFactory=yourFactoryHere ...

• The factory name you specify must be a fully qualified


class name (all package prefixes included). For more
information, see the documentation in the newInstance()
method of the SAXParserFactory class.
Validating SAX Parsers
String JAXP_SCHEMA_LANGUAGE =
"http://java.sun.com/xml/jaxp/properties/schemaLanguage";
String W3C_XML_SCHEMA =
"http://www.w3.org/2001/XMLSchema";

Next, you need to configure DocumentBuilderFactory to generate a


namespace-aware, validating parser that uses XML Schema:

… SaxParserFactory factory = SaxParserFactory.newInstance()


factory.setNamespaceAware(true);
factory.setValidating(true);
try {
factory.setAttribute(JAXP_SCHEMA_LANGUAGE, W3C_XML_SCHEMA); }
catch (IllegalArgumentException x) {
// Happens if the parser does not support JAXP 1.2 ...
}
Transforming arbitrary data
structures using SAX and
Transformer
Goal
• Now that we know SAX and a little about
Transformations, there are some cool things we
can do.

• One immediate thing is to create xml files from


plain text files using the help of a faux SAX parser

• Turns out to be more robust than doing by hand


Transformers
• Recall that transformers easily let us go between
any source and result by arbitrary wirings of
– StreamSource / StreamResult
– SAXSource / SAXResult
– DOMSource / DOMResult

• We used this to write a DOM tree to an XML file

• Now we will use a SAXSource together with a


StreamResult to convert our text file
Strategy
• We construct our own SAXParser – ie a class that
implements the XMLReader interface

• This class must have a parse method (among


others)

• We use parse to read our input file and fire the


appropriate SAX events, rather than handcoding
the Strings ourselves.
Main snippet
public static void main (String argv []){
StudentReader parser = new StudentReader(); Create SAX “parser”
TransformerFactory tFactory =
TransformerFactory.newInstance();
create transformer
Transformer transformer = tFactory.newTransformer();
FileReader fr = new FileReader(“students.txt”);
BufferedReader br = new BufferedReader(fr);
Use text File as
InputSource inputSource = new InputSource(fr);
Transformer source
SAXSource source = new SAXSource(saxReader, inputSource);
StreamResult result = new StreamResult(System.out); Use text as result
transformer.transform(source, result);
}
XMLReader implementation

• To have a valid SAXSource we need a class that implements


XMLReader interface

public void parse(InputSource input)


public void setContentHandler(ContentHandler handler)
public ContentHandler getContentHandler()
.
.
.

•Shown are the important methods for a simple app


See Course Examples for details
JAXB

Java Architecture for XML Bindings


What is JAXB?
• JAXB defines the behavior of a standard set of tools and
interfaces that automatically generate java class files from
XML schema

• JAXB is a framework or architecture, not an


implementation.

• Sun provides a reference implementation of JAXB with the


Web Services Developers kit, available as a separate
download
http://java.sun.com/webservices/downloads/webservicespa
ck.html
JAXB vs. DOM and SAX
• JAXB is a higher level construct than DOM or SAX
– DOM represents XML documents as generic trees
– SAX represents XML documents as generic event streams
– JAXB represents XML documents as Java classes with properties
that are specific to the particular XML document
• E.g. book.xml becomes Book.java with getTitle, setTitle, etc.

• JAXB thus requires almost no knowledge of XML to be


able to programmatically process XML documents!
High-level comparison
• Before diving into details of JAXB, it’s good to
see a bird’s-eye-view of the difference between
JAXB and SAX and/or DOM-like parsers

• Study the books/ examples under the


examples/jaxb directory on the course website
JAXB steps
• We start by assuming that you have a
valid installation of java web services
developers pack version 3. We cover
these installation details later

• Using JAXB then requires several


steps:
1. Run the binding compiler on the
schema file to automagically produce
the appropriate java class files
2. Compile the java class files (ant tool
helps here)
3. Study the autogenerated api to learn
what java types have been created
4. Create a program that unmarshals an
xml document into these elementary
data structures
Running binding compiler
• <install_dir>/jaxb/bin/xjc.sh -p test.jaxb books.xsd -d work
– xjc.sh : executes binding compiler
– -p test.jaxb : place resulting class files in package test.jaxb
– books.xsd : run compiler on schema books.xsd
– -d work : place resulting files in directory called work/

• Note that this creates a huge number of files that together represent the
content of the books.xsd schema as a set of Java classes

• It is not necessary to know all of these classes. We’ll study them only
at a high level so we can understand how to use them
Example: students.xsd
Generated interfaces
• xjc.sh -p test.lottery students.xsd

• This generates the following interfaces


– test/lottery/ObjectFactory.java
• Contains methods for generating instances of the interfaces
– test/lottery/Students.java
• Represents the root node <students>
– test/lottery/StudentsType.java
• Represents the unnamed type of each student object
Generated implementations
• Each interface is implemented in the impl
directory
– test/lottery/impl/StudentsImpl.java
• Vendor-specific implementation of the Students inteface
– test/lottery/impl/StudentsTypeImpl.java
• Vendor-specific implementation of the StudentsType Interface
Compilation
• Next, the generated classes must be compiled:
– javac students/*.java students/impl/*.java

• CLASSPATH requires many jar files:


– jaxb/lib/*.jar
– jwsdp-shared/lib/*.jar
– jaxp/lib/**/*.jar

• Note: an ant buildfile (like a java makefile) makes


this much easier. More on this later
Generated docs
• Java API docs for these classes are
generated in
– students/docs/api/*.html

• After bindings are generated, one usually


works directly through these API docs to
learn how to access/manipulate the XML
data.
Sample Programs
Sample Programs
• Easiest way to learn is to cover certain generic sample
cases. These are all on the course website under
ace104/lesson6/examples

• Summary of examples:
– student/
• Use JAXB to read an xml document composed of a single student
complex type
– student/
• Same, but for an xml document composed of a sequence of such
student types of indefinite length
– purchaseOrder/
• Another read example, but for a more complex schema
Sample programs, cont
• Course examples, cont
– create-marshal
• Purchase-order example modified to create in memory and
write to XML
– modify-marshal
• Purchase-order example modified to read XML, change it and
write back to XML

• Study these examples!


Some additional JAXB details
Binding Data Types
• Default java datatype bindings can be found at:
http://java.sun.com/webservices/docs/1.3/tutorial/doc/JAXBWorks5.html

• These defaults can be changed if required for an


application

• Also, name binding are fairly standard changes of names to


things acceptable in java programming language

• See other binding rules on subsequent pages


Default binding rules summary
• The JAXB binding model follows the default binding rules summarized below:

• Bind the following to Java package:


– XML Namespace URI

• Bind the following XML Schema components to Java content interface:


– Named complex type
– Anonymous inlined type definition of an element declaration

• Bind to typesafe enum class:


– A named simple type definition with a basetype that derives from "xsd:NCName" and has enumeration facets.

• Bind the following XML Schema components to a Java Element interface:


– A global element declaration to a Element interface.
– Local element declaration that can be inserted into a general content list.

• Bind to Java property:


– Attribute use
– Particle with a term that is an element reference or local element declaration.

• Bind model group with a repeating occurrence and complex type definitions with mixed {content type} to:
– A general content property; a List content-property that holds Java instances representing element information items and character
data items.
End

You might also like