Lec12 XMLCS

Web Programming Course
Lecture 12 – XML
Briefly: The Power of XML
• XML is Extensible Markup Language
– Text-based representation for describing data structure
• Both human and machine readable
– Originated from Standardized Generalized Markup
Language (SGML)
– Became a World Wide Web Consortium (W3C)
standard in 1998
• XML is a great choice for exchanging data
between disparate systems
Synergy between Java and XML
• Java+XML=Portable language+Portable
Data
• Allows use Java to generate XML data
– Use Java to access SQL databases
– Use Java to format data in XML
– Use Java to parse data
– Use Java to validate data
– Use Java to transform data
HTML and XML
• HTML and XML look similar, because they are

both SGML languages
– use elements enclosed in tags (e.g. <body>This is
an element</body>)
– use tag attributes (e.g.,
)
• More precisely,
– HTML is defined in SGML
– XML is a (very small) subset of SGML
HTML and XML
• HTML is for humans

– HTML describes web pages
– Browsers ignore and/or correct many HTML
errors, so HTML is often sloppy
• XML is for computers
– XML describes data
– The rules are strict and errors are not allowed
• In this way, XML is like a programming language
– Current versions of most browsers display XML
Example XML document
<?xml version="1.0"?>
<weatherReport>
<date>7/14/97</date>
<city>North Place</city>, <state>NX</state>
<country>USA</country>
High Temp: <high scale="F">103</high>
Low Temp: <low scale="F">70</low>
Morning: <morning>Partly cloudy, Hazy</morning>
Afternoon: <afternoon>Sunny & hot</afternoon>
Evening: <evening>Clear and Cooler</evening>
</weatherReport>
Overall structure
• An XML document may start with one or more
processing instructions or directives:
<?xml-stylesheet type="text/css" href="ss.css"?>
• Following the directives, there must be exactly one root
element containing all the rest of the XML:
<weatherReport>
...
</weatherReport>
XML building blocks
• Aside from the directives, an XML document

is built from:
– elements: high in <high scale="F">103</high>
– tags, in pairs: <high scale="F">103</high>
– attributes: <high scale="F">103</high>
– entities: <afternoon>Sunny & hot</afternoon>
– data: <high scale="F">103</high>
Elements and attributes
• Attributes and elements are interchangeable
• Example:
<name> <name first="David"
<first>David</first>
<last>Smith</last> last="Smith">
</name> </name>
• Elements are easier to use from Java
• Attributes may contain elaborate metadata, such as
unique IDs
Well-formed XML
• In XML, every element must have both a start tag
and an end tag, e.g. <name> ... </name>
– Empty elements can be abbreviated: <break />.
– XML tags are case sensitive and may not begin
with the letters xml, in any combination of cases
• Elements must be properly nested
– e.g. not bold and italic
• XML document must have one and only one root
element
• The values of attributes must be enclosed in quotes
– e.g. <time unit="days">
DTDs and Namespaces
• DTDs are used to define the tags that can be
used in an XML document
• A document may refer to a number of DTDs
• Namespaces specify which DTD defines a
given tag
– This helps to avoid collisions between names
– XML: myDTD:myTag
– Note that colon (:) is used rather than a dot (.)
XML as a tree
• An XML document represents a hierarchy
• A hierarchy is a tree
novel
foreword chapter
number="1"
paragraph paragraph paragraph
This is the great It was a dark Suddenly, a shot

American novel. and stormy night. rang out!
Viewing XML
• XML is designed to be processed by computer
programs, not to be displayed to humans
• Nevertheless, almost all current Web browsers can
display XML documents
– They do not all display it the same way
– They may not display it at all if it has errors
• This is just an added value. Remember:
HTML is designed to be viewed,
XML is designed to be used
Stream Model
• Stream seen by parser is a sequence of elements
• As each XML element is seen, an event occurs
– Some code registered with the parser (the event
handler) is executed
• This approach is popularized by the Simple API
for XML (SAX)
• Problem:
– Hard to get a global view of the document
– Parsing state represented by global variables set by
the event handlers
Data Model
• The XML data is transformed into a navigable
data structure in memory
– Because of the nesting of XML elements, a tree data
structure is used
– The tree is navigated to discover the XML document
• This approach is popularized by the Document
Object Model (DOM)
• Problem:
– May require large amounts of memory
– May not be as fast as stream approach
• Some DOM parsers use SAX to build the tree
SAX and DOM
• SAX and DOM are standards for XML parsers
– DOM is a W3C standard
– SAX is an ad-hoc (but very popular) standard
• There are various implementations available
• Java implementations are provided as part of
JAXP (Java API for XML Processing)
• JAXP package is included in JDK starting from
JDK 1.4
– Is available separately for Java 1.3
Difference between SAX and DOM
• DOM reads the entire document into memory and
stores it as a tree data structure
• SAX reads the document and calls handler methods
for each element or block of text that it encounters
• Consequences:
– DOM provides "random access" into the document
– SAX provides only sequential access to the document
– DOM is slow and requires huge amount of memory, so it
cannot be used for large documents
– SAX is fast and requires very little memory, so it can be
used for huge documents
• This makes SAX much more popular for web sites
Parsing with SAX
• SAX uses the source-listener-delegate model for
parsing XML documents
– Source is XML data consisting of a XML elements
– A listener written in Java is attached to the document
which listens for an event
– When event is thrown, some method is delegated for
handling the code
Callbacks
• SAX works through callbacks:
– The program calls the parser
– The parser calls methods provided by the program
Program
startDocument(...)
The SAX parser startElement(...)
main(...)
parse(...) characters(...)
endElement( )
endDocument( )
Simple SAX program
• The program consists of two classes:
– Sample -- This class contains the main method; it
• Gets a factory to make parsers
• Gets a parser from the factory
• Creates a Handler object to handle callbacks from the parser
• Tells the parser which handler to send its callbacks to
• Reads and parses the input XML file
– Handler -- This class contains handlers for three kinds of
callbacks:
• startElement callbacks, generated when a start tag is seen
• endElement callbacks, generated when an end tag is seen
• characters callbacks, generated for the contents of an element
The Sample class
import javax.xml.parsers.*; // for both SAX and DOM
import org.xml.sax.*;
import org.xml.sax.helpers.*;
// For simplicity, we let the operating system handle exceptions

// In "real life" this is poor programming practice
public class Sample {
public static void main(String args[]) throws Exception {
// Create a parser factory
SAXParserFactory factory = SAXParserFactory.newInstance();
// Tell factory that the parser must understand namespaces
factory.setNamespaceAware(true);
// Make the parser
SAXParser saxParser = factory.newSAXParser();
XMLReader parser = saxParser.getXMLReader();
The Sample class
// Create a handler
Handler handler = new Handler();
// Tell the parser to use this handler
parser.setContentHandler(handler);
// Finally, read and parse the document
parser.parse("hello.xml");
} // end of Sample class
• The parser reads the file hello.xml

• It should be located
– In the same directory
– In a directory that is included in the classpath
The Handler class
• public class Handler extends DefaultHandler {
– DefaultHandler is an adapter class that defines empty
methods to be overridden
• We define 3 methods to handle (1) start tags, (2)
contents, and (3) end tags.
– The methods will just print a line
– Each of these 3 methods throws a SAXException
• // SAX calls this when it encounters a start tag
public void startElement(String namespaceURI,
String localName, String qualifiedName,
Attributes attributes) throws SAXException {
System.out.println("startElement: " + qualifiedName);
}
The Handler class
• // SAX calls this method to pass in character data
public void characters(char ch[ ], int start, int length)
throws SAXException {
System.out.println("characters: \"" +
new String(ch, start, length) + "\"");
}
• // SAX call this method when it encounters an end tag
public void endElement(String namespaceURI,
String localName,
String qualifiedName)
throws SAXException {
System.out.println("Element: /" + qualifiedName);
}
} // End of Handler class
Results
• If the file hello.xml contains:
<display>Hello World!</display>
• Then the output from running java Sample will be:
startElement: display
characters: "Hello World!"
Element: /display
More results
• Now suppose the file  startElement: display
hello.xml contains: characters: "" // empty string
– <?xml version="1.0"?> characters: "
<display> " // newline
Hello World! characters: " " // spaces
</display>
startElement: i
• Notice that the root element, characters: "Hello"
<display>, contains a nested endElement: /i
element and whitespace characters: "World!"
(including newlines) characters: "
• The result will be as shown at " // another newline
the right: endElement: /display
Factories
• SAX uses a parser factory
– A factory is a design pattern alternative to constructors
• Factories allow the programmer to:
– Decide whether or not to create a new object
– Decide what kind of object to create
class TrustMe {
private TrustMe() { } // private constructor
public TrustMe makeTrust() { // factory method

if ( /* test of some sort */)
return new TrustMe();
}
}
}
Parser factories
• To create a SAX parser factory, call static method:
SAXParserFactory.newInstance()
– Returns an object of type SAXParserFactory
– It may throw a FactoryConfigurationError
• Then, the parser can be customized:
– public void setNamespaceAware(boolean awareness)
• Call this with true if you are using namespaces
• The default (if you don’t call this method) is false
– public void setValidating(boolean validating)
• Call this with true if you want to validate against a DTD
• The default (if you don’t call this method) is false
• Validation will give an error if you do not have a DTD
Getting a parser
• Once a SAXParserFactory factory was set up,
parsers can be created with:
SAXParser saxParser = factory.newSAXParser();
XMLReader parser = saxParser.getXMLReader();
• Note: SAXParser is not thread-safe
• If a parser will be used by in multiple threads,
create a separate SAXParser object for each thread
Declaring which handler to use
• Since the SAX parser will call the handlers, we
need to supply these methods
• Binding the parser with a handler:
Handler handler = new Handler();
parser.setContentHandler(handler);
• These statements could be combined:
parser.setContentHandler(new Handler());
• Finally, the parser is invoked on the file to parse:
parser.parse("hello.xml");
• Everything else is done in the handler methods
SAX handlers
• A callback handler must implement 4 interfaces:
– interface ContentHandler
• Handles basic parsing callbacks, e.g., element starts and ends
– interface DTDHandler
• Handles only notation and unparsed entity declarations
– interface EntityResolver
• Does customized handling for external entities
– interface ErrorHandler
• Must be implemented or parsing errors will be ignored!
• Implementing all these interfaces is a lot of work
– It is easier to use an adapter class
Class DefaultHandler
• DefaultHandler is in an adapter from package
org.xml.sax.helpers
• DefaultHandler implements ContentHandler,
DTDHandler, EntityResolver, and
ErrorHandler
• DefaultHandler provides empty methods for
every method declared in each of the interfaces
• To use this class, extend it and override the
methods that are important to the application
ContentHandler methods
• public void startElement(String namespaceURI,
String localName, String qualifiedName,
Attributes atts) throws SAXException
• This method is called at the beginning of elements
• When SAX calls startElement, it passes in a
parameter of type Attributes
• The following methods look up attributes by name
rather than by index:
– public int getIndex(String qualifiedName)
– public int getIndex(String uri, String localName)
– public String getValue(String qualifiedName)
– public String getValue(String uri, String localName)
ContentHandler methods
• endElement(String namespaceURI,
String localName, String qualifiedName)
throws SAXException
• The parameters to endElement are the same as
those to startElement, except that the Attributes
parameter is omitted
• public void characters(char[] ch, int start, int
length) throws SAXException
• ch is an array of characters
– Only length characters, starting from ch[start], are the
contents of the element
Error Handling
• SAX error handling is unusual
• Most errors are ignored unless you an error handler
org.xml.sax.ErrorHandler is registered
– Ignored errors can cause unexpected behavior
• The ErrorHandler interface declares:
– public void fatalError (SAXParseException exception)
throws SAXException // XML not well structured
– public void error (SAXParseException exception)
throws SAXException // XML validation error
– public void warning (SAXParseException exception)
throws SAXException // minor problem
External parsers
• Alternatively, you can use an existing parser:
– Xerces, Electric XML, Expat, MSXML, CMarkup
• Stages of the parsing
– Get the URL object for the source
– Create InputSource object encapsulating the data
source
– Create the parser
– Launch the parser on the data source
Creating InputSource
import org.xml.sax.*;
import org.xml.sax.helpers.*;
import javax.servlet.*;
import javax.servlet.http.*;
import java.io.*;
import java.net.*;
import java.util.*;
public class MyServlet extends HttpServlet {
private static string URL url;
public void init() throws ServletException {
try {
url = new URL(“http://server/data.xml”);
} catch (MalformedURLException e) {
System.err.println(e);
}
}
Creating InputSource & Parser
public void doGet(HttpServletRequest req, HttpServletResponse resp)
throws IOException, ServletException {
resp.setContentType(“text/html”);
PrintWriter out = resp.getWriter();
out.println(“<html><title> mytitle </title><body>”);
InputStream in = url.openStream();
InputSource src = new InputSource(in);
try {
XMLReader parser = XMLReaderFactory.createXMLReader(
“org.apache.xerces.parsers.SAXParser”);
parser.parse(src);
}
catch (SAXException e) { System.err.println(e); }
catch (IOException e) { System.err.println(e); }
out.println(“</body></html>”);
}
Problems with SAX
• SAX provides only sequential access to the
document being processed
• SAX has only a local view of the current element
being processed
– Global knowledge of parsing must be stored in global
variables
– A single startElement() method for all elements
• In startElement() there are many “if-then-else” tests for
checking a specific element
• When an element is seen, a global flag is set
• When finished with the element global flag must be set to false
DOM
• DOM represents the XML document as a tree
– Hierarchical nature of tree maps well to hierarchical
nesting of XML elements
– Tree contains a global view of the document
• Makes navigation of document easy
• Allows to modify any subtree
• Easier processing than SAX but memory intensive!
• As well as SAX, DOM is an API only
– Does not specify a parser
– Lists the API and requirements for the parser
• DOM parsers typically use SAX parsing
Simple DOM program
• First we need to create a DOM parser, called a
DocumentBuilder
• The parser is created, not by a constructor, but by
calling a static factory method
DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
DocumentBuilder builder =
factory.newDocumentBuilder();
Simple DOM program
• An XML file hello.xml will be be parsed
<display>Hello World!</display>
• To read this file, we add the following line :
Document document = builder.parse("hello.xml");
• document contains the entire XML file as a tree
• The following code finds the content of the root element
and prints it
Element root = document.getDocumentElement();
Node textNode = root.getFirstChild();
System.out.println(textNode.getNodeValue());
• The output of the program is: Hello World!
Reading in the tree
• The parse method reads in the entire XML
document and represents it as a tree in memory
– For a large document, parsing could take a while
– If you want to interact with your program while it is
parsing, you need to use parser in a separate thread
• Practically, an XML parse tree may require up to 10
times memory as the original XML document
– If you have a lot of tree manipulation to do, DOM is
much more convenient than SAX
– If you do not have a lot of tree manipulation to do,
consider using SAX instead
Structure of the DOM tree
• The DOM tree is composed of Node objects
• Node is an interface
– Some of the more important sub-interfaces are Element,
Attr, and Text
• An Element node may have children
• Attr and Text nodes are the leaves of the tree
• Hence, the DOM tree is composed of Node objects
– Node objects can be downcast into specific types if needed
Operations on Nodes
• The results returned by getNodeName(), getNodeValue(),
getNodeType() and getAttributes() depend on the subtype
of the node, as follows:
Element Text Attr
getNodeName() tag name "#text" name of attribute
getNodeValue() null text contents value of attribute
getNodeType() ELEMENT_NODE ATTRIBUTE_NODE

TEXT_NODE
getAttributes() NamedNodeMap null
null
Distinguishing Node types
• An easy way to handle different types of nodes:
switch(node.getNodeType()) {
case Node.ELEMENT_NODE:
Element element = (Element)node;
...;
break;
case Node.TEXT_NODE:
Text text = (Text)node;
...
break;
case Node.ATTRIBUTE_NODE:
Attr attr = (Attr)node;
...
break;
default: ...
}
Operations on Nodes
• Tree-walking methods that return a Node:
– getParentNode()
– getFirstChild()
– getNextSibling()
– getPreviousSibling()
– getLastChild()
• Test methods that return a boolean:
– hasAttributes()
– hasChildNodes()
Operations for Elements
• String getTagName()
– Returns the name of the tag
• boolean hasAttribute(String name)
– Returns true if this Element has the named attribute
• String getAttribute(String name)
– Returns the value of the named attribute
• boolean hasAttributes()
– Returns true if this Element has any attributes
• NamedNodeMap getAttributes()
– Returns a NamedNodeMap of all the Element’s
attributes
Operations on Texts
• Text is a subinterface of CharacterData and
inherits the following operations (among others):
– public String getData() throws DOMException
• Returns the text contents of this Text node
– public int getLength()
• Returns the number of Unicode characters in the text
– public String substringData(int offset, int count)
throws DOMException
• Returns a substring of the text contents
Operations on Attributes
• String getName()
– Returns the name of this attribute.
• Element getOwnerElement()
– Returns the Element node this attribute is attached to
• boolean getSpecified()
– Returns true if this attribute was explicitly given a
value in the document
• String getValue()
– Returns the value of the attribute as a String
Pre-order traversal
• The DOM is stored in memory as a tree
• Trees can be traversed using pre-order, in-order,
or post-order
• A simple way to traverse a tree is in preorder
• The general form of a pre-order traversal is:
– Visit the root
– Traverse each one of the sub-trees, in order
Pre-order traversal in Java
• static void simplePreorderPrint(String indent, Node node) {
printNode(indent, node);
if(node.hasChildNodes()) {
Node child = node.getFirstChild();
while (child != null) {
simplePreorderPrint(indent + " ", child);
child = child.getNextSibling();
}
}
}
• static void printNode(String indent, Node node) {
System.out.print(indent);
System.out.print(node.getNodeType() + " ");
System.out.print(node.getNodeName() + " ");
System.out.print(node.getNodeValue() + " ");
System.out.println(node.getAttributes());
}
Trying out the program
Input: Output:
<?xml version="1.0"?> 1 novel null

<novel> 3 #text
<chapter num="1">The Beginning</chapter> null
<chapter num="2">The Middle</chapter> 1 chapter null num="1“
<chapter num="3">The End</chapter> 3 #text The Beginning
</novel> null
3 #text
Things to think about: null
1 chapter null num="2“
What are the numbers? 3 #text The Middle null
3 #text
Are the nulls in the right places? null
1 chapter null num="3“
Is the indentation as expected?
3 #text The End null
How could this program be improved? 3 #text
null
Overview
• DOM, unlike SAX, gives allows to create and
modify XML trees
• There are three basic kinds of operations:
– Creating a new DOM
– Modifying the structure of a DOM
– Modifying the content of a DOM
• Creating a new DOM requires a few extra
methods just to get started
– Afterwards, you can add elements through modifying
its structure and contents
Creating a new DOM
import javax.xml.parsers.*;
import org.w3c.dom.Document;
…
try {
DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
DocumentBuilder builder =
factory.newDocumentBuilder();
Document doc = builder.newDocument();
}
catch (ParserConfigurationException e) { ... }
Creating structure
• The following are instance methods of Document:
– public Element createElement(String tagName)
– public Element createElementNS(String namespaceURI,
– public Attr createAttribute(String name)
– public Attr createAttributeNS(String namespaceURI,
– public ProcessingInstruction createProcessingInstruction
(String target, String data)
– public EntityReference createEntityReference(String name)
– public Text createTextNode(String data)
– public Comment createComment(String data)
Methods of Node
• public Node appendChild(Node newChild)
• public Node insertBefore(Node newChild, Node
refChild)
• public Node removeChild(Node oldChild)
• public Node replaceChild(Node newChild, Node
oldChild)
• setNodeValue(String nodeValue)
– Functionality depends on the type of the node
Methods of Element
• public void setAttribute(String name, String value)
• public Attr setAttributeNode(Attr newAttr)
• public void setAttributeNodeNS(String namespaceURI,
String qualifiedName, String value)
• public Attr setAttributeNodeNS(Attr newAttr)
• public void removeAttribute(String name)
• public void removeAttributeNS(String namespaceURI,
String localName)
• public Attr removeAttributeNode(Attr oldAttr)
Method of Attribute
• public void setValue(String value)
• This is the only method that modifies an
Attribute
– The rest just retrieve information
Writing out the DOM as XML
• There are no Java-supplied methods for writing
out a DOM as XML
• Writing out a DOM is conceptually simple
– It is just a tree walk
• Practically, there are a lot of details
– Various node types
– Binding attributes
– …
• Doing a good job isn’t complicated, but it is
lengthy

Lec12 XMLCS

Uploaded by

Copyright:

Available Formats

Lec12 XMLCS

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec12 XMLCS

Uploaded by

Copyright:

Available Formats

Web Programming Course

• HTML and XML look similar, because they are

• HTML is for humans

• Aside from the directives, an XML document

paragraph paragraph paragraph

This is the great It was a dark Suddenly, a shot

// For simplicity, we let the operating system handle exceptions

• The parser reads the file hello.xml

public TrustMe makeTrust() { // factory method

getNodeName() tag name "#text" name of attribute

getNodeValue() null text contents value of attribute

getNodeType() ELEMENT_NODE ATTRIBUTE_NODE

<?xml version="1.0"?> 1 novel null

You might also like