Lec12 XMLCS

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 60

Web Programming Course

Lecture 12 – XML
Briefly: The Power of XML
• XML is Extensible Markup Language
– Text-based representation for describing data structure
• Both human and machine readable
– Originated from Standardized Generalized Markup
Language (SGML)
– Became a World Wide Web Consortium (W3C)
standard in 1998
• XML is a great choice for exchanging data
between disparate systems
Synergy between Java and XML

• Java+XML=Portable language+Portable
Data
• Allows use Java to generate XML data
– Use Java to access SQL databases
– Use Java to format data in XML
– Use Java to parse data
– Use Java to validate data
– Use Java to transform data
HTML and XML

• HTML and XML look similar, because they are


both SGML languages
– use elements enclosed in tags (e.g. <body>This is
an element</body>)
– use tag attributes (e.g.,
<font face="Verdana" size="+1" color="red">)
• More precisely,
– HTML is defined in SGML
– XML is a (very small) subset of SGML
HTML and XML

• HTML is for humans


– HTML describes web pages
– Browsers ignore and/or correct many HTML
errors, so HTML is often sloppy
• XML is for computers
– XML describes data
– The rules are strict and errors are not allowed
• In this way, XML is like a programming language
– Current versions of most browsers display XML
Example XML document
<?xml version="1.0"?>
<weatherReport>
<date>7/14/97</date>
<city>North Place</city>, <state>NX</state>
<country>USA</country>
High Temp: <high scale="F">103</high>
Low Temp: <low scale="F">70</low>
Morning: <morning>Partly cloudy, Hazy</morning>
Afternoon: <afternoon>Sunny &amp; hot</afternoon>
Evening: <evening>Clear and Cooler</evening>
</weatherReport>
Overall structure
• An XML document may start with one or more
processing instructions or directives:
<?xml version="1.0"?>
<?xml-stylesheet type="text/css" href="ss.css"?>
• Following the directives, there must be exactly one root
element containing all the rest of the XML:
<weatherReport>
...
</weatherReport>
XML building blocks

• Aside from the directives, an XML document


is built from:
– elements: high in <high scale="F">103</high>
– tags, in pairs: <high scale="F">103</high>
– attributes: <high scale="F">103</high>
– entities: <afternoon>Sunny &amp; hot</afternoon>
– data: <high scale="F">103</high>
Elements and attributes
• Attributes and elements are interchangeable
• Example:
<name> <name first="David"
<first>David</first>
<last>Smith</last> last="Smith">
</name> </name>
• Elements are easier to use from Java
• Attributes may contain elaborate metadata, such as
unique IDs
Well-formed XML
• In XML, every element must have both a start tag
and an end tag, e.g. <name> ... </name>
– Empty elements can be abbreviated: <break />.
– XML tags are case sensitive and may not begin
with the letters xml, in any combination of cases
• Elements must be properly nested
– e.g. not <b><i>bold and italic</b></i>
• XML document must have one and only one root
element
• The values of attributes must be enclosed in quotes
– e.g. <time unit="days">
DTDs and Namespaces
• DTDs are used to define the tags that can be
used in an XML document
• A document may refer to a number of DTDs
• Namespaces specify which DTD defines a
given tag
– This helps to avoid collisions between names
– XML: myDTD:myTag
– Note that colon (:) is used rather than a dot (.)
XML as a tree
• An XML document represents a hierarchy
• A hierarchy is a tree

novel

foreword chapter
number="1"

paragraph paragraph paragraph

This is the great It was a dark Suddenly, a shot


American novel. and stormy night. rang out!
Viewing XML
• XML is designed to be processed by computer
programs, not to be displayed to humans
• Nevertheless, almost all current Web browsers can
display XML documents
– They do not all display it the same way
– They may not display it at all if it has errors
• This is just an added value. Remember:
HTML is designed to be viewed,
XML is designed to be used
Stream Model
• Stream seen by parser is a sequence of elements
• As each XML element is seen, an event occurs
– Some code registered with the parser (the event
handler) is executed
• This approach is popularized by the Simple API
for XML (SAX)
• Problem:
– Hard to get a global view of the document
– Parsing state represented by global variables set by
the event handlers
Data Model
• The XML data is transformed into a navigable
data structure in memory
– Because of the nesting of XML elements, a tree data
structure is used
– The tree is navigated to discover the XML document
• This approach is popularized by the Document
Object Model (DOM)
• Problem:
– May require large amounts of memory
– May not be as fast as stream approach
• Some DOM parsers use SAX to build the tree
SAX and DOM
• SAX and DOM are standards for XML parsers
– DOM is a W3C standard
– SAX is an ad-hoc (but very popular) standard
• There are various implementations available
• Java implementations are provided as part of
JAXP (Java API for XML Processing)
• JAXP package is included in JDK starting from
JDK 1.4
– Is available separately for Java 1.3
Difference between SAX and DOM
• DOM reads the entire document into memory and
stores it as a tree data structure
• SAX reads the document and calls handler methods
for each element or block of text that it encounters
• Consequences:
– DOM provides "random access" into the document
– SAX provides only sequential access to the document
– DOM is slow and requires huge amount of memory, so it
cannot be used for large documents
– SAX is fast and requires very little memory, so it can be
used for huge documents
• This makes SAX much more popular for web sites
Parsing with SAX
• SAX uses the source-listener-delegate model for
parsing XML documents
– Source is XML data consisting of a XML elements
– A listener written in Java is attached to the document
which listens for an event
– When event is thrown, some method is delegated for
handling the code
Callbacks
• SAX works through callbacks:
– The program calls the parser
– The parser calls methods provided by the program

Program
startDocument(...)
The SAX parser startElement(...)
main(...)
parse(...) characters(...)
endElement( )
endDocument( )
Simple SAX program
• The program consists of two classes:
– Sample -- This class contains the main method; it
• Gets a factory to make parsers
• Gets a parser from the factory
• Creates a Handler object to handle callbacks from the parser
• Tells the parser which handler to send its callbacks to
• Reads and parses the input XML file
– Handler -- This class contains handlers for three kinds of
callbacks:
• startElement callbacks, generated when a start tag is seen
• endElement callbacks, generated when an end tag is seen
• characters callbacks, generated for the contents of an element
The Sample class
import javax.xml.parsers.*; // for both SAX and DOM
import org.xml.sax.*;
import org.xml.sax.helpers.*;

// For simplicity, we let the operating system handle exceptions


// In "real life" this is poor programming practice
public class Sample {
public static void main(String args[]) throws Exception {
// Create a parser factory
SAXParserFactory factory = SAXParserFactory.newInstance();
// Tell factory that the parser must understand namespaces
factory.setNamespaceAware(true);
// Make the parser
SAXParser saxParser = factory.newSAXParser();
XMLReader parser = saxParser.getXMLReader();
The Sample class
// Create a handler
Handler handler = new Handler();
// Tell the parser to use this handler
parser.setContentHandler(handler);
// Finally, read and parse the document
parser.parse("hello.xml");
} // end of Sample class

• The parser reads the file hello.xml


• It should be located
– In the same directory
– In a directory that is included in the classpath
The Handler class
• public class Handler extends DefaultHandler {
– DefaultHandler is an adapter class that defines empty
methods to be overridden
• We define 3 methods to handle (1) start tags, (2)
contents, and (3) end tags.
– The methods will just print a line
– Each of these 3 methods throws a SAXException
• // SAX calls this when it encounters a start tag
public void startElement(String namespaceURI,
String localName, String qualifiedName,
Attributes attributes) throws SAXException {
System.out.println("startElement: " + qualifiedName);
}
The Handler class
• // SAX calls this method to pass in character data
public void characters(char ch[ ], int start, int length)
throws SAXException {
System.out.println("characters: \"" +
new String(ch, start, length) + "\"");
}
• // SAX call this method when it encounters an end tag
public void endElement(String namespaceURI,
String localName,
String qualifiedName)
throws SAXException {
System.out.println("Element: /" + qualifiedName);
}
} // End of Handler class
Results
• If the file hello.xml contains:
<?xml version="1.0"?>
<display>Hello World!</display>
• Then the output from running java Sample will be:
startElement: display
characters: "Hello World!"
Element: /display
More results
• Now suppose the file  startElement: display
hello.xml contains: characters: "" // empty string
– <?xml version="1.0"?> characters: "
<display> " // newline
<i>Hello</i> World! characters: " " // spaces
</display>
startElement: i
• Notice that the root element, characters: "Hello"
<display>, contains a nested endElement: /i
element <i> and whitespace characters: "World!"
(including newlines) characters: "
• The result will be as shown at " // another newline
the right: endElement: /display
Factories
• SAX uses a parser factory
– A factory is a design pattern alternative to constructors
• Factories allow the programmer to:
– Decide whether or not to create a new object
– Decide what kind of object to create
class TrustMe {
private TrustMe() { } // private constructor

public TrustMe makeTrust() { // factory method


if ( /* test of some sort */)
return new TrustMe();
}
}
}
Parser factories
• To create a SAX parser factory, call static method:
SAXParserFactory.newInstance()
– Returns an object of type SAXParserFactory
– It may throw a FactoryConfigurationError
• Then, the parser can be customized:
– public void setNamespaceAware(boolean awareness)
• Call this with true if you are using namespaces
• The default (if you don’t call this method) is false
– public void setValidating(boolean validating)
• Call this with true if you want to validate against a DTD
• The default (if you don’t call this method) is false
• Validation will give an error if you do not have a DTD
Getting a parser
• Once a SAXParserFactory factory was set up,
parsers can be created with:
SAXParser saxParser = factory.newSAXParser();
XMLReader parser = saxParser.getXMLReader();
• Note: SAXParser is not thread-safe
• If a parser will be used by in multiple threads,
create a separate SAXParser object for each thread
Declaring which handler to use
• Since the SAX parser will call the handlers, we
need to supply these methods
• Binding the parser with a handler:
Handler handler = new Handler();
parser.setContentHandler(handler);
• These statements could be combined:
parser.setContentHandler(new Handler());
• Finally, the parser is invoked on the file to parse:
parser.parse("hello.xml");
• Everything else is done in the handler methods
SAX handlers
• A callback handler must implement 4 interfaces:
– interface ContentHandler
• Handles basic parsing callbacks, e.g., element starts and ends
– interface DTDHandler
• Handles only notation and unparsed entity declarations
– interface EntityResolver
• Does customized handling for external entities
– interface ErrorHandler
• Must be implemented or parsing errors will be ignored!
• Implementing all these interfaces is a lot of work
– It is easier to use an adapter class
Class DefaultHandler
• DefaultHandler is in an adapter from package
org.xml.sax.helpers
• DefaultHandler implements ContentHandler,
DTDHandler, EntityResolver, and
ErrorHandler
• DefaultHandler provides empty methods for
every method declared in each of the interfaces
• To use this class, extend it and override the
methods that are important to the application
ContentHandler methods
• public void startElement(String namespaceURI,
String localName, String qualifiedName,
Attributes atts) throws SAXException
• This method is called at the beginning of elements
• When SAX calls startElement, it passes in a
parameter of type Attributes
• The following methods look up attributes by name
rather than by index:
– public int getIndex(String qualifiedName)
– public int getIndex(String uri, String localName)
– public String getValue(String qualifiedName)
– public String getValue(String uri, String localName)
ContentHandler methods
• endElement(String namespaceURI,
String localName, String qualifiedName)
throws SAXException
• The parameters to endElement are the same as
those to startElement, except that the Attributes
parameter is omitted
• public void characters(char[] ch, int start, int
length) throws SAXException
• ch is an array of characters
– Only length characters, starting from ch[start], are the
contents of the element
Error Handling
• SAX error handling is unusual
• Most errors are ignored unless you an error handler
org.xml.sax.ErrorHandler is registered
– Ignored errors can cause unexpected behavior
• The ErrorHandler interface declares:
– public void fatalError (SAXParseException exception)
throws SAXException // XML not well structured
– public void error (SAXParseException exception)
throws SAXException // XML validation error
– public void warning (SAXParseException exception)
throws SAXException // minor problem
External parsers
• Alternatively, you can use an existing parser:
– Xerces, Electric XML, Expat, MSXML, CMarkup
• Stages of the parsing
– Get the URL object for the source
– Create InputSource object encapsulating the data
source
– Create the parser
– Launch the parser on the data source
Creating InputSource
import org.xml.sax.*;
import org.xml.sax.helpers.*;
import javax.servlet.*;
import javax.servlet.http.*;
import java.io.*;
import java.net.*;
import java.util.*;
public class MyServlet extends HttpServlet {
private static string URL url;
public void init() throws ServletException {
try {
url = new URL(“http://server/data.xml”);
} catch (MalformedURLException e) {
System.err.println(e);
}
}
Creating InputSource & Parser
public void doGet(HttpServletRequest req, HttpServletResponse resp)
throws IOException, ServletException {
resp.setContentType(“text/html”);
PrintWriter out = resp.getWriter();
out.println(“<html><title> mytitle </title><body>”);
InputStream in = url.openStream();
InputSource src = new InputSource(in);
try {
XMLReader parser = XMLReaderFactory.createXMLReader(
“org.apache.xerces.parsers.SAXParser”);
parser.parse(src);
}
catch (SAXException e) { System.err.println(e); }
catch (IOException e) { System.err.println(e); }
out.println(“</body></html>”);
}
Problems with SAX
• SAX provides only sequential access to the
document being processed
• SAX has only a local view of the current element
being processed
– Global knowledge of parsing must be stored in global
variables
– A single startElement() method for all elements
• In startElement() there are many “if-then-else” tests for
checking a specific element
• When an element is seen, a global flag is set
• When finished with the element global flag must be set to false
DOM
• DOM represents the XML document as a tree
– Hierarchical nature of tree maps well to hierarchical
nesting of XML elements
– Tree contains a global view of the document
• Makes navigation of document easy
• Allows to modify any subtree
• Easier processing than SAX but memory intensive!
• As well as SAX, DOM is an API only
– Does not specify a parser
– Lists the API and requirements for the parser
• DOM parsers typically use SAX parsing
Simple DOM program
• First we need to create a DOM parser, called a
DocumentBuilder
• The parser is created, not by a constructor, but by
calling a static factory method
DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();

DocumentBuilder builder =
factory.newDocumentBuilder();
Simple DOM program
• An XML file hello.xml will be be parsed
<?xml version="1.0"?>
<display>Hello World!</display>
• To read this file, we add the following line :
Document document = builder.parse("hello.xml");
• document contains the entire XML file as a tree
• The following code finds the content of the root element
and prints it
Element root = document.getDocumentElement();
Node textNode = root.getFirstChild();
System.out.println(textNode.getNodeValue());
• The output of the program is: Hello World!
Reading in the tree
• The parse method reads in the entire XML
document and represents it as a tree in memory
– For a large document, parsing could take a while
– If you want to interact with your program while it is
parsing, you need to use parser in a separate thread
• Practically, an XML parse tree may require up to 10
times memory as the original XML document
– If you have a lot of tree manipulation to do, DOM is
much more convenient than SAX
– If you do not have a lot of tree manipulation to do,
consider using SAX instead
Structure of the DOM tree
• The DOM tree is composed of Node objects
• Node is an interface
– Some of the more important sub-interfaces are Element,
Attr, and Text
• An Element node may have children
• Attr and Text nodes are the leaves of the tree
• Hence, the DOM tree is composed of Node objects
– Node objects can be downcast into specific types if needed
Operations on Nodes
• The results returned by getNodeName(), getNodeValue(),
getNodeType() and getAttributes() depend on the subtype
of the node, as follows:
Element Text Attr

getNodeName() tag name "#text" name of attribute

getNodeValue() null text contents value of attribute

getNodeType() ELEMENT_NODE ATTRIBUTE_NODE


TEXT_NODE
getAttributes() NamedNodeMap null
null
Distinguishing Node types
• An easy way to handle different types of nodes:
switch(node.getNodeType()) {
case Node.ELEMENT_NODE:
Element element = (Element)node;
...;
break;
case Node.TEXT_NODE:
Text text = (Text)node;
...
break;
case Node.ATTRIBUTE_NODE:
Attr attr = (Attr)node;
...
break;
default: ...
}
Operations on Nodes
• Tree-walking methods that return a Node:
– getParentNode()
– getFirstChild()
– getNextSibling()
– getPreviousSibling()
– getLastChild()
• Test methods that return a boolean:
– hasAttributes()
– hasChildNodes()
Operations for Elements
• String getTagName()
– Returns the name of the tag
• boolean hasAttribute(String name)
– Returns true if this Element has the named attribute
• String getAttribute(String name)
– Returns the value of the named attribute
• boolean hasAttributes()
– Returns true if this Element has any attributes
• NamedNodeMap getAttributes()
– Returns a NamedNodeMap of all the Element’s
attributes
Operations on Texts
• Text is a subinterface of CharacterData and
inherits the following operations (among others):
– public String getData() throws DOMException
• Returns the text contents of this Text node
– public int getLength()
• Returns the number of Unicode characters in the text
– public String substringData(int offset, int count)
throws DOMException
• Returns a substring of the text contents
Operations on Attributes
• String getName()
– Returns the name of this attribute.
• Element getOwnerElement()
– Returns the Element node this attribute is attached to
• boolean getSpecified()
– Returns true if this attribute was explicitly given a
value in the document
• String getValue()
– Returns the value of the attribute as a String
Pre-order traversal
• The DOM is stored in memory as a tree
• Trees can be traversed using pre-order, in-order,
or post-order
• A simple way to traverse a tree is in preorder
• The general form of a pre-order traversal is:
– Visit the root
– Traverse each one of the sub-trees, in order
Pre-order traversal in Java
• static void simplePreorderPrint(String indent, Node node) {
printNode(indent, node);
if(node.hasChildNodes()) {
Node child = node.getFirstChild();
while (child != null) {
simplePreorderPrint(indent + " ", child);
child = child.getNextSibling();
}
}
}
• static void printNode(String indent, Node node) {
System.out.print(indent);
System.out.print(node.getNodeType() + " ");
System.out.print(node.getNodeName() + " ");
System.out.print(node.getNodeValue() + " ");
System.out.println(node.getAttributes());
}
Trying out the program
Input: Output:

<?xml version="1.0"?> 1 novel null


<novel> 3 #text
<chapter num="1">The Beginning</chapter> null
<chapter num="2">The Middle</chapter> 1 chapter null num="1“
<chapter num="3">The End</chapter> 3 #text The Beginning
</novel> null
3 #text
Things to think about: null
1 chapter null num="2“
What are the numbers? 3 #text The Middle null
3 #text
Are the nulls in the right places? null
1 chapter null num="3“
Is the indentation as expected?
3 #text The End null
How could this program be improved? 3 #text
null
Overview
• DOM, unlike SAX, gives allows to create and
modify XML trees
• There are three basic kinds of operations:
– Creating a new DOM
– Modifying the structure of a DOM
– Modifying the content of a DOM
• Creating a new DOM requires a few extra
methods just to get started
– Afterwards, you can add elements through modifying
its structure and contents
Creating a new DOM
import javax.xml.parsers.*;
import org.w3c.dom.Document;

try {
DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
DocumentBuilder builder =
factory.newDocumentBuilder();
Document doc = builder.newDocument();
}
catch (ParserConfigurationException e) { ... }
Creating structure
• The following are instance methods of Document:
– public Element createElement(String tagName)
– public Element createElementNS(String namespaceURI,
String qualifiedName)
– public Attr createAttribute(String name)
– public Attr createAttributeNS(String namespaceURI,
String qualifiedName)
– public ProcessingInstruction createProcessingInstruction
(String target, String data)
– public EntityReference createEntityReference(String name)
– public Text createTextNode(String data)
– public Comment createComment(String data)
Methods of Node
• public Node appendChild(Node newChild)
• public Node insertBefore(Node newChild, Node
refChild)
• public Node removeChild(Node oldChild)
• public Node replaceChild(Node newChild, Node
oldChild)
• setNodeValue(String nodeValue)
– Functionality depends on the type of the node
Methods of Element
• public void setAttribute(String name, String value)
• public Attr setAttributeNode(Attr newAttr)
• public void setAttributeNodeNS(String namespaceURI,
String qualifiedName, String value)
• public Attr setAttributeNodeNS(Attr newAttr)
• public void removeAttribute(String name)
• public void removeAttributeNS(String namespaceURI,
String localName)
• public Attr removeAttributeNode(Attr oldAttr)
Method of Attribute
• public void setValue(String value)
• This is the only method that modifies an
Attribute
– The rest just retrieve information
Writing out the DOM as XML
• There are no Java-supplied methods for writing
out a DOM as XML
• Writing out a DOM is conceptually simple
– It is just a tree walk
• Practically, there are a lot of details
– Various node types
– Binding attributes
– …
• Doing a good job isn’t complicated, but it is
lengthy

You might also like