Lec12 XMLCS
Lec12 XMLCS
Lec12 XMLCS
Lecture 12 – XML
Briefly: The Power of XML
• XML is Extensible Markup Language
– Text-based representation for describing data structure
• Both human and machine readable
– Originated from Standardized Generalized Markup
Language (SGML)
– Became a World Wide Web Consortium (W3C)
standard in 1998
• XML is a great choice for exchanging data
between disparate systems
Synergy between Java and XML
• Java+XML=Portable language+Portable
Data
• Allows use Java to generate XML data
– Use Java to access SQL databases
– Use Java to format data in XML
– Use Java to parse data
– Use Java to validate data
– Use Java to transform data
HTML and XML
novel
foreword chapter
number="1"
Program
startDocument(...)
The SAX parser startElement(...)
main(...)
parse(...) characters(...)
endElement( )
endDocument( )
Simple SAX program
• The program consists of two classes:
– Sample -- This class contains the main method; it
• Gets a factory to make parsers
• Gets a parser from the factory
• Creates a Handler object to handle callbacks from the parser
• Tells the parser which handler to send its callbacks to
• Reads and parses the input XML file
– Handler -- This class contains handlers for three kinds of
callbacks:
• startElement callbacks, generated when a start tag is seen
• endElement callbacks, generated when an end tag is seen
• characters callbacks, generated for the contents of an element
The Sample class
import javax.xml.parsers.*; // for both SAX and DOM
import org.xml.sax.*;
import org.xml.sax.helpers.*;
DocumentBuilder builder =
factory.newDocumentBuilder();
Simple DOM program
• An XML file hello.xml will be be parsed
<?xml version="1.0"?>
<display>Hello World!</display>
• To read this file, we add the following line :
Document document = builder.parse("hello.xml");
• document contains the entire XML file as a tree
• The following code finds the content of the root element
and prints it
Element root = document.getDocumentElement();
Node textNode = root.getFirstChild();
System.out.println(textNode.getNodeValue());
• The output of the program is: Hello World!
Reading in the tree
• The parse method reads in the entire XML
document and represents it as a tree in memory
– For a large document, parsing could take a while
– If you want to interact with your program while it is
parsing, you need to use parser in a separate thread
• Practically, an XML parse tree may require up to 10
times memory as the original XML document
– If you have a lot of tree manipulation to do, DOM is
much more convenient than SAX
– If you do not have a lot of tree manipulation to do,
consider using SAX instead
Structure of the DOM tree
• The DOM tree is composed of Node objects
• Node is an interface
– Some of the more important sub-interfaces are Element,
Attr, and Text
• An Element node may have children
• Attr and Text nodes are the leaves of the tree
• Hence, the DOM tree is composed of Node objects
– Node objects can be downcast into specific types if needed
Operations on Nodes
• The results returned by getNodeName(), getNodeValue(),
getNodeType() and getAttributes() depend on the subtype
of the node, as follows:
Element Text Attr