A Website For Offline Browsing

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 8

Download a Website for offline browsing

Use common Java classes to build an offloading utility


In this article, I guide you through the steps involved in designing a utility to
download a Website. This utility downloads only text and image files, but it
can easily be extended to download files of any type. At the end of the
article I'll provide tips on how you can extend the utility.

First, a brief introduction to URLs (Uniform Resource Locators) would not be out of
place. The general form of a URL is:
protocol://machinename[:port]/filename[#referenve].

An absolute URL -- such as http://java.sun.com/products/jdk1.2 -- has all the


components required to identify the resource on the Web. In relative URLs, the protocol
and the machine name are inherited from the base URL embedded in the document
(base tag) or from the URL used to retrieve the document. For example, assume that you
have downloaded an HTML document using the URL
http://www.somesite.com/index.html and that this document has a link home.html. The
link actually points to http://www.somesite.com/home.html. For more information,
please see Resources.
The utility I describe in this article uses the URL class in the java.net package. The
class provides three methods to obtain data from the URL. In this utility, I use the
method public final InputStream openStream() throws IOException to
establish a connection with the URL and to return an InputStream object to get the data
from the URL. Note that the data does not contain any of the HTTP headers. This
method hides all the intricacies of setting the appropriate parameters to make a
connection and connecting to the remote resource. It returns the InputStream, which
helps you to get the data as you would get any other file stream.
Some of the commonly used protocols are HTTP, FTP, Gopher, and News. This article
deals only with HTTP (HyperText Transfer Protocol), an application-level protocol
commonly used to transfer hypertext documents across the Internet. HTTP has gained
importance because of its simplicity and low overhead.
The main idea
Suppose you visit a Webpage containing links to several other pages that, in
turn, have links to still other pages. You want to download all those pages
onto your hard disk. How would you accomplish this? You could simply visit
all the pages and save them on your hard disk, right? However, that is not

only a tedious process but also an inconvenient one. The links in the pages
may not be pointing correctly (relative to the location of other pages you are
downloading), or the links might be absolute URLs pointing to the remote
machine (in which case, downloading the page becomes useless). You could
manipulate the links manually, but that would also be a painful process.

This utility lets you download all the pages of a Website in a graceful manner. It follows
these simple steps:
1. It downloads a page and stores all the links inside a vector
2. It loops (or iterates) over all the elements of the vector, repeating
Step 1 and Step 2 recursively

The utility consists of four classes: DownloadSite, Downloader, URLlist, and


ExtendedURL. You can download the source code from Resources.
DownloadSite
The DownloadSite class reads the command line arguments and does some
initialization. It contains the main() method. This utility takes at least one
but no more than two arguments. The first argument is the site name, and
the second is the location of the new directory created to hold the
downloaded files. If you do not specify the second argument, the files are
downloaded into the current directory.

If you need to use this utility behind a firewall, the changes should be done in
DownloadSite. See Resources for information on how to access the sites when you are
behind a firewall.
parses the command line arguments and passes them to the Downloader
class, which does the actual downloading.
DownloadSite

Downloader
Downloader is the heart of the utility. This class contains the logic used to

download the pages and the code to manipulate the links.

You use recursion to download the pages. The logic is simple:


private void startDownload(URL u)
{
...
listOfURL = downloadAndFillVector(in, out);
/*
* downloadAndFillVector downloads the file (and also
* manipulates the link) and returns a vector of URLs
* in the file specified by URL u.

* After the execution of this statement, listOfURL contains


* the URLs in the current page that needs to be downloaded.
*/
...
sizeOfVector = listOfURL.size();
for(int i = 0; i < sizeOfVector; i++)
startDownload((URL)listOfURL.elementAt(i));
/*
* Loop through all the elements of the vector and
* call startDownload recursively. The process repeats
* downloading all the pages
*/

I should explain two private members of this class: private String hostName and
static Vector URLs:

hostName contains the machine name from the first page's URL (the

URL provided at the command line). In any page, you can have two
types of links: absolute and relative. If the link is relative, use this
hostName to retrieve the document. But if the link is absolute, you
must check whether or not the host name in the link is the same as
hostName. If it is, include this link in the list of URLs to be downloaded.
If it isn't, ignore this link. For example, if you are downloading a site,
say www.somesite.com, and one of its pages contains a link to
www.othersite.com, you do not want to download pages from
www.othersite.com.

URLs is the global vector where you keep adding all the pages you

download. When you get a link, check whether or not the link is
already present in URLs. This prevents you from downloading a page
twice. Another common scenario: Often a page a.html can link to
another page b.html, and the page b.html can also have a link to
a.html (from the Back button). static Vector URLs also helps you
avoid falling into such loops.

To download text files and binary files, you must have separate methods for each. From
the file extensions, decide whether the file is a binary (image) file or a text file. The
method nonTextFile() returns true if the file is not a text file. For efficiency, call a
different method, downloadNonTextFile(), to download binary files. This function
does not perform any file parsing.
If the files are text files, you must extract the links and modify them appropriately. If
you wanted only to extract the links, you could use the Swing package to do so (see
Resources).
But for this article, you also want to change the links, so you should parse the files.

The general strategy

First, search all the occurrences of <a, <base, <img, and <frame (irrespective of case)
and store the characters up to the next enclosing > in the string. For example, in the case
of "Click <a href=enter.html>here</a> to enter," the sequence a href="enter.html"
is stored in the string. Extract the URL from this string, but first make sure to take care
of several things:

If the link is absolute, such as http://www.somesite.com/docs/index.html, it


begins with a protocol name (http, ftp, news, and so forth -- although this article
is concerned only with http). In such cases, check whether or not the hostname is
the same as the hostName:
o If it isn't, do not modify the link or download the file.
o If the hostnames are same, manipulate the link. Suppose the destination
directory is /work/tp; you then modify the link to
/work/tp/docs/index.html (meaning the destination directory plus the
link's filename) and add the unmodified link to the list of links to be
downloaded.

If the link begins with a backslash (/), such as in /images/back.gif, the


hostname and protocol are guaranteed to be same. Modify the link to include the
destination directory and add the unmodified link to the list of links to be
downloaded. For example, assume you have downloaded a page from
http://www.somesite.com/docs/index.html, which has a link /images/back.gif.
If the destination directory is /work/tp, the link should be modified to
/work/tp/images/back.gif.

If the link does not begin with a backslash (/), you need not modify the link.
Add the link to the list of links to be downloaded.

If the link ends with a backslash (/), modify the link to include index.html and
add the unmodified link to the list.

If the base tag (<base) is present in the document, evaluate the relative URLs
using the base URL. In such cases, first evaluate the filename using the base
URL and the relative links, then modify the link to the destination directory plus
the newly evaluated filename. The unmodified link is added to the list of links to
be downloaded. An important thing to note is that the base tag is removed from
the downloaded file. Here's an example of how the relative links are resolved in
this case:

<html>

<head>
<title>Just an example</title>
<base
href="http://www.somesite.com/docs/someproduct/index.html">
</head>
<body>

<p>Let us manipulate <a


href="../someotherproduct/index.html">this link</a>
</body>
</html>

Now, the relative URL .../someotherproduct/index.html would resolve to


http://www.somesite.com/docs/someotherproduct/index.html. The filename
(meaning the newly evaluated filename) is then
docs/someotherproducts/index.html. If the destination directory is /work/tp, the
link is modified to /work/tp/docs/someotherproduct/index.html.
A number of methods are written to parse the file and manipulate the links.
downloadAndFillVector() does the first-phase parsing. It scans the file being
downloaded for <a, <base, <img, and <frame, and it stores the characters up to > in a
StringBuffer. These characters are not written to the downloaded file, since you need
to modify the links present in this string. This string is passed to another method,
modifyLink().
Take a look at the following example:
<BODY>
<P>I just returned from vacation! Here is a photo of my family at the
lake:
<IMG SRC=image/family.gif alt="A photo of family at the lake">
</BODY>

After the first-phase parsing, the string obtained is:


IMG SRC=image/family.gif

alt="A photo of family at the lake"

.
modifyLink() does the second-phase parsing. It searches for the occurrence of href
src in the string. Now, the link can be either one of the following:

SRC=image/family.gif

SRC =image/family.gif

SRC = image/family.gif

or

Most browsers accept spaces on either side of the equal sign (=) between href or src
and the link. modifyLink() evaluates the index of a link's beginning (in this case, i).
This index, the string, and the length of the string are passed to another method,
processLink(). So, after the second-phase parsing, you have:
image/family.gif

alt="A photo of family at the lake"

performs the final phase of parsing. It finds out what is at the end of the
link. Now you have image/family.gif. Depending on the link, processLink()
modifies the link and writes the modified link to the file. The unmodified link is
returned to downloadAndFillVector(), which adds the link to a vector.
processLink()

downloadAndFillVector() does one more thing. If the base tag (<base) is present,
downloadAndFillVector() extracts the URL from the base tag and assigns it to
baseTagURL, which is a private member of Downloader. You use the baseTagURL to get
the actual file paths in case the links are relative. downloadAndFillVector() calls the
setBaseTagURL() method to extract the base URL from the base string, parsing the

same way you did in the first and second phase.


Finally, the list of links is passed to formVectorOfURLs() method. This method creates
an object of the URLlist class, whose sole purpose is to generate complete URLs using
the base URL and the links so you can use them to download Web pages.

URLlist
is a simple class. It receives the base URL, which is either the URL specified
in the base tag or the URL used to retrieve the document, plus the list of links in the
page to be downloaded. From this, it generates a list of URLs and returns the list to the
Downloader class. It also adds the link to the global vector URLs. This class contains the
functionality that prevents a link from being downloaded twice.
URLlist

ExtendedURL
The URL class provides a method getFile() that returns the filename (anything after
the machine name up to # or the end of the string). You need a way to get the directory
and the filename, and you need the directory name to maintain the same file structure.
URL, being a final class, cannot be extended, so you can use composition. (There are
basically two approaches in object-oriented programming to achieve code reusability:
inheritance and composition. In composition, you use the existing class as a member of
the new class, which is composed of the already existing class along with other
members. For more information, see Resources.) ExtendedURL has a member field of
type URL. The ExtendedURL class provides methods getDirectory() and getFile(),
which return the directory and the filename, respectively.
First, obtain the filename (directory plus the optional filename) using the getFile()
method of the URL class. Then search for the question mark (?) in the filename. A
question mark indicates that the query string is appended to the filename and that the
file is a script. You can't download scripts, so both directory and file are set to null. This
URL is not added to the list of URLs.
If the filename ends with /, the directory is set to the filename and the file is set to null.
In other cases, you can extract the directory and file from the filename by performing
some simple calculations.
One more thing determines the setting of the directory and file.

Consider these two URLs:

http://www.somesite.com/work/xyz.html

http://www.somesite.com/work/abc.mp3

In the first example, the directory is set to /work and the file is set to xyz.html. In the
second example, the file is abc.mp3; however, since here you are not interested in MP3
files, set the filename and directory to null.
Use the method fileOfInterest(), which returns true for the files that interest you.
You can add MP3 files in this method, so this utility can download MP3 files as well.
To understand this utility further, you can browse through the source code in Resources.
This utility, however, has some limitations.
Limitations of the utility
The following list reveals the limitations of the utility described in this
article. You can easily overcome most of them to enhance your own
offloading utility:
1. It downloads only text and image files. This is a minor limitation you
can easily tackle by modifying the fileOfInterest() method of the
ExtendedURL class and the nonTextFile() method of the Downloader
class.
2. It cannot download applets. Applets are specified in HTML files as
<APPLET CODE=Simple.class CODEBASE="/examples"></APPLET>, where
CODE specifies the class file and the CODEBASE specifies the directory in
which this class file is located.

To overcome this limitation, search for <applet in the first phase of parsing
similar to <img, <frame, and other tags. During the second and third phase of
parsing, extract the filename and the directory from the CODE and CODEBASE
attributes. The link becomes /example/Simple.class, which is just like
/images/family.gif.
Note, if both CODE and ARCHIVE are present in the APPLET tag, the filename
should be a jar file, not a class file. For example, <APPLET CODE=Simple.class
ARCHIVE=examples.jar CODEBASE="/examples"></APPLET> should be
/examples/examples.jar, not /examples/Simple.class.
should return true for class and jar files. You can modify the
nonTextFile() method of Downloader to return true for class and jar files.
fileOfInterest()

3. It does not handle forms correctly. The form tag has the following
syntax:
4. <FORM ACTION="http://www.somesite.com/prog/adduser"
METHOD="POST">

If the URL specified with the ACTION attribute is absolute, this utility will work
well. If it does not, you can handle this limitation easily by replacing the relative
URL with the absolute URL. Obtaining the absolute URL from the relative URL
is simple when you know the base URL:
URL absolute = new URL(baseURL, relative);
String absoluteURL = absolute.toString();

5. It does not download background images. This limitation can become


a major one depending on the context. The background attribute can
be present in some HTML tags. The value of this attribute is the
image to be displayed in the background, which this utility does not
download. To overcome this, search for the background attribute in
some of the HTML tags and then generate a complete URL to
download background images.
6. It does not handle scripts. At the time you modify the links, the utility
has no idea whether the link is a script (executable) or an ordinary
file. A simple guess can tackle most of the situations. If the filename
ends with .pl or .cgi, or if it has no extension, you can assume it is a
script. In that case, replace the relative link with the absolute link.

I avoided overcoming these limitations in this article to make the utility simple. One of
the goals of this article was to show you how easily you can write a useful utility with
commonly used Java classes. If you provide this utility with a nice GUI and tackle some
of its limitations, it can serve as a full-fledged application.
About the author
Rakesh Didwania is a software engineer at Informix, India. Previously, he
worked at Fujitsu-ICIM, India, developing a translator that converts Informix
4GL code to C.

You might also like