A Website For Offline Browsing
A Website For Offline Browsing
A Website For Offline Browsing
First, a brief introduction to URLs (Uniform Resource Locators) would not be out of
place. The general form of a URL is:
protocol://machinename[:port]/filename[#referenve].
only a tedious process but also an inconvenient one. The links in the pages
may not be pointing correctly (relative to the location of other pages you are
downloading), or the links might be absolute URLs pointing to the remote
machine (in which case, downloading the page becomes useless). You could
manipulate the links manually, but that would also be a painful process.
This utility lets you download all the pages of a Website in a graceful manner. It follows
these simple steps:
1. It downloads a page and stores all the links inside a vector
2. It loops (or iterates) over all the elements of the vector, repeating
Step 1 and Step 2 recursively
If you need to use this utility behind a firewall, the changes should be done in
DownloadSite. See Resources for information on how to access the sites when you are
behind a firewall.
parses the command line arguments and passes them to the Downloader
class, which does the actual downloading.
DownloadSite
Downloader
Downloader is the heart of the utility. This class contains the logic used to
I should explain two private members of this class: private String hostName and
static Vector URLs:
hostName contains the machine name from the first page's URL (the
URL provided at the command line). In any page, you can have two
types of links: absolute and relative. If the link is relative, use this
hostName to retrieve the document. But if the link is absolute, you
must check whether or not the host name in the link is the same as
hostName. If it is, include this link in the list of URLs to be downloaded.
If it isn't, ignore this link. For example, if you are downloading a site,
say www.somesite.com, and one of its pages contains a link to
www.othersite.com, you do not want to download pages from
www.othersite.com.
URLs is the global vector where you keep adding all the pages you
download. When you get a link, check whether or not the link is
already present in URLs. This prevents you from downloading a page
twice. Another common scenario: Often a page a.html can link to
another page b.html, and the page b.html can also have a link to
a.html (from the Back button). static Vector URLs also helps you
avoid falling into such loops.
To download text files and binary files, you must have separate methods for each. From
the file extensions, decide whether the file is a binary (image) file or a text file. The
method nonTextFile() returns true if the file is not a text file. For efficiency, call a
different method, downloadNonTextFile(), to download binary files. This function
does not perform any file parsing.
If the files are text files, you must extract the links and modify them appropriately. If
you wanted only to extract the links, you could use the Swing package to do so (see
Resources).
But for this article, you also want to change the links, so you should parse the files.
First, search all the occurrences of <a, <base, <img, and <frame (irrespective of case)
and store the characters up to the next enclosing > in the string. For example, in the case
of "Click <a href=enter.html>here</a> to enter," the sequence a href="enter.html"
is stored in the string. Extract the URL from this string, but first make sure to take care
of several things:
If the link does not begin with a backslash (/), you need not modify the link.
Add the link to the list of links to be downloaded.
If the link ends with a backslash (/), modify the link to include index.html and
add the unmodified link to the list.
If the base tag (<base) is present in the document, evaluate the relative URLs
using the base URL. In such cases, first evaluate the filename using the base
URL and the relative links, then modify the link to the destination directory plus
the newly evaluated filename. The unmodified link is added to the list of links to
be downloaded. An important thing to note is that the base tag is removed from
the downloaded file. Here's an example of how the relative links are resolved in
this case:
<html>
<head>
<title>Just an example</title>
<base
href="http://www.somesite.com/docs/someproduct/index.html">
</head>
<body>
.
modifyLink() does the second-phase parsing. It searches for the occurrence of href
src in the string. Now, the link can be either one of the following:
SRC=image/family.gif
SRC =image/family.gif
SRC = image/family.gif
or
Most browsers accept spaces on either side of the equal sign (=) between href or src
and the link. modifyLink() evaluates the index of a link's beginning (in this case, i).
This index, the string, and the length of the string are passed to another method,
processLink(). So, after the second-phase parsing, you have:
image/family.gif
performs the final phase of parsing. It finds out what is at the end of the
link. Now you have image/family.gif. Depending on the link, processLink()
modifies the link and writes the modified link to the file. The unmodified link is
returned to downloadAndFillVector(), which adds the link to a vector.
processLink()
downloadAndFillVector() does one more thing. If the base tag (<base) is present,
downloadAndFillVector() extracts the URL from the base tag and assigns it to
baseTagURL, which is a private member of Downloader. You use the baseTagURL to get
the actual file paths in case the links are relative. downloadAndFillVector() calls the
setBaseTagURL() method to extract the base URL from the base string, parsing the
URLlist
is a simple class. It receives the base URL, which is either the URL specified
in the base tag or the URL used to retrieve the document, plus the list of links in the
page to be downloaded. From this, it generates a list of URLs and returns the list to the
Downloader class. It also adds the link to the global vector URLs. This class contains the
functionality that prevents a link from being downloaded twice.
URLlist
ExtendedURL
The URL class provides a method getFile() that returns the filename (anything after
the machine name up to # or the end of the string). You need a way to get the directory
and the filename, and you need the directory name to maintain the same file structure.
URL, being a final class, cannot be extended, so you can use composition. (There are
basically two approaches in object-oriented programming to achieve code reusability:
inheritance and composition. In composition, you use the existing class as a member of
the new class, which is composed of the already existing class along with other
members. For more information, see Resources.) ExtendedURL has a member field of
type URL. The ExtendedURL class provides methods getDirectory() and getFile(),
which return the directory and the filename, respectively.
First, obtain the filename (directory plus the optional filename) using the getFile()
method of the URL class. Then search for the question mark (?) in the filename. A
question mark indicates that the query string is appended to the filename and that the
file is a script. You can't download scripts, so both directory and file are set to null. This
URL is not added to the list of URLs.
If the filename ends with /, the directory is set to the filename and the file is set to null.
In other cases, you can extract the directory and file from the filename by performing
some simple calculations.
One more thing determines the setting of the directory and file.
http://www.somesite.com/work/xyz.html
http://www.somesite.com/work/abc.mp3
In the first example, the directory is set to /work and the file is set to xyz.html. In the
second example, the file is abc.mp3; however, since here you are not interested in MP3
files, set the filename and directory to null.
Use the method fileOfInterest(), which returns true for the files that interest you.
You can add MP3 files in this method, so this utility can download MP3 files as well.
To understand this utility further, you can browse through the source code in Resources.
This utility, however, has some limitations.
Limitations of the utility
The following list reveals the limitations of the utility described in this
article. You can easily overcome most of them to enhance your own
offloading utility:
1. It downloads only text and image files. This is a minor limitation you
can easily tackle by modifying the fileOfInterest() method of the
ExtendedURL class and the nonTextFile() method of the Downloader
class.
2. It cannot download applets. Applets are specified in HTML files as
<APPLET CODE=Simple.class CODEBASE="/examples"></APPLET>, where
CODE specifies the class file and the CODEBASE specifies the directory in
which this class file is located.
To overcome this limitation, search for <applet in the first phase of parsing
similar to <img, <frame, and other tags. During the second and third phase of
parsing, extract the filename and the directory from the CODE and CODEBASE
attributes. The link becomes /example/Simple.class, which is just like
/images/family.gif.
Note, if both CODE and ARCHIVE are present in the APPLET tag, the filename
should be a jar file, not a class file. For example, <APPLET CODE=Simple.class
ARCHIVE=examples.jar CODEBASE="/examples"></APPLET> should be
/examples/examples.jar, not /examples/Simple.class.
should return true for class and jar files. You can modify the
nonTextFile() method of Downloader to return true for class and jar files.
fileOfInterest()
3. It does not handle forms correctly. The form tag has the following
syntax:
4. <FORM ACTION="http://www.somesite.com/prog/adduser"
METHOD="POST">
If the URL specified with the ACTION attribute is absolute, this utility will work
well. If it does not, you can handle this limitation easily by replacing the relative
URL with the absolute URL. Obtaining the absolute URL from the relative URL
is simple when you know the base URL:
URL absolute = new URL(baseURL, relative);
String absoluteURL = absolute.toString();
I avoided overcoming these limitations in this article to make the utility simple. One of
the goals of this article was to show you how easily you can write a useful utility with
commonly used Java classes. If you provide this utility with a nice GUI and tackle some
of its limitations, it can serve as a full-fledged application.
About the author
Rakesh Didwania is a software engineer at Informix, India. Previously, he
worked at Fujitsu-ICIM, India, developing a translator that converts Informix
4GL code to C.