Copyright © 2002 Katrin Becker 1998-2002 Last Modified September 12, 2000 04:16 PM

Cpsc 461 Lab Set For:

Client-Side Search Engine 


[this set not finished yet]

sample webpage form

A Possible Plan of Attack.....
There are essentially 2 parts to this assignment:
1. Building a data file on the Server-Side that can be downloaded to the client and searched.
2. Writing an 'applet' that can be attached to a web page and that will, when executed read the datafile from the server, search it, and build a list of links that match the 'query' and then finally display it.

In order to accomplish part 1 you must first examine the reuirements for part 2. You need to get at least a rough idea of what kind of file the search program(s) will need or can work with and how it will have to proceed. So...

Basics of the Web:

What is a browser anyway? What happens when you click on a link from a web browser? [the source page, images, embedded scripts, external scripts/applets, etc.]

A browser is essentially an application that can access and transfer files fom any URL given to your local machine (the one running the browser). It has the ability to read and display several types of files, including: '.html', '.htm', '.shtml'
Note: HTML is not a programing language. It is often reffered to as a language in the same sense that postscipt is a language. It is a special purpose language used to describe a page (document). The various parts of the language do everything from specifying the font and style to use to describing various images and other files that are to be embedded within the given document, and, these days, it can even be used to change a page that has already been displayed based on user (viewer) actions as well as other criteria.
When you click on a link, the browser (which is running on your local machine - a.k.a. the "client") tracks down the original file (according to it's URL), connects to the machine that owns the file (this is the "server", which could be anywhere in the world plus a few places in space) and copies (downloads) it from the server into a local directory (some sort of "cache"). Once it has a copy of the "document", your browser parses the file according to it's suffix (let's stick with '.html' for now). The browser displays the document as specified in the HTML description. Any references to external files (like images) are resolved (they also have to be copied from the server) and then they too are displayed.

Client vs Server

Who runs where? When you are surfing the web, you are the client and each link that you click on becomes a server for the time that you are 'visiting' the page. When you create your own website the machine that holds the files comprising your website is the server and anyone else using the web who clicks on one of your pages is then the client.

Who does what? In the old days things were pretty clear but these days with Java Applets, Dynamic HTML, and various scripting languages (like Javascript and PHP) it gets a little fuzzy. Often it is the developer of the page that must decide what will be done where. Little things like "mouseovers" (when an image changes or something else happens every time your mouse passes over something on the page) are usually done entirely on the client's side. Other things, like most search engines, run on the server's side. The search request is collected on the client's side and sent back to the server, who then processes the request and sends back a page to display which contains the results.

Basics of HTML:
(note: it will be sufficient to prepare a solution that will only run under Netscape)
tags, anchors, forms, links, images, borwsers

In an HTML page, various pre-defined objects are created and associated with each page. For example, there exists an array of image objects. Each image has an entry in this array along with various attributes that serve to describe it (such as: src (the source of the image - the file where it lives); height (how many pixels high the image is suposed to be when displayed), etc.
There is also a list for each anchor (an anchor is a part on the page we can reference), link, and form. Most objects can be referenced through their arrays by number (eg. images[4] ) - they are stored in the same order in which the HTML parser comes across their references HTML source (which is not always the order in which they appear on the visible page). The objects can also be given names (<img src="x.gif" name="thisone"> ) so we can then reference the image by name ( images["thisone"] )

A note on notation: The values assigned to attributes must be enclosed in quotes. If you have something inside that 'value' that also needs quotes, then you must alternate between double and single quotes:
(<img src="x.gif" name="thisone" onClick=" function_call( 'arg-to-function' ) "> )

Dynamic HTML is an extension to 'regular' HTML that allows parts of the page to be changed after that page has already been loaded and displayed. It defines numerous "events" that you can detect and act upon.

HTML objects, attributes, event handlers (like onClick and reset)

The Form Page:
Look over the form page provided with this assignment and make sure you know where the user input goes and how to 'launch' your applet.

Calling Javascript from a Webpage:
There are basically 2 ways to run javascript scripts from a web page. Stuff in the HTML HEAD section is parsed before the BODY
1. define the code in the <HEAD> portion of the web page and then call the functions as necessary.

<HEAD>
<script language="JavaScript1.2" >
<!-- // the entire javascript code portion is embedded inside
        // an HTML comment that will keep the HTML parser from getting
        // confused by the javascript
var N = 0; // defines a global variable called 'N'
void functionX( Var1, Var2 ) {
// code goes here
}
-->
</script>
</HEAD>

2. put the code in a file named "X.js" and then simply refer to it in the HEAD section
<HEAD>
<script language="JavaScript1.2" src="images.js"></script>
</HEAD>

Whichever way the code is defined, once that's been done, it can be used in the BODY portion like this:
<a href="http://www.cpsc.ucalgary.ca" target="parent" class="linkstyle"
onmouseover="linkOn('level1_0')"
onmouseout="linkOff('level1_0')" name="link-0">
<b><img src="../Resources/Clipart/pr_pin.gif" name="level1_0"
border="0" naturalsizeflag="2" height="26" width="26" align="CENTER"></b>
<font face="Arial,Helvetica,Geneva,Swiss,SunSans-Regular" color="yellow">
<nobr><b> Cpsc Home</b></nobr></font></a>


Calling java From a Web-Page:
Applet vs. Application: an application is a stand-alone program while an applet is designed to be run from a web-page. The applet has no "main" function or method; it is a class definition derived from Applet and is instantiated automatically when the page that refers to it is loaded into the browser.

It is referenced in an HTML document like this:
<APPLET CODE=MyApp.class
      CODEBASE="Http://www.cpsc.ucalgary.ca/~becker/461/Asst/SearchEngine">
      <PARAM NAME="arg1" VALUE="something">
</APPLET>

sample web-page
sample java applet

For simplicity, make sure that the HTML page, and the byte-code version (*.class) of the applet as well as any data files you intend to read are all in the same directory and that all files are world-readable (make the *.class file executable as well). Also make sure that the name of the *.class file is the same as the name of the Applet object.

Meanwhile back at the host (the server), you will have written a java-source applet called something.java. Use the java "compiler" to convert it into byte-code:
            csb% javac something.java
This will generate the something.class file. This is the one that needs to go in the directory with your web-page.

Testing & Development:
When you are creating your java applet or messing with javascript you will probably want to keep one or 2 emacs windows open as well as a Netscape window open. Load the HTML page into Netscape (either by URL or as a local file) and when you want to test changes simply reload the page. Netscape will cache as much of the page as it can and normally when you click on the button to re-load the page, only the HTML document will actually be re-loaded. External stuff, like images, applets, and the data files will probably NOT be re-loaded so make sure you force Netscape to re-load all of them by holding the SHIFT button while clicking RE-LOAD.

Getting the Input:

Reading the Data File:
Javascript is incapable of reading from or writing to a file (why?) so the search routine will have to be done in Java. The data file itself will most likely be a binary file (ASCII wastes too much space) In order for Java to be able to open and read the file the data file must have the suffix '.bin'.

For info on reading directly from a URL click here. (thanks to SUN)

Displaying the Results:


Once you have a reasonable idea about how the search engine will have to work you will be in a position to look at the design of the data file. It should be as small as you can make it but still be reasonably efficient to work with. The utilities/language(s) you choose for implementing the data file generator is constrained only by what will run on our SUN systems.

Some suggestions:
awk (designed for text manipulation)
UNIX shell (has a nice sort and various other handy utilities)
Perl (has a set of modules designed to provide an interface to parse HTML documents - handy, huh?)

All of these have utilities for recognizing regular expressions (which you should learn something about).

One way or another you will probably end up having to go through the following steps:
1. walk through the directory tree and gather a list of words found in HTML documents along with some kind of reference to where these words were found (looking in other files like images, java source, etc. is unnecessary)
2. sort this list
3. pull out duplicates
4. condense the list so that each word appears only once followed by a list of URLs that contain the word (the word becomes your index)
5. get rid of uninteresting words (so, what is an uninteresting word?)
6. compress this final list so it takes up as little space as possible.
Copyright © 2002 Katrin Becker 1998-2002 Last Modified September 12, 2000 04:16 PM