Copyright © 2002 Katrin Becker 1998-2002 Last Modified August 10, 2000 04:56 PM

Programming Assignment:

Client-Side Search Engine 


NOTE: The programming parts of this assignment are worth ~2/3 of the total marks. The other 1/3 is for documentation and discussion.
  
You've been asked to create a simple and efficient 'database' to support a search engine for a website. The website will exist completely in one directory but may contain many sub-directories. Here's the catch: due to security concerns a typical server-side (CGI) search engine is not an option. You must implement a client-side search engine. Most  Server-Side Search-Engines gather queries made in a form at the client's side and then transfer them to a program that accesses a database of some sort stored on the server's side. In contrast, the client-side search engine will download both the 'database' and the program used to search it to the client's machine and do the search from there.
HINTS, CONSTRAINTS and CONSIDERATIONS:

2 Main Components:
1. program(s) to build (indices) and update the database file.
2. program(s) to do the retrieving - these must be connected to a webpage
 
Data consists of an undetermined number (could be from 10's to 1000's) of files in a given directory tree. If you allow the data file to be generated in an arbitrary directory you should be able to test it on any non-distributed website there is. To keep it simple, it must be testable on any website in the www.cpsc.ucalgary.ca domain. If you have your own website, try that. If not, you may use any of mine: ~becker/461 ~becker/461old ~becker/231 ~becker/233. Notice each has a 'main' directory and numerous sub-directories. The generator program must include all pages in all sub-directories. If time permits, a sample website will be placed in the 461 directory in the directory called WWW
A sample webpage form will be provided. This page (or one like it) must reside in a directory that contains the data file to be searched. You may make your own if you choose. It must be placed in a directory called www inside your home directory. You must set access to 755 for all directories and executable files and 644 for all ASCII files. Under normal circumstances, the data file and the form (which also contains the search programs in some way) will reside on the host machine and both will be downloaded to the client when they want to do a search. For the purposes of this assignment, all that needs to reside in your own "website" is the page containing the form and search program and a file containing the "database". Since the search only uses the data file ("database") we can pretend the rest is there even when it's not.


Part 1: outline/design
- submit a 2-5 page report (professional appearance is required) outlining your intended design both for the generator and the search engine
- describe the format of the data file and explain why you think this is a worthwhile design (for example: how does it address the constraints listed)


Part 2 the generator
This component may consist of multiple programs and utilities (it need not be a singlre program). If so, all parts must be "wrapped" in a shell program so the system's webmaster can use it by simply running a single script. One possible approach:
A:
Given a directory containing web files, write a program or utility that will extract words from each file in the directory and produce a list of word-link pairs (relative pathnames are permissible but not required):
example: 
Mary link1.html
had link1.html
had http://www.mysite.cpm/page2.html
 
Part 2 B:
Sort the output produced by Part 1 (note: efficiency counts);
Eliminate duplicates and uninteresting words ('in', 'the', possibly words that appear in all documents, html tags etc.);
Merge all the pairs so each word appears only once in the final list followed by the number of pages it appears in and a list of the URLs for the pages themselves:
Mary 1 link1.html
had 2 link1.html http://www.mysite.cpm/page2.html
Part 2 C:
Design a file structure that will allow for an efficient search when necessary but still leave us with only a single, relatively small file. Explain and Justify your choices.

Part 3: The Search Engine:
Build a simple query system that
- uses the file created above
- can be executed from a web-page (i.e. input comes from a web form, the 'program' will be automatically run when the user submits a query, and the results will be displayed in one or more html pages)
- given a word to search for, shows the URL for each file containing that word (again, efficiency counts)
-incorporate the use of simple Boolean expressions and the operators AND, OR, NOT to allow for more complicated searches
 
If you are creating your own form for the search, be sure to keep the user interface as simple as possible. That is not the focus of this assignment.
 
Part 4: Analysis and Discussion:
Describe and fully justify your structure(s) :
- Your implementation should make efficient use of space and minimize downloading time and searching as much as possible. Justify all design decisions regarding data structures used and how they affect efficiency and speed.
- Comment on any special problems encountered as a result of your design and possible solutions or modifications recommended for the next version.
- Remember that this application should be able to deal with thousands of files so design your structures and algorithms accordingly.
How is the index implemented?
- diagrams (like UML or some such) are acceptable
 
 
What kind of index search did you implement? What makes it efficient? What are the costs? Why did you choose this one?
- Remember, it must be efficient for a directory or set of directories that contain hundreds or thousands of documents.
 
Assuming that the website may be updated from time to time, how would you design an update program that can add to the index without having to re-create the entire set?



All explanations must be clear and concise or they will not be marked. All code MUST be fully documented and both a programmer's manual and a user's manual must be supplied (can be in same document but must be obviously separate). Programmer documentation must include a description of the structures used, algorithm outlines, and justification for your decisions.
If adequate documentation is not supplied your maximum grade will be C.



BONUS:
Write the generator so that it skips text found in html tags.
 
Write a program that will check the given directory for new or changed files and add to the index of keywords. Explain your design.
 
Suggest an alternate (more efficient) design, both for automatically building and maintaining keywords as well as for doing the searches and displaying the 'hits'.


IF YOU DEVIATE FROM THE SPECS YOU MUST HAVE A GOOD REASON. In other words you may use your own judgment but you must be able to convince your TA that it's really a good idea!

A major portion of the marks for this assignment is allowed for discussion. You need not implement the most efficient methods you know. Give it some thought at the beginning and then do the implementation. Do not redesign your program if you discover inefficiencies. Instead, discuss what is good about your design and what is not. Make some suggestions for how this program could be redesigned to be more efficient. Gross inefficiencies and efforts clearly below a third-year level or expertise are unacceptable.
Copyright © 2002 Katrin Becker 1998-2002 Last Modified August 10, 2000 04:56 PM