Computer Science 461Programming Assignment:

Document Retrieval System

 

NOTE: The programming parts of this assignment are worth ~2/3 of the total marks. The other 1/3 is for documentation and discussion.
 
Goal: To provide a set of programs that will allow a user to ask for a list of all documents in a given domain containing specified words or combinations of words. This assignment does not have to be done in C, C++, or Java. Other suggestions: awk or Perl.
 
HINT: Keep it simple. Speed and efficiency count. Be sure to keep the O/S interface in mind. You may not assume that the index will always fit in RAM. Make use of the facilities and utilities provided by the operating system (UNIX). Make sure you are able to justify all of your decisions.

2 Main Parts:
1. programs to build (indices) and update the files
2. programs to do the retrieving
 
Data consists of an undetermined number (could be from 10's to 1000's) of ASCII files in a given directory. A containing directory may be assumed to contain a file which has a directory listing of the text files (called "ListofFiles") as well as the directory which itself contains the actual text files. You may make no further assumptions about the locations or names of the directories or the text files.
Ex: /home/jr/461/DocSys:
    DocData/
    ListofFiles
     
    DocData/
    - contains the actual text files, in this case all end in .txt but this is not required

 




Part 1 A:
Given a directory containing text files, and a file that is a directory listing of the files themselves, write a program that will extract words from each file in the directory and produce a list of word-document pairs.
example:
 
Mary doc1.txt
had doc1.txt
had doc2.txt
 
Part 1 B:
Sort the output produced by Part 1 (note: efficiency counts);
Eliminate duplicates and uninteresting words ('in', 'the', words that appear in all documents, etc.);
Merge all the pairs so each word appears only once in the final list followed by the number of documents it appears in and a list of the documents themselves:
    Mary 1 doc1.txt
    had 2 doc1.txt doc2.txt

 

 
Part 2:
Build a query system that
- uses a reasonably efficient index
- creates an inverted file (each word is associated with a list of documents)
- given a word to search for, shows the document name and the first 4 lines of each document that contains the word (again, efficiency counts)
-incorporate the use of simple Boolean expressions and the operators AND OR NOT to allow for more complicated searches
 
Be sure to keep the user interface as simple as possible. That is not the focus of this assignment. If writing in C/C++ use command line arguments:
looksee {options} <dir_name> <query>
 
Part 3: Analysis and Discussion:
Describe and fully justify your structure(s) :
- Your implementation should make efficient use of disk space and minimize disk access as much as possible. Justify all design decisions regarding data structures used and how they affect efficiency and speed.
- Comment on any special problems encountered as a result of your design and possible solutions or modifications recommended for the next version.
- Include some discussion of file behaviour as it applies to possible fragmentation, wasted space, and expensive disk operations.
- Remember that this application should be able to deal with thousands of files so design your structures and algorithms accordingly.
How is the index implemented?
- diagrams (like UML or some such) are acceptable
 
 
What kind of index search did you implement? What makes it efficient? What are the costs? Why did you choose this one?
- Remember, it must be efficient for a directory or set of directories that contain hundreds or thousands of documents - even when the entire index can't be stored in memory at once.
 
Assuming that the document sets may be updated from time to time, how would you design an update program that can add to the index without having to re-create the entire set?

 




All explanations must be clear and concise or they will not be marked. All code MUST be fully documented and both a programmer's manual and a user's manual must be supplied (can be in same document but must be obviously separate). Programmer documentation must include a description of the structures used, algorithm outlines, and justification for your decisions. If adequate documentation is not supplied your maximum grade will be C.



BONUS:
Make it work for a directory containing sub-directories. Maintain separate indices for each directory but allow the user the option of searching all directories.
 
Write a program that will check the given directory for new files and add to the index of keywords. Explain your design.
 
Suggest an alternate (more efficient) design, both for automatically building and maintaining keywords as well as for doing the searches and displaying the first few lines of the documents.


IF YOU DEVIATE FROM THE SPECS YOU MUST HAVE A GOOD REASON. In other words you may use your own judgment but you must be able to convince your TA that it's really a good idea!

A major portion of the marks for this assignment is allowed for discussion. You need not implement the most efficient methods you know. Give it some thought at the beginning and then do the implementation. Do not re-design your program if you discover inefficiencies. Instead, discuss what is good about your design and what is not. Make some suggestions for how this program could be re-designed to be more efficient. Gross inefficiencies and efforts clearly below a third-year level or expertise are unacceptable.


Back to TopCopyright © Katrin Becker 1998, 1999 Last Modified