CPSC 461 Information Structures III
Research Assignment

Copyright (C) 2003 Katrin Becker Last Modified June 8, 2003 10:29 PM
Educational Object Repositories


Specifications
PART 2-A / PART 2-B
The "Work"
Specifications PART 3
The Post-Mortem
Assignment Links:

Reference Links:
Conducting a research project:
Related Paper: Proceedings of the eleventh international conference on Information and knowledge management 2002 , McLean, Virginia, USA
A compact and efficient image retrieval approach based on border/interior pixel classification
Authors Renato O. Stehling University of Campinas, Brazil, Mario A. Nascimento University of Alberta, Canada, Alexandre X. Falcão University of Campinas, Brazil
Comment: Note the references - many of these papers also relate to problems we are trying to solve.



Background

About This Assignment:

This assignment does not target any specific topic in 461. Instead, it draws upon various components to help solve a more complex problem. It is hoped that this makes it more interesting and relevant than, say, "Implement a varying-length record file".

What is an Educational Object Repository?
A collection of educational materials (text files, data files, music, movies, animations, images, etc.).
Media (the files) are of mixed types and formats.
There may be various source locations (places where the files reside).
This repository must be: structured, organized, searchable.
Why are we looking at it? (What has it got to do with CPSC 461?)
An example of a large collection of various files. The files may reside in many locations, but the "index" must connect them all.
What else is this good for? (Why are we doing this?)
The "searchable" part of the requirement is largely uncharted territory. Considerable work has gone into ways to annotate various files & media, but there are many problems associated with annotation. A more flexible and ultimately automatic approach would be to permit searching by other means. This is what we are going to explore here.
What are we doing with it? (What is this assignment)
We will propose and explore ways to search a collection of files that do not rely exclusively upon text annotations.
Goals, Skills, & Concepts
Some of the things you may get out of this assignment:
  1. Writing a proposal for some task you wish to carry out.
  2. Evaluating a proposal.
  3. How to design & scope a large-scale problem.
  4. Planning & executing the experimental process.
  5. Proceding through the analysis of results
  6. The benefits of reflective inquiry.
Description
There are two main categories of project that may be pursued for this assignment:
1. those WITH known solutions, and
2. those WITHOUT known solutions.
Each requires a somewhat different approach, and each will be judged using different criteria. You may choose which kind of problem you would like to explore.
Possible Projects:
MORE Structured / Known Solutions

A project related to these topics has a structured approach to a solution. Solutions to these problems exist (although it is certainly possible you may find a new one that no-one else has thought of before.) NOTE: This does not mean these projects will be easy.

LESS Structured / UN-Known Solutions

A project related to these topics will require you to design your own approach. Solutions to these problems often don't exist, and it is possible that your proposed solution will turn out to be one that will not achieve your goal. How you get there and what you learn from it are more important here than finding the right answer.

Design and test a prototype for an index structure. Choose one category of file (image; audio; animation; video) and explore searching mechanisms that does not involve the use of "tags" or annotation.
Explore searching mechanisms for the text documents in the collection. Examine methods for the structuring and organization of distributed data sets.


Approach, Hints & HELP
HINTS, CONSTRAINTS and CONSIDERATIONS:
FOR BOTH
Index Component.
Object Searching Component.
Keep it simple.
Get a basic "index" working first. Then expand on it and refine it.
KEEP NOTES on what works (and why); what doesn't (and why); suggest 'theories' if you don't know for sure. This will provide material for your post- mortem.
Keep it to the point. Don't spend too much energy on a pretty user interface - that is not the point of this assignment. The organization of the information you have in a manner that facillitates retrieval and manipulation is.
You will need to decide on a quantitative measure for your object. Start with just 1 or 2 measures for each object. Matches will be those objects that have values within a specified range of those measures. If these values are not producing the desired results, try adding others. It is far more likely that some combination of measures will yield usable results than that you will find a single, 'perfect' measure.
Speed and efficiency are very important.
Important "ilities" include:
- reliability
- efficiency (both time & space)
- maintainability
- useability (this system is to be used by people who may not have a great deal of computer experience)

- testability (provide a way to verify that your design is worthwhile)
- portability (must be useable from various systems)

- expandibility (can we add objects? can we add new kinds of objects)
- flexibility
This is NOT a simple problem. Don't expect to find the right matches easily.
Make sure you are able to justify all of your decisions. A solution that fails to produce a sound organization of objects, but shows evidence of sound planning and robust, considered design will get a better grade than a more successful index whose results are unexplained. A solution that fails to find the right kinds of objects, but shows evidence of proper testing and explanation will get a better grade than a more successful engine whose results are unexplained.
This assignment is NOT just about the code. It's about designing a well-thought out solution to a difficult problem; implementing it; improving it; and finally analyzing your results.
It is important to note that the usual requirement in CPSC that submitted code must work in order to pass still holds true here. The code must run and work correctly. The results may or may not be useable, but the code must work well enough to allow you to determine the viability of the hypothesis.
VERY IMPORTANT!!! You may make use of existing code found on the net or elsewhere but you MAY NOT use any code "wholesale" (i.e. in its entirely, w/o modification, and you MUST acknowledge your sources). Failure to follow this requirement constitutes cheating and such a submission will be given an 'F', along with a letter of explanation to the Department Head and Dean of Science. NO EXCEPTIONS.

Part 1: Outline/Design: [the requirements for part 1 are the same for both approaches - see "Writing a Proposal"]

- submit a 2-5 page report (professional appearance is required) outlining your intended design
- explain what attributes you intend to measure (with examples), and why you think they will be successful
- all design decisions must be justified

Part 2 - Version 1: The Index / Search Engine:
2 Main Components:
1. Program(s) to build (indices) and update the database file.
2. Program(s) to do the retrieving - these must be designed so they could be connected to a web page (your prototype need not run from the web)
The Generator
This component may consist of multiple programs and utilities (it need not be a single program). If so, all parts must be "wrapped" in a shell program so the system's web master can use it by simply running a single script.
One possible approach:
First:Given a directory containing objects, write a program or utility that will extract summary data from each object in the directory and produce a list of object data-web link pairs (relative pathnames are permissible but not required). Since you are creating one part of a much larger system, the summary data can be something contrived. You should prove your approach is viable by extracting "something" (some bytes of text) from at least two different types of objects (such as .html files and .gif files).
Next: Sort the output produced by Part 1 (note: efficiency counts);
Eliminate duplicates. Merge all the pairs so each object appears only once in the final list followed by the number of pages it appears in and a list of the URLs for the pages themselves.
Then: Design a file structure that will allow for an efficient search when necessary but still leave us with only a single, relatively small file. Explain and Justify your choices.

The Search Engine:
Build a simple query system that
- uses the file created above
- can potentially be executed from a web-page (i.e. input comes from a web form, the 'program' will be automatically run when the user submits a query, and the results will be displayed in one or more html pages)
- given a sample object to search for, shows the URL for each object deemed to be a match (again, efficiency counts)
 
If you are creating your own form for the search, be sure to keep the user interface as simple as possible. That is not the focus of this assignment.
Make sure you explain and justify any assumptions you are making about the other parts of the system (the ones you are not writing).
Part 2 - Version 2: The Object Match (Suggestion: choose one KIND of object only):
The match is driven by: "Here is an example of what I am looking for. Find me more like this."

2 Main Components:
1. Program(s) to read objects and gather data.
2. Program(s) to do the matching - these must have a simple, but effective user interface
The Collector :
This component may consist of multiple programs and utilities (it need not be a single program). If so, all parts must be "wrapped" in a shell program so the system's web master can use it by simply running a single script. One possible approach:
First:Given a directory containing objects, write a program or utility that will extract summary data from each object (or the right kind) in the directory and produce a file of object data-pathname pairs (relative pathnames are permissible but not required). This time the focus is on the summary data itself. This is where you should focus most of your efforts. The code used to find or collect the files from which you are extracting data is secondary. Several directories containing various objects will be provided. If you have chosen an object for which no sample data has been provided, you will have to collect it yourself. A minimum of 50 items is required.
Then: Design a file structure that will allow for an efficient search when necessary but still leave us with only a single, relatively small file. Part of the goal is to create a system where the eventual search will not have to look at each object each time. Rather, we will build a 'data base' of information on the objects we have and then when we have a new object to search for, we will generate the summary data on this one object, and then use that in our search. Explain and Justify your choices.
The Object Match:
Build a simple query system that
- uses the file created above
- given a sample object to search for, shows the URL or filename of each object deemed to be a match (again, efficiency counts) including any applicable summary analysis

Part 3: Analysis and Discussion [The Post-mortem]:

[Note that this is NOT a complete list, merely a few suggested questions to consider]

- How successful was your design?
- What could have been improved?
- How did your design change from your original, and why?
- If you had it to do again, what would you change? (and why, always why).
- What have you learned from this?

Describe and fully justify your structure(s) :
- Your implementation should make efficient use of space and minimize downloading time and searching as much as possible. Justify all design decisions regarding data structures used and how they affect efficiency and speed.
- Comment on any special problems encountered as a result of your design and possible solutions or modifications recommended for the next version.
- Remember that this application should be able to deal with thousands of objects so design your structures and algorithms accordingly.
How is the index implemented?
- diagrams (like UML or some such) are acceptable
 
 
What kind of index search did you implement? What makes it efficient? What are the costs? Why did you choose this one?
- Remember, it must be efficient for a directory or set of directories that contain hundreds or thousands of objects.
 
Assuming that the repository may be updated from time to time, how would you design an update program that can add to the index without having to re-create the entire set?

All explanations must be clear and concise or they will not be marked. All code MUST be fully documented and both a programmer's manual and a user's manual must be supplied as appropriate (these can be in same physical document but must be obviously separate). Programmer documentation must include a description of the structures used, algorithm outlines, and justification for your decisions.
If adequate documentation is not supplied your maximum grade will be C.



IF YOU DEVIATE FROM THE SPECS YOU MUST HAVE A GOOD REASON. In other words you may use your own judgment but you must be able to convince your TA that it's really a good idea!

A major portion of the marks for this assignment is allowed for discussion. You need not implement the most efficient methods you know. Give it some thought at the beginning and then do the implementation. Do not redesign your program if you discover inefficiencies. Instead, discuss what is good about your design and what is not. Make some suggestions for how this program could be redesigned to be more efficient. Gross inefficiencies and efforts clearly below a third-year level or expertise are unacceptable.

Testing & Marking

One requirement for each part of this assignment will be to grade it yourself and to submit your assessment along with your assignment. See self-assessment for more information.

Requirements:

See marking Rubrics.

Final Words

The rules for research are absolute: no compromises, no evasions, no shortcuts, no excuses, and no saving face.
( Locke, Spirduso, and Silverman, 2000, pp 25)


Copyright (C) 2003 Katrin Becker 1998-2002 Last Modified June 8, 2003 10:29 PM