Copyright © 2002 Katrin Becker 1998-2002 Last Modified May 16, 2003 11:35 PM

Programming Assignment:

Image Search Engine 


About This Assignment:

This assingment does not target any specific topic in 461. Instead, it draws upon various components to help solve a more complex problem. It is hoped that this makes it more interesting and relevant than, say, "Implement a B-Tree".


NOTE: The programming parts of this assignment are worth ~1/2 of the total marks. The other 1/2 is for design and discussion.


Scenario: I'm doing a project which involves the creation of a few web pages. I need to find some images. I have a few that I like but I want a few more that are similar. This assignment is to devise a "Search Engine" for images. For example, if I have a picture of some mountains, I want this "search engine" to be able to find more pictures of mountains.


The basic idea is as follows:

  1. Load the search page in my browser (let's say Netscape, for simplicity)
  2. Type in the URL (or local file name) of one (or more) images
  3. Your search engine will examine all available images in that site and search for one's similar to the one first given. [Note: This is a trick - see later for explanation.]
  4. Your search engine will provide a report of images examined, and display those images classified as "matches" along with some analysis data indicating why these images are considered as matches.

Before We Begin:

There are two main approaches to the solution of this assignment:
1. You can solve the Search Engine Component.
The solution to this is known. It is fairly clear-cut. It will involve numerous programs and will produce a solution that can be run as a demo. It will require you to design a file format for storing image data that has been gathered; creating a file in that format; searching it; and producing output.
2. You can solve the Image Matching Component.
The solution to this part is not known. By anyone. It isn't at all clear cut. In fact, you may end up with nothing definitive. You may end up with a bunch of data that indicates your 'solution' was the wrong one. This is one of the few assignments where bad results will be acceptable. This approach involves relatively little code but lots of running of your program and lots of data gathering and analysis. It will require your writing a number of components that will serve as testing tools and then running many images through your system; changing your parameters; and doing it again; and again; and again.
Do NOT attempt to do both. It's way too much work.

HINTS, CONSTRAINTS and CONSIDERATIONS:
FOR BOTH Search Engine Component. Image Matching Component.
Keep it simple.
Get the basic "engine" working first:
get it to read an image
get it to just find other images (worry about proper matching later)
get it to print a report
KEEP NOTES on what works (and why); what doesn't (and why); suggest 'theories' if you don't know for sure
Keep it to the point. The search engine must run in a web browser so use Javascript or Java and keep it to something that older browser's can still run. (I must be able to run it on my PC at home). Start with just 1 or 2 measures for each image. Matches will be those images that have values within a specified range of those measures. If these values are not producing the desired results, try adding others.
Speed and efficiency are very important. This is NOT a simple problem. Don't expect to find the right pictures easily.
Make sure you are able to justify all of your decisions. A solution that fails to find the right kinds of images but shows evidence of poper testing and explanation will get a better grade than a more successful engine whose results are unexplained.
This assignment is NOT just about the code. It's about designing a well-thought out solution to a difficult problem; implementing it; improving it; and finally analyzing your results.
VERY IMPORTANT!!! You may make use of existing code found on the net or elsewhere but you MAY NOT use any code "wholesale" (i.e. in its entirely, w/o modification, and you MUST acknowledge your sources). Failure to follow this requirement constitues cheating and such a submission will be given an 'F', along with a letter of explanation to the Department Head and Dean of Science. NO EXCEPTIONS.

Data consists of an undetermined number (could be from 10's to 1000's) of files in a given directory tree. To keep it simple, your system should be capable of gathering data on any website in the www.cpsc.ucalgary.ca domain. If you have your own website, try that. If not, you may use any of the sites in our department (www.cpsc.ucalgary.ca; pages.cpsc.ucalgary.ca) or any of mine: ~becker/461 ~becker/461old ~becker/231 ~becker/233. Notice each has a 'main' directory and numerous sub-directories. The engine should search all pages in all sub-directories. If time permits, a sample website will be placed in the 461 directory in the directory called WWW

Part 1: Outline/Design: [the requirements for part 1 are the same for both approaches]
- submit a 2-5 page report (professional appearance is required) outlining your intended design for the search engine
- explain what attributes you intend to measure (with examples), and why you think they will be successful
- all design decisions must be justified

Part 2 - Version 1: The Search Engine:
2 Main Components:
1. program(s) to build (indices) and update the database file.
2. program(s) to do the retrieving - these must be connected to a webpage
The Generator
This component may consist of multiple programs and utilities (it need not be a singlre program). If so, all parts must be "wrapped" in a shell program so the system's webmaster can use it by simply running a single script. One possible approach:
First:Given a directory containing images, write a program or utility that will extract summary data from each image in the directory and produce a list of image data-web link pairs (relative pathnames are permissible but not required):
Next: Sort the output produced by Part 1 (note: efficiency counts);
Eliminate duplicates. Merge all the pairs so each image appears only once in the final list followed by the number of pages it appears in and a list of the URLs for the pages themselves.
Then: Design a file structure that will allow for an efficient search when necessary but still leave us with only a single, relatively small file. Explain and Justify your choices.
The Search Engine:
Build a simple query system that
- uses the file created above
- can be executed from a web-page (i.e. input comes from a web form, the 'program' will be automatically run when the user submits a query, and the results will be displayed in one or more html pages)
- given a sample image to search for, shows the URL for each image deemed to be a match (again, efficiency counts)
 
If you are creating your own form for the search, be sure to keep the user interface as simple as possible. That is not the focus of this assignment.

Part 2 - Version 2: The Image Match:
2 Main Components:
1. program(s) to read images and gather data.
2. program(s) to do the matching - these must have a simple, but effective user interface
The Collector :
This component may consist of multiple programs and utilities (it need not be a singlre program). If so, all parts must be "wrapped" in a shell program so the system's webmaster can use it by simply running a single script. One possible approach:
First:Given a directory containing images, write a program or utility that will extract summary dtaa from each image in the directory and produce a file of image data-pathname pairs (relative pathnames are permissible but not required):
Then: Design a file structure that will allow for an efficient search when necessary but still leave us with only a single, relatively small file. Explain and Justify your choices.
The Image Match:
Build a simple query system that
- uses the file created above
- given a sample image to search for, shows the URL or filename of each image deemed to be a match (again, efficiency counts) including any applicable summary analysis

 Requirements:
Search Engine Component. Image Matching Component.
If you are creating your own form for the search, be sure to keep the user interface as simple as possible. That is not the focus of this assignment.
'C' Requirements:
- runs from appletviewer
- takes a single file name as input ( name.gif -or- name.jpg ), residing in current working directory
- searches a file of your own design containing image data (NOT the actual images)
- produces a text report with the results of the analysis and the names of all the files that are 'matches'
- implement minimal webcrawler capabilities {run as a stand-alone utility}
'B' Requirements:
- all the 'C' stuff, plus:
- allows image file to de a URL or another local filename (but not necessarily in the current directory)
- allows the option of displaying the images that were found to match (along with the text report)
- implement better webcrawler capabilities {run on a timer, or in background, or for a specified number of links, ...}
'A' Requirements:
- all the 'C' and 'B' stuff, plus:
- takes a single file name as input ( name.gif -or- name.jpg ), residing in current working directory
- searches a user-specified website for files called <something>.jpg -or- <something>.gif (in other words, allow user to restrict the scope of the search)
Keep the user interface as simple as possible. That is not the focus of this assignment.
'C' Requirements:
- takes a single file name as input ( name.gif -or- name.jpg ), residing in current working directory
- searches a predetermined directory for files called image001.jpg (-or- image001.gif) through image999.jpg (-or- image999.gif)
- produces a text report with the results of the analysis and the names of all the files that are 'matches'
'B' Requirements:
- all the 'C' stuff, plus:
- allows image file to be a URL or another local filename (but not necessarily in the current directory)
- allows the option of displaying the images that were found to match (along with the text report)
- more thorough testing & additional attributes
'A' Requirements:
- all the 'C' and 'B' stuff, plus:
- takes a single file name as input ( name.gif -or- name.jpg ), residing in current working directory
- searches a user-specified directory for files called <something>.jpg -or- <something>.gif
- allow user some 'tweeking' capabilities
BONUS:
[up to 10 points] - allows for multiple sample images rather than just one as input
[up to 10 points each] - deal with additional graphics file formats {must provide own test data}
[up to 20 points] - provide some way of indicating the status of the 'hunt' (from the web-crawler's perspective)
[??? points] - extra possible for particularly clever or well thought out 'hunters'
BONUS:
[up to 10 points] - allows for multiple sample images rather than just one as input
[up to 10 points each] - deal with additional graphics file formats {must provide own test data}
[??? points] - extra possible for particularly clever or well thought out 'hunters'


Part 3: Analysis and Discussion [The Post-Mortem]:
[Note that this is NOT a complete list, merely a few suggested questions to consider]

- how successful was your design?
- what could have been inproved?
- how did your design change from your original, and why?
- if you had it to do again, what would you change? (and why, always why).
- what have you learned from this?

Describe and fully justify your structure(s) :
- Your implementation should make efficient use of space and minimize downloading time and searching as much as possible. Justify all design decisions regarding data structures used and how they affect efficiency and speed.
- Comment on any special problems encountered as a result of your design and possible solutions or modifications recommended for the next version.
- Remember that this application should be able to deal with thousands of images so design your structures and algorithms accordingly.
How is the index implemented?
- diagrams (like UML or some such) are acceptable
 
 
What kind of index search did you implement? What makes it efficient? What are the costs? Why did you choose this one?
- Remember, it must be efficient for a directory or set of directories that contain hundreds or thousands of documents.
 
Assuming that the website may be updated from time to time, how would you design an update program that can add to the index without having to re-create the entire set?


All explanations must be clear and concise or they will not be marked. All code MUST be fully documented and both a programmer's manual and a user's manual must be supplied (can be in same document but must be obviously separate). Programmer documentation must include a description of the structures used, algorithm outlines, and justification for your decisions.
If adequate documentation is not supplied your maximum grade will be C.



IF YOU DEVIATE FROM THE SPECS YOU MUST HAVE A GOOD REASON. In other words you may use your own judgment but you must be able to convince your TA that it's really a good idea!

A major portion of the marks for this assignment is allowed for discussion. You need not implement the most efficient methods you know. Give it some thought at the beginning and then do the implementation. Do not redesign your program if you discover inefficiencies. Instead, discuss what is good about your design and what is not. Make some suggestions for how this program could be redesigned to be more efficient. Gross inefficiencies and efforts clearly below a third-year level or expertise are unacceptable.

Back to Top Copyright © 2002 Katrin Becker 1998-2002 Last Modified May 16, 2003 11:35 PM