The aim of this assignment is to construct a simple search engine. You are expected to program this in Python, although you will not be penalised for the use of another language.


  • To give students practical experience of information retrieval systems, experimental work and evaluation

Learning outcomes

  • Improved program design and coding skills.
  • Improved research and communications skills.

Assessment criteria

You will submit code and present your system in a short demonstration.

Marks will be awarded for:

– the design and functionality of the system,

– the organization of the demonstration,

– the quality of your code, which you will submit.


The breakdown of marks is given on the marking sheet (attached).
Note that the distribution of marks is indicative only and may change.

Description of the assignment


The assignment is to produce a search engine that works over the computing domain. Use the NIST crawler as the starting point.

The task is to crawl and index the domain. You will have to strip out the formatting and make decisions about how to handle non-content material (menus, banners, …).




  1. Demonstration and questions

The report should show the output from crawling the inforet.cmp.uea.ac.uk domain.
You must submit a PDF version of the report via Blackboard, with your code.
You may be asked to answer questions on the operation of your system.

  1. Code

You must submit a copy of all the code used in your experiments as a single zipped file, via Blackboard; the filename should be of the form: yourstudentid_IRcoursework.zip.

Marks will be awarded for clear, well-structured code with appropriate and informative comments. Systems that are mostly built from code reused from third parties, cluttered with redundant fragments, have an inconsistent layout, are uncommented, … will attract very few marks. Note that the unacknowledged use of code from third parties will be treated as plagiarism.

CMP-5036A Marking Sheet: Crawler and indexer

Student name  No.  
Marker name  
Demonstration of a crawl over a sample of the uea.ac.uk/computing domain, stripping out HTML formatting and any other non-content material using domain-independent code.

0 if the crawler does not work;
2 for a working crawler;
+1-2 for duplicate removal;
+1-2 for stripping unwanted material.

Demonstration of an inverted index containing docids, postings and vocabulary tables from a crawl over the uea.ac.uk/computing domain, stored in a form suitable for subsequent retrieval.

0 if the crawl has not been completed;
3 for a file or files containing docids, postings and vocabulary (+1 each);
+1-2 for document lengths stored.

Design and code
Evidence of a structured approach to design of code, use of appropriate programming conventions and good comments.

4 for design and conventions (0=inconsistent layout, confusing naming, poor use of control structures and functions, +1 consistent naming and layout, +1 good structure, +1 good use of functions, +1 domain-independent code);
2 for comments (+1 explaining the purpose of each function, +1 additional points about statements/blocks).

Organisation and conduct of the demonstration.

0 if poorly organised and not ready,
+1 if ready,
+1-2 for organisation.

Additional comment



Related posts: