7071CEM Assignment Help
Information Retrieval Assignment help
There are two tasks in this coursework. You can use any general purpose programming language of your choice to perform these tasks. However, Python is recommended. The tasks are specified next.
Task 1. Search Engine
Develop a vertical search engine similar to Google Scholar, but specialised to retrieve only papers/books published by a member of the Research Centre for Computational Science and Mathematical Modelling (CSM) at Coventry University:
That is, at least one of the co-authors is a member of CSM.
Your system crawls the relevant web pages and retrieves information about all available publications. For each publication, it extracts available data (such as authors, publication year, and title) and the links to both the publication page and the author’s profile (also called “pureportal” profile) page.
Make sure you that your crawler is polite, i.e. it preserves the robots.txt rules and does not hit the servers unnecessarily or too fast.
Because of low rate of changes to this information, your crawler may be scheduled to look for new information, say, once per week, but it should ideally be able to do so automatically, as a scheduled task. Every time it runs, it should update the index with the new data.
Make sure you apply the required pre-processing tasks to both the crawled data and the users’ queries.
From the user’s point of view, your system has an interface that is similar to the Google Scholar main page, where the user can type in their queries/keywords about the resources they want to find. Then, your system will display the results, sorted by relevance, in a similar way Google Scholar does. However, the search results are restricted to the publications by CSM members only. Unless you intend to get a score higher than 70 or so, the user interface does not need to be web-based (like Google Scholar) and the standard Python interface in your IDE is enough. But it would be good to be able to click on the printed links (instead of copy-pasting them in a browser) for more usability.
NOTE: You must show in your report and viva that your system is accurate by trying varies queries. For example, you must use both short and long queries, both with and without stop words, queries with various keywords and more challenging queries to prove the robustness of your system.