-->
当前位置:首页 > 题库 > 正文内容

主观题:Roll Your Own Mini Search Engine

Luz3年前 (2022-02-20)题库755
In this project, you are supposed to create your own mini search engine which can handle inquiries over “The Complete Works of William Shakespeare” (http://shakespeare.mit.edu/).

You may download the functions for handling stop words and stemming from the Internet, as long as you add the source in your reference list.

Your tasks are:

- (1) Run a word count over the Shakespeare set and try to identify the stop words (also called the **noisy** words) – How and where do you draw the line between “interesting” and “noisy” words?
- (2) Create your inverted index over the Shakespeare set with word stemming. The stop words identified in part (1) must not be included.
- (3) Write a query program on top of your inverted file index, which will accept a user-specified word (or phrase) and return the IDs of the documents that contain that word.
- (4) Run tests to show how the thresholds on query may affect the results.



### Grading Policy:

**Programming:** Write the programs for word counting (1 pt.), index generation (5 pts.) and query processing (3 pts.) with sufficient comments.

**Testing:** Design tests for the correctness of the inverted index (2 pts.) and thresholding for query (2 pts.). Write analysis and comments (3 pts.). **Bonus: What if you have 500 000 files and 400 000 000 distinct words? Will your program still work? (+2 pts.)**

**Documentation:** Chapter 1 (1 pt.), Chapter 2 (2 pts.), and finally a complete report (1 point for overall style of documentation).

**Note:** Anyone who does excellent job on answering the Bonus question will gain extra points.







答案:Rubric items: 6

Item 1: 1 point

The cover page must be presented with the title and the date of completion (+ 0.2 pts.). A complete problem description must be given in Chapter 1 (+0.8 pts.). Deduct points if:

- the cover page is not complete (-0.2)
- the introduction is a simple copy + paste of the assignment statement (-0.3)
- the introduction is not very clear -- in this chapter they are supposed to make it clear WHAT is to be done, besides why they are doing it (-0.1 ~ 0.5)
- others – please specify in the final comments.

Item 2: 2 points

Chapter 2 is supposed to contain the descriptions (pseudo-code preferred) of all the algorithms: word counter, index generator, and query processor (+0.6 pt. for each), plus the data structure of their index (+0.2 pt.). Deduct points if:

- the algorithm specification is not complete
- not making their algorithm easier to understand than a simple program (-0.5)
- only a program + comments, which is not acceptable (-1)
- some functions in STL are used without a thorough discussion on their implementations (-0.5)
- others – please specify in the final comments.


Item 3: 1 point

The overall style of documentation is supposed to be neat and clear. Deduct (at most 1) point if:

- the document looks like a patch work (-1)
- messy – some of the data in the charts and tables are missing (-1)
- the hand-in is not zipped with proper folders (-1)
- the hand-in is not complete – some files are missing (-1)
- others – please specify in the final comments.

Item 4: 4 points

A complete table of testing data must be given in Chapter 3, including testing the correctness of the inverted index (+2 pts.) and the thresholding for query (+2 pts.). Deduct points if:

- the purposes of some test cases are not clearly specified (-1)
- the testing results are not complete – the correctness of the inverted index is missing(-2)
- the testing results are not complete – the thresholding for query is missing(-2)
- others – please specify in the final comments.

Item 5: 3 points

In Chapter 4, analysis of both the time (+1 pt.) and space (+1 pt.) complexities must be given, together with sufficient discussions on the testing results (+1 pt.). **Bonus points are given for answering the bonus question (+1 ~ 2 pts.).** Deduct points if:

- there is no comments on the testing results (-1)
- some functions in STL are used without a thorough discussion on their complexities (-1)
- **analysis** of the **complexities** of time and/or space is missing -- they must show how they have reached their conclusions, instead of simply listing them (-2)
- others – please specify in the final comments.

Item 6: 9 points

The programmer is supposed to implement the three key functions: the word counting (+1 pt.), the index generation (+5 pts.), and the query processing (+3 pts.). Deduct points if:

- some of the programs are not working properly (-1 ~ 3)
- the word counting program is missing (-1)
- the index generation program is missing (-5)
- the query processing program is missing (-3)
- the programs are not well-commented (-9)
- others – please specify in the final comments.

发表评论

访客

◎欢迎参与讨论,请在这里发表您的看法和观点。