主观题:Roll Your Own Mini Search Engine
In this project, you are supposed to create your own mini search engine which can handle inquiries over “The Complete Works of William Shakespeare” (http://shakespeare.mit.edu/).
You may download the functions for handling stop words and stemming from the Internet, as long as you add the source in your reference list.
Your tasks are:
- (1) Run a word count over the Shakespeare set and try to identify the stop words (also called the **noisy** words) – How and where do you draw the line between “interesting” and “noisy” words?
- (2) Create your inverted index over the Shakespeare set with word stemming. The stop words identified in part (1) must not be included.
- (3) Write a query program on top of your inverted file index, which will accept a user-specified word (or phrase) and return the IDs of the documents that contain that word.
- (4) Run tests to show how the thresholds on query may affect the results.
### Grading Policy:
**Programming:** Write the programs for word counting (1 pt.), index generation (5 pts.) and query processing (3 pts.) with sufficient comments.
**Testing:** Design tests for the correctness of the inverted index (2 pts.) and thresholding for query (2 pts.). Write analysis and comments (3 pts.). **Bonus: What if you have 500 000 files and 400 000 000 distinct words? Will your program still work? (+2 pts.)**
**Documentation:** Chapter 1 (1 pt.), Chapter 2 (2 pts.), and finally a complete report (1 point for overall style of documentation).
**Note:** Anyone who does excellent job on answering the Bonus question will gain extra points.
答案:Rubric items: 6
Item 1: 1 point
The cover page must be presented with the title and the date of completion (+ 0.2 pts.). A complete problem description must be given in Chapter 1 (+0.8 pts.). Deduct points if:
- the cover page is not complete (-0.2)
- the introduction is a simple copy + paste of the assignment statement (-0.3)
- the introduction is not very clear -- in this chapter they are supposed to make it clear WHAT is to be done, besides why they are doing it (-0.1 ~ 0.5)
- others – please specify in the final comments.
Item 2: 2 points
Chapter 2 is supposed to contain the descriptions (pseudo-code preferred) of all the algorithms: word counter, index generator, and query processor (+0.6 pt. for each), plus the data structure of their index (+0.2 pt.). Deduct points if:
- the algorithm specification is not complete
- not making their algorithm easier to understand than a simple program (-0.5)
- only a program + comments, which is not acceptable (-1)
- some functions in STL are used without a thorough discussion on their implementations (-0.5)
- others – please specify in the final comments.
Item 3: 1 point
The overall style of documentation is supposed to be neat and clear. Deduct (at most 1) point if:
- the document looks like a patch work (-1)
- messy – some of the data in the charts and tables are missing (-1)
- the hand-in is not zipped with proper folders (-1)
- the hand-in is not complete – some files are missing (-1)
- others – please specify in the final comments.
Item 4: 4 points
A complete table of testing data must be given in Chapter 3, including testing the correctness of the inverted index (+2 pts.) and the thresholding for query (+2 pts.). Deduct points if:
- the purposes of some test cases are not clearly specified (-1)
- the testing results are not complete – the correctness of the inverted index is missing(-2)
- the testing results are not complete – the thresholding for query is missing(-2)
- others – please specify in the final comments.
Item 5: 3 points
In Chapter 4, analysis of both the time (+1 pt.) and space (+1 pt.) complexities must be given, together with sufficient discussions on the testing results (+1 pt.). **Bonus points are given for answering the bonus question (+1 ~ 2 pts.).** Deduct points if:
- there is no comments on the testing results (-1)
- some functions in STL are used without a thorough discussion on their complexities (-1)
- **analysis** of the **complexities** of time and/or space is missing -- they must show how they have reached their conclusions, instead of simply listing them (-2)
- others – please specify in the final comments.
Item 6: 9 points
The programmer is supposed to implement the three key functions: the word counting (+1 pt.), the index generation (+5 pts.), and the query processing (+3 pts.). Deduct points if:
- some of the programs are not working properly (-1 ~ 3)
- the word counting program is missing (-1)
- the index generation program is missing (-5)
- the query processing program is missing (-3)
- the programs are not well-commented (-9)
- others – please specify in the final comments.
You may download the functions for handling stop words and stemming from the Internet, as long as you add the source in your reference list.
Your tasks are:
- (1) Run a word count over the Shakespeare set and try to identify the stop words (also called the **noisy** words) – How and where do you draw the line between “interesting” and “noisy” words?
- (2) Create your inverted index over the Shakespeare set with word stemming. The stop words identified in part (1) must not be included.
- (3) Write a query program on top of your inverted file index, which will accept a user-specified word (or phrase) and return the IDs of the documents that contain that word.
- (4) Run tests to show how the thresholds on query may affect the results.
### Grading Policy:
**Programming:** Write the programs for word counting (1 pt.), index generation (5 pts.) and query processing (3 pts.) with sufficient comments.
**Testing:** Design tests for the correctness of the inverted index (2 pts.) and thresholding for query (2 pts.). Write analysis and comments (3 pts.). **Bonus: What if you have 500 000 files and 400 000 000 distinct words? Will your program still work? (+2 pts.)**
**Documentation:** Chapter 1 (1 pt.), Chapter 2 (2 pts.), and finally a complete report (1 point for overall style of documentation).
**Note:** Anyone who does excellent job on answering the Bonus question will gain extra points.
答案:Rubric items: 6
Item 1: 1 point
The cover page must be presented with the title and the date of completion (+ 0.2 pts.). A complete problem description must be given in Chapter 1 (+0.8 pts.). Deduct points if:
- the cover page is not complete (-0.2)
- the introduction is a simple copy + paste of the assignment statement (-0.3)
- the introduction is not very clear -- in this chapter they are supposed to make it clear WHAT is to be done, besides why they are doing it (-0.1 ~ 0.5)
- others – please specify in the final comments.
Item 2: 2 points
Chapter 2 is supposed to contain the descriptions (pseudo-code preferred) of all the algorithms: word counter, index generator, and query processor (+0.6 pt. for each), plus the data structure of their index (+0.2 pt.). Deduct points if:
- the algorithm specification is not complete
- not making their algorithm easier to understand than a simple program (-0.5)
- only a program + comments, which is not acceptable (-1)
- some functions in STL are used without a thorough discussion on their implementations (-0.5)
- others – please specify in the final comments.
Item 3: 1 point
The overall style of documentation is supposed to be neat and clear. Deduct (at most 1) point if:
- the document looks like a patch work (-1)
- messy – some of the data in the charts and tables are missing (-1)
- the hand-in is not zipped with proper folders (-1)
- the hand-in is not complete – some files are missing (-1)
- others – please specify in the final comments.
Item 4: 4 points
A complete table of testing data must be given in Chapter 3, including testing the correctness of the inverted index (+2 pts.) and the thresholding for query (+2 pts.). Deduct points if:
- the purposes of some test cases are not clearly specified (-1)
- the testing results are not complete – the correctness of the inverted index is missing(-2)
- the testing results are not complete – the thresholding for query is missing(-2)
- others – please specify in the final comments.
Item 5: 3 points
In Chapter 4, analysis of both the time (+1 pt.) and space (+1 pt.) complexities must be given, together with sufficient discussions on the testing results (+1 pt.). **Bonus points are given for answering the bonus question (+1 ~ 2 pts.).** Deduct points if:
- there is no comments on the testing results (-1)
- some functions in STL are used without a thorough discussion on their complexities (-1)
- **analysis** of the **complexities** of time and/or space is missing -- they must show how they have reached their conclusions, instead of simply listing them (-2)
- others – please specify in the final comments.
Item 6: 9 points
The programmer is supposed to implement the three key functions: the word counting (+1 pt.), the index generation (+5 pts.), and the query processing (+3 pts.). Deduct points if:
- some of the programs are not working properly (-1 ~ 3)
- the word counting program is missing (-1)
- the index generation program is missing (-5)
- the query processing program is missing (-3)
- the programs are not well-commented (-9)
- others – please specify in the final comments.