代写数据结构作业,实现 HashTable ,并且用HashTable完成一个文件索引工具。
![HashTable](https://upload.wikimedia.org/wikipedia/commons/thumb/7/7d/Hash_table_3_1_1_0_1_0_0_SP.svg/315px-
Hash_table_3_1_1_0_1_0_0_SP.svg.png)
OBJECTIVES
- Open and manipulate files
- Implement a hashtable as part of your data structure library
- Index files
- Get familiar with command line parameters
SUMMARY
This week we’re building a hashtable, then using the hashtable to index the
files. After you complete your hashtable implementation, you’ll write a piece
of code that will traverse a file directory indexing the files. Then, you’ll
read each of those files and index each row of the file in another hashtable.
Finally, you’ll output the indexed data to a file.
HOW DOES THIS ALL FIT TOGETHER?
In the bigger scheme of things, we’re putting pieces together to build a
search engine. A search engine has a few components: a file crawler, a file
processor, an indexer, and a query processor. The file crawler starts in one
place, and traverses to find other files. On the web, this means starting at
one page and following all the links on the page; in a file system, it means
starting in one directory and traversing through the directory structure until
all files have been foun. The file processor takes each of those files and
finds significant words or concepts in the file. The indexer creates a data
structure that (essentially) populates a data structure that makes it easy to
find documents that have similar topics, or find documents that contain a
certain word. The query processor lets us work with that data structure to get
the results based on what we ask for.
PART 1: BUILD A HASHTABLE
First step is to build a hashtable. This is similar to the process we went
through with LinkedList. You are once again provided with some starting files.
See the a7 folder for the files.
Once again, you’re provided with an example of how to use the hashtable, as
well as example binaries that work so you can compare your output.
All of this should be in a htll folder in your a7 directory.
A few notes about hashtables:
- Here’s a set of slides explaining about hashtables.
- Aspnes (Yale) also has some info about C and hashtables.
- Hashtables need a key to map to a value. You get to choose what this value is. In this implementation, it must be an unsigned int. You’ll see in the example hashtable usage, we compute a FNV hash value of a key, then use that to put a new value in the hashtable.
PART 2: FILE PROCESSING
Now, we’re going to read in some movie info.
The program takes in a flag as an argument and outputs the results in a given
order. The file will be in CSV format (comma-separated values). For this
assignment, the format is such:-
represents an null value. The genre field is a comma-seperated list of
genres: it should be parsed, and when indexing via genre, the movie should be
considered as all genres.
id|type|title1|title2||year||duration|genre
A few rows:
tt7356766|tvEpisode|Walker Cup|Walker Cup|0|2017|-|-|Sport
tt7356768|tvSeries|Aaron Olivo’s Skits|Aaron Olivo’s Skits|0|2017|-|-|Comedy
tt7356770|short|Ducko|Ducko|0|2018|-|1|Comedy,Short
tt7356774|tvEpisode|Kitchen Disaster to Master|Kitchen Disaster to Master|0|2017|-|-|Comedy,Game-Show,Reality-TV
tt7356776|tvEpisode|Carolina Panthers at San Francisco 49ers|Carolina Panthers at San Francisco 49ers|0|2017|-|-|Sport
tt7358226|tvEpisode|Camra invisible|Camra invisible|0|1994|-|-|Adventure,Family,Thriller
The program accepts three different flags: -t, -g, -y.
- The -t flag will index the movies based on type
- The -g flag will index the movies based on genre
- The -y flag will index the movies based on year
- Only allow one flag at a time. It should be an error if more than one is provided.
- If no flags are set, index by genre.
- C has a tokenizer functionality. That might help process each line for you.
A sample invocation:
% ./index_movies -t data/test
Output to file_index.csv
The program is started for you, with the following structure:
Starting code is in the resources repo, a7 directory.
The README.md file has the steps you need to complete the code.
Some info on how to use command line arguments .
PART 3: OUTPUT THE INDEXED FILES TO AN “INDEX” FILE
After you’ve built the index, use the hashtable/index to generate a report of
the indexed files.
Create a new file called indexed_movies.txt. Iterate through your hashtable,
printing out each MovieSet in a format similar to below:
Genre: Crime
3 items
The Shawshank Redemption
The Usual Suspects
American History X
Genre: Adventure
2 items
The Lord of the Rings: The Fellowship of the Ring
Back to the Future
…
Type: Short
3 items
The Shawshank Redemption
The Usual Suspects
American History X
Type: tvEpisode
2 items
The Lord of the Rings: The Fellowship of the Ring
Back to the Future
Type: movie
5 items
…
- Print out the indexed type/value on one line, then the number of items in that value, then the movie name for each movie in that genre.
PART 4: (OPTIONAL) SORTING THE OUTPUT
If you’d like to take the program one step further, you can output the movies
in a sorted manner, utilizing C’s qsort function and a custom-built
comparator.
- As a suggestion, consider creating a struct to store each row. Also, look at the qsort() function in C. Specifically, look at how you can pass a custom comparator function into it that will let you dictate how values are sorted.
SUBMISSION
Submit your assignment by pushing your code to Github. Tag it with the tag
a7-final.
Tag your last commit, and submit the tag on Blackboard by the due date,
midnight (Pacific Time).
Things to remember
- Make sure your code compiles and runs; even if it crashes or isn’t complete. Comment what you need to in order to make sure it compiles.
- Unable to compile = losing 50% of your grade.
- Commit early and often.
- If you mistakenly tag your repo as a7-final and find yourself having to create a new final tag, delete the tag first. We don’t want to figure out which of a7-final, a7-final-a, a7-final2, … is the one we should be grading.
- If there’s no tag, automatic 0.
- Valgrind should report that there are no memory leaks (all heap has been freed) as well as NO ERRORS.
- If you have issues (you got sick, changed jobs, moved, …) that impact your ability to finish on time, TELL US BEFORE THE DEADLINE. You will lose at least 50% of your points if you don’t talk to us first