Introduction
代写一个YELP的数据库程序,涉及到Data Structure的Map实现。
相比C++,用C语言来实现Data Structure和Algorithm会复杂很多,几乎所有的基础结构都需要自己写。
Purpose
The purpose of this assignment is for you to:
- Increase your proficiency in C programming, your dexterity with dynamic memory allocation and your understanding of linked data structures, through programming a dictionary.
- Increase your understanding of how computational complexity can affect the performance of an algorithm by conducting orderly experiments with your program and comparing the results of your experimentation with theory.
- Increase your proficiency in using UNIX utilities.
Background
A dictionary is an abstract data type that stores and supports lookup of key,
value pairs. For example, in a telephone directory, the (string) key is a
person or company name, and the value is the phone number. In a student record
lookup, the key would be a student ID number and the value would be a complex
structure containing all the other information about the student.
A dictionary can be implemented in C using a number of underlying data
structures. Any implementation must support the operations: makedict a new
dictionary; insert a new item (key, value pair) into a dictionary; search for
a key in the dictionary, and return the associated value. Most dictionaries
will also support the operation delete an item.
Your task
In this assignment, you will create a simplified UNIX yelp.com (local business
directory) as a concrete instance of a dictionary, and will use it to look up
information about a specific business name, such as full address or opening
times.
There are two stages in this project. In each stage you will code a dictionary
in the C programming language. A binary search tree will be the underlying
data structure for both stages.
In this assignment the search keys are not guaranteed to be unique. In this
assignment we use variants of the binary search tree designed to handle
duplicates, i.e. by either dividing nodes using <= >, or by using < > and a
linked list for items with same key. You will use a Makefile to direct the
compilation of two separate executable programs, one for Stage 1 and one for
Stage 2, each of which uses a different variant of the binary search tree.
In both stages of the assignment, you will insert records into the dictionary
from a file. You will then look up and output the records (business) contained
by the dictionary, counting and outputting the number of key comparisons used
in the search.
You will report on the number of key comparisons used for search, compare the
number of key comparisons used by each stage, and analyse what would have been
expected theoretically. The report should cover each file used to initialize
the dictionary.
You are not required to implement the delete functionality.
Stage 1
In Stage 1 of this assignment, your Makefile will direct the compilation to
produce an executable program called yelp1. The program yelp1 takes two
command line arguments: the first argument is the name of the data file used
to build the dictionary; the second argument is the name of the output file,
containing the data located in the searches. The file consists of an
unspecified number of records, one per line, where the format of each record
is:
The field name is an alphabetic string of varying length, containing the name
of the business or the user. You may assume that this field contains no more
than 64 characters. The data field is a string containing all the data
collected about the business or the user. Although the average size of this
field is around 430 characters, the maximum size of this field can be 1,465
characters. Each field is separated by a semicolon “,”. It is a standard csv
format where the delimiter used is a comma.
The dictionary key consists of the name field. The data is the information
sought during lookup.
For the purposes of this assignment, you may assume that the input data is
well-formatted, that the input file is not empty, and that the maximum length
of an input record is 1,465 characters. This number could help you fixing a
reading buffer size.
In this first stage of the assignment, you will:
- Construct a binary search tree to store the information contained in the file specified in the command line argument. Each record should be stored in a separate Node.
- Search the binary search tree for records, based on their keys. The keys are read in from stdin, i.e. from the screen.
For testing, it is often convenient to create a file of keys to be searched,
one per line, and redirect the input from this file. Use the UNIX operator for
redirecting input from a file. - Examples of use:
yelp1 datafile outputfile then type in keys; or
yelp1 datafile outputfile < keyfile - Your program will look up each key and output the information (the data found) to the output file specified by the second command line parameter. If the key is not found in the tree, you must output the word NOTFOUND.
The number of key comparisons performed during both successful and
unsuccessful lookups have to be written to stdout. - Remember that the entries in the file do not necessarily have unique keys. Your search must locate all keys matching the search key, and output all the data found.
In Stage 1 of the assignment you will locate the duplicates by continuing your
search until you reach a leaf node, regardless of whether or not you have
already found a match or matches.
Note that the key is output to both the file and to stdout, for identification
purposes. Also note that the number of comparisons is only output at the end
of the search, so there is only one number for key comparisons per key, even
when multiple records have been located for that key.
The format need not be exactly as above. Variations in whitespace/tabs are
permitted.
Stage 2
In Stage 2, you will code a dictionary where all the duplicate keys in the
dictionary are returned, as previously, and additionally where the search is
more efficient than in Stage 1. Input and output are as for Stage 1, with the
information or NOTFOUND written to a file and the number of comparisons made
during the search written to stdout.
In Stage 2, however, you will structure your tree so that once a key is found,
all duplicate keys can be found without further key comparisons. Note that
comparing a key to NULL is not a full (costly) key comparison, and is not
counted as a key comparison in Stage 2 of this assignment when building the
report.
Experimentation
You will run various files through your program to test its accuracy and also
to examine the number of key comparisons used when searching different files.
You will report on the key comparisons used by your Stage 1 dictionary yelp1
for various data inputs and the key comparisons used by your Stage 2
dictionary yelp2 for various data inputs too. You will compare these results
with each other and, importantly with what you expected based on theory
(big-O).
Your experimentation should be systematic, varying the size and
characteristics of the files you use (e.g. sorted, random, duplicates, etc.),
and observing how the number of key comparisons varies. Repeating a test case
with different keys and taking the average can be useful.
Some useful UNIX commands for creating test files with different
characteristics include sort, sort -R (man sort for more information on the -R
option), and shuf. You can randomize your input data and pick the first x keys
as the lookup keywords.
You will write up your findings and submit your results separately through the
Turnitin system. You will compare your results with the two dictionary
implementations (stage1 and stage2) and also compare these results to what you
know about the theory of binary search trees.
Tables and graphs are useful presentation methods. Select only informative
data; more is not always better.
You should present your findings clearly, in light of what you know about the
data structures used in your programs and in light of their known
computational complexity. You may find that your results are what you
expected, based on theory. Alternatively, you may find your results do not
agree with theory. In either case, you should state what you expected from the
theory, and if there is a discrepancy you should suggest possible reasons. You
might want to discuss space-time trade-offs, if this is appropriate to your
code and data.
You are not constrained to any particular structure in this report, but a
useful way to present your findings might be:
- Introduction: Summary of data structures and inputs.
- Stage 1 and Stage 2:
- Data (number of key comparisons)
- Comparison of the two stages
- Comparison with theory
- Discussion
Implementation Requirements
The following implementation requirements must be adhered to:
- You must code your dictionary in the C programming language.
- You must code your dictionary in a modular way, so that your dictionary implementation could be used in another program without extensive rewriting or copying. This means that the dictionary operations are kept together in a separate .c file, with its own header (.h) file, separate from the main program. The main.c of stage1 can perfectly be the same main for stage2, in terms of dictionary operations.
- Your code should be easily extensible to allow for multiple dictionaries. This means that the functions for insertion, search, and deletion take as arguments not only the item being inserted or a key for searching and deleting, but also a pointer to a particular dictionary, e.g. insert(dict, item).
- In each stage, you must read the input file once only.
- Your program should store strings in a space-efficient manner. If you are using malloc() to create the space for a string, remember to allow space for the final end of string ‘\0’ (NULL).
- A Makefile is not provided for you. The Makefile should direct the compilation of two separate programs: yelp1 and yelp2. To use the Makefile, make sure is in the same directory of your code, and type make yelp1 to make the dictionary for Stage 1 and make yelp2 to make the dictionary for Stage 2. You must submit your makefile with your assignment.
Hint: If you havent used make before, try it on simple programs first. If it
doesnt work, read the error messages carefully. A common problem in compiling
multifile executables is in the included header files. Note also that the
whitespace before the command is a tab, and not multiple spaces. It is not a
good idea to code your program as a single file and then try to break it down
into multiple files. Start by using multiple files, with minimal content, and
make sure they are communicating with each other before starting more serious
coding.