比较几种搜索算法,对不同的数据进行搜索和分析。
Introduction
In Lab 1 “Rock Climber”, you learned how to climb rocks (well, sort of). You
got overly excited and couldn’t wait to tell your friends about your newly
acquired skill on Twitter. Obviously, you needed an excellent hashtag for your
tweet. No problem! We will find you one.
In this assignment, you will be exposed to several array searching and sorting
concepts and implementations, and asymptotic analyses in general. Work
individually. Follow the guidelines in the syllabus for any discussions with
others. You may work in pairs on extra credit contests, but then the extra
credit will be divided among the two of you. Extra credit entries must be bug
free to be eligible and turned in on time (no late days).
Files
After downloading the assignment tarball from the course website, extract the
files by running:
tar -xvf lab2-handout.tgz
from a terminal window. Some of the files worth looking at are listed below.
Files you won’t modify:
Makefile
lib/.h
lib/.c
lib/words
Files you may modify (and the files denoted by * will be submitted for
grading):
* duplicates.c
* trending.c
* validity.c
duplicates_test.c
trending_test.c
validity_test.c
Additionally, you should create a file called:
written.pdf
which contains the answers to the written parts of the assignment. For your
convenience, we’ve provided a LATEX template (written.tex).
Submission
To submit your programming assignment, first generate a handin archive
lab2-handin.tgz with the command
make package
then submit your lab2-handin.tgz file to your Subversion repository (svn) for
the class. Once you’ve completed all problems, you should also submit your
written.pdf to Gradescope.
Note that your code will not be graded until after the due date. It is your
responsibility to test your code and proofread your writeup thoroughly before
submitting.
#Hashtag
What is a hashtag you ask? Well, it is “a word or phrase preceded by the
symbol # that classifies or categorizes the accompanying text (such as a
tweet)”. For the purpose of this lab, we will ignore the preceding “#” sign,
because they appear in front of every hashtag. Furthermore, throughout the
lab, we will assume that the hashtags are case-insensitive. For example,
#OhWhatABeautifulMorning is represented as the string
“ohwhatabeautifulmorning”, or more precisely, a char array in C, of course.
#WhatHashtagsHaveIUsed
In this part of the assignment, given a sorted array of hashtags (i.e. an
array of strings sorted in alphabetical order) that you have used since your
first tweet, you will return a new sorted array of distinct hashtags. In other
words, the new array should contain the same strings as the original array,
but all strings are now distinct.
Task 1
In duplicates.c, implement the function
int number_distinct(char **hashtags, int n)
—|—
that counts the number of distinct strings among the first n elements of the
sorted array hashtags.
For example, for the following array
animals = {“bobcat”, “bobcat”, “chipmunk”, “eagle”, “eagle”, “eagle”, “tortoise”}
—|—number_distinct(animals, 7)
should evaluate to 4, for there are four kinds
of animal in the array, namely bobcat, chipmunk, eagle and tortoise. Your
solution should have O(n), a.k.a. linear, asymptotic running time.
You may use the C string library. In particular, you may find the following
functions useful in string.h:
// Comparing string
int strcmp(const char *str1, const char *str2)
// Copying string
char *strcpy(char *dest, const char *src)
—|—
Task 2
In duplicates.c, implement the function
int remove_duplicates(char **hashtags, char **result, int n)
—|—
which stores a new sorted array in result with all duplicates in the sorted
subarray of hashtags removed, and returns the number of distinct strings in
your result. Similar to number_distinct
, we only consider the first n
elements in the array. So calling remove_duplicates(animals, result, 5)
should return 3 and store result = {"bobcat", "chipmunk", "eagle"}
.
Again, your solution should have O(n) asymptotic running time.
To test your code for the above functions, you can simply add your own test
cases to the file duplicates_test.c, and run the following command:
>> make duplicates
#WhatsTrendingNow
Now you want to see what are the trending hashtags on the Internet. In this
part, you will implement some functions that analyze the frequencies of
hashtags of an input feed. The expected hashtags will be represented by a
sorted array of strings hashtags of length n (i.e. just like before, we only
consider the first n elements of the array), and we will maintain another
array of integers frequencies, where each element frequencies[i] is the number
of times we have seen hashtags[i] in the feed so far, where i ∈ [0, n).
Task 1
In trending.c, implement the function
int count_frequencies(char **hashtags, int *frequencies, int n, char **feed, int feedlength, bool fast)
—|—
which should evaluate to the number of occurrences of strings in feed that do
not appear in the array hashtags, and should update the frequency counts in
frequencies with the number of times each string in feed appears. Note that
feedlength is the size of the subarray we are considering. So for example,
when it is given
hashtags = {“bobcat”, “chipmunk”, “eagle”, “tortoise”}
—|—
and the corresponding frequencies
frequencies = {2,1,3,1}
—|—
then the input feed
feed = {“tortoise”, “tortoise”, “Hare”, “Hare”, “Hare”}
—|—
would cause the frequency count for “tortoise” to be incremented by 2, and
would cause 3 to be returned, because “Hare” was not in hashtags. So
frequencies now becomes {2,1,3,3}.
Note that a precondition of count_frequencies
is that the hashtags must be
sorted, a fact you should exploit. Your function should use the linear search
algorithm when fast is set to false and it should use the binary search
algorithm when fast is true. You can implement this choice with a simple if
statement that decides which function to call duplicating a lot of code is
unnecessary and unhelpful. Feel free to use two library functions we
implemented for you in lib/array_util.c:
// Linear search
int linear_search(char *s, char **A, int n)
// Binary search
int binary_search(char *s, char **A, int n)
—|—
both of which will return the index where string s appearred in subarray A of
length n, and return -1 if s did not appear in the subarray.
Task 2
In trending.c, implement the function
void top_three(char **hashtags, int *frequencies, char **result, int n)
—|—
which should find the three most common hashtags among the subarray hashtags
of length n, according to their counts in frequencies, and store them in
result by the order of most frequent to least frequent, i.e. result[0] is the
most common hashtag, result[1] is the second common one, and result[2] is the
third.
You implementation should have O(n) asymptotic running time, and furthermore,
it should loop over the array frequencies only once.
Just like the previous tasks, you can test your code by adding your own test
cases to the file trending_test.c, and then running:
make trending
#IsItValidThough
As an experienced Twitter user, you can recognize a hashtag in a jiffy, but
your friend, who is unfamiliar with Twitter, doesn’t know how to insert spaces
between the words in your hashtag to interpret it appropriately. You write
down “#imadventurous”, which obviously means “I’m adventurous”. He first tries
to scan through until he first sees a whole word, insert a space, and then
continue scanning. Depending on his vocabulary, he might see “I mad vent” and
then get stuck because there is no such word as “urous”.
Your friend is desperate to find out what exactly your hashtag means, so you
need to give him an algorithm for determining whether it is possible to insert
spaces into a spaceless string to produce a sequence of valid English words.
Task 1
In validity.c, implement the function
bool is_valid_hashtag(char *hashtag)
which should evaluate to true if hashtag can be split into a sequence of valid
English words and false otherwise. Your implementation should be recursive.
You may find the library functions in lib/string_util.c useful:
// Look up dictionary
bool is_word(char *s)
// Extract substring
char *substring(char *dest, char *src, int i, int len)
—|—
where is_word
looks the word s up in the dictionary (in lib/words) and
returns a Boolean indicating whether it is a valid word. is_word is case-
insensitive; substring returns a pointer to the resulting string dest of the
input string src starting at index i and of length len.
Task 2
With a little twist to the algorithm, you can in fact achieve O(n^2)
asymptotic running time for the above problem. First person/pair who submit a
working solution.
Again, you can test your code by adding your own test cases to the file
validity_test.c, and then running:
>> make validity