代写算法类作业,需要用到BFS和Union-Find两个算法。这个作业比较坑的地方是需要对算法进行优化,否则计算不出结果。
Overview
The assignment is sub-divided under the following headings:
- Part 0: Survey and Consent Form
- Part 1: Checkpoint: Pathfinder
- You will write a program to play (and win) the generalized Kevin Bacon trivia game. Your program will take as input ANY two actors/actresses and find the shortest sequence of shared movies between them. Part of this program (unweighted shortest path) is due at the checkpoint deadline.
- Part 2: Final Submission: Pathfinder (Complete) + Actor Connections — Deadline: Monday, November 28
- In addition to completing the weighted shortest path version of Pathfinder, you will write a program that starts with one actor in the graph and adds new movies step-by-step to find the earliest movie year to connect the start actor to the destination actor.
- Part 3: Final Submission: Additional Extension – Deadline: Monday, November 28
- You will write a program that will compute some other interesting statistic or action on a different graph of your choosing. Particularly complex extensions will earn a star point
Abstract
You will explore the “Six Degrees of Kevin Bacon” game. Specifically, you will
design and implement the data structures and algorithms necessary to explore
degrees of separation between Hollywood actors that act in the same movies.
The main focus of this assignment is to design and implement efficient graph-
based data structures and use them to explore large graphs.
What is the “Six Degrees of Kevin Bacon” game?
“Six Degrees of Kevin Bacon” is a parlor game based on the “six degrees of
separation” concept, which posits that any two people on Earth are six or
fewer acquaintance-links apart. That idea eventually morphed into this parlor
game wherein movie buffs challenge each other to find the shortest path
between an arbitrary actor and prolific Hollywood actor Kevin Bacon. It rests
on the assumption that any individual involved in the Hollywood, California,
film industry can be linked through his or her film roles to Kevin Bacon
within six steps. The game requires a group of players to try to connect any
such individual to Kevin Bacon as quickly as possible and in as few links as
possible. It can also be described as a trivia game based on the concept of
the small world phenomenon.
Part 0: Survey and Consent form (complete AFTER filling out CAPEs and
finishing most of PA4)
Link coming up Soon!!
Part 1: Checkpoint: Pathfinder (Unweighted)
We have provided you a tab-separated file movie_casts.tsv that contains the
majority of actors/actresses found in IMDb and the movies they have played in.
Specifically, the file looks like this (“TAB” denotes a single tab character):
Actor/Actress
50 CENT
50 CENT
50 CENT
50 CENT
50 CENT
50 CENT
50 CENT
50 CENT
…
The first column contains the name of the actor/actress, the second column
contains the name of a movie they played in, and the last column contains the
year the movie was made. Each line defines a single actor→movie relationship
in this manner (except for the first line, which is the header). You may
assume that actor→movie relationships will be grouped by actor name, but do
not assume they will be sorted.
Note that multiple movies made in different years can have the same name, so
use movie year as well as title when checking if two are the same. Also note
that some actors have a “(I)” appended to their name - so “Kevin Bacon” is
really “Kevin Bacon (I)”. Make sure you DO NOT format the names of actors or
movies beyond what is given in the tab-separated input file. In other words,
each actor’s name should be taken exactly as the actor’s name appears in the
movie_casts.tsv file: you do not have to (and should not) mess with it. During
grading, the actor’s name in the test file will match the actor’s name in the
movie_casts.tsv file.
In your graph, each actor/actress will define a single node. Two nodes (i.e.,
actors) will be connected by an undirected edge if the corresponding actors
played in the same movie.
Multiple undirected edges can exist between the same two nodes (which would
imply that the two actors played in multiple movies together). Once you load
the movie_casts.tsv file, you should expect to find 11,794 actors or nodes,
14,252 movies, and 4,016,412 directed edges - note that if we implement our
graph with directed edges, every undirected edge will be represented by two
directed edges. You may NOT use any pre-built data structures, like the Boost
Graph Library (BGL), besides what is provided in the C++ STL data structures.
For this part of PA4, you will write a program called pathfinder (in
pathfinder.cpp) to find paths between actors. It will take 4 command-line
arguments:
- The first argument is the name of a text file containing the movie casts in the same format as movie_casts.tsv. This file is quite large (6.4M), so you should create smaller versions to test your implementation as a first step.
- The second argument is a lower-case character equal to u or w - u means “build the graph with unweighted edges”, while w means “build the graph with weighted edges” (where weights are computed with the formula described later).
- The third argument is the name of a text file containing the pairs of actors for which you will find paths. The first line of the file is a header, and each row contains the names of two actors separated by a single tab character.
- The fourth argument is the name for your output text file, which will contain the shortest path between each pair of actors given in the input pairs file in argument 3. The first line of the output will be a header, and each row will contain a path for the corresponding pair of actors in the input pairs file (in the same order). Each path will be formatted as follows: ( )–[ #@ ]–>( )–[ #@ ]–>( )….etc where the movie listed between each pair of actors is one where they both had a role. Note that any given pair of actors may have played in multiple movies together (like Matt Damon and Ben Affleck), and if we are interested in a shortest weighted path, you must display the particular movie between each pair of actors that yields the minimum total path weight when combined with all other movies in the path.
Examples of the input pairs and output paths files are given in test_pairs.tsv
and out_paths_unweighted.tsv, respectively. So calling your program with:
> ./pathfinder movie_casts.tsv u test_pairs.tsv out_paths_unweighted.tsv
where test_pairs.tsv contains:
Actor1/Actress1 Actor2/Actress2
BACON, KEVIN (I)HOUNSOU, DJIMON
BACON, KEVIN (I)KIDMAN, NICOLE
BACON, KEVIN (I)WILLIS, BRUCE
BACON, KEVIN (I)GIAMATTI, PAUL
HOUNSOU, DJIMON50 CENT
should produce an output file out_paths_unweighted.tsv containing the
following (although the particular movies may not match, the total path
weights should match your output):
(actor)–[movie#@year]–>(actor)–…
(BACON, KEVIN (I))–[ELEPHANT WHITE#@2011]–>(HOUNSOU, DJIMON)
(BACON, KEVIN (I))–[SUPER#@2010]–>(MCKAY, COLE S.)–[FAR AND AWAY#@1992]–>(KIDMAN, NICOLE)
(BACON, KEVIN (I))–[SUPER#@2010]–>(MORENO, DARCEL WHITE)–[LAY THE FAVORITE#@2012]–>(WILLIS, BRUCE)
(BACON, KEVIN (I))–[A FEW GOOD MEN#@1992]–>(MOORE, DEMI)–[DECONSTRUCTING HARRY#@1997]–>(GIAMATTI, PAUL)
(HOUNSOU, DJIMON)–[IN AMERICA#@2002]–>(MARTINEZ, ADRIAN (I))–[MORNING GLORY#@2010]–>(50 CENT)
For the CHECKPOINT submission you are only required to have the unweighted
portion of pathfinder working. This means we will test your program with all 4
arguments, except the second argument will always be a u. The weighted portion
will be graded on the FINAL submission This is all that will be graded for the
checkpoint submission.
The complete pathfinder program (as described below) will be graded at the
final submission, so even if you don’t get the “unweighted edges” version of
your program working for the checkpoint, you still need to get the whole thing
working (weighted path find and unweighted) for the final submission.
Files Required:
- pathfinder.cpp
- Makefile (supporting at least make pathfinder)
- Any additional files needed to execute Part 1
Part 2: Final Submission: Pathfinder (Complete) + Actor Connections
In this part, the first thing you will do is complete pathfinder by
implementing the “weighted edges” version of your program (which is needed for
pathfinder for the Final Submission). In this version of your program, you can
treat unweighted edges as weighted edges with a weight of 1 (i.e., a “dummy”
weight), while truly weighted edges will have a weight equal to the age of the
movie (because we will want to choose newer movies over older movies when
connecting two actors). If we are defining an edge between two actors that
played in a movie made in year Y, then the weight of that edge will be:
weight = 1 + (2015 - Y)
Note that we are using 2015 instead of 2016, which is because the dataset only
contains movies released in 2015 and earlier. Don’t accidentally use 2016!
Calling your program with:
> ./pathfinder movie_casts.tsv w test_pairs.tsv out_paths_weighted.tsv
should produce an output file out_paths_weighted.tsv containing the following
(although the particular movies may not match, the total path weights should
match your output):
(actor)–[movie#@year]–>(actor)–…
(BACON, KEVIN (I))–[ELEPHANT WHITE#@2011]–>(HOUNSOU, DJIMON)
(BACON, KEVIN (I))–[R.I.P.D.#@2013]–>(HUSS, TOBY (I))–[LITTLE BOY#@2015]–>(CHAPLIN, BEN)–[CINDERELLA#@2015]–>(MARTIN, BARRIE (II))–[PADDINGTON#@2014]–>(KIDMAN, NICOLE)
(BACON, KEVIN (I))–[R.I.P.D.#@2013]–>(BELTRAN, JONNY)–[THE WEDDING RINGER#@2015]–>(ROGERS, MIMI (I))–[CAPTIVE#@2015]–>(WILLIS, BRUCE)
(BACON, KEVIN (I))–[R.I.P.D.#@2013]–>(HOWARD, ROSEMARY (II))–[THE AMAZING SPIDER-MAN 2#@2014]–>(GIAMATTI, PAUL)
(HOUNSOU, DJIMON)–[THE VATICAN TAPES#@2015]–>(SCOTT, DOUGRAY)–[TAKEN 3#@2014]–>(HARVEY, DON (I))–[THE PRINCE#@2014]–>(50 CENT)
In this part of the assignment, you will implement a program called
actorconnections. For a given movie database and a list of actor pairs, the
actorconnections program should answer the following query for every actor
pair (X, Y) in the given list: “After which year did actors X and Y become
connected?”
By connected, we mean that there exists a path between actors X and Y in the
equivalent movie graph (similar to that constructed in Part 1) with the
exception that the movie graph under consideration only includes movies that
were made until (i.e., before and including) a certain year.
actorconnections will take 4 command-line arguments:
- The first argument is the name of a text file containing the movie casts in the same format as movie_casts.tsv. Again, this file is quite large (6.4M), so you should create smaller versions to test your implementation as a first step.
- The second argument is the name of a text file containing the names of actor pairs on each line separated, with the two actor names are tab-separated (same format as test_pairs.tsv).
- The third argument is the name of your output text file, which should contain in each line an actor pair followed by the year (tab-separated) after which the corresponding actor pair became connected (you will do all actor pairs specified in the file from step 2, one on each line). If two actors are never connected or if one or both of the actors is not in the movie cast file given to you, simply append 9999 in the corresponding line of the output file. To further clarify, if the second argument was a file containing the actor pair “BLANCHETT, CATE” and “REEVES, KEANU” and they only became connected after adding a movie made in 1997, your program should output the actor pair and 1997 in their line of the output file. If they never became connected even after adding all the movies from in the movie cast file to your graph, you should output 9999 on that line.
- The fourth argument should be specified as either bfs or ufind. This option determines which algorithm will be used in your program. If the fourth argument is not given, by default your algorithm should use the union-find data structure (i.e., the equivalent of specifying ufind as the fourth argument). We will test your code with both flags.
Your output text file should look like the following:
Actor1Actor2 Year
BLANCHETT, CATEREEVES, KEANU 1997
KNAPP, DANIELWILLIS, BRUCE 9999
…
Calling your program with:
> ./refactorconnections movie_casts.tsv test_pairs.tsv out_connections_bfs.tsv ufind
should run your code (using the union-find algorithm) to produce an output
file out_connections_bfs.tsv containing the following:
Actor1Actor2 Year
BACON, KEVIN (I)HOUNSOU, DJIMON 1992
BACON, KEVIN (I)KIDMAN, NICOLE 1991
BACON, KEVIN (I)WILLIS, BRUCE 1990
BACON, KEVIN (I)GIAMATTI, PAUL 1992
HOUNSOU, DJIMON50 CENT 2003
We would like you to implement the actorconnections program using both BFS and
union-find and allow the user to select between the two by specifying the
appropriate value for the fourth argument to your executable: - BFS: To answer queries about the connection between actor pairs using BFS, we recommend you start with an empty graph containing only actor names and incrementally add movies in increasing order of the year of the movie. Every time you add a new set of movies made in a specific year, actors that were not connected before may become connected, which can be determined by running BFS on the updated graph.
- Union-Find: Alternatively, the disjoint-set (i.e., “union-find”) data structure allows you to keep track of all connected sets of actors without maintaining the corresponding graph structure. You might still consider adding movies incrementally, and if a movie creates a path between two actors that were not connected before, two disjoint sets would be merged into a single set in your union-find data structure. You should be able to then query your data structure about the connectivity of any specific actor pairs. The performance of your implementation will naturally depend on the efficiency of your Union-Find data structure. We will go over these topics in lecture as well.
We will not test you on the corner case where both the actors are the same.
You can handle it however you want.
Once you have completed and tested both implementations, compare the run time
of each implementation on a file containing actor pair pairs that you generate
yourself. The file should be in the same format as test_pairs.tsv. See how the
run times compare when you repeat the same query multiple times. For example,
in your input file (specified as the second argument), have the same actor
pair appear 100 times and calculate the time it took to answer all 100 queries
using your BFS implementation and then using your union-find implementation.
To time your code, you can use the timing classes given to you in PA2. Analyze
the timing results for the two implementations and write your analysis in the
file Report.pdf.
In addition, answer the following questions in Report.pdf: - Which implementation is better and by how much?
- When does the union-find data structure significantly outperform BFS (if at all)?
- What arguments can you provide to support your observations?
Files Required:
- pathfinder.cpp
- actorconnections.cpp
- extension.cpp
- Makefile (supporting at least make pathfinder, make actorconnections, and make extension)
- extension.txt
- Report.pdf
- Any additional files needed to execute Parts 1-3
Part 3: Final Submission -: Additional Extension
In Part 3, you should choose any additional extension in which you report some
interesting graph properties or solve some interesting problem on a data set
different from the kevin bacon movie data set that we provided. There are a
number of extensions you can make to this assignment, and rather than choosing
a data set and a problem for you, we’ve left it open for you to decide what
you want to do. Do you want to pull in data from movie reviews to augment your
weighting so you prefer more highly rated movies? Do you want to analyze
friend circles from the facebook ego network and make interesting conclusions?
Do you want to provide k friend recommendations to a user based on the number
of mutual/common friends? Do you want to study on how tweets evolve over time?
Pretty much any cool non-trivial graph extension is fair game here. If you are
concerned about whether or not your idea is “non-trivial” enough, ask on
Piazza. It is important that you solve a problem that is considerably
different from path-finder / actorConnections. You may not solve the same
problem on a different graph.
These videos give you some leads and classes of questions that you might
answer using graphs. This extension is really about your interests and your
creativity. Feel free to come up with a completely new idea/ solve similar
problems on other datasets/ tweak the ideas described in the videos. We really
encourage you to keep looking and research. Google is your friend here.
You need to implement your idea in a program called extension (in a file
called extension.cpp) and it is worth 3 points and potentially a star point.
For full credit, you will need to provide a short write-up in the file
extension.txt which includes the problem you solved, how you solved it, and
how you tested it. Note that this file will be in plain text format. Include
in the write-up how the grader should run your program. If this is not
provided, you will receive a 0 for this part. You must also include a
tsv/csv/data file for a smaller version of the graph for us to test the
program. The data file(s) that you submit must be small and must contain no
more than 100 vertices. Also provide a link to where we can download the
entire data set in extension.txt. If you did some pre-processing to the data
set, you must include that code as well in your submission and tell. You must
clearly give all steps to get the data set and reproduce the results that you
have provided. DO NOT PUSH THE ENTIRE DATA SET TO GIT.
REMEMBER TO ADD THE LINE !fileName TO THE .gitignore FILE. (where the fileName
is the fileName of the smaller version of the graph on which we can test your
code).
Since this is a challenging extension that requires a substantial amount of
work, the extension may be deemed Star Point worthy. Again, we won’t say yes
or no for any particular problem. The graders will deem it Star Point worthy
based on the difficulty of the problem attempted, correctness and quality of
write-up/testing.
Super challenging, complex extensions may even receive more than 1 star point.
Guidelines to Submit the Assignment
When you have completed the milestone you were working on (either Checkpoint
or Final), you MUST submit it. Before submitting, please insure that any main
methods you implement return 0 upon completion. You will do this by (1)
pushing your changes to GITHUB (i.e. push the required files ONLY) and (2)
Submitting on Vocareum, in that order.
For detailed instructions look at the ppt.
Format
We are giving very specific instructions on how to format your programs’
output because our auto-grader will parse your output in these formats, and
any deviation from these exact formats will cause the autograder to take off
points. There will be no special attention given to submissions that do not
output results in the correct format. Although we will still give partial
credit to the correctness of your results, if you do not follow the exact
formatting described here, you are at risk of losing all the points for that
portion of the assignment. NO EXCEPTIONS.