对犯罪数据进行统计分析处理。
Files required for this assignment
- Assignment2.py - A starting program to help you load the data into Python
- COMP90059_CrimeData_Large_Clean.csv - A clean version of the main data to help you, with the functions other than cleaning… if you need it.
- COMP90059_CrimeData_Large_Dirty.csv - The main data you will need to work on for your final submission
- COMP90059_CrimeData_Small_Clean.csv - Small version of the clean data to let you work on data navigation… if you need it.
- COMP90059_CrimeData_Small_Dirty.csv - Small version of the data to help you clean and navigate the data… if you need it
here are FIVE (5) questions in this assignment. The fifth question will
require you to call the functions you wrote in the first four questions - Things to look out for in solving the questions are:
- Never be afraid to create extra variables, e.g. to break up the code into conceptual sub-parts, improve readability, or avoid redundancy in your code.
- You are encouraged you to write helper functions to simplify your code - you can write as many functions as you like, as long as one of them is the function you are asked to write.
- Commenting of code is something you will be marked on; get some practice writing comments in your code, focusing on: o Adding a header block, providing the developers ID o Describing key variables when they are first defined (but not things like index variables in for loops) o Describing what “chunks” of code do (i.e. not every line, but chunks of code that perform a particular operation, such as “find the maximum value in the list” or “count the number of vowels”
Background
The Australian crime statistics database holds crime statistical data that is
freely available on the Australian government website: data.gov.au/dataset.
This data indicates trends in crime covering the whole of Australia, which is
separated between counties and Local Government Authority areas (LGA), over a
number of years. The information held in these databases highlight the number
of crimes committed from Trespass to Homicide, in a number of geographical
locations.
Your task for this assignment will be to take on a contract as a software
developer/analyst and respond to a realistic project requirement.
Congratulations! You have been appointed by the MUC (Made Up Company) through
the Australian Government to help them ascertain various information, from
within the online crime dataset. The dataset has been vandalised by high tech
criminals and will need your skills to help clean it up, prior to providing
basic analysis.
As part of your task, you are asked to develop and application that can cater
for this request. You will code four (4) functions that perform specific
tasks. In addition, you will write a “main” function that will utilise all the
new functions you have developed.
The Crime Statistics data is provided to you in one or more comma-separated
values (CSV) files. You will find this in the Assignment 2 folder on LMS.
CSV is a simple file format which is widely used for storing tabular data
(data that consists of columns and rows). In a CSV file, columns are separated
by commas, and rows are separated by newlines (so every line of the text file
corresponds to a row of the data). Usually, the first row of the file is a
header row which gives names for the columns.
The Crime Statistics data contains the following columns: ID, Statistical
Division or Subdivision, LGA, Offence category, Subcategory, Year statistics.
- ID (An integer unique ID assigned to each row of data)
- Statistical Division or Subdivision (The broad area the crimes were committed)
- LGA (The Local Governance Area that managed the crime area)
- Offence category (A title of the crime category)
- Subcategory (A breakdown of the crime within each category area)
- Year statistics (from 2002 through to 2012) (Holding a tally corresponding to each crime that took place in that year)
Supplied is a sample of the CVS data, as provided to you by the MUC. In fact,
we have provided 4 data samples; two large and two small. One of each (large
and small) samples are contaminated (vandalised), and the other two (large and
small) are clean. Both the small and clean samples are provided to assist you
in developing your code. Small samples are processed quicker. Clean samples
allow you to progress, without completing the required cleaning task. Your
final code should work on the large-dirty data-set.
In order to clean up and analyse the data, you need a way to take data from a
CSV file and put it into a Python data structure. Fortunately, Python has a
built-in csv library which can do most of the work for you.
In this assignment, you won’t have to use the csv library directly, though. We
will provide you with a helper function called read_data which uses the csv
library to read the data and turn it into a dictionary of dictionaries. For
example, suppose the data above was stored in a file called CrimeDataSet.csv.
To work with this data in Python, we would call (from within the read_data.py
file’s working directory) the following:
read_data(“CrimeDataSet.csv”)
—|—
which would return the following Python dictionary:
{‘1’ :{‘Division’: ‘Inner Sydney’, ‘LGA’: ‘Botany Bay’, ‘Offence’: ‘Homicide’, ‘Subcategory’: ‘Murder (a)’, ‘2002’: ‘1’, ‘2003’: ‘0’, ‘2004’: ‘zero’, ‘2005’: ‘1’, ‘2006’: ‘2’, ‘2007’: ‘1’, ‘2008’: ‘0’, ‘2009’: ‘1’, ‘2010’: ‘0’, ‘2011’: ‘0’, ‘2012’: ‘1’two big parantheses
—|—
Note
Notice that all of the values in the nested dictionaries are strings, even the
numeric values. If you want to use the values in numerical calculations, you
will have to typecast them yourself.
Nested dictionaries can be confusing. Here are some simple examples of how to
access data in a nested dictionary:
# save the data in a variable
data = {‘1’:{‘Division’: ‘Inner Sydney’, ‘LGA’: ‘Botany Bay’, ‘Offence’: ‘Homicide’, ‘Subcategory’: ‘Murder (a)’, ‘2002’: ‘1’, ‘2003’: ‘0’, ‘2004’: ‘0’, ‘2005’: ‘1’, ‘2006’: ‘2’, ‘2007’: ‘1’, ‘2008’: ‘0’, ‘2009’: ‘1’, ‘2010’: ‘0’, ‘2011’: ‘0’, ‘2012’: ‘1’two big parantheses
# Where is the ‘1’ ID’s Division
print(data[“1”][“Division”])
# What is the second ID’s subcategory
print(data[“2”][“Subcategory”])
# What is the summation of each year of ID ‘1’
sum = 0
for year_data in range(2002, 2012+1):
sum += int(data[‘1’][str(year_data)])
—|—
You have been provided with large CSV files containing Crime data within
Australia. Unfortunately, the data has been contaminated and is considered
“dirty”: some criminal-hackers have attacked the data, and intentionally
entered incorrect data values, to subvert the clear understanding of the
crimes committed.
Your first task as a programmer-analyst is to clean up the dirty data and fix
any issues caused by the criminals, for later analysis.
The errors in this data-set consist of the following changes, peppered or
scattered throughout:
- They have included zero’s instead of the integer value 0
- They have also entered NULL instead of the integer value 0
- They have converted positive numbers to negative numbers (i.e. -10 instead of 10, etc).
- And they have altered all entries for Trespass (within the subcategory column) with a cruel capitalised string of text; MUC-SUCK!
To clarify, in the data set, any value referred to as zero, should be an
integer 0 (zero), any value with a NULL reference should also be an integer 0
(zero), all integer values should be positive values (i.e. minus 20 should be
positive 20) and any derogatory remarks about the MUC within the Subcatogory
flied, should be actually read as Trespass
Task 1 (Clean data)
Write a function called clean which takes one argument called data. It should
be utilised like this:
clean(data)
—|—
The data value consists of a dictionary of data; which is the format type
returned by read_data. This data has been read directly from a CSV file and is
presumed contaminated or dirty! Your function should construct and return a
new data dictionary which is identical to the input dictionary, except that
invalid data values must be replaced; as described above. You should not need
to modify the argument dictionary variable data. The cleaning process should
keep a count of all the data samples it cleans and also return this summated
value, along with the new cleaned dictionary data set.
Let’s look at the data contained in CrimeDataSetDirty.csv:
{‘1’ :{‘Division’: ‘Inner Sydney’, ‘LGA’: ‘Botany Bay’, ‘Offence’: ‘Homicide’, ‘Subcategory’: ‘Murder (a)’, ‘2002’: ‘1’, ‘2003’: ‘0’, ‘2004’: ‘zero’, ‘2005’: ‘1’, ‘2006’: ‘2’, ‘2007’:’-1’ , ‘2008’: ‘0’, ‘2009’: ‘1’, ‘2010’: ‘0’, ‘2011’: ‘0’, ‘2012’: ‘1’two big parantheses
Clearly some of the values are invalid! Calling clean_data on this data, would
yield the following result:
{‘1’ :{‘Division’: ‘Inner Sydney’, ‘LGA’: ‘Botany Bay’, ‘Offence’: ‘Homicide’, ‘Subcategory’: ‘Murder (a)’, ‘2002’: ‘1’, ‘2003’: ‘0’, ‘2004’: ‘0’, ‘2005’: ‘1’, ‘2006’: ‘2’, ‘2007’: ‘1’, ‘2008’: ‘0’, ‘2009’: ‘1’, ‘2010’: ‘0’, ‘2011’: ‘0’, ‘2012’: ‘1’two big parantheses
Notice the 0 and negative values in the nested 2004 and 2007 dictionary of the
cleaned data, was previously ‘zero’ and ‘-1’. Don’t forget the other repair
alterations too, from the list above!
You can assume the following:
- The final input data dictionary should not contain zero or null or negative values;
- All year-column data entry statistics (once cleaned) are strings that can be cast to ints;
- Any references within the Subcategory-column, containing derogatory remarks about the MUC should be renamed from the derogatory remark back to Trespass
Task 2 (Worst year)
Write a function called countCrimes that takes in two arguments; data and key.
The data value will be the dictionary containing crime data and the key value
will be a suitable value representing the year statistics data. The function
summates all the values within the key-column(s) and returns the sum value
(representing each year). For example, all crimes for key ‘2012’ should return
an integer representing the total sum of crimes though for that column.
Using this function, and inside your main method, you must calculate the worst
year for crime, (i.e. the year with the maximum total crimes), and store then
display the year and crime total-number in that year. You may assume the crime
data in data is “clean”, after all invalid values have been replaced. So, all
values are non-negative integers, and all data values have been repaired. A
clean_data_set.csv file containing clean data is supplied to allow you to test
this method, in case you have not yet been able to clean the dirty data-set
yet.
Task 3 (Worst area)
The MUC are interested in the distribution of crime throughout the different
Statistical Subdivision areas. One way to establish this is to divide each
Subdivision into unique bins or dictionary keys; where a key holds a summation
of all the crimes for all the years within that subdivision area.
Write a function called worstCrime, which takes the argument data and adds up
the values of each Subdivision of each year then returns a new dictionary,
where the key is the Subdivision name and the value is the summated total of
all crimes within that area over all years. From within the main function,
store and display the number of Subdivisions found and present the area with
the highest overall crime values as the ‘Worst Area’ and the area with the
lowest overall crime values as the ‘Best Area’.
Task 4 (Most active criminal activity)
The MUC are also interested in learning which crime is performed the most
throughout the whole dataset. By acquiring this information, they will be able
to focus on reinforcing security levels, targeting that type of crime more
robustly.
Write a function called mostActiveCrime(data), which returns a dictionary of
the different crime types, which holds a tally (count) of how often those
crimes were committed overall. That is, each key within the dictionary will be
the name of a crime (such as Homicide, or Robbery, etc), and the values
therein, will be the tally of crimes for that particular crime throughout all
years. Finally, within your main function, from the returned dictionary, store
and display the most active/performed crime type and present it as the ‘Most
active Crime overall’, and include its summated value.
Task 5 (Providing a final report)
The government has asked the MUC to produce a report on the final status and
crime situation within their supplied dataset. The MUC have asked you to help
them locate the appropriate data for this report.
Write a function called report which takes a filename (called datafile) as an
argument. This function reads the original crime data contained under that
filename, then uses your function (clean) to clean the data. Following this,
you will use your newly created functions (1 to 4) to present some facts about
crime in Australia.
You should assume that the data in datafile is noisy. Your function should
calculate and return the following data-facts as a list:
- The total number of rows in the data file
- The total number of Subdivisions examined in the data
- The total number of Offence Categories
- The worst area and best area for crime (most and least crime counts respectively)
- The most active type of crime.