代写数据分析作业的第一部分,使用Pandas完成实现数据抓取。
Requirement
In this stage, you’re going to write code to download the data files, load the
data to Pandas DataFrames, and then answer various questions about the data.
The questions you must answer are below. If a given cell is answering a
question number N, the cell should have a comment that looks like this:
#qN
… code that computes the answer …
—|—
For example, the cell that answers the first question should contain a comment
that says #q1.
Q1: what is the total population across all the countries in the dataset?
Hint 1: pd.read_json(URL) will return a DataFrame by downloading the JSON file
from online at URL. If the downloaded JSON contains a list of dictionaries,
each dictionary will be a row in the DataFrame.
Hint 2: review how to extract a single column as a Series from a DataFrame.
You can add all the values in a Series with the .sum() method.
Q2: what is the first URL in the capitals.txt page?
You may hardcode this URL in your program. You must, however, answer this
question by programmatically extracting the first line from capitals.txt.
Hint: use requests.get to download the capitals.txt, then split it into a
list.
Q3: what is the capital of China?
To solve this problem (and subsequent problems), use requests.get to download
every file listed in capitals.txt and combine all the data in a DataFrame.
Hint 1: construct a DataFrame where every row is from one of the files listed
in capitals.txt. This will be useful for answering other questions as well. If
rows is a list of dictionaries (each representing a row), you can easily
construct a DataFrame with this snippet: DataFrame(rows).
Hint 2: you can use fancy indexing to extract the row where the Country equals
“China”. Then, extract the Capital Series, from which you can grab the only
value with the Series.item() function.
Q4: which 5 countries have the southern-most capitals?
Format: produce your answer as a JSON-formatted list of five countries. The
list should be sorted so that the countries with capitals farther south are
first.
Hint 1: look at the documentation examples of how to sort a DataFrame with the
sort_values function.
Hint 2: look at examples that used the head function.
Q5: which 3 countries have the northern-most capitals?
Format: produce your answer as a JSON-formatted list of three countries. The
list should be sorted so that the countries with capitals farther north are
first.
Q6: for “birth-rate” and “death-rate”, what are various summary statistics
(e.g., mean, max, standard deviation, etc)?
Format: use the describe function on a DataFrame that contains birth-rate and
death-rate columns. You may include summary statistics for other columns in
your output, as long as your summary table has stats for birth and death.
Q7: for “literacy” and “phone”, what are various summary statistics (e.g.,
mean, max, standard deviation, etc)?
Format: a table generated by the describe function.
In some countries, it is standard to use commas instead of periods to indicate
decimals. The literacy and phone data is formatted this way (i.e., decimal
numbers represented as strings, with commas for decimals). You’ll need to
reformat the data to use periods (instead of commas), then convert the column
of strings to a column of floats.
Hint: learn how to use the astype and replace Pandas functions.
Q8: what is the largest land-locked country in Europe?
A “land-locked” country is one that has zero coastline. Largest is in terms of
area.
Q9: what is the largest land-locked country in Africa?
Same as Q8.
Q10: what is the largest land-locked country in South America?
Same as Q8.