Python代写:COMP9102COVID-19


使用Open data API对 COVID-19 的数据进行分析。
![COVID-19](https://upload.wikimedia.org/wikipedia/commons/thumb/4/48/Fphar-11-00937-g001.jpg/250px-
Fphar-11-00937-g001.jpg)

Objective

In this activity, you will be asked to do three things:

  1. Query an open data API for public COVID information:
  2. Mold this information into several dataframes according to our instructions;
  3. Create quick data transformations and plots to answer specific questions.
    This activity’s solutions should be provided in a single IPython Notebook
    file, named CW2_A1. ipynb.

Sub-activity: Open Data COVID-19 API

The UK government has a portal with data about the Coronavirus in the UK; it
contains data on cases, deaths, and vaccinations on national and local levels.
The portal is available on this page: https://coronavirus.data.gov.uk . You can acquire the data by querying its
API.
We ask you to use the requests library in order to communicate with the API.
The documentation is available at: https://docs.python-requests.org/endatest . Read carefully the API
documentation at [ https://coronavirus.data.gov.ukidetallsidevelopers-
guidelmain-api ](https://coronavirus.data.gov.ukidetallsidevelopers-
guidelmain-api) .
Then complete the following tasks in order to acquire the data.

Task 1

Create a function get_API_data(filters, structure) that sends a specific query
to the API and retrieves all the data provided by the API that matches the
query. The function requires two arguments:

  • filters (dictionary) are the filters to be passed to the query, as specified in the API documentation. This will be a dictionary where keys are filter metrics and values are the values of those metrics. For example, you may want to filter the data by nation, date etc. As seen in the API documentation, filters are passed to the API’s URL as a URL parameter. This means you will have to format filters inside get_API_data in a way that the API can accept it as an argument.
  • structure (dictionary) will specify what information the query should return, again as specified in the API documentation. This will be a dictionary where the keys are the names you wish to give to each metric, and the values are the metrics as specified in the API. The structure argument specifies what attributes from the records that match the filters you wish to obtain, such as date, region, daily casualties etc. The argument is passed as an URL parameter to the API’s URL. This means you will have to format structure inside get_APIJdata in a way that the API can accept it as an argument.
    The function get_API_data should return a list of dictionaries answering the
    query.
    To ensure you receive all data matching your query, use the page URL
    parameter. The function should get data from all pages and return everything
    as a single list of dictionaries.
    An example of the full URL with filter. structure, and page parameters defined
    can be seen in Listing ; this URL, when queried. returns the first page (pages
    begin at 1) with data at a regional level, retrieving only the date and new
    cases by publishing date, and naming them date and newCases, respectively.

Task 2

Write a script that calls the function get_API_data twice, producing two lists
of dictionaries: results_json_national and results_json_regional. Both lists
should consist of dictionaries with the following key-value pairs:

  • date (string): The date to which this observation corresponds to;
  • name (string) : The name of the area covered by this observation (could be a nation, region, a local authority, etc);
  • daily_cases (numeric) : The number of new cases at that date and in that area by specimen date;
  • cumulative_cases (numeric) : The cumulative number of cases at that date and in that area area by specimen date;
  • daily_deaths (numeric) : The number of new deaths at that date and in that area after 28 days of a positive test, by publishing date;
  • cumulative_deaths (numeric) : The cumulative number of deaths at that date and in that area after 28 days of a positive test, by publishing date;
  • cumulative_vaccinated (numeric) : The cumulative number of people who completed their vaccination (both doses) by vaccination date;
  • vaccination_age (dictionary or list of dictionaries) : A demographic breakdown of cumulative vaccinations by age intervals for all people.
    The first list of dictionaries obtained (results_j son_national) should have
    data at the national level (England, Wales, Scotland, Northern Ireland). The
    second (results_j son_regional) should have data at a regional level (London,
    North West, North East, etc). Both should contain data for all dates covered
    by the API.
    Attention: Do not query the API too often, as you might be blocked or
    compromise the API’s service. The API service is used by many other
    organisations, which rely on it for vital tasks. It is your responsibility to
    query the API by respecting its rules. We ask students to keep requests under
    10 requests every 100 seconds, and 100 total requests every hour. When
    querying the API, if your response has a 429 status code (or a similar code
    indicating your query failed), check for a header called “Retry-After”, which
    indicates how much time you have to wait before doing another query; you
    should wait that long.

Sub-activity: Shaping the COVID data into different dataframes

These two lists of dictionaries from before are a good start. However, they
are not the easiest way to turn data into insight.
In the following, you will take the data from these lists of dictionaries and
turn it into Pandas dataframes. Dataframes have quick transformation,
summarising, and plotting functionalities which let you analyse the data more
easily.
The code should use native Pandas methods. Implementing the functionality
manually (e.g. using loops or directly accessing the arrays inside the
dataframes) will be penalised. Follow the library’s documentation. Remember
that Pandas methods can very often be chained; use that to your advantage.

Task 3

Concatenate the two lists of dictionaries (results_json_national and
results_json_regional) into a single list.

Task 4

Transform this list into a dataframe called covid_data, which should now have
one column for each metric retrieved from the API (date, name, daily_cases,
cumulative_cases, daily_deaths, cumulative_deaths, cumulative_vaccinated,
vaccination-age).

Task 5

The regional portion of the dataframe is a breakdown of the data from England.
Thus, all observations in England are contained in the dataframe twice. Hence
you can erase all rows in which the name column have the value “England”.

Task 6

The column name has an ambiguous title. Change it to area.

Task 7

The date column is of type object, which is for strings and other types. This
makes it harder to filter/select by month, year, to plot according to time,
etc. Convert this entire column to the datetime type.

Task 8

Print a summary of the dataframe, which includes the amount of missing data.
How you measure the amount of missing data is up to you. Please document your
decision in the code.

Task 9

For the cumulative metrics columns (cumulative_deaths, cumulative_cases,
cumulative_vaccinated), replace missing values with the most recent (up to the
date corresponding to that missing value) existing values for that area. If
none exist, leave it as it is. For example, if there is a missing value in the
cumulative_deaths column at the date 08-02-2021, look at all non-missing
values in the cumulative_deaths columns whose date is lower than 08-02-2021
and take the most recent.

Task 10

Now, remove the rows that still have missing values in the cumulative metrics
columns mentioned in the last question.

Task 11

Rolling averages are often better indicators of daily quantitative metrics
than raw daily measures. Create two new columns. One, with the 7-day rolling
average of new daily cases in that area, including the current day, and one
with the same calculation but for daily deaths. Name them daily_cases_roll_avg
and daily_deaths_roll_avg.

Task 12

Now that we have the rolling averages, drop the columns daily_deaths and
daily_cases as they contain redundant information.

Task 13

A column in the dataframe covid_data has dictionaries as values. We can
transform this column into a separate dataframe. Copy the columns date, area,
and vaccination_age into a new dataframe named covid_data_vaccinations, and
drop the vaccination_age column from covid_data.

Task 14

Transform covid_data_vaccinations into a new dataframe called
covid_data_vaccinations_wide. Each row must represent available vaccination
metrics for a specific date, in a specific area, and for a specific age
interval. The dataframe must have the following columns:

  • date: The date when the observation was made;
  • area: The region/nation where the observation was made;
  • age: The age interval that the observation applies to;
  • VaccineRegisterPopulationByVaccinationDate: Number of people registered for vaccination;
  • cumPeopleVaccinatedCompleteByVaccinationDate: Cumulative number of people who completed their vaccination;
  • newPeopleVaccinatedCompleteByVaccinationDate: Number of new people completing their vaccination;
  • cumPeopleVaccinatedFirstDoseByVaccinationDate: Cumulative number of people who took their first dose of vaccination;
  • newPeopleVaccinatedFirstDoseByVaccinationDate: Number of new people taking their first dose of vaccination;
  • cumPeopleVaccinatedSecondDoseByVaccinationDate: Cumulative number of people who took their second dose of vaccination;
  • newPeopleVaccinatedSecondDoseByVaccinationDate: Number of new people taking their second dose of vaccination;
  • cumVaccinationFirstDoseUptakeByVaccinationDatePercentage: Percentage of people out of that demographic who took their first dose of vaccination;
  • curnVaccinationCompleteCoverageByVaccinationDatePereentage: Percentage of people out of that demographic who took all their doses of vaccination;
  • cumVaccinationSecondDosaptakeByVaccinationDatePercentage: Percentage of people out of that demographic who took their second dose of vaccination.

Sub-activity: Aggregating, plotting, and analysing

We have created dataframes for our analysis. We will ask you to answer several
questions with the data from the dataframes. For each question, follow the
same three steps:

  1. aggregate and/or shape the data to answer the question and save it as an intermediate dataframe;
  2. apply plot methods on the dataframe to create a single plot to visualise the transformed data;
  3. write your conclusion as comments or markdown.
    Some questions will use data and plots from a previous question and require
    you only to answer the question; in this case, either have a cell with only
    comments or only a markdown cell with the answer.
    Plotting should be done exclusively using native Pandas visualisation methods,
    described here. To make these answers clear for us, we ask you to use concise
    and clear transformations and to add comments to your code.

Task 15

Show the cumulative cases in London as they evolve through time.
Question: Is there a period in time in which the cases plateaued?

Task 16

Show the evolution through time of cumulative cases summed over all areas.
Question: How does the pattern seen in London hold country-wide?

Task 17

Now, instead of summing the data over areas, show us the evolution of
cumulative cases of different areas as different lines in a plot.
Question: What patterns do all nations/regions share?

Task 18

Question: As a data scientist you will often need to interpret data insights,
based on your own judgement and expertise. Considering the data and plot from
the last question, what event could have taken place in June-July that could
justify the trend seen from there onward?

Task 19

Show us the evolution of cumulative deaths in London through time.
Question: Is there a noticeable period in time when the ongoing trend is
broken? When?

Task 20

Question: Based on the data and plot from the last question, is there any
similarity between trends in cumulative cases and cumulative deaths?

Task 21

Create a new column, cumulative_deaths_per_cases, showing the ratio between
cumulative deaths and cumulative cases in each row. Show us its sum over all
regions/nations as a function of time.
Question: What overall trends can be seen?

Task 22

Question: Based on the data and plot from the last question, it seems like, in
June-July, the graph’s inclination gets steeper. What could be a reasonable
explanation?

Task 23

Show us the sum of cumulative vaccinations over all areas as a function of
time.
Question: Are there any relationships between the trends seen here and the
ones seen in Task 21?

Task 24

Show us the daily cases rolling average as a function of time, separated by
areas.
Question: Is there a specific area that seems to escape the general trend in
any way? Which one and how?

Task 25

Show us the daily cases rolling average as a function of time for the area
identified in the previous question alongside another area that follows the
general trend, in order to compare them.
Question: What reasons there might be to justify this difference?

Task 26

To be able to compare numbers of cases and deaths, we should normalise them.
Create two new columns, daily_cases_roll_avg_norm and
daily_deaths_roll_avgnorm, obtained by performing a simple normalisation on
all values in the daily_cases_roll_avg and daily_deaths_roll_avg columns; for
each column, you divide all values by the maximum value in that column.
Now, on the same line plot with dale as the x-axis, plot two lines: the
normalised rolling average of deaths and the normalised rolling average of
cases summed over all areas.
Question: Are daily trends of cases and deaths increasing and decreasing at
the same rates? What part of the plot tells you this?

Task 27

The dataframe covid_data_vaccinations_wide has some columns expressed as
percentage of population. First, split this dataframe into two dataframes, one
for London, one for Scotland.
Now, mould the London dataframe such that each row corresponds to a date, each
column corresponds to an age interval, and the data in a dataframe cell is the
value of cumVaccinationFirstDoseUptakeByVaccinationDatePercentage for that age
interval and date.
Plot the London dataframe as a line chart with multiple lines, each
representing an age interval, showing the growth in vaccination coverage per
age group.
Because this plot will generate over ten lines, colours will repeat. Add this
argument to your call of the plot() method: style=['--' for _ in range (1.0)] . This will force the first ten lines to become dashed.
Question: Were all age groups vaccinated equally and at the same time, or was
there a strategy employed? What strategy does the plot indicate and why?

Task 28

Do the same transformations asked in the last question, but for the Scotland
dataframe.
Question: In both plots, compare how vaccination evolved for two sections of
population: 50-64 years and 65-79 years. Were there any differences in the
strategies employed between London and Scotland for dealing with both sections


文章作者: SafePoker
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 SafePoker !
  目录