In this post, we will deal with data from ECDC and we will explain basic data manipulation in Python with the Pandas package.
In our day, data is everywhere in enormous size and depth. Data science is an emerging field that penetrates every aspect of our life and, lately, it has proved to be an extraordinary weapon for predicting infections from Covid-19 and organizing strategies to limit the damage.
Matplotlib packages we code:
import pandas as pd import matplotlib.pyplot as plt
We download the excel file locally from ECDC site and open it using the
read_excel function of Pandas library. We have named the file as
data.xls in our case.
We can first explore the data and the columns of the dataframe df:
We observe the columns of the dataframe — in our case, we will use the columns:
deaths. Additionally, the name of the country is stored in column
We next select ‘
Italy’ as the country under study. A new column is created named
DateTime of type datetime where we store the day. In the following, we create a new dataframe with the name df_italia which is the same as the dataframe
df_italia_sorted but it is sorted according to the column
df_italia=df[df.countriesAndTerritories=='Italy'] df_italia['DateTime']=pd.to_datetime(df_italia['dateRep'],format="%d/%m/%Y") #We sort according to DateTime df_italia_sorted=df_italia.sort_values(by='DateTime') df_italia_selected=df_italia_sorted[df_italia_sorted.month>4]
We are interested in data after the month of April (i.e., May, June, July, August, … etc) so we choose to filter using the column
month and create a new
Since the data in columns
deaths may have great variation, it is practical in order to understand the trend to use a moving average. We choose a moving average of seven days and we create two new columns (
Moving Average Cases and
Moving Average Deaths) where we store the average values of
#Calculate moving average df_italia_selected['Moving Average Cases']=df_italia_selected.cases.rolling(7,min_periods=1).mean() df_italia_selected['Moving Average Deaths']=df_italia_selected.deaths.rolling(7,min_periods=1).mean()
We now plot the cases and deaths as functions of time. We choose the red color for cases and blue for deaths. It is useful to plot
deaths in the same figure with common x-axis in order to understand possible connection and relation. So, we use the subplots function and first create figure fig and axis ax1 (this will be the axis for the cases and it will be the left axis). We then create ax2 using
twinx function. The values for deaths will be our right axis. A dashed line is used for the average values.
#Figure fig, ax1=plt.subplots() color1='tab:red' ax1.plot(df_italia_selected['DateTime'], df_italia_selected['cases'], color=color1) ax1.plot(df_italia_selected['DateTime'], df_italia_selected['Moving Average Cases'], color=color1,linestyle='dashed') ax1.set_xlabel('Data') ax1.set_ylabel('Cases',color=color1) ax1.tick_params(axis='y',labelcolor=color1) locs, labels=plt.xticks() plt.setp(labels,rotation=90) ax2=ax1.twinx() #instantiate a second axes that shares the same x-axis color2='tab:blue' ax2.plot(df_italia_selected['DateTime'], df_italia_selected['deaths'], color=color2) ax2.plot(df_italia_selected['DateTime'], df_italia_selected['Moving Average Deaths'], color=color2,linestyle='dashed') ax2.set_ylabel('Deaths',color=color2) ax2.tick_params(axis='y',labelcolor=color2) fig.tight_layout() #otherwise the right y-label is slightly clipped
The figure below is the program output.
Cases and deaths as a function of data for Italy