≡ Menu

Basic data manipulation in Python…

In this post, we will deal with data from ECDC and we will explain basic data manipulation in Python with the Pandas package.

In our day, data is everywhere in enormous size and depth. Data science is an emerging field that penetrates every aspect of our life and, lately, it has proved to be an extraordinary weapon for predicting infections from Covid-19 and organizing strategies to limit the damage.

To import Pandas and Matplotlib packages we code:

import pandas as pd
import matplotlib.pyplot as plt

We download the excel file locally from ECDC site and open it using the read_excel function of Pandas library. We have named the file as data.xls in our case.

df=pd.read_excel("data.xlsx", engine="openpyxl")

We can first explore the data and the columns of the dataframe df:

We observe the columns of the dataframe — in our case, we will use the columns: dateRep, cases and deaths. Additionally, the name of the country is stored in column countriesAndTerritories.

We next select ‘Italy’ as the country under study. A new column is created named DateTime of type datetime where we store the day. In the following, we create a new dataframe with the name df_italia which is the same as the dataframe df_italia_sorted but it is sorted according to the column DateTime

df_italia=df[df.countriesAndTerritories=='Italy']
df_italia['DateTime']=pd.to_datetime(df_italia['dateRep'],format="%d/%m/%Y")

#We sort according to DateTime
df_italia_sorted=df_italia.sort_values(by='DateTime')

df_italia_selected=df_italia_sorted[df_italia_sorted.month>4]

We are interested in data after the month of April (i.e., May, June, July, August, … etc) so we choose to filter using the column month and create a new dataframe df_italia_selected.

Since the data in columns cases and deaths may have great variation, it is practical in order to understand the trend to use a moving average. We choose a moving average of seven days and we create two new columns (Moving Average Cases and Moving Average Deaths) where we store the average values of cases and deaths.

#Calculate moving average

df_italia_selected['Moving Average Cases']=df_italia_selected.cases.rolling(7,min_periods=1).mean()
df_italia_selected['Moving Average Deaths']=df_italia_selected.deaths.rolling(7,min_periods=1).mean()

We now plot the cases and deaths as functions of time. We choose the red color for cases and blue for deaths. It is useful to plot cases and deaths in the same figure with common x-axis in order to understand possible connection and relation. So, we use the subplots function and first create figure fig and axis ax1 (this will be the axis for the cases and it will be the left axis). We then create ax2 using twinx function. The values for deaths will be our right axis. A dashed line is used for the average values.

#Figure

fig, ax1=plt.subplots()
color1='tab:red'

ax1.plot(df_italia_selected['DateTime'], df_italia_selected['cases'], color=color1)

ax1.plot(df_italia_selected['DateTime'], df_italia_selected['Moving Average Cases'], color=color1,linestyle='dashed')

ax1.set_xlabel('Data')

ax1.set_ylabel('Cases',color=color1)
ax1.tick_params(axis='y',labelcolor=color1)

locs, labels=plt.xticks()
plt.setp(labels,rotation=90)

ax2=ax1.twinx() #instantiate a second axes that shares the same x-axis
color2='tab:blue'

ax2.plot(df_italia_selected['DateTime'], df_italia_selected['deaths'], color=color2)
ax2.plot(df_italia_selected['DateTime'], df_italia_selected['Moving Average Deaths'], color=color2,linestyle='dashed')

ax2.set_ylabel('Deaths',color=color2)
ax2.tick_params(axis='y',labelcolor=color2)

fig.tight_layout() #otherwise the right y-label is slightly clipped

The figure below is the program output.

Cases and deaths as a function of data for Italy

{ 0 comments… add one }

Rispondi