In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn
In [3]:
covid = pd.read_csv("./Datas/owid-covid-data.csv")
Covid 데이터를 이용해, 어떤 지역에서 가장 많은 확진자가 발생했는지 알아보자
In [19]:
covid.head()
Out[19]:
iso_code | continent | location | date | total_cases | new_cases | new_cases_smoothed | total_deaths | new_deaths | new_deaths_smoothed | ... | female_smokers | male_smokers | handwashing_facilities | hospital_beds_per_thousand | life_expectancy | human_development_index | excess_mortality_cumulative_absolute | excess_mortality_cumulative | excess_mortality | excess_mortality_cumulative_per_million | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AFG | Asia | Afghanistan | 2020-02-24 | 5.0 | 5.0 | NaN | NaN | NaN | NaN | ... | NaN | NaN | 37.746 | 0.5 | 64.83 | 0.511 | NaN | NaN | NaN | NaN |
1 | AFG | Asia | Afghanistan | 2020-02-25 | 5.0 | 0.0 | NaN | NaN | NaN | NaN | ... | NaN | NaN | 37.746 | 0.5 | 64.83 | 0.511 | NaN | NaN | NaN | NaN |
2 | AFG | Asia | Afghanistan | 2020-02-26 | 5.0 | 0.0 | NaN | NaN | NaN | NaN | ... | NaN | NaN | 37.746 | 0.5 | 64.83 | 0.511 | NaN | NaN | NaN | NaN |
3 | AFG | Asia | Afghanistan | 2020-02-27 | 5.0 | 0.0 | NaN | NaN | NaN | NaN | ... | NaN | NaN | 37.746 | 0.5 | 64.83 | 0.511 | NaN | NaN | NaN | NaN |
4 | AFG | Asia | Afghanistan | 2020-02-28 | 5.0 | 0.0 | NaN | NaN | NaN | NaN | ... | NaN | NaN | 37.746 | 0.5 | 64.83 | 0.511 | NaN | NaN | NaN | NaN |
5 rows × 67 columns
In [37]:
covid.continent.unique()
Out[37]:
array(['Asia', nan, 'Europe', 'Africa', 'North America', 'South America', 'Oceania'], dtype=object)
In [38]:
df = covid.groupby("continent").sum()["total_cases"].sort_values(ascending=False)
In [39]:
df
Out[39]:
continent Asia 2.645692e+10 Europe 2.602631e+10 North America 2.162411e+10 South America 1.457782e+10 Africa 3.025414e+09 Oceania 1.703324e+08 Name: total_cases, dtype: float64
In [50]:
plt.plot(df)
plt.ylim(0,30000000000)
plt.xlabel("Continent")
plt.ylabel("cases")
plt.title("Total cases by Continents")
plt.xticks(rotation = 30)
plt.show()
In [108]:
covid.groupby("location").sum().total_cases.sort_values(ascending=False)
Out[108]:
location World 9.188144e+10 High income 4.233457e+10 Upper middle income 2.799119e+10 Asia 2.645692e+10 Europe 2.602631e+10 ... Tokelau 0.000000e+00 Niue 0.000000e+00 Northern Cyprus 0.000000e+00 Pitcairn 0.000000e+00 Guernsey 0.000000e+00 Name: total_cases, Length: 238, dtype: float64
In [125]:
covid["location"].drop(columns = ["World"], axis=1, inplace=True)
In [153]:
covid.location
Out[153]:
0 Afghanistan 1 Afghanistan 2 Afghanistan 3 Afghanistan 4 Afghanistan ... 162602 Zimbabwe 162603 Zimbabwe 162604 Zimbabwe 162605 Zimbabwe 162606 Zimbabwe Name: location, Length: 162607, dtype: object
In [173]:
location_covid = covid
location_covid = location_covid[location_covid.location!="World"]
location_covid = location_covid[location_covid.location!="Asia"]
location_covid = location_covid[location_covid.location!="Africa"]
location_covid = location_covid[location_covid.location!="High income"]
location_covid = location_covid[location_covid.location!="Upper middle income"]
location_covid = location_covid[location_covid.location!="Europe"]
location_covid = location_covid[location_covid.location!="North America"]
location_covid = location_covid[location_covid.location!="Lower middle income"]
location_covid = location_covid[location_covid.location!="European Union"]
location_covid = location_covid[location_covid.location!="South America"]
loc_covid = location_covid.groupby("location").sum().total_cases.sort_values(ascending=False)
In [174]:
loc_covid
Out[174]:
location United States 1.833960e+10 India 1.218840e+10 Brazil 8.283593e+09 United Kingdom 3.189000e+09 France 3.146283e+09 ... Pitcairn 0.000000e+00 Sint Maarten (Dutch part) 0.000000e+00 Jersey 0.000000e+00 Tuvalu 0.000000e+00 Tokelau 0.000000e+00 Name: total_cases, Length: 228, dtype: float64
In [183]:
df6 = loc_covid.head(10)
In [184]:
df6 = df6.sort_values()
In [210]:
loc_covid["China"]
Out[210]:
66506587.0
In [211]:
df6["China"] = 66506587.0
In [212]:
plt.plot(df6)
plt.xticks(rotation=30)
plt.show()
Covid 19에 가장 많은 확진자가 나온 곳은 인구수가 많은 US, India가 1,2위였음
중국이 왜 순위권 안에 없는지 궁금해서, 위의 자료에 중국을 추가해서 보았는데, 예상과 달리 굉장히 낮은 total 수치를 가지고 있었음. 이럴 경우 데이터의 신뢰성에 대해 의심할 만 하다.
In [99]:
covid.groupby("date").sum().new_cases
Out[99]:
date 2020-01 37493.0 2020-02 304387.0 2020-03 3557162.0 2020-04 10290649.0 2020-05 11776327.0 2020-06 17287360.0 2020-07 28649641.0 2020-08 32244386.0 2020-09 34966024.0 2020-10 51922678.0 2020-11 74467013.0 2020-12 82176550.0 2021-01 82395466.0 2021-02 47967457.0 2021-03 63534680.0 2021-04 94119668.0 2021-05 80545061.0 2021-06 46442899.0 2021-07 64331187.0 2021-08 81584162.0 2021-09 65240567.0 2021-10 54744350.0 2021-11 68848336.0 2021-12 111529439.0 2022-01 388495588.0 2022-02 179260175.0 Name: new_cases, dtype: float64
In [55]:
plt.plot(df2)
plt.show()
Date를 하루하루로 입력하니 너무 데이터가 방대해졌음. 월단위로 한번 해볼까?
In [57]:
df2
Out[57]:
date 2020-01-01 0.0 2020-01-02 0.0 2020-01-03 0.0 2020-01-04 0.0 2020-01-05 0.0 ... 2022-02-13 6260275.0 2022-02-14 7732210.0 2022-02-15 7572471.0 2022-02-16 10836497.0 2022-02-17 8575864.0 Name: new_cases, Length: 779, dtype: float64
In [75]:
df3 = covid
In [76]:
df3 ["date"] = df3["date"].apply(lambda x : x[:7])
In [77]:
df3["date"]
Out[77]:
0 2020-02 1 2020-02 2 2020-02 3 2020-02 4 2020-02 ... 162602 2022-02 162603 2022-02 162604 2022-02 162605 2022-02 162606 2022-02 Name: date, Length: 162607, dtype: object
In [84]:
df4 = df3.groupby("date").sum().new_cases
In [89]:
plt.plot(df4)
plt.xticks(rotation=90)
plt.title("covid new cases by month")
plt.show()
In [214]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))
#창 맞추기위함
In [ ]:
'데이터 시각화 분석' 카테고리의 다른 글
취미로 하는 데이터 분석 시리즈05(이미지 분류/Dacon 공모전 CNN 클론 코딩) (0) | 2022.03.04 |
---|---|
취미로 하는 데이터 분석 시리즈04-2(기원후 1000년에 와인을 만들었다면 그 가격은 얼마일까?) (0) | 2022.03.01 |
취미로 하는 데이터 분석 시리즈04-1(와인 가격 데이터 분석) (0) | 2022.02.28 |
취미로 하는 데이터 분석 시리즈03(Instagram 팔로워 수 데이터 분석) (0) | 2022.02.23 |
취미로 하는 데이터 분석 시리즈01(reddit 에서 우수하다고 생각하는 대학 순위) (0) | 2022.02.22 |