Oh-Seung-Rok

안녕하세요. 정량적인 분석을 통해 제품-서비스를 개선하고 소비자들의 욕구를 먼저 파악하는 분석가가 되고 싶은 오승록입니다. 포트폴리오 [https://seungrok0317.com]

Kaggle - Covid19 John Hopkins

29 Mar 2021 » Kaggle

Covid 19 from John Hopkins Dataset

  • 데이터 셋 : https://www.kaggle.com/antgoldbloom/covid19-data-from-john-hopkins-university
  • 실시간으로 업데이트 되는 라이브 데이터
  • 시계열 데이터가 있음. (시간별 변화 추이가 중요)

라이브러리 및 데이터 불러오기

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

df_case = pd.read_csv('RAW_global_confirmed_cases.csv')
df_death = pd.read_csv('RAW_global_deaths.csv')

pd.set_option('display.max_columns', None)
df_case.head()
Country/RegionProvince/StateLatLong1/22/201/23/201/24/201/25/201/26/201/27/201/28/201/29/201/30/201/31/202/1/202/2/202/3/202/4/202/5/202/6/202/7/202/8/202/9/202/10/202/11/202/12/202/13/202/14/202/15/202/16/202/17/202/18/202/19/202/20/202/21/202/22/202/23/202/24/202/25/202/26/202/27/202/28/202/29/203/1/203/2/203/3/203/4/203/5/203/6/203/7/203/8/203/9/203/10/203/11/203/12/203/13/203/14/203/15/203/16/203/17/203/18/203/19/203/20/203/21/203/22/203/23/203/24/203/25/203/26/203/27/203/28/203/29/203/30/203/31/204/1/204/2/204/3/204/4/204/5/204/6/204/7/204/8/204/9/204/10/204/11/204/12/204/13/204/14/204/15/204/16/204/17/204/18/204/19/204/20/204/21/204/22/204/23/204/24/204/25/204/26/204/27/204/28/204/29/204/30/205/1/205/2/205/3/205/4/205/5/205/6/205/7/205/8/205/9/205/10/205/11/205/12/205/13/205/14/205/15/205/16/205/17/205/18/205/19/205/20/205/21/205/22/205/23/205/24/205/25/205/26/205/27/205/28/205/29/205/30/205/31/206/1/206/2/206/3/206/4/206/5/206/6/206/7/206/8/206/9/206/10/206/11/206/12/206/13/206/14/206/15/206/16/206/17/206/18/206/19/206/20/206/21/206/22/206/23/206/24/206/25/206/26/206/27/206/28/206/29/206/30/207/1/207/2/207/3/207/4/207/5/207/6/207/7/207/8/207/9/207/10/207/11/207/12/207/13/207/14/207/15/207/16/207/17/207/18/207/19/207/20/207/21/207/22/207/23/207/24/207/25/207/26/207/27/207/28/207/29/207/30/207/31/208/1/208/2/208/3/208/4/208/5/208/6/208/7/208/8/208/9/208/10/208/11/208/12/208/13/208/14/208/15/208/16/208/17/208/18/208/19/208/20/208/21/208/22/208/23/208/24/208/25/208/26/208/27/208/28/208/29/208/30/208/31/209/1/209/2/209/3/209/4/209/5/209/6/209/7/209/8/209/9/209/10/209/11/209/12/209/13/209/14/209/15/209/16/209/17/209/18/209/19/209/20/209/21/209/22/209/23/209/24/209/25/209/26/209/27/209/28/209/29/209/30/2010/1/2010/2/2010/3/2010/4/2010/5/2010/6/2010/7/2010/8/2010/9/2010/10/2010/11/2010/12/2010/13/2010/14/2010/15/2010/16/2010/17/2010/18/2010/19/2010/20/2010/21/2010/22/2010/23/2010/24/2010/25/2010/26/2010/27/2010/28/2010/29/2010/30/2010/31/2011/1/2011/2/2011/3/2011/4/2011/5/2011/6/2011/7/2011/8/2011/9/2011/10/2011/11/2011/12/2011/13/2011/14/2011/15/2011/16/2011/17/2011/18/2011/19/2011/20/2011/21/2011/22/2011/23/2011/24/2011/25/2011/26/2011/27/2011/28/2011/29/2011/30/2012/1/2012/2/2012/3/2012/4/2012/5/2012/6/2012/7/2012/8/2012/9/2012/10/2012/11/2012/12/2012/13/2012/14/2012/15/2012/16/2012/17/2012/18/2012/19/2012/20/2012/21/2012/22/2012/23/2012/24/2012/25/2012/26/2012/27/2012/28/2012/29/2012/30/2012/31/201/1/211/2/211/3/211/4/211/5/211/6/211/7/211/8/211/9/211/10/211/11/211/12/211/13/211/14/211/15/211/16/211/17/211/18/211/19/211/20/211/21/211/22/211/23/211/24/211/25/211/26/211/27/211/28/211/29/211/30/211/31/212/1/212/2/212/3/212/4/212/5/212/6/212/7/212/8/212/9/212/10/212/11/212/12/212/13/212/14/212/15/212/16/212/17/212/18/212/19/212/20/212/21/212/22/212/23/212/24/212/25/212/26/212/27/212/28/213/1/213/2/213/3/213/4/213/5/213/6/213/7/213/8/213/9/213/10/213/11/213/12/213/13/213/14/213/15/213/16/213/17/213/18/213/19/213/20/213/21/213/22/213/23/213/24/21
0AfghanistanNaN33.9391167.70995300000000000000000000000000000000011111111244445781112131516182024252930344143768091107118146175197240275300338368424445485532556608666715785841907934997102710931177123613311464153217041830194021272291247027052895322533933564378140424403468749685227564060546403666570737654814686779219100011058511176118341245913039136621452815208157531651217270180571897219554203452092021462221462289423550241062477025531263142687827536278822842828837291472947129705301653044130606309573122831507318263201232314326623294133180333743358433898341843435634441345953473034984350603521935279354533549335605357173591835978360263614736253363583646336532366653670036701367373677336820369283700637046370833715337260373363742237497375423759037667377103775037852378853794437990380453806138103381193813038133381553815938193382433828838304383243839838494385203854438572386063864138716387723881538855388723889738919390443907439096391453917039186391923922739239392543926839285392903929739341394223948639548396163969339703397993987039928399944002640088401414020040287403694051040626406874076840833409374103241145412684133441425415014163341728418144193541975420334215942297424634260942795429694303543240434684368143924441774436344503447064498845174453844560045723458444611646274465164671846837468374707247306475164771647851480534811648229485274871848952491614937849621496814981750013501905043350655508105088651039512805135051405515265152651526515265301153105531055320753332534005348953538535845358453775538315393853984540625414154278544035448354559545955467254750548545489154939550085502355059551215517455231552655533055335553595538455402554205544555473554925551455518555405555755575555805560455617556465566455680556965570755714557335575955770557755582755840558475587655876558945591755959559595598555985559955601656044560695609356103561535617756192
1AlbaniaNaN41.1533020.16830000000000000000000000000000000000000000000000000210122333384251555964707689104123146174186197212223243259277304333361377383400409416433446467475494518539548562584609634663678712726736750766773782789795803820832842850856868872876880898916933946948949964969981989998100410291050107610991122113711431164118411971212123212461263129913411385141614641521159016721722178818381891196219952047211421922269233024022466253525802662275228192893296430383106318832783371345435713667375238513906400840904171429043584466457046374763488049975105519752765396551956205750588960166151627564116536667668176971711772607380749976547812796781198275842786058759892790839195927993809513960697289844996710102102551040610553107041086011021111851135311520116721181611948120731222612385125351266612787129211304513153132591339113518136491380613965141171426614410145681473014899150661523115399155701575215955162121650116774170551735017651179481825018556188581915719445197292004020315206342087521202215232190422300227212321023705242062473125294258012621126701272332783028432291262983730623314593219632761335563430034944356003624536790376253818239014397194050141302421484298843683444364518846061468634774248530491915000050637514245200452542530035342553814543175482755380557555625456572571465772758316583165899159438596236028361008617056237863033635956397164627653346599466635672166769067982685686923869916706557144172274728127369174567754547635077251781277899279934809418199383082842128533686289875288867189776908359198793075938509465195726968389790999062100246101285102306103327104313105229106215107167107931108823109674110521111301112078112897113580114209114840115442116123116821117474118017118492118938119528120022120541121200121544121847122295
2AlgeriaNaN28.033901.65960000000000000000000000000000000000001111113512121717192020202426374854607487901392012302643023674094545115847168479861171125113201423146815721666176118251914198320702160226824182534262927182811291030073127325633823517364938484006415442954474464848384997518253695558572358916067625364426629682170197201737775427728791881138306850386978857899791349267939495139626973398319935100501015410265103821048410589106981081010919110311114711268113851150411631117711192012076122481244512685129681327313571139071427214657150701550015941164041687917348178081824218712191951968920216207702135521948225492308423691242782487225484261592676427357279732861529229298313039430950314653197232504330553362634155346933516035712362043669937187376643813338583390253944439847402584066741068414604185842228426194301643403437814414644494448334515845469457734607146364466534693847216474884775248007482544849648734489664919449413496234982650023502145040050579507545091451067512135136851530516905184751995521365227052399525205265852804529405307253325533995358453777539985420354402546165482955081553575563055880561435641956706570265733257651579425827258574589795952760169608006138162051626936344664257651086597566819676796858969591706297165272755737747486275867770007802579110801688121282221831998415285084859278673087502882528882589416900149057991121916389210292597930659350793933943719478195203956599606996549970079744197857982499863198988993119961099897100159100408100645100873101120101382101657101913102144102369102641102860103127103381103611103833104092104341104606104852105124105369105596105854106097106359106610106887107122107339107578107841108116108381108629108629109088109313109559109782110049110303110513110711110894111069111247111418111600111764111917112094112279112461112622112805112960113092113255113430113593113761113948114104114234114382114543114681114851115008115143115265115410115540115688115842115970116066116157116255116349116438
3AndorraNaN42.506301.521800000000000000000000000000000000000000000011111111111111239395375881131331641882242673083343703763904284394665015255455645836016016386466596736736967047137177177237237317387387437437437457457477487507517517527527547557557587607617617617617617617627627627627627637637637637647647647658448518528528528528528528528528538538538538548548558558558558558558558558558558558558558558558558558558558558558558558558558588618628778808808808848848898898978978979079079189229259259259379399399449559559559639639779819899899891005100510241024104510451045106010601098109811241124112411761184119911991215121512151261126113011301134413441344143814381483148315641564156416811681175317531836183618361966196620502050211021102110237023702568256826962696269629952995319031903377337733773623362338113811403840384038432544104517456746654756482548884910504551355135531953835437547755675616572557255872591459516018606661426207625663046351642865346610661067126745679068426904695570057050708471277162719072367288733873827382744674667519756075777602763376697699775678067821787579197983804981178166819282498308834883488489858685868586868288188868894690389083908391949308937994169499954995969638971697799837988599379972100171007010137101721020610251102751031210352103911042710463105031053810555105831061010645106721069910712107391077510799108221084910866108891090810948109761099811019110421106911089111301113011199112281126611289113191136011393114311148111517115451159111638
4AngolaNaN-11.2027017.8739000000000000000000000000000000000000000000000000000000000000122333445777888101416171919191919191919191924242424252525252627272727303535353636364343454545454848484850525258606169707071748184868686868686889192961131181301381401421481551661721761831861891972122122592672762842913153283463463463863863964584625065255415766076386877057497798128518809169329501000107811091148116411991280134413951483153815721672167917351762181518521879190619351966201520442068213421712222228323322415247125512624265427292777280528762935296529813033309232173279333533883439356936753789384839013991411742364363447545904672471847974905497251145211537054025530572557255958603162466366648866806846709672227462762278298049833885828829902693819644987110074102691055810805110351122811577118131210212223123351243312680128161295313053132281337413451136151381813922141341426714413144931463414742148211492015008150871510315139152511531915361154931553615591156481572915804159251606116161161881627716362164071648416562166261664416686168021693117029170991714917240172961737117433175531756817608176421768417756178641797418066181561819318254183431842518613186791876518875189261901119093191771926919367193991947619553195801967219723197821979619829199001993719996200302006220086201122016320210202612029420329203662038120389204002045220478204992051920548205842064020695207592078220807208542088220923209812102621055210862110821114211612120521265213232138021407214462148921558216422169621733217572177421836

데이터 구조 변경, 시각화

  • 위-경도 사용하지 않고 큰 나라의 경우 지방이 많아 나뉘어져 있는데 묶어서 분석하는것이 용이할 듯.
  • date columns이 단순 문자열로 입력되어 있는데 이를 datetime 객체로 변환하여 date가 index가 되도록 변경.

구조변경

df_case['Country/Region'].value_counts()
China             33
Canada            16
United Kingdom    12
France            12
Australia          8
                  ..
Ethiopia           1
Montenegro         1
Uzbekistan         1
Nicaragua          1
Chad               1
Name: Country/Region, Length: 192, dtype: int64
def fix_dataframe(df):
    df = df.drop(['Lat', 'Long'], axis=1).groupby('Country/Region').sum()
# 나라 기준으로 묶고 Province당 경우를 합하여 표현한다.

    df = df.transpose()
# 날짜가 인덱스가 되게끔 바꾸어준다.

    df.index.name = 'Date'
    df.reset_index(inplace=True)
    df['Date'] = df['Date'].apply(lambda s : pd.to_datetime(str(s)))

    df.set_index('Date', inplace=True)
    return df
  • 데이터 프레임 구조를 바꾸고 date를 datetime 타입으로 바꾸어 인덱스로 만드는 함수를 만들어준다.
df_case = fix_dataframe(df_case)
df_death = fix_dataframe(df_death)

시각화

# 확진자 상위 10개국

ten_cases = df_case.loc[df_case.index[-1]].sort_values(ascending=False)[:10]
ten_cases
Country/Region
US                30011839
Brazil            12220011
India             11787534
Russia             4433364
France             4374774
United Kingdom     4326645
Italy              3440862
Spain              3234319
Turkey             3091282
Germany            2722988
Name: 2021-03-24 00:00:00, dtype: int64
sns.barplot(x=ten_cases.index, y=ten_cases)
<matplotlib.axes._subplots.AxesSubplot at 0x211e06a8288>

output_14_1

ten_death = df_death.loc[df_death.index[-1]][ten_cases.index] 
#sort_values 말고 확진자 10개국의 사망자를 확인해보자.
ten_death
Country/Region
US                545264
Brazil            300685
India             160692
Russia             94624
France             93083
United Kingdom    126621
Italy             106339
Spain              73744
Turkey             30462
Germany            75484
Name: 2021-03-24 00:00:00, dtype: int64
# 확진자와 사망자를 동시에 보여주기.
plt.figure(figsize=(7,5))

sns.barplot(x=ten_cases.index, y=ten_cases, color='black')
plt.xticks(rotation=90, size=15)
plt.ylabel('Total Confirmed Cases', size=15)
plt.xlabel('')
plt.title('Total Confirmed Cases (%s)' %ten_cases.name.strftime('%Y-%m-%d'), size=15)

ax = plt.gca()
ax2 = ax.twinx()
ax2.plot(ten_death.index, ten_death, 'r--')
ax2.set_ylabel('Total Deaths', color='red', size=15)
plt.show()

output_16_0

# 미국만 알아보자.
plt.figure(figsize=(10,7))

plt.plot(df_case.index, df_case['US'], 'b-')
plt.ylabel('Confirmed Cases', color='blue')
plt.xlabel('Date')
plt.xlim(right=df_case.index[-1])
plt.ylim(0, df_case['US'].max()*1.1)
plt.title('US' + 'Cases & Deaths')

ax = plt.gca()
ax2 = ax.twinx()
ax2.plot(df_death.index, df_death['US'], 'r--')
ax2.set_ylabel('Deaths', color='red')
ax2.set_ylim(0, df_death['US'].max()*1.3)

plt.show()

output_17_0

# 재미로 일본도

plt.figure(figsize=(6,4))

plt.plot(df_case.index, df_case['Japan'], 'b-')
plt.ylabel('Confirmed Cases', color='blue')
plt.xlabel('Date')
plt.xlim(right=df_case.index[-1])
plt.ylim(0, df_case['Japan'].max()*1.1)
plt.title('US' + 'Cases & Deaths')

ax = plt.gca()
ax2 = ax.twinx()
ax2.plot(df_death.index, df_death['Japan'], 'r--')
ax2.set_ylabel('Deaths', color='red')
ax2.set_ylim(0, df_death['Japan'].max()*1.3)

plt.show()

output_18_0

# 한국
plt.figure(figsize=(6,4))

plt.plot(df_case.index, df_case['Korea, South'], 'b-')
plt.ylabel('Confirmed Cases', color='blue')
plt.xlabel('Date')
plt.xlim(right=df_case.index[-1])
plt.ylim(0, df_case['Korea, South'].max()*1.1)
plt.title('US' + 'Cases & Deaths')

ax = plt.gca()
ax2 = ax.twinx()
ax2.plot(df_death.index, df_death['Korea, South'], 'r--')
ax2.set_ylabel('Deaths', color='red')
ax2.set_ylim(0, df_death['Korea, South'].max()*1.3)

plt.show()

output_19_0

# 누적 말고 일일데이터는?
plt.plot(df_case.index, df_case['Korea, South'].diff(), 'b-')
plt.ylabel('Confirmed Cases', color='blue')
plt.xlabel('Date')
plt.xlim(right=df_case.index[-1])
plt.ylim(bottom = 0)
plt.title('US' + 'Cases & Deaths')

ax = plt.gca()
ax2 = ax.twinx()
ax2.plot(df_death.index, df_death['Korea, South'].diff(), 'r--')
ax2.set_ylabel('Deaths', color='red')
ax2.set_ylim(bottom=0)

plt.show()

output_20_0

  • diff()만 붙히면 해결되는데, diff는 기준과 그 전의 차이를 나타내고, df_case[나라]가 누적확진자이므로 일일확진자로 표현할 수 있다.
  • 작년 3월 대구 사건이 2번째로 영향이 크다. 사망자는 당연하게도 확진자가 급증한 이후에 간격을두고 역시 늘어났다.

전처리

  • FBProhphet 사용하기 위해 전처리
  • FBProhphet 사용은 공식 레퍼런스를 참조 함 : https://facebook.github.io/prophet/docs/quick_start.html#python-api
df_case.reset_index()[['Date', 'Korea, South']]
Country/RegionDateKorea, South
02020-01-221
12020-01-231
22020-01-242
32020-01-252
42020-01-263
.........
4232021-03-2098665
4242021-03-2199075
4252021-03-2299421
4262021-03-2399846
4272021-03-24100276

428 rows × 2 columns

df = pd.DataFrame(df_case.reset_index()[['Date', 'Korea, South']].to_numpy(), columns=['ds', 'y'])
df
dsy
02020-01-221
12020-01-231
22020-01-242
32020-01-252
42020-01-263
.........
4232021-03-2098665
4242021-03-2199075
4252021-03-2299421
4262021-03-2399846
4272021-03-24100276

428 rows × 2 columns

from math import floor

def train_test_split_df(df, test_size):
    div = floor(df.shape[0] * (1 - test_size))
    return df.loc[:div], df.loc[div + 1:]

train_df, test_df = train_test_split_df(df, 0.1)
train_df.shape
(386, 2)
train_df.tail()
dsy
3812021-02-0680896
3822021-02-0781185
3832021-02-0881487
3842021-02-0981930
3852021-02-1082434
test_df.head()
dsy
3862021-02-1182837
3872021-02-1283199
3882021-02-1383525
3892021-02-1483869
3902021-02-1584325

모델 학습(Prophet)

from fbprophet import Prophet

model = Prophet(changepoint_range = 1.0)
model.fit(train_df)
INFO:fbprophet:Disabling yearly seasonality. Run prophet with yearly_seasonality=True to override this.
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
C:\Users\dissi\anaconda31\lib\site-packages\pystan\misc.py:399: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  elif np.issubdtype(np.asarray(v).dtype, float):





<fbprophet.forecaster.Prophet at 0x211e324c848>
from fbprophet.plot import add_changepoints_to_plot
pred = model.predict(test_df)
model.plot(pred);

output_33_0

model.plot_components(pred);
# 주마다 경향

output_34_0

fig = model.plot(pred)
plt.plot(test_df['ds'], test_df['y'], 'g-', label='actual')
add_changepoints_to_plot(fig.gca(), model, pred)
plt.legend()
<matplotlib.legend.Legend at 0x211e346c048>

output_35_1

  • 실제랑 거의 동일
  • 실제 양상이 변칙적인 상황없이 예상대로 흘러감을 뜻함.

모델평가

from sklearn.metrics import r2_score
print('R2 Score : ', r2_score(test_df['y'], pred['yhat']))
R2 Score :  0.9928405923770204

앞으로 30일 예측

model = Prophet(changepoint_range=1.0)
model.fit(df)
future = model.make_future_dataframe(30)
pred = model.predict(future)
model.plot(pred);
INFO:fbprophet:Disabling yearly seasonality. Run prophet with yearly_seasonality=True to override this.
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
C:\Users\dissi\anaconda31\lib\site-packages\pystan\misc.py:399: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  elif np.issubdtype(np.asarray(v).dtype, float):

output_40_1

과거 데이터

  • 몇 가지 특수한 사건을 통해 코로나가 급증하게 되었는데 만약 이 사건이 벌어지지 않았다면 지금쯤 어떤 양상을 보일지 분석해보고자 한다.
df.loc[26:32]

# 2월 20일 기준으로 급격히 늘어남.
dsy
262020-02-1730
272020-02-1831
282020-02-1931
292020-02-20104
302020-02-21204
312020-02-22433
322020-02-23602
model = Prophet(changepoint_range=1.0)
model.fit(df.loc[:28])
future2 = model.make_future_dataframe(30)
pred = model.predict(future2)
model.plot(pred);

plt.plot(df.loc[:58]['ds'], df.loc[:58]['y'], 'g-')
plt.show()
INFO:fbprophet:Disabling yearly seasonality. Run prophet with yearly_seasonality=True to override this.
INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
C:\Users\dissi\anaconda31\lib\site-packages\pystan\misc.py:399: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  elif np.issubdtype(np.asarray(v).dtype, float):

output_44_1

  • 극단적인 차이가 남.
  • 작은사건 하나가 미치는 큰 영향을 확인할 수 있으며, 모든걸 통제하고 예측해야하는 방역 체계가 얼마나 힘든지 알 수 있다.
  • 모두 조심한다고 해도 조그만 사건을 통해 이런 무서운 결과가 발생할 수 있음은 방역에 대한 매너리즘에서 빠져나올 수 있게 해준다.