US Election 2020

데이터 셋

https://www.kaggle.com/unanimad/us-election-2020 (선거관련) https://www.kaggle.com/muonneutrino/us-census-demographic-data (인구관련)

대통령 선거 이전에 보여지는 모습을 통해 인종, 상황, 배경이라는 변수에 따라 투표의 양상이 어떻게 벌어지는지 확인할 수 있는 데이터 셋이다.

라이브러리 설정 및 데이터 읽어들이기

president_country_candidate.csv : 대통령 투표 결과
governors_country_candidate.csv : 카운티 지사 투표 결과
senate_country_candidate.csv : 상원의원 투표 결과
house_candidate.csv : 하원의원 투표 결과
state : 주
county : 카운티(군)
district : 지구
candidate : 후보자
party : 후보자의 소속 정당
total_votes: 득표 수
won : 지역 투표 우승 여부

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

df_pres = pd.read_csv("us-election-2020/president_county_candidate.csv")
df_gov = pd.read_csv("us-election-2020/governors_county_candidate.csv")
# 우선 두가지 사용.

# df_sen = pd.read_csv('us-election-2020/senate_country_candidate.csv')
# df_hou = pd.read_csv('us-election-2020/house_candidate.csv')

df_census = pd.read_csv("./us-census-demographic/acs2017_county_data.csv") # 인구 관련 조사

# State Code 관련 부가 자료
state_code = pd.read_html('https://www.infoplease.com/us/postal-information/state-abbreviations-and-state-postal-codes')[0]

EDA 및 기초 통계 분석

df_pres.head()

	state	county	candidate	party	total_votes	won
0	Delaware	Kent County	Joe Biden	DEM	44552	True
1	Delaware	Kent County	Donald Trump	REP	41009	False
2	Delaware	Kent County	Jo Jorgensen	LIB	1044	False
3	Delaware	Kent County	Howie Hawkins	GRN	420	False
4	Delaware	New Castle County	Joe Biden	DEM	195034	True

df_pres['candidate'].unique()

array(['Joe Biden', 'Donald Trump', 'Jo Jorgensen', 'Howie Hawkins',
       ' Write-ins', 'Gloria La Riva', 'Brock Pierce',
       'Rocky De La Fuente', 'Don Blankenship', 'Kanye West',
       'Brian Carroll', 'Ricki Sue King', 'Jade Simmons',
       'President Boddie', 'Bill Hammons', 'Tom Hoefling',
       'Alyson Kennedy', 'Jerome Segal', 'Phil Collins',
       ' None of these candidates', 'Sheila Samm Tittle', 'Dario Hunter',
       'Joe McHugh', 'Christopher LaFontaine', 'Keith McCormic',
       'Brooke Paige', 'Gary Swing', 'Richard Duncan', 'Blake Huber',
       'Kyle Kopitke', 'Zachary Scalf', 'Jesse Ventura', 'Connie Gammon',
       'John Richard Myers', 'Mark Charles', 'Princess Jacob-Fambro',
       'Joseph Kishore', 'Jordan Scott'], dtype=object)

df_pres.loc[df_pres['candidate'] == 'Jo Jorgensen']['total_votes'].sum()

df_gov.head()

	state	county	candidate	party	votes	won
0	Delaware	Kent County	John Carney	DEM	44352	True
1	Delaware	Kent County	Julianne Murray	REP	39332	False
2	Delaware	Kent County	Kathy DeMatteis	IPD	1115	False
3	Delaware	Kent County	John Machurek	LIB	616	False
4	Delaware	New Castle County	John Carney	DEM	191678	True

pd.set_option('display.max_columns', None)

df_census.head()

	CountyId	State	County	TotalPop	Men	Women	Hispanic	White	Black	Native	Asian	VotingAgeCitizen	Income	IncomeErr	IncomePerCap	IncomePerCapErr	Poverty	ChildPoverty	Professional	Service	Office	Construction	Production	Drive	Carpool	Transit	Walk	OtherTransp	WorkAtHome	MeanCommute	Employed	PrivateWork	PublicWork	SelfEmployed	FamilyWork	Unemployment
0	1001	Alabama	Autauga County	55036	26899	28137	2.7	75.4	18.9	0.3	0.9	41016	55317	2838	27824	2024	13.7	20.1	35.3	18.0	23.2	8.1	15.4	86.0	9.6	0.1	0.6	1.3	2.5	25.8	24112	74.1	20.2	5.6	0.1	5.2
1	1003	Alabama	Baldwin County	203360	99527	103833	4.4	83.1	9.5	0.8	0.7	155376	52562	1348	29364	735	11.8	16.1	35.7	18.2	25.6	9.7	10.8	84.7	7.6	0.1	0.8	1.1	5.6	27.0	89527	80.7	12.9	6.3	0.1	5.5
2	1005	Alabama	Barbour County	26201	13976	12225	4.2	45.7	47.8	0.2	0.6	20269	33368	2551	17561	798	27.2	44.9	25.0	16.8	22.6	11.5	24.1	83.4	11.1	0.3	2.2	1.7	1.3	23.4	8878	74.1	19.1	6.5	0.3	12.4
3	1007	Alabama	Bibb County	22580	12251	10329	2.4	74.6	22.0	0.4	0.0	17662	43404	3431	20911	1889	15.2	26.6	24.4	17.6	19.7	15.9	22.4	86.4	9.5	0.7	0.3	1.7	1.5	30.0	8171	76.0	17.4	6.3	0.3	8.2
4	1009	Alabama	Blount County	57667	28490	29177	9.0	87.4	1.5	0.3	0.1	42513	47412	2630	22021	850	15.6	25.4	28.5	12.9	23.3	15.8	19.5	86.8	10.2	0.1	0.4	0.4	2.1	35.0	21380	83.9	11.9	4.0	0.1	4.9

숫자로 표기되어있는 data, 백분율로 표기되어 있는 data가 있음. 백분율 경우 TotalPop을 가중치로 두어 전처리 해야 함.

state_code.head()

	State/District	Abbreviation	Postal Code
0	Alabama	Ala.	AL
1	Alaska	Alaska	AK
2	Arizona	Ariz.	AZ
3	Arkansas	Ark.	AR
4	California	Calif.	CA

County별 통계로 데이터프레임 구조 변경

data = df_pres.loc[df_pres['party'].apply(lambda s: str(s) in ['DEM', 'REP'])]

table_pres = pd.pivot_table(data=data, index=['state', 'county'], columns='party', values='total_votes')
table_pres.rename({'DEM':'Pres_DEM', 'REP':'Pres_REP'}, axis=1, inplace=True)
table_pres

	party	Pres_DEM	Pres_REP
state	county
Alabama	Autauga County	7503	19838
	Baldwin County	24578	83544
	Barbour County	4816	5622
	Bibb County	1986	7525
	Blount County	2640	24711
...	...	...	...
Wyoming	Sweetwater County	3823	12229
	Teton County	9848	4341
	Uinta County	1591	7496
	Washakie County	651	3245
	Weston County	360	3107

4633 rows × 2 columns

table_pres.isna().sum()

party
Pres_DEM    0
Pres_REP    0
dtype: int64

data2 = df_gov.loc[df_gov['party'].apply(lambda s: str(s) in ['DEM', 'REP'])]

table_gov = pd.pivot_table(data=data2, index=['state', 'county'], columns='party', values='votes')
table_gov.rename({'DEM':'Gov_DEM', 'REP':'Gov_REP'}, axis=1, inplace=True)
table_gov

	party	Gov_DEM	Gov_REP
state	county
Delaware	Kent County	44352	39332
	New Castle County	191678	82545
	Sussex County	56873	68435
Indiana	Adams County	2143	9441
Indiana	Allen County	53895	98406
...	...	...	...
West Virginia	Webster County	659	2552
	Wetzel County	1727	4559
	Wirt County	483	1947
	Wood County	9933	26232
	Wyoming County	1240	6941

1025 rows × 2 columns

table_gov.isna().sum()

party
Gov_DEM    0
Gov_REP    0
dtype: int64

df_census.columns

Index(['CountyId', 'State', 'County', 'TotalPop', 'Men', 'Women', 'Hispanic',
       'White', 'Black', 'Native', 'Asian', 'Pacific', 'VotingAgeCitizen',
       'Income', 'IncomeErr', 'IncomePerCap', 'IncomePerCapErr', 'Poverty',
       'ChildPoverty', 'Professional', 'Service', 'Office', 'Construction',
       'Production', 'Drive', 'Carpool', 'Transit', 'Walk', 'OtherTransp',
       'WorkAtHome', 'MeanCommute', 'Employed', 'PrivateWork', 'PublicWork',
       'SelfEmployed', 'FamilyWork', 'Unemployment'],
      dtype='object')

df_census.drop(['Income', 'IncomeErr', 'IncomePerCapErr'], axis=1, inplace=True)

# state, county 컬럼 소문자로
df_census.rename({'State':'state', 'County':'county'},axis=1, inplace=True)
df_census.drop('CountyId', axis=1, inplace=True)
df_census.set_index(['state', 'county'], inplace=True)
df_census

		TotalPop	Men	Women	Hispanic	White	Black	Native	Asian	Pacific	VotingAgeCitizen	IncomePerCap	Poverty	ChildPoverty	Professional	Service	Office	Construction	Production	Drive	Carpool	Transit	Walk	OtherTransp	WorkAtHome	MeanCommute	Employed	PrivateWork	PublicWork	SelfEmployed	FamilyWork	Unemployment
state	county
Alabama	Autauga County	55036	26899	28137	2.7	75.4	18.9	0.3	0.9	0.0	41016	27824	13.7	20.1	35.3	18.0	23.2	8.1	15.4	86.0	9.6	0.1	0.6	1.3	2.5	25.8	24112	74.1	20.2	5.6	0.1	5.2
	Baldwin County	203360	99527	103833	4.4	83.1	9.5	0.8	0.7	0.0	155376	29364	11.8	16.1	35.7	18.2	25.6	9.7	10.8	84.7	7.6	0.1	0.8	1.1	5.6	27.0	89527	80.7	12.9	6.3	0.1	5.5
	Barbour County	26201	13976	12225	4.2	45.7	47.8	0.2	0.6	0.0	20269	17561	27.2	44.9	25.0	16.8	22.6	11.5	24.1	83.4	11.1	0.3	2.2	1.7	1.3	23.4	8878	74.1	19.1	6.5	0.3	12.4
	Bibb County	22580	12251	10329	2.4	74.6	22.0	0.4	0.0	0.0	17662	20911	15.2	26.6	24.4	17.6	19.7	15.9	22.4	86.4	9.5	0.7	0.3	1.7	1.5	30.0	8171	76.0	17.4	6.3	0.3	8.2
	Blount County	57667	28490	29177	9.0	87.4	1.5	0.3	0.1	0.0	42513	22021	15.6	25.4	28.5	12.9	23.3	15.8	19.5	86.8	10.2	0.1	0.4	0.4	2.1	35.0	21380	83.9	11.9	4.0	0.1	4.9
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
Puerto Rico	Vega Baja Municipio	54754	26269	28485	96.7	3.1	0.1	0.0	0.0	0.0	42838	10197	43.8	49.4	28.6	20.2	25.9	11.1	14.2	92.0	4.2	0.9	1.4	0.6	0.9	31.6	14234	76.2	19.3	4.3	0.2	16.8
	Vieques Municipio	8931	4351	4580	95.7	4.0	0.0	0.0	0.0	0.0	7045	11136	36.8	68.2	20.9	38.4	16.4	16.9	7.3	76.3	16.9	0.0	5.0	0.0	1.7	14.9	2927	40.7	40.9	18.4	0.0	12.8
	Villalba Municipio	23659	11510	12149	99.7	0.2	0.1	0.0	0.0	0.0	18053	10449	50.0	67.9	22.5	21.2	22.7	14.1	19.5	83.1	11.8	0.1	2.1	0.0	2.8	28.4	6873	59.2	30.2	10.4	0.2	24.8
	Yabucoa Municipio	35025	16984	18041	99.9	0.1	0.0	0.0	0.0	0.0	27523	8672	52.4	62.1	27.7	26.0	20.7	9.5	16.0	87.6	9.2	0.0	1.4	1.8	0.1	30.5	7878	62.7	30.9	6.3	0.0	25.4
	Yauco Municipio	37585	18052	19533	99.8	0.2	0.0	0.0	0.0	0.0	29763	8124	50.4	58.2	30.4	20.2	25.6	12.6	11.3	82.8	8.2	2.2	1.7	0.1	5.0	24.4	8995	66.4	28.7	5.0	0.0	24.0

3220 rows × 31 columns

# 다중공선성을 피하기 위해 총인구컬럼과 겹치게 되는 남성-여성인구수, 유권자수, 고용인수를 비율로 바꿔줌.
df_census.drop('Women', axis=1, inplace=True) # 남성아니면 여성이므로 다중공선성 제거를 위해 여성컬럼 제거.

df_census['Men'] /= df_census['TotalPop']

df_census['VotingAgeCitizen'] /= df_census['TotalPop']
df_census['Employed'] /= df_census['TotalPop']

df_census.head()

		TotalPop	Men	Hispanic	White	Black	Native	Asian	Pacific	VotingAgeCitizen	IncomePerCap	Poverty	ChildPoverty	Professional	Service	Office	Construction	Production	Drive	Carpool	Transit	Walk	OtherTransp	WorkAtHome	MeanCommute	Employed	PrivateWork	PublicWork	SelfEmployed	FamilyWork	Unemployment
state	county
Alabama	Autauga County	55036	0.488753	2.7	75.4	18.9	0.3	0.9	0.0	0.745258	27824	13.7	20.1	35.3	18.0	23.2	8.1	15.4	86.0	9.6	0.1	0.6	1.3	2.5	25.8	0.438113	74.1	20.2	5.6	0.1	5.2
	Baldwin County	203360	0.489413	4.4	83.1	9.5	0.8	0.7	0.0	0.764044	29364	11.8	16.1	35.7	18.2	25.6	9.7	10.8	84.7	7.6	0.1	0.8	1.1	5.6	27.0	0.440239	80.7	12.9	6.3	0.1	5.5
	Barbour County	26201	0.533415	4.2	45.7	47.8	0.2	0.6	0.0	0.773596	17561	27.2	44.9	25.0	16.8	22.6	11.5	24.1	83.4	11.1	0.3	2.2	1.7	1.3	23.4	0.338842	74.1	19.1	6.5	0.3	12.4
	Bibb County	22580	0.542560	2.4	74.6	22.0	0.4	0.0	0.0	0.782197	20911	15.2	26.6	24.4	17.6	19.7	15.9	22.4	86.4	9.5	0.7	0.3	1.7	1.5	30.0	0.361869	76.0	17.4	6.3	0.3	8.2
	Blount County	57667	0.494043	9.0	87.4	1.5	0.3	0.1	0.0	0.737215	22021	15.6	25.4	28.5	12.9	23.3	15.8	19.5	86.8	10.2	0.1	0.4	0.4	2.1	35.0	0.370749	83.9	11.9	4.0	0.1	4.9

# 세 가지 데이터프레임 통합.
df = pd.concat([table_pres, table_gov, df_census], axis=1)
df

		Pres_DEM	Pres_REP	Gov_DEM	Gov_REP	TotalPop	Men	Hispanic	White	Black	Native	Asian	Pacific	VotingAgeCitizen	IncomePerCap	Poverty	ChildPoverty	Professional	Service	Office	Construction	Production	Drive	Carpool	Transit	Walk	OtherTransp	WorkAtHome	MeanCommute	Employed	PrivateWork	PublicWork	SelfEmployed	FamilyWork	Unemployment
state	county
Alabama	Autauga County	7503.0	19838.0	NaN	NaN	55036.0	0.488753	2.7	75.4	18.9	0.3	0.9	0.0	0.745258	27824.0	13.7	20.1	35.3	18.0	23.2	8.1	15.4	86.0	9.6	0.1	0.6	1.3	2.5	25.8	0.438113	74.1	20.2	5.6	0.1	5.2
	Baldwin County	24578.0	83544.0	NaN	NaN	203360.0	0.489413	4.4	83.1	9.5	0.8	0.7	0.0	0.764044	29364.0	11.8	16.1	35.7	18.2	25.6	9.7	10.8	84.7	7.6	0.1	0.8	1.1	5.6	27.0	0.440239	80.7	12.9	6.3	0.1	5.5
	Barbour County	4816.0	5622.0	NaN	NaN	26201.0	0.533415	4.2	45.7	47.8	0.2	0.6	0.0	0.773596	17561.0	27.2	44.9	25.0	16.8	22.6	11.5	24.1	83.4	11.1	0.3	2.2	1.7	1.3	23.4	0.338842	74.1	19.1	6.5	0.3	12.4
	Bibb County	1986.0	7525.0	NaN	NaN	22580.0	0.542560	2.4	74.6	22.0	0.4	0.0	0.0	0.782197	20911.0	15.2	26.6	24.4	17.6	19.7	15.9	22.4	86.4	9.5	0.7	0.3	1.7	1.5	30.0	0.361869	76.0	17.4	6.3	0.3	8.2
	Blount County	2640.0	24711.0	NaN	NaN	57667.0	0.494043	9.0	87.4	1.5	0.3	0.1	0.0	0.737215	22021.0	15.6	25.4	28.5	12.9	23.3	15.8	19.5	86.8	10.2	0.1	0.4	0.4	2.1	35.0	0.370749	83.9	11.9	4.0	0.1	4.9
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
Wyoming	Sweetwater County	3823.0	12229.0	NaN	NaN	44527.0	0.516114	16.0	79.6	0.8	0.6	0.6	0.5	0.696768	31700.0	12.0	15.7	27.7	16.1	20.0	20.8	15.4	77.5	14.4	2.6	2.8	1.3	1.5	20.5	0.510679	78.4	17.8	3.8	0.0	5.2
	Teton County	9848.0	4341.0	NaN	NaN	22923.0	0.530864	15.0	81.5	0.5	0.3	2.2	0.0	0.728177	49200.0	6.8	2.8	39.4	25.4	17.0	11.7	6.5	68.3	6.7	3.8	11.7	3.8	5.7	14.3	0.632203	82.1	11.4	6.5	0.0	1.3
	Uinta County	1591.0	7496.0	NaN	NaN	20758.0	0.510309	9.1	87.7	0.1	0.9	0.1	0.0	0.685760	27115.0	14.9	20.0	30.4	19.4	18.1	16.1	16.1	77.4	14.9	3.3	1.1	1.3	2.0	19.9	0.459004	71.5	21.5	6.6	0.4	6.4
	Washakie County	651.0	3245.0	NaN	NaN	8253.0	0.498970	14.2	82.2	0.3	0.4	0.1	0.0	0.742154	27345.0	12.8	17.5	32.1	16.3	17.6	18.8	15.3	77.2	10.2	0.0	6.9	1.3	4.4	14.3	0.464437	69.8	22.0	8.1	0.2	6.1
	Weston County	360.0	3107.0	NaN	NaN	7117.0	0.527750	1.4	91.6	0.5	0.1	4.3	0.0	0.774203	30955.0	14.4	24.1	32.0	15.0	15.8	17.9	19.3	72.7	6.7	9.1	3.0	1.6	6.9	25.7	0.478713	68.2	21.9	8.8	1.1	2.2

4809 rows × 34 columns

카운티 지사를 선출하지 않는 county의 경우 NaN값으로 표시된다.

컬럼간 상관관계 살펴보기

plt.figure(figsize=(10,10))
sns.heatmap(df.corr())
plt.show()

output_30_0

전처리가 덜 되었음. 민주당을 뽑은 인원이 공화당에도 큰 영향을 미친다는 것은 인구자체에 영향을 받아 다중공선성이 일어나고 있다는 증거.
비율로 바꾸어 줄 필요가 있음.
Asian 투표율이 높음.

df_norm = df.copy()
df_norm['Pres_DEM'] /= df['Pres_DEM'] + df['Pres_REP']
df_norm['Pres_REP'] /= df['Pres_DEM'] + df['Pres_REP']
df_norm['Gov_DEM'] /= df['Gov_DEM'] + df['Gov_REP']
df_norm['Gov_REP'] /= df['Gov_DEM'] + df['Gov_REP']

plt.figure(figsize=(5,10))
sns.heatmap(df_norm.corr()[['Pres_DEM', 'Pres_REP']], annot=True)
plt.show()

output_33_0

인구가 많은 county일수록 민주당 지지, 백인은 공화동 유색인종은민주당
전문직,서비스직,사무직 민주당. 건설,생산,운송 공화당.

sns.jointplot(data=df_norm, x='White', y='Pres_REP', kind='hex')

<seaborn.axisgrid.JointGrid at 0x1bbd39a4f48>

output_35_1

단순히 백인비율이 많다고 공화당 지지가 높지는 않음. 아마 직업의 영향이 반영된듯 하다.

sns.jointplot(data=df_norm, x='White', y='Pres_REP', hue='Professional')

C:\Users\dissi\anaconda31\lib\site-packages\seaborn\distributions.py:306: UserWarning: Dataset has 0 variance; skipping density estimate.
  warnings.warn(msg, UserWarning)
C:\Users\dissi\anaconda31\lib\site-packages\seaborn\distributions.py:306: UserWarning: Dataset has 0 variance; skipping density estimate.
  warnings.warn(msg, UserWarning)

<seaborn.axisgrid.JointGrid at 0x1bbd3dd1f48>

output_37_2

아래로 내려갈수록(공화당 지지가 낮을수록) 전문직비율이 높음.

sns.jointplot(data=df_norm, x='Black', y='Pres_DEM', alpha=0.2)

<seaborn.axisgrid.JointGrid at 0x1bbd3934488>

output_39_1

상관성은 확실히 있으나 위의 plot과 마찬가지로 black 비율이 낮다고 민주당 지지도 낮지는 않음.
또한 흑인이 많은 county 자체도 많지는 않아 전체 데이터에 큰 영향을 주지는 않음.

Plotly로 시각화

전처리

import plotly.figure_factory as ff

df_sample = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/laucnty16.csv')
df_sample['State FIPS Code'] = df_sample['State FIPS Code'].apply(lambda x: str(x).zfill(2))
df_sample['County FIPS Code'] = df_sample['County FIPS Code'].apply(lambda x: str(x).zfill(3))
df_sample['FIPS'] = df_sample['State FIPS Code'] + df_sample['County FIPS Code']

colorscale = ["#f7fbff","#ebf3fb","#deebf7","#d2e3f3","#c6dbef","#b3d2e9","#9ecae1",
              "#85bcdb","#6baed6","#57a0ce","#4292c6","#3082be","#2171b5","#1361a9",
              "#08519c","#0b4083","#08306b"]

df_sample

	LAUS Code	State FIPS Code	County FIPS Code	County Name/State Abbreviation	Year	Labor Force	Employed	Unemployed	Unemployment Rate (%)	FIPS
0	CN0100100000000	01	001	Autauga County, AL	2016	25,649	24,297	1,352	5.3	01001
1	CN0100300000000	01	003	Baldwin County, AL	2016	89,931	85,061	4,870	5.4	01003
2	CN0100500000000	01	005	Barbour County, AL	2016	8,302	7,584	718	8.6	01005
3	CN0100700000000	01	007	Bibb County, AL	2016	8,573	8,004	569	6.6	01007
4	CN0100900000000	01	009	Blount County, AL	2016	24,525	23,171	1,354	5.5	01009
...	...	...	...	...	...	...	...	...	...	...
3214	CN7214500000000	72	145	Vega Baja Municipio, PR	2016	13,812	11,894	1,918	13.9	72145
3215	CN7214700000000	72	147	Vieques Municipio, PR	2016	3,287	2,939	348	10.6	72147
3216	CN7214900000000	72	149	Villalba Municipio, PR	2016	7,860	6,273	1,587	20.2	72149
3217	CN7215100000000	72	151	Yabucoa Municipio, PR	2016	9,137	7,591	1,546	16.9	72151
3218	CN7215300000000	72	153	Yauco Municipio, PR	2016	10,815	8,783	2,032	18.8	72153

3219 rows × 10 columns

state_code.head()

	State/District	Abbreviation	Postal Code
0	Alabama	Ala.	AL
1	Alaska	Alaska	AK
2	Arizona	Ariz.	AZ
3	Arkansas	Ark.	AR
4	California	Calif.	CA

state_map = state_code.set_index('State/District')['Postal Code']

state_map

State/District
Alabama                 AL
Alaska                  AK
Arizona                 AZ
Arkansas                AR
California              CA
Colorado                CO
Connecticut             CT
Delaware                DE
District of Columbia    DC
Florida                 FL
Georgia                 GA
Hawaii                  HI
Idaho                   ID
Illinois                IL
Indiana                 IN
Iowa                    IA
Kansas                  KS
Kentucky                KY
Louisiana               LA
Maine                   ME
Maryland                MD
Massachusetts           MA
Michigan                MI
Minnesota               MN
Mississippi             MS
Missouri                MO
Montana                 MT
Nebraska                NE
Nevada                  NV
New Hampshire           NH
New Jersey              NJ
New Mexico              NM
New York                NY
North Carolina          NC
North Dakota            ND
Ohio                    OH
Oklahoma                OK
Oregon                  OR
Pennsylvania            PA
Rhode Island            RI
South Carolina          SC
South Dakota            SD
Tennessee               TN
Texas                   TX
Utah                    UT
Vermont                 VT
Virginia                VA
Washington              WA
West Virginia           WV
Wisconsin               WI
Wyoming                 WY
Name: Postal Code, dtype: object

counties = df_norm.reset_index()['county'] + ', ' + df_norm.reset_index()['state'].map(state_map)

counties

        Autauga County, AL
        Baldwin County, AL
        Barbour County, AL
           Bibb County, AL
         Blount County, AL
                ...          
  Sweetwater County, WY
       Teton County, WY
       Uinta County, WY
    Washakie County, WY
      Weston County, WY
Length: 4809, dtype: object

counties_to_fips = df_sample.set_index('County Name/State Abbreviation')['FIPS']
counties_to_fips

County Name/State Abbreviation
Autauga County, AL         01001
Baldwin County, AL         01003
Barbour County, AL         01005
Bibb County, AL            01007
Blount County, AL          01009
                           ...  
Vega Baja Municipio, PR    72145
Vieques Municipio, PR      72147
Villalba Municipio, PR     72149
Yabucoa Municipio, PR      72151
Yauco Municipio, PR        72153
Name: FIPS, Length: 3219, dtype: object

fips = counties.map(counties_to_fips)
fips

# df_norm 의 지리정보를 fips코드 하나로 바꿈.

     01001
     01003
     01005
     01007
     01009
        ...  
  56037
  56039
  56041
  56043
  56045
Length: 4809, dtype: object

fips.isna().sum()

data = df_norm.reset_index()['Pres_DEM'][fips.notna()]
fips = fips[fips.notna()]

# data는 민주당 지지율을 index로 새로 구성하되 fips가 null값이 아닌 번호를 골라서
# Fips코드는 choropleth를 사용하기 위해(가시화)

시각화

fig = ff.create_choropleth(
    fips=fips, values=data,
    show_state_data = False,
    colorscale=colorscale,
    binning_endpoints=list(np.linspace(0.0, 1.0, len(colorscale) - 2)),
    show_hover=True, centroid_marker={'opacity':0},
    asp = 2.9, title="USA by Voting for DEM President"
)

fig.layout.template = None
fig.show()

전처리

df_norm.columns

Index(['Pres_DEM', 'Pres_REP', 'Gov_DEM', 'Gov_REP', 'TotalPop', 'Men',
       'Hispanic', 'White', 'Black', 'Native', 'Asian', 'Pacific',
       'VotingAgeCitizen', 'IncomePerCap', 'Poverty', 'ChildPoverty',
       'Professional', 'Service', 'Office', 'Construction', 'Production',
       'Drive', 'Carpool', 'Transit', 'Walk', 'OtherTransp', 'WorkAtHome',
       'MeanCommute', 'Employed', 'PrivateWork', 'PublicWork', 'SelfEmployed',
       'FamilyWork', 'Unemployment'],
      dtype='object')

df_norm.dropna(inplace=True)
X = df_norm.drop(['Pres_DEM', 'Pres_REP', 'Gov_DEM', 'Gov_REP'], axis=1)
y = df_norm['Pres_DEM']

# 수치형 데이터 표준화
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)
X= pd.DataFrame(data=X_scaled, index=X.index, columns=X.columns)

X.head()

		TotalPop	Men	Hispanic	White	Black	Native	Asian	Pacific	VotingAgeCitizen	IncomePerCap	Poverty	ChildPoverty	Professional	Service	Office	Construction	Production	Drive	Carpool	Transit	Walk	OtherTransp	WorkAtHome	MeanCommute	Employed	PrivateWork	PublicWork	SelfEmployed	FamilyWork	Unemployment
state	county
Delaware	Kent County	0.651462	-0.946145	0.346057	-1.423723	1.751813	-0.191121	0.783427	-0.393531	-0.388638	0.313498	-0.444312	-0.300398	0.327031	0.447702	0.504056	-0.807226	-0.379883	0.447917	-0.170797	0.559713	-0.455170	-0.337372	-0.325794	0.608072	0.209912	-0.063847	0.791294	-0.836585	-0.243581	0.265911
	New Castle County	3.043033	-0.848175	0.828313	-1.695454	1.751813	-0.239442	3.218819	-0.393531	-0.665857	1.606755	-0.638605	-0.567309	1.978537	-0.315791	0.658756	-1.535077	-1.212547	0.221028	-0.857051	3.061686	-0.301446	-0.337372	-0.325794	0.492385	0.801126	1.168959	-0.715528	-0.999491	-0.404976	0.192397
	Sussex County	0.917027	-0.818852	0.759420	-0.634409	0.622026	-0.215282	0.210394	-0.393531	0.227859	1.122397	-0.620942	-0.044163	0.295272	0.147758	0.844395	-0.469295	-0.543150	0.632264	-1.001526	0.112932	-0.506412	-0.141175	0.055883	0.415259	-0.026356	0.503244	-0.251891	-0.470046	-0.404976	0.045370
Indiana	Adams County	-0.213551	-0.230718	-0.067305	0.581911	-0.432442	-0.263602	-0.362639	-0.393531	-1.625095	-0.755470	0.597804	0.916716	-1.149795	-0.615735	-0.516958	0.232561	1.660960	-0.658166	2.393626	-0.512562	-0.250204	0.447416	-0.244006	-0.066773	-0.205611	0.971710	-1.344751	-0.001691	0.401999	-0.285443
Indiana	Allen County	1.870146	-0.656809	0.414951	-0.653818	0.546706	-0.239442	1.857865	-0.393531	-1.068441	0.224871	-0.144041	0.083954	0.358791	-0.452129	0.937214	-1.405104	0.289514	0.759889	-0.495865	0.202288	-0.608895	-0.533569	-0.380320	-0.355993	0.707837	1.477161	-1.278517	-0.816222	-0.404976	0.118884

학습, 테스트 데이터 분리

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

PCA 이용 (Columns 수가 많음.)

from sklearn.decomposition import PCA

pca = PCA()
pca.fit(X_train)
plt.plot(range(1, len(pca.explained_variance_) + 1), pca.explained_variance_)
plt.grid()

# 1부터 30개의 Columns. 10정도의 변수만 되어도 충분히 설명가능.

output_66_0

pca = PCA(n_components=10)
pca.fit(X_train)

PCA(copy=True, iterated_power='auto', n_components=10, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

모델 적용

LightGBM 모델

from lightgbm import LGBMRegressor
model_reg = LGBMRegressor()
model_reg.fit(pca.transform(X_train), y_train)

LGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
              importance_type='split', learning_rate=0.1, max_depth=-1,
              min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
              n_estimators=100, n_jobs=-1, num_leaves=31, objective=None,
              random_state=None, reg_alpha=0.0, reg_lambda=0.0, silent=True,
              subsample=1.0, subsample_for_bin=200000, subsample_freq=0)

from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.metrics import classification_report
from math import sqrt

pred = model_reg.predict(pca.transform(X_test))
print(mean_absolute_error(y_test, pred))
print(sqrt(mean_squared_error(y_test, pred)))

0.06323869955871446
0.08461257022993489

print(classification_report(y_test > 0.5, pred > 0.5))

# y = Pres_DEM 0.5 이상일경우 True

              precision    recall  f1-score   support

       False       0.95      0.96      0.96       146
        True       0.62      0.59      0.61        17

    accuracy                           0.92       163
   macro avg       0.79      0.77      0.78       163
weighted avg       0.92      0.92      0.92       163

XGBoost 모델

from xgboost import XGBClassifier

model_xgb = XGBClassifier()
model_xgb.fit(X_train, y_train > 0.5)

[23:07:34] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.3.0/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.





XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=4, num_parallel_tree=1,
              objective='binary:logistic', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', use_label_encoder=True,
              validate_parameters=1, verbosity=None)

plt.figure(figsize=(12,10))
plt.barh(x.columns, model_xgb.feature_importances_)
plt.show()

# 민주당 표에 영향을 미치는 것. 인구는 당연한 변수이므로 제외한다면 asian, transit, publicwork 눈여겨볼만하다.

output_76_0

pred = model_xgb.predict(X_test)
print(classification_report(y_test>0.5 , pred))

              precision    recall  f1-score   support

       False       0.96      0.97      0.97       146
        True       0.73      0.65      0.69        17

    accuracy                           0.94       163
   macro avg       0.85      0.81      0.83       163
weighted avg       0.94      0.94      0.94       163

Oh-Seung-Rok

Kaggle - US Election 2020