US Election 2020
데이터 셋
https://www.kaggle.com/unanimad/us-election-2020 (선거관련) https://www.kaggle.com/muonneutrino/us-census-demographic-data (인구관련)
- 대통령 선거 이전에 보여지는 모습을 통해 인종, 상황, 배경이라는 변수에 따라 투표의 양상이 어떻게 벌어지는지 확인할 수 있는 데이터 셋이다.
라이브러리 설정 및 데이터 읽어들이기
- president_country_candidate.csv : 대통령 투표 결과
- governors_country_candidate.csv : 카운티 지사 투표 결과
- senate_country_candidate.csv : 상원의원 투표 결과
house_candidate.csv : 하원의원 투표 결과
- state : 주
- county : 카운티(군)
- district : 지구
- candidate : 후보자
- party : 후보자의 소속 정당
- total_votes: 득표 수
- won : 지역 투표 우승 여부
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
df_pres = pd.read_csv("us-election-2020/president_county_candidate.csv")
df_gov = pd.read_csv("us-election-2020/governors_county_candidate.csv")
# 우선 두가지 사용.
# df_sen = pd.read_csv('us-election-2020/senate_country_candidate.csv')
# df_hou = pd.read_csv('us-election-2020/house_candidate.csv')
df_census = pd.read_csv("./us-census-demographic/acs2017_county_data.csv") # 인구 관련 조사
# State Code 관련 부가 자료
state_code = pd.read_html('https://www.infoplease.com/us/postal-information/state-abbreviations-and-state-postal-codes')[0]
EDA 및 기초 통계 분석
df_pres.head()
state | county | candidate | party | total_votes | won | |
---|---|---|---|---|---|---|
0 | Delaware | Kent County | Joe Biden | DEM | 44552 | True |
1 | Delaware | Kent County | Donald Trump | REP | 41009 | False |
2 | Delaware | Kent County | Jo Jorgensen | LIB | 1044 | False |
3 | Delaware | Kent County | Howie Hawkins | GRN | 420 | False |
4 | Delaware | New Castle County | Joe Biden | DEM | 195034 | True |
df_pres['candidate'].unique()
array(['Joe Biden', 'Donald Trump', 'Jo Jorgensen', 'Howie Hawkins',
' Write-ins', 'Gloria La Riva', 'Brock Pierce',
'Rocky De La Fuente', 'Don Blankenship', 'Kanye West',
'Brian Carroll', 'Ricki Sue King', 'Jade Simmons',
'President Boddie', 'Bill Hammons', 'Tom Hoefling',
'Alyson Kennedy', 'Jerome Segal', 'Phil Collins',
' None of these candidates', 'Sheila Samm Tittle', 'Dario Hunter',
'Joe McHugh', 'Christopher LaFontaine', 'Keith McCormic',
'Brooke Paige', 'Gary Swing', 'Richard Duncan', 'Blake Huber',
'Kyle Kopitke', 'Zachary Scalf', 'Jesse Ventura', 'Connie Gammon',
'John Richard Myers', 'Mark Charles', 'Princess Jacob-Fambro',
'Joseph Kishore', 'Jordan Scott'], dtype=object)
df_pres.loc[df_pres['candidate'] == 'Jo Jorgensen']['total_votes'].sum()
1874183
df_gov.head()
state | county | candidate | party | votes | won | |
---|---|---|---|---|---|---|
0 | Delaware | Kent County | John Carney | DEM | 44352 | True |
1 | Delaware | Kent County | Julianne Murray | REP | 39332 | False |
2 | Delaware | Kent County | Kathy DeMatteis | IPD | 1115 | False |
3 | Delaware | Kent County | John Machurek | LIB | 616 | False |
4 | Delaware | New Castle County | John Carney | DEM | 191678 | True |
pd.set_option('display.max_columns', None)
df_census.head()
CountyId | State | County | TotalPop | Men | Women | Hispanic | White | Black | Native | Asian | Pacific | VotingAgeCitizen | Income | IncomeErr | IncomePerCap | IncomePerCapErr | Poverty | ChildPoverty | Professional | Service | Office | Construction | Production | Drive | Carpool | Transit | Walk | OtherTransp | WorkAtHome | MeanCommute | Employed | PrivateWork | PublicWork | SelfEmployed | FamilyWork | Unemployment | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1001 | Alabama | Autauga County | 55036 | 26899 | 28137 | 2.7 | 75.4 | 18.9 | 0.3 | 0.9 | 0.0 | 41016 | 55317 | 2838 | 27824 | 2024 | 13.7 | 20.1 | 35.3 | 18.0 | 23.2 | 8.1 | 15.4 | 86.0 | 9.6 | 0.1 | 0.6 | 1.3 | 2.5 | 25.8 | 24112 | 74.1 | 20.2 | 5.6 | 0.1 | 5.2 |
1 | 1003 | Alabama | Baldwin County | 203360 | 99527 | 103833 | 4.4 | 83.1 | 9.5 | 0.8 | 0.7 | 0.0 | 155376 | 52562 | 1348 | 29364 | 735 | 11.8 | 16.1 | 35.7 | 18.2 | 25.6 | 9.7 | 10.8 | 84.7 | 7.6 | 0.1 | 0.8 | 1.1 | 5.6 | 27.0 | 89527 | 80.7 | 12.9 | 6.3 | 0.1 | 5.5 |
2 | 1005 | Alabama | Barbour County | 26201 | 13976 | 12225 | 4.2 | 45.7 | 47.8 | 0.2 | 0.6 | 0.0 | 20269 | 33368 | 2551 | 17561 | 798 | 27.2 | 44.9 | 25.0 | 16.8 | 22.6 | 11.5 | 24.1 | 83.4 | 11.1 | 0.3 | 2.2 | 1.7 | 1.3 | 23.4 | 8878 | 74.1 | 19.1 | 6.5 | 0.3 | 12.4 |
3 | 1007 | Alabama | Bibb County | 22580 | 12251 | 10329 | 2.4 | 74.6 | 22.0 | 0.4 | 0.0 | 0.0 | 17662 | 43404 | 3431 | 20911 | 1889 | 15.2 | 26.6 | 24.4 | 17.6 | 19.7 | 15.9 | 22.4 | 86.4 | 9.5 | 0.7 | 0.3 | 1.7 | 1.5 | 30.0 | 8171 | 76.0 | 17.4 | 6.3 | 0.3 | 8.2 |
4 | 1009 | Alabama | Blount County | 57667 | 28490 | 29177 | 9.0 | 87.4 | 1.5 | 0.3 | 0.1 | 0.0 | 42513 | 47412 | 2630 | 22021 | 850 | 15.6 | 25.4 | 28.5 | 12.9 | 23.3 | 15.8 | 19.5 | 86.8 | 10.2 | 0.1 | 0.4 | 0.4 | 2.1 | 35.0 | 21380 | 83.9 | 11.9 | 4.0 | 0.1 | 4.9 |
- 숫자로 표기되어있는 data, 백분율로 표기되어 있는 data가 있음. 백분율 경우 TotalPop을 가중치로 두어 전처리 해야 함.
state_code.head()
State/District | Abbreviation | Postal Code | |
---|---|---|---|
0 | Alabama | Ala. | AL |
1 | Alaska | Alaska | AK |
2 | Arizona | Ariz. | AZ |
3 | Arkansas | Ark. | AR |
4 | California | Calif. | CA |
County별 통계로 데이터프레임 구조 변경
data = df_pres.loc[df_pres['party'].apply(lambda s: str(s) in ['DEM', 'REP'])]
table_pres = pd.pivot_table(data=data, index=['state', 'county'], columns='party', values='total_votes')
table_pres.rename({'DEM':'Pres_DEM', 'REP':'Pres_REP'}, axis=1, inplace=True)
table_pres
party | Pres_DEM | Pres_REP | |
---|---|---|---|
state | county | ||
Alabama | Autauga County | 7503 | 19838 |
Baldwin County | 24578 | 83544 | |
Barbour County | 4816 | 5622 | |
Bibb County | 1986 | 7525 | |
Blount County | 2640 | 24711 | |
... | ... | ... | ... |
Wyoming | Sweetwater County | 3823 | 12229 |
Teton County | 9848 | 4341 | |
Uinta County | 1591 | 7496 | |
Washakie County | 651 | 3245 | |
Weston County | 360 | 3107 |
4633 rows × 2 columns
table_pres.isna().sum()
party
Pres_DEM 0
Pres_REP 0
dtype: int64
data2 = df_gov.loc[df_gov['party'].apply(lambda s: str(s) in ['DEM', 'REP'])]
table_gov = pd.pivot_table(data=data2, index=['state', 'county'], columns='party', values='votes')
table_gov.rename({'DEM':'Gov_DEM', 'REP':'Gov_REP'}, axis=1, inplace=True)
table_gov
party | Gov_DEM | Gov_REP | |
---|---|---|---|
state | county | ||
Delaware | Kent County | 44352 | 39332 |
New Castle County | 191678 | 82545 | |
Sussex County | 56873 | 68435 | |
Indiana | Adams County | 2143 | 9441 |
Allen County | 53895 | 98406 | |
... | ... | ... | ... |
West Virginia | Webster County | 659 | 2552 |
Wetzel County | 1727 | 4559 | |
Wirt County | 483 | 1947 | |
Wood County | 9933 | 26232 | |
Wyoming County | 1240 | 6941 |
1025 rows × 2 columns
table_gov.isna().sum()
party
Gov_DEM 0
Gov_REP 0
dtype: int64
df_census.columns
Index(['CountyId', 'State', 'County', 'TotalPop', 'Men', 'Women', 'Hispanic',
'White', 'Black', 'Native', 'Asian', 'Pacific', 'VotingAgeCitizen',
'Income', 'IncomeErr', 'IncomePerCap', 'IncomePerCapErr', 'Poverty',
'ChildPoverty', 'Professional', 'Service', 'Office', 'Construction',
'Production', 'Drive', 'Carpool', 'Transit', 'Walk', 'OtherTransp',
'WorkAtHome', 'MeanCommute', 'Employed', 'PrivateWork', 'PublicWork',
'SelfEmployed', 'FamilyWork', 'Unemployment'],
dtype='object')
df_census.drop(['Income', 'IncomeErr', 'IncomePerCapErr'], axis=1, inplace=True)
# state, county 컬럼 소문자로
df_census.rename({'State':'state', 'County':'county'},axis=1, inplace=True)
df_census.drop('CountyId', axis=1, inplace=True)
df_census.set_index(['state', 'county'], inplace=True)
df_census
TotalPop | Men | Women | Hispanic | White | Black | Native | Asian | Pacific | VotingAgeCitizen | IncomePerCap | Poverty | ChildPoverty | Professional | Service | Office | Construction | Production | Drive | Carpool | Transit | Walk | OtherTransp | WorkAtHome | MeanCommute | Employed | PrivateWork | PublicWork | SelfEmployed | FamilyWork | Unemployment | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
state | county | |||||||||||||||||||||||||||||||
Alabama | Autauga County | 55036 | 26899 | 28137 | 2.7 | 75.4 | 18.9 | 0.3 | 0.9 | 0.0 | 41016 | 27824 | 13.7 | 20.1 | 35.3 | 18.0 | 23.2 | 8.1 | 15.4 | 86.0 | 9.6 | 0.1 | 0.6 | 1.3 | 2.5 | 25.8 | 24112 | 74.1 | 20.2 | 5.6 | 0.1 | 5.2 |
Baldwin County | 203360 | 99527 | 103833 | 4.4 | 83.1 | 9.5 | 0.8 | 0.7 | 0.0 | 155376 | 29364 | 11.8 | 16.1 | 35.7 | 18.2 | 25.6 | 9.7 | 10.8 | 84.7 | 7.6 | 0.1 | 0.8 | 1.1 | 5.6 | 27.0 | 89527 | 80.7 | 12.9 | 6.3 | 0.1 | 5.5 | |
Barbour County | 26201 | 13976 | 12225 | 4.2 | 45.7 | 47.8 | 0.2 | 0.6 | 0.0 | 20269 | 17561 | 27.2 | 44.9 | 25.0 | 16.8 | 22.6 | 11.5 | 24.1 | 83.4 | 11.1 | 0.3 | 2.2 | 1.7 | 1.3 | 23.4 | 8878 | 74.1 | 19.1 | 6.5 | 0.3 | 12.4 | |
Bibb County | 22580 | 12251 | 10329 | 2.4 | 74.6 | 22.0 | 0.4 | 0.0 | 0.0 | 17662 | 20911 | 15.2 | 26.6 | 24.4 | 17.6 | 19.7 | 15.9 | 22.4 | 86.4 | 9.5 | 0.7 | 0.3 | 1.7 | 1.5 | 30.0 | 8171 | 76.0 | 17.4 | 6.3 | 0.3 | 8.2 | |
Blount County | 57667 | 28490 | 29177 | 9.0 | 87.4 | 1.5 | 0.3 | 0.1 | 0.0 | 42513 | 22021 | 15.6 | 25.4 | 28.5 | 12.9 | 23.3 | 15.8 | 19.5 | 86.8 | 10.2 | 0.1 | 0.4 | 0.4 | 2.1 | 35.0 | 21380 | 83.9 | 11.9 | 4.0 | 0.1 | 4.9 | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Puerto Rico | Vega Baja Municipio | 54754 | 26269 | 28485 | 96.7 | 3.1 | 0.1 | 0.0 | 0.0 | 0.0 | 42838 | 10197 | 43.8 | 49.4 | 28.6 | 20.2 | 25.9 | 11.1 | 14.2 | 92.0 | 4.2 | 0.9 | 1.4 | 0.6 | 0.9 | 31.6 | 14234 | 76.2 | 19.3 | 4.3 | 0.2 | 16.8 |
Vieques Municipio | 8931 | 4351 | 4580 | 95.7 | 4.0 | 0.0 | 0.0 | 0.0 | 0.0 | 7045 | 11136 | 36.8 | 68.2 | 20.9 | 38.4 | 16.4 | 16.9 | 7.3 | 76.3 | 16.9 | 0.0 | 5.0 | 0.0 | 1.7 | 14.9 | 2927 | 40.7 | 40.9 | 18.4 | 0.0 | 12.8 | |
Villalba Municipio | 23659 | 11510 | 12149 | 99.7 | 0.2 | 0.1 | 0.0 | 0.0 | 0.0 | 18053 | 10449 | 50.0 | 67.9 | 22.5 | 21.2 | 22.7 | 14.1 | 19.5 | 83.1 | 11.8 | 0.1 | 2.1 | 0.0 | 2.8 | 28.4 | 6873 | 59.2 | 30.2 | 10.4 | 0.2 | 24.8 | |
Yabucoa Municipio | 35025 | 16984 | 18041 | 99.9 | 0.1 | 0.0 | 0.0 | 0.0 | 0.0 | 27523 | 8672 | 52.4 | 62.1 | 27.7 | 26.0 | 20.7 | 9.5 | 16.0 | 87.6 | 9.2 | 0.0 | 1.4 | 1.8 | 0.1 | 30.5 | 7878 | 62.7 | 30.9 | 6.3 | 0.0 | 25.4 | |
Yauco Municipio | 37585 | 18052 | 19533 | 99.8 | 0.2 | 0.0 | 0.0 | 0.0 | 0.0 | 29763 | 8124 | 50.4 | 58.2 | 30.4 | 20.2 | 25.6 | 12.6 | 11.3 | 82.8 | 8.2 | 2.2 | 1.7 | 0.1 | 5.0 | 24.4 | 8995 | 66.4 | 28.7 | 5.0 | 0.0 | 24.0 |
3220 rows × 31 columns
# 다중공선성을 피하기 위해 총인구컬럼과 겹치게 되는 남성-여성인구수, 유권자수, 고용인수를 비율로 바꿔줌.
df_census.drop('Women', axis=1, inplace=True) # 남성아니면 여성이므로 다중공선성 제거를 위해 여성컬럼 제거.
df_census['Men'] /= df_census['TotalPop']
df_census['VotingAgeCitizen'] /= df_census['TotalPop']
df_census['Employed'] /= df_census['TotalPop']
df_census.head()
TotalPop | Men | Hispanic | White | Black | Native | Asian | Pacific | VotingAgeCitizen | IncomePerCap | Poverty | ChildPoverty | Professional | Service | Office | Construction | Production | Drive | Carpool | Transit | Walk | OtherTransp | WorkAtHome | MeanCommute | Employed | PrivateWork | PublicWork | SelfEmployed | FamilyWork | Unemployment | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
state | county | ||||||||||||||||||||||||||||||
Alabama | Autauga County | 55036 | 0.488753 | 2.7 | 75.4 | 18.9 | 0.3 | 0.9 | 0.0 | 0.745258 | 27824 | 13.7 | 20.1 | 35.3 | 18.0 | 23.2 | 8.1 | 15.4 | 86.0 | 9.6 | 0.1 | 0.6 | 1.3 | 2.5 | 25.8 | 0.438113 | 74.1 | 20.2 | 5.6 | 0.1 | 5.2 |
Baldwin County | 203360 | 0.489413 | 4.4 | 83.1 | 9.5 | 0.8 | 0.7 | 0.0 | 0.764044 | 29364 | 11.8 | 16.1 | 35.7 | 18.2 | 25.6 | 9.7 | 10.8 | 84.7 | 7.6 | 0.1 | 0.8 | 1.1 | 5.6 | 27.0 | 0.440239 | 80.7 | 12.9 | 6.3 | 0.1 | 5.5 | |
Barbour County | 26201 | 0.533415 | 4.2 | 45.7 | 47.8 | 0.2 | 0.6 | 0.0 | 0.773596 | 17561 | 27.2 | 44.9 | 25.0 | 16.8 | 22.6 | 11.5 | 24.1 | 83.4 | 11.1 | 0.3 | 2.2 | 1.7 | 1.3 | 23.4 | 0.338842 | 74.1 | 19.1 | 6.5 | 0.3 | 12.4 | |
Bibb County | 22580 | 0.542560 | 2.4 | 74.6 | 22.0 | 0.4 | 0.0 | 0.0 | 0.782197 | 20911 | 15.2 | 26.6 | 24.4 | 17.6 | 19.7 | 15.9 | 22.4 | 86.4 | 9.5 | 0.7 | 0.3 | 1.7 | 1.5 | 30.0 | 0.361869 | 76.0 | 17.4 | 6.3 | 0.3 | 8.2 | |
Blount County | 57667 | 0.494043 | 9.0 | 87.4 | 1.5 | 0.3 | 0.1 | 0.0 | 0.737215 | 22021 | 15.6 | 25.4 | 28.5 | 12.9 | 23.3 | 15.8 | 19.5 | 86.8 | 10.2 | 0.1 | 0.4 | 0.4 | 2.1 | 35.0 | 0.370749 | 83.9 | 11.9 | 4.0 | 0.1 | 4.9 |
# 세 가지 데이터프레임 통합.
df = pd.concat([table_pres, table_gov, df_census], axis=1)
df
Pres_DEM | Pres_REP | Gov_DEM | Gov_REP | TotalPop | Men | Hispanic | White | Black | Native | Asian | Pacific | VotingAgeCitizen | IncomePerCap | Poverty | ChildPoverty | Professional | Service | Office | Construction | Production | Drive | Carpool | Transit | Walk | OtherTransp | WorkAtHome | MeanCommute | Employed | PrivateWork | PublicWork | SelfEmployed | FamilyWork | Unemployment | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
state | county | ||||||||||||||||||||||||||||||||||
Alabama | Autauga County | 7503.0 | 19838.0 | NaN | NaN | 55036.0 | 0.488753 | 2.7 | 75.4 | 18.9 | 0.3 | 0.9 | 0.0 | 0.745258 | 27824.0 | 13.7 | 20.1 | 35.3 | 18.0 | 23.2 | 8.1 | 15.4 | 86.0 | 9.6 | 0.1 | 0.6 | 1.3 | 2.5 | 25.8 | 0.438113 | 74.1 | 20.2 | 5.6 | 0.1 | 5.2 |
Baldwin County | 24578.0 | 83544.0 | NaN | NaN | 203360.0 | 0.489413 | 4.4 | 83.1 | 9.5 | 0.8 | 0.7 | 0.0 | 0.764044 | 29364.0 | 11.8 | 16.1 | 35.7 | 18.2 | 25.6 | 9.7 | 10.8 | 84.7 | 7.6 | 0.1 | 0.8 | 1.1 | 5.6 | 27.0 | 0.440239 | 80.7 | 12.9 | 6.3 | 0.1 | 5.5 | |
Barbour County | 4816.0 | 5622.0 | NaN | NaN | 26201.0 | 0.533415 | 4.2 | 45.7 | 47.8 | 0.2 | 0.6 | 0.0 | 0.773596 | 17561.0 | 27.2 | 44.9 | 25.0 | 16.8 | 22.6 | 11.5 | 24.1 | 83.4 | 11.1 | 0.3 | 2.2 | 1.7 | 1.3 | 23.4 | 0.338842 | 74.1 | 19.1 | 6.5 | 0.3 | 12.4 | |
Bibb County | 1986.0 | 7525.0 | NaN | NaN | 22580.0 | 0.542560 | 2.4 | 74.6 | 22.0 | 0.4 | 0.0 | 0.0 | 0.782197 | 20911.0 | 15.2 | 26.6 | 24.4 | 17.6 | 19.7 | 15.9 | 22.4 | 86.4 | 9.5 | 0.7 | 0.3 | 1.7 | 1.5 | 30.0 | 0.361869 | 76.0 | 17.4 | 6.3 | 0.3 | 8.2 | |
Blount County | 2640.0 | 24711.0 | NaN | NaN | 57667.0 | 0.494043 | 9.0 | 87.4 | 1.5 | 0.3 | 0.1 | 0.0 | 0.737215 | 22021.0 | 15.6 | 25.4 | 28.5 | 12.9 | 23.3 | 15.8 | 19.5 | 86.8 | 10.2 | 0.1 | 0.4 | 0.4 | 2.1 | 35.0 | 0.370749 | 83.9 | 11.9 | 4.0 | 0.1 | 4.9 | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Wyoming | Sweetwater County | 3823.0 | 12229.0 | NaN | NaN | 44527.0 | 0.516114 | 16.0 | 79.6 | 0.8 | 0.6 | 0.6 | 0.5 | 0.696768 | 31700.0 | 12.0 | 15.7 | 27.7 | 16.1 | 20.0 | 20.8 | 15.4 | 77.5 | 14.4 | 2.6 | 2.8 | 1.3 | 1.5 | 20.5 | 0.510679 | 78.4 | 17.8 | 3.8 | 0.0 | 5.2 |
Teton County | 9848.0 | 4341.0 | NaN | NaN | 22923.0 | 0.530864 | 15.0 | 81.5 | 0.5 | 0.3 | 2.2 | 0.0 | 0.728177 | 49200.0 | 6.8 | 2.8 | 39.4 | 25.4 | 17.0 | 11.7 | 6.5 | 68.3 | 6.7 | 3.8 | 11.7 | 3.8 | 5.7 | 14.3 | 0.632203 | 82.1 | 11.4 | 6.5 | 0.0 | 1.3 | |
Uinta County | 1591.0 | 7496.0 | NaN | NaN | 20758.0 | 0.510309 | 9.1 | 87.7 | 0.1 | 0.9 | 0.1 | 0.0 | 0.685760 | 27115.0 | 14.9 | 20.0 | 30.4 | 19.4 | 18.1 | 16.1 | 16.1 | 77.4 | 14.9 | 3.3 | 1.1 | 1.3 | 2.0 | 19.9 | 0.459004 | 71.5 | 21.5 | 6.6 | 0.4 | 6.4 | |
Washakie County | 651.0 | 3245.0 | NaN | NaN | 8253.0 | 0.498970 | 14.2 | 82.2 | 0.3 | 0.4 | 0.1 | 0.0 | 0.742154 | 27345.0 | 12.8 | 17.5 | 32.1 | 16.3 | 17.6 | 18.8 | 15.3 | 77.2 | 10.2 | 0.0 | 6.9 | 1.3 | 4.4 | 14.3 | 0.464437 | 69.8 | 22.0 | 8.1 | 0.2 | 6.1 | |
Weston County | 360.0 | 3107.0 | NaN | NaN | 7117.0 | 0.527750 | 1.4 | 91.6 | 0.5 | 0.1 | 4.3 | 0.0 | 0.774203 | 30955.0 | 14.4 | 24.1 | 32.0 | 15.0 | 15.8 | 17.9 | 19.3 | 72.7 | 6.7 | 9.1 | 3.0 | 1.6 | 6.9 | 25.7 | 0.478713 | 68.2 | 21.9 | 8.8 | 1.1 | 2.2 |
4809 rows × 34 columns
- 카운티 지사를 선출하지 않는 county의 경우 NaN값으로 표시된다.
컬럼간 상관관계 살펴보기
plt.figure(figsize=(10,10))
sns.heatmap(df.corr())
plt.show()
- 전처리가 덜 되었음. 민주당을 뽑은 인원이 공화당에도 큰 영향을 미친다는 것은 인구자체에 영향을 받아 다중공선성이 일어나고 있다는 증거.
- 비율로 바꾸어 줄 필요가 있음.
- Asian 투표율이 높음.
df_norm = df.copy()
df_norm['Pres_DEM'] /= df['Pres_DEM'] + df['Pres_REP']
df_norm['Pres_REP'] /= df['Pres_DEM'] + df['Pres_REP']
df_norm['Gov_DEM'] /= df['Gov_DEM'] + df['Gov_REP']
df_norm['Gov_REP'] /= df['Gov_DEM'] + df['Gov_REP']
plt.figure(figsize=(5,10))
sns.heatmap(df_norm.corr()[['Pres_DEM', 'Pres_REP']], annot=True)
plt.show()
- 인구가 많은 county일수록 민주당 지지, 백인은 공화동 유색인종은민주당
- 전문직,서비스직,사무직 민주당. 건설,생산,운송 공화당.
sns.jointplot(data=df_norm, x='White', y='Pres_REP', kind='hex')
<seaborn.axisgrid.JointGrid at 0x1bbd39a4f48>
- 단순히 백인비율이 많다고 공화당 지지가 높지는 않음. 아마 직업의 영향이 반영된듯 하다.
sns.jointplot(data=df_norm, x='White', y='Pres_REP', hue='Professional')
C:\Users\dissi\anaconda31\lib\site-packages\seaborn\distributions.py:306: UserWarning: Dataset has 0 variance; skipping density estimate.
warnings.warn(msg, UserWarning)
C:\Users\dissi\anaconda31\lib\site-packages\seaborn\distributions.py:306: UserWarning: Dataset has 0 variance; skipping density estimate.
warnings.warn(msg, UserWarning)
<seaborn.axisgrid.JointGrid at 0x1bbd3dd1f48>
- 아래로 내려갈수록(공화당 지지가 낮을수록) 전문직비율이 높음.
sns.jointplot(data=df_norm, x='Black', y='Pres_DEM', alpha=0.2)
<seaborn.axisgrid.JointGrid at 0x1bbd3934488>
- 상관성은 확실히 있으나 위의 plot과 마찬가지로 black 비율이 낮다고 민주당 지지도 낮지는 않음.
- 또한 흑인이 많은 county 자체도 많지는 않아 전체 데이터에 큰 영향을 주지는 않음.
Plotly로 시각화
전처리
import plotly.figure_factory as ff
df_sample = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/laucnty16.csv')
df_sample['State FIPS Code'] = df_sample['State FIPS Code'].apply(lambda x: str(x).zfill(2))
df_sample['County FIPS Code'] = df_sample['County FIPS Code'].apply(lambda x: str(x).zfill(3))
df_sample['FIPS'] = df_sample['State FIPS Code'] + df_sample['County FIPS Code']
colorscale = ["#f7fbff","#ebf3fb","#deebf7","#d2e3f3","#c6dbef","#b3d2e9","#9ecae1",
"#85bcdb","#6baed6","#57a0ce","#4292c6","#3082be","#2171b5","#1361a9",
"#08519c","#0b4083","#08306b"]
df_sample
LAUS Code | State FIPS Code | County FIPS Code | County Name/State Abbreviation | Year | Labor Force | Employed | Unemployed | Unemployment Rate (%) | FIPS | |
---|---|---|---|---|---|---|---|---|---|---|
0 | CN0100100000000 | 01 | 001 | Autauga County, AL | 2016 | 25,649 | 24,297 | 1,352 | 5.3 | 01001 |
1 | CN0100300000000 | 01 | 003 | Baldwin County, AL | 2016 | 89,931 | 85,061 | 4,870 | 5.4 | 01003 |
2 | CN0100500000000 | 01 | 005 | Barbour County, AL | 2016 | 8,302 | 7,584 | 718 | 8.6 | 01005 |
3 | CN0100700000000 | 01 | 007 | Bibb County, AL | 2016 | 8,573 | 8,004 | 569 | 6.6 | 01007 |
4 | CN0100900000000 | 01 | 009 | Blount County, AL | 2016 | 24,525 | 23,171 | 1,354 | 5.5 | 01009 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3214 | CN7214500000000 | 72 | 145 | Vega Baja Municipio, PR | 2016 | 13,812 | 11,894 | 1,918 | 13.9 | 72145 |
3215 | CN7214700000000 | 72 | 147 | Vieques Municipio, PR | 2016 | 3,287 | 2,939 | 348 | 10.6 | 72147 |
3216 | CN7214900000000 | 72 | 149 | Villalba Municipio, PR | 2016 | 7,860 | 6,273 | 1,587 | 20.2 | 72149 |
3217 | CN7215100000000 | 72 | 151 | Yabucoa Municipio, PR | 2016 | 9,137 | 7,591 | 1,546 | 16.9 | 72151 |
3218 | CN7215300000000 | 72 | 153 | Yauco Municipio, PR | 2016 | 10,815 | 8,783 | 2,032 | 18.8 | 72153 |
3219 rows × 10 columns
state_code.head()
State/District | Abbreviation | Postal Code | |
---|---|---|---|
0 | Alabama | Ala. | AL |
1 | Alaska | Alaska | AK |
2 | Arizona | Ariz. | AZ |
3 | Arkansas | Ark. | AR |
4 | California | Calif. | CA |
state_map = state_code.set_index('State/District')['Postal Code']
state_map
State/District
Alabama AL
Alaska AK
Arizona AZ
Arkansas AR
California CA
Colorado CO
Connecticut CT
Delaware DE
District of Columbia DC
Florida FL
Georgia GA
Hawaii HI
Idaho ID
Illinois IL
Indiana IN
Iowa IA
Kansas KS
Kentucky KY
Louisiana LA
Maine ME
Maryland MD
Massachusetts MA
Michigan MI
Minnesota MN
Mississippi MS
Missouri MO
Montana MT
Nebraska NE
Nevada NV
New Hampshire NH
New Jersey NJ
New Mexico NM
New York NY
North Carolina NC
North Dakota ND
Ohio OH
Oklahoma OK
Oregon OR
Pennsylvania PA
Rhode Island RI
South Carolina SC
South Dakota SD
Tennessee TN
Texas TX
Utah UT
Vermont VT
Virginia VA
Washington WA
West Virginia WV
Wisconsin WI
Wyoming WY
Name: Postal Code, dtype: object
counties = df_norm.reset_index()['county'] + ', ' + df_norm.reset_index()['state'].map(state_map)
counties
0 Autauga County, AL
1 Baldwin County, AL
2 Barbour County, AL
3 Bibb County, AL
4 Blount County, AL
...
4804 Sweetwater County, WY
4805 Teton County, WY
4806 Uinta County, WY
4807 Washakie County, WY
4808 Weston County, WY
Length: 4809, dtype: object
counties_to_fips = df_sample.set_index('County Name/State Abbreviation')['FIPS']
counties_to_fips
County Name/State Abbreviation
Autauga County, AL 01001
Baldwin County, AL 01003
Barbour County, AL 01005
Bibb County, AL 01007
Blount County, AL 01009
...
Vega Baja Municipio, PR 72145
Vieques Municipio, PR 72147
Villalba Municipio, PR 72149
Yabucoa Municipio, PR 72151
Yauco Municipio, PR 72153
Name: FIPS, Length: 3219, dtype: object
fips = counties.map(counties_to_fips)
fips
# df_norm 의 지리정보를 fips코드 하나로 바꿈.
0 01001
1 01003
2 01005
3 01007
4 01009
...
4804 56037
4805 56039
4806 56041
4807 56043
4808 56045
Length: 4809, dtype: object
fips.isna().sum()
1681
data = df_norm.reset_index()['Pres_DEM'][fips.notna()]
fips = fips[fips.notna()]
# data는 민주당 지지율을 index로 새로 구성하되 fips가 null값이 아닌 번호를 골라서
# Fips코드는 choropleth를 사용하기 위해(가시화)
시각화
fig = ff.create_choropleth(
fips=fips, values=data,
show_state_data = False,
colorscale=colorscale,
binning_endpoints=list(np.linspace(0.0, 1.0, len(colorscale) - 2)),
show_hover=True, centroid_marker={'opacity':0},
asp = 2.9, title="USA by Voting for DEM President"
)
fig.layout.template = None
fig.show()
전처리
df_norm.columns
Index(['Pres_DEM', 'Pres_REP', 'Gov_DEM', 'Gov_REP', 'TotalPop', 'Men',
'Hispanic', 'White', 'Black', 'Native', 'Asian', 'Pacific',
'VotingAgeCitizen', 'IncomePerCap', 'Poverty', 'ChildPoverty',
'Professional', 'Service', 'Office', 'Construction', 'Production',
'Drive', 'Carpool', 'Transit', 'Walk', 'OtherTransp', 'WorkAtHome',
'MeanCommute', 'Employed', 'PrivateWork', 'PublicWork', 'SelfEmployed',
'FamilyWork', 'Unemployment'],
dtype='object')
df_norm.dropna(inplace=True)
X = df_norm.drop(['Pres_DEM', 'Pres_REP', 'Gov_DEM', 'Gov_REP'], axis=1)
y = df_norm['Pres_DEM']
# 수치형 데이터 표준화
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)
X= pd.DataFrame(data=X_scaled, index=X.index, columns=X.columns)
X.head()
TotalPop | Men | Hispanic | White | Black | Native | Asian | Pacific | VotingAgeCitizen | IncomePerCap | Poverty | ChildPoverty | Professional | Service | Office | Construction | Production | Drive | Carpool | Transit | Walk | OtherTransp | WorkAtHome | MeanCommute | Employed | PrivateWork | PublicWork | SelfEmployed | FamilyWork | Unemployment | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
state | county | ||||||||||||||||||||||||||||||
Delaware | Kent County | 0.651462 | -0.946145 | 0.346057 | -1.423723 | 1.751813 | -0.191121 | 0.783427 | -0.393531 | -0.388638 | 0.313498 | -0.444312 | -0.300398 | 0.327031 | 0.447702 | 0.504056 | -0.807226 | -0.379883 | 0.447917 | -0.170797 | 0.559713 | -0.455170 | -0.337372 | -0.325794 | 0.608072 | 0.209912 | -0.063847 | 0.791294 | -0.836585 | -0.243581 | 0.265911 |
New Castle County | 3.043033 | -0.848175 | 0.828313 | -1.695454 | 1.751813 | -0.239442 | 3.218819 | -0.393531 | -0.665857 | 1.606755 | -0.638605 | -0.567309 | 1.978537 | -0.315791 | 0.658756 | -1.535077 | -1.212547 | 0.221028 | -0.857051 | 3.061686 | -0.301446 | -0.337372 | -0.325794 | 0.492385 | 0.801126 | 1.168959 | -0.715528 | -0.999491 | -0.404976 | 0.192397 | |
Sussex County | 0.917027 | -0.818852 | 0.759420 | -0.634409 | 0.622026 | -0.215282 | 0.210394 | -0.393531 | 0.227859 | 1.122397 | -0.620942 | -0.044163 | 0.295272 | 0.147758 | 0.844395 | -0.469295 | -0.543150 | 0.632264 | -1.001526 | 0.112932 | -0.506412 | -0.141175 | 0.055883 | 0.415259 | -0.026356 | 0.503244 | -0.251891 | -0.470046 | -0.404976 | 0.045370 | |
Indiana | Adams County | -0.213551 | -0.230718 | -0.067305 | 0.581911 | -0.432442 | -0.263602 | -0.362639 | -0.393531 | -1.625095 | -0.755470 | 0.597804 | 0.916716 | -1.149795 | -0.615735 | -0.516958 | 0.232561 | 1.660960 | -0.658166 | 2.393626 | -0.512562 | -0.250204 | 0.447416 | -0.244006 | -0.066773 | -0.205611 | 0.971710 | -1.344751 | -0.001691 | 0.401999 | -0.285443 |
Allen County | 1.870146 | -0.656809 | 0.414951 | -0.653818 | 0.546706 | -0.239442 | 1.857865 | -0.393531 | -1.068441 | 0.224871 | -0.144041 | 0.083954 | 0.358791 | -0.452129 | 0.937214 | -1.405104 | 0.289514 | 0.759889 | -0.495865 | 0.202288 | -0.608895 | -0.533569 | -0.380320 | -0.355993 | 0.707837 | 1.477161 | -1.278517 | -0.816222 | -0.404976 | 0.118884 |
학습, 테스트 데이터 분리
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
PCA 이용 (Columns 수가 많음.)
from sklearn.decomposition import PCA
pca = PCA()
pca.fit(X_train)
plt.plot(range(1, len(pca.explained_variance_) + 1), pca.explained_variance_)
plt.grid()
# 1부터 30개의 Columns. 10정도의 변수만 되어도 충분히 설명가능.
pca = PCA(n_components=10)
pca.fit(X_train)
PCA(copy=True, iterated_power='auto', n_components=10, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)
모델 적용
LightGBM 모델
from lightgbm import LGBMRegressor
model_reg = LGBMRegressor()
model_reg.fit(pca.transform(X_train), y_train)
LGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
importance_type='split', learning_rate=0.1, max_depth=-1,
min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
n_estimators=100, n_jobs=-1, num_leaves=31, objective=None,
random_state=None, reg_alpha=0.0, reg_lambda=0.0, silent=True,
subsample=1.0, subsample_for_bin=200000, subsample_freq=0)
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.metrics import classification_report
from math import sqrt
pred = model_reg.predict(pca.transform(X_test))
print(mean_absolute_error(y_test, pred))
print(sqrt(mean_squared_error(y_test, pred)))
0.06323869955871446
0.08461257022993489
print(classification_report(y_test > 0.5, pred > 0.5))
# y = Pres_DEM 0.5 이상일경우 True
precision recall f1-score support
False 0.95 0.96 0.96 146
True 0.62 0.59 0.61 17
accuracy 0.92 163
macro avg 0.79 0.77 0.78 163
weighted avg 0.92 0.92 0.92 163
XGBoost 모델
from xgboost import XGBClassifier
model_xgb = XGBClassifier()
model_xgb.fit(X_train, y_train > 0.5)
[23:07:34] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.3.0/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
importance_type='gain', interaction_constraints='',
learning_rate=0.300000012, max_delta_step=0, max_depth=6,
min_child_weight=1, missing=nan, monotone_constraints='()',
n_estimators=100, n_jobs=4, num_parallel_tree=1,
objective='binary:logistic', random_state=0, reg_alpha=0,
reg_lambda=1, scale_pos_weight=1, subsample=1,
tree_method='exact', use_label_encoder=True,
validate_parameters=1, verbosity=None)
plt.figure(figsize=(12,10))
plt.barh(x.columns, model_xgb.feature_importances_)
plt.show()
# 민주당 표에 영향을 미치는 것. 인구는 당연한 변수이므로 제외한다면 asian, transit, publicwork 눈여겨볼만하다.
pred = model_xgb.predict(X_test)
print(classification_report(y_test>0.5 , pred))
precision recall f1-score support
False 0.96 0.97 0.97 146
True 0.73 0.65 0.69 17
accuracy 0.94 163
macro avg 0.85 0.81 0.83 163
weighted avg 0.94 0.94 0.94 163