Oh-Seung-Rok

안녕하세요. 정량적인 분석을 통해 제품-서비스를 개선하고 소비자들의 욕구를 먼저 파악하는 분석가가 되고 싶은 오승록입니다. 포트폴리오 [https://seungrok0317.com]

Kaggle - US Election 2020

19 Mar 2021 » Kaggle

US Election 2020

데이터 셋

https://www.kaggle.com/unanimad/us-election-2020 (선거관련) https://www.kaggle.com/muonneutrino/us-census-demographic-data (인구관련)

  • 대통령 선거 이전에 보여지는 모습을 통해 인종, 상황, 배경이라는 변수에 따라 투표의 양상이 어떻게 벌어지는지 확인할 수 있는 데이터 셋이다.

라이브러리 설정 및 데이터 읽어들이기

  • president_country_candidate.csv : 대통령 투표 결과
  • governors_country_candidate.csv : 카운티 지사 투표 결과
  • senate_country_candidate.csv : 상원의원 투표 결과
  • house_candidate.csv : 하원의원 투표 결과

  • state : 주
  • county : 카운티(군)
  • district : 지구
  • candidate : 후보자
  • party : 후보자의 소속 정당
  • total_votes: 득표 수
  • won : 지역 투표 우승 여부
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

df_pres = pd.read_csv("us-election-2020/president_county_candidate.csv")
df_gov = pd.read_csv("us-election-2020/governors_county_candidate.csv")
# 우선 두가지 사용.

# df_sen = pd.read_csv('us-election-2020/senate_country_candidate.csv')
# df_hou = pd.read_csv('us-election-2020/house_candidate.csv')

df_census = pd.read_csv("./us-census-demographic/acs2017_county_data.csv") # 인구 관련 조사
# State Code 관련 부가 자료
state_code = pd.read_html('https://www.infoplease.com/us/postal-information/state-abbreviations-and-state-postal-codes')[0]

EDA 및 기초 통계 분석

df_pres.head()
statecountycandidatepartytotal_voteswon
0DelawareKent CountyJoe BidenDEM44552True
1DelawareKent CountyDonald TrumpREP41009False
2DelawareKent CountyJo JorgensenLIB1044False
3DelawareKent CountyHowie HawkinsGRN420False
4DelawareNew Castle CountyJoe BidenDEM195034True
df_pres['candidate'].unique()
array(['Joe Biden', 'Donald Trump', 'Jo Jorgensen', 'Howie Hawkins',
       ' Write-ins', 'Gloria La Riva', 'Brock Pierce',
       'Rocky De La Fuente', 'Don Blankenship', 'Kanye West',
       'Brian Carroll', 'Ricki Sue King', 'Jade Simmons',
       'President Boddie', 'Bill Hammons', 'Tom Hoefling',
       'Alyson Kennedy', 'Jerome Segal', 'Phil Collins',
       ' None of these candidates', 'Sheila Samm Tittle', 'Dario Hunter',
       'Joe McHugh', 'Christopher LaFontaine', 'Keith McCormic',
       'Brooke Paige', 'Gary Swing', 'Richard Duncan', 'Blake Huber',
       'Kyle Kopitke', 'Zachary Scalf', 'Jesse Ventura', 'Connie Gammon',
       'John Richard Myers', 'Mark Charles', 'Princess Jacob-Fambro',
       'Joseph Kishore', 'Jordan Scott'], dtype=object)
df_pres.loc[df_pres['candidate'] == 'Jo Jorgensen']['total_votes'].sum()
1874183
df_gov.head()
statecountycandidatepartyvoteswon
0DelawareKent CountyJohn CarneyDEM44352True
1DelawareKent CountyJulianne MurrayREP39332False
2DelawareKent CountyKathy DeMatteisIPD1115False
3DelawareKent CountyJohn MachurekLIB616False
4DelawareNew Castle CountyJohn CarneyDEM191678True
pd.set_option('display.max_columns', None)
df_census.head()
CountyIdStateCountyTotalPopMenWomenHispanicWhiteBlackNativeAsianPacificVotingAgeCitizenIncomeIncomeErrIncomePerCapIncomePerCapErrPovertyChildPovertyProfessionalServiceOfficeConstructionProductionDriveCarpoolTransitWalkOtherTranspWorkAtHomeMeanCommuteEmployedPrivateWorkPublicWorkSelfEmployedFamilyWorkUnemployment
01001AlabamaAutauga County5503626899281372.775.418.90.30.90.04101655317283827824202413.720.135.318.023.28.115.486.09.60.10.61.32.525.82411274.120.25.60.15.2
11003AlabamaBaldwin County203360995271038334.483.19.50.80.70.01553765256213482936473511.816.135.718.225.69.710.884.77.60.10.81.15.627.08952780.712.96.30.15.5
21005AlabamaBarbour County2620113976122254.245.747.80.20.60.0202693336825511756179827.244.925.016.822.611.524.183.411.10.32.21.71.323.4887874.119.16.50.312.4
31007AlabamaBibb County2258012251103292.474.622.00.40.00.01766243404343120911188915.226.624.417.619.715.922.486.49.50.70.31.71.530.0817176.017.46.30.38.2
41009AlabamaBlount County5766728490291779.087.41.50.30.10.0425134741226302202185015.625.428.512.923.315.819.586.810.20.10.40.42.135.02138083.911.94.00.14.9
  • 숫자로 표기되어있는 data, 백분율로 표기되어 있는 data가 있음. 백분율 경우 TotalPop을 가중치로 두어 전처리 해야 함.
state_code.head()
State/DistrictAbbreviationPostal Code
0AlabamaAla.AL
1AlaskaAlaskaAK
2ArizonaAriz.AZ
3ArkansasArk.AR
4CaliforniaCalif.CA

County별 통계로 데이터프레임 구조 변경

data = df_pres.loc[df_pres['party'].apply(lambda s: str(s) in ['DEM', 'REP'])]

table_pres = pd.pivot_table(data=data, index=['state', 'county'], columns='party', values='total_votes')
table_pres.rename({'DEM':'Pres_DEM', 'REP':'Pres_REP'}, axis=1, inplace=True)
table_pres
partyPres_DEMPres_REP
statecounty
AlabamaAutauga County750319838
Baldwin County2457883544
Barbour County48165622
Bibb County19867525
Blount County264024711
............
WyomingSweetwater County382312229
Teton County98484341
Uinta County15917496
Washakie County6513245
Weston County3603107

4633 rows × 2 columns

table_pres.isna().sum()
party
Pres_DEM    0
Pres_REP    0
dtype: int64
data2 = df_gov.loc[df_gov['party'].apply(lambda s: str(s) in ['DEM', 'REP'])]

table_gov = pd.pivot_table(data=data2, index=['state', 'county'], columns='party', values='votes')
table_gov.rename({'DEM':'Gov_DEM', 'REP':'Gov_REP'}, axis=1, inplace=True)
table_gov
partyGov_DEMGov_REP
statecounty
DelawareKent County4435239332
New Castle County19167882545
Sussex County5687368435
IndianaAdams County21439441
Allen County5389598406
............
West VirginiaWebster County6592552
Wetzel County17274559
Wirt County4831947
Wood County993326232
Wyoming County12406941

1025 rows × 2 columns

table_gov.isna().sum()
party
Gov_DEM    0
Gov_REP    0
dtype: int64
df_census.columns
Index(['CountyId', 'State', 'County', 'TotalPop', 'Men', 'Women', 'Hispanic',
       'White', 'Black', 'Native', 'Asian', 'Pacific', 'VotingAgeCitizen',
       'Income', 'IncomeErr', 'IncomePerCap', 'IncomePerCapErr', 'Poverty',
       'ChildPoverty', 'Professional', 'Service', 'Office', 'Construction',
       'Production', 'Drive', 'Carpool', 'Transit', 'Walk', 'OtherTransp',
       'WorkAtHome', 'MeanCommute', 'Employed', 'PrivateWork', 'PublicWork',
       'SelfEmployed', 'FamilyWork', 'Unemployment'],
      dtype='object')
df_census.drop(['Income', 'IncomeErr', 'IncomePerCapErr'], axis=1, inplace=True)
# state, county 컬럼 소문자로
df_census.rename({'State':'state', 'County':'county'},axis=1, inplace=True)
df_census.drop('CountyId', axis=1, inplace=True)
df_census.set_index(['state', 'county'], inplace=True)
df_census
TotalPopMenWomenHispanicWhiteBlackNativeAsianPacificVotingAgeCitizenIncomePerCapPovertyChildPovertyProfessionalServiceOfficeConstructionProductionDriveCarpoolTransitWalkOtherTranspWorkAtHomeMeanCommuteEmployedPrivateWorkPublicWorkSelfEmployedFamilyWorkUnemployment
statecounty
AlabamaAutauga County5503626899281372.775.418.90.30.90.0410162782413.720.135.318.023.28.115.486.09.60.10.61.32.525.82411274.120.25.60.15.2
Baldwin County203360995271038334.483.19.50.80.70.01553762936411.816.135.718.225.69.710.884.77.60.10.81.15.627.08952780.712.96.30.15.5
Barbour County2620113976122254.245.747.80.20.60.0202691756127.244.925.016.822.611.524.183.411.10.32.21.71.323.4887874.119.16.50.312.4
Bibb County2258012251103292.474.622.00.40.00.0176622091115.226.624.417.619.715.922.486.49.50.70.31.71.530.0817176.017.46.30.38.2
Blount County5766728490291779.087.41.50.30.10.0425132202115.625.428.512.923.315.819.586.810.20.10.40.42.135.02138083.911.94.00.14.9
...................................................................................................
Puerto RicoVega Baja Municipio54754262692848596.73.10.10.00.00.0428381019743.849.428.620.225.911.114.292.04.20.91.40.60.931.61423476.219.34.30.216.8
Vieques Municipio89314351458095.74.00.00.00.00.070451113636.868.220.938.416.416.97.376.316.90.05.00.01.714.9292740.740.918.40.012.8
Villalba Municipio23659115101214999.70.20.10.00.00.0180531044950.067.922.521.222.714.119.583.111.80.12.10.02.828.4687359.230.210.40.224.8
Yabucoa Municipio35025169841804199.90.10.00.00.00.027523867252.462.127.726.020.79.516.087.69.20.01.41.80.130.5787862.730.96.30.025.4
Yauco Municipio37585180521953399.80.20.00.00.00.029763812450.458.230.420.225.612.611.382.88.22.21.70.15.024.4899566.428.75.00.024.0

3220 rows × 31 columns

# 다중공선성을 피하기 위해 총인구컬럼과 겹치게 되는 남성-여성인구수, 유권자수, 고용인수를 비율로 바꿔줌.
df_census.drop('Women', axis=1, inplace=True) # 남성아니면 여성이므로 다중공선성 제거를 위해 여성컬럼 제거.

df_census['Men'] /= df_census['TotalPop']
df_census['VotingAgeCitizen'] /= df_census['TotalPop']
df_census['Employed'] /= df_census['TotalPop']
df_census.head()
TotalPopMenHispanicWhiteBlackNativeAsianPacificVotingAgeCitizenIncomePerCapPovertyChildPovertyProfessionalServiceOfficeConstructionProductionDriveCarpoolTransitWalkOtherTranspWorkAtHomeMeanCommuteEmployedPrivateWorkPublicWorkSelfEmployedFamilyWorkUnemployment
statecounty
AlabamaAutauga County550360.4887532.775.418.90.30.90.00.7452582782413.720.135.318.023.28.115.486.09.60.10.61.32.525.80.43811374.120.25.60.15.2
Baldwin County2033600.4894134.483.19.50.80.70.00.7640442936411.816.135.718.225.69.710.884.77.60.10.81.15.627.00.44023980.712.96.30.15.5
Barbour County262010.5334154.245.747.80.20.60.00.7735961756127.244.925.016.822.611.524.183.411.10.32.21.71.323.40.33884274.119.16.50.312.4
Bibb County225800.5425602.474.622.00.40.00.00.7821972091115.226.624.417.619.715.922.486.49.50.70.31.71.530.00.36186976.017.46.30.38.2
Blount County576670.4940439.087.41.50.30.10.00.7372152202115.625.428.512.923.315.819.586.810.20.10.40.42.135.00.37074983.911.94.00.14.9
# 세 가지 데이터프레임 통합.
df = pd.concat([table_pres, table_gov, df_census], axis=1)
df
Pres_DEMPres_REPGov_DEMGov_REPTotalPopMenHispanicWhiteBlackNativeAsianPacificVotingAgeCitizenIncomePerCapPovertyChildPovertyProfessionalServiceOfficeConstructionProductionDriveCarpoolTransitWalkOtherTranspWorkAtHomeMeanCommuteEmployedPrivateWorkPublicWorkSelfEmployedFamilyWorkUnemployment
statecounty
AlabamaAutauga County7503.019838.0NaNNaN55036.00.4887532.775.418.90.30.90.00.74525827824.013.720.135.318.023.28.115.486.09.60.10.61.32.525.80.43811374.120.25.60.15.2
Baldwin County24578.083544.0NaNNaN203360.00.4894134.483.19.50.80.70.00.76404429364.011.816.135.718.225.69.710.884.77.60.10.81.15.627.00.44023980.712.96.30.15.5
Barbour County4816.05622.0NaNNaN26201.00.5334154.245.747.80.20.60.00.77359617561.027.244.925.016.822.611.524.183.411.10.32.21.71.323.40.33884274.119.16.50.312.4
Bibb County1986.07525.0NaNNaN22580.00.5425602.474.622.00.40.00.00.78219720911.015.226.624.417.619.715.922.486.49.50.70.31.71.530.00.36186976.017.46.30.38.2
Blount County2640.024711.0NaNNaN57667.00.4940439.087.41.50.30.10.00.73721522021.015.625.428.512.923.315.819.586.810.20.10.40.42.135.00.37074983.911.94.00.14.9
............................................................................................................
WyomingSweetwater County3823.012229.0NaNNaN44527.00.51611416.079.60.80.60.60.50.69676831700.012.015.727.716.120.020.815.477.514.42.62.81.31.520.50.51067978.417.83.80.05.2
Teton County9848.04341.0NaNNaN22923.00.53086415.081.50.50.32.20.00.72817749200.06.82.839.425.417.011.76.568.36.73.811.73.85.714.30.63220382.111.46.50.01.3
Uinta County1591.07496.0NaNNaN20758.00.5103099.187.70.10.90.10.00.68576027115.014.920.030.419.418.116.116.177.414.93.31.11.32.019.90.45900471.521.56.60.46.4
Washakie County651.03245.0NaNNaN8253.00.49897014.282.20.30.40.10.00.74215427345.012.817.532.116.317.618.815.377.210.20.06.91.34.414.30.46443769.822.08.10.26.1
Weston County360.03107.0NaNNaN7117.00.5277501.491.60.50.14.30.00.77420330955.014.424.132.015.015.817.919.372.76.79.13.01.66.925.70.47871368.221.98.81.12.2

4809 rows × 34 columns

  • 카운티 지사를 선출하지 않는 county의 경우 NaN값으로 표시된다.

컬럼간 상관관계 살펴보기

plt.figure(figsize=(10,10))
sns.heatmap(df.corr())
plt.show()

output_30_0

  • 전처리가 덜 되었음. 민주당을 뽑은 인원이 공화당에도 큰 영향을 미친다는 것은 인구자체에 영향을 받아 다중공선성이 일어나고 있다는 증거.
  • 비율로 바꾸어 줄 필요가 있음.
  • Asian 투표율이 높음.
df_norm = df.copy()
df_norm['Pres_DEM'] /= df['Pres_DEM'] + df['Pres_REP']
df_norm['Pres_REP'] /= df['Pres_DEM'] + df['Pres_REP']
df_norm['Gov_DEM'] /= df['Gov_DEM'] + df['Gov_REP']
df_norm['Gov_REP'] /= df['Gov_DEM'] + df['Gov_REP']
plt.figure(figsize=(5,10))
sns.heatmap(df_norm.corr()[['Pres_DEM', 'Pres_REP']], annot=True)
plt.show()

output_33_0

  • 인구가 많은 county일수록 민주당 지지, 백인은 공화동 유색인종은민주당
  • 전문직,서비스직,사무직 민주당. 건설,생산,운송 공화당.
sns.jointplot(data=df_norm, x='White', y='Pres_REP', kind='hex')
<seaborn.axisgrid.JointGrid at 0x1bbd39a4f48>

output_35_1

  • 단순히 백인비율이 많다고 공화당 지지가 높지는 않음. 아마 직업의 영향이 반영된듯 하다.
sns.jointplot(data=df_norm, x='White', y='Pres_REP', hue='Professional')
C:\Users\dissi\anaconda31\lib\site-packages\seaborn\distributions.py:306: UserWarning: Dataset has 0 variance; skipping density estimate.
  warnings.warn(msg, UserWarning)
C:\Users\dissi\anaconda31\lib\site-packages\seaborn\distributions.py:306: UserWarning: Dataset has 0 variance; skipping density estimate.
  warnings.warn(msg, UserWarning)





<seaborn.axisgrid.JointGrid at 0x1bbd3dd1f48>

output_37_2

  • 아래로 내려갈수록(공화당 지지가 낮을수록) 전문직비율이 높음.
sns.jointplot(data=df_norm, x='Black', y='Pres_DEM', alpha=0.2)
<seaborn.axisgrid.JointGrid at 0x1bbd3934488>

output_39_1

  • 상관성은 확실히 있으나 위의 plot과 마찬가지로 black 비율이 낮다고 민주당 지지도 낮지는 않음.
  • 또한 흑인이 많은 county 자체도 많지는 않아 전체 데이터에 큰 영향을 주지는 않음.

Plotly로 시각화

전처리

import plotly.figure_factory as ff

df_sample = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/laucnty16.csv')
df_sample['State FIPS Code'] = df_sample['State FIPS Code'].apply(lambda x: str(x).zfill(2))
df_sample['County FIPS Code'] = df_sample['County FIPS Code'].apply(lambda x: str(x).zfill(3))
df_sample['FIPS'] = df_sample['State FIPS Code'] + df_sample['County FIPS Code']

colorscale = ["#f7fbff","#ebf3fb","#deebf7","#d2e3f3","#c6dbef","#b3d2e9","#9ecae1",
              "#85bcdb","#6baed6","#57a0ce","#4292c6","#3082be","#2171b5","#1361a9",
              "#08519c","#0b4083","#08306b"]
df_sample
LAUS CodeState FIPS CodeCounty FIPS CodeCounty Name/State AbbreviationYearLabor ForceEmployedUnemployedUnemployment Rate (%)FIPS
0CN010010000000001001Autauga County, AL201625,64924,2971,3525.301001
1CN010030000000001003Baldwin County, AL201689,93185,0614,8705.401003
2CN010050000000001005Barbour County, AL20168,3027,5847188.601005
3CN010070000000001007Bibb County, AL20168,5738,0045696.601007
4CN010090000000001009Blount County, AL201624,52523,1711,3545.501009
.................................
3214CN721450000000072145Vega Baja Municipio, PR201613,81211,8941,91813.972145
3215CN721470000000072147Vieques Municipio, PR20163,2872,93934810.672147
3216CN721490000000072149Villalba Municipio, PR20167,8606,2731,58720.272149
3217CN721510000000072151Yabucoa Municipio, PR20169,1377,5911,54616.972151
3218CN721530000000072153Yauco Municipio, PR201610,8158,7832,03218.872153

3219 rows × 10 columns

state_code.head()
State/DistrictAbbreviationPostal Code
0AlabamaAla.AL
1AlaskaAlaskaAK
2ArizonaAriz.AZ
3ArkansasArk.AR
4CaliforniaCalif.CA
state_map = state_code.set_index('State/District')['Postal Code']
state_map
State/District
Alabama                 AL
Alaska                  AK
Arizona                 AZ
Arkansas                AR
California              CA
Colorado                CO
Connecticut             CT
Delaware                DE
District of Columbia    DC
Florida                 FL
Georgia                 GA
Hawaii                  HI
Idaho                   ID
Illinois                IL
Indiana                 IN
Iowa                    IA
Kansas                  KS
Kentucky                KY
Louisiana               LA
Maine                   ME
Maryland                MD
Massachusetts           MA
Michigan                MI
Minnesota               MN
Mississippi             MS
Missouri                MO
Montana                 MT
Nebraska                NE
Nevada                  NV
New Hampshire           NH
New Jersey              NJ
New Mexico              NM
New York                NY
North Carolina          NC
North Dakota            ND
Ohio                    OH
Oklahoma                OK
Oregon                  OR
Pennsylvania            PA
Rhode Island            RI
South Carolina          SC
South Dakota            SD
Tennessee               TN
Texas                   TX
Utah                    UT
Vermont                 VT
Virginia                VA
Washington              WA
West Virginia           WV
Wisconsin               WI
Wyoming                 WY
Name: Postal Code, dtype: object
counties = df_norm.reset_index()['county'] + ', ' + df_norm.reset_index()['state'].map(state_map)
counties
0          Autauga County, AL
1          Baldwin County, AL
2          Barbour County, AL
3             Bibb County, AL
4           Blount County, AL
                ...          
4804    Sweetwater County, WY
4805         Teton County, WY
4806         Uinta County, WY
4807      Washakie County, WY
4808        Weston County, WY
Length: 4809, dtype: object
counties_to_fips = df_sample.set_index('County Name/State Abbreviation')['FIPS']
counties_to_fips
County Name/State Abbreviation
Autauga County, AL         01001
Baldwin County, AL         01003
Barbour County, AL         01005
Bibb County, AL            01007
Blount County, AL          01009
                           ...  
Vega Baja Municipio, PR    72145
Vieques Municipio, PR      72147
Villalba Municipio, PR     72149
Yabucoa Municipio, PR      72151
Yauco Municipio, PR        72153
Name: FIPS, Length: 3219, dtype: object
fips = counties.map(counties_to_fips)
fips

# df_norm 의 지리정보를 fips코드 하나로 바꿈.
0       01001
1       01003
2       01005
3       01007
4       01009
        ...  
4804    56037
4805    56039
4806    56041
4807    56043
4808    56045
Length: 4809, dtype: object
fips.isna().sum()
1681
data = df_norm.reset_index()['Pres_DEM'][fips.notna()]
fips = fips[fips.notna()]

# data는 민주당 지지율을 index로 새로 구성하되 fips가 null값이 아닌 번호를 골라서
# Fips코드는 choropleth를 사용하기 위해(가시화)

시각화

fig = ff.create_choropleth(
    fips=fips, values=data,
    show_state_data = False,
    colorscale=colorscale,
    binning_endpoints=list(np.linspace(0.0, 1.0, len(colorscale) - 2)),
    show_hover=True, centroid_marker={'opacity':0},
    asp = 2.9, title="USA by Voting for DEM President"
)

fig.layout.template = None
fig.show()

전처리

df_norm.columns
Index(['Pres_DEM', 'Pres_REP', 'Gov_DEM', 'Gov_REP', 'TotalPop', 'Men',
       'Hispanic', 'White', 'Black', 'Native', 'Asian', 'Pacific',
       'VotingAgeCitizen', 'IncomePerCap', 'Poverty', 'ChildPoverty',
       'Professional', 'Service', 'Office', 'Construction', 'Production',
       'Drive', 'Carpool', 'Transit', 'Walk', 'OtherTransp', 'WorkAtHome',
       'MeanCommute', 'Employed', 'PrivateWork', 'PublicWork', 'SelfEmployed',
       'FamilyWork', 'Unemployment'],
      dtype='object')
df_norm.dropna(inplace=True)
X = df_norm.drop(['Pres_DEM', 'Pres_REP', 'Gov_DEM', 'Gov_REP'], axis=1)
y = df_norm['Pres_DEM']
# 수치형 데이터 표준화
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)
X= pd.DataFrame(data=X_scaled, index=X.index, columns=X.columns)
X.head()
TotalPopMenHispanicWhiteBlackNativeAsianPacificVotingAgeCitizenIncomePerCapPovertyChildPovertyProfessionalServiceOfficeConstructionProductionDriveCarpoolTransitWalkOtherTranspWorkAtHomeMeanCommuteEmployedPrivateWorkPublicWorkSelfEmployedFamilyWorkUnemployment
statecounty
DelawareKent County0.651462-0.9461450.346057-1.4237231.751813-0.1911210.783427-0.393531-0.3886380.313498-0.444312-0.3003980.3270310.4477020.504056-0.807226-0.3798830.447917-0.1707970.559713-0.455170-0.337372-0.3257940.6080720.209912-0.0638470.791294-0.836585-0.2435810.265911
New Castle County3.043033-0.8481750.828313-1.6954541.751813-0.2394423.218819-0.393531-0.6658571.606755-0.638605-0.5673091.978537-0.3157910.658756-1.535077-1.2125470.221028-0.8570513.061686-0.301446-0.337372-0.3257940.4923850.8011261.168959-0.715528-0.999491-0.4049760.192397
Sussex County0.917027-0.8188520.759420-0.6344090.622026-0.2152820.210394-0.3935310.2278591.122397-0.620942-0.0441630.2952720.1477580.844395-0.469295-0.5431500.632264-1.0015260.112932-0.506412-0.1411750.0558830.415259-0.0263560.503244-0.251891-0.470046-0.4049760.045370
IndianaAdams County-0.213551-0.230718-0.0673050.581911-0.432442-0.263602-0.362639-0.393531-1.625095-0.7554700.5978040.916716-1.149795-0.615735-0.5169580.2325611.660960-0.6581662.393626-0.512562-0.2502040.447416-0.244006-0.066773-0.2056110.971710-1.344751-0.0016910.401999-0.285443
Allen County1.870146-0.6568090.414951-0.6538180.546706-0.2394421.857865-0.393531-1.0684410.224871-0.1440410.0839540.358791-0.4521290.937214-1.4051040.2895140.759889-0.4958650.202288-0.608895-0.533569-0.380320-0.3559930.7078371.477161-1.278517-0.816222-0.4049760.118884

학습, 테스트 데이터 분리

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

PCA 이용 (Columns 수가 많음.)

from sklearn.decomposition import PCA
pca = PCA()
pca.fit(X_train)
plt.plot(range(1, len(pca.explained_variance_) + 1), pca.explained_variance_)
plt.grid()

# 1부터 30개의 Columns. 10정도의 변수만 되어도 충분히 설명가능.

output_66_0

pca = PCA(n_components=10)
pca.fit(X_train)
PCA(copy=True, iterated_power='auto', n_components=10, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

모델 적용

LightGBM 모델

from lightgbm import LGBMRegressor
model_reg = LGBMRegressor()
model_reg.fit(pca.transform(X_train), y_train)
LGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
              importance_type='split', learning_rate=0.1, max_depth=-1,
              min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
              n_estimators=100, n_jobs=-1, num_leaves=31, objective=None,
              random_state=None, reg_alpha=0.0, reg_lambda=0.0, silent=True,
              subsample=1.0, subsample_for_bin=200000, subsample_freq=0)
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.metrics import classification_report
from math import sqrt
pred = model_reg.predict(pca.transform(X_test))
print(mean_absolute_error(y_test, pred))
print(sqrt(mean_squared_error(y_test, pred)))
0.06323869955871446
0.08461257022993489
print(classification_report(y_test > 0.5, pred > 0.5))

# y = Pres_DEM 0.5 이상일경우 True
              precision    recall  f1-score   support

       False       0.95      0.96      0.96       146
        True       0.62      0.59      0.61        17

    accuracy                           0.92       163
   macro avg       0.79      0.77      0.78       163
weighted avg       0.92      0.92      0.92       163

XGBoost 모델

from xgboost import XGBClassifier

model_xgb = XGBClassifier()
model_xgb.fit(X_train, y_train > 0.5)
[23:07:34] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.3.0/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.





XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=4, num_parallel_tree=1,
              objective='binary:logistic', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', use_label_encoder=True,
              validate_parameters=1, verbosity=None)
plt.figure(figsize=(12,10))
plt.barh(x.columns, model_xgb.feature_importances_)
plt.show()

# 민주당 표에 영향을 미치는 것. 인구는 당연한 변수이므로 제외한다면 asian, transit, publicwork 눈여겨볼만하다.

output_76_0

pred = model_xgb.predict(X_test)
print(classification_report(y_test>0.5 , pred))
              precision    recall  f1-score   support

       False       0.96      0.97      0.97       146
        True       0.73      0.65      0.69        17

    accuracy                           0.94       163
   macro avg       0.85      0.81      0.83       163
weighted avg       0.94      0.94      0.94       163