Oh-Seung-Rok

안녕하세요. 정량적인 분석을 통해 제품-서비스를 개선하고 소비자들의 욕구를 먼저 파악하는 분석가가 되고 싶은 오승록입니다. 포트폴리오 [https://seungrok0317.com]

Kaggle - Steel Plates Faults

08 Apr 2021 » Kaggle

Steel Plates Faults (철판 제조 공정 데이터)

  • 데이터 셋 : https://www.kaggle.com/mahsateimourikia/faults-nna

  • FastCampus의 강의해설을 듣고 진행하였다. (https://www.fastcampus.co.kr/data_online_dl300)

  • 철판의 여러 특성을 통해 불량을 예측하고자 하는 데이터 셋
  • 제조 공정 데이터로, 불량품 예측하여 원인을 제거하거나 재고를 예측하여 수요에 맞는 생산을 진행하는 방식의 데이터가 주를 이룬다.
  • 데이터 모이는 과정이 자동화되어 결측치가 적거나 퀄리티가 좋은 경향을 보인다.
  • 머신러닝 대부분의 모델 적용 연습

라이브러리 설정 및 데이터 불러오기

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv('./Faults.NNA', delimiter='\t', header=None)

pd.set_option('display.max_columns', None)
df.head()
0123456789101112131415161718192021222324252627282930313233
0425027090027094426717442422076108168710800.04980.24150.18180.00470.47061.00001.02.42650.90311.64350.8182-0.29130.58221000000
16456512538079253810810810301139784123168710800.76470.37930.20690.00360.60000.96671.02.03340.77821.46240.7931-0.17560.29841000000
282983515539131553931718197972991251623101000.97100.34260.33330.00370.75000.94741.01.85130.77821.25530.6667-0.12280.21501000000
3853860369370369415176134518996991261353012900.72870.44130.15560.00520.53851.00001.02.24550.84511.65320.8444-0.15680.52121000000
412891306498078498335240960260246930371261353011850.06950.44860.06620.01260.28330.98851.03.38181.23052.40990.9338-0.19921.00001000000
  • 컬럼이 숫자로 되어 있는데 정해져 있는 이름값으로 바꿔준다.
a_name = pd.read_csv('Faults27x7_var', delimiter=' ', header=None)
df.columns = a_name[0]
df.tail()
X_MinimumX_MaximumY_MinimumY_MaximumPixels_AreasX_PerimeterY_PerimeterSum_of_LuminosityMinimum_of_LuminosityMaximum_of_LuminosityLength_of_ConveyerTypeOfSteel_A300TypeOfSteel_A400Steel_Plate_ThicknessEdges_IndexEmpty_IndexSquare_IndexOutside_X_IndexEdges_X_IndexEdges_Y_IndexOutside_Global_IndexLogOfAreasLog_X_IndexLog_Y_IndexOrientation_IndexLuminosity_IndexSigmoidOfAreasPastryZ_ScratchK_ScatchStainsDirtinessBumpsOther_Faults
1936249277325780325796273542235033119141136001400.36620.39060.57140.02060.51850.72730.02.43621.44721.2041-0.42860.00260.72540000001
1937144175340581340598287442434599112133136001400.21180.45540.54840.02280.70460.70830.02.45791.49141.2305-0.4516-0.05820.81730000001
1938145174386779386794292402237572120140136001400.21320.32870.51720.02130.72500.68180.02.46541.46241.1761-0.48280.00520.70790000001
1939137170422497422528419974752715117140136001400.20150.59040.93940.02430.34020.65960.02.62221.51851.4914-0.0606-0.01710.99190000001
1940126112818795187967103262211682101133136010800.11620.67810.80000.01470.76920.72730.02.01281.30101.2041-0.2000-0.11390.52960000001
print(df.shape)
(1941, 34)
import os
n_cpu = os.cpu_count()
print(n_cpu)
n_thread = n_cpu*2
print(n_thread)
4
8
  • 계산량이 많을 수 있어 cpu를 나누어 계산하는 방법이 있다.
  • 전부 숫자형 데이터이다.

종속변수 7개 (철판에 어떠한 불량이 생겼는지)

  • Pastry
  • Z_Scratch
  • K_Scatch
  • Stains
  • Dirtiness
  • Bumps
  • Other_Faults

설명변수 27개 (철판의 길이, 반짝이는 정도 - 두께 타입 등 다양한 변수)

  • X_Minimum
  • X_Maximum
  • Y_Minimum
  • Y_Maximum
  • Pixels_Areas
  • X_Perimeter
  • Y_Perimeter
  • Sum_of_Luminosity
  • Minimum_of_Luminosity
  • Maximum_of_Luminosity
  • Length_of_Conveyer
  • TypeOfSteel_A300
  • TypeOfSteel_A400
  • Steel_Plate_Thickness
  • Edges_Index
  • Empty_Index
  • Square_Index
  • Outside_X_Index
  • Edges_X_Index
  • Edges_Y_Index
  • Outside_Global_Index
  • LogOfAreas
  • Log_X_Index
  • Log_Y_Index
  • Orientation_Index
  • Luminosity_Index
  • SigmoidOfAreas

데이터 전처리 및 EDA-기초통계분석

  • 종속변수(철판이상을 나타내는)가 7가지가 있기에 7개를 이어 붙인 리스트를 만들어준다.
conditions = [df['Pastry'].astype(bool),
             df['Z_Scratch'].astype(bool),
             df['K_Scatch'].astype(bool),
             df['Stains'].astype(bool),
             df['Dirtiness'].astype(bool),
             df['Bumps'].astype(bool),
             df['Other_Faults'].astype(bool)]

# 사실 뭐 astype 안쓰고 conditions = list(map(lambda i : i.astype(bool), 'astype안쓴 데이터프레임')) 이렇게도 가능.
print(type(conditions))
print(type(conditions[0]))
print(len(conditions))
print(len(conditions[0]))
<class 'list'>
<class 'pandas.core.series.Series'>
7
1941
choices = ['Pastry', 'Z_Scratch', 'K_Scatch', 'Stains', 'Dirtiness', 'Bumps', 'Other_Faults']

df['class'] = np.select(conditions, choices)

# np.select 를 이용해 True에 해당하는 값을 출력하는 column을 만들어줌.
df.sample(10)
X_MinimumX_MaximumY_MinimumY_MaximumPixels_AreasX_PerimeterY_PerimeterSum_of_LuminosityMinimum_of_LuminosityMaximum_of_LuminosityLength_of_ConveyerTypeOfSteel_A300TypeOfSteel_A400Steel_Plate_ThicknessEdges_IndexEmpty_IndexSquare_IndexOutside_X_IndexEdges_X_IndexEdges_Y_IndexOutside_Global_IndexLogOfAreasLog_X_IndexLog_Y_IndexOrientation_IndexLuminosity_IndexSigmoidOfAreasPastryZ_ScratchK_ScatchStainsDirtinessBumpsOther_Faultsclass
5044119022544072254463499523212452790639127140201400.05850.40140.37580.10630.64220.45160.03.69852.17321.7482-0.6242-0.17431.00000010000K_Scatch
136386711049496559496691695247106197365103132166801400.67630.48910.05910.14210.95950.13210.03.22922.37471.1461-0.9409-0.09031.00000000001Other_Faults
1604128713023204829320484519117162059067133162701400.39950.20420.93750.00920.88241.00001.02.28101.17611.20410.0625-0.15780.39770000001Other_Faults
1566741756614880614919392193940498771271360101500.88820.32990.38460.01100.78951.00001.02.59331.17611.59110.6154-0.19290.86820000001Other_Faults
1890105810807518927519111333520170851121421658101430.69720.68180.86360.01330.62860.95000.02.12391.34241.2787-0.13640.00360.68390000001Other_Faults
125210941124128064951280652057137255858763127136201400.34950.23870.83330.02200.81081.00000.02.75661.47711.3979-0.1667-0.19840.95190000010Bumps
27332235717476251747660314845038597110140135610700.47490.74371.00000.02580.41670.70000.52.49691.54411.54410.0000-0.03970.99790100000Z_Scratch
37391993056183256184288111014538134183138701400.65900.20000.90910.00791.00001.00000.01.94451.04141.0000-0.09090.29070.21730010000K_Scatch
6254121417756041775675668627313268734536124135601400.06050.45570.41040.12760.63370.53790.03.82522.23801.8513-0.5896-0.19691.00000010000K_Scatch
167170572750718085071831282382730746771261356102000.92770.44270.95650.01620.57890.85181.02.45021.34241.36170.0435-0.14820.79550000001Other_Faults
df.isnull().sum()
0
X_Minimum                0
X_Maximum                0
Y_Minimum                0
Y_Maximum                0
Pixels_Areas             0
X_Perimeter              0
Y_Perimeter              0
Sum_of_Luminosity        0
Minimum_of_Luminosity    0
Maximum_of_Luminosity    0
Length_of_Conveyer       0
TypeOfSteel_A300         0
TypeOfSteel_A400         0
Steel_Plate_Thickness    0
Edges_Index              0
Empty_Index              0
Square_Index             0
Outside_X_Index          0
Edges_X_Index            0
Edges_Y_Index            0
Outside_Global_Index     0
LogOfAreas               0
Log_X_Index              0
Log_Y_Index              0
Orientation_Index        0
Luminosity_Index         0
SigmoidOfAreas           0
Pastry                   0
Z_Scratch                0
K_Scatch                 0
Stains                   0
Dirtiness                0
Bumps                    0
Other_Faults             0
class                    0
dtype: int64
df.describe()
X_MinimumX_MaximumY_MinimumY_MaximumPixels_AreasX_PerimeterY_PerimeterSum_of_LuminosityMinimum_of_LuminosityMaximum_of_LuminosityLength_of_ConveyerTypeOfSteel_A300TypeOfSteel_A400Steel_Plate_ThicknessEdges_IndexEmpty_IndexSquare_IndexOutside_X_IndexEdges_X_IndexEdges_Y_IndexOutside_Global_IndexLogOfAreasLog_X_IndexLog_Y_IndexOrientation_IndexLuminosity_IndexSigmoidOfAreasPastryZ_ScratchK_ScatchStainsDirtinessBumpsOther_Faults
count1941.0000001941.0000001.941000e+031.941000e+031941.0000001941.0000001941.0000001.941000e+031941.0000001941.0000001941.0000001941.0000001941.0000001941.0000001941.0000001941.0000001941.0000001941.0000001941.0000001941.0000001941.0000001941.0000001941.0000001941.0000001941.0000001941.0000001941.0000001941.0000001941.0000001941.0000001941.0000001941.0000001941.0000001941.000000
mean571.136012617.9644511.650685e+061.650739e+061893.878413111.85522982.9659972.063121e+0584.548686130.1937151459.1602270.4003090.59969178.7377640.3317150.4142030.5707670.0333610.6105290.8134720.5757342.4923881.3356861.4032710.083288-0.1313050.5854200.0814010.0978880.2014430.0370940.0283360.2071100.346728
std520.690671497.6274101.774578e+061.774590e+065168.459560301.209187426.4828795.122936e+0532.13427618.690992144.5778230.4900870.49008755.0860320.2997120.1372610.2710580.0589610.2432770.2342740.4823520.7889300.4816120.4543450.5008680.1487670.3394520.2735210.2972390.4011810.1890420.1659730.4053390.476051
min0.0000004.0000006.712000e+036.724000e+032.0000002.0000001.0000002.500000e+020.00000037.0000001227.0000000.0000000.00000040.0000000.0000000.0000000.0083000.0015000.0144000.0484000.0000000.3010000.3010000.000000-0.991000-0.9989000.1190000.0000000.0000000.0000000.0000000.0000000.0000000.000000
25%51.000000192.0000004.712530e+054.712810e+0584.00000015.00000013.0000009.522000e+0363.000000124.0000001358.0000000.0000000.00000040.0000000.0604000.3158000.3613000.0066000.4118000.5968000.0000001.9243001.0000001.079200-0.333300-0.1950000.2482000.0000000.0000000.0000000.0000000.0000000.0000000.000000
50%435.000000467.0000001.204128e+061.204136e+06174.00000026.00000025.0000001.920200e+0490.000000127.0000001364.0000000.0000001.00000070.0000000.2273000.4121000.5556000.0101000.6364000.9474001.0000002.2406001.1761001.3222000.095200-0.1330000.5063000.0000000.0000000.0000000.0000000.0000000.0000000.000000
75%1053.0000001072.0000002.183073e+062.183084e+06822.00000084.00000083.0000008.301100e+04106.000000140.0000001650.0000001.0000001.00000080.0000000.5738000.5016000.8182000.0235000.8000001.0000001.0000002.9149001.5185001.7324000.511600-0.0666000.9998000.0000000.0000000.0000000.0000000.0000000.0000001.000000
max1705.0000001713.0000001.298766e+071.298769e+07152655.00000010449.00000018152.0000001.159141e+07203.000000253.0000001794.0000001.0000001.000000300.0000000.9952000.9439001.0000000.8759001.0000001.0000001.0000005.1837003.0741004.2587000.9917000.6421001.0000001.0000001.0000001.0000001.0000001.0000001.0000001.000000
df['class'].value_counts()
Other_Faults    673
Bumps           402
K_Scatch        391
Z_Scratch       190
Pastry          158
Stains           72
Dirtiness        55
Name: class, dtype: int64

산점도 통해서 각 변수간의 관계 파악

color_code = {'Pastry' : 'Red', 'Z_Scratch' : 'Blue', 'K_Scatch' : 'Green', 'Stains' : 'Black', 'Dirtiness' : 'Pink', 'Bumps' : 'Brown', 'Other_Faults' : 'Gold'}
color_list = [color_code.get(i) for i in df.loc[:,'class']]
pd.plotting.scatter_matrix(df.loc[:, df.columns!='class'], c=color_list, figsize=[30,30], alpha=0.3, s= 50, diagonal='hist')
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000002C49A32E348>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002C49AF4E688>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002C49AF80D08>,
        ...,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002C49B5D02C8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002C49B608408>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002C49B63F508>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000002C49B6755C8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002C49B6AE708>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002C49B6E6808>,
        ...,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002C49BD48608>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002C49BD84E88>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002C49BDB87C8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000002C49BDF08C8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002C49BE279C8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002C49BE60B08>,
        ...,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002C49C4C1848>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002C49C4F7948>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002C49C530A48>],
       ...,
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000002C4AE5E9588>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002C4AE6216C8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002C4AE658808>,
        ...,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002C4AFC88B88>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002C4AFCC0CC8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002C4AFCF6E48>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000002C4AFD2CF88>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002C4AFD6D0C8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002C4AFDA5248>,
        ...,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002C4B0404748>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002C4B043D848>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002C4B04759C8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000002C4B04ACB08>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002C4B04E3C48>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002C4B051ADC8>,
        ...,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002C4B0B81288>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002C4B0BB73C8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002C4B0BF1508>]],
      dtype=object)

output_24_1

위의 범주형 변수 시각화

sns.set_style('white')

g = sns.catplot(data=df, x='class', kind='count', palette='YlGnBu', height=6)
g.ax.xaxis.set_label_text('Type of Defect')
g.ax.yaxis.set_label_text('Count')
g.ax.set_title('The number of Defects by Defect type')

for p in g.ax.patches: # ax의 patches(각각의 기둥들= p)
    g.ax.annotate((p.get_height()), (p.get_x()+0.2, p.get_height()+10)) # 높이에 해당하는 값을 annotate할건데 좌표는 x와 y값

output_26_0

상관계수를 통해 각 변수간의 관계 파악 + Heatmap

df.columns
Index(['X_Minimum', 'X_Maximum', 'Y_Minimum', 'Y_Maximum', 'Pixels_Areas',
       'X_Perimeter', 'Y_Perimeter', 'Sum_of_Luminosity',
       'Minimum_of_Luminosity', 'Maximum_of_Luminosity', 'Length_of_Conveyer',
       'TypeOfSteel_A300', 'TypeOfSteel_A400', 'Steel_Plate_Thickness',
       'Edges_Index', 'Empty_Index', 'Square_Index', 'Outside_X_Index',
       'Edges_X_Index', 'Edges_Y_Index', 'Outside_Global_Index', 'LogOfAreas',
       'Log_X_Index', 'Log_Y_Index', 'Orientation_Index', 'Luminosity_Index',
       'SigmoidOfAreas', 'Pastry', 'Z_Scratch', 'K_Scatch', 'Stains',
       'Dirtiness', 'Bumps', 'Other_Faults', 'class'],
      dtype='object', name=0)
df_corTarget = df[['X_Minimum', 'X_Maximum', 'Y_Minimum', 'Y_Maximum', 'Pixels_Areas',
       'X_Perimeter', 'Y_Perimeter', 'Sum_of_Luminosity',
       'Minimum_of_Luminosity', 'Maximum_of_Luminosity', 'Length_of_Conveyer',
       'TypeOfSteel_A300', 'TypeOfSteel_A400', 'Steel_Plate_Thickness',
       'Edges_Index', 'Empty_Index', 'Square_Index', 'Outside_X_Index',
       'Edges_X_Index', 'Edges_Y_Index', 'Outside_Global_Index', 'LogOfAreas',
       'Log_X_Index', 'Log_Y_Index', 'Orientation_Index', 'Luminosity_Index',
       'SigmoidOfAreas']]

corr = df_corTarget.corr()
corr
X_MinimumX_MaximumY_MinimumY_MaximumPixels_AreasX_PerimeterY_PerimeterSum_of_LuminosityMinimum_of_LuminosityMaximum_of_LuminosityLength_of_ConveyerTypeOfSteel_A300TypeOfSteel_A400Steel_Plate_ThicknessEdges_IndexEmpty_IndexSquare_IndexOutside_X_IndexEdges_X_IndexEdges_Y_IndexOutside_Global_IndexLogOfAreasLog_X_IndexLog_Y_IndexOrientation_IndexLuminosity_IndexSigmoidOfAreas
0
X_Minimum1.0000000.9883140.0418210.041807-0.307322-0.258937-0.118757-0.3390450.237637-0.0755540.3166620.144319-0.1443190.1366250.278075-0.1984610.063658-0.3611600.1547780.3679070.147282-0.428553-0.437944-0.3268510.178585-0.031578-0.355251
X_Maximum0.9883141.0000000.0521470.052135-0.225399-0.186326-0.090138-0.2470520.168649-0.0623920.2993900.112009-0.1120090.1061190.242846-0.1526800.048575-0.2149300.1492590.2719150.099253-0.332169-0.324012-0.2659900.115019-0.038996-0.286736
Y_Minimum0.0418210.0521471.0000001.0000000.0176700.0238430.0241500.007362-0.065703-0.067785-0.0492110.075164-0.075164-0.2076400.021314-0.043117-0.0061350.0541650.066085-0.036543-0.0629110.0449520.070406-0.008442-0.086497-0.0906540.025257
Y_Maximum0.0418070.0521351.0000001.0000000.0178400.0240380.0243800.007499-0.065733-0.067776-0.0492190.075151-0.075151-0.2076440.021300-0.043085-0.0061520.0541850.066051-0.036549-0.0629010.0449940.070432-0.008382-0.086480-0.0906660.025284
Pixels_Areas-0.307322-0.2253990.0176700.0178401.0000000.9666440.8271990.978952-0.4972040.110063-0.155853-0.2355910.235591-0.183735-0.2752890.2728080.0178650.588606-0.294673-0.463571-0.1096550.6502340.6030720.578342-0.137604-0.0434490.422947
X_Perimeter-0.258937-0.1863260.0238430.0240380.9666441.0000000.9124360.912956-0.4004270.111363-0.134240-0.1892500.189250-0.147712-0.2275900.3063480.0045070.517098-0.293039-0.412100-0.0791060.5630360.5247160.523472-0.101731-0.0326170.380605
Y_Perimeter-0.118757-0.0901380.0241500.0243800.8271990.9124361.0000000.704876-0.2137580.061809-0.063825-0.0951540.095154-0.058889-0.1112400.188825-0.0475110.209160-0.195162-0.1367230.0134380.2940400.2284850.3443780.031381-0.0477780.191772
Sum_of_Luminosity-0.339045-0.2470520.0073620.0074990.9789520.9129560.7048761.000000-0.5405660.136515-0.169331-0.2636320.263632-0.204812-0.3014520.2936910.0496070.658339-0.327728-0.529745-0.1210900.7121280.6677360.618795-0.158483-0.0140670.464248
Minimum_of_Luminosity0.2376370.168649-0.065703-0.065733-0.497204-0.400427-0.213758-0.5405661.0000000.429605-0.0235790.042048-0.0420480.1033930.358915-0.0441110.066748-0.4875740.2522560.3166100.035462-0.678762-0.567655-0.5882080.0571230.669534-0.514797
Maximum_of_Luminosity-0.075554-0.062392-0.067785-0.0677760.1100630.1113630.0618090.1365150.4296051.000000-0.098009-0.2163390.216339-0.1283970.1496750.0314250.0655170.0993000.093522-0.167441-0.1240390.0076720.092823-0.069522-0.1697470.870160-0.039651
Length_of_Conveyer0.3166620.299390-0.049211-0.049219-0.155853-0.134240-0.063825-0.169331-0.023579-0.0980091.0000000.378542-0.3785420.2147690.135152-0.2306010.073694-0.2174170.1235850.2357320.128663-0.193247-0.219973-0.1570570.120715-0.149769-0.197543
TypeOfSteel_A3000.1443190.1120090.0751640.075151-0.235591-0.189250-0.095154-0.2636320.042048-0.2163390.3785421.000000-1.0000000.1256490.112140-0.0919540.164156-0.2447650.1738360.2406340.022142-0.329614-0.266955-0.3117960.010630-0.252818-0.308910
TypeOfSteel_A400-0.144319-0.112009-0.075164-0.0751510.2355910.1892500.0951540.263632-0.0420480.216339-0.378542-1.0000001.000000-0.125649-0.1121400.091954-0.1641560.244765-0.173836-0.240634-0.0221420.3296140.2669550.311796-0.0106300.2528180.308910
Steel_Plate_Thickness0.1366250.106119-0.207640-0.207644-0.183735-0.147712-0.058889-0.2048120.103393-0.1283970.2147690.125649-0.1256491.0000000.0634490.012526-0.124382-0.228352-0.0774080.2519850.221244-0.176639-0.252822-0.0372870.274097-0.116499-0.085159
Edges_Index0.2780750.2428460.0213140.021300-0.275289-0.227590-0.111240-0.3014520.3589150.1496750.1351520.112140-0.1121400.0634491.000000-0.1807390.149498-0.2965100.2501780.2853020.008282-0.408619-0.355853-0.3719890.0205480.207516-0.330006
Empty_Index-0.198461-0.152680-0.043117-0.0430850.2728080.3063480.1888250.293691-0.0441110.031425-0.230601-0.0919540.0919540.012526-0.1807391.000000-0.0764390.334996-0.389342-0.459800-0.1652930.3566850.4488640.397289-0.1394200.0616080.481738
Square_Index0.0636580.048575-0.006135-0.0061520.0178650.004507-0.0475110.0496070.0667480.0655170.0736940.164156-0.164156-0.1243820.149498-0.0764391.000000-0.1136270.2427790.081488-0.069913-0.189340-0.082846-0.257661-0.1620340.111977-0.292251
Outside_X_Index-0.361160-0.2149300.0541650.0541850.5886060.5170980.2091600.658339-0.4875740.099300-0.217417-0.2447650.244765-0.228352-0.2965100.334996-0.1136271.000000-0.076663-0.689867-0.3371730.7108370.8202230.464860-0.440358-0.0357210.518910
Edges_X_Index0.1547780.1492590.0660850.066051-0.294673-0.293039-0.195162-0.3277280.2522560.0935220.1235850.173836-0.173836-0.0774080.250178-0.3893420.242779-0.0766631.0000000.108144-0.419383-0.496206-0.189262-0.748892-0.5503020.126460-0.558426
Edges_Y_Index0.3679070.271915-0.036543-0.036549-0.463571-0.412100-0.136723-0.5297450.316610-0.1674410.2357320.240634-0.2406340.2519850.285302-0.4598000.081488-0.6898670.1081441.0000000.537565-0.642991-0.855414-0.3218920.658049-0.094368-0.545393
Outside_Global_Index0.1472820.099253-0.062911-0.062901-0.109655-0.0791060.013438-0.1210900.035462-0.1240390.1286630.022142-0.0221420.2212440.008282-0.165293-0.069913-0.337173-0.4193830.5375651.000000-0.097762-0.4280600.2418980.862670-0.122321-0.053770
LogOfAreas-0.428553-0.3321690.0449520.0449940.6502340.5630360.2940400.712128-0.6787620.007672-0.193247-0.3296140.329614-0.176639-0.4086190.356685-0.1893400.710837-0.496206-0.642991-0.0977621.0000000.8889190.882974-0.123898-0.1758790.877768
Log_X_Index-0.437944-0.3240120.0704060.0704320.6030720.5247160.2284850.667736-0.5676550.092823-0.219973-0.2669550.266955-0.252822-0.3558530.448864-0.0828460.820223-0.189262-0.855414-0.4280600.8889191.0000000.598652-0.536629-0.0649230.757343
Log_Y_Index-0.326851-0.265990-0.008442-0.0083820.5783420.5234720.3443780.618795-0.588208-0.069522-0.157057-0.3117960.311796-0.037287-0.3719890.397289-0.2576610.464860-0.748892-0.3218920.2418980.8829740.5986521.0000000.316792-0.2191100.838188
Orientation_Index0.1785850.115019-0.086497-0.086480-0.137604-0.1017310.031381-0.1584830.057123-0.1697470.1207150.010630-0.0106300.2740970.020548-0.139420-0.162034-0.440358-0.5503020.6580490.862670-0.123898-0.5366290.3167921.000000-0.153464-0.023978
Luminosity_Index-0.031578-0.038996-0.090654-0.090666-0.043449-0.032617-0.047778-0.0140670.6695340.870160-0.149769-0.2528180.252818-0.1164990.2075160.0616080.111977-0.0357210.126460-0.094368-0.122321-0.175879-0.064923-0.219110-0.1534641.000000-0.184840
SigmoidOfAreas-0.355251-0.2867360.0252570.0252840.4229470.3806050.1917720.464248-0.514797-0.039651-0.197543-0.3089100.308910-0.085159-0.3300060.481738-0.2922510.518910-0.558426-0.545393-0.0537700.8777680.7573430.838188-0.023978-0.1848401.000000
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
# 대각 기준 위 값 아래 값 똑같으니 하나만 표현


f, ax = plt.subplots(figsize = (11,9))
cmap = sns.diverging_palette(1,200, as_cmap=True)

sns.heatmap(corr, mask=mask, cmap=cmap, vmax=1, vmin=-1, center=0, linewidths=2)
<matplotlib.axes._subplots.AxesSubplot at 0x2c4c246a308>

output_30_1

학습 및 테스트 데이터 분리

x = df[['X_Minimum', 'X_Maximum', 'Y_Minimum', 'Y_Maximum', 'Pixels_Areas',
       'X_Perimeter', 'Y_Perimeter', 'Sum_of_Luminosity',
       'Minimum_of_Luminosity', 'Maximum_of_Luminosity', 'Length_of_Conveyer',
       'TypeOfSteel_A300', 'TypeOfSteel_A400', 'Steel_Plate_Thickness',
       'Edges_Index', 'Empty_Index', 'Square_Index', 'Outside_X_Index',
       'Edges_X_Index', 'Edges_Y_Index', 'Outside_Global_Index', 'LogOfAreas',
       'Log_X_Index', 'Log_Y_Index', 'Orientation_Index', 'Luminosity_Index',
       'SigmoidOfAreas']]

y = df['K_Scatch']
from sklearn.model_selection import train_test_split
from scipy.stats import zscore

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2, random_state=1, stratify=y)
# stratify : y비율이 train, test에서 비율이 맞게끔
# 표준화 작업 sklearn.preprocessing import StandardScaler 와 같은 역할
x_train = x_train.apply(zscore)
x_test = x_test.apply(zscore)

로지스틱 분류 모형 (Grid Search 구축 , Lidge-Lasso penalty /Threshold)

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix
from sklearn import metrics
lm = LogisticRegression(solver= 'liblinear')

# liblinear로 지정해야 이후 ridge,lasso 모델에도 알고리즘 적용가능
# https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
  • 그리드 서치를 통해 최적의 파라미터 탐색
parameters = {'penalty' : ['l1', 'l2'], 'C':[0.01, 0.1, 0.5, 0.9, 1, 5, 10], 'tol': [1e-4, 1e-2, 1, 1e2]}
GSLR = GridSearchCV(lm, parameters, cv=10, n_jobs=n_thread, scoring='accuracy')
GSLR.fit(x_train, y_train)
GridSearchCV(cv=10, error_score=nan,
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='auto',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='liblinear',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='deprecated', n_jobs=8,
             param_grid={'C': [0.01, 0.1, 0.5, 0.9, 1, 5, 10],
                         'penalty': ['l1', 'l2'],
                         'tol': [0.0001, 0.01, 1, 100.0]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='accuracy', verbose=0)
print('final params' , GSLR.best_params_)
print('best score', GSLR.best_score_)
final params {'C': 5, 'penalty': 'l1', 'tol': 0.01}
best score 0.9729321753515301
  • 최적 파라미터 경우 범위 극단에 있을 경우 범위 바깥에 있는 것도 시도해 보아야 한다.

모형 평가 및 모형 구축

predicted=GSLR.predict(x_test)

cMatrix = confusion_matrix(y_test, predicted)
print(cMatrix)
print('\n Accuracy:', GSLR.score(x_test, y_test))
[[305   6]
 [  5  73]]

 Accuracy: 0.9717223650385605
print(metrics.classification_report(y_test, predicted))
              precision    recall  f1-score   support

           0       0.98      0.98      0.98       311
           1       0.92      0.94      0.93        78

    accuracy                           0.97       389
   macro avg       0.95      0.96      0.96       389
weighted avg       0.97      0.97      0.97       389

  • 파라미터 값 시각화 통해서 파라미터가 바뀔수록 정확도가 어떻게 변화하는가
means = GSLR.cv_results_['mean_test_score']
stds = GSLR.cv_results_['std_test_score']

for mean, std, params in zip(means, stds, GSLR.cv_results_['params']):
    print ('%0.3f (+/-%0.03f) for %r' % (mean, std *2, params))
print()
0.945 (+/-0.037) for {'C': 0.01, 'penalty': 'l1', 'tol': 0.0001}
0.946 (+/-0.036) for {'C': 0.01, 'penalty': 'l1', 'tol': 0.01}
0.942 (+/-0.036) for {'C': 0.01, 'penalty': 'l1', 'tol': 1}
0.798 (+/-0.005) for {'C': 0.01, 'penalty': 'l1', 'tol': 100.0}
0.950 (+/-0.033) for {'C': 0.01, 'penalty': 'l2', 'tol': 0.0001}
0.950 (+/-0.033) for {'C': 0.01, 'penalty': 'l2', 'tol': 0.01}
0.954 (+/-0.035) for {'C': 0.01, 'penalty': 'l2', 'tol': 1}
0.798 (+/-0.005) for {'C': 0.01, 'penalty': 'l2', 'tol': 100.0}
0.964 (+/-0.028) for {'C': 0.1, 'penalty': 'l1', 'tol': 0.0001}
0.963 (+/-0.028) for {'C': 0.1, 'penalty': 'l1', 'tol': 0.01}
0.954 (+/-0.040) for {'C': 0.1, 'penalty': 'l1', 'tol': 1}
0.798 (+/-0.005) for {'C': 0.1, 'penalty': 'l1', 'tol': 100.0}
0.966 (+/-0.021) for {'C': 0.1, 'penalty': 'l2', 'tol': 0.0001}
0.966 (+/-0.021) for {'C': 0.1, 'penalty': 'l2', 'tol': 0.01}
0.957 (+/-0.032) for {'C': 0.1, 'penalty': 'l2', 'tol': 1}
0.798 (+/-0.005) for {'C': 0.1, 'penalty': 'l2', 'tol': 100.0}
0.969 (+/-0.028) for {'C': 0.5, 'penalty': 'l1', 'tol': 0.0001}
0.970 (+/-0.025) for {'C': 0.5, 'penalty': 'l1', 'tol': 0.01}
0.957 (+/-0.031) for {'C': 0.5, 'penalty': 'l1', 'tol': 1}
0.798 (+/-0.005) for {'C': 0.5, 'penalty': 'l1', 'tol': 100.0}
0.969 (+/-0.023) for {'C': 0.5, 'penalty': 'l2', 'tol': 0.0001}
0.968 (+/-0.020) for {'C': 0.5, 'penalty': 'l2', 'tol': 0.01}
0.958 (+/-0.030) for {'C': 0.5, 'penalty': 'l2', 'tol': 1}
0.798 (+/-0.005) for {'C': 0.5, 'penalty': 'l2', 'tol': 100.0}
0.969 (+/-0.030) for {'C': 0.9, 'penalty': 'l1', 'tol': 0.0001}
0.968 (+/-0.030) for {'C': 0.9, 'penalty': 'l1', 'tol': 0.01}
0.957 (+/-0.026) for {'C': 0.9, 'penalty': 'l1', 'tol': 1}
0.798 (+/-0.005) for {'C': 0.9, 'penalty': 'l1', 'tol': 100.0}
0.971 (+/-0.022) for {'C': 0.9, 'penalty': 'l2', 'tol': 0.0001}
0.972 (+/-0.025) for {'C': 0.9, 'penalty': 'l2', 'tol': 0.01}
0.958 (+/-0.030) for {'C': 0.9, 'penalty': 'l2', 'tol': 1}
0.798 (+/-0.005) for {'C': 0.9, 'penalty': 'l2', 'tol': 100.0}
0.970 (+/-0.030) for {'C': 1, 'penalty': 'l1', 'tol': 0.0001}
0.970 (+/-0.026) for {'C': 1, 'penalty': 'l1', 'tol': 0.01}
0.955 (+/-0.035) for {'C': 1, 'penalty': 'l1', 'tol': 1}
0.798 (+/-0.005) for {'C': 1, 'penalty': 'l1', 'tol': 100.0}
0.972 (+/-0.023) for {'C': 1, 'penalty': 'l2', 'tol': 0.0001}
0.972 (+/-0.025) for {'C': 1, 'penalty': 'l2', 'tol': 0.01}
0.958 (+/-0.030) for {'C': 1, 'penalty': 'l2', 'tol': 1}
0.798 (+/-0.005) for {'C': 1, 'penalty': 'l2', 'tol': 100.0}
0.970 (+/-0.027) for {'C': 5, 'penalty': 'l1', 'tol': 0.0001}
0.973 (+/-0.025) for {'C': 5, 'penalty': 'l1', 'tol': 0.01}
0.951 (+/-0.047) for {'C': 5, 'penalty': 'l1', 'tol': 1}
0.798 (+/-0.005) for {'C': 5, 'penalty': 'l1', 'tol': 100.0}
0.972 (+/-0.028) for {'C': 5, 'penalty': 'l2', 'tol': 0.0001}
0.972 (+/-0.028) for {'C': 5, 'penalty': 'l2', 'tol': 0.01}
0.959 (+/-0.030) for {'C': 5, 'penalty': 'l2', 'tol': 1}
0.798 (+/-0.005) for {'C': 5, 'penalty': 'l2', 'tol': 100.0}
0.972 (+/-0.025) for {'C': 10, 'penalty': 'l1', 'tol': 0.0001}
0.971 (+/-0.028) for {'C': 10, 'penalty': 'l1', 'tol': 0.01}
0.955 (+/-0.032) for {'C': 10, 'penalty': 'l1', 'tol': 1}
0.798 (+/-0.005) for {'C': 10, 'penalty': 'l1', 'tol': 100.0}
0.972 (+/-0.024) for {'C': 10, 'penalty': 'l2', 'tol': 0.0001}
0.971 (+/-0.024) for {'C': 10, 'penalty': 'l2', 'tol': 0.01}
0.959 (+/-0.030) for {'C': 10, 'penalty': 'l2', 'tol': 1}
0.798 (+/-0.005) for {'C': 10, 'penalty': 'l2', 'tol': 100.0}

의사결정나무

from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()

# https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
parameters = {'criterion' : ['gini', 'entropy'],'min_samples_split' : [2, 5, 10, 15], 'max_depth' : [None, 2], 'min_samples_leaf':[1,3,10,15], 'max_features':[None, 'sqrt', 'log2']}
GSDT = GridSearchCV(dt, parameters, cv=10, n_jobs=n_thread, scoring='accuracy')
GSDT.fit(x_train, y_train)
GridSearchCV(cv=10, error_score=nan,
             estimator=DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features=None,
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              presort='deprecated',
                                              random_state=None,
                                              splitter='best'),
             iid='deprecated', n_jobs=8,
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [None, 2],
                         'max_features': [None, 'sqrt', 'log2'],
                         'min_samples_leaf': [1, 3, 10, 15],
                         'min_samples_split': [2, 5, 10, 15]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='accuracy', verbose=0)
print('final params', GSDT.best_params_)
print('ACC', GSDT.best_score_)
final params {'criterion': 'entropy', 'max_depth': None, 'max_features': None, 'min_samples_leaf': 3, 'min_samples_split': 2}
ACC 0.9780976013234077
predicted = GSDT.predict(x_test)
cMatrix = confusion_matrix(y_test, predicted)
print(cMatrix)
print(round(GSDT.score(x_test, y_test), 3))
print(metrics.classification_report(y_test, predicted))
[[308   3]
 [  8  70]]
0.972
              precision    recall  f1-score   support

           0       0.97      0.99      0.98       311
           1       0.96      0.90      0.93        78

    accuracy                           0.97       389
   macro avg       0.97      0.94      0.95       389
weighted avg       0.97      0.97      0.97       389

랜덤포레스트

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()

# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
parameters = {'n_estimators':[20,50,100], 'criterion':['entropy'], 'min_samples_split':[2,5], 'max_depth':[None, 2], 'min_samples_leaf':[1, 3, 10], 'max_features':['sqrt']}
GSRF = GridSearchCV(rf, parameters, cv=10, n_jobs=n_thread, scoring='accuracy')
GSRF.fit(x_train, y_train)
GridSearchCV(cv=10, error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=100, n_jobs=None,
                                              oob_score=False,
                                              random_state=None, verbose=0,
                                              warm_start=False),
             iid='deprecated', n_jobs=8,
             param_grid={'criterion': ['entropy'], 'max_depth': [None, 2],
                         'max_features': ['sqrt'],
                         'min_samples_leaf': [1, 3, 10],
                         'min_samples_split': [2, 5],
                         'n_estimators': [20, 50, 100]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='accuracy', verbose=0)
print('final params', GSRF.best_params_)
print('best score', GSRF.best_score_)
final params {'criterion': 'entropy', 'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 50}
best score 0.9851736972704715
predicted = GSRF.predict(x_test)
cMatrix = confusion_matrix(y_test, predicted)
print(cMatrix)
print(metrics.classification_report(y_test, predicted))
[[311   0]
 [  4  74]]
              precision    recall  f1-score   support

           0       0.99      1.00      0.99       311
           1       1.00      0.95      0.97        78

    accuracy                           0.99       389
   macro avg       0.99      0.97      0.98       389
weighted avg       0.99      0.99      0.99       389

  • 랜덤포레스트가 가장 많은 정확도를 가지고 있지만 랜덤포레스트의 경우 모델에 대한 해석이 쉽지 않다. 즉 어떤 요인으로 인해 ‘K_Scatch’가 가장 많이 발생했는지 그 요인을 명쾌하기 설명해내기 어렵다.

SVM (서포트 벡터 머신)

from sklearn import svm

svc = svm.SVC()

# https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
parameters = {'C':[0.01, 0.1, 0.5, 0.9, 1.5, 10], 'kernel':['linear','rbf','poly'], 'gamma':[0.1, 1, 10]}
GS_SVM = GridSearchCV(svc, parameters, cv=10, n_jobs=n_thread, scoring='accuracy')
GS_SVM.fit(x_train, y_train)
GridSearchCV(cv=10, error_score=nan,
             estimator=SVC(C=1.0, break_ties=False, cache_size=200,
                           class_weight=None, coef0=0.0,
                           decision_function_shape='ovr', degree=3,
                           gamma='scale', kernel='rbf', max_iter=-1,
                           probability=False, random_state=None, shrinking=True,
                           tol=0.001, verbose=False),
             iid='deprecated', n_jobs=8,
             param_grid={'C': [0.01, 0.1, 0.5, 0.9, 1.5, 10],
                         'gamma': [0.1, 1, 10],
                         'kernel': ['linear', 'rbf', 'poly']},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='accuracy', verbose=0)
print('final params', GS_SVM.best_params_)
print('final score', GS_SVM.best_score_)
final params {'C': 1.5, 'gamma': 0.1, 'kernel': 'rbf'}
final score 0.9819602977667493
predicted = GS_SVM.predict(x_test)
cMatrix = confusion_matrix(y_test, predicted)
print(cMatrix)
print(metrics.classification_report(y_test, predicted))
[[311   0]
 [  8  70]]
              precision    recall  f1-score   support

           0       0.97      1.00      0.99       311
           1       1.00      0.90      0.95        78

    accuracy                           0.98       389
   macro avg       0.99      0.95      0.97       389
weighted avg       0.98      0.98      0.98       389

인공 신경망 모형

  • 참고 : playground.tensorflow.org/
from sklearn.neural_network import MLPClassifier
nn_model = MLPClassifier(random_state=1)

# https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html
  • 인공 신경망의 경우 Hidden Layer 설정하는것이 관건.
  • 방법은 없지만 보편적인 경향성을 말할 수 있음. 보통 1개로도 충분함- 1개에서 시작
  • 히든레이어의 노드의 수는 보통 GridSerach를 통해 파악.
  • 히든레이어 노드 가이드 라인. Number Of Neurons = Trading Data Samples / Factor * (Input Neurons + Output Neurons) Factor 수를 크게 하면 노드의 수가 줄어들며 적게하면 반대의 경우
x_train.shape
(1552, 27)
a = 1552/(10 * (27+1))  # 27 = Input 변수개수 1 = output y변수
b = 1552 /(1 * (27+1)) # Factor가 1인경우
print(a, b)
5.542857142857143 55.42857142857143
  • Hidden Layer 가 하나라고 한다면 Factor 1-10이므로 노드의 수를 5~55개로 설정하며 그리드 서치 실행
parameters = {'alpha':[1e-3, 1e-1, 1e1], 'hidden_layer_sizes':[(5),(30),(56)], 'activation':['tanh', 'relu'], 'solver':['adam', 'lbfgs']}
GS_NN = GridSearchCV(nn_model, parameters, cv=10, n_jobs=n_thread, scoring='accuracy')

# alpha, Searching Space를 찾아야 할 필요가 있다. 범위가 넓으므로
GS_NN.fit(x_train, y_train)
C:\Users\dissi\anaconda31\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:571: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)





GridSearchCV(cv=10, error_score=nan,
             estimator=MLPClassifier(activation='relu', alpha=0.0001,
                                     batch_size='auto', beta_1=0.9,
                                     beta_2=0.999, early_stopping=False,
                                     epsilon=1e-08, hidden_layer_sizes=(100,),
                                     learning_rate='constant',
                                     learning_rate_init=0.001, max_fun=15000,
                                     max_iter=200, momentum=0.9,
                                     n_iter_no_change=10,
                                     nesterovs_momentum=True, power_t=0.5,
                                     random_state=1, shuffle=True,
                                     solver='adam', tol=0.0001,
                                     validation_fraction=0.1, verbose=False,
                                     warm_start=False),
             iid='deprecated', n_jobs=8,
             param_grid={'activation': ['tanh', 'relu'],
                         'alpha': [0.001, 0.1, 10.0],
                         'hidden_layer_sizes': [5, 30, 56],
                         'solver': ['adam', 'lbfgs']},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='accuracy', verbose=0)
print('final params', GS_NN.best_params_)
print('best score', GS_NN.best_score_)
final params {'activation': 'relu', 'alpha': 0.001, 'hidden_layer_sizes': 56, 'solver': 'adam'}
best score 0.9781058726220018
means = GS_NN.cv_results_['mean_test_score']
stds = GS_NN.cv_results_['std_test_score']

for mean, std, params in zip(means, stds, GS_NN.cv_results_['params']):
    print ('%0.3f (+/-%0.03f) for %r' % (mean, std *2, params))
print()
0.971 (+/-0.023) for {'activation': 'tanh', 'alpha': 0.001, 'hidden_layer_sizes': 5, 'solver': 'adam'}
0.971 (+/-0.029) for {'activation': 'tanh', 'alpha': 0.001, 'hidden_layer_sizes': 5, 'solver': 'lbfgs'}
0.976 (+/-0.025) for {'activation': 'tanh', 'alpha': 0.001, 'hidden_layer_sizes': 30, 'solver': 'adam'}
0.978 (+/-0.018) for {'activation': 'tanh', 'alpha': 0.001, 'hidden_layer_sizes': 30, 'solver': 'lbfgs'}
0.976 (+/-0.024) for {'activation': 'tanh', 'alpha': 0.001, 'hidden_layer_sizes': 56, 'solver': 'adam'}
0.975 (+/-0.029) for {'activation': 'tanh', 'alpha': 0.001, 'hidden_layer_sizes': 56, 'solver': 'lbfgs'}
0.971 (+/-0.023) for {'activation': 'tanh', 'alpha': 0.1, 'hidden_layer_sizes': 5, 'solver': 'adam'}
0.972 (+/-0.024) for {'activation': 'tanh', 'alpha': 0.1, 'hidden_layer_sizes': 5, 'solver': 'lbfgs'}
0.975 (+/-0.025) for {'activation': 'tanh', 'alpha': 0.1, 'hidden_layer_sizes': 30, 'solver': 'adam'}
0.976 (+/-0.024) for {'activation': 'tanh', 'alpha': 0.1, 'hidden_layer_sizes': 30, 'solver': 'lbfgs'}
0.976 (+/-0.024) for {'activation': 'tanh', 'alpha': 0.1, 'hidden_layer_sizes': 56, 'solver': 'adam'}
0.977 (+/-0.031) for {'activation': 'tanh', 'alpha': 0.1, 'hidden_layer_sizes': 56, 'solver': 'lbfgs'}
0.957 (+/-0.028) for {'activation': 'tanh', 'alpha': 10.0, 'hidden_layer_sizes': 5, 'solver': 'adam'}
0.972 (+/-0.026) for {'activation': 'tanh', 'alpha': 10.0, 'hidden_layer_sizes': 5, 'solver': 'lbfgs'}
0.956 (+/-0.028) for {'activation': 'tanh', 'alpha': 10.0, 'hidden_layer_sizes': 30, 'solver': 'adam'}
0.972 (+/-0.027) for {'activation': 'tanh', 'alpha': 10.0, 'hidden_layer_sizes': 30, 'solver': 'lbfgs'}
0.958 (+/-0.032) for {'activation': 'tanh', 'alpha': 10.0, 'hidden_layer_sizes': 56, 'solver': 'adam'}
0.972 (+/-0.027) for {'activation': 'tanh', 'alpha': 10.0, 'hidden_layer_sizes': 56, 'solver': 'lbfgs'}
0.962 (+/-0.036) for {'activation': 'relu', 'alpha': 0.001, 'hidden_layer_sizes': 5, 'solver': 'adam'}
0.966 (+/-0.022) for {'activation': 'relu', 'alpha': 0.001, 'hidden_layer_sizes': 5, 'solver': 'lbfgs'}
0.976 (+/-0.022) for {'activation': 'relu', 'alpha': 0.001, 'hidden_layer_sizes': 30, 'solver': 'adam'}
0.972 (+/-0.018) for {'activation': 'relu', 'alpha': 0.001, 'hidden_layer_sizes': 30, 'solver': 'lbfgs'}
0.978 (+/-0.022) for {'activation': 'relu', 'alpha': 0.001, 'hidden_layer_sizes': 56, 'solver': 'adam'}
0.975 (+/-0.018) for {'activation': 'relu', 'alpha': 0.001, 'hidden_layer_sizes': 56, 'solver': 'lbfgs'}
0.961 (+/-0.036) for {'activation': 'relu', 'alpha': 0.1, 'hidden_layer_sizes': 5, 'solver': 'adam'}
0.970 (+/-0.028) for {'activation': 'relu', 'alpha': 0.1, 'hidden_layer_sizes': 5, 'solver': 'lbfgs'}
0.976 (+/-0.022) for {'activation': 'relu', 'alpha': 0.1, 'hidden_layer_sizes': 30, 'solver': 'adam'}
0.978 (+/-0.027) for {'activation': 'relu', 'alpha': 0.1, 'hidden_layer_sizes': 30, 'solver': 'lbfgs'}
0.977 (+/-0.022) for {'activation': 'relu', 'alpha': 0.1, 'hidden_layer_sizes': 56, 'solver': 'adam'}
0.978 (+/-0.024) for {'activation': 'relu', 'alpha': 0.1, 'hidden_layer_sizes': 56, 'solver': 'lbfgs'}
0.960 (+/-0.033) for {'activation': 'relu', 'alpha': 10.0, 'hidden_layer_sizes': 5, 'solver': 'adam'}
0.973 (+/-0.026) for {'activation': 'relu', 'alpha': 10.0, 'hidden_layer_sizes': 5, 'solver': 'lbfgs'}
0.959 (+/-0.028) for {'activation': 'relu', 'alpha': 10.0, 'hidden_layer_sizes': 30, 'solver': 'adam'}
0.974 (+/-0.024) for {'activation': 'relu', 'alpha': 10.0, 'hidden_layer_sizes': 30, 'solver': 'lbfgs'}
0.957 (+/-0.030) for {'activation': 'relu', 'alpha': 10.0, 'hidden_layer_sizes': 56, 'solver': 'adam'}
0.975 (+/-0.023) for {'activation': 'relu', 'alpha': 10.0, 'hidden_layer_sizes': 56, 'solver': 'lbfgs'}

  • solver의 경우 adam 보다는 lbfgs일 경우 정확도가 높다.
  • Hidden Layer를 늘려서 진행할 수도 있다.

  • Hidden Layer 구성 어려움- 노드 설정에 있어 가이드 라인 존재. Hidden Layer 일단 하나부터 시작.
  • 노드 범위를 옮겨가며 최적의 조건을 찾아가기
predicted = GS_NN.predict(x_test)
cMatrix = confusion_matrix(y_test, predicted)
print(cMatrix)
print(metrics.classification_report(y_test, predicted))
[[307   4]
 [  4  74]]
              precision    recall  f1-score   support

           0       0.99      0.99      0.99       311
           1       0.95      0.95      0.95        78

    accuracy                           0.98       389
   macro avg       0.97      0.97      0.97       389
weighted avg       0.98      0.98      0.98       389

부스트 (XGBoost, LightGBM)

xgboost

import xgboost as xgb
from sklearn.metrics import accuracy_score
xgb_model = xgb.XGBClassifier(objective='binary:logistic')

# https://xgboost.readthedocs.io/en/latest/parameter.html
# https://xgboost.readthedocs.io/en/latest/tutorials/param_tuning.html
parameters = {
    'max_depth' : [5,8],
    'min_child_weight' :[1,5],
    'gamma':[0,1],
    'colsample_bytree': [0.8, 1],
    'colsample_bylevel': [0.9, 1],
    'n_estimators': [50, 100]
}

GS_xgb = GridSearchCV(xgb_model, param_grid = parameters, cv=10, n_jobs=n_thread, scoring='accuracy')
GS_xgb.fit(x_train, y_train)
C:\Users\dissi\anaconda31\lib\site-packages\xgboost\sklearn.py:888: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
  warnings.warn(label_encoder_deprecation_msg, UserWarning)


[21:34:51] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.3.0/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.





GridSearchCV(cv=10, error_score=nan,
             estimator=XGBClassifier(base_score=None, booster=None,
                                     colsample_bylevel=None,
                                     colsample_bynode=None,
                                     colsample_bytree=None, gamma=None,
                                     gpu_id=None, importance_type='gain',
                                     interaction_constraints=None,
                                     learning_rate=None, max_delta_step=None,
                                     max_depth=None, min_child_weight=None,
                                     missing=nan, monotone_constraints=None,
                                     n_esti...
                                     subsample=None, tree_method=None,
                                     use_label_encoder=True,
                                     validate_parameters=None, verbosity=None),
             iid='deprecated', n_jobs=8,
             param_grid={'colsample_bylevel': [0.9, 1],
                         'colsample_bytree': [0.8, 1], 'gamma': [0, 1],
                         'max_depth': [5, 8], 'min_child_weight': [1, 5],
                         'n_estimators': [50, 100]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='accuracy', verbose=0)
print('final params', GS_xgb.best_params_)
print('best score', GS_xgb.best_score_)
final params {'colsample_bylevel': 0.9, 'colsample_bytree': 0.8, 'gamma': 0, 'max_depth': 5, 'min_child_weight': 1, 'n_estimators': 50}
best score 0.9864681555004138
predicted = GS_xgb.predict(x_test)
cMatrix = confusion_matrix(y_test, predicted)
print(cMatrix)
print(metrics.classification_report(y_test, predicted))
[[310   1]
 [  4  74]]
              precision    recall  f1-score   support

           0       0.99      1.00      0.99       311
           1       0.99      0.95      0.97        78

    accuracy                           0.99       389
   macro avg       0.99      0.97      0.98       389
weighted avg       0.99      0.99      0.99       389

lightgbm

import lightgbm as lgb

lgbm_model = lgb.LGBMClassifier(objective='binary')

# https://lightgbm.readthedocs.io/en/latest/Parameters.html
# https://lightgbm.readthedocs.io/en/latest/Parameters_Tuning.html
parameters ={
    'num_leaves' : [32, 64, 128],
    'min_data_in_leaf' : [1, 5, 10],
    'colsample_byree' : [0.8, 1],
    'n_estimators' : [100, 150]
    }

GS_lgbm = GridSearchCV(lgbm_model, parameters, cv=10, n_jobs = n_thread, scoring='accuracy')
GS_lgbm.fit(x_train, y_train)
[LightGBM] [Warning] Unknown parameter: colsample_byree
[LightGBM] [Warning] min_data_in_leaf is set=10, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=10





GridSearchCV(cv=10, error_score=nan,
             estimator=LGBMClassifier(boosting_type='gbdt', class_weight=None,
                                      colsample_bytree=1.0,
                                      importance_type='split',
                                      learning_rate=0.1, max_depth=-1,
                                      min_child_samples=20,
                                      min_child_weight=0.001,
                                      min_split_gain=0.0, n_estimators=100,
                                      n_jobs=-1, num_leaves=31,
                                      objective='binary', random_state=None,
                                      reg_alpha=0.0, reg_lambda=0.0,
                                      silent=True, subsample=1.0,
                                      subsample_for_bin=200000,
                                      subsample_freq=0),
             iid='deprecated', n_jobs=8,
             param_grid={'colsample_byree': [0.8, 1],
                         'min_data_in_leaf': [1, 5, 10],
                         'n_estimators': [100, 150],
                         'num_leaves': [32, 64, 128]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='accuracy', verbose=0)
print('final params', GS_lgbm.best_params_)
print('best score', GS_lgbm.best_score_)
final params {'colsample_byree': 0.8, 'min_data_in_leaf': 10, 'n_estimators': 100, 'num_leaves': 64}
best score 0.985827129859388
predicted = GS_lgbm.predict(x_test)
cMatrix = confusion_matrix(y_test, predicted)
print(cMatrix)
print(metrics.classification_report(y_test, predicted))
[[310   1]
 [  5  73]]
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       311
           1       0.99      0.94      0.96        78

    accuracy                           0.98       389
   macro avg       0.99      0.97      0.98       389
weighted avg       0.98      0.98      0.98       389