Oh-Seung-Rok

안녕하세요. 정량적인 분석을 통해 제품-서비스를 개선하고 소비자들의 욕구를 먼저 파악하는 분석가가 되고 싶은 오승록입니다. 포트폴리오 [https://seungrok0317.com]

Kaggle - Used Cars

23 Mar 2021 » Kaggle

Used Cars Dataset

  • 데이터 셋 : https://www.kaggle.com/austinreese/craigslist-carstrucks-data

  • 중고차가 가진 여러가지 변수를 통해 중고차의 가격을 예측하고자 하는 데이터 셋이다.

라이브러리 설정 및 데이터 읽어들이기

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv('vehicles.csv')

pd.set_option('display.max_columns', None)
df.head()
Unnamed: 0idurlregionregion_urlpriceyearmanufacturermodelconditioncylindersfuelodometertitle_statustransmissionVINdrivesizetypepaint_colorimage_urldescriptionstatelatlongposting_date
007240372487https://auburn.craigslist.org/ctd/d/auburn-uni...auburnhttps://auburn.craigslist.org359902010.0chevroletcorvette grand sportgood8 cylindersgas32742.0cleanother1G1YU3DW1A5106980rwdNaNotherNaNhttps://images.craigslist.org/00N0N_ipkbHVZYf4...Carvana is the safer way to buy a car During t...al32.590000-85.4800002020-12-02T08:11:30-0600
117240309422https://auburn.craigslist.org/cto/d/auburn-201...auburnhttps://auburn.craigslist.org75002014.0hyundaisonataexcellent4 cylindersgas93600.0cleanautomatic5NPEC4AB0EH813529fwdNaNsedanNaNhttps://images.craigslist.org/00s0s_gBHYmJ5o7y...I'll move to another city and try to sell my c...al32.547500-85.4682002020-12-02T02:11:50-0600
227240224296https://auburn.craigslist.org/cto/d/auburn-200...auburnhttps://auburn.craigslist.org49002006.0bmwx3 3.0igood6 cylindersgas87046.0cleanautomaticNaNNaNNaNSUVbluehttps://images.craigslist.org/00B0B_5zgEGWPOrt...Clean 2006 BMW X3 3.0I. Beautiful and rare Bl...al32.616807-85.4641492020-12-01T19:50:41-0600
337240103965https://auburn.craigslist.org/cto/d/lanett-tru...auburnhttps://auburn.craigslist.org20001974.0chevroletc-10good4 cylindersgas190000.0cleanautomaticNaNrwdfull-sizepickupbluehttps://images.craigslist.org/00M0M_6o7KcDpArw...1974 chev. truck (LONG BED) NEW starter front ...al32.861600-85.2161002020-12-01T15:54:45-0600
447239983776https://auburn.craigslist.org/cto/d/auburn-200...auburnhttps://auburn.craigslist.org195002005.0fordf350 lariatexcellent8 cylindersdiesel116000.0lienautomaticNaN4wdfull-sizepickupbluehttps://images.craigslist.org/00p0p_b95l1EgUfl...2005 Ford F350 Lariat (Bullet Proofed). This t...al32.547500-85.4682002020-12-01T12:53:56-0600
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 458213 entries, 0 to 458212
Data columns (total 26 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   Unnamed: 0    458213 non-null  int64  
 1   id            458213 non-null  int64  
 2   url           458213 non-null  object 
 3   region        458213 non-null  object 
 4   region_url    458213 non-null  object 
 5   price         458213 non-null  int64  
 6   year          457163 non-null  float64
 7   manufacturer  439993 non-null  object 
 8   model         453367 non-null  object 
 9   condition     265273 non-null  object 
 10  cylinders     287073 non-null  object 
 11  fuel          454976 non-null  object 
 12  odometer      402910 non-null  float64
 13  title_status  455636 non-null  object 
 14  transmission  455771 non-null  object 
 15  VIN           270664 non-null  object 
 16  drive         324025 non-null  object 
 17  size          136865 non-null  object 
 18  type          345475 non-null  object 
 19  paint_color   317370 non-null  object 
 20  image_url     458185 non-null  object 
 21  description   458143 non-null  object 
 22  state         458213 non-null  object 
 23  lat           450765 non-null  float64
 24  long          450765 non-null  float64
 25  posting_date  458185 non-null  object 
dtypes: float64(4), int64(3), object(19)
memory usage: 90.9+ MB

EDA 및 기초 통계 분석

df.isna().sum()
Unnamed: 0           0
id                   0
url                  0
region               0
region_url           0
price                0
year              1050
manufacturer     18220
model             4846
condition       192940
cylinders       171140
fuel              3237
odometer         55303
title_status      2577
transmission      2442
VIN             187549
drive           134188
size            321348
type            112738
paint_color     140843
image_url           28
description         70
state                0
lat               7448
long              7448
posting_date        28
dtype: int64
df.describe()
Unnamed: 0idpriceyearodometerlatlong
count458213.0000004.582130e+054.582130e+05457163.0000004.029100e+05450765.000000450765.000000
mean229106.0000007.235233e+094.042093e+042010.7460671.016698e+0538.531925-94.375824
std132274.8437864.594362e+068.194599e+068.8681363.228623e+065.85737818.076225
min0.0000007.208550e+090.000000e+001900.0000000.000000e+00-82.607549-164.091797
25%114553.0000007.231953e+094.900000e+032008.0000004.087700e+0434.600000-110.890427
50%229106.0000007.236409e+091.099500e+042013.0000008.764100e+0439.244500-88.314889
75%343659.0000007.239321e+092.149500e+042016.0000001.340000e+0542.484503-81.015022
max458212.0000007.241019e+093.615215e+092021.0000002.043756e+0982.049255150.898969
  • 가격/ 평균값이 4만달러, 중위값이 천만달러
  • 가격 /최소값이 0, 최대값이 터무니없이 높음(잘못설정된것(아웃라이어))
  • 주행거리 역시 잘못된 데이터가 있음.

불필요한 column 제거

  • year를 연식(age)값으로 바꿈.
df.columns
Index(['Unnamed: 0', 'id', 'url', 'region', 'region_url', 'price', 'year',
       'manufacturer', 'model', 'condition', 'cylinders', 'fuel', 'odometer',
       'title_status', 'transmission', 'VIN', 'drive', 'size', 'type',
       'paint_color', 'image_url', 'description', 'state', 'lat', 'long',
       'posting_date'],
      dtype='object')
df.drop(['Unnamed: 0', 'id', 'url', 'region_url', 'VIN',
         'image_url', 'description', 'state', 'lat', 'long',
         'posting_date'], axis=1, inplace=True)
df['age'] = 2021 - df['year']
df.drop('year', axis=1, inplace=True)

범주형 데이터 분석

len(df['manufacturer'].value_counts())
43
plt.figure(figsize=(8,10))
sns.countplot(data=df.fillna('n/a'), y='manufacturer', order=df.fillna('n/a')['manufacturer'].value_counts().index)
<matplotlib.axes._subplots.AxesSubplot at 0x1c5b58bfb88>

output_17_1

len(df['model'].value_counts())

# for model, num in zip(df['model'].value_counts().index df['model'].value_counts()):
# print(model, num)
31520
len(df['condition'].value_counts())
6
len(df['cylinders'].value_counts())
8
sns.countplot(data=df.fillna('n/a'), y='transmission', order=df.fillna('n/a')['transmission'].value_counts().index)
<matplotlib.axes._subplots.AxesSubplot at 0x1c5b5961c08>

output_21_1

sns.countplot(data=df.fillna('n/a'), y='drive', order=df.fillna('n/a')['drive'].value_counts().index)
<matplotlib.axes._subplots.AxesSubplot at 0x1c5b594f788>

output_22_1

sns.countplot(data=df.fillna('n/a'), y='size', order=df.fillna('n/a')['size'].value_counts().index)
<matplotlib.axes._subplots.AxesSubplot at 0x1c5ba1cfc88>

output_23_1

sns.countplot(data=df.fillna('n/a'), y='type', order=df.fillna('n/a')['type'].value_counts().index)
<matplotlib.axes._subplots.AxesSubplot at 0x1c5b9daf8c8>

output_24_1

sns.countplot(data=df.fillna('n/a'), y='paint_color', order=df.fillna('n/a')['paint_color'].value_counts().index)
<matplotlib.axes._subplots.AxesSubplot at 0x1c5a9223d88>

output_25_1

수치형 데이터 분석

df.columns
Index(['region', 'price', 'manufacturer', 'model', 'condition', 'cylinders',
       'fuel', 'odometer', 'title_status', 'transmission', 'drive', 'size',
       'type', 'paint_color', 'age'],
      dtype='object')
plt.figure(figsize=(8,2))
sns.rugplot(data=df, x='odometer', height=1)
<matplotlib.axes._subplots.AxesSubplot at 0x1c5b4522d48>

output_28_1

sns.histplot(data=df, x='age', bins=20, kde=True)
<matplotlib.axes._subplots.AxesSubplot at 0x1c59ed64a48>

output_29_1

데이터 전처리

범주형 데이터 전처리

sns.boxplot(data=df.fillna('n/a'), x='manufacturer', y='price')
<matplotlib.axes._subplots.AxesSubplot at 0x1c5b5162c48>

output_32_1

  • 아웃라이어가 심각
df.columns
Index(['region', 'price', 'manufacturer', 'model', 'condition', 'cylinders',
       'fuel', 'odometer', 'title_status', 'transmission', 'drive', 'size',
       'type', 'paint_color', 'age'],
      dtype='object')
  • 해당 Columns 들이 어느정도까지가 유의미한 데이터인지 확인하기 위해 plot를 그림.
col= 'manufacturer'
counts = df[col].fillna('others').value_counts()
plt.grid()
plt.plot(range(len(counts)), counts)

# 10 정도가 적당해보임.
[<matplotlib.lines.Line2D at 0x1c6857feec8>]

output_36_1

# 상위 10개에 포함되지 않는 분류의 경우 others로 취급
n_categorical = 10 
counts.index[n_categorical:]
df[col] = df[col].apply(lambda s : s if str(s) not in counts.index[n_categorical:] else 'others')
df['manufacturer'].value_counts()
others       134392
ford          79666
chevrolet     64977
toyota        38577
honda         25868
nissan        23654
jeep          21165
ram           17697
gmc           17267
dodge         16730
Name: manufacturer, dtype: int64
col= 'model'
counts = df[col].fillna('others').value_counts()
plt.grid()
plt.plot(range(len(counts)), counts)
[<matplotlib.lines.Line2D at 0x1c684f1e808>]

output_39_1

col= 'model'
counts = df[col].fillna('others').value_counts()
plt.grid()
plt.plot(range(len(counts[:20])), counts[:20])
[<matplotlib.lines.Line2D at 0x1c684fad2c8>]

output_40_1

# manufacturer 와 달리 분류가 엄청 많아서 lambda 실행으로 느리게 작동할 수 있음.
n_categorical = 10 
#counts.index[n_categorical:]
# df[col] = df[col].apply(lambda s : s if str(s) not in counts.index[n_categorical:] else 'others')
others = counts.index[n_categorical:]
df[col] = df[col].apply(lambda s : s if str(s) not in others else 'others')

df['model'].value_counts()
others            413556
f-150               8370
silverado 1500      5964
1500                4211
camry               4033
accord              3730
altima              3490
civic               3479
escape              3444
silverado           3090
Name: model, dtype: int64
col = 'condition'
counts = df[col].fillna('others').value_counts()
n_categorical = 3 
others = counts.index[n_categorical:]
df[col] = df[col].apply(lambda s : s if str(s) not in others else 'others')
col = 'cylinders'
counts = df[col].fillna('others').value_counts()
n_categorical = 4 
others = counts.index[n_categorical:]
df[col] = df[col].apply(lambda s : s if str(s) not in others else 'others')
col = 'fuel'
# counts = df[col].fillna('others').value_counts()
# counts.index

n_categorical = 2
others = counts.index[n_categorical:]
df[col] = df[col].apply(lambda s : s if str(s) not in others else 'others')
df.drop('title_status', axis=1, inplace=True)
col = 'transmission'
counts = df[col].fillna('others').value_counts()
n_categorical = 3
others = counts.index[n_categorical:]
df[col] = df[col].apply(lambda s : s if str(s) not in others else 'others')
col = 'drive'
df[col].fillna('others', inplace=True)
col = 'size'
counts = df[col].fillna('others').value_counts()
n_categorical = 2
others = counts.index[n_categorical:]
df[col] = df[col].apply(lambda s : s if str(s) not in others else 'others')
col = 'type'
counts = df[col].fillna('others').value_counts()
n_categorical = 8
others = counts.index[n_categorical:]
df[col] = df[col].apply(lambda s : s if str(s) not in others else 'others')

df.loc[df[col] == 'other', col] = 'others'

# other이란 분류도 있어서 others에 편입.
col = 'paint_color'
counts = df[col].fillna('others').value_counts()
n_categorical = 7
others = counts.index[n_categorical:]
df[col] = df[col].apply(lambda s : s if str(s) not in others else 'others')

수치형 데이터 전처리

# age는 양호, odometer와 price 조정 필요.
p1 = df['price'].quantile(0.99)
p2 = df['price'].quantile(0.1)
print(p1, p2)

# 가격 = 상위 1% , 하위 10%를 제거하여 아웃라이어 제거.
59900.0 651.0
df = df[(p1 > df['price']) & (df['price'] > p2)]
o1 = df['odometer'].quantile(0.99)
o2 = df['odometer'].quantile(0.1)
print(o1, o2)

# 주행거리 = 상위, 하위 1% 제거하여 아웃라이어 제거.
270000.0 17553.0
df = df[(o1 > df['odometer']) & (df['odometer'] > o2)]
df.describe()

# 아웃라이어 제거 하여 수치형 데이터 개괄
priceodometerage
count324382.000000324382.000000323860.000000
mean15314.530106102569.31960210.174001
std11298.91748455165.1354007.076283
min652.00000017555.0000000.000000
25%6500.00000056199.0000005.000000
50%12388.00000098146.0000009.000000
75%21000.000000140482.75000013.000000
max59895.000000269930.000000121.000000
plt.figure(figsize=(10,8))
sns.boxplot(data=df, x='manufacturer', y='price')
<matplotlib.axes._subplots.AxesSubplot at 0x1c684c59cc8>

output_57_1

  • 전체적인 범위가 다르지 않지만 평균 값들을 비교할 만하다.
plt.figure(figsize=(10,8))
sns.boxplot(data=df, x='model', y='price')
<matplotlib.axes._subplots.AxesSubplot at 0x1c684c6db08>

output_59_1

  • 같은 모델이라도 상태, 주행거리에 따라 가격이 크게 달라짐. 물론 모델마다 다르기도함.
sns.heatmap(df.corr(), annot=True, cmap='YlOrRd')
<matplotlib.axes._subplots.AxesSubplot at 0x1c684ef1408>

output_61_1

  • 상관성은 높으나 주행거리, 연식 둘다 가격에 역방향으로 영향을 줌.
  • 주행거리와 연식도 당연하게도 영향이 있음.
  • 두개 모두 사용하면 비효율적인 모델이 될 수 있음.
from sklearn.preprocessing import StandardScaler

X_num = df[['odometer', 'age']]

scaler = StandardScaler()
scaler.fit(X_num)
X_scaled = scaler.transform(X_num)
X_scaled = pd.DataFrame(X_scaled, index=X_num.index, columns=X_num.columns)

# 범주형 데이터 one-hot 벡터로
X_cat = df.drop(['price', 'odometer', 'age'], axis=1)
X_cat = pd.get_dummies(X_cat)

X = pd.concat([X_scaled, X_cat], axis=1)
y = df['price']
X.head()
odometerageregion_SF bay arearegion_abileneregion_akron / cantonregion_albanyregion_albuquerqueregion_altoona-johnstownregion_amarilloregion_amesregion_anchorage / mat-suregion_ann arborregion_annapolisregion_appleton-oshkosh-FDLregion_ashevilleregion_ashtabularegion_athensregion_atlantaregion_auburnregion_augustaregion_austinregion_bakersfieldregion_baltimoreregion_baton rougeregion_battle creekregion_beaumont / port arthurregion_bellinghamregion_bemidjiregion_bendregion_billingsregion_binghamtonregion_birminghamregion_bismarckregion_bloomingtonregion_bloomington-normalregion_boiseregion_booneregion_bostonregion_boulderregion_bowling greenregion_bozemanregion_brainerdregion_brownsvilleregion_brunswickregion_buffaloregion_butteregion_cape cod / islandsregion_catskillsregion_cedar rapidsregion_central NJregion_central louisianaregion_central michiganregion_champaign urbanaregion_charlestonregion_charlotteregion_charlottesvilleregion_chattanoogaregion_chautauquaregion_chicagoregion_chicoregion_chillicotheregion_cincinnatiregion_clarksvilleregion_clevelandregion_clovis / portalesregion_college stationregion_colorado springsregion_columbiaregion_columbia / jeff cityregion_columbusregion_cookevilleregion_corpus christiregion_corvallis/albanyregion_cumberland valleyregion_dallas / fort worthregion_danvilleregion_dayton / springfieldregion_daytona beachregion_decaturregion_deep east texasregion_del rio / eagle passregion_delawareregion_denverregion_des moinesregion_detroit metroregion_dothanregion_dubuqueregion_duluth / superiorregion_east idahoregion_east oregonregion_eastern COregion_eastern CTregion_eastern NCregion_eastern kentuckyregion_eastern montanaregion_eastern panhandleregion_eastern shoreregion_eau claireregion_el pasoregion_elkoregion_elmira-corningregion_erieregion_eugeneregion_evansvilleregion_fairbanksregion_fargo / moorheadregion_farmingtonregion_fayettevilleregion_finger lakesregion_flagstaff / sedonaregion_flintregion_florenceregion_florence / muscle shoalsregion_florida keysregion_fort collins / north COregion_fort dodgeregion_fort smithregion_fort smith, ARregion_fort wayneregion_frederickregion_fredericksburgregion_fresno / maderaregion_ft myers / SW floridaregion_gadsden-annistonregion_gainesvilleregion_galvestonregion_glens fallsregion_gold countryregion_grand forksregion_grand islandregion_grand rapidsregion_great fallsregion_green bayregion_greensbororegion_greenville / upstateregion_gulfport / biloxiregion_hanford-corcoranregion_harrisburgregion_harrisonburgregion_hartfordregion_hattiesburgregion_hawaiiregion_heartland floridaregion_helenaregion_hickory / lenoirregion_high rockiesregion_hilton headregion_hollandregion_houmaregion_houstonregion_hudson valleyregion_humboldt countyregion_huntington-ashlandregion_huntsville / decaturregion_imperial countyregion_indianapolisregion_inland empireregion_iowa cityregion_ithacaregion_jacksonregion_jacksonvilleregion_janesvilleregion_jersey shoreregion_jonesbororegion_joplinregion_kalamazooregion_kalispellregion_kansas cityregion_kansas city, MOregion_kenai peninsularegion_kennewick-pasco-richlandregion_kenosha-racineregion_killeen / temple / ft hoodregion_kirksvilleregion_klamath fallsregion_knoxvilleregion_kokomoregion_la crosseregion_la salle coregion_lafayetteregion_lafayette / west lafayetteregion_lake charlesregion_lake of the ozarksregion_lakelandregion_lancasterregion_lansingregion_laredoregion_las crucesregion_las vegasregion_lawrenceregion_lawtonregion_lehigh valleyregion_lewiston / clarkstonregion_lexingtonregion_lima / findlayregion_lincolnregion_little rockregion_loganregion_long islandregion_los angelesregion_louisvilleregion_lubbockregion_lynchburgregion_macon / warner robinsregion_madisonregion_maineregion_manhattanregion_mankatoregion_mansfieldregion_mason cityregion_mattoon-charlestonregion_mcallen / edinburgregion_meadvilleregion_medford-ashlandregion_memphisregion_mendocino countyregion_mercedregion_meridianregion_milwaukeeregion_minneapolis / st paulregion_missoularegion_mobileregion_modestoregion_mohave countyregion_monroeregion_monterey bayregion_montgomeryregion_morgantownregion_moses lakeregion_muncie / andersonregion_muskegonregion_myrtle beachregion_nashvilleregion_new hampshireregion_new havenregion_new orleansregion_new river valleyregion_new york cityregion_norfolk / hampton roadsregion_north central FLregion_north dakotaregion_north jerseyregion_north mississippiregion_north platteregion_northeast SDregion_northern WIregion_northern michiganregion_northern panhandleregion_northwest CTregion_northwest GAregion_northwest KSregion_northwest OKregion_ocalaregion_odessa / midlandregion_ogden-clearfieldregion_okaloosa / waltonregion_oklahoma cityregion_olympic peninsularegion_omaha / council bluffsregion_oneontaregion_orange countyregion_oregon coastregion_orlandoregion_outer banksregion_owensbororegion_palm springsregion_panama cityregion_parkersburg-mariettaregion_pensacolaregion_peoriaregion_philadelphiaregion_phoenixregion_pierre / central SDregion_pittsburghregion_plattsburgh-adirondacksregion_poconosregion_port huronregion_portlandregion_potsdam-canton-massenaregion_prescottregion_provo / oremregion_puebloregion_pullman / moscowregion_quad cities, IA/ILregion_raleigh / durham / CHregion_rapid city / west SDregion_readingregion_reddingregion_reno / tahoeregion_rhode islandregion_richmondregion_roanokeregion_rochesterregion_rockfordregion_roseburgregion_roswell / carlsbadregion_sacramentoregion_saginaw-midland-baycityregion_salemregion_salinaregion_salt lake cityregion_san angeloregion_san antonioregion_san diegoregion_san luis obisporegion_san marcosregion_sanduskyregion_santa barbararegion_santa fe / taosregion_santa mariaregion_sarasota-bradentonregion_savannah / hinesvilleregion_scottsbluff / panhandleregion_scranton / wilkes-barreregion_seattle-tacomaregion_sheboyganregion_show lowregion_shreveportregion_sierra vistaregion_sioux cityregion_sioux city, IAregion_sioux falls / SE SDregion_siskiyou countyregion_skagit / island / SJIregion_south bend / michianaregion_south coastregion_south dakotaregion_south floridaregion_south jerseyregion_southeast IAregion_southeast KSregion_southeast alaskaregion_southeast missouriregion_southern WVregion_southern illinoisregion_southern marylandregion_southwest KSregion_southwest MNregion_southwest MSregion_southwest TXregion_southwest VAregion_southwest michiganregion_space coastregion_spokane / coeur d'aleneregion_springfieldregion_st augustineregion_st cloudregion_st georgeregion_st josephregion_st louisregion_st louis, MOregion_state collegeregion_statesbororegion_stillwaterregion_stocktonregion_susanvilleregion_syracuseregion_tallahasseeregion_tampa bay arearegion_terre hauteregion_texarkanaregion_texomaregion_the thumbregion_toledoregion_topekaregion_treasure coastregion_tri-citiesregion_tucsonregion_tulsaregion_tuscaloosaregion_tuscarawas coregion_twin fallsregion_twin tiers NY/PAregion_tyler / east TXregion_upper peninsularegion_utica-rome-oneidaregion_valdostaregion_ventura countyregion_vermontregion_victoriaregion_visalia-tulareregion_wacoregion_washington, DCregion_waterloo / cedar fallsregion_watertownregion_wausauregion_wenatcheeregion_west virginia (old)region_western ILregion_western KYregion_western marylandregion_western massachusettsregion_western sloperegion_wichitaregion_wichita fallsregion_williamsportregion_wilmingtonregion_winchesterregion_winston-salemregion_worcester / central MAregion_wyomingregion_yakimaregion_yorkregion_youngstownregion_yuba-sutterregion_yumaregion_zanesville / cambridgemanufacturer_chevroletmanufacturer_dodgemanufacturer_fordmanufacturer_gmcmanufacturer_hondamanufacturer_jeepmanufacturer_nissanmanufacturer_othersmanufacturer_rammanufacturer_toyotamodel_1500model_accordmodel_altimamodel_camrymodel_civicmodel_escapemodel_f-150model_othersmodel_silveradomodel_silverado 1500condition_excellentcondition_goodcondition_otherscylinders_4 cylinderscylinders_6 cylinderscylinders_8 cylinderscylinders_othersfuel_dieselfuel_electricfuel_gasfuel_hybridfuel_otherstransmission_automatictransmission_manualtransmission_otherdrive_4wddrive_fwddrive_othersdrive_rwdsize_full-sizesize_otherstype_SUVtype_coupetype_hatchbacktype_otherstype_pickuptype_sedantype_truckpaint_color_blackpaint_color_bluepaint_color_greypaint_color_otherspaint_color_redpaint_color_silverpaint_color_white
0-1.2657890.1167280000000000000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000000010001000100010000100010000010000000000
1-0.162591-0.4485410000000000000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000000010010010000010010001000000000100000000
2-0.2813980.6819970000000000000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000000010001001000010010000100010000000100000
31.5848935.2041520000000000000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000000010001010000010010000011000001000100000
40.2434640.8233150000000000000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010000000000000010010000101000010010001000001000100000
X.isna().sum()
odometer                   0
age                      522
region_SF bay area         0
region_abilene             0
region_akron / canton      0
                        ... 
paint_color_grey           0
paint_color_others         0
paint_color_red            0
paint_color_silver         0
paint_color_white          0
Length: 462, dtype: int64
X['age'].mean()
2.4065666124846312e-15
X['age'].fillna(0.0, inplace=True)

# 평균값이 0에 가깝고 따라서 nan값을 0으로 채워준다.

학습데이터 테스트데이터 분리

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

분류하기 (Regression 모델)

XGBoost Regression

from xgboost import XGBRegressor

model_reg = XGBRegressor()
model_reg.fit(X_train, y_train)
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.300000012, max_delta_step=0, max_depth=6,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=100, n_jobs=4, num_parallel_tree=1,
             objective='reg:squarederror', random_state=0, reg_alpha=0,
             reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',
             validate_parameters=1, verbosity=None)
from sklearn.metrics import mean_absolute_error, mean_squared_error
from math import sqrt
pred = model_reg.predict(X_test)
print(mean_absolute_error(y_test, pred))
print(sqrt(mean_squared_error(y_test, pred)))
3208.492644616665
4863.626919465305

결과 분석

plt.scatter(x=y_test, y=pred, alpha=0.005)
plt.plot([0,60000], [0,60000], 'r-')
[<matplotlib.lines.Line2D at 0x1c5b73ec108>]

output_76_1

  • 실제로 값이 엄청 저렴한데 높게 책정하는 경우가 있고 전체적으로 underestimate 하는 경향이 있다.
err = (pred - y_test) / y_test * 100
sns.histplot(err)
plt.xlabel('error(%)')
plt.xlim(-100, 100)

# 오차율에 관한 히스토그램
(-100, 100)

output_78_1

  • 위 결론과 마찬가지로 underestimate 되는 경향이 주로 보이고, overestimate되는 값은 오차율이 크다.
sns.histplot(x=y_test, y=pred)
plt.plot([0,60000], [0,60000], 'r-')
[<matplotlib.lines.Line2D at 0x1c681b0a148>]

output_80_1