Oh-Seung-Rok

안녕하세요. 정량적인 분석을 통해 제품-서비스를 개선하고 소비자들의 욕구를 먼저 파악하는 분석가가 되고 싶은 오승록입니다. 포트폴리오 [https://seungrok0317.com]

Kaggle - New York City Airbnb

23 Mar 2021 » Kaggle

New York Cty Airbnb Open Data

  • 데이터 셋 : https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data

  • 뉴욕시 에어비앤비에 전시된 여러 공간의 변수들을 통해 적당한 ‘이용료’를 파악-분류하고자 하는 데이터 셋이다.

라이브러리 설정 및 데이터 읽어들이기

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv('AB_NYC_2019.csv')

pd.set_option('display.max_columns', None)
df.head()
idnamehost_idhost_nameneighbourhood_groupneighbourhoodlatitudelongituderoom_typepriceminimum_nightsnumber_of_reviewslast_reviewreviews_per_monthcalculated_host_listings_countavailability_365
02539Clean & quiet apt home by the park2787JohnBrooklynKensington40.64749-73.97237Private room149192018-10-190.216365
12595Skylit Midtown Castle2845JenniferManhattanMidtown40.75362-73.98377Entire home/apt2251452019-05-210.382355
23647THE VILLAGE OF HARLEM....NEW YORK !4632ElisabethManhattanHarlem40.80902-73.94190Private room15030NaNNaN1365
33831Cozy Entire Floor of Brownstone4869LisaRoxanneBrooklynClinton Hill40.68514-73.95976Entire home/apt8912702019-07-054.641194
45022Entire Apt: Spacious Studio/Loft by central park7192LauraManhattanEast Harlem40.79851-73.94399Entire home/apt801092018-11-190.1010
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              48895 non-null  int64  
 1   name                            48879 non-null  object 
 2   host_id                         48895 non-null  int64  
 3   host_name                       48874 non-null  object 
 4   neighbourhood_group             48895 non-null  object 
 5   neighbourhood                   48895 non-null  object 
 6   latitude                        48895 non-null  float64
 7   longitude                       48895 non-null  float64
 8   room_type                       48895 non-null  object 
 9   price                           48895 non-null  int64  
 10  minimum_nights                  48895 non-null  int64  
 11  number_of_reviews               48895 non-null  int64  
 12  last_review                     38843 non-null  object 
 13  reviews_per_month               38843 non-null  float64
 14  calculated_host_listings_count  48895 non-null  int64  
 15  availability_365                48895 non-null  int64  
dtypes: float64(3), int64(7), object(6)
memory usage: 6.0+ MB
df.isna().sum()
id                                    0
name                                 16
host_id                               0
host_name                            21
neighbourhood_group                   0
neighbourhood                         0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
last_review                       10052
reviews_per_month                 10052
calculated_host_listings_count        0
availability_365                      0
dtype: int64
  • last_review 와 reviews_per_month 숫자가 같은것으로 보아 동일한 개수의 데이터일 듯 하다.
(df['number_of_reviews'] == 0).sum()
10052
  • number of reviews가 0인 데이터가 10052인것으로 보아 해당 데이터가 last_review, reviews_per_month가 결측되어 있음을 알 수 있다.
# 곱셉 기능을 통해 True False 유무 확인. & 기능을 통해 True False 확인.
(df['reviews_per_month'].isna() & df['last_review'].isna()).sum()
10052
  • 10052로 null 값 생성. 즉 두 변수가 가지고 있는 결측치의 인덱스가 동일

EDA 및 기초통계 분석

불필요한 column 제거

  • ID, host_name, latitude, longitude은 직관적으로 가격에 영향을 및지 않을 것 같음. 위도와 경도의 경우 다를 수 있겠으나 일단 제거.
  • name 역시 자연어 처리를 통해 유의미한 결과값을 뽑아낼 수 있겠으나 일단 EDA - 데이터 분석을 위해 제거.
  • 리뷰관련된 변수의 경우 리뷰의 유무라는 새로운 변수 생성가능. 이용 가능 일수 역시 0일(이용 일수 미입력)이라는 변수 새롭게 생성 가능
df['room_type'].value_counts()
Entire home/apt    25409
Private room       22326
Shared room         1160
Name: room_type, dtype: int64
df['availability_365'].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x269799e7508>

output_17_1

# 이용 가능 일수가 0인 데이터. 

(df['availability_365'] == 0).sum()
17533
df.describe()
idhost_idlatitudelongitudepriceminimum_nightsnumber_of_reviewsreviews_per_monthcalculated_host_listings_countavailability_365
count4.889500e+044.889500e+0448895.00000048895.00000048895.00000048895.00000048895.00000038843.00000048895.00000048895.000000
mean1.901714e+076.762001e+0740.728949-73.952170152.7206877.02996223.2744661.3732217.143982112.781327
std1.098311e+077.861097e+070.0545300.046157240.15417020.51055044.5505821.68044232.952519131.622289
min2.539000e+032.438000e+0340.499790-74.2444200.0000001.0000000.0000000.0100001.0000000.000000
25%9.471945e+067.822033e+0640.690100-73.98307069.0000001.0000001.0000000.1900001.0000000.000000
50%1.967728e+073.079382e+0740.723070-73.955680106.0000003.0000005.0000000.7200001.00000045.000000
75%2.915218e+071.074344e+0840.763115-73.936275175.0000005.00000024.0000002.0200002.000000227.000000
max3.648724e+072.743213e+0840.913060-73.71299010000.0000001250.000000629.00000058.500000327.000000365.000000
  • 가격의 경우 최소가격과 최대가격의 설정이 잘못 되었다. 최소 숙박일수 역시 최대값이 잘못 설정되었다.
df.columns
Index(['id', 'name', 'host_id', 'host_name', 'neighbourhood_group',
       'neighbourhood', 'latitude', 'longitude', 'room_type', 'price',
       'minimum_nights', 'number_of_reviews', 'last_review',
       'reviews_per_month', 'calculated_host_listings_count',
       'availability_365'],
      dtype='object')
df.drop(['id', 'name', 'host_name', 'latitude', 'longitude'], axis=1, inplace=True)
df.head()
host_idneighbourhood_groupneighbourhoodroom_typepriceminimum_nightsnumber_of_reviewslast_reviewreviews_per_monthcalculated_host_listings_countavailability_365
02787BrooklynKensingtonPrivate room149192018-10-190.216365
12845ManhattanMidtownEntire home/apt2251452019-05-210.382355
24632ManhattanHarlemPrivate room15030NaNNaN1365
34869BrooklynClinton HillEntire home/apt8912702019-07-054.641194
47192ManhattanEast HarlemEntire home/apt801092018-11-190.1010

수치형 데이터

  • 수치형 데이터를 통해서만 price를 예측할 수 있을까?
sns.jointplot(data=df, x='reviews_per_month', y='price')
<seaborn.axisgrid.JointGrid at 0x2697ab33148>

output_26_1

sns.heatmap(df.corr(), annot=True, cmap='YlOrRd')
<matplotlib.axes._subplots.AxesSubplot at 0x2697b48fec8>

output_27_1

  • 전처리가 되어 있지 않아서 상관성을 파악하기 어렵다.
  • 의외로 host_id가 상관성이 있어보이는데 사용일수와 0.2, 한달동안 리뷰는 0.3 으로 확인할 수 있다.
  • 새롭게 운영하는 사람일수록 큰 숫자의 id를 가지고 있고 새 시설인 만큼 리뷰도 많이 받고 오래 이용할 수 있기 때문인지 확인이 더 필요하다.
  • host_id에 있어서 누적리뷰와 한달동안 받는 리뷰가 다른 상관성을 띈다.

범주형 데이터

sns.boxplot(data=df, x='neighbourhood_group', y='price')
<matplotlib.axes._subplots.AxesSubplot at 0x2697e22af08>

output_30_1

  • 최대값이외에도 평균 기준값을 훨씬 초과하는 아웃라이어 price 값들이 있어서 확인하기 어렵다. 전처리 필요

데이터 전처리

범주형 데이터 전처리

df['neighbourhood_group'].value_counts()
Manhattan        21661
Brooklyn         20104
Queens            5666
Bronx             1091
Staten Island      373
Name: neighbourhood_group, dtype: int64
df['neighbourhood'].value_counts()
Williamsburg          3920
Bedford-Stuyvesant    3714
Harlem                2658
Bushwick              2465
Upper West Side       1971
                      ... 
Woodrow                  1
Fort Wadsworth           1
New Dorp                 1
Richmondtown             1
Rossville                1
Name: neighbourhood, Length: 221, dtype: int64
plt.plot(range(len(df['neighbourhood'].value_counts())), df['neighbourhood'].value_counts())
[<matplotlib.lines.Line2D at 0x2697e315588>]

output_36_1

  • 소수의 값 제거를 위해 50번째 이전 value 값만 잔존시킴.
ne = df['neighbourhood'].value_counts()[50:]
df['neighbourhood'] = df['neighbourhood'].apply(lambda s : s if str(s) not in ne[50:] else 'others')

수치형 데이터 전처리

sns.rugplot(data=df, x='price', height=1)
<matplotlib.axes._subplots.AxesSubplot at 0x2697e36a348>

output_41_1

print(df['price'].quantile(0.99))
print(df['price'].quantile(0.005))
799.0
26.0
  • price값 상위 1% 가 799달러 즉, 1000달러 이후의 데이터는 아웃라이어 값으로 판단할 수 있다.
  • price값 하위 아웃라이어 값 제거를 위해 0.5% 제거, 상위는 5% 제거
sns.rugplot(data=df, x='minimum_nights', height=1)
<matplotlib.axes._subplots.AxesSubplot at 0x2697e4aaf88>

output_44_1

print(df['minimum_nights'].quantile(0.98))
print(df['minimum_nights'].quantile(0.005))
30.0
1.0
  • minimum_nights 경우 최소값은 자를 필요가 없고 상위 2%정도로 자르면 될 듯 하다.
sns.rugplot(data=df, x='availability_365', height=1)
<matplotlib.axes._subplots.AxesSubplot at 0x269028d0e88>

output_47_1

df['availability_365'].quantile(0.3)
0.0
  • 30% 까지 0이라는 것은 미입력된 데이터가 상당하다는 뜻으로 차라리 이용일수가’0’인 새로운 범주형변수를 생성하는것이 효율적으로 보인다.
p1= df['price'].quantile(0.95)
p2= df['price'].quantile(0.005)

df = df[(p1 > df['price']) & (df['price'] > p2)]
df['price'].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x269042e5fc8>

output_51_1

df['minimum_nights'].quantile(0.98)
30.0
  • price에서 아웃라이어 제거해도 상위 2%값 변하지 않음. 그대로 진행해도 좋다.
m1 = df['minimum_nights'].quantile(0.98)
df = df[df['minimum_nights'] < m1 ]
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_num = df.drop(['neighbourhood_group', 'neighbourhood', 'room_type', 'avail_zero', 'review_exists','last_review','price'], axis=1)

scaler.fit(X_num)
X_scaled = scaler.transform(X_num)
X_scaled = pd.DataFrame(X_scaled, index=X_num.index, columns=X_num.columns)

# last_review 경우 수치형이긴 하지만 날짜형태

범주형 데이터 전처리

# availability_365 중 새로운 범주형 변수 생성

df['avail_zero'] = df['availability_365'].apply(lambda x : 'Zero' if x==0 else 'NonZero')
# review 중 Null 값 채우기.

df['review_exists'] = df['reviews_per_month'].isna().apply(lambda x: 'No' if x is True else 'Yes')

df.fillna(0, inplace=True) # 현재 미기입은 리뷰만 있음.
df.columns
Index(['host_id', 'neighbourhood_group', 'neighbourhood', 'room_type', 'price',
       'minimum_nights', 'number_of_reviews', 'last_review',
       'reviews_per_month', 'calculated_host_listings_count',
       'availability_365', 'avail_zero', 'review_exists'],
      dtype='object')
X_cat = df[['neighbourhood_group', 'neighbourhood', 'room_type', 'avail_zero', 'review_exists']]
X_cat = pd.get_dummies(X_cat)

# 선형회귀일 경우 get dummies 실행시 drop_first 그라디언트 부스트 사용시 X.
X = pd.concat([X_scaled, X_cat], axis=1)
y = df['price']
X.head()
host_idminimum_nightsnumber_of_reviewsreviews_per_monthcalculated_host_listings_countavailability_365neighbourhood_group_Bronxneighbourhood_group_Brooklynneighbourhood_group_Manhattanneighbourhood_group_Queensneighbourhood_group_Staten Islandneighbourhood_Arverneneighbourhood_Astorianeighbourhood_Battery Park Cityneighbourhood_Bay Ridgeneighbourhood_Bedford-Stuyvesantneighbourhood_Bensonhurstneighbourhood_Boerum Hillneighbourhood_Borough Parkneighbourhood_Briarwoodneighbourhood_Brighton Beachneighbourhood_Brooklyn Heightsneighbourhood_Brownsvilleneighbourhood_Bushwickneighbourhood_Canarsieneighbourhood_Carroll Gardensneighbourhood_Chelseaneighbourhood_Chinatownneighbourhood_Civic Centerneighbourhood_Clinton Hillneighbourhood_Cobble Hillneighbourhood_Concourseneighbourhood_Coronaneighbourhood_Crown Heightsneighbourhood_Cypress Hillsneighbourhood_Ditmars Steinwayneighbourhood_Downtown Brooklynneighbourhood_East Elmhurstneighbourhood_East Flatbushneighbourhood_East Harlemneighbourhood_East New Yorkneighbourhood_East Villageneighbourhood_Elmhurstneighbourhood_Financial Districtneighbourhood_Flatbushneighbourhood_Flatiron Districtneighbourhood_Flatlandsneighbourhood_Flushingneighbourhood_Fordhamneighbourhood_Forest Hillsneighbourhood_Fort Greeneneighbourhood_Fort Hamiltonneighbourhood_Glendaleneighbourhood_Gowanusneighbourhood_Gramercyneighbourhood_Gravesendneighbourhood_Greenpointneighbourhood_Greenwich Villageneighbourhood_Harlemneighbourhood_Hell's Kitchenneighbourhood_Inwoodneighbourhood_Jackson Heightsneighbourhood_Jamaicaneighbourhood_Kensingtonneighbourhood_Kingsbridgeneighbourhood_Kips Bayneighbourhood_Little Italyneighbourhood_Long Island Cityneighbourhood_Longwoodneighbourhood_Lower East Sideneighbourhood_Maspethneighbourhood_Midtownneighbourhood_Midwoodneighbourhood_Morningside Heightsneighbourhood_Mott Havenneighbourhood_Murray Hillneighbourhood_NoHoneighbourhood_Nolitaneighbourhood_Ozone Parkneighbourhood_Park Slopeneighbourhood_Port Morrisneighbourhood_Prospect Heightsneighbourhood_Prospect-Lefferts Gardensneighbourhood_Queens Villageneighbourhood_Red Hookneighbourhood_Rego Parkneighbourhood_Richmond Hillneighbourhood_Ridgewoodneighbourhood_Rockaway Beachneighbourhood_Roosevelt Islandneighbourhood_Rosedaleneighbourhood_Sheepshead Bayneighbourhood_SoHoneighbourhood_South Slopeneighbourhood_Springfield Gardensneighbourhood_St. Albansneighbourhood_St. Georgeneighbourhood_Sunnysideneighbourhood_Sunset Parkneighbourhood_Theater Districtneighbourhood_Tribecaneighbourhood_Two Bridgesneighbourhood_Upper East Sideneighbourhood_Upper West Sideneighbourhood_Wakefieldneighbourhood_Washington Heightsneighbourhood_West Villageneighbourhood_Williamsburgneighbourhood_Windsor Terraceneighbourhood_Woodhavenneighbourhood_Woodsideneighbourhood_othersroom_type_Entire home/aptroom_type_Private roomroom_type_Shared roomavail_zero_NonZeroavail_zero_Zeroreview_exists_Noreview_exists_Yes
0-0.848227-0.588004-0.352358-0.5882990.0660692.13258601000000000000000000000000000000000000000000000000000000010000000000000000000000000000000000000000000000000101001
1-0.848227-0.5880040.419649-0.485721-0.0841342.05261600100000000000000000000000000000000000000000000000000000000000000100000000000000000000000000000000000000001001001
2-0.848204-0.119342-0.545360-0.715013-0.1216842.13258600100000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000101010
3-0.848200-0.5880045.2446922.084766-0.1216840.76509501000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000000000000001001001
4-0.8481701.520973-0.352358-0.654673-0.121684-0.78632700100000000000000000000000000000010000000000000000000000000000000000000000000000000000000000000000000000001000101

학습 및 테스트 데이터 분리

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

분류하기 - XGBoost Regression 모델 적용

from xgboost import XGBRegressor

model_reg =XGBRegressor()
model_reg.fit(X_train, y_train)
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.300000012, max_delta_step=0, max_depth=6,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=100, n_jobs=4, num_parallel_tree=1,
             objective='reg:squarederror', random_state=0, reg_alpha=0,
             reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',
             validate_parameters=1, verbosity=None)

모델 평가

from sklearn.metrics import mean_absolute_error, mean_squared_error
from math import sqrt

pred = model_reg.predict(X_test)
print(mean_absolute_error(y_test, pred))
print(sqrt(mean_squared_error(y_test, pred)))
34.158105898456235
48.339277624918665
  • 낮은 값인지 높은값인지 직관적으로 쉬이 알기 어렵다.
plt.scatter(x=y_test, y=pred, alpha=0.1)
plt.plot([0,350], [0,350], 'r-')
[<matplotlib.lines.Line2D at 0x26907658148>]

output_70_1

  • price 값의 경우 단위별로 나누어져 있는 경우가 많아 구간별로 나누어진다.
  • 값이 낮을때는 overestimate 높을 경우는 underestimate 하는 현상이 일어남.
err = (pred - y_test) / y_test
sns.histplot(err)
plt.grid()

output_72_0