Studnet’s Academic Performance

데이터 셋 : https://www.kaggle.com/aljarah/xAPI-Edu-Data
학생들의 인적사항과 평가데이터를 통해 학생들의 성적을 예측하고자 하는 데이터셋이다.

라이브러리 설정 및 데이터 불러오기

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv('C:/Users/dissi/Kaggle Practice/xAPI-Edu-Data.csv')

df.sample(10)

	gender	NationalITy	PlaceofBirth	StageID	GradeID	SectionID	Topic	Semester	Relation	raisedhands	VisITedResources	AnnouncementsView	Discussion	ParentAnsweringSurvey	ParentschoolSatisfaction	StudentAbsenceDays	Class
453	F	Jordan	Jordan	MiddleSchool	G-08	A	Geology	S	Father	29	78	40	12	Yes	Good	Above-7	M
412	M	Palestine	Jordan	MiddleSchool	G-07	B	Biology	F	Father	78	80	66	51	Yes	Good	Under-7	M
339	F	Palestine	Jordan	lowerlevel	G-02	B	French	S	Father	79	89	11	14	No	Good	Under-7	M
55	M	KW	KuwaIT	MiddleSchool	G-07	A	Math	F	Father	16	14	6	20	Yes	Good	Above-7	L
471	M	Palestine	Jordan	MiddleSchool	G-08	A	History	S	Father	78	82	78	53	Yes	Good	Under-7	M
140	M	Tunis	Tunis	MiddleSchool	G-07	A	Quran	F	Father	10	60	5	20	Yes	Bad	Above-7	L
327	M	Jordan	Jordan	lowerlevel	G-02	A	French	S	Father	30	10	20	5	No	Bad	Above-7	L
228	M	KW	KuwaIT	HighSchool	G-11	B	Math	S	Mum	73	84	77	81	Yes	Good	Above-7	H
418	M	Palestine	Jordan	MiddleSchool	G-07	B	Biology	F	Father	88	90	76	81	Yes	Good	Under-7	H
378	M	Jordan	Jordan	lowerlevel	G-02	B	Arabic	F	Father	10	30	50	91	Yes	Bad	Above-7	L

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 480 entries, 0 to 479
Data columns (total 17 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 gender                    480 non-null    object
 NationalITy               480 non-null    object
 PlaceofBirth              480 non-null    object
 StageID                   480 non-null    object
 GradeID                   480 non-null    object
 SectionID                 480 non-null    object
 Topic                     480 non-null    object
 Semester                  480 non-null    object
 Relation                  480 non-null    object
 raisedhands               480 non-null    int64 
VisITedResources          480 non-null    int64 
AnnouncementsView         480 non-null    int64 
Discussion                480 non-null    int64 
ParentAnsweringSurvey     480 non-null    object
ParentschoolSatisfaction  480 non-null    object
StudentAbsenceDays        480 non-null    object
Class                     480 non-null    object
dtypes: int64(4), object(13)
memory usage: 63.9+ KB

gender : 학생 성별
NationalITy : 학생 국적
PlaceofBirth : 학생이 태어난 국가
StageID : 학생이 다니는 학교 (초, 중, 고)
GradeID : 학생이 속한 성적 등급
SectionID : 학생이 속한 반 이름
Topic : 수강한 과목
Semester : 수강한 학기 (1학기/2학기)
Relatioin : 주 보호자와 학생의 관계
raisedhands : 학생이 수업 중 손을 든 횟수
VisITedReseources : 학생이 과목 교과과정을 확인한 횟수
Discussion : 학생이 토론 그룹에 참여한 횟수
AnnouncementsView : 학생이 공지를 확인한 횟수
ParentAnsweringSurvey : 부모가 학교 설문에 참여했는지 여부
ParentschoolSatisfaction : 부모가 학교에 만족했는지 여부
StudentAbsenceDays : 학생 결석 횟수 (7회 이상/ 미만)
Class : 학생 성적 등급 (L 낮음, M 보통, H 높음)

df['NationalITy'].value_counts()

KW             179
Jordan         172
Palestine       28
Iraq            22
lebanon         17
Tunis           12
SaudiArabia     11
Egypt            9
Syria            7
Lybia            6
USA              6
Iran             6
Morocco          4
venzuela         1
Name: NationalITy, dtype: int64

df['PlaceofBirth'].unique()

array(['KuwaIT', 'lebanon', 'Egypt', 'SaudiArabia', 'USA', 'Jordan',
       'venzuela', 'Iran', 'Tunis', 'Morocco', 'Syria', 'Iraq',
       'Palestine', 'Lybia'], dtype=object)

EDA 및 기초 통계 분석

수치형 데이터

raisedhands, visitedresources, announcementsview, discussion

df.describe()

	raisedhands	VisITedResources	AnnouncementsView	Discussion
count	480.000000	480.000000	480.000000	480.000000
mean	46.775000	54.797917	37.918750	43.283333
std	30.779223	33.080007	26.611244	27.637735
min	0.000000	0.000000	0.000000	1.000000
25%	15.750000	20.000000	14.000000	20.000000
50%	50.000000	65.000000	33.000000	39.000000
75%	75.000000	84.000000	58.000000	70.000000
max	100.000000	99.000000	98.000000	99.000000

요약을 통해 보았을때 전체적으로 균형이 맞는 데이터라 볼 수 있다.

sns.histplot(data=df, x='raisedhands',hue='Class', hue_order=['L', 'M', 'H'], kde=True)

<matplotlib.axes._subplots.AxesSubplot at 0x1661d411708>

output_15_1

손을 드는 횟수는 적거나 많거나 양쪽으로 나뉘어 나타나는데 Class 구분을 잘 반영하고 있다. 다만 수업시간에 손을 적게들어도, 많이들어도 중위권에 들어간 학생이 있는것으로 보아 완벽히 구분해내지는 못함을 보여준다.

sns.histplot(data=df, x='VisITedResources',hue='Class', hue_order=['L', 'M', 'H'], kde=True)

<matplotlib.axes._subplots.AxesSubplot at 0x1661dc92108>

output_17_1

raisedhands와 비슷한 양상으로 역시 Class 구분을 잘 반영하고 있다.

sns.histplot(data=df, x='AnnouncementsView',hue='Class', hue_order=['L', 'M', 'H'], kde=True)

<matplotlib.axes._subplots.AxesSubplot at 0x1661dd81948>

output_19_1

성적이 낮은 학생들의 구분은 쉽지만 중위권과 상위권 학생들의 구분이 모호함.

sns.histplot(data=df, x='Discussion',hue='Class', hue_order=['L', 'M', 'H'], kde=True)

<matplotlib.axes._subplots.AxesSubplot at 0x1661de54548>

output_21_1

다른 지표 대비 특별한 경향성이 보이지 않음.(하위권 학생들도 참여율이 나름 있고, 상위권 학생들도 두 양상으로 나타난다.)

# raisedhands와 visitedresources의 경우 잘 나누어주고 있기 때문에 jointplot으로 함께 확인해본다.
sns.jointplot(data=df, x='VisITedResources', y='raisedhands', hue='Class', hue_order=['L','M',"H"])

<seaborn.axisgrid.JointGrid at 0x1661df00488>

output_23_1

중위권과 상위권 구분은 여전히 어렵지만 하위권과 중위권은 jointplot, 2차원으로 확인시 더 분류할 수 있다..

sns.pairplot(df, hue='Class', hue_order=['L','M','H'])

<seaborn.axisgrid.PairGrid at 0x1661e05be08>

output_25_1

범주형 데이터

sns.countplot(data=df, x='Class', order=['L', 'M', 'H']) 

<matplotlib.axes._subplots.AxesSubplot at 0x1661e960dc8>

output_27_1

sns.countplot(data=df, x='gender', hue='Class', hue_order=['L', 'M', 'H'])

# 남녀 카테고리에 따른 성적 비교

<matplotlib.axes._subplots.AxesSubplot at 0x1661fbad688>

output_28_1

sns.countplot(data=df, x='NationalITy', hue='Class', hue_order=['L', 'M', 'H'])
plt.xticks(rotation=90)
plt.show()

# 국적에 따른 성적 비교

output_29_0

sns.countplot(data=df, x='ParentAnsweringSurvey', hue='Class', hue_order=['L', 'M', 'H'])

# 부모 응답에 따른 성적 비교 
# 학교만족도의 경우 성적과 연관 가능성이 높으므로 빼는게 좋다고 판단 됨.

<matplotlib.axes._subplots.AxesSubplot at 0x1661eac8788>

output_30_1

sns.countplot(data=df, x='Topic', hue='Class', hue_order=['L', 'M', 'H'])
plt.xticks(rotation=90)
plt.show()

# 과목에 따른 성적 비교 // 어떤 과목이 어려운지 

output_31_0

범주형 대상 Class을 수치로 바꾸어 표현

비율 파악을 위해, Low를 -1로, Middle을 0으로, High를 1로

df['Class_value'] = df['Class'].map(dict(L=-1, M=0, H=1))

gb_gender = df.groupby('gender').mean()['Class_value']
gb_gender

gender
F    0.291429
M   -0.118033
Name: Class_value, dtype: float64

plt.bar(gb_gender.index, gb_gender)

<BarContainer object of 2 artists>

output_35_1

gb_Topic = df.groupby('Topic').mean()['Class_value'].sort_values()
plt.barh(gb_Topic.index, gb_Topic)

<BarContainer object of 12 artists>

output_36_1

데이터 전처리

범주형 데이터를 one-hot vector로 변환

# 컴퓨터는 0,1 밖에 인식할 수 없기 때문.

df.columns

Index(['gender', 'NationalITy', 'PlaceofBirth', 'StageID', 'GradeID',
       'SectionID', 'Topic', 'Semester', 'Relation', 'raisedhands',
       'VisITedResources', 'AnnouncementsView', 'Discussion',
       'ParentAnsweringSurvey', 'ParentschoolSatisfaction',
       'StudentAbsenceDays', 'Class', 'Class_value'],
      dtype='object')

# 다중공선성을 줄이기위해 drop_first를 True로
# drop을 써주지 않으면 해당칼럼을 수치형으로 인식하여 그대로 가져오게 됨. 빼고 싶으면 미표기가 아닌 drop를 써야 함.

X = pd.get_dummies(df.drop(['ParentschoolSatisfaction', 'Class', 'Class_value'], axis=1),
                   columns=['gender', 'NationalITy', 'PlaceofBirth', 'StageID', 'GradeID',
                            'SectionID', 'Topic', 'Semester', 'Relation','ParentAnsweringSurvey',
                            'StudentAbsenceDays'], drop_first=True)

y = df['Class']

X.tail()

	raisedhands	VisITedResources	AnnouncementsView	Discussion	NationalITy_Jordan	...	Topic_History	Semester_S	StudentAbsenceDays_Under-7
475	5	4	5	8	1	...	0	1	0
476	50	77	14	28	1	...	0	0	1
477	55	74	25	29	1	...	0	1	1
478	30	17	14	57	1	...	1	0	0
479	35	14	23	62	1	...	1	1	0

5 rows × 59 columns

학습데이터 테스트 데이터 분리

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

모델 학습 및 평가

Logistic Regression 모델 학습

from sklearn.linear_model import LogisticRegression

model_lr = LogisticRegression(max_iter=1000)
model_lr.fit(X_train, y_train)

C:\Users\dissi\anaconda31\lib\site-packages\sklearn\linear_model\_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)





LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

# 평가

from sklearn.metrics import classification_report
pred = model_lr.predict(X_test)
print(classification_report(y_test, pred))

              precision    recall  f1-score   support

           H       0.77      0.67      0.72        55
           L       0.78      0.76      0.77        33
           M       0.59      0.68      0.63        56

    accuracy                           0.69       144
   macro avg       0.72      0.70      0.71       144
weighted avg       0.70      0.69      0.70       144

XGBoost 모델 학습

from xgboost import XGBClassifier
model_xgb = XGBClassifier()
model_xgb.fit(X_train, y_train)

C:\Users\dissi\anaconda31\lib\site-packages\xgboost\sklearn.py:888: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
  warnings.warn(label_encoder_deprecation_msg, UserWarning)


[21:42:35] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.3.0/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior.





XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=4, num_parallel_tree=1,
              objective='multi:softprob', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=None, subsample=1,
              tree_method='exact', use_label_encoder=True,
              validate_parameters=1, verbosity=None)

# 평가

pred = model_xgb.predict(X_test)
print(classification_report(y_test, pred))

              precision    recall  f1-score   support

           H       0.79      0.69      0.74        55
           L       0.85      0.85      0.85        33
           M       0.65      0.73      0.69        56

    accuracy                           0.74       144
   macro avg       0.76      0.76      0.76       144
weighted avg       0.75      0.74      0.74       144

결과 분석

model_lr.classes_

array(['H', 'L', 'M'], dtype=object)

# 회귀분석 결과

plt.figure(figsize=(15, 10))
plt.bar(X.columns, model_lr.coef_[0, :]) # H Class에 관여하는 요소들의 영향 정도
plt.xticks(rotation=90)
plt.show()

output_53_0

결석일수 7일 미만, 부모응답이 높고, 보호자가 어머니이며, 사우디아라비아 국적,출생이며 수학을 선택하면 성적이 높게 나온다.

# 회귀분석 결과(성적을 낮게 하는 요소)
plt.figure(figsize=(15, 10))
plt.bar(X.columns, model_lr.coef_[1, :]) # L Class
plt.xticks(rotation=90)
plt.show()

output_55_0

# xgboost분석 결과

plt.figure(figsize=(15, 10))
plt.bar(X.columns, model_xgb.feature_importances_)
plt.xticks(rotation=90)
plt.show()

output_56_0

결석일수 7일 미만, 보호자가 어머니이며, 손을 많이 들고 공지를 많이 확인 할 경우 성적이 높음.

Oh-Seung-Rok

Kaggle - Student's Academic Performance

Studnet’s Academic Performance

라이브러리 설정 및 데이터 불러오기

EDA 및 기초 통계 분석

수치형 데이터

범주형 데이터

범주형 대상 Class을 수치로 바꾸어 표현

데이터 전처리

범주형 데이터를 one-hot vector로 변환

학습데이터 테스트 데이터 분리

모델 학습 및 평가

Logistic Regression 모델 학습

XGBoost 모델 학습

결과 분석

Related Posts

	raisedhands	VisITedResources	AnnouncementsView	Discussion	NationalITy_Jordan	...	Topic_History	Semester_S	StudentAbsenceDays_Under-7
475	5	4	5	8	1	...	0	1	0
476	50	77	14	28	1	...	0	0	1
477	55	74	25	29	1	...	0	1	1
478	30	17	14	57	1	...	1	0	0
479	35	14	23	62	1	...	1	1	0

	raisedhands	VisITedResources	AnnouncementsView	Discussion	NationalITy_Jordan	...	Topic_History	Semester_S	StudentAbsenceDays_Under-7
475	5	4	5	8	1	...	0	1	0
476	50	77	14	28	1	...	0	0	1
477	55	74	25	29	1	...	0	1	1
478	30	17	14	57	1	...	1	0	0
479	35	14	23	62	1	...	1	1	0

	raisedhands	VisITedResources	AnnouncementsView	Discussion	NationalITy_Jordan	...	Topic_History	Semester_S	StudentAbsenceDays_Under-7
475	5	4	5	8	1	...	0	1	0
476	50	77	14	28	1	...	0	0	1
477	55	74	25	29	1	...	0	1	1
478	30	17	14	57	1	...	1	0	0
479	35	14	23	62	1	...	1	1	0