머신러닝/캐글

[필사] Porto Seguro Exploratory Analysis and Prediction - 아직 완성 못함

have a good time 2021. 10. 31. 15:33

Analysis packages

# 코드 1

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.utils import shuffle
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import SelectFromModel

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression

pd.set_option('display.max_columns', 100)

에러 : cannot import name 'Imputer' from 'sklearn.preprocessing'

해결법 :

from sklearn.preprocessing import Imputer 대신

from sklearn.impute import SimpleImputer 사용

참고 자료 :

https://log-laboratory.tistory.com/328

[tensorflow] cannot import name 'Imputer' from 'sklearn.preprocessing'

현상 sklearn.preprocessing에서 Imputer 를 불러올때, 아래와 같은 오류가 발생했다. 에러 화면 해결 방법 Imputer 3 버전 전에 사용되지 않으며 0.22에서 제거되었다. Imputer 모듈을 불러오기 위해선 sklearn..

log-laboratory.tistory.com

# 코드 2

trainset = pd.read_csv('Kaggle/train.csv')
testset = pd.read_csv('Kaggle/test.csv')

csv 형식의 데이터 읽어오기

# 코드 3

trainset.head()

trainset 자료에서 일부분 출력해보기

	id	target	ps_ind_01	ps_ind_02_cat	ps_ind_03	ps_ind_04_cat	ps_ind_05_cat	ps_ind_06_bin	ps_ind_07_bin	ps_ind_08_bin	ps_ind_09_bin	ps_ind_10_bin	ps_ind_11_bin	ps_ind_12_bin	ps_ind_13_bin	ps_ind_14	ps_ind_15	ps_ind_16_bin	ps_ind_17_bin	ps_ind_18_bin	ps_reg_01	ps_reg_02	ps_reg_03	ps_car_01_cat	ps_car_02_cat	ps_car_03_cat	ps_car_04_cat	ps_car_05_cat	ps_car_06_cat	ps_car_07_cat	ps_car_08_cat	ps_car_09_cat	ps_car_10_cat	ps_car_11_cat	ps_car_11	ps_car_12	ps_car_13	ps_car_14	ps_car_15	ps_calc_01	ps_calc_02	ps_calc_03	ps_calc_04	ps_calc_05	ps_calc_06	ps_calc_07	ps_calc_08	ps_calc_09	ps_calc_10	ps_calc_11	ps_calc_12	ps_calc_13	ps_calc_14	ps_calc_15_bin	ps_calc_16_bin	ps_calc_17_bin	ps_calc_18_bin	ps_calc_19_bin	ps_calc_20_bin
0	7	0	2	2	5	1	0	0	1	0	0	0	0	0	0	0	11	0	1	0	0.7	0.2	0.718070	10	1	-1	0	1	4	1	0	0	1	12	2	0.400000	0.883679	0.370810	3.605551	0.6	0.5	0.2	3	1	10	1	10	1	5	9	1	5	8	0	1	1	0	0	1
1	9	0	1	1	7	0	0	0	0	1	0	0	0	0	0	0	3	0	0	1	0.8	0.4	0.766078	11	1	-1	0	-1	11	1	1	2	1	19	3	0.316228	0.618817	0.388716	2.449490	0.3	0.1	0.3	2	1	9	5	8	1	7	3	1	1	9	0	1	1	0	1	0
2	13	0	5	4	9	1	0	0	0	1	0	0	0	0	0	0	12	1	0	0	0.0	0.0	-1.000000	7	1	-1	0	-1	14	1	1	2	1	60	1	0.316228	0.641586	0.347275	3.316625	0.5	0.7	0.1	2	2	9	1	8	2	7	4	2	7	7	0	1	1	0	1	0
3	16	0	0	1	2	0	0	1	0	0	0	0	0	0	0	0	8	1	0	0	0.9	0.2	0.580948	7	1	0	0	1	11	1	1	3	1	104	1	0.374166	0.542949	0.294958	2.000000	0.6	0.9	0.1	2	4	7	1	8	4	2	2	2	4	9	0	0	0	0	0	0
4	17	0	0	2	0	1	0	1	0	0	0	0	0	0	0	0	9	1	0	0	0.7	0.6	0.840759	11	1	-1	0	-1	14	1	1	2	1	82	3	0.316070	0.565832	0.365103	2.000000	0.4	0.6	0.0	2	2	6	3	10	2	12	3	1	1	3	0	0	0	1	1	0

자료에서 보면

ps_car_08_cat 과 같이 cat values가 있다.

이는 categorical이며, 0~n의 범위를 같는 Integer 형 변수이다.

판다스에서 category 형 타입이란, 예를 들어 성별(남, 여), 나이(10대 20대..) 같이 특정 구간의 데이터가 반복되는 경우를 의미한다.

참고자료 : https://computer-science-student.tistory.com/302

또

ps_calc_19_bin와 같이 bin values가 있다.

이는 0또는 1을 나타내는 binary 변수이다.

# 코드 3

print("Train dataset (rows, cols):",trainset.shape, "\nTest dataset (rows, cols):",testset.shape)

Train dataset (rows, cols): (595212, 59)

Test dataset (rows, cols): (892816, 58)

trainset, testset 의 행,렬의 개수를 확인해본다.

testset의 column 수가 1개 작은 58개로 나오는데, 이는 target column이 없어서 그런다.

이를 확인해 보겠다.

# 코드 4

print("Columns in train and not in test dataset:",set(trainset.columns)-set(testset.columns))

Columns in train and not in test dataset: {'target'}

Introduction of metadata

사용될 metadata

use: input, ID, target
type: nominal, interval, ordinal, binary
preserve: True or False
dataType: int, float, char
category: ind, reg, car, calc

# 코드 5

# uses code from https://www.kaggle.com/bertcarremans/data-preparation-exploration (see references)
data = []
for feature in trainset.columns:
    # Defining the role
    if feature == 'target':
        use = 'target'
    elif feature == 'id':
        use = 'id'
    else:
        use = 'input'
         
    # Defining the type
    if 'bin' in feature or feature == 'target':
        type = 'binary'
    elif 'cat' in feature or feature == 'id':
        type = 'categorical'
    elif trainset[feature].dtype == float or isinstance(trainset[feature].dtype, float):
        type = 'real'
    elif trainset[feature].dtype == int:
        type = 'integer'
        
    # Initialize preserve to True for all variables except for id
    preserve = True
    if feature == 'id':
        preserve = False
    
    # Defining the data type 
    dtype = trainset[feature].dtype
    
    category = 'none'
    # Defining the category
    if 'ind' in feature:
        category = 'individual'
    elif 'reg' in feature:
        category = 'registration'
    elif 'car' in feature:
        category = 'car'
    elif 'calc' in feature:
        category = 'calculated'
    
    
    # Creating a Dict that contains all the metadata for the variable
    feature_dictionary = {
        'varname': feature,
        'use': use,
        'type': type,
        'preserve': preserve,
        'dtype': dtype,
        'category' : category
    }
    data.append(feature_dictionary)
    
metadata = pd.DataFrame(data, columns=['varname', 'use', 'type', 'preserve', 'dtype', 'category'])
metadata.set_index('varname', inplace=True)
metadata

코드 설명

for feature in trainset.columns:

-> trainset의 columns 를 feature 라고 하여,

columns 에는

target

ps_ind_01

ps_ind_02_cat

ps_ind_03

ps_ind_04_cat

ps_ind_05_cat

ps_ind_06_bin

ps_ind_07_bin

ps_ind_08_bin

ps_ind_09_bin

등이 있다.

if feature == 'target': use = 'target'

-> 만약 feature == target 이면, use라는 칼럼을 만들고 그 값은 target이라고 입력한다.

if 'bin' in feature or feature == 'target': type = 'binary'

->만약 feature 에 bin이라는 글자가 포함되어 있거나(ex: ps_ind_06_bin),

feature 값이 target이라면,

type 이라는 칼럼을 만들 때 그 값은 binary 로 입력한다.

elif trainset[feature].dtype == float or

-> trainset의 feature 들 중에 type 이 float 이거나

isinstance(trainset[feature].dtype, float):

-> trainset의 feature 의 인스턴스 타입이 float 이면

(isinstance(인스턴스, 클래스/데이터타입) : 인스턴스가 특정 클래스/데이터 타입과 일치하면 True, 아니면 false 출력

참고 자료 :

https://devpouch.tistory.com/87

type = 'real'

-> type 칼럼에다가 그 값을 real 이라고 입력해라

print(trainset.info()) 이렇게 코드 입력시,

trainset 의 정보를 확인할 수 있는데

#   Column          Non-Null Count   Dtype
---  ------          --------------   -----
0   id              595212 non-null  int64
1   target          595212 non-null  int64
2   ps_ind_01       595212 non-null  int64
3   ps_ind_02_cat   595212 non-null  int64
4   ps_ind_03       595212 non-null  int64
5   ps_ind_04_cat   595212 non-null  int64
6   ps_ind_05_cat   595212 non-null  int64
7   ps_ind_06_bin   595212 non-null  int64
8   ps_ind_07_bin   595212 non-null  int64
9   ps_ind_08_bin   595212 non-null  int64
10  ps_ind_09_bin   595212 non-null  int64
11  ps_ind_10_bin   595212 non-null  int64
12  ps_ind_11_bin   595212 non-null  int64
13  ps_ind_12_bin   595212 non-null  int64
14  ps_ind_13_bin   595212 non-null  int64
15  ps_ind_14       595212 non-null  int64
16  ps_ind_15       595212 non-null  int64
17  ps_ind_16_bin   595212 non-null  int64
18  ps_ind_17_bin   595212 non-null  int64
19  ps_ind_18_bin   595212 non-null  int64
20  ps_reg_01       595212 non-null  float64
21  ps_reg_02       595212 non-null  float64
22  ps_reg_03       595212 non-null  float64
23  ps_car_01_cat   595212 non-null  int64
24  ps_car_02_cat   595212 non-null  int64
25  ps_car_03_cat   595212 non-null  int64
26  ps_car_04_cat   595212 non-null  int64
27  ps_car_05_cat   595212 non-null  int64
28  ps_car_06_cat   595212 non-null  int64
29  ps_car_07_cat   595212 non-null  int64
30  ps_car_08_cat   595212 non-null  int64
31  ps_car_09_cat   595212 non-null  int64
32  ps_car_10_cat   595212 non-null  int64
33  ps_car_11_cat   595212 non-null  int64
34  ps_car_11       595212 non-null  int64

이렇게 데이터 타입을 확인할 수 있다.

따라서 float 타입이면 type이라는 칼럼에 real이 입력된다.

dtype = trainset[feature].dtype

-> 코드 아래부분에서 metadata라는 DataFrame을 만드는데,

여기다가 각 feature 들의 타입을 dtype라는 칼럼을 이용해서 나타냄

category = 'none'

# Defining the category

if 'ind' in feature:

category = 'individual'

-> 일단 category 칼럼은 모두 none으로 초기화 시켜놓고

만약 feature 이 ind 라는 문자열을 포함하고 있으면( 예를 들어 ps_ind_05_cat)

category = individual

feature_dictionary = {

'varname': feature,

'use': use,

'type': type,

'preserve': preserve,

'dtype': dtype,

'category' : category

}

data.append(feature_dictionary)

-> 각 feature(id, target, ps_ind_01, ps_ind_02_cat, ps_ind_03 등) 들이

varname 이라는 항목으로 들어감

metadata = pd.DataFrame(data, columns=['varname', 'use', 'type', 'preserve', 'dtype', 'category']) metadata.set_index('varname', inplace=True)

data를 이용해서 metadata라는 DataFrame을 만드는데,

metadata.set_index('varname', inplace=True) 이 코드를 이용해서,

varname이라는 항목을 기준으로 표가 만들어짐

(varname을 인덱스로 하여 표가 만들어짐)

표 왼쪽에 보면 varname 이라고 해서

아래쪽으로 id, target, ps_ind_01 값들을 기준으로 그 옆에 use, type, preserv, dtype, category 값들이 나오고 있음

참고 자료 : https://www.kaggle.com/gpreda/porto-seguro-exploratory-analysis-and-prediction/notebook

Porto Seguro Exploratory Analysis and Prediction

Explore and run machine learning code with Kaggle Notebooks | Using data from Porto Seguro’s Safe Driver Prediction