

Seleksi Fitur

Kita dapat menghitung "seberapa berharga" fitur X dalam data melalui Feature Gain. Dengan demikian, fitur terlalu banyak bisa dikurangi.

from pandas import *
from IPython.display import HTML, display
from tabulate import tabulate
from math import log
from sklearn.feature_selection import mutual_info_classif

def table(df): display(HTML(tabulate(df, tablefmt='html', headers='keys', showindex=False)))

Mari kita ambil beberapa sampel:

df = read_csv('play.csv', sep=';')
table(df)

outlook	temperature	humidity	windy	play
sunny	hot	high	False	no
sunny	hot	high	True	no
overcast	hot	high	False	yes
rainy	mild	high	False	yes
rainy	cool	normal	False	yes
rainy	cool	normal	True	no
overcast	cool	normal	True	yes
sunny	mild	high	False	no
sunny	cool	normal	False	yes
rainy	mild	normal	False	yes
sunny	mild	normal	True	yes
overcast	mild	high	True	yes
overcast	hot	normal	False	yes
rainy	mild	high	True	no

Entropy Target

Entropy (keberagaman) kolom target:

$E(T) = \sum_{i=1}^n {-P_i\log{P_i}}$

dimana $P$ = Rasio Peluang muncul dalam record

def findEntropy(column):
    rawGroups = df.groupby(column)
    targetGroups = [[key, len(data), len(data)/df[column].size] for key,data in rawGroups]
    targetGroups = DataFrame(targetGroups, columns=['value', 'count', 'probability'])
    return sum([-x*log(x,2) for x in targetGroups['probability']]), targetGroups, rawGroups

entropyTarget, groupTargets, _ = findEntropy('play')
table(groupTargets)
print('entropy target =', entropyTarget)

value	count	probability
no	5	0.357143
yes	9	0.642857

entropy target = 0.9402859586706309

Gain

Gain dalam sebuah fitur $X$ untuk data $T$ :

$\operatorname{Gain}(T, X) = \operatorname{Entropy}(T) - \sum_{v\in{T}} \frac{T_{X,v}}{T} E(T_{X,v})$

def findGain(column):
    entropyOutlook, groupOutlooks, rawOutlooks = findEntropy(column)
    table(groupOutlooks)
    gain = entropyTarget-sum(len(data)/len(df)*sum(-x/len(data)*log(x/len(data),2) 
                for x in data.groupby('play').size()) for key,data in rawOutlooks)
    print("gain dari '%s': %f" % (column, gain))
    return gain

gains = [[x,findGain(x)] for x in ['outlook','temperature','humidity','windy']]

value	count	probability
overcast	4	0.285714
rainy	5	0.357143
sunny	5	0.357143

gain dari 'outlook': 0.246750

value	count	probability
cool	4	0.285714
hot	4	0.285714
mild	6	0.428571

gain dari 'temperature': 0.029223

value	count	probability
high	7	0.5
normal	7	0.5

gain dari 'humidity': 0.151836

value	count	probability
False	8	0.571429
True	6	0.428571

gain dari 'windy': 0.048127

Overall Gain Score:

result = DataFrame(gains, columns=["Feature", "Gain Score"]).sort_values("Gain Score")[::-1]
table(result)

print("'%s' mempunyai gain score tertinggi sedangkan '%s' terendah" % (result.values[0,0], result.values[-1,0]))

Feature	Gain Score
outlook	0.24675
humidity	0.151836
windy	0.048127
temperature	0.0292226

'outlook' mempunyai gain score tertinggi sedangkan 'temperature' terendah