Seleksi Fitur

Kita dapat menghitung "seberapa berharga" fitur X dalam data melalui Feature Gain. Dengan demikian, fitur terlalu banyak bisa dikurangi.

from pandas import *
from IPython.display import HTML, display
from tabulate import tabulate
from math import log
from sklearn.feature_selection import mutual_info_classif

def table(df): display(HTML(tabulate(df, tablefmt='html', headers='keys', showindex=False)))

Mari kita ambil beberapa sampel:

df = read_csv('play.csv', sep=';')
table(df)
outlook temperature humidity windy play
sunny hot high False no
sunny hot high True no
overcast hot high False yes
rainy mild high False yes
rainy cool normal False yes
rainy cool normal True no
overcast cool normal True yes
sunny mild high False no
sunny cool normal False yes
rainy mild normal False yes
sunny mild normal True yes
overcast mild high True yes
overcast hot normal False yes
rainy mild high True no

Entropy Target

Entropy (keberagaman) kolom target:

E(T) = \sum_{i=1}^n {-P_i\log{P_i}}

dimana P = Rasio Peluang muncul dalam record

def findEntropy(column):
    rawGroups = df.groupby(column)
    targetGroups = [[key, len(data), len(data)/df[column].size] for key,data in rawGroups]
    targetGroups = DataFrame(targetGroups, columns=['value', 'count', 'probability'])
    return sum([-x*log(x,2) for x in targetGroups['probability']]), targetGroups, rawGroups

entropyTarget, groupTargets, _ = findEntropy('play')
table(groupTargets)
print('entropy target =', entropyTarget)
value count probability
no 5 0.357143
yes 9 0.642857
entropy target = 0.9402859586706309

Gain

Gain dalam sebuah fitur X untuk data T:

\operatorname{Gain}(T, X) = \operatorname{Entropy}(T) - \sum_{v\in{T}} \frac{T_{X,v}}{T} E(T_{X,v})
def findGain(column):
    entropyOutlook, groupOutlooks, rawOutlooks = findEntropy(column)
    table(groupOutlooks)
    gain = entropyTarget-sum(len(data)/len(df)*sum(-x/len(data)*log(x/len(data),2) 
                for x in data.groupby('play').size()) for key,data in rawOutlooks)
    print("gain dari '%s': %f" % (column, gain))
    return gain

gains = [[x,findGain(x)] for x in ['outlook','temperature','humidity','windy']]
value count probability
overcast 4 0.285714
rainy 5 0.357143
sunny 5 0.357143
gain dari 'outlook': 0.246750
value count probability
cool 4 0.285714
hot 4 0.285714
mild 6 0.428571
gain dari 'temperature': 0.029223
value count probability
high 7 0.5
normal 7 0.5
gain dari 'humidity': 0.151836
value count probability
False 8 0.571429
True 6 0.428571
gain dari 'windy': 0.048127

Overall Gain Score:

result = DataFrame(gains, columns=["Feature", "Gain Score"]).sort_values("Gain Score")[::-1]
table(result)

print("'%s' mempunyai gain score tertinggi sedangkan '%s' terendah" % (result.values[0,0], result.values[-1,0]))
Feature Gain Score
outlook 0.24675
humidity 0.151836
windy 0.048127
temperature 0.0292226
'outlook' mempunyai gain score tertinggi sedangkan 'temperature' terendah