Seleksi Fitur
Kita dapat menghitung "seberapa berharga" fitur X dalam data melalui Feature Gain. Dengan demikian, fitur terlalu banyak bisa dikurangi.
from pandas import * from IPython.display import HTML, display from tabulate import tabulate from math import log from sklearn.feature_selection import mutual_info_classif def table(df): display(HTML(tabulate(df, tablefmt='html', headers='keys', showindex=False)))
Mari kita ambil beberapa sampel:
df = read_csv('play.csv', sep=';') table(df)
outlook | temperature | humidity | windy | play |
---|---|---|---|---|
sunny | hot | high | False | no |
sunny | hot | high | True | no |
overcast | hot | high | False | yes |
rainy | mild | high | False | yes |
rainy | cool | normal | False | yes |
rainy | cool | normal | True | no |
overcast | cool | normal | True | yes |
sunny | mild | high | False | no |
sunny | cool | normal | False | yes |
rainy | mild | normal | False | yes |
sunny | mild | normal | True | yes |
overcast | mild | high | True | yes |
overcast | hot | normal | False | yes |
rainy | mild | high | True | no |
Entropy Target
Entropy (keberagaman) kolom target:
E(T) = \sum_{i=1}^n {-P_i\log{P_i}}
dimana P = Rasio Peluang muncul dalam record
def findEntropy(column): rawGroups = df.groupby(column) targetGroups = [[key, len(data), len(data)/df[column].size] for key,data in rawGroups] targetGroups = DataFrame(targetGroups, columns=['value', 'count', 'probability']) return sum([-x*log(x,2) for x in targetGroups['probability']]), targetGroups, rawGroups entropyTarget, groupTargets, _ = findEntropy('play') table(groupTargets) print('entropy target =', entropyTarget)
value | count | probability |
---|---|---|
no | 5 | 0.357143 |
yes | 9 | 0.642857 |
entropy target = 0.9402859586706309
Gain
Gain dalam sebuah fitur X untuk data T:
\operatorname{Gain}(T, X) = \operatorname{Entropy}(T) - \sum_{v\in{T}} \frac{T_{X,v}}{T} E(T_{X,v})
def findGain(column): entropyOutlook, groupOutlooks, rawOutlooks = findEntropy(column) table(groupOutlooks) gain = entropyTarget-sum(len(data)/len(df)*sum(-x/len(data)*log(x/len(data),2) for x in data.groupby('play').size()) for key,data in rawOutlooks) print("gain dari '%s': %f" % (column, gain)) return gain gains = [[x,findGain(x)] for x in ['outlook','temperature','humidity','windy']]
value | count | probability |
---|---|---|
overcast | 4 | 0.285714 |
rainy | 5 | 0.357143 |
sunny | 5 | 0.357143 |
gain dari 'outlook': 0.246750
value | count | probability |
---|---|---|
cool | 4 | 0.285714 |
hot | 4 | 0.285714 |
mild | 6 | 0.428571 |
gain dari 'temperature': 0.029223
value | count | probability |
---|---|---|
high | 7 | 0.5 |
normal | 7 | 0.5 |
gain dari 'humidity': 0.151836
value | count | probability |
---|---|---|
False | 8 | 0.571429 |
True | 6 | 0.428571 |
gain dari 'windy': 0.048127
Overall Gain Score:
result = DataFrame(gains, columns=["Feature", "Gain Score"]).sort_values("Gain Score")[::-1] table(result) print("'%s' mempunyai gain score tertinggi sedangkan '%s' terendah" % (result.values[0,0], result.values[-1,0]))
Feature | Gain Score |
---|---|
outlook | 0.24675 |
humidity | 0.151836 |
windy | 0.048127 |
temperature | 0.0292226 |
'outlook' mempunyai gain score tertinggi sedangkan 'temperature' terendah