Seleksi Fitur
Kita dapat menghitung "seberapa berharga" fitur X dalam data melalui Feature Gain. Dengan demikian, fitur terlalu banyak bisa dikurangi.
from pandas import * from IPython.display import HTML, display from tabulate import tabulate from math import log from sklearn.feature_selection import mutual_info_classif def table(df): display(HTML(tabulate(df, tablefmt='html', headers='keys', showindex=False)))
Mari kita ambil beberapa sampel:
df = read_csv('play.csv', sep=';') table(df)
| outlook | temperature | humidity | windy | play |
|---|---|---|---|---|
| sunny | hot | high | False | no |
| sunny | hot | high | True | no |
| overcast | hot | high | False | yes |
| rainy | mild | high | False | yes |
| rainy | cool | normal | False | yes |
| rainy | cool | normal | True | no |
| overcast | cool | normal | True | yes |
| sunny | mild | high | False | no |
| sunny | cool | normal | False | yes |
| rainy | mild | normal | False | yes |
| sunny | mild | normal | True | yes |
| overcast | mild | high | True | yes |
| overcast | hot | normal | False | yes |
| rainy | mild | high | True | no |
Entropy Target
Entropy (keberagaman) kolom target:
E(T) = \sum_{i=1}^n {-P_i\log{P_i}}
dimana P = Rasio Peluang muncul dalam record
def findEntropy(column): rawGroups = df.groupby(column) targetGroups = [[key, len(data), len(data)/df[column].size] for key,data in rawGroups] targetGroups = DataFrame(targetGroups, columns=['value', 'count', 'probability']) return sum([-x*log(x,2) for x in targetGroups['probability']]), targetGroups, rawGroups entropyTarget, groupTargets, _ = findEntropy('play') table(groupTargets) print('entropy target =', entropyTarget)
| value | count | probability |
|---|---|---|
| no | 5 | 0.357143 |
| yes | 9 | 0.642857 |
entropy target = 0.9402859586706309
Gain
Gain dalam sebuah fitur X untuk data T:
\operatorname{Gain}(T, X) = \operatorname{Entropy}(T) - \sum_{v\in{T}} \frac{T_{X,v}}{T} E(T_{X,v})
def findGain(column): entropyOutlook, groupOutlooks, rawOutlooks = findEntropy(column) table(groupOutlooks) gain = entropyTarget-sum(len(data)/len(df)*sum(-x/len(data)*log(x/len(data),2) for x in data.groupby('play').size()) for key,data in rawOutlooks) print("gain dari '%s': %f" % (column, gain)) return gain gains = [[x,findGain(x)] for x in ['outlook','temperature','humidity','windy']]
| value | count | probability |
|---|---|---|
| overcast | 4 | 0.285714 |
| rainy | 5 | 0.357143 |
| sunny | 5 | 0.357143 |
gain dari 'outlook': 0.246750
| value | count | probability |
|---|---|---|
| cool | 4 | 0.285714 |
| hot | 4 | 0.285714 |
| mild | 6 | 0.428571 |
gain dari 'temperature': 0.029223
| value | count | probability |
|---|---|---|
| high | 7 | 0.5 |
| normal | 7 | 0.5 |
gain dari 'humidity': 0.151836
| value | count | probability |
|---|---|---|
| False | 8 | 0.571429 |
| True | 6 | 0.428571 |
gain dari 'windy': 0.048127
Overall Gain Score:
result = DataFrame(gains, columns=["Feature", "Gain Score"]).sort_values("Gain Score")[::-1] table(result) print("'%s' mempunyai gain score tertinggi sedangkan '%s' terendah" % (result.values[0,0], result.values[-1,0]))
| Feature | Gain Score |
|---|---|
| outlook | 0.24675 |
| humidity | 0.151836 |
| windy | 0.048127 |
| temperature | 0.0292226 |
'outlook' mempunyai gain score tertinggi sedangkan 'temperature' terendah