Jarak Data
Data dapat diketahui equivalensinya menggunakan penggurukan jarak dari dataset.
Sebelum itu, mari kita tarik sampel data dan beri label.
from pandas import * import itertools import scipy.spatial.distance as spad
columns = ['Specimen Number', 'Eccentricity', 'Aspect Ratio', 'Elongation', 'Solidity'] df = read_csv('leaf.csv', nrows=4, usecols=columns) data = [[["","A","B","C","D"][int(x[0])]]+[round(i*2,2) for i in x[1:]] for x in df.values.tolist()] df = DataFrame(data, columns=['Class']+columns[1:]) df
Class | Eccentricity | Aspect Ratio | Elongation | Solidity | |
---|---|---|---|---|---|
0 | A | 1.45 | 2.95 | 0.65 | 1.97 |
1 | B | 1.48 | 3.05 | 0.72 | 1.96 |
2 | C | 1.53 | 3.15 | 0.78 | 1.96 |
3 | D | 1.48 | 2.92 | 0.71 | 1.95 |
Minkowski Distance
Jarak Minkowski adalah jarak spatial antara dua record (x dan y) dengan m sebagai parameter real dan n sebagai jumlah dimensi pada entity.
d_{\operatorname{minkowski}} = \left(\sum_{i=1}^{n}|x_{i}-y_{i}|^{m}\right)^{\frac{1}{m}},m\geq 1
Special Case: + Jika M = 1 maka bisa disebut Manhattan (Cityblock) distance
d_{\operatorname{manhattan}} = \sum_{i=1}^{n}|x_{i}-y_{i}|
- Jika M = 2 maka bisa disebut Euclidean distance.
d_{\operatorname{euclidean}} = \sqrt{\sum_{i=1}^{n}|x_{i}-y_{i}|^{2}}
Ilustrasi menggambarkan hitungan Cityblock (garis putus-putus) dan Euclidean (garis lurus) secara visual:
columns = ['v1-v2', 'Manhattan distance (M=1)', 'Euclidean distance (M=2)', 'Minkowski distance at M=3'] data2 = [( "{} - {}".format(a[0],b[0]), spad.cityblock(a[1:],b[1:]), spad.euclidean(a[1:],b[1:]), spad.minkowski(a[1:],b[1:],3), ) for a, b in itertools.combinations(data, 2)] DataFrame(data2,columns=columns)
v1-v2 | Manhattan distance (M=1) | Euclidean distance (M=2) | Minkowski distance at M=3 | |
---|---|---|---|---|
0 | A - B | 0.21 | 0.126095 | 0.111091 |
1 | A - C | 0.42 | 0.251794 | 0.220426 |
2 | A - D | 0.14 | 0.076158 | 0.065265 |
3 | B - C | 0.21 | 0.126886 | 0.110275 |
4 | B - D | 0.15 | 0.130767 | 0.130039 |
5 | C - D | 0.36 | 0.245764 | 0.232918 |
Average Distance
columns = ['v1-v2', 'Average Distance', 'Euclidean distance (M=2)', 'Minkowski M=3'] data2 = [( "{} - {}".format(a[0],b[0]), spad.cityblock(a[1:],b[1:]), spad.euclidean(a[1:],b[1:]), spad.euclidean(a[1:],b[1:]), ) for a, b in itertools.combinations(data, 2)] DataFrame(data2,columns=columns)
v1-v2 | Average Distance | Euclidean distance (M=2) | Minkowski M=3 | |
---|---|---|---|---|
0 | A - B | 0.21 | 0.126095 | 0.126095 |
1 | A - C | 0.42 | 0.251794 | 0.251794 |
2 | A - D | 0.14 | 0.076158 | 0.076158 |
3 | B - C | 0.21 | 0.126886 | 0.126886 |
4 | B - D | 0.15 | 0.130767 | 0.130767 |
5 | C - D | 0.36 | 0.245764 | 0.245764 |