Pearson(皮尔逊), Kendall(肯德尔)和Spearman(斯伯曼/斯皮尔曼)三种相关分析方法
具体公式就不Copy了,一般认为:
使用Pearson积差相关系数:
两个连续变量间呈线性相关时;
数据呈现正态分布时;
Spearman和Kendall相关系数:
对分类变量的数据或变量值的分布明显非正态或分布不明时,计算时先对离散数据进行排序或对定距变量值排(求)秩。
直接上计算代码:
public static double pearson(double x[], double y[]) {
int size = x.length;
if (size == 0 || size != y.length) {
return 0;
} else {
double xavg = avg(x);
double yavg = avg(y);
double numerator = 0.0;
double denominatorx = 0.0;
double denominatory = 0.0;
for (int i = 0; i < size; i++) {
numerator += (x[i] - xavg) * (y[i] - yavg);
denominatorx += Math.pow((x[i] - xavg), 2);
denominatory += Math.pow((y[i] - yavg), 2);
}
return numerator / Math.sqrt(denominatorx * denominatory);
}
}
public static double avg(double x[]) {
double sum = 0;
for (int i = 0; i < x.length; i++) {
sum += x[i];
}
return sum / x.length;
}
/** 要求X按小至大顺序 */
public static double spearman(double x[], double y[]) {
double rx[] = getRank(x);
double ry[] = getRank(y);
return pearson(rx, ry);
}
public static double[] getRank(double x[]) {
double r[] = new double[x.length];
for (int i = 0; i < x.length; i++) {
int cnt_lt = 0, cnt_eq = 0;
for (int j = 0; j < x.length; j++) {
if (x[i] < x[j]) {
cnt_lt++;
}
if (x[i] == x[j]) {
cnt_eq++;
}
}
r[i] = cnt_lt + (1 + cnt_eq) / 2;
}
return r;
}
/** 要求X按小至大顺序 */
public static double kendall(double x[], double y[]) {
int cnt_bt = 0;
for (int i = 0; i < y.length; i++) {
for (int j = i; j < y.length; j++) {
if (y[i] < y[j]) {
cnt_bt++;
}
}
}
return 4.0 * cnt_bt / ((x.length) * (x.length - 1)) - 1;
}
public static void sortAsX(double x[], double y[]) {
List list = new ArrayList();
for (int i = 0; i < x.length; i++) {
list.add(new Point(x[i], y[i]));
}
Collections.sort(list, new Comparator() {
@Override
public int compare(Object o1, Object o2) {
Point obj1 = (Point) o1;
Point obj2 = (Point) o2;
return ((Double) obj1.x).compareTo((Double) obj2.x);
}
});
for (int i = 0; i < x.length; i++) {
Point p = (Point) list.get(i);
x[i] = p.x;
y[i] = p.y;
}
}
public static double calcCorr(double x[], double y[], int type) {
if (type == 1) {
return pearson(x, y);
} else if (type == 2) {
sortAsX(x, y);
return spearman(x, y);
} else {
sortAsX(x, y);
return kendall(x, y);
}
}
当然代码没有考虑很多异常情况(例如缺失值),同时Kendall相关系数也是采用了最简单明了的方式,仅供参考。
用Excel / R / Spss 等都可以计算出来,将上面的代码转成其他的语言例如C、C++、Python代码也是很容易的事情。
R计算方法为:
cor(x, y = NULL, use = "everything",
method = c("pearson", "kendall", "spearman"))