Details

Improvement

Status: Closed

Major

Resolution: Not A Problem

0.4

None
Description
For boolean data ,the prefValue is always 1.0f, We need simplify Similarity arithmetic
for example:
1) DistributedEuclideanDistanceVectorSimilarity
package org.apache.mahout.math.hadoop.similarity.vector;
import org.apache.mahout.math.hadoop.similarity.Cooccurrence;
/**
 distributed implementation of euclidean distance as vector similarity measure
*/
public class DistributedEuclideanDistanceVectorSimilarity extends AbstractDistributedVectorSimilarity {
@Override
protected double doComputeResult(int rowA, int rowB, Iterable<Cooccurrence> cooccurrences, double weightOfVectorA,
double weightOfVectorB, int numberOfColumns) {
double n = 0.0;
double sumXYdiff2 = 0.0;
for (Cooccurrence cooccurrence : cooccurrences)
{ double diff = cooccurrence.getValueA()  cooccurrence.getValueB(); sumXYdiff2 += diff * diff; n++; }return n / (1.0 + Math.sqrt(sumXYdiff2));
}
}
this one is always return n (=cooccurrence.size())
2) DistributedUncenteredCosineVectorSimilarity
/**
 distributed implementation of cosine similarity that does not center its data
*/
public class DistributedUncenteredCosineVectorSimilarity extends AbstractDistributedVectorSimilarity {
@Override
protected double doComputeResult(int rowA, int rowB, Iterable<Cooccurrence> cooccurrences, double weightOfVectorA,
double weightOfVectorB, int numberOfColumns) {
int n = 0;
double sumXY = 0.0;
double sumX2 = 0.0;
double sumY2 = 0.0;
for (Cooccurrence cooccurrence : cooccurrences)
{ double x = cooccurrence.getValueA(); double y = cooccurrence.getValueB(); sumXY += x * y; sumX2 += x * x; sumY2 += y * y; n++; }if (n == 0)
{ return Double.NaN; }double denominator = Math.sqrt(sumX2) * Math.sqrt(sumY2);
if (denominator == 0.0)
return sumXY / denominator;
}
}
this one will always return 1.0
3) DistributedUncenteredZeroAssumingCosineVectorSimilarity
If n users like ItemA, m users like ItemB,p users like both ItemA and ItemB,
DistributedUncenteredZeroAssumingCosineVectorSimilarity return p/(m*n).
it also can use for Boolean data, but we can provide a simple one , return (p*p)/(m*n),no so much computing.