[Machine Learning] Summary of Common Vector Space Metrics

Last updated: 2023/4/5 Fixed the display issue of mathematical symbols in Markdown
This article was first published on 若绾. Please indicate the source if reproduced.

Introduction#

When it comes to machine learning and data science, vector space distance is a very important concept. In many machine learning applications, data points are often represented in vector form. Therefore, it is crucial to understand how to calculate and compare distances between vectors. Vector space distance can be used to solve many problems, such as clustering, classification, dimensionality reduction, and more. In federated learning projects, vector space distance is particularly important because it helps us compare vectors from different devices or data sources to determine if they are similar enough for joint training.

This article will introduce the basic concepts of vector space distance, including Euclidean distance, Manhattan distance, Chebyshev distance, and more. We will discuss how these distance metrics are calculated, their advantages and disadvantages, and when to use each distance metric. We will also introduce some more advanced distance metric methods, such as Mahalanobis distance and cosine similarity, and explore their applicability in different scenarios. Hopefully, this article will help you better understand vector space distance and how to apply it to your federated learning projects.

Manhattan Distance#

Manhattan distance (L1 norm) is a method for measuring the distance between two vectors X and Y. Its formula is as follows:

$D_{M}(X,Y) = \sum_{i=1}^{n}|x_{i} - y_{i}|$

where:

$D_{M}(X,Y) $ represents the Manhattan distance between vectors X and Y.
$x_{i}$ and $y_{i}$ represent the ith component of vectors X and Y, respectively.
$n$ is the number of components in the vectors.

Manhattan distance measures the distance between two vectors by summing the absolute differences between their components. It is named after the grid-like layout of Manhattan streets, where the distance between two points is the sum of their horizontal and vertical distances. Compared to Euclidean distance, Manhattan distance is more suitable for applications in high-dimensional spaces because in high-dimensional spaces, the distance between two points is easier to calculate using horizontal and vertical distances. Manhattan distance is also widely used in machine learning and data science, for example, in clustering and classification problems, as well as in feature extraction in image and speech recognition.

Canberra Distance#

Canberra distance is a distance metric method used to measure the similarity between two vectors. It is commonly used in data analysis and information retrieval. Its formula is as follows:

$D_{c}(X,Y) = \sum_{i=1}^{n}\frac{|x_{i} - y_{i}|}{|x_{i}| + |y_{i}|}$

where:

$D_{c}(X,Y)$ represents the Canberra distance between vectors X and Y.
$x_{i}$ and $y_{i}$ represent the ith component of vectors X and Y, respectively.
$n$ is the number of components in the vectors.

Canberra distance is a distance metric that takes into account the magnitude of vector components and is suitable for situations where the magnitude of components is important, such as analyzing gene expression data. The difference between Canberra distance and other distance metrics is that it calculates the distance by taking the ratio of the absolute difference between components and the sum of their absolute values. When there are many zero-value components in the vector, the denominator may become very small, which can lead to distance instability. Canberra distance is also widely used in machine learning and data science, such as in classification, clustering, recommendation systems, and information retrieval.

Euclidean Distance#

Euclidean distance is a common method for measuring the distance between two vectors, and it is widely used in various tasks in the field of machine learning and data science, such as clustering, classification, and regression.

Euclidean distance can be used to calculate the distance between two vectors. Let's assume we have two vectors X and Y, with equal lengths, i.e., each vector contains n components. Then, the Euclidean distance between them can be calculated using the following formula:

$D_{E}(X,Y) = \sqrt{\sum_{i=1}^{n}(x_{i} - y_{i})^2}$

where:

$D_{E}(X,Y)$ represents the Euclidean distance between vectors X and Y.
$x_{i}$ and $y_{i}$ represent the ith component of vectors X and Y, respectively.
$n$ is the number of components in the vectors.

The calculation of Euclidean distance involves summing the squares of the differences between the components of the two vectors and taking the square root of the sum. It is named after the ancient Greek mathematician Euclid and is widely used in plane geometry as well. In machine learning and data science, Euclidean distance is often used to calculate the similarity or distance between two samples, as it helps us identify samples that are close to each other in feature space.

Standardized Euclidean Distance#

Standardized Euclidean distance is a distance metric method used to measure the distance between two vectors, taking into account the variability of vector components. It is commonly used in cases where the components have different measurement units and scales, and the variability of components is important.

The formula for calculating standardized Euclidean distance is as follows:

$D_{SE}(X,Y) = \sqrt{\sum_{i=1}^{n}\frac{(x_{i} - y_{i})^2}{s_{i}^2}}$

where:

$D_{SE}(X,Y)$ represents the standardized Euclidean distance between vectors X and Y.
$x_{i}$ and $y_{i}$ represent the ith component of vectors X and Y, respectively.
$n$ is the number of components in the vectors.
$s_{i}$ is the standard deviation of the ith component in the vectors.

Standardized Euclidean distance takes into account the variability of vector components and is usually applicable in cases where the components have different measurement units and scales, as well as when the variability of components is important. It is similar to Euclidean distance in terms of calculation, but when calculating the difference between components, it is normalized by dividing it by the standard deviation of the component.

Squared Euclidean Distance#

Squared Euclidean distance is a distance metric method used to calculate the distance between two vectors. Its calculation formula is as follows:

$D_{SE}^2(X,Y) = \sum_{i=1}^{n}(x_{i} - y_{i})^2$

where:

$D_{SE}^2(X,Y)$ represents the squared Euclidean distance between vectors X and Y.
$x_{i}$ and $y_{i}$ represent the ith component of vectors X and Y, respectively.
$n$ is the number of components in the vectors.

Squared Euclidean distance measures the distance between two vectors by summing the squares of the differences between their components. Compared to Euclidean distance, squared Euclidean distance avoids the square root operation on the sum, making the calculation more efficient. Squared Euclidean distance is also widely used in machine learning and data science, such as in clustering, classification, regression, and other tasks.

It is important to note that squared Euclidean distance ignores the scale and units of the components, so in some cases, it may not be suitable for measuring the distance between vectors. In such cases, other distance metric methods, such as standardized Euclidean distance, Manhattan distance, Chebyshev distance, etc., can be used.

Cosine Similarity#

Cosine similarity is a metric method used to measure the similarity between two vectors. Its calculation formula is as follows:

$cos(\theta) = \frac{X \cdot Y}{\left| X\right| \left| Y\right|} = \frac{\sum\limits_{i=1}^{n} x_i y_i}{\sqrt{\sum\limits_{i=1}^{n} x_i^2} \sqrt{\sum\limits_{i=1}^{n} y_i^2}}$

where:

$cos(\theta)$ represents the cosine similarity between vectors X and Y.
$X \cdot Y$ represents the dot product of vectors X and Y.
$\left| X\right|$ and $\left| Y\right|$ represent the magnitudes of vectors X and Y, respectively.
$x_i$ and $y_i$ represent the ith component of vectors X and Y, respectively.
$n$ is the number of components in the vectors.

Cosine similarity is a similarity metric that takes into account the angle between vectors. In fields such as natural language processing and information retrieval, cosine similarity is often used to measure the similarity between texts, as texts can be represented as vectors and the magnitudes of the components are not important. Compared to other distance metric methods, cosine similarity calculation is simpler and has good performance. For example, when two vectors are exactly the same, the cosine similarity is 1; when the angle between two vectors is 90 degrees, the cosine similarity is 0; when the two vectors have opposite directions, the cosine similarity is -1.

It is important to note that cosine similarity does not consider the magnitude differences between vector components, so it may not be suitable for some datasets.

Chebyshev Distance#

Chebyshev distance is a distance metric method used to measure the distance between two vectors. Its calculation involves finding the maximum absolute difference between the components of the two vectors. Let's assume we have two vectors X and Y, with equal lengths, i.e., each vector contains n components. Then, the Chebyshev distance between them can be calculated using the following formula:

$D_{C}(X,Y) = max_{i=1}^{n}|x_{i} - y_{i}|$

where:

$D_{C}(X,Y)$ represents the Chebyshev distance between vectors X and Y.
$x_{i}$ and $y_{i}$ represent the ith component of vectors X and Y, respectively.
$n$ is the number of components in the vectors.

Chebyshev distance is often used to measure the distance between vectors. Its calculation is similar to Manhattan distance, but it takes the absolute difference with the maximum value as the difference between components, which allows it to better capture the differences between vectors compared to Manhattan distance. Chebyshev distance is also widely used in machine learning and data science, such as in image processing, signal processing, time series analysis, and more.

It is important to note that Chebyshev distance may be affected by outliers, as it is based on the absolute difference between components. Therefore, in the presence of outliers, Chebyshev distance may give inaccurate distance measurements.

Mahalanobis Distance#

Mahalanobis distance is a distance metric method used to measure the distance between two vectors, taking into account the correlations between the components. Let's assume we have two vectors X and Y, with equal lengths, i.e., each vector contains n components. Then, the Mahalanobis distance between them can be calculated using the following formula:

$D_{M}(X,Y) = \sqrt{(X-Y)^T S^{-1} (X-Y)}$

where:

$D_{M}(X,Y)$ represents the Mahalanobis distance between vectors X and Y.
$X$ and $Y$ are two vectors of length n.
$S$ is the covariance matrix of size $n \times n$.

The calculation formula of Mahalanobis distance is similar to Euclidean distance, but it takes into account the correlations between the components. If the covariance matrix is the identity matrix, then Mahalanobis distance is equivalent to Euclidean distance. Compared to Euclidean distance, Mahalanobis distance can better capture the correlations between the components, so it has been widely used in various fields where the correlations between the components need to be considered, such as financial risk management, speech recognition, image recognition, and more.

It is important to note that Mahalanobis distance requires the components to follow a multivariate normal distribution, and the covariance matrix needs to be positive definite. If the components do not satisfy these conditions, Mahalanobis distance may give inaccurate distance measurements. In addition, Mahalanobis distance is also affected by the estimation error of the covariance matrix. In practical applications, we need to estimate the covariance matrix using sample data, so the size and quality of the samples can affect the accuracy of Mahalanobis distance.