Measuring Distances in High Dimensional Spaces Why Average Group Vector Comparisons Exhibit Bias, and What to Do about it

Breanna Green, William Hobbs, Sofia Avila, Pedro L. Rodriguez, Arthur Spirling, Brandon Michael Stewart

Research output: Contribution to journalArticlepeer-review

Abstract

Analysts often seek to compare representations in high-dimensional space, e.g., embedding vectors of the same word across groups. We show that the distance measures calculated in such cases can exhibit considerable statistical bias, that stems from uncertainty in the estimation of the elements of those vectors. This problem applies to Euclidean distance, cosine similarity, and other similar measures. After illustrating the severity of this problem for text-as-data applications, we provide and validate a bias correction for the squared Euclidean distance. This same correction also substantially reduces bias in ordinary Euclidean distance and cosine similarity estimates, but corrections for these measures are not quite unbiased and are (non-intuitively) bimodal when distances are close to zero. The estimators require obtaining the variance of the latent positions. We (will) implement the estimator in free software, and we offer recommendations for related work.

Original languageAmerican English
JournalPolitical Analysis
DOIs
StateAccepted/In press - 2025

ASJC Scopus subject areas

  • Sociology and Political Science
  • Political Science and International Relations

Keywords

  • Euclidean distance
  • big data
  • cosine similarity
  • point estimates
  • unbiasedness
  • word embeddings

Fingerprint

Dive into the research topics of 'Measuring Distances in High Dimensional Spaces Why Average Group Vector Comparisons Exhibit Bias, and What to Do about it'. Together they form a unique fingerprint.

Cite this