Skip to Content

Bo Zhang

Picture of Bo Zhang.


Machine Learning on Statistical Manifold

Weiqing Gu
Second Reader(s)
Nicholas J. Pippenger


This senior thesis project explores and generalizes some fundamental machine learning algorithms from the Euclidean space to the statistical manifold, an abstract space in which each point is a probability distribution. In this thesis, we adapt the Support Vector Machine, the K-Means Clustering, and the Hierarchical Clustering Methods for classifying and clustering probability distributions. In these modifications, we use the statistical distances as a measure of the dissimilarity between objects. We describe a situation where the clustering of probability distributions is needed and useful. We present many interesting and promising empirical clustering results, which demonstrate the statistical-distance-based clustering algorithms often outperform the same algorithms with the Euclidean distance in many complex scenarios. In particular, we apply our statistical-distance-based hierarchical and k-means clustering algorithms on the univariate normal distributions with $k = 2$ and $k = 3$ clusters, the bivariate normal distributions with diagonal covariance matrix and $k = 3$ clusters, and the discrete Poisson distributions with $k = 3$ clusters. Finally, we prove the k-means clustering algorithm applied on the discrete distributions using the Hellinger distance converges not only to the partial optimal solution but also to the local minimum.

Additional Materials