From-Scratch EM Algorithm for GMM Matches scikit-learn on UMAP-Reduced Text Data #31216

dimitris-markopoulos · 2025-04-16T19:56:29Z

dimitris-markopoulos
Apr 16, 2025

I implemented the EM algorithm for multivariate Gaussian Mixture Models from scratch and benchmarked it against sklearn.mixture.GaussianMixture. On a UMAP-reduced version of a high-dimensional text dataset, the results aligned almost perfectly:

Matching mixing weights, means, and covariances

Adjusted Rand Index = 1.0000

Component assignments match after greedy alignment via L2 distance

The implementation is object-oriented, numerically stable (with covariance regularization), and tracks parameter convergence across iterations. A direct comparison to scikit-learn is included.

Notebook:
06_em_algorithm_fit_gmm.ipynb

Core class:
ml_utils.py

Note: The convergence only matches this closely after dimensionality reduction with UMAP. On raw high-dimensional data, convergence is more sensitive to initialization.

Happy to share this as a learning tool or discussion starter around reproducibility and clustering convergence diagnostics.

—
If you are interested in seeing the entire project: here

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

From-Scratch EM Algorithm for GMM Matches scikit-learn on UMAP-Reduced Text Data #31216

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

From-Scratch EM Algorithm for GMM Matches scikit-learn on UMAP-Reduced Text Data #31216

Uh oh!

dimitris-markopoulos Apr 16, 2025

Replies: 0 comments

dimitris-markopoulos
Apr 16, 2025