From-Scratch EM Algorithm for GMM Matches scikit-learn on UMAP-Reduced Text Data #31216
dimitris-markopoulos
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I implemented the EM algorithm for multivariate Gaussian Mixture Models from scratch and benchmarked it against sklearn.mixture.GaussianMixture. On a UMAP-reduced version of a high-dimensional text dataset, the results aligned almost perfectly:
Matching mixing weights, means, and covariances
Adjusted Rand Index = 1.0000
Component assignments match after greedy alignment via L2 distance
The implementation is object-oriented, numerically stable (with covariance regularization), and tracks parameter convergence across iterations. A direct comparison to scikit-learn is included.
Notebook:
06_em_algorithm_fit_gmm.ipynb
Core class:
ml_utils.py
Note: The convergence only matches this closely after dimensionality reduction with UMAP. On raw high-dimensional data, convergence is more sensitive to initialization.
Happy to share this as a learning tool or discussion starter around reproducibility and clustering convergence diagnostics.
—
If you are interested in seeing the entire project: here
Beta Was this translation helpful? Give feedback.
All reactions