Skip to content

Model-based clustering based on parameterized finite Gaussian mixture models. Models are estimated by EM algorithm initialized by hierarchical model-based agglomerative clustering. The optimal model is then selected according to BIC.

License

Notifications You must be signed in to change notification settings

KalinNonchev/mclustpy

Repository files navigation

Downloads Python package contributions welcome

mclustpy

mclustpy is a Python function for clustering data using the Mclust algorithm from the R package mclust. The function takes a 2D numpy array of data and returns a dictionary containing various output values computed by the Mclust algorithm.

Installation

mclustpy requires the following dependencies:

  • numpy
  • rpy2

To install mclustpy, you can use pip:

pip install mclustpy

Usage

from mclustpy import mclustpy
import numpy as np

data = np.random.rand(1000, 10)
data.shape

res = mclustpy(data, G=9, modelNames='EEE', random_seed=2020)

The mclustpy function takes the following parameters:

  • data: a 2D numpy array of data to be clustered.
  • G: an integer specifying the maximum number of mixture components to be considered (default is 9).
  • modelNames: a string specifying the model types to be considered (default is 'EEE').
  • random_seed: an integer specifying the random seed for reproducibility (default is 2020).

The function returns a dictionary containing the following output values:

  • call: the function call used to run the Mclust algorithm.
  • data: the input data as an R matrix.
  • modelName: the model name(s) selected by the algorithm.
  • n: the number of observations in the data.
  • d: the number of variables in the data.
  • G: the number of mixture components selected by the algorithm.
  • BIC: the Bayesian Information Criterion (BIC) value for the selected model.
  • loglik: the log-likelihood of the selected model.
  • df: the number of degrees of freedom in the selected model.
  • bic: the BIC value for each model considered.
  • icl: the Integrated Completed Likelihood (ICL) value for each model considered.
  • hypvol: the hypervolume of the cluster tree for each model considered.
  • parameters: the estimated parameters for each component in the selected model.
  • z: the posterior probabilities of assignment to each component for each observation.
  • classification: the classification of each observation under the selected model.
  • uncertainty: a measure of uncertainty in the classification of each observation.

For more info take a look at the original mclust page

About

Model-based clustering based on parameterized finite Gaussian mixture models. Models are estimated by EM algorithm initialized by hierarchical model-based agglomerative clustering. The optimal model is then selected according to BIC.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published