We study parallelization of categorical data clustering algorithms in an MPI platform. Clustering such data has been a daunting task even for sequential algorithms, mainly due to the challenges in finding suitable similarity/distance measures. We propose a parallel version of the k-modes algorithm, called PV3, which maintains the clustering quality and achieves a reasonable speed-up. This method guarantees producing outputs same as sequential methods but in a parallel environment which results in speed-up. To make the clustering process deterministic, and to produce better clustering results, we develop an initialization method called Revised Density Method (RDM) based on the notion of density. Other RDM variants are suggested to address some of its shortcomings. We then investigate ways to parallelize RDM and some of its variants. Finally, we propose an Ensemble Parallelizing Process (EPP) framework for parallelization of categorical data clustering. This framework is flexible and can be used with different initialization/clustering algorithms with different levels of parallelism. Using our different RDM initialization techniques along with the PV3 algorithm, we then build an RDM realization of EPP called RDM EPP whose clustering quality stands among the top three best k-modes-based clustering algorithms, to the best of our knowledge.
Examining Committee
Dr. Sudhir Mudur (Chair)
Dr. Shiri Nematollaah & Dhrubajyoti Goswami (Supervisor)