KMeansPP

class geoanalytics.clustering.KMeansPP.KMeansPP(dataframe)[source]

Bases: object

About this algorithm

:Description:KMeans++ improves K-Means clustering by using smarter centroid initialization for better stability and faster convergence, applied here to high-dimensional data excluding x, y coordinates.

Parameters:

dataframe (pd.DataFrame) – A Pandas DataFrame that contains the input dataset.
The first two columns must be spatial or positional features (e.g., ‘x’ and ‘y’).
All other columns are treated as feature vectors for clustering.

Attributes:

df (pd.DataFrame) – Stores the copy of the input dataset, renaming first two columns to ‘x’ and ‘y’.
start_time (float) – Records the clustering start time for runtime analysis.
memory_uss_kb (float) – Measures USS memory usage in kilobytes after execution.
memory_rss_kb (float) – Measures RSS memory usage in kilobytes after execution.
labels (pd.DataFrame) – Final dataframe containing ‘x’, ‘y’, and cluster label for each instance.
cluster_centers_ (np.ndarray) – Coordinates of the final cluster centroids after fitting.

Execution methods

Calling from a Python program

import pandas as pd

from goeAnalytics.clustering import KMeansPP

df = pd.read_csv('data.csv')

obj = KMeansPP(df)

obj.elbowMethod()

output = obj.clustering(k=3)

labelsDF = output[0]

clusterCenters = output[1]

obj.save(outputFile='KMeansPPLabels.csv')

Credits

The complete program was written by Raashika and revised by M.Charan Teja under the supervision of Professor Rage Uday Kiran.

elbowMethod()[source]: Applies the elbow method to help decide the optimal number of clusters (k). It plots WCSS (within-cluster sum of squares) for k in range 1 to 10.

getMemoryRSS()[source]: Prints the memory usage (RSS) of the process in kilobytes.

getMemoryUSS()[source]: Prints the memory usage (USS) of the process in kilobytes.

getRuntime()[source]: Prints the total runtime of the clustering algorithm.

run(k=4, max_iter=300)[source]

Runs KMeans++ clustering on the input dataset using scikit-learn.

Parameters:

k – Number of clusters to form.
max_iter – Maximum number of iterations for a single run.

Returns:

A DataFrame with original x, y and cluster labels, and the cluster centers.

save(outputFileLabels='KMeansPPLabels.csv', outputFileCenters='KMeansPPCenters.csv')[source]