KMeans

class geoanalytics.clustering.KMeans.KMeans(dataframe)[source]

Bases: object

About this algorithm

Description:

K-Means clustering by using smarter centroid initialization for better stability and faster convergence, applied here to high-dimensional data excluding x, y coordinates.

Parameters:
  • dataframe (pd.DataFrame) – A Pandas DataFrame that contains the input dataset.

  • The first two columns must be spatial or positional features (e.g., ‘x’ and ‘y’).

  • All other columns are treated as feature vectors for clustering.

Attributes:
  • df (pd.DataFrame) – Stores the copy of the input dataset, renaming first two columns to ‘x’ and ‘y’.

  • start_time (float) – Records the clustering start time for runtime analysis.

  • memory_uss_kb (float) – Measures USS memory usage in kilobytes after execution.

  • memory_rss_kb (float) – Measures RSS memory usage in kilobytes after execution.

  • labels (pd.DataFrame) – Final dataframe containing ‘x’, ‘y’, and cluster label for each instance.

  • cluster_centers_ (np.ndarray) – Coordinates of the final cluster centroids after fitting.

Execution methods

Calling from a Python program

import pandas as pd

from goeAnalytics.clustering import KMeans

df = pd.read_csv('data.csv')

obj = KMeans(df)

obj.elbowMethod()

output = obj.clustering(k=3)

labelsDF = output[0]

clusterCenters = output[1]

obj.save(outputFile='KMeansLabels.csv')

Credits

The complete program was written by Raashika and revised by M.Charan Teja under the supervision of Professor Rage Uday Kiran.

elbowMethod()[source]

Applies the elbow method to help decide the optimal number of clusters (k). It plots WCSS (within-cluster sum of squares) for k in range 1 to 10.

getMemoryRSS()[source]

Prints the memory usage (RSS) of the process in kilobytes.

getMemoryUSS()[source]

Prints the memory usage (USS) of the process in kilobytes.

getRuntime()[source]

Prints the total runtime of the clustering algorithm.

run(k=4, max_iter=100)[source]

Runs KMeans clustering on the input dataset using scikit-learn.

Parameters:
  • k – Number of clusters to form.

  • max_iter – Maximum number of iterations for a single run.

Returns:

A DataFrame with original x, y and cluster labels, and the cluster centers.

save(outputFileLabels='KMeansLabels.csv', outputFileCenters='KMeansCenters.csv')[source]

Saves the imputed DataFrame to a CSV file.