Clustering for Preprocessing--- theme: condensed-night-purpl

1. The concept of clustering

Clustering is a technique used in data analysis and machine learning to group similar data points together into "cluster". The goal of clustering is to find patterns in data points together, while separating dissimilar data points into different groups.

2. The importance of Processing

Preprocessing data

Clustering can be used as a preprocessing step before building a machine learning model. Using clustering to group together similar data points, and then build a separate model for each model.

Note. Nowadays Proprocessing stage is the most laborious step, it may take 60-80% of ML Engineer efforts.

Two types of preprocessing

1. Clustering to reduce dimensionality: In an n-dimensional dataset (n number of attributes), the computational complexity is proportional to the number of dimensions or "n". With clustering, n-dimensional attributes can be converted or reduced to one categorical attribute——"Cluster ID". This reduces the complexity, although there wiil be some loss of information because of the dimensionality reduction to one single attribute.

2. Clustering for object reduction: Assume that the number of customers for an organization is in the millions and the number of cluster groups is 100. For each of these 100 cluster groups, one "poster child" customer can be identified that represents the typical characteristics of all the customers in that cluster group. The poster child customer can be an actual customer or a fictional customer. The prototype of a cluster is the most common representation of all the customers in a group. This greatly reduces the record count and the dataset can be made appropriate for classification by algorithms like k-nearest neighbor (k-NN) where computation complexity depends on the number of records.