What is Clustering?
Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups. In simple words, the aim is to segregate groups with similar traits and assign them into clusters.
Types of Clustering
Broadly speaking, clustering can be divided into two subgroups :
- Hard Clustering: In hard clustering, each data point either belongs to a cluster completely or not.
- Soft Clustering: In soft clustering, instead of putting each data point into a separate cluster, a probability or likelihood of that data point to be in those clusters is assigned. For example, from the above scenario, each customer is assigned a probability to be in either of 10 clusters of the retail store.
Types of clustering algorithms
Since the task of clustering is subjective, the means that can be used for achieving this goal are plenty. Every methodology follows a different set of rules for defining the ‘similarity among data points. In fact, there are more than 100 clustering algorithms known. But few of the algorithms are used popularly, let’s look at them in detail:
- Connectivity models: As the name suggests, these models are based on the notion that the data points closer in data space exhibit more similarity to each other than the data points lying farther away. These models can follow two approaches. In the first approach, they start with classifying all data points into separate clusters & then aggregating them as the distance decreases. In the second approach, all data points are classified as a single cluster and then partitioned as the distance increases. Also, the choice of distance function is subjective. These models are very easy to interpret but lack scalability for handling big datasets. Examples of these models are hierarchical clustering algorithms and their variants.
- Centroid models: These are iterative clustering algorithms in which the notion of similarity is derived by the closeness of a data point to the centroid of the clusters. K-Means clustering algorithm is a popular algorithm that falls into this category. In these models, the no. of clusters required at the end have to be mentioned beforehand, which makes it important to have prior knowledge of the dataset. These models run iteratively to find the local optima.
- Distribution models: These clustering models are based on the notion of how probable is it that all data points in the cluster belong to the same distribution (For example: Normal, Gaussian). These models often suffer from overfitting. A popular example of these models is the Expectation-maximization algorithm which uses multivariate normal distributions.
- Density models: These models search the data space for areas of the varied density of data points in the data space. It isolates various different density regions and assigns the data points within these regions in the same cluster. Popular examples of density models are DBSCAN and OPTICS.
k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. k-means clustering minimizes within-cluster variances (squared Euclidean distances), but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances. For instance, better Euclidean solutions can be found using k-medians and k-medoids.
K-Means Clustering Algorithm
Let’s say we have x1, x2, x3……… x(n) as our inputs, and we want to split this into K clusters.
The steps to form clusters are:
Step 1: Choose K random points as cluster centers called centroids.
Step 2: Assign each x(i) to the closest cluster by implementing euclidean distance (i.e., calculating its distance to each centroid)
Step 3: Identify new centroids by taking the average of the assigned points.
Step 4: Keep repeating step 2 and step 3 until convergence is achieved
Let’s take a detailed look at it at each of these steps.
We randomly pick K (centroids). We name them c1,c2,….. ck, and we can say that
Where C is the set of all centroids.
We assign each data point to its nearest center, which is accomplished by calculating the euclidean distance.
Where dist() is the Euclidean distance.
Here, we calculate each x value’s distance from each c value, i.e. the distance between x1-c1, x1-c2, x1-c3, and so on. Then we find which is the lowest value and assign x1 to that particular centroid.
Similarly, we find the minimum distance for x2, x3, etc.
We identify the actual centroid by taking the average of all the points assigned to that cluster.
Where Si is the set of all points assigned to the ith cluster.
It means the original point, which we thought was the centroid, will shift to the new position, which is the actual centroid for each of these groups.
Keep repeating step 2 and step 3 until convergence is achieved.
UseCases of K-Means in Security
Crime analysis is defined as an analytical process that provides relevant information relative to crime patterns and trend correlations to assist personnel in planning the deployment of resources for the prevention and suppression of criminal activities. It is important to analyze crime due to following reasons :
- Analyze crime to inform law enforcers about general and specific crime trends in a timely manner
- 2. Analyze crime to take advantage of the plenty of information existing injustice system and public domain. Crime rates are rapidly changing and improved analysis finds hidden patterns of crime, if any, without any explicit prior knowledge of these patterns.
Analysis of crime is essential for providing safety and security to the civilian population. Using data mining, we can discover critical information which can help local authorities detect crime and areas of importance. The main purpose of this paper is to analyze the crime which entails theft, homicide, and various drug offenses which also include suspicious activities, noise complaints, and burglar alarm by using a qualitative and quantitative approach.
K-means clustering is one of the methods of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.
1. Initially, the number of clusters must be known let it be k
2. The initial step is the choose a set of K instances as centers of the clusters.
3. Next, the algorithm considers each instance and assigns it to the cluster which is closest.
4. The cluster centroids are recalculated either after the whole cycle of re-assignment or each instance assignment.
5. This process is iterated.
K means algorithm complexity is O(tkn), where n instances, c clusters, and t are iterations and relatively efficient. It often terminates at a local optimum. Its disadvantage is applicable only when the mean is defined and the need to specify c, the number of clusters, in advance. It is unable to handle noisy data and outliers and is not suitable to discover clusters with non-convex shapes.
K Means clustering helps us to analyze the historical crime rates and enhance the crime resolution rate of the present. Take actions to prevent future incidents by using preventive mechanisms based on observed patterns. Reduce the training time of the officers that are assigned to a new location and have no prior knowledge of site-specific crimes. Increase operational efficiency by optimally redeploying limited resources to the right places at the right times.