Reading Time: 14 minutes

Here at Veraset, we know geolocation data can be messy and noisy, making it difficult to see potential trends within it. What can be helpful is to divide it into clusters based on data points’ proximity to each other and/or similarity in other attributes you want to measure. This can let you examine localized patterns of human mobility, behavior by demographic, retail store performance, and more.

But clustering geolocation data is a trickier task than it may seem. Depending on what you’re trying to analyze, some clustering strategies work better than others. To decide on the right one, you’ll have to ask yourself some questions. How can I deal with noisy or outlying data points? How many clusters will accurately represent the segments I want to analyze? Do cluster shapes accurately reflect how similar data points are? How do I minimize the algorithm randomly “guessing” which data points or clusters are related or not?

This article will help you find the answers by expanding on the following topics: 

We’ll first explain what it means to cluster geolocation data, as well as why you would want to do it (along with some sample applications).

What is clustering for geolocation data?

Clustering is sorting a set of objects into smaller groups so objects in each group are somehow more similar to each other than objects in other groups. For geolocation data, it means mapping coordinates or addresses to meaningful places and population segments, then organizing them for analysis.

Clustering data points on house price and size into groups based on location

For instance, a brand may want to look at all of its stores in a particular geographic area to see how much foot traffic each one is receiving, and why that might be. Or a logistics company may want to isolate the areas and addresses where it needs to deliver shipments to make the trip easier and more efficient for drivers. Or a company might want to analyze the mobility and shopping habits of different demographics in an area to decide where to place out-of-home advertising.

Why clustering geolocation data is so important

There are a couple of reasons why it can be important to cluster geolocation data. One is that raw location data tends not to be very informative on its own; it usually needs to be put in the context of the places or groups of people it’s pointing to in order to be properly understood. Another is that it makes analysis easier by ensuring that comparisons are being made between places and people that actually have some sort of relationship with each other.

How is clustering used for geolocation data?

Clustered geolocation data is mainly meant to tie patterns of occurrences and human movement together with information about places and people. This lets analysts see where something is happening, what’s happening, who’s involved, and where those people are going next. This has applications in a number of fields, including these examples:

  • Healthcare: tracking and predicting the spread of a contagious disease
  • Urban planning: measuring which transport methods are most commonly used to get around a neighborhood
  • Transportation: determining where people who work in major cities or business districts commute from, and how long it typically takes them
  • News: sorting stories based on locality and subject matter
  • IT: detecting Internet traffic from localized networks for content filtering and cybersecurity
  • Entertainment & retail: understanding the other economic behaviors of people who shop at a certain store or attend a particular event
  • Government: speeding up census data collection by making inferences, based on mobility, of people who may be family members or coworkers

While there are many more use cases for geolocation data clustering, there are also several problems to solve and factors to consider in performing it correctly. Some are unique to geolocation data as a class, while others are more general issues with the process of clustering and the algorithm chosen to do it with. We’ll discuss some of them in the following sections.

Challenges with clustering geolocation data

It’s not always a simple task to cluster geolocation data, and there are several reasons why. One is the problem of noise. Environmental interference – such as indoor places, inclement weather, or nearby tall buildings – with location data signal sources can cause accuracy errors. Some inaccuracies can even be caused by the way the source or receiver are configured or designed.

Beyond noise are outliers: signals that may not have accuracy issues, but do not neatly conform to observable patterns that are relevant to the current analysis. For example, you may get GPS frequencies from cars traveling down streets near a shopping district, but these are likely not very meaningful if you’re trying to measure the foot traffic of shoppers. Like noise, outliers may need to be filtered out of a dataset, or they may hinder you from performing your analysis properly.

Dataset with a lot of outlying points

And as we mentioned, clustered location data often isn’t all that informative unless it is mapped to other, more semantic, geospatial data like points of interest. To illustrate, it is usually much less intuitive to say you were at 48.8584° N, 2.2945° E – or even 5 Av. Anatole France, 75007, Paris, France – than to tell someone you visited the Eiffel Tower. So data often has to be geocoded or reverse-geocoded before it can be processed or presented.

Location representation – especially through addresses – can be one of the biggest problems when cleaning geolocation data. We’ll explain why in greater detail below.

Using a standardized geolocation data format

Location data is commonly conceptualized as geographic coordinates produced by GPS. The reality, however, is it can come from several different sources and be represented by many different formats or attributes. For instance, a location may be simply a latitude and longitude pair, or it may contain additional information regarding things like altitude, elevation, and time. Location data coming from smartphones, meanwhile, may also have an IP address and/or mobile device ID attached to it.

Location data can also come in the form of a street address. This can be a particularly problematic format because it’s represented in numerous different ways in different parts of the world. For example, it may use various abbreviations (e.g. “Ave.” or “AV” for “Avenue”), place attributes in different orders, or have more or fewer attributes depending on what information is being represented (e.g. some addresses specify counties, districts, or other subregions while others do not). An address could even simply have spelling mistakes, unnecessary characters, improper use of uppercase and lowercase letters, or mistakes in spacing between characters or words.

All of these things can cause problems in matching location data attributes and ensuring you’re comparing data that’s actually related. That’s why it’s important to choose a standard geolocation data format, and clean the data so it conforms to that standard. An example is Placekey, a universal place identification standard that Veraset helped develop. We continue to use it – and so do many other geospatial scientists, businesses, and organizations.

Tips for cleaning geolocation data before clustering it

Cleaning geospatial data for clustering is a balancing act. One one hand, removing data points that are redundant or will likely not fit any patterns can increase the speed and accuracy of the clustering algorithm you choose. On the other hand, excluding too many data points can give some algorithms trouble with detecting dense enough areas of data to form clusters around. So these tips are merely guidelines; apply them to the degree they work best for you.

1. Start by cleaning up noise and outliers

Your data may have some points that aren’t particularly close to any other points. This probably means not many people spent significant enough time at those locations for them to indicate a meaningful place or pattern. So you can, for example, set a threshold for how many other points must be a certain distance from a point, then filter out points that don’t meet the criteria. Some algorithms, like DBSCAN, have this cleaning feature built in. Many do not, however, so it can be useful to perform beforehand.

Clustered dataset showing before and after noise points are cleaned

2. If applicable and appropriate, clean points that indicate movement

Some types of geolocation data, like GPS, will include movement speed and direction in addition to latitude and longitude. In these cases, it can be advantageous to filter out data points that have speed greater than 0 (or another very small number). That way, you reduce outliers and noise from data points where people are in the process of moving from one location to another. Note that this won’t apply to all data, though, and that analyses interested in movement patterns may want to keep attributes related to speed and direction.

3. Clean redundant points that indicate idling, if necessary

On the other side of the coin, a person might stay in virtually the same place for an extended period of time (e.g. they are sitting or lying down). This can result in large clumps of data points layered on top of each other. So it may be worthwhile to filter out certain data points where their change in distance from the previous data point is very small. This can speed up the algorithm you’re using, but it can also cause accuracy issues if you clean too aggressively.

Common methods for clustering geolocation data

Cluster analysis of anything, including geolocation data, is further complicated by the fact that a “cluster” can be different things depending on its application. That’s why there are a number of different algorithms for modeling clusters in different ways. Here are some of the key ones, including how they operate and what their strengths or weaknesses are.

1. K-means clustering

K-means clustering is an algorithm that aims to define a pre-set number of clusters so that each data point in a cluster is the minimum distance away from a central (mean) point, or centroid. It is a centroid-based unsupervised machine learning algorithm, meaning that the computer is left to determine for itself the relationship(s) between data points without having to look for special metadata about them.

How it works: K-means clustering starts with the user inputting into a computer how many clusters they want to create (“k”). The computer then randomly places that many centroids among the data points and calculates the distance between each centroid and data point. Each data point is then assigned to the centroid it is closest to. From there, each centroid is repositioned to minimize the distance between itself and all data points assigned to it. Then each data point is reassigned to a cluster based on how close it is to the centroids’ new positions.

This pattern of calculating distances, assigning data points to clusters, and then repositioning the centroid of each cluster continues until one of two things happen. In some cases, a predefined number of iterations can be run. Otherwise, the algorithm stops when cluster centroids no longer need to be repositioned because the data points assigned to them do not change.

Why it’s important: K-means clustering is one of the simplest and quickest clustering algorithms to implement. It’s most useful for large datasets, or when you go in knowing how many clusters you want to end up with. However, it can be prone to mistakes due to the random starting points of centroids; some variations use non-random starting points for centroids to improve accuracy. Using a suboptimal number of clusters can cause errors as well, which is why the “elbow method” – plotting average final point distance from associated centroids as a function of the number of clusters – is often used to select the optimal number of clusters.

2. Affinity propagation

Affinity propagation is a clustering technique that uses data points themselves as centroids for clusters (i.e. “medoids”). It does this by taking into account how similar one data point is to another when compared against its similarity to all other data points.

How it works: Affinity propagation starts by calculating each data point’s overall similarity to another, typically through negating the sum of the squares of the differences between their attribute values. Then, a self-similarity value must be set for each data point; this is usually the median similarity value. A lower self-similarity value will produce fewer clusters, while a higher one will produce more clusters.

Then each data point is assigned a “responsibility” to itself and other points, based on the similarity of the point(s) minus the highest similarity among all other points. From there, an “availability” score is calculated. A point’s self-availability is the sum of its positive responsibility values other than the one for itself. A point’s availability to any other point is its self-responsibility value plus any other positive responsibility values it has with other points, except the point for which its availability is being calculated.

From here, the “responsibility” and “availability” values of each point and pair of points are added. The point with the highest sum relative to another point becomes that point’s “exemplar”, or the centroid of the cluster that point belongs to. Points that share the same exemplar are part of the same cluster.

Why it’s important: Affinity propagation is useful in that it doesn’t require specifying a set number of clusters in advance. Instead, the algorithm calculates this for itself. It can also work with data in more than two dimensions, so it is useful when comparing location data points defined by attributes in addition to just latitude and longitude. However, like K-means, it’s poor at dealing with outliers, noise, and irregularly-shaped clusters.

3. Self-organizing map

A self-organizing map (SOM) is a type of unsupervised deep learning neural network. It attempts to convert multi-attribute data into a representation with fewer dimensions – usually just two – so it’s more easily visualized and understandable. At the same time, it also tries to maintain patterns in the underlying data by locating nodes (or neurons) corresponding to similar data points closer together.

How it works: An SOM starts with a grid of nodes that are typically assigned random values. Then a sample data point is picked from the dataset, and its values are compared against the values of all nodes. The most similar node has both its values and its position in the grid adjusted to be closer to the data point, as do its neighboring nodes to a lesser degree (this degree is affected by the distance between nodes and the current number of iterations). This process repeats itself for a different data point until all data points have been competed over (the process is called “competitive learning”), or the target number of iterations is reached.

Why it’s important: The biggest advantage of SOMs is they are easy to comprehend. They are designed to function similarly to how humans process sensory information, particularly sight. In fact, their “map” structure is similar to how most geospatial data is represented in other media. However, SOMs can end up being inaccurate if they have sparse or incomplete input data to train on (which limits clustering), or because initial node values are randomized (which is why some variants use non-random initial node values closer to expected data point values). Also, depending on the magnitude and order in which data points and their attributes are weighted, the same dataset may produce unique maps with different relationships between clusters.

4. DBSCAN

DBSCAN (density-based spatial clustering of applications with noise) is an algorithm that, given a set of data points, groups them into clusters based on how many other data points are nearby. In this way, it separates data into areas of high density and low density.

How it works: In addition to the data itself, DBSCAN takes two inputs: the distance around each point to search (“eps”), and the minimum number of other points that must be found within that distance to include the point as part of a cluster (“minPts”). Points that meet or exceed “minPts” are considered as core points of a cluster. Points that do not meet “minPts”, but are within the “eps” distance of at least one core point, are considered outliers in that cluster. Points that do not meet either of these conditions are considered noise.

Why it’s important: DBSCAN has a few advantages. It does not need a predefined number of clusters, and it can detect clusters of non-convex shapes. Also, because it requires setting density parameters and can conceptualize outliers/noise, it is useful for analyzing geolocation data such as foot traffic (i.e. minimum number of people within a certain distance). On the other hand, needing to set universal density parameters can be a weakness of this model, as setting them inaccurately can lead to imprecise results. This is especially true when dealing with datasets that have few dense areas, or at least those with areas varying greatly in density. This is why hierarchical versions that can handle multiple cluster densities, such as HDBSCAN and OPTICS (ordering points to identify clustering structures), are sometimes used instead.

5. Hierarchical clustering

Hierarchical clustering is a family of clustering techniques that create a series of cluster sets through several iterations. There are two general types of hierarchical clustering. Agglomerative clustering is more common; it starts with each data point belonging to its own cluster, and then merges clusters together over each iteration until the final iteration has all data points as part of the same cluster. Divisive clustering works in reverse: it starts with all data points as part of the same cluster, then divides them into separate clusters until the final iteration has each data point as its own cluster.

How it works: The main way hierarchical agglomerative clustering methods differ in operation is in how the algorithm chooses which clusters to merge in each iteration. In single-linkage clustering, it’s based on the minimum distance between two data points in separate clusters. In complete-linkage clustering, it’s based on the shortest maximum distance between two data points in separate clusters. In average-linkage clustering, it’s based on the smallest average distance between all data points in separate clusters. In Ward linkage clustering, it’s based on the smallest change in distance between all data points in two potentially-merging clusters.

Why it’s important: One benefit of hierarchical clustering is it shows the process through which the algorithm chooses to form clusters, so it’s easy to understand. It also doesn’t require a predetermined number of clusters; in fact, the algorithm can be stopped after a particular number of iterations once the desired number of clusters is achieved. However, hierarchical clustering can sometimes make arbitrary decisions in merging or splitting clusters. This is particularly the case when datasets are overly large or have lots of outliers and noise, or when data isn’t input in the same order each time.

How to visualize geolocation data clusters

As we’ve mentioned, the point of grouping geolocation data into clusters is to observe how sets of places and people are similar or different. However, this is much more intelligible when it’s visualized as shapes on an actual map. Even more so if you add names of locations and groups, which are more easily understood than abstract addresses or coordinates.

Many clustering algorithms can be run in the Python programming language. Therefore, to demonstrate how to visualize the result, we will provide a basic sample workflow based on this starting point. We will also offer an alternative option at the end.

Step 1: convert the dataset into one recognized as geospatial

Code for converting a pandas dataframe into a GeoPandas geodataframe
(Image source: Towards Data Science – Claudia Ng)

A lot of data analysis and machine learning in Python is accomplished using the pandas software library. However, it isn’t always optimal for working with geospatial data. That’s why it can be advantageous to convert a dataset into a format the GeoPandas library can work with. GeoPandas was specifically designed to help Python process geospatial data.

Step 2: load shape files of the geographic area you’re working with

Code for setting visualization size and plotting shape files for the chosen geography
(Image source: Towards Data Science – Claudia Ng)

Next, you need to import the “.shp” and associated files of the map you want to work with. GeoPandas has built-in files corresponding to individual countries, but for larger or smaller geographic areas, you will need to find and import the files manually.

The example above shows the import of a shape file corresponding to New York City. Note the “z-order” is set at 1; this is important with regards to the order in which data elements are stacked on top of each other. We’ll explain what this means in the next step.

Step 3: plot data points on top of the imported map

Code for plotting data points on top of the geographic shape files
(Image source: Towards Data Science – Claudia Ng)

Now you can call the “.plot()” function for your geospatial data to plot the points on the map. Note that you want the “z-order” for the data points (2) to be higher than the map (1) so the former display on top of the latter and not the other way around.

Step 4: convert cluster centers to a separate series of points and plot them

Code for converting cluster centers to a GeoSeries and then plotting it on top of the other data layers
(Image source: Towards Data Science – Claudia Ng)

To make the centers of each cluster stand out, you have to convert them from part of your geospatial data frame into a separate object: a GeoSeries. Once you do that, you can plot this object on the map. Again, make sure the GeoSeries has a higher “z-order” than both the regular data points and the map (e.g. 3 > 1 or 2) so your cluster centers aren’t hidden under the other data objects.

Step 5: label, display, and save the visualization

Code for labeling, displaying, and saving the generated cluster visualization
(Image source: Towards Data Science – Claudia Ng)

Once all the heavy lifting is out of the way, all you need to do is give your visualization a title, label the X and Y axes as “Longitude” and “Latitude”, and display the whole thing by calling the “plt.show()” function. You may also wish to save the visualization for future reference.

Alternative option: Folium

Folium is an add-on module for Python that allows for using data to plot an interactive map without needing to import a shapefile to base it on. Instead, you pick a location and a zoom level to start, and folium will set up the map for you. You can even customize how data points show up on the map, though it can take some clever coding.

We hope this article has given you some techniques and other factors to consider when choosing to analyze geospatial data through clustering. Ideally, you’ll now know how to decide which clustering strategies to employ, based on their strengths and weaknesses in certain situations. You have to have geolocation data as a starting point, though, and that’s where we at Veraset come in.

Our Movement dataset measures billions of foot traffic data points every day in over 150 countries around the world. Meanwhile, our Visits dataset precisely measures visits to over 6 million points of interest in the US, daily. If you’re interested in learning more about how our data can help you achieve your business objectives, get in touch with us here.