UMAP
The UMAP dimensionality reduction tool offers a workflow for dimensionality reduction and clustering that is particularly useful for high dimensional data, such as hyperspectral images. The tool uses UMAP (Uniform Manifold Approximation and Projection) to reduce the dimensionality of the data, followed by HDBSCAN (Heirarchical Density Based Spatial Clustering of Applications with Noise).
Warning
UMAP requires access to Compute.
Tip
Mathematical details for both UMAP and HDBSCAN can be found in the original papers: UMAP and HDBSCAN.
Selecting data
The UMAP dialog is broken into three steps. The first step allows you to
select the product, bands
and AOI for analysis. When you have selected
the data you want to analyze, click Parameterize UMAP
to move to the
next step.
Warning
Selected data cannot have any fully masked bands. If necessary, use the data statistical analysis tool to clean data before running UMAP.
Note
Data will be limited to 50,000 points to manage runtimes.
Parameterize UMAP
UMAP works by computing a manifold of the input data and projecting that manifold
to a lower dimensional space. Parameterizing UMAP, therefore, involves determining
how that manifold is built (the neighborhood size and
distance metric used), how many dimensions to
use in the projection, and how clustered the low dimensional
projection is. The second step of the dialog allows manipulation of these parameters
and provides a plot of the embedding for analysis. When you have set parameters to your satisfaction, click Train model with chosen parameters
to train the model and generate
a plot of the embedding, and when you are satisfied with your model click Cluster data
to move to the clustering step.
Tip
More details on parameters can be found here
Tip
Points in the embedding plot will be colored according to the colors of the layer on the map. Clicking on a point in the plot will show its location on the map.
Note
The first time training a model with a given choice of metric, the K nearest neighbors will be computed and stored. This enables faster parameter testing, without needing to run this step each time.
Local neighborhood size
The number of points that UMAP will use when building the data manifold. This parameter allows a balance between local and global structure of the data in the embedding - low values will focus on local structure, and large values will focus on global structure.
Minimum distance
The minimum distance between points in the final embedding. Lower values will create a clumpier embedding, while larger values will spread points farther apart.
Tip
This parameter has a strong effect on the number of clusters generated - small values tend to create fewer large clusters, and large values create more small clusters.
Number of dimensions
Number of dimensions for the embedding. Unlike simpler embedding algorithms such as T-SNE, UMAP can successfully embed into more than three dimensions.
Note
If the number of dimensions is greater than 3, the embedding plot will show the first three axes.
Metric
The metric for computing distances in the input data. Available metrics are:
- Euclidean: Simple Euclidean distance between points.
- Manhattan: Manhattan, or "Taxi cab" distance, measures the sum of the distance along each axis.
- Chebyshev: The Chebyshev metric defines the distance between points as the greatest of their differences along any axis.
- Cosine: Cosine distance is the cosine of the angle between the vectors.
- Correlation: Correlation distance between the vectors, measuring the dependence between them.
Tip
Euclidean, Manhattan, and Chebyshev distances can easily be visualized by their behavior on a chess board. Chebyshev distance corresponds to the number of moves a King makes getting from one square to the other, Manhattan distance behaves like a Rook, and Euclidean distance is an ant moving without regard for the board:
Embeddings plot
After training a model, the embedding will be plotted on either a 2D or 3D scatterplot,
depending on the choice of dimension. Points on the plot
will be colored the same as they are on the map, and clicking on plotted point will
show the location on the map in a vector layer called
selected points
.
Note
The axes for the embedding are dimensionless, and simply represent a new spatial position for the input points relative to all of the others.
Cluster data
Once the low dimensional manifold is built, the UMAP tool uses HDBSCAN to cluster
the data in an unsupervised manner. HDBSCAN is a better clustering algorithm than
k-means for the unique shapes that are generated with UMAP, as it
can find clusters with varied shapes, whereas k-means builds clusters as balls.
Use the Cluster data with chosen parameters
button to run the clustering on your
embedded data.
Tip
HDBSCAN will not attempt to cluster outlier points, and will instead classify them separately as noise.
Tip
When you run a clustering, the embeddings plot will change to show the clusters instead of the pixel RGBs.
Minimum cluster size
The smallest number of points that can be considered a cluster. Lower values tend to create more clusters.
Minimum number of samples
The minimum number of samples in the neighborhood of a point for it to be considered a "core" sample, and therefore considered for clustering instead of noise.
Cluster selection epsilon
This parameter merges clusters that are closer than this value. This will help if your clustering contains a large number of small, nearby clusters.
Deploy clustering
When you are happy with the clustering parameters, click the Deploy clustering
.
This will deploy a Compute
function that will apply the UMAP and HDBSCAN models over your chosen AOI, and
add the result to the map.
Tip
Progress on the Compute task can be found here.
Note
Images for the clustered result will appear on the map as they are completed, but you may need to override the map's cache to see the latest images. This can be accomplished by moving around the map, or changing the settings of the layer.
Warning
Only the AOI where data were selected will be clustered. As the models are build using these data, applying the models to areas outside this AOI is not valid.