0% found this document useful (0 votes)
5 views

Lecture UnsupervisedML_SOM

The document discusses Self-Organizing Maps (SOM), an unsupervised machine learning method that enables dimensional reduction and visualization of complex datasets, particularly useful in material informatics. It outlines the SOM algorithm, its advantages over other methods like K-Means and PCA, and how it can be combined with these methods for enhanced data analysis. Additionally, it introduces various implementations of SOM, including augmented SOMPY and MiniSOM, and highlights their applications in materials research.

Uploaded by

cepem13540
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Lecture UnsupervisedML_SOM

The document discusses Self-Organizing Maps (SOM), an unsupervised machine learning method that enables dimensional reduction and visualization of complex datasets, particularly useful in material informatics. It outlines the SOM algorithm, its advantages over other methods like K-Means and PCA, and how it can be combined with these methods for enhanced data analysis. Additionally, it introduces various implementations of SOM, including augmented SOMPY and MiniSOM, and highlights their applications in materials research.

Uploaded by

cepem13540
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Unsupervised ML SOM

and how to choose a


Data Science method
Quick Review of the methods we learned

> Statistical analysis


> Supervised ML
– Linear regression
– NN,
– KNN,
– Decision Tree
– SVM
> Unsupervised ML
– K-Means Clustering
– PCA
– ….. Why another one,
Why another method, SOM

> Demonstrate some data science methods that are not


widely used or well known, but also can be very useful
for material informatic study
> Introduce a method I have used, and feel is adequate
to the uniqueness of many materials study
applications.
> Demonstrate how various data science methods can
be used together to drive improved results
> Demonstrate a few projects using the same methods
so that we can understand a methods from user point
of view
What is Self-Organizing Map (SOM)

> An Unsupervised ML method


> Dimensional reduction, enabling powerful
visualizations of the data:
– K-Means does clustering, but neither dimensionality
reduction nor visualization
– PCA does dimensionality reduction, enabling visualization to
certain level (not applicable if the first 3 principal
components won’t represent the data well), however, it
does not perform clustering. Besides, the visualization does
not keep the original topographic information.
> Give some insights into how data is clustered in high
dimensions
What is SOM

> You can think of SOM as an artificial neural network


with a single neuronal layer, whose neurons are
arranged in a two-dimensional matrix.
– The 2D matrix can been seen as a position map that
captures the characteristics of the data
> Merits of SOM
– Effective in training big datasets
– Since this is a 2D matrix, visualization of the resulting map
is possible
– kept the topography of the original data,
– Possible to present the Euclidean distance between data
points
Algorithm of SOM
– Normalization of the input data, all features will be distributed more
balancely
– Initialization: each (x,y) position in the map is assigned a weight for each
input neuron, thus associating a weight vector for each map position.

– Iteration:
> Choose a sample from dataset
> Calculate Euclidean distance between that sample and each weight vector
> The (x,y) position ”closest” to the sample is declared the Best Matching Unit

> The weights vector for the BMU get adjusted to more closely match the sample.
Amount of adjustment (learning) decreases as we go through iterations

> The weights vector for neighbors of the BMU also get adjusted, to a lesser extent.
The number of neighbors and how much they get adjusted also depends on
hyperparameters and the number of iterations.

– Convergence:
> Max number of iterations
> Monitoring of topological error
– Reference: https://link.springer.com/article/10.1007/BF00337288
Self-Organizing Map (SOM)

How does it work?


𝑎!
𝑏!
𝑐!
𝑥! = 𝑑
!
𝑒!
𝑓!
Two Dimensional Mesh structure

Each connection can deform


a
1 11 12

b 6 10
f
2
4 8
3
9
5 7
c
e
d
a
1 11 12

b 6 10
f
2
4 8
3
9
5 7
c
e
d
a
1 11 12

b 6 10
f
2
4 8
3
9
5 7
c
e
d
a
1 11 12

b 6 10
f
2
4 8
3
9
5 7
c
e
d
a
1 11 12

b 6 10
f
2
4 8
3
9
5 7
c
e
d
a
1 11 12

b 6 10
f
2
4 8
3
9
5 7
c
e
d
a
1 11 12

b 6 10
f
2
4 8
3
9
5 7
c
e
d
a
1 11 12

b 6 10
f
2
4 8
3
9
5 7
c
e
d
a
1 11 12

b 6 10
f
2
4 8
3
9
5 7
c
e
d
a
1 11 12

b 6 10
f
2
4 8
3
9
5 7
c
e
d
a
1 11 12

b 6 10
f
2
4 8
3
9
5 7
c
e
d
a
1 11 12

b 6 10
f
2
4 8
3
9
5 7
c
e
d
a
1 11 12

b 6 10
f
2
4 8
3
9
5 7
c
e
d
Self-Organizing Map (SOM) Algorithm

> Dragging Nodes


> “Flattening a crumpled paper”
U-matrix and how to use it to get insights for
clustering

> After training, the nodes in


the 2D key map are not
evenly distributed. The
adjacent data point might
not be similar to each
other in the higher
dimension space.
> U-matrix use the concept
of the heatmap to
illustrate the distance in
Euclidean space
Using SOM in conjunction with other methods

> Since this is a dimensionality


reduction method, for smaller
dataset, you can initialize your
SOM map using the first 2
Principal components,
essentially the 2D PCA map
> K-means can also be run on the
same dataset, and
corresponding clusters can be
visualized on SOM map.
K-Means clustering and U-Matrix
They can be compared to validate the results!

> SOM can provide a means to visualize K-Means!


> If the boundary matches well, then the training is
successful
Different Implementations of SOM

> SOM is just an algorithm, there are many


packages you can use that implement it
> We will introduce
– An augmented version of SOMPY, a version our group has
contributions on
– MiniSOM
The uniqueness and functions of augmented
SOMPY

https://github.com/DataScienceUWMSE/SOM

> Utilizes PCA for initialization, and include K-Means


Clustering overlay
> “Heat maps” provide a way to visualize each
feature after training
> Projection function helps users find additional
correlations or patterns among features,
including for categorical data
“heatmap” concept

> Map each node’s


weight onto the 2D
map
> Number of heat maps
equals to number of
input variables
Example of utilizing the
heatmap on materials research
Example 1 Granta Data Set: Experimental Commercial Materials
Property Dataset
> Training data set
contains 398 commercial
materials and 21
numerical properties
Example of utilizing the heatmap on materials
research
Example 1 Granta Data Set: Experimental Commercial Materials
Property Dataset (continue)
Project information concept

> Overlay one specific data


property onto SOM, can
use even categorical
values
> Easily identify patterns
Example of utilizing the project function on
materials research

Example 1 Granta Data Set: Experimental Commercial Materials


Property Dataset (continue), finding the outliers’ uniqueness
Example of utilizing the projection function on
materials research

Example 2 OPV materials study using an experimental dataset


Reference Y.Huang, J. Phys. Chem. C 2020, 124, 12871−12882

> Dataset includes 1203 donor


polymers of Donor-Acceptor
pairs, with properties
related to the proficiency of
the charge transfer.
Molecular Descriptors

Python package of Molecular Descriptor

> There are Python tools to extract molecular


structural or geometrical information from
notation of molecule, such as SMILES (Simplified
molecular-input line-entry system)
> We will introduce Mordred, (covered in the Hands-
on session)
The advantage of using MiniSOM

> SOMPY is not as easy to use as the other packages


introduced in this class.
– The Augmented SOMPY has contribution from a few
Materials Science researchers in our group, including
your TA Jimin, Qian
> MiniSOM is relatively easier to use, well
documented and constantly maintained, and
have the basic implementation of the SOM
algorithm
What MiniSOM provides

> It has :
– The core implementation of SOM
– Visualization
– U-Matrix (“distance map” in MiniSOM)
– Project certain feature onto SOM

> Doesn’t have:


– PCA initialization
– Cannot generate heatmap for each features
– K-Means clustering,
Hyperparameters of SOM

> Length of input vectors (the number of properties)


> Map size, the most important one
> Map topology – rectangular or hexagonal
– Important in defining the notion of “neighbors”
> Sigma – spread of the neighborhood function
> Learning Rate – initial learning rate, decreases with the
number of iterations
> Decay function – defines how much learning rate and sigma
decrease with the number of iterations
> Neighborhood function – defines how much neighbors of
the BMU get impacted at each iteration (eg gaussian,
bubble,…)
> Activation distance function (eg Euclidean distance)
> Initialization method – random or PCA
Hands-on session and HW for this week

You might also like