What is a visualisation of data? August 9, 2013, 2PM 302- 309
This lecture is
about how to visualise multi-variate
data. Since the 1950s many methods have been developed for displaying a
high-dimensional data-set as a two-dimensional scatter-plot. Recently there has been a breakthrough in
this area: some new methods have been developed that give much better
displays of certain types of data than was possible before. In particular, I will briefly describe the
method of t-statistic stochastic neighbour
embedding (t-SNE) developed by L. van der Maaten
and G. Hinton (JMLR 2008), with some examples. However, algorithms such as
t-SNE are highly non-linear and they give distorted views of the data: these
distortions are unavoidable because reducing dimensionality necessarily
distorts geometrical relationships. Because of these distortions,
scatter-plot visualisations are hard to interpret:
which parts of the scatter-plot are 'correct', and which are wrong? In the
main part of the talk, I will demonstrate a 2D display that gives not only a
scatterplot, but also distance-relationships between some data elements: we
call this display a 'neighbour-plot' or 'proxigram'. This
very simple idea makes it much easier to interpret visualisations
produced by powerful non-linear algorithms such as t-SNE. Many extensions of this
idea are possible. The last part of the talk is about how to give a formal
definition of a visualisation, so that the purpose
of a visualisation can be precisely defined, and
the quality of the visualisation can (in principle)
be precisely specified. The development of complex visualisations
such as proxigrams makes it necessary to have a
better theory of how to specify and to design visualisations
than currently exists. I will briefly discuss possible uses of advanced visualisation methods in machine learning diagnostics. |