What is a visualisation of data?
August 9, 2013, 2PM
This lecture is about how to visualise multi-variate data. Since the 1950s many methods have been developed for displaying a high-dimensional data-set as a two-dimensional scatter-plot. Recently there has been a breakthrough in this area: some new methods have been developed that give much better displays of certain types of data than was possible before. In particular, I will briefly describe the method of t-statistic stochastic neighbour embedding (t-SNE) developed by L. van der Maaten and G. Hinton (JMLR 2008), with some examples. However, algorithms such as t-SNE are highly non-linear and they give distorted views of the data: these distortions are unavoidable because reducing dimensionality necessarily distorts geometrical relationships. Because of these distortions, scatter-plot visualisations are hard to interpret: which parts of the scatter-plot are 'correct', and which are wrong? In the main part of the talk, I will demonstrate a 2D display that gives not only a scatterplot, but also distance-relationships between some data elements: we call this display a 'neighbour-plot' or 'proxigram'. This very simple idea makes it much easier to interpret visualisations produced by powerful non-linear algorithms such as t-SNE. Many extensions of this idea are possible. The last part of the talk is about how to give a formal definition of a visualisation, so that the purpose of a visualisation can be precisely defined, and the quality of the visualisation can (in principle) be precisely specified. The development of complex visualisations such as proxigrams makes it necessary to have a better theory of how to specify and to design visualisations than currently exists. I will briefly discuss possible uses of advanced visualisation methods in machine learning diagnostics.