mars 11, 2021 | Documents

(English only)

Seeing is Believing:
Toward Interpretable Data Visualization

Motivation

The human eye has been advocated as the ultimate data mining tool. When we see something we can easily detect patterns, we can spot anomalies, we can verify similarities. That’s why when dealing with data it is particularly important to reach the visualization part as early as possible. It can significantly speed up the identification of data problems but also help to better communicate and illustrate the research findings.

Imagine, that you want to visualize the market capitalization of these four companies: Apple (APPL), Microsoft (MSFT), Tesla (TSLA), and Zoom (ZM). It is quite easy to visualize the value of a single attribute as you can see below.

By seeing this visualization, we can discern that currently, Apple has the highest market capitalization, or MarketCap, of approximately 2 Trillion dollars, followed by Microsoft, Tesla, and then Zoom.

Now, assume that we have two attributes for each company to visualize: MarketCap and PERatio (profits to earnings ratio). This is still not a problem as long as we are on two dimensions.

If we have more attributes per company, obviously we cannot use anymore a 2D visualization. What we could do, is use an embedding technique that tries to combine several attributes into one. As an example, one could find a linear relationship such that:

x = w­1 x PERatio + w2 x BookValuePerShare + …, and
y = w10 x EarningPerShare + w11 x EBITDA + …

and then plot each company on the (x,y) coordinates of this 2D space. There are many approaches for discovering this embedding and producing a low-dimensional projection on two dimensions.

To illustrate that, now we use 8 attributes per company, and we embed these 8 values on two dimensions that properly encapsulate all of the original attribute values. Using two popular embedding techniques, one called t-SNE and another called Principal Component Analysis (PCA) would result in the following visualizations:

Left: Visualization using t-SNE. Right: Visualization using PCA

Here, the x and the y coordinates are a combination of all the attributes that describe a company. The actual x and y values do not have some readily discernible meaning for the viewer of the plot. This example highlights also one of the main problems when visualizing high-dimensional data on two dimensions. It is not easy to understand anymore what the x and y coordinates mean.

Addressing interpretability – MoDE

To alleviate this inherent problem of interpretability when visualizing high-dimensional data, we worked toward creating a more interpretable visualization technique called MoDE.

The intuition here is that in most datasets there exists some very important attribute, which we would like to preserve very accurately when projecting the data on two dimensions. The rest of the attributes we can preserve approximately. MoDE tries to preserve that single attribute with very high fidelity, almost perfectly. In our example, when talking about companies or the various stock indices, market capitalization can play this role, as a single number for ranking companies.

Below we compare the visualization of MoDE with three other prevalent visualization methods: ISOMAP, t-SNE, and MDS. Each data point on the 2D plot is a company. Originally each data point was represented as a 1024-dimensional time-series of the stock value across several years. We see an example of the original time-series data on the left part of the figure. In addition to this time-series for each company, we also keep another attribute: the company’s market capitalization which is the important attribute that we would like to accurately preserve.

The color of each data point shows the market capitalization and you see that MoDE offers a very smooth transition of colors from light yellow (high market capitalization) to purple (low market capitalization). One can observe, that for all other techniques it is not possible to have such good preservation of that feature whilst also preserving all the other relationships of the original data.

Advantages of MoDE

In addition to being very interpretable, MoDE has several other desirable properties:

  1. It can operate even in the presence of inexact/approximate measurements. A unique characteristic of MoDE is that it does not require exact distances between the visualized entities like most visualization techniques do. One can simply provide ranges of values (lower and upper bounds) between the entities. This means that MoDE can also effectively visualize compressed or uncertain data!
  2. Very fast data processing and visualization.
  3. Accurate preservation of correlations and distances.
  4. Anytime method: this means that at any point of the algorithm execution the visualization is readily available, and it progressively improves. This is a particularly important aspect when dealing with Big Data.

You may see an example of the anytime nature of MoDE in the Figure below.

Resources

  1. An Interpretable Data Embedding under Uncertain Distance Information, by N. Freris, M. Vlachos, A. Ajalloeian; published at the IEEE International Conference on Data Mining (ICDM), 2020.
  2. If you are interested to use MoDE in your project, you may find the source code at this URL.

Acknowledgments

This work has been partially supported by the E4S research grant “Toward Interpretable Machine Learning”.

Michalis Vlachos, HEC, University of Lausanne