Visual Computing in the Life Sciences: Basic Plotting techniques
Basic Techniques
This post discusses the basic techniques of how to visualize the data. We will explore Heat Maps, Box Plots, Histogram, KDE, and Violin Plots. The next post will discuss scatterplots.
Heat Maps
Bioinformaticians are working with the RNA seq which is one of the hottest fields and they love heat maps because it is easy to use and we can intuitively understand the results if there are small datasets. Actually, heat maps are really simple. They are just a direct visual representation of matrices. Colors encode numbers in each cell. In RNA seq, typically, rows are different genes, columns are different times after treatment. Values represented by colors are a log of relative expression level. If you are hesitating to choose the color of it, then take a look at my other post on colors.
Box Plots
Box plots(box + whisker plots) summarize the distribution of a single attribute. The central band represents the median value. The Box in the plot indicates the interquartile range(IQR), 1st to 3rd quartile. The whiskers indicate remaining points up to 1.5 IQRs of the 1st/3rd quartile. Therefore, its total length is 4.0IQR, and box plots define the points out of this length are called ‘outliers’. If we lay out the box plots side by side, then we can compare the distribution.
Histogram & KDE
The histogram is a combination of bar charts of the number of values in predefined ranges(‘bins’). Choosing bins is a major problem in the histogram because the number of bins can produce totally different results. KDE is invented in order to overcome and get the smooth version of a histogram.
KDE is using Kernel and consider every single point as the center of the density function and ingrate the centered density function(kernel) and normalize it. The most common kernel normally is Gaussian Normal distribution and t-distribution. The mean in the distributions is no matter because we pick the single point as the center of distribution but we need to choose variance, bandwidth in KDE. Small bandwidth(h) means emphasizing the single point and large h means ignore the single point and make the graph smooth. The rule of thumb of bandwidth is (variance of the sample)/(root of the # of the sample).
Violin Plots
Violin plots combine boxplots with a KDE. Box plots cannot show the distribution of the data but violin plots can represent the distribution. The split violin can be used for comparison. Faceting is to combine the basic tools, e.g. color + marker types + line style + linewidth … When you implement the faceting, you have to be careful about the interpretation of the data because it can be really hard.
This post is published on 9/1/2020