ML: Basic Statistics, Mean, Median, and Mode

Jeheonpark
4 min readSep 7, 2020

--

This post is the starting point of Machine Learning posts and its first chapter, the basic statistics. Mean, Median, and Mode are the basic statistics of the data. This post will focus on these basic statistics and its applications.

Mean

Example data of expression of gene p53 in tumor and normal tissues
Mean equation and Variance

Mean is the most well-known statistics. In our case, the means of the tumor and normal tissues are 6.070, and 1.257. You can distinguish two of the cluster by the mean intuitively but we need to apply other tests if we want to confirm they are different. The meaning of statistics is the summary of whole datasets with specific numbers. Therefore, we can reconstruct the data with only a few statistics. Mean is frequently used for the normal distribution with its variance because Mean and Variance are the parameters for normal distribution. In our case, the standard deviations(the square root of the variance) are 1.319 and 0.486 for tumor and normal tissues respectively.

Median

Median Absolute Deviation

Median is the data point which is the center, the point is larger than 50% of the data and smaller than 50% of the data. The Median is more robust than the mean because it is severely affected by the outliers. In our case, the median points are 5.423 and 1.045 for the tumor and normal tissues respectively. There is an alternative concept for variance in the mean statistics. Median Absolute Deviation means how much the data is spread like the variance and standard deviation. However, it is also more robust than the variance and standard deviation.

Let’s the effects of outliers.

The data with an outlier

The mean is changed from 4.992 to 6.237 and the median is changed from 5.423 to 5.831. You can observe the median is more robust. If you calculate the variance and MAD, you will have the same result.

Outliers

Outlier, I want to be an outlier in the data science field.

An outlier is a value, which does not fit the rest of the data, i.e. does not meet our expectations. Outliers can be generated by the measurement errors and the rare event, it really happens. Outliers can spoil our estimation of the datasets because the statistics represent the estimation and we extract the statistics from sample datasets that have the outlier. If we want to avoid the effects of outliers, we can use robust statistics or eliminate outliers.

Quantiles

The median is the 50% quantile of the data. Quantile means the data point where is bigger than a specific percent of the datasets. Furthermore, the difference between the upper and lower quartile is called IQR.

Mode

KDE example for each statistics

The mode of a set of data is the most frequent value. In the histogram, the mode exactly follows the highest point because it is the maximum of observed frequencies. The mode can be more than one, the histogram will be the bimodal distribution.

Correlation

Example of Pearson Correlation
Pearson’s Correlation

There are many correlation measures. The most popular one is Pearson’s correlation. It measures a linear dependency of x and y. It is scaled between -1 and 1, perfect negative correlation, and perfect positive correlation. If the correlation is 0, x and y are independent.

The difference between correlation types

Spearman’s rank and Kendall’s rank correlation are alternatives for Pearson correlation when the data is not represented by a linear relationship. The algorithm of Spearman rank correlation is to convert raw measures into ranks, then compute Pearson correlation between ranks. It can catch the linear relationship between the data points.

Covariance

Covariance

It calculates the spread with respect to other data points and it is related to Pearson Correlation.

This post is published on 9/7/2020.

--

--

Jeheonpark

Jeheon Park, Software Engineer at Kakao in South Korea