ML: Basic Statistics, Mean, Median, and Mode
This post is the starting point of Machine Learning posts and its first chapter, the basic statistics. Mean, Median, and Mode are the basic statistics of the data. This post will focus on these basic statistics and its applications.
Mean
Mean is the most well-known statistics. In our case, the means of the tumor and normal tissues are 6.070, and 1.257. You can distinguish two of the cluster by the mean intuitively but we need to apply other tests if we want to confirm they are different. The meaning of statistics is the summary of whole datasets with specific numbers. Therefore, we can reconstruct the data with only a few statistics. Mean is frequently used for the normal distribution with its variance because Mean and Variance are the parameters for normal distribution. In our case, the standard deviations(the square root of the variance) are 1.319 and 0.486 for tumor and normal tissues respectively.
Median
Median is the data point which is the center, the point is larger than 50% of the data and smaller than 50% of the data. The Median is more robust than the mean because it is severely affected by the outliers. In our case, the median points are 5.423 and 1.045 for the tumor and normal tissues respectively. There is an alternative concept for variance in the mean statistics. Median Absolute Deviation means how much the data is spread like the variance and standard deviation. However, it is also more robust than the variance and standard deviation.
Let’s the effects of outliers.
The mean is changed from 4.992 to 6.237 and the median is changed from 5.423 to 5.831. You can observe the median is more robust. If you calculate the variance and MAD, you will have the same result.
Outliers
An outlier is a value, which does not fit the rest of the data, i.e. does not meet our expectations. Outliers can be generated by the measurement errors and the rare event, it really happens. Outliers can spoil our estimation of the datasets because the statistics represent the estimation and we extract the statistics from sample datasets that have the outlier. If we want to avoid the effects of outliers, we can use robust statistics or eliminate outliers.
Quantiles
The median is the 50% quantile of the data. Quantile means the data point where is bigger than a specific percent of the datasets. Furthermore, the difference between the upper and lower quartile is called IQR.
Mode
The mode of a set of data is the most frequent value. In the histogram, the mode exactly follows the highest point because it is the maximum of observed frequencies. The mode can be more than one, the histogram will be the bimodal distribution.
Correlation
There are many correlation measures. The most popular one is Pearson’s correlation. It measures a linear dependency of x and y. It is scaled between -1 and 1, perfect negative correlation, and perfect positive correlation. If the correlation is 0, x and y are independent.
Spearman’s rank and Kendall’s rank correlation are alternatives for Pearson correlation when the data is not represented by a linear relationship. The algorithm of Spearman rank correlation is to convert raw measures into ranks, then compute Pearson correlation between ranks. It can catch the linear relationship between the data points.
Covariance
It calculates the spread with respect to other data points and it is related to Pearson Correlation.
This post is published on 9/7/2020.