Calculating basic statistics formulas in Python
Data science needs some measures to explain the data well. Here are some formulas that make your life easy in Python.
Table of contents:
- Creating the data first
- Finding the mean of the data
- Calculating the median
- Quantile
- Interquartile Range (IQR)
- Maximum distance
- Variance
- Standard deviation
- Covariance
- Correlation
- Appendix
Creating the data first
Let’s first create the data where we will be testing our data statistics functions.
from random import randint
a = [randint(-100, 100) for p in range(0, 100)]
a[0:10],len(a),sorted(a)[0:10]
Output:
([72, -66, -84, 74, 68, 58, 34, 32, 82, -36],
100,
[-100, -95, -93, -92, -92, -85, -84, -84, -82, -79])
As you may note we created 100 different numbers from -100 to 100.
Finding the mean of the data
def mean(a):
return sum(a)/len(a)
In Python it is easy to calculate the mean, you first sum all elements and you divide the sum with the number of elements.
Checking the mean of our list:
mean(a)
Output:
4.31
Calculating the median
def median(a):
a=sorted(a)
n=len(a)
m=n//2
if n%2==1:
return a[m]
else:
return (a[m]+a[m+1])/2
You should note to compute the median we sorted the list first. This is not absolutely needed. There are algorithms to avoid sorting. However, this way is easy to interpret.
The median is the middle element of the sorted array if the number of elements in array is even. Else, it is the average of the elements in the middle.
Checking the median value:
median(a)
Output:
7.0
Mean is more sensitive to outliers than the median. This is why we may use median in some cases, or some other techniques to remove the outliers and then to calculate the mean.
Quantile
A generalization of the median is called quantile. The median represents the value less than which 50% of the data lies. Quantile represents the value less than which a certain percentile of the data lies.
Formula:
def quantile(a, p):# p is percent <1
n=len(a)
i = int(p*n) # i is index
return sorted(a)[i]
Example:
from random import randint
import pandas as pd
a = [randint(-100, 100) for p in range(0, 100)]
df=pd.DataFrame({'a':a})
quantile(a,0.25), quantile(a,0.5), quantile(a,0.75)
Output:
(-55, -9, 54)
Note
np.percentile(a, 25, interpolation='higher')
would be same asquantile(a,0.25)
. By default np.percentile interpolation is linear.
Quantile helps describing the distribution of the data. 0.5 percent should describes the middle of the data, also known as median.
.25 and .75 are the upper/lower quarter of the data. You can get an idea of how skew your data is.
If the mean is lower/higher than the median, this means the data is left/right skewed.
Interquartile Range (IQR)
Quantiles which are particularly useful are the quartiles of the distribution.
The quartiles divide the distribution into four equal parts, called fourths.
The IQR describes the middle 50% of values when ordered from lowest to highest.
To calculate interquartile range we calculate 0.75 and 0,25 quartile ranges.
def interquartile_range(a):
return quantile(a, 0.75) - quantile(a, 0.25)
Maximum distance
It is really easy to interpret the distance. Just divide maximum and minimum element of the list.
def disp(a):
return max(a)-min(a)
disp(a)
Output:
198
Variance
We calculate the variance first by calculating the mean m. Then we create the list of all deviations from the mean, and later we sum all squares of all the deviations.
def variance(a):
n=len(a)
m=sum(a)/len(a)
'deviations from mean'
d=[e-m for e in a]
v=0
for e in d:
v+=e**2
return v/n
variance(a)
Output:
3411.7139
At the end we divide the sum of squares with he number of elements n.
It is possible to divide with the number of elements n-1. This is called Bessels’s correction.
def varianceb(a):
n=len(a)
m=sum(a)/len(a)
'deviations from mean'
d=[e-m for e in a]
v=0
for e in d:
v+=e**2
return v/(n-1) # Bessel's correction
varianceb(a)
Output:
3446.1756565656565
Note that variance is bigger in the case of the Bessel’s correction.
Python statistics package uses Bessel’s correction to calculate variance and stdev.
import statistics
statistics.variance(a)
statistics.stdev(a)
Standard deviation
def std(a):
n=len(a)
m=sum(a)/len(a)
'deviations from mean'
d=[e-m for e in a]
v=0
for e in d:
v+=e**2
return (v/n)**.5
std(a)
Output:
58.409878445345186
We can easily calculate the standard deviation just if we square root the variation.
In case of the correction:
def stdb(a): # Bessel's correction
n=len(a)
m=sum(a)/n
'deviations from mean'
d=[e-m for e in a]
v=0
for e in d:
v+=e**2
return (v/(n-1))**.5
stdb(a)
Output:
58.704136622265864
Covariance
To calculate the covariance we need second array b.
b=[randint(-100, 100) for p in range(0, 100)]
def covariance(a,b):
la=len(a)
lb=len(b)
assert la==lb, 'Non equal dimensions'
ma=sum(a)/la
mb=sum(b)/lb
'deviations from the mean'
da=[e-ma for e in a]
db=[e-mb for e in b]
v=0
for i in range(la):
v+=da[i]*db[i]
return v/(la-1) # Bessel's correction
covariance(a,b)
In here we used again the random array b with exactly 100 elements same as a.
Output:
315.42797979797973
The known problem with covariance is just that you cannot interpret the magnitude.
This is why we need correlation, the same measure as covariance but normalized in range [-1..1].
Correlation
Note we needed the standard devation function and covariance function.
def correlation(a, b):
std_a = stdb(a)
std_b = stdb(b)
if std_a > 0 and std_b > 0:
return covariance(a, b) / std_a / std_b
else:
return 0
correlation(a, b)
And again we must not divide with 0.
Appendix
What can we say about correlation?
Example1:
a = [1, 3, 5, 7, 9]
b = [2, 4, 6, 8, 10]
correlation(a,b)
Output:
0.9999999999999999
We can say two functions are linearly dependent if correlation is close to 1 or -1.
a = [-2, -1, 0, 1, 2]
b = [2, 1, 0, -1, -2]
correlation(a,b)
Output:
-0.9999999999999999
If correlation is close to 0, we cannot see the linear dependence.
Note there may be other kind of dependencies, but we cannot detect them.
a = [-2, -1, 0, 1, 2]
b = [4, 1, 0, 1, 4,]
correlation(a,b)
Output:
0.0
Absolute deviation is used not so often
Just because standard deviation is more in use this doesn’t mean absolute deviation isn’t my favorite measure to calculate the deviation.
You would implement that like this:
def abs_dev(a):
n=len(a)
m=sum(a)/len(a)
'deviations from mean'
d=[e-m for e in a]
v=0
for e in d:
v+=abs(e)
return v/n
variance(a)
Output:
2.0
It is less heavy to compute compared to standard deviation.
Pandas statistic methods
df.describe()
df.mean()
df.median()
df.max()
df.min()
df.count() # count of all elements
df.std() # standard deviation
df.mad() # mean absolute deviation
df.var() # variance
df.cov() # covariance
df.corr() # correlation
…
tags: correlation - medium - mean & category: python