Each Answer to this Q is separated by one/two green lines.
I was trying one Dataquest exercise and I figured out that the variance I am getting is different for the two packages..
e.g for [1,2,3,4]
from statistics import variance import numpy as np print(np.var([1,2,3,4])) print(variance([1,2,3,4])) //1.25 //1.6666666666666667
The expected answer of the exercise is calculated with np.var()
I guess it has to do that the later one is sample variance and not variance.. Anyone could explain the difference?
Delta Degrees of Freedom: the divisor used in the calculation is
N - ddof, where N represents the number of elements. By default,
ddof is zero.
The mean is normally calculated as
x.sum() / N, where
N = len(x). If, however,
ddof is specified, the divisor
N - ddof is used instead.
In standard statistical practice,
ddof=1 provides an unbiased estimator of the variance of a hypothetical infinite population.
ddof=0 provides a maximum likelihood estimate of the variance for normally distributed variables.
Statistical libraries like numpy use the variance n for what they call var or variance and the standard deviation
For more information refer this documentation : numpy doc
It is correct that dividing by N-1 gives an unbiased estimate for the mean, which can give the impression that dividing by N-1 is therefore slightly more accurate, albeit a little more complex. What is too often not stated is that dividing by N gives the minimum variance estimate for the mean, which is likely to be closer to the true mean than the unbiased estimate, as well as being somewhat simpler.