SciPy Statistics
SciPy Statistics
All statistical functions are located in the scipy.stats subpackage. A fairly complete list of these functions can be obtained using the info(stats) function. A list of available random variables can also be obtained from the stats subpackage’s documentation. This module contains a large number of probability distributions and a growing library of statistical functions.
Each univariate distribution has its own subclass, as shown in the table below.
Order Number | Class and Description |
---|---|
1 | rv_continuous A general continuous random variable class, suitable for subclassing. |
2 | rv_discrete A general discrete random variable class, suitable for subclassing. |
3 | rv_histogram Generates a distribution given by a histogram. |
Normal Continuous Random Variable
A probability distribution in which the random variable X can take on any value, and is a continuous random variable. The location (loc) keyword specifies the mean. The scale (scale) keyword specifies the standard deviation.
As an instance of the rv_continuous class, the norm object inherits a number of general methods from it and refines them with the specific details of this particular distribution.
To compute the CDF of a number of points, we can pass it a list or a NumPy array. Let’s consider the following example.
from scipy.stats import norm
import numpy as np
print norm.cdf(np.array([1,-1., 0, 1, 3, 4, -2, 6]))
The above program will produce the following output.
array([ 0.84134475, 0.15865525, 0.5 , 0.84134475, 0.9986501 ,
0.99996833, 0.02275013, 1. ])
To find the median of a distribution, we can use the percentage point function (PPF), which is the inverse of the CDF. Let’s understand this with the following example.
from scipy.stats import norm
print norm.ppf(0.5)
The above program will produce the following output.
0.0
To generate a sequence of random variables, we should use the size keyword argument, as shown in the following example.
from scipy.stats import norm
print norm.rvs(size = 5)
The above program will produce the following output.
array([ 0.20929928, -1.91049255, 0.41264672, -0.7135557 , -0.03833048])
The above output is not reproducible. To generate the same random numbers, use the seed function.
Uniform Distribution
A uniform distribution can be generated using the uniform function. Let’s consider the following example.
from scipy.stats import uniform
print uniform.cdf([0, 1, 2, 3, 4, 5], loc = 1, scale = 4)
The above program will produce the following output.
array([ 0. , 0. , 0.25, 0.5 , 0.75, 1. ])
Building a Discrete Distribution
Let’s generate a random sample and compare the observed frequencies with the probabilities.
Binomial Distribution
As an instance of the rv_discrete class, the binom object inherits a set of general methods from it, complementing them with details specific to this distribution. Let’s consider the following example.
from scipy.stats import uniform
print uniform.cdf([0, 1, 2, 3, 4, 5], loc = 1, scale = 4)
The above program will produce the following output.
array([ 0. , 0. , 0.25, 0.5 , 0.75, 1. ])
Descriptive Statistics
Basic statistics like Min, Max, Mean, and Variance require NumPy arrays as input and return corresponding results. The following table describes some basic statistical functions in the scipy.stats package.
Number | Function and Description |
---|---|
1 | describe() Computes several descriptive statistics for the passed array. |
2 | gmean() Computes the geometric mean along the specified axis. |
3 | hmean() Computes the harmonic mean along the specified axis. |
4 | kurtosis() Calculates kurtosis |
5 | mode() Returns the mode value |
6 | skew() Tests for skewness in the data |
7 | f_oneway() Performs a one-way ANOVA |
8 | iqr() Calculates the interquartile range of the data along the specified axis. |
9 | zscore() Calculates the z-score of each value in a sample relative to the sample mean and standard deviation. |
10 | sem() Calculates the standard error of the mean (or standard error of measurement) of the values in the input array. |
Several of these functions have analogs in scipy.stats.mstats that work with masked arrays. Let’s understand this with the following example.
from scipy import stats
import numpy as np
x = np.array([1,2,3,4,5,6,7,8,9])
print x.max(), x.min(), x.mean(), x.var()
The above program will produce the following output.
(9, 1, 5.0, 6.6666666666666667)
T-test
Let’s see how the T-test works in SciPy.
ttest_1samp
This T-test computes the mean of a set of scores. This is a two-sided test that tests the null hypothesis that the expected value (mean) of a sample of independent observations ‘a’ is equal to the given population mean, popmean. Let’s consider the following example.
from scipy import stats
rvs = stats.norm.rvs(loc = 5, scale = 10, size = (50,2))
print stats.ttest_1samp(rvs,5.0)
The above program will produce the following output.
Ttest_1sampResult(statistic = array([-1.40184894, 2.70158009]),
pvalue = array([ 0.16726344, 0.00945234]))
Comparing Two Samples
In the following example, there are two samples that can come from the same or different distributions, and we want to test whether these samples have the same statistical properties.
ttest_ind – Computes a T-test on the mean of two independent sample scores. This is a two-sided test of the null hypothesis that the two independent samples have the same mean (expected) value. This test assumes that the populations have the same variation.
We can use this test if we observe two independent samples from the same or different populations. Let’s consider the following example.
from scipy import stats
rvs1 = stats.norm.rvs(loc = 5, scale = 10, size = 500)
rvs2 = stats.norm.rvs(loc = 5, scale = 10, size = 500)
print stats.ttest_ind(rvs1, rvs2)
The above program will produce the following output.
Ttest_indResult(statistic = -0.67406312233650278, pvalue = 0.50042727502272966)
You can test this with a new array of the same length, but with a different mean. Use a different value in loc and run the same test.