Python function to get the t-statistic

ChrisProsser picture ChrisProsser · Oct 12, 2013 · Viewed 50.5k times · Source

I am looking for a Python function (or to write my own if there is not one) to get the t-statistic in order to use in a confidence interval calculation.

I have found tables that give answers for various probabilities / degrees of freedom like this one, but I would like to be able to calculate this for any given probability. For anyone not already familiar with this degrees of freedom is the number of data points (n) in your sample -1 and the numbers for the column headings at the top are probabilities (p) e.g. a 2 tailed significance level of 0.05 is used if you are looking up the t-score to use in the calculation for 95% confidence that if you repeated n tests the result would fall within the mean +/- the confidence interval.

I have looked into using various functions within scipy.stats, but none that I can see seem to allow for the simple inputs I described above.

Excel has a simple implementation of this e.g. to get the t-score for a sample of 1000, where I need to be 95% confident I would use: =TINV(0.05,999) and get the score ~1.96

Here is the code that I have used to implement confidence intervals so far, as you can see I am using a very crude way of getting the t-score at present (just allowing a few values for perc_conf and warning that it is not accurate for samples < 1000):

# -*- coding: utf-8 -*-
from __future__ import division
import math

def mean(lst):
    # μ = 1/N Σ(xi)
    return sum(lst) / float(len(lst))

def variance(lst):
    """
    Uses standard variance formula (sum of each (data point - mean) squared)
    all divided by number of data points
    """
    # σ² = 1/N Σ((xi-μ)²)
    mu = mean(lst)
    return 1.0/len(lst) * sum([(i-mu)**2 for i in lst])

def conf_int(lst, perc_conf=95):
    """
    Confidence interval - given a list of values compute the square root of
    the variance of the list (v) divided by the number of entries (n)
    multiplied by a constant factor of (c). This means that I can
    be confident of a result +/- this amount from the mean.
    The constant factor can be looked up from a table, for 95% confidence
    on a reasonable size sample (>=500) 1.96 is used.
    """
    if perc_conf == 95:
        c = 1.96
    elif perc_conf == 90:
        c = 1.64
    elif perc_conf == 99:
        c = 2.58
    else:
        c = 1.96
        print 'Only 90, 95 or 99 % are allowed for, using default 95%'
    n, v = len(lst), variance(lst)
    if n < 1000:
        print 'WARNING: constant factor may not be accurate for n < ~1000'
    return math.sqrt(v/n) * c

Here is an example call for the above code:

# Example: 1000 coin tosses on a fair coin. What is the range that I can be 95%
#          confident the result will f all within.

# list of 1000 perfectly distributed...
perc_conf_req = 95
n, p = 1000, 0.5 # sample_size, probability of heads for each coin
l = [0 for i in range(int(n*(1-p)))] + [1 for j in range(int(n*p))]
exp_heads = mean(l) * len(l)
c_int = conf_int(l, perc_conf_req)

print 'I can be '+str(perc_conf_req)+'% confident that the result of '+str(n)+ \
      ' coin flips will be within +/- '+str(round(c_int*100,2))+'% of '+\
      str(int(exp_heads))
x = round(n*c_int,0)
print 'i.e. between '+str(int(exp_heads-x))+' and '+str(int(exp_heads+x))+\
      ' heads (assuming a probability of '+str(p)+' for each flip).' 

The output for this is:

I can be 95% confident that the result of 1000 coin flips will be within +/- 3.1% of 500 i.e. between 469 and 531 heads (assuming a probability of 0.5 for each flip).

I also looked into calculating the t-distribution for a range and then returning the t-score that got the probability closest to that required, but I had issues implementing the formula. Let me know if this is relevant and you want to see the code, but I have assumed not as there is probably an easier way.

Thanks in advance.

Answer

henderso picture henderso · Oct 12, 2013

Have you tried scipy?

You will need to installl the scipy library...more about installing it here: http://www.scipy.org/install.html

Once installed, you can replicate the Excel functionality like such:

from scipy import stats
#Studnt, n=999, p<0.05, 2-tail
#equivalent to Excel TINV(0.05,999)
print stats.t.ppf(1-0.025, 999)

#Studnt, n=999, p<0.05%, Single tail
#equivalent to Excel TINV(2*0.05,999)
print stats.t.ppf(1-0.05, 999)

You can also read about installing the library here: how to install scipy for python?