I'm a newbie, first of all, just started learning Python and I'm trying to write some code to calculate the Gini index for a fake country. I've came up with the following:
GDP = (653200000000)
A = (0.49 * GDP) / 100 # Poorest 10%
B = (0.59 * GDP) / 100
C = (0.69 * GDP) / 100
D = (0.79 * GDP) / 100
E = (1.89 * GDP) / 100
F = (2.55 * GDP) / 100
G = (5.0 * GDP) / 100
H = (10.0 * GDP) / 100
I = (18.0 * GDP) / 100
J = (60.0 * GDP) / 100 # Richest 10%
# Divide into quintiles and total income within each quintile
Q1 = float(A + B) # lowest quintile
Q2 = float(C + D) # second quintile
Q3 = float(E + F) # third quintile
Q4 = float(G + H) # fourth quintile
Q5 = float(I + J) # fifth quintile
# Calculate the percent of total income in each quintile
T1 = float((100 * Q1) / GDP) / 100
T2 = float((100 * Q2) / GDP) / 100
T3 = float((100 * Q3) / GDP) / 100
T4 = float((100 * Q4) / GDP) / 100
T5 = float((100 * Q5) / GDP) / 100
TR = float(T1 + T2 + T3 + T4 + T5)
# Calculate the cumulative percentage of household income
H1 = float(T1)
H2 = float(T1+T2)
H3 = float(T1+T2+T3)
H4 = float(T1+T2+T3+T4)
H5 = float(T1+T2+T3+T4+T5)
# Magic! Using numpy to calculate area under Lorenz curve.
# Problem might be here?
import numpy as np
from numpy import trapz
# The y values. Cumulative percentage of incomes
y = np.array([Q1,Q2,Q3,Q4,Q5])
# Compute the area using the composite trapezoidal rule.
area_lorenz = trapz(y, dx=5)
# Calculate the area below the perfect equality line.
area_perfect = (Q5 * H5) / 2
# Seems to work fine until here.
# Manually calculated Gini using the values given for the areas above
# turns out at .58 which seems reasonable?
Gini = area_perfect - area_lorenz
# Prints utter nonsense.
print Gini
The result of Gini = area_perfect - area_lorenz
just makes no sense. I've took out the values given by the area variables and did the math by hand and it came out fairly ok, but when i try to get the program to do it, it gives me a completely ??? value (-1.7198...). What am I missing? Can someone point me in the right direction?
Thanks!
A first issue is not factoring for the equation for the Gini coefficient correctly:
gini = (area between Lorenz curve and perfect equality) / (area under perfect equality)
The denominator in was not incldued in the calculations, and an incorrect equation for the area under the line of equality is also being used (see code for a method using np.linspace and np.trapz).
There is also the issue that the first point of the Lorenz curve is missing (it needs to start at 0, not the first quintile's share). Although the area under the Lorenz curve is small between 0 and the first quintile, its ratio to the area under the line of equality after that is extended is quite large.
The following provides an equivalent answer to the methods given in the answers to this question:
import numpy as np
GDP = 653200000000 # this value isn't actually needed
# Decile percents of global GDP
gdp_decile_percents = [0.49, 0.59, 0.69, 0.79, 1.89, 2.55, 5.0, 10.0, 18.0, 60.0]
print('Percents sum to 100:', sum(gdp_decile_percents) == 100)
gdp_decile_shares = [i/100 for i in gdp_decile_percents]
# Convert to quintile shares of total GDP
gdp_quintile_shares = [(gdp_decile_shares[i] + gdp_decile_shares[i+1]) for i in range(0, len(gdp_decile_shares), 2)]
# Insert 0 for the first value in the Lorenz curve
gdp_quintile_shares.insert(0, 0)
# Cumulative sum of shares (Lorenz curve values)
shares_cumsum = np.cumsum(a=gdp_quintile_shares, axis=None)
# Perfect equality line
pe_line = np.linspace(start=0.0, stop=1.0, num=len(shares_cumsum))
area_under_lorenz = np.trapz(y=shares_cumsum, dx=1/len(shares_cumsum))
area_under_pe = np.trapz(y=pe_line, dx=1/len(shares_cumsum))
gini = (area_under_pe - area_under_lorenz) / area_under_pe
print('Gini coefficient:', gini)
The areas calculated with np.trapz
give a coefficient of 0.67. The value calculated without the first point of the Lorenz curve and using trapz was 0.59. Our calculation of global inequality is now roughly equal to that provided by the methods in the question linked above (you do not need to add 0 to the lists/arrays in those methods). Note that using scipy.integrate.simps gives 0.69, meaning the methods in the other question coincide more with trapezoidal than Simpson integration.
Here's the plot, which includes plt.fill_between
to color under the Lorenz curve:
from matplotlib import pyplot as plt
plt.plot(pe_line, shares_cumsum, label='lorenz_curve')
plt.plot(pe_line, pe_line, label='perfect_equality')
plt.fill_between(pe_line, shares_cumsum)
plt.title('Gini: {}'.format(gini), fontsize=20)
plt.ylabel('Cummulative Share of Global GDP', fontsize=15)
plt.xlabel('Income Quintiles (Lowest to Highest)', fontsize=15)
plt.legend()
plt.tight_layout()
plt.show()