Here is a question that might be worth of some thinking. Suppose one has a random sample of say 10000 observations on income of individuals and wants to record just two summary statistics for each of say 20 groups made from this data. What summary statistics should be recorded?

Let’s focus on two very common practices: (i) inspired by the histogram, the percentage of people and the boundaries for each group denoted by “ci” and “zi” are recorded i.e. data consist of {ci,zi} for i = 1,….,20 (ii) Inspired by the Lorenz curve,“ci” and group average incomes “mi” are recorded i.e. data consist of {ci,mi} for i = 1,….,20. Which practice should be preferred? In one of the earlier posts I argued that the Lorenz curve is more intuitive but what about statistical efficiency? I don’t know the answer in general (there may not be such an answer) but here are a couple of points:

- I is not clear how to answer this question in general because estimation and inference with this amount of information and under a nonparametric framework for an arbitrary point is unclear. The problem is basically how the observed points should be connected.
- Linear interpolation of Lorenz curves does not seem to be the best thing to do. In this paper, the authors show how to derive a Lorenz curve from a histogram (it turns out to be piece-wise quadratic) and argue that the Lorenz curve obtained in this way should be preferred to a linearly interpolated Lorenz curve since it is within Mehran’s bounds while the linearly interpolated Lorenz curve is one of the bounds.
- There are other ways to interpolate Lorenz curves e.g. using splines [Nick has a paper on this] and there are ways to estimate them (one of them we discuss below). The above result does not necessarily apply to these other methods.
- I think I have seen somewhere that the histogram is the maximum entropy distribution if the data is (ci,zi). I am not however aware of a similar result for Lorenz curves.
- If I make a reasonable parametric assumption for an income distribution (e.g. a GB2 with reasonable parameter values), generate data from it and create equally sized groups, record the data one time according to (i), another time according to (ii), and estimate the distribution for each case then the estimated variances for the parameters turns out substantially smaller (less than half) for case (ii). So, despite the above remarks, it might not be incorrect to claim that for a typical income distribution, case (ii) i.e. the Lorenz type of data recording is statistically more efficient.

I think this post can be expanded into a short paper. If you are interested in further development of this or have ideas on how to flexibly model a Lorenz curve please let me know.

Posted by Sriram on September 28, 2015 at 2:50 pm

– What do you mean by flexible modeling? In the literature people have used GB2, log-normal, Singh Maddala distribution to estimate Lorenz curve. How would you find out which distribution is flexible? Will you be using some information theoretical measure(s) to find which parametric distribution is the most flexible among a set of such distributions?

– Given that in the case of grouped data one may have a very small number of groups (10 or 20 in general). Will estimating the parameters using GMM or NLS give you an appropriate standard errors. Has anyone used Bayesian method to estimate Lorenz curves? It is usually mentioned that for small samples Bayesian would be a better approach.

Posted by Reza on September 29, 2015 at 1:47 am

What I mean by flexible is a sort of functional form that you can increase its flexibility by adding more terms (something like series or splines) if needed. In other words something robust that can approximate a whole range of different shapes (even GB2 is restricted).

One of the points of using a Lorenz curve is that you don’t want to specify a density function and therefore Bayesian is not straightforward. Even if you have the density, it is not straightforward to specify a likelihood function and that’s why we do GMM.