Here is a question that might be worth of some thinking. Suppose one has a random sample of say 10000 observations on income of individuals and wants to record just two summary statistics for each of say 20 groups made from this data. What summary statistics should be recorded?

Let’s focus on two very common practices: (i) inspired by the histogram, the percentage of people and the boundaries for each group denoted by “ci” and “zi” are recorded i.e. data consist of {ci,zi} for i = 1,….,20 (ii) Inspired by the Lorenz curve,“ci” and group average incomes “mi” are recorded i.e. data consist of {ci,mi} for i = 1,….,20. Which practice should be preferred? In one of the earlier posts I argued that the Lorenz curve is more intuitive but what about statistical efficiency? I don’t know the answer in general (there may not be such an answer) but here are a couple of points:

- It is not clear how to answer this question in general because estimation and inference with this amount of information and under a nonparametric framework for an arbitrary point is unclear. The problem is basically how the observed points should be connected.
- Linear interpolation for Lorenz curves does not seem to be the best thing to do. In this paper, the authors show how to derive a Lorenz curve from a histogram (it turns out to be piecewise quadratic) and argue that the Lorenz curve obtained in this way should be preferred to a linearly interpolated Lorenz curve since it is within Mehran’s bounds while the linearly interpolated Lorenz curve is one of the bounds.
- There are other ways to interpolate Lorenz curves e.g. using splines [Nick has a paper on this] and there are ways to estimate them (one of them we discuss below). The above result does not necessarily apply to these other methods.
- I think I have seen somewhere that histogram is the maximum entropy distribution if the data is (ci,zi). I am not however aware of a similar result for Lorenz curves.
- If I make a reasonable parametric assumption for an income distribution (e.g. a GB2 with reasonable parameter values), generate data from it and create equally sized groups, record the data one time according to (i), another time according to (ii), and estimate the distribution for each case then the estimated variances for the parameters turns out substantially smaller (less than half) for case (ii). So, despite the above remarks, it might not be incorrect to claim that for a typical income distribution, case (ii) i.e. the Lorenz type of data recording is statistically more efficient.

Two messages at the end: (i) This might be expandable to a short paper. If any of the readers are interested in further development of this please contact me. (ii) If you have any ideas on how to flexibly model a Lorenz curve based on grouped data please let me know.