[math-fun] Trimmed means and Multi-dimensional medians
While we're at it, the "trimmed mean" is another way to combine the robustness of the median with the acuity of the mean. the k-trimmed-mean throws out the k most extreme data points (if I remember correctly, and probably k should be even) then takes the mean of the rest. The pure mean is then equivalent to the 0-trimmed mean, and the pure median is equivalent to the (n-1)-trimmed mean (where n is the number of data points in the set). Here's an interesting question: suppose we have data X_1, ..., X_n drawn from a Gaussian distribution with unknown mean mu and known variance 1. We wish to estimate mu with a guess muhat. Virtually everyone uses the sample mean of the dataset as an estimate of mu, but note that mu is also the *median* of the distribution. Under what circumstances would we be justified in prefering the sample median of the data to estimate mu? Since the sample average is a sufficient statistic, the answer might be never, but I'm not sure. Might it be the case that that the sample median is preferable if we are using L1 loss, i.e., seeking to minimize E_mu |mu - muhat| ? Here is another question about the median: is there a median that makes sense in two or more dimensions? Suppose (X,Y) ~ f(x,y) where f(x,y) is the continuous joint pdf of the random variables X and Y. Is there a reasonable quantity to call the median? -Joshua On 9/29/05, Mike Speciner <speciner@ll.mit.edu> wrote:
So, if y(x) is the histogram, the median is the m such that
integral(x<m) y(x) = integral(x>m) y(x)
while the mean is the m such that
integral(x<m) |x-m|*y(x) = integral(x>m) |x-m|*y(x)
This suggests a whole family of averages (using various functions of (x-m) for the weighting), though what use they might have escapes me.
--ms
David Gale wrote:
Jim, what is the Propp median if there are m zeros and m fives (and zero everything else)? Dan, if you're going to bring in averages at all then why not go all the way and use THE average? But maybe the CDC was using some sort of hybrid like the one you suggest.
D
At 09:14 PM 9/28/2005, you wrote:
The picture was supposed to show a rectangle of width 1 and height 2 whose bottom is centered at x=1, and to the right of it, a rectangle of width 1 and height 1 whose bottom is centered at x=2.
The base of the first rectangle goes from x=1/2 to x=3/2, and the base of the second rectangle goes from x=3/2 to x=5/2.
The total area under the histogram is (1)(2)+(1)(1) = 3.
The area to the left of the line x=5/4 is (5/4-1/2)(2) = 3/2, which is half of the total area. So x=5/4 is the "median".
Jim
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com http://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com http://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com http://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
--- joshua sweetkind-singer <sweetkindsinger@gmail.com> wrote:
Here's an interesting question: suppose we have data X_1, ..., X_n drawn from a Gaussian distribution with unknown mean mu and known variance 1. We wish to estimate mu with a guess muhat. Virtually everyone uses the sample mean of the dataset as an estimate of mu, but note that mu is also the *median* of the distribution. Under what circumstances would we be justified in prefering the sample median of the data to estimate mu? Since the sample average is a sufficient statistic, the answer might be never, but I'm not sure. Might it be the case that that the sample median is preferable if we are using L1 loss, i.e., seeking to minimize E_mu |mu - muhat| ?
I would solve this problem using the Bayesean method. Then the posterior distribution for mu will be a Gaussian with mean equal to the sample mean, and variance 1/n. This is all you can know about mu on the basis of the given information. For this particular estimation problem, where we are given that the underlying distribution is a Gaussian with unit variance, I would have no need for the sample median. Now then, if you must pick a number muhat, and make some decision on that basis, and there is a cost c(mutrue,muhat) for being wrong, then you can calculate the muhat that minimizes the expected cost, using the p(mu) derived above. Gene __________________________________ Yahoo! Mail - PC Magazine Editors' Choice 2005 http://mail.yahoo.com
Yes, nice solution. A couple of points. First, the posterior distribution you state is specifically the one that results when you use the improper prior f(mu) = constant. Second, following through on your last paragraph, if the cost function is specifically c(mu, mhat) = |mu - muhat|, then the answer is obviously the median of the posterior distribution, which is the sample *mean*! So, this drives home the point that, under this Bayesian framework, the sample median is not necessary. What if we take a frequentist point of view though? Then things get more difficult. Is there a uniformly minimum-cost unbiased estimate for the median? If so, is it the sample mean, the sample median, or something else? -Joshua On 9/29/05, Eugene Salamin <gene_salamin@yahoo.com> wrote:
--- joshua sweetkind-singer <sweetkindsinger@gmail.com> wrote:
Here's an interesting question: suppose we have data X_1, ..., X_n drawn from a Gaussian distribution with unknown mean mu and known variance 1. We wish to estimate mu with a guess muhat. Virtually everyone uses the sample mean of the dataset as an estimate of mu, but note that mu is also the *median* of the distribution. Under what circumstances would we be justified in prefering the sample median of the data to estimate mu? Since the sample average is a sufficient statistic, the answer might be never, but I'm not sure. Might it be the case that that the sample median is preferable if we are using L1 loss, i.e., seeking to minimize E_mu |mu - muhat| ?
I would solve this problem using the Bayesean method. Then the posterior distribution for mu will be a Gaussian with mean equal to the sample mean, and variance 1/n. This is all you can know about mu on the basis of the given information. For this particular estimation problem, where we are given that the underlying distribution is a Gaussian with unit variance, I would have no need for the sample median.
Now then, if you must pick a number muhat, and make some decision on that basis, and there is a cost c(mutrue,muhat) for being wrong, then you can calculate the muhat that minimizes the expected cost, using the p(mu) derived above.
Gene
__________________________________ Yahoo! Mail - PC Magazine Editors' Choice 2005 http://mail.yahoo.com
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com http://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
participants (2)
-
Eugene Salamin -
joshua sweetkind-singer