A common problem arises in the analysis of experimental data. There is a sample of real data, each member of which consists of a set of values -- for example, a set of Z decay events with inclusive leptons, for each of which there is a value of , or a set of measured particle tracks, each with . You know that these arise from a number of sources: the lepton events from direct b decays, cascade b decays, c decays, and background, the tracks from , K, and p hadrons. You wish to determine the proportions of the different sources in the data.
There is no analytic form available for the distributions of these sources as functions of the , only samples of data generated by Monte Carlo simulation. You therefore have to bin the data, dividing the multidimensional space spanned by the
values into n bins. This gives a set of numbers , where is the number of events in the real data that fall into bin i. Let be the predicted number of events in the bin, given by the strengths and the numbers of Monte Carlo events from source j in bin i.
where is the total number in the data sample,and
the total number in the MC sample for source j.
The are then the actual proportions and should sum to unity. It is convenient to incorporate these normalisation factors into the strength factors, writing , giving the equivalent form
One approach is then to estimate the by adjusting them to minimise
This assumes that the distribution for is Gaussian; it is of course Poisson, but the Gaussian is a good approximation to the Poisson at large numbers.
Unfortunately it often happens that many of the are small, in this case one can go back to the original Poisson distribution, and write down the probability for observing a particular as
and the estimates of the proportions are found by maximising the total likelihood or, for convenience, its logarithm (remembering , and omitting the constant factorials)
This accounts correctly for the small numbers of data events in the bins, and is a technique in general use. It is often referred to as a ``binned maximum likelihood'' fit.
But this does not account for the fact that the Monte Carlo samples used may also be of finite size, leading to statistical fluctuations in the . In Equation it can be seen that these are damped by a factor , but we cannot hope that this is small.
So: disagreements between a particular and
arise from incorrect , from fluctuations in , and from fluctations in the . Binned maximum likelihood reckons with the first two sources, but not the third. In the formalism of Equation this can be dealt with by adjusting the error used in the denominator
but this still suffers from the incorrect Gaussian approximation. The problem is to find the equivalent treatment for the binned maximum likelihood method.