next up previous contents index
Next: Methodology. Up: Fitting with finite Previous: Fitting with finite

The Problem.

A common problem arises in the analysis of experimental data. There is a sample of real data, each member of which consists of a set of values xr -- for example, a set of Z decay events with inclusive leptons, for each of which there is a value of p, pt, T, Evis , or a set of measured particle tracks, each with p,dEdx,cos&thetas; . You know that these arise from a number of sources: the lepton events from direct b decays, cascade b decays, c decays, and background, the tracks from π , K, and p hadrons. You wish to determine the proportions Pj of the different sources in the data.

There is no analytic form available for the distributions of these sources as functions of the xr , only samples of data generated by Monte Carlo simulation. You therefore have to bin the data, dividing the multidimensional space spanned by the xr

values into n bins. This gives a set of numbers d1,d2...dn , where di is the number of events in the real data that fall into bin i. Let fi(P1,P2...Pm) be the predicted number of events in the bin, given by the strengths Pj and the numbers of Monte Carlo events aji from source j in bin i. fi= NDj=1mPjaji/Nj

where ND is the total number in the data sample,and Nj

the total number in the MC sample for source j. ND= ∑i=1ndiNj=∑i=1naji

The Pj are then the actual proportions and should sum to unity. It is convenient to incorporate these normalisation factors into the strength factors, writing pj=NDPj/Nj , giving the equivalent form fi= ∑j=1mpjaji

One approach is then to estimate the pj by adjusting them to minimise χ2= ∑i(di-fi)2di

This χ2 assumes that the distribution for di is Gaussian; it is of course Poisson, but the Gaussian is a good approximation to the Poisson at large numbers.

Unfortunately it often happens that many of the di are small, in this case one can go back to the original Poisson distribution, and write down the probability for observing a particular di as e-fifididi!

and the estimates of the proportions pj are found by maximising the total likelihood or, for convenience, its logarithm (remembering ab= eb lna , and omitting the constant factorials) lnL= ∑i=1ndilnfi- fi

This accounts correctly for the small numbers of data events in the bins, and is a technique in general use. It is often referred to as a ``binned maximum likelihood'' fit.

But this does not account for the fact that the Monte Carlo samples used may also be of finite size, leading to statistical fluctuations in the aji . In Equation gif it can be seen that these are damped by a factor ND/Nj , but we cannot hope that this is small.

So: disagreements between a particular di and fi

arise from incorrect pj , from fluctuations in di , and from fluctations in the aji . Binned maximum likelihood reckons with the first two sources, but not the third. In the χ2 formalism of Equation gif this can be dealt with by adjusting the error used in the denominator χ2= ∑i(di-fi)2di+ ND2jaji/Nj2

but this still suffers from the incorrect Gaussian approximation. The problem is to find the equivalent treatment for the binned maximum likelihood method.


next up previous contents index
Next: Methodology. Up: Fitting with finite Previous: Fitting with finite

Janne Saarela
Tue May 16 09:09:27 METDST 1995