The Problem.

Next: Methodology. Up: Fitting with finite Previous: Fitting with finite

The Problem.

A common problem arises in the analysis of experimental data. There is a sample of real data, each member of which consists of a set of values $x$ _r -- for example, a set of Z decay events with inclusive leptons, for each of which there is a value of $p$ ^ℓ, p_t^ℓ, T, E_vis , or a set of measured particle tracks, each with $p,dEdx,cos&thetas;$ . You know that these arise from a number of sources: the lepton events from direct b decays, cascade b decays, c decays, and background, the tracks from $π$ , K, and p hadrons. You wish to determine the proportions $P$ _j of the different sources in the data.

There is no analytic form available for the distributions of these sources as functions of the $x$ _r , only samples of data generated by Monte Carlo simulation. You therefore have to bin the data, dividing the multidimensional space spanned by the $x$ _r

values into n bins. This gives a set of numbers $d$ ₁,d₂...d_n , where $d$ _i is the number of events in the real data that fall into bin i. Let $f$ _i(P₁,P₂...P_m) be the predicted number of events in the bin, given by the strengths $P$ _j and the numbers of Monte Carlo events $a$ _ji from source j in bin i. $f$ _i= N_D∑_j=1^mP_ja_ji/N_j

where $N$ _D is the total number in the data sample,and $N$ _j

the total number in the MC sample for source j. $N$ _D= ∑_i=1ⁿd_iN_j=∑_i=1ⁿa_ji

The $P$ _j are then the actual proportions and should sum to unity. It is convenient to incorporate these normalisation factors into the strength factors, writing $p$ _j=N_DP_j/N_j , giving the equivalent form $f$ _i= ∑_j=1^mp_ja_ji

One approach is then to estimate the $p$ _j by adjusting them to minimise $χ$ ²= ∑_i(d_i-f_i)²d_i

This $χ$ ² assumes that the distribution for $d$ _i is Gaussian; it is of course Poisson, but the Gaussian is a good approximation to the Poisson at large numbers.

Unfortunately it often happens that many of the $d$ _i are small, in this case one can go back to the original Poisson distribution, and write down the probability for observing a particular $d$ _i as $e$ ^-f_if_i^d_id_i!

and the estimates of the proportions $p$ _j are found by maximising the total likelihood or, for convenience, its logarithm (remembering $a$ ^b= e^{b lna} , and omitting the constant factorials) $lnL= ∑$ _i=1ⁿd_ilnf_i- f_i

This accounts correctly for the small numbers of data events in the bins, and is a technique in general use. It is often referred to as a ``binned maximum likelihood'' fit.

But this does not account for the fact that the Monte Carlo samples used may also be of finite size, leading to statistical fluctuations in the $a$ _ji . In Equation it can be seen that these are damped by a factor $N$ _D/N_j , but we cannot hope that this is small.

So: disagreements between a particular $d$ _i and $f$ _i

arise from incorrect $p$ _j , from fluctuations in $d$ _i , and from fluctations in the $a$ _ji . Binned maximum likelihood reckons with the first two sources, but not the third. In the $χ$ ² formalism of Equation this can be dealt with by adjusting the error used in the denominator $χ$ ²= ∑_i(d_i-f_i)²d_i+ N_D²∑_ja_ji/N_j²

but this still suffers from the incorrect Gaussian approximation. The problem is to find the equivalent treatment for the binned maximum likelihood method.

Next: Methodology. Up: Fitting with finite Previous: Fitting with finite

Janne Saarela
Tue May 16 09:09:27 METDST 1995