UKH Forums - Stats help please

Stats help please

This topic has been archived, and won't accept reply postings.

Removed User 19 Sep 2020

I know there are some maths heavyweights on here. Perhaps one of them might be persuaded to give me some pointers?

I've got something that happens about 3 times/100 in a population.

I've got 100 populations, varying from a few tens of thousands to a couple of million.

So I'd expect if I plotted events against population size I'd get a decent trend line. And I do, but R^2 is "only" about 0.8.

If the event is random but 3/100 on average, how do I work out the expected R^2 value? Is simulation the only way?

I also have some influencing parameters that I think will increase or decrease the event frequency. Each of my hundred populations is impacted by either one or two of these. I want to work out how much of the deviation from R^2=1 is attributable to each parameter.

It may well be that my O-level maths and R^2 are a red herring and I need a completely different tool.

All advice appreciated; helpful advice especially so!

Jack B 19 Sep 2020

In reply to Removed UserBilberry:

I don't think plotting occurrence against population (sample?) size is going to get what you want.

Tell us more about these parameters... are they on-off, or are they variable in strength? How well do you know them?

If I understand you right, you have two parameters you want to test against, and you have populations with A, B, or A+B but not with neither? Do you expect A and B to interact if they are both present, or just to both independently influence the occurrence?

What separates the populations, other than these influencing parameters? Could you lump them together or do they seem too different?

Post edited at 20:02

Removed User 19 Sep 2020

In reply to Jack B:

Hi Jack, thanks for replying.

If something happened rigidly 3 times in 100, then I'd have a y=0.03x graph; no wobbles.

If something is 3% probability then, within a large population, I'd get a normal distribution of outcome probabilities I think?

So if I took a set of such populations, I'd expect some wobbling around the y=0.03x line depending on how close to the apex of the distribution each population is. My first question above is what would be an "expected" amount of wobble. The only tool I can remember is variance!

I think I have more wobble than would be "expected". I think that is because in some of the my populations, I have an influencing factor (or two) that should suppress the event to some degree, If I plot just those populations I get a line that is still wobbly, but is "clearly on inspection" different to the line without the influencing factor. I'd like to replace "clearly on inspection" with something a bit more robust!

The populations are subsets and can be summed.

It's not Covid by the way!

Post edited at 20:35

elsewhere 19 Sep 2020

In reply to Removed UserBilberry:

Assuming the events are independent and randomly distributed the Poisson distribution is appropriate.

N=rP (N, number of events, r=0.03, P population few tens of thousands to a couple of million)

expected deviation n=sqrt(rP)

When you plot the graph N vs P or make calculations, do you factor in that the error bars increase with sqrt(P)?

That will need to feed into your calculation of R, the different data points a have different uncertainties so the different data points have different weightings.

The absolute uncertainty or expected deviation increases with sqrt(P).
A plot of N vs P will have gradient r with large deviations sqrt(rP) at larger P.

The relative uncertainty or expected deviation decreases with 1/sqrt(P).
A plot of N/P vs P will have gradient zero with smaller deviations sqrt(r/P) at larger P.

My stats is very rudimentary so take that with a pinch of salt.

Post edited at 20:43

Removed User 19 Sep 2020

In reply to elsewhere:

Hi elswhere and thanks for the reply. The actual numbers are very solid. I know how many events have happened in each of my groups and the population to within fine tolerances.

I'd expect in a 1 million population that the event would happen very close to 3% of the time. So if I'm seeing 2.8% it seems like something else is influencing. I've got a set of populations with the influence and a set without.

elsewhere 19 Sep 2020

In reply to Removed UserBilberry:

> Hi elswhere and thanks for the reply. The actual numbers are very solid. I know how many events have happened in each of my groups and the population to within fine tolerances.

You can know exactly how many events and the population. However if the events are independent and random you can* get a different number of events in an identical sized population.

*and almost certainly will for 30,000 events in a population of a million.

> I'd expect in a 1 million population that the event would happen very close to 3% of the time. So if I'm seeing 2.8% it seems like something else is influencing. I've got a set of populations with the influence and a set without.

3%, 1 million -> 30,000

N=30,000

n= sqrt(30,000)=173

You'd expect most (two thirds) of your samples consisting of 1 million population to be in the range 30,000+-173 events which is 3.00+-0.02%.

2.8% is possible but highly unlikely, e^-10 sort of unlikely.

If you control samples obey 3.00+-0.02% for P=1,000,000 and other samples are 2.8% (P=1,000,000) I'd be confident there is a difference.

Ballpark figures to be taken with a pinch of salt.

Post edited at 21:10

Jack B 19 Sep 2020

In reply to Removed UserBilberry:

> If something happened rigidly 3 times in 100, then I'd have a y=0.03x graph; no wobbles. [...]

> I think I have more wobble than would be "expected".

This is broadly correct, but it is not a good way to demonstrate the effect the other factors have.

> The populations are subsets and can be summed.

I'm not sure what you mean by this. If your "populations" are subsets of a single large population, and one datum can exist in more than one subset, then summing them is not necessarily safe. However, if it is safe to sum them, then you can proceed like this:

- Sum them into groups for A, B and A+B (and for neither if you have any).

- Calculate the mean occurrence rate for each group.

- Calculate the uncertainty in mean rate for each group, assuming a Poisson distribution.

- Compare them. If they are different by much more than the uncertainty, then you have a result. Yay!

- If they are different but only by less than the uncertainty, or only a bit more, then you can do some more maths to calculate the probability you would have seen this difference by chance if there was no real underlying difference. Conventionally, if this is less than some threshold which varies between subjects (often 5%) we call it a result. This is called "statistical significance".

wintertree 19 Sep 2020

In reply to Removed UserBilberry:

You want to look at the residual values (difference between model fit and data) and compare their magnitude to that of the measurement error.

For this sort of thing the “measurement error” is really more of a manifestation error that occours due to the random nature of the event. What I might do:

Use the probability p estimated by your linear fit and the Poisson stats given by elsewhere to estimate the uncertainty on each data point as e=sqrt(Np) where N is the size of each sample set. Then look at the square residuals normalised to their error values - (the data points - the model fit)^2/e
If they’re >> 1, the data has more variance then expected of the model is a bad fit
if they’re ~1, the variance is as expected
if they’re << 1, the model fits suspiciously wellm
This technique is called “chi squared” and technically it’s rather dodgy to apply it to discrete data (Poisson not Gaussian) but if nP is more than about 20 I’d not worry to much if it’s just for internal curiosity.
Its best to do the linear fit by minimising the square residuals but where the variance is manifested in the data and not the measurement there’s a chicken and egg problem - you can always refine with a second such fit
The evaluation in 2-4 needs finessing for how many free parameters you have (eg the gradient and intercept of your linear fit). Look in to “reduced chi squared” for how to handle that numerically. This adjusts the value you calculate for quality of fit based on free parameters and with some work can give you an estimate for the uncertainty on each fit parameter.

Here I have adapted my normal method I use working with continuous data and measurement noise to discrete data and random manifestation noise; I expect I’ll get crucified by a stats expert who works with the later when they come along.

Edit: Attached is an example of the kind of plot I'd use - it's all I have to hand and is rather topical. The black line on the top plot is my "model" - actually a local polynomial fit rather than your global fit of a fixed fraction. My x-axis is date, yours is sample number. Grey crosses are black data points. The middle plot shows the residuals - how far off the black line each data point is. The bottom plot is the normalised residuals - here I've divided them by the square root of the corresponding model value (dodgy in my case, and in your case it would be sqrt (nP)). The histogram then shows the distribution of the normalised residuals. The main thing clearly visible here is that the residuals are not uniformly distributed and there is clear correlation along the x-axis implying they aren't random (they'd fail a "Dubrin Watson" test). There is a 7-day repeating structure in them associated with weekend/weekday issues. Yours very definitely should be random if the x-axis is sample size, it's just their scale they you're interested in...

Post edited at 21:49

Jon Read 20 Sep 2020

In reply to Removed UserBilberry:

Sounds like what you are trying to model is count data (number of events per time period), you want to adjust for the size of the population you are counting from, and also look for association with a couple of covariates. I would suggest a Poisson regression model with population size as an offset term. You would then be able to see how much of an effect each covariate has on the outcome (events rate), either in terms of decreasing or increasing the event rate.

It's not O-level stuff, I'm afraid.

Post edited at 11:32

seankenny 21 Sep 2020

In reply to Removed UserBilberry:

I’m a little confused as to why we have assumed the event is discrete (have I missed something?) and also where the 3% occurrence comes from - is there an underlying population model that gives you this?

Perhaps the best thing would be to do an OLS regression including the other two parameters, so that you 0.03 would be the constant term. Then check to see if your residuals are normally distributed - which may well be the case if you include the previously omitted variables - and if they are not, do some kind of weighting, use robust standard errors, etc.

You’d need stata or similar to do this easily.