Statistics question - trends

New Topic
This topic has been archived, and won't accept reply postings.

stats novice here- if i am looking to establish whether there is a trend in data over time, what is the most appropriate test to use? 
 

variable is the proportion of people who experienced an outcome of interest in successive years

any help gratefully accepted!

 mbh 01 Jul 2021
In reply to no_more_scotch_eggs:

Time series data can be tricky, but a simple approach would be to do a regression analysis, on the formula proportion ~ time. 

This works if the time variation of the proportion is more or less linear and will work best if the proportion is never close to either 100% or 0%.

You will get two coefficients out of this, the second of which is the gradient of the line. It will be positive if the proportion is increasing with time, and negative if it is decreasing.

There will also be a 'p-value', which is the likelihood that you would get the gradient you got under a null hypothesis that the 'true' gradient' is in fact zero - ie no trend. If the p-value is tiny (a common threshold is < 0.05) , you can have some confidence that what you are seeing is a real thing. If the p-value is big (ie > 0.05 or some other fraction close to zero), then you can't. 

There are lots of caveats to this, but that is a start.

 Jon Read 01 Jul 2021
In reply to no_more_scotch_eggs:

If you want to explore it informally and visually, use a running average (7days or time window appropriate to the data)

For more formal tests, depends on the data. If your data is independent single observations from individuals, then you could regress binary outcome against linear year -- this would be logistic regression.

 Jack B 01 Jul 2021
In reply to no_more_scotch_eggs:

There's different tests and techniques for different types of data.  I know about some but not others. What are your variables like?

e.g. for time: do you have a load of dates when people experienced the outcome, or is it grouped into months/years already.  If the latter, roughly how many groups do you have and how many are in each group?

For outcomes is it just did/didn't, or is there a scale, and if it's a scale how many steps does it have?

How many data points do you have? Tens? Thousands?

What proportion of people have the outcome? is it close to 50/50 or is it one in a million? Can people have the outcome more than once?

Post edited at 12:06
 wintertree 01 Jul 2021
In reply to no_more_scotch_eggs:

What sort of noise statistics on your measurement do you expect? If the noise is gaussian or poissonian and expected to be uncorrelated over time....

  • I'd fit a 1st order polynomial (y=k0 + k1.x) to the data, and plot that as a line with scattered data markers.  
  • I'd then plot the residuals - the difference between the data and the fit.
    • If the residuals show no trends or correlation, and have a magnitude about that of the square root of the corresponding data points, it looks like the straight line describes the trend well to the expected noise levels.  At this point you can start to worry about the error bars on the fit parameters, so that you can describe them with some statistical confidence - or show that a formula of k1~0 is a good description, implying no trend.
    • If the residuals show a lot of structure (a whole bunch negative say, then some positive), this tells you that the trend is described by a higher order polynomial.  So, you can try the same sort of fitting approach with those.

That's always a handy way of taking a first stab at it to see if there's anything interesting in the data.  Beyond that the way you get a statistical probability that your trend is real depends wildly on the field and sub-field you are in.  I prefer fields where people use an analysis of the noise in the system to find probabilities...

Edit: If you're wondering "noise, what noise?" I don't mean measurement error but the statistical variation in outcomes over people.  If you took this dataset from each of a hundred separate groups of similar people, what would the standard deviation be on each data point?

Post edited at 12:18
3
In reply to no_more_scotch_eggs:

Thanks everyone- some clarifications

data not collected yet. Will be healthcare related, and look at proportion of people admitted to specialist wards who die during/soon after their admission 

Looking at a 7 year time series, with numbers who died/total admissions, giving a proportion who die in each year. Proportion who die will be smaller than those who don’t- around 15% from some pilot data. Likely to be data relating to around 2000 admissions, but multiple sites so may be available at summary level per site.  Hypothesis is the proportion dying is rising over time as the characteristics of people admitted change due to a number of factors. 

 

 wintertree 01 Jul 2021
In reply to no_more_scotch_eggs:

Okay; numbers are large enough to fairly assume gaussian noise. 

Define:
x: time [years]
n: number of admissions [people/year]
d: number of deaths [people/year]
r = d/n : fatality rate [/year]

I would estimate the vertical error bar on the 'r' data points as:
e_r = r × sqrt(1/d + 1/n)

Which when d << n simplifies to:

e_r = sqrt(d)/n

Which interprets as the dominant source of uncertainty being the gaussian (actually large number poisson) noise on the smaller variable, d and that translating fractionally in to the ratio. 

I would then plot vs x as a scatter plot, and do a least squares regression to the data to get a model, m, of the form:

m = k₀ k₁x

I would then plot this as a line on the same plot.

  • If this gives k₁ > 0, and the line generally passes through the errorbars, and the rise over the period of 7 years is more than one full error bar (7k₁ > 2×e_r) then it's looking odds on you've got a significant, time correlated increase.   This is a coarse test, proper ones exist to give confidence values.
  • If the line passes through the errorbars and rises, but the rise over time is less than a full errrobar, but you can probably scribble a falling line that also passes through the errorbars and so the significance of your finding will be small.
  • If the line does't pass through most errorbars, you either have a higher order behaviour than linear, or you have inconsistencies in your data.

Edit: This is rather a physical sciences approach to health data, I imagine health people use different tests and methods; what you really need depends on what the purpose of your outputs is and who you need to convince I suspect.

Post edited at 13:47
1
cb294 01 Jul 2021
In reply to no_more_scotch_eggs:

For starters I would try regression analysis: Fit your points to a line (unless you have a priori reason to e.g. assume exponential or asymptotic growth) and check how good that line fits your data. The slope of the line will be your "trend".

Whichever program you use (this is easily doable even in excel) the program will give you a so called regression coefficient. The closer to 1 this is the better the linear function fits your data.

There are additional ways of testing how good your regression model fits the data, e.g. F test and others, that measure how likely it is to get a result as skewed as yours if the hypothesis underlying the regression model (e.g., "the fraction of happy customers increases by 3% each year") is wrong (i.e. the fraction of happy customers does not depend on time). Again, this so called p value will be given automatically by the regression analysis in excel, obviously the smaller the p value the better.

CB

In reply to wintertree:

Thanks Wintertree- that was clearly set out- if I’m following that correctly you’re suggesting regression analysis? At this point I’m just writing a proposal and need to set out planned analysis method, but don’t need to give details beyond the method being proposed. I can get my head round chi squared and t tests, but this a bit more complex… Would it be fair to say regression analysis was the method, are there different types of regression so needing specified which type? 

 wintertree 01 Jul 2021
In reply to no_more_scotch_eggs:

I suffer with terminology greatly, word soup and all that.  Sorry, I'd have gone in to less detail if I knew this was for a proposal, not for applying to the pilot data.

To me, the "linear regression" is a form of "regression analysis" that is used to fit a linear model to data.  

Techniques such as looking at the R^2 value of that "linear regression" would form a further part of "regression analysis", so "fitting a model with linear regression and examine the R^2 values" would be a form of "regression analysis".

With 7 years of data, assuming this effect is small in scale and not large, a linear model would seem appropriate to test the hypothesis.

(Aside: I am not a fan of R^2 in many cases as an end point,  as it does not take in to account the explicit statistics of the noise on the inputs, just that there is a mismatch between the data and the fit.  If you consider the expected sources of noise, then you can test your hypothesis in the context of that.  How well does my model describe the data in the presence of the expected noise?  What do the noise weighted residuals tell me about systematic errors in my model fit?  The method I described is a hand wavy step towards using chi squared tests on the model fit which opens up this stuff rather than sweeping it under the carpet in to a black box generated P-value). 

Sorry, not much help on the terminology.  Probably just confusion and ranting.

Post edited at 14:25
1
 seankenny 01 Jul 2021
In reply to no_more_scotch_eggs:

It's worth bearing in mind that as a time series, you may well find an autoregression model works quite well, ie that the values of variable Y at time t are dependent on Y at t-1. (With only seven years you can't do more than a one period lag really...) This would definitely show a trend.

Given that the time series is so short, might it be better to think of this as panel data and then include the other factors, ie the patient characterisitcs in your model, so you can control for them? For grant writinng purposes this would presumably be described as a regression analysis for sure. I am not super-hot on panel data analysis so not sure exactly how you'd proceed (some kind of fixed effects model?) but it wouldn't be hard to find out. Is there anything you can use as a "treatment", even if it's the result of a natural experiment rather than an actual experiment? If so, a difference in difference model might work.

I'd still follow wintertree's suggestions re your residuals!

Post edited at 15:14
In reply to seankenny:

Thanks, and to Wintertree- that’s the first time I’ve had regression explained in a way I can follow! 
 

and to everyone that answered- ukc remains a great resource for rapid replies to quite technical questions! 

 Jon Read 01 Jul 2021
In reply to no_more_scotch_eggs:

With respect to all posting here, given this sounds like proper human subject research (rather than someone asking for help with a hobby) you should be consulting with an in-house statistician or epidemiologist.  I can try to help direct you if you send me a pm.

Jon.

In reply to no_more_scotch_eggs:

A final question- the time series will include a period prior to the pandemic, which I think makes sense to analysis in itself, as it represents the effect of secular tends across a range of factors (bed numbers, numbers of people with the condition, etc) and is of primary interest.

but by the time this work happens (if it happens…) we will have over 2 years of pandemic affected data too- investigating the impact of the pandemic is not the primary purpose of this, but it would seem strange to exclude it, and would leave a criticism that the work was out of date. 
 

So how to handle the pandemic years? Re-run the regression extending it through 2020-2021? Or combine the two pandemic years’ data, and compare to the two years prior combined, with something like a chi squared test? 

 wintertree 01 Jul 2021
In reply to no_more_scotch_eggs:

I'd second Jon's comments about getting someone domain specific to help with this (sort of what I was getting at with "This is rather a physical sciences approach to health data, I imagine health people use different tests and methods; what you really need depends on what the purpose of your outputs is and who you need to convince I suspect.").  Precise issues of language and method choice are probably going to make a big difference to how the reviewers view it.  I don't think health or life science people like assigning estimated errrobars based on expected noise statistics and tend to use more "black box" methods on which I shall restrict my comments lest I just add confusion.

In terms of 2 potentially exceptional years; 2 data points is basically not enough to measure/fit/test anything.  Because you can fit basically anything other than a constant to any 2 data points regardless of their value.  I'd probably run the regression on the pre-pandemic period, and then project that fit forwards in to the pandemic era (as in just extend the straight line k0+k1x), and see where the pandemic data points and their error bar estimates lie with respect to the line, and give a limited interpretation of if they sit well with it or are way above it, with the later suggesting some conflation).   But again, people in the field might have a different take.

Post edited at 16:02
1
In reply to wintertree and Jon:

thanks- and PMed

 seankenny 01 Jul 2021
In reply to no_more_scotch_eggs:

> So how to handle the pandemic years? Re-run the regression extending it through 2020-2021? Or combine the two pandemic years’ data, and compare to the two years prior combined, with something like a chi squared test? 

Use a dummy variable to split the data into pre/post pandemic sets should control for the effects of the pandemic period. As wintertree says, two years isn't enough, but at least this way you can extend the method using future data.

Andy Gamisou 03 Jul 2021
In reply to no_more_scotch_eggs:

There's a quite reasonable Facebook group (if you use FB) called "Statistical Data Analysis" which has members quite good at answering this sort of question.

In reply to wintertree:

I think that’s probably better than ‘shove it into Matlab and choose the fit that supports the outcome you want’. That’s how stats work isn’t it? 😂

1
cb294 04 Jul 2021
In reply to paul_in_cumbria:

Results first, data later, obviously! How else would you get your pet model published?

In all seriousness, you do have to have a model of your process before you start fitting your data!

Do you expect linear growth, exponential growth, saturation behaviour, oscillations, or whatever? Just taking a cloud of dots and drawing a line through it tells you nothing.

CB

In reply to cb294:

Always a big surprise when an FFT turns up something interesting…

 sparrigan 05 Jul 2021
In reply to no_more_scotch_eggs:

To second what Jon has said (and again with no disrespect intended to anyone posting here), I would *strongly* recommend that you involve an experienced statistician in your work if the results are important and will be used to draw actionable conclusions.  Without wanting to be patronising, there is an awful lot of context needed to determine which statistics can be correctly used to make inferences and a lot of care needed not to invalidate them. Stats is sadly a subject where a little bit of knowledge can be very dangerous. You will most likely simply not know if you get it wrong.

If you're associated in anyway with an academic institution then they possibly have a team specifically tasked with advising researchers on the usage of statistics.

1
 wintertree 05 Jul 2021
In reply to paul_in_cumbria:

> I think that’s probably better than ‘shove it into Matlab and choose the fit that supports the outcome you want’. That’s how stats work isn’t it? 😂

Working across several discipline boundaries it's tempting to chime in...   I'll be polite, not least 'cos I'm the dunce in the room on this stuff.  What is interesting is to me is how different subjects have evolved to use different sorts of statistical methods to test what's fundamentally similar data against their hypotheses.  Different cultures and histories exist in different fields and it's not always trivial to present one set of results across the boundaries.

Being an unsubtle person, I much prefer the kind of experiment that has a sufficiently unsubtle result that the nuances of statistical testing are replaced by a gap between error bars you could drive a bus through.   Assuming the errorbars aren't horrendously under-sized due to some methodological whoopsies arising from not understanding the true nature of the noise in your system.  I some times wonder if measurement systems should be taught from the noise up...

In reply to cb294:

> In all seriousness, you do have to have a model of your process before you start fitting your data!

One can always start with the hypothesis that there is no dependancy, and test that.  

> Just taking a cloud of dots and drawing a line through it tells you nothing.

Unless you can put some errorbars on them..

cb294 06 Jul 2021
In reply to wintertree:

> One can always start with the hypothesis that there is no dependancy, and test that.  

Why would you?  You very rarely operate in an intellectual vacuum where you have no idea what is going to happen or how something works. The days of going without a preconception and measuring any arbitrary phenomenon that looks interesting are pretty much over, in my branch of biology at least.

If you decide to throw resources at an experiment you better have a model you want to test!

> Unless you can put some errorbars on them..

The coolest "error bars" I have to deal with are instrument response functions of FLIM microscopes, where the error along the time axis is dominated by which side of the semiconductor layer of the APD the first electron got kicked out......

CB

 wintertree 06 Jul 2021
In reply to cb294:

> Why would you?  You very rarely operate in an intellectual vacuum where you have no idea what is going to happen or how something works. The days of going without a preconception and measuring any arbitrary phenomenon that looks interesting are pretty much over, in my branch of biology at least.

It's very field dependant I think.

There're plenty of measurements out there where people expect there to be no dependancy upon a certain variable, and it's still good form to check that there is no such dependancy.   For that your model is "no dependancy" and a test against that is perfectly valid model. You may have many variables and expect relations on some, checking for an absence of dependancy on other variables is more a case of good housekeeping and verifying that what you expect to be a dimension of statistical noise is just noise.

> If you decide to throw resources at an experiment you better have a model you want to test!

In the OPs case, having 7 time points and - presumably - expecting low growth, Taylor's theorem pretty much guarantees if the effect is real and "sane", they're going to find a linear model and will struggle to get any statistical differentiation between higher order fits.  Perhaps herein lies a difference between designing and resourcing a microscopy experiment and scrabbling for significance in public health statistics where you can't generally get large sample sizes, and where you can't access parallel universes to perform replicates.   

As astute reader will note I've almost completely avoided putting significance on any of the visualisations I've been dong over Covid; I'm not convinced there's a method I would actually believe if applied to the data.  I'm sure some see that as an ultimate cop out on my behalf...  

> The coolest "error bars" I have to deal with are instrument response functions of FLIM microscopes, where the error along the time axis is dominated by which side of the semiconductor layer of the APD the first electron got kicked out......

Where do you fall on APDs vs PMTs for this stuff?  Neither have the tidiest noise statistics...

Having moved away from hi-tech microscopy towards rather lo-fi microscopy at large scales, I'm having withdraws pangs from all the cool tech.  Then again my mobile phone has a photon counting, time of flight SPAD array in it.  Madness.

Post edited at 16:24
cb294 06 Jul 2021
In reply to wintertree:

> Where do you fall on APDs vs PMTs for this stuff?  Neither have the tidiest noise statistics...

APDs all the way, my main issue is sensitivity

> Having moved away from hi-tech microscopy towards rather lo-fi microscopy at large scales, I'm having withdraws pangs from all the cool tech.  

Don't worry, you are just suffering from gadget envy:

youtube.com/watch?v=fui3H8j6phY&

>Then again my mobile phone has a photon counting, time of flight SPAD array in it.  Madness.

Remember how you would have killed for that kind of detector even 15 years ago?

 wintertree 06 Jul 2021
In reply to cb294:

> Remember how you would have killed for that kind of detector even 15 years ago?

I nearly worked with an early SPAD array about a decade ago, it was very cool then and it's just unbelievable its in my phone now.  The whole sensor suite in there is just mind blowing.

In terms of my then applications however, both SPAD arrays and emCCDs come out worse than a decent, cooled, back thinned CCD.  The multiplicative noise statistics on the photon detections are worse.  To this day, I'm not convinced that many uses of emCCDs in biological microscopy are justified...

In reply to wintertree:

SPAD array… I have a horrific vision of a grid like pattern of Dominic Cummings clones… 😳

Re the data mentioned in my OP- we’d be limited to what informatics teams are able and willing to get for us… and in reply to others who’ve suggested getting formal stats input from institution- as a clinician on the periphery of university departments, it’s possible, but the timescales for this were tricky, and the proposal only a rough outline at this stage- will certainly be doing so if we get to the next stage! Thanks to everyone for the advice, and the interesting discussion that’s followed 


New Topic
This topic has been archived, and won't accept reply postings.
Loading Notifications...