Skip to main content

Command Palette

Search for a command to run...

Intra Class Correlation - the Confusing Term Spanning Across ANOVA, Reliability & MLM

Updated
13 min read
S

Statistician turned Data Scientist with a Psychology background. I create clear, practical content that makes statistics easy to understand.

Note: this post is part of a series of post about Multi Level Modelling

The Intra Class Correlation is a confusing term - and the reason is because there are many types of ICCs, all being used in different contexts. Even though they all stem from the same concept (between subject variance vs within subject variance), they are not exactly completely interchangeable with each other as well.

ICC General Meaning

ICC is formally defined as “the degree of similarity between units inside the same cluster”. Taken literally, it is more of a concept rather than something concrete for now - because there are many ways you can operationalise this.

This is still a meaningless quantity fraction - because you (1) haven’t used it in any particular context of interest, and (2) haven’t figured out how exactly to compute the variances that you want. However, roughly speaking, all the ICC formulas you will subsequently see stem from this same core idea - compare the between group variance to the total variance (or within group variance in ANOVA) to see how large it is.

“ICC” in ANOVA

ANOVA is not exactly ICC - even though conceptually, the notion of comparing variances is similar. ANOVA seeks to tell you whether an effect is significant - which is done by comparing the between subject variance to the within subject variance. Instead, ICC is focused on comparing the proportion of the between subject variance to the TOTAL variance. These are very different goals!

But the overall framework of these 2 methods is similar - partition the variance, that compare some parts with each other. In ANOVA, you should recall that we typically compare the Means Square of the Between and Within group variance. The formula looks like that:

Given its theoretical link to the ICC methodology, ANOVA often gets lumped into talks about ICC - though honestly that hindered my understanding more than it helped me. We seldom call this ratio in ANOVA an “ICC” - the term is instead more commonly used in Psychometrics.

ICC in Psychometrics (Reliability)

Psychometrics is probably the biggest area you will see the term ICC being used. If you have a Psychology background like me, in fact this will also be a stumbling block for you - because you will get confused between the different forms of ICC, and as a result fail to properly understand how the ICC being talked about in ANOVA & MLM is NOT the same ICC as you learnt in Psychometrics.

Most commonly, ICC is being used to measure reliability in Psychometrics. Why is that the case? Unfortunately, this was something I struggled with for quite a while - because it was ingrained in me that reliability is often described as the consistency of measurements - which I just could not relate to the between cluster variance. Shouldn’t only the within cluster variance (measurements of an individual) be considered?

I only managed to convince myself via a thought experiment. Think about it intuitively - imagine you have a tool to measure the height of people. While reliability is often described as the consistency of measurements, in truth this understanding is incomplete - because “consistency” already implicitly factors in the typical variance (between cluster variance) of scores in the first place. 

Say a scale measures the height of an individual to be 170cm, 171cm, 170.5cm. Is this scale “reliable”? Instinctively, you probably will say yes. Why? Because the measurements are quite consistent. More precisely however, it is because the measurements are quite consistent relative to the variance of height between people (which probably ranges from 100 - 190cm?)

Why is this back part important? Because imagine if you lived in a world where people were all between 168 and 172cm in the first place. Is your scale still reliable then?

Visually, the effect of this “scale effect” would like this:

The exact same measurements have very different reliability depending on how it is supposed to differ across individuals (between subject variance) in the first place! Which is why the formal definition of reliability in Psychometrics is as follows:

Where “true score variance” is essentially a proxy for the “between subject variance”, or the “scale” in which the measurements exist in the first place. (You can kind of think of it as the shooting board in which your arrows land - your shots don’t make sense without that landscape!)

Put more clearly in formula - ICC in psychometrics is always about:

You will see that the various types of ICC merely differ in the computation of the signal & total variance - rather than being "fundamentally different” from one another. The most popular conceptualisation of this in Psychometrics is the Shrout and Fleiss (1979) types of ICC - which i will be focusing on here as well.

This conceptualisation of ICC generally takes the form of ICC (model type, number of ratings used to compute a person’s score) - or ICC (a, k) for short. There are 3 types of a, and k can be as large as you want it to be - so primarily I will focus on explaining the 3 model types of a.

ICC (1, k) - One Way Random Effects Model (One Way Random Effect ANOVA)

The dataset for a one way random effects model looks like that:

It is imperative to emphasise now: ICC IS NOT (exactly) ANOVA. In ANOVA, the “factor” would be the group conditions, whereas the “raters” are typically “subject scores”. 

Side note: here, subjects is a random factor as well, whereas in traditional ANOVA the grouping would be a fixed factor

Many people (especially those with a psychology background) will struggle with this mental gymnastics - because we are so instinctively used to the subject being the measurements that the notion of the subject being the “factor” just doesn't sit well with us. But if you think of the “factor” as “subject” - and “raters” as “measurements” - then you can start to see how ICC is fundamentally still parallel to ANOVA. Looking at the formula:

You will also notice that it is NOT computing an F ratio at all. The numerator term is trying to measure the “signal” - or the true score variance. Instead of taking the MS between, it also first subtracts the MS within (remember - within group variance now is within SUBJECT variance (vertical in the table) - representing the spread of measurement of a SUBJECT in general) in order to compute the “signal”. Reason being is that the between subject variance comprises 2 components - the true score variance, and the inherent measurement error. In order to isolate this true score variance, we need to subtract the error from the between subject (group) variance first.

And now, take a look at the denominator. The computation of “total variance” is given by MSB + MSW, scaled by 5. But wait - the MSB already has an error term! So why are we adding the error term again? Because the error in MSB represents the error reflected in EACH INDIVIDUAL SCORE - whereas the additional MSW reflected the error of computing a FINAL score for the subject (hence this is affected by k, the number of ratings used to compute a subject’s score!) 

ICC (2,k) - Two-Way Random Effects Model

ICC(2,k) differs from ICC (1,k) in that it also explicitly models the raters as a source of variance. Of course, this decision is not arbitrary - it is because the data format is fundamentally different from ICC(1,k) as well. The data for ICC(2,k) would look something like that:

It is important to note that BOTH RATERS and SUBJECT are being treated as random effects over here (i.e. not all levels of this variable are being captured). This distinction will be made clearer when we cover ICC(3,k). But for now, let’s look at ICC(2,k) formula to see how it differs from ICC(1.k).

Notice that we have an additional term inside our formula - MS (rater). This reflects the between rater mean square - representing how each rater’s score, on average, differs from the global average scores across all raters.

The important thing to note is the MSE (previously MSW) - even though the variable name is identical to in ICC(1.k) - is actually smaller than it would be in ICC(1.k). This is because as a "residual" variance - it actually subtracts out all other sources of variance (including between rater variance!) before being labelled as a residual - as a result the MSE is likely to be smaller than in ICC(1,k), resulting in a stronger “signal” (bigger numerator). This is also the same reason why ICC(2,k) tends to be larger than ICC(1,k) - you accounted for more sources of variance, thus naturally the signal is better!

The denominator term models in the additional rater variance as part of the total variance as well - hence an additional term.

ICC (3,k) - Two Way Mixed Effect Model (Mixed Effect ANOVA)

In ICC (3,k) - subjects is a random factor, while raters are a fixed factor. What does this mean? It means that your raters are NOT arbitrary - they WILL BE the exact same raters you know you are going to use later on. It’s not any Tom, Dick or Harry that will be used as raters - it’s literally Abby, Brandon & Claire that are your raters perpetually.

You will notice that the data looks identical to ICC(2,k) - because it is! The only difference is that raters are no longer random - and you can’t really see this inside the dataset itself.  The formula for ICC(3,k) looks identical to the ICC(1,k):

But it is not identical because this is still fundamentally a 2 way ANOVA. When it is a two way design - MSE fundamentally still takes into account all sources of variance first as follows:

This is not the same MSE as in the 1 way ANOVA:

As a result, since MSE is smaller again in ICC(3.k) vs ICC(1,k), ICC(3.k) tends to have higher values than ICC(1,k).

Now take a look at the denominator. For ICC(3,k), we don’t add the rater mean square as an additional part of the total variance - and this is because it is no longer a random effect! If the raters are fixed - there’s no more “uncertainty” in the raters - and as such the reliability computation no longer needs to take this into account.

Since the denominator is smaller, ICC(3,k) thus tends to be higher than ICC(2,k) as well - because there is less uncertainty in the computation of that reliability!

ICC in Psychometrics, Summarized

From the table itself, you can’t really tell what k is. K merely represents the number of ratings you use to compute the subject's score eventually - and doesn't have an impact on the data design. This is in contrast to a - which is not an arbitrary statistical decision, but instead should reflect your research design in the first place. And with this, we have learnt how ICC is also commonly used in Psychometrics.

ICC in Multi Level Modelling

And finally - we come to the crux of this entire series. ICC is essential in multilevel modelling because it allows us to examine the proportion of variance (out of total variance) due to clustering.

Because take a closer look at what ICC is again:

In psychometrics, the “group” was the subject - but if we change back to an ANOVA thinking, the “group” can reflect the “clusters” that the data is naturally in! Remember - when do you any statistical test, you are interested in examine if the effect you are interested in affects the DV across the individuals. And an important prerequisite for these statistical tests is that they are independent observations - meaning that they should be randomly selected & unrelated to each other.

If there exists other natural groupings inside your dataset (aside from your fixed effect of interest)- and worse still, these groupings actually significantly affect your dependent variable - then obviously your analysis will be biased! In MLM - the ICC serves as a quick way to verify whether or not the other natural groupings are important - because it allows you to quantify whether or not the proportion of variance explained by this grouping is large relative to the total amount of variance you have.

Still confused? Let me confuse you more. Ideally, you actually want the ICC to be LOW for normal statistical analyses - because it means that this additional grouping is UNABLE to account for a large part of your total variance! In other words - you want “low reliability” in this context - because if you go back to the Psychometrics thinking, you don’t want the "true score variance” (variance from your “grouping factor”) to be large relative to the total variance anymore!

Side note: this is obviously an out of context wording - please do not use the term reliability in MLM! But the cross domain usage of the term ICC really confused me a lot - thus I felt it important to just handle the situation upfront so readers can appreciate the parallel.

Remember - in MLM - the hierarchy is a NUISANCE (random effect) variable, You aren’t that interested in it - you’re interested in accounting for it so that your actual effect can be studied properly. As such - if the “true score variance" of your nuisance grouping is high - it means that you really have to double down on MLM and account for this additional source of variance before appropriate conclusions can be drawn about your fixed factor!

Quantification of Variance at Levels

ICC in MLM also has an additional use - to quantify the variances at different levels of the dataset. Say your ICC is 0.4. It means that 40% of the total variance can be explained by the “hierarchy” in the dataset - and is due to the grouping rather than individual differences. 

Why is this important? Say you are the hospital director - thinking of where to create interventions to improve the patients survival rates. The ICC you got was 0.7 - meaning that 70% of the variance in patient survival rates can be attributed to the ward the patient was in, and only 30% attributed the individual patients themselves. Obviously, this means that ward level interventions are going gto have a lot more utility than individual level interventions - and your money should be directed to improving ward level variables!

For MLM with >2 levels - the ICC formula can be generalized further to take into account the further hierarchy - and give you the proportion of variance attributed to each level of the database. Say you are not just the director of a hospital - but a cluster of hospitals. Now, 30% of the variance is attributed to the hospital, 50% to ward, and 20% to individual patients. Still, your intervention should be at the ward level - not at the hospital or the individual patients level!

Conclusion:

What a whirlwind! With the fundamentals set - you know how the confusing term of ICC spans across domains, and can better appreciate the appropriate usage of ICC in the context of MLM. With this in, we can then go ahead to start building our MLM model - stay tuned!