In one of my first posts I tried to explain what 'driver analysis' is and I promised to follow up with a post describing the various types of driver analyses that are used and what some of the pros and cons might be for each method. So here we are. You might like to review the post but in summary driver analysis refers to a family of statistical techniques for assessing the relationships between variables. In the context of employee engagement surveys this usually involves assessing the relationship between engagement items (psychological outcome questions about how people think, feel and behave) and all the other workplace items (questions about specific workplace and culture elements) we have in a survey1; the purpose being to identify the best workplace predictors of employee engagement2. These are the things that can be most effectively targeted in the workplace to improve employee engagement (or any other outcome measure we might choose actually).
So, what are the main types of driver analysis? Let's divide and conquer (and settle on some terminology). I wont be able to make it through all of the techniques in this one post - but we can get half way there and I will follow up with another post discussing two further and more technical techniques before providing a final verdict.
I have no idea if this term is commonly applied, but I have seen it used as a measure of the specific impact a 'driver' has on an outcome. The most basic form of this analysis involves dividing people into high and low groups on all the workplace questions - people who score in the bottom 50% being allocated to a low group and people in the top 50% in the high group. We then compare engagement scores for the two groups on each item - giving us a real engagement score for each group and the difference becomes a measure of the impact. This provides a directly understandable and easy to interpret measure - which is also easily represented graphically in simple bar charts.
Pros - Easy to understand. Easy to visualise. Represented in same units as survey data.
Cons - Doesn't use all the information in the data because we lump so many people together in broad groups. Results can be unstable.
Correlational analysis, unlike impact analysis, uses every individual's response separately. There is actually a family of statistics that come under this heading and they generally differ based on what assumptions are made about the data we have. The most well known and used is the Pearson Correlation which is based on four assumptions. Feel free to do some reading yourself but personally I would not bet against a statistician that employee survey data meets these assumptions. Thankfully there are other correlation statistics (part of the non-parametric family) that are better suited to survey data that just don't make those kind of assumptions - non-parametric tests also have far groovier names like the Kolmogorov-Smirnov and the Siegel-Tukey tests. The correlational statistics best suited to survey data belong to the Kendall tau family - these tests make use of the relative rankings of responses. Correlations are quite simple to interpret once you get the hang of them too. They range from -1 to 1; where 1 means two variables are perfectly positively related (they move upwards and downwards together) and -1 means they are perfectly negatively related (as one goes up the other goes down and vice versa); 0 means that they are pretty well randomly disinterested in each other (i.e., independent of one another).
Pros - Uses most of the information in our data. Always in the same units (-1 to 1).
Cons - Slightly more abstract. Harder to show graphically (but can be done simply if necessary).
Similar to correlation, regression refers to a family of techniques that have different assumptions about the data. However, leaving that aside, the main point about regression is that it tries to do something additional; it tries to find the best combination of variables to predict an outcome. Impact analysis and correlation analysis look at the relationship between every survey question and engagement (or any other outcome) separately and we look for those with the strongest relationship; regression looks for the best weighted combination of variables (generally the best weighted linear combination of variables in terms of predicting your outcome measure). What does best weighted combination mean? Well that means an equation amongst thousands of other possible equations that is picked based on some sort of statistical criteria or process. What does that mean practically? It means that we end up with a situation where the drivers have to be interpreted in the context of the specific equation arrived at. What does that mean more practically? It means there could be other important drivers that are sitting in very slightly different equations that are potentially better individual predictors of engagement, and it means your results are going to jump around from year to year3 because this technique if very sensitive to the particular dataset the equation is built for.
Pros - It sounds intelligent and difficult and sophisticated. For very large samples it can be academically interesting.
Cons - Employee datasets are usually highly multicollinear which is bad for regression techniques in particular3
In a follow up post I will discuss some other techniques such as Structural Equation Modeling, and Relative Importance (or Weights) measures. These techniques are not commonly used in your day-to-day employee surveys but they do pop up in academic or larger scale research. This post is already well beyond length guidelines for a successful blog post - so let's quickly wrap it up.
In terms of picking a best solution for driver analysis I would say that a correlational measure will provide the best balance of sophistication, pragmatism and interpretability. Unlike impact analysis, correlations do not average all the individuals together into a small number of groups. Unlike regression, correlations do not try to answer a question that you didn't really ask, and correlations are more stable and straightforward to interpret in the face of multicollinearity. In terms of which correlational technique to use I would recommend using a non-parametric statistic that does not carry any assumptions about the distributional assumptions about your data - your survey data is likely to meet almost none of the assumptions of parametric statistics. I hope that was helpful.
1 I'm afraid I can't avoid footnotes here. They should help separate tangential issues for us in this slightly technical post. This footnote is to note that sometimes people use individual questions as the drivers and sometimes they use combinations of drivers, much like the engagement indexes. They argue that this provides more statistical reliability and/or validity (incorrectly usually). There are problems with this approach that are often overlooked though. A simple example will suffice. Imagine we have a question that is actually the most strongly related to engagement, but it is put together with some other questions that are not related to engagement at all - like putting a good player in a really bad team. If we analyse at the question group or theme level we will never know that an item was a good predictor because it will be masked by having to associate with poor predictors and we will be ignoring an important driver. People argue that this is not the case if there themes are reliable - but reliability is not the same as measurement unidimensionality and construct validity - which require other statistical procedures to assess that are almost never used or practical for your average employee survey.
2 Prediction is always contentious because a relationship does not prove a predictive relationship (or causality). Most people get this intuitively with an example: Most of us would accept that weight is correlated to height - and indeed weight would turn up as a 'driver' of height - but most of us have importantly understood that gaining weight will not make us taller. We must also use logic and judgement in what we enter into our driver analyses or we can get correct but useless information. A frequent issue arises when 'job satisfaction' questions are entered as 'drivers' to reveal that job satisfaction is a top driver of engagement - people do this because they believe engagement is more than just job satisfaction. Engagement is more than job satisfaction but it is not entirely independent and one must consider whether it is useful to know that improving job satisfaction will improve engagement. So for a sensible driver analysis we need to focus on 'drivers' that represent aspects of the workplace we believe we may be able to change directly and also avoid things that are really just outcomes similar to engagement.
3 Employee data is nearly always multicollinear which poses serious problems for most regression techniques. This means that responses to many of our common survey questions are highly correlated. Because most employee survey data is confidential this is hard to simply point to when statistics on this are not reported. However, academic papers will sometimes provide the relevant statistics. For example, see this nice paper by Schaufeli & Bakker (2006; p. 712) in which they share with us that the median correlation between the factors of their engagement factor are > .9. This does not mean that all questions in employee surveys are this highly correlated but I often see inter-item and inter-factor correlations that are in the > .8 range. It is also common for organisations to report their drivers flipping around from year to year and regression will make this worse. Here's why.