There are four techniques that all people scientists need to understand. They are visualization, correlation and interrelationships, prediction and text analytics.
1. Visualization to understand the problem
Visualisation is the foundation of people science, but perhaps not in the way you think. In this context I’m not talking about fancy visualizations or presentations. In my opinion, the foundation of people science is using visualisation to screen data, understand the distribution and identify any problems.
Whenever I’ve skipped this step in the past, I’ve found that there’s a big gap in my data or that something has been coded incorrectly and is affecting the data. Visualizing your data is the easiest way to stop yourself from making mistakes like that.
There are several websites that specialize in visualizing for data screening and can show you many different ways to do this. Look at your own data sets by jumping into R and using simple things like histograms, scatter plots and multiple scatter plots to look for anything strange. You can see quickly if anything grabs your attention - perhaps the data is bunching to one side or something looks like it’s in the wrong place.
Non-visual techniques can also be used but the visual ones tend to give you more information - and it's more fun!
2. Correlation to understand the structure of the data
Correlation or dimensionality may not seem very complex but they're a fantastic way to understand the structure in your data. Understanding the structure or the interrelationship between the variables will inform what techniques you use down the track.
For example, you may initially think you’re looking at 50 different things, but these may actually correlate into just three or four major underlying factors. It’s then very important and powerful to understand the structure and interrelationship between those three or four things. This can have huge implications for the techniques you use and any problems you have with those techniques down the track.
People often rush into a regression analysis or dimension reduction technique rather than looking at factor or components analysis. But these correlation or covariance techniques are a good way to see the direct relationship between variables and help you truly understand the basic correlation and covariance structures within the data. If you don't have this understanding at the outset you may include factors in your later prediction models that are actually interrelated and confuse the model. An understanding of correlations also gives you a depth of understanding about how your models may perform before you jump into more complex or higher level modeling.
To identify interrelationships, first look at the correlation matrix for any data set and inspect and explore it visually. Color code any correlations and covariances - overlay them with different shades and colors to draw your attention to where variables are strongly related or not related. Then look at some resources on principal components as a rich way to engage with subjects before you become too abstract about them. For example, I might review some literature in the psychology domain about personality variables or cognitive abilities to better understand the area before I get started.
3. Prediction - crucial but easy to get wrong
Predictions aren’t a magic bullet - one of the biggest mistakes many people scientists make is to over-emphasize how well they can predict something. Some things are just too difficult to predict and the error rates will be high because you don't have all the variables you need at hand. It’s important to be aware of that and the limitations of the techniques that you’re using.
It’s best to start with things like basic linear regression and logistic regression. You can then move onto more complex and powerful tools like ridge and lasso regressions and random forests. These are helpful tools but it’s important to be aware of how accurate they are when you interpret them and also be able to convey that level of accuracy to others. People need to have a true and practical sense of what you’re really able to predict, and what you can’t.
It’s also common to overfit a small data set to a model and say that certain things are the best predictors. Look at the accuracy of the predictions, error rates or just how much you have fitted the data. There are many techniques that you can use to help with this, like testing for accuracy in holdout samples and checking whether the model replicates in larger data sets.
To avoid mistakes, it’s also helpful to start by predicting something simple or straightforward. For example, if you’re looking at churn just look at whether someone leaves an organization or not first. Then you can use techniques to help you generalize to other samples where you start to get more in-depth. This approach will make your ridge or lasso regressions really powerful and can also help you when you start to predict nonlinear things, groups or categories, like random forests.
4. Text analytics are essential for identifying themes
Modern people scientists need to understand text analytics because any type of survey or feedback you do in an organization will generate thousands of comments. Text analytics can help you turn those comments into numerical vectors and identify common themes.
One of the most common mistakes people make with text analytics is to run them into word clouds straight away. Word clouds are great but first you have to strip out all of the unimportant words like ‘it’ and ‘the’. There are some great libraries that can do that and then put them into a word cloud or clustering technique for you.
Clustering techniques are crucial when doing text analytics. Techniques like word to vector, K-means clustering and t-SNE (t-distributed Stochastic Neighbor Embedding) allow you to find structure in text data that you’ve never seen before. Before you start working with your own data set, it’s best to use dummy data sets that you can find online. This is because there’s often a lot of data cleaning that needs to be done before you can use a clustering technique. With the dummy data set you can experience some of the joys of text analytics before getting into the weeds.
While these techniques form a great foundation for any people scientist, it’s important not to limit yourself to them. These are just techniques and they can, and should, be combined with techniques from other areas. Have fun and play around with your data, combine and explore it to help you understand and identify what it’s telling you.