I don’t have all of the answers, but I do have an opinion.
My take? Most data is messy. Or contains at least some mess.
What do I mean by this? Very little data collected in the ‘real world’ is perfect. Even in controlled experimental designs &*^% happens. And in quasi- or non-experimental designs a plethora of contextual, design, and other study factors can muck up data.
And sure, I’ve heard the saying, “garbage in, garbage out” but I think it’s a bit unrealistic. It almost implies that most real world data, which almost always includes some messy data (almost always unintentionally) can’t be useful. Of course, this doesn’t mean that we should create garbage data collection tools and expect rainbows and unicorns in our data, but I do think that we can make use of data from less than perfect tools (after all, isn’t it often clear that tools are less-than-perfect once data is in hand?).
So, how do we make use of what data we get, messy as it may be? And when is messy too messy to be useful? What do you do to ensure data integrity, to the extent possible?
In my experience, a great deal of mess can be avoided by being intentional about data collection. For starters – collect only what’s needed, in as simple a manner possible, train data collectors adequately, and take care to match methods to needs (not necessarily in that order). This takes time – it means being clear about what the data needs are, and taking time to align the questions asked and/or tools used to collect that data to those needs. And preferably to test out those questions or tools ahead of official use. Did I mention that this takes time? I think the time required is a huge factor of the why-bad-data-is-collected equation.
And when the data arrives messier than intended? Careful, intentional assessment of what’s there and what isn’t and what that means for the planned analysis are key. And then messy data might mean re-grouping a bit – reorganizing your plans for analysis. And preparing an explanation of why that is the case if need be. If you’ve involved stakeholders throughout an evaluation process, it’ll be all the easier to explain data limitations once data is in hand. In fact, stakeholders may be able to help you determine the degree to which messy data is limiting to begin with (or may help identify data limitations that an evaluator without a full enough understanding of the program context might miss).
Now, before I am accused of using bad data or misanalyzing data which shouldn’t be analyzed in the first place, know that I do think there is such a thing as data that is too messy. I also think this is rare. In most cases, there’s something to be learned from any data captured, it just might not be what you thought you were going to learn.
What do you think? Any good messy data stories to share? Or other words of wisdom?