By Dave Anderson:
This has made me retch as a professional data geek:
The state [Florida] has been responsible for helping screen voters since 2006 when it launched a statewide voter registration database. The state database is supposed to check the names of registered voters against other databases, including ones that contain the names of people who have died and people who have been sent to prison.
I'm a data geek. One of my occassional tasks is to integrate my company's data set and lists that outside parties provide to us. A priori, I know that a very large proportion of the individuals should be on both lists. I've blocked out most of tomorrow for this task as we just got a medium size list that needs to be crosswalked into our data set. I'll be working with the data geek intern (yay, I have a .25 FTE minion) to show the intern the ropes on how to work this process. We talked about the project for twenty minutes this afternoon and the intern was shocked that this is not an easy process as it is just a matter of comparing names, and names are easy.
Ahhh, to be understandably incompetent in the ways of data. Names suck as unique identifiers, here are some common problems.
- Junior versus Jr. versus JR versus II
- Dave versus David
- David M Anderson versus DM Anderson versus David Anderson versus D Anderson
- Family groupings don't neccessarily follow any coherent naming structure
- Mary Louise Jones versus Mary Louise Smith Jones versus Mary Smith-Jones versus Mary L Smith Jones etc.
My name in particular is a pain in the ass because for my age cohort, it has a top-10 male name and a very common last name. Googling "David Anderson" and restricting it to Pittsburgh produces numerous other individuals before you come find anything that is non-Newshoggers related to me. My wife is a bit easier for the data geek as she has an uncommon first name. But the point is that names are a hideous identifier.
Names combined with other information can be better as unique identifiers. However, there are strong limitations on using address data such as postal address as there again are significant naming convention problems, as well as the lack of actual zip code boundaries that are not imputed. ZIP codes can commonly cross multiple municipalities and counties. Furthermore, center cities are often used as mailing addresses for multiple inner ring suburbs, for instance, I live outside of the Pittsburgh city limits, but my zip code means my mailing address is "Pittsburgh, PA". Birthday data is a bit better, assuming accurate data entry, but again, there are numerous David Anderson's born on my birthday and they live in multiple states and have jacked up my credit report more than once.
The intern's eyes were glazing over when I got to the point about propensity scoring (ie a match on first name, last name, DOB, and zip code but mismatch on middle initial and suffix is probably a valid match), wild ass guesses that need to be sent back to the outside vendor for confirmation, and unique identifiers such as Social Security number or UPIN or NPI or anything else. A match on EIN or TIN or SSN is a solid match.
The intern's ignorance is understandable as this is his first exposure to intermediate data geekery. However, Florida's decision to use name matching for anything other than a PSA mailing to remind people to brush their teeth is not defensible as understandable ignorance. It is intentional and willful incompetence by someone, either the hiring entity or the contractor and if it is the contracter, the state is guilty of neglect.
But that happens to be the entire point of this exercise, intentional neglect is useful to the Florida governing elite.