This is a guest post by Nick Parry. I recently gave Nick the challenge of creating a self-service way for our non-Tableau Desktop users to do fuzzy matches from data that these users have in spreadsheets. What he came up with was pretty ingenious:
Last week our team got a request to determine the value of a
healthcare education campaign. They had excel sheets containing the lists of
potential patients and wanted to pull any matches in our patient data. Our goal
was to pull a fuzzy matched list of patients based on last name and date of
birth for them to analyze.
Blending the data wasn’t an option since our patient source
is so large and the lists they wanted to pull were only kept in excel (and may
or may not be updated regularly). They also had to look for hundreds of
patients at a time, so manually filtering by these patients would have been
time consuming. What I wanted was a way to pull all of the patients at once as
easily as possible.
What I came up with was to allow the user to enter the full list
of patient names and DOBs into a string parameter, and create a filter that
would return only the rows contained within that list. This way the user will
be able to copy an entire excel column, paste it into the parameter, and pull
the data with one search. Below is the calculation I used.
Essentially this calculation concatenates patient name and
birth date in our patient encounter data sources and checks to see whether that
combination is found in the delimited parameter list that the end-user is
providing. Both the parameter and the values from the data source are converted
to uppercase to prevent matching issues caused by inconsistent capitalization.
So the end user would create a spreadsheet like the one below, copy and paste
the values into the parameter in Tableau, and Tableau would respond by
providing a list of all the fuzzy matches.
I did have some concern about the character limits on
Tableau parameters, but after some quick testing I found they will accept up to
around 34,000 characters.
I've created a example of this using superstore sales data matching on customer name and order date. You can play with this yourself by clicking the image below. Test by entering this string into the parameter: "Nicole Brennan, 10/22/2011|Georgia Rosenberg, 11/21/2011|Ricardo Emerson, 12/29/2011|Craig Molinari, 3/1/2012|Valerie Takahito, 4/5/2012|Pauline Chand, 8/1/2012|Andy Gerbode, 9/7/2012|Peter Fuller, 9/17/2012|David Philippe, 10/10/2012|Craig Carroll, 10/23/2012|Sam Craven, 11/10/2012|Lycoris Saunders, 11/14/2012|Duane Huffman, 11/15/2012|Kelly Williams, 11/20/2012|Harold Dahlen, 11/22/2012|Guy Phonely, 11/26/2012"
UPDATE:
One of the comments on this post asked why you wouldn't simply copy and past the values into a custom quick filter. I had two thoughts on this. First, the approach described above can better handle capitalization discrepancies because you can run LOWER() or UPPER() on both sides of the equation. Second, I assumed this approach would be faster since it doesn't have to scan through two large lists. To test this theory, Nick ran this query with thousands of names against a data set with millions of records. The result showed that the approach described in this post was more than 7x faster than simply copying the values into a quick filter. Pretty cool.
I've created a example of this using superstore sales data matching on customer name and order date. You can play with this yourself by clicking the image below. Test by entering this string into the parameter: "Nicole Brennan, 10/22/2011|Georgia Rosenberg, 11/21/2011|Ricardo Emerson, 12/29/2011|Craig Molinari, 3/1/2012|Valerie Takahito, 4/5/2012|Pauline Chand, 8/1/2012|Andy Gerbode, 9/7/2012|Peter Fuller, 9/17/2012|David Philippe, 10/10/2012|Craig Carroll, 10/23/2012|Sam Craven, 11/10/2012|Lycoris Saunders, 11/14/2012|Duane Huffman, 11/15/2012|Kelly Williams, 11/20/2012|Harold Dahlen, 11/22/2012|Guy Phonely, 11/26/2012"
One of the comments on this post asked why you wouldn't simply copy and past the values into a custom quick filter. I had two thoughts on this. First, the approach described above can better handle capitalization discrepancies because you can run LOWER() or UPPER() on both sides of the equation. Second, I assumed this approach would be faster since it doesn't have to scan through two large lists. To test this theory, Nick ran this query with thousands of names against a data set with millions of records. The result showed that the approach described in this post was more than 7x faster than simply copying the values into a quick filter. Pretty cool.