After graduating from the University of Georgia with a Management Information Systems degree, I started my professional career in consulting and audit at KPMG. Basically this was because it was one of the few companies hiring for a decent position in my field after the dot-com bubble burst. I spent 6 years as a generalist doing projects in a variety of areas like Sarbanes-Oxley documentation, Six Sigma, IT audit, IPO due diligence, merger and acquisition, etc. It seemed I was never doing the same thing twice. It was both exciting and stressful, but ultimately unhelpful in building deep expertise in any one particular area. I needed focus and I also needed to get off the road.

One of the great things about KPMG though is that it gave me a lot of networking connections. I leveraged that to land a position at the Piedmont Heart Institute (PHI) doing project management and business development work. I did manage to greatly increase my Excel skills while at KPMG and that was pretty useful in helping me succeed at all the analysis work I found myself doing at PHI. I started following a few Excel and data visualization blogs to help me further enhance my skills. Some of them included chandoo.org, clearlyandsimply.com, peltiertech.com, excelhero.com/blog and junkcharts.typepad.com.

While at PHI I built some pretty sophisticated stuff in Excel. I built things that helped me optimize physician schedules, track patient throughput in our cath and EP labs, identify candidates for EP implants, do project portfolio management, identify new physician office opportunity areas, and many more interesting things. But with each new tool I developed in Excel, maintaining everything started to become onerous. I had macros built that would automate data pulls, formulas that would merge data sets and charts that would auto populate into PowerPoint for presentation, but it still was difficult to maintain. And God help the person who would come after me that would need to figure out the complex web of technology that I had woven. I needed a better approach.

Reading through comments on the blogs above, I kept seeing people recommend Tableau as an easier way of accomplishing the same data presentation tasks. I was very proud of my Excel skills and took it as a challenge to prove that Excel was a superior BI tool. So I started re-engineering my Excel work in Tableau. Being new to the tool, I was fairly limited in my imagination of what was possible. I tried to make Tableau be Excel (not the best idea). Still, there was enough familiarity with the features to accelerate my learning because a lot of Excel skills are easily transferable to Tableau. Tableau's VizQL interface is essentially pivot tables on steroids. At the same time, I discovered authors like Stephen Few and Edward Tufte who opened my eyes to the science of data visualization. Up until that point, I basically displayed data in a way that I thought looked cool in a presentation.

After converting a few of my projects to Tableau I was convinced of my own ignorance and submitted that I was wrong in assuming Excel was the best BI tool. (side note: Excel still has a place, but mostly I use it for light data capture and data transformation work. I leave the presentation layer to Tableau.) Now that the scales were removed from my eyes, I started gaining huge efficiencies in my workflow. I was able to completely automate a lot of my work and offer the end user an interactive data exploration experience through dashboarding. Subsequent questions were built as drill-downs to more detail. And if I didn't anticipate all the questions, I had a lightening fast ad-hoc analysis tool to answer questions on the fly. I no longer walked into meetings with a PowerPoint presentation with just static aggregated information. I walked in with my laptop and an interactive dashboard. No longer would I be in a situation of having made a wrong assumption which required a few days of re-work and another meeting to address. I could fix the issue on the fly, in the meeting. The incredible speed to insight meant that decisions could be made in a single meeting instead of multiple meetings. My audience went from sitting back and talking broadly about ideas to leaning forward and engaging in the process of finding the answers to specific questions. A question would be asked and immediately answered. That would lead to more questions that would also be immediately answered. And usually in the same meeting we would have a draft of a dashboard that would automate that analysis process with new data each day. It was life altering and I was addicted. 

My desire to understand the science of data visualization had a big impact as well. It enabled me to shift perspectives and offer a fresh look at the same data sets leading to new insight. Tableau was like a natural extension of my mind in this new world that helped me fail quickly as I iterated through different designs. On the data modeling and data quality side of things, I was able to easily identify mistakes and correct them quickly because the data was at my fingertips ready to tell me everything in an instant.

I finally knew what I wanted to do with the rest of my career. It wasn't just a job. It became a passion. My passion and project success eventually got me noticed by the corporate Financial Planning and Analysis team. They needed some help with monthly operating reports that took a week to compile each month using Crystal Reports, Excel and Adobe Acrobat. The results were printed in a 133 page book, one for each of our five hospitals, and shipped on a truck 2 weeks after month end close to company leadership. I wanted to help, but I had a bigger vision. I had experienced the work that goes on in the trenches by analysts trying to make sense of data. I wanted to help everyone in the company do the same things that I could do at PHI. My new bosses decided to take this journey with me and let me run with the idea. 

So I accepted the position to lead the BI program for Piedmont Healthcare. We quickly automated the monthly operating reporting so that it updated every day in the form of interactive dashboards. People subscribe to them and receive them each day in their email. If they have subsequent questions, they just click a link and use the drill-downs to answer subsequent questions. And if they need further ad-hoc analysis, their analysts have access to the same deaggregated data sources that drive the dashboards. Life is good! 

Now we've moved on from financial analytics to other areas of neglected or onerous insight. There are more than 130 people across the enterprise who have been trained and have the ability to develop reports from 100+ modeled data sources containing more than 25 billions data points. They have published more than 3,000 reports/dashboards that are being consumed by more than 350 people each week. We are doing more clinical and operational analysis now, which is really exciting to me because Piedmont Healthcare's mission is to better the lives and healthcare experience of our patients. If my work can have a positive impact on our core mission, I'm thrilled! We've made a ton of progress, but have only scratched the surface of what is possible in the almost 3 years I've been in this role. The future is bright with possibilities to transform healthcare with analytics and I'm so fortunate to have stumbled into this career. For those looking into this field, I say go for it! There has never been a better and more exciting time to be in this space.

Further Reading...

Earlier this year I made a visualization of my career. If you want to know more about my background, check it out by clicking on the image below. It will load-up an interactive version that has audio narration and screenshot examples of my work.



2021 Wk 7

Challenge: https://preppindata.blogspot.com/2021/02/2021-week-7-vegan-shopping-list.html

The 2021 Week 7 #PreppinData challenge introduces the use of the GROUP_CONCAT() and INSTR() functions in Exasol and expands on the use of scaffolding techniques to unpivot data contained in a single field with delimited values. I also used CTEs, regular expressions, and built a view to avoid code repetition for this challenge. The full SQL solution is posted below.

Lines 1-23 create our table structures and load the data into Exasol. Refer to the first post in this series for more about these DDL / DML statements. The bulk of the work is saved in a view that starts on line 25 so that it can be reference to generate the two outputs which are just pulling from the view with different filters. This is a good practice so that you don't repeat code multiple times for different uses and later have version control issues when minor changes fork from the original code.

The view contains a couple of common table expressions (CTEs) that manipulate the keyword dataset. The challenge with this one is that the list of keywords exist in two discrete columns as a comma separate list (shown below). The first column has ingredients and the second column has E numbers used to identify food additives. These two lists are not related, so they ultimately need to be concatenated. In retrospect I probably could have eliminated the second CTE by concatenating the two fields in the first, but I'll just explain the steps as I did them. 


The first CTE on lines 26-33 named "keyword_tall_pass1" converts the two comma separated lists into one record per value as shown below. This is accomplished by cross joining to a statement that uses the CONNECT BY hierarchical query functionality that generates 30 records for us on line 32. Thirty is just a number I chose that is large enough to capture every value in the two lists. On line 33 I drop the excess records that didn't need because there were only 16 items max between the two lists. The magic here is with the REGEXP_SUBSTR() functions. I used pattern matching to capture just alphabetic characters for the first list (line 29) or numeric characters for the second list (line 30) and kept the nth matching instance where the nth value is the RecordID value I generated on line 32. The result of this CTE is shown below. So you can see "Milk" was the first word followed by "Whey", "Honey", etc. from the screenshot above. Likewise for the second list of E numbers.


The second CTE named "nonvegan_keywords" on lines 35-38 just takes the Ingredient and ENumber columns shown above and stacks them on top of each other with UNION ALL. The ALL qualifier tells the query compiler not to bother checking for duplicate values among the two expressions. I also needed to append the letter "E" to the beginning of each number. You can concatenate string values with a double pipe operator "||". It turned out that the E numbers weren't found in the shopping list data, so none of that data was used anyway.

The final SELECT statement for the view appears on lines 41-50. This query uses the shopping list as the base table (line 48) and cross joins to the "nonvegan_keywords" CTE (line 49) so that each shopping list product's ingredients can be compared to every keyword in the list individually. I do this with the CASE statement you find on lines 46-47. Exasol is case sensitive, so I forced both the ingredients and the keywords to be lowercase and used the INSTR() function to see if an individual keyword is found in the list of ingredients. INSTR() returns the character location of the found text in a string, so if it is greater than zero I return the matched keyword. Any non-matched keywords are ignored and return NULL. 

The case statement is wrapped in a GROUP_CONCAT() function, which is an aggregate function for string data. By default it comma delimits the string data with the group, but you could choose a different delimiter. I then grouped by the Product, Description, and Ingredient fields on line 50 to get the dataset back to one line per product on the shopping list. The results are saved in a view (line 25) so I can call all this code again for my output queries.

The two output queries on lines 53-63 are very simple and largely the same. One just filters for products with a NULL value for the "Contains" field and the other for non-NULL values. This means one list is vegan and the other list is non-vegan.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
CREATE OR REPLACE TABLE DEV_MAJ."PD_2021_Wk7_ShoppingList" (
    "Product"     VARCHAR(255),
    "Description" VARCHAR(1000),
    "Ingredients" VARCHAR(2000)
);
CREATE OR REPLACE TABLE DEV_MAJ."PD_2021_Wk7_Keywords" (
    "AnimalIngredients" VARCHAR(1000),
    "ENumbers"          VARCHAR(1000)
);
IMPORT INTO DEV_MAJ."PD_2021_Wk7_ShoppingList" FROM LOCAL CSV
    FILE 'C:\Mark\Preppin Data\PD 2021 Wk 7 Shopping List.csv'
    SKIP = 1
    ROW SEPARATOR  = 'CRLF'
    COLUMN SEPARATOR = ','
    COLUMN DELIMITER = '"'
;
IMPORT INTO DEV_MAJ."PD_2021_Wk7_Keywords" FROM LOCAL CSV
    FILE 'C:\Mark\Preppin Data\PD 2021 Wk 7 Ingredients.csv'
    SKIP = 1
    ROW SEPARATOR  = 'CRLF'
    COLUMN SEPARATOR = ','
    COLUMN DELIMITER = '"'
;

CREATE OR REPLACE VIEW DEV_MAJ."PD_2021_Wk7_ShoppingListKeywords_vw" AS
    WITH keyword_tall_pass1 AS ( --generate unique rows for ingredients / e-numbers
        SELECT
            i."RecordID"
            ,REGEXP_SUBSTR(k."AnimalIngredients",'(?i)([a-z]+)',1,i."RecordID") AS "Ingredient"
            ,REGEXP_SUBSTR(k."ENumbers",'(?i)([0-9]+)',1,i."RecordID") AS "ENumber"
        FROM DEV_MAJ."PD_2021_Wk7_Keywords" k
            CROSS JOIN (SELECT level AS "RecordID" FROM DUAL CONNECT BY level < 30) i --30 is arbitrary
        WHERE local."Ingredient" IS NOT NULL OR local."ENumber" IS NOT NULL --drop null records

    ), nonvegan_keywords AS ( --stack ingredients / e-numbers
        SELECT "Ingredient" AS "Keyword" FROM keyword_tall_pass1 k WHERE k."Ingredient" IS NOT NULL
        UNION ALL
        SELECT 'E' || "ENumber" AS "Keyword" FROM keyword_tall_pass1 k WHERE k."ENumber" IS NOT NULL
    )

    SELECT --return products w/ delimited list of matching ingredients
        sl."Product"
        ,sl."Description"
        ,sl."Ingredients"
        ,GROUP_CONCAT(
            CASE WHEN INSTR(LOWER(sl."Ingredients"),LOWER(nvk."Keyword"))>0  --when ingredients contain keyword
            THEN nvk."Keyword" END) AS "Contains"
    FROM DEV_MAJ."PD_2021_Wk7_ShoppingList" sl
        CROSS JOIN nonvegan_keywords nvk
    GROUP BY 1,2,3
;

--OUTPUT 1: Vegan Shopping List
SELECT slk."Product", slk."Description"
FROM DEV_MAJ."PD_2021_Wk7_ShoppingListKeywords_vw" slk
WHERE slk."Contains" IS NULL
ORDER BY 1;

--OUTPUT 2: Non-Vegan Shopping List
SELECT slk."Product", slk."Description", slk."Contains"
FROM DEV_MAJ."PD_2021_Wk7_ShoppingListKeywords_vw" slk
WHERE slk."Contains" IS NOT NULL
ORDER BY 1;

I hope you found this exercise informative. If so, share with your friends and colleagues on your favorite social platforms.  If there is a particular #PreppinData challenge you'd like me to re-create in SQL, let me know on Twitter @ugamarkj.

If you want to follow along, Exasol has a free Community Edition. It is pretty easy to stand up as a virtual machine with the free Oracle VirtualBox platform. I use DataGrip as my favorite database IDE, which is paid software, though you can use the free DBeaver platform if you prefer.
Popular Posts
Popular Posts
About Me
About Me
Blog Archive
Labels
Labels
Other Links
Subscribe
Subscribe
Total Pageviews
Total Pageviews
1384206
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.