The Problem

Healthcare providers store a lot of data. Whether it's financial or medical data, it's extremely important that this data is private and secure and can be accessed when needed. In fact, there are many regulations regarding how long a company must retain many types of data. It's also a good idea to have the records on hand in case of any legal needs down the road. However, due to licensing contracts established between healthcare providers and healthcare software companies, the data can become expensive to retain even after the licensed party no longer actively uses the software except to store the data.

The Solution

As a database analyst intern for Harmony Healthcare IT, a company specializing in healthcare data ETL (extract, transform, load) processes, I helped by porting data from legacy healthcare systems to a database that could be accessed by our company's product, the Health Data Archiver. This consisted of three main steps: 1.) I found the required data points in the database (all of the ones I had worked on had been Microsoft SQL Server DBMSes), then 2.) I created SQL views to access those data points and transformed them to a format that could be read by our Health Data Archiver application, and finally 3.) I extracted the data by creating new tables from those views and loading it to a PostgreSQL database and formatted the Health Data Archiver to read from that new database. This allowed our customers to access the data from the Health Data Archiver, a web app, whenever they needed it, and stop paying for the expensive licenses and services associated with the previous legacy healthcare systems.

On top of all this, I got to experience what it's like to work with lots of regulations surrounding security and privacy (e.g. HIPAA), enterprise cloud systems and data storage, and working with a team in a professional setting.

Extract

After the systems analysts gained access to the healthcare system's database and made it possible for us to query it, they would pass it along to the database analysts, my role. From here I would look at what data was required; this was documented in a spreadsheet created by the business analysts and contained screenshots of the legacy system's GUI as well as some descriptive text. Then I would query the database to gain a sense of the structure of the relationships between tables. Often, it was very helpful to find the table containing the Patient Id numbers because all the other data was related to that.

Transform

As I would find datapoints, I would create views around the required columns. For example, if patient information was spread about on different tables, but it made more sense to archive it on one table, I would create a view for that. This was based upon what made it easiest for our application, the Health Data Archiver, to read the data. The tool used for the extract and transform processes was SQL Server Management Studio.

Load

With all of the views created, I would then use Pentaho to create a transformation pipeline that told each datapoint where to go in the new, archive database. Running this job would copy all of that data. As was usually the case, and was for each of the projects I worked on, this ported data from MS SQL Server to a PostgreSQL DBMS. I would then format the Health Data Archiver to pull from this new, archive database. In the end it would look at work very similar to the old, legacy healthcare system but in a read-only manner.
connectrix graph
The Data Team at Harmony Healthcare IT during our morning standup meeting.

Parsing with Python

Sometimes there would unstructured data (in that it wasn't in rows and columns) that we would have to store. This may have been contained in PDF files, HTML files, Rich Text Format, XML, JSON, etc. One of my colleagues was having some issues with a large number of HTML files that had been extracted. First, there was a long string of random text inserted that wasn't supposed to be on the document, and second, there were several image tags that were not formated correctly. He could have went through and changed tens of thousands of these all by hand, but we had Python on our side. I created a script utilizing the BeautifulSoup HTML parser library to remove the string of text and reformat the image tag. As it was run on so many files, it saved countless hours of work and made for a happy colleague and happy customer. Additionally, I taught my colleague, a full-time employee, how to use the script and modify it for future use.

Programming with PL/pgSQL

Our Manager of ETL and Data Services often had to do maintenance on the databases store on our servers. She needed a way to remove existing database with a single script. She did this line by line for some time, but I came up with a way using PL/pgSQL to generate a script for her to run in one shot. This saved her a lot of time and was a way for me to learn more about developing with SQL including using dynamic SQL.