Healthcare providers store a lot of data. Whether it's financial or medical data, it's extremely important that this data is private and secure and can be accessed when needed. In fact, there are many regulations regarding how long a company must retain many types of data. It's also a good idea to have the records on hand in case of any legal needs down the road. However, due to licensing contracts established between healthcare providers and healthcare software companies, the data can become expensive to retain even after the licensed party no longer actively uses the software except to store the data.
As a database analyst intern for Harmony Healthcare IT, a company specializing in healthcare data ETL (extract, transform, load) processes, I helped by porting data from legacy healthcare systems to a database that could be accessed by our company's product, the Health Data Archiver. This consisted of three main steps: 1.) I found the required data points in the database (all of the ones I had worked on had been Microsoft SQL Server DBMSes), then 2.) I created SQL views to access those data points and transformed them to a format that could be read by our Health Data Archiver application, and finally 3.) I extracted the data by creating new tables from those views and loading it to a PostgreSQL database and formatted the Health Data Archiver to read from that new database. This allowed our customers to access the data from the Health Data Archiver, a web app, whenever they needed it, and stop paying for the expensive licenses and services associated with the previous legacy healthcare systems.
On top of all this, I got to experience what it's like to work with lots of regulations surrounding security and privacy (e.g. HIPAA), enterprise cloud systems and data storage, and working with a team in a professional setting.
After the systems analysts gained access to the healthcare system's database and made it possible for us to query it, they would pass it along to the database analysts, my role. From here I would look at what data was required; this was documented in a spreadsheet created by the business analysts and contained screenshots of the legacy system's GUI as well as some descriptive text. Then I would query the database to gain a sense of the structure of the relationships between tables. Often, it was very helpful to find the table containing the Patient Id numbers because all the other data was related to that.
Sometimes there would unstructured data (in that it wasn't in rows and columns) that we would have to store. This may have been contained in PDF files, HTML files, Rich Text Format, XML, JSON, etc. One of my colleagues was having some issues with a large number of HTML files that had been extracted. First, there was a long string of random text inserted that wasn't supposed to be on the document, and second, there were several image tags that were not formated correctly. He could have went through and changed tens of thousands of these all by hand, but we had Python on our side. I created a script utilizing the BeautifulSoup HTML parser library to remove the string of text and reformat the image tag. As it was run on so many files, it saved countless hours of work and made for a happy colleague and happy customer. Additionally, I taught my colleague, a full-time employee, how to use the script and modify it for future use.