From spreadsheets to dbt

Like a lot of people in this field, my path has been anything but straight. I grew up building stuff. My dad was a cabinet maker and handyman. Working with him on weekends and in the summers, it really nurtured my love for building things. Fast forward to now, I’m creating things digitally and making cabinets on the weekends.

When I was in college I took up a job doing admin and analytical work. It was a small packaged goods manufacturer in Portland. When I started, they were doing everything on paper. Their order tracking was a stack of purchase orders, the inventory tracking system was a clipboard with numbers crossed off and a new number appended, production batches were tracked in a spiral notebook complete with the cocoa powder from the production room. The first thing I did was open up Excel to start putting everything I could in it: invoices, inventory counts, production information. It wasn’t long to begin to see the advantages of having the information structured digitally. I could readily see what was trending for a period, the value of a customer, production efficiency, and so on. Being able to calculate and see information like that was wild. I was hooked.

I started to read more on efficient ways to store data and that’s when I learned a very pivotal piece of information: tidy data. After that, I restructured how all the data was stored. We stopped inputting information each time and began referencing dimension tables. It was great! We began to utilize Google Forms for standardization and Google Sheets for storage. I would take the raw data from spreadsheets and create various views for reporting. Looker (now Data Studio) became the BI tool. The stack was a bit cumbersome but it worked well for what we needed at the time.

My next role was at a nursery brokerage. Here they used spreadsheets as most of their data warehousing. They used an old niche software for most of their operations and then would export to Excel for most of their reporting needs. This created a huge opportunity for me once I came on. One of the first things I did was begin the process of automating almost all of the data movement (EDI, API, exporting) to reduce the number of different people manipulating the data. 

After automation of the movement, then came the objective to improve the reporting with the obvious solution being implementing databases and BI tools. Unfortunately the niche software didn’t really have any connectors or APIs that I could tap into. So the initial database design and ingestion came from automating a CSV that was a massive view of a chunk of their data. On top of that was a workbook each for products, orders, inventory, pricing, shipping, and store information — each with their own set of worksheets, all maintained by hand. Getting a report meant exporting and manipulating the data. That was easily a couple hours a day. So I broke down the data monolith, developed the various star schemas with their respective fact and dimension tables, and coupled them together. Then I fed the various Excel workbooks into the database. Once everything was in place, questions that used to take hours or sometimes days were a few minutes.

It wasn’t too long before the Task Scheduler was loaded with various scripts performing a multitude of tasks ranging from data flowing out of the operations software to pulling data from retail partners via a Python data connection all on different schedules. This then brought the desire for a better solution: orchestration. I went back and forth trying to select between a few before I finally settled on Airflow and with the machine I was running it on, meant that Docker was also part of the solution. Airflow and Docker proved to be pivotal. Being able to visually see in Airflow what is scheduled when, previous runs, and most importantly logs truly accelerated much of the development. Being able to bring up containers, tearing them down, being able to exec in them when they are running, all proved very valuable when building this all out. I could also package it up easily and redeploy it somewhere else with just briefly setting up the environment.

This increase in development led to a lot of rewriting code. This rewriting code led to a lot of versions of scripts. These different versions then led to, I’m sure you can guess it: version control and git. Once I made the commit to using git, it wasn’t long before I wanted to take this version control closer to the data in the databases. This then brought me to dbt. I’m currently working through refactoring all the Python and SQL transformations to dbt. It’s been pretty great. Being able to segment the data more cleanly as it flows from raw to mart has helped with some of the confusing abstraction that took place in both Python and SQL.