A couple days ago stumbled across an article about the concept the author coined “git scraping”. The most compelling aspect of this process was that I, as developer, don’t have to maintain servers or a large codebase to achieve an up-to-date versioned dataset. I could actually take advantage of the version control capabilities of git, and Actions from Github. This process was so interesting to me that I had to try it for myself. So here is my journey documented so that you can do the same for your own datasets.
Finding My Dataset
To fully take advantage of git versioning for my dataset I wanted to look for a dataset where there would be added value in seeing the changes to it over time. I also didn’t want these changes to be frequent so that the commits would be sparse so that the individual updates are easier to identify for this exercise.
After failing to find a good brewery dataset I settled on monitoring this table of chief executive officers (CEO) of notable companies. This list appeared as a table within a Wikipedia page.
With my dataset now identified I just had to put together some code that would retrieve the information we want to store. I also wanted to make sure to expose it in a way that will be easy to ingest in the future, in an automated way, so considering this I decided store this data as a CSV file.
With the solution that I came up with, the following Python 3 packages are needed. In the final repository these show up in the requirements.txt file.
- pandas — This could easy be another library like tablib, but just in case there was some unforeseen processing involved I chose pandas.
- lxml — This library is leveraged in the function for ingesting tables from html tables.
The first thing that I needed to do was to ingest the table from Wikipedia. Luckily Pandas has just the function for that, read_html.
In the code above I am retrieving that table by using read_html, identifying the header as the first row, and choosing the first table to add to the DataFrame. At this point I have the file loaded into a Pandas Dataframe, but still need to export this file to the CSV file format. Before I did so I noticed an issue with the ingested data.
As you can see I still needed to remove the reference annotations that show up in the results. To do this I just added a regular expression to strip these annotations.
It was easier to first convert the DataFrame to a CSV file first so I converted my data format before applying the regular expression for stripping the annotations. At his point all I needed to do was write to the file that I want the data to reside in, and the dataset has be successfully landed in the intended destination.
If you want to use this code as a reference implementation here is the final main function that is run to update the dataset.
Github Action Workflow
At this point I was able to ingest the table from Wikipedia, and write the cleaned results to a CSV file. The next step would be to now schedule this process, and commit the results to Github, if and only if the data changed.
Before I built out the steps of the Action workflow I needed to schedule it to run the code just developed every hour. In order to do this I needed to define in the predefined yaml format the cron schedule for a frequency of every hour.
Before building out my workflow I wanted to define the image that the Github Action needs to run on. For simplicities sake I just have it running using the latest version of ubuntu, but in the future will use some lighter images such as a python slim version or alpine once I have had some time to go more in-depth with actions.
This Github Action will have a number of steps in its workflow before the dataset can be monitored for changes. The first step of our cron workflow will be to checkout a working copy of our master branch. I have been noticing lately with my experimentations that this is usually the first step of most of my Actions that I leverage.
In lines 4 and 5 above we don’t actually need to specify the fetch-depth as the default is 1, but this defines the number of commits to fetch. At first I was going to try, and use this to compare the versions of the datasets, but as you will notice later this was a rabbit-hole that didn’t need exploring. The way that I was able rationalize all of this work was to keep a reference to this capability in my write up for future readers to see.
Now that I had a working copy of the code to run in the Action workflow the next step will be to get Python 3.8 setup to start running the code. As with the previous step you can see below that Github provides us an action for this as well. Python versions can be easily set, but for my example 3.8 was the target.
Now that we have Python installed we want to install the required libraries. I included a requirements.txt file with this repo, but this step just as easily could be a pip insall pandas lxml.
At this point in the workflow everything that is required to run our code to update the dataset has been provided to the environment so in this step we will now do just that.
At this point we have the latest version of our dataset which may or may not have changed. So what we will do here is only commit, and push if the data has updated. Note that here we exit the action early, and don’t push if this is the case. There are a few different patterns that can be used here, but Simon Wilson has a reference to a few here if you are curious. Otherwise a clean way to achieve what we want can be found below.
Here is the action flow in all of it’s glory feel free to use for your purposes however you want, but at this point all that is needed is to commit the less than 10 lines of Python code, and the yaml with the Action workflow, and everything should be set up to have a versioned dataset with it’s history preserved.
I usually find it easier when I have a reference implementation to work from so here is the repo that contains the complete codebase with the action workflow that I just stepped through.
As you can see it’s actually pretty easy to ingest, and keep a dataset up to date without you own infrastructure, and a little bit of code. Github Actions have really given more options for simple automations. In the article’s example my repo was public so in support of open sourced projects Github has made this completely free.
As you can see from my journey with a little Python knowledge it isn’t difficult to start keeping datasets up to date with versioning leveraging Github Actions.
I have another tutorial that I will be publishing this week so feel free to follow me so that you don’t miss it, and if you are interested in my other tutorials I have listed them below.
Read Other Tutorials By Dylan
Enjoying these posts? Subscribe for moreSubscribe now
Already have an account? Sign in