Most people’s introduction to data analysis is using spreadsheets, like Microsoft Excel.
Spreadsheets are incredibly powerful and popular but have significant limitations, particularly when it comes to reproducibility, shareability, and processing large datasets.
When it’s time to get serious about data analysis, experienced practitioners will tell you to use code: it’s more reliable, easier to reproduce, and a whole lot more enjoyable.
But the transition from spreadsheet-based data analysis to code-based data analysis can be daunting if you are new to coding.
To make it easier for you to get started with code-based analysis, I’ve created a series of blog posts that use everyday examples to show how code can replace spreadsheets for visualizing, analyzing, and modelling data.
We will be using the popular programming language, Python, which has become the de facto standard for code-based data analysis. The code has all been run using the online, collaborative workspace, Filament.
Let’s get started.
What’s the weather like where you are?
To show you why analysis in Python is superior, I’m going to chat a bit about the weather.
Specifically, the weather in Oxford, but you can apply it to your town as well. This post will show you how to automatically download weather data from a server, produce a DataFrame containing temperature information, plot different averages, and fit trendlines.
All of these activities could be done in a spreadsheet, but using code makes it easier to change the analysis, update the analysis as new data become available, analyse larger datasets, and share our analysis in a reproducible way. It’s also easier to produce beautiful graphics!
Download the data
The Radcliffe Meteorological Station in Oxford has downloadable weather data . For this example, I am going to explore daily temperature measurements, which are available from 1815!
The first thing we need to do is to import Pandas, a powerful Python package for data analysis. We can then use the Pandas function read_csv to read in the weather data directly from the Oxford University website. This gives me a DataFrame, which we can visualise as follows:https://towardsdatascience.com/media/514e3adf288682f8911b55366aafc9f0
The DataFrame consists of columns of data with headings indicating the content of each column.
This looks very much like what you would see in a spreadsheet. Where there are no data, we see NaN. However, in some cells the csv file contains a text explanation, which we can easily replace by NaN later on.
We can find the data for a specific date by finding the row that contains the year, month and day that we want. In Excel we could achieve this by putting filters on the columns.https://towardsdatascience.com/media/daefaef8aeabf36de9ed94ddc9f63e9c
And if we just want the temperature that day we can use:https://towardsdatascience.com/media/3f38292af75b867ababd20a49f9a7ad3
Which gives an output of 4.7 °C.
This is much easier than scrolling through significant amounts of data in a spreadsheet. The only tricky thing here is remembering how to write the degree sign!
We’ll now do a bit of data manipulation so that we can compare temperatures in different years.
What we want is a DataFrame in which the columns represent the years, and the rows represent each day of the year.
The first thing we will do is remove all the data from 29th February. We do that by finding the index (the left-most column) of all the rows where month is 2 and the day is 29.
We then use .drop to remove these rows. Again, this could be done in a spreadsheet, but it’s much simpler using Python and Pandas.https://towardsdatascience.com/media/50cd3de98b8e8272d1702e37318ef2c3
Having done that, it’s useful to save the new DataFrame so that we don’t have to download and manipulate it again.https://towardsdatascience.com/media/e9e5bdaa71dabc57e89c536cb4eee6b0
Graphing the Temperature
Now let’s produce some nice graphs to visualise the daily temperature.
We will use functions from three of the most common Python libraries: Pandas, Numpy and Matplotlib. First we need to import these libraries.
And now read in the data that we saved earlier.
An interesting question to answer using this data might be, ‘Is it unusually hot today?’, so let’s take a look at Tmax. To do this we create a new DataFrame containing the maximum temperature each day.
Although this code initially looks complicated, it is actually much simpler than performing the same task in Excel.
Now we have Tmax for every day, it’s easy to plot the maximum temperature on every day of every year and to highlight any year we want. However, this isn’t very meaningful, as we can see from the following graph.
In this plot, it’s really difficult to see whether 2020 is an outlier or not. To do this we need to calculate the mean and standard deviations of Tmax, and re-plot.
Here the gray lines represent the standard deviation. It looks like there were some hot days in 2020!
Of course, if we take any single year there are likely to be outliers: we would expect about one third of days to be outside the standard deviation, about one sixth above and one sixth below.
If we want to see whether the temperature in general is increasing, perhaps it would be better to take averages over two periods of time, as follows:
We can now plot the average of the most recent 20 years to the mean and standard error from the period before 1990.
So, it looks like the past 20 years have been warmer than the long term average; in particular, the temperatures in Winter are significantly higher.
Let’s explore this further by taking a closer look at how Tmax has changed over the years. We are interested in the hottest and coldest day each year, as well as the mean of Tmax for the whole year.
This shows some of the power of using Pandas DataFrames; above we calculated averages along rows, now we find them down columns, and we can even perform the calculation whilst plotting the graph.
From this graph it appears that temperatures have increased over the past few years; let’s quantify this by fitting some trendlines. Here we will redefine ‘recent’ years to start from 1980.
The code tells us that the gradient before 1980 was 0.44 °C per century, but since 1980 it has been equivalent to 4.9 °C per century.
One massive advantage of doing this in code is that it is easy to change the ranges of years used for comparisons.
Clearly the gradient in recent years is larger than that over the previous 165. In order to determine whether this is statistically significant we need a measure of the error in the gradient calculation. This is known as the standard error, and can easily be calculated using a different function, equivalent to LINEST in Excel (which I personally find painful to use).
The function outputs quite a bit of information there; let’s just look at the values we want
The +/- values represent 95% confidence intervals. This means that over the past 40 years the average temperature has increased at a rate of somewhere between 4 and 6 °C per century, which is significant in both the statistical and the every-day meaning of the word.
I have shown an example of how Python can be used to perform simple visualizations and analyses of data.
Some of the steps will take a bit longer than using a spreadsheet. However, for large data-sets or repetitive operations, using code reduces the scope for error, whilst making it easy to repeat or modify calculations. This is definitely worth the additional investment of time.
Finally, in nearly all the examples here, I have used the default matplotlib formats for the graphs, and they are already perfectly presentable. This is definitely an advantage over spreadsheets.
In in this blog series I am going to explore Python functionality further, and hopefully inspire you to take your first steps into using Python yourself.