Most people’s introduction to data analysis is using spreadsheets, like Microsoft Excel.

Spreadsheets are incredibly powerful and popular but have significant limitations, particularly when it comes to reproducibility, shareability, and processing large datasets.

When it’s time to get serious about data analysis, experienced practitioners will tell you to use code: it’s more reliable, easier to reproduce, and a whole lot more enjoyable.

But the transition from spreadsheet-based data analysis to code-based data analysis can be daunting if you are new to coding.

To make it easier for you to get started with code-based analysis, I’ve created a series of blog posts that use everyday examples to show how code can replace spreadsheets for visualizing, analyzing, and modelling data.

We will be using the popular programming language, Python, which has become the de facto standard for code-based data analysis. The code has all been run using the online, collaborative workspace, Filament.

Let’s get started.

What’s the weather like where you are?

To show you why analysis in Python is superior, I’m going to chat a bit about the weather.

Specifically, the weather in Oxford, but you can apply it to your town as well. This post will show you how to automatically download weather data from a server, produce a DataFrame containing temperature information, plot different averages, and fit trendlines.

All of these activities could be done in a spreadsheet, but using code makes it easier to change the analysis, update the analysis as new data become available, analyse larger datasets, and share our analysis in a reproducible way. It’s also easier to produce beautiful graphics!

Download the data

The Radcliffe Meteorological Station in Oxford has downloadable weather data [1]. For this example, I am going to explore daily temperature measurements, which are available from 1815!

The first thing we need to do is to import Pandas, a powerful Python package for data analysis. We can then use the Pandas function read_csv to read in the weather data directly from the Oxford University website. This gives me a DataFrame, which we can visualise as follows:https://towardsdatascience.com/media/514e3adf288682f8911b55366aafc9f0

Dataframe showing Oxford weather data since 1815
Image by Author

The DataFrame consists of columns of data with headings indicating the content of each column.

This looks very much like what you would see in a spreadsheet. Where there are no data, we see NaN. However, in some cells the csv file contains a text explanation, which we can easily replace by NaN later on.

We can find the data for a specific date by finding the row that contains the year, month and day that we want. In Excel we could achieve this by putting filters on the columns.https://towardsdatascience.com/media/daefaef8aeabf36de9ed94ddc9f63e9c

Oxford weather data on 1 January 1900
Image by Author

And if we just want the temperature that day we can use:https://towardsdatascience.com/media/3f38292af75b867ababd20a49f9a7ad3

Which gives an output of 4.7 °C.

This is much easier than scrolling through significant amounts of data in a spreadsheet. The only tricky thing here is remembering how to write the degree sign!

We’ll now do a bit of data manipulation so that we can compare temperatures in different years.

What we want is a DataFrame in which the columns represent the years, and the rows represent each day of the year.

The first thing we will do is remove all the data from 29th February. We do that by finding the index (the left-most column) of all the rows where month is 2 and the day is 29.

We then use .drop to remove these rows. Again, this could be done in a spreadsheet, but it’s much simpler using Python and Pandas.https://towardsdatascience.com/media/50cd3de98b8e8272d1702e37318ef2c3

Having done that, it’s useful to save the new DataFrame so that we don’t have to download and manipulate it again.https://towardsdatascience.com/media/e9e5bdaa71dabc57e89c536cb4eee6b0

Graphing the Temperature

Now let’s produce some nice graphs to visualise the daily temperature.

We will use functions from three of the most common Python libraries: Pandas, Numpy and Matplotlib. First we need to import these libraries.

https://towardsdatascience.com/media/4df0343f2d523d5ec9ebf26a6f2ba6ec

And now read in the data that we saved earlier.

https://towardsdatascience.com/media/ce08d67f6dfecc10140333e72345cf8c

An interesting question to answer using this data might be, ‘Is it unusually hot today?’, so let’s take a look at Tmax. To do this we create a new DataFrame containing the maximum temperature each day.

https://towardsdatascience.com/media/f8f6d3163e54f1e75b16edcbb268647b

Dataframe showing Oxford temperature data on each day of every year
Image by Author

Although this code initially looks complicated, it is actually much simpler than performing the same task in Excel.

Now we have Tmax for every day, it’s easy to plot the maximum temperature on every day of every year and to highlight any year we want. However, this isn’t very meaningful, as we can see from the following graph.

https://towardsdatascience.com/media/f86a4bb35aa5269648ef364907ed48c7

Oxford temperature data each day of every year since 1815, with 2020 highlighted
Image by Author

In this plot, it’s really difficult to see whether 2020 is an outlier or not. To do this we need to calculate the mean and standard deviations of Tmax, and re-plot.

https://towardsdatascience.com/media/5ddeab2702e462ed9b2276cb037cf48a

Mean and standard deviations of temperature every day of the year, overlaid with data from 2020
Image by Author

Here the gray lines represent the standard deviation. It looks like there were some hot days in 2020!

Of course, if we take any single year there are likely to be outliers: we would expect about one third of days to be outside the standard deviation, about one sixth above and one sixth below.

If we want to see whether the temperature in general is increasing, perhaps it would be better to take averages over two periods of time, as follows:

https://towardsdatascience.com/media/e6daf61d675e658a21a9566729967a5e

We can now plot the average of the most recent 20 years to the mean and standard error from the period before 1990.

https://towardsdatascience.com/media/70fea4b49470e669640bbbeac84f3759

Mean and standard error for temperature data before 1990, compared to mean after 2000.
Image by Author

So, it looks like the past 20 years have been warmer than the long term average; in particular, the temperatures in Winter are significantly higher.

Calculating trends

Let’s explore this further by taking a closer look at how Tmax has changed over the years. We are interested in the hottest and coldest day each year, as well as the mean of Tmax for the whole year.

This shows some of the power of using Pandas DataFrames; above we calculated averages along rows, now we find them down columns, and we can even perform the calculation whilst plotting the graph.

https://towardsdatascience.com/media/89d6896ef96f9975d44423d96c898c4b

Mean, maximum and minimum daily temperature every year since 1815
Image by Author

From this graph it appears that temperatures have increased over the past few years; let’s quantify this by fitting some trendlines. Here we will redefine ‘recent’ years to start from 1980.

https://towardsdatascience.com/media/72eae63ecdbb461ddf1cd560036c3d56

Treandlines for the mean temperature before 1980 and after 1980
Image by Author

The code tells us that the gradient before 1980 was 0.44 °C per century, but since 1980 it has been equivalent to 4.9 °C per century.

One massive advantage of doing this in code is that it is easy to change the ranges of years used for comparisons.

Clearly the gradient in recent years is larger than that over the previous 165. In order to determine whether this is statistically significant we need a measure of the error in the gradient calculation. This is known as the standard error, and can easily be calculated using a different function, equivalent to LINEST in Excel (which I personally find painful to use).

https://towardsdatascience.com/media/d19e404cffc25f99d141e53f0941ff23

Complete output from a linear regression analysis using the linregress function
Image by Author

The function outputs quite a bit of information there; let’s just look at the values we want

https://towardsdatascience.com/media/0d8764056853cd275d085cc5baa55802

Relevant data from the linregress function. Gradient before 1980 was: 0.0044 plus or minus 0.0012 °C per year. Gradient since 1980 is: 0.049 plus or minus 0.008 °C per year.
Image by Author

The +/- values represent 95% confidence intervals. This means that over the past 40 years the average temperature has increased at a rate of somewhere between 4 and 6 °C per century, which is significant in both the statistical and the every-day meaning of the word.

Summary

I have shown an example of how Python can be used to perform simple visualizations and analyses of data.

Some of the steps will take a bit longer than using a spreadsheet. However, for large data-sets or repetitive operations, using code reduces the scope for error, whilst making it easy to repeat or modify calculations. This is definitely worth the additional investment of time.

Finally, in nearly all the examples here, I have used the default matplotlib formats for the graphs, and they are already perfectly presentable. This is definitely an advantage over spreadsheets.

In in this blog series I am going to explore Python functionality further, and hopefully inspire you to take your first steps into using Python yourself.

Original Source

Author

I am a Professor of Engineering Science at the University of Oxford. I am also CSO of Filament, the all-in-one workspace for data, analysis, notes and reports.

Write A Comment