With the abundance and proliferation of data in this day and age, there is an inherent need to store and reuse that wealth of information in a meaningful way. It is analogous to having a kitchen inundated with a multitude of utensils and tools to use, without having an organized way to manage them.
Well, chances are that you’re going to end up opening your canned lunch with the rear end of a dipper, unless you warehouse up real fast.
Data warehousing is the ability to cache, tokenize, analyze and reuse your curated data on demand in an unparalleled manner.
In a similar fashion to how your mother navigates around her immaculately well organized kitchen. Mind you, there is no one size fits all solution, and there are as many ways to warehouse as there are warehouses themselves.
Arguably, there are three key ingredients to implementing a successful data warehouse:
- Server: first and foremost you must provision a distributed database system that is both robust and resilient.
- Indexing: your database system should ideally have some form of indexing that allows you to access records at warp speed. Having a full-text index would be a bonus.
- Dashboard: you should have a staging area where you can import, export, visualize and mutate your data in an immutable way.
In this tutorial we will be addressing the first and last points mentioned above by creating a data warehouse where we can store datasets, arrays and records into. In addition we will create a dashboard where we can graphically interface with our warehouse to load, retrieve, mutate and visualize our data. Perhaps in another article, we will implement the second point i.e. a full-text indexed database.
PostgreSQL otherwise known as Postgres for short, is an open source relational database system that is more often than not the database of choice for developers due to its extended capabilities and relative ease of use.
Even though it is brandished as a structured database management system, it can also store non-structured data including but not limited to arrays and binary objects. Most importantly however, Postgres’s graphical user interface makes it too easy to provision and manage databases on the fly, something other database systems should take careful note of.
In our implementation of a data warehouse, we will be using a local Postgres server to store all our data in. Before we proceed, please download and install Postgres using this link.
During the installation you will be prompted to set a username, password and a local TCP port to connect to your server. The default port is 5432 which you may keep as is or modify if necessary. Once the installation is complete, you may login to the server by running the pgAdmin 4 application which will open a portal on your browser as shown below.
There will be a default database labeled postgres, however you may create your own by right clicking on the Databases menu and then selecting Create to provision a new database.
Now that you have provisioned your server and database, you should install the package sqlalchemy that will be used to connect to our database through Python. You can download and install this package by typing the following command into Anaconda prompt:
pip install sqlalchemy
In addition download, install and then import all of the other necessary libraries into your Python script as
Firstly, we will need to establish a connection between our records_db database and create a table where we can store records and arrays in. In addition, we will need to create another connection to the datasets_db database where we will store our datasets in:
As per Postgres’s naming convention, table names must start with underscores or letters (not numbers), must not contain dashes and must be less than 64 characters long. For our records table, we will create a name field with a datatype of text declared as a PRIMARY KEY and a details field as text which is Postgres’s notation for a single-dimension array. To acquire an exhaustive list of all the datatypes supported by Postgres, please refer to this link. We will use the name field as a primary key that will be used to search for records later on. In addition, please note that a more secure way of storing your database credentials, is to save them in a configuration file and then to invoke them as parameters in your code.
Subsequently, we will create the following five functions to write, update, read and list our data to/from our database:
Please be aware of SQL-injection vulnerabilities when concatenating strings to your queries. You may instead use parametrization to prevent SQL-injection as elaborated in this article.
Streamlit is a pure Python web framework that allows you to develop and deploy user interfaces and applications in real-time at your fingertips. For this tutorial, we will be using Streamlit to render a dashboard which we can use to interface with our Postgres database.
In the snippet shown below, we are using several text input widgets to insert the values for our records, arrays and names for the datasets. Furthermore we are using Streamlit’s functions to interactively visualize our dataset as a chart and as a dataframe.
You can run the dashboard on a local browser, by typing the following commands in Anaconda prompt. First, change your root directory to where your source code is saved:
Then type the following to run your app:
streamlit run file_name.py
And there you have it, a dashboard that can be used to tokenize, write, read, update, upload and visualize our data in real-time. The beauty of our data warehouse is that it can be scaled up to host as much data as you may need, all within the same structure that we have just created! If you want to learn more about the fundamentals of data warehouses, I would recommend signing up for University of Colorado’s course on data warehousing. It’s a great way to get up to speed.