Reading data is the first step in any data science project. Often, you’ll work with data in JSON format and run into problems at the very beginning.

In this article, you’ll learn how to use the Pandas built-in functions read_json() and json_normalize() to deal with the following common problems:

  1. Reading simple JSON from a local file
  2. Reading simple JSON from a URL
  3. Flattening nested list from JSON object
  4. Flattening nested list and dict from JSON object
  5. Extracting a value from deeply nested JSON

Please check out Notebook for the source code.

1. Reading simple JSON from a local file

Let’s begin with a simple example.

[
{
"id": "A001",
"name": "Tom",
"math": 60,
"physics": 66,
"chemistry": 61
},
{
"id": "A002",
"name": "James",
"math": 89,
"physics": 76,
"chemistry": 51
},
{
"id": "A003",
"name": "Jenny",
"math": 79,
"physics": 90,
"chemistry": 78
}
]

To read a JSON file via Pandas, we can use the read_json() method.

df = pd.read_json('data/simple.json')
image by author

The result looks great. Let’s take a look at the data types with df.info(). By default, columns that are numerical are cast to numeric types, for example, the math, physics, and chemistry columns have been cast to int64.

>>> df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 3 non-null object
1 name 3 non-null object
2 math 3 non-null int64
3 physics 3 non-null int64
4 chemistry 3 non-null int64
dtypes: int64(3), object(2)
memory usage: 248.0+ bytes

2. Reading simple JSON from a URL

Pandas read_json() accepts a URL.

URL = 'http://raw.githubusercontent.com/BindiChen/machine-learning/master/data-analysis/027-pandas-convert-json/data/simple.json'df = pd.read_json(URL)
image by author
>>> df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 3 non-null object
1 name 3 non-null object
2 math 3 non-null int64
3 physics 3 non-null int64
4 chemistry 3 non-null int64
dtypes: int64(3), object(2)
memory usage: 248.0+ bytes

Same as reading from a local file, it returns a DataFrame, and columns that are numerical are cast to numeric types by default.

3. Flattening nested list from JSON object

Pandas read_json() works great for flattened JSON like we have in the previous example. What about JSON with a nested list? Let’s see how to convert the following JSON into a DataFrame:

{
"school_name": "ABC primary school",
"class": "Year 1",
"students": [
{
"id": "A001",
"name": "Tom",
"math": 60,
"physics": 66,
"chemistry": 61
},
{
"id": "A002",
"name": "James",
"math": 89,
"physics": 76,
"chemistry": 51
},
{
"id": "A003",
"name": "Jenny",
"math": 79,
"physics": 90,
"chemistry": 78
}]

}

To read it with pandas read_json()

df = pd.read_json('data/nested_list.json')
image by author

After reading this JSON, we can see that our nested list is put up into a single column students. How can we flatten the nested list? One solution is to apply a custom function to flatten the values in students.

This certainly does our work, but it requires extra code to get the data in the form we require. We can solve this effectively using the Pandas json_normalize() function.

import json
# load data using Python JSON module
with open('data/nested_array.json','r') as f:
data = json.loads(f.read())# Flatten data
df_nested_list = pd.json_normalize(data, record_path =['students'])
image by author

data = json.loads(f.read()) load data using Python json module. After that, json_normalize() is called with the argument record_path set to ['students'] to flatten the nested list in students.

The result looks great but doesn’t include school_name and class. To include them, we can use the argument meta to specify a list of metadata we want in the result.

# To include school_name and class
df_nested_list = pd.json_normalize(
data,
record_path =['students'],
meta=['school_name', 'class']
)
image by author

4. Flattening nested list and dict from JSON object

Next, let’s try to read a more complex JSON data, with a nested list and a nested dictionary.

{
"school_name": "local primary school",
"class": "Year 1",
"info": {
"president": "John Kasich",
"address": "ABC road, London, UK",
"contacts": {
"email": "admin@e.com",
"tel": "123456789"
}
}
,
"students": [
{
"id": "A001",
"name": "Tom",
"math": 60,
"physics": 66,
"chemistry": 61
},
{
"id": "A002",
"name": "James",
"math": 89,
"physics": 76,
"chemistry": 51
},
{
"id": "A003",
"name": "Jenny",
"math": 79,
"physics": 90,
"chemistry": 78
}]

}

We will get a ValueError when trying to read it using read_json().

To read it probably, we can use json_normalize()

import json
# load data using Python JSON module
with open('data/nested_mix.json','r') as f:
data = json.loads(f.read())

# Normalizing data
df = pd.json_normalize(data, record_path =['students'])
image by author

And to include classpresident (a property of info), and tel (a property of contacts.info), we can use the argument meta to specify the path to the property.

df = pd.json_normalize(
data,
record_path =['students'],
meta=[
'class',
['info', 'president'],
['info', 'contacts', 'tel']
]

)
image by author

Please check out the following article if you would like to learn more about Pandas json_normalize():All Pandas json_normalize() you should know for flattening JSONSome of the most useful Pandas trickstowardsdatascience.com

5. Extracting a single value from deeply nested JSON

Pandas json_normalize() can do most of the work when working with nested data from a JSON file. However, it flattens the entire nested data when your goal might actually be to extract one value. For example, to extract the property math from the following JSON file.

{
"school_name": "local primary school",
"class": "Year 1",
"students": [
{
"id": "A001",
"name": "Tom",
"grade": {
"math": 60,
"physics": 66,
"chemistry": 61
}

},
{
"id": "A002",
"name": "James",
"grade": {
"math": 89,
"physics": 76,
"chemistry": 51
}

},
{
"id": "A003",
"name": "Jenny",
"grade": {
"math": 79,
"physics": 90,
"chemistry": 78
}
}]
}

How can we do that more effectively? The answer is using read_json with glom.

from glom import glomdf = pd.read_json('data/nested_deep.json')
df['students'].apply(lambda row: glom(row, 'grade.math'))0 60
1 89
2 79

Name: students, dtype: int64

glom is a Python library that allows us to use . notation to access property from a deeply nested object.

Conclusion

Pandas read_json() function is a quick and convenient way for converting simple flattened JSON into a Pandas DataFrame. When dealing with nested JSON, we can use the Pandas built-in json_normalize() function.

I hope this article will help you to save time in converting JSON data into a DataFrame. I recommend you to check out the documentation for read_json() and json_normalize() APIs, and to know about other things you can do.

Author

Machine Learning practitioner | Formerly health informatics at University of Oxford | Ph.D. | https://www.linkedin.com/in/bindi-chen-aa55571a/