Wondering what your LinkedIn connections looks like? Use graphs to find out.

LinkedIn is an awesome place to connect with all kinds of people from various backgrounds. It’s the place where networking happens.

As an aspiring Data Scientist, I personally use LinkedIn to connect with Data Scientists around the world and find out what their day-to-day looks like, what tools they use, what problems they’re solving, etc. to learn more about the field.

I’ve been using LinkedIn for quite some time now, and connected with quite a few people, and I’ve always been curious about how my connections “look” like, especially what companies they are from, what positions they hold, etc.

On LinkedIn, you only see a list of your connections, so it’s hard to visualize the entire network of your connections.

What’s the best way we can visualize networks? Using graphs!

Graphs

source

Graphs are one of the most important data structures that is applied in many real-world events. One of them is social networks.

In our case, it won’t be a complex network like the one you see above.

What we’ll be doing is creating a network that connects you to all the companies from your connections.

graph in this context is made up of vertices (also called nodes or points) which are connected by edges (also called links or lines)

In other words, there’ll be a single node that connects with dozens of other nodes, where the count of the companies correspond with the size of the node. For example, if you connect with tons of people working at Amazon, the Amazon node will be large.

If you aren’t familiar with the terminologies, a graph in this context is made up of nodes/vertices, which are points that can store information, and they are connecteed by the edges or links.

If you didn’t get that, don’t worry! It’ll make a lot more sense with the visualizations later.

As always, here’s where you can find the code for this article:

Haven’t joined bitgrit’s new discord serverYou’re missing out! Join today for discussions on all things data science and AI, and more.

Download data

First, we need the data.

Here’s a step-by-step guide for getting a copy of your data on LinkedIn:

  1. Click on your Me drop down in the homepage
  2. Head over to “Settings & Privacy”
  3. Click on “Get a copy of your data”

You should see something like this below 👇

Get a copy of your data page on LinkedIn

Check connections only, and hit “Request archive”.

After a few minutes, you should get the archive file in your email.

Now that we have our data, let’s dive into the code!

Installing Dependencies

Before loading the data, we need to install some dependencies that don’t come by default on Google Colab.

!pip install pyjanitor pyvis --quiet

Load libraries

import pandas as pd
import janitor
import datetime

from IPython.core.display import display, HTML
from pyvis import network as net
import networkx as nx
  • janitor — Clean APIs for data cleaning. Python implementation of R package Janitor
  • NetworkX — Create network graph with Python
  • Pyvis — Visualizing network graph

Loading data

df_ori = pd.read_csv("/content/drive/MyDrive/linkedin_network/Connections.csv", skiprows=2)
df_ori.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2757 entries, 0 to 2756
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   First Name     2747 non-null   object
 1   Last Name      2747 non-null   object
 2   Email Address  71 non-null     object
 3   Company        2668 non-null   object
 4   Position       2668 non-null   object
 5   Connected On   2757 non-null   object
dtypes: object(6)
memory usage: 129.4+ KB

The original csv file has notes on the first two rows, so we’re reading starting from line 3, which is what skiprows=2 does.

Running .info() on our data, I have a total of 2757 connections.

Data Cleaning

Our connections data out of the box isn’t clean, so let’s do some simple cleaning with the janitor library.

df = (
    df_ori
    .clean_names() # remove spacing and capitalization
    .drop(columns=['first_name', 'last_name', 'email_address']) # drop for privacy
    .dropna(subset=['company', 'position']) # drop missing values in company and position
    .to_datetime('connected_on', format='%d %b %Y')
  )
df.head()
companypositionconnected_on
0ClosedLoop.aiData Scientist2021-10-25
1FreelanceIT Support2021-10-19
2Erupture Angel NetworkData Partner2021-10-18
3Digital Marketing Networks Sdn BhdDigital Marketing Trainer & Consultant2021-10-17
4SECURITI.aiMarketing Team Lead2021-10-16

If you’re not familiar with the syntax above, what it’s doing is chaining operations one after the other.

For example, df_ori.clean_names().drop(...) means you’re cleaning the column names, then dropping the columns, and so on, where the output of the first operation becomes the input of the next.

This approach is called functional programming, and I personally like it because it’s clean and simple.

What we basically did above is cleaning the column names, dropping columns, remove missing values, and converted the date column to datetime format, all in one block!

Now that our data is in a good shape, let’s do some exploratory data analysis.

Simple EDA

For plotting, I’m using pandas built-in plotting capabilities for simplicity.

Top 10 Companies

df['company'].value_counts().head(10).plot(kind="barh").invert_yaxis();

What I’m doing here is getting the value count of the company column, which acts like a counter for every company, take only the top 10, and then plotting a bar chart. I also inverted the y-axis so that the names are easier to read.

Looks like tons of my connections work at Amazon, with IBM following close behind.

Notice there’s a “Freelance” category in our company column, digging more into the data I found that there’s also a “self-employed” category.

We’ll be removing that for the sake of only having company (and university) names.

pattern = "freelance|self-employed"
df = df[~df['company'].str.contains(pattern, case=False)]

Top 10 Positions

For positions, the “Data Scientist” title has a huge lead, which isn’t surprising because I’ve been connecting with tons of data scientists.

df['connected_on'].hist(xrot=35, bins=15);

Plotting the date of my connections, it seems I had a spike of connections during January 2020 and April 2021.

Now that we’ve done some really basic EDA, it’s time to create individual data frames for the company and positions to make creating the network easier.

Aggregate position and connection columns

df_company = df['company'].value_counts().reset_index()
df_company.columns = ['company', 'count']
df_company = df_company.sort_values(by="count", ascending=False)
df_company.head(10)
companycount
0Amazon38
1IBM36
2MoneyLion28
3Facebook22
4Google20
5PETRONAS17
6Grab13
7Microsoft13
8Accenture12
9Medium

Using value_counts()again, along with sort_values(), we’re able to get a new dataframe of the companies along with the counts.

We do the same process for the position column.

Now we have our data frame, it’s time to create our network.

Creating the network

Before we create our LinkedIn network, let’s start small, and figure out how PyVis and networkX work.

Simple Network

nt = net.Network(notebook=True)

g = nx.Graph()
g.add_node(0, label = "root") # intialize yourself as central node
g.add_node(1, label = "Company 1", size=10, title="info1")
g.add_node(2, label = "Company 2", size=40, title="info2")
g.add_node(3, label = "Company 3", size=60, title="info3")
g.add_edge(0, 1)
g.add_edge(0, 2)
g.add_edge(0, 3)

nt.from_nx(g)
nt.show('nodes.html')
display(HTML('nodes.html'))

First things first, we create our network class which takes in a graph object in the end, and then save the network as a html.

To form our graph, we construct it with the Graph class, and then add nodes to it. The add_node function takes in a couple arguments.

The number you see at the start are unique id’s, this allows us to connect them later on. You can also use the labels as ID’s as long as they are unique.

You can control the size by assigning integer values to the argument, and the title argument stands for the hover info.

Try clicking on one of the nodes, and hover over it. You can see that the strings you pass in is displayed!

After we have our graphs, we can also get properties like number of nodes and edges.

print(f"number of nodes: {g.number_of_nodes()}")
print(f"number of edges: {g.number_of_edges()}")

number of nodes: 4 number of edges: 3

Now that you understand how to create a simple network, let’s do it for our connections!

Reduce size of nodes

print(df_company.shape)
df_company_reduced = df_company.loc[df_company['count']>=5]
print(df_company_reduced.shape)

(1956, 2) (45, 2)

print(df_position.shape)
df_position_reduced = df_position.loc[df_position['count']>=5]
print(df_position_reduced.shape)

(1535, 2) (47, 2)

Currently we have almost 2000 companies and over 1500 positions, that would make our network incredible large and slow down the visualization. To prevent that, we filter them down to counts of over 5.

Creating network for connections

# initialize graph
g = nx.Graph()
g.add_node('root') # intialize yourself as central

# use iterrows tp iterate through the data frame
for _, row in df_company_reduced.iterrows():

  # store company name and count
  company = row['company']
  count = row['count']

  title = f"<b>{company}</b> – {count}"
  positions = set([x for x in df[company == df['company']]['position']])
  positions = ''.join('<li>{}</li>'.format(x) for x in positions)

  position_list = f"<ul>{positions}</ul>"
  hover_info = title + position_list

  g.add_node(company, size=count*2, title=hover_info, color='#3449eb')
  g.add_edge('root', company, color='grey')

# generate the graph
nt = net.Network(height='700px', width='700px', bgcolor="black", font_color='white')
nt.from_nx(g)
nt.hrepulsion()
# more customization https://tinyurl.com/yf5lvvdm
nt.show('company_graph.html')
display(HTML('company_graph.html'))

First you initialize the graph, and add the root node, which is yourself.

Then we use iterrows() , which basically allows us to iterate the rows of our dataframe.

In each for loop, we save the company name and count for use later.

We want each of the node to display the positions that our connection holds at those companies, so to achieve that, we use a list comprehension to grab the positions and store them in a set (to prevent duplication).

To make the hover information on our nodes prettier, we use HTML to format our information.

At the end of each loop, we add the new node, and then add a link our root node.

After our graph is done building, we add it to the network, and we can display it.

And Voila! We get this beautiful network!

In this network form, it’s much clearer who your connections are, and you get an idea of the span of your connections on LinkedIn.

The process is pretty much the same for our positions network.

Look at how large the data scientist node is!

One thing I’ve noticed from this network is the Data Scientist title has many forms — “Principal Data Scientists”, “Sr. Data Scientists”, “Junior Data Scientists”, etc.

What we could’ve done is combined them all under the name “Data Scientist”.

I’ll leave that as a challenge for you readers!

That’s all for this article, thank you for reading and I hope you found this article interesting!