Social Hygiene: Pruning my Twitter Feed with Plumes

While scrolling through my Twitter feed recently, I began to get a little annoyed at the amount of content that I was simply not interested in. In engineering terms, my signal-to-noise ratio was way too low.

This led me to think about my social media hygene and how I don’t tend to prune my tweets, people I follow, or topics of interest.

So, over a few days off, I made a tool: plumes.

I designed plumes to be a simple Twitter CLI for day-to-day social media hygiene, allowing me to perform basic pruning operations. My end goal was to make a cron job that would perform typical operations at a scheduled interval.

In this post, I’ll be walking through my pruning process using plumes as I try to achieve a more reasonable level of Twitter content quality. I have two overall goals:

  1. Prune my tweets: delete old and low-value tweets
  2. Prune my friends: unfollow people based on certain criteria

Extracting Twitter Data

First, we need data.

The plumes CLI makes this pretty straight forward:

plumes tweets
plumes friends

The above commands give me all my tweets and all my friends (i.e., people I follow) as two JSON files, EngNadeau-tweets.json and EngNadeau-friends.json, respectively.

Analyzing and Pruning Tweets

Let’s take a look at my tweets. Using the JSON output from plumes, we can load the data and get an idea of my tweeting habits and quality.

Loading Data

Loading the data is a simple JSON -> DataFrame process:

import json
from pathlib import Path
import pandas as pd

# nicer pandas float formatting
pd.options.display.float_format = "{:g}".format

# load data
path = Path("EngNadeau-tweets.json")
with open(path) as f:
    data = json.load(f)

# convert to pandas dataframe
df = pd.json_normalize(data).pipe(
    lambda x: x.assign(**{"created_at": pd.to_datetime(x["created_at"])})
)

df

created_atidid_strtexttruncatedsourcein_reply_to_status_idin_reply_to_status_id_strin_reply_to_user_idin_reply_to_user_id_str...retweeted_status.quoted_status.place.contained_withinretweeted_status.quoted_status.place.bounding_box.typeretweeted_status.quoted_status.place.bounding_box.coordinatesretweeted_status.quoted_status.entities.mediaretweeted_status.quoted_status.extended_entities.mediaretweeted_status.scopes.followersretweeted_status.geo.typeretweeted_status.geo.coordinatesretweeted_status.coordinates.typeretweeted_status.coordinates.coordinates
02020-08-20 17:50:52+00:0012965049932608962581296504993260896258Shoutout to @thedungeoncast for their breaks’ ...False<a href="http://twitter.com/download/iphone" r...nanNonenanNone...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
12020-08-18 14:59:41+00:0012957371349729034251295737134972903425My personal shoutout = eReader + @LibbyApp + @...False<a href="http://twitter.com/download/iphone" r...nanNonenanNone...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
22020-08-18 03:00:46+00:0012955562153517178881295556215351717888Accidentally published their private keys http...False<a href="http://twitter.com/download/iphone" r...nanNonenanNone...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
32020-08-17 19:52:11+00:0012954483590047866941295448359004786694This may be one of my favourite #bot features:...False<a href="https://mobile.twitter.com" rel="nofo...1.29545e+1812954480011006238758.10231e+0781023088...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
42020-08-17 19:50:46+00:0012954480011006238751295448001100623875Pybotics -&gt; https://t.co/4YRC6gqOxf\nsemant...False<a href="https://mobile.twitter.com" rel="nofo...1.29545e+1812954477554695946248.10231e+0781023088...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
..................................................................
16782014-08-19 20:16:49+00:00501825129739845632501825129739845632RT @Brainsight: Organized a few useful Brainsi...False<a href="http://twitter.com" rel="nofollow">Tw...nanNonenanNone...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
16792014-08-16 22:39:25+00:00500773855523504128500773855523504128Researchers create 1,000-robot swarm\nhttp://t...False<a href="http://twitter.com/#!/download/ipad" ...nanNonenanNone...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
16802014-08-07 04:03:12+00:00497231458550177793497231458550177793Officially got my Quebec junior engineering pe...False<a href="http://twitter.com/#!/download/ipad" ...nanNonenanNone...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
16812014-08-07 02:33:08+00:00497208793118552064497208793118552064@Brainsight @JLMorris91 #brainsight #TMS http:...False<a href="http://twitter.com/#!/download/ipad" ...3.26715e+173267152092197027855.04317e+08504317349...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
16822014-06-20 03:23:49+00:00479826930410475521479826930410475521@CAJMTL &amp; @InterActionPMG, check out Face ...False<a href="http://twitter.com" rel="nofollow">Tw...nanNone2.55997e+092559970724...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN

1683 rows × 339 columns

Exploring Data

Wow, I never thought of myself as a “tweeter”, but 1600+ tweets is pretty cool.

Moreover, the data is very nicely structured, and the pandas.normalize_json() is wonderful at converting the semi-structured JSON data into a flat table.

Let’s take a closer look as some statistics:

df[["retweet_count", "favorite_count"]].describe()

retweet_countfavorite_count
count16831683
mean177.4480.75817
std5146.031.75906
min00
25%00
50%00
75%11
max20989921

Hmmmm, I don’t remember ever going viral and getting 200k+ retweets.

Per the Twitter API docs, the retweet_count key of a Tweet object counts the source tweet’s retweets, not just my personal retweets.

Let’s filter out retweets to get just my personal tweets. Since we’re beginning to chain conditions, filters, transforms, etc. on our dataframe, I’ll begin using pipes to keep the code clean and efficient. (Note: I LOVE pipes).

(
    df.pipe(lambda x: x[~x["retweeted"]])
    .pipe(lambda x: x[["retweet_count", "favorite_count"]])
    .describe(percentiles=[0.5, 0.9, 0.95, 0.99])
)

retweet_countfavorite_count
count13761376
mean0.2056690.919331
std0.7320191.89699
min00
50%00
90%12
95%14
99%49
max1121

Aha! That’s more like it.

So I have around 1.3k personal tweets and I definitely never went viral.

Tweets vs. Time

I wonder what this looks like over time?

from matplotlib import pyplot as plt
import matplotlib as mpl

mpl.rcParams["axes.spines.right"] = False
mpl.rcParams["axes.spines.top"] = False

figsize=(10, 4)

fig, ax = plt.subplots(figsize=figsize)

(
    df.pipe(lambda x: x[~x["retweeted"]])
    .set_index("created_at")
    .resample("2Q")
    .count()
    .pipe(lambda x: x.set_index(x.index.date))
    .pipe(lambda x: x["id"])
    .plot.bar(ax=ax, rot=30)
)

fig.suptitle("Tweets Over Time")
ax.set_xlabel("Quarter")
ax.set_ylabel("Number of Tweets")
fig.tight_layout()

png

It appears that I tend to be cyclical with my tweets. Here are some highlights I can think of:

  • 2015-2016: Attended medical conferences and was tweeting to promote my company.
  • 2017-2018: Attended robotics conferences and was tweeting to promote my research.
  • 2019: Writing my PhD thesis and forgot about the rest of the world.

Interactions vs. Time

fig, ax = plt.subplots(figsize=figsize)

(
    df.pipe(lambda x: x[~x["retweeted"]])
    .set_index("created_at")
    .resample("2Q")
    .sum()
    .pipe(lambda x: x.set_index(x.index.date))
    .pipe(lambda x: x[["retweet_count", "favorite_count"]])
    .plot.bar(ax=ax, rot=30)
)

fig.suptitle("Interactions Over Time")
ax.set_xlabel("Quarter")
ax.set_ylabel("Number of Interactions")
ax.legend()
fig.tight_layout()

png

As expected, the number of interactions generally follows my tweeting frequency. The more you give, the more you get.

But, I did have a stellar quarter in 2017.

Pruning Tweets

With goal #1 in mind, let’s use plumes to prune (i.e., delete) old tweets that aren’t worth keeping around. Judging from the previous data and plots (e.g., 99% percentile above), I’d be OK with deleting tweets that:

  • Are older than 60 days
  • Have less than 9 likes
  • Have less than 4 retweets
  • Are not self-liked by me

The command (and future cron job) will look like this:

# add --prune to switch from dry-run to deleting
plumes audit_tweets EngNadeau-tweets.json --min_likes 9 --min_likes 4 --days 60 --self_favorited False

This results in 1325 identified tweets that will be deleted. Goodbye :)

Analyzing and Pruning Friends

Next, let’s take a look at my friends (i.e., people I follow). Similar to the tweet analysis and pruning process, we’ll be using the plumes JSON output to get an idea of my following quality.

Loading Data

Like before, we will simply load the JSON data and convert to a DataFrame.

# load data
path = Path("EngNadeau-friends.json")
with open(path) as f:
    data = json.load(f)

# convert data to pandas dataframe
df = pd.json_normalize(data).pipe(
    lambda x: x.assign(**{"created_at": pd.to_datetime(x["created_at"])})
)

df

idid_strnamescreen_namelocationdescriptionurlprotectedfollowers_countfriends_count...status.retweeted_status.place.idstatus.retweeted_status.place.urlstatus.retweeted_status.place.place_typestatus.retweeted_status.place.namestatus.retweeted_status.place.full_namestatus.retweeted_status.place.country_codestatus.retweeted_status.place.countrystatus.retweeted_status.place.contained_withinstatus.retweeted_status.place.bounding_box.typestatus.retweeted_status.place.bounding_box.coordinates
08631527686315276Will StrafachchronicSan Francisco, CAbuilding great things. breaking others. | foun...https://t.co/7qRzHeZcxyFalse602395265...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
11575223515752235Zack WhittakerzackwhittakerNew York, NYSecurity editor @TechCrunch • Signal / WhatsAp...https://t.co/0I0oRqFMAyFalse59392998...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2776454093816594433776454093816594433Oliver LimoyoOliverLimoyoPhD. candidate @UofTRobotics @VectorInst study...https://t.co/I8kDSFF4JpFalse71306...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
310793702784242728981079370278424272898AppsCyborgAppsCyborgWorldHome of all cyborg web apps. All our apps are ...https://t.co/djpFssnWsiFalse8250...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
41951009019510090Julian TogeliustogeliusNew York CityAI and games researcher.\nAssociate professor ...http://t.co/j74XjVzSpsFalse10675983...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
..................................................................
14081426915214269152Anthony HaanthonyhaNew York, NYJournalism for @TechCrunch, science fiction fo...https://t.co/2dWc2EzwK6False43278731...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
140923840712384071timoreillytimoreillyOakland, CAFounder and CEO, O'Reilly Media. Watching the ...https://t.co/HsFlR6PWvTFalse17666872121...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
14102293891422938914Steve WozniakstevewozLos Gatos, CaliforniaEngineers first! Human rights. Gadgets. Jokes ...http://t.co/gC1NnB1hglFalse62650192...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
141118354111835411Andy IhnatkoIhnatkoSector ZZ9 Plural Z AlphaTech contributor to @bospublicradio @WGBH, pod...http://t.co/xoCNd62XhnFalse889181916...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
14121722093417220934Al GorealgoreNashville, TNhttps://t.co/R5WtdSm0cWFalse304060538...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN

1413 rows × 131 columns

Like the tweets data before, the users JSON data is also nicely structured.

According to this (and my Twitter profile), I follow 1.4k+ people. I definitely don’t interact with that many people, nor pay attention to them. Let’s explore the data a little more and then clean this up.

Exploring Data

First, let’s take a look at basic user statistics:

df.filter(like="_count").describe()

followers_countfriends_countlisted_countfavourites_countstatuses_countstatus.retweeted_status.retweet_countstatus.retweeted_status.favorite_countstatus.retweet_countstatus.favorite_count
count1413141314131413141334834814131413
mean1.26141e+061898.86021.7611701.328546.41421.526146.5429.389490.885
std5.78562e+0616325.616785.941353.262321.87092.6531651.43723.265379.2
min7100061000
25%92132402337042970411.7500
50%48128630910304586301549.532
75%3439111412425192612378995.25434.51518
max1.21754e+086020532212821.03488e+066540806324736741263247164625

Above we have a lot of useful info. From a segmentation perspective, followers_count followed by statuses_count gives the biggest variance if we wanted to classify our users into groups.

However, who are the super popular people that are outliers compared to the rest of my friends?

df.sort_values(by="followers_count", ascending=False).head()

idid_strnamescreen_namelocationdescriptionurlprotectedfollowers_countfriends_count...status.retweeted_status.place.idstatus.retweeted_status.place.urlstatus.retweeted_status.place.place_typestatus.retweeted_status.place.namestatus.retweeted_status.place.full_namestatus.retweeted_status.place.country_codestatus.retweeted_status.place.countrystatus.retweeted_status.place.contained_withinstatus.retweeted_status.place.bounding_box.typestatus.retweeted_status.place.bounding_box.coordinates
1407813286813286Barack ObamaBarackObamaWashington, DCDad, husband, President, citizen.https://t.co/93Y27HEnnXFalse121753699602053...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
10001022827210228272YouTubeYouTubeSan Bruno, CABlack Lives Matter.https://t.co/qkVaJFk2CGFalse721611381124...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
1403428333428333CNN Breaking NewscnnbrkEverywhereBreaking news from CNN Digital. Now 58M strong...http://t.co/HjKR4r61U5False58499326119...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
14065039396050393960Bill GatesBillGatesSeattle, WASharing things I'm learning through my foundat...https://t.co/emd1hfqSRDFalse51520675228...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
1323759251759251CNNCNNIt’s our job to #GoThere & tell the most diffi...http://t.co/IaghNW8Xm2False495995881108...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN

5 rows × 131 columns

Oh… hello Obama.

But this also brings up another issue: who should I actually follow?

As much as I like Obama and Bill Gates, I don’t actually interact with them. This is especially true for CNN and YouTube.

If there is something truly worthwhile being tweeted by these people (or orgs), I’ll probably hear about it from my thousand other social media sources. What I want from Twitter is more personal content from people that provide intelligent ideas and good discussion topics.

Let’s start by exploring a user’s followers vs. friends and their Twitter Follower-to-Friend (TFF ratio).

fig, ax = plt.subplots(figsize=figsize)

df.plot.scatter(x="friends_count", y="followers_count", ax=ax, label="Friend")

fig.suptitle("Friends' Followers vs. Friends")
ax.set_xlabel("Number of Friends")
ax.set_ylabel("Number of Followers")
ax.set_xlim([0, df["friends_count"].quantile(q=0.9)])
ax.set_ylim([0, df["followers_count"].quantile(q=0.9)])
ax.legend()
fig.tight_layout()

png

Right away, we can see that the majority of outliers fall below a TFF ratio of 1. Below TFF=1, people have many more followers than friends (e.g., celebrities, popular people). Above TFF=1, you follow a lot of people, but people don’t follow you back (e.g., up-and-comers, bots).

So who should I follow? Let’s look at who I’ve interacted with in the past.

First, I’ll load the my past retweets and favourited tweets ( EngNadeau-favorites.json) data. Second, I’ll extract the users I’ve retweeted in the past. Third, I’ll compare these users vs. my current friends.

import numpy as np

# load retweet data
path = Path("EngNadeau-tweets.json")
with open(path) as f:
    data = json.load(f)

df_retweeted = (
    pd.json_normalize(data)
    .pipe(lambda x: x.assign(**{"created_at": pd.to_datetime(x["created_at"])}))
    .pipe(lambda x: x[x["retweeted"]])
    .filter(like="retweeted_status")
    .filter(like="_count")
    .pipe(
        lambda x: x.assign(
            ratio=x["retweeted_status.user.followers_count"]
            / x["retweeted_status.user.friends_count"]
        )
    )
    .replace([np.inf, -np.inf], np.nan)
    .dropna(subset=["ratio"])
)

df_retweeted.describe()

retweeted_status.user.followers_countretweeted_status.user.friends_countretweeted_status.user.listed_countretweeted_status.user.favourites_countretweeted_status.user.statuses_countretweeted_status.retweet_countretweeted_status.favorite_countretweeted_status.quoted_status.user.followers_countretweeted_status.quoted_status.user.friends_countretweeted_status.quoted_status.user.listed_countretweeted_status.quoted_status.user.favourites_countretweeted_status.quoted_status.user.statuses_countretweeted_status.quoted_status.retweet_countretweeted_status.quoted_status.favorite_countratio
count30230230230230230230212121212121212302
mean6811726140.143436.956985.0827714.4960.6061533.558786131547.927723.755401.75173021878.679460.336533.61
std3.61614e+0668696.811923.934915.98922112127.6208022.99208e+061274.3925607.84931.233218.16146.833215642805.3
min15100871063936833661021020.0815217
25%993237.2536494.251753132113675951666.52686.251.757.50.872062
50%3265712981583.55369365747.51026.5140.545455060617.55.70978
75%31005.22142585.254501.2510565.8262914300.22179.25581.57356.2512165.2122.25205.7539.3201
max3.81207e+071.16459e+061127425763309596142098993609151.03793e+074320890271623911901021386111562392997
# load favourited tweets data
# run `plumes favorites`
path = Path("EngNadeau-favorites.json")
with open(path) as f:
    data = json.load(f)

df_favorites = (
    pd.json_normalize(data)
    .pipe(lambda x: x.assign(**{"created_at": pd.to_datetime(x["created_at"])}))
    .drop_duplicates(subset="user.screen_name")
    .filter(like="user.")
    .filter(like="_count")
    .pipe(lambda x: x.assign(ratio=x["user.followers_count"] / x["user.friends_count"]))
    .replace([np.inf, -np.inf], np.nan)
    .dropna(subset=["ratio"])
)

df_favorites.describe()

user.followers_countuser.friends_countuser.listed_countuser.favourites_countuser.statuses_countquoted_status.user.followers_countquoted_status.user.friends_countquoted_status.user.listed_countquoted_status.user.favourites_countquoted_status.user.statuses_countratio
count1571571571571571313131313157
mean1.03861e+068965.463013.0115381.725452.6522141244.15843.3087358.926584.381437.58
std9.96025e+0654112.119875.254028.696299.11158181362.511763.613425.18800.2615243.8
min1570027307070780.05
25%887337276081352114691492337271.38529
50%511083012927344793403079715794641393.74692
75%3157919685479695136701055614093186045668827.2923
max1.22369e+08601499220417576678963929335806369959534614831104190697

So, from the above, I typically interact with pretty popular people (retweeted_status.user.followers_count and user.followers_count), but most people fall within a TFF ratio of about 1 to 30.

What does this look like plotted?

fig, ax = plt.subplots(figsize=figsize)

df.plot.scatter(x="friends_count", y="followers_count", ax=ax, label="Friend")

df_retweeted.plot.scatter(
    x="retweeted_status.user.friends_count",
    y="retweeted_status.user.followers_count",
    ax=ax,
    label="Retweeted User",
    c="C1",
)

df_favorites.plot.scatter(
    x="user.friends_count", y="user.followers_count", ax=ax, label="Liked User", c="C2"
)

fig.suptitle("Users' Friends vs Followers")
ax.set_xlabel("Number of Friends")
ax.set_ylabel("Number of Followers")
ax.set_xlim([0, df["friends_count"].quantile(q=0.9)])
ax.set_ylim([0, df["followers_count"].quantile(q=0.7)])
ax.legend()
fig.tight_layout()

png

The vast majority of people I interact with are within the core grouping of less than 2k friends and 50k followers (i.e., a TFF ratio of 25).

Pruning Friends

With goal #2 in mind, we’ll use plumes to prune friends that I don’t typically interact with. From the previous data and plots, I’ll unfriend people that:

  • Haven’t tweeted in the last 30 days
  • Have a TFF ratio less than 1
  • Have a TFF ratio more than 30

Since plumes assumes the conditional flags are AND boolean operations, we’ll need to call the bool_or flag to convert the boolean algebra to OR conditions. The command will look like this:

# add --prune to switch from dry-run to deleting
poetry run plumes audit_users EngNadeau-friends.json --min_ratio 1 --max_ratio 30 --days 30 --bool_or

This results in 912 identified users that will be unfriended.

My Twitter feed has never looked so good. :)

Related