Optimizing Newsletter Metrics Through Data Collection, Analysis, and Visualization¶

by Ilana Seidl¶

Introduction¶

In July 2019, I started The Root, a weekly financial wellness newsletter that aims to foster conversations about money. Until now, I have only been using Mailchimp, an automated marketing platform, to measure The Root's growth, success, and impact. In just six months, The Root has grown to over 900 subscribers, has an average 40% open rate, and a 5% average click rate. The industry standard is an open rate of less than 18% and a click rate of 2%. A large part of my time this semester has been dedicated to The Root's success. Because of this, I want to dedicate this tutorial to optimizing, understanding, and visualizing the data gathered by Mailchimp-- beyond the standard graphs it provides on its dashboard. Through this tutorial, I hope to gain a better understanding of email metrics and various techniques in data exploration, as well as provide a guide for others looking to gain insight on their email marketing efforts.

Overview¶

Data Collection¶

The data for this tutorial was downloaded using Mailchimp's export feature. Through this feature, Mailchimp provides a copy of account data. Mailchimp provides the data in a ZIP file which was downloadable from my account. In this ZIP file, Mailchimp includes a significant amount of data in folders and individual files are in CSV format. I will be making use of these:

Lists
- Subscriber information
Reports
- Campaign opens, clicks, unsubscribes, and geolocation data

Loading and Preprocessing¶

Before diving into analyzing the data and running code, there are several steps I need to take. In order to preprocess the data, I organized the Mailchimp export data and only kept what I needed. This includes the campaign reports (campaigns.csv) and the List data. More importantly, I quickly recognized that the List data contained sensitive information such as full names, emails, geodata location, and more. Therefore, I decided to prioritize anonymizing the data. To do this, I used the random.permutation numpy function and pandas reindexing. After this, I followed relatively standard tidying methods to have clean, easy-to-use data such as concatenating dataframes, dropping unnecessary columns, converting columns to the proper data types, and more.

Exploratory Analysis¶

For this tutorial, a majority of the exploration involves gathering insights that summarize the many data points provided by Mailchimp. For subscriber data, this includes location, timezone, and optin date. For campaign data, this includes opens, clicks, unsubscribes, subject lines, send-dates and more.

Hypothesis¶

I believe that time will be one of the most important factors in a successful email.

Conclusion¶

Although time is important, it became difficult to identify a singular factor that leads to a successful email. Aditionally, as I worked through the tutorial, I realized that the most helpful insights would lie in comparing subscriber growth to opens, clicks, and other metrics. This is important because consistent engagement is the absolute most important metric for an email list. Overall, this analysis provides me with detailed insights into The Root's data and expectations for the future. Specifically, it is helping me make predictions for future growth using the concrete data from the past six months. Accurately predicting growth is useful for many reasons. I am currently presenting The Root at business pitch competitions, to potential investors, and more. I look forward to consistently looking into this data and continuing to explore. In the future, I plan to access the data through Mailchimp's API rather than static CSV files.

Getting Started with Data¶

For this project, I will be using Python and several imported libraries such as pandas, numpy, matplotlib, scikit-learn, seaborn, wordcloud, and more.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.pyplot import *
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from PIL import Image


#widening notebook and lets us display data easily.
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))

pd.options.display.max_rows = 500
pd.options.display.max_columns = 500
pd.options.display.width = 1000

Loading in Subscriber data¶

Now that campaign data is stored in the proper dataframe, I am loading in subscriber data.

sub_df = pd.read_csv("./newsletteraudience/subscribed.csv")
unsub_df = pd.read_csv("./newsletteraudience/unsubscribed.csv")

print(sub_df.shape)
print(unsub_df.shape)

(678, 22)
(17, 26)

Now that our lists are merged, let's get a better view of what we are working with.

contacts = pd.concat([sub_df, unsub_df], sort = False, ignore_index = True) 
contacts.shape

(695, 27)

contacts.dtypes

What's your first name?                                object
Last name?                                             object
Email?                                                 object
How did you find us?                                   object
From 1-10, how comfortable do you feel with money?     object
MEMBER_RATING                                           int64
OPTIN_TIME                                             object
OPTIN_IP                                               object
CONFIRM_TIME                                           object
CONFIRM_IP                                             object
LATITUDE                                               object
LONGITUDE                                              object
GMTOFF                                                 object
DSTOFF                                                 object
TIMEZONE                                               object
CC                                                     object
REGION                                                 object
LAST_CHANGED                                           object
LEID                                                    int64
EUID                                                   object
NOTES                                                 float64
TAGS                                                   object
UNSUB_TIME                                             object
UNSUB_CAMPAIGN_TITLE                                   object
UNSUB_CAMPAIGN_ID                                      object
UNSUB_REASON                                           object
UNSUB_REASON_OTHER                                    float64
dtype: object

contacts.isnull().sum()

What's your first name?                               219
Last name?                                            228
Email?                                                  0
How did you find us?                                  316
From 1-10, how comfortable do you feel with money?    356
MEMBER_RATING                                           0
OPTIN_TIME                                              0
OPTIN_IP                                              694
CONFIRM_TIME                                            0
CONFIRM_IP                                            206
LATITUDE                                              194
LONGITUDE                                             194
GMTOFF                                                387
DSTOFF                                                387
TIMEZONE                                              253
CC                                                    194
REGION                                                194
LAST_CHANGED                                           17
LEID                                                    0
EUID                                                    0
NOTES                                                 695
TAGS                                                  694
UNSUB_TIME                                            678
UNSUB_CAMPAIGN_TITLE                                  679
UNSUB_CAMPAIGN_ID                                     679
UNSUB_REASON                                          688
UNSUB_REASON_OTHER                                    695
dtype: int64

As you can see, there are a lot of missing records. We are only going to keep columns that will aid us in understanding our data better.

Loading in Campaign Data¶

The dataset loaded in below contains campaign data and is being stored in a pandas dataframe using read_csv().

df1 = pd.read_csv("./campaigns.csv")

In the original dataset, called "df1", there are 26 columns. Below is a detailed look into what is provided.

df1.shape

(27, 26)

df1.dtypes

Title                      object
Subject                    object
List                       object
Send Date                  object
Send Weekday               object
Total Recipients            int64
Successful Deliveries       int64
Soft Bounces                int64
Hard Bounces                int64
Total Bounces               int64
Times Forwarded             int64
Forwarded Opens             int64
Unique Opens                int64
Open Rate                  object
Total Opens                 int64
Unique Clicks               int64
Click Rate                 object
Total Clicks                int64
Unsubscribes                int64
Abuse Complaints            int64
Times Liked on Facebook     int64
Folder Id                   int64
Unique Id                  object
Total Orders                int64
Total Gross Sales           int64
Total Revenue               int64
dtype: object

Tidying and Modifying the Data¶

Subscriber data: anonymizing and sorting¶

The "contacts" dataframe contains sensitive information such as full names, emails, and other identifying information. For these subscribers' safety, I will shuffle and anonymize the dataset. I am doing this below.

contacts = contacts[["Email?","MEMBER_RATING", "REGION", "TIMEZONE", "OPTIN_TIME"]]
shuffled = contacts.reindex(np.random.permutation(contacts.index))

shuffled.reset_index(inplace = True)

In order to identify subscribers without their sensitive information, I am assigning an email_id to each record.

shuffled['email_id'] = shuffled.index
shuffled.drop(columns = ['Email?', 'index'], inplace = True)

Now, I am converting OPTIN_TIME to its proper type using the pandas datetime functionality.

shuffled["OPTIN_TIME"] = pd.to_datetime(shuffled["OPTIN_TIME"])

shuffled.dtypes

MEMBER_RATING             int64
REGION                   object
TIMEZONE                 object
OPTIN_TIME       datetime64[ns]
email_id                  int64
dtype: object

Out of curiosity, lets see how many records have a NaN value.

shuffled.isnull().sum()

MEMBER_RATING      0
REGION           194
TIMEZONE         253
OPTIN_TIME         0
email_id           0
dtype: int64

This number of NaNs will be something to look into for future data collection. It is interesting to see what is missing.

shuffled.dropna(inplace = True)

shuffled.set_index("email_id", inplace = True)

Above, I am taking the final steps to achieving a clean data set. Although not ideal, I am dropping the NaN values to make the data easier to work with. Lastly, I am setting the index to email_id.

shuffled.to_csv("cleaned.csv")

To finish, I am saving the cleaned data into CSV file for future use and exploration.

Campaign data: aggregating, sorting, and tidying¶

Now, I am tidying up the campaign data to ensure we have the information I need. Because the CSV file includes 27 columns, I need to choose the columns I would like to keep and then concatenate them into one dataset. I call it "metrics."

df_interest = df1.iloc[:, 0:21]
metrics1 = df_interest.iloc[:, 1:5]
metrics2 = df_interest.iloc[:, 5:9]
metrics3 = df_interest.iloc[:, 12:18]
metrics = pd.concat([metrics1, metrics2, metrics3], axis = 1)
metrics.drop(['List', 'Send Weekday', 'Open Rate', 'Click Rate'], axis = 1, inplace = True)

In order to access the individual campaigns, I am inserting a campaign_id column using the index.

metrics['campaign_id'] = metrics.index

Now, I am converting columns to their proper types.¶

metrics["Subject"] = metrics["Subject"].astype(str, copy = True)

metrics["Send Date"] = pd.to_datetime(metrics["Send Date"])
metrics.set_index("Send Date", inplace = True)
metrics['send_time'] = metrics.index.time

As you can see, I am using the Send Date column as the index. It is being stored as a datetime variable. This will be useful for future analysis.

Tidy Campaign Dataframe¶

metrics.head()

metrics.dtypes

Subject                  object
Total Recipients          int64
Successful Deliveries     int64
Soft Bounces              int64
Hard Bounces              int64
Unique Opens              int64
Total Opens               int64
Unique Clicks             int64
Total Clicks              int64
campaign_id               int64
send_time                object
dtype: object

metrics.shape

(27, 11)

Now, we have only the columns we need.

Subscriber Analysis¶

Let's get started with the analysis. Here is a view of the finalized subscriber dataframe we are going to be working with.

shuffled.head()

Location Data Analysis and Visualization¶

According to Mailchimp, regions are an estimate based on IP address data. Contacts' IP addresses are first collected when signing up and become more and more accurate as they engage with the email campaigns.

shuffled["REGION"].value_counts().head(10)

la    95
ny    75
LA    36
fl    35
tx    31
FL    15
ca    13
mi    12
ga    12
nj    11
Name: REGION, dtype: int64

With a closer look at the REGION column, I can see that locations are stored as both uppercase and lowercase. In order to get the proper value counts of locations, I am going to convert all the REGION records to uppercase and take another look.

shuffled['region'] = shuffled['REGION'].str.upper()
shuffled.drop(columns = ['REGION'], inplace = True)

Now that the data has been tidied a bit further, I am getting the counts for each region to get a better overview of my subscriber list.

region_counts = shuffled["region"].value_counts(ascending = False)
region_counts_df = pd.DataFrame(region_counts)
region_counts_df.reset_index()
region_counts_df.head(10)

Let's visualize where most of The Root's subscribers are from.¶

To do this, I am going to look into the top five regions and visualize them through a pie chart.

region_counts_df.iloc[:5].plot.pie(y = 'region', figsize = (8,8), autopct = '%1.1f%%')
plt.title('Top 5 Regions', color='black')

Text(0.5, 1.0, 'Top 5 Regions')

As you can see, this pie chart says that 41.7% of The Root's subcribers are in Louisiana. This makes sense-- I have been mostly spreading the word to students on Tulane's campus!

Subscriber Rating Analysis and Visualization¶

Now, I am going to look into the the member rating metric that Mailchimp provides. In short, it is a metric based on a member's total engagement with the newsletter.¶

First, I am making a copy of the shuffled dataframe, just to keep things organized!

rating_df = shuffled.copy()

Next, I am getting an overall look at how The Root's subscribers are rated and putting it into ascending order.

rating_table = rating_df["MEMBER_RATING"].value_counts(ascending = False).to_frame()
rating_table

The table above shows an overall view of member rating counts. It helps but... lets visualize it!

rating_table.plot.pie(y = "MEMBER_RATING", figsize = (8,8), autopct = '%1.1f%%')
plt.title('Member Ratings', color='black')

Text(0.5, 1.0, 'Member Ratings')

This graph gives me a good idea of how engaged The Root's subscribers are. It could definitely improve.

Campaign Data Analysis¶

Successful Deliveries Correlation¶

First, I am looking for an overall view of our Campaign data and the relationship between the variables.

ax = sns.heatmap(metrics.corr()[['Successful Deliveries']].sort_values('Successful Deliveries'),cmap = 'coolwarm',annot = True,
           xticklabels = True).set_title("Successful Deliveries and Metrics")
plt.yticks(rotation=0)

(array([0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5]),
 <a list of 10 Text yticklabel objects>)

This graph shows the relationship between successful newsletter deliveries and the other metrics we are observing. As expected, hard bounces and successful deliveries have a very low correlation.

A closer look: Hard Bounces¶

The Hard Bounces column is important to dive into. Hard bounces can occur for several reasons: the recipient email address doesn't exist, the domain name doesn't exist, or the recipient email server has completely blocked delivery.

Successful Deliveries vs Hard Bounces¶

Here, I am using seaborn's scatterplot functionality.

sns.scatterplot(metrics["Successful Deliveries"], metrics["Hard Bounces"]).set_title("Successful Deliveries vs Hard Bounces")

Text(0.5, 1.0, 'Successful Deliveries vs Hard Bounces')

Hard Bounces per Campaign¶

sns.barplot(metrics['campaign_id'],
            y = metrics["Hard Bounces"],
           ).set_title("Hard Bounces per Campaign")

Text(0.5, 1.0, 'Hard Bounces per Campaign')

These graphs show that there have only been a few instances of hard bounces in the overall history of the newsletter. Mailchimp makes an effort to clean subscriber lists, so if it detects a hard bounce it unsubscribes the contact. This explains the sparseness of these graphs.

A closer look: Soft Bounces¶

Now, I want to look at the Soft Bounces metric. Compared to hard bounces, soft bounces can occur for several reasons. Inclunding, the recipients mailbox is full, the recipient email server is down or offline, or the email message is too large. Here are the same graphs as above, but looking using this metric.

Successful Deliveries vs Soft Bounces¶

sns.scatterplot(metrics["Successful Deliveries"], metrics["Soft Bounces"]).set_title("Successful Deliveries vs Soft Bounces")

Text(0.5, 1.0, 'Successful Deliveries vs Soft Bounces')

Soft Bounces per Campaign¶

sns.barplot(metrics['campaign_id'],
            y = metrics["Soft Bounces"],
           ).set_title("Soft Bounces per Campaign")

Text(0.5, 1.0, 'Soft Bounces per Campaign')

These are helpful and telling of the difference in frequency between soft and hard bounces. Because this difference is my biggest takeaway from these graphs, I am going to visualize a comparison between the two metrics.

Soft Bounces vs Hard Bounces¶

First, I am taking several steps to keep the data organized and get the outcome closest to what I am looking for.¶

To begin, I am copying the metrics dataframe, using only the columns I am interested in. I am calling the new dataframe bouncedf.

bouncedf = metrics[['campaign_id', 'Soft Bounces', 'Hard Bounces']].copy()

The next step I am taking involves creating a pivot table from bouncedf. I am doing this to stay organized and look at only the information I am currently investigating for.

bounce_table = pd.pivot_table(bouncedf, values = ["Soft Bounces", "Hard Bounces"], index = "campaign_id", aggfunc = np.mean)
bounce_table.head()

Now, I am coding the plot using a for loop to access all the columns that I need. The input will be a list of the metrics I am looking for.

def bounce_type(measures):
    fig, ax = subplots(figsize=(10,5))
    for x in measures:
        bounce_table.groupby(['campaign_id']).mean().plot(kind = 'bar', ax = ax, grid = True)
    ax.set_ylabel('Bounces')
    ax.set_title('Bounces per Campaign')
    ax.set_xlabel("Campaign ID")
    ax.legend(measures)
    return ax

ax = bounce_type(['Hard Bounces', 'Soft Bounces'])

As you can see, soft bounces are much more common that hard bounces and in many cases, the contacts that bounce are cleared from the sending list.

Total Opens by Campaign ID¶

sns.barplot(metrics['campaign_id'],
            y = metrics["Total Opens"],
           ).set_title("Average Email Opens per Campaign")

Text(0.5, 1.0, 'Average Email Opens per Campaign')

This graph is helpful, but may be difficult to understand without context. Out of curiosity, I have been resending to non-openers-- it has not proved very useful. Additionally, The Root is sent out every Monday morning, sometimes I have seen a spike in subscribers on that day and have sent the campaign to those people. Because it is a smaller list, the opens are smaller.

Total Clicks vs Unique Clicks¶

Now, I am using seaborn's lmplot. It is useful because it provides a regression line. This helps with gaining insight in overall expectations for the future.

plt.figure(figsize = (25,25))
sns.lmplot(x="Total Clicks", y = "Unique Clicks", scatter_kws={"color":"red","alpha":0.5
                                                                     ,"s":30}, data = metrics)

<seaborn.axisgrid.FacetGrid at 0x1a25db3d10>

<Figure size 1800x1800 with 0 Axes>

This graph is interesting because it shows that usually a subscriber who clicks, will click around 2 times throughout the newsletter.

Unique Opens vs Total Recipients¶

Again, I am using seaborn's lmplot functionality to measure consistency between Total Recipients and Unique Opens. I am pretty proud of the results.

plt.figure(figsize = (25,25))
sns.lmplot(x='Total Recipients',y='Unique Opens',fit_reg=True,scatter_kws={"color":"red","alpha":0.5
                                                                     ,"s":30},data=metrics)

<seaborn.axisgrid.FacetGrid at 0x1a2d89f850>

<Figure size 1800x1800 with 0 Axes>

This graph shows the positive correlation between Total Recipients and unique opens. This is interesting to see because as the list has grown, people have continued opening emails. This is opposed to a dead list-- where the number of subscribers is high but the actual engagement is close to nothing.

Time Analysis¶

Metrics over time¶

A closer look: by Month¶

In order to see how opens have progressed over time, I am creating a column called month. I am doing this using the (very helpful) pandas DateTimeIndex functionality.

metrics['month'] = metrics.index.month
newdf = metrics[['Total Recipients', 'Unique Opens', 'Unique Clicks', 'month']].copy()
newdf.set_index("month", inplace = True)

month_table = pd.pivot_table(newdf, values = ["Total Recipients", "Unique Opens", "Unique Clicks"], index = "month", aggfunc = np.mean)
month_table

month_table.plot(kind = 'bar')

<matplotlib.axes._subplots.AxesSubplot at 0x1a2cf17550>

This visualization tells me that the last few campaigns have not had as many opens as ones before. It will be interesting to note why/what can be done with this data to improve.

I am going to take a look at this data through a line graph.

metrics['month'] = metrics.index.month

def by_month(measures):
    fig, ax = subplots(figsize=(10,5))
    for x in measures:
        month_table.groupby('month').mean().plot(kind = 'line', ax = ax, grid = True)
    ax.set_ylabel('# of Individuals')
    ax.set_title('Metrics by Month')
    ax.set_xlabel("Month")
    ax.legend(measures)
    return ax

ax = by_month(['Unique Opens', 'Unique Clicks', 'Total Recipients'])

Metrics by Hour¶

I am going to separate out hour from our time index. Through this, I hope to gain some insight into how powerful time is when sending out a newsletter.

metrics['hour'] = metrics.index.hour
ax = sns.heatmap(metrics.corr()[['hour']].sort_values('hour'),cmap = 'coolwarm',annot = True,
           xticklabels = True).set_title("Time and metrics")
plt.yticks(rotation=0)

(array([ 0.5,  1.5,  2.5,  3.5,  4.5,  5.5,  6.5,  7.5,  8.5,  9.5, 10.5]),
 <a list of 11 Text yticklabel objects>)

Through the heatmap above, I can see the relationship hour has to the various metrics. Notably, hour and Total Opens have the lowest correlation. I was not expecting this.

metrics['hour'] = metrics.index.hour

def by_hour(measures):
    fig, ax = subplots(figsize=(10,5))
    for x in measures:
        metrics.groupby('hour').mean().plot(kind = 'line', ax = ax, grid = True)
    ax.set_ylabel('# of Individuals')
    ax.set_title('Unique Engagement per Hour')
    ax.set_xlabel("Hour")
    ax.legend(measures)
    return ax

I created the function above to get a closer look into the relationship between hour and various important newsletter metrics such as Unique Opens and Unique Clicks.

ax = by_hour(['Unique Opens', 'Unique Clicks'])

Through this graph, I aimed to learn what times work best for sending out newsletters. It shows that morning morning time, closer to 7:30 AM, works best. This graph is difficult to use fully, though, because my subscribers are from a few different time zones and I have largely sent out my emails in the early mornings which could definitely play a role in these findings.

Metrics by Subject¶

Finally, I am curious about the Subject metric. Using the WordCloud import seems like an interesting way to explore it more. I am not sure how helpful the data will actually be for future action, though.

subjects = ''
for s in metrics["Subject"]:
    subjects+=s

stopwords = set(STOPWORDS) 
wordcloud = WordCloud(width = 800, height = 800, 
                background_color ='white', 
                stopwords = stopwords, 
                min_font_size = 10).generate(subjects) 
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

As expected, this final exploration was more out of curiousity! It is definitely interesting to see The Root's subjects in this form.

Conclusion and Final Thoughts¶

Email marketing and newsletters are popular and widely recognized as an important part of an organization's engagement with its audience. Although a ton of data exists around the topic, I have realized that without these skills, it would be nearly impossible to have a clear view on how my newsletter, The Root, has grown over time.

In all, this tutorial has helped me gain a better understanding of newsletter data and email marketing efforts. Through several data analysis techniques, I am able look carefully into different metrics such as time, bounces, and more. Additionally, I am now able to gain insight on who is subscribed to The Root, where they are from, and how engaged they are.

I look forward to continue exploring the data I gather through my newsletter. Hopefully, knowing how to analyze and look into data beyond what Mailchimp offers will help me make more concrete, data-driven decisions. Thanks for reading!

	Subject	Total Recipients	Successful Deliveries	Soft Bounces	Hard Bounces	Unique Opens	Total Opens	Unique Clicks	Total Clicks	campaign_id	send_time
Send Date
2019-08-12 06:58:00	newsletter no.1	212	210	1	1	113	280	9	17	0	06:58:00
2019-08-12 09:46:00	newsletter no.1	1	1	0	0	1	2	0	0	1	09:46:00
2019-08-19 07:02:00	get. that. money. (newsletter no. 2)	233	232	1	0	129	235	0	0	2	07:02:00
2019-08-19 16:50:00	get. that. money. (newsletter no. 2)	4	4	0	0	3	7	0	0	3	16:50:00
2019-08-26 07:00:00	all. i. see. is. $$$$ (newsletter no. 3)	254	253	1	0	140	251	0	0	4	07:00:00

	MEMBER_RATING
2	195
3	123
4	101
5	21
1	2

	Total Recipients	Unique Clicks	Unique Opens
month
8	138.166667	1.5	70.166667
9	209.800000	4.3	79.000000
10	322.428571	26.0	153.142857
11	308.000000	10.5	110.500000

	MEMBER_RATING	REGION	TIMEZONE	OPTIN_TIME
email_id
0	2	la	985	2019-09-02 22:31:17
1	3	la	504	2019-08-12 06:40:52
2	2	ny	212/332/646/917	2019-08-18 20:01:06
3	2	ny	212/332/646/917	2019-08-12 06:40:52
5	3	ny	929/917	2019-11-06 11:16:36

	region
LA	131
NY	80
FL	50
TX	36
CA	17
NJ	13
MA	12
MI	12
GA	12
PA	10