In July 2019, I started The Root, a weekly financial wellness newsletter that aims to foster conversations about money. Until now, I have only been using Mailchimp, an automated marketing platform, to measure The Root's growth, success, and impact. In just six months, The Root has grown to over 900 subscribers, has an average 40% open rate, and a 5% average click rate. The industry standard is an open rate of less than 18% and a click rate of 2%. A large part of my time this semester has been dedicated to The Root's success. Because of this, I want to dedicate this tutorial to optimizing, understanding, and visualizing the data gathered by Mailchimp-- beyond the standard graphs it provides on its dashboard. Through this tutorial, I hope to gain a better understanding of email metrics and various techniques in data exploration, as well as provide a guide for others looking to gain insight on their email marketing efforts.
The data for this tutorial was downloaded using Mailchimp's export feature. Through this feature, Mailchimp provides a copy of account data. Mailchimp provides the data in a ZIP file which was downloadable from my account. In this ZIP file, Mailchimp includes a significant amount of data in folders and individual files are in CSV format. I will be making use of these:
Before diving into analyzing the data and running code, there are several steps I need to take. In order to preprocess the data, I organized the Mailchimp export data and only kept what I needed. This includes the campaign reports (campaigns.csv) and the List data. More importantly, I quickly recognized that the List data contained sensitive information such as full names, emails, geodata location, and more. Therefore, I decided to prioritize anonymizing the data. To do this, I used the random.permutation numpy function and pandas reindexing. After this, I followed relatively standard tidying methods to have clean, easy-to-use data such as concatenating dataframes, dropping unnecessary columns, converting columns to the proper data types, and more.
For this tutorial, a majority of the exploration involves gathering insights that summarize the many data points provided by Mailchimp. For subscriber data, this includes location, timezone, and optin date. For campaign data, this includes opens, clicks, unsubscribes, subject lines, send-dates and more.
I believe that time will be one of the most important factors in a successful email.
Although time is important, it became difficult to identify a singular factor that leads to a successful email. Aditionally, as I worked through the tutorial, I realized that the most helpful insights would lie in comparing subscriber growth to opens, clicks, and other metrics. This is important because consistent engagement is the absolute most important metric for an email list. Overall, this analysis provides me with detailed insights into The Root's data and expectations for the future. Specifically, it is helping me make predictions for future growth using the concrete data from the past six months. Accurately predicting growth is useful for many reasons. I am currently presenting The Root at business pitch competitions, to potential investors, and more. I look forward to consistently looking into this data and continuing to explore. In the future, I plan to access the data through Mailchimp's API rather than static CSV files.
For this project, I will be using Python and several imported libraries such as pandas, numpy, matplotlib, scikit-learn, seaborn, wordcloud, and more.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.pyplot import *
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from PIL import Image
#widening notebook and lets us display data easily.
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))
pd.options.display.max_rows = 500
pd.options.display.max_columns = 500
pd.options.display.width = 1000
Now that campaign data is stored in the proper dataframe, I am loading in subscriber data.
sub_df = pd.read_csv("./newsletteraudience/subscribed.csv")
unsub_df = pd.read_csv("./newsletteraudience/unsubscribed.csv")
print(sub_df.shape)
print(unsub_df.shape)
Now that our lists are merged, let's get a better view of what we are working with.
contacts = pd.concat([sub_df, unsub_df], sort = False, ignore_index = True)
contacts.shape
contacts.dtypes
contacts.isnull().sum()
As you can see, there are a lot of missing records. We are only going to keep columns that will aid us in understanding our data better.
The dataset loaded in below contains campaign data and is being stored in a pandas dataframe using read_csv().
df1 = pd.read_csv("./campaigns.csv")
In the original dataset, called "df1", there are 26 columns. Below is a detailed look into what is provided.
df1.shape
df1.dtypes
The "contacts" dataframe contains sensitive information such as full names, emails, and other identifying information. For these subscribers' safety, I will shuffle and anonymize the dataset. I am doing this below.
contacts = contacts[["Email?","MEMBER_RATING", "REGION", "TIMEZONE", "OPTIN_TIME"]]
shuffled = contacts.reindex(np.random.permutation(contacts.index))
shuffled.reset_index(inplace = True)
In order to identify subscribers without their sensitive information, I am assigning an email_id to each record.
shuffled['email_id'] = shuffled.index
shuffled.drop(columns = ['Email?', 'index'], inplace = True)
Now, I am converting OPTIN_TIME to its proper type using the pandas datetime functionality.
shuffled["OPTIN_TIME"] = pd.to_datetime(shuffled["OPTIN_TIME"])
shuffled.dtypes
Out of curiosity, lets see how many records have a NaN value.
shuffled.isnull().sum()
This number of NaNs will be something to look into for future data collection. It is interesting to see what is missing.
shuffled.dropna(inplace = True)
shuffled.set_index("email_id", inplace = True)
Above, I am taking the final steps to achieving a clean data set. Although not ideal, I am dropping the NaN values to make the data easier to work with. Lastly, I am setting the index to email_id.
shuffled.to_csv("cleaned.csv")
To finish, I am saving the cleaned data into CSV file for future use and exploration.
Now, I am tidying up the campaign data to ensure we have the information I need. Because the CSV file includes 27 columns, I need to choose the columns I would like to keep and then concatenate them into one dataset. I call it "metrics."
df_interest = df1.iloc[:, 0:21]
metrics1 = df_interest.iloc[:, 1:5]
metrics2 = df_interest.iloc[:, 5:9]
metrics3 = df_interest.iloc[:, 12:18]
metrics = pd.concat([metrics1, metrics2, metrics3], axis = 1)
metrics.drop(['List', 'Send Weekday', 'Open Rate', 'Click Rate'], axis = 1, inplace = True)
In order to access the individual campaigns, I am inserting a campaign_id column using the index.
metrics['campaign_id'] = metrics.index
metrics["Subject"] = metrics["Subject"].astype(str, copy = True)
metrics["Send Date"] = pd.to_datetime(metrics["Send Date"])
metrics.set_index("Send Date", inplace = True)
metrics['send_time'] = metrics.index.time
As you can see, I am using the Send Date column as the index. It is being stored as a datetime variable. This will be useful for future analysis.
metrics.head()
metrics.dtypes
metrics.shape
Now, we have only the columns we need.
Let's get started with the analysis. Here is a view of the finalized subscriber dataframe we are going to be working with.
shuffled.head()
According to Mailchimp, regions are an estimate based on IP address data. Contacts' IP addresses are first collected when signing up and become more and more accurate as they engage with the email campaigns.
shuffled["REGION"].value_counts().head(10)
With a closer look at the REGION column, I can see that locations are stored as both uppercase and lowercase. In order to get the proper value counts of locations, I am going to convert all the REGION records to uppercase and take another look.
shuffled['region'] = shuffled['REGION'].str.upper()
shuffled.drop(columns = ['REGION'], inplace = True)
Now that the data has been tidied a bit further, I am getting the counts for each region to get a better overview of my subscriber list.
region_counts = shuffled["region"].value_counts(ascending = False)
region_counts_df = pd.DataFrame(region_counts)
region_counts_df.reset_index()
region_counts_df.head(10)
To do this, I am going to look into the top five regions and visualize them through a pie chart.
region_counts_df.iloc[:5].plot.pie(y = 'region', figsize = (8,8), autopct = '%1.1f%%')
plt.title('Top 5 Regions', color='black')
As you can see, this pie chart says that 41.7% of The Root's subcribers are in Louisiana. This makes sense-- I have been mostly spreading the word to students on Tulane's campus!
First, I am making a copy of the shuffled dataframe, just to keep things organized!
rating_df = shuffled.copy()
Next, I am getting an overall look at how The Root's subscribers are rated and putting it into ascending order.
rating_table = rating_df["MEMBER_RATING"].value_counts(ascending = False).to_frame()
rating_table
The table above shows an overall view of member rating counts. It helps but... lets visualize it!
rating_table.plot.pie(y = "MEMBER_RATING", figsize = (8,8), autopct = '%1.1f%%')
plt.title('Member Ratings', color='black')
This graph gives me a good idea of how engaged The Root's subscribers are. It could definitely improve.
First, I am looking for an overall view of our Campaign data and the relationship between the variables.
ax = sns.heatmap(metrics.corr()[['Successful Deliveries']].sort_values('Successful Deliveries'),cmap = 'coolwarm',annot = True,
xticklabels = True).set_title("Successful Deliveries and Metrics")
plt.yticks(rotation=0)
This graph shows the relationship between successful newsletter deliveries and the other metrics we are observing. As expected, hard bounces and successful deliveries have a very low correlation.
The Hard Bounces column is important to dive into. Hard bounces can occur for several reasons: the recipient email address doesn't exist, the domain name doesn't exist, or the recipient email server has completely blocked delivery.
Here, I am using seaborn's scatterplot functionality.
sns.scatterplot(metrics["Successful Deliveries"], metrics["Hard Bounces"]).set_title("Successful Deliveries vs Hard Bounces")
sns.barplot(metrics['campaign_id'],
y = metrics["Hard Bounces"],
).set_title("Hard Bounces per Campaign")
These graphs show that there have only been a few instances of hard bounces in the overall history of the newsletter. Mailchimp makes an effort to clean subscriber lists, so if it detects a hard bounce it unsubscribes the contact. This explains the sparseness of these graphs.
Now, I want to look at the Soft Bounces metric. Compared to hard bounces, soft bounces can occur for several reasons. Inclunding, the recipients mailbox is full, the recipient email server is down or offline, or the email message is too large. Here are the same graphs as above, but looking using this metric.
sns.scatterplot(metrics["Successful Deliveries"], metrics["Soft Bounces"]).set_title("Successful Deliveries vs Soft Bounces")
sns.barplot(metrics['campaign_id'],
y = metrics["Soft Bounces"],
).set_title("Soft Bounces per Campaign")
These are helpful and telling of the difference in frequency between soft and hard bounces. Because this difference is my biggest takeaway from these graphs, I am going to visualize a comparison between the two metrics.
To begin, I am copying the metrics dataframe, using only the columns I am interested in. I am calling the new dataframe bouncedf.
bouncedf = metrics[['campaign_id', 'Soft Bounces', 'Hard Bounces']].copy()
The next step I am taking involves creating a pivot table from bouncedf. I am doing this to stay organized and look at only the information I am currently investigating for.
bounce_table = pd.pivot_table(bouncedf, values = ["Soft Bounces", "Hard Bounces"], index = "campaign_id", aggfunc = np.mean)
bounce_table.head()
Now, I am coding the plot using a for loop to access all the columns that I need. The input will be a list of the metrics I am looking for.
def bounce_type(measures):
fig, ax = subplots(figsize=(10,5))
for x in measures:
bounce_table.groupby(['campaign_id']).mean().plot(kind = 'bar', ax = ax, grid = True)
ax.set_ylabel('Bounces')
ax.set_title('Bounces per Campaign')
ax.set_xlabel("Campaign ID")
ax.legend(measures)
return ax
ax = bounce_type(['Hard Bounces', 'Soft Bounces'])
As you can see, soft bounces are much more common that hard bounces and in many cases, the contacts that bounce are cleared from the sending list.
sns.barplot(metrics['campaign_id'],
y = metrics["Total Opens"],
).set_title("Average Email Opens per Campaign")
This graph is helpful, but may be difficult to understand without context. Out of curiosity, I have been resending to non-openers-- it has not proved very useful. Additionally, The Root is sent out every Monday morning, sometimes I have seen a spike in subscribers on that day and have sent the campaign to those people. Because it is a smaller list, the opens are smaller.
Now, I am using seaborn's lmplot. It is useful because it provides a regression line. This helps with gaining insight in overall expectations for the future.
plt.figure(figsize = (25,25))
sns.lmplot(x="Total Clicks", y = "Unique Clicks", scatter_kws={"color":"red","alpha":0.5
,"s":30}, data = metrics)
This graph is interesting because it shows that usually a subscriber who clicks, will click around 2 times throughout the newsletter.
Again, I am using seaborn's lmplot functionality to measure consistency between Total Recipients and Unique Opens. I am pretty proud of the results.
plt.figure(figsize = (25,25))
sns.lmplot(x='Total Recipients',y='Unique Opens',fit_reg=True,scatter_kws={"color":"red","alpha":0.5
,"s":30},data=metrics)
This graph shows the positive correlation between Total Recipients and unique opens. This is interesting to see because as the list has grown, people have continued opening emails. This is opposed to a dead list-- where the number of subscribers is high but the actual engagement is close to nothing.
In order to see how opens have progressed over time, I am creating a column called month. I am doing this using the (very helpful) pandas DateTimeIndex functionality.
metrics['month'] = metrics.index.month
newdf = metrics[['Total Recipients', 'Unique Opens', 'Unique Clicks', 'month']].copy()
newdf.set_index("month", inplace = True)
month_table = pd.pivot_table(newdf, values = ["Total Recipients", "Unique Opens", "Unique Clicks"], index = "month", aggfunc = np.mean)
month_table
month_table.plot(kind = 'bar')
This visualization tells me that the last few campaigns have not had as many opens as ones before. It will be interesting to note why/what can be done with this data to improve.
I am going to take a look at this data through a line graph.
metrics['month'] = metrics.index.month
def by_month(measures):
fig, ax = subplots(figsize=(10,5))
for x in measures:
month_table.groupby('month').mean().plot(kind = 'line', ax = ax, grid = True)
ax.set_ylabel('# of Individuals')
ax.set_title('Metrics by Month')
ax.set_xlabel("Month")
ax.legend(measures)
return ax
ax = by_month(['Unique Opens', 'Unique Clicks', 'Total Recipients'])
I am going to separate out hour from our time index. Through this, I hope to gain some insight into how powerful time is when sending out a newsletter.
metrics['hour'] = metrics.index.hour
ax = sns.heatmap(metrics.corr()[['hour']].sort_values('hour'),cmap = 'coolwarm',annot = True,
xticklabels = True).set_title("Time and metrics")
plt.yticks(rotation=0)
Through the heatmap above, I can see the relationship hour has to the various metrics. Notably, hour and Total Opens have the lowest correlation. I was not expecting this.
metrics['hour'] = metrics.index.hour
def by_hour(measures):
fig, ax = subplots(figsize=(10,5))
for x in measures:
metrics.groupby('hour').mean().plot(kind = 'line', ax = ax, grid = True)
ax.set_ylabel('# of Individuals')
ax.set_title('Unique Engagement per Hour')
ax.set_xlabel("Hour")
ax.legend(measures)
return ax
I created the function above to get a closer look into the relationship between hour and various important newsletter metrics such as Unique Opens and Unique Clicks.
ax = by_hour(['Unique Opens', 'Unique Clicks'])
Through this graph, I aimed to learn what times work best for sending out newsletters. It shows that morning morning time, closer to 7:30 AM, works best. This graph is difficult to use fully, though, because my subscribers are from a few different time zones and I have largely sent out my emails in the early mornings which could definitely play a role in these findings.
Finally, I am curious about the Subject metric. Using the WordCloud import seems like an interesting way to explore it more. I am not sure how helpful the data will actually be for future action, though.
subjects = ''
for s in metrics["Subject"]:
subjects+=s
stopwords = set(STOPWORDS)
wordcloud = WordCloud(width = 800, height = 800,
background_color ='white',
stopwords = stopwords,
min_font_size = 10).generate(subjects)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
As expected, this final exploration was more out of curiousity! It is definitely interesting to see The Root's subjects in this form.
Email marketing and newsletters are popular and widely recognized as an important part of an organization's engagement with its audience. Although a ton of data exists around the topic, I have realized that without these skills, it would be nearly impossible to have a clear view on how my newsletter, The Root, has grown over time.
In all, this tutorial has helped me gain a better understanding of newsletter data and email marketing efforts. Through several data analysis techniques, I am able look carefully into different metrics such as time, bounces, and more. Additionally, I am now able to gain insight on who is subscribed to The Root, where they are from, and how engaged they are.
I look forward to continue exploring the data I gather through my newsletter. Hopefully, knowing how to analyze and look into data beyond what Mailchimp offers will help me make more concrete, data-driven decisions. Thanks for reading!