A few months ago I saw this goofy looking kid getting his articles ranked high on two of my favorite marketing sites Inbound.org and Growthhackers. His name was Nate Desmond and what he was writing about was extremely impressive. Funny enough on his site was a link to hire him, I did part-time.
Since then he’s been impressively helping the AppSumo team with our latest product, SumoMe.com. He showed me a few weeks ago how he can simply scrape any website to get insanely actionable data. I forced him to teach it to you below, enjoy!
|Note: This is a guest post by Nate Desmond. Be sure to read to the end for a sweet bonus!|
Ever notice how some of your blog posts garner massive attention while others flop like a failed startup?
The funny thing is, if you just make a few small changes, you can dramatically increase social sharing for your posts. With a few minor changes you can easily grow faster and monetize better.
In analyzing OkDork, for instance, I found that using statistics in the headline increased engagement by more than 10x!
While some of these best practices hold consistently across different websites, others (like the best day of the week to publish) vary from blog to blog.
Thus, instead of relying on general statistics, I wanted to discover what actually worked best for my blog (hint: it didn’t always match the general statistics). Applying my background in accounting, I figured out how to scrape my blog and analyze the results. Then I documented the process to make it easy for you to do.
Analyzing your blog data is a three step process:
First, scrape your basic website information (titles, authors, categories, dates, etc).
Second, add social statistics for each post (Twitter, Facebook, Reddit, etc).
Finally, use Excel to uncover interesting correlations that will help you grow your blog faster.
Everything’s easier to learn with an example, so I will share how I ran this blog analysis on OkDork. For extra fun, I’ll also share statistics from a couple other major blogs at the end.
Here’s the visual guide to doing it yourself:
1. Scrape Your Blog’s Post Information
Alright, so let’s get into scraping.
Web scraping sounds like some super technical process only computer hackers in dark basements can do, but it’s actually pretty easy.
After installing Import.io, I navigate to OkDork’s blog homepage:
Within Import.io, navigate to your blog homepage
Like most blogs, Noah conveniently uses post archive pages. Starting with the blog homepage and paginating on to the very first post ever published, these archive pages make it easy for us to quickly harvest his blog data. <Insert Dr. Evil laugh here>
Completing the basic settings
Navigating to the correct URL within Import.io, you just need to create a few settings. Let Import.io run it’s optimization, select crawler mode, and choose the multiple rows option.
Training rows to create our API
Using the visual highlighter, select each post block (or “row”) and click “train rows”. As you select rows, Import.io will look at the code behind each element and create an API to automatically recognize additional rows. Normally the API will start working after you select 2-4 rows.
Training columns to create our API
Once the rows are trained, you need to add the “columns” (information like the title, date, and url). To pull the URL, select the title and choose to pull “html” instead of “text”. We’ll then parse this html within Excel later in the process to pull the URL.
You’ll also notice that the API didn’t work automatically for all the elements. For the “date” column, I had to train the API by selecting a couple of dates before it started pulling correctly.
Training additional pages to perfect the API
After training the first page, you’ll need to add at least five more to make sure the API is configured correctly. In analyzing OkDork, I actually had to add 10 pages before it was working smoothly.
Using Excel to create a list of archive page URLs
Import.io also analyzes the URL structure of the pages you’ve added, so you can let it automatically crawl the website to find other archive pages. I’ve had a few problems with this feature though, so I prefer to generate the archive URLs myself.
Here’s the Excel formula I used for OkDork:
This combines the website URL with the page number in column A to quickly create a full list of 105 pages (the number currently on OkDork).
Running the crawler on OkDork
Pasting in the URLs generated in Excel and setting the page depth to “0” (so it won’t try to look for other pages), simply hit the big “Go” button. Sit back, eat some popcorn and enjoy the oncoming wave of data
2. Use Excel To Parse The URL
With the export downloaded in CSV format from Import.io, open the file in Excel. You will notice that the “URL” column actually contains the raw html links. We need to isolate the actual URL.
Using the “Text to Columns” function to isolate the post URL
To do this, simply move the URL column to the right most column and run a “Text to Columns” function. Choose “delimited” based on the quotation mark. From the resulting columns, find the column now containing the post’s URL and delete the extras.
Save this file as an Excel workbook.
3. Pull Social Stats for Each Post
We now have a beautiful list of urls, titles, comments, and dates. Before we can start analyzing though, we need to add the social stats for each post. We could manually look up each post on dozens of social networks, but wouldn’t it be nice to save a week of extra work?
Happily, you can use the OpenRefine program to easily (and freely) pull all the social statistics your heart desires.
Download the OpenRefine application, drag the icon to your Applications folder, and double-click to launch.
Once OpenRefine is launched in either platform, upload your Excel file and create a new project.
Using OpenRefine To Pull Social Stats
With the project open, simply select to add a new column by fetching URLs based on the URL column (as shown above). Set the throttle delay to “1” so the process doesn’t take all day, and past this bit of code into the expression box:
‘http://api.sharedcount.com/?url=’ + value
Thanks to SharedCount’s API, you can pull social stats on up to 100,000 URLs per day for free.
Parsing the number of Facebook shares
Now we simply parse the social stats API return into separate columns for each social network.
Here’s the code for Facebook:
And here’s the code for every other platform (just switch the name):
Once you have split social stats into separate columns, export the results back to Excel for the fun part — actually pulling results from the data.
Congratulations, you now have a fortune of social data on each blog post!
4. Combine Everything in Excel
Now that we have an export of hundreds of blog posts plus the social statistics for each one, we can start analyzing our data to find trends.
Adding formulas for day of the week and social totals
First, you’ll want to add two columns to Excel. Use the weekday formula to determine which day of the week each post was published:
Then use a simple SUM formula to add all the social stats + comments to get an idea of the post’s popularity.
Case Study #1: Key Findings for OkDork
Alright, so now that we’ve got all this data, what did I find that can help OkDork grow faster?
Finding #1: Publish on Sunday, Monday, or Wednesday
Using a PivotTable to find average social score by day of week
The way I found this can seem a bit complicated, but it’s really pretty easy once you’ve done it.
In a new tab, insert a pivot table, and add the “day of week” column in the “Column Labels” box and the “social stats” column in the “Values” box.
Make sure to set the social stats field to “average” instead of the default “count” (see screencast above).
Based on the pivot table we ran above, Saturday seems to be by far the best day for Noah to publish. After all, Saturday posts averaged 962 social interactions vs the next best day at 167.
However, all findings need to be evaluated with the “common sense” test.
When I noticed that Saturday (traditionally a low day) performed 6x better than average, my suspicions were raised. Looking into the details, I found that most of Saturday’s number comes from one hugely popular post. With that outlier removed, the actual Saturday average is closer to 31.
Removing that outlier from the data, we find that Sunday, Monday, and Wednesday are the three best days for Noah to publish.
Finding #2: Headlines starting with “Why” or “How” perform above average
Using “Text-to-columns” and a Pivot Table to find the best headline wording
Using a simple “text-to-columns” function (use delimited by space), I can isolate the first word of each post’s title. Plugging this into our fancy Pivot table, we get a long list of social stats for each headline intro word.
The top three headline start words (based on avg social score) are “Why”, “One”, and “How”. The word “One” is only used in one post, though, so the sample size is too small to be significant.
After removing words that have only been used in one or two posts, we can see that “Why” and “How” reliably correlate to increased social sharing. “What” is the third place winner, but it performs far below the other two.
Finding #3: Headlines with statistics perform above average
Using a couple formulas to check headlines for statistics
I noticed that some of the top posts seemed to include statistics (e.g. “Why I got Fired from Facebook (a $100 Million dollar lesson)” and “Why Content Goes Viral: What Analyzing 100 Million Articles Taught Us.”). So I put together a formula to test my hypothesis.
I’m sure there’s a better way to do this, but this quick and dirty formula just checks each headline for numbers:
With that data in hand, I simply ran another pivot table and found. Turns out, my hunch was right: Headlines with numbers perform approximately 10x better than headlines without.Click to Tweet
Case Study #2: Key Findings for Problogger
When I first started blogging 5+ years ago, Problogger was one of my favorite blogs to learn from. They’re still going strong today, so I thought it would be fun to run some numbers on their blog.
Key Finding #1: Monday is the best day to publish
Using a Pivot Table to find the best day for Problogger posts
Just like we did earlier for OkDork, I used a quick Pivot Table to find the best day of week for Problogger posts. Looks like Monday is a big day for Problogger (just like OkDork), but Problogger’s second best days are actually Thursday and Friday.
This is precisely why you should run these numbers on your blog. While OkDork and Problogger share some trends (Monday is good for both), they also are very different (Sun and Wed vs Thurs and Fri).
Key Finding #2: Guest posters get more social engagement
Using a Pivot Table to find the most popular authors
Since I was easily able to scrap the author names from Problogger, I wanted to see which authors wrote the most popular posts. Just running a Pivot Table and sorting by average score, we can quickly see these results.
Not surprisingly, Darren Rowse (the blog’s primary author) gets a very high social score. I was surprised to see, though, that guest authors actually score consistently higher. Probably because they bring some of their own audience.
It might be worth checking on your blog to see if guest authors similarly help bump social engagement.
Key Finding #3: Problogger readers are most active on Twitter (then Facebook)
Using a basic SUM formula to find the most popular sharing platforms
In marketing, you always want to discover where your customers are and meet them there. Thus, I’m interested to see where Problogger’s audience generally gathers on the internet.
Based on social sharing, it looks like Problogger readers are heavy Twitter users. Facebook also ranks quite highly, and LinkedIn makes a good showing.
Interestingly also, Problogger has a high volume of comments – more people comment than share on LinkedIn.
Case Study #3: GrowthHackers.com
So now that we’ve seen how data analysis works on blogs, lets try a different type of website. GrowthHackers.com is a community for growth professionals to share interesting articles and tips (highly recommended).
Key Finding #1: BrianBalfour.com drives the highest number of average votes
Using SUMIF and COUNTIF formulas to find blogs with the most votes
Say you wanted to write a few guest posts in the startup marketing area. You’d want to know which blogs drove the most engagement on GrowthHackers.com.
One key metric is the average number of votes that blog’s posts drive. Drawing from the raw list of posts, votes, and discussions I scraped from the website, I used a couple formulas to uncover the average votes per blog:
To find the total posts per blog:
To find the average votes per blog post:
After removing duplicates (so each blog is only listed once), I also filtered out blogs with fewer than 6 posts (to remove statistical anomalies).
When looking at blogs with 5 or more posts shared on GrowthHackers.com, it looks like Brian Balfour’s blog leads the pack.
Key Finding #2: KISSmetric’s blog has the largest total impact
Using SUMIF and SUM formulas to find the blogs with the largest total footprint
So it’s fun to see which blogs drive the most engagement per post, but what blogs have the largest total impact on the GrowthHackers.com community?
To find this, I used two twin SUMIF formulas to find the total votes and discussions for each website:
Using a simple SUM formula to add the results, I sorted the entire list by total footprint (the sum of votes + comments).
Removing Medium from the rankings (since it’s a platform, not a blog), I discovered that the KISSmetrics blog has made the largest total impact on the GrowthHackers.com community with Moz.com coming in at a close second.
Now that you’ve seen the power of data analysis, what will you find on your own blog?
Who is Nate?
An enthusiastic growth practitioner, you can find more of my marketing writing at NateDesmond.com.
When not playing with exciting data and studying behavioral psychology, you’ll probably find me practicing parkour (it’s like gymnastics) or reading a good book (currently loving Edmund Morris’ The Rise of Theodore Roosevelt).
Incoming search terms:
- best technique to scrape millions of pages (1)
- import io sitemap (1)
- okdork import io (1)
- www scrapee doul frames com/ (1)