The first 6 steps to homegrowing basic startup analytics

Quick intro to getting set up on analytics
I’ve been asked a few times recently, “Wow, these analytics you write about are great, but how does a startup begin to bite off the relevant parts?” This blog is to address these questions.

First, let me recommend reading a previous blog, called omg I’m just a startup, I can’t do those fancy metrics. In it, I cover some more general philosophical ideas about how to approach what to measure and what not to measure. Might be worth taking a look if it’s not too important.

Now let’s move on to the first couple topics:

Step 0: Pre-product
Initally, the product development process should likely be focused on big-picture qualitative information, like whether or not your business is addressing the right audience as well as the preferences for that audience. So don’t measure anything yet :)

Instead, spend your time gathering qualitative data, interviewing users, understanding the problem-behind-the-problem you’re trying to solve, and prototyping concepts.

Do this for a couple weeks!

Step 1: Prototypes
As you create prototypes of your product, you should throw up some free, simple analytics to get you some rough ideas of what’s happening inside the functionality. This likely means something like Google Analytics, although there is a very large universe of equivalent tools out there as well.

Google analytics can’t really tell you much – it’s not very actionable. The main things I like to look at are new versus return visitors, top content pages, what pages are causing bounces, etc. Again, at this stage you are still primarily driven by qualitative research and ideas, and it’s hard for analytics to drive much of your thinking.

This prototype phase might last a month or a couple months

Step 2: Traffic comes in, so data must be collected
As your product begins to mature, and you get a better sense for what you are trying to do with it, the next thing I might do is to figure out what the important pieces of data are, and confirm that it’s being measured. Nothing is worse than throwing data away that you might want to use later.

Generally, I prefer a single table or log that can be queried later that stores events. The right granularity of events is at the “business” event level, like “someone updated their profile” or “someone downloaded a video” rather than at the URL level. This ensures that you are getting a good amount of information from the logs but it’s not so overwhelming that you’re blowing up your database.

You might, for example, hold events in the rough key/value form:

user_id, event_name, value, datetime

Where it might look something like:

1000, profile.photo.update, 1, 9:30AM 3/14/2008

Make sense?

I prefer to start out via SQL so that the manipulations of the data are easy, although many large-scale systems eventually move to flat-files of some format.

Design-wise, here are some things to consider:

What’s your “event” hierarchy and what level of granularity do you want?
Do you want your analytics DB to be the same as your webapp DB?
How should you join data between your webapp stats and your analytics stats?
Where does it make sense to throw data away versus trying to store it forever?
How do you pass data into the analytics DB? Via a JS interface called by the client (like Google Analytics) or server-side within your methods?

There’s really no wrong answers to the above – I’ve seen it done in many ways.

Step 3: Identifying your user flows
Every web product ultimately has a bunch of user flows contained within it. For example, there might be a series of flows in how users come into the site, starting with ads, SEO, or otherwise. Similarly, once they get on the site, you might be trying to optimize their usage of their site.Identifying these flows is key since you are trying to find the”critical path” that is then optimized. Figure these flows out, and make sure you’re collecting the right data to optimize.
A good place to learn about these user flows is to read about ecommerce “funnels” and how folks go about breaking those down and optimizing them.

Step 4: Trying ad hoc queries
As users are coming into the system, it can then become a good idea to start gathering data into a standard format. This means creating a small set of queries that you might try to run to learn more about the critical paths that users are taking, and where you can adjust their flow. At this point, it’s important to have the vision of the product become fairly stable so that you are starting to optimize the edges rather than reinventing the core constantly.

The kinds of ad hoc queries worth doing revolve around whatever are the tactical goals of your business. If you are trying to come up with a monetization strategy, you should try to figure out your average order size and what percentage of users that start a buying process finish it. Once you create a small list of these queries, then you can start to formalize the ideas into specific metrics that you track daily.

If any ad hoc queries return data that is similar to what you could get out of Google Analytics (for example, aggregate numbers like pageviews and uniques), it’s probably a dumb idea to try to do those in-house. Don’t do more work than you have to! Instead, the only homegrown stuff should be so specific to your business that it’s easier to do in-house than to shoehorn it into a 3rd party analytics stuff. Don’t waste your effort on numbers a off-the-shelf analytics pacakge would get you.

Assuming that your product is stable, most startups will want to tackle this within the first few weeks (but obviously not until you have data)

Step 5: Formal in-house reporting
Once the product features (and thus the user flows) are sufficiently mature to invest in this area, then it makes sense to formalize out the reports. Typically I would start out with a series of pretty plain HTML pages using tables that just print out SQL queries. You can add finishing touches like percentage %s, key ratios, etc. as you go. I generally invest zero time into cute visualizations and graphs, and prefer to read the key numbers.

How many reports should you generate? I find that it’s pretty addictive to build reports and get a clear understanding of what’s actually happening in your product. So create enough that you can make key decisions, but don’t go too far either – you’ll hit diminishing returns quickly. Generally, 2-3 reports are good enough to start, but ultimately you’ll probably track dozens of dashboards each focusing on specific aspects of your business like.

System performance and uptime
User acquisition via each method you use
Aggregate metrics
Retention
Engagement
Content creation?
Ads and monetization?
Pricing and revenue?
etc.

Anyway, get enough data but not too much – it’s a fine balance. For timing, it probably only makes sense to do this once the product is quite stable and the key user flows are stable as well. This is likely at least a month or two out from the prototype stage.

Step 6: Too much data! Reports are too slow!
If you’re lucky, eventually your reports will be too slow. At Revenue Science, we were gathering somewhere like 1 billion pixel hits per day, and that had to be translated into reporting. Ouch. So you likely will go through a couple specific steps:

Reports will initially query the production server – eventually this doesn’t work and slows down the site
Reports and data are then moved off to a slave machine, where the queries still happen in real-time – but eventually this doesn’t work either because it’s too slow and there’s too much data
Reports and data are then pre-processed every hour, and then served up – which is fine, until your queries take too long, and you have go keep moving
Data is then replicated across a number of slave machines, where the pre-processing happens
etc.

There are many many layers of incremental improvements you can make here – but the toughest nut to crack, in the case where your web product is HUGE is that you will be inserting more data into the system than the system can process within a reasonable time.

Then the more exotic technologies like Hadoop, HBase, Hypertable, etc start to make a difference. Most sites don’t have to deal with this so I’ll stop here!

Conclusion
Eventually, most serious analytics-driven businesses have to build their own internal analytics. It’s not pretty, but it has to be done. Hopefully the above article gives some background on the key issues you might want to look at as you scale up your product.

If you liked this blog post, please recommend it to a colleague and/or click here to get updates via email or RSS.

The first 6 steps to homegrowing basic startup analytics

Published by

Andrew Chen