2024.05
Background
We are building a website where people make "posts" and see other people's posts. People can "follow" certain accounts and some kind of algorithm "recommends" posts for accounts.
Problem statement
We want to see how many views our posts are getting. We also want to distinguish between total impressions vs reach.
How many unique people saw this post? If people come to the post again and again, that's valuable information, but we do want to distinguish between impressions and reach.
We also want to see a graph of the growth of impressions and reach over time.
How to model the data
In previous designed documents, I just presented the data model as a given. This time I want to try something different. I want to walk through the problem solving process.
Views
How do we know the number of views? Well, everytime someone views the post, we increment a counter. So that should be pretty easy.
Now the problem might be that of a definition: how do we decide that someone has viewed the post? If the only way to view a post was by opening the page that holds it, that would be easy. But what if the post shows up on their news feed? Does that count as a view? What if we sent the post to the browser, but it was the 10th post in the feed and the user never scrolled down to it! Do we count it when we send it, or do we wait for a signal from the client that the user has scrolled past it?
What if the post is long and the thing that appears in the feed is just a summary, do we count that as a view, or do we keep separate counts of views of the summary vs views of the details?
We can create two counters: summary_view and detail_view
What if the post is short, and the summary and the detail are the same thing? Well we should still count the detail_view separately because it does have significance: expanding the post details is not just about viewing its content, it's also about posting a reply or seeing what replies other people have left.
So, the solution is, for each post, have two counts:
summaryViews
detailViews
Simply increment them when there's a viewing event.
Now we need to answer the question: how do we count a view event? I think we can just let the client tell us when the post is brought into view.
This of course leaves the possibility that a client can just spam view requests to artificially inflate the view count on a post, hoping that this would increase its odds of being recommended to more people by the algorithm.
Perhaps we don't care because that kind of spamming would actually decrease the odds of the post being recommended as it decreases the engagement rate. But then this opens the door for the opposite kind of attack: artificially decreasing the engagement rate for a post you don't like by spamming view requests to increase its total view count.
At first we probably don't care, but if we do care, perhaps several months down the road, we can guard against this kind of behavior by adding another check before counting a view: only count a view from the same user if the previous view from that user was more than 10 minutes ago.
How do we know when was the last time a user viewed a post? Well we could maintain a mapping where the key is the tuple (user_id, post_id) and the value is (view_timestamp), and if we get a request, we first check the mapping: if a record exists for the same (user_id, post_id) pair, we check the time stamp. If the timestamp is not older than 10 minutes, we ignore the request. We don't even update the time stamp! Or perhaps, we could increment a counter, so the value is not just the view_timestmap, but also how many view requests arrived in the timeframe since view_timestmap. This allows to detect spam events: if for example we get 100 or more view requests within 10 minutes, there's a high probability the user is spamming requests to increase the view count. Now, if the the view timestamp is older than 10 minutes, then we overwrite it and reset counter to 1, and we finally increase the view count on the post indicated by post_id.
Having this kind of mechanism in place should also make it very clear how to count reach. That is, views by unique users.
Simply speaking, if we see no entry at all for (user_id, post_id), then this is the first time this user sees this post, and we can increase the reach counter.
We can do the same thing with detail views, but this makes it likely that we will duplicate all that book keeping code.
So to avoid that, we add another parameter to the tuple above: (user_id, post_id, mode) where mode is a constant that indicates whether we are handling the summary views or the detail views. We can also use it for other things, like 'comments', 'reposts', etc. The output from the process are two bits of information: should this count as a new view? and should it count as an increase in reach?
Thinking about this in go, we can define the following structs:
type PostViewMode int
const (
SummaryView PostViewMode = 0
DetailView PostViewMode = 1
MaxViewMode PostViewMode = 2
)
type PostViewParams struct {
UserId int
PostId int
Mode PostViewMode
}
type PostViewVerifyResult struct {
IsValid bool
IsUnique bool
}
func VerifyPostView(params PostViewParams) PostViewVerifyResult
type UserPostViews struct {
Timestamp timestamp
Counter int
}
type PostAnalytics struct {
Totals [MaxViewMode]int
Uniques [MaxViewMode]int
}
bucket UserPostViewsFilter(PostViewParams, UserPostViews)
bucket PostAnalyticsBucket(postId int, PostAnalytics)
You may have noticed the PostAnalytics struct uses an array instead of a named fields like `TotalViews, UniqeViews, TotalDetailViews, UniqueDetailViews, etc...`.
This streamlines the processing of verifying and updating data.
Simply put, given a mode, we update the counts this way:
result := VerifyPostView(params)
if result.IsValid {
analytics.Totals[params.Mode] += 1
if result.IsUnique {
analytics.Uniques[params.Mode] += 1
}
}
This is generally a good way to keep the number of codepaths small, speically when they'd be mostly identical.
Imagine if we had to do this instead, with named fields, and no "mode" parameter to VerifyPostView.
// code path for handling summary view
result := VerifyPostSummaryView(userId, postId)
if result.IsValid {
summaryAnalytics.Totals += 1
if result.IsUnique {
summaryAnalytics.Uniques += 1
}
}
// different (but identical) code path for handling detail views
result := VerifyPostDetailView(userId, postId)
if result.IsValid {
detailAnalytics.Totals += 1
if result.IsUnique {
detailAnalytics.Uniques += 1
}
}
// as we add more modes, we duplicate the code more and more
Visualizing
To draw a chart, what do we need? We need data points on a scale that makes sense for the chart's dimensions.
If we're drawing data from the last 7 days and the chart has some width in it, it might make sense to take data grouped by the hour: every dot on the chart represents the value of the views counter at that point in time, one dot for the views on 2024-05-26 14:00, the next dot is the value of the views counter on 2024-05-26 15:00, and so on.
If we are viewing the chart over the range of a year, it doesn't really make sense to use the granularity of an hour. A day would make more sense: what was the views counter set to at 2024-05-26 0:00
If we imagine viewing the chart with the range of 10 years, the granulairty of a single week might be enough.
So let's go with three levels of granularity: hour, day, week.
That should satisfy most conditions reasonably well.
What if you want to see analytics in a finer detail? We could keep 10 minute granularity view counts only for the 72 hours, for example. Otherwise it'll just waste too much data.
The expected usage pattern is the user wants to visualize the exposure for some post he made over time. In the same chart, we want to show the impressions, the reach, and the engagement numbers over a period of time. The granularity will be decided by the time range. I'm not worried too much right now about who will decide the granularity or how.
The query parameters will probably include the post id, the time range, the metric (the view mode from above), and the granularity. The output is a list of pairs, each pair is a timestamp and the view counts for that timestamp.
The timestamp in the output can be chosen either to represent the beginning or the end of the "time window" specified by the granularity. For example, if the granulairty is a day, and the timestamp is "2024-05-26" and the views is 550, does that mean the post had 550 at the start of the day or the end of the day? Does the timestamp represent the start or the end of the time period?
It doesn't matter which convention we pick, as long as we remain consistent and stick to it throughout the entire codebase.
So what should we pick? I think we should pick the choice that simplifies the process of storing and querying.
How do we "store" the data? Everytime we register a new view event, we have to know to which time bucket it belongs to, for each granularity we care about, and update the count for that granularity.
It's easier to define the granularity if we can "truncate" the timestamp to find the appropriate time value.
For example, let's say the time is 2024-05-26 14:12
Hour granularity: 2024-05-26 14:00
Day Granularity: 2024-05-26 00:00
What about the week granularity? I think we can just use the Go standard library time trimming function.
package main
import (
"fmt"
"time"
)
func main() {
t := time.Date(2024, 5, 26, 14, 12, 0, 0, time.UTC)
fmt.Println(t)
fmt.Println(t.Truncate(time.Hour))
fmt.Println(t.Truncate(time.Hour * 24))
fmt.Println(t.Truncate(time.Hour * 24 * 7))
}
Outputs:
2024-05-26 14:12:00 +0000 UTC
2024-05-26 14:00:00 +0000 UTC
2024-05-26 00:00:00 +0000 UTC
2024-05-20 00:00:00 +0000 UTC
For the week, it seems to truncate it to a Monday. It probably has to do with the underlying representation.
We can try in another language. Let's say, Javascript.
The time truncate works by doing (time - time % duration). It should be pretty easy to replicate in Javascript using just unix timestamps (JS timestamps are to millisecond accuracy).
let t = Date.parse("2024-05-26T14:12") // 1716700320000
let s = 1000
let m = s * 60
let h = m * 60
let d = h * 24
let w = d * 7
// truncate to hour
new Date(t - t % h).toJSON() // 2024-05-26T05:00:00.000Z
// truncate to day
new Date(t - t % d).toJSON() // 2024-05-26T00:00:00.000Z
// truncate to week
new Date(t - t % w).toJSON() // 2024-05-23T00:00:00.000Z
Notice the week start point is now different. Go does not use unix timestamps internally, so that's probably why.
For general consistency it's probably better to stick to unix timestamps.
The answer is clear now: the timestamp represents the start of the time window for this bucket in this granularity.
Now, it should be very clear how to update counts:
Everytime we encounter a view, after updating the totals, we also update the totals in the current bucket for each time granularity
Remember the above line where we did:
analytics.Totals[params.Mode] += 1
This means we already know the totals, so all we have to do is store in the specific time bucket for each granularity.
Note however that this process does not guarantee that we will have entries for all the time buckets.
To give an example, let's say there are the following entries in our database. Assume they are all for the same post, for the same granularity (hour), for the same metric (total impressions).
Timestamp Value
-------------------|-------------------
2024-05-26 13:00 | 2454
2024-05-26 14:00 | 2465
2024-05-26 17:00 | 2470
2024-05-26 18:00 | 2493
2024-05-26 19:00 | 2509
2024-05-26 20:00 | 2538
2024-05-26 21:00 | 2552
-------------------|-------------------
Notice there are no entries for hours 15:00 and 16:00. Why does this happen? Because the process we perform is to set the total values in the bucket for the current time window. If no event occurs during the specific time window, there will be no entry.
How do we deal with this?
Well, if the client queries for views from, say 15:00 to 20:00, we have nothing to show for 15:00, so we have to find the last entry before 15:00 and use its value. In this case, the entry is the 14:00 entry, and the view count at the end of that entry is the view count at the end of 15:00 (aka the view count at 16:00). How do you get the last entry directly before the 15:00 entry depends on your database engine. If it's a low level storage engine that can iterate on entries by order with a cursor, you just move the cursor back one step at the beginning.
With that, we can now describe the data model.
Note: all of the above was kind of a thinking process. I'm not necessarily going to use everything as-is. Don't be too surprised if the next section uses slightly different names or even structures than what I described above. I tried to streamline it a little bit more.
Data Model
As always, no actual code has been written yet, so think of this as a starting point to get you started, not as the final destination to arrive to.
type Metric int
const (
SummaryViews Metric = 0
DetailViews Metric = 1
Engagements Metric = 2
MaxMetric Metric = 3
)
type Granularity int
const (
AllTime Granularity = 0
Weekly Granularity = 1
Daily Granularity = 2
Hourly Granularity = 3
MaxGranularity Granularity = 4
)
type UserPostKey struct {
UserId int
PostId int
Metric Metric
}
type UserRepeatTracking struct {
Timestamp timestamp
Count int
}
type PostMetricsKey struct {
PostId int
Granularity Granularity
Metric Metric
Timestamp timestamp
}
bucket UserViews(UserPostKey, UserRepeatTracking)
bucket PostMetricsTotal(PostMetricsKey, int)
bucket PostMetricsUnique(PostMetricsKey, int)
Processes
Handling an analytic event
Some user interacted in some what with a post:
View
Expand
Engagement (like, comment, etc)
The input is: user id, post id, metric type.
In response to this event, we want to record some analytics.
Check the values in the `UserViews` bucket that corresponds to the given input. If a value already exists, check the time stamp. If the activity is within a certain threshold (e.g. 10 minutes), increase the counter and don't proceed with any anlytics recording. If the timestamp is older than the threshold, or there's no entry, reset the entry with the current timestamp and a counter value of 1. It's important to note whether there was an entry at all or not. If there was no entry, this counts as a new unique event. If not, it's just a regular event but not unique
For future consideration: If the value of the counter is way too high (e.g. 1000 events) it might be a suspicious activity. (For now we don't specify how handle it)
Compute the appropriate timestamps for all granularities:
AllTime: The zero time
Weekly: Now, truncated by a 7 day duration
Daily: Now, truncated by a day duration
Hourly: Now, truncated by an hourly duration
If we are to add the event to the total counter:
Fetch the counter from the PostMetricsTotal buckets that corresponds to the AllTime granularity.
Increment it by one
Store the number in the PostMetricsTotal bucket for all the given granularities (using the timestamps we computed)
If we are to consider this event a unique event, do the same thing for the PostMetricsUnique bucket
Displaying metrics for a post
Given a specific post, the user wants to see how many total views it got, and how many unique impressions it's got. He also wants to see the total and unique values for the detail expands and engagements.
The input is the post id.
Construct a PostMetricsKey query key that has the post id set to the input, while the granularity and timestamps are kept to the zero value.
Loop over the metrics values (0..<MaxMetric), and for each one, copy the key, set the Metric to the value from the loop, and load the value from both the PostMetricsTotal and PostMetricsUnique.
Alternatively, instead of looping explicitly, utilize the appropriate functionality in your database system that loads all entries where for the given post it at granularity 0 and timestamp 0. In other words, allow the database system to naturally load all the metrics without you explicitly looping over the metric values.
Collect the result into the appropriate structure for a response. Something like this could work:
type PostMetrics struct {
Uniques [MaxMetric]int
Totals [MaxMetric]int
}
Displaying post metrics in a chart
To show the various engagement metrics change (grow) over time.
Input: Post ID, time range (start, end), granularity, the list of metrics.
For each listed matreic, construct the appropriate PostMetricsKey, using the given post id and granularity and start time. Iteratively all the entries for this metric that fall within the given time range.
If there's a significant gap between the timestamp in the first entry that matches the query and the requested range's start time, find the one entry immediately before the requested start time.
Return the data to the client to be displayed using the appropriate UI component (out of scope for this document).