funneljoin
In my previous blog post, I explored the awesome funneljoin package. Check that post out for an overview of the package and some examples.
In this post, I’m going to investigate business problems that can’t be modeled with gaps, and how we can work around that.
Gaps vs. deadlines
Gap
- if the next event must happen within a window of time, use gaps.
Example:
if a customer clicks on an ad, and then purchases the product 30 days later the company may not want to attribute that purchase to the ad.
Deadline
- if the next event must occur before a specific time, use deadlines.
Examples:
Turbo Tax wants to know the conversion rate between uploading your W-2s and filing your taxes. But they only care if you file your taxes before April 15.
A football coaches need to print off opponent reports before their upcoming game.
How to model deadlines
In the last blog post, I showed how funneljoin
supports gaps.
Let’s figure out how we can model deadlines using funneljoin
.
I’m going to start with the upload/submit/report logs again. Recall that the deadline to print is the same calendar week as the upload.
library(dplyr)
library(tidyr)
library(purrr)
library(funneljoin)
library(lubridate)
logs <- readRDS("funneljoin-logs.RDS")
logs
## # A tibble: 1,759 x 3
## date user event
## <date> <int> <chr>
## 1 2020-01-06 1 upload
## 2 2020-01-06 2 upload
## 3 2020-01-06 3 upload
## 4 2020-01-06 4 upload
## 5 2020-01-06 5 upload
## 6 2020-01-06 6 upload
## 7 2020-01-06 7 upload
## 8 2020-01-06 8 upload
## 9 2020-01-06 9 upload
## 10 2020-01-06 10 upload
## # … with 1,749 more rows
We want to separate the logs based on their deadline,
which is the week of the upload.
We can do that with lubridate::floor_date
and
tidyr::nest
.
nested_events <- logs %>%
mutate(deadline = floor_date(date, "week")) %>%
nest(events = -deadline)
nested_events
## # A tibble: 4 x 2
## deadline events
## <date> <S3: vctrs_list_of>
## 1 2020-01-05 18267 , 18267 , 18267 , 18267 , 18267 , 18267 , 18267 , 18267…
## 2 2020-01-12 18274 , 18274 , 18274 , 18274 , 18274 , 18274 , 18274 , 18274…
## 3 2020-01-19 18281 , 18281 , 18281 , 18281 , 18281 , 18281 , 18281 , 18281…
## 4 2020-01-26 18288 , 18288 , 18288 , 18288 , 18288 , 18288 , 18288 , 18288…
This gives us a set of events for each deadline period.
Now, we want to apply the funnel and summary for the events
in each deadline period.
We’ll use purrr::map
to calculate the conversions for each set of events.
customer_funnel <- function(.data) {
.data %>%
funnel_start(
moment_type = "upload",
moment = "event",
tstamp = "date",
user = "user"
) %>%
funnel_steps(
moment_types = c("submit", "print"),
type = "any-firstafter",
)
}
nested_events %>%
mutate(conversions = map(
events,
~customer_funnel(.) %>%
summarize_conversions(date_print)
)) %>%
unnest(conversions) %>%
select(-events)
## # A tibble: 4 x 4
## deadline nb_users nb_conversions pct_converted
## <date> <int> <int> <dbl>
## 1 2020-01-05 300 132 0.44
## 2 2020-01-12 250 184 0.736
## 3 2020-01-19 174 150 0.862
## 4 2020-01-26 118 118 1
In this case, the conversion percentages are the same whether we use gaps or deadline.
However, this isn’t real data. The simulated data I created doesn’t cover every scenario. With real data, we need to carefully think if we want gaps or deadlines.
What I’d like to see
As much as I love funneljoin
, I think the
nest -> funnel -> summarise -> unnest
workflow is a little clunky.
I think it would be cool if funnel_start
respected a grouped data frame,
when looking for the next step in the funnel.
For example, we can pass a grouped data frame to funnel_start
:
logs %>%
mutate(week = week(date)) %>%
group_by(week) %>%
funnel_start(
moment_type = "upload",
moment = "event",
tstamp = "date",
user = "user"
) %>%
funnel_steps(
moment_types = c("submit", "print"),
type = "any-firstafter"
) %>%
select(date_upload, date_print)
## # A tibble: 5,476 x 3
## # Groups: week_upload.x.x [4]
## week_upload.x.x date_upload date_print
## <dbl> <date> <date>
## 1 1 2020-01-06 2020-01-24
## 2 1 2020-01-06 2020-01-31
## 3 1 2020-01-06 2020-01-24
## 4 1 2020-01-06 2020-01-31
## 5 1 2020-01-06 2020-01-24
## 6 1 2020-01-06 2020-01-31
## 7 1 2020-01-06 2020-01-24
## 8 1 2020-01-06 2020-01-31
## 9 1 2020-01-06 2020-01-24
## 10 1 2020-01-06 2020-01-31
## # … with 5,466 more rows
But funneljoin
finds the firstafter
print date outside of the week
of the upload (the group parameter for the data-frame).
I want to find the firstafter
inside the week of the upload
(the group parameter).
Conclusion
Gaps or deadlines depend on the business we are trying to model.
While gaps are very easy to use with funneljoin
,
we can still analyze conversions
in a deadline-based funnel
with a little extra work.