My comments on Bug 1422892 started to get long, so I started untangling my thoughts here.
From the bug:
We experimented with using
submission_datewhen developing the
clients_dailyetl job. We should summarize our findings and decide on which of these measures we'd like to standardize against in the future.
Summary of the problem
activity_date is generally preferable to
because it's closer to what we actually want to measure.
There's a delay between user activity and us receiving the data.
:chutten has some
on the empirical difference between submission and activity dates,
if you want to read more.
95% of pings are received within two days of the actual activity
but that means using
submission_date "smears" data between today and yesterday (mostly).
submission_date is much easier to work with computationally.
When we partition by
most jobs only need to process one day of data at a time.
This makes it much easier to continuously update datasets and backfill missing data.
clients_daily is currently limited to 6 months of historical data
because the entire dataset needs to be regenerated every day.
This is inconvenient and causes real limitations when using the dataset .
The job takes between 90 and 120 minutes to run and currently finishes near 9:00 UTC.
Adding more data to this job will push that completion time back,
meaning the data will be unavailable for the first few working hours every day.
I see three possible options:
- Standardize to
- Standardize to
activity_dateand try to mitigate the performance losses
- Allow both, but provide guidance for when to use each configuration
So far, the data engineering team has strongly recommended using
The difference between
has become even smaller with our team's work on ping sender
Without a strong counter argument, I recommend continuing with
If we do have a strong reason to continue keying datasets by
I recommend only using
activity_date on "small" datasets.
These are datasets built over a sample of our data,
build over a rarer type of ping (e.g. not main pings),
or heavily aggregated (e.g. to country-day).
Someone should provide documentation on when
activity_date is [un]necessary
to be included in docs.tmo.