You can't do data science in a GUI

I came across You can't do data science in a GUI by Hadley Wickham a little while ago. He hits on a lot of the same problems I mentioned in Don't make me code in your text box. Take a look if you have some time. In the first 15m …


Why bootstrap?

Over the next few quarters, I'm going to focus my attention on Mozilla's experimentation platform. One of the first questions we need to answer is how we're going to calculate and report the necessary measures of variance. Any experimentation platform needs to be able to compare metrics between two groups …


SQL Style Guide

I'm happy to announce, we now have a SQL style guide. Check it out!

If you have any suggestions, feel free to file a PR or issue in the docs repository.

Many thanks to all who participated in the St. Mocli conversation and @mreid for the review!


PSA: Don't use approximate counts for trends

I got caught giving some bad advice this week, so I decided to share here as penance. TL;DR: Probabilistic counts are great, but they shouldn't be used everywhere.


Counting stuff is hard. We use probabilistic algorithms pretty frequently at Mozilla. For example, when trying to get user counts, we …


Don't make me code in your text box!

Whenever I start a new data project, my first step is rooting out any false assumptions I have about the data.

The key here is iterating quickly. My workflow looks like this: Code a little, plot the data, what do you see? Ah, outliers. Code a little, plot the data …


The 5 Stages of Experiment Analysis

I've been thinking about experimentation a lot recently. Our team is spending a lot of effort trying to make Firefox experimentation feel easy. But what happens after the experiment's been run? There's not a clear process for taking experimental data and turning it into a decision.

I noted the importance …


Asking Questions

Will posted a great article a couple weeks ago, Giving and Receiving Help at Mozilla. I have been meaning to write a similar article for a while now. His post finally pushed me over the edge.

Be sure to read Will's post first. The rest of this article is an …


Managing Someday-Maybe Projects with a CLI

I have a problem managing projects I'm interested in but don't have time for. For example, the CLI for generating slack alerts I posted about last year. Not really a priority, but helpful and not that complicated. I sat on that project for about a year before I could finally …


Removing Disqus

I'm removing Disqus from this blog. Disqus allowed readers to post comments on articles. I added it because it was easy to do, but I no longer think it's worth keeping.

If you'd like to share your thoughts, feel free to shoot me an email at harterrt on gmail. I …


CLI for alerts via Slack

I finally got a chance to scratch an itch today.

Problem

When working with bigger ETL jobs, I frequently run into jobs that take hours to run. I usually either step away from the computer or work on something less important while the job runs. I don't have a good …


Experiments are releases

Mission Control was a major 2017 initiative for the Firefox Data team. The goal is to provide release managers with near-real-time release-health metrics minutes after going public. Will has a great write up here if you want to read more.

The key here is that the data has to be …


Desirable features of experimentation tools

Introduction

At Mozilla, we're quickly climbing up our Data Science Hierarchy of Needs 1. I think the next big step for our data team is to make experimentation feel natural. There are a few components to this (e.g. training or culture) but improving the tooling is going to be …


Submission Date vs Activity Date

My comments on Bug 1422892 started to get long, so I started untangling my thoughts here.


From the bug:

We experimented with using activity_date instead of submission_date when developing the clients_daily etl job. We should summarize our findings and decide on which of these measures we'd like to standardize against …


OKRs and 4DX

I feel like I'm swimming in acronyms these days.

Earlier this year, my team started using Objectives and Key Results (OKRs) for our planning. It's been a learning process. I had some prior experience with OKRs at Google, but I've never felt like I was fully taking advantage of the …


Evaluating New Tools

At Mozilla, we're still relatively early in our data science journey. As such, we're always evaluating new tools to improve our analysis workflow (jupyter vs. Rmd), or make our infrastructure more usable (our home-rolled ATMO vs. databricks), or scale our knowledge (knoledge-repo. vs. gitbook)

Most of these tools look like …


Documentation Style Guide

I just wrote up a style guide for our team's documentation. The documentation is rendered using Gitbook and hosted on Github Pages. You can find the PR here but I figured it's worth sharing here as well.

Style Guide

Articles should be written in Markdown (not AsciiDoc). Markdown is usually …


Beer and Probes

Quick post to clear up some terminology. But first, an analogy to clear up my thinking:

Analogy

Temperature control is a big part of brewing beer. Throughout the brewing process I use a thermometer to measure the temperature of the soon-to-be beer. Because I take several temperature readings throughout the …


Bad Tools are Insidious

This is my first job making data tools that other people use. In the past, I've always been a data scientist - a consumer of these tools. I'm learning a lot.

Last quarter, I learned that bad tools are often hard to spot even when they're damaging productivity. I sum this …


Literature Review: Writing Great Documentation

I'm working on a big overhaul of my team's documentation. I've noticed writing documentation is a difficult thing to get right. I haven't seen any great example for a data product, either. I don't have much experience in this area, so I decided to review what's already been written about …


Announcing the Cross Sectional Dataset

I'm happy to announce a new telemetry dataset!

The Cross Sectional dataset makes it easy to describe our users by providing summary statistics for each client. Like the Longitudinal table, there's one row for each client_id in a 1% sample of clients. However, the Cross Sectional dataset simplifies your analysis …

© Ryan T. Harter. Built using Pelican. Theme by Giulio Fidente on github.