Catch a wild distribution
A practice of the most basic bit of applied statistics you can do, and a foundation for what comes next.
To warm up for this class we will conduct a brief exercise . This exercise serves a few purposes. First, it will help us to think about data generating processes. Second, it will introduce us to some of the bestiary of statistical distributions that have been invented. Third, it will help us practice the whole cycle of applied statistics: question, simulation, data collection, fitting and checking a model, interpretation. Most importantly, it will help us to get to know each other a bit!
Group Exercise
- Separate yourselves into groups. The groups should be at least 2 people. Suggestion pair with people you’ve only just met.
- Each group should pick a different quantity that they will measure repeatedly. This can be anything that you can measure about 30 times or so in about 1.5 hours or so. Your measurements should be of the SAME thing. Here are some ideas to get you started:
- How long can someone hop on one foot
- amount of change in everyone’s pockets
- how many birds can you hear in 2 minutes
- Hypothesize What stastical distribution might describe these observations? Think about the kind of statistical distribution that might represent this phenomenon! Take a moment and write down the name of the distribution, and why it might describe the thing you’re going to measure.
- Simulate. create a simulation of the data you’re about to collect. Every statistical distribution has four functions in R, and here you can use two of them: the the density function and the random number. These always start with
d
andr
respectively, e.g.dnorm()
andrnorm
. Simulate about as many number as you hope to collect. There’s a quick example below.
- Collect after you have an idea of what your data might look like, go and get some! Try to get a good number (30 or more), this should take about 1hr depending on time.
- Visualize and fit Enter the data into R and visualize them. Do they match your hypothesized distribution? How do you know? If it doesn’t match, why might that be? For this step you have a couple of choices:
- compare histoger
hist()
orgeom_hist
fitdistr
optim
Stan
(we’ll do this one tomorrow!)
- Discuss We’ll go around the room and each group will do a brief show-and-tell of their chosen distribution, their measurement methods, their results and what they might mean!
Evening exercises
Make sure you have a working computer set-up to fit the kinds of models that we are working on in this class. If you want to follow along, we’ll be using Stan, and specfically the cmdstanr
interface. You can follow this vignette to install cmdstan
; to confirm that everything works, run the examples in the section running MCMC