/
CourseKata Learning Objectives

CourseKata Learning Objectives

Statistics and Data Science: A Modeling Approach

CourseKata Goals and Learning Objectives

High-level Performance Objective

At the end of the course students will be able to:

Generate research questions that could be answered with data--whether collected by students or provided to them--and will engage in the cycle of data analysis: exploring variation, modeling variation, evaluating their models, and communicating the results of what they have learned from data analysis.

To achieve this high-level objective, students will master a wide range of specific learning objectives, as shown in the table below.

How to Read the Learning Goals and Objectives Table

What students will learn in CourseKata is defined by a hierarchical system of high-level goals, sub-goals, and learning objectives. High-level goals are shown in bold and numbered 1, 2 and 3. Each high-level goal is organized into two levels of sub-goals shown in bold and numbered 1.1, 1.2, etc., or 1.1.1, 1.1.2, etc.

The meaning of each sub-goal is then made clear by specific learning objectives, which are identified by codes combining letters and numbers: CM1, CM2, etc. The letters represent important words in the sub goal: for example, CM = Concepts of Measurement.

Many objectives require effective use of R code to achieve them. This is indicated by (R) at the end of the objective. Numbers in parentheses—e. g., (2, 4)—show the textbook chapters that address the objective.

 

ID

Learning Goals and Objectives

1. UNDERSTAND DATA

1.1. Understand where data come from

CM

1.1.1. Understand and apply concepts of measurement

CM1

Quantify variation in data by assigning numbers or categories to attributes (2)

CM2

Differentiate quantitative and categorical levels of measurement (2)

CM3

Explain the inherent issue of measurement error (2)

CM4

Recognize the limitations of measurement, including measurement bias (2)

SM

1.1.2. Understand sampling methodology

SM1

Explain that samples are the result of a sampling process and a data generating process (DGP) (2, 3)

SM2

Define independent, random sampling and explain the consequences of violating independent, random sampling (2)

SM3

Explain and apply the definition of sampling variation and consider it a possible explanation for observed group differences (2, 4)

SM4

Recognize that samples (and even random samples) are not perfectly representative of the entire population (2)

SM5

Recognize that samples are studied in order to find out about the population and the DGP (2)

SM6

Recognize the limitations of sampling, including sampling bias (2)

RD

1.1.3. Understand elements of research design

RD1

Distinguish outcome variables versus explanatory variables (2,3, 4)

RD2

Differentiate observational and experimental studies and recognize the limitations of each (4)

RD3

Differentiate random sampling and random assignment and explain the advantages of each (4)

1.2. Understand the organization of data

DF

1.2.1. Understand data frames

DF1

Interpret the structure of data: rows (i.e. observations) and columns (i.e. variables) (R) (2)

DF2

Differentiate and interpret variables versus values in a data set (2)

DF3

Differentiate and interpret levels of variables (2)

MD

1.2.2. Manipulate data in a data frame

MD1

Write code to perform basic R commands (R) (1 - 12)

MD2

Create summary variables (R) (2)

MD3

Aggregate variables across rows to create a new data frame (R) (2)

MD4

Recode quantitative variables into categorical variables e.g., using ntile() (R) (2)

MD5

Identify and decide how to handle missing data (R) (2)

MD6

Filter, organize, and manipulate data in a data frame; interpret the results (e.g. arrange, select; R) (2)

1.3. Understand the purpose of data

DQ

1.3.1. Imagine data that could answer a question

DQ1

Generate questions that could be answered with a given data set (2)

QD

1.3.2. Generate questions you could ask of data

QD1

Evaluate appropriateness of data to specific questions and purposes (2)

2. EXPLORE VARIATION

2.1. Understand sources of variation

SV

2.1.1. Categorize sources of variation

SV1

Recognize that variation can be divided into explained and unexplained; unexplained variation into real (i.e. a product of the system) or induced; induced variation into measurement error, sampling error, or mistakes (4)

EV

2.1.2. Understand what it means to explain variation

EV1

Explain and apply an intuitive definition of ‘explaining variation’ (4)

EV2

Identify within and between group variations in a graph (4)

EV3

Apply the definition of explaining variation to graphs (e.g. scatterplots, faceted histograms) (4)

CCC

2.1.3. Understand correlation, causation, and confounding

CCC1

Recognize that explaining variation does not always mean the relationship is causal (4)

CCC2

Identify and apply the concepts of confounding variables and the problem of directionality (4, 7)

CCC3

Differentiate experimental from observational designs; understand which can lead to causal conclusions (4, 7, 8, 11)

CCC4

Recognize that randomness can cause an apparent relationship in data (4)

2.2. Describe distributions of data

CD

2.2.1. Understand concept of distribution

CD1

Define a distribution as the pattern of variation in a variable or combination of variables (3)

CD2

Describe a distribution using shape, center, and spread, and reason about the possible causes of a distribution's characteristics (3)

CD3

Reason about measures of central tendency (5)

DGP

2.2.2. Understand that distributions of data are generated by DGP, which is usually unknown

DGP1

Recognize that populations are the long-run result of a host of causal processes we call the DGP; (3)

DGP2

Simulate samples of data from a DGP; anticipate what the data may look like ® (3, 4, 9, 10)

DGP3

Generate hypotheses about the DGP based on distributions of data; recognize limitations (3, 4)

DGP4

Recognize the role of randomness and the law of large numbers in interpreting distributions of data (3)

RUD

2.2.3. Represent univariate distributions in tables and graphs

RUD1

Create frequency tables and relative frequency tables (e.g., tally) (R) (3)

RUD2

Create and interpret box plots, histograms, and bar graphs to visualize univariate data (R) (3)

RUD3

Choose the appropriate visualization to represent a given variable (3)

RUD4

Create and interpret a five-number summary (i.e., favstats()) (R) (3)

RUD5

Find and interpret the interquartile range (IQR) (4)

RUD6

Identify outliers and decide how to handle them (3)

IUD

2.2.4. Interpret tables and graphs of univariate distributions

IUD1

Interpret the axes of a histogram: identify that frequency is represented on the y axis; variable is represented on the x axis (3)

IUD2

Interpret bar graphs and understand why shape, center, and spread are not meaningful characteristics for these visualizations (3)

IUD3

Recognize and explain the differences and similarities between a frequency histogram and a relative frequency histogram (3)

2.3. Describe relationships in data

RBR

2.3.1. Represent bivariate relationships in tables and graphs

RBR1

Create two-way contingency tables (e.g. using tally function) with both frequencies and proportions (4) (R)

RBR2

Create scatterplots, faceted histograms, box plots, jitter plots to visualize bivariate data (4) (R)

RBR3

Choose the appropriate visualization to represent a bivariate relationship (4)

IBR

2.3.2. Interpret tables and graphs of bivariate relationships

IBR1

Interpret bivariate relationships presented in tables, scatter plots, faceted histograms (4)

IBR2

Visually evaluate the strength of relationships in scatter plots, faceted histograms, box plots (8)

WE

2.3.3. Represent relationships in word equations

WE1

Create path diagrams to represent bivariate relationships (4)

WE2

Write word equations to represent relationships between explanatory and outcome variables (4)

3. MODEL VARIATION

3.1. Understand data = model + error

SM

3.1.1. Understand concept of statistical models

SM1

Map DATA = MODEL + ERROR in different contexts (4, 5, 7, 8)

SM2

Define and use the concept of a statistical model as a function that produces a predicted score for each observation (5, 7, 8)

SM3

Differentiate between the empty model and a more complex model (5, 7, 8)

SM4

Identify the most appropriate model to use based on how variables are measured (e.g., group versus regression models) (5, 7, 8)

SM5

Distinguish between a model of data and a model of DGP (5, 6, 7, 8)

CE

3.1.2. Understand concept of error

CE1

Visually intuit which models or distributions show more error (4, 5, 7)

CE2

Define, calculate, and reason about residuals from models (5, 6, 7, 8)

CE3

Recognize that residuals aggregate to measures of error (7)

CE4

Recognize that error from the empty model is all of the unexplained variation (6)

PE

3.1.3. Understand partitioning error

PE1

Recognize that sum of squares from the empty model (SS total) can be partitioned into error and model sums of squares (6)

PE2

Recognize that SS total is determined by the outcome variable and so total variation for an outcome variable will be the same regardless of explanatory variables included in the model (7,8)

PE3

Interpret visual representations of SS error, SS model, and SS total from empty, group, and regression models (8)

TV

3.1.4. Understand how transforming variables can help build models

TV1

Use difference scores to model paired samples (7)

TV2

Use standardized variables (e.g., z-scores) to compare the strengths of linear relationships between different sets of variables (6, 8)

TV3

Calculate and interpret Pearson’s R, and estimate correlation coefficients from scatterplots (8)

TV4

Link transformed variables to visualizations (e.g., scatterplots, linear representations of models) (6, 8)

3.2 Specify models

GLM

3.2.1. Write models using GLM notation

GLM1

Use and interpret GLM notation to represent models of data, the DGP, and sampling distributions (5, 7, 8)

GLM2

Flexibly use GLM notation to represent group and regression models in different contexts (7, 8)

GLM3

Differentiate between variables and parameters in GLM equations (5, 7, 8)

GLM4

Use dummy codes to model multi-group data in GLM (7)

IPE

3.2.2. Interpret parameter estimates in context

IPE1

Map GLM components to contexts, graph, and word equations (5, 7, 8)

IPE2

Interpret GLM components computationally (5,7,8)

3.3. Fit models

FM

3.3.1. Understand what it means to fit a model

FM1

Recognize that fitting a model means to calculate best fitting parameter estimates (5,7,8)

FM2

Identify that the mean acts as a balancing point for (error) residuals in a distribution (5)

FM3

Recognize that the best fitting model minimizes the appropriate measure of error (in this course, sum of squares) (6, 7, 8)

FM4

Understand the concept of “overfitting” a model (7)

EP

3.3.2. Estimate parameters

EP1

Differentiate between parameters and statistics; explain why a statistic is our best estimate of a (usually) unknowable parameter (5, 7, 8)

EP2

Estimate parameters for empty models, group models, and regression models (i.e., fit models) (5, 7, 8) (R)

EP3

Represent and interpret visual representations of models (e.g. vline() (5, 7, 8)

IE

3.3.3. Interpret estimates

IE1

Interpret the output of lm() for empty, group and regression models (R) (5,7,8)

IE2

Interpret parameter estimates for group and regression models in context (8)

IE3

Write best fitting model based on lm() output (5,7,8)

3.4. Assess model fit

UR

3.4.1. Use residuals to assess model fit

UR1

Recognize that the empty model is made up of all unexplained variation (5)

UR2

Recognize that a residual from the empty model can be partitioned into error and model (7, 8)

UR3

Identify and interpret residuals on a graph of data for empty, group, and regression models (6, 7, 8)

UR4

Plot and interpret distributions of residuals (R) (7, 8)

UR5

Generate and analyze predictions and residuals from a model (R) (6, 7, 8)

AE

3.4.2. Quantify aggregate error around a model

AE1

Quantify aggregate error as Sum of Absolute Deviations (SAD), Sum of Squares (SS), variance, and standard deviation, and interpret these measures (R) (6)

AE2

Explain how sums of squares are constructed from residuals (6)

AE3

Recognize the pros and cons of different methods of quantifying error (6)

AE4

Calculate and interpret z-scores within a distribution or across distributions (6)

CF

3.4.3. Compare the fit of two models

CF1

Use SS, PRE, and F statistic to compare models (7, 8)

CF2

Calculate (e.g., find ANOVA tables; R) and interpret PRE and F statistic (R) (7, 8)

CF3

Interpret a t distribution and relate the critical t to critical z (10)

CF4

Interpret PRE similarly across ANOVA models and regression models: recognize that these two measures express the same thing (8)

ES

3.4.4. Understand and calculate different measures of effect size

ES1

Describe and determine appropriate measures of effect size such as difference of means, PRE, Cohen’s d (7, 8)

3.5 Use models to generate predictions and probabilities

ND

3.4.5. Use the normal distribution to model variation

ND1

Explain why the normal distribution may be used to model variation in the DGP/population (6)

ND2

Identify the features of a normal distribution and use data to construct best fitting normal model of variation (e.g., defined by mean and standard deviation) (6)

PE

3.4.6. Make point predictions based on parameter estimates

PE1

Use parameter estimates in functions to make point predictions from the empty, group, and regression models (R) (6)

PD

3.4.7. Make likelihood predictions based on probability distributions

PD1

Explain the concept of probability distribution as a model of a random DGP (6, 9)

PD2

Explain the relationship between a discrete (e.g. data, simulated normal distributions) and continuous probability distribution (6)

PD3

Use the empirical rule to estimate the likelihood of scores under a normal distribution (6)

PD4

Use the distribution of data to predict likelihood of future observations falling within a specified range (6)

PD5

Use the normal distribution to estimate likelihood of future observations falling within a specified range (R) (6)

4. EVALUATE MODELS

4.1. Model the distributions of estimates

CSD

4.1.1. Understand the concept of sampling distribution

CSD1

Explain the concept of a sampling distribution (9)

CSD2

Explain the problem a sampling distribution solves, i.e. that it helps to make sense of a particular estimate given sampling variation (9)

CSD3

Interpret sampling distributions in specific contexts (9)

CSD4

Recognize that sampling distributions can depict the random distribution of any statistic (not just the sample mean) from some DGP (9)

CSD5

Recognize that sampling distributions are imaginary (9)

FSD

4.1.2. Understand features of sampling distributions

FSD1

Describe the shape, center, and spread of a sampling distribution of estimates, including SDoM, b0, b1, F, and PRE (9, 10, 11)

FSD2

Explain how the shape/center/spread of population and sample size are related to sampling distributions, and how the population's features will affect shape/center/spread of sampling distributions (9)

FSD3

Define and calculate standard error (9)

FSD4

Identify standard error visually (9)

SDR

4.1.3. Construct sampling distributions (R)

SDR1

Construct sampling distributions by resampling (bootstrapping), simulation, randomization, and using mathematical probability distributions (R) (9, 10, 11)

SDR2

Explain the relationship between a discrete (e.g. data, simulated normal distributions) and continuous probability distribution (6)

SDR3

Be able to predict the center and shape of a sampling distribution depending on the method used to construct it (9, 10, 11)

ISD

4.1.4. Interpret sampling distributions

ISD1

Calculate and evaluate the likelihood of selecting a random sample with a certain sample statistic, (e.g., a t distribution) (R) (9)

ISD2

Identify what questions can and cannot be addressed using sampling distributions; i.e. questions about individual observations require a population distribution whereas questions about samples require a sampling distribution (9)

4.2. Use confidence intervals to compare models

CI

4.2.1. Understand confidence intervals

CI1

Recognize that just as a fixed DGP could produce a range of estimates, a single estimate could be produced by a range of DGPs (9)

CI2

Recognize that a confidence interval represents a range of parameter values that could, with some degree of likelihood, have produced your estimate (10)

CI3

Define margin of error and be able to find it (10)

CI4

Recognize how standard error relates to confidence intervals (10)

CI5

Recognize that the upper and lower bounds of a confidence interval represent the highest and lowest parameter values beyond which our sample would have been unlikely (10)

CCI

4.2.2. Construct a confidence interval around an estimate (R)

CCI1

Construct confidence intervals using simulation, bootstrapping, and the mathematical probability distribution (i.e., z, t) for various estimates (e.g., b0, b1) (10) (R)

CCI2

Explain how level of confidence, sample size, and variance impact the size of a confidence interval (10)

ICI

4.2.3. Interpret a confidence interval

ICI1

Interpret confidence intervals in regression and grouping models for various estimates (e.g., b0, b1) (10)

ICI2

Explain why you might reject the empty model when a confidence interval for b1 contains 0 (10)

ICI3

Interpret a confidence interval correctly (10)

4.3. Compare models using the F distribution

F

4.3.1. Understand F

F1

Define F as a measure of the strength of a relationship per parameter used in a model; and as a ratio of variation explained to variation unexplained (11)

F2

Use simulation to explore which statistics can be modeled with the F distribution. (R) (11)

F3

Interpret an F distribution, and distinguish the F statistic from the F distribution (11)

F4

Recognize the that the shape of the F distribution depends on the degrees of freedom for model and for error (11)

MC

4.3.2. Conduct model comparison

MC1

Conduct F tests for ANOVA and regression models in some context for some purpose (7, 8)

MC2

Identify the region that corresponds to p-value in a sampling distribution of F or PRE (11)

MC3

Define p-value as the probability of obtaining an F or PRE statistic as extreme or more extreme than the one observed assuming that the empty model is true (11)

MC4

Explain how the numbers in an ANOVA table are calculated and what they mean (7, 8)

IMC

4.3.3. Interpret results of model comparison

IMC1

Determine which statistics can be used to compare two group or three group models versus an empty model (11)

IMC2

Recognize the limitations of the F test in a three group model (11)

IMC3

Explain the relationship between an F test and a t test (11)

IMC4

Recognize and explain the need for simple effects tests (11)

IMC5

Use the results of ANOVA tables to make predictions about confidence intervals (11)

IMC6

Explain the problem of simultaneous comparisons and how it is addressed by the bonferroni correction (11)

IMC7

Explain the inherent risk of statisticians’ p hacking; calculate likelihood of p-hacking (11)

IMC8

Explain and recognize Type I error and why its probability of occurring is never zero (11)

IMC9

Explain the use of an alpha level and how it relates to type I and type II error (11)

IMC10

Explain and recognize Type II error (11)

IMC11

Use the F statistic to compare models; intuit about effect size based on size of F ratio (7)

IMC12

Interpret a p-value and link to the assumed Data Generating Process (11)