Course of Study

Statistics and Data Science II: A Modeling Approach

With standards alignment listed for CCSS, AP Stats, and Intro to Stats at Mesa College (Dual Enrollment) – fourth year course, with SDS I or AP Statistics as prerequisite

I. Course Description

A. UC/CSU “a-g” Subject Area: Mathematics - Statistics

B. Rationale for Course: To provide students with a rigorous course in Data Science - a combination of statistics and computer programming to meaningfully model, interpret, and analyze data. The course is the second in a two-course series and continues building on topics across three different sets of standards: CCSS Mathematics in Statistics and Probability, Functions, AP Statistics, and basic AP Computer Science standards. It also incorporates the Mathematical Practice standards in every assignment and reading, providing frequent ways to practice critical thinking in mathematics.

C. Grade Level(s): 11th, 12th

D. Credits & Length of Course: Full year

E. Graduation Requirement: 4th year mathematics

F. Which Graduation Requirement is met? Mathematics

G. Prerequisites/Corequisites: Statistics and Data Science I: A Modeling Approach or AP Statistics

H. Classroom-based or Online/hybrid: Classroom

II. Course Overview: Goals and Student Outcomes

SDS II is designed as a follow-on course to SDS I, which is described in detail elsewhere. The course is divided into 6 units.

This course continues to develop students’ skills in data science and statistics by emphasizing a modeling approach. The General Linear Model (GLM) is used as a major connecting principle across both SDS I and SDS II. Students will learn to explore variation, model variation, and evaluate models to answer critical questions. Students will use the statistical programming language R to produce data analyses and data visualizations and they will learn to align R code with the algebraic notation of the GLM that is commonly used by professional researchers. Authentic datasets and real-world questions will be explored while building coding skills using technology used by real data scientists, thus preparing students to work in a professional environment.

The goals of the course are for students to develop the habits of mind of a data scientist, such that they can problem solve flexibly with data in a variety of situations. The course also develops the practical skills needed across the stages of the data science cycle, including data collection, analysis, inferential tests, parameter estimation, and communication of results. Furthermore, a significant goal of the course is to provide students with skills that are future-oriented, not outdated, and tied to the real-world so that they are better equipped for their future careers.

This course builds upon the High School Common Core State Standards for Statistics and Probability that involve the study of data science, as well as AP Statistics and Probability Standards. Additionally, the course meets the learning objectives for the California Community College “Introduction to Statistics” course (Math 110 according to the Course Identification Numbering System, https://c-id.net/descriptors/final/show/365).

A note about the connections to algebra in this course sequence (both SDS I and II): Commonly, students think of algebraic functions in terms of manipulation or substitution to solve for particular values of X (or Y). In the DSS course sequence, students’ prior knowledge of algebraic functions are engaged and developed for the purpose of making sense of relationships between variables. Instead of being given relationships and solving for X, students use X -- and consider many levels of X -- in order to model variation in Y. SDS I focuses on comparing models with no explanatory variables and one explanatory variable and one explanatory variable with multiple dummy-coded categories where the Xs are levels of one categorical variable.

SDS II focuses on more advanced modeling techniques. It begins with using inferential methods to go from fitting to the sample to estimates of population parameters. SDS II also introduces students to fitting and evaluating more complex models such as multivariate models, where the Xs are different explanatory variables, polynomial models, log/exponential models and the limits of these models (e.g., cyclical data that are typically modeled with trigonometric functions).

Materials include an interactive textbook; mini-projects (implemented as Jupyter notebooks) during class sessions; and a variety of culminating projects that can be implemented either as final exams or as individual or group projects. For all the unit assessments detailed in the sections below, unless otherwise noted, students will be producing a Jupyter notebook that integrates code (programming in R), writing, data visualizations, and analyses (including simulations).

The course will be divided into 6 units:

Unit 1: (SDS I review) Explore and Model Variation in Data

Unit 2: Hypothesis Testing

Unit 3: Confidence Intervals, Bootstrapping, Simulation

Unit 4: Multiple Regression (Additive Models)

Unit 5: Multiple Regression (Interaction Models) and Special Topics in Statistics, Probability, and Data Science

Unit 6: Culminating Exam or Project

III. Course Content

Unit 1 (SDS I review) Explore and Model Variation in Data

Unit 1 for SDS II will be a review of the core concepts and data analysis procedures covered in SDS I. Starting with a new data set, students will go through the cycle of data analysis: discussing questions that could be answered using the data; cleaning data and constructing summary variables; exploring variation in data and making hypotheses about the Data Generating Process that could have produced the variation; specifying (in notation of the General Linear Model) and fitting statistical models with one outcome variable (quantitative) and one explanatory variable (categorical or quantitative); assessing the fit of models (e.g., using SS, PRE, and F-statistics); using models to generate predictions; drawing conclusions from the analyses; and communicating the results of their work in writing. This unit essentially focuses on reviewing models and comparing them to the empty model. Students will make plots and conduct all analyses using R.

Unit 1 Assignments:

School Census Project: In the school census project, students are given a large, messy data set with survey data from over 10,000 high school students and asked to explore variation and model variation for a client, a magazine that wants to report on the lives of modern high school students. In the format of a Jupyter notebook, students develop their own research question that can be explored with this data set, make plots to explore their initial hypothesis, clean the data, and find the best fitting model of their hypothesis. For example, for the question “how many languages do you speak?” a few of the survey responses stated numbers in the thousands. Students will have to consider whether they want to include these data in their analysis and if not, explain how they chose to clean their data and potential implications of that choice in their model. This project will include analyses and making visualizations but also include decision-making, writing, and organizing to make their analyses comprehensible to an audience. The final deliverable will be a Jupyter notebook in which students write a background and rationale for their question, present exploratory data analyses, model the outcome using a predictor variable, and write a conclusion.

Predictive Models of Voting: One of the most common models covered in news media are predictions of voting data. In this assignment, students are given data with voting behavior and census information from the 50 states in the US. Students will create predictive models of voting, visualize their models, and compare their models to those created by the classmates. They will have to debate which models are better by using measures of fit and error reduction.

Unit 1 Recommended Focus Standards

Common Core State Standards (Pg 86-112)

Statistics: S-ID 1 - 3, S-ID 5 - 9, S-IC 5, S-IC 6

AP Statistics Standards 5.0, 6.0, 8.0, 10.0, 11.0, 12.0, 13.0, 14.0

CCC MATH 110 Learning Objectives: 1, 3, 4, 5, 10, 15

Unit 1 Recommended Focus Mathematical Practices

1. Make sense of problems and persevere in solving them.

6. Attend to precision.

Unit 2: Hypothesis Testing

Unit 2 introduces students to the problem of statistical inference and the concept of hypothesis testing within a modern modeling framework. Thus far, students have learned to fit both group models and simple regression models to data. But how accurate are the parameter estimates they calculate based on data? Unit 2 starts by exploring the idea that there is variation across different random samples of a given size; that the specific sample included in a data set is only one of many possible samples that could have been studied; and that some random samples may not accurately represent the population or Data Generating Process (DGP). These ideas are explored visually first, using simulations of random samples from the empty model (a.k.a. the null hypothesis in the Null Hypothesis Significance Testing, NHST, tradition). Students then fit models to these simulated data and compute parameter estimates or statistics for each simulated sample. Finally, students learn to construct a sampling distribution of these parameter estimates and use them to evaluate the empty model. We can represent this algebraically as evaluating the model of the DGP.

These simulated sampling distributions will be used as probability distributions to calculate the likelihood of a variety of parameter estimates (e.g., b sub 1, PRE, F) under the empty (or null) model. Concepts such as alpha and p-value, standard error, Type I and Type II error will be explored in the context of these simulated sampling distributions and students will explore in depth what it means to reject or not the empty model. Students will also connect their simulated models of sampling distributions to the mathematical models (e.g., t distribution, F distribution). The unit also includes discussions of the limits of the hypothesis testing approach. The approach here builds up to what are traditionally called "two-sample t-tests and ANOVAs" but comes at these concepts through a mathematical modeling approach.

Unit 2 Assignments:

Open Science Framework (OSF) Replication: Students will be given experimental data available from the Open Science Framework (https://osf.io/). They can then replicate the analyses written about in the paper but also interpret the results in context of the experiment’s hypothesis. For example, data from the now classic experiment by Carol Dweck and colleagues where they gave 5th graders different kinds of praise (either “you’re so smart” or “you must have tried very hard”) following a series of IQ test questions are available to students. Students can read about the experiment and hypothesis. Then they would fit a model based on the hypothesis. Finally, they can use various inferential techniques (bootstrapping, randomization, or mathematical models) in order to evaluate their hypothesis.

Comparing simulation approaches to mathematical models of the sampling distribution: Given a data set (e.g., an experiment by Paul Piff, UCI, examining how feeling wealthy might affect ethical decision making), students will show that a variety of simulation approaches (e.g., randomization and bootstrapping) lead to similar conclusions as mathematical models of the sampling distribution based on the F-distribution and t-distribution. Students will show that hypothesis testing using at least two different methods to converge on similar p-values and conclusions.

Unit 2 Recommended Focus Standards

Common Core State Standards (Pg 86-112)

Statistics: S-IC 1, S-IC 2, S-IC 5, S-IC 6, S-MD 1 - 4, S-MD 7

AP Statistics Standards 4.0, 7.0, 9.0, 15.0, 16.0, 18.0

CCC CID MATH 110 Learning Objectives : 2, 6, 7, 9, 10, 11, 12, 13, 14, 15

Unit 2 Recommended Focus Mathematical Practices

1. Make sense of problems and persevere in solving them.

2. Reason abstractly and quantitatively.

5. Use appropriate tools strategically.

Unit 3: Confidence Intervals (for Means, Proportions, and Variance), Bootstrapping, and Simulation

Having explored the logic and limitations of hypothesis testing, Unit 3 focuses on parameter estimation and confidence intervals, expanding the use of sampling distributions and how to construct them. Students now extend their best fitting models of data to estimating models of the DGP. Throughout, connections between hypothesis testing and parameter estimation are developed and both are related to the overall framework of model comparison. In the beginning of the unit, students learn another technique for constructing sampling distributions, bootstrapping. They compare sampling distributions created using different methods, and notice that while the estimates of Standard Error are similar across methods, differences can be understood in terms of the details of the Data Generating Process. Throughout, students discuss how choosing among statistical models and approaches depends on their purpose, in particular, understanding vs. prediction.

Students also expand their repertoire of computational thinking skills to include simulation in addition to techniques such as randomization and bootstrapping, freeing them to engage in more flexible and wide-ranging hypothetical thinking. Using simulation, students do a deep dive into the factors that influence the shape and spread of sampling distributions, and learn about the Central Limit Theorem. Armed with this deepening understanding of sampling distributions, students learn about the difficult concept of power (i.e., probability of replication), and use simulation to inform the design of future experiments. Although the course so far has taken a frequentist approach to statistics, students now revisit the logic of inference through the eyes of Bayes, along the way developing a more sophisticated understanding of probability and of the interpretation of confidence intervals for a variety of parameters.

Unit 3 Assignments:

Making Money as a YouTuber: Students have a lot of theories about how social media influencers make their money. Students will gather data from YouTubers who have videos disclosing how much money they make on YouTube. They can hypothesize what variables would predict income (e.g., number of subscribers, number of views, type of channel) and create their preferred model. Then students can create sampling distributions to estimate the range of parameter values (confidence intervals) that could give rise to this data. Students will write about their model, their parameter estimates, and what these values mean for their hypothesis in a data analysis report (housed in a Jupyter notebook). Students will also consider what population of YouTube content creators their model applies to.

Power Analysis: In both unit 2 and 3, there is an emphasis on using simulation techniques to analyze data. In this assignment, students will use simulation techniques to figure out how much data they need to collect in order to detect some effect. Students will be presented with a recent finding that has generated some controversy. For example, one micro-economics study found that subjects assigned to be low in wealth (who were given fewer chances to win in repeated “Wheel of Fortune” type word puzzle games) perform worse in a subsequent cognitive task than subjects who were in the high-wealth condition (they had many chances to win). The original study was done with 56 subjects and the effect size of interest is the difference between the two groups. Students will simulate different effect sizes to figure out how much data should be collected in order to be able to have an 80% chance of replicating the study. Students will code their simulation, annotate their code, and explain their reasoning.

Unit 3 Recommended Focus Standards

Common Core State Standards (Pg 86-112)

Statistics: S-IC 4, S-IC 6, S-MD 1 - 4, S-MD 7

AP Statistics Standards 4.0, 9.0, 15.0, 16.0, 17.0

CCC CID Math 110 Student Learning Objectives: 2, 6, 7, 8, 10, 13, 15

Unit 3 Recommended Focus Mathematical Practices

3. Construct viable arguments and critique the reasoning of others.

4. Model with mathematics.

7. Look for and make use of structure.

8. Look for and express regularity in repeated reasoning.

Unit 4: Multiple Regression (Additive Models)

Up to this point students have used only a single explanatory variable in their models. In Unit 4 we extend students’ capabilities to multivariate models, starting with multiple regression. Students learn how to specify multiple regression models using the notation of the General Linear Model, and how to fit the models, which can include both quantitative and categorical predictors. The focus of this unit is multivariate models that can be expressed as a system of equations with the same slope (parallel lines). They learn how to interpret the parameter estimates and how to use parameter estimates to make predictions. Students then learn how to interpret ANOVA tables generated by multiple regression models: they develop an intuitive notion of how multicollinearity affects the partitioning of sums of squares (e.g., noticing that Type III sums of squares don’t add up, and understanding why this is the case), and they learn to use F tests to compare different complex models to each other as well as comparing overall model fit to the empty model.

After mastering the basics of multiple regression, students also will explore data in which error is correlated within groups, consider the distinction between fixed and random effects in a mixed model regression framework, and introduce interaction effects into their models for the first time. They will think through the considerations that would lead them to either fix the slope and intercept, or to allow them to vary within groups.

Unit 4 Assignments:

Multivariate Models to Predict Heart Disease: The leading cause of death in the developed world is heart disease. However, we cannot do experiments to figure out what causes heart disease. Therefore, we need to create both predictive models of heart disease that take into account multiple factors and then consider whether these factors may cause heart disease or not. Students will analyze data from the University of California Irvine's Machine Learning Repository which includes variables such as blood pressure, cholesterol, max heart rate, etc. They will determine whether adding variables would lead to better predictions and interpret ANOVA tables generated by multiple regression models. Students will create their more complex models using a training data set and then measure error from their models with a testing data set.

Predictive Models of Sports: Sports statistics are now used in a variety of contexts for a variety of purposes. Students will create multivariate models to predict various outcomes in basketball games (e.g., points, free throws, assists) to see which statistics are more easily predicted and which are not. Students will compare these models based on games to models based on data from individual players. Students will consider a purely predictive goal (e.g., just predicting the outcomes of games/players) for purposes such as fantasy sport but also consider a coach’s point: which explanatory variables cause important outcomes (and therefore teams should work on those skills)?

Unit 4 Recommended Focus Standards

Common Core State Standards (Pg 86-112)

Statistics: S-ID 1, S-ID 6, S-ID 6a - 6c, S-ID 7 - 9, S-IC 5 - 6, S-MD 1, S-MD 7

AP Statistics Standards 12.0, 14.0, 15.0, 17.0, 18.0

CCC CID Math 110 Student Learning Objectives: 1, 7-11, 13-15

Unit 4 Recommended Focus Mathematical Practices

2. Reason abstractly and quantitatively.

5. Use appropriate tools strategically.

6. Attend to precision.

Unit 5: Multiple Regression (Interaction Models) and Special Topics in Statistics, Probability, and Data Science

In contrast to Unit 4 where students focus on additive models (systems of equations where the lines have the same slope), in Unit 5 students focus on interaction models (systems of equations where the lines have different slopes. They learn how to interpret the parameter estimates and how to use parameter estimates to make predictions.

After that, teachers have the choice to extend these ideas of modeling in a variety of directions. Teachers can either delve into probability axioms, modeling with nonlinear functions (e.g., polynomial or log), or transforming variables (and then fitting a linear model). Students can delve into the statistics tools and axioms useful for understanding categorical outcomes (which includes probability theory, permutations/combinations, binomial distributions, and chi-square tests). The simulation skills developed earlier (e.g., randomization and bootstrapping) are reprised here to examine the probabilities and patterns seen with categorical outcomes. Or students might wonder how to model "curves" in their data and realize the limitations of linear equations. Much of the concepts developed earlier (e.g., making predictions with functions, examining error from those predictions) can be repurposed from the linear models to more sophisticated nonlinear functions.

Unit 5 Assignments:

Dog Therapy: People end up at the ER for a variety of reasons and they might have real reason to feel anxiety there. A hospital conducted a randomized controlled trial to evaluate whether therapy dogs might be particularly helpful to patients who are highly anxious when arriving at the ER. Students discover that patients who arrived at the ER with high anxiety, tend to have high levels of anxiety 30 or 90 minutes later while patients who had lower levels of anxiety initially have similarly low levels after a delay. That's business as usual. But patients who got 15 minutes with a therapy dog, even the initially anxious patients experience generally lower levels of anxiety after 30 or 90 minutes. They model this with a system of equations, each equation representing the relationship between initial and later anxiety. However, the slopes differ such that the dog therapy group has a more gradual slope.

Can Money Buy Health and Happiness?: Many patterns of international health, wellness, and wealth are highly correlated. For example, richer countries also tend to be healthier and happier. But this relationship is not exactly linear. Students will explore measures of health and happiness from GapMinder and the Happy Planet Index and create predictive models using GDP as a measure of wealth. Linear models are extremely dissatisfying because the data often depict diminishing returns of wealth. Students will explore the limits of linear functions and be introduced to polynomial and logarithmic models and evaluate which models reduce error.

Unit 5 Recommended Focus Standards

Common Core State Standards (Pg 86-112)

Statistics: S-ID 6a - 6c, S-ID 7 - 9, S-IC 5 - 6, S-MD 1, S-MD 7

AP Statistics Standards: 12.0, 14.0, 15.0, 17.0, 18.0

CCC CID Math 110 Student Learning Objectives: 1, 7-11, 13-15

Unit 5 Recommended Focus Mathematical Practices

2. Reason abstractly and quantitatively.

4. Model with mathematics.

7. Look for and make use of structure.

8. Look for and express regularity in repeated reasoning.

Unit 6: Culminating Exam or Project

Each lesson in units 1-5 includes stand alone mini-projects or case studies in which students can apply what they are learning to new data sets. In unit 6, students will engage in a culminating project, independent of the mini-projects, that require a lengthier and more in-depth data analysis report that includes five sections (detailed below). This authentic performance assessment may be administered as a final exam, or as a project that extends out over the final part of the course. Additionally, the project could be done individually with randomized datasets, or in groups, depending on the instructional context.

The culminating project or exam is a structured Jupyter notebook that involves identifying a research question answerable with data, planning and running analyses to address that question, and communicating results in a data analysis report. It is written in clear prose, includes runnable code (with an emphasis on reproducible results), and includes five parts: (1) an introduction to the problem, rationale, and description of the data; (2) an exploration of data, with charts, graphs, and tables created using R as well as any data cleaning required; (3) a specification and fitting of a statistical model or models, expressed in General Linear Model notation and using the model as a function to generate predictions; (4) evaluation and comparison of alternative models using inferential statistics and/or simulation methods; and (5) a conclusion that returns to the questions motivating this analysis. The culminating project requires students to take all the knowledge of modeling they have (e.g., the different functions that can be applied, linear, polynomial, multivariate, etc) and to use them to appropriately model the situation at hand, producing a product recognizable by a working data scientist.

Although the 5-part structure of the data analysis report is the same (examples of these data analysis report assessments used in California colleges can be found here: https://coursekata.org/teaching/resources), there is flexibility in the type of data and contexts students can be asked to explore. For example, some instructors may ask students to use large data sets available to educators (e.g., from the Inter-university Consortium for Political and Social Research) or the public (e.g., GapMinder, Kaggle, FiveThirtyEight). Others may require students to randomize their data from the datasets provided, or design a survey or experiment and collect their own data. The rubric used by college instructors is generalizable enough to evaluate the data analysis reports regardless of what data set and context is used. In addition to the template data analysis report, examples, and rubric, there is also a data analysis guide developed by college instructors.

Unit 6 Assignments:

Housing Costs: Data available on Zillow.com and other housing websites have made it possible to build models that can help predict changes in the housing market. New policies (such as new laws allowing California property owners to build duplexes and triplexes on their single family lots) may change how housing is priced. Students will consider data before and after some policy change to see whether factors such as the number of units, bedrooms, bathrooms, and/or size (e.g., square foot) could be used to predict housing price. Jupyter notebooks such as these timed exams can be completed within one hour because instructors will set which factors students should take into account in their model. Although the structure is similar to a data analysis report (all 5 sections), the introduction (section 1) is much lighter because the hypotheses are set by the instructor. Section 2 is also lighter because there is minimal data cleaning involved.

Build a Better Catapult: Students will be given a limited set of items (such as a variety of popsicle sticks and rubber bands) and their task is to build a basic catapult (with directions shared with the entire class) for launching gummy bears. In groups, students are charged to engineer two improvements that could affect launch distance. They must design a means of collecting and analyzing evidence as to whether their improvements affected gummy bear launching over their basic catapult. To do so, they must design and conduct experiments to ascertain how much variation in launch distance is accounted for by each of their improvements (or as an interaction of their improvements). Students can build catapults, design improvements, and collect data in small groups starting in Unit 3. Starting in Unit 4 they will begin sections 1 and 2 of their data analysis report. By Unit 4, they can start to build multivariate models of their data (section 3). By Unit 5, they should also consider polynomial and other nonlinear models if appropriate (section 3 continued). In Unit 6, they should either conduct simulations or use mathematical models of sampling error to evaluate their models and complete sections 4 and 5.

Unit 6 Recommended Focus Standards

Common Core State Standards (Pg 86-112)

Statistics: S-ID 1 - 3, S-ID 5 - 9, S-IC 1 - 2, S-IC 4 - 6, S-MD 1 - 4, S-MD 7

AP Statistics Standards: 4.0 - 12.0

CCC CID Math 110 Student Learning Objectives: 1 - 11, 13

Unit 6 Recommended Focus Mathematical Practices

1. Make sense of problems and persevere in solving them.

2. Reason abstractly and quantitatively.

3. Construct viable arguments and critique the reasoning of others.

4. Model with mathematics.

5. Use appropriate tools strategically.

6. Attend to precision.

7. Look for and make use of structure.

8. Look for and express regularity in repeated reasoning.

Topic Outline

This outline describes the scope of the course but not necessarily the sequence.

Organizing concepts
Modeling: DATA = MODEL + ERROR
General linear model; GLM notation
Data analysis using R
Probability
1. Probability under mathematical distributions
Research methods
1. Sampling
2. Measurement: categorical v. quantitative variables
3. Organizing data
Visualizations
1. Univariate visualizations: histograms
2. Bivariate visualizations: frequency tables, faceted histograms, scatterplots, bar graphs, box plots
3. Multivariate visualizations
Descriptive statistics
1. Summary statistics: center (mean, median), regression model
2. Quantitative and categorical predictors; quantitative outcomes
3. Z scores
Modeling data with algebraic functions
1. Constant functions (e.g., the mean as a model)
2. Linear functions where X is a categorical variable (i.e., a group model)
3. Linear functions where X is a quantitative variable (i.e., a regression model)
4. Using functions to make predictions
5. Calculating and aggregating residuals from function predictions (e.g., sum of residuals, sum of squared residuals, variance, standard deviation, r – the correlation coefficient, Proportion Reduction in Error – PRE also called r-squared).
Inferential statistics and sampling distributions
1. Mathematical distributions
  1. Probability under mathematical distributions
    1. Central limit theorem
    2. normal/Z distribution, t distribution, F distribution
  2. Computational techniques
    1. Simulation
    2. Bootstrapping
    3. Randomization
2. Hypothesis testing
  1. t-test
  2. ANOVA
    1. posthoc comparisons
  3. Regression
  4. Type I and Type II error
    1. Concepts of power and effect size
  5. Confidence intervals
  6. ANCOVA
  7. Two-way ANOVA
  8. Multiple Regression
Data Science and Statistics using R
1. Data frames (includes matrices and vectors)
2. Functions
  1. Algebraic functions
  2. Functions in computer programming
3. Transformations of data
4. Computational methods (e.g., simulation, randomization, bootstrapping)
5. Evaluating and fitting models to data
6. Interpreting and debugging code

IV. Instructional Methods and/or Strategies including Instructional Technology

This course is designed using the Practicing Connections instructional framework (Fries et al., 2021) and the Four-Component Instructional Design Model (Van Merriënboer, Clark, & De Croock, 2002). It consists of a rotation of:

1. Learning Tasks that are authentic real-life tasks that intentionally build from simple to complex

2. Supportive Information that provide cognitive strategies and mental models to help both routine and non-routine aspects of the learning tasks

3. Procedural Information that provides just-in-time step-by-step instruction to help learners become experts in reoccurring, more routine parts of the task 4. Part-Task Practice that, when necessary, provides extra practice for areas that require automaticity.

All assessments and tasks will be aided by the use of computers and the computer language, R, in order to clean and analyze large data sets. Technology will also be used for the course textbook, formative assessments, data visualization and presentation tools throughout the course.

V. Assessment Methods and/or Tools

Formative Assessments:

Reading

Over 1200 formative reading assessments are embedded into the textbook that provide immediate self checks before coming into class to participate in the learning tasks. Teachers also have immediate access to the results to review common misconceptions or misunderstandings from the reading to address in a timely manner.

Learning Tasks/ Assignments

Each learning task given in class will typically be done in the format of a jupyter notebook. Each assignment provides a different context and data set to be analyzed using the statistics strategies learned. They build in complexity with differing conclusions per data set.

R

R in and of itself, provides immediate feedback by indicating to the learner whether or not their code will run. Students have indefinite opportunities to edit and revise.

Quizzes

About 6-9 in-class quizzes will be given per semester to assess the progress of the students’ learning and to gauge areas for review

Summative Assessment

Culminating Project

At the end of each semester, students will be either given novel research questions to investigate or come up with their own given a data set. They can address that research question using the statistics and data science concepts and skills they have learned with the course. It will be a culminating project and presentation to prove mastery of the course.

VI. Textbook(s) and Supplemental Instructional Materials

Textbook

Title: Statistics and Data Science: A Modeling Approach (XCD)

Author: Ji Y. Son, Ph.D., & James W. Stigler, Ph.D.

Publisher: A online interactive textbook published by CourseKata

Ed: 2023, version 5.0 (2018, version 1.0)

Website: coursekata.org

Primary? yes

References:

Fries, L., Son, J. Y., Givvin, K. B., & Stigler, J. W. (2021). Practicing connections: A framework to guide instructional design for developing understanding in complex domains. Educational Psychology Review, 33(2), 739-762.

Van Merriënboer, J. J., Clark, R. E., & De Croock, M. B. (2002). Blueprints for complex learning: The 4C/ID-model. Educational technology research and development, 50(2), 39-61

Course of Study Statistics and Data Science II Template

Statistics and Data Science II: A Modeling Approach

I. Course Description

II. Course Overview: Goals and Student Outcomes

III. Course Content

Unit 1 (SDS I review) Explore and Model Variation in Data

Unit 1 Assignments:

Unit 1 Recommended Focus Standards

Unit 1 Recommended Focus Mathematical Practices

Unit 2: Hypothesis Testing

Unit 2 Assignments:

Unit 2 Recommended Focus Standards

Unit 2 Recommended Focus Mathematical Practices

Unit 3: Confidence Intervals (for Means, Proportions, and Variance), Bootstrapping, and Simulation

Unit 3 Assignments:

Unit 3 Recommended Focus Standards

Unit 3 Recommended Focus Mathematical Practices

Unit 4: Multiple Regression (Additive Models)

Unit 4 Assignments:

Unit 4 Recommended Focus Standards

Unit 4 Recommended Focus Mathematical Practices

Unit 5: Multiple Regression (Interaction Models) and Special Topics in Statistics, Probability, and Data Science

Unit 5 Assignments:

Unit 5 Recommended Focus Standards

Unit 5 Recommended Focus Mathematical Practices

Unit 6: Culminating Exam or Project

Unit 6 Assignments:

Unit 6 Recommended Focus Standards

Unit 6 Recommended Focus Mathematical Practices

Topic Outline

IV. Instructional Methods and/or Strategies including Instructional Technology

V. Assessment Methods and/or Tools

Formative Assessments:

Reading

Learning Tasks/ Assignments

R

Quizzes

Summative Assessment

Culminating Project

VI. Textbook(s) and Supplemental Instructional Materials