Course of Study: Advanced Statistics and Data Science I
Course of Study
Advanced Statistics and Data Science I: A Modeling Approach
With standards alignment listed for CCSS and Statistics Standards, Algebra II prerequisite
I. Course Description
A. UC/CSU “a-g” Subject Area: Mathematics - Statistics
B. Rationale for Course: To provide students with a rigorous course in Data Science- a combination of statistics and computer programming to meaningfully model, interpret, and analyze data. The course covers topics across three different sets of standards: CCSS Mathematics in Statistics and Probability, AP Statistics, and basic AP Computer Science standards. It also incorporates the Mathematical Practice standards in every assignment and reading, providing frequent ways to practice critical thinking in mathematics.
C. Grade Level(s): 11th, 12th
D. Credits & Length of Course: Full year
E. Graduation Requirement: 4th year of mathematics
F. Which Graduation Requirement is met? 4th year advanced mathematics
G. Pre-Requisites/Co-Requisites: IM3 or Alg II
H. Classroom-based or Online/hybrid: Classroom based
II. Course Overview: Goals and Student Outcomes
Advanced Statistics and Data Science I is a faster paced version of SDS I and covers more advanced content. This course will develop skills in Data Science and statistics by emphasizing a modeling approach. The General Linear Model (GLM) is used as a major connecting principle among the many concepts covered in this course. Students will learn to use data and modeling to answer questions and to critically evaluate information. This is achieved, in part, by embedding opportunities for students to engage in productive struggle, deliberate practice, and practicing explicit connections between concepts into the structure of the course and its online textbook.
The goals of the course are for students to develop the habits of mind of a data scientist, such that they can problem solve flexibly with data in a variety of situations. The course also aims to develop the practical skills needed across the stages of the Data Science cycle, including data collection, analyses, and communication of results. Furthermore, a significant goal of the course is to provide students with skills that are future-oriented, not outdated, and tied to the real-world so that they are better equipped for their future careers.
Throughout the course, data analyses and data visualizations will be done using the statistical programming language, R, and aligning R with the algebraic notation of the GLM commonly used by professional researchers. Authentic datasets and real-world questions will be explored while building coding skills, thus providing a genuinely professional environment to work in.
This course emphasizes the High School Common Core State Standards for Statistics and Probability that involve the study of Data Science, as well as AP Statistics and Probability Standards. Students authentically apply the Standards for Mathematical Practice throughout the course. By the end of the course, students will be prepared to explore quantitative and categorical data using numerical and visual summaries with the use of R; to model variation in categorical and quantitative data using the GLM; and to compare and evaluate models in terms of effect size and probabilities.
This course is divided into 7 units:
Unit 1: Where Data Come From
Unit 2: Exploring Variation
Unit 3: Modeling Variation - The Empty Model
Unit 4: Modeling Variation - The Complex Model
Unit 5: Model Comparison with the F-Statistic and PRE
Unit 6: Hypothesis Testing
Unit 7: Confidence Intervals (for Means, Proportions, and Variance), Bootstrapping, and Simulation
III. Course Content
Unit 1: Where Data Come From
This initial unit will familiarize students with where data come from, the process of cleaning data before it is ready for analysis, and the basics of R coding. They will understand and apply concepts of measurement, sampling methodology, and elements of research design by actually taking measurements and preparing them for analysis in R. This unit will also focus on understanding the organization of data and manipulating data in data frames. Further, it will introduce an understanding of the purpose of data by practicing imagining data that could answer a question, and by generating questions you could ask of data.
This introductory unit will also provide a primer on the basics of R and get students comfortable with programming languages. It will cover the concepts of functions and arguments, and distinguish between objects and vectors. Students will learn the difference between cases, variables, and data frames, and recognize when data is in “Tidy Data” format. Additionally, it will provide practice with basic coding error troubleshooting.
Unit 1 Assignment: Critically Cleaning Data
Finger Measurements. In authentic data collection situations, there are often sources of error that data scientists have to figure out. For example, there may be mistakes in data such as some test scores entered in as a percentage (90 to indicate 90%) and others entered in as decimals (.9 to indicate 90%). But these are often not told to practitioners -- instead, data scientists must examine and understand the data set well enough to figure this out. This necessitates an understanding of variables (and related data structures such as vectors or arrays) and values. Then they must clean the data. Students will measure their own finger lengths, and then measure a partner’s finger lengths. The pair of measurements can be compared in order to get students thinking about sources of error, then their measurements (along with their height and gender) will be transferred into R to engage in the practice of cleaning and filtering data before analyses. There will also be mistakes hidden in the data (for example, some 2 students have entered their measurements in mm and others in cm) that students will have to critically think about the data, find mistakes, and clean the data themselves. Once the raw data is ready for analysis, they have the opportunity to examine the distributions with visualizations (such as histograms, scatterplots) and with summary statistics.
Dating Profiles. In a large OKCupid dating profiles data set, there are odd data points such as people who are over 100 years old. Students will argue whether those data points should be included and examine the data to see what they can learn about those very old profiles.
NFL Suspension. This data set includes NFL players that have been suspended for a variety of misbehaviors. There is a variable called suspensions that indicates the number of games they have been suspended. However, while to human eyes this variable looks to be made up of numeric values, it is actually defined as a character or string variable (e.g., the character “3” instead of the number 3). This is a common issue in authentic Data Science situations (even adults who have worked with Excel have run across this issue). Students will have to figure out why the data set does not seem to “behave” like they expect.
Unit 1 Recommended Focus Standards
Common Core State Standards: S-IC 6
AP Statistics Standards: 14.0
Unit 1 Recommended Focus Mathematical Practices
1. Make sense of problems and persevere in solving them.
6. Attend to precision.
Unit 2: Exploring Variation
After gaining experience with collecting and cleaning data, students will begin learning the skills needed to explore variation in a set of variables. They will learn how to categorize sources of variation, and what it means to use the variation in one variable to explain the variation in another variable. As they learn about variation they will also begin learning how to summarize and describe distributions of data, to understand the concept of distribution, and to understand that distributions are the result of a data generating process that is usually unknown. Students will also sharpen their skills in producing and interpreting univariate and bivariate tables and graphs. Further, they will begin to learn how to describe relationships in data, and to represent relationships in word equations, while also learning how to distinguish between correlation and causation, and to identify possible confounding variables.
Unit 2 Assignment: Practicing Data Science
Music Experiment. The authentic practice of Data Science starts with a question. Students will explore the question, “does music help us memorize content better?” not just with their opinions but by designing and fully conducting a simple experiment. They will randomly assign classmates to try and memorize a passage with music or without music playing through their headphones, and they must then recall as much as they can from the passage. Students have to go through the process of organizing data collection, deciding what to do with the data they have collected, turning that data into a data frame with rows and columns. They then consider other relevant information that may need to be collected (gender, previous knowledge, etc.), and explore their results while trying to answer the bigger question: does music help? This activity is designed to reinforce the difference between experimental and observational design, and between categorical and quantitative measurements, while also providing practice interpreting and describing distributions in the context of an authentic question.
Engineering a Better Gummy Bear Launcher. In this study, we integrate science and engineering with Data Science as students start with a basic popsicle-stick-and-rubber-band gummy bear launcher and collect data on the effectiveness of the design. Students appreciate that there is a lot of variation in their gummy bear launcher’s performance even when nothing is changed about their design. They also have the opportunity to modify their designs (e.g., adding another rubber band, adding another popsicle stick to the lever) and collect data to see if their change made an improvement. Their goal is to examine whether their modified design is truly an improvement with data.
Unit 2 Recommended Focus Standards
Common Core State Standards: S-ID 1, S-ID 3, S-ID 5, S-ID 6, S-IC 3, S-IC 6, S-MD 6,
S-MD 7
AP Statistics Standards: 10.0, 14.0
Unit 2 Recommended Focus Mathematical Practices
2. Reason abstractly and quantitatively.
5. Use appropriate tools strategically.
Unit 3: Modeling Variation - The Empty Model
This unit will introduce the General Linear Model (GLM) and the notation of the empty model in the format data = model + error, with a focus on developing an understanding of the concept of model and the concept of error, and how to partition error. Students will learn how to fit models, estimate parameters, and interpret estimates. Building upon these concepts, they will use models to generate predictions and probabilities by using the normal distribution to model variation, make point predictions based on parameter estimates, and make likelihood predictions based on probability distributions.
Unit 3 Assignment: The Power of the Mean
Betting on Basketball. One of the major reasons why we build models is to make predictions. In this assignment, students examine different models they can build to make predictions about how many points (or other variables of interest) an NBA team might score in their next game. Students will build models based on the mean, median, mode; consider which data will be included in order to estimate the parameters of their model; and depict error (in the form of residuals) from the model. They can then examine new data to see how their model did against other models.
Unusual Measurements. Students find similarities/differences across several scenarios to help students discover how the mean is an unbiased model. Each scenario employs an unusual or unrefined measurement strategy and is meant to reveal a surprising or unexpected outcome when the mean is used as an estimate. The scenarios include: measuring height in feet, and guessing the weight of a cow. While the measurements in these scenarios may be considered to lack precision, the mean will still balance the residuals and be pretty close to the estimate of a more sensitive scale of measurement.
Unit 3 Recommended Focus Standards
Common Core State Standards: S-ID 2, S-ID 3, S-ID 4, S-ID 6b, S-IC6, S-MD 1
AP Statistics Standards: 4.0, 5.0, 6.0, 7.0, 8.0, 10.0, 11.0
Unit 3 Recommended Focus Mathematical Practices
4. Model with mathematics.
7. Look for and make use of structure.
Unit 4: Modeling Variation - The Complex Model
The skills and concepts learned in fitting the empty model will be expanded upon to specify complex models. Students will write models using GLM notation and interpret model components in context. They will learn to distinguish models with both categorical and quantitative explanatory variables. This unit will also preface the upcoming units on model evaluation by introducing methods for assessing how well a model fits the data. This will include the use of residuals to assess model fit, quantifying aggregate error around a model, comparing the fit of two models, and understanding and calculating different measures of effect size.
Unit 4 Assignment: Develop Predictive Models
Election Model. This assignment takes students through the arc of exploring variation and modeling variation. Students will be provided data from US States with a multitude of variables to explore including proportion of population, high school graduates, median income, percentage of smokers, percentage of Trump voters in 2016, etc. They will be charged to predict how different states might vote in an upcoming election year. Students must select and argue for a particular explanatory variable to be included in their model, visually consider how much variation might be explained by this explanatory variable, and fit a more simple and more complex model to predict vote percentage for a particular party. Then they will consider the implications of their model predictions and subsequent error for an election.
Replicating Moneyball. In the book MoneyBall by Michael Lewis, the story follows a team with little money who wishes to figure out what non-obvious variables predict baseball players’ outcomes (e.g., runs scored in a season). In this activity, if students were the manager of a baseball team, what would be obvious explanatory variables that every team might be interested in? What would be less obvious explanatory variables that still do a great job of predicting runs? Students would create more obvious and less obvious predictive models and attempt to find quality players that would be undervalued by more obvious models.
Unit 4 Recommended Focus Standards
Common Core State Standards: S-ID 6a, S-ID 6b, S-IC 5, S-IC 6, S-MD 7
AP Statistics Standards: 5.0, 6.0, 8.0, 10.0, 11.0, 12.0, 13.0
Unit 4 Recommended Focus Mathematical Practices
4. Model with mathematics.
8. Look for and express regularity in repeated reasoning.
Unit 5: Model Comparison with the F-Statistic and PRE
This unit will focus on understanding the concept of the F-statistic, model comparison with the F-statistic, and interpreting the results of model comparison with the F-statistic. Students will compare and contrast the F-statistic and Proportional Reduction in Error (PRE). Students will also learn how to use the F-statistic to compare multiple groups. Note that in some traditions of statistics, PRE is referred to as R-squared.
Unit 5 Assignment: Mid-point Project
This unit will include a mid-point project in which students will exercise all of the skills they have acquired over the first part of the course.
Ice Melting Experiments. Which of these substances added to ice makes it melt faster: sugar, salt, sand, or nothing? Students will conduct experiments with small ice cubes and gather data on how many mL of water had been melted. Gather their class data together and examine whether there was variation across the different conditions and even variation within condition. They will create models of whether the differences seen in their data could be due to sampling variation. Interested students can additionally have the opportunity to test whether other substances or whether the amount of added substance impacts ice melting with data.
Public Health Data. Students will receive different public health questions from a government or non-profit such as “What leads to longer life expectancies in countries?” They will prepare a presentation based on a large real-world data set, with which they select possible explanatory variables, build models, and conduct the appropriate model comparisons. They will also prepare a brief report and presentation answering the entity’s question, warning of misinterpretations, and making recommendations.
Unit 5 Recommended Focus Standards
Common Core State Standards: S-ID 6, S-ID 6a, S-ID 6b, S-ID 6c, S-ID 7, S-ID 8, S-ID 9, S-IC 5, S-IC 6, S-MD 7
AP Statistics Standards: 5.0, 6.0, 8.0, 10.0, 11.0, 12.0, 13.0
Unit 5 Recommended Focus Mathematical Practices
1. Make sense of problems and persevere in solving them.
3. Construct viable arguments and critique the reasoning of others.
Unit 6: Hypothesis Testing
Unit 6 introduces students to the problem of statistical inference and the concept of hypothesis testing within a modern modeling framework. Thus far, students have learned to fit both group models and simple regression models to data. But how accurate are the parameter estimates they calculate based on data? Unit 6 starts by exploring the idea that there is variation across different random samples of a given size; that the specific sample included in a data set is only one of many possible samples that could have been studied; and that some random samples may not accurately represent the population or Data Generating Process (DGP). These ideas are explored visually first, using simulations of random samples from the empty model (a.k.a. the null hypothesis in the Null Hypothesis Significance Testing, NHST, tradition). Students then fit models to these simulated data and compute parameter estimates or statistics for each simulated sample. Finally, students learn to construct a sampling distribution of these parameter estimates and use them to evaluate the empty model. We can represent this algebraically as evaluating the model of the DGP.
These simulated sampling distributions will be used as probability distributions to calculate the likelihood of a variety of parameter estimates (e.g., b sub 1 , PRE, F) under the empty (or null) model. Concepts such as alpha and p-value, standard error, Type I and Type II error will be explored in the context of these simulated sampling distributions and students will explore in depth what it means to reject or not the empty model. Students will also connect their simulated models of sampling distributions to the mathematical models (e.g., t-distribution, F-distribution). The unit also includes discussions of the limits of the hypothesis testing approach. The approach here builds up to what are traditionally called "two-sample t-tests and ANOVAs" but comes at these concepts through a mathematical modeling approach.
Unit 6 Assignment: Hypothesis Testing
Open Science Framework (OSF) Replication: Students will be given experimental data available from the Open Science Framework (https://osf.io/). They can then replicate the analyses written about in the paper but also interpret the results in context of the experiment’s hypothesis. For example, data from the now classic experiment by Carol Dweck and colleagues where they gave 5th graders different kinds of praise (either “you’re so smart” or “you must have tried very hard”) following a series of IQ test questions are available to students. Students can read about the experiment and hypothesis. Then they would fit a model based on the hypothesis. Finally, they can use various inferential techniques (bootstrapping, randomization, or mathematical models) in order to evaluate their hypothesis.
Comparing simulation approaches to mathematical models of the sampling distribution: Given a data set (e.g., an experiment by Paul Piff, UCI, examining how feeling wealthy might affect ethical decision making), students will show that a variety of simulation approaches (e.g., randomization and bootstrapping) lead to similar conclusions as mathematical models of the sampling distribution based on the F-distribution and t-distribution. Students will show that hypothesis testing using at least two different methods to converge on similar p-values and conclusions.
Unit 6 Recommended Focus Standards
Common Core State Standards (Pg 86-112)
Statistics: S-IC 1, S-IC 2, S-IC 6, S-MD 2 - 4, S-MD 7
AP Statistics Standards: 4.0, 7.0, 9.0, 15.0, 16.0, 18.0
Unit 6 Recommended Focus Mathematical Practices
1. Make sense of problems and persevere in solving them.
2. Reason abstractly and quantitatively.
5. Use appropriate tools strategically.
Unit 7: Confidence Intervals, Bootstrapping, and Simulation
Having explored the logic and limitations of hypothesis testing, Unit 7 focuses on parameter estimation and confidence intervals, expanding the use of sampling distributions and how to construct them. Students now extend their best fitting models of data to estimating models of the DGP. Throughout, connections between hypothesis testing and parameter estimation are developed and both are related to the overall framework of model comparison. In the beginning of the unit, students learn another technique for constructing sampling distributions, bootstrapping. They compare sampling distributions created using different methods, and notice that while the estimates of Standard Error are similar across methods, differences can be understood in terms of the details of the Data Generating Process. Throughout, students discuss how choosing among statistical models and approaches depends on their purpose, in particular, understanding vs. prediction.
Students also expand their repertoire of computational thinking skills to include simulation in addition to techniques such as randomization and bootstrapping, freeing them to engage in more flexible and wide-ranging hypothetical thinking. Using simulation, students do a deep dive into the factors that influence the shape and spread of sampling distributions, and learn about the Central Limit Theorem. Armed with this deepening understanding of sampling distributions, students learn about the difficult concept of power (i.e., probability of replication), and use simulation to inform the design of future experiments. Although the course so far has taken a frequentist approach to statistics, students now revisit the logic of inference through the eyes of Bayes, along the way developing a more sophisticated understanding of probability and of the interpretation of confidence intervals for a variety of parameters.
Unit 7 Assignments: Culminating Projects
This unit will culminate in several projects in which students will exercise all of the skills they have acquired over the year.
Making Money as a YouTuber: Students have a lot of theories about how social media influencers make their money. Students will gather data from YouTubers who have videos disclosing how much money they make on YouTube. They can hypothesize what variables would predict income (e.g., number of subscribers, number of views, type of channel) and create their preferred model. Then students can create sampling distributions to estimate the range of parameter values (confidence intervals) that could give rise to this data. Students will write about their model, their parameter estimates, and what these values mean for their hypothesis in a data analysis report (housed in a Jupyter notebook). Students will also consider what population of YouTube content creators their model applies to.
Power Analysis: In both unit 6 and 7, there is an emphasis on using simulation techniques to analyze data. In this assignment, students will use simulation techniques to figure out how much data they need to collect in order to detect some effect. Students will be presented with a recent finding that has generated some controversy. For example, one micro-economics study found that subjects assigned to be low in wealth (who were given fewer chances to win in repeated “Wheel of Fortune” type word puzzle games) perform worse in a subsequent cognitive task than subjects who were in the high-wealth condition (they had many chances to win). The original study was done with 56 subjects and the effect size of interest is the difference between the two groups. Students will simulate different effect sizes to figure out how much data should be collected in order to be able to have an 80% chance of replicating the study. Students will code their simulation, annotate their code, and explain their reasoning.
Unit 7 Recommended Focus Standards
Common Core State Standards (Pg 86-112)
Statistics: S-IC 4, S-IC 6, S-MD 1 - 4, S-MD 7
AP Statistics Standards: 9.0, 15.0, 16.0, 17.0
(Note, remaining AP Statistics standards - 1.0, 2.0, 3.0 and 19.0 - are covered in optional Jupyter Notebooks/lesson plans.)
Unit 7 Recommended Focus Mathematical Practices
3. Construct viable arguments and critique the reasoning of others.
4. Model with mathematics.
7. Look for and make use of structure.
8. Look for and express regularity in repeated reasoning.
Topic Outline
This outline describes the scope of the course but not necessarily the sequence.
Organizing concepts
Modeling: DATA = MODEL + ERROR
General linear model; GLM notation
Data analysis using R
Probability
Law of large numbers
Sampling with and without replacement
Contingency tables
Probability under mathematical distributions
Research methods
Sampling
Measurement: categorical v. quantitative variables
Organizing data
Research design: correlational v. experimental
Correlation, causality, and confounding
Visualizations
Univariate visualizations: histograms, box plots, bar graphs
Bivariate visualizations: frequency tables, faceted histograms, scatterplots, bar graphs, box plots
Multivariate visualizations
Descriptive statistics
Summary statistics: center (mean, median, mode), shape (skew, normal, uniform, multimodal), spread (standard deviation, sums of squares, variance), five number summary, regression model, correlation coefficient
Quantitative and categorical predictors; quantitative outcomes
Z scores
Modeling data with algebraic functions
Constant functions (e.g., the mean as a model)
Linear functions where X is a categorical variable (i.e., a group model)
Linear functions where X is a quantitative variable (i.e., a regression model)
Using functions to make predictions
Calculating and aggregating residuals from function predictions (e.g., sum of residuals, sum of squared residuals, variance, standard deviation, r – the correlation coefficient, Proportion Reduction in Error – PRE also called r-squared).
Inferential statistics and sampling distributions
Mathematical distributions
Probability under mathematical distributions
Central limit theorem
normal/Z distribution, t distribution, F distribution
Computational techniques
Simulation
Bootstrapping
Randomization
Hypothesis testing
t-test
ANOVA
posthoc comparisons
Regression
Type I and Type II error
Concepts of power and effect size
Confidence intervals
Data Science and Statistics using R
Data frames (includes matrices and vectors)
Functions
Algebraic functions
Functions in computer programming
Transformations of data
Computational methods (e.g., simulation, randomization, bootstrapping)
Evaluating and fitting models to data
Interpreting and debugging code
IV. Instructional Methods and/or Strategies including Instructional Technology
This course is designed using the Four-Component Instructional Design Model (Van Merriënboer, Clark, & De Croock, 2002). It consists of a rotation of:
Learning Tasks that are authentic real-life tasks that intentionally build from simple to complex
Supportive Information that provide cognitive strategies and mental models to help both routine and non-routine aspects of the learning tasks
Procedural Information that provide just-in-time step-by-step instruction to help learners become experts in reoccurring, more routine parts of the task 4. Part-Task Practice that, when necessary, provide an extra practice for areas that require automaticity.
All assessments and tasks will be aided by the use of computers and the computer language, R, in order to clean and analyze large data sets. Technology will also be used for the course textbook, formative assessments, data visualization and presentation tools throughout the course.
V. Assessment Methods and/or Tools
Formative Assessments:
Reading
Over 1200 formative reading assessments are embedded into the textbook that provide immediate self checks before coming into class to participate in the learning tasks. Teachers also have immediate access to the results to review common misconceptions or misunderstandings from the reading to address in a timely manner.
Learning Tasks/ Assignments
Each learning task given in class will provide a different data set to be analyzed using the statistics strategies learned. They build in complexity with differing conclusions per data set.
R
R in and of itself, provides immediate feedback by indicating to the learner whether or not their code will run. Students have indefinite opportunities to edit and revise.
Quizzes
About 6-9 in-class quizzes will be given per semester to assess the progress of the students’ learning and to gauge areas for review
Summative Assessment:
Culminating Project: At the end of each semester, students will be given several novel problems to choose from to solve using Data Science. It will be a culminating project and presentation to prove mastery of the course.
VI. Textbook(s) and Supplemental Instructional Materials
Textbook
Title: Statistics and Data Science: A Modeling Approach (ABC)
Author: Ji Y. Son, Ph.D., & James W. Stigler, Ph.D.
Publisher: A online interactive textbook published by CourseKata
Ed: 2023, version 5.0 (2018, version 1.0)
Website: coursekata.org
Primary? yes
Reference: Van Merriënboer, J. J., Clark, R. E., & De Croock, M. B. (2002). Blueprints for complex learning: The 4C/ID-model. Educational technology research and development, 50(2), 39-61