Descriptive Statistics

Jacoya Thompson, Jacob Mills, Shruti Researcher
Mathematics
7-9 class periods
AP Statistics
v5

Unit Overview

By the end of this unit, students will be able to display and describe quantitative data sets using Python. Students will compare distributions of data, explain how outliers affect measures of center and spread, and develop a deeper understanding of standard deviation. 

Standards

Computational Thinking in STEM
  • Data Practices
    • Analyzing Data
    • Manipulating Data
    • Visualizing Data
  • Modeling and Simulation Practices
    • Using Computational Models to Understand a Concept
  • Computational Problem Solving Practices
    • Computer Programming

Underlying Lessons

  • Lesson 1. Engage - Creating a budget
  • Lesson 2. Engage - Datasets in Python
  • Lesson 3. Explore - Standard Deviation Unplugged
  • Lesson 4. Explain - Using Python to Display and Analyze Data
  • Lesson 5. Evaluate & Elaborate - Comparing Distributions

Lesson 1. Engage - Creating a budget

Jacoya Thompson, Jacob Mills, Shruti Researcher
Mathematics
1 class period - 33 minutes
AP Statistics
v5

Lesson 1 Overview

In this activity, students are given a scenario, a list of 30 data points, and asked to make decisions based on their prior knowledge of descriptive statistics. Students should use a measure of center (mean, median, mode) and a measure of spread (interquartile range, standard deviation) in their explanation. Students may also choose to provide a plot of the sample data (histogram, stem plot, dot plot, etc). 

Lesson 1 Activities

  • 1.1. Engage - Creating a budget

1.0. Student Directions and Resources


Suppose that you own a trucking company. Each month, you budget for fuel expenses but the price of gasoline fluctuates from day to day. You gather a sample of prices per gallon from 30 different gas stations in your area. Your goal is to decide how much to budget for fuel expenses (in terms of price per gallon). 

1.1. Engage - Creating a budget


Suppose that you own a trucking company. Each month, you budget for fuel expenses but the price of gasoline fluctuates from day to day. You gather a sample of prices per gallon from 30 different gas stations in your area. 

Gas Station (#) Price per gallon (USD)
1 3.21
2 3.42
3 3.33
4 3.41
5 3.09
6 3.16
7 3.17
8 3.00
9 3.11
10 2.98
11 3.78
12 3.56
13 3.67
14 3.44
15 3.64
16 3.50
17 3.44
18 3.19
19 3.25
20 3.37
21 3.39
22 3.41
23 3.40
24 3.46
25 3.39
26 3.49
27 3.71
28 3.26
29 3.31
30 3.38

Question 1.1.1

How much should you budget for gasoline (per gallon)? Explain your reasoning based on the data set above. Keep in mind that you want to budget enough money for fuel expenses, but not too much, since there are other expenses for your business. 



Lesson 2. Engage - Datasets in Python

Jacoya Thompson, Jacob Mills, Shruti Researcher
Mathematics
1 class period, 42 minutes
AP Statistics
v6

Lesson 2 Overview

In this lesson:

Students will engage in a pre-assessment using Python and Jupyter notebooks.

Given output table for given set of data, interpret the standard deviation in context. 

Ex table output:

  1. Plot table in python (screenshot/code)

  2. Interpret std for given data set (text box)

  3. Interpret mean for given data set (text box)

  4. Change labels (25%, 50%, 75%) to (Q1, median, Q3) within Python code. (screenshot/code)

Lesson 2 Activities

  • 2.1. Introduction to Python & Jupyter notebooks
  • 2.2. Engage with data sets in Python
  • 2.3. Display a descriptive table only with information for counts, mean, and standard deviation for the data set
  • 2.4. Add the correct quartiles (Q1, Median, Q3) labels to 25%, 50%, 75% listed in the table
  • 2.5. Plot a boxplot of the data set
  • 2.6. Plot a histogram of the data set

2.0. Student Directions and Resources


Students will familiarize themselves with Jupyter notebook and various Python commands and functions. They will be asked to complete a set of questions based on a table they create and modify.

2.1. Introduction to Python & Jupyter notebooks


Python is a popular programming language, created by Guido van Rossum, we will use to make plots and perform calculations on various data sets.

We will use Jupyter notebooks, which is a web-based platform that acts as a diary for python code.

For the exercises, once you open the Jupyter notebook, to explore your data, refer back to the GIF and video below to see results for each line of code. 

  


2.2. Engage with data sets in Python


Click on the link below to start the pre-assessment:

https://mybinder.org/v2/gh/CT-STEM/Descriptive-Statistics/1.0

To begin, click on "STD_Part1"

Please note that you will need to have two tabs open on your Chromebook - one for CT-STEM and one for Jupyter notebook.


Question 2.2.1

Running the code will display a descriptive statistics table for the data set. Take a screenshot of this output and upload the image below. 

Steps for taking a partial screenshot:

Upload files that are less than 5MB in size.
File Delete
Upload files to the space allocated by your teacher.


2.3. Display a descriptive table only with information for counts, mean, and standard deviation for the data set



Question 2.3.1

Change the table so that only "count", "mean" and "std" are displayed. Take a screenshot of the new table and upload image below. 

Upload files that are less than 5MB in size.
File Delete
Upload files to the space allocated by your teacher.


Question 2.3.2

Interpret the mean for this data set (Average US Gas Price).



Question 2.3.3

Interpret the standard deviation for this data set (Average US Gas Price).



2.4. Add the correct quartiles (Q1, Median, Q3) labels to 25%, 50%, 75% listed in the table



Question 2.4.1

Change the labels for 25%, 50%, and 75% to the names of those quartiles. Take a screenshot of the updated table and upload image below. 

Upload files that are less than 5MB in size.
File Delete
Upload files to the space allocated by your teacher.


Question 2.4.2

What does Q1 represent, in the context of this situation?



Question 2.4.3

What does the median represent, in the context of this situation?



Question 2.4.4

What does Q3 represent, in the context of this situation?



2.5. Plot a boxplot of the data set



Question 2.5.1

Plot a boxplot of this data set using Python commands and functions. Take a screenshot of the boxplot and upload image below. 

Upload files that are less than 5MB in size.
File Delete
Upload files to the space allocated by your teacher.


Question 2.5.2

What does the length of the "box" represent in the boxplot?



Question 2.5.3

What descriptive statistics are NOT shown on the boxplot?



2.6. Plot a histogram of the data set



Question 2.6.1

Plot a histogram of this data set using Python commands and functions. Take a screenshot of the histogram and upload image below. 

Upload files that are less than 5MB in size.
File Delete
Upload files to the space allocated by your teacher.


Question 2.6.2

Describe the distribution (shape, center, spread, outliers). 



Question 2.6.3

What descriptive statistics are NOT shown on the histogram?



Lesson 3. Explore - Standard Deviation Unplugged

Jacoya Thompson, Jacob Mills, Shruti Researcher
Mathematics
1 class period, 42 minutes
AP Statistics
v4

Lesson 3 Overview

In groups of 3-4, students will be presented with the formula for sample standard deviation. They will write their own pseudo code or procedure for calculating sample standard deviation from a given data set. Students will test out their code and compare their output to their graphing calculator output. If the outputs don't match, they must debug their code until it matches the calculator output. After writing the correct code, students will answer some follow-up questions that dive deeper into standard deviation

Lesson 3 Activities

  • 3.1. Standard Deviation - Unplugged
  • 3.2. Using your model
  • 3.3. Evaluating your model
  • 3.4. What happens when we add data points?
  • 3.5. Why do we need to understand standard deviation?

3.0. Student Directions and Resources


In groups of 3-4, students will be presented with the formula for sample standard deviation. They will write their own pseudo code or procedure for calculating sample standard deviation from a given data set. Students will test out their code and compare their output to their graphing calculator output. If the outputs don't match, they must debug their code until it matches the calculator output. After writing the correct code, students will answer some follow-up questions that dive deeper into standard deviation

3.1. Standard Deviation - Unplugged


Here is the formula for sample standard deviation, as it appears in our textbook and AP formula sheet:


Question 3.1.1

Using what you know about order of operations and previous math courses, look at the formula above and write out a procedure or "pseudo code" that will take a given data set (e.g., 5 numbers) and calculate sample standard deviation. 



3.2. Using your model



Question 3.2.1

Use your procedure or "pseudo code" to find the sample standard deviation for this 5 number data set. Write your solution in the textbox below. Be sure to show all steps. 

3     7     9     11     15



3.3. Evaluating your model



Question 3.3.1

Grab your graphing calculator and enter the same data set into List 1 (L1). Next, go to STAT --> CALC --> 1 Var Stats to display descriptive statistics. The sample standard deviation will be listed under "Sx". Does this value match your answer to Question 2.1? If it does, move on to question #4. If not, re-write your procedure or "pseudo code" in the text box below. As a reminder, your procedure must take a list of values and using the formula, calculate sample standard deviation. Make sure the output of your procedure matches the calculator output. 

 

3     7     9     11     15



Question 3.3.2

If your procedure was incorrect the first time, please write what went wrong in the text box below. 



3.4. What happens when we add data points?



Question 3.4.1

If we added the value "1" to our set, what effect would it have on the sample standard deviation? Why?

(If you're not sure, use your procedure or graphing calculator and find the new standard deviation with this point added.)



Question 3.4.2

If we added the value "20" to our set, what effect would it have on the sample standard deviation? Why?

(If you're not sure, use your procedure or graphing calculator and find the new standard deviation with this point added.)



Question 3.4.3

If we added the value "9" to our set, what effect would it have on the sample standard deviation? Why?

(If you're not sure, use your procedure or graphing calculator and find the new standard deviation with this point added.)



3.5. Why do we need to understand standard deviation?


Recall the gas price problem from earlier in this unit. You owned a trucking company and were trying to figure out how much to budget for gasoline (per gallon). For the questions that follow, let's assume that the current mean US gas price is $3.43/gallon with a standard deviation of $0.13/gallon. 


Question 3.5.1

What would be an amount ($ per gallon) that would be way too high to budget for gasoline? Justify your answer. 



Question 3.5.2

What would be an amount ($ per gallon) that would be way too low to budget for gasoline? Justify your answer. 



Question 3.5.3

Make a prediction for the shape of the distribution of US gas prices. Justify your answer. 



Lesson 4. Explain - Using Python to Display and Analyze Data

Jacoya Thompson, Jacob Mills, Shruti Researcher
Mathematics
1 class period, 42 minutes
AP Statistics
v5

Lesson 4 Overview

In this lesson, students explore Jupyter notebook and various Python commands and functions. Students will choose a data set, plot histograms, change bin width, explore shape and the effect that outliers have on distributions. 

Lesson 4 Activities

  • 4.1. Choose Data Set & Plot Descriptive Table
  • 4.2. Plot Histogram
  • 4.3. Manipulating the Histogram
  • 4.4. Plot Boxplot
  • 4.5. Summarize

4.0. Student Directions and Resources


In this lesson, you will:

  • Explore Jupyter notebook and various Python commands and functions
  • Choose a data set, plot a histogram, change bin width, and explore the effect that outliers have on distributions
  • Add data points, remove data points, and observe the effect this has on measures of center and spread

Please note that you will need to have 2 tabs open for this lesson (one for Jupyter notebook and one for CT-STEM)

4.1. Choose Data Set & Plot Descriptive Table


Open separate tab and paste this URL into the search bar:

https://mybinder.org/v2/gh/CT-STEM/Descriptive-Statistics/1.0

Then, click on "STD_Part2"

Please note that you will need to have TWO tabs open on your Chromebook - one for CT-STEM and one for Jupyter notebook.


Question 4.1.1

Which data set did you choose to display and analyze? You will enter the name of this .csv file into the Jupyter notebook

ds = pd.read_csv('FILE NAME')

  daily_exercise.csv
  daily_phoneuse.csv
  daily_sleep.csv
  daily_studying.csv
  igfollowers.csv
  studentheight.csv
  studentweight.csv


Question 4.1.2

Write the code necessary to display descriptive statistics (count, mean, st dev, etc). You may need to open up the link from yesterday's pre-assessment, if you forgot how to display this information (https://mybinder.org/v2/gh/CT-STEM/Descriptive-Statistics/1.0).

Based on the relationship between the mean and median, make a prediction of the shape of this distribution. Explain your reasoning. 



4.2. Plot Histogram



Question 4.2.1

Plot the histogram for the data set you chose, take a screenshot on your chromebook and upload the image below. 

Steps for taking a partial screenshot:

Upload files that are less than 5MB in size.
File Delete
Upload files to the space allocated by your teacher.


Question 4.2.2

Describe the shape, center, spread and any possible outliers in the distribution (in context).



4.3. Manipulating the Histogram



Question 4.3.1

When you changed the number of bins to less than 8, what did you notice about the distribution? Did the shape change? Did you learn more or less about the data set? Write your observations below. 



Question 4.3.2

When you changed the number of bins to more than 8, what did you notice about the distribution? Did the shape change? Did you learn more or less about the data set? Write your observations below. 



Question 4.3.3

What is an appropriate number of bins for this data set? Why?



4.4. Plot Boxplot



Question 4.4.1

Plot the boxplot for your data set, take a screenshot and upload the image below. 

Upload files that are less than 5MB in size.
File Delete
Upload files to the space allocated by your teacher.


Question 4.4.2

What information does the boxplot show that the histogram does not?



Question 4.4.3

What are the pros and cons of boxplots?



4.5. Summarize



Question 4.5.1

Summarize what you did in todays lesson. 

What did you like? dislike? 



Lesson 5. Evaluate & Elaborate - Comparing Distributions

Jacoya Thompson, Jacob Mills, Shruti Researcher
Mathematics
1 class period, 42 minutes
AP Statistics
v4

Lesson 5 Overview

Students will compare distributions for two different data sets of their choosing (height, weight, IG followers, etc.)

Students will observe the effect of removing points on measures of center and spread. 

Lesson 5 Activities

  • 5.1. Choose & Display Data
  • 5.2. Plot Histogram and Boxplot for each class
  • 5.3. Effect of removing data points near the mean
  • 5.4. Generalize
  • 5.5. Challenge - Dotplots
  • 5.6. Summarize

5.0. Student Directions and Resources


In this lesson, you will analyze class data from Jupyter repositories.

Click on this link below:
https://mybinder.org/v2/gh/CT-STEM/Descriptive-Statistics/1.0

Then, click on " CT-STEM's repositories", and then click on " Descriptive-Statistics"

Please note that you will need to have 2 tabs open for this lesson (one for Jupyter notebook and one for CT-STEM)

 

5.1. Choose & Display Data


In our Jupyter binder, we have a number of different variables we collected from two classes of high school students:

1. Instagram Followers

2. Daily Phone Use

3. Number of photos on phone

4. Number of hours spent sleeping per day

5. Number of hours spent studying per day

6. Student heights

7. Student weights

Each of these you will find encoded in a CSV file (comma-separated-value) in our notebook. Pick one of the variables to analyze.


Question 5.1.1

Choose a variable you would like to analyze for each class.

  Height
  Hours spent studying
  Sleep
  Exercise
  Phone Use
  Instagram Followers
  # of pictures on phone


Question 5.1.2

Run through the first few lines of code until you display descriptive tables for both class periods. (You may need to look at previous Jupyter notebook files if you forgot the commands for these functions.

Compare the standard deviations of these distributions. What does this tell us about 1st period versus 2nd period, in context of the variable you chose?



5.2. Plot Histogram and Boxplot for each class



Question 5.2.1

Plot a histogram and boxplot for each class on Jupyter notebook.

Fill in the table below. 



Question 5.2.2

Write a few sentences comparing the distributions, in context (shape, center, spread, outliers). 



5.3. Effect of removing data points near the mean



Question 5.3.1

You will remove 1 data point in each class that is CLOSEST to the mean. 

Display descriptive tables for each class' data. What effect did removal of this point have on the mean and standard deviation in each set?



Question 5.3.2

Explain WHY removing a point closest to the mean had the observed effect. 



Question 5.3.3

You will remove 1 data point in each class that is FURTHEST from the mean. 

Display descriptive tables for each class' data. What effect did removal of this point have on the mean and standard deviation in each set?



Question 5.3.4

Explain WHY removing a point furthest from the mean had the observed effect. 



5.4. Generalize



Question 5.4.1

Complete the following statement:

Removing a point closest to the mean will _______________the value of the sample standard deviation. 

  have no effect
  increase
  decrease


Question 5.4.2

Complete the following statement:

Removing a point furthest from the mean will ________________ the value of the sample standard deviation. 

  have no effect
  increase
  decrease


Question 5.4.3

Complete the following statement:

If the mean is greater than the median, the shape of the distribution is ________________

  skewed left
  skewed right
  approximately symmetric


Question 5.4.4

Complete the following statement:

If the mean is less than the median, the shape of the distribution is ________________.

  skewed left
  skewed right
  approximately symmetric


5.5. Challenge - Dotplots


Python doesn’t support the creation of dot plots.

However, creating a histogram of values and then mapping them on a scatterplot solves this problem.

The last few lines of code on this notebook will plot a dotplot for any set of data in our Jupyter notebook, but modifications has to be made to accurately display the plot.

Choose a variable to analyze (height, weight, Instagram followers, etc) and debug the code so that dotplots are displayed for 1st period and 2nd period. 


Question 5.5.1

What changes did you have to make in the Python code?



Question 5.5.2

Copy and paste the code used to create the plot.



5.6. Summarize



Question 5.6.1

Summarize what you did and learned from todays lesson.

What did you like? dislike?