Statistics

KS4

MA-KS4-D006

Collecting, interpreting and comparing data using graphical representations, measures of average and spread, sampling methods, and correlation.

National Curriculum context

Statistics at KS4 develops pupils' statistical literacy — the ability to collect, represent, interpret and evaluate data in order to answer questions, make decisions and identify bias. Pupils work with a wider range of graphical representations than at KS3, including cumulative frequency graphs, box plots and histograms with unequal class widths, and apply formal measures of spread including interquartile range. The curriculum requires pupils to compare distributions using summary statistics and graphical displays, to understand correlation and use it to make predictions (while recognising the distinction from causation), and to critique statistical claims and the quality of data collection methods. Sampling — including random, systematic and stratified methods — is introduced to understand the relationship between sample and population. Higher tier pupils additionally work with estimating the mean from grouped frequency tables and interpolating median from cumulative frequency.

4

Concepts

3

Clusters

2

Prerequisites

4

With difficulty levels

AI Direct: 2
AI Facilitated: 2

Lesson Clusters

1

Understand sampling methods and collect data appropriately

introduction Curated

Data collection and sampling (random, systematic, stratified; bias identification) is the foundational GCSE statistics concept — understanding how data is gathered before it can be analysed.

1 concepts Evidence and Argument
2

Construct and interpret statistical diagrams and calculate averages

practice Curated

Statistical diagrams (histograms, cumulative frequency, box plots) and averages/measures of spread (mean, median, mode, IQR) are the core GCSE statistical analysis cluster.

2 concepts Patterns
3

Describe and interpret bivariate data using scatter graphs and correlation

practice Curated

Scatter graphs, correlation (positive/negative/none) and lines of best fit represent the bivariate data and relationship strand, distinct from univariate distribution analysis.

1 concepts Patterns

Prerequisites

Concepts from other domains that pupils should know before this domain.

Concepts (4)

Data Collection and Sampling

knowledge AI Direct

MA-KS4-C030

Understanding and applying sampling methods (random, systematic, stratified); identifying sources of bias; designing data collection methods including questionnaires and observation.

Teaching guidance

Use real-world contexts (election polling, quality control, medical trials) to motivate proper sampling. The distinction between population and sample should be made explicit from the beginning. Stratified sampling requires proportional allocation — pupils should calculate sample sizes from each stratum using the population ratio. Questionnaire design flaws (leading questions, ambiguous response categories) are worth analysing critically.

Vocabulary: population, sample, random sample, systematic sampling, stratified sampling, bias, questionnaire, census, strata, representative sample, hypothesis, data collection
Common misconceptions

Pupils confuse stratified and systematic sampling — stratified is proportional across groups, systematic is every nth member. Sample bias is difficult to identify without seeing the sampling process; pupils tend to assess bias only from the data. Many pupils believe a larger sample automatically eliminates bias, not recognising that a biased method amplifies with scale.

Difficulty levels

Emerging

Understands the difference between a population and a sample, and can identify potential sources of bias in data collection.

Example task

A school wants to find out students' favourite lunch option. They survey 30 students from the football team. Explain why this sample may be biased.

Model response: The football team may not be representative of the whole school — they might prefer higher-calorie meals. A better sample would be a random selection from all year groups.

Developing

Selects and applies appropriate sampling methods (random, systematic, stratified) and designs data collection tools including questionnaires.

Example task

A school has 600 students: 200 in Year 7, 180 in Year 8, 120 in Year 9, 100 in Year 10. A stratified sample of 50 is needed. How many from each year?

Model response: Year 7: (200/600) x 50 = 16.7, round to 17. Year 8: (180/600) x 50 = 15. Year 9: (120/600) x 50 = 10. Year 10: (100/600) x 50 = 8.3, round to 8. Total: 17 + 15 + 10 + 8 = 50.

Secure

Evaluates sampling methods, identifies limitations of data collection, and designs strategies to minimise bias and improve validity.

Example task

A researcher wants to find out how much time teenagers spend on social media. Compare using an online survey versus face-to-face interviews. Which is better?

Model response: Online survey: larger sample, cheaper, but self-selection bias (only those online will respond, and they may use social media more). Respondents may also exaggerate or underestimate. Face-to-face: more accurate responses (can clarify questions), but smaller sample, more expensive, and social desirability bias (respondents may understate usage). A better approach might be combining a stratified random online survey with follow-up interviews for a subsample.

Mastery

Designs complete statistical investigations with appropriate hypotheses, sampling strategies, and evaluates the reliability and validity of conclusions drawn from data.

Example task

Design a statistical investigation to test whether students who eat breakfast perform better in morning tests. Include hypothesis, sampling, data collection and potential confounders.

Model response: Hypothesis: Students who eat breakfast score higher on morning tests than those who do not. Sampling: Stratified random sample across year groups, aiming for 100+ students to ensure adequate power. Data collection: Anonymous questionnaire on breakfast habits (eaten/not eaten, type), plus morning test scores from school records. Confounders: sleep quality, prior attainment, socioeconomic status (linked to both breakfast and attainment). These should be recorded and controlled for in analysis. Limitation: This is observational, not experimental — correlation does not imply causation. Students who eat breakfast may also have other advantages.

Delivery rationale

Secondary maths concept — abstract, procedural, and objectively assessable.

Statistical Diagrams

process AI Facilitated

MA-KS4-C031

Constructing and interpreting bar charts, pie charts, histograms (equal and unequal class widths), cumulative frequency graphs, and box plots.

Teaching guidance

Frequency density = frequency / class width is the key concept for histograms with unequal class widths — pupils must understand that it is the area, not the height, of each bar that represents frequency. Cumulative frequency graphs should always be plotted at the upper class boundary. Box plots represent the five-number summary (min, Q1, median, Q3, max) and allow distribution comparison visually.

Vocabulary: histogram, frequency density, class width, cumulative frequency, box plot, quartile, median, range, interquartile range, pie chart, bar chart, frequency polygon, upper class boundary
Common misconceptions

Pupils draw histograms with equal-width bars and label the y-axis 'frequency' even when class widths are unequal — frequency density is not intuitive. Cumulative frequency is plotted at midpoints rather than upper class boundaries. Box plots are confused with bar charts; the box width has no meaning, only the positions of the lines matter.

Difficulty levels

Emerging

Constructs and interprets bar charts, pie charts and pictograms accurately, including reading scales and comparing categories.

Example task

A pie chart shows exam results: A* = 36 degrees, A = 72 degrees, B = 108 degrees, C = 90 degrees, D = 54 degrees. There are 200 students. How many got a B?

Model response: B sector = 108 degrees. Fraction = 108/360 = 3/10. Number of students = 3/10 x 200 = 60.

Developing

Constructs and interprets frequency polygons, cumulative frequency graphs (plotting at upper class boundaries), and reads quartiles from cumulative frequency.

Example task

Draw a cumulative frequency graph from: 0-10 (freq 4), 10-20 (freq 12), 20-30 (freq 18), 30-40 (freq 6). Estimate the median.

Model response: Cumulative frequencies: 4, 16, 34, 40. Plot at upper boundaries: (10, 4), (20, 16), (30, 34), (40, 40). Total = 40, so median at 20th value. Reading from the graph: approximately 23.

Secure

Constructs and interprets histograms with unequal class widths using frequency density, and compares distributions using box plots.

Example task

A histogram has these bars: class 0-5 with frequency density 2, class 5-15 with frequency density 3.5, class 15-20 with frequency density 4. Find the frequency for each class.

Model response: Frequency = frequency density x class width. Class 0-5: 2 x 5 = 10. Class 5-15: 3.5 x 10 = 35. Class 15-20: 4 x 5 = 20. Total = 65.

Mastery

Interprets and compares complex statistical diagrams critically, including population pyramids, misleading graphs, and composite representations. Constructs histograms from raw grouped data and uses area to estimate probabilities.

Example task

A histogram shows journey times. The class 10-20 has frequency density 3.2 and the class 20-25 has frequency density 4.0. Estimate the probability that a randomly chosen journey takes between 15 and 25 minutes.

Model response: Area for 15-20 (half of the 10-20 bar): 3.2 x 5 = 16. Area for 20-25: 4.0 x 5 = 20. Estimated frequency for 15-25: 16 + 20 = 36. Need total frequency (sum of all bar areas) to compute probability. If total = 120, then P(15 to 25) = 36/120 = 0.3. The key insight is that area represents frequency in a histogram, so we can estimate probabilities from areas.

Delivery rationale

Secondary maths process concept — problem-solving benefits from structured AI delivery with facilitator for extended reasoning.

Averages and Measures of Spread

process AI Facilitated

MA-KS4-C032

Calculating mean, median, mode and range for ungrouped data; estimating mean from grouped frequency tables; finding quartiles and interquartile range from cumulative frequency.

Teaching guidance

Mean from grouped data requires finding midpoints and computing Σfm/Σf — the mean is an estimate because we do not know exact values within classes. Interquartile range (Q3 − Q1) measures spread while being resistant to outliers, unlike range — this statistical property is worth emphasising. Pupils should understand when each average is appropriate: mode for categorical data, median for skewed distributions, mean for symmetric data.

Vocabulary: mean, median, mode, range, interquartile range, quartile, grouped data, midpoint, frequency table, skew, outlier, average, measure of central tendency, spread
Common misconceptions

The mean from grouped data is always estimated — pupils often state it as an exact value. Median from a frequency table requires finding the middle value by cumulative frequency, not by halving the total frequency and reading off. Interquartile range is sometimes computed as Q2 − Q1 (not Q3 − Q1) by pupils who confuse quartile numbering.

Difficulty levels

Emerging

Calculates mean, median, mode and range for small ungrouped data sets and knows which average each one represents.

Example task

Find the mean, median and mode of: 3, 5, 5, 6, 8, 9, 12.

Model response: Mean = (3+5+5+6+8+9+12)/7 = 48/7 = 6.86 (2 d.p.). Median = 6 (middle of 7 ordered values). Mode = 5 (appears twice).

Developing

Calculates the mean from a frequency table using sum of fx / sum of f, and chooses the most appropriate average for a given data set.

Example task

Find the mean from: Score 1 (freq 5), Score 2 (freq 8), Score 3 (freq 12), Score 4 (freq 3), Score 5 (freq 2).

Model response: Sum of fx = 1(5) + 2(8) + 3(12) + 4(3) + 5(2) = 5 + 16 + 36 + 12 + 10 = 79. Sum of f = 30. Mean = 79/30 = 2.63 (2 d.p.).

Secure

Estimates the mean from a grouped frequency table using midpoints, finds the modal class, and estimates the median class. Understands why these are estimates.

Example task

Estimate the mean from: 0 < x <= 10 (freq 6), 10 < x <= 20 (freq 14), 20 < x <= 30 (freq 8), 30 < x <= 50 (freq 2).

Model response: Midpoints: 5, 15, 25, 40. Sum of fm = 6(5) + 14(15) + 8(25) + 2(40) = 30 + 210 + 200 + 80 = 520. Sum of f = 30. Estimated mean = 520/30 = 17.3 (1 d.p.). This is an estimate because we assume all values in each class equal the midpoint.

Mastery

Uses cumulative frequency to estimate quartiles, interquartile range and percentiles. Compares distributions using summary statistics and evaluates which measures are most appropriate.

Example task

Two factories produce bolts. Factory A: mean diameter 10.02 mm, IQR 0.03 mm. Factory B: mean diameter 10.00 mm, IQR 0.12 mm. The target is 10.00 mm. Which factory is more reliable?

Model response: Factory B has the mean closer to target (10.00 vs 10.02) but much larger variability (IQR 0.12 vs 0.03). Factory A is slightly off-target but very consistent — most bolts are within 0.03 mm of each other. Factory B is centred correctly but produces bolts with wide variation — some could be far from target. Factory A is more reliable for quality control because consistency matters more than being exactly centred (which can be recalibrated). The small systematic error in A is preferable to the large random error in B.

Delivery rationale

Secondary maths process concept — problem-solving benefits from structured AI delivery with facilitator for extended reasoning.

Scatter Graphs and Correlation

knowledge AI Direct

MA-KS4-C033

Plotting and interpreting scatter graphs of bivariate data; describing correlation (positive, negative, none); drawing and using lines of best fit; distinguishing correlation from causation.

Teaching guidance

Require pupils to describe correlation in context, not just label it positive/negative — 'as height increases, weight tends to increase' is more statistically meaningful than 'positive correlation'. Lines of best fit should pass through the mean point (x̄, ȳ) and pupils should use this property to check their line. The causation-correlation distinction is critical for statistical literacy and should be reinforced with counter-intuitive examples.

Vocabulary: scatter graph, bivariate, correlation, positive correlation, negative correlation, no correlation, line of best fit, outlier, interpolation, extrapolation, causation, association
Common misconceptions

Pupils believe strong correlation implies causation — the most important misconception in statistics. Lines of best fit are frequently drawn from (0,0) or through the most extreme points rather than balancing the data. Extrapolation beyond the data range is treated as equally reliable as interpolation within it.

Difficulty levels

Emerging

Plots scatter graphs from paired data and describes the overall trend informally (going up, going down, no pattern).

Example task

Plot these points on a scatter graph: (150, 45), (155, 50), (160, 55), (165, 52), (170, 60), (175, 58), (180, 65). Describe the trend.

Model response: The points show an upward trend: as height increases, weight tends to increase. The relationship is not perfect — not all points lie on a straight line.

Developing

Identifies and describes the type and strength of correlation (strong/weak positive, strong/weak negative, none) and draws a line of best fit by eye.

Example task

A scatter graph shows that as the age of a car increases, its value decreases. There is a clear downward trend with points close to a line. Describe the correlation.

Model response: There is a strong negative correlation: as the age of the car increases, its value tends to decrease. A line of best fit would slope downward. The strong correlation means the points are close to the line, so age is a good predictor of value.

Secure

Uses a line of best fit to make predictions, distinguishing between interpolation (reliable, within data range) and extrapolation (unreliable, outside data range).

Example task

A line of best fit for height (cm) vs shoe size has equation y = 0.1x - 7, valid for heights 150-190 cm. Estimate the shoe size for (a) height 170 cm, (b) height 210 cm.

Model response: (a) y = 0.1(170) - 7 = 17 - 7 = size 10. This is interpolation (170 is within the data range 150-190) so the estimate is reliable. (b) y = 0.1(210) - 7 = 21 - 7 = size 14. This is extrapolation (210 is outside the data range) so the estimate is unreliable — the linear relationship may not hold for very tall people.

Mastery

Critically evaluates bivariate data, distinguishes correlation from causation, identifies lurking variables, and interprets the equation and gradient of a line of best fit in context.

Example task

A study finds a strong positive correlation between ice cream sales and drowning incidents. A newspaper headline says 'Ice cream causes drowning'. Evaluate this claim.

Model response: The claim is false — it confuses correlation with causation. The lurking (confounding) variable is temperature/season: hot weather causes both increased ice cream sales and increased swimming (leading to more drowning incidents). The two variables are associated because they share a common cause, not because one causes the other. To establish causation, you would need a controlled experiment (impossible here for ethical reasons). This is a classic example of a spurious correlation driven by a confounding variable.

Delivery rationale

Secondary maths concept — abstract, procedural, and objectively assessable.