Statistics
KS4MA-KS4-D006
Collecting, interpreting and comparing data using graphical representations, measures of average and spread, sampling methods, and correlation.
National Curriculum context
Statistics at KS4 develops pupils' statistical literacy — the ability to collect, represent, interpret and evaluate data in order to answer questions, make decisions and identify bias. Pupils work with a wider range of graphical representations than at KS3, including cumulative frequency graphs, box plots and histograms with unequal class widths, and apply formal measures of spread including interquartile range. The curriculum requires pupils to compare distributions using summary statistics and graphical displays, to understand correlation and use it to make predictions (while recognising the distinction from causation), and to critique statistical claims and the quality of data collection methods. Sampling — including random, systematic and stratified methods — is introduced to understand the relationship between sample and population. Higher tier pupils additionally work with estimating the mean from grouped frequency tables and interpolating median from cumulative frequency.
4
Concepts
3
Clusters
2
Prerequisites
4
With difficulty levels
Lesson Clusters
Understand sampling methods and collect data appropriately
introduction CuratedData collection and sampling (random, systematic, stratified; bias identification) is the foundational GCSE statistics concept — understanding how data is gathered before it can be analysed.
Construct and interpret statistical diagrams and calculate averages
practice CuratedStatistical diagrams (histograms, cumulative frequency, box plots) and averages/measures of spread (mean, median, mode, IQR) are the core GCSE statistical analysis cluster.
Describe and interpret bivariate data using scatter graphs and correlation
practice CuratedScatter graphs, correlation (positive/negative/none) and lines of best fit represent the bivariate data and relationship strand, distinct from univariate distribution analysis.
Prerequisites
Concepts from other domains that pupils should know before this domain.
Concepts (4)
Data Collection and Sampling
knowledge AI DirectMA-KS4-C030
Understanding and applying sampling methods (random, systematic, stratified); identifying sources of bias; designing data collection methods including questionnaires and observation.
Teaching guidance
Use real-world contexts (election polling, quality control, medical trials) to motivate proper sampling. The distinction between population and sample should be made explicit from the beginning. Stratified sampling requires proportional allocation — pupils should calculate sample sizes from each stratum using the population ratio. Questionnaire design flaws (leading questions, ambiguous response categories) are worth analysing critically.
Common misconceptions
Pupils confuse stratified and systematic sampling — stratified is proportional across groups, systematic is every nth member. Sample bias is difficult to identify without seeing the sampling process; pupils tend to assess bias only from the data. Many pupils believe a larger sample automatically eliminates bias, not recognising that a biased method amplifies with scale.
Difficulty levels
Understands the difference between a population and a sample, and can identify potential sources of bias in data collection.
Example task
A school wants to find out students' favourite lunch option. They survey 30 students from the football team. Explain why this sample may be biased.
Model response: The football team may not be representative of the whole school — they might prefer higher-calorie meals. A better sample would be a random selection from all year groups.
Selects and applies appropriate sampling methods (random, systematic, stratified) and designs data collection tools including questionnaires.
Example task
A school has 600 students: 200 in Year 7, 180 in Year 8, 120 in Year 9, 100 in Year 10. A stratified sample of 50 is needed. How many from each year?
Model response: Year 7: (200/600) x 50 = 16.7, round to 17. Year 8: (180/600) x 50 = 15. Year 9: (120/600) x 50 = 10. Year 10: (100/600) x 50 = 8.3, round to 8. Total: 17 + 15 + 10 + 8 = 50.
Evaluates sampling methods, identifies limitations of data collection, and designs strategies to minimise bias and improve validity.
Example task
A researcher wants to find out how much time teenagers spend on social media. Compare using an online survey versus face-to-face interviews. Which is better?
Model response: Online survey: larger sample, cheaper, but self-selection bias (only those online will respond, and they may use social media more). Respondents may also exaggerate or underestimate. Face-to-face: more accurate responses (can clarify questions), but smaller sample, more expensive, and social desirability bias (respondents may understate usage). A better approach might be combining a stratified random online survey with follow-up interviews for a subsample.
Designs complete statistical investigations with appropriate hypotheses, sampling strategies, and evaluates the reliability and validity of conclusions drawn from data.
Example task
Design a statistical investigation to test whether students who eat breakfast perform better in morning tests. Include hypothesis, sampling, data collection and potential confounders.
Model response: Hypothesis: Students who eat breakfast score higher on morning tests than those who do not. Sampling: Stratified random sample across year groups, aiming for 100+ students to ensure adequate power. Data collection: Anonymous questionnaire on breakfast habits (eaten/not eaten, type), plus morning test scores from school records. Confounders: sleep quality, prior attainment, socioeconomic status (linked to both breakfast and attainment). These should be recorded and controlled for in analysis. Limitation: This is observational, not experimental — correlation does not imply causation. Students who eat breakfast may also have other advantages.
Delivery rationale
Secondary maths concept — abstract, procedural, and objectively assessable.
Statistical Diagrams
process AI FacilitatedMA-KS4-C031
Constructing and interpreting bar charts, pie charts, histograms (equal and unequal class widths), cumulative frequency graphs, and box plots.
Teaching guidance
Frequency density = frequency / class width is the key concept for histograms with unequal class widths — pupils must understand that it is the area, not the height, of each bar that represents frequency. Cumulative frequency graphs should always be plotted at the upper class boundary. Box plots represent the five-number summary (min, Q1, median, Q3, max) and allow distribution comparison visually.
Common misconceptions
Pupils draw histograms with equal-width bars and label the y-axis 'frequency' even when class widths are unequal — frequency density is not intuitive. Cumulative frequency is plotted at midpoints rather than upper class boundaries. Box plots are confused with bar charts; the box width has no meaning, only the positions of the lines matter.
Difficulty levels
Constructs and interprets bar charts, pie charts and pictograms accurately, including reading scales and comparing categories.
Example task
A pie chart shows exam results: A* = 36 degrees, A = 72 degrees, B = 108 degrees, C = 90 degrees, D = 54 degrees. There are 200 students. How many got a B?
Model response: B sector = 108 degrees. Fraction = 108/360 = 3/10. Number of students = 3/10 x 200 = 60.
Constructs and interprets frequency polygons, cumulative frequency graphs (plotting at upper class boundaries), and reads quartiles from cumulative frequency.
Example task
Draw a cumulative frequency graph from: 0-10 (freq 4), 10-20 (freq 12), 20-30 (freq 18), 30-40 (freq 6). Estimate the median.
Model response: Cumulative frequencies: 4, 16, 34, 40. Plot at upper boundaries: (10, 4), (20, 16), (30, 34), (40, 40). Total = 40, so median at 20th value. Reading from the graph: approximately 23.
Constructs and interprets histograms with unequal class widths using frequency density, and compares distributions using box plots.
Example task
A histogram has these bars: class 0-5 with frequency density 2, class 5-15 with frequency density 3.5, class 15-20 with frequency density 4. Find the frequency for each class.
Model response: Frequency = frequency density x class width. Class 0-5: 2 x 5 = 10. Class 5-15: 3.5 x 10 = 35. Class 15-20: 4 x 5 = 20. Total = 65.
Interprets and compares complex statistical diagrams critically, including population pyramids, misleading graphs, and composite representations. Constructs histograms from raw grouped data and uses area to estimate probabilities.
Example task
A histogram shows journey times. The class 10-20 has frequency density 3.2 and the class 20-25 has frequency density 4.0. Estimate the probability that a randomly chosen journey takes between 15 and 25 minutes.
Model response: Area for 15-20 (half of the 10-20 bar): 3.2 x 5 = 16. Area for 20-25: 4.0 x 5 = 20. Estimated frequency for 15-25: 16 + 20 = 36. Need total frequency (sum of all bar areas) to compute probability. If total = 120, then P(15 to 25) = 36/120 = 0.3. The key insight is that area represents frequency in a histogram, so we can estimate probabilities from areas.
Delivery rationale
Secondary maths process concept — problem-solving benefits from structured AI delivery with facilitator for extended reasoning.
Averages and Measures of Spread
process AI FacilitatedMA-KS4-C032
Calculating mean, median, mode and range for ungrouped data; estimating mean from grouped frequency tables; finding quartiles and interquartile range from cumulative frequency.
Teaching guidance
Mean from grouped data requires finding midpoints and computing Σfm/Σf — the mean is an estimate because we do not know exact values within classes. Interquartile range (Q3 − Q1) measures spread while being resistant to outliers, unlike range — this statistical property is worth emphasising. Pupils should understand when each average is appropriate: mode for categorical data, median for skewed distributions, mean for symmetric data.
Common misconceptions
The mean from grouped data is always estimated — pupils often state it as an exact value. Median from a frequency table requires finding the middle value by cumulative frequency, not by halving the total frequency and reading off. Interquartile range is sometimes computed as Q2 − Q1 (not Q3 − Q1) by pupils who confuse quartile numbering.
Difficulty levels
Calculates mean, median, mode and range for small ungrouped data sets and knows which average each one represents.
Example task
Find the mean, median and mode of: 3, 5, 5, 6, 8, 9, 12.
Model response: Mean = (3+5+5+6+8+9+12)/7 = 48/7 = 6.86 (2 d.p.). Median = 6 (middle of 7 ordered values). Mode = 5 (appears twice).
Calculates the mean from a frequency table using sum of fx / sum of f, and chooses the most appropriate average for a given data set.
Example task
Find the mean from: Score 1 (freq 5), Score 2 (freq 8), Score 3 (freq 12), Score 4 (freq 3), Score 5 (freq 2).
Model response: Sum of fx = 1(5) + 2(8) + 3(12) + 4(3) + 5(2) = 5 + 16 + 36 + 12 + 10 = 79. Sum of f = 30. Mean = 79/30 = 2.63 (2 d.p.).
Estimates the mean from a grouped frequency table using midpoints, finds the modal class, and estimates the median class. Understands why these are estimates.
Example task
Estimate the mean from: 0 < x <= 10 (freq 6), 10 < x <= 20 (freq 14), 20 < x <= 30 (freq 8), 30 < x <= 50 (freq 2).
Model response: Midpoints: 5, 15, 25, 40. Sum of fm = 6(5) + 14(15) + 8(25) + 2(40) = 30 + 210 + 200 + 80 = 520. Sum of f = 30. Estimated mean = 520/30 = 17.3 (1 d.p.). This is an estimate because we assume all values in each class equal the midpoint.
Uses cumulative frequency to estimate quartiles, interquartile range and percentiles. Compares distributions using summary statistics and evaluates which measures are most appropriate.
Example task
Two factories produce bolts. Factory A: mean diameter 10.02 mm, IQR 0.03 mm. Factory B: mean diameter 10.00 mm, IQR 0.12 mm. The target is 10.00 mm. Which factory is more reliable?
Model response: Factory B has the mean closer to target (10.00 vs 10.02) but much larger variability (IQR 0.12 vs 0.03). Factory A is slightly off-target but very consistent — most bolts are within 0.03 mm of each other. Factory B is centred correctly but produces bolts with wide variation — some could be far from target. Factory A is more reliable for quality control because consistency matters more than being exactly centred (which can be recalibrated). The small systematic error in A is preferable to the large random error in B.
Delivery rationale
Secondary maths process concept — problem-solving benefits from structured AI delivery with facilitator for extended reasoning.
Scatter Graphs and Correlation
knowledge AI DirectMA-KS4-C033
Plotting and interpreting scatter graphs of bivariate data; describing correlation (positive, negative, none); drawing and using lines of best fit; distinguishing correlation from causation.
Teaching guidance
Require pupils to describe correlation in context, not just label it positive/negative — 'as height increases, weight tends to increase' is more statistically meaningful than 'positive correlation'. Lines of best fit should pass through the mean point (x̄, ȳ) and pupils should use this property to check their line. The causation-correlation distinction is critical for statistical literacy and should be reinforced with counter-intuitive examples.
Common misconceptions
Pupils believe strong correlation implies causation — the most important misconception in statistics. Lines of best fit are frequently drawn from (0,0) or through the most extreme points rather than balancing the data. Extrapolation beyond the data range is treated as equally reliable as interpolation within it.
Difficulty levels
Plots scatter graphs from paired data and describes the overall trend informally (going up, going down, no pattern).
Example task
Plot these points on a scatter graph: (150, 45), (155, 50), (160, 55), (165, 52), (170, 60), (175, 58), (180, 65). Describe the trend.
Model response: The points show an upward trend: as height increases, weight tends to increase. The relationship is not perfect — not all points lie on a straight line.
Identifies and describes the type and strength of correlation (strong/weak positive, strong/weak negative, none) and draws a line of best fit by eye.
Example task
A scatter graph shows that as the age of a car increases, its value decreases. There is a clear downward trend with points close to a line. Describe the correlation.
Model response: There is a strong negative correlation: as the age of the car increases, its value tends to decrease. A line of best fit would slope downward. The strong correlation means the points are close to the line, so age is a good predictor of value.
Uses a line of best fit to make predictions, distinguishing between interpolation (reliable, within data range) and extrapolation (unreliable, outside data range).
Example task
A line of best fit for height (cm) vs shoe size has equation y = 0.1x - 7, valid for heights 150-190 cm. Estimate the shoe size for (a) height 170 cm, (b) height 210 cm.
Model response: (a) y = 0.1(170) - 7 = 17 - 7 = size 10. This is interpolation (170 is within the data range 150-190) so the estimate is reliable. (b) y = 0.1(210) - 7 = 21 - 7 = size 14. This is extrapolation (210 is outside the data range) so the estimate is unreliable — the linear relationship may not hold for very tall people.
Critically evaluates bivariate data, distinguishes correlation from causation, identifies lurking variables, and interprets the equation and gradient of a line of best fit in context.
Example task
A study finds a strong positive correlation between ice cream sales and drowning incidents. A newspaper headline says 'Ice cream causes drowning'. Evaluate this claim.
Model response: The claim is false — it confuses correlation with causation. The lurking (confounding) variable is temperature/season: hot weather causes both increased ice cream sales and increased swimming (leading to more drowning incidents). The two variables are associated because they share a common cause, not because one causes the other. To establish causation, you would need a controlled experiment (impossible here for ethical reasons). This is a classic example of a spurious correlation driven by a confounding variable.
Delivery rationale
Secondary maths concept — abstract, procedural, and objectively assessable.