Chapter 4: Correlation and Regression
Syllabus hours: 6 | Exam weight: 10 marks | Marks breakdown: correlation 3, regression 4, variation 1, partial/multiple correlation 2
Difficulty type: Medium-high | Version / Last Updated: 2026-04-18 | Not in syllabus: advanced nonlinear and machine-learning regression
Outcome: measure association, fit a regression line, interpret variation, and read multiple/partial correlation results from PYQs.
1. Fundamental Concepts
- Correlation measures strength and direction of linear association.
- Regression predicts one variable from another and gives an equation, not just a relationship score.
- Correlation is symmetric; regression is directional.
- Positive correlation means both variables move in the same direction; negative correlation means they move in opposite directions.
- Outliers can distort both correlation and regression heavily.
2. Core Methods and Formulas
When to use: use correlation when the question asks how strongly two variables move together; use regression when the question asks to estimate one variable from another.
When not to use: do not treat correlation as causation; do not use regression predictions outside the observed range without caution.
3. Standard Models / Topics
Topic 1: Correlation Analysis and Tests
Basic notes: correlation gives a single number between -1 and 1. A value close to +1 means strong positive linear relationship; close to -1 means strong negative linear relationship; near 0 means weak linear relationship.
Conditions / use: use Pearson correlation for linear numerical data; avoid it as a summary if the relation is curved or heavily affected by outliers.
Formula recap: and .
Seen-Before Check: if the problem asks for relationship strength, linear association, or “degree of correlation,” this topic applies.
[Core] Problem 1: Find the sign of correlation when both variables increase together.
Answer: positive correlation.
[Core] Problem 2: If r = -0.82, what does it indicate?
Answer: a strong negative linear relationship.
[Advanced] Problem 3: For x = 1, 2, 3, 4 and y = 2, 5, 4, 6, compute the correlation coefficient.
Answer: r ≈ 0.81, so the variables have a fairly strong positive linear association.
Topic 2: Simple Regression and Interpretation
Basic notes: regression builds a prediction rule. In the regression line of y on x, x is the predictor and y is the response.
Conditions / use: use simple regression when there is one explanatory variable and one response variable.
Formula recap: , , .
Seen-Before Check: “estimate y for a given x,” “regression line,” or “best fit straight line” are the clues.
[Core] Problem 1: Why does the regression line of y on x pass through (x̄, ȳ)?
Answer: because a = ȳ - b x̄, so substituting x̄ gives ȳ.
[Core] Problem 2: If r is positive, what is the sign of the slope of y on x?
Answer: positive, assuming standard deviations are positive.
[Advanced] Problem 3: Given x̄=10, ȳ=20, r=0.8, σx=2, σy=5, find the regression line.
[PYQ-Trap] Problem 4: Using the same data, find the regression line of x on y.
Topic 3: Explained, Unexplained, and Total Variation
Basic notes: total variation is the spread in the response values. Regression explains part of it; the rest is unexplained error.
Conditions / use: use this to interpret model quality and whether the fitted line is useful.
Formula recap: , , and larger means a better fit.
Seen-Before Check: when a question asks for “variation explained by regression” or “coefficient of determination,” this topic is active.
[Core] Problem 1: What does r² = 0.81 mean?
Answer: 81% of the variation in y is explained by x in the fitted linear model.
[Core] Problem 2: If SST = 500 and SSE = 125, find SSR and r².
Topic 4: Multiple Regression and Partial Correlation
Basic notes: multiple regression uses more than one predictor. Partial correlation measures the relation between two variables after removing the influence of a third.
Conditions / use: use multiple regression when one response depends on several predictors; use partial correlation when the third-variable effect must be controlled.
Formula recap: multiple regression has the form , and partial correlation is written as .
Seen-Before Check: “control for,” “adjust for,” “multiple predictors,” and “holding one variable constant” are the indicators.
[Core] Problem 1: Why is partial correlation useful?
Answer: it isolates the association between two variables after removing a confounding variable.
[Advanced] Problem 2: A model uses temperature and pressure to predict output. What regression type is this?
Answer: multiple regression.
4. Applied Problem Solving
- [Core] Decide whether the data calls for correlation or regression.
- [Core] Compute or interpret the regression slope and intercept.
- [PYQ-Trap] Explain why a high correlation does not automatically imply a useful predictive model.
5. System-Level Understanding
- Correlation summarizes association; regression turns association into a predictive equation.
- Residual variation tells you how much uncertainty remains after fitting the line.
- Partial and multiple correlation extend the same logic to more complex systems with hidden or confounding variables.
6. Quick Reference
for linear relationship strength.
for simple prediction.
for variation breakdown.
Seen-Before Check: relationship, prediction, variation, or control variable.
7. Exam Tips
- Always mention whether the relation is positive or negative before giving the numerical result.
- For regression questions, define the response and predictor clearly.
- Use the words “explained variation” and “unexplained variation” correctly.
- Seen-Before Check: if a question says “hold one variable constant,” think partial correlation, not simple correlation.
8. Common Pitfalls
- Reversing the roles of x and y in the regression line.
- Claiming causation from correlation alone.
- Using a regression model outside the observed data range without caution.
- Confusing r with r².
9. Tools and Guides
- Correlation = association score; regression = prediction equation.
- Sign of r tells direction; magnitude tells strength.
- Residuals and r² tell how much of the response is still not explained by the model.