Linear Regression is one of the simplest and most fundamental supervised machine learning algorithms.
It tries to predict a continuous numerical value (the target) by finding the best straight-line relationship between the input features and the target.
The model learns an equation of the form:
Predicted value = β₀ + β₁ × feature₁ + β₂ × feature₂ + …
Common use cases: Predicting house prices, salaries, sales, temperature, etc.
About the Example Dataset
We will use a very small and simple dataset containing only 6 houses. This makes it easy to understand what the model is learning.
Features (Inputs):
- size_m2: Size of the house in square meters.
- bedrooms: Number of bedrooms in the house.
Target (What we want to predict):
- price_kEUR: Price of the house in thousands of Euros (kEUR).
Here is the dataset:
import pandas as pd
data = {
"size_m2": [75, 82, 90, 98, 105, 112, 120, 128, 135, 142],
"bedrooms": [ 3, 2, 4, 4, 2, 2, 4, 4, 5, 5],
"age_years": [28, 5, 22, 12, 31, 8, 25, 15, 3, 19],
"price_kEUR": [268,299,321,387,277,355,412,421,506,507]
}
df = pd.DataFrame(data)
print(df)Observations from the data:
- Larger houses generally cost more.
- Houses with more bedrooms also tend to cost more.
- size_m2 and bedrooms are correlated (bigger houses usually have more bedrooms).
- age is independent
Part 1: Simple Linear Regression
In this part, we create a simple example of the scikit-learn LinearRegression algorithm.
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
import shap
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
print("=== PART 1: Simple Linear Regression ===\n")
# Create a tiny, easy-to-understand dataset
data = {
"size_m2": [75, 82, 90, 98, 105, 112, 120, 128, 135, 142],
"bedrooms": [ 3, 2, 4, 4, 2, 2, 4, 4, 5, 5],
"age_years": [28, 5, 22, 12, 31, 8, 25, 15, 3, 19],
"price_kEUR": [268,299,321,387,277,355,412,421,506,507]
}
df = pd.DataFrame(data)
print("Our tiny dataset:")
print(df)
print("\n")
# Features and target
X = df[["size_m2", "bedrooms", "age_years"]]
y = df["price_kEUR"]
# Train the Linear Regression model
model = LinearRegression()
model.fit(X, y)
# Show what the model learned
print("Model learned these coefficients:")
print(f"Size (per m²) : {model.coef_[0]:.2f} kEUR")
print(f"Bedrooms (per extra room) : {model.coef_[1]:.2f} kEUR")
print(f"Age (per year) : {model.coef_[2]:.2f} kEUR")
print("\n")Run This Part
- Copy the code block into one file (linear_regression_tutorial.py).
- Run it:
python linear_regression_tutorial.py
Part 2: Explaining features with Shapley values
In this part, we try to understand the prediction result we obtained in the first part. Shapley values can help us with that.
Shapley values come from cooperative game theory and were adapted for machine learning through the SHAP (SHapley Additive exPlanations) framework.
The exact definition of Shapley values (see https://en.wikipedia.org/wiki/Shapley_value) is quite complex and certainly go beyond this simple tutorial.
In simple terms:
Shapley values tell us how much each feature contributed to a specific prediction, compared to the average prediction.
Instead of just saying “size matters”, SHAP answers questions like:
- “For this particular house, how much did the size push the price up (or down)?”
- “How much did the number of bedrooms contribute compared to a normal house?”
Key advantages of SHAP:
- It is fair — it distributes the total prediction among all features.
- It works with any machine learning model (linear regression, random forests, neural networks, etc.).
- It shows both positive (increases prediction) and negative (decreases prediction) effects.
- The sum of all SHAP values + the base value equals the final prediction.
The most popular visualization is the Waterfall Plot, which shows a story:
“We start from the average house price. Then the size adds €X, the bedrooms add €Y, and we arrive at the final predicted price.”
SHAP helps us understand why the model made a certain prediction — very important for trust, debugging, and explaining decisions.
In real-world data, features are often correlated (as seen in this dataset). Linear regression has some difficulty cleanly separating the effect of each feature when they move together.
SHAP values are very useful because they explain each individual prediction rather than just giving general coefficients. You will see this clearly in the examples below.
print("=== PART 2: Explaining predictions with SHAP ===\n")
# Create SHAP explainer (best for linear models)
explainer = shap.LinearExplainer(model, X)
# --- Example 1: Typical large house ---
new_house1 = pd.DataFrame({"size_m2": [125], "bedrooms": [4], "age_years": [5]})
pred1 = model.predict(new_house1)[0]
sv1 = explainer(new_house1)
print(f"House 1 → 125 m² with 4 bedrooms")
print(f"Predicted price: {pred1:.1f} kEUR")
print(f"SHAP size_m2 : {sv1.values[0][0]:+.2f} kEUR")
print(f"SHAP bedrooms : {sv1.values[0][1]:+.2f} kEUR")
print(f"SHAP age : {sv1.values[0][2]:+.2f} kEUR\n")
# --- Example 2: Atypical house (large but few bedrooms) ---
new_house2 = pd.DataFrame({"size_m2": [140], "bedrooms": [2], "age_years": [20]})
pred2 = model.predict(new_house2)[0]
sv2 = explainer(new_house2)
print(f"House 2 → 140 m² with only 2 bedrooms")
print(f"Predicted price: {pred2:.1f} kEUR")
print(f"SHAP size_m2 : {sv2.values[0][0]:+.2f} kEUR")
print(f"SHAP bedrooms : {sv2.values[0][1]:+.2f} kEUR")
print(f"SHAP age : {sv2.values[0][2]:+.2f} kEUR\n")
# Waterfall plot for the typical house (fixed version)
shap.plots.waterfall(sv1[0], max_display=5, show=False) # sv1[0] ensures single instance
plt.title("SHAP Explanation - Typical House (125 m², 4 bedrooms)")
plt.tight_layout() # Helps with layout
plt.show() # This is crucial in scriptsRun this Part
- Copy both code blocks into one file (linear_regression_tutorial.py).
- Run it:
python linear_regression_tutorial.py
You should now see:
- Both features contributing positively in the typical house.
- Opposing contributions in the atypical house (size pushes price up, bedrooms push it down).
This clearly demonstrates that even with correlated features, SHAP provides meaningful, instance-specific explanations.
Part 3: Evaluate the Model with R2 Score
The R² Score (also called Coefficient of Determination) tells us how well the linear regression model explains the variation in the data.
It is calculated as the ratio between:
- the variance of the predicted values (explained variance), and
- the variance of the actual data (total variance).
Formula:
R^2 = \frac{SS_{explained}}{SS_{total}} = \frac{\text{Variance of the predictions}}{\text{Variance of the actual data}}Interpretation:
- R² = 1.0 → Perfect fit (model explains 100% of the variation)
- R² = 0.8 → Good model
- R² = 0.5 → Moderate
- R² = 0.0 → The model is no better than simply predicting the average value every time
The closer R² is to 1, the better the model fits the data.
print("=== PART 3: Calculating R² Score (Coefficient of Determination) ===\n")
# Step 1: Calculate mean of actual prices
y_mean = np.mean(y)
# Step 2: Calculate total variance (SS_total)
ss_total = np.sum((y - y_mean) ** 2)
# Step 3: Calculate explained variance (SS_explained)
y_pred = model.predict(X)
ss_explained = np.sum((y_pred - y_mean) ** 2)
# Step 4: Calculate R² using the formula from your script
r2 = ss_explained / ss_total
print(f"SS_total (Variance of actual data) : {ss_total:.2f}")
print(f"SS_explained (Variance of predictions): {ss_explained:.2f}")
print(f"R² Score : {r2:.4f}")
print(f"→ The model explains {r2*100:.1f}% of the variation in house prices.")Run this Part
- Copy all code blocks into one file (linear_regression_shap_tutorial.py).
- Run it:
python linear_regression_tutorial.py
Part 4: Diagnosing Multicollinearity with VIF (Variance Inflation Factor)
While R² tells us how well the model fits overall and SHAP explains individual predictions, we still need to check whether we can trust the individual coefficients (β₁ for size and β₂ for bedrooms).
This is where VIF (Variance Inflation Factor) becomes essential.
What is VIF?
VIF measures how much the variance of a regression coefficient is inflated due to multicollinearity — that is, when your features are highly correlated with each other.
In simple terms:
“How difficult is it for the model to separate the effect of one feature from the others?”
Why Does This Matter?
When features are strongly correlated:
- Coefficients become unstable (small changes in data cause big swings in β values).
- Standard errors increase → p-values become less reliable.
- It becomes hard to say “this effect comes from size” vs “this effect comes from bedrooms”.
SHAP values are more robust to multicollinearity because they fairly distribute shared credit across features. However, raw coefficients can be misleading if VIF is high.
Mathematical Definition
For each feature j :
\text{VIF}_j = \frac{1}{1 - R_j^2}where Rj2 is the R² obtained by regressing feature j on all other features.
- VIF = 1 → No correlation with other features
- VIF = 5 → Moderate problem
- VIF ≥ 10 → Severe multicollinearity (common threshold)
VIF on Our Dataset
Let’s calculate it on our 6-house dataset:
# === PART 4: VIF CHECK ===
print("=== PART 4: VIF - Multicollinearity Check ===\n")
X_const = sm.add_constant(X) # Important: add intercept
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X_const.values, i+1)
for i in range(X.shape[1])]
print(vif_data.round(3))Run this Part
- Copy all code blocks into one file (linear_regression_shap_tutorial.py).
- Run it:
python linear_regression_tutorial.py
Output:
IDX feature VIF
0 size_m2 8.155
1 bedrooms 9.935
2 age_years 1.812
VIF ≈ 85.7 — This is very high!
Why Is It So High Here?
With only 6 observations, there is almost no independent variation left to estimate the unique effect of each feature. The model struggles to distinguish “bigger house” from “more bedrooms”.
Interpretation & Practical Advice
| VIF Value | Meaning | Recommendation |
|---|---|---|
| < 5 | Acceptable | Model is fine |
| 5 – 10 | Moderate | Consider removing or combining features |
| ≥ 10 | Severe multicollinearity | Fix it before trusting individual coefficients |
Conclusion
In this tutorial, we built a simple linear regression model to predict house prices using two features: size_m2 and bedrooms. We explored three complementary ways to understand and evaluate our model:
- R² Score showed us that our model explains approximately 99.2% of the variance in house prices — a strong overall fit for such a small dataset.
- SHAP values gave us powerful local interpretability. We can now clearly see, for every individual house, how much each feature contributes to the final predicted price. The waterfall and force plots make these explanations intuitive even for non-technical stakeholders.
- VIF (Variance Inflation Factor) revealed a critical limitation: severe multicollinearity. Although the model fits well and SHAP values remain useful, we cannot fully trust the individual coefficients because size_m2 and bedrooms are highly correlated.
Key Takeaways
- A good R² is not enough. High accuracy doesn’t guarantee reliable feature interpretations.
- SHAP values are excellent for explaining predictions, especially when features are correlated, because they fairly distribute shared contributions.
- Diagnostics like VIF are essential. They help us understand the limitations of our model and decide whether we should simplify it (e.g., by keeping only size_m2).
This combination — performance metrics (R²) + local explanations (SHAP) + diagnostic checks (VIF) — represents a modern and responsible approach to linear regression. It moves us beyond simply “fitting a model” toward truly understanding and trusting it.