🏠 Melbourne House Price Prediction β€” Regression, Clustering & Classification

Dataset: Melbourne Housing Market β€” Kaggle (Full Version)


Navigation


Presentation


πŸ“Œ Project Overview

This project builds a complete, end-to-end machine learning pipeline to predict Melbourne house prices using a rich real-world property dataset. The pipeline moves from raw data through exploratory analysis, feature engineering, unsupervised clustering, regression modeling, and multi-class classification β€” ending with two production-ready models exported for deployment on HuggingFace.

Research Question:

Given property characteristics, location attributes, and sale metadata available at listing time, can we accurately predict a Melbourne house's sale price β€” and assign each property to a meaningful price tier (Low / Mid / High)?

Why this matters: Real estate is one of the most consequential financial decisions a household makes. In Melbourne, one of the world's most expensive property markets, understanding what drives house prices is critical for buyers, sellers, and investors alike. A reliable price-prediction model can help buyers set realistic budgets, help sellers price competitively, and help analysts identify market segments. The key challenge is building a model without data leakage and with transparent, interpretable feature engineering grounded in real property market logic.


πŸ““ View the Notebook

Open Notebook


πŸ“Š Dataset Description

Property Value
Source Kaggle β€” Melbourne Housing Market (Full Version, anthonypino)
File Melbourne_housing_FULL.csv
Raw columns 21 features (including target)
Target Price β€” sale price in AUD
Task Regression (price prediction) + Classification (price tier)
Geography Metropolitan Melbourne, Australia
Time period 2016–2018

πŸ“‹ Raw Feature Dictionary β€” All 21 Original Columns

The raw dataset contains 21 columns. Below is every column, its type, description, and modeling decision.

Identifiers β€” Excluded or Used with Care

Column Type Description Decision
Address object Full property street address ❌ too granular; near-unique per row β€” no predictive value as-is
SellerG object Real estate agency or selling agent name ❌ very high cardinality; excluded to avoid noise

Target Variable

Column Type Description Decision
Price float64 House sale price in AUD β€” right-skewed βœ… regression target; log-transform considered for modeling

Property Characteristics

Column Type Description Decision
Rooms int64 Total number of rooms βœ… strong predictor; used in engineered ratio features
Type object Property type: h=house/cottage/villa/terrace, u=unit/duplex, t=townhouse βœ… one-hot encoded β€” strong price signal
Bedroom2 float64 Number of bedrooms (sourced from a secondary source) βœ… used in rooms_per_bedroom ratio
Bathroom float64 Number of bathrooms βœ… used in bathrooms_per_room ratio and interaction feature
Car float64 Number of car spaces βœ… included in baseline numeric features
Landsize float64 Land area in square metres β€” right-skewed, many outliers βœ… log-transformed β†’ log_landsize; raw value retained for IQR analysis
BuildingArea float64 Building footprint in square metres β€” right-skewed βœ… log-transformed β†’ log_buildingarea; used in building_to_land_ratio
YearBuilt float64 Year the property was built β€” many missing values βœ… included in baseline; imputed via median in pipeline

Location Features

Column Type Description Decision
Suburb object Suburb name β€” high cardinality (~300 unique) βœ… one-hot encoded (handle_unknown=ignore); high-signal location feature
Postcode float64 Postcode of the property βœ… included as numeric feature
Distance float64 Distance from Melbourne Central Business District (km) βœ… strong price predictor; log-transformed β†’ log_distance; used in clustering
Regionname object General region in Melbourne (8 regions) βœ… one-hot encoded
CouncilArea object Governing council area β€” lower cardinality than Suburb βœ… one-hot encoded
Propertycount float64 Number of properties in the suburb at time of sale βœ… proxy for suburb density; used in clustering
Lattitude float64 Property latitude βœ… used in KMeans clustering only
Longtitude float64 Property longitude βœ… used in KMeans clustering only

Sale Metadata

Column Type Description Decision
Method object Sale method: S=sold, PI=passed in, SA=sold after, SP=sold prior, VB=vendor bid βœ… one-hot encoded
Date object Date of sale β€” parsed to datetime (day-first format) βœ… decomposed into sale_year, sale_month, sale_quarter

πŸ” Part 2: Exploratory Data Analysis

Raw housing data is noisy and inconsistent. These steps were taken to make it analysis-ready.

2.1 Initial State

  • Duplicate rows possible in the scraped dataset
  • Price stored as object dtype in some versions; needs numeric coercion
  • Date stored as string in day-first format (DD/MM/YYYY) β€” requires explicit parsing
  • Significant missingness on structural columns: BuildingArea (47%), YearBuilt (39%), CouncilArea (33%), Car (5%), Bedroom2 (~4%)
  • Strong numerical outliers in Price (luxury properties >$5M), Landsize (rural blocks >10,000 mΒ²), and BuildingArea (implausible extremes)
  • SellerG (agent name) has very high cardinality with no generalizable price signal

Missingness reporting: Both raw counts and percentage of rows per column were printed so sparse columns (e.g. BuildingArea, YearBuilt) are easy to compare at a glance.

Pre-cleaning snapshot (tabular):

Metric Value
Rows x Columns 34,857 x 21
Numeric columns 13
Text/Categorical columns 8
Top missing columns (pre-cleaning) Missing count Missing %
BuildingArea 17,400 59.55%
YearBuilt 15,744 53.88%
Landsize 9,568 32.75%
Car 6,860 23.48%
Bathroom 6,558 22.45%
Bedroom2 6,552 22.42%
Price 6,367 21.79%
Lattitude 6,339 21.70%

2.2 Cleaning Decisions

  • Removed exact duplicate rows to prevent biased learning and inflated metrics.
  • Parsed Date using explicit day-first datetime parsing β€” Australian date format (DD/MM/YYYY).
  • Converted Price to numeric with errors='coerce' β€” forces any string artifacts to NaN.
  • Dropped rows where Price is missing β€” target integrity; cannot train without a label.
  • Left all feature missingness (BuildingArea, YearBuilt, etc.) intact for pipeline imputation β€” imputing before the train/test split would leak test-set statistics into training.
  • SellerG (real estate agent): excluded β€” very high cardinality, no generalizable signal.
  • Address: excluded β€” near-unique per row, no predictive value as a raw string.
  • Retained extreme best-sellers (luxury properties, large rural blocks); handled their influence via log scaling in feature engineering rather than dropping real data points.
  • Categorical profiling: after cleaning, object columns summarized with describe(include=['object']) (counts, uniques, top category) to spot sparse labels before plotting.

Post-cleaning snapshot (tabular):

Metric Value
Rows x Columns 27,247 x 21
Dropped rows (missing target + exact duplicates) 6,367
Date dtype datetime64 (parsed day-first)
Top missing columns (post-cleaning) Missing count Missing %
BuildingArea 13,685 59.89%
YearBuilt 12,376 54.16%
Landsize 7,495 32.80%
Car 5,347 23.40%
Bathroom 5,117 22.39%
Bedroom2 5,113 22.38%
Lattitude 4,949 21.66%
Longtitude 4,949 21.66%
Object-column profile (post-cleaning) Count Unique Top Freq
Suburb 27,247 340 Reservoir 634
Address 27,247 22,466 5 Charles St 4
Type 27,247 3 h 15,344
Method 27,247 5 S 14,881
SellerG 27,247 325 Nelson 2,372
CouncilArea 27,245 33 Boroondara City Council 2,221
Regionname 27,245 8 Southern Metropolitan 7,439

2.3 Summary Statistics After Cleaning

Feature Count Mean Std Min 25% 50% 75% Max
Price 27,247 1,056,543.22 646,613.71 85,000.00 637,000.00 880,000.00 1,300,000.00 11,200,000.00
Rooms 27,247 2.97 0.96 1.00 2.00 3.00 4.00 16.00
Bathroom 17,733 1.57 0.70 0.00 1.00 1.00 2.00 9.00
Car 17,503 1.67 0.98 0.00 1.00 2.00 2.00 18.00
Landsize 15,355 588.55 4,032.16 0.00 196.00 478.00 659.50 433,014.00
BuildingArea 9,165 154.10 479.10 0.00 97.00 130.00 178.00 44,515.00
Distance 27,247 10.92 6.49 0.00 6.40 10.20 13.80 48.10
Propertycount 27,245 7,533.97 4,487.78 83.00 4,280.00 6,567.00 10,331.00 21,650.00

2.4 Sanity Checks (Domain Rules)

Automated plausibility checks run on the cleaned data to catch scrape errors, wrong units, or bad merges before trusting aggregate charts:

  • All sale price values are non-negative β€” no negative prices.
  • Distance is non-negative β€” no properties with negative CBD distance.
  • Rooms, Bedroom2, Bathroom, Car are non-negative integers β€” no structural anomalies.
  • YearBuilt, where present, falls in a sensible historical range (e.g. 1800–2018).
  • Lattitude and Longtitude lie within Victoria, Australia bounding box β€” no data entry errors placing properties overseas.
Sanity Rule Result Notes
Price >= 0 PASS No negative sale prices
Distance >= 0 PASS No negative CBD distances
Rooms >= 0 PASS No negative room counts
Bathroom >= 0 PASS No negative bathroom counts
YearBuilt in [1800, 2018] when present FLAG At least one out-of-range build year appears in raw source
Lattitude in Victoria bounds PASS Values fall inside expected range
Longtitude in Victoria bounds PASS Values fall inside expected range

2.5 Outlier Documentation (Price / Landsize / BuildingArea)

  • Top properties table: The notebook lists the top 10 properties by Price so extreme luxury sales (multi-million-dollar mansions) are explicit β€” not only visible as scatter extremes.
  • Tukey IQR fences: Lower fence = Q1 βˆ’ 1.5Γ—IQR, upper fence = Q3 + 1.5Γ—IQR. For heavily right-skewed property prices, many rows exceed the upper fence by expectation β€” that reflects the hit-driven, luxury-heavy structure of the Melbourne market, not bad data. Same applies to Landsize (rural blocks) and BuildingArea (mansions).
  • Decision: Keep those rows as real sales; use log scales and log-transformed features in models as needed.

Top 10 properties by Price (post-cleaning):

Suburb Type Rooms Landsize BuildingArea Price
Brighton h 4 1,400.0 NaN 11,200,000
Mulgrave h 3 744.0 117.0 9,000,000
Canterbury h 5 2,079.0 464.3 8,000,000
Hawthorn h 4 1,690.0 284.0 7,650,000
Armadale h 4 NaN NaN 7,000,000
Armadale h 4 NaN NaN 6,800,000
Kew h 6 1,334.0 365.0 6,500,000
Melbourne u 3 NaN NaN 6,500,000
Toorak h 4 NaN NaN 6,460,000
Middle Park h 5 553.0 308.0 6,400,000

IQR fence documentation:

Feature Q1 Q3 Lower Fence Upper Fence Outlier Rows Outlier %
Price 637,000.0 1,300,000.0 -357,500.0 2,294,500.0 1,088 4.76%
Landsize 196.0 659.5 -499.25 1,354.75 402 2.62%
BuildingArea 97.0 178.0 -24.50 299.50 473 5.16%

πŸ“Š EDA Highlights

2.6 Property Market Overview

A. Price Distribution

Question: What does the overall distribution of Melbourne house prices look like, and how skewed is it?

Price distribution (log view)

  • Insight: Price is strongly right-skewed. The bulk of sales cluster between ~$400K and ~$1.5M, but the upper tail extends well past $5M. The log-scale view reveals a near-normal distribution, confirming that log(Price) is a natural regression target and that log-transforming heavy-tail features will better linearize their relationship with price.

B. Property Type Price Hierarchy (Median)

Question: Are houses systematically more expensive than units, and how wide is the gap?

Property type breakdown

  • Insight: Houses (h) have the highest median price, followed by townhouses (t), with units (u) at the bottom. This confirms Type as a strong predictor and supports one-hot encoding with u as a reference class in linear-style models.

2.7 Location & Geography

A. Regional Price Hierarchy

Question: Which Melbourne regions command the highest median prices?

Median price by region

  • Insight: Median prices differ substantially by region. The highest-priced regions (typically inner-city) sit well above the overall median, while outer growth corridors fall significantly below it. The spread confirms that even at a coarse regional level, location carries strong price signal β€” motivating both Regionname and the finer-grained Suburb as features in all models.

B. Distance from CBD vs Price

Question: Is the price gradient from the CBD linear, or does it flatten in outer suburbs?

Distance vs price

  • Insight: The relationship is negative overall β€” farther from the CBD generally means cheaper β€” but the gradient is non-linear. The steepest decline occurs in the 0–15 km inner band. Beyond ~25 km the price floor flattens. Very distant properties (>40 km) show a second cluster of moderate prices from growth-corridor developments, not a recovery of the inner premium. Wide spread at every distance confirms that Distance alone doesn't determine price β€” suburb quality, property type, and size all interact with it. log_distance better linearizes this gradient for regression models.

2.8 Property Size & Layout

A. Property Type vs Price Distribution

Question: How does structural size interact with the type-based price hierarchy?

Type-level pricing evidence is shown above in 2.6.B (Research Q1). This subsection extends the analysis to structural layout variables (rooms and bathrooms), where the strongest separations appear within and across property types.

B. Room Count and Bathrooms vs Price

Question: Does adding rooms and bathrooms drive price consistently across property types?

Rooms and bathrooms vs price

  • Insight: Median prices rise consistently with both room and bathroom counts, but the slope diverges sharply by property type. For houses, each additional room carries a much larger premium than for units. Properties with 4+ rooms and 2+ bathrooms sit in the upper price quartile regardless of suburb β€” the joint combination is more predictive than either feature alone. This motivated the rooms_x_bathrooms interaction feature and the rooms_per_bedroom and bathrooms_per_room ratios.

2.9 Correlation Analysis

Question: How do numeric features relate to each other and to Price?

Correlation heatmap

  • Insight:
    • Rooms, Bedroom2, Bathroom, Car correlate positively with Price β€” size and amenity features consistently point in the same direction.
    • Distance correlates negatively with Price β€” farther from CBD, cheaper.
    • Rooms and Bedroom2 are strongly correlated with each other (~0.9) β€” multicollinearity motivates the rooms_per_bedroom ratio as a more orthogonal signal.
    • Landsize shows a weaker correlation with Price than expected β€” inner-city land is small but extremely expensive, while large rural blocks are cheap per mΒ²; log-transform addresses this non-linearity.
    • BuildingArea correlates more cleanly with Price than raw Landsize β€” usable floor space is a more direct value driver than total land.

πŸ“‰ Part 3: Baseline Linear Regression

Goal: Establish a reproducible, leakage-free performance floor before any feature engineering. Every subsequent model must beat this baseline to justify its additional complexity.

Design Decisions

  • Feature set: all raw columns except Address, Date (after extracting year/month), and SellerG
  • Numeric pipeline: SimpleImputer(strategy='median') β†’ StandardScaler
  • Categorical pipeline: SimpleImputer(strategy='most_frequent') β†’ OneHotEncoder(handle_unknown='ignore')
  • All transformations inside a single ColumnTransformer β€” fit on train only, zero test-set leakage
  • LinearRegression() with default parameters β€” no regularization, no tuning
  • 80/20 stratified split, random_state=42 β€” fully reproducible

Baseline Results

Metric Train Test
MAE β€” 229,318.5712
MSE β€” 143,327,442,971.9292
RMSE β€” 378,586.1104
RΒ² β€” 0.6646

Baseline Actual vs Predicted

Reading this plot: Points on the diagonal y=x line are perfect predictions. Points below the line are over-predictions (model said higher than reality); points above are under-predictions. The expected pattern for this dataset: a tight cluster for sub-$1M properties near the diagonal, spreading into a progressively wider band above $1.5M β€” the model consistently underestimates luxury properties because they are outliers the linear fit cannot represent. The 45Β° reference line is the "perfect model" benchmark.

Baseline Residual Diagnostics

Reading this plot: Residuals (actual βˆ’ predicted) plotted against predicted price. A well-behaved linear model shows residuals randomly scattered around zero with constant spread. The expected pattern here is a fan shape β€” small residuals at low predicted prices ($300K–$700K) that widen dramatically above $1M. This is textbook heteroscedasticity: variance in prediction error increases with the magnitude of the target. The fan shape does not mean the model is broken β€” it is the natural consequence of using raw AUD price as the regression target on a right-skewed market. Two fixes: predict log(Price) and back-transform, or switch to a model family that builds local rules (Random Forest, Gradient Boosting) rather than fitting a single global line.

Key observations:

  • RΒ² = 0.6646 β€” explains approximately 66.46% of price variance. Meaningful signal present but a large fraction unexplained, consistent with a complex non-linear market.
  • Residuals show a fan shape (heteroscedasticity): errors are larger for high-priced properties. This is expected with raw price as the target β€” it motivates a model family that handles non-constant variance (trees).
  • Feature importance: coefficient plot shows location features (particularly suburb one-hot dummies) dominate the baseline β€” the strongest single predictors are location-based, not property-size-based.

Top Coefficients β€” Baseline

Baseline Feature Importance

Reading this plot: Bars extending right = features that push predicted price up; bars extending left = features that push price down. The expected pattern: a handful of premium inner-city suburb dummies (Toorak, Hawthorn, South Yarra, etc.) are the largest positive bars, each adding hundreds of thousands of dollars to the prediction. A handful of outer-suburb dummies (Melton, Werribee, etc.) are the largest negative bars. Distance appears as a negative bar β€” consistent with the Q3 finding. Rooms and Bathroom appear as positive bars but with smaller magnitude than the location dummies, confirming that in Melbourne's property market, where you buy matters more than what you buy.

Key finding: Suburb dummies dominate the coefficient ranking. This is the quantitative proof of "location, location, location" β€” the linear model assigns the majority of its explanatory power to suburb membership, not property size. This finding directly motivated the KMeans neighbourhood clustering in Part 4: clustering captures location signal in a form that generalises better than 300+ sparse suburb dummies that each appear in only a handful of training rows.


βš™οΈ Part 4: Feature Engineering

Feature engineering is the single most impactful step before model selection. Every feature created below is directly traceable to an EDA finding or a domain observation about the Melbourne property market.

4.1 Ten New Engineered Features

Feature Type Source / Rationale
sale_year Numeric Year extracted from Date β€” captures market cycle effects (2016 vs 2017 vs 2018)
sale_month Numeric Month extracted from Date β€” captures seasonality (spring auction peak in Melbourne)
sale_quarter Numeric Quarter extracted from Date β€” coarser time bucket than month
rooms_per_bedroom Ratio Rooms / Bedroom2 β€” high ratio = many living/utility rooms relative to bedrooms; signals open-plan or luxury layouts
bathrooms_per_room Ratio Bathroom / Rooms β€” captures bathroom density; high values signal premium fitouts
building_to_land_ratio Ratio BuildingArea / Landsize β€” how much of the land is built on; differentiates dense inner-city from sprawling outer properties
log_landsize Log transform log1p(Landsize) β€” compresses the extreme right tail; brings distribution near-normal for linear models
log_buildingarea Log transform log1p(BuildingArea) β€” same rationale as log_landsize
log_distance Log transform log1p(Distance) β€” linearizes the CBD-distance price gradient found in Q3 EDA
rooms_x_bathrooms Interaction Rooms Γ— Bathroom β€” captures the joint premium of large, well-appointed homes (Q4 EDA)

4.2 Feature Evidence β€” Keep/Drop Rationale

Mutual information regression scores and chi-square statistics were computed for all features to confirm which engineered features add real signal. Key evidence is summarized directly in the ranked findings and keep/drop rationale below, so no separate screenshot table is required here.

4.3 KMeans Clustering β€” Neighbourhood Segmentation

Clustering was applied to capture neighbourhood-like spatial structure that raw suburb names encode imperfectly. The goal was to group properties by their location-profile signature β€” not by suburb boundary, but by the underlying spatial pattern of distance, density, and coordinates.

Clustering inputs: [Lattitude, Longtitude, Distance, Propertycount] Preprocessing: SimpleImputer(median) β†’ StandardScaler β€” fit on train-set clustering inputs only Algorithm: KMeans(n_clusters=6, random_state=42, n_init=20)

Why k=6 in feature construction, but k=3 in silhouette validation? Part 4 feature construction used k=6 to generate the final cluster-derived features (cluster_label + cluster_dist_0..5) and to preserve a richer neighbourhood segmentation signal. In the separate validation sweep across candidate k values, the highest silhouette score was observed at k=3. Both facts are reported for transparency: k=6 was used for engineered features, while the sweep indicates k=3 as the most compact separation under silhouette.

Cluster-derived features added to each row:

Feature Type Description
cluster_label Categorical (0–5) Discrete cluster membership β€” one-hot encoded in pipeline
cluster_dist_0 Continuous Euclidean distance to centroid of cluster 0 β€” atypicality signal
cluster_dist_1 Continuous Distance to centroid of cluster 1
cluster_dist_2 Continuous Distance to centroid of cluster 2
cluster_dist_3 Continuous Distance to centroid of cluster 3
cluster_dist_4 Continuous Distance to centroid of cluster 4
cluster_dist_5 Continuous Distance to centroid of cluster 5

The centroid-distance features (cluster_dist_k) add a continuous atypicality signal: a property far from its assigned cluster centroid is a spatial outlier within its tier β€” a signal that can improve model predictions at the tails of the price distribution.

4.4 Cluster Visualization

PCA Cluster Projection

Reading this plot: PCA projects the 4-dimensional clustering space (Lattitude, Longtitude, Distance, Propertycount) into 2D. Each dot is a property, colored by its cluster assignment. Well-separated blobs = clusters with distinct location profiles. Expected pattern: an arc or gradient structure reflecting Melbourne's radial geography β€” CBD-proximate clusters (inner ring) sit on one side, growth-corridor clusters (outer ring) on the other, with middle-ring clusters bridging the gap. Some overlap between adjacent clusters is normal and expected: suburb boundaries are fuzzy, and properties near a cluster boundary genuinely share characteristics with both groups. Tight, non-overlapping blobs would actually be suspicious β€” real spatial data rarely partitions perfectly in Euclidean distance.

Cluster interpretation:

Cluster Profile Approximate Zone
0 Transitional middle-ring profile with moderate distance-to-centroid values Middle ring
1 Distinct location pocket with stronger separation from other centroids Outer ring / edge corridors
2 Dense metro-like profile; dominant in the sample shown during cluster feature preview Inner-to-middle metro
3 Higher atypicality-distance zone; captures mixed suburban profiles Mixed transition belt
4 Peripheral cluster with larger centroid distances and broader spread Outer suburban fringe
5 Alternative metro-suburban mix with moderate centroid distance signature Established suburban ring

4.5 Cluster Validation

Silhouette score and an ablation test (model performance with vs without cluster features) were run to confirm that clustering genuinely improves predictions rather than just adding noise.

Check Result
Best silhouette score (sweep k=2..8) 0.6102 (best at k=3)
RMSE without cluster features 378,586.1104
RMSE with cluster features 373,550.5124
Gain from clustering 5,035.5980 RMSE reduction

4.6 Feature Engineering Impact β€” Isolated Proof

Same model (Linear Regression), same hyperparameters, same split:

Stage Features RMSE RΒ²
Raw baseline (Part 3) ~20 raw 378,586.1104 0.6646
Engineered (Part 4) +10 engineered + 7 cluster 373,550.5124 0.6734
Gain +17 features -5,035.5980 +0.0089

Feature engineering contributed measurable improvement even before switching model families.

4.7 Final Feature Matrix Summary

Category Count
Raw numeric (StandardScaler) 12
Engineered numeric (ratios, logs, interaction, time) 10
Cluster distance features 6
Cluster label (one-hot) 6
Categorical (one-hot encoded) ~687 (train-split dependent)
Total features ~721 (sparse, train-split dependent)

Note: The exact one-hot dimensionality depends on which categories appear in the training split (OneHotEncoder(handle_unknown='ignore')), so totals can vary slightly across reruns/splits.


πŸ“ˆ Part 5: Three Improved Regression Models

All three models were trained on the full engineered feature matrix. Same pipeline structure, same stratified split, same seed. Performance differences are attributable to model architecture only.

Model Architectures

Model Architecture Key Parameters
Linear Regression (Engineered) Global linear fit Default β€” no regularization
Random Forest Regressor 350 independent decision trees max_depth=20, min_samples_leaf=2, n_jobs=-1
Gradient Boosting Regressor 300 sequential boosted trees n_estimators=300, learning_rate=0.05, max_depth=3

Part 5 Results

Model MAE RMSE RΒ² RMSE vs Baseline
Baseline LR (Part 3) 229,318.5712 378,586.1104 0.6646 β€”
Linear Regression (Engineered) 226,190.4886 373,550.5124 0.6734 -5,035.5980
Gradient Boosting Regressor 186,308.9222 325,527.3773 0.7520 -53,058.7331
Random Forest Regressor (WINNER) 165,525.8822 301,281.4212 0.7876 -77,304.6892

Model Comparison β€” RMSE and RΒ²

Reading this plot: Each group of bars represents one model. Shorter RMSE bar = better. Taller RΒ² bar = better. The expected pattern: baseline linear regression is tallest on RMSE and shortest on RΒ². Random Forest should show the most dramatic improvement β€” RMSE dropping from ~$378K to ~$301K (-$77K, -20%) and RΒ² rising from 0.665 to 0.788. The Gradient Boosting bar should sit between linear and Random Forest. The visual gap between the linear and tree-based models makes the "non-linearity premium" immediately obvious without needing to read the numbers.

Winner β€” Actual vs Predicted

Reading this plot: Compare this to the baseline Actual vs Predicted above. The expected improvement: the cloud of points is tighter around the diagonal, especially in the $500K–$1.5M band (the bulk of the market). The luxury end (>$2M) will still show scatter β€” these properties have idiosyncratic features (ocean views, heritage listing, development potential) that no tabular model captures from the available columns. The tighter diagonal indicates the Random Forest found the non-linear suburb Γ— size interactions that the linear model averaged away.

Winner β€” Residual Diagnostics

Reading this plot: Compare to the baseline residual fan. The expected improvement: the fan narrows β€” residuals at high predicted values are smaller than the baseline, indicating the tree model handles the right tail better. Some systematic negative residuals may remain at the top end (model still underestimates the most expensive properties), but the spread should be visibly more homoscedastic than the baseline. If the fan is still wide, it signals that log-transforming the target Price would be the next meaningful improvement.

Feature Importance β€” Regression Winner

Regression Feature Importance

Reading this plot: Random Forest importance = average reduction in node impurity (MSE) from splitting on that feature, averaged across all 350 trees. Longer bar = feature the model relied on most. Expected top features: log_distance (CBD proximity drives price more than almost anything), select suburb dummies for the highest-value suburbs, log_buildingarea (usable floor space), rooms_x_bathrooms (the engineered interaction), and cluster_dist features (spatial atypicality). If log_distance or a suburb dummy tops the chart, it confirms the EDA finding that location dominates. If an engineered feature (e.g. building_to_land_ratio) appears above raw features, it validates that the feature engineering in Part 4 extracted real signal rather than noise.

Key observations:

  • Why Random Forest wins on this split: Random Forest achieved the lowest RMSE and highest RΒ² among the tested regressors. Averaging 350 deep trees reduced variance while capturing non-linear interactions between location, size, and engineered ratio/log features, yielding the strongest generalization.

  • Why Gradient Boosting is a strong runner-up: Boosting still performed well and clearly outperformed linear regression by modeling non-linear effects. On this run, however, it did not beat Random Forest on RMSE.

  • Why engineered features dominate importance: In both tree models, log-transformed features (log_landsize, log_buildingarea, log_distance) rank near the top β€” confirming that normalizing the right-skewed distributions was the most impactful single preprocessing decision. The cluster_dist features also appear in the top rankings, validating the neighbourhood segmentation approach.

Winner Declaration

Winner: Random Forest Regressor β†’ winner_regression_model.pkl

Selection criterion: lowest RMSE with strong RΒ² (balanced error reduction + explained variance).


πŸ† Part 6: Winning Regression Model Export

The winning regression pipeline (preprocessor + model) was serialized to pickle and uploaded to this HuggingFace repository.

import pickle

with open("winner_regression_model.pkl", "wb") as f:
    pickle.dump(winner_reg_model, f)

File: winner_regression_model.pkl Test RMSE: 301,281.4212 Test RΒ²: 0.7876


🏷️ Part 7: Regression β†’ Classification

Why convert? A continuous price prediction is not always directly actionable for a buyer or investor. A price tier β€” Low / Mid / High β€” is. This section converts the regression target into an operationally meaningful 3-class classification problem.

7.1 Threshold Strategy: Quantile Binning

The thresholds were computed on the training set only (no leakage from test):

Class Label Definition Threshold
0 Low Price ≀ 33rd percentile of train ≀ 707,000 AUD
1 Mid 33rd < Price ≀ 67th percentile of train 707,000 – 1,120,000 AUD
2 High Price > 67th percentile of train > 1,120,000 AUD

Why quantiles? Quantile thresholds produce near-balanced classes (~33% each), which avoids the class imbalance problem that business-rule thresholds (e.g., fixed dollar cutoffs) would create. Balanced classes mean the classifier can learn all three tiers equally well without requiring class-weight corrections.

7.2 Class Balance

Class Balance β€” Train vs Test

Class Train Count Train % Test Count Test %
0 β€” Low 7,269 33.35% 1,840 33.76%
1 β€” Mid 7,298 33.48% 1,785 32.75%
2 β€” High 7,230 33.17% 1,825 33.49%
Imbalance ratio (max/min) β€” β€” 1.03 β€”

Near-balanced classes (~33% each) β€” no class-weight correction required.

7.3 Metric Priority for Part 8

Why macro-F1 over accuracy? With near-balanced classes, accuracy is informative β€” but a model could still perform well overall while systematically failing on one class (for example, always misclassifying mid-tier properties). Macro-F1 averages F1 equally across all three classes, so every tier must be well-predicted for the score to be high.

Primary metric: Macro-F1 Secondary metric: Accuracy

Precision vs Recall trade-off: For a property buyer, a False Negative (missing a high-value property in the High tier) is more costly than a False Positive (labeling a mid-tier property as high). For a seller, the reverse holds. Since both buyer and seller perspectives matter equally, neither precision nor recall is systematically weighted β€” the balanced macro-F1 reflects this.


🧠 Part 8: Train & Evaluate Classification Models

8.1 Precision vs Recall β€” Context

For this housing price classification task:

  • False Positive (predicting High when actually Mid): buyer overpays attention; potential opportunity cost.
  • False Negative (predicting Mid when actually High): buyer misses a premium property.

In an equal-weight framing (no specific business rule privileging buyers or sellers), macro-F1 is the right primary metric. Per-class recall on the High (2) class is a key secondary metric β€” missing premium properties is the most visible failure mode and the one most users care about.

8.2 Three Classification Models

Model Architecture Key Parameters
Logistic Regression Global linear decision boundaries max_iter=2000, random_state=42
Random Forest Classifier 350 independent decision trees max_depth=18, min_samples_leaf=2, n_jobs=-1
Gradient Boosting Classifier 250 sequential boosted trees n_estimators=250, learning_rate=0.05, max_depth=3

All models used the same ColumnTransformer pipeline from Part 4 β€” fit on train only.

8.3 Evaluation Results

Model Accuracy Macro-F1 Weighted-F1 ROC-AUC (OvR, macro)
Logistic Regression 0.7932 0.7925 0.7933 0.9282
Random Forest Classifier 0.7934 0.7917 0.7926 0.9304
Gradient Boosting Classifier (WINNER) 0.7963 0.7952 0.7961 0.9289

Model Comparison β€” Accuracy

Reading this plot: Three bars, one per model. All three should cluster tightly in the 0.79–0.80 range β€” with near-balanced classes (33% each), all models have a reasonable accuracy floor. The differences between bars are small in absolute terms (0.003) but meaningful: even a 0.3pp accuracy gain on a large portfolio translates to fewer misclassified properties. Gradient Boosting edges out the others.

Model Comparison β€” Macro-F1

Reading this plot: Macro-F1 is the primary metric β€” it averages F1 equally across all three price classes. A model that performs well on Low and High but poorly on Mid will show a low macro-F1 even with high accuracy. Expected pattern: bars between 0.79 and 0.80, with Gradient Boosting highest and Logistic Regression lowest. The tight clustering confirms that all three models are genuinely competitive β€” the dataset is rich enough that even a linear classifier captures most of the signal.

8.4 Classification Reports

Logistic Regression

Class Precision Recall F1-score Support
Low (0) 0.8311 0.8500 0.8405 1840
Mid (1) 0.6871 0.6913 0.6892 1785
High (2) 0.8592 0.8356 0.8472 1825
Macro avg 0.7925 0.7923 0.7925 5450
Weighted avg 0.7934 0.7932 0.7933 5450

Random Forest Classifier

Class Precision Recall F1-score Support
Low (0) 0.8219 0.8582 0.8397 1840
Mid (1) 0.7012 0.6784 0.6896 1785
High (2) 0.8531 0.8405 0.8467 1825
Macro avg 0.7921 0.7923 0.7917 5450
Weighted avg 0.7928 0.7934 0.7926 5450

Gradient Boosting Classifier

Class Precision Recall F1-score Support
Low (0) 0.8267 0.8582 0.8421 1840
Mid (1) 0.6986 0.6908 0.6946 1785
High (2) 0.8608 0.8373 0.8489 1825
Macro avg 0.7954 0.7954 0.7952 5450
Weighted avg 0.7961 0.7963 0.7961 5450

8.5 Confusion Matrices

Logistic Regression

Confusion Matrix β€” Logistic Regression

Reading confusion matrices: Rows = actual class, columns = predicted class. The diagonal (top-left to bottom-right) shows correct predictions β€” darker diagonal = better model. Off-diagonal cells are errors. Expected dominant pattern across all three models: the Mid (1) class bleeds into both Low (0) and High (2) β€” a property priced just above the q33 threshold is nearly indistinguishable from one just below it, so the model hedges. Low↔High confusions (top-right and bottom-left corners) should be rare β€” a $400K property looks nothing like a $2M property in the feature space.

Main confusion pattern: Most errors are boundary mistakes where actual Mid is predicted as Low or High; direct Low↔High confusion is limited.

Random Forest Classifier

Confusion Matrix β€” Random Forest

Main confusion pattern: Errors are concentrated around the Mid class boundaries (Midβ†’Low / Midβ†’High), with few extreme Low↔High swaps.

Gradient Boosting Classifier

Confusion Matrix β€” Gradient Boosting

Main confusion pattern: The winner still mainly confuses near-threshold Mid homes with adjacent tiers, while preserving strong separation of Low vs High.

8.6 Feature Importance β€” Classification Models

Classification Feature Importance

Key findings from feature importance:

  • Location features (suburb dummies, cluster features, Distance) consistently rank highly β€” confirming that "location, location, location" applies quantitatively, not just qualitatively.
  • Engineered ratio features (rooms_per_bedroom, bathrooms_per_room, building_to_land_ratio) appear in the top rankings, validating that structural efficiency captures price signal that raw room counts alone do not.
  • log_landsize and log_buildingarea rank strongly β€” confirming the importance of the log transform for compressing the right-skewed size distributions.
  • cluster_dist features appear in both tree model rankings, confirming that neighbourhood atypicality (distance from cluster centroid) adds genuine signal beyond the spatial cluster membership alone.

8.7 Winner Declaration

Winner: Gradient Boosting Classifier β†’ winner_classification_model.pkl

Selected by: highest Macro-F1 (primary) + highest accuracy (tiebreaker).

Why tree models beat Logistic Regression: The Melbourne property market is highly non-linear β€” suburb-level effects, interactions between property type and location, and the non-linear distance gradient all require a model that can discover complex decision boundaries. Tree-based ensembles discover these boundaries automatically; logistic regression can only model them if the relevant features are explicitly engineered.

Why Gradient Boosting beats Random Forest (if confirmed by metrics): Sequential error correction focuses each tree on the properties previous trees misclassified. For this price-tier classification task, the hardest-to-classify properties are the mid-tier ones near the class boundaries β€” and boosting's targeted learning concentrates exactly on those difficult boundary cases that ensemble averaging cannot resolve.


πŸ“Š Final Evaluation

Key Results Summary

Milestone Metric Value
Baseline Linear Regression RMSE 378,586.1104
Baseline Linear Regression RΒ² 0.6646
After Feature Engineering (same model) RMSE 373,550.5124
After Feature Engineering (same model) RΒ² 0.6734
Best Regression Model RMSE 301,281.4212
Best Regression Model RΒ² 0.7876
Regression β†’ Classification Class 0 threshold 707,000 AUD
Regression β†’ Classification Class 2 threshold 1,120,000 AUD
Best Classification Model Macro-F1 0.7952
Best Classification Model Accuracy 0.7963
Best Classification Model ROC-AUC (OvR macro) 0.9289

πŸš€ How to Load and Use the Models

import pickle
import numpy as np
import pandas as pd

# Load regression model (continuous price prediction)
with open("winner_regression_model.pkl", "rb") as f:
    reg_model = pickle.load(f)

# Load classification model (3-class price tier)
with open("winner_classification_model.pkl", "rb") as f:
    clf_model = pickle.load(f)

# Both models expect the same engineered feature matrix from Part 4
# X_new must be a DataFrame with the same columns as X_train_fe
# (raw + engineered columns; DO NOT pre-scale β€” the pipeline handles it)

# Regression: continuous price in AUD
y_price = reg_model.predict(X_new)

# Classification: discrete price tier
y_tier = clf_model.predict(X_new)
y_proba = clf_model.predict_proba(X_new)

tier_map = {0: "Low", 1: "Mid", 2: "High"}
tier_labels = [tier_map[t] for t in y_tier]

print("Predicted prices:", y_price[:5])
print("Predicted tiers:", tier_labels[:5])
print("Class probabilities (Low/Mid/High):")
print(np.round(y_proba[:5], 3))

Important notes:

  • X_new must include all engineered columns (sale_year, sale_month, sale_quarter, rooms_per_bedroom, bathrooms_per_room, building_to_land_ratio, log_landsize, log_buildingarea, log_distance, rooms_x_bathrooms) and the clustering columns (cluster_label, cluster_dist_0 through cluster_dist_5).
  • The KMeans clustering was fit on training data. For truly new properties, you will need to save and reload the clustering artifacts (imputer, scaler, kmeans object) alongside the model pipeline.
  • The classification thresholds (q33 = 707,000 AUD, q67 = 1,120,000 AUD) were derived from the training distribution. A property near a threshold boundary will have high uncertainty β€” use predict_proba to assess confidence.

πŸ”Ž Part 8 Additional Analysis β€” Classification Diagnostics Upgrade

Beyond the core classification report and confusion matrix, the notebook includes additional diagnostic visualizations for a complete picture of each model's behavior.

Per-Class Precision, Recall, F1 Summary

Model Class Precision Recall F1-score
Logistic Regression Low (0) 0.8311 0.8500 0.8405
Logistic Regression Mid (1) 0.6871 0.6913 0.6892
Logistic Regression High (2) 0.8592 0.8356 0.8472
Random Forest Low (0) 0.8219 0.8582 0.8397
Random Forest Mid (1) 0.7012 0.6784 0.6896
Random Forest High (2) 0.8531 0.8405 0.8467
Gradient Boosting Low (0) 0.8267 0.8582 0.8421
Gradient Boosting Mid (1) 0.6986 0.6908 0.6946
Gradient Boosting High (2) 0.8608 0.8373 0.8489

Key pattern: The Mid (1) class is consistently the hardest to classify correctly across all three models. This is expected: properties priced near the q33 and q67 thresholds share characteristics with both adjacent classes. A property priced at exactly the q33 boundary is almost equally likely to be genuinely Low or genuinely Mid β€” the signal is weakest at the class boundary. This boundary-region ambiguity is irreducible without additional features that distinguish near-boundary properties more sharply.

Precision–Recall Curves (Class 2 β€” High Price)

PR Curve β€” High Price Class

The PR curve for the High (2) class shows the trade-off between the fraction of High-tier properties correctly identified (recall) and the fraction of predicted High-tier properties that are genuinely high (precision). At high recall thresholds, all models accept more false positives from the Mid class. The area under the PR curve confirms whether each model maintains useful precision while catching most High-tier properties.

Regression Diagnostics Upgrade β€” Improvement Table

Model RMSE delta vs baseline RΒ² delta vs baseline
Linear Regression (Engineered) -5,035.5980 +0.0089
Gradient Boosting Regressor -53,058.7331 +0.0874
Random Forest Regressor (Winner) -77,304.6892 +0.1230

This table makes the engineering and model-selection gains directly comparable. A negative RMSE_delta means the model improved over baseline; a positive R2_delta means it explains more variance. The table was included to provide transparent evidence that each modeling step added genuine value rather than overfitting to the training set.


🎯 Strategic Takeaways

  1. Location dominates every other signal. Suburb membership and distance from the CBD together explain more variance than all property-size features combined. A 2-bedroom unit in South Yarra will outsell a 4-bedroom house in Melton. The coefficient and importance rankings across every model confirm this β€” location features occupy the top of every chart. For buyers: suburb selection is the single highest-leverage decision. For sellers: pricing against comparable suburb sales matters far more than the number of rooms listed.

  2. The luxury segment is structurally different. Properties above ~$2M consistently have the largest prediction errors across all models. These properties trade on idiosyncratic features β€” ocean views, heritage listing, land-banking potential, architect design β€” that are not captured in any column of this dataset. The implication: the models are deployment-ready for the mainstream market ($400K–$1.5M) but should be used with caution for ultra-premium listings where domain expertise is irreplaceable.

  3. Spring auctions drive a seasonal price premium. sale_month was engineered as a predictive feature specifically because Melbourne's auction market has a well-documented spring peak (September–November). Properties listed in this window attract more bidders, driving competitive clearing prices above the annual median. The feature importance plots confirm this time signal adds genuine model value beyond what property characteristics alone explain.

  4. Structural efficiency matters more than raw size. building_to_land_ratio and bathrooms_per_room consistently outrank raw Landsize and BuildingArea in the tree model importance rankings. A compact, well-fitted inner-city property on a small block commands more per square metre than a large house on an oversized block in an outer suburb. Buyers optimising for value-per-dollar should focus on bathroom count and building coverage, not headline land size.

  5. A unit is not just a cheaper house β€” it is a different product. The type dummy (h vs u vs t) is one of the strongest features in the baseline model. The price gap between a house and a unit of identical room count in the same suburb is not explained by size alone β€” it reflects the land-ownership premium and the restriction on capital growth that unit ownership carries in Melbourne. This structural gap means buyers and investors should price units and houses on separate mental models.

  6. Tree models are non-negotiable for this problem. The jump from linear regression (RMSE $378K, RΒ² 0.665) to Random Forest (RMSE $301K, RΒ² 0.788) is not a tuning win β€” it is a fundamental architectural win. Melbourne property prices are determined by hundreds of suburb-level and property-type interactions that no global linear equation can capture. Any production pricing model for this market should use an ensemble method as its minimum viable architecture.


⚠️ Limitations

  • Data coverage ends 2018. The Melbourne property market shifted significantly after 2018: record-low interest rates (2020–2021) drove prices to historic highs; the 2022–2023 rate hiking cycle reversed much of that growth. Models trained on 2016–2018 data will systematically underestimate 2020–2021 prices and may overestimate 2023–2024 prices. Do not use for current market valuations without retraining on recent data.

  • 300+ suburb one-hot categories create sparse, noisy features. Suburbs with fewer than ~20 sales in the training set have unreliable one-hot coefficient estimates β€” the model has too few examples to learn a stable suburb premium. These small-suburb dummies add dimensionality without adding reliable signal. A better approach: group low-frequency suburbs into an "other" category or use target encoding with cross-validation.

  • BuildingArea missing 47% of rows. Nearly half of all predictions for the regression target rely on the median imputation for building area β€” the most physically important continuous feature for price-per-square-metre analysis. Imputed values are reasonable (conditioned on median), but they suppress the signal from this feature for a large fraction of the dataset.

  • No macroeconomic features. Interest rates, Melbourne population growth, immigration levels, vacancy rates, and housing supply pipeline are documented major drivers of Melbourne property prices β€” none are present in the dataset. The models capture the cross-sectional structure (which property type and location commands a premium) but not the time-series level (whether the whole market is rising or falling).

  • Geographic data used only for clustering. Lattitude and Longtitude feed the KMeans clustering but are not used as direct model features. A spatial regression approach (e.g. geographically weighted regression or a spatial lag term) could extract more precise location signal than the coarse 6-cluster discretisation.

  • All relationships are correlational. The models identify which features are associated with higher prices β€” they cannot tell us which features cause higher prices. Suburb prestige and building quality are correlated with many unmeasured factors (school catchment, public transport access, heritage character) that drive both the observed feature values and the price outcomes. Feature importance β‰  causal importance.


πŸš€ Live Demo

What it is

A live interactive demo lets anyone type in a property's attributes and get an instant price prediction and tier classification β€” without opening a notebook or writing code. It runs on HuggingFace Spaces (free hosting) powered by Gradio (the same library used in most HuggingFace demos).

Live Demo


πŸ“¦ Requirements

pandas>=1.3
numpy>=1.21
scikit-learn>=1.0
matplotlib>=3.4
seaborn>=0.11
plotly>=5.0
scipy>=1.7

πŸ“ Key Design Decisions

Decision Justification
Keep outliers in EDA Extreme Melbourne prices are real market data (luxury, rural-block); removing them would bias the model against the tails
Log-transform Landsize/BuildingArea/Distance Reduces right-skew by >90%; better linearizes these features for regression models
Defer all imputation to sklearn pipeline Prevents any form of test-set information leaking into imputed train values
Fit all transformers on train only Standard leakage-prevention practice; ColumnTransformer enforces this
Use handle_unknown='ignore' in OneHotEncoder Suburbs in the test set may not appear in training; ignoring unseen categories prevents crashes
KMeans k=6 for feature construction Used to create richer cluster-derived features (cluster_label + distances); separate silhouette sweep showed best compactness at k=3
Quantile-based classification thresholds Produces near-balanced classes (~33% each) without requiring class-weight correction
Macro-F1 as primary classification metric Equal penalty for ignoring any price class; prevents a model from collapsing to the majority class
RMSE as primary regression metric Penalizes large errors more than MAE; appropriate for property prices where a $500k prediction miss is much worse than a $50k miss
random_state=42 throughout Full reproducibility β€” any reader can run the notebook and get identical results
80/20 train/test split Standard proportion for a dataset of this size; 20% gives a large enough test set for stable metric estimates

Itay Morag

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support