🏠 Melbourne House Price Prediction — Regression, Clustering & Classification

Dataset: Melbourne Housing Market — Kaggle (Full Version)

Navigation

Project Overview
Dataset Description
Part 2: EDA — Data Integrity & Cleaning
EDA Highlights — Research Questions
Part 3: Baseline Linear Regression
Part 4: Feature Engineering & Clustering
Part 5: Three Improved Regression Models
Part 6: Winning Regression Model Export
Part 7: Regression -> Classification
Part 8: Classification Models & Evaluation
Final Evaluation
Strategic Takeaways
Limitations
Live Demo
How to Load and Use the Models
Requirements

Presentation

📌 Project Overview

This project builds a complete, end-to-end machine learning pipeline to predict Melbourne house prices using a rich real-world property dataset. The pipeline moves from raw data through exploratory analysis, feature engineering, unsupervised clustering, regression modeling, and multi-class classification — ending with two production-ready models exported for deployment on HuggingFace.

Research Question:

Given property characteristics, location attributes, and sale metadata available at listing time, can we accurately predict a Melbourne house's sale price — and assign each property to a meaningful price tier (Low / Mid / High)?

Why this matters: Real estate is one of the most consequential financial decisions a household makes. In Melbourne, one of the world's most expensive property markets, understanding what drives house prices is critical for buyers, sellers, and investors alike. A reliable price-prediction model can help buyers set realistic budgets, help sellers price competitively, and help analysts identify market segments. The key challenge is building a model without data leakage and with transparent, interpretable feature engineering grounded in real property market logic.

📓 View the Notebook

📊 Dataset Description

Property	Value
Source	Kaggle — Melbourne Housing Market (Full Version, anthonypino)
File	`Melbourne_housing_FULL.csv`
Raw columns	21 features (including target)
Target	`Price` — sale price in AUD
Task	Regression (price prediction) + Classification (price tier)
Geography	Metropolitan Melbourne, Australia
Time period	2016–2018

📋 Raw Feature Dictionary — All 21 Original Columns

The raw dataset contains 21 columns. Below is every column, its type, description, and modeling decision.

Identifiers — Excluded or Used with Care

Column	Type	Description	Decision
`Address`	object	Full property street address	❌ too granular; near-unique per row — no predictive value as-is
`SellerG`	object	Real estate agency or selling agent name	❌ very high cardinality; excluded to avoid noise

Target Variable

Column	Type	Description	Decision
`Price`	float64	House sale price in AUD — right-skewed	✅ regression target; log-transform considered for modeling

Property Characteristics

Column	Type	Description	Decision
`Rooms`	int64	Total number of rooms	✅ strong predictor; used in engineered ratio features
`Type`	object	Property type: `h`=house/cottage/villa/terrace, `u`=unit/duplex, `t`=townhouse	✅ one-hot encoded — strong price signal
`Bedroom2`	float64	Number of bedrooms (sourced from a secondary source)	✅ used in `rooms_per_bedroom` ratio
`Bathroom`	float64	Number of bathrooms	✅ used in `bathrooms_per_room` ratio and interaction feature
`Car`	float64	Number of car spaces	✅ included in baseline numeric features
`Landsize`	float64	Land area in square metres — right-skewed, many outliers	✅ log-transformed → `log_landsize`; raw value retained for IQR analysis
`BuildingArea`	float64	Building footprint in square metres — right-skewed	✅ log-transformed → `log_buildingarea`; used in `building_to_land_ratio`
`YearBuilt`	float64	Year the property was built — many missing values	✅ included in baseline; imputed via median in pipeline

Location Features

Column	Type	Description	Decision
`Suburb`	object	Suburb name — high cardinality (~300 unique)	✅ one-hot encoded (handle_unknown=ignore); high-signal location feature
`Postcode`	float64	Postcode of the property	✅ included as numeric feature
`Distance`	float64	Distance from Melbourne Central Business District (km)	✅ strong price predictor; log-transformed → `log_distance`; used in clustering
`Regionname`	object	General region in Melbourne (8 regions)	✅ one-hot encoded
`CouncilArea`	object	Governing council area — lower cardinality than Suburb	✅ one-hot encoded
`Propertycount`	float64	Number of properties in the suburb at time of sale	✅ proxy for suburb density; used in clustering
`Lattitude`	float64	Property latitude	✅ used in KMeans clustering only
`Longtitude`	float64	Property longitude	✅ used in KMeans clustering only

Sale Metadata

Column	Type	Description	Decision
`Method`	object	Sale method: S=sold, PI=passed in, SA=sold after, SP=sold prior, VB=vendor bid	✅ one-hot encoded
`Date`	object	Date of sale — parsed to datetime (day-first format)	✅ decomposed into `sale_year`, `sale_month`, `sale_quarter`

🔍 Part 2: Exploratory Data Analysis

Raw housing data is noisy and inconsistent. These steps were taken to make it analysis-ready.

2.1 Initial State

Duplicate rows possible in the scraped dataset
Price stored as object dtype in some versions; needs numeric coercion
Date stored as string in day-first format (DD/MM/YYYY) — requires explicit parsing
Significant missingness on structural columns: BuildingArea (~~47%), YearBuilt (~~39%), CouncilArea (~~33%), Car (~~5%), Bedroom2 (~4%)
Strong numerical outliers in Price (luxury properties >$5M), Landsize (rural blocks >10,000 m²), and BuildingArea (implausible extremes)
SellerG (agent name) has very high cardinality with no generalizable price signal

Missingness reporting: Both raw counts and percentage of rows per column were printed so sparse columns (e.g. BuildingArea, YearBuilt) are easy to compare at a glance.

Pre-cleaning snapshot (tabular):

Metric	Value
Rows x Columns	34,857 x 21
Numeric columns	13
Text/Categorical columns	8

Top missing columns (pre-cleaning)	Missing count	Missing %
`BuildingArea`	17,400	59.55%
`YearBuilt`	15,744	53.88%
`Landsize`	9,568	32.75%
`Car`	6,860	23.48%
`Bathroom`	6,558	22.45%
`Bedroom2`	6,552	22.42%
`Price`	6,367	21.79%
`Lattitude`	6,339	21.70%

2.2 Cleaning Decisions

Removed exact duplicate rows to prevent biased learning and inflated metrics.
Parsed Date using explicit day-first datetime parsing — Australian date format (DD/MM/YYYY).
Converted Price to numeric with errors='coerce' — forces any string artifacts to NaN.
Dropped rows where Price is missing — target integrity; cannot train without a label.
Left all feature missingness (BuildingArea, YearBuilt, etc.) intact for pipeline imputation — imputing before the train/test split would leak test-set statistics into training.
SellerG (real estate agent): excluded — very high cardinality, no generalizable signal.
Address: excluded — near-unique per row, no predictive value as a raw string.
Retained extreme best-sellers (luxury properties, large rural blocks); handled their influence via log scaling in feature engineering rather than dropping real data points.
Categorical profiling: after cleaning, object columns summarized with describe(include=['object']) (counts, uniques, top category) to spot sparse labels before plotting.

Post-cleaning snapshot (tabular):

Metric	Value
Rows x Columns	27,247 x 21
Dropped rows (missing target + exact duplicates)	6,367
Date dtype	`datetime64` (parsed day-first)

Top missing columns (post-cleaning)	Missing count	Missing %
`BuildingArea`	13,685	59.89%
`YearBuilt`	12,376	54.16%
`Landsize`	7,495	32.80%
`Car`	5,347	23.40%
`Bathroom`	5,117	22.39%
`Bedroom2`	5,113	22.38%
`Lattitude`	4,949	21.66%
`Longtitude`	4,949	21.66%

Object-column profile (post-cleaning)	Count	Unique	Top	Freq
`Suburb`	27,247	340	Reservoir	634
`Address`	27,247	22,466	5 Charles St	4
`Type`	27,247	3	h	15,344
`Method`	27,247	5	S	14,881
`SellerG`	27,247	325	Nelson	2,372
`CouncilArea`	27,245	33	Boroondara City Council	2,221
`Regionname`	27,245	8	Southern Metropolitan	7,439

2.3 Summary Statistics After Cleaning

Feature	Count	Mean	Std	Min	25%	50%	75%	Max
`Price`	27,247	1,056,543.22	646,613.71	85,000.00	637,000.00	880,000.00	1,300,000.00	11,200,000.00
`Rooms`	27,247	2.97	0.96	1.00	2.00	3.00	4.00	16.00
`Bathroom`	17,733	1.57	0.70	0.00	1.00	1.00	2.00	9.00
`Car`	17,503	1.67	0.98	0.00	1.00	2.00	2.00	18.00
`Landsize`	15,355	588.55	4,032.16	0.00	196.00	478.00	659.50	433,014.00
`BuildingArea`	9,165	154.10	479.10	0.00	97.00	130.00	178.00	44,515.00
`Distance`	27,247	10.92	6.49	0.00	6.40	10.20	13.80	48.10
`Propertycount`	27,245	7,533.97	4,487.78	83.00	4,280.00	6,567.00	10,331.00	21,650.00

2.4 Sanity Checks (Domain Rules)

Automated plausibility checks run on the cleaned data to catch scrape errors, wrong units, or bad merges before trusting aggregate charts:

All sale price values are non-negative — no negative prices.
Distance is non-negative — no properties with negative CBD distance.
Rooms, Bedroom2, Bathroom, Car are non-negative integers — no structural anomalies.
YearBuilt, where present, falls in a sensible historical range (e.g. 1800–2018).
Lattitude and Longtitude lie within Victoria, Australia bounding box — no data entry errors placing properties overseas.

Sanity Rule	Result	Notes
`Price >= 0`	PASS	No negative sale prices
`Distance >= 0`	PASS	No negative CBD distances
`Rooms >= 0`	PASS	No negative room counts
`Bathroom >= 0`	PASS	No negative bathroom counts
`YearBuilt` in [1800, 2018] when present	FLAG	At least one out-of-range build year appears in raw source
`Lattitude` in Victoria bounds	PASS	Values fall inside expected range
`Longtitude` in Victoria bounds	PASS	Values fall inside expected range

2.5 Outlier Documentation (`Price` / `Landsize` / `BuildingArea`)

Top properties table: The notebook lists the top 10 properties by Price so extreme luxury sales (multi-million-dollar mansions) are explicit — not only visible as scatter extremes.
Tukey IQR fences: Lower fence = Q1 − 1.5×IQR, upper fence = Q3 + 1.5×IQR. For heavily right-skewed property prices, many rows exceed the upper fence by expectation — that reflects the hit-driven, luxury-heavy structure of the Melbourne market, not bad data. Same applies to Landsize (rural blocks) and BuildingArea (mansions).
Decision: Keep those rows as real sales; use log scales and log-transformed features in models as needed.

Top 10 properties by Price (post-cleaning):

Suburb	Type	Rooms	Landsize	BuildingArea	Price
Brighton	h	4	1,400.0	NaN	11,200,000
Mulgrave	h	3	744.0	117.0	9,000,000
Canterbury	h	5	2,079.0	464.3	8,000,000
Hawthorn	h	4	1,690.0	284.0	7,650,000
Armadale	h	4	NaN	NaN	7,000,000
Armadale	h	4	NaN	NaN	6,800,000
Kew	h	6	1,334.0	365.0	6,500,000
Melbourne	u	3	NaN	NaN	6,500,000
Toorak	h	4	NaN	NaN	6,460,000
Middle Park	h	5	553.0	308.0	6,400,000

IQR fence documentation:

Feature	Q1	Q3	Lower Fence	Upper Fence	Outlier Rows	Outlier %
`Price`	637,000.0	1,300,000.0	-357,500.0	2,294,500.0	1,088	4.76%
`Landsize`	196.0	659.5	-499.25	1,354.75	402	2.62%
`BuildingArea`	97.0	178.0	-24.50	299.50	473	5.16%

📊 EDA Highlights

2.6 Property Market Overview

A. Price Distribution

Question: What does the overall distribution of Melbourne house prices look like, and how skewed is it?

Insight: Price is strongly right-skewed. The bulk of sales cluster between ~$400K and ~$1.5M, but the upper tail extends well past $5M. The log-scale view reveals a near-normal distribution, confirming that log(Price) is a natural regression target and that log-transforming heavy-tail features will better linearize their relationship with price.

B. Property Type Price Hierarchy (Median)

Question: Are houses systematically more expensive than units, and how wide is the gap?

Insight: Houses (h) have the highest median price, followed by townhouses (t), with units (u) at the bottom. This confirms Type as a strong predictor and supports one-hot encoding with u as a reference class in linear-style models.

2.7 Location & Geography

A. Regional Price Hierarchy

Question: Which Melbourne regions command the highest median prices?

Insight: Median prices differ substantially by region. The highest-priced regions (typically inner-city) sit well above the overall median, while outer growth corridors fall significantly below it. The spread confirms that even at a coarse regional level, location carries strong price signal — motivating both Regionname and the finer-grained Suburb as features in all models.

B. Distance from CBD vs Price

Question: Is the price gradient from the CBD linear, or does it flatten in outer suburbs?

Insight: The relationship is negative overall — farther from the CBD generally means cheaper — but the gradient is non-linear. The steepest decline occurs in the 0–15 km inner band. Beyond ~25 km the price floor flattens. Very distant properties (>40 km) show a second cluster of moderate prices from growth-corridor developments, not a recovery of the inner premium. Wide spread at every distance confirms that Distance alone doesn't determine price — suburb quality, property type, and size all interact with it. log_distance better linearizes this gradient for regression models.

2.8 Property Size & Layout

A. Property Type vs Price Distribution

Question: How does structural size interact with the type-based price hierarchy?

Type-level pricing evidence is shown above in 2.6.B (Research Q1). This subsection extends the analysis to structural layout variables (rooms and bathrooms), where the strongest separations appear within and across property types.

B. Room Count and Bathrooms vs Price

Question: Does adding rooms and bathrooms drive price consistently across property types?

Insight: Median prices rise consistently with both room and bathroom counts, but the slope diverges sharply by property type. For houses, each additional room carries a much larger premium than for units. Properties with 4+ rooms and 2+ bathrooms sit in the upper price quartile regardless of suburb — the joint combination is more predictive than either feature alone. This motivated the rooms_x_bathrooms interaction feature and the rooms_per_bedroom and bathrooms_per_room ratios.

2.9 Correlation Analysis

Question: How do numeric features relate to each other and to Price?

Insight:
- Rooms, Bedroom2, Bathroom, Car correlate positively with Price — size and amenity features consistently point in the same direction.
- Distance correlates negatively with Price — farther from CBD, cheaper.
- Rooms and Bedroom2 are strongly correlated with each other (~0.9) — multicollinearity motivates the rooms_per_bedroom ratio as a more orthogonal signal.
- Landsize shows a weaker correlation with Price than expected — inner-city land is small but extremely expensive, while large rural blocks are cheap per m²; log-transform addresses this non-linearity.
- BuildingArea correlates more cleanly with Price than raw Landsize — usable floor space is a more direct value driver than total land.

📉 Part 3: Baseline Linear Regression

Goal: Establish a reproducible, leakage-free performance floor before any feature engineering. Every subsequent model must beat this baseline to justify its additional complexity.

Design Decisions

Feature set: all raw columns except Address, Date (after extracting year/month), and SellerG
Numeric pipeline: SimpleImputer(strategy='median') → StandardScaler
Categorical pipeline: SimpleImputer(strategy='most_frequent') → OneHotEncoder(handle_unknown='ignore')
All transformations inside a single ColumnTransformer — fit on train only, zero test-set leakage
LinearRegression() with default parameters — no regularization, no tuning
80/20 stratified split, random_state=42 — fully reproducible

Baseline Results

Metric	Train	Test
MAE	—	229,318.5712
MSE	—	143,327,442,971.9292
RMSE	—	378,586.1104
R²	—	0.6646

Reading this plot: Points on the diagonal y=x line are perfect predictions. Points below the line are over-predictions (model said higher than reality); points above are under-predictions. The expected pattern for this dataset: a tight cluster for sub-$1M properties near the diagonal, spreading into a progressively wider band above $1.5M — the model consistently underestimates luxury properties because they are outliers the linear fit cannot represent. The 45° reference line is the "perfect model" benchmark.

Reading this plot: Residuals (actual − predicted) plotted against predicted price. A well-behaved linear model shows residuals randomly scattered around zero with constant spread. The expected pattern here is a fan shape — small residuals at low predicted prices ($300K–$700K) that widen dramatically above $1M. This is textbook heteroscedasticity: variance in prediction error increases with the magnitude of the target. The fan shape does not mean the model is broken — it is the natural consequence of using raw AUD price as the regression target on a right-skewed market. Two fixes: predict log(Price) and back-transform, or switch to a model family that builds local rules (Random Forest, Gradient Boosting) rather than fitting a single global line.

Key observations:

R² = 0.6646 — explains approximately 66.46% of price variance. Meaningful signal present but a large fraction unexplained, consistent with a complex non-linear market.
Residuals show a fan shape (heteroscedasticity): errors are larger for high-priced properties. This is expected with raw price as the target — it motivates a model family that handles non-constant variance (trees).
Feature importance: coefficient plot shows location features (particularly suburb one-hot dummies) dominate the baseline — the strongest single predictors are location-based, not property-size-based.

Top Coefficients — Baseline

Reading this plot: Bars extending right = features that push predicted price up; bars extending left = features that push price down. The expected pattern: a handful of premium inner-city suburb dummies (Toorak, Hawthorn, South Yarra, etc.) are the largest positive bars, each adding hundreds of thousands of dollars to the prediction. A handful of outer-suburb dummies (Melton, Werribee, etc.) are the largest negative bars. Distance appears as a negative bar — consistent with the Q3 finding. Rooms and Bathroom appear as positive bars but with smaller magnitude than the location dummies, confirming that in Melbourne's property market, where you buy matters more than what you buy.

Key finding: Suburb dummies dominate the coefficient ranking. This is the quantitative proof of "location, location, location" — the linear model assigns the majority of its explanatory power to suburb membership, not property size. This finding directly motivated the KMeans neighbourhood clustering in Part 4: clustering captures location signal in a form that generalises better than 300+ sparse suburb dummies that each appear in only a handful of training rows.

⚙️ Part 4: Feature Engineering

Feature engineering is the single most impactful step before model selection. Every feature created below is directly traceable to an EDA finding or a domain observation about the Melbourne property market.

4.1 Ten New Engineered Features

Feature	Type	Source / Rationale
`sale_year`	Numeric	Year extracted from `Date` — captures market cycle effects (2016 vs 2017 vs 2018)
`sale_month`	Numeric	Month extracted from `Date` — captures seasonality (spring auction peak in Melbourne)
`sale_quarter`	Numeric	Quarter extracted from `Date` — coarser time bucket than month
`rooms_per_bedroom`	Ratio	`Rooms / Bedroom2` — high ratio = many living/utility rooms relative to bedrooms; signals open-plan or luxury layouts
`bathrooms_per_room`	Ratio	`Bathroom / Rooms` — captures bathroom density; high values signal premium fitouts
`building_to_land_ratio`	Ratio	`BuildingArea / Landsize` — how much of the land is built on; differentiates dense inner-city from sprawling outer properties
`log_landsize`	Log transform	`log1p(Landsize)` — compresses the extreme right tail; brings distribution near-normal for linear models
`log_buildingarea`	Log transform	`log1p(BuildingArea)` — same rationale as log_landsize
`log_distance`	Log transform	`log1p(Distance)` — linearizes the CBD-distance price gradient found in Q3 EDA
`rooms_x_bathrooms`	Interaction	`Rooms × Bathroom` — captures the joint premium of large, well-appointed homes (Q4 EDA)

4.2 Feature Evidence — Keep/Drop Rationale

Mutual information regression scores and chi-square statistics were computed for all features to confirm which engineered features add real signal. Key evidence is summarized directly in the ranked findings and keep/drop rationale below, so no separate screenshot table is required here.

4.3 KMeans Clustering — Neighbourhood Segmentation

Clustering was applied to capture neighbourhood-like spatial structure that raw suburb names encode imperfectly. The goal was to group properties by their location-profile signature — not by suburb boundary, but by the underlying spatial pattern of distance, density, and coordinates.

Clustering inputs: [Lattitude, Longtitude, Distance, Propertycount] Preprocessing: SimpleImputer(median) → StandardScaler — fit on train-set clustering inputs only Algorithm: KMeans(n_clusters=6, random_state=42, n_init=20)

Why k=6 in feature construction, but k=3 in silhouette validation? Part 4 feature construction used k=6 to generate the final cluster-derived features (cluster_label + cluster_dist_0..5) and to preserve a richer neighbourhood segmentation signal. In the separate validation sweep across candidate k values, the highest silhouette score was observed at k=3. Both facts are reported for transparency: k=6 was used for engineered features, while the sweep indicates k=3 as the most compact separation under silhouette.

Cluster-derived features added to each row:

Feature	Type	Description
`cluster_label`	Categorical (0–5)	Discrete cluster membership — one-hot encoded in pipeline
`cluster_dist_0`	Continuous	Euclidean distance to centroid of cluster 0 — atypicality signal
`cluster_dist_1`	Continuous	Distance to centroid of cluster 1
`cluster_dist_2`	Continuous	Distance to centroid of cluster 2
`cluster_dist_3`	Continuous	Distance to centroid of cluster 3
`cluster_dist_4`	Continuous	Distance to centroid of cluster 4
`cluster_dist_5`	Continuous	Distance to centroid of cluster 5

The centroid-distance features (cluster_dist_k) add a continuous atypicality signal: a property far from its assigned cluster centroid is a spatial outlier within its tier — a signal that can improve model predictions at the tails of the price distribution.

4.4 Cluster Visualization

Reading this plot: PCA projects the 4-dimensional clustering space (Lattitude, Longtitude, Distance, Propertycount) into 2D. Each dot is a property, colored by its cluster assignment. Well-separated blobs = clusters with distinct location profiles. Expected pattern: an arc or gradient structure reflecting Melbourne's radial geography — CBD-proximate clusters (inner ring) sit on one side, growth-corridor clusters (outer ring) on the other, with middle-ring clusters bridging the gap. Some overlap between adjacent clusters is normal and expected: suburb boundaries are fuzzy, and properties near a cluster boundary genuinely share characteristics with both groups. Tight, non-overlapping blobs would actually be suspicious — real spatial data rarely partitions perfectly in Euclidean distance.

Cluster interpretation:

Cluster	Profile	Approximate Zone
0	Transitional middle-ring profile with moderate distance-to-centroid values	Middle ring
1	Distinct location pocket with stronger separation from other centroids	Outer ring / edge corridors
2	Dense metro-like profile; dominant in the sample shown during cluster feature preview	Inner-to-middle metro
3	Higher atypicality-distance zone; captures mixed suburban profiles	Mixed transition belt
4	Peripheral cluster with larger centroid distances and broader spread	Outer suburban fringe
5	Alternative metro-suburban mix with moderate centroid distance signature	Established suburban ring

4.5 Cluster Validation

Silhouette score and an ablation test (model performance with vs without cluster features) were run to confirm that clustering genuinely improves predictions rather than just adding noise.

Check	Result
Best silhouette score (sweep k=2..8)	0.6102 (best at k=3)
RMSE without cluster features	378,586.1104
RMSE with cluster features	373,550.5124
Gain from clustering	5,035.5980 RMSE reduction

4.6 Feature Engineering Impact — Isolated Proof

Same model (Linear Regression), same hyperparameters, same split:

Stage	Features	RMSE	R²
Raw baseline (Part 3)	~20 raw	378,586.1104	0.6646
Engineered (Part 4)	+10 engineered + 7 cluster	373,550.5124	0.6734
Gain	+17 features	-5,035.5980	+0.0089

Feature engineering contributed measurable improvement even before switching model families.

4.7 Final Feature Matrix Summary

Category	Count
Raw numeric (StandardScaler)	12
Engineered numeric (ratios, logs, interaction, time)	10
Cluster distance features	6
Cluster label (one-hot)	6
Categorical (one-hot encoded)	~687 (train-split dependent)
Total features	~721 (sparse, train-split dependent)

Note: The exact one-hot dimensionality depends on which categories appear in the training split (OneHotEncoder(handle_unknown='ignore')), so totals can vary slightly across reruns/splits.

📈 Part 5: Three Improved Regression Models

All three models were trained on the full engineered feature matrix. Same pipeline structure, same stratified split, same seed. Performance differences are attributable to model architecture only.

Model Architectures

Model	Architecture	Key Parameters
Linear Regression (Engineered)	Global linear fit	Default — no regularization
Random Forest Regressor	350 independent decision trees	max_depth=20, min_samples_leaf=2, n_jobs=-1
Gradient Boosting Regressor	300 sequential boosted trees	n_estimators=300, learning_rate=0.05, max_depth=3

Part 5 Results

Model	MAE	RMSE	R²	RMSE vs Baseline
Baseline LR (Part 3)	229,318.5712	378,586.1104	0.6646	—
Linear Regression (Engineered)	226,190.4886	373,550.5124	0.6734	-5,035.5980
Gradient Boosting Regressor	186,308.9222	325,527.3773	0.7520	-53,058.7331
Random Forest Regressor (WINNER)	165,525.8822	301,281.4212	0.7876	-77,304.6892

Reading this plot: Each group of bars represents one model. Shorter RMSE bar = better. Taller R² bar = better. The expected pattern: baseline linear regression is tallest on RMSE and shortest on R². Random Forest should show the most dramatic improvement — RMSE dropping from ~$378K to ~$301K (-$77K, -20%) and R² rising from 0.665 to 0.788. The Gradient Boosting bar should sit between linear and Random Forest. The visual gap between the linear and tree-based models makes the "non-linearity premium" immediately obvious without needing to read the numbers.

Reading this plot: Compare this to the baseline Actual vs Predicted above. The expected improvement: the cloud of points is tighter around the diagonal, especially in the $500K–$1.5M band (the bulk of the market). The luxury end (>$2M) will still show scatter — these properties have idiosyncratic features (ocean views, heritage listing, development potential) that no tabular model captures from the available columns. The tighter diagonal indicates the Random Forest found the non-linear suburb × size interactions that the linear model averaged away.

Reading this plot: Compare to the baseline residual fan. The expected improvement: the fan narrows — residuals at high predicted values are smaller than the baseline, indicating the tree model handles the right tail better. Some systematic negative residuals may remain at the top end (model still underestimates the most expensive properties), but the spread should be visibly more homoscedastic than the baseline. If the fan is still wide, it signals that log-transforming the target Price would be the next meaningful improvement.

Feature Importance — Regression Winner

Reading this plot: Random Forest importance = average reduction in node impurity (MSE) from splitting on that feature, averaged across all 350 trees. Longer bar = feature the model relied on most. Expected top features: log_distance (CBD proximity drives price more than almost anything), select suburb dummies for the highest-value suburbs, log_buildingarea (usable floor space), rooms_x_bathrooms (the engineered interaction), and cluster_dist features (spatial atypicality). If log_distance or a suburb dummy tops the chart, it confirms the EDA finding that location dominates. If an engineered feature (e.g. building_to_land_ratio) appears above raw features, it validates that the feature engineering in Part 4 extracted real signal rather than noise.

Key observations:

Why Random Forest wins on this split: Random Forest achieved the lowest RMSE and highest R² among the tested regressors. Averaging 350 deep trees reduced variance while capturing non-linear interactions between location, size, and engineered ratio/log features, yielding the strongest generalization.
Why Gradient Boosting is a strong runner-up: Boosting still performed well and clearly outperformed linear regression by modeling non-linear effects. On this run, however, it did not beat Random Forest on RMSE.
Why engineered features dominate importance: In both tree models, log-transformed features (log_landsize, log_buildingarea, log_distance) rank near the top — confirming that normalizing the right-skewed distributions was the most impactful single preprocessing decision. The cluster_dist features also appear in the top rankings, validating the neighbourhood segmentation approach.

Winner Declaration

Winner: Random Forest Regressor → winner_regression_model.pkl

Selection criterion: lowest RMSE with strong R² (balanced error reduction + explained variance).

🏆 Part 6: Winning Regression Model Export

The winning regression pipeline (preprocessor + model) was serialized to pickle and uploaded to this HuggingFace repository.

import pickle

with open("winner_regression_model.pkl", "wb") as f:
    pickle.dump(winner_reg_model, f)

File: winner_regression_model.pkl Test RMSE: 301,281.4212 Test R²: 0.7876

🏷️ Part 7: Regression → Classification

Why convert? A continuous price prediction is not always directly actionable for a buyer or investor. A price tier — Low / Mid / High — is. This section converts the regression target into an operationally meaningful 3-class classification problem.

7.1 Threshold Strategy: Quantile Binning

The thresholds were computed on the training set only (no leakage from test):

Class	Label	Definition	Threshold
0	Low	Price ≤ 33rd percentile of train	≤ 707,000 AUD
1	Mid	33rd < Price ≤ 67th percentile of train	707,000 – 1,120,000 AUD
2	High	Price > 67th percentile of train	> 1,120,000 AUD

Why quantiles? Quantile thresholds produce near-balanced classes (~33% each), which avoids the class imbalance problem that business-rule thresholds (e.g., fixed dollar cutoffs) would create. Balanced classes mean the classifier can learn all three tiers equally well without requiring class-weight corrections.

7.2 Class Balance

Class	Train Count	Train %	Test Count	Test %
0 — Low	7,269	33.35%	1,840	33.76%
1 — Mid	7,298	33.48%	1,785	32.75%
2 — High	7,230	33.17%	1,825	33.49%
Imbalance ratio (max/min)	—	—	1.03	—

Near-balanced classes (~33% each) — no class-weight correction required.

7.3 Metric Priority for Part 8

Why macro-F1 over accuracy? With near-balanced classes, accuracy is informative — but a model could still perform well overall while systematically failing on one class (for example, always misclassifying mid-tier properties). Macro-F1 averages F1 equally across all three classes, so every tier must be well-predicted for the score to be high.

Primary metric: Macro-F1 Secondary metric: Accuracy

Precision vs Recall trade-off: For a property buyer, a False Negative (missing a high-value property in the High tier) is more costly than a False Positive (labeling a mid-tier property as high). For a seller, the reverse holds. Since both buyer and seller perspectives matter equally, neither precision nor recall is systematically weighted — the balanced macro-F1 reflects this.

🧠 Part 8: Train & Evaluate Classification Models

8.1 Precision vs Recall — Context

For this housing price classification task:

False Positive (predicting High when actually Mid): buyer overpays attention; potential opportunity cost.
False Negative (predicting Mid when actually High): buyer misses a premium property.

In an equal-weight framing (no specific business rule privileging buyers or sellers), macro-F1 is the right primary metric. Per-class recall on the High (2) class is a key secondary metric — missing premium properties is the most visible failure mode and the one most users care about.

8.2 Three Classification Models

Model	Architecture	Key Parameters
Logistic Regression	Global linear decision boundaries	max_iter=2000, random_state=42
Random Forest Classifier	350 independent decision trees	max_depth=18, min_samples_leaf=2, n_jobs=-1
Gradient Boosting Classifier	250 sequential boosted trees	n_estimators=250, learning_rate=0.05, max_depth=3

All models used the same ColumnTransformer pipeline from Part 4 — fit on train only.

8.3 Evaluation Results

Model	Accuracy	Macro-F1	Weighted-F1	ROC-AUC (OvR, macro)
Logistic Regression	0.7932	0.7925	0.7933	0.9282
Random Forest Classifier	0.7934	0.7917	0.7926	0.9304
Gradient Boosting Classifier (WINNER)	0.7963	0.7952	0.7961	0.9289

Reading this plot: Three bars, one per model. All three should cluster tightly in the 0.79–0.80 range — with near-balanced classes (~~33% each), all models have a reasonable accuracy floor. The differences between bars are small in absolute terms (~~0.003) but meaningful: even a 0.3pp accuracy gain on a large portfolio translates to fewer misclassified properties. Gradient Boosting edges out the others.

Reading this plot: Macro-F1 is the primary metric — it averages F1 equally across all three price classes. A model that performs well on Low and High but poorly on Mid will show a low macro-F1 even with high accuracy. Expected pattern: bars between 0.79 and 0.80, with Gradient Boosting highest and Logistic Regression lowest. The tight clustering confirms that all three models are genuinely competitive — the dataset is rich enough that even a linear classifier captures most of the signal.

8.4 Classification Reports

Logistic Regression

Class	Precision	Recall	F1-score	Support
Low (0)	0.8311	0.8500	0.8405	1840
Mid (1)	0.6871	0.6913	0.6892	1785
High (2)	0.8592	0.8356	0.8472	1825
Macro avg	0.7925	0.7923	0.7925	5450
Weighted avg	0.7934	0.7932	0.7933	5450

Random Forest Classifier

Class	Precision	Recall	F1-score	Support
Low (0)	0.8219	0.8582	0.8397	1840
Mid (1)	0.7012	0.6784	0.6896	1785
High (2)	0.8531	0.8405	0.8467	1825
Macro avg	0.7921	0.7923	0.7917	5450
Weighted avg	0.7928	0.7934	0.7926	5450

Gradient Boosting Classifier

Class	Precision	Recall	F1-score	Support
Low (0)	0.8267	0.8582	0.8421	1840
Mid (1)	0.6986	0.6908	0.6946	1785
High (2)	0.8608	0.8373	0.8489	1825
Macro avg	0.7954	0.7954	0.7952	5450
Weighted avg	0.7961	0.7963	0.7961	5450

8.5 Confusion Matrices

Logistic Regression

Reading confusion matrices: Rows = actual class, columns = predicted class. The diagonal (top-left to bottom-right) shows correct predictions — darker diagonal = better model. Off-diagonal cells are errors. Expected dominant pattern across all three models: the Mid (1) class bleeds into both Low (0) and High (2) — a property priced just above the q33 threshold is nearly indistinguishable from one just below it, so the model hedges. Low↔High confusions (top-right and bottom-left corners) should be rare — a $400K property looks nothing like a $2M property in the feature space.

Main confusion pattern: Most errors are boundary mistakes where actual Mid is predicted as Low or High; direct Low↔High confusion is limited.

Random Forest Classifier

Main confusion pattern: Errors are concentrated around the Mid class boundaries (Mid→Low / Mid→High), with few extreme Low↔High swaps.

Gradient Boosting Classifier

Main confusion pattern: The winner still mainly confuses near-threshold Mid homes with adjacent tiers, while preserving strong separation of Low vs High.

8.6 Feature Importance — Classification Models

Key findings from feature importance:

Location features (suburb dummies, cluster features, Distance) consistently rank highly — confirming that "location, location, location" applies quantitatively, not just qualitatively.
Engineered ratio features (rooms_per_bedroom, bathrooms_per_room, building_to_land_ratio) appear in the top rankings, validating that structural efficiency captures price signal that raw room counts alone do not.
log_landsize and log_buildingarea rank strongly — confirming the importance of the log transform for compressing the right-skewed size distributions.
cluster_dist features appear in both tree model rankings, confirming that neighbourhood atypicality (distance from cluster centroid) adds genuine signal beyond the spatial cluster membership alone.

8.7 Winner Declaration

Winner: Gradient Boosting Classifier → winner_classification_model.pkl

Selected by: highest Macro-F1 (primary) + highest accuracy (tiebreaker).

Why tree models beat Logistic Regression: The Melbourne property market is highly non-linear — suburb-level effects, interactions between property type and location, and the non-linear distance gradient all require a model that can discover complex decision boundaries. Tree-based ensembles discover these boundaries automatically; logistic regression can only model them if the relevant features are explicitly engineered.

Why Gradient Boosting beats Random Forest (if confirmed by metrics): Sequential error correction focuses each tree on the properties previous trees misclassified. For this price-tier classification task, the hardest-to-classify properties are the mid-tier ones near the class boundaries — and boosting's targeted learning concentrates exactly on those difficult boundary cases that ensemble averaging cannot resolve.

📊 Final Evaluation

Key Results Summary

Milestone	Metric	Value
Baseline Linear Regression	RMSE	378,586.1104
Baseline Linear Regression	R²	0.6646
After Feature Engineering (same model)	RMSE	373,550.5124
After Feature Engineering (same model)	R²	0.6734
Best Regression Model	RMSE	301,281.4212
Best Regression Model	R²	0.7876
Regression → Classification	Class 0 threshold	707,000 AUD
Regression → Classification	Class 2 threshold	1,120,000 AUD
Best Classification Model	Macro-F1	0.7952
Best Classification Model	Accuracy	0.7963
Best Classification Model	ROC-AUC (OvR macro)	0.9289

🚀 How to Load and Use the Models

import pickle
import numpy as np
import pandas as pd

# Load regression model (continuous price prediction)
with open("winner_regression_model.pkl", "rb") as f:
    reg_model = pickle.load(f)

# Load classification model (3-class price tier)
with open("winner_classification_model.pkl", "rb") as f:
    clf_model = pickle.load(f)

# Both models expect the same engineered feature matrix from Part 4
# X_new must be a DataFrame with the same columns as X_train_fe
# (raw + engineered columns; DO NOT pre-scale — the pipeline handles it)

# Regression: continuous price in AUD
y_price = reg_model.predict(X_new)

# Classification: discrete price tier
y_tier = clf_model.predict(X_new)
y_proba = clf_model.predict_proba(X_new)

tier_map = {0: "Low", 1: "Mid", 2: "High"}
tier_labels = [tier_map[t] for t in y_tier]

print("Predicted prices:", y_price[:5])
print("Predicted tiers:", tier_labels[:5])
print("Class probabilities (Low/Mid/High):")
print(np.round(y_proba[:5], 3))

Important notes:

X_new must include all engineered columns (sale_year, sale_month, sale_quarter, rooms_per_bedroom, bathrooms_per_room, building_to_land_ratio, log_landsize, log_buildingarea, log_distance, rooms_x_bathrooms) and the clustering columns (cluster_label, cluster_dist_0 through cluster_dist_5).
The KMeans clustering was fit on training data. For truly new properties, you will need to save and reload the clustering artifacts (imputer, scaler, kmeans object) alongside the model pipeline.
The classification thresholds (q33 = 707,000 AUD, q67 = 1,120,000 AUD) were derived from the training distribution. A property near a threshold boundary will have high uncertainty — use predict_proba to assess confidence.

🔎 Part 8 Additional Analysis — Classification Diagnostics Upgrade

Beyond the core classification report and confusion matrix, the notebook includes additional diagnostic visualizations for a complete picture of each model's behavior.

Per-Class Precision, Recall, F1 Summary

Model	Class	Precision	Recall	F1-score
Logistic Regression	Low (0)	0.8311	0.8500	0.8405
Logistic Regression	Mid (1)	0.6871	0.6913	0.6892
Logistic Regression	High (2)	0.8592	0.8356	0.8472
Random Forest	Low (0)	0.8219	0.8582	0.8397
Random Forest	Mid (1)	0.7012	0.6784	0.6896
Random Forest	High (2)	0.8531	0.8405	0.8467
Gradient Boosting	Low (0)	0.8267	0.8582	0.8421
Gradient Boosting	Mid (1)	0.6986	0.6908	0.6946
Gradient Boosting	High (2)	0.8608	0.8373	0.8489

Key pattern: The Mid (1) class is consistently the hardest to classify correctly across all three models. This is expected: properties priced near the q33 and q67 thresholds share characteristics with both adjacent classes. A property priced at exactly the q33 boundary is almost equally likely to be genuinely Low or genuinely Mid — the signal is weakest at the class boundary. This boundary-region ambiguity is irreducible without additional features that distinguish near-boundary properties more sharply.

Precision–Recall Curves (Class 2 — High Price)

The PR curve for the High (2) class shows the trade-off between the fraction of High-tier properties correctly identified (recall) and the fraction of predicted High-tier properties that are genuinely high (precision). At high recall thresholds, all models accept more false positives from the Mid class. The area under the PR curve confirms whether each model maintains useful precision while catching most High-tier properties.

Regression Diagnostics Upgrade — Improvement Table

Model	RMSE delta vs baseline	R² delta vs baseline
Linear Regression (Engineered)	-5,035.5980	+0.0089
Gradient Boosting Regressor	-53,058.7331	+0.0874
Random Forest Regressor (Winner)	-77,304.6892	+0.1230

This table makes the engineering and model-selection gains directly comparable. A negative RMSE_delta means the model improved over baseline; a positive R2_delta means it explains more variance. The table was included to provide transparent evidence that each modeling step added genuine value rather than overfitting to the training set.

🎯 Strategic Takeaways

Location dominates every other signal. Suburb membership and distance from the CBD together explain more variance than all property-size features combined. A 2-bedroom unit in South Yarra will outsell a 4-bedroom house in Melton. The coefficient and importance rankings across every model confirm this — location features occupy the top of every chart. For buyers: suburb selection is the single highest-leverage decision. For sellers: pricing against comparable suburb sales matters far more than the number of rooms listed.
The luxury segment is structurally different. Properties above ~$2M consistently have the largest prediction errors across all models. These properties trade on idiosyncratic features — ocean views, heritage listing, land-banking potential, architect design — that are not captured in any column of this dataset. The implication: the models are deployment-ready for the mainstream market ($400K–$1.5M) but should be used with caution for ultra-premium listings where domain expertise is irreplaceable.
Spring auctions drive a seasonal price premium. sale_month was engineered as a predictive feature specifically because Melbourne's auction market has a well-documented spring peak (September–November). Properties listed in this window attract more bidders, driving competitive clearing prices above the annual median. The feature importance plots confirm this time signal adds genuine model value beyond what property characteristics alone explain.
Structural efficiency matters more than raw size. building_to_land_ratio and bathrooms_per_room consistently outrank raw Landsize and BuildingArea in the tree model importance rankings. A compact, well-fitted inner-city property on a small block commands more per square metre than a large house on an oversized block in an outer suburb. Buyers optimising for value-per-dollar should focus on bathroom count and building coverage, not headline land size.
A unit is not just a cheaper house — it is a different product. The type dummy (h vs u vs t) is one of the strongest features in the baseline model. The price gap between a house and a unit of identical room count in the same suburb is not explained by size alone — it reflects the land-ownership premium and the restriction on capital growth that unit ownership carries in Melbourne. This structural gap means buyers and investors should price units and houses on separate mental models.
Tree models are non-negotiable for this problem. The jump from linear regression (RMSE $378K, R² 0.665) to Random Forest (RMSE $301K, R² 0.788) is not a tuning win — it is a fundamental architectural win. Melbourne property prices are determined by hundreds of suburb-level and property-type interactions that no global linear equation can capture. Any production pricing model for this market should use an ensemble method as its minimum viable architecture.

⚠️ Limitations

Data coverage ends 2018. The Melbourne property market shifted significantly after 2018: record-low interest rates (2020–2021) drove prices to historic highs; the 2022–2023 rate hiking cycle reversed much of that growth. Models trained on 2016–2018 data will systematically underestimate 2020–2021 prices and may overestimate 2023–2024 prices. Do not use for current market valuations without retraining on recent data.
300+ suburb one-hot categories create sparse, noisy features. Suburbs with fewer than ~20 sales in the training set have unreliable one-hot coefficient estimates — the model has too few examples to learn a stable suburb premium. These small-suburb dummies add dimensionality without adding reliable signal. A better approach: group low-frequency suburbs into an "other" category or use target encoding with cross-validation.
BuildingArea missing 47% of rows. Nearly half of all predictions for the regression target rely on the median imputation for building area — the most physically important continuous feature for price-per-square-metre analysis. Imputed values are reasonable (conditioned on median), but they suppress the signal from this feature for a large fraction of the dataset.
No macroeconomic features. Interest rates, Melbourne population growth, immigration levels, vacancy rates, and housing supply pipeline are documented major drivers of Melbourne property prices — none are present in the dataset. The models capture the cross-sectional structure (which property type and location commands a premium) but not the time-series level (whether the whole market is rising or falling).
Geographic data used only for clustering. Lattitude and Longtitude feed the KMeans clustering but are not used as direct model features. A spatial regression approach (e.g. geographically weighted regression or a spatial lag term) could extract more precise location signal than the coarse 6-cluster discretisation.
All relationships are correlational. The models identify which features are associated with higher prices — they cannot tell us which features cause higher prices. Suburb prestige and building quality are correlated with many unmeasured factors (school catchment, public transport access, heritage character) that drive both the observed feature values and the price outcomes. Feature importance ≠ causal importance.

🚀 Live Demo

What it is

A live interactive demo lets anyone type in a property's attributes and get an instant price prediction and tier classification — without opening a notebook or writing code. It runs on HuggingFace Spaces (free hosting) powered by Gradio (the same library used in most HuggingFace demos).

📦 Requirements

pandas>=1.3
numpy>=1.21
scikit-learn>=1.0
matplotlib>=3.4
seaborn>=0.11
plotly>=5.0
scipy>=1.7

📝 Key Design Decisions

Decision	Justification
Keep outliers in EDA	Extreme Melbourne prices are real market data (luxury, rural-block); removing them would bias the model against the tails
Log-transform Landsize/BuildingArea/Distance	Reduces right-skew by >90%; better linearizes these features for regression models
Defer all imputation to sklearn pipeline	Prevents any form of test-set information leaking into imputed train values
Fit all transformers on train only	Standard leakage-prevention practice; ColumnTransformer enforces this
Use `handle_unknown='ignore'` in OneHotEncoder	Suburbs in the test set may not appear in training; ignoring unseen categories prevents crashes
KMeans k=6 for feature construction	Used to create richer cluster-derived features (`cluster_label` + distances); separate silhouette sweep showed best compactness at k=3
Quantile-based classification thresholds	Produces near-balanced classes (~33% each) without requiring class-weight correction
Macro-F1 as primary classification metric	Equal penalty for ignoring any price class; prevents a model from collapsing to the majority class
RMSE as primary regression metric	Penalizes large errors more than MAE; appropriate for property prices where a $500k prediction miss is much worse than a $50k miss
`random_state=42` throughout	Full reproducibility — any reader can run the notebook and get identical results
80/20 train/test split	Standard proportion for a dataset of this size; 20% gives a large enough test set for stable metric estimates

Itay Morag

Downloads last month: -

🏠 Melbourne House Price Prediction — Regression, Clustering & Classification

Navigation

Presentation

📌 Project Overview

📓 View the Notebook

📊 Dataset Description

📋 Raw Feature Dictionary — All 21 Original Columns

Identifiers — Excluded or Used with Care

Target Variable

Property Characteristics

Location Features

Sale Metadata

🔍 Part 2: Exploratory Data Analysis

2.1 Initial State

2.2 Cleaning Decisions

2.3 Summary Statistics After Cleaning

2.4 Sanity Checks (Domain Rules)

2.5 Outlier Documentation (Price / Landsize / BuildingArea)

📊 EDA Highlights

2.6 Property Market Overview

A. Price Distribution

B. Property Type Price Hierarchy (Median)

2.7 Location & Geography

A. Regional Price Hierarchy

B. Distance from CBD vs Price

2.8 Property Size & Layout

A. Property Type vs Price Distribution

B. Room Count and Bathrooms vs Price

2.9 Correlation Analysis

📉 Part 3: Baseline Linear Regression

Design Decisions

Baseline Results

Top Coefficients — Baseline

⚙️ Part 4: Feature Engineering

4.1 Ten New Engineered Features

4.2 Feature Evidence — Keep/Drop Rationale

4.3 KMeans Clustering — Neighbourhood Segmentation

4.4 Cluster Visualization

4.5 Cluster Validation

4.6 Feature Engineering Impact — Isolated Proof

4.7 Final Feature Matrix Summary

📈 Part 5: Three Improved Regression Models

Model Architectures

Part 5 Results

Feature Importance — Regression Winner

Winner Declaration

🏆 Part 6: Winning Regression Model Export

🏷️ Part 7: Regression → Classification

7.1 Threshold Strategy: Quantile Binning

7.2 Class Balance

7.3 Metric Priority for Part 8

🧠 Part 8: Train & Evaluate Classification Models

8.1 Precision vs Recall — Context

8.2 Three Classification Models

8.3 Evaluation Results

8.4 Classification Reports

Logistic Regression

Random Forest Classifier

Gradient Boosting Classifier

8.5 Confusion Matrices

Logistic Regression

Random Forest Classifier

Gradient Boosting Classifier

8.6 Feature Importance — Classification Models

8.7 Winner Declaration

📊 Final Evaluation

Key Results Summary

🚀 How to Load and Use the Models

🔎 Part 8 Additional Analysis — Classification Diagnostics Upgrade

Per-Class Precision, Recall, F1 Summary

Precision–Recall Curves (Class 2 — High Price)

Regression Diagnostics Upgrade — Improvement Table

🎯 Strategic Takeaways

⚠️ Limitations

🚀 Live Demo

What it is

📦 Requirements

📝 Key Design Decisions

2.5 Outlier Documentation (`Price` / `Landsize` / `BuildingArea`)