Instructions to use 0tizm0/melbourne-price-winner-model with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Scikit-learn
How to use 0tizm0/melbourne-price-winner-model with Scikit-learn:
from huggingface_hub import hf_hub_download import joblib model = joblib.load( hf_hub_download("0tizm0/melbourne-price-winner-model", "sklearn_model.joblib") ) # only load pickle files from sources you trust # read more about it here https://skops.readthedocs.io/en/stable/persistence.html - Notebooks
- Google Colab
- Kaggle
- π Melbourne House Price Prediction β Regression, Clustering & Classification
- Navigation
- Presentation
- π Project Overview
- π View the Notebook
- π Dataset Description
- π Raw Feature Dictionary β All 21 Original Columns
- π Part 2: Exploratory Data Analysis
- π EDA Highlights
- π Part 3: Baseline Linear Regression
- βοΈ Part 4: Feature Engineering
- π Part 5: Three Improved Regression Models
- π Part 6: Winning Regression Model Export
- π·οΈ Part 7: Regression β Classification
- π§ Part 8: Train & Evaluate Classification Models
- π Final Evaluation
- π How to Load and Use the Models
- π Part 8 Additional Analysis β Classification Diagnostics Upgrade
- π― Strategic Takeaways
- β οΈ Limitations
- π Live Demo
- π¦ Requirements
- π Key Design Decisions
- Navigation
π Melbourne House Price Prediction β Regression, Clustering & Classification
Dataset: Melbourne Housing Market β Kaggle (Full Version)
Navigation
- Project Overview
- Dataset Description
- Part 2: EDA β Data Integrity & Cleaning
- EDA Highlights β Research Questions
- Part 3: Baseline Linear Regression
- Part 4: Feature Engineering & Clustering
- Part 5: Three Improved Regression Models
- Part 6: Winning Regression Model Export
- Part 7: Regression -> Classification
- Part 8: Classification Models & Evaluation
- Final Evaluation
- Strategic Takeaways
- Limitations
- Live Demo
- How to Load and Use the Models
- Requirements
Presentation
π Project Overview
This project builds a complete, end-to-end machine learning pipeline to predict Melbourne house prices using a rich real-world property dataset. The pipeline moves from raw data through exploratory analysis, feature engineering, unsupervised clustering, regression modeling, and multi-class classification β ending with two production-ready models exported for deployment on HuggingFace.
Research Question:
Given property characteristics, location attributes, and sale metadata available at listing time, can we accurately predict a Melbourne house's sale price β and assign each property to a meaningful price tier (Low / Mid / High)?
Why this matters: Real estate is one of the most consequential financial decisions a household makes. In Melbourne, one of the world's most expensive property markets, understanding what drives house prices is critical for buyers, sellers, and investors alike. A reliable price-prediction model can help buyers set realistic budgets, help sellers price competitively, and help analysts identify market segments. The key challenge is building a model without data leakage and with transparent, interpretable feature engineering grounded in real property market logic.
π View the Notebook
π Dataset Description
| Property | Value |
|---|---|
| Source | Kaggle β Melbourne Housing Market (Full Version, anthonypino) |
| File | Melbourne_housing_FULL.csv |
| Raw columns | 21 features (including target) |
| Target | Price β sale price in AUD |
| Task | Regression (price prediction) + Classification (price tier) |
| Geography | Metropolitan Melbourne, Australia |
| Time period | 2016β2018 |
π Raw Feature Dictionary β All 21 Original Columns
The raw dataset contains 21 columns. Below is every column, its type, description, and modeling decision.
Identifiers β Excluded or Used with Care
| Column | Type | Description | Decision |
|---|---|---|---|
Address |
object | Full property street address | β too granular; near-unique per row β no predictive value as-is |
SellerG |
object | Real estate agency or selling agent name | β very high cardinality; excluded to avoid noise |
Target Variable
| Column | Type | Description | Decision |
|---|---|---|---|
Price |
float64 | House sale price in AUD β right-skewed | β regression target; log-transform considered for modeling |
Property Characteristics
| Column | Type | Description | Decision |
|---|---|---|---|
Rooms |
int64 | Total number of rooms | β strong predictor; used in engineered ratio features |
Type |
object | Property type: h=house/cottage/villa/terrace, u=unit/duplex, t=townhouse |
β one-hot encoded β strong price signal |
Bedroom2 |
float64 | Number of bedrooms (sourced from a secondary source) | β
used in rooms_per_bedroom ratio |
Bathroom |
float64 | Number of bathrooms | β
used in bathrooms_per_room ratio and interaction feature |
Car |
float64 | Number of car spaces | β included in baseline numeric features |
Landsize |
float64 | Land area in square metres β right-skewed, many outliers | β
log-transformed β log_landsize; raw value retained for IQR analysis |
BuildingArea |
float64 | Building footprint in square metres β right-skewed | β
log-transformed β log_buildingarea; used in building_to_land_ratio |
YearBuilt |
float64 | Year the property was built β many missing values | β included in baseline; imputed via median in pipeline |
Location Features
| Column | Type | Description | Decision |
|---|---|---|---|
Suburb |
object | Suburb name β high cardinality (~300 unique) | β one-hot encoded (handle_unknown=ignore); high-signal location feature |
Postcode |
float64 | Postcode of the property | β included as numeric feature |
Distance |
float64 | Distance from Melbourne Central Business District (km) | β
strong price predictor; log-transformed β log_distance; used in clustering |
Regionname |
object | General region in Melbourne (8 regions) | β one-hot encoded |
CouncilArea |
object | Governing council area β lower cardinality than Suburb | β one-hot encoded |
Propertycount |
float64 | Number of properties in the suburb at time of sale | β proxy for suburb density; used in clustering |
Lattitude |
float64 | Property latitude | β used in KMeans clustering only |
Longtitude |
float64 | Property longitude | β used in KMeans clustering only |
Sale Metadata
| Column | Type | Description | Decision |
|---|---|---|---|
Method |
object | Sale method: S=sold, PI=passed in, SA=sold after, SP=sold prior, VB=vendor bid | β one-hot encoded |
Date |
object | Date of sale β parsed to datetime (day-first format) | β
decomposed into sale_year, sale_month, sale_quarter |
π Part 2: Exploratory Data Analysis
Raw housing data is noisy and inconsistent. These steps were taken to make it analysis-ready.
2.1 Initial State
- Duplicate rows possible in the scraped dataset
Pricestored as object dtype in some versions; needs numeric coercionDatestored as string in day-first format (DD/MM/YYYY) β requires explicit parsing- Significant missingness on structural columns:
BuildingArea(47%),39%),YearBuilt(CouncilArea(33%),5%),Car(Bedroom2(~4%) - Strong numerical outliers in
Price(luxury properties >$5M),Landsize(rural blocks >10,000 mΒ²), andBuildingArea(implausible extremes) SellerG(agent name) has very high cardinality with no generalizable price signal
Missingness reporting: Both raw counts and percentage of rows per column were printed
so sparse columns (e.g. BuildingArea, YearBuilt) are easy to compare at a glance.
Pre-cleaning snapshot (tabular):
| Metric | Value |
|---|---|
| Rows x Columns | 34,857 x 21 |
| Numeric columns | 13 |
| Text/Categorical columns | 8 |
| Top missing columns (pre-cleaning) | Missing count | Missing % |
|---|---|---|
BuildingArea |
17,400 | 59.55% |
YearBuilt |
15,744 | 53.88% |
Landsize |
9,568 | 32.75% |
Car |
6,860 | 23.48% |
Bathroom |
6,558 | 22.45% |
Bedroom2 |
6,552 | 22.42% |
Price |
6,367 | 21.79% |
Lattitude |
6,339 | 21.70% |
2.2 Cleaning Decisions
- Removed exact duplicate rows to prevent biased learning and inflated metrics.
- Parsed
Dateusing explicit day-first datetime parsing β Australian date format (DD/MM/YYYY). - Converted
Priceto numeric witherrors='coerce'β forces any string artifacts to NaN. - Dropped rows where
Priceis missing β target integrity; cannot train without a label. - Left all feature missingness (
BuildingArea,YearBuilt, etc.) intact for pipeline imputation β imputing before the train/test split would leak test-set statistics into training. SellerG(real estate agent): excluded β very high cardinality, no generalizable signal.Address: excluded β near-unique per row, no predictive value as a raw string.- Retained extreme best-sellers (luxury properties, large rural blocks); handled their influence via log scaling in feature engineering rather than dropping real data points.
- Categorical profiling: after cleaning, object columns summarized with
describe(include=['object'])(counts, uniques, top category) to spot sparse labels before plotting.
Post-cleaning snapshot (tabular):
| Metric | Value |
|---|---|
| Rows x Columns | 27,247 x 21 |
| Dropped rows (missing target + exact duplicates) | 6,367 |
| Date dtype | datetime64 (parsed day-first) |
| Top missing columns (post-cleaning) | Missing count | Missing % |
|---|---|---|
BuildingArea |
13,685 | 59.89% |
YearBuilt |
12,376 | 54.16% |
Landsize |
7,495 | 32.80% |
Car |
5,347 | 23.40% |
Bathroom |
5,117 | 22.39% |
Bedroom2 |
5,113 | 22.38% |
Lattitude |
4,949 | 21.66% |
Longtitude |
4,949 | 21.66% |
| Object-column profile (post-cleaning) | Count | Unique | Top | Freq |
|---|---|---|---|---|
Suburb |
27,247 | 340 | Reservoir | 634 |
Address |
27,247 | 22,466 | 5 Charles St | 4 |
Type |
27,247 | 3 | h | 15,344 |
Method |
27,247 | 5 | S | 14,881 |
SellerG |
27,247 | 325 | Nelson | 2,372 |
CouncilArea |
27,245 | 33 | Boroondara City Council | 2,221 |
Regionname |
27,245 | 8 | Southern Metropolitan | 7,439 |
2.3 Summary Statistics After Cleaning
| Feature | Count | Mean | Std | Min | 25% | 50% | 75% | Max |
|---|---|---|---|---|---|---|---|---|
Price |
27,247 | 1,056,543.22 | 646,613.71 | 85,000.00 | 637,000.00 | 880,000.00 | 1,300,000.00 | 11,200,000.00 |
Rooms |
27,247 | 2.97 | 0.96 | 1.00 | 2.00 | 3.00 | 4.00 | 16.00 |
Bathroom |
17,733 | 1.57 | 0.70 | 0.00 | 1.00 | 1.00 | 2.00 | 9.00 |
Car |
17,503 | 1.67 | 0.98 | 0.00 | 1.00 | 2.00 | 2.00 | 18.00 |
Landsize |
15,355 | 588.55 | 4,032.16 | 0.00 | 196.00 | 478.00 | 659.50 | 433,014.00 |
BuildingArea |
9,165 | 154.10 | 479.10 | 0.00 | 97.00 | 130.00 | 178.00 | 44,515.00 |
Distance |
27,247 | 10.92 | 6.49 | 0.00 | 6.40 | 10.20 | 13.80 | 48.10 |
Propertycount |
27,245 | 7,533.97 | 4,487.78 | 83.00 | 4,280.00 | 6,567.00 | 10,331.00 | 21,650.00 |
2.4 Sanity Checks (Domain Rules)
Automated plausibility checks run on the cleaned data to catch scrape errors, wrong units, or bad merges before trusting aggregate charts:
- All sale price values are non-negative β no negative prices.
Distanceis non-negative β no properties with negative CBD distance.Rooms,Bedroom2,Bathroom,Carare non-negative integers β no structural anomalies.YearBuilt, where present, falls in a sensible historical range (e.g. 1800β2018).LattitudeandLongtitudelie within Victoria, Australia bounding box β no data entry errors placing properties overseas.
| Sanity Rule | Result | Notes |
|---|---|---|
Price >= 0 |
PASS | No negative sale prices |
Distance >= 0 |
PASS | No negative CBD distances |
Rooms >= 0 |
PASS | No negative room counts |
Bathroom >= 0 |
PASS | No negative bathroom counts |
YearBuilt in [1800, 2018] when present |
FLAG | At least one out-of-range build year appears in raw source |
Lattitude in Victoria bounds |
PASS | Values fall inside expected range |
Longtitude in Victoria bounds |
PASS | Values fall inside expected range |
2.5 Outlier Documentation (Price / Landsize / BuildingArea)
- Top properties table: The notebook lists the top 10 properties by
Priceso extreme luxury sales (multi-million-dollar mansions) are explicit β not only visible as scatter extremes. - Tukey IQR fences: Lower fence = Q1 β 1.5ΓIQR, upper fence = Q3 + 1.5ΓIQR. For
heavily right-skewed property prices, many rows exceed the upper fence by expectation β
that reflects the hit-driven, luxury-heavy structure of the Melbourne market, not bad data.
Same applies to
Landsize(rural blocks) andBuildingArea(mansions). - Decision: Keep those rows as real sales; use log scales and log-transformed features in models as needed.
Top 10 properties by Price (post-cleaning):
| Suburb | Type | Rooms | Landsize | BuildingArea | Price |
|---|---|---|---|---|---|
| Brighton | h | 4 | 1,400.0 | NaN | 11,200,000 |
| Mulgrave | h | 3 | 744.0 | 117.0 | 9,000,000 |
| Canterbury | h | 5 | 2,079.0 | 464.3 | 8,000,000 |
| Hawthorn | h | 4 | 1,690.0 | 284.0 | 7,650,000 |
| Armadale | h | 4 | NaN | NaN | 7,000,000 |
| Armadale | h | 4 | NaN | NaN | 6,800,000 |
| Kew | h | 6 | 1,334.0 | 365.0 | 6,500,000 |
| Melbourne | u | 3 | NaN | NaN | 6,500,000 |
| Toorak | h | 4 | NaN | NaN | 6,460,000 |
| Middle Park | h | 5 | 553.0 | 308.0 | 6,400,000 |
IQR fence documentation:
| Feature | Q1 | Q3 | Lower Fence | Upper Fence | Outlier Rows | Outlier % |
|---|---|---|---|---|---|---|
Price |
637,000.0 | 1,300,000.0 | -357,500.0 | 2,294,500.0 | 1,088 | 4.76% |
Landsize |
196.0 | 659.5 | -499.25 | 1,354.75 | 402 | 2.62% |
BuildingArea |
97.0 | 178.0 | -24.50 | 299.50 | 473 | 5.16% |
π EDA Highlights
2.6 Property Market Overview
A. Price Distribution
Question: What does the overall distribution of Melbourne house prices look like, and how skewed is it?
- Insight:
Priceis strongly right-skewed. The bulk of sales cluster between ~$400K and ~$1.5M, but the upper tail extends well past $5M. The log-scale view reveals a near-normal distribution, confirming thatlog(Price)is a natural regression target and that log-transforming heavy-tail features will better linearize their relationship with price.
B. Property Type Price Hierarchy (Median)
Question: Are houses systematically more expensive than units, and how wide is the gap?
- Insight: Houses (
h) have the highest median price, followed by townhouses (t), with units (u) at the bottom. This confirmsTypeas a strong predictor and supports one-hot encoding withuas a reference class in linear-style models.
2.7 Location & Geography
A. Regional Price Hierarchy
Question: Which Melbourne regions command the highest median prices?
- Insight: Median prices differ substantially by region. The highest-priced regions
(typically inner-city) sit well above the overall median, while outer growth corridors
fall significantly below it. The spread confirms that even at a coarse regional level,
location carries strong price signal β motivating both
Regionnameand the finer-grainedSuburbas features in all models.
B. Distance from CBD vs Price
Question: Is the price gradient from the CBD linear, or does it flatten in outer suburbs?
- Insight: The relationship is negative overall β farther from the CBD generally means cheaper β
but the gradient is non-linear. The steepest decline occurs in the 0β15 km inner band.
Beyond ~25 km the price floor flattens. Very distant properties (>40 km) show a second cluster
of moderate prices from growth-corridor developments, not a recovery of the inner premium.
Wide spread at every distance confirms that Distance alone doesn't determine price β suburb
quality, property type, and size all interact with it.
log_distancebetter linearizes this gradient for regression models.
2.8 Property Size & Layout
A. Property Type vs Price Distribution
Question: How does structural size interact with the type-based price hierarchy?
Type-level pricing evidence is shown above in 2.6.B (Research Q1). This subsection extends the analysis to structural layout variables (rooms and bathrooms), where the strongest separations appear within and across property types.
B. Room Count and Bathrooms vs Price
Question: Does adding rooms and bathrooms drive price consistently across property types?
- Insight: Median prices rise consistently with both room and bathroom counts, but the slope
diverges sharply by property type. For houses, each additional room carries a much larger
premium than for units. Properties with 4+ rooms and 2+ bathrooms sit in the upper price
quartile regardless of suburb β the joint combination is more predictive than either feature
alone. This motivated the
rooms_x_bathroomsinteraction feature and therooms_per_bedroomandbathrooms_per_roomratios.
2.9 Correlation Analysis
Question: How do numeric features relate to each other and to Price?
- Insight:
Rooms,Bedroom2,Bathroom,Carcorrelate positively withPriceβ size and amenity features consistently point in the same direction.Distancecorrelates negatively withPriceβ farther from CBD, cheaper.RoomsandBedroom2are strongly correlated with each other (~0.9) β multicollinearity motivates therooms_per_bedroomratio as a more orthogonal signal.Landsizeshows a weaker correlation withPricethan expected β inner-city land is small but extremely expensive, while large rural blocks are cheap per mΒ²; log-transform addresses this non-linearity.BuildingAreacorrelates more cleanly withPricethan rawLandsizeβ usable floor space is a more direct value driver than total land.
π Part 3: Baseline Linear Regression
Goal: Establish a reproducible, leakage-free performance floor before any feature engineering. Every subsequent model must beat this baseline to justify its additional complexity.
Design Decisions
- Feature set: all raw columns except
Address,Date(after extracting year/month), andSellerG - Numeric pipeline:
SimpleImputer(strategy='median')βStandardScaler - Categorical pipeline:
SimpleImputer(strategy='most_frequent')βOneHotEncoder(handle_unknown='ignore') - All transformations inside a single ColumnTransformer β fit on train only, zero test-set leakage
LinearRegression()with default parameters β no regularization, no tuning- 80/20 stratified split,
random_state=42β fully reproducible
Baseline Results
| Metric | Train | Test |
|---|---|---|
| MAE | β | 229,318.5712 |
| MSE | β | 143,327,442,971.9292 |
| RMSE | β | 378,586.1104 |
| RΒ² | β | 0.6646 |
Reading this plot: Points on the diagonal y=x line are perfect predictions. Points below the line are over-predictions (model said higher than reality); points above are under-predictions. The expected pattern for this dataset: a tight cluster for sub-$1M properties near the diagonal, spreading into a progressively wider band above $1.5M β the model consistently underestimates luxury properties because they are outliers the linear fit cannot represent. The 45Β° reference line is the "perfect model" benchmark.
Reading this plot: Residuals (actual β predicted) plotted against predicted price. A well-behaved
linear model shows residuals randomly scattered around zero with constant spread. The expected pattern
here is a fan shape β small residuals at low predicted prices ($300Kβ$700K) that widen dramatically
above $1M. This is textbook heteroscedasticity: variance in prediction error increases with the
magnitude of the target. The fan shape does not mean the model is broken β it is the natural
consequence of using raw AUD price as the regression target on a right-skewed market. Two fixes:
predict log(Price) and back-transform, or switch to a model family that builds local rules
(Random Forest, Gradient Boosting) rather than fitting a single global line.
Key observations:
- RΒ² = 0.6646 β explains approximately 66.46% of price variance. Meaningful signal present but a large fraction unexplained, consistent with a complex non-linear market.
- Residuals show a fan shape (heteroscedasticity): errors are larger for high-priced properties. This is expected with raw price as the target β it motivates a model family that handles non-constant variance (trees).
- Feature importance: coefficient plot shows location features (particularly suburb one-hot dummies) dominate the baseline β the strongest single predictors are location-based, not property-size-based.
Top Coefficients β Baseline
Reading this plot: Bars extending right = features that push predicted price up; bars extending
left = features that push price down. The expected pattern: a handful of premium inner-city suburb
dummies (Toorak, Hawthorn, South Yarra, etc.) are the largest positive bars, each adding hundreds
of thousands of dollars to the prediction. A handful of outer-suburb dummies (Melton, Werribee, etc.)
are the largest negative bars. Distance appears as a negative bar β consistent with the Q3 finding.
Rooms and Bathroom appear as positive bars but with smaller magnitude than the location dummies,
confirming that in Melbourne's property market, where you buy matters more than what you buy.
Key finding: Suburb dummies dominate the coefficient ranking. This is the quantitative proof of "location, location, location" β the linear model assigns the majority of its explanatory power to suburb membership, not property size. This finding directly motivated the KMeans neighbourhood clustering in Part 4: clustering captures location signal in a form that generalises better than 300+ sparse suburb dummies that each appear in only a handful of training rows.
βοΈ Part 4: Feature Engineering
Feature engineering is the single most impactful step before model selection. Every feature created below is directly traceable to an EDA finding or a domain observation about the Melbourne property market.
4.1 Ten New Engineered Features
| Feature | Type | Source / Rationale |
|---|---|---|
sale_year |
Numeric | Year extracted from Date β captures market cycle effects (2016 vs 2017 vs 2018) |
sale_month |
Numeric | Month extracted from Date β captures seasonality (spring auction peak in Melbourne) |
sale_quarter |
Numeric | Quarter extracted from Date β coarser time bucket than month |
rooms_per_bedroom |
Ratio | Rooms / Bedroom2 β high ratio = many living/utility rooms relative to bedrooms; signals open-plan or luxury layouts |
bathrooms_per_room |
Ratio | Bathroom / Rooms β captures bathroom density; high values signal premium fitouts |
building_to_land_ratio |
Ratio | BuildingArea / Landsize β how much of the land is built on; differentiates dense inner-city from sprawling outer properties |
log_landsize |
Log transform | log1p(Landsize) β compresses the extreme right tail; brings distribution near-normal for linear models |
log_buildingarea |
Log transform | log1p(BuildingArea) β same rationale as log_landsize |
log_distance |
Log transform | log1p(Distance) β linearizes the CBD-distance price gradient found in Q3 EDA |
rooms_x_bathrooms |
Interaction | Rooms Γ Bathroom β captures the joint premium of large, well-appointed homes (Q4 EDA) |
4.2 Feature Evidence β Keep/Drop Rationale
Mutual information regression scores and chi-square statistics were computed for all features to confirm which engineered features add real signal. Key evidence is summarized directly in the ranked findings and keep/drop rationale below, so no separate screenshot table is required here.
4.3 KMeans Clustering β Neighbourhood Segmentation
Clustering was applied to capture neighbourhood-like spatial structure that raw suburb names encode imperfectly. The goal was to group properties by their location-profile signature β not by suburb boundary, but by the underlying spatial pattern of distance, density, and coordinates.
Clustering inputs: [Lattitude, Longtitude, Distance, Propertycount]
Preprocessing: SimpleImputer(median) β StandardScaler β fit on train-set clustering inputs only
Algorithm: KMeans(n_clusters=6, random_state=42, n_init=20)
Why k=6 in feature construction, but k=3 in silhouette validation?
Part 4 feature construction used k=6 to generate the final cluster-derived features
(cluster_label + cluster_dist_0..5) and to preserve a richer neighbourhood segmentation signal.
In the separate validation sweep across candidate k values, the highest silhouette score was observed
at k=3. Both facts are reported for transparency: k=6 was used for engineered features, while
the sweep indicates k=3 as the most compact separation under silhouette.
Cluster-derived features added to each row:
| Feature | Type | Description |
|---|---|---|
cluster_label |
Categorical (0β5) | Discrete cluster membership β one-hot encoded in pipeline |
cluster_dist_0 |
Continuous | Euclidean distance to centroid of cluster 0 β atypicality signal |
cluster_dist_1 |
Continuous | Distance to centroid of cluster 1 |
cluster_dist_2 |
Continuous | Distance to centroid of cluster 2 |
cluster_dist_3 |
Continuous | Distance to centroid of cluster 3 |
cluster_dist_4 |
Continuous | Distance to centroid of cluster 4 |
cluster_dist_5 |
Continuous | Distance to centroid of cluster 5 |
The centroid-distance features (cluster_dist_k) add a continuous atypicality signal: a property far from its assigned cluster centroid is a spatial outlier within its tier β a signal that can improve model predictions at the tails of the price distribution.
4.4 Cluster Visualization
Reading this plot: PCA projects the 4-dimensional clustering space (Lattitude, Longtitude, Distance, Propertycount) into 2D. Each dot is a property, colored by its cluster assignment. Well-separated blobs = clusters with distinct location profiles. Expected pattern: an arc or gradient structure reflecting Melbourne's radial geography β CBD-proximate clusters (inner ring) sit on one side, growth-corridor clusters (outer ring) on the other, with middle-ring clusters bridging the gap. Some overlap between adjacent clusters is normal and expected: suburb boundaries are fuzzy, and properties near a cluster boundary genuinely share characteristics with both groups. Tight, non-overlapping blobs would actually be suspicious β real spatial data rarely partitions perfectly in Euclidean distance.
Cluster interpretation:
| Cluster | Profile | Approximate Zone |
|---|---|---|
| 0 | Transitional middle-ring profile with moderate distance-to-centroid values | Middle ring |
| 1 | Distinct location pocket with stronger separation from other centroids | Outer ring / edge corridors |
| 2 | Dense metro-like profile; dominant in the sample shown during cluster feature preview | Inner-to-middle metro |
| 3 | Higher atypicality-distance zone; captures mixed suburban profiles | Mixed transition belt |
| 4 | Peripheral cluster with larger centroid distances and broader spread | Outer suburban fringe |
| 5 | Alternative metro-suburban mix with moderate centroid distance signature | Established suburban ring |
4.5 Cluster Validation
Silhouette score and an ablation test (model performance with vs without cluster features) were run to confirm that clustering genuinely improves predictions rather than just adding noise.
| Check | Result |
|---|---|
| Best silhouette score (sweep k=2..8) | 0.6102 (best at k=3) |
| RMSE without cluster features | 378,586.1104 |
| RMSE with cluster features | 373,550.5124 |
| Gain from clustering | 5,035.5980 RMSE reduction |
4.6 Feature Engineering Impact β Isolated Proof
Same model (Linear Regression), same hyperparameters, same split:
| Stage | Features | RMSE | RΒ² |
|---|---|---|---|
| Raw baseline (Part 3) | ~20 raw | 378,586.1104 | 0.6646 |
| Engineered (Part 4) | +10 engineered + 7 cluster | 373,550.5124 | 0.6734 |
| Gain | +17 features | -5,035.5980 | +0.0089 |
Feature engineering contributed measurable improvement even before switching model families.
4.7 Final Feature Matrix Summary
| Category | Count |
|---|---|
| Raw numeric (StandardScaler) | 12 |
| Engineered numeric (ratios, logs, interaction, time) | 10 |
| Cluster distance features | 6 |
| Cluster label (one-hot) | 6 |
| Categorical (one-hot encoded) | ~687 (train-split dependent) |
| Total features | ~721 (sparse, train-split dependent) |
Note: The exact one-hot dimensionality depends on which categories appear in the training split
(OneHotEncoder(handle_unknown='ignore')), so totals can vary slightly across reruns/splits.
π Part 5: Three Improved Regression Models
All three models were trained on the full engineered feature matrix. Same pipeline structure, same stratified split, same seed. Performance differences are attributable to model architecture only.
Model Architectures
| Model | Architecture | Key Parameters |
|---|---|---|
| Linear Regression (Engineered) | Global linear fit | Default β no regularization |
| Random Forest Regressor | 350 independent decision trees | max_depth=20, min_samples_leaf=2, n_jobs=-1 |
| Gradient Boosting Regressor | 300 sequential boosted trees | n_estimators=300, learning_rate=0.05, max_depth=3 |
Part 5 Results
| Model | MAE | RMSE | RΒ² | RMSE vs Baseline |
|---|---|---|---|---|
| Baseline LR (Part 3) | 229,318.5712 | 378,586.1104 | 0.6646 | β |
| Linear Regression (Engineered) | 226,190.4886 | 373,550.5124 | 0.6734 | -5,035.5980 |
| Gradient Boosting Regressor | 186,308.9222 | 325,527.3773 | 0.7520 | -53,058.7331 |
| Random Forest Regressor (WINNER) | 165,525.8822 | 301,281.4212 | 0.7876 | -77,304.6892 |
Reading this plot: Each group of bars represents one model. Shorter RMSE bar = better. Taller RΒ² bar = better. The expected pattern: baseline linear regression is tallest on RMSE and shortest on RΒ². Random Forest should show the most dramatic improvement β RMSE dropping from ~$378K to ~$301K (-$77K, -20%) and RΒ² rising from 0.665 to 0.788. The Gradient Boosting bar should sit between linear and Random Forest. The visual gap between the linear and tree-based models makes the "non-linearity premium" immediately obvious without needing to read the numbers.
Reading this plot: Compare this to the baseline Actual vs Predicted above. The expected improvement: the cloud of points is tighter around the diagonal, especially in the $500Kβ$1.5M band (the bulk of the market). The luxury end (>$2M) will still show scatter β these properties have idiosyncratic features (ocean views, heritage listing, development potential) that no tabular model captures from the available columns. The tighter diagonal indicates the Random Forest found the non-linear suburb Γ size interactions that the linear model averaged away.
Reading this plot: Compare to the baseline residual fan. The expected improvement: the fan narrows β
residuals at high predicted values are smaller than the baseline, indicating the tree model handles
the right tail better. Some systematic negative residuals may remain at the top end (model still
underestimates the most expensive properties), but the spread should be visibly more homoscedastic
than the baseline. If the fan is still wide, it signals that log-transforming the target Price
would be the next meaningful improvement.
Feature Importance β Regression Winner
Reading this plot: Random Forest importance = average reduction in node impurity (MSE) from
splitting on that feature, averaged across all 350 trees. Longer bar = feature the model relied on
most. Expected top features: log_distance (CBD proximity drives price more than almost anything),
select suburb dummies for the highest-value suburbs, log_buildingarea (usable floor space),
rooms_x_bathrooms (the engineered interaction), and cluster_dist features (spatial atypicality).
If log_distance or a suburb dummy tops the chart, it confirms the EDA finding that location
dominates. If an engineered feature (e.g. building_to_land_ratio) appears above raw features,
it validates that the feature engineering in Part 4 extracted real signal rather than noise.
Key observations:
Why Random Forest wins on this split: Random Forest achieved the lowest RMSE and highest RΒ² among the tested regressors. Averaging 350 deep trees reduced variance while capturing non-linear interactions between location, size, and engineered ratio/log features, yielding the strongest generalization.
Why Gradient Boosting is a strong runner-up: Boosting still performed well and clearly outperformed linear regression by modeling non-linear effects. On this run, however, it did not beat Random Forest on RMSE.
Why engineered features dominate importance: In both tree models, log-transformed features (log_landsize, log_buildingarea, log_distance) rank near the top β confirming that normalizing the right-skewed distributions was the most impactful single preprocessing decision. The cluster_dist features also appear in the top rankings, validating the neighbourhood segmentation approach.
Winner Declaration
Winner: Random Forest Regressor β winner_regression_model.pkl
Selection criterion: lowest RMSE with strong RΒ² (balanced error reduction + explained variance).
π Part 6: Winning Regression Model Export
The winning regression pipeline (preprocessor + model) was serialized to pickle and uploaded to this HuggingFace repository.
import pickle
with open("winner_regression_model.pkl", "wb") as f:
pickle.dump(winner_reg_model, f)
File: winner_regression_model.pkl
Test RMSE: 301,281.4212
Test RΒ²: 0.7876
π·οΈ Part 7: Regression β Classification
Why convert? A continuous price prediction is not always directly actionable for a buyer or investor. A price tier β Low / Mid / High β is. This section converts the regression target into an operationally meaningful 3-class classification problem.
7.1 Threshold Strategy: Quantile Binning
The thresholds were computed on the training set only (no leakage from test):
| Class | Label | Definition | Threshold |
|---|---|---|---|
| 0 | Low | Price β€ 33rd percentile of train | β€ 707,000 AUD |
| 1 | Mid | 33rd < Price β€ 67th percentile of train | 707,000 β 1,120,000 AUD |
| 2 | High | Price > 67th percentile of train | > 1,120,000 AUD |
Why quantiles? Quantile thresholds produce near-balanced classes (~33% each), which avoids the class imbalance problem that business-rule thresholds (e.g., fixed dollar cutoffs) would create. Balanced classes mean the classifier can learn all three tiers equally well without requiring class-weight corrections.
7.2 Class Balance
| Class | Train Count | Train % | Test Count | Test % |
|---|---|---|---|---|
| 0 β Low | 7,269 | 33.35% | 1,840 | 33.76% |
| 1 β Mid | 7,298 | 33.48% | 1,785 | 32.75% |
| 2 β High | 7,230 | 33.17% | 1,825 | 33.49% |
| Imbalance ratio (max/min) | β | β | 1.03 | β |
Near-balanced classes (~33% each) β no class-weight correction required.
7.3 Metric Priority for Part 8
Why macro-F1 over accuracy? With near-balanced classes, accuracy is informative β but a model could still perform well overall while systematically failing on one class (for example, always misclassifying mid-tier properties). Macro-F1 averages F1 equally across all three classes, so every tier must be well-predicted for the score to be high.
Primary metric: Macro-F1 Secondary metric: Accuracy
Precision vs Recall trade-off: For a property buyer, a False Negative (missing a high-value property in the High tier) is more costly than a False Positive (labeling a mid-tier property as high). For a seller, the reverse holds. Since both buyer and seller perspectives matter equally, neither precision nor recall is systematically weighted β the balanced macro-F1 reflects this.
π§ Part 8: Train & Evaluate Classification Models
8.1 Precision vs Recall β Context
For this housing price classification task:
- False Positive (predicting High when actually Mid): buyer overpays attention; potential opportunity cost.
- False Negative (predicting Mid when actually High): buyer misses a premium property.
In an equal-weight framing (no specific business rule privileging buyers or sellers), macro-F1 is the right primary metric. Per-class recall on the High (2) class is a key secondary metric β missing premium properties is the most visible failure mode and the one most users care about.
8.2 Three Classification Models
| Model | Architecture | Key Parameters |
|---|---|---|
| Logistic Regression | Global linear decision boundaries | max_iter=2000, random_state=42 |
| Random Forest Classifier | 350 independent decision trees | max_depth=18, min_samples_leaf=2, n_jobs=-1 |
| Gradient Boosting Classifier | 250 sequential boosted trees | n_estimators=250, learning_rate=0.05, max_depth=3 |
All models used the same ColumnTransformer pipeline from Part 4 β fit on train only.
8.3 Evaluation Results
| Model | Accuracy | Macro-F1 | Weighted-F1 | ROC-AUC (OvR, macro) |
|---|---|---|---|---|
| Logistic Regression | 0.7932 | 0.7925 | 0.7933 | 0.9282 |
| Random Forest Classifier | 0.7934 | 0.7917 | 0.7926 | 0.9304 |
| Gradient Boosting Classifier (WINNER) | 0.7963 | 0.7952 | 0.7961 | 0.9289 |
Reading this plot: Three bars, one per model. All three should cluster tightly in the 0.79β0.80
range β with near-balanced classes (33% each), all models have a reasonable accuracy floor.
The differences between bars are small in absolute terms (0.003) but meaningful: even a 0.3pp
accuracy gain on a large portfolio translates to fewer misclassified properties. Gradient Boosting
edges out the others.
Reading this plot: Macro-F1 is the primary metric β it averages F1 equally across all three price classes. A model that performs well on Low and High but poorly on Mid will show a low macro-F1 even with high accuracy. Expected pattern: bars between 0.79 and 0.80, with Gradient Boosting highest and Logistic Regression lowest. The tight clustering confirms that all three models are genuinely competitive β the dataset is rich enough that even a linear classifier captures most of the signal.
8.4 Classification Reports
Logistic Regression
| Class | Precision | Recall | F1-score | Support |
|---|---|---|---|---|
| Low (0) | 0.8311 | 0.8500 | 0.8405 | 1840 |
| Mid (1) | 0.6871 | 0.6913 | 0.6892 | 1785 |
| High (2) | 0.8592 | 0.8356 | 0.8472 | 1825 |
| Macro avg | 0.7925 | 0.7923 | 0.7925 | 5450 |
| Weighted avg | 0.7934 | 0.7932 | 0.7933 | 5450 |
Random Forest Classifier
| Class | Precision | Recall | F1-score | Support |
|---|---|---|---|---|
| Low (0) | 0.8219 | 0.8582 | 0.8397 | 1840 |
| Mid (1) | 0.7012 | 0.6784 | 0.6896 | 1785 |
| High (2) | 0.8531 | 0.8405 | 0.8467 | 1825 |
| Macro avg | 0.7921 | 0.7923 | 0.7917 | 5450 |
| Weighted avg | 0.7928 | 0.7934 | 0.7926 | 5450 |
Gradient Boosting Classifier
| Class | Precision | Recall | F1-score | Support |
|---|---|---|---|---|
| Low (0) | 0.8267 | 0.8582 | 0.8421 | 1840 |
| Mid (1) | 0.6986 | 0.6908 | 0.6946 | 1785 |
| High (2) | 0.8608 | 0.8373 | 0.8489 | 1825 |
| Macro avg | 0.7954 | 0.7954 | 0.7952 | 5450 |
| Weighted avg | 0.7961 | 0.7963 | 0.7961 | 5450 |
8.5 Confusion Matrices
Logistic Regression
Reading confusion matrices: Rows = actual class, columns = predicted class. The diagonal (top-left to bottom-right) shows correct predictions β darker diagonal = better model. Off-diagonal cells are errors. Expected dominant pattern across all three models: the Mid (1) class bleeds into both Low (0) and High (2) β a property priced just above the q33 threshold is nearly indistinguishable from one just below it, so the model hedges. LowβHigh confusions (top-right and bottom-left corners) should be rare β a $400K property looks nothing like a $2M property in the feature space.
Main confusion pattern: Most errors are boundary mistakes where actual Mid is predicted as Low or High; direct LowβHigh confusion is limited.
Random Forest Classifier
Main confusion pattern: Errors are concentrated around the Mid class boundaries (MidβLow / MidβHigh), with few extreme LowβHigh swaps.
Gradient Boosting Classifier
Main confusion pattern: The winner still mainly confuses near-threshold Mid homes with adjacent tiers, while preserving strong separation of Low vs High.
8.6 Feature Importance β Classification Models
Key findings from feature importance:
- Location features (suburb dummies, cluster features, Distance) consistently rank highly β confirming that "location, location, location" applies quantitatively, not just qualitatively.
- Engineered ratio features (
rooms_per_bedroom,bathrooms_per_room,building_to_land_ratio) appear in the top rankings, validating that structural efficiency captures price signal that raw room counts alone do not. log_landsizeandlog_buildingarearank strongly β confirming the importance of the log transform for compressing the right-skewed size distributions.cluster_distfeatures appear in both tree model rankings, confirming that neighbourhood atypicality (distance from cluster centroid) adds genuine signal beyond the spatial cluster membership alone.
8.7 Winner Declaration
Winner: Gradient Boosting Classifier β winner_classification_model.pkl
Selected by: highest Macro-F1 (primary) + highest accuracy (tiebreaker).
Why tree models beat Logistic Regression: The Melbourne property market is highly non-linear β suburb-level effects, interactions between property type and location, and the non-linear distance gradient all require a model that can discover complex decision boundaries. Tree-based ensembles discover these boundaries automatically; logistic regression can only model them if the relevant features are explicitly engineered.
Why Gradient Boosting beats Random Forest (if confirmed by metrics): Sequential error correction focuses each tree on the properties previous trees misclassified. For this price-tier classification task, the hardest-to-classify properties are the mid-tier ones near the class boundaries β and boosting's targeted learning concentrates exactly on those difficult boundary cases that ensemble averaging cannot resolve.
π Final Evaluation
Key Results Summary
| Milestone | Metric | Value |
|---|---|---|
| Baseline Linear Regression | RMSE | 378,586.1104 |
| Baseline Linear Regression | RΒ² | 0.6646 |
| After Feature Engineering (same model) | RMSE | 373,550.5124 |
| After Feature Engineering (same model) | RΒ² | 0.6734 |
| Best Regression Model | RMSE | 301,281.4212 |
| Best Regression Model | RΒ² | 0.7876 |
| Regression β Classification | Class 0 threshold | 707,000 AUD |
| Regression β Classification | Class 2 threshold | 1,120,000 AUD |
| Best Classification Model | Macro-F1 | 0.7952 |
| Best Classification Model | Accuracy | 0.7963 |
| Best Classification Model | ROC-AUC (OvR macro) | 0.9289 |
π How to Load and Use the Models
import pickle
import numpy as np
import pandas as pd
# Load regression model (continuous price prediction)
with open("winner_regression_model.pkl", "rb") as f:
reg_model = pickle.load(f)
# Load classification model (3-class price tier)
with open("winner_classification_model.pkl", "rb") as f:
clf_model = pickle.load(f)
# Both models expect the same engineered feature matrix from Part 4
# X_new must be a DataFrame with the same columns as X_train_fe
# (raw + engineered columns; DO NOT pre-scale β the pipeline handles it)
# Regression: continuous price in AUD
y_price = reg_model.predict(X_new)
# Classification: discrete price tier
y_tier = clf_model.predict(X_new)
y_proba = clf_model.predict_proba(X_new)
tier_map = {0: "Low", 1: "Mid", 2: "High"}
tier_labels = [tier_map[t] for t in y_tier]
print("Predicted prices:", y_price[:5])
print("Predicted tiers:", tier_labels[:5])
print("Class probabilities (Low/Mid/High):")
print(np.round(y_proba[:5], 3))
Important notes:
X_newmust include all engineered columns (sale_year,sale_month,sale_quarter,rooms_per_bedroom,bathrooms_per_room,building_to_land_ratio,log_landsize,log_buildingarea,log_distance,rooms_x_bathrooms) and the clustering columns (cluster_label,cluster_dist_0throughcluster_dist_5).- The KMeans clustering was fit on training data. For truly new properties, you will need to save and reload the clustering artifacts (imputer, scaler, kmeans object) alongside the model pipeline.
- The classification thresholds (q33 = 707,000 AUD, q67 = 1,120,000 AUD)
were derived from the training distribution. A property near a threshold boundary
will have high uncertainty β use
predict_probato assess confidence.
π Part 8 Additional Analysis β Classification Diagnostics Upgrade
Beyond the core classification report and confusion matrix, the notebook includes additional diagnostic visualizations for a complete picture of each model's behavior.
Per-Class Precision, Recall, F1 Summary
| Model | Class | Precision | Recall | F1-score |
|---|---|---|---|---|
| Logistic Regression | Low (0) | 0.8311 | 0.8500 | 0.8405 |
| Logistic Regression | Mid (1) | 0.6871 | 0.6913 | 0.6892 |
| Logistic Regression | High (2) | 0.8592 | 0.8356 | 0.8472 |
| Random Forest | Low (0) | 0.8219 | 0.8582 | 0.8397 |
| Random Forest | Mid (1) | 0.7012 | 0.6784 | 0.6896 |
| Random Forest | High (2) | 0.8531 | 0.8405 | 0.8467 |
| Gradient Boosting | Low (0) | 0.8267 | 0.8582 | 0.8421 |
| Gradient Boosting | Mid (1) | 0.6986 | 0.6908 | 0.6946 |
| Gradient Boosting | High (2) | 0.8608 | 0.8373 | 0.8489 |
Key pattern: The Mid (1) class is consistently the hardest to classify correctly across all three models. This is expected: properties priced near the q33 and q67 thresholds share characteristics with both adjacent classes. A property priced at exactly the q33 boundary is almost equally likely to be genuinely Low or genuinely Mid β the signal is weakest at the class boundary. This boundary-region ambiguity is irreducible without additional features that distinguish near-boundary properties more sharply.
PrecisionβRecall Curves (Class 2 β High Price)
The PR curve for the High (2) class shows the trade-off between the fraction of High-tier properties correctly identified (recall) and the fraction of predicted High-tier properties that are genuinely high (precision). At high recall thresholds, all models accept more false positives from the Mid class. The area under the PR curve confirms whether each model maintains useful precision while catching most High-tier properties.
Regression Diagnostics Upgrade β Improvement Table
| Model | RMSE delta vs baseline | RΒ² delta vs baseline |
|---|---|---|
| Linear Regression (Engineered) | -5,035.5980 | +0.0089 |
| Gradient Boosting Regressor | -53,058.7331 | +0.0874 |
| Random Forest Regressor (Winner) | -77,304.6892 | +0.1230 |
This table makes the engineering and model-selection gains directly comparable. A negative
RMSE_delta means the model improved over baseline; a positive R2_delta means it explains
more variance. The table was included to provide transparent evidence that each modeling
step added genuine value rather than overfitting to the training set.
π― Strategic Takeaways
Location dominates every other signal. Suburb membership and distance from the CBD together explain more variance than all property-size features combined. A 2-bedroom unit in South Yarra will outsell a 4-bedroom house in Melton. The coefficient and importance rankings across every model confirm this β location features occupy the top of every chart. For buyers: suburb selection is the single highest-leverage decision. For sellers: pricing against comparable suburb sales matters far more than the number of rooms listed.
The luxury segment is structurally different. Properties above ~$2M consistently have the largest prediction errors across all models. These properties trade on idiosyncratic features β ocean views, heritage listing, land-banking potential, architect design β that are not captured in any column of this dataset. The implication: the models are deployment-ready for the mainstream market ($400Kβ$1.5M) but should be used with caution for ultra-premium listings where domain expertise is irreplaceable.
Spring auctions drive a seasonal price premium.
sale_monthwas engineered as a predictive feature specifically because Melbourne's auction market has a well-documented spring peak (SeptemberβNovember). Properties listed in this window attract more bidders, driving competitive clearing prices above the annual median. The feature importance plots confirm this time signal adds genuine model value beyond what property characteristics alone explain.Structural efficiency matters more than raw size.
building_to_land_ratioandbathrooms_per_roomconsistently outrank rawLandsizeandBuildingAreain the tree model importance rankings. A compact, well-fitted inner-city property on a small block commands more per square metre than a large house on an oversized block in an outer suburb. Buyers optimising for value-per-dollar should focus on bathroom count and building coverage, not headline land size.A unit is not just a cheaper house β it is a different product. The type dummy (
hvsuvst) is one of the strongest features in the baseline model. The price gap between a house and a unit of identical room count in the same suburb is not explained by size alone β it reflects the land-ownership premium and the restriction on capital growth that unit ownership carries in Melbourne. This structural gap means buyers and investors should price units and houses on separate mental models.Tree models are non-negotiable for this problem. The jump from linear regression (RMSE $378K, RΒ² 0.665) to Random Forest (RMSE $301K, RΒ² 0.788) is not a tuning win β it is a fundamental architectural win. Melbourne property prices are determined by hundreds of suburb-level and property-type interactions that no global linear equation can capture. Any production pricing model for this market should use an ensemble method as its minimum viable architecture.
β οΈ Limitations
Data coverage ends 2018. The Melbourne property market shifted significantly after 2018: record-low interest rates (2020β2021) drove prices to historic highs; the 2022β2023 rate hiking cycle reversed much of that growth. Models trained on 2016β2018 data will systematically underestimate 2020β2021 prices and may overestimate 2023β2024 prices. Do not use for current market valuations without retraining on recent data.
300+ suburb one-hot categories create sparse, noisy features. Suburbs with fewer than ~20 sales in the training set have unreliable one-hot coefficient estimates β the model has too few examples to learn a stable suburb premium. These small-suburb dummies add dimensionality without adding reliable signal. A better approach: group low-frequency suburbs into an "other" category or use target encoding with cross-validation.
BuildingAreamissing 47% of rows. Nearly half of all predictions for the regression target rely on the median imputation for building area β the most physically important continuous feature for price-per-square-metre analysis. Imputed values are reasonable (conditioned on median), but they suppress the signal from this feature for a large fraction of the dataset.No macroeconomic features. Interest rates, Melbourne population growth, immigration levels, vacancy rates, and housing supply pipeline are documented major drivers of Melbourne property prices β none are present in the dataset. The models capture the cross-sectional structure (which property type and location commands a premium) but not the time-series level (whether the whole market is rising or falling).
Geographic data used only for clustering.
LattitudeandLongtitudefeed the KMeans clustering but are not used as direct model features. A spatial regression approach (e.g. geographically weighted regression or a spatial lag term) could extract more precise location signal than the coarse 6-cluster discretisation.All relationships are correlational. The models identify which features are associated with higher prices β they cannot tell us which features cause higher prices. Suburb prestige and building quality are correlated with many unmeasured factors (school catchment, public transport access, heritage character) that drive both the observed feature values and the price outcomes. Feature importance β causal importance.
π Live Demo
What it is
A live interactive demo lets anyone type in a property's attributes and get an instant price prediction and tier classification β without opening a notebook or writing code. It runs on HuggingFace Spaces (free hosting) powered by Gradio (the same library used in most HuggingFace demos).
π¦ Requirements
pandas>=1.3
numpy>=1.21
scikit-learn>=1.0
matplotlib>=3.4
seaborn>=0.11
plotly>=5.0
scipy>=1.7
π Key Design Decisions
| Decision | Justification |
|---|---|
| Keep outliers in EDA | Extreme Melbourne prices are real market data (luxury, rural-block); removing them would bias the model against the tails |
| Log-transform Landsize/BuildingArea/Distance | Reduces right-skew by >90%; better linearizes these features for regression models |
| Defer all imputation to sklearn pipeline | Prevents any form of test-set information leaking into imputed train values |
| Fit all transformers on train only | Standard leakage-prevention practice; ColumnTransformer enforces this |
Use handle_unknown='ignore' in OneHotEncoder |
Suburbs in the test set may not appear in training; ignoring unseen categories prevents crashes |
| KMeans k=6 for feature construction | Used to create richer cluster-derived features (cluster_label + distances); separate silhouette sweep showed best compactness at k=3 |
| Quantile-based classification thresholds | Produces near-balanced classes (~33% each) without requiring class-weight correction |
| Macro-F1 as primary classification metric | Equal penalty for ignoring any price class; prevents a model from collapsing to the majority class |
| RMSE as primary regression metric | Penalizes large errors more than MAE; appropriate for property prices where a $500k prediction miss is much worse than a $50k miss |
random_state=42 throughout |
Full reproducibility β any reader can run the notebook and get identical results |
| 80/20 train/test split | Standard proportion for a dataset of this size; 20% gives a large enough test set for stable metric estimates |
Itay Morag
- Downloads last month
- -





















