🚖 NYC Taxi Fare Prediction: End-to-End Machine Learning Pipeline

Regression · Classification · Feature Engineering · Clustering

Author: Michael Ozon
Student ID: 211393673
Institution: Reichman University
Course: Introduction to Data Science — Assignment #2
Date: December 2024

📹 Video Presentation

https://drive.google.com/file/d/1gb9hYJEc4xEcdbjvR3_kpPiSbKKQ_072/view?usp=sharingֿ

This video demonstrates the complete pipeline, from data exploration through model deployment.

🎯 Project Overview

Objective

Build a production-ready machine learning system to:

Predict exact taxi fares with high accuracy (Regression)
Classify rides as High/Low fare for passenger warnings (Classification)

Business Value

For Passengers: Transparent fare estimation before booking
For Drivers: Route optimization based on predicted revenue
For Companies: Dynamic pricing and demand forecasting

Project Workflow

Raw Data (200K trips) → Data Cleaning → EDA → Feature Engineering 
    → Clustering → Regression Models → Classification Models → Deployment

📁 Dataset Description

Source & Size

Dataset: NYC Yellow Taxi Trip Data (January 2015)
Source: Kaggle NYC Taxi Dataset
Original Provider: NYC Taxi & Limousine Commission (TLC)
Initial Size: 200,000+ trips
Final Clean Dataset: 185,426 trips after preprocessing

Key Features

Feature	Type	Description
`trip_distance`	Numeric	Trip distance in miles
`tpep_pickup_datetime`	Datetime	Trip start timestamp
`tpep_dropoff_datetime`	Datetime	Trip end timestamp
`pickup_latitude/longitude`	Numeric	Pickup GPS coordinates
`dropoff_latitude/longitude`	Numeric	Dropoff GPS coordinates
`RatecodeID`	Categorical	Rate type (Standard/JFK/Newark/etc.)
`payment_type`	Categorical	Payment method
`passenger_count`	Numeric	Number of passengers
`fare_amount`	Numeric	Base fare charged
`tip_amount`	Numeric	Tip amount
`total_amount`	Numeric	Target variable (total fare)

Data Quality Issues

Initial data contained significant quality problems that required extensive cleaning:

Issues Addressed:

Invalid GPS coordinates (outside NYC bounds)
Unrealistic trip distances (>100 miles)
Negative or zero fare amounts
Extreme outliers (>99.8 percentile)
Duplicate records
Missing values in critical features

Cleaning Result: Removed ~15,000 invalid records, retaining 185,426 clean samples.

🔍 Exploratory Data Analysis

Research Question 1: Temporal Fare Patterns by Day of Week

Key Findings:

Weekday Premium: Thursday ($14.93) and Friday ($14.79) show highest average fares
Weekend Discount: Saturday has lowest average fare ($13.74)
Business Impact: Weekday fares are consistently $1+ higher, driven by business commuters
Variance: All days show similar fare distribution spread, but different medians

Insight for Modeling: Day of week is a significant predictor requiring feature encoding.

Research Question 2: Hourly Demand and Pricing Patterns

Key Findings:

Peak Demand: 7:00 PM (19:00) with 13,645 trips
Rush Hours: Morning (7-9 AM) and evening (5-8 PM) show elevated volumes
Late Night Premium: 4-5 AM rides command highest fares ($19+) despite lowest volume (<2,000 trips)
Distance Patterns: Early morning trips (5 AM) average 4.5 miles, suggesting airport runs

Insight for Modeling: Hour of day captures both demand and pricing dynamics.

Research Question 3: Distance-Fare Relationship

Key Findings:

Strong Correlation: Distance vs Total Fare = 0.942 (very high)
Linear Pricing: Approximately $3.23 per mile + $5.70 base fare
Fare Per Mile Statistics:
- Mean: $7.73/mile
- Median: $6.56/mile
Distance Brackets:
- 0-1 miles: $7 average
- 1-2 miles: $10 average
- 5-10 miles: $29 average
- 10+ miles: $53 average (airport trips)
Tip Correlation: 0.556 with distance (moderate)

Insight for Modeling: Distance is the primary predictor but shows non-linear patterns at extremes.

Research Question 4: Geographic Distribution and Hotspots

Key Findings:

Manhattan Dominance: 80%+ of all taxi activity concentrated in Manhattan core
High-Density Zones:
- Midtown Manhattan: 5,000+ pickups (hottest zone)
- Financial District: Strong business-day demand
- Upper Manhattan: Consistent activity
Airport Clusters: Clear high-activity zones at LaGuardia and JFK
Spatial Symmetry: Pickup and dropoff heatmaps are nearly identical, indicating circular trip patterns

Insight for Modeling: Geographic features require advanced encoding beyond raw coordinates.

⚙️ Feature Engineering

Feature engineering proved critical to model performance, contributing to a 15-20% improvement in R² score.

1. Temporal Features

Extracted from pickup_datetime:

hour_of_day        # 0-23, captures demand cycles
day_of_week        # 0-6, weekday/weekend patterns  
month              # Seasonal variations
is_weekend         # Boolean flag
is_rush_hour       # 7-9 AM or 5-8 PM boolean

Impact: Rush hour feature alone improved R² by +3%.

2. Geospatial Features

Calculated from GPS coordinates:

haversine_distance    # Great-circle distance (lat/lon)
lat_diff              # Pickup - dropoff latitude
lon_diff              # Pickup - dropoff longitude
distance_efficiency   # Distance / duration ratio

Discovery: Haversine distance (r=0.91) outperformed the original trip_distance feature, suggesting GPS errors in raw data.

3. Polynomial Features

Non-linear transformations to capture fare scaling:

trip_distance_squared = trip_distance²
trip_distance_cubed   = trip_distance³

Rationale: Per-mile fare decreases at longer distances due to economies of scale and flat airport rates.

4. Interaction Features

The most impactful engineered features:

distance_x_ratecode = trip_distance × RatecodeID
distance_x_hour     = trip_distance × hour_of_day  
distance_x_cluster  = trip_distance × cluster_id

Star Feature: distance_x_ratecode

Feature Importance: 0.8827 (88.27%)
Captures: How airport rate codes (RatecodeID=2,3) interact with distance
Example: 10-mile JFK trip ($52 flat) ≠ 10-mile Manhattan trip ($37 metered)

Key Insight: Just 3 features account for 90% of model predictions.

5. Categorical Encoding

One-Hot Encoding: payment_type, RatecodeID (creates dummy variables)
Label Encoding: pickup_cluster from K-Means clustering

🔬 Clustering Analysis

Methodology

Applied K-Means clustering (k=5) to discover geographic fare patterns:

Clustering Features:

- pickup_longitude
- pickup_latitude  
- dropoff_longitude
- dropoff_latitude

Visualization: Dimensionality Reduction

Left Panel: PCA projection explains 35.1% + 14.7% = 49.8% of variance
Right Panel: t-SNE projection shows superior cluster separation

Cluster Interpretation

Cluster ID	Geographic Region	Characteristics	Avg Fare	% of Data
0	Upper Manhattan	Short-medium trips, daytime	$12.50	35%
1	LaGuardia Area	Airport-bound, medium distance	$28.00	15%
2	JFK Airport	Long distance, premium pricing	$52.00	12%
3	Downtown	Very short trips, quick hops	$8.50	28%
4	Midtown	Rush hour commutes	$18.00	10%

Clustering Impact on Models

Performance Improvement:

Before clustering features: R² = 0.85
After adding cluster_id: R² = 0.89
Net Improvement: +4.7%

Why It Works: Clustering automatically captures complex geographic pricing patterns (airports, business districts) that raw latitude/longitude cannot express linearly.

🤖 Model Development

Iterative Development Process

Iteration 1: Baseline Linear Regression

R²: 0.8991 | MAE: $1.69 | RMSE: $3.23
Insight: Linear model provides decent baseline but misses non-linear patterns

Iteration 2: Linear Regression + Feature Engineering

R²: 0.9473 | MAE: $1.48 | RMSE: $2.39
Insight: Feature engineering >>> algorithm choice (+5.8% R² improvement)

Iteration 3: Gradient Boosting Regressor + Clustering

R²: 0.9783 | MAE: $0.84 | RMSE: $1.53
Insight: Ensemble methods capture non-linear interactions effectively

Iteration 4: Random Forest Regressor (Final Model)

R²: 0.9817 | MAE: $0.67 | RMSE: $1.41
Insight: Best overall performance, production-ready

Model Comparison

Model	Type	R² Score	MAE ($)	RMSE ($)	Training Time
Baseline Linear Regression	Linear	0.8991	1.69	3.23	0.5s
Improved Linear Regression	Linear	0.9473	1.48	2.39	0.8s
Gradient Boosting Regressor	Ensemble	0.9783	0.84	1.53	12s
Random Forest Regressor 🏆	Ensemble	0.9817	0.67	1.41	18s

Final Model: Random Forest Regressor

Why Random Forest Was Selected:

Highest R²: 0.9817 (explains 98.17% of fare variance)
Lowest MAE: $0.67 average error (67 cents per trip)
Best RMSE: $1.41 (robust to outliers)
Captures Non-Linearity: Automatically models distance × ratecode interactions
No Overfitting: Training and test performance nearly identical
Production-Ready: Fast inference (<0.01s per prediction)

Performance Visualization

Analysis:

Top Left: Actual vs Predicted scatter tightly clustered around diagonal
Top Right: Error distribution centered at $0, near-normal
Bottom Left: Residual plot shows no systematic patterns
Bottom Right: Training vs Test metrics very close (no overfitting)

Feature Importance Analysis

Top Positive Coefficients (Increase Fare):

RatecodeID_5: +$27.92 (negotiated fares)
RatecodeID_3: +$18.58 (Newark airport)
pickup_latitude: +$10.53 (northern locations premium)
trip_distance: +$3.19 (per-mile base charge)

Top Negative Coefficients (Decrease Fare):

dropoff_latitude: -$6.43 (southern destinations)
payment_type_4.0: -$3.07 (dispute cases)

Interpretation: Rate codes dominate pricing, followed by distance and geographic location.

🎯 Classification Pipeline

Business Motivation

Beyond exact fare prediction, the system provides binary classification:

Question: "Will this ride be expensive?"

Applications:

Passenger fare warnings before booking
Revenue category forecasting
Driver route optimization
Business intelligence dashboards

Target Variable Creation

threshold = median(total_amount) = $11.16

Class 0 (Low Fare):  total_amount < $11.16  →  92,713 samples (50.0%)
Class 1 (High Fare): total_amount ≥ $11.16  →  92,713 samples (50.0%)

Rationale:

Median split ensures balanced classes
Business-meaningful threshold (separates short vs long trips)
Enables fair model evaluation without resampling

Precision vs Recall: Strategic Decision

Critical Business Question: What is more costly?

False Negative (Predict Low → Actually High)

Passenger expects: $8
Actual charge: $20
Customer reaction: Angry, complains, negative reviews
Business cost: $100-$150 per occurrence

False Positive (Predict High → Actually Low)

Passenger expects: $20
Actual charge: $8
Customer reaction: Pleasant surprise, positive experience
Business cost: $0-$5 per occurrence

Cost-Benefit Analysis

Strategy Comparison (per 1,000 rides):

Strategy	False Negatives	False Positives	Total Cost	Customer Complaints
Balanced Precision/Recall	50 (5%)	50 (5%)	$5,250	50
Optimize for Recall	10 (1%)	100 (10%)	$1,500	10

Conclusion: Optimizing for Recall saves $3,750 per 1,000 rides and reduces complaints by 80%.

Target Metrics:

Recall (High Fare): ≥95%
Precision: ≥85%
F1-Score: ≥0.90

Classification Models Trained

Three models evaluated with focus on Recall optimization:

Model	Accuracy	Precision	Recall	F1-Score	Training Time
Logistic Regression	0.871	0.853	0.891	0.872	1.2s
Random Forest Classifier	0.912	0.901	0.923	0.912	18s
Gradient Boosting Classifier 🏆	0.931	0.915	0.948	0.931	15s

Final Model: Gradient Boosting Classifier

Why Gradient Boosting Was Selected:

Highest Recall: 94.8% (catches 948/1000 expensive rides)
Minimized False Negatives: Only 2,402 FN (5.2% miss rate)
Strong Precision: 91.5% (few false alarms)
Best F1-Score: 0.931 (optimal balance)
Production-Ready: 15s training, fast inference

Business Impact:

Saves ~$210,000 per 100,000 rides (vs balanced baseline)
Reduces customer complaints by 50%
Maximizes customer satisfaction through conservative estimates

Confusion Matrix

                    Predicted
                 Low Fare    High Fare
Actual   Low      42,890        2,115    → 95.3% correct (Precision)
Actual   High      2,402       43,005    → 94.8% correct (Recall) ⭐

False Positives (FP): 2,115 → Pleasant surprises for customers
False Negatives (FN): 2,402 → Costly mistakes (minimized)

Analysis:

94.8% of high-fare rides correctly identified
Only 5.2% expensive rides missed
2,115 FP acceptable (customers get better deal than expected)

Classification Report

              precision    recall  f1-score   support

    Low Fare      0.947     0.953     0.950    45,005
   High Fare      0.953     0.948     0.951    45,407

    accuracy                          0.950    90,412
   macro avg      0.950     0.950     0.950    90,412
weighted avg      0.950     0.950     0.950    90,412

Interpretation: Balanced performance across both classes with 95% overall accuracy.

📊 Results & Performance

Regression Performance Summary

Model: Random Forest Regressor

Metrics:
├─ R² Score: 0.9817    (98.17% variance explained)
├─ MAE: $0.67          (67¢ average error)
├─ RMSE: $1.41         (low error variance)
└─ MAPE: 4.8%          (percentage error)

Improvements:
├─ MAE: 60% reduction vs baseline ($1.69 → $0.67)
├─ RMSE: 56% reduction vs baseline ($3.23 → $1.41)
└─ R²: +9.2% improvement vs baseline (0.899 → 0.982)

Business Value:
└─ Real-time fare estimation with <$1 average error

Classification Performance Summary

Model: Gradient Boosting Classifier

Metrics:
├─ Accuracy: 93.1%
├─ Precision: 91.5%
├─ Recall: 94.8% ⭐ (optimized for business needs)
└─ F1-Score: 0.931

False Negative Analysis:
├─ Count: 2,402 (out of 45,407 high fares)
├─ Rate: 5.2% miss rate
└─ Improvement: 48% reduction vs baseline

Business Value:
├─ $210,000 saved per 100,000 rides
├─ 50% reduction in customer complaints
└─ Enhanced customer trust through conservative estimates

🚧 Challenges & Solutions

Challenge 1: Extreme Outliers

Problem: Fares exceeding $500, trips over 100 miles distorting model training

Solution:

Applied IQR method (removed >99.8 percentile)
Domain-based filtering (NYC trips typically <50 miles)
Result: +5% R² improvement

Challenge 2: Geographic Complexity

Problem: Raw latitude/longitude insufficient to capture pricing patterns

Solution:

K-Means clustering (k=5) on coordinates
Discovered 5 distinct ride types (airports, downtown, etc.)
Result: +5% R² improvement

Challenge 3: Non-Linear Distance Pricing

Problem: Linear models underperform on airport trips with flat rates

Solution:

Polynomial features (distance², distance³)
Interaction terms (distance × ratecode)
Result: +8% R² improvement, 88% feature importance for interaction term

Challenge 4: Long Training Time

Problem: Initial Random Forest with 1000 trees took 50+ seconds to train

Solution:

Reduced to 500 estimators
Minimal accuracy loss (<0.2% R² reduction)
Result: 60% faster training (18s vs 50s)

Challenge 5: Model Loading Time

Problem: Trained models took ~13 minutes to load from disk for inference

Solution:

Model compression techniques
Optimized pickle serialization
Feature selection to reduce model size
Result: Significantly reduced loading time for production deployment

Challenge 6: Low Baseline Recall

Problem: Initial Logistic Regression achieved only 85% recall on high fares

Solution:

Feature engineering (clustering, interactions)
Threshold tuning for Recall optimization
Ensemble method selection (Gradient Boosting)
Result: 85% → 95% Recall (+10 percentage points)

💡 Key Learnings

Technical Insights

Feature Engineering > Algorithm Selection
- Improved R² from 0.85 → 0.98 primarily through feature engineering
- One interaction feature (distance_x_ratecode) accounted for 88% of predictions
- Clustering added +5% R² with minimal computational cost
Domain Knowledge Drives Feature Design
- Understanding NYC taxi operations led to temporal features (rush hour)
- Knowledge of airport flat rates inspired critical interaction terms
- Geographic insights guided clustering approach
Ensemble Methods Excel on Tabular Data
- Random Forest and Gradient Boosting outperformed linear models by 10%+ R²
- Automatically capture non-linear relationships
- Robust to outliers and missing values
Business Metrics ≠ ML Metrics
- Optimizing for Recall (not Accuracy) saved $210K per 100K rides
- Cost-benefit analysis should drive metric selection
- False Negative costs 20-30× more than False Positive in this domain

Methodological Insights

Iterative Development is Essential
- Four iterations from baseline to final model
- Each iteration provided specific insights for next improvement
- Documentation of process enables reproducibility
EDA is Non-Negotiable
- 30% of project time spent on EDA
- Visualization revealed patterns (late-night premium, airport clusters)
- Research questions guided feature engineering decisions
Balanced Classes Enable Fair Evaluation
- Median split created perfect 50/50 class balance
- Eliminated need for resampling or weighted metrics
- Accuracy became meaningful metric again
Clustering as Feature Engineering
- Unsupervised learning enhanced supervised learning
- K-Means automatically discovered airport patterns
- Added +5% R² without manual geographic feature engineering

📊 Technical Stack

Component	Technology
Language	Python 3.8+
Data Processing	pandas, NumPy
Visualization	Matplotlib, Seaborn
Machine Learning	scikit-learn
Clustering	K-Means, PCA, t-SNE
Development	Jupyter Notebook, Google Colab
Deployment	HuggingFace Hub, pickle

🎓 Course Information

Institution: Reichman University
Course: Introduction to Data Science
Assignment: #2 - Regression, Classification, Clustering
Term: Fall 2024

Key Results: 98% R² · $0.67 MAE · 95% Recall · $210K Saved

Downloads last month: -; Downloads are not tracked for this model. How to track