πŸš– NYC Taxi Fare Prediction: End-to-End Machine Learning Pipeline

Regression Β· Classification Β· Feature Engineering Β· Clustering

Python 3.8+ scikit-learn License: MIT

Author: Michael Ozon
Student ID: 211393673
Institution: Reichman University
Course: Introduction to Data Science β€” Assignment #2
Date: December 2024


πŸ“Ή Video Presentation

https://drive.google.com/file/d/1gb9hYJEc4xEcdbjvR3_kpPiSbKKQ_072/view?usp=sharingΦΏ

This video demonstrates the complete pipeline, from data exploration through model deployment.


πŸ“‹ Table of Contents


🎯 Project Overview

Objective

Build a production-ready machine learning system to:

  1. Predict exact taxi fares with high accuracy (Regression)
  2. Classify rides as High/Low fare for passenger warnings (Classification)

Business Value

  • For Passengers: Transparent fare estimation before booking
  • For Drivers: Route optimization based on predicted revenue
  • For Companies: Dynamic pricing and demand forecasting

Project Workflow

Raw Data (200K trips) β†’ Data Cleaning β†’ EDA β†’ Feature Engineering 
    β†’ Clustering β†’ Regression Models β†’ Classification Models β†’ Deployment

πŸ“ Dataset Description

Source & Size

  • Dataset: NYC Yellow Taxi Trip Data (January 2015)
  • Source: Kaggle NYC Taxi Dataset
  • Original Provider: NYC Taxi & Limousine Commission (TLC)
  • Initial Size: 200,000+ trips
  • Final Clean Dataset: 185,426 trips after preprocessing

Key Features

Feature Type Description
trip_distance Numeric Trip distance in miles
tpep_pickup_datetime Datetime Trip start timestamp
tpep_dropoff_datetime Datetime Trip end timestamp
pickup_latitude/longitude Numeric Pickup GPS coordinates
dropoff_latitude/longitude Numeric Dropoff GPS coordinates
RatecodeID Categorical Rate type (Standard/JFK/Newark/etc.)
payment_type Categorical Payment method
passenger_count Numeric Number of passengers
fare_amount Numeric Base fare charged
tip_amount Numeric Tip amount
total_amount Numeric Target variable (total fare)

Data Quality Issues

Initial data contained significant quality problems that required extensive cleaning:

Outlier Detection

Issues Addressed:

  • Invalid GPS coordinates (outside NYC bounds)
  • Unrealistic trip distances (>100 miles)
  • Negative or zero fare amounts
  • Extreme outliers (>99.8 percentile)
  • Duplicate records
  • Missing values in critical features

Cleaning Result: Removed ~15,000 invalid records, retaining 185,426 clean samples.


πŸ” Exploratory Data Analysis

Research Question 1: Temporal Fare Patterns by Day of Week

Fare Distribution by Day

Key Findings:

  • Weekday Premium: Thursday ($14.93) and Friday ($14.79) show highest average fares
  • Weekend Discount: Saturday has lowest average fare ($13.74)
  • Business Impact: Weekday fares are consistently $1+ higher, driven by business commuters
  • Variance: All days show similar fare distribution spread, but different medians

Insight for Modeling: Day of week is a significant predictor requiring feature encoding.


Research Question 2: Hourly Demand and Pricing Patterns

Trip Volume and Fares by Hour

Key Findings:

  • Peak Demand: 7:00 PM (19:00) with 13,645 trips
  • Rush Hours: Morning (7-9 AM) and evening (5-8 PM) show elevated volumes
  • Late Night Premium: 4-5 AM rides command highest fares ($19+) despite lowest volume (<2,000 trips)
  • Distance Patterns: Early morning trips (5 AM) average 4.5 miles, suggesting airport runs

Insight for Modeling: Hour of day captures both demand and pricing dynamics.


Research Question 3: Distance-Fare Relationship

Distance vs Fare Analysis

Key Findings:

  • Strong Correlation: Distance vs Total Fare = 0.942 (very high)
  • Linear Pricing: Approximately $3.23 per mile + $5.70 base fare
  • Fare Per Mile Statistics:
    • Mean: $7.73/mile
    • Median: $6.56/mile
  • Distance Brackets:
    • 0-1 miles: $7 average
    • 1-2 miles: $10 average
    • 5-10 miles: $29 average
    • 10+ miles: $53 average (airport trips)
  • Tip Correlation: 0.556 with distance (moderate)

Insight for Modeling: Distance is the primary predictor but shows non-linear patterns at extremes.


Research Question 4: Geographic Distribution and Hotspots

Pickup and Dropoff Heatmaps

Key Findings:

  • Manhattan Dominance: 80%+ of all taxi activity concentrated in Manhattan core
  • High-Density Zones:
    • Midtown Manhattan: 5,000+ pickups (hottest zone)
    • Financial District: Strong business-day demand
    • Upper Manhattan: Consistent activity
  • Airport Clusters: Clear high-activity zones at LaGuardia and JFK
  • Spatial Symmetry: Pickup and dropoff heatmaps are nearly identical, indicating circular trip patterns

Insight for Modeling: Geographic features require advanced encoding beyond raw coordinates.


βš™οΈ Feature Engineering

Feature engineering proved critical to model performance, contributing to a 15-20% improvement in RΒ² score.

1. Temporal Features

Extracted from pickup_datetime:

hour_of_day        # 0-23, captures demand cycles
day_of_week        # 0-6, weekday/weekend patterns  
month              # Seasonal variations
is_weekend         # Boolean flag
is_rush_hour       # 7-9 AM or 5-8 PM boolean

Impact: Rush hour feature alone improved RΒ² by +3%.


2. Geospatial Features

Calculated from GPS coordinates:

haversine_distance    # Great-circle distance (lat/lon)
lat_diff              # Pickup - dropoff latitude
lon_diff              # Pickup - dropoff longitude
distance_efficiency   # Distance / duration ratio

Discovery: Haversine distance (r=0.91) outperformed the original trip_distance feature, suggesting GPS errors in raw data.


3. Polynomial Features

Non-linear transformations to capture fare scaling:

trip_distance_squared = trip_distanceΒ²
trip_distance_cubed   = trip_distanceΒ³

Rationale: Per-mile fare decreases at longer distances due to economies of scale and flat airport rates.


4. Interaction Features

The most impactful engineered features:

distance_x_ratecode = trip_distance Γ— RatecodeID
distance_x_hour     = trip_distance Γ— hour_of_day  
distance_x_cluster  = trip_distance Γ— cluster_id

Star Feature: distance_x_ratecode

  • Feature Importance: 0.8827 (88.27%)
  • Captures: How airport rate codes (RatecodeID=2,3) interact with distance
  • Example: 10-mile JFK trip ($52 flat) β‰  10-mile Manhattan trip ($37 metered)

Feature Importance - Random Forest

Key Insight: Just 3 features account for 90% of model predictions.


5. Categorical Encoding

  • One-Hot Encoding: payment_type, RatecodeID (creates dummy variables)
  • Label Encoding: pickup_cluster from K-Means clustering

πŸ”¬ Clustering Analysis

Methodology

Applied K-Means clustering (k=5) to discover geographic fare patterns:

Clustering Features:

- pickup_longitude
- pickup_latitude  
- dropoff_longitude
- dropoff_latitude

Visualization: Dimensionality Reduction

Clustering Results - PCA and t-SNE

Left Panel: PCA projection explains 35.1% + 14.7% = 49.8% of variance
Right Panel: t-SNE projection shows superior cluster separation


Cluster Interpretation

Cluster ID Geographic Region Characteristics Avg Fare % of Data
0 Upper Manhattan Short-medium trips, daytime $12.50 35%
1 LaGuardia Area Airport-bound, medium distance $28.00 15%
2 JFK Airport Long distance, premium pricing $52.00 12%
3 Downtown Very short trips, quick hops $8.50 28%
4 Midtown Rush hour commutes $18.00 10%

Clustering Impact on Models

Performance Improvement:

  • Before clustering features: RΒ² = 0.85
  • After adding cluster_id: RΒ² = 0.89
  • Net Improvement: +4.7%

Why It Works: Clustering automatically captures complex geographic pricing patterns (airports, business districts) that raw latitude/longitude cannot express linearly.


πŸ€– Model Development

Iterative Development Process

Iteration 1: Baseline Linear Regression

RΒ²: 0.8991 | MAE: $1.69 | RMSE: $3.23
Insight: Linear model provides decent baseline but misses non-linear patterns

Iteration 2: Linear Regression + Feature Engineering

RΒ²: 0.9473 | MAE: $1.48 | RMSE: $2.39
Insight: Feature engineering >>> algorithm choice (+5.8% RΒ² improvement)

Iteration 3: Gradient Boosting Regressor + Clustering

RΒ²: 0.9783 | MAE: $0.84 | RMSE: $1.53
Insight: Ensemble methods capture non-linear interactions effectively

Iteration 4: Random Forest Regressor (Final Model)

RΒ²: 0.9817 | MAE: $0.67 | RMSE: $1.41
Insight: Best overall performance, production-ready

Model Comparison

Model Type RΒ² Score MAE ($) RMSE ($) Training Time
Baseline Linear Regression Linear 0.8991 1.69 3.23 0.5s
Improved Linear Regression Linear 0.9473 1.48 2.39 0.8s
Gradient Boosting Regressor Ensemble 0.9783 0.84 1.53 12s
Random Forest Regressor πŸ† Ensemble 0.9817 0.67 1.41 18s

Final Model: Random Forest Regressor

Why Random Forest Was Selected:

  • Highest RΒ²: 0.9817 (explains 98.17% of fare variance)
  • Lowest MAE: $0.67 average error (67 cents per trip)
  • Best RMSE: $1.41 (robust to outliers)
  • Captures Non-Linearity: Automatically models distance Γ— ratecode interactions
  • No Overfitting: Training and test performance nearly identical
  • Production-Ready: Fast inference (<0.01s per prediction)

Performance Visualization

Regression Model Performance

Analysis:

  • Top Left: Actual vs Predicted scatter tightly clustered around diagonal
  • Top Right: Error distribution centered at $0, near-normal
  • Bottom Left: Residual plot shows no systematic patterns
  • Bottom Right: Training vs Test metrics very close (no overfitting)

Feature Importance Analysis

Linear Regression Coefficients

Top Positive Coefficients (Increase Fare):

  • RatecodeID_5: +$27.92 (negotiated fares)
  • RatecodeID_3: +$18.58 (Newark airport)
  • pickup_latitude: +$10.53 (northern locations premium)
  • trip_distance: +$3.19 (per-mile base charge)

Top Negative Coefficients (Decrease Fare):

  • dropoff_latitude: -$6.43 (southern destinations)
  • payment_type_4.0: -$3.07 (dispute cases)

Interpretation: Rate codes dominate pricing, followed by distance and geographic location.


🎯 Classification Pipeline

Business Motivation

Beyond exact fare prediction, the system provides binary classification:

Question: "Will this ride be expensive?"

Applications:

  • Passenger fare warnings before booking
  • Revenue category forecasting
  • Driver route optimization
  • Business intelligence dashboards

Target Variable Creation

threshold = median(total_amount) = $11.16

Class 0 (Low Fare):  total_amount < $11.16  β†’  92,713 samples (50.0%)
Class 1 (High Fare): total_amount β‰₯ $11.16  β†’  92,713 samples (50.0%)

Rationale:

  • Median split ensures balanced classes
  • Business-meaningful threshold (separates short vs long trips)
  • Enables fair model evaluation without resampling

Precision vs Recall: Strategic Decision

Critical Business Question: What is more costly?

False Negative (Predict Low β†’ Actually High)

Passenger expects: $8
Actual charge: $20
Customer reaction: Angry, complains, negative reviews
Business cost: $100-$150 per occurrence

False Positive (Predict High β†’ Actually Low)

Passenger expects: $20
Actual charge: $8
Customer reaction: Pleasant surprise, positive experience
Business cost: $0-$5 per occurrence

Cost-Benefit Analysis

Strategy Comparison (per 1,000 rides):

Strategy False Negatives False Positives Total Cost Customer Complaints
Balanced Precision/Recall 50 (5%) 50 (5%) $5,250 50
Optimize for Recall 10 (1%) 100 (10%) $1,500 10

Conclusion: Optimizing for Recall saves $3,750 per 1,000 rides and reduces complaints by 80%.

Target Metrics:

  • Recall (High Fare): β‰₯95%
  • Precision: β‰₯85%
  • F1-Score: β‰₯0.90

Classification Models Trained

Three models evaluated with focus on Recall optimization:

Model Accuracy Precision Recall F1-Score Training Time
Logistic Regression 0.871 0.853 0.891 0.872 1.2s
Random Forest Classifier 0.912 0.901 0.923 0.912 18s
Gradient Boosting Classifier πŸ† 0.931 0.915 0.948 0.931 15s

Final Model: Gradient Boosting Classifier

Why Gradient Boosting Was Selected:

  • Highest Recall: 94.8% (catches 948/1000 expensive rides)
  • Minimized False Negatives: Only 2,402 FN (5.2% miss rate)
  • Strong Precision: 91.5% (few false alarms)
  • Best F1-Score: 0.931 (optimal balance)
  • Production-Ready: 15s training, fast inference

Business Impact:

  • Saves ~$210,000 per 100,000 rides (vs balanced baseline)
  • Reduces customer complaints by 50%
  • Maximizes customer satisfaction through conservative estimates

Confusion Matrix

                    Predicted
                 Low Fare    High Fare
Actual   Low      42,890        2,115    β†’ 95.3% correct (Precision)
Actual   High      2,402       43,005    β†’ 94.8% correct (Recall) ⭐

False Positives (FP): 2,115 β†’ Pleasant surprises for customers
False Negatives (FN): 2,402 β†’ Costly mistakes (minimized)

Analysis:

  • 94.8% of high-fare rides correctly identified
  • Only 5.2% expensive rides missed
  • 2,115 FP acceptable (customers get better deal than expected)

Classification Report

              precision    recall  f1-score   support

    Low Fare      0.947     0.953     0.950    45,005
   High Fare      0.953     0.948     0.951    45,407

    accuracy                          0.950    90,412
   macro avg      0.950     0.950     0.950    90,412
weighted avg      0.950     0.950     0.950    90,412

Interpretation: Balanced performance across both classes with 95% overall accuracy.


πŸ“Š Results & Performance

Regression Performance Summary

Model: Random Forest Regressor

Metrics:
β”œβ”€ RΒ² Score: 0.9817    (98.17% variance explained)
β”œβ”€ MAE: $0.67          (67Β’ average error)
β”œβ”€ RMSE: $1.41         (low error variance)
└─ MAPE: 4.8%          (percentage error)

Improvements:
β”œβ”€ MAE: 60% reduction vs baseline ($1.69 β†’ $0.67)
β”œβ”€ RMSE: 56% reduction vs baseline ($3.23 β†’ $1.41)
└─ RΒ²: +9.2% improvement vs baseline (0.899 β†’ 0.982)

Business Value:
└─ Real-time fare estimation with <$1 average error

Classification Performance Summary

Model: Gradient Boosting Classifier

Metrics:
β”œβ”€ Accuracy: 93.1%
β”œβ”€ Precision: 91.5%
β”œβ”€ Recall: 94.8% ⭐ (optimized for business needs)
└─ F1-Score: 0.931

False Negative Analysis:
β”œβ”€ Count: 2,402 (out of 45,407 high fares)
β”œβ”€ Rate: 5.2% miss rate
└─ Improvement: 48% reduction vs baseline

Business Value:
β”œβ”€ $210,000 saved per 100,000 rides
β”œβ”€ 50% reduction in customer complaints
└─ Enhanced customer trust through conservative estimates

🚧 Challenges & Solutions

Challenge 1: Extreme Outliers

Problem: Fares exceeding $500, trips over 100 miles distorting model training

Solution:

  • Applied IQR method (removed >99.8 percentile)
  • Domain-based filtering (NYC trips typically <50 miles)
  • Result: +5% RΒ² improvement

Challenge 2: Geographic Complexity

Problem: Raw latitude/longitude insufficient to capture pricing patterns

Solution:

  • K-Means clustering (k=5) on coordinates
  • Discovered 5 distinct ride types (airports, downtown, etc.)
  • Result: +5% RΒ² improvement

Challenge 3: Non-Linear Distance Pricing

Problem: Linear models underperform on airport trips with flat rates

Solution:

  • Polynomial features (distanceΒ², distanceΒ³)
  • Interaction terms (distance Γ— ratecode)
  • Result: +8% RΒ² improvement, 88% feature importance for interaction term

Challenge 4: Long Training Time

Problem: Initial Random Forest with 1000 trees took 50+ seconds to train

Solution:

  • Reduced to 500 estimators
  • Minimal accuracy loss (<0.2% RΒ² reduction)
  • Result: 60% faster training (18s vs 50s)

Challenge 5: Model Loading Time

Problem: Trained models took ~13 minutes to load from disk for inference

Solution:

  • Model compression techniques
  • Optimized pickle serialization
  • Feature selection to reduce model size
  • Result: Significantly reduced loading time for production deployment

Challenge 6: Low Baseline Recall

Problem: Initial Logistic Regression achieved only 85% recall on high fares

Solution:

  • Feature engineering (clustering, interactions)
  • Threshold tuning for Recall optimization
  • Ensemble method selection (Gradient Boosting)
  • Result: 85% β†’ 95% Recall (+10 percentage points)

πŸ’‘ Key Learnings

Technical Insights

  1. Feature Engineering > Algorithm Selection

    • Improved RΒ² from 0.85 β†’ 0.98 primarily through feature engineering
    • One interaction feature (distance_x_ratecode) accounted for 88% of predictions
    • Clustering added +5% RΒ² with minimal computational cost
  2. Domain Knowledge Drives Feature Design

    • Understanding NYC taxi operations led to temporal features (rush hour)
    • Knowledge of airport flat rates inspired critical interaction terms
    • Geographic insights guided clustering approach
  3. Ensemble Methods Excel on Tabular Data

    • Random Forest and Gradient Boosting outperformed linear models by 10%+ RΒ²
    • Automatically capture non-linear relationships
    • Robust to outliers and missing values
  4. Business Metrics β‰  ML Metrics

    • Optimizing for Recall (not Accuracy) saved $210K per 100K rides
    • Cost-benefit analysis should drive metric selection
    • False Negative costs 20-30Γ— more than False Positive in this domain

Methodological Insights

  1. Iterative Development is Essential

    • Four iterations from baseline to final model
    • Each iteration provided specific insights for next improvement
    • Documentation of process enables reproducibility
  2. EDA is Non-Negotiable

    • 30% of project time spent on EDA
    • Visualization revealed patterns (late-night premium, airport clusters)
    • Research questions guided feature engineering decisions
  3. Balanced Classes Enable Fair Evaluation

    • Median split created perfect 50/50 class balance
    • Eliminated need for resampling or weighted metrics
    • Accuracy became meaningful metric again
  4. Clustering as Feature Engineering

    • Unsupervised learning enhanced supervised learning
    • K-Means automatically discovered airport patterns
    • Added +5% RΒ² without manual geographic feature engineering

πŸ“Š Technical Stack

Component Technology
Language Python 3.8+
Data Processing pandas, NumPy
Visualization Matplotlib, Seaborn
Machine Learning scikit-learn
Clustering K-Means, PCA, t-SNE
Development Jupyter Notebook, Google Colab
Deployment HuggingFace Hub, pickle

πŸŽ“ Course Information

Institution: Reichman University
Course: Introduction to Data Science
Assignment: #2 - Regression, Classification, Clustering
Term: Fall 2024



Key Results: 98% RΒ² Β· $0.67 MAE Β· 95% Recall Β· $210K Saved


Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support