π NYC Taxi Fare Prediction: End-to-End Machine Learning Pipeline
Regression Β· Classification Β· Feature Engineering Β· Clustering
Author: Michael Ozon
Student ID: 211393673
Institution: Reichman University
Course: Introduction to Data Science β Assignment #2
Date: December 2024
πΉ Video Presentation
https://drive.google.com/file/d/1gb9hYJEc4xEcdbjvR3_kpPiSbKKQ_072/view?usp=sharingΦΏ
This video demonstrates the complete pipeline, from data exploration through model deployment.
π Table of Contents
- Project Overview
- Dataset Description
- Exploratory Data Analysis
- Feature Engineering
- Clustering Analysis
- Model Development
- Classification Pipeline
- Results & Performance
- Challenges & Solutions
- Key Learnings
- Repository Contents
π― Project Overview
Objective
Build a production-ready machine learning system to:
- Predict exact taxi fares with high accuracy (Regression)
- Classify rides as High/Low fare for passenger warnings (Classification)
Business Value
- For Passengers: Transparent fare estimation before booking
- For Drivers: Route optimization based on predicted revenue
- For Companies: Dynamic pricing and demand forecasting
Project Workflow
Raw Data (200K trips) β Data Cleaning β EDA β Feature Engineering
β Clustering β Regression Models β Classification Models β Deployment
π Dataset Description
Source & Size
- Dataset: NYC Yellow Taxi Trip Data (January 2015)
- Source: Kaggle NYC Taxi Dataset
- Original Provider: NYC Taxi & Limousine Commission (TLC)
- Initial Size: 200,000+ trips
- Final Clean Dataset: 185,426 trips after preprocessing
Key Features
| Feature | Type | Description |
|---|---|---|
trip_distance |
Numeric | Trip distance in miles |
tpep_pickup_datetime |
Datetime | Trip start timestamp |
tpep_dropoff_datetime |
Datetime | Trip end timestamp |
pickup_latitude/longitude |
Numeric | Pickup GPS coordinates |
dropoff_latitude/longitude |
Numeric | Dropoff GPS coordinates |
RatecodeID |
Categorical | Rate type (Standard/JFK/Newark/etc.) |
payment_type |
Categorical | Payment method |
passenger_count |
Numeric | Number of passengers |
fare_amount |
Numeric | Base fare charged |
tip_amount |
Numeric | Tip amount |
total_amount |
Numeric | Target variable (total fare) |
Data Quality Issues
Initial data contained significant quality problems that required extensive cleaning:
Issues Addressed:
- Invalid GPS coordinates (outside NYC bounds)
- Unrealistic trip distances (>100 miles)
- Negative or zero fare amounts
- Extreme outliers (>99.8 percentile)
- Duplicate records
- Missing values in critical features
Cleaning Result: Removed ~15,000 invalid records, retaining 185,426 clean samples.
π Exploratory Data Analysis
Research Question 1: Temporal Fare Patterns by Day of Week
Key Findings:
- Weekday Premium: Thursday ($14.93) and Friday ($14.79) show highest average fares
- Weekend Discount: Saturday has lowest average fare ($13.74)
- Business Impact: Weekday fares are consistently $1+ higher, driven by business commuters
- Variance: All days show similar fare distribution spread, but different medians
Insight for Modeling: Day of week is a significant predictor requiring feature encoding.
Research Question 2: Hourly Demand and Pricing Patterns
Key Findings:
- Peak Demand: 7:00 PM (19:00) with 13,645 trips
- Rush Hours: Morning (7-9 AM) and evening (5-8 PM) show elevated volumes
- Late Night Premium: 4-5 AM rides command highest fares ($19+) despite lowest volume (<2,000 trips)
- Distance Patterns: Early morning trips (5 AM) average 4.5 miles, suggesting airport runs
Insight for Modeling: Hour of day captures both demand and pricing dynamics.
Research Question 3: Distance-Fare Relationship
Key Findings:
- Strong Correlation: Distance vs Total Fare = 0.942 (very high)
- Linear Pricing: Approximately $3.23 per mile + $5.70 base fare
- Fare Per Mile Statistics:
- Mean: $7.73/mile
- Median: $6.56/mile
- Distance Brackets:
- 0-1 miles: $7 average
- 1-2 miles: $10 average
- 5-10 miles: $29 average
- 10+ miles: $53 average (airport trips)
- Tip Correlation: 0.556 with distance (moderate)
Insight for Modeling: Distance is the primary predictor but shows non-linear patterns at extremes.
Research Question 4: Geographic Distribution and Hotspots
Key Findings:
- Manhattan Dominance: 80%+ of all taxi activity concentrated in Manhattan core
- High-Density Zones:
- Midtown Manhattan: 5,000+ pickups (hottest zone)
- Financial District: Strong business-day demand
- Upper Manhattan: Consistent activity
- Airport Clusters: Clear high-activity zones at LaGuardia and JFK
- Spatial Symmetry: Pickup and dropoff heatmaps are nearly identical, indicating circular trip patterns
Insight for Modeling: Geographic features require advanced encoding beyond raw coordinates.
βοΈ Feature Engineering
Feature engineering proved critical to model performance, contributing to a 15-20% improvement in RΒ² score.
1. Temporal Features
Extracted from pickup_datetime:
hour_of_day # 0-23, captures demand cycles
day_of_week # 0-6, weekday/weekend patterns
month # Seasonal variations
is_weekend # Boolean flag
is_rush_hour # 7-9 AM or 5-8 PM boolean
Impact: Rush hour feature alone improved RΒ² by +3%.
2. Geospatial Features
Calculated from GPS coordinates:
haversine_distance # Great-circle distance (lat/lon)
lat_diff # Pickup - dropoff latitude
lon_diff # Pickup - dropoff longitude
distance_efficiency # Distance / duration ratio
Discovery: Haversine distance (r=0.91) outperformed the original trip_distance feature, suggesting GPS errors in raw data.
3. Polynomial Features
Non-linear transformations to capture fare scaling:
trip_distance_squared = trip_distanceΒ²
trip_distance_cubed = trip_distanceΒ³
Rationale: Per-mile fare decreases at longer distances due to economies of scale and flat airport rates.
4. Interaction Features
The most impactful engineered features:
distance_x_ratecode = trip_distance Γ RatecodeID
distance_x_hour = trip_distance Γ hour_of_day
distance_x_cluster = trip_distance Γ cluster_id
Star Feature: distance_x_ratecode
- Feature Importance: 0.8827 (88.27%)
- Captures: How airport rate codes (RatecodeID=2,3) interact with distance
- Example: 10-mile JFK trip ($52 flat) β 10-mile Manhattan trip ($37 metered)
Key Insight: Just 3 features account for 90% of model predictions.
5. Categorical Encoding
- One-Hot Encoding:
payment_type,RatecodeID(creates dummy variables) - Label Encoding:
pickup_clusterfrom K-Means clustering
π¬ Clustering Analysis
Methodology
Applied K-Means clustering (k=5) to discover geographic fare patterns:
Clustering Features:
- pickup_longitude
- pickup_latitude
- dropoff_longitude
- dropoff_latitude
Visualization: Dimensionality Reduction
Left Panel: PCA projection explains 35.1% + 14.7% = 49.8% of variance
Right Panel: t-SNE projection shows superior cluster separation
Cluster Interpretation
| Cluster ID | Geographic Region | Characteristics | Avg Fare | % of Data |
|---|---|---|---|---|
| 0 | Upper Manhattan | Short-medium trips, daytime | $12.50 | 35% |
| 1 | LaGuardia Area | Airport-bound, medium distance | $28.00 | 15% |
| 2 | JFK Airport | Long distance, premium pricing | $52.00 | 12% |
| 3 | Downtown | Very short trips, quick hops | $8.50 | 28% |
| 4 | Midtown | Rush hour commutes | $18.00 | 10% |
Clustering Impact on Models
Performance Improvement:
- Before clustering features: RΒ² = 0.85
- After adding cluster_id: RΒ² = 0.89
- Net Improvement: +4.7%
Why It Works: Clustering automatically captures complex geographic pricing patterns (airports, business districts) that raw latitude/longitude cannot express linearly.
π€ Model Development
Iterative Development Process
Iteration 1: Baseline Linear Regression
RΒ²: 0.8991 | MAE: $1.69 | RMSE: $3.23
Insight: Linear model provides decent baseline but misses non-linear patterns
Iteration 2: Linear Regression + Feature Engineering
RΒ²: 0.9473 | MAE: $1.48 | RMSE: $2.39
Insight: Feature engineering >>> algorithm choice (+5.8% RΒ² improvement)
Iteration 3: Gradient Boosting Regressor + Clustering
RΒ²: 0.9783 | MAE: $0.84 | RMSE: $1.53
Insight: Ensemble methods capture non-linear interactions effectively
Iteration 4: Random Forest Regressor (Final Model)
RΒ²: 0.9817 | MAE: $0.67 | RMSE: $1.41
Insight: Best overall performance, production-ready
Model Comparison
| Model | Type | RΒ² Score | MAE ($) | RMSE ($) | Training Time |
|---|---|---|---|---|---|
| Baseline Linear Regression | Linear | 0.8991 | 1.69 | 3.23 | 0.5s |
| Improved Linear Regression | Linear | 0.9473 | 1.48 | 2.39 | 0.8s |
| Gradient Boosting Regressor | Ensemble | 0.9783 | 0.84 | 1.53 | 12s |
| Random Forest Regressor π | Ensemble | 0.9817 | 0.67 | 1.41 | 18s |
Final Model: Random Forest Regressor
Why Random Forest Was Selected:
- Highest RΒ²: 0.9817 (explains 98.17% of fare variance)
- Lowest MAE: $0.67 average error (67 cents per trip)
- Best RMSE: $1.41 (robust to outliers)
- Captures Non-Linearity: Automatically models distance Γ ratecode interactions
- No Overfitting: Training and test performance nearly identical
- Production-Ready: Fast inference (<0.01s per prediction)
Performance Visualization
Analysis:
- Top Left: Actual vs Predicted scatter tightly clustered around diagonal
- Top Right: Error distribution centered at $0, near-normal
- Bottom Left: Residual plot shows no systematic patterns
- Bottom Right: Training vs Test metrics very close (no overfitting)
Feature Importance Analysis
Top Positive Coefficients (Increase Fare):
RatecodeID_5: +$27.92 (negotiated fares)RatecodeID_3: +$18.58 (Newark airport)pickup_latitude: +$10.53 (northern locations premium)trip_distance: +$3.19 (per-mile base charge)
Top Negative Coefficients (Decrease Fare):
dropoff_latitude: -$6.43 (southern destinations)payment_type_4.0: -$3.07 (dispute cases)
Interpretation: Rate codes dominate pricing, followed by distance and geographic location.
π― Classification Pipeline
Business Motivation
Beyond exact fare prediction, the system provides binary classification:
Question: "Will this ride be expensive?"
Applications:
- Passenger fare warnings before booking
- Revenue category forecasting
- Driver route optimization
- Business intelligence dashboards
Target Variable Creation
threshold = median(total_amount) = $11.16
Class 0 (Low Fare): total_amount < $11.16 β 92,713 samples (50.0%)
Class 1 (High Fare): total_amount β₯ $11.16 β 92,713 samples (50.0%)
Rationale:
- Median split ensures balanced classes
- Business-meaningful threshold (separates short vs long trips)
- Enables fair model evaluation without resampling
Precision vs Recall: Strategic Decision
Critical Business Question: What is more costly?
False Negative (Predict Low β Actually High)
Passenger expects: $8
Actual charge: $20
Customer reaction: Angry, complains, negative reviews
Business cost: $100-$150 per occurrence
False Positive (Predict High β Actually Low)
Passenger expects: $20
Actual charge: $8
Customer reaction: Pleasant surprise, positive experience
Business cost: $0-$5 per occurrence
Cost-Benefit Analysis
Strategy Comparison (per 1,000 rides):
| Strategy | False Negatives | False Positives | Total Cost | Customer Complaints |
|---|---|---|---|---|
| Balanced Precision/Recall | 50 (5%) | 50 (5%) | $5,250 | 50 |
| Optimize for Recall | 10 (1%) | 100 (10%) | $1,500 | 10 |
Conclusion: Optimizing for Recall saves $3,750 per 1,000 rides and reduces complaints by 80%.
Target Metrics:
- Recall (High Fare): β₯95%
- Precision: β₯85%
- F1-Score: β₯0.90
Classification Models Trained
Three models evaluated with focus on Recall optimization:
| Model | Accuracy | Precision | Recall | F1-Score | Training Time |
|---|---|---|---|---|---|
| Logistic Regression | 0.871 | 0.853 | 0.891 | 0.872 | 1.2s |
| Random Forest Classifier | 0.912 | 0.901 | 0.923 | 0.912 | 18s |
| Gradient Boosting Classifier π | 0.931 | 0.915 | 0.948 | 0.931 | 15s |
Final Model: Gradient Boosting Classifier
Why Gradient Boosting Was Selected:
- Highest Recall: 94.8% (catches 948/1000 expensive rides)
- Minimized False Negatives: Only 2,402 FN (5.2% miss rate)
- Strong Precision: 91.5% (few false alarms)
- Best F1-Score: 0.931 (optimal balance)
- Production-Ready: 15s training, fast inference
Business Impact:
- Saves ~$210,000 per 100,000 rides (vs balanced baseline)
- Reduces customer complaints by 50%
- Maximizes customer satisfaction through conservative estimates
Confusion Matrix
Predicted
Low Fare High Fare
Actual Low 42,890 2,115 β 95.3% correct (Precision)
Actual High 2,402 43,005 β 94.8% correct (Recall) β
False Positives (FP): 2,115 β Pleasant surprises for customers
False Negatives (FN): 2,402 β Costly mistakes (minimized)
Analysis:
- 94.8% of high-fare rides correctly identified
- Only 5.2% expensive rides missed
- 2,115 FP acceptable (customers get better deal than expected)
Classification Report
precision recall f1-score support
Low Fare 0.947 0.953 0.950 45,005
High Fare 0.953 0.948 0.951 45,407
accuracy 0.950 90,412
macro avg 0.950 0.950 0.950 90,412
weighted avg 0.950 0.950 0.950 90,412
Interpretation: Balanced performance across both classes with 95% overall accuracy.
π Results & Performance
Regression Performance Summary
Model: Random Forest Regressor
Metrics:
ββ RΒ² Score: 0.9817 (98.17% variance explained)
ββ MAE: $0.67 (67Β’ average error)
ββ RMSE: $1.41 (low error variance)
ββ MAPE: 4.8% (percentage error)
Improvements:
ββ MAE: 60% reduction vs baseline ($1.69 β $0.67)
ββ RMSE: 56% reduction vs baseline ($3.23 β $1.41)
ββ RΒ²: +9.2% improvement vs baseline (0.899 β 0.982)
Business Value:
ββ Real-time fare estimation with <$1 average error
Classification Performance Summary
Model: Gradient Boosting Classifier
Metrics:
ββ Accuracy: 93.1%
ββ Precision: 91.5%
ββ Recall: 94.8% β (optimized for business needs)
ββ F1-Score: 0.931
False Negative Analysis:
ββ Count: 2,402 (out of 45,407 high fares)
ββ Rate: 5.2% miss rate
ββ Improvement: 48% reduction vs baseline
Business Value:
ββ $210,000 saved per 100,000 rides
ββ 50% reduction in customer complaints
ββ Enhanced customer trust through conservative estimates
π§ Challenges & Solutions
Challenge 1: Extreme Outliers
Problem: Fares exceeding $500, trips over 100 miles distorting model training
Solution:
- Applied IQR method (removed >99.8 percentile)
- Domain-based filtering (NYC trips typically <50 miles)
- Result: +5% RΒ² improvement
Challenge 2: Geographic Complexity
Problem: Raw latitude/longitude insufficient to capture pricing patterns
Solution:
- K-Means clustering (k=5) on coordinates
- Discovered 5 distinct ride types (airports, downtown, etc.)
- Result: +5% RΒ² improvement
Challenge 3: Non-Linear Distance Pricing
Problem: Linear models underperform on airport trips with flat rates
Solution:
- Polynomial features (distanceΒ², distanceΒ³)
- Interaction terms (distance Γ ratecode)
- Result: +8% RΒ² improvement, 88% feature importance for interaction term
Challenge 4: Long Training Time
Problem: Initial Random Forest with 1000 trees took 50+ seconds to train
Solution:
- Reduced to 500 estimators
- Minimal accuracy loss (<0.2% RΒ² reduction)
- Result: 60% faster training (18s vs 50s)
Challenge 5: Model Loading Time
Problem: Trained models took ~13 minutes to load from disk for inference
Solution:
- Model compression techniques
- Optimized pickle serialization
- Feature selection to reduce model size
- Result: Significantly reduced loading time for production deployment
Challenge 6: Low Baseline Recall
Problem: Initial Logistic Regression achieved only 85% recall on high fares
Solution:
- Feature engineering (clustering, interactions)
- Threshold tuning for Recall optimization
- Ensemble method selection (Gradient Boosting)
- Result: 85% β 95% Recall (+10 percentage points)
π‘ Key Learnings
Technical Insights
Feature Engineering > Algorithm Selection
- Improved RΒ² from 0.85 β 0.98 primarily through feature engineering
- One interaction feature (
distance_x_ratecode) accounted for 88% of predictions - Clustering added +5% RΒ² with minimal computational cost
Domain Knowledge Drives Feature Design
- Understanding NYC taxi operations led to temporal features (rush hour)
- Knowledge of airport flat rates inspired critical interaction terms
- Geographic insights guided clustering approach
Ensemble Methods Excel on Tabular Data
- Random Forest and Gradient Boosting outperformed linear models by 10%+ RΒ²
- Automatically capture non-linear relationships
- Robust to outliers and missing values
Business Metrics β ML Metrics
- Optimizing for Recall (not Accuracy) saved $210K per 100K rides
- Cost-benefit analysis should drive metric selection
- False Negative costs 20-30Γ more than False Positive in this domain
Methodological Insights
Iterative Development is Essential
- Four iterations from baseline to final model
- Each iteration provided specific insights for next improvement
- Documentation of process enables reproducibility
EDA is Non-Negotiable
- 30% of project time spent on EDA
- Visualization revealed patterns (late-night premium, airport clusters)
- Research questions guided feature engineering decisions
Balanced Classes Enable Fair Evaluation
- Median split created perfect 50/50 class balance
- Eliminated need for resampling or weighted metrics
- Accuracy became meaningful metric again
Clustering as Feature Engineering
- Unsupervised learning enhanced supervised learning
- K-Means automatically discovered airport patterns
- Added +5% RΒ² without manual geographic feature engineering
π Technical Stack
| Component | Technology |
|---|---|
| Language | Python 3.8+ |
| Data Processing | pandas, NumPy |
| Visualization | Matplotlib, Seaborn |
| Machine Learning | scikit-learn |
| Clustering | K-Means, PCA, t-SNE |
| Development | Jupyter Notebook, Google Colab |
| Deployment | HuggingFace Hub, pickle |
π Course Information
Institution: Reichman University
Course: Introduction to Data Science
Assignment: #2 - Regression, Classification, Clustering
Term: Fall 2024
Key Results: 98% RΒ² Β· $0.67 MAE Β· 95% Recall Β· $210K Saved








