Our analysis model primarily uses gradient boosted decision trees to predict inspection rates while accounting for complex interactions between demographic characteristics. We also maintain Poisson regression models for statistical inference. This hybrid approach provides both accurate predictions and interpretable statistical insights.
We use trip counts as exposure variables to calculate rates per 100 trips, accounting for varying travel patterns.
We report IRRs with 95% confidence intervals to show relative inspection rates between demographic groups.
K-anonymity ensures groups with fewer than 50 respondents are suppressed from public results.
We collect anonymous self-reported data including:
Number of public transport journeys in the last 30 days
Number of times fare inspectors checked tickets during those trips
Optional characteristics including age, gender, ethnicity, height, etc.
We use LightGBM gradient boosting as our primary prediction model, with statistical models for inference:
• Gradient boosting machine learning algorithm
• Handles categorical features natively
• Automatically discovers feature interactions
• Weighted by trip counts for robust rate estimation
• SHAP values provide individual explanations
log(E[stops_i]) = log(trips_i) + β₀ + β₁×age_i + β₂×gender_i + ... + βₖ×trait_k
The statistical model provides interpretable coefficients and confidence intervals, while LightGBM provides more accurate individual predictions.
We provide results from both approaches:
Personalized predictions based on your specific characteristics
Shows which factors most influence your prediction
Uncertainty estimates from model variation
Relative rates between demographic groups
Statistical significance testing
Population-wide patterns and disparities
We implement multiple privacy protections:
Groups with <50 respondents are suppressed from results
We don't publish detailed breakdowns that could identify small subgroups
Raw data deleted after 12 months, only aggregates retained
No personally identifiable information collected
We validate our models through:
3-fold validation to assess model stability
20% test set for unbiased performance evaluation
Identifying most predictive demographic factors
Ensuring explanation quality and interpretability
Chi-square tests for Poisson model appropriateness
Automatic switching to Negative Binomial if needed
Examining patterns in statistical model residuals
Sufficient sample size for LightGBM training
Adequate events for pattern detection
Enables meaningful comparisons and feature learning
Automatic flagging of unusual patterns (e.g., >100 trips/month)
Consistency checks and range validation
System chooses best performing model (LightGBM preferred)
Models updated weekly with new data
Results depend on voluntary responses which may not represent all transport users
People may not perfectly remember inspection frequencies over 30 days
Survey respondents may differ systematically from non-respondents
Results may not be representative of all Melbourne transport routes/times
Testing many demographic groups increases chance of false discoveries
Unmeasured factors (route, time, behavior) may influence results
Some demographic groups may have limited representation
For technical questions about our methodology, please contact us at methods@stopodds.com.au
This methodology is open source and available for review on GitHub