Railway Price Prediction

railway-price-prediction
March 28, 2024
1 Problem Statement: Develop a machine learning model to pre-

dict the ticket price for Indian Railways based on various fac-
tors such as ‘baseFare’, ‘reservationCharge’, ‘superfastCharge’,
‘serviceTax’, ‘cateringCharge’, ‘dynamicFare’, ‘duration’, and
‘distance’. The goal is to create a predictive model that can
accurately estimate the ticket price by analyzing the relation-
ships and patterns within these columns, thereby enhancing the
efficiency and accuracy of fare predictions for passengers and
stakeholders in the Indian Railways system.
1.0.1 Github Link: https://github.com/QaidJamali53/Railways-Price-Prediction
[37]: import numpy as np

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import sklearn
import warnings
warnings.filterwarnings('ignore')
from sklearn.metrics import mean_squared_error, accuracy_score

from sklearn.metrics import explained_variance_score, max_error,␣
↪mean_absolute_error
from mlxtend.evaluate import bias_variance_decomp
from sklearn.linear_model import LinearRegression

from sklearn.ensemble import GradientBoostingRegressor,RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.pipeline import Pipeline

from sklearn.model_selection import train_test_split, learning_curve
from sklearn.preprocessing import StandardScaler
1
from math import sqrt
[2]: df = pd.read_csv('price_data.csv')
[3]: df.head()
[3]: baseFare reservationCharge superfastCharge fuelAmount totalConcession \

0 1059 60 0 0.0 0
1 626 50 0 0.0 0
2 441 40 0 0.0 0
3 125 20 0 0.0 0
4 1059 60 0 0.0 0
tatkalFare serviceTax otherCharge cateringCharge dynamicFare \

0 0 56.0 0 0 0
1 0 34.0 0 0 0
2 0 24.0 0 0 0
3 0 0.0 0 0 0
4 0 56.0 0 0 0
totalFare availability trainNumber \

0 1175 [{'date': '2-12-2023', 'status': 'AVAILABLE-00… 11464
timeStamp fromStnCode toStnCode classCode distance \

0 2023-10-03 22:13:07.781307 JBP SRID 1A 54
1 2023-10-03 22:13:07.781307 JBP SRID 2A 54
2 2023-10-03 22:13:07.781307 JBP SRID 3A 54
3 2023-10-03 22:13:07.781307 JBP SRID SL 54
4 2023-10-03 22:13:07.781307 JBP KKB 1A 69
duration
0 33.0
1 33.0
2 33.0
3 33.0
4 49.0
[4]: #df['timeStamp'] = pd.to_datetime(df['timeStamp'], format='%Y-%m-%d %H:%M:%S.

↪%f', errors='coerce')
[5]: df.describe()
2
[5]: baseFare reservationCharge superfastCharge fuelAmount \
count 326643.000000 326643.000000 326643.000000 326643.0
mean 890.075122 38.157805 20.058091 0.0
std 793.765711 14.324952 24.646072 0.0
min 30.000000 15.000000 0.000000 0.0
25% 362.000000 20.000000 0.000000 0.0
50% 626.000000 40.000000 0.000000 0.0
75% 1222.000000 50.000000 45.000000 0.0
max 6541.000000 60.000000 75.000000 0.0
totalConcession tatkalFare serviceTax otherCharge \

count 326643.0 326643.0 326643.000000 326643.000000
mean 0.0 0.0 42.761510 0.053208
std 0.0 0.0 45.080793 1.018403
min 0.0 0.0 0.000000 0.000000
25% 0.0 0.0 0.000000 0.000000
50% 0.0 0.0 34.000000 0.000000
75% 0.0 0.0 65.000000 0.000000
max 0.0 0.0 334.000000 25.000000
cateringCharge dynamicFare totalFare trainNumber \

count 326643.000000 326643.000000 326643.000000 326643.000000
mean 4.990601 4.224771 1000.355281 14597.751870
std 42.586032 49.050982 872.003147 4734.686291
min 0.000000 0.000000 45.000000 1027.000000
25% 0.000000 0.000000 395.000000 12316.000000
50% 0.000000 0.000000 710.000000 13020.000000
75% 0.000000 0.000000 1370.000000 18477.000000
max 1365.000000 1518.000000 8375.000000 22645.000000
distance duration
count 326643.000000 326643.000000
mean 587.745153 644.124564
std 548.858324 581.665800
min 1.000000 -1423.000000
25% 164.000000 193.000000
50% 404.000000 462.000000
75% 861.000000 936.000000
max 3149.000000 3225.000000
3
1.1 Columns Fuel Amount, Total Concession and Tatkal Fare have no Standard
Deviation therefore no impact on Total fare.
1.2 Also otherCharge has an average impact of 0.05 Rupees on total fare,which
is negligible
[6]: # Select categorical columns
obj_cols = df.select_dtypes(include=['object','int64','float64']).columns
# Count the number of unique values in each of the column names saved in␣
↪obj_cols & print column name if only 1 unique value
for i in obj_cols:
# calculate unique values
unique_values = len(df[i].value_counts())
if unique_values == 1:
print(i, 'column has only 1 unique value')
fuelAmount column has only 1 unique value
totalConcession column has only 1 unique value
tatkalFare column has only 1 unique value
1.3 Fuel Amount ,Total Concession, and Tatkal Fare are columns with unique
values, which means they do not contribute much.
[7]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 326643 entries, 0 to 326642
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 baseFare 326643 non-null int64
1 reservationCharge 326643 non-null int64
2 superfastCharge 326643 non-null int64
3 fuelAmount 326643 non-null float64
4 totalConcession 326643 non-null int64
4
5 tatkalFare 326643 non-null int64
6 serviceTax 326643 non-null float64
7 otherCharge 326643 non-null int64
8 cateringCharge 326643 non-null int64
9 dynamicFare 326643 non-null int64
10 totalFare 326643 non-null int64
11 availability 326643 non-null object
12 trainNumber 326643 non-null int64
13 timeStamp 326643 non-null object
14 fromStnCode 326643 non-null object
15 toStnCode 326643 non-null object
16 classCode 326643 non-null object
17 distance 326643 non-null int64
18 duration 326643 non-null float64
dtypes: float64(3), int64(11), object(5)
memory usage: 47.3+ MB
1.4 The dataset has no Null values in remaining columns and hence no missing
values.
[8]: unique_trains=len(pd.unique(df['trainNumber']))
unique_trains
[8]: 533
1.5 The dataset contains data of 533 different trains

1.6 Lets compare columns with each othere to identify a relation between them.
[10]: numerical_columns = df.select_dtypes(include=['float64', 'int64']).columns
sns.pairplot(df[numerical_columns],height=2.5)
plt.suptitle('Pair Plot of Numerical Columns', y=1.02)
5
plt.show()
6
1.7 Reservation charges and Superfast charges are constant throughout. They
are not affected by any attribute.
1.8 Base fare is in linear relation with all except train number.
1.9 Total fare depends on Base Fare,Distance, Duration,Service Tax,Dynamic
Fare, Catering Charge
1.10 Total Fare has a linear relation with most of the attributes.Depending on
this Linear Regression Model can be used.
[11]: print("\nDistribution of Numerical Features:")
df.hist(figsize=(12, 10))
plt.show()
Distribution of Numerical Features:
7
1.11 This graph shows that some attributes have no change with frequency
while some do change with frequency.
1.12 Duration, Distance, Total fare, Service Tax and Base Fare have a lesser
frequency for higher Measures.
[12]: print("\nDistribution of Categorical Features:")
plt.figure(figsize=(8, 6))
sns.countplot(x='classCode', data=df)
plt.show()
Distribution of Categorical Features:
1.13 People Mostly prefer SL, 3A and 2A tickets during travelling.

[13]: df_new=df.copy()
df_new.
↪drop(['availability','timeStamp','fromStnCode','toStnCode','classCode'],axis=1,inplace=True)
print("\nCorrelation Analysis:")
correlation_matrix = df_new.corr()
8
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix')
plt.show()
Correlation Analysis:
9
2 Total Fare is Highly Correlated with Base Fare, Reservation
Charge, Service Tax, Superfast Charge, Duration, Distance and
Mildly Correlated with Catering Charge and Dynamic Fare.
3 Features Fuel Amount, Total Concession, Tatkal Fare have No

Correlation with Total Fare. Also Other Charge and Train
Number have Least Impact on Total Fare.
3.1 Removing useless Columns
[9]: df.
↪drop(['fuelAmount','totalConcession','tatkalFare','otherCharge','trainNumber'],axis=1,inplac
[10]: y=df.totalFare
[11]: '''df_now = df[['classCode','baseFare','reservationCharge','duration',␣

↪'serviceTax', 'distance']]
dummies = pd.get_dummies(df_now['classCode'])
merge = pd.concat([df_now, dummies], axis='columns')
X = merge.drop(['classCode'], axis='columns')'''
X=df[['baseFare','reservationCharge','superfastCharge', 'serviceTax',␣
↪'cateringCharge', 'dynamicFare','duration', 'distance']]
[12]: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,␣

↪shuffle=True, random_state=37)
4 Linear Regression
[32]: model = LinearRegression()
scores = []
LRpipeline = Pipeline([('scaler2' , StandardScaler()),

('LinearRegressor:', model)])
LRpipeline.fit(X_train, y_train)
y_pred = LRpipeline.predict(X_test)
LRpipeline.score(X_test, y_test)
10
4.1 Linear Regression Model is an excellent fit for this data set.
[14]: mse = mean_squared_error(y_test, y_pred)
X_train_np = X_train.values
y_train_np = y_train.values
X_test_np = X_test.values
y_test_np = y_test.values
# Bias variance decompositions

avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(LRpipeline,
X_train_np,␣
↪y_train_np,
X_test_np,␣
↪y_test_np,
loss='mse',
random_seed=23,
num_rounds=5,
)
# Print the result

print('Average expected loss: %.2f' % avg_expected_loss)
print('Average bias: %.2f' % avg_bias)
#print('Average variance: %.2f' % avg_var)
print('Explained variance_score:', explained_variance_score(y_test, y_pred))

print('Max_error:', max_error(y_test, y_pred))
print('Mean_absolute_error score:', mean_absolute_error(y_test, y_pred))
print('Mean_squared_error score:', mean_squared_error(y_test, y_pred))
print('Root mean_squared_error:', sqrt(mean_squared_error(y_test, y_pred)))
Average expected loss: 1.12
Average bias: 1.12
Explained variance_score: 0.9999985217764332
Max_error: 24.86785179398271
Mean_absolute_error score: 0.18015889734709314
Mean_squared_error score: 1.1186337306873104
Root mean_squared_error: 1.0576548258705722
11
4.2 These metrics collectively suggest that the model is performing exception-
ally well on the dataset. The high explained variance, low error metrics, and
low bias are indicative of a model that is accurately capturing the patterns
in the data.
[24]: res1=y_test-y_pred
5 Random Forest Regressor

[33]: model = RandomForestRegressor(n_estimators=150, n_jobs=-1)
scores = []
RFpipeline = Pipeline([('scaler2' , StandardScaler()),

('RandomForestRegressor: ', model)])
RFpipeline.fit(X_train , y_train)
y_pred = RFpipeline.predict(X_test)
scores.append(RFpipeline.score(X_test, y_test))
print(scores)
[0.999982797020953]

avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(RFpipeline,
X_train_np,␣
↪y_train_np,
X_test_np,␣
↪y_test_np,
loss='mse',
random_seed=23,
num_rounds=8,
)
# Print the result

12
Average bias: 19.35
Max_error: 557.9666666666662
5.1 While some metrics, like the explained variance score and low MAE, suggest
good predictive performance, the presence of a relatively high max error
and MSE indicates that there are instances with larger prediction errors.
[18]: res2=y_test-y_pred
[19]: model = GradientBoostingRegressor(n_estimators=100)
scores = []
GBRpipeline = Pipeline([('scaler2' , StandardScaler()),

('GradienBoostingRegressor:', model)])
GBRpipeline.fit(X_train, y_train)
y_pred = GBRpipeline.predict(X_test)
GBRpipeline.score(X_test, y_test)
[19]: 0.9998374923857266
13
avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(GBRpipeline,
X_train_np,␣
↪y_train_np,
X_test_np,␣
↪y_test_np,
loss='mse',
random_seed=23,
num_rounds=5,
)
# Print the result


Average bias: 110.85
Max_error: 374.4876395700476
14
5.2 Unlike Random Forest and Linear Regression Models, Gradient Boost gives
High Error and Loss.Conclusively this model is not a good fit for Current
Data set, as there are linear relations between attributes and data set is
not complex enough.
6 XGBoost
[38]: model = XGBRegressor(n_jobs=-1)
estimators = 1000
model.set_params(n_estimators=estimators)
scores = []
XGpipeline = Pipeline([('scaler2' , StandardScaler()),

('XGBRegressor: ', model)])
XGpipeline.fit(X_train , y_train)
y_pred = XGpipeline.predict(X_test)
XGpipeline.score(X_test, y_test)
[38]: 0.9999476391046355

avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(XGpipeline,
X_train_np,␣
↪y_train_np,
X_test_np,␣
↪y_test_np,
loss='mse',
random_seed=23,
num_rounds=5,
)
# Print the result

15
Average bias: 39.19
Max_error: 476.8935546875
6.1 Overall, the high explained variance score suggests that XGBoost model
captures the patterns in the data very well. However, the presence of
a relatively high max error indicates that there are instances with larger
prediction errors.
[28]: def plot_learning_curve(model, X, y, cv=5):
train_sizes, train_scores, test_scores = learning_curve(model, X, y, cv=cv,␣

↪ scoring='neg_mean_squared_error')
train_scores_mean = -np.mean(train_scores, axis=1)

test_scores_mean = -np.mean(test_scores, axis=1)
plt.title('Learning Curve')
plt.xlabel('Training Examples')
plt.ylabel('Negative Mean Squared Error')
plt.plot(train_sizes, train_scores_mean, label='Training Error')
plt.plot(train_sizes, test_scores_mean, label='Validation Error')
plt.legend()
plt.show()
[29]: plot_learning_curve(LinearRegression(), X, y, cv=5)
16
[30]: plot_learning_curve(RandomForestRegressor(), X, y, cv=5)
17
[31]: plot_learning_curve(GradientBoostingRegressor(), X, y, cv=5)
18
[41]: plot_learning_curve(XGBRegressor(), X, y, cv=5)
19
6.2 The Learning Curves show that :
6.3 1) Linear Regression has no training error, and a decrease in validation
proves that the Data Set is linear. Linear Regression Proves to be the Best
model with high Prediction Score and the least Errors.
6.4 2)Random Forest Regressor gives a High Predictive Score But leads to
erorrs More frequently than Linear regression model.
6.5 3)Gradient Boost and XGBoost have Training error which lead to them
being Unsatissfactory as Prediction Models.
20

Railway Price Prediction

Uploaded by

Copyright:

Available Formats

Railway Price Prediction

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Railway Price Prediction

Uploaded by

Copyright:

Available Formats

railway-price-prediction

March 28, 2024

1 Problem Statement: Develop a machine learning model to pre-

[37]: import numpy as np

from sklearn.metrics import mean_squared_error, accuracy_score

from mlxtend.evaluate import bias_variance_decomp

from sklearn.linear_model import LinearRegression

from sklearn.pipeline import Pipeline

[3]: baseFare reservationCharge superfastCharge fuelAmount totalConcession \

tatkalFare serviceTax otherCharge cateringCharge dynamicFare \

totalFare availability trainNumber \

timeStamp fromStnCode toStnCode classCode distance \

[4]: #df['timeStamp'] = pd.to_datetime(df['timeStamp'], format='%Y-%m-%d %H:%M:%S.

totalConcession tatkalFare serviceTax otherCharge \

cateringCharge dynamicFare totalFare trainNumber \

fuelAmount column has only 1 unique value

totalConcession column has only 1 unique value

tatkalFare column has only 1 unique value

RangeIndex: 326643 entries, 0 to 326642

Data columns (total 19 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 baseFare 326643 non-null int64

1 reservationCharge 326643 non-null int64

2 superfastCharge 326643 non-null int64

3 fuelAmount 326643 non-null float64

4 totalConcession 326643 non-null int64

6 serviceTax 326643 non-null float64

7 otherCharge 326643 non-null int64

8 cateringCharge 326643 non-null int64

9 dynamicFare 326643 non-null int64

10 totalFare 326643 non-null int64

11 availability 326643 non-null object

12 trainNumber 326643 non-null int64

13 timeStamp 326643 non-null object

14 fromStnCode 326643 non-null object

15 toStnCode 326643 non-null object

16 classCode 326643 non-null object

17 distance 326643 non-null int64

18 duration 326643 non-null float64

dtypes: float64(3), int64(11), object(5)

memory usage: 47.3+ MB

1.5 The dataset contains data of 533 different trains

Distribution of Numerical Features:

Distribution of Categorical Features:

1.13 People Mostly prefer SL, 3A and 2A tickets during travelling.

3 Features Fuel Amount, Total Concession, Tatkal Fare have No

[11]: '''df_now = df[['classCode','baseFare','reservationCharge','duration',␣

[12]: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,␣

LRpipeline = Pipeline([('scaler2' , StandardScaler()),

# Bias variance decompositions

# Print the result

print('Explained variance_score:', explained_variance_score(y_test, y_pred))

Average expected loss: 1.12

Average bias: 1.12

Explained variance_score: 0.9999985217764332

Mean_absolute_error score: 0.18015889734709314

Mean_squared_error score: 1.1186337306873104

Root mean_squared_error: 1.0576548258705722

5 Random Forest Regressor