Railway Price Prediction
Railway Price Prediction
Railway Price Prediction
import warnings
warnings.filterwarnings('ignore')
1
from math import sqrt
[2]: df = pd.read_csv('price_data.csv')
[3]: df.head()
duration
0 33.0
1 33.0
2 33.0
3 33.0
4 49.0
[5]: df.describe()
2
[5]: baseFare reservationCharge superfastCharge fuelAmount \
count 326643.000000 326643.000000 326643.000000 326643.0
mean 890.075122 38.157805 20.058091 0.0
std 793.765711 14.324952 24.646072 0.0
min 30.000000 15.000000 0.000000 0.0
25% 362.000000 20.000000 0.000000 0.0
50% 626.000000 40.000000 0.000000 0.0
75% 1222.000000 50.000000 45.000000 0.0
max 6541.000000 60.000000 75.000000 0.0
distance duration
count 326643.000000 326643.000000
mean 587.745153 644.124564
std 548.858324 581.665800
min 1.000000 -1423.000000
25% 164.000000 193.000000
50% 404.000000 462.000000
75% 861.000000 936.000000
max 3149.000000 3225.000000
3
1.1 Columns Fuel Amount, Total Concession and Tatkal Fare have no Standard
Deviation therefore no impact on Total fare.
1.2 Also otherCharge has an average impact of 0.05 Rupees on total fare,which
is negligible
[6]: # Select categorical columns
obj_cols = df.select_dtypes(include=['object','int64','float64']).columns
# Count the number of unique values in each of the column names saved in␣
↪obj_cols & print column name if only 1 unique value
for i in obj_cols:
# calculate unique values
unique_values = len(df[i].value_counts())
if unique_values == 1:
print(i, 'column has only 1 unique value')
1.3 Fuel Amount ,Total Concession, and Tatkal Fare are columns with unique
values, which means they do not contribute much.
[7]: df.info()
<class 'pandas.core.frame.DataFrame'>
4
5 tatkalFare 326643 non-null int64
1.4 The dataset has no Null values in remaining columns and hence no missing
values.
[8]: unique_trains=len(pd.unique(df['trainNumber']))
unique_trains
[8]: 533
sns.pairplot(df[numerical_columns],height=2.5)
plt.suptitle('Pair Plot of Numerical Columns', y=1.02)
5
plt.show()
6
1.7 Reservation charges and Superfast charges are constant throughout. They
are not affected by any attribute.
1.8 Base fare is in linear relation with all except train number.
1.9 Total fare depends on Base Fare,Distance, Duration,Service Tax,Dynamic
Fare, Catering Charge
1.10 Total Fare has a linear relation with most of the attributes.Depending on
this Linear Regression Model can be used.
[11]: print("\nDistribution of Numerical Features:")
df.hist(figsize=(12, 10))
plt.show()
7
1.11 This graph shows that some attributes have no change with frequency
while some do change with frequency.
1.12 Duration, Distance, Total fare, Service Tax and Base Fare have a lesser
frequency for higher Measures.
[12]: print("\nDistribution of Categorical Features:")
plt.figure(figsize=(8, 6))
sns.countplot(x='classCode', data=df)
plt.show()
print("\nCorrelation Analysis:")
correlation_matrix = df_new.corr()
8
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix')
plt.show()
Correlation Analysis:
9
2 Total Fare is Highly Correlated with Base Fare, Reservation
Charge, Service Tax, Superfast Charge, Duration, Distance and
Mildly Correlated with Catering Charge and Dynamic Fare.
[10]: y=df.totalFare
dummies = pd.get_dummies(df_now['classCode'])
merge = pd.concat([df_now, dummies], axis='columns')
X = merge.drop(['classCode'], axis='columns')'''
X=df[['baseFare','reservationCharge','superfastCharge', 'serviceTax',␣
↪'cateringCharge', 'dynamicFare','duration', 'distance']]
4 Linear Regression
[32]: model = LinearRegression()
scores = []
LRpipeline.score(X_test, y_test)
10
4.1 Linear Regression Model is an excellent fit for this data set.
[14]: mse = mean_squared_error(y_test, y_pred)
X_train_np = X_train.values
y_train_np = y_train.values
X_test_np = X_test.values
y_test_np = y_test.values
X_test_np,␣
↪y_test_np,
loss='mse',
random_seed=23,
num_rounds=5,
)
Max_error: 24.86785179398271
11
4.2 These metrics collectively suggest that the model is performing exception-
ally well on the dataset. The high explained variance, low error metrics, and
low bias are indicative of a model that is accurately capturing the patterns
in the data.
[24]: res1=y_test-y_pred
scores = []
scores.append(RFpipeline.score(X_test, y_test))
print(scores)
[0.999982797020953]
X_train_np = X_train.values
y_train_np = y_train.values
X_test_np = X_test.values
y_test_np = y_test.values
X_test_np,␣
↪y_test_np,
loss='mse',
random_seed=23,
num_rounds=8,
)
12
print('Explained variance_score:', explained_variance_score(y_test, y_pred))
print('Max_error:', max_error(y_test, y_pred))
print('Mean_absolute_error score:', mean_absolute_error(y_test, y_pred))
print('Mean_squared_error score:', mean_squared_error(y_test, y_pred))
print('Root mean_squared_error:', sqrt(mean_squared_error(y_test, y_pred)))
Max_error: 557.9666666666662
5.1 While some metrics, like the explained variance score and low MAE, suggest
good predictive performance, the presence of a relatively high max error
and MSE indicates that there are instances with larger prediction errors.
[18]: res2=y_test-y_pred
scores = []
GBRpipeline.score(X_test, y_test)
[19]: 0.9998374923857266
X_train_np = X_train.values
y_train_np = y_train.values
X_test_np = X_test.values
y_test_np = y_test.values
13
# Bias variance decompositions
avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(GBRpipeline,
X_train_np,␣
↪y_train_np,
X_test_np,␣
↪y_test_np,
loss='mse',
random_seed=23,
num_rounds=5,
)
Max_error: 374.4876395700476
14
5.2 Unlike Random Forest and Linear Regression Models, Gradient Boost gives
High Error and Loss.Conclusively this model is not a good fit for Current
Data set, as there are linear relations between attributes and data set is
not complex enough.
6 XGBoost
[38]: model = XGBRegressor(n_jobs=-1)
estimators = 1000
model.set_params(n_estimators=estimators)
scores = []
XGpipeline.score(X_test, y_test)
[38]: 0.9999476391046355
X_train_np = X_train.values
y_train_np = y_train.values
X_test_np = X_test.values
y_test_np = y_test.values
X_test_np,␣
↪y_test_np,
loss='mse',
random_seed=23,
num_rounds=5,
)
15
print('Max_error:', max_error(y_test, y_pred))
print('Mean_absolute_error score:', mean_absolute_error(y_test, y_pred))
print('Mean_squared_error score:', mean_squared_error(y_test, y_pred))
print('Root mean_squared_error:', sqrt(mean_squared_error(y_test, y_pred)))
Max_error: 476.8935546875
6.1 Overall, the high explained variance score suggests that XGBoost model
captures the patterns in the data very well. However, the presence of
a relatively high max error indicates that there are instances with larger
prediction errors.
[28]: def plot_learning_curve(model, X, y, cv=5):
plt.figure(figsize=(8, 6))
plt.title('Learning Curve')
plt.xlabel('Training Examples')
plt.ylabel('Negative Mean Squared Error')
plt.plot(train_sizes, train_scores_mean, label='Training Error')
plt.plot(train_sizes, test_scores_mean, label='Validation Error')
plt.legend()
plt.show()
16
[30]: plot_learning_curve(RandomForestRegressor(), X, y, cv=5)
17
[31]: plot_learning_curve(GradientBoostingRegressor(), X, y, cv=5)
18
[41]: plot_learning_curve(XGBRegressor(), X, y, cv=5)
19
6.2 The Learning Curves show that :
6.3 1) Linear Regression has no training error, and a decrease in validation
proves that the Data Set is linear. Linear Regression Proves to be the Best
model with high Prediction Score and the least Errors.
6.4 2)Random Forest Regressor gives a High Predictive Score But leads to
erorrs More frequently than Linear regression model.
6.5 3)Gradient Boost and XGBoost have Training error which lead to them
being Unsatissfactory as Prediction Models.
20