Cryptocurrency Price Prediction and Analysis: Submitted To Prof. Vijayasherly V., SCOPE
Cryptocurrency Price Prediction and Analysis: Submitted To Prof. Vijayasherly V., SCOPE
Cryptocurrency Price Prediction and Analysis: Submitted To Prof. Vijayasherly V., SCOPE
J COMPONENT REPORT
E2 SLOT
Submitted by
NAME REG
MAYANK GUPTA 15BCE0477
HARSHIT SHARMA 15BCE0506
PIYUSH JAISWAL 15BCE0611
Submitted to
Prof. Vijayasherly V., SCOPE
1|Page
DECLARATION
I hereby declare that the J Component report entitled “Cryptocurrency Price Prediction
and Analysis” submitted by me to Vellore Institute of Technology, Vellore-14 in partial
fulfilment of the requirement for the award of the degree of B. Tech in Computer science
and engineering is a record of bonafide undertaken by me under the supervision of Prof.
Vijayasherly V. I further declare that the work reported in this report has not been submitted
and will not be submitted, either in part or in full, for the award of any other degree or
diploma in this institute or any other institute or university.
2|Page
Table of Contents
Abstract…………………………………………………………………...page 4
1.Introduction………………………………………………........................page 5
3.Methodology……………………………………………………………...page 8-10
3.4.Polynomial Regression……………………………………………….page 10
3.5.Ensemble Learning…………………………………………………...page 10
4.Results……………………………………………………………….........page 11-12
5.Conclusion………………………………………………………………...page 13
References……………………………………………………...………….page 14
3|Page
ABSTRACT
4|Page
1. Introduction
Digital currencies have been increasing in attention during last years, inevitably it reached in
academia, finance, and public policy atmospheres. From the academia perspective, this
importance arises on the fact that it has features that generate several conflicts in political
and financial environments. Even the definition is ambiguous, as a product of an information
of technology conception, it can be defined as a protocol, platform, currency or payment
method (Athey et al. 2016). Among digital currencies, Bitcoin has been capturing almost all
of its reflection, this virtual currency was created in 2009 and serves as a peer-to-peer version
of electronic cash that let to do transactions on the internet without the intermediation of the
financial system (Nakamoto 2008).
Digital coins or cryptocurrencies, named in such way due to their characteristic of using
encryption systems that regulate the creation of coins and transfers have to be identified from
an economic analysis perspective. Hence, it is important to examine which social, financial
and macroeconomic factors determine its price in order to know the scope and consequences
of the economy.
5|Page
2. Literature Survey
Paper – 1
Among the notable attempts to model the prediction of extreme events in a systematic way
are those of Hallerberg et al., (2008) assessing under what circumstances the extreme events
may be more predictable the bigger they are, or the recent work by Franzke (2012) who
develops a nonlinear stochastic-dynamical model. In the economic context, extreme events
mean a bubble formation or a bubble burst, and their precursors are of vital importance in
risk management. To extract the causal extent (deterministic segment) buried in the noisy
data, various techniques have been proposed, for instance recurrent neural network with
memory feedback (Elman, 1990) or support vector machines (Cortes and Vapnik, 1995). A
survey of recent methods can be found in the work of Akansu et al. (2016). Binary classifiers
separating the upward and downward trend (positive or negative sign of logarithmic return),
which easily evaluate against the dataset in terms of hit ratios (precision of binary classifier
output), are common.
Paper – 2
Specifically, it assumes that the input features (i.e., the groups of 6 price points) are
conditionally independent given the label (i.e., a positive price change [+1] or a negative
price change [-1]) Support Vector Machine Like logistic regression, the support vector
machine algorithm yields a binary classification model while making very few assumptions
about the dataset. Analysis of price data from Coinbase (2017) shows that between August
30th, 2015 and October 19th, 2017, Cryptocurrency had a monthly volatility of 21.73%. Over
that same time span, Cryptocurrency had a monthly volatility of 77.91%. For comparison, the
S&P 500 has a historical monthly volatility of about 14%, suggesting that the price of Ether
is significantly less predictable than that of either Cryptocurrency or common stock. (i.e.,
even as the price of Cryptocurrency might vary, the distribution describing the magnitude of
the price changes between iterations would stay constant.) This assumption guided our idea
to use price changes (and the sign of the price change) as input features into a SVM-based
6|Page
model, but the model underperformed the ARIMA-based model all the same, even when the
data was standardized and/or normalized.
Their results substantiate earlier research done by Madan, Saluja, and Zhao (2014), who
found that by using the Cryptocurrency price sampled every 10 minutes as the primary
feature for a random-forest model, they could predict the direction of the next change in
Cryptocurrency price with 57.4% accuracy.3 Datasets The primary dataset consists of the
price of Ether sampled at approximately one-hour intervals between August 30, 2015 and
December 2, 2017 (Etherchain, 2017)Dataset Truncation The variance of the dataset is large,
relative to its mean, and so we initially attempted to reduce variance by truncating the dataset
to only include data points occurring after February 26th, 2017.
Most of these models used 6 price points as the input feature and were based on binomial
classification algorithms, including Logistic Regression, Support Vector Machine, Random
Forest and Naive Bayes. Hegazy and Mumford (2016) compute an exponentially-smoothed
Cryptocurrency price every eight minutes; using the first five left derivatives of this price as
features in a decision-tree based algorithm, they predict the direction of the next change in
Cryptocurrency price with 57.11% accuracy. (For reference, a naive model which takes no
input and always predicts the price will increase yields a baseline accuracy of 55.8%.) We
interrogated the dataset with TSNE, LDA, and PCA to discover that the classes are not
qualitatively well-separated by any of those methods.
Final results are obtained by training on the trainingand development sets and testing on the
test set.1 3.3 Feature Selection Features were generated by grouping the original data points,
which contained Ether prices, into series of six points, such that each point was separated
from its neighbors by one hour.This date marks a natural inflection point in Cryptocurrency’s
price history, which seemed to support Bovaird’s hypothesis, and roughly indicates when
large institutions increased their interest in Cryptocurrency (since if there was no such
interest, the institutions would not have announced the consortium’s formation.)However,
truncating the dataset in this way did not significantly change results.Other types of features
were also tested, including the price change between time points and the sign of the price
change between time points, as well as normalized and standard versions of all the features
already described.
Results The ratio of positive to negative price changes in the dataset is almost 1:1; as such,
the models areevaluated based on their prediction accuracy.
7|Page
3. Methodology
After merging the entire data set we are applying the following 4 algorithms on the
dataset to finally predict the accuracy of each model.
This model generalizes the simple linear regression in two ways. It allows the mean
function E () y to depend on more than one explanatory variables and to have shapes
other than straight lines, although it does not allow for arbitrary shapes.
Let y denotes the dependent (or study) variable that is linearly related to k
independent (or explanatory) variables 1 2 , ,..., XX Xk through the parameters 1 2 ,
,..., k and we write 11 2 2 ... . k k yX X X This is called as the
multiple linear regression model. The parameters 1 2 , ,..., k are the regression
coefficients associated with 1 2 , ,..., X X Xk respectively and is the random error
component reflecting the difference between the observed and fitted linear
relationship. There can be various reasons for such difference, e.g., joint effect of
those variables not included in the model, random factors which cannot be accounted
in the model etc.
Random Forest is a supervised learning algorithm. Like you can already see from it’s
name, it creates a forest and makes it somehow random. The „forest“ it builds, is an
ensemble of Decision Trees, most of the time trained with the “bagging” method.
The general idea of the bagging method is that a combination of learning models
increases the overall result.
Why Random Forest Algorithm
The same random forest algorithm or the random forest classifier can use
for both classification and the regression task.
When we have more trees in the forest, random forest classifier won’t
overfit the model.
Can model the random forest classifier for categorical values also.
8|Page
Pseudo Code for The Algorithm
1. Where k << m
2. Among the “k” features, calculate the node “d” using the best split point.
3. Split the node into daughter nodes using the best split.
Build forest by repeating steps 1 to 4 for “n” number times to create “n” number of
tree
9|Page
3.4 Polynomial Regression
polynomial regression is a form of regression analysis in which the relationship
between the independent variable x and the dependent variable y is modelled as
an nth degree polynomial in x. Polynomial regression fits a nonlinear relationship
between the value of x and the corresponding conditional mean of y, denoted E(y |x),
and has been used to describe nonlinear phenomena such as the growth rate of
tissues, the distribution of carbon isotopes in lake sediments, and the progression of
disease epidemics. Although polynomial regression fits a nonlinear model to the
data, as a statistical estimation problem it is linear, in the sense that the regression
function E(y | x) is linear in the unknown parameters that are estimated from the data.
For this reason, polynomial regression is considered to be a special case of multiple
linear regression.
10 | P a g e
4. Results
Input:
Closing: - 756
High: - 798
Low: - 567
Spread: - 231
Output:
11 | P a g e
Analysis:
12 | P a g e
5. Conclusion
Cryptocurrency have gained a sudden boom in the market and are disrupting the online
trading industry. With digitalization placing its foots in the market, cryptocurrencies can
become a great asset to countries and even show promise when it comes to security. After the
analysis we concluded that prices of cryptocurrencies do not follow any linearity or pattern.
The analysis shows us a unique graph with the relationship of cryptocurrencies with various
attributes. This uncertainty in there is the reason a person cannot make use of a single
algorithm to predict their prices. Our model makes use of four different prediction algorithms
and through ensemble learning combines their best attributes to produce a result far superior
than any individual algorithm could produce. The dataset used to train the model contains
10000 data entries and the accuracy obtained through our model on the test set is more than
90%. We hence conclude our model to be a highly fit model for any cryptocurrency related
predictions and with respect to time as the training set gets larger it’s accuracy will keep
improving.
13 | P a g e
References
[1]. www.kaggle.com
[2]. www.blockchain.info/
[3]. Automated Bitcoin Trading via Machine Learning Algorithms by Isaac Madan, Shaurya
Saluja, Aojia Zhao
[4]. Regime change and trend prediction for cryptocurrency time series data by Osamu
Kodama,1 Lukáš Pichl,2 Taisei Kaizoji3
[5] T. B. Trafalis and H. Ince. Support vector machine for regression and applications
to financial forecasting. IJCNN2000, 348-353.
http://www.svms.org/regression/TrIn00.pdf
[6]. H. Yang, L. Chan, and I. King. Support vector machine regression for volatile
stock market prediction. Proceedings of the Third International Conference on
Intelligent Data Engineering and Automated Learning, 2002.
http://www.cse.cuhk.edu.hk/~lwchan/papers/ideal2002.pdf
[7]. Predicting Price Changes in Cryptocurrency by Matthew Chen, Neha Narwal and Mila
Schultz, Stanford University Stanford, CA 94305
14 | P a g e
Appendix: Sample Code
choice = int(input())
if (choice == 1):
a=0
b = 1492
c = "Bitcoin"
elif (choice == 2):
a = 1493
b = 2397
c = "Ethereum"
15 | P a g e
elif (choice == 4):
a = 4036
b = 4224
c = "Bitcoin Cash"
16 | P a g e
else:
print("Invalid Choice\n")
print("Please select a number between (1-10)\n")
ch = int(input())
if (ch == 1):
a=0
b = 1492
c = "Bitcoin"
elif (ch == 2):
a = 1493
b = 2397
c = "Ethereum"
17 | P a g e
b = 7351
c = "Litecoin"
print("\n")
print ("Input the parameters")
high = float(input())
18 | P a g e
low = float(input())
close = float(input())
spread = float(input())
arr = np.array([[high,low,close,spread]])
p_sum = 0
count = 0
# Splitting the dataset into the Training set and Test set for Multiple Linear Regression
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_mlr, y, test_size = 0.2, random_state = 0)
19 | P a g e
#Predicting the test set results using MLR
y_pred1 = regressor.predict(arr)
# Fitting the Random Forest Regression Model to the dataset and predicting the result
# High
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 1000, random_state = 0)
regressor.fit(X_high, y)
y_pred3_1 = regressor.predict(high)
# Close
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 1000, random_state = 0)
regressor.fit(X_close, y)
y_pred3_2 = regressor.predict(close)
# Low
from sklearn.ensemble import RandomForestRegressor
20 | P a g e
regressor = RandomForestRegressor(n_estimators = 1000, random_state = 0)
regressor.fit(X_low, y)
y_pred3_3 = regressor.predict(low)
# Spread
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 1000, random_state = 0)
regressor.fit(X_spread, y)
y_pred3_4 = regressor.predict(spread)
print ("\n\nThe dependence of the close value against the various input parameters is shown
as follows:\n\n")
# High
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 4)
X_poly = poly_reg.fit_transform(X_high)
lin_reg2 = LinearRegression()
lin_reg2.fit(X_poly, y)
y_pred2_1 = lin_reg2.predict(poly_reg.fit_transform(high))
21 | P a g e
if ((y_pred2_1) >= low and (y_pred2_1) <=high):
p_sum = p_sum + y_pred2_1
count = count + 1
# Close
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 4)
X_poly = poly_reg.fit_transform(X_close)
lin_reg2 = LinearRegression()
lin_reg2.fit(X_poly, y)
y_pred2_2 = lin_reg2.predict(poly_reg.fit_transform(close))
# Low
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 4)
X_poly = poly_reg.fit_transform(X_low)
22 | P a g e
lin_reg2 = LinearRegression()
lin_reg2.fit(X_poly, y)
y_pred2_3 = lin_reg2.predict(poly_reg.fit_transform(low))
# Spread
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 4)
X_poly = poly_reg.fit_transform(X_spread)
lin_reg2 = LinearRegression()
lin_reg2.fit(X_poly, y)
y_pred2_4 = lin_reg2.predict(poly_reg.fit_transform(spread))
23 | P a g e
# Calculating the Final Predicted Value
prediction = p_sum/count
print ("\n\nThe Predicted Opening Price for " + c +" for the following day is : " +
str(prediction))
24 | P a g e