Assignment Help (+1) 346-375-7878
Assignment Help (+1) 346-375-7878
Assignment Help (+1) 346-375-7878
net/publication/342761024
Real Estate Market Data Analysis and Prediction Based on Minor Advertisements
Data and Locations’ Geo-codes
Article in International Journal of Advanced Trends in Computer Science and Engineering · June 2020
DOI: 10.30534/ijatcse/2020/235932020
CITATION READS
1 228
2 authors, including:
Waleed Al-Sit
Mu’tah University
6 PUBLICATIONS 44 CITATIONS
SEE PROFILE
All content following this page was uploaded by Waleed Al-Sit on 16 July 2020.
4077
Waleed T. Al-Sit et al., International Journal of Advanced Trends in Computer Science and Engineering, 9(3), May – June 2020, 4077 – 4089
Inputs Results
Dataset Dataset Used Models Data Model Evaluation Value
Dimensions Splits Criteria
UCI’s Dataset 452 X 13 SVM (1), LSSVM (2), PLS (3) 400-52 (1) MSE 10.7373
(Housing Value of Boston R. Time 0.4610s
Suburb) (2) MSE 20.3730
R. Time 20.3730s
(3) MSE 25.0540
R. Time 0.7460s
Local Data in Spain 1187 X 6 MLP [1 Hidden Layer] 952-237 MLP R2 0. 8605
RMSE 39540.36
MAE 28551.34
“Zillow.com”, 21,000 X 15 Linear Regression (1), Multivariate 80% - (1) RMSE 1.201918
”magicbricks.com” Regression (2), Polynomial 20% (2) RMSE 16,545,470
Regression (3)
(3) RMSE 11359157
bProperty.com 3505 X 15 GB-Regression (1), Random Forest 80% - All RMSE 0.1864 to 0.2340
(2), SVM Ensemble (3) 20%
random sample from 200 X 7 MLP 80% - MLP R2 0.6907 to 0.9
www.bluebook.co.nz 20% RMSE 449, 111.46 to 1,
014,721.92
techniques due to the straightforwardness of the problem and provides the predictive modeling analysis different
the size of the available work on it [8]. Starting with the experiments conducted using machine learning. In Section 5,
benefits yielded of an accurate estimation for property buyers, the work provides the reported results alongside the needed
alongside the assumption related to if a certain property priced discussions, before concluding the presented work in section 6
fairly, and to have a better idea about the possible impacts of and its future next steps.
each of the attributes on the price and in what way. All are
leading to facilitate the process of making a purchase 2. SIMILAR WORK
decision and budget and priorities setting for them. In addition,
helping the property investors in knowing if a purchase deal is As a heavily covered topic especially in the recent years, the
a bargain with a high margin of profit or not. On the other side options to compare to and benefit from are plenty. Hence, we
of the same transaction type for describing the property sales chose a sample of few papers that addressed the price
process, we have property sellers and potential investors in the estimation problem that felt closest and most relevant to this
property sector whether they are individuals or corporate sized work.
parties. Who are looking forward to understand the market Starting with hedonic regression, where we performed
prices, and how much they expect to charge for their properties, several regression to each attribute individually. Then, trying to
and what aspects related to the market in order to concentrate observe the change in a target attribute (Price), so constructing
on, and what to dismiss, all are a common use case for data the final equation of the predicted variable of the weighted
science some techniques. estimation attributes. With an origin out of the real estate,
The contributions of the work presented in here are pricing [9] and some interesting use cases [10]. A lot of
threefold: (1) demonstrates utilizing data science to figure literature on this problem is studying the alternative methods
more countries’ economic aspects, which requires huge for the estimation with a concentration on the use of machine
resources to create some business and academic opportunities learning. Table.1 summarizes five works followed with a
and insights. (2) Presenting a comparative predictive modeling discussion for each further more in this comparative work.
and analysis study for Jordanian property market against five In the first work [11], using a UCI dataset with housing prices
similar works conducted in different countries. (3) Finally, data in Boston, and 13 attributes in the data with a relatively
reporting the analysis resulted insights upon adopting different low number of instance at 452. Authors presented their work
machine learning techniques and evaluation measures for the as comparison of three different models; namely Partial Least
presented problem. Squares (PLS) regression, Support Vector Machine (SVM),
The rest of the work organized as follows. In section 2, and Least-Squares Support-Vector Machine (LSSVM), where
presenting similar works for other countries and their reported the third is somewhat a kernelled version of regular SVMs
results. Section 3 provides the methodology adopted to conduct with an optimization included by design. The results shows
end-to-end data preprocessing and Explanatory Data Analysis that SVMs where superior to the other two methods both on
(EDA) considering real estate apartments prices prediction Mean Squared Error (MSE) and fitting time of the model,
based on advertisement data and locations geocodes. Section 4 though LSSVM would have smaller fitting time if the
4078
Waleed T. Al-Sit et al., International Journal of Advanced Trends in Computer Science and Engineering, 9(3), May – June 2020, 4077 – 4089
Attributes Type Attributes ... Type ... Attributes ... Type ... Attributes ... Type ...
ID INT Ad Images Count INT Air Conditioning BIN Nearby Facilities BIN
Title STR City STR Heating BIN Security BIN
Date STR Location STR Balcony BIN Built-in Wardrobes BIN
Real Estate BIN No. Rooms INT Elevator BIN Swimming Pool BIN
Paid Ad Feature.1 BIN No. Bath Rooms INT Garden BIN Solar Panels BIN
Paid Ad Feature.2 BIN Area INT Garage Parking BIN Double pane Windows BIN
Paid Ad Feature.3 BIN Floor INT Maid Room BIN Ad Post STR
Paid Ad Feature.4 BIN Age STR Laundry Room BIN
Price INT Payment Type STR Is Furnished BIN
parameter optimization were dropped, while all models seems that has shown good results, exceeding the performance of
to yield acceptable results. other models that do not utilize the ensemble paradigm like
The next work presented in [12] provides a good outlook on ANNs. With 3505 instances in the data set and 19 attributes,
the Spanish real estate market. It utilizes data that spans over a the dataset considered is features rich and should allow for
long time interval by making use of a one hidden layer better estimation. As it reduces, the complexity of the fitted
Multi-Layer Perceptron (MLP), in order to yield an estimation model due to the higher dimensionality that is at the same time
model of the real estate price based on some exogenous might hinder the fitting and learning process due to the larger
variables. Authors claiming the superiority of one hidden layer search space for the solution. The results is good yet could
over two, where the results obtained are to some degree a benefit from a clearer presentation of the results, this while
supporter of the claim. However, further examination of such lacking in clarity demonstrated the potential benefit of
work should tested, considering the various architectures and ensemble methods for this type of regression. In addition, the
parameters that an MLP could take. While also taking into data preprocessing and transformation effect on the
account the benefits of the two hidden layers models [13] regression’s final output, and this with the respect to the data
beyond the general function approximation abilities that categories and how each might effect on the learning process in
artificial neural networks have [14]. with a somewhat rich data terms of model and training/fitting performance, while
on vertically with 1187 instance the data is slim horizontally elaborative on the effects of tuning machine-learning
with only 6 attributes, demonstrating with the calculated R2, ensembles with parameters such as depth and number of
RMSE and Median Absolute Error (MAE) the potential estimators [17].
Artificial Neural Networks (ANNs) have for solving such Following in the trend of neural models for price estimation,
problems and improvement over traditional hedonic models while going further with the hedonic versus ANN house price
which is an observation that is shared between a good fraction estimation. Where they share a similar definition of the
of literature. problem addressed to solve. Authors in [18] dives deeper into
Coming more on the technical side. Authors in [15] makes both the hedonic and artificial neural network theories.
use of rich data with 21000 instances and 15 attributes and Alongside the histories and the inflection of the
tries several regression models, namely linear, multivariate and aforementioned theories on the corresponding models. Rather,
polynomial regression. However, the evaluation method leaves scarce in the data used with 200 instances and 7 attributes.
a lot to be desired, going with RMSE solely leaves a need for Nevertheless, the models produced varies highly in prediction
data exploration to understand the results further more. But the performance. That is rather obvious in terms of R-squared,
use of such evaluation criterion could be understandable due to which it clearly explained in the different architectures of the
the nature of the experiment where most regression models rely used neural models in terms of number of neurons. However,
on minimizing some sort of error or residual to produce the the scarcity of data raises some questions regarding the top
model and with the iterative nature of the tuning both the obtained performance, that explained by the more complex
models and the data. model producing it. In addition, the number of test instance
Considering the ensemble way, authors in [16] based on the indicates less support for the results. Nonetheless the statistical
assumption that several models should yield better results. As nature of this work and the emphasis on the hedonic-neural
in the aggregated results of several models should present more comparison and attributes contribution. The analysis
support to a certain decision. This work utilizes several models conducted gives an indication, that these results are just
4079
Waleed T. Al-Sit et al., International Journal of Advanced Trends in Computer Science and Engineering, 9(3), May – June 2020, 4077 – 4089
state where models are able to capture the trend in the data in in the final outcomes of the data whether it being predictive
order to produce a sane estimation. Then, exploring the data models or simple statistics. Such steps varies from dropping
visually to understand the nature of the data, also it is a NULLs in the main anticipated features to dealing with
necessary for some preprocessing steps, for instance, inconsistent values, duplicates and obvious noise. In addition,
discovering outliers in order to remove and the data to fix data types, correcting, and unifying the values of
distribution to better understand the results obtained from each attributes as the data collected from online sources that are
model. Rather, the visualization for the data can provide open for contribution from any seller and non-seller parties,
valuable insights about the Jordanian real estate market from who wants to make an apartment listing. due to the fair number
with the visual aids as they can substitute even partially for the of instances in the data, and the assuming that the distribution
more defined metrics of the interactions between the attributes and coverage of the data is to be preserved due to size a lot of
of the data [29] including price. messy and unclean data were dropped, that will also be
As mentioned earlier, the obtained data set contains 34 contributing to the speed of the fitting and learning process for
attributes, and usage to those attributes that listed in Table.2 the predictive modelling part.
can varies to conduct different types of analysis other than
3.4 Data Preprocessing
estimating real estate prices. However, the data instances
collected are for advertisements related to apartments in Keeping in mind that the data should be too clean and perfect
specific in overall Jordan for a short period (5, March 2020 to in order to leave room for the models to generalize over unseen
data. Several preprocessing steps done, each serving some
28, April 2020).
purpose, overlapping with preprocessing some transformations
3.3 Data Cleaning applied to the features serving purposes.
Some essential steps are to be done before any further process
should take place to accommodate for the tools to be used
nature and to reduce the bias and errors that may be perceived
4081
Waleed T. Al-Sit et al., International Journal of Advanced Trends in Computer Science and Engineering, 9(3), May – June 2020, 4077 – 4089
4082
Waleed T. Al-Sit et al., International Journal of Advanced Trends in Computer Science and Engineering, 9(3), May – June 2020, 4077 – 4089
4083
Waleed T. Al-Sit et al., International Journal of Advanced Trends in Computer Science and Engineering, 9(3), May – June 2020, 4077 – 4089
4084
Waleed T. Al-Sit et al., International Journal of Advanced Trends in Computer Science and Engineering, 9(3), May – June 2020, 4077 – 4089
4085
Waleed T. Al-Sit et al., International Journal of Advanced Trends in Computer Science and Engineering, 9(3), May – June 2020, 4077 – 4089
shows minor improvement over the default model, but the R2 inside the local and/or global markets different areas, trying to
remained showing higher deviation (i.e. Overfitting) achieved provide insights about, enrich, and curate the local and/or
as MLP modeling techniques used for random configuration global economic aspects, which had better enhance the
lookup. These results can attributed to either low instances understanding and aid specialists to draw insightful
searched and/or the efficiency of the default models to begin conclusions about the market different indicators using
with on smaller dataset. different data science techniques
Finally, Linear and GB-Regression evaluated well using R2
score function with a small indication for minor trendy model REFERENCES
overfitting. Whilst the non-linear SVMs using RBF Kernel 1. M. Hoesli, C. Lizieri, and B. MacGregor. The inflation
yielded in two very different standard model score-based hedging characteristics of US and UK investments: a
results, giving an indication that SVMs perform better with the multi-factor error correction approach. The Journal of
scaled data as it uses distances, so a drastic difference in Real Estate Finance and Economics, 36(2), 183-206.
distances will cloud the finer trends in the classifier (or used 2008.
regression model). Rather, best estimator configurations https://doi.org/10.1007/s11146-007-9062-6
founded shows minor improvement over the default model 2. S. J. A. P. Zavei, and M. M. Jusan. Exploring housing
even reporting with R2 evaluation metric, which shows better attributes selection based on Maslow's hierarchy of
results in the other models employed. needs. Procedia-Social and Behavioral Sciences, 42,
311-319. 2012.
6. CONCLUSION 3. S. Hudson-Wilson, F. J. Fabozzi, and J. N. Gordon,. Why
real estate?. The Journal of Portfolio Management, 29(5),
Not all the previously mentioned aspects and benefits to 12-25. 2003.
work on such problem changed the state of literature and the 4. J. Aizenman, and Y. Jinjarak. Real estate valuation,
work on the Jordanian market. Whilst the scarcity of such current account and credit growth patterns, before
works imposes a need for it, and demonstrates an unutilized and after the 2008–9 crisis. Journal of International
aspect of the economy that requires little resources to create Money and Finance, 48, 249-270. 2014.
5. A. Anari, and J. Kolari. House Prices and InflationReal
some business and academic opportunities. In addition,
Estate Economics, 30(1), 67–84. 2002.
helping in further understanding of Jordanian economy, and
6. K. E. Case, E. L. Glaeser, and J. A. Parker. Real estate
potentially diagnosing some problems and recommending
and the macroeconomy. Brookings Papers on Economic
actions to be taking to revert the declination of that economy
Activity, 2000(2), 119-162. 2000.
and identify where correction should be concentrated. Going 7. S. B. Billings. Hedonic amenity valuation and housing
further to cover all the potential inductions and renovations. Real Estate Economics, 43(3), 652-682.
recommendations amongst other possible extractions and 2015.
solutions to this problem start with methodological data https://doi.org/10.1111/1540-6229.12093
science’s use case. The use case shown is offering a 8. N. Shinde, and K. Gawande. Survey on predicting
comparative study of several predictive and descriptive property price. In 2018 International Conference on
methods over an online collected data of apartments for sale in Automation and Computational Engineering (ICACE)
Jordan and their prices alongside listed features. The focus of (pp. 1-7). IEEE. October 2018.
the work concentrate mainly on data mining and machine 9. B. Sopranzetti, Hedonic Regression Models. pp.
learning different techniques. Some anecdotal insights were 2119–2134 10 1007 978–1–4614–7750–1 78.)
listed that relates to the apartments market and on the 10. D. Harrison Jr, and D. L. Rubinfeld. Hedonic housing
techniques used in this work, where the experimental results prices and the demand for clean air. 1978.
might indicate some relations, the consolidation of the 11. J. Mu, F. Wu, and A. Zhang. Housing value forecasting
observed interrelations still needs a more thorough study with based on machine learning methods. In Abstract and
more comprehensive tests on bigger datasets. However, the Applied Analysis (Vol. 2014). Hindawi. January 2014.
work focuses on exploring some of the attributes interactions 12. J. M. N. Tabales, J. M. Caridad, and F. J. R. Carmona, F.
while preparing them for the predictive modelling, and J. R. Artificial neural networks for predicting real
reshaping, transforming and filtering the data for a better estate price. Revista de Métodos Cuantitativos para la
learning by addressing some of basic data problems like Economía y la Empresa, 15, 29-44. 2013.
distribution, outliers and incompatible attribute types with 13. E. D. Sontag. Feedback stabilization using
two-hidden-layer nets. In 1991 American Control
some models. Then and based on a selected subset of machine
Conference (pp. 815-820). IEEE. June 1991.
learning models that vary in complexity and behavior fitted
14. Y. Li, and Y. Yuan. Convergence analysis of two-layer
with while tuning both the models and data in atrial to enhance
neural networks with relu activation. In Advances in
performance, judging based on several criteria (R2, MSE,
neural information processing systems (pp. 597-607).
MdAE, and MeAE) and trying to explain and evaluate the final 2017.
results obtained. The next steps for this work will go deeply
4087
Waleed T. Al-Sit et al., International Journal of Advanced Trends in Computer Science and Engineering, 9(3), May – June 2020, 4077 – 4089
15. R. Manjula, S. Jain, S. Srivastava, and P. R. Kher. Real 29. M. F. De Oliveira, and H. Levkowitz. From visual data
estate value prediction using multivariate regression exploration to visual data mining: A survey. IEEE
models. In IOP Conference Series: Materials Science and transactions on visualization and computer graphics, 9(3),
Engineering (Vol. 263, p. 042098). November 2017. 378-394. 2003.
16. A. A. Neloy, H. S. Haque, and M. M. Ul Islam. Ensemble 30. S. Lima, A. M. Gonçalves, and M. Costa. Time series
learning based rental apartment price prediction forecasting using Holt-Winters exponential
model by categorical features factoring. In Proceedings smoothing: An application to economic data. In AIP
of the 2019 11th International Conference on Machine Conference Proceedings (Vol. 2186, No. 1, p. 090003).
Learning and Computing (pp. 350-356). February 2019. AIP Publishing LLC. December 2019.
https://doi.org/10.1145/3318299.3318377 https://doi.org/10.1063/1.5137999
17. A. Singh, and R. Lakshmiganthan. Impact of different 31. C. Lee, O. Kwon, M. Kim, and D. Kwon. Early
data types on classifier performance of random forest, identification of emerging technologies: A machine
naive bayes, and k-nearest neighbors algorithms. learning approach using multiple patent indicators.
2018. Technological Forecasting and Social Change, 127,
18. V. Limsombunchai. House price prediction: hedonic 291-303. 2018.
price model vs. artificial neural network. In New 32. R. Berwick. An Idiot’s guide to Support vector
Zealand agricultural and resource economics society machines (SVMs). Retrieved on October, 21, 2011. 2003.
conference (pp. 25-26). June 2004. 33. A. Abanda, U. Mori, and J. A. Lozano. A review on
19. G. J. McKee, and D. Miljkovic. Data Aggregation and distance based time series classification. Data Mining
Information Loss (No. 381-2016-22080). 2007. and Knowledge Discovery, 33(2), 378-412. 2019.
20. H. YILDIRIM. Property Value Assessment Using 34. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B.
Artificial Neural Networks, Hedonic Regression and Thirion, O. Grisel, ... and J. Vanderplas. Scikit-learn:
Nearest Neighbors Regression Methods. Selçuk Machine learning in Python. The Journal of machine
Üniversitesi Mühendislik, Bilim ve Teknoloji Dergisi, Learning Research, 12, 2825-2830. 2011.
7(2), 387-404. 2019. 35. R. A. Bottenberg, and J. H. Ward. Applied multiple
21. M. Miyamoto, and H. Tsubaki. Measuring technology linear regression (Vol. 63, No. 6). 6570th Personnel
and pricing differences in the digital still camera Research Laboratory, Aerospace Medical Division, Air
industry using improved hedonic price estimation. Force Systems Command, Lackland Air Force Base.
Behaviormetrika, 28(2), 111-152. 2001. 1963.
22. H. Xu, and F. Mueller. Work-in-progress: Making 36. E. W. Steyerberg, A. J. Vickers, N. R. Cook, T. Gerds, M.
machine learning real-time predictable. IEEE Gonen, N. Obuchowski, ... and M. W. Kattan. Assessing
Real-Time Systems Symposium (RTSS) (pp. 157-160). the performance of prediction models: a framework
IEEE. December 2018. for some traditional and novel measures. Epidemiology
23. N. H. Abroyan, and R. G. Hakobyan. A review of the (Cambridge, Mass.), 21(1), 128. 2010.
usage of machine learning in real-time systems. 37. L. Breiman. Random Forests. Machine learning, 45(1),
Bulletin of the National Polytechnic University of 5–32. 2001.
Armenia. Information Technology, Electronics, Radio 38. N. J. Nagelkerke. A note on a general definition of the
Engineering, (1), 46-54. 2016. coefficient of determination. Biometrika, 78(3),
https://doi.org/10.22606/fsp.2017.12002 691-692. 1991.
24. R. Bellazzi, C. Larizza, P. Magni, S. Montani, and G. De 39. Z. Wang, and A. C. Bovik. Mean squared error: Love it
Nicolao. Intelligent analysis of clinical time series by or leave it? A new look at signal fidelity measures.
combining structural filtering and temporal IEEE signal processing magazine, 26(1), 98-117. 2009.
abstractions. In Joint European Conference on Artificial 40. C. J. Willmott and K. Matsuura. Advantages of the mean
Intelligence in Medicine and Medical Decision Making absolute error (MAE) over the root mean square error
(pp. 261-270). Springer, Berlin, Heidelberg. June 1999. (RMSE) in assessing average model performance.
25. X. Chen, L. Wei, and J. Xu. House price prediction Climate research, 30(1), 79-82. 2005.
using LSTM. arXiv preprint arXiv:1709.08432. 2017. 41. R. Tripathi, and D. P. Rai. Comparative Study of
26. E. Ahmed, and M. Moustafa. House price estimation Software Cost Estimation Technique. International
from visual and textual features. arXiv preprint Journal of Advanced Research in Computer Science and
arXiv:1609.08399. 2016. Software Engineering, 6(1). 2016.
27. F. Zhang, B. Du, and L. Zhang. Scene classification via a 42. V. Kale, and F. Momin. Video Data Mining Framework
gradient boosting random convolutional network for Surveillance Video. International Journal of
framework. IEEE Transactions on Geoscience and Advanced Trends in Computer Science and Engineering,
Remote Sensing, 54(3), 1793-1802. 2015. 2(3). 2013.
28. L. Breiman. Using iterated bagging to debias 43. A. A. Mahule, and A. J. Agrawal. Hybrid Method for
regressions. Machine Learning, 45(3), 261-277. 2001. Improving Accuracy of Crop-Type Detection using
Machine Learning. International Journal, 9(2). 2020.
4088
Waleed T. Al-Sit et al., International Journal of Advanced Trends in Computer Science and Engineering, 9(3), May – June 2020, 4077 – 4089
https://doi.org/10.30534/ijatcse/2020/209922020
44. M. Akour, O. Al Qasem, H. Alsghaier, and K.
Al-Radaideh. The effectiveness of using deep learning
algorithms in predicting daily activities. International
Journal, 8(5). 2019.
4089