Dendrogram Clustering For 3D Data Analytics in Smart City

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLII-4/W9, 2018

International Conference on Geomatics and Geospatial Technology (GGT 2018), 3–5 September 2018, Kuala Lumpur, Malaysia

DENDROGRAM CLUSTERING FOR 3D DATA ANALYTICS IN SMART CITY

S. Azri*, U. Ujang , A. Abdul Rahman

3D GIS Research Lab, Department of Geoinformation, Faculty of Built Environment and Survey, 81310 UTM Johor
Bahru, Malaysia - (suhaibah, mduznir, alias)@utm.my

KEY WORDS: Smart City, Dendrogram Clustering, 3D Spatial Database, 3D GIS, Data Analytics, Data Structure

ABSTRACT:

Smart city is a connection of physical and social infrastructure together with the information technology to leverage the
collective intelligence of the city. Cities will build huge data centres. These data are collected from sensors, social
media, and legacy data sources. In order to be smart, cities needs data analysis to identify infrastructure that needs to be
improved, city planning and predictive analysis for citizen safety and security. However, no matter how much smart city
focus on the updated technology, data do not organize themselves in a database. Such tasks require a sophisticated
database structure to produce informative data output. Furthermore, increasing number of smart cities and generated
data from smart cities contributes to current phenomenon called big data. These large and complex data collections
would be difficult to process using regular database management tools or traditional data processing applications. There
are multiple challenges for big data, including visualization, mining, analysis, capture, storage, search, and sharing.
Efficient data analysis mechanisms are necessary to search and extract valuable patterns and knowledge through the big
data of smart cities. In this paper, we present a technique of three-dimensional data analytics using dendrogram
clustering approach. Data will be organized using this technique and several output and analyses are carried out to proof
the efficiency of the structure for three – dimensional data analytics in smart city.

1. INTRODUCTION technologies are huge. For example, 1030 databases were


set up in 2013 by The Helsinki Region Infoshare Project
In this few years, ‘Smart City’ and ‘Big Data’ are the to cover a wide range of urban phenomena, such as
most hype and buzzwords in business industry, transport, economics, conditions, employment, and well-
government organizations and even in academia field. being.
According to (Mohanty et al., 2016), smart city can be
defined as a place where network, service and Big data are worthless in a vacuum. Its potential value is
infrastructure are more efficient and sustainable with the unlocked only when leveraged to drive decision making.
aid of digital information and advanced technology. The To enable such evidence-based decision making,
explosive growth of Information and Communication organizations need efficient processes to turn high
Technology (ICT) has become a pace setter to transform a volumes of fast-moving and diverse data into meaningful
city to a smart city. Smart city provide a benefits to its insights. According to (Labrinidis and Jagadish, 2012),
inhabitants with a greener, safer, faster and friendlier there are several process to extracting insights from big
environment. data. This process can be viewed in Figure 1. There are
five stages that are categorized into two main processes;
During the process and plan of smart city, data are data management and analytics. In the data management
gathered and collected from sensor, social media, and phase, technologies are used to acquiring and storing the
legacy data sources for data analysis to identify data. This is important for data preparation and retrieval
infrastructure that needs to be improved or to predict for further analysis. On the other hand, analytics refers to
disease outbreak. Data that are produced from smart city techniques used to analyse and acquire intelligence from
plan contributes to current phenomenon called big data. big data. There are a few techniques for big data
Big data is a large and complex data collection that is analytical. The techniques could be used for both
difficult to process using regular database management structured and unstructured data such as text analytics,
approach. It requires special data handling and audio analytics, video analytics, social media analytics
sophisticated data structure. Recent reviews from and predictive analytics (Gandomi and Haider, 2015).
(Hashem et al., 2016) three smart cities; Helsinki,
Stockholm and Copenhagen shown that the amount of
data generated using Internet of Things (IoT)

* Corresponding author

This contribution has been peer-reviewed.


https://doi.org/10.5194/isprs-archives-XLII-4-W9-247-2018 | © Authors 2018. CC BY 4.0 License. 247
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLII-4/W9, 2018
International Conference on Geomatics and Geospatial Technology (GGT 2018), 3–5 September 2018, Kuala Lumpur, Malaysia

Figure 1. Big data processes (Labrinidis and Jagadish, 2012)

amounts of data efficiently, cost effectively, and in a


The application of big data in a smart city has many
timely manner. However, there are some limitations for
benefits and challenges, including the availability of large
these tools such as limitation of queries on Hadoop. In a
computational and storage facilities to process streams of
normal relational database, data is found and analysed
data produced within a smart city environment. In this
using queries. Hadoop is not really a database: it stores
study, we aim to offer another alternative technique for
data and retrieve data out of it, but there are no queries
3D big data analytics in smart city. Thus, this paper is
such as SQL involved. Hadoop is more of a data
motivated by the current availability of smart devices that
warehousing system.
generate large heterogeneous datasets every day and the
processing challenges that must be addressed to increase
While analysing big data using Hadoop has lived up to
citizens’ quality of life and make their cities sustainable.
much of the hype, there are certain situations where
The rest of this paper is organized as follows: problems
running workloads on a database may be the better
and motivation regarding the big data of smart city and
solution. For example, dealing with structured data that is
smart city challenges are discussed in the next section. In
owing to the fact it is in large volumes that can be
Section 3, the concept of the proposed method is
entered, stored, queried, and analysed in a simple and
explained with its implementation. Section 4 presents the
straightforward manner, this type of data is best served by
analysis and results of the experiment. Finally, the
a database. In cases where organizations rely on time-
conclusions are presented in Section 5.
sensitive data analysis, a traditional database is the better
fit. That’s because it can offer shorter time-to-insight the
2. RESEARCH PROBLEMS AND MOTIVATIONS
datasets. With regards to the main aim and case study of
this research, most of the data generated and used for
In this section background to the problem on big data is smart city application is from structured data lead to the
discuss and review. From the problem, proposed option of using database. However, storing and
technique for 3D data analytics is discuss in the next information extraction in database require a new efficient
section. approach. Therefore, this research introduces a new
approach of big data handling of smart city in 3D. The
2.1 Existing Data Structure for Big Data Handling in approach is efficient to store and retrieve data to acquire
the Database the informative output.

To process a huge volume of 3D data in database, a


In this research, the focus is concentrated on big data specific 3D data structure is needed for data storing,
processes with a case study of smart city. With the retrieval and analytics. Current approach in database is by
evolution of computing technology, immense volumes using data constellation to arrange data in memory space.
can be managed without requiring supercomputers and There are two well-known 3D data structure used in
high cost. Many tools and techniques are available for database: Octree and 3D R-Tree (Guttman, 1984). Octree
data management such as Google BigTable, Simple DB, is an extended version of two-dimensional quad-tree to
Not Only SQL (NoSQL), Data Stream Management three-dimensional. Octree data structure internal node has
System (DSMS), MemcacheDB, and Voldemort (Chen et exactly eight children. Octrees are most often used to
al., 2014). However, special tools and technologies are partition a three-dimensional space by recursively
required so that it can store, access, and analyse large subdividing it into eight octants. It is widely used for 3D
amounts of data in near-real time. This is because big data graphics and 3D games application. According to (Keling
are different from the traditional data and cannot be et al., 2017), the algorithm of Octree is simplified by
stored in a single machine. Commonly used tools and exploiting recursive nature of Octree, but the drawback is
techniques for big data handling are Hadoop, it require long tree traversals. Besides that, voxelization
MapReduce, and Big Table. These tools have redefined require more storage on memory space.
data management because they effectively process large

This contribution has been peer-reviewed.


https://doi.org/10.5194/isprs-archives-XLII-4-W9-247-2018 | © Authors 2018. CC BY 4.0 License. 248
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLII-4/W9, 2018
International Conference on Geomatics and Geospatial Technology (GGT 2018), 3–5 September 2018, Kuala Lumpur, Malaysia

3D R-Tree seems to be the most promising data structure due two group filter which are classification and
to be used in database. In fact, it has been widely used in clustering.
commercial database software such as Oracle. However,
the transition of R-Tree to 3D had increase the overlap 3.2 Dendrogram Clustering
among node and requires more storage. This would lead
to repetitive data and multipath query among node (Azri Dendrogram clustering is also known as hierarchical
et al., 2015). Although an effort has been made to clustering algorithms or HCA. This clustering actually
improve the structure of 3D R-Tree, application using falls into two categories top-down or bottom-up. Bottom-
large amount of data still faces low data retrieval up algorithm categorize each data or object as a single
efficiency. Recent review from (Azri et al., 2013) cluster and then merge with its pair until all clusters are
proposed to merge multiple indexing methods to form merged and become a single cluster. This hierarchy of
hybrid data structure. Looking forward to a smart city, clusters is represented as a tree (or dendrogram). The root
there is a need to design a specific data handling or of the tree is the unique cluster that gathers all the objects,
structure to enable computer and system to analyse spatial the leaves being the clusters with only one sample.
big data for intelligent decision making. Thus, in this Another category is using top-down algorithm or known
paper, we propose a 3D data structure to constellate big as divisive. The approach is a bit different from top-down
data of 3D smart city data into spatial database. where all observations start in one cluster, and splits are
performed recursively as one moves down the hierarchy.
In this study, dendrogram clustering is constructed based
3. 3D DATA CONSTELLATION IN SMART CITY on bottom up algorithm. The algorithm of dendrogram
clustering can be seen as follows and the structure of
dendrogram clustering using different distance metric can
In this section the proposed 3D data structure is be seen in the Figure 2.
introduced. The construction and development of the
proposed structure is the most important part. All of the
information will be accessed through this structure prior Algorithm Dendrogram Clustering
to data analytical. 3.1
Input: X Data Points
3.1 3D Data Constellation for Efficient Data Retrieval Output: X Clusters
1. create single cluster for each X
Classified and Clustered Data Constellation (CCDC) is a
points
3D data structure that constellate 3D spatial data into
2. select distance metric that measures
spatial database (Azri et al., 2016). The data structure is
the distance between two cluster
designed and developed based on two main filters;
3. combine two clusters into one with a
classification and clustering and works based on
condition smallest distance metric
hierarchical tree concept. The classification phase will
(nearest neighbour)
classifies each of spatial objects into a group based on its
4. repeat step 3 until reach root of the
theme or type. For instance classification based on zoning
theme such as, retail, housing or industrial. Then, each tree (one cluster which contain all
object in each group of classification will be clustered data )
using clustering algorithm. In (Azri et al., 2016) the 5. list all clusters
clustering processes are based on k-means++ crisp
clustering algorithm by (Arthur and Vassilvitskii, 2007).
k-means++ introduced the approach of careful seeding to One of the advantages of using dendrogram clustering is,
improve the k-means algorithm. By using this approach, it does not require number of clusters. Besides that, the
initial seeds are defined and the remaining objects are algorithm is not sensitive to the choice of distance metric
then clustered based on the nearest distance to the initial where it can work equally well with other clustering
seeds. This algorithm has proven to yield improvements algorithms. These advantages of hierarchical clustering
in terms of accuracy with respect to original algorithm. come at the cost of lower efficiency, as it has a time
complexity of O(n³), unlike the linear complexity of k-
The results from classification and clustering phases are means. Dendrogram clustering algorithm is based on a
then mapped into hierarchical tree structure. Data will be distance matrix that has to be kept in memory. Thus, the
retrieved by traversing the tree structure from its parent distance matrix is symmetric which need memory scales
node to its child. CCDC data structure offer a very
minimal percentage and coverage area among nodes as . Besides that, average link clustering scales
which is one of the requirement for efficient data retrieval as N3 in time, because for each cluster agglomeration, the
from the database. However, still CCDC could not
achieve the zero overlap among nodes and we believe that algorithm searches through cluster
the construction of CCDC structure is time consuming dissimilarities in order to determine the pair of most

This contribution has been peer-reviewed.


https://doi.org/10.5194/isprs-archives-XLII-4-W9-247-2018 | © Authors 2018. CC BY 4.0 License. 249
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLII-4/W9, 2018
International Conference on Geomatics and Geospatial Technology (GGT 2018), 3–5 September 2018, Kuala Lumpur, Malaysia

similar clusters to merge, and the algorithm works


through N−1 iterations. However, to overcome this issue
the structure can be speed up with cluster seeding as
presented in (Embrechts et al., 2013). From their study, k-
means is applied as a cluster seeding to speed up the
performance.

Figure 2. Different dendrograms are produced with

Figure 3. Vertical and Horizontal mixed-use development

To manage this type of development theme in a smart city


requires a specific approach than usual. In CCDC data
structure, classification will be done at the early stage
prior to the clustering process. Building units with the
same business such as retail, office or accommodation
will be grouped and bounded in a parallelepiped.
different distance metrics (Fisher, 1936).
However, in this study units with different theme will be
appeared as different categories and will only be merged
3.3 Clustering the Smart City using Dendrogram
before the final grouping. In other words, objects will be
clustered with their same categories and these categories
Buildings in a smart city are blended with two or more will be combined to produce root of the tree. The tree
residential, commercial, cultural, institutional, and/or structure is created along the way of clustering process.
industrial uses. This type of development is also known as Unlike CCDC data structure, classification and clustering
mixed-use development. Mixed use is one of the ten processes need to be done prior to the creation of the tree.
principles of Smart Growth, a planning strategy that seeks
to foster community design and development that serves All information and parameters related to the building in
the economy, community, public health, and the a smart city are stored in the database using dendrogram
environment. Mixed-use zoning allows for the horizontal clustering. The search operation is deployed from the
and vertical combination of land uses in a given area. parent node of tree structure and dives in into the group
Commercial, residential, and even in some instances, light cluster and retrieves the leaf node. Each record in the
industrial are fit together to help create built environments structure is identified using key identifier. The key
where residents can live, work, and play. Figure 3 shows identifier consists of unique IDs of group cluster and the
the vertical and horizontal mixed-use development in building. Algorithm 3.2 presents the dendrogram
urban area. clustering for processing different types of building in
smart city and Figure 4 explain the tree structure for a few
3D buildings sample.

This contribution has been peer-reviewed.


https://doi.org/10.5194/isprs-archives-XLII-4-W9-247-2018 | © Authors 2018. CC BY 4.0 License. 250
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLII-4/W9, 2018
International Conference on Geomatics and Geospatial Technology (GGT 2018), 3–5 September 2018, Kuala Lumpur, Malaysia

Residential and Retail. By using dendrogram clustering,


Algorithm Dendrogram Clustering for 3D these units are organized in a tree structure and grouped
3.2 Building in Smart City with the same business theme.
Input: 3D Building with different type of
urban mixed-use theme i.e. Retail, 4. EXPERIMENT AND ANALYSIS
Office or Residential
Output: Dendrogram Several experiments and analyses are performed in order
1. create single cluster for each n points to verify the proposed 3D data structure. The experiment
in x theme is done to retrieve records from database and compare the
2. select distance metric that measures performance of dendrogram clustering with existing
the distance between two cluster structure.
3. combine two clusters into one with a
condition smallest distance metric 4.1 Retrieving Records
(nearest neighbour)
4. repeat step 1 to 3 for all data until it Records are retrieved from the database using Structured
become one cluster for one theme Query Language (SQL). From the statement, the structure
5. Combine all theme to become root specifies the records using its unique identification from
tree cluster ID and then the record can be retrieve directly
6. display tree from the cluster.

Result from Figure 5 show 11 records of roof that are


belong to a single building in a smart city. This record
reported total incident radiation and absorption of urban
n7 heat island phenomenon in that city.
n1 n5
n2 4.2 Page Analysis
n6

An experiment is conducted to evaluate the average


n4 number of page accessed or Input/Output (I/O) incurred
by the proposed 3D data structure. The test is evaluating
the effect of retrieved records from large tuples. To test
n3
the performance of query operation using proposed 3D
Office data structure, 1,000 000 records of buildings are used to
Residential evaluate the average number of page access. Figure 6
Retail illustrates the I/O cost for finding k number of records
with and without the 3D data structure. As the figure
shows, the proposed data structure offer minimal page
accessed. Without the structure the number of accessed
page is high due to multipath query and repetitive data
entry.

4.3 Response Time Analysis

The main objective in this study is to improve the data


retrieval efficiency from the database. Thus, a test is
performed to analyse the query response time of the
proposed structure. From the tests, k number of objects
n1 n2 n4 n5 n7 n3 n6 will be retrieved and the time retrieval is measured and
recorded in millisecond (ms). 1,000 000 buildings
location are populated in the database. Query is
performed to retrieve different number k of objects from
the database. Then, the results are plotted in a graph as
Figure 4. Dendrogram tree for 3D buildings. presented in the Figure 7. From the result, the proposed
3D data structure outperforms the response time
From Figure 4, several building with a few units in the compared to non-constellated data. The response time is
building is shown with a different business theme; Office, 50% to 60% faster compared to non-constellated data.

This contribution has been peer-reviewed.


https://doi.org/10.5194/isprs-archives-XLII-4-W9-247-2018 | © Authors 2018. CC BY 4.0 License. 251
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLII-4/W9, 2018
International Conference on Geomatics and Geospatial Technology (GGT 2018), 3–5 September 2018, Kuala Lumpur, Malaysia

Figure 5. Retrieved records from the database.

Figure 6. I/O vs k number of objects.

5. CONCLUSIONS

Figure 7. Response time analysis for k number of objects retrieval.

This contribution has been peer-reviewed.


https://doi.org/10.5194/isprs-archives-XLII-4-W9-247-2018 | © Authors 2018. CC BY 4.0 License. 252
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLII-4/W9, 2018
International Conference on Geomatics and Geospatial Technology (GGT 2018), 3–5 September 2018, Kuala Lumpur, Malaysia

This paper proposed 3D dendrogram clustering to GUTTMAN, A. 1984. R-trees: a dynamic index structure for
produce hierarchical tree structure for data for data spatial searching. SIGMOD Rec., 14, 47-57.
retrieval and analytics. The structure is constructed based
on dendrogram clustering. The clustering grouped the HASHEM, I. A. T., CHANG, V., ANUAR, N. B., ADEWOLE,
K., YAQOOB, I., GANI, A., AHMED, E. & CHIROMA, H.
object with the same features and then groups the object
2016. The role of big data in smart city. International Journal
under the root tree. The datasets are retrieved based on of Information Management, 36, 748-758.
specific IDs for each object. The implementation of k-
means algorithm is used as a cluster seeding to speed up KELING, N., MOHAMAD YUSOFF, I., LATEH, H. &
the tree creation. This is due to the lower efficiency, as it UJANG, U. 2017. Highly Efficient Computer Oriented Octree
has a time complexity of O (n³). Based on the Data Structure and Neighbours Search in 3D GIS. In: ABDUL-
comprehensive tests and analyses of the proposed RAHMAN, A. (ed.) Advances in 3D Geoinformation. Cham:
structure, results and findings are discussed as follows. Springer International Publishing.
The first test is to prove its ability of the structure in
LABRINIDIS, A. & JAGADISH, H. V. 2012. Challenges and
retrieving records from the database. From the test, it is
opportunities with big data. Proceedings of the VLDB
successfully shown that the structure is able to retrieve Endowment, 5, 2032-2033.
information of radiation and solar absorption on the roof
of one building. The last two experiments were performed MOHANTY, S. P., CHOPPALI, U. & KOUGIANOS, E. 2016.
in order to measure the efficiency of the structure. Based Everything you wanted to know about smart cities: The Internet
on page analysis and response time analysis, it is proven of things is the backbone. IEEE Consumer Electronics
that the structure could perform better data analysis with Magazine, 5, 60-70.
low access to page. We strongly believe that the data
structure will be benefitted to planner and spatial
professional and aid them to perform better data analytics
for smarter cities.

ARTHUR, D. & VASSILVITSKII, S. k-means++: The


advantages of careful seeding. Proceedings of the eighteenth
annual ACM-SIAM symposium on Discrete algorithms, 2007.
Society for Industrial and Applied Mathematics, 1027-1035.

AZRI, S., ANTON, F., UJANG, U., MIOC, D. & RAHMAN,


A. A. 2015. Crisp Clustering Algorithm for 3D Geospatial
Vector Data Quantization. In: BREUNIG, M., AL-DOORI, M.,
BUTWILOWSKI, E., KUPER, P. V., BENNER, J. &
HAEFELE, K. H. (eds.) 3D Geoinformation Science: The
Selected Papers of the 3D GeoInfo 2014. Cham: Springer
International Publishing.

AZRI, S., UJANG, U., ANTON, F., MIOC, D. & RAHMAN,


A. A. Review of Spatial Indexing Techniques for Large Urban
Data Management. 2013 2013.

AZRI, S., UJANG, U., CASTRO, F. A., RAHMAN, A. A. &


MIOC, D. 2016. Classified and clustered data constellation: An
efficient approach of 3D urban data management. ISPRS
Journal of Photogrammetry and Remote Sensing, 113, 30-42.

CHEN, M., MAO, S. & LIU, Y. 2014. Big Data: A Survey.


Mobile Networks and Applications, 19, 171-209.

EMBRECHTS, M. J., GATTI, C. J., LINTON, J. & ROYSAM,


B. 2013. Hierarchical clustering for large data sets. Advances in
Intelligent Signal Processing and Data Mining. Springer.

FISHER, R. A. 1936. The use of multiple measurements in


taxonomic problems. Annals of eugenics, 7, 179-188.

GANDOMI, A. & HAIDER, M. 2015. Beyond the hype: Big


data concepts, methods, and analytics. International Journal of
Information Management, 35, 137-144.

This contribution has been peer-reviewed.


https://doi.org/10.5194/isprs-archives-XLII-4-W9-247-2018 | © Authors 2018. CC BY 4.0 License. 253

You might also like