Malicious Application Detection

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Vol 13, Issue 06, June/2022

ISSN NO:0377-9254

Malicious Application Detection Using Machine Learning


1G.Chandana ,2B. Anusha , 3S. Sri Devi , 4Mrs. V. Sri Suma
B.tech Student, Assistant Professor
DEPARTMENT OF INFORMATION TECHNOLOGY
CMR TECHNICAL CAMPUS, Hyderabad
ABSTRACT and trained dataset we can predict the malware android
apps. With an estimated market share of 70% to 80%,
Android plays a vital role in the today's market. Android has become the most popular operating system
According to recent survey placed nearly 84.4% of for smartphones and tablets. Unsurprisingly, cyber-
people stick to android which explosively become criminals have followed, expanding their malicious
popular for personal or business purposes. It is no doubt activities to mobile platforms. Mobile threat researchers
that the application is extremely familiar in the market have recognized an alarming increase of Android
for their amazing features and the wonderful benefits of malware from 2012 to 2013 and estimate thatthe number
android applications makes the users to fall for it. of detected malicious applications is in the range of
Android imparts significant responsibility to 120,000 to 718,000. To efficiently detect malware from
application developers for designing the application applications available from official and third-party
with understanding the risk of security issues. When sources, many efforts have contributed to studying the
concerned about security, malware protection is a major nature of smartphone platforms and theirapplications in
issue in which android has been a major target of the past decade. The Android platform employs the
malicious applications. In android based applications, permission system to restrictapplications privileges to
permission control is one of the major security secure the sensitive resources of the users. The
mechanisms. In this project, the permission induced risk developer is responsible for determining appropriately
in application, and the fundamentals of the android which permissions an application requires, but an
security architecture are explored, and it also focuses on application needs to get a user’s approval of the
the securit y ranking algorithms that are unique to requested permissions to access private or otherwise-
specific applications. Hence, we propose the system restricted resources. Although the permission system
providing the detection of malware analysis based on can protect users from applications with invasive
permission and steps to mitigate from accessing behaviors, its effectiveness highly depends on a user’s
unwanted permission (limits the permission). It is also comprehension of the consequences of granting a
designed to reduce the probability of vulnerable attacks. permission. According to recent studies, many users do
I.INTRODUCTION not understand what each permission means and blindly
1.1OBJECTIVE OF THE PROJECT grant them, potentiallyallowing an application to access
sensitive/private information. Another laws that the
In recent years, the usages of smart phones are user cannot decide to grant single permissions, while
increasing steadily and also growth of Android denying others. Many users, although an app might
application users are increasing. Due to growth of request a suspicious permission among much seemingly
Android application user, some intruder are creating
legitimate permission, will stillconfirm the installation.
malicious android application as tool to steal the The Android security model is based mainly on
sensitive data and identity theft / fraud mobile bank, permissions.
mobile wallets. There are so many malicious
applications detection tools and software are available. 1.2PURPOSE OF THE PROJECT
But an effectively and efficiently malicious applications The ultimate aim of the project is to improve
detection tools needed to tackle and handle new permission for detecting the malicious android mobile
complex malicious apps created by intruder orhackers. application using machine learning algorithms. As a
In this paper we came up with idea of using machine result, the implementation of these permissions is of
learning approaches for detectingthe malicious android interest to us. An Android permission is a restriction
application. First we have to gather dataset of past limiting accessto a part of the code or to data on the
malicious apps as training set and with the help of device. The limitation is imposed to protect critical data
Support vector machine algorithm and decision tree and code that could be misused to distort or damage a
algorithm make up comparision with training dataset user’s experience. Permissions are also used to allow or

www.jespublication.com Page No:1198


Vol 13, Issue 06, June/2022
ISSN NO:0377-9254

restrict application access to restricted APIs and method, they developed a model that calculates two
resources. For example, the Android ‘INTERNET’ scores called normal score and malicious score for
permission is required by apps to perform network every application and decides whether a particular
communications so, opening a network connection is application is malware or not. The most commonly used
restricted by the ‘INTERNET’ permission. properties in static and dynamic Android malware
Furthermore, an application must have the ‘READ detection are permissions and network traffic features
CONTACTS’ permission in order to read entries in a respectively. Static permissions cannot identify
user’sphonebook as well. To require a permission, the sophisticated malware, which is capable of update
developer specifies them using the Manifest file in attacks. And coming to dynamic network traffic, it
declaring a "" attribute. The "android : name" field cannot detect malware samples without a network
specifies the name of the permissionin the code. connection. Therefore, a hybrid model integrating both
1.3 PROJECT FEATURES of these properties is proposed. They extracted both
permissions and network traffic features and made them
A new method to detect malicious Android into a single vector. Using the K-medoids method, they
applications through machine learning techniques by partitioned the vectors into K clusters. And they used
analyzing the extracted permissions from the the K-Nearest Neighbours method, to classify whether
application itself. Features used to classify are the a particular application is malicious or not. They made
presence of tags uses-permission and uses-feature into sure that K is odd, just to make sure out of K nearest
the manifest as well as the number of permissions of neighbours, the count of malicious and benign
each application. These features are the permission neighbours is not the same. In another work, Zhenlong
requested individually and the «uses- feature» tag the Yuan et al. proposed a technique to associate static
possibility of detection malicious Android applications features with dynamic features and then classify the
based on permissions and 20 features from Android given android applications as malicious or safe. They
application packages. got the features they used as input to their model in three
II.LITERATURE SURVEY stages: • Static Phase • Sensitive APIs • Dynamic Phase
We studied the techniques that are proposed to Static phase includes the permissions that are obtained
identify Android malwares. In his work, Anshul et al. by unzipping the apk file and parsing xml files obtained
presented an idea to detect Android Malwares by later. Another file classes.dex accounts for the sensitive
Network traffic analysis. Their approach is used to api calls.
identify malware on Android that is operated by a III.SYSTEM ANALYSIS
remote server. These malwares either accept orders 3.1PROBLEM STATEMENT
from the server or leak sensitive data to it. First, they
Smartphones have become the most used device in
analyzed the network traffic of android malwares and
one’s day to day life. They facilitateusers with a variety
then the traffic of normal applications. They discovered
of applications that are enriched with powerful features.
the characteristics that distinguish malware traffic from
It is almost impossible for anyone these days to spend a
non-malware traffic.. And in the second phase, they
day without their smartphones. Out of allsmartphones,
built a classifier using these network traffic features
Android smartphones are the ones that are widely used.
which can detect the malwares. In another work, Anshul
This increasing popularity of Android smartphones has
et al. proposed a technique called the PermPair method.
also attracted malicious attackers. This malicious
They approached the goal by considering every pair of
activity can be done by either a single application or a
permissions as the possible input feature and finally
group of applications working together.The objective of
decided on each pair, if that combination is vulnerable.
this project is to create a model that can detect such
Their method includes data sets from 3 different sources
malicious applications.
called Genome ,Debris and Koodoos. Their approach
had 3 phases. In the first phase, they constructed 4
different graphs by extracting permission pairs from
each application. Out of the 4 graphs, 3 graphs are for 3.2 EXISTING SYSTEM
malwares and 1 graph is for benign applications. In the
Traditionally Numerous malware detection tools
second phase, they dealt with merging 3 malicious
have been developed, but some tools are may not able
graphs into a single malicious graph. At the end of this
to detect newly created malware application and
phase, they ended up with two graphs, one for malicious
unknown malware application infected by various
and one for benign. In the third and final phase of their
Trojan, worms, spyware. Detecting of large number of

www.jespublication.com Page No:1199


Vol 13, Issue 06, June/2022
ISSN NO:0377-9254

malicious application over millions of android popularity. Web applications are used for web
application is still a challenging task using traditional mail, online retail sales, discussion boards,
way. In existing, Non machine learning way of weblogs, online banking, and more. One web
detecting the malicious application based on application can beaccessed and used by millions
characteristics, properties, behavioral. of people.
DISADVANTAGES OF THE EXISTING
SYSTEM

Identification of newly updated or created


malicious application is hard to findout.
Non Machine learning approaches are not reliable and
efficient.
In Existing approaches covers only 30
permissions out of 300 app permissions,due to this
limited apps permissions different types of attacks can
occur.
3.3PROPOSED SYSTEM
In proposed paper, we implement
SIGPID, Significant Permission Identification
(SIGPID). The goal of the sigpid is to improve the apps
permissions effectively and efficiently. This SIGID Figure no. 3.1 Project Architecture
system improves the accuracy and efficient detection of
4.2MODULE DESCRIPTION
malware application. With help machine learning
1. Permission
algorithms such as SVM and Decision Tree algorithms
make a comparison between training and trained Permission characterize existing
datasets. Support vector machine algorithms act as a Android malware from various aspects,
classifier which is used to classify malicious application including the permissions requested. They
and benign app. identified individually the permissions that
ADVANTAGES OF THE PROPOSED SYSTEM are widely requested in both malicious and
benign apps
Improves the percentages of detection malicious
2. Combination of Permission
application
This method on network classification
Machine learning is better efficient than Non machine
helps to define irregular permission
learning algorithm.
combinations requested by abnormal
Able to detect new malware android applications. applications. The nature, sources and
implications of sensitive data on Android
We only need to consider 22 out of 135
devices in enterprise settings.
permissions to improve the runtimeperformance
by85.6.
3. Feature Extraction

IV.ARCHITECTURE A new method to detect malicious


Android applications through machine
4.1PROJECT ARCHITECTURE learning techniques by analyzing the
extracted permissions from the application
Web applications are by
itself.
nature distributed applications, meaning that they
4. Classification
are programsthat run on more than one computer
and communicate through network or server. According to them, by combining
Specifically,web applications are accessed with a results from various classifiers, it can be a
web browser and are popular because of the ease quick filter to identify more suspicious
of using the browser as a user client. For the applications. And propose a framework that
enterprise, software on potentially thousands of intends to develop a machine learning-based
client computers is a key reason for their malware detection system on Android to

www.jespublication.com Page No:1200


Vol 13, Issue 06, June/2022
ISSN NO:0377-9254

detect malware applications and to enhance To achieve Normalization

V.IMPLEMENTATION Null Value Handling


METHODOLOGY To remove invalid data

To classify malicious application To achieve scaling


from benign application a decent dataset is
Data Preprocessing
required.The dataset can be downloaded from
debrin dataset. We construct massive The process of converting raw data into a
experiments, including 516 benign applications comprehensible format is known as data preparation.
and 528 malicious applications. In this section We can’t work with raw data, thus this is a key stage in
the methodology followed is discussed in machine learning. Before using machine learning or
details. data mining methods, make sure the data is of good
quality.The purpose of data preprocessing is to ensure
Dataset that the data is of good quality. The following criteria
can be used to assess quality accuracy, completeness,
Any machine learning model needs a consistency, trustworthy, understandability. Data
dataset over which it can be trained. So data Preprocessing involves the followingsteps:
collection is one of the most important steps.
Data Cleaning: Correcting or deleting incorrect,
We’ve worked on 3 different datasets and
corrupted, improperly formatted, duplicate, or
compared their result against each other
incomplete data. We have removed those columns
1st dataset is collected
having missing values. We’ve removed any undesirable
from google comprised of 70 different
observations from our datasets, such as duplicates or
application eachhaving a set of 17 permission
irrelevant observations
2nd is downloaded
from Kaggle which has 184 different permissions Data Transformation: Changing data from one format
or we can say features list for 29999 apps to another. For string columns and decimal columns
individually. such as price, they’re converted to binary. • Data
3rd one is downloaded integration: combining data from a variety of sources,
from Kaggle. It has 138047 records with each including databases (bothrelational and non-relational),
record consisting 57 columns(permissions) data cubes, files, and so on.
For training and testing
purposes, we split the dataset into two parts. We Data reduction: It is possible to reduce the amount of
records, characteristics, ordimensions. It is carried out
used 80% of the dataset for training the machine
learning model and the remaining 20% dataset during feature selection using correlation matrix.
was used for testing every machine learning We have worked on last dataset in detail and for
model and calculating the performance of each remaining we have tested accuracy and compared its
model with metrics such as accuracy, f1-score, result. We made sure that the dataset contained enough
precision and recall. examples for both the malware and benign applications.
Feature Engineering There are 41323 examples for benign applications and
The feature set used for training has a big 96724 applications for malicious applications. So, we
impact on machine learning. Several research can say that the dataset is not skewed. Each permission
have found that certain features are helpful in column is a binary column,indicating a permission is
training machine learning-based malware asked or not.
classifiers. That is the reason we have used feature
engineering in ourimplementation. In supervised A SVM Linear classifier is built to fit the data you
learning, we will use Feature engineering, which supply and provide a hyperplane that fits well and
is the process of selecting, manipulating, and classify your data into different classes. Following that,
changing raw data into features. We use Feature you may input some attributes to your classifier to
Engineering for the following reasons: check what the projected class is once you’ve obtained
To remove imputation the hyperplane. Support Vectors are the points which
can be considered as edge cases. They are very nearer
Handling outliers to the hyperplane. The two support vectors

www.jespublication.com Page No:1201


Vol 13, Issue 06, June/2022
ISSN NO:0377-9254

corresponding to either classes benign and malicious VI. SCREENSHOTS


respectively are equidistant to the hyperplane with
maximum margin possible. In classification using After building the svm model, we have trained and
Support vector machines, the model with polynomial tested the data using the built svmmodel and
kernel performs merely when the positive examples and obtained a accuracy of 87.5
negative examples in the data are overlapping. One way
to deal with this overlapping data is to use a support
vector machine with a radial kernel generally known as
Radial Basis Function(RBF). When RBF kernel is used
in SVM, radial kernel behaves like a weighted nearest
neighbor model. In other words, the nearest observation
has a lot of influence on how we classify the new
example. The value obtained after substituting in radial
kernel function is inversely proportional to the
closeness. The radial kernel function of two data Screenshot no 6.1 Building SVM Model.
observations a,b is as below. After building the Bayesian model, we
RBF(a, b) = e −γ(a−b) 2 have trained and tested the data using the built
Decision Trees are a non-parametric Bayesian model and obtained a accuracy of 54.1
supervised learning approach which can be used for
classifying problems as well as regression problems.
The sole objective is to construct a machine learning
model that guesses the class of a given instance by
learning basic decision rules from feature values.
First, we will calculate the entropy. It is also known
as measure of uncertainity. Then for each attribute
A, we calculate information gain. The attribute with
maximum value for information gain will be
selected as the root node and this process continues. Screenshot no 6.2 Building Naïve Bayes
The formulas for entropy and Information gain are: Classification Model.
E(S) = −[plog(p) + (1 − p)log(1 − p)]
A collection of classification algorithms
which are based on a theorem named after Bayes
together forms a Naive Bayes classifier. It isn’t a
single algorithm. It is a group of algorithms sharing
one common principle i.e., every pair of features
which are used in classification are independent of
each other. For two events Ea, Eb Bayes theorem
tells that:
P(Ea|Eb) = P(Eb|Ea) ∗ P(Ea) P(Eb)
Using this principle for classification task,
we can say that:
P r(class|attributes) = P r(attributes|class) ∗ Screenshot no 6.3 SVM vs Decision Tree vs Naïve
P r(class) P r(attributes) Bayes
As we have two classes malicious and safe,
we will calculate the probabilities for the application
to be in malicious class and to be in safe class. The
class for which the probability value is higher, is the
class to which the application belongs to.

www.jespublication.com Page No:1202


Vol 13, Issue 06, June/2022
ISSN NO:0377-9254

attention all the possibly dangerous


applications, allowing them to scrutinize the
applications that they trust more carefully. This
in turn will help users become more security-
conscious overall. Even so, this is only a first
step. Future work for this project will include
increasing the accuracy of the classifier,
migrating the Python portions of this project to
Java, and integrating more advanced methods
of detecting maliciousbehavior such as looking
at API calls (this follows a "defense in depth"
Screenshot no. 6.4 Learning curve
strategy). One benefit of the decision tree
classifier is its speed. It can serve as a
preliminary screen for more advanced but
slower methods, to focus the applications they
will inspect. Lastly, taking into account
application categories such as being a game or
email-client would also help detect suspicious
permissions and behaviors. But, a set of
android applications operating together can
carry out a malicious activity. We call them
Screenshot no. 6.5 Performance of different models
colluding apps. In this, the malicious activity is
on different datasets carried out by more than one application. Each
application participating in collusion does a
small part of the malicious action. These
applications communicate with each other
through covert channels. Sometimes when a
malicious activity cannot be performed by a
single application, it might be possible that a
group of applications coordinating with each
other can perform that malicious activity. This
Screenshot no. 6.6 Dataset 1 phenomenon is called Application Collusion. It
is an emerging threat. The reason behind why
we are calling this as an emerging threat is
because most of the android malware detectors
scan the applications individually when
determining whether it is a malware or not. But
as the malicious activity here is being carried
out by a group of applications, those traditional
detectors cannot detect this. So, we need a
model to detect these colluding applications.
Till now, very little research has been done on
Screenshot no. 6.7 Dataset 2
this and there is scarcity for datasets. We are
VII.CONCLUSION
trying to create or obtain a few applications that
In conclusion, our project can
can perform collusion, so that we can do some
identify, with moderate success, applications
research on them which may eventually help in
that pose a potential threat based on the
creating a model that can detect colluding
permissions that they request. Our application
applications. Firstly, we have to obtain a
can scan applications on a phone at any time,
template for collusion. Then, we have to try to
and alerts the user to do so when an installation
split a malicious task into various steps and
or app update occurs. We believe that this is an
make each application perform one of the steps.
important step in preventing Android malware,
By doing this, we can achieve collusion.
because this application brings to the user’s

www.jespublication.com Page No:1203


Vol 13, Issue 06, June/2022
ISSN NO:0377-9254

10. W. Enck, M. Ongtang, P. McDaniel,


VIII.FUTURESCOPE "Understanding Android Security"

We are trying to create or obtain a few


applications that can perform collusion, so that
we can do some research on them which may
eventually help in creating a model that can
detect colluding applications. Firstly, we have
to obtain a template for collusion. Then, we
have to try to split a malicious task into various
steps and make each application perform oneof
the steps. By doing this, we can achieve
collusion.

REFERENCES

1. A. P. Felt, K. Greenwood, and D.


Wagner, “The effectiveness of install-time
permission systems forthird-
partyapplications”,2010.
2. B. P. Sarma, N. Li, C. Gates, R.
Potharaju, C. Nita-Rotaru, and I. Molloy,
“Android permissions: aperspective
combining risks and benefits,” 2012.
3.Y. Zhou and X. Jiang, “Dissecting android
malware: Characterization andevolution,2012.

4. V. Rastogi, Y. Chen, and X. Jiang,


“Droidchameleon: evaluating android
antimalware againsttransformation attacks,
2013.
5. G. Canfora, F. Mercaldo, and
C. A. Visaggio, “A classifier of
malicious android
applications,”2013.
6. B. Sanz, I. Santos, C. Laorden, X.
Ugarte-Pedrero, P. G. Bringas, and G.
A´ lvarez, “Puma:Permission usage to
detect malware in android,”,2013.
7. C.-Y. Huang, Y.-T. Tsai, and C.-H.
Hsu, “Performance Evaluation on
PermissionBased Detection for Android
Malware,”2013.
8. Franklin Tchakount´, Computers
& Security “Permission-based
Malware Detection Mechanisms on
Android: Analysis and
Perspectives”,2014.
9.Z. Fang, W. Han, and Y. Li, “Permission-
based Android security: Issues and counter
measures,”

www.jespublication.com Page No:1204

You might also like