Sample Seminar 2
Sample Seminar 2
Sample Seminar 2
INTRODUCTION
1
The recent advent of novel sensing and display technologies has encouraged the development
of a variety of multi-touch and gesture based interactive systems. In these systems user may
interact directly with information using touch add natural hand gestures. Today there are lots
of way by which we can connect to digital world in the controlled environment using multi-
touch and gesture based interaction. Unfortunately, most gestural and multi-touch based
interactive systems are not mobile and small mobile devices fail to provide the intuitive
experience of full-sized gestural systems.
fig(1)
Which replace the physical mobile phone device to the virtual multi-touch & natural gesture
based interaction on the user palm by which user communicate with other digital devices
over the network.
2
VSP basically turns the human hand as a mobile phone by which is able to user connect to
the digital world as well as other peoples like their friends and relatives.
A robust deformation tracking, it is important to classify these image pixels as not belonging
to the deformable surface, so the model is not fit into wrong depth values.
VSP is basically a computer-vision based wearable and gestural information interface that
augments the physical world around us with digital information and proposes natural hand
gestures as the mechanism to interact with that information.
3
Chen†, and Tetsuya Kusumoto Abstract We describe the Virtual Smartphone over IP system (IP:
Internet protocol), which allows users to create virtual smart phone images in a mobile cloud and
customize each image to meet different needs. Users can easily and freely tap into the power of a
data center by installing desired mobile applications remotely in one of these images. Because
the mobile applications are controlled remotely, they are not constrained by the processing-
power, memory, and battery-life limits of a physical smart phone. Vol. 8 No. 7 July 2010 Special
Feature 3. Application scenarios To create a new smart phone image in the cloud, the user can
simply select from a number of preconfigured image templates to get up and running
immediately. The following are examples of how users might utilize our system and what types
of image templates we can provide to match different user needs. 3.1 Remote sandbox A virtual
smart phone image can be used to execute unknown mobile applications from unverified third
parties. This environment is conventionally called a sandbox because applications do not run
natively on the physical device and can access only a tightly controlled set of remote resources
visible from the virtual smart phone image. Network and data access is heavily restricted to
minimize possible negative impacts of potentially harmful applications. A sandbox is particularly
useful for Android users who would like to install the less-trusted applications obtained outside
the official Android Market. 3.2 Data leakage prevention Our system can also be used as a viable
solution against data leakage if the data is stored in the data center and accessible only through
one of the virtual smart phone images. Since only the graphic pixels of display images are
transferred to the physical smart phone, corporate data is securely contained within the data
center and never stored in the physical smart phone. This allows employees to work with the data
remotely and securely without retaining it on their devices. 3.3 Performance boost The fact that
Android uses the same Java application framework on both x86 and ARM processors provides
seamless application portability on these platforms. While most existing applications are Virtual
smart phone Virtual smart phone Smartphone device Smartphone device Local Client Data
center Virtual smart phone Hypervisor Display image Server Cloud App App App App Android
OS (x86) Network Application framework Application framework Android OS (ARM) ARM
CPU App App Intel x86 CPU App: application CPU: central processing unit App Application
framework Fig. 1. Virtual Smartphone over IP
RELATED WORK
4
Recently, there have been a great variety of multi-touch interaction and mobile device
products or research prototypes that have made it possible to directly manipulate user
interface components using touch and natural hand gestures.
Most of these systems depend on the physical touch-based interaction between the
user’s fingers and physical screen and thus do not recognize and incorporate touch
independent freehand gestures.
VSP Virtual Smart Phone Technology takes a different approach to computing and
tries to make the digital aspect of our lives more intuitive, interactive and, above all,
more natural. It’s a lot of complex technology squeezed into a simple portable device.
When we bring in connectivity, we can get instant, relevant visual information
projected on any object we pick up or interact with the technology is mainly based on
hand augmented reality, gesture recognition, computer vision based algorithm etc.
AUGMENTED REALITY
5
The final target is to make a system such that the user cannot find the difference between the scene
of a real world and its virtual augmentation. Furnishing such view to a surgeon in the operating
theater in the operating theater will increase their performance and prevents the requirement of any
other fixtures. The process of augmented reality is nothing but the superimposing of the digital
pictures into the real world’s environment by applying a sense of virtual reality. The smart phones
are using the similarly applied science these days; the collection of programs used in this
automatically analyses and makes a virtual image on the smart phone with the aid of a camera. In
simple words, it means that it avails camera to establish an angle of images.
How does it work?
The augmented reality is concealed information, recently it was used in the advertising of a car and
football boot as it makes a 3D render of a product. This permit the customer to get a 360-degree
picture of a product, as per the quality of augmentation this can go as far as viewing the size of the
product and also permits the customer to wear the item as viewed through their phone. With the
avail of a mobile app, a mobile phone’s camera identifies and renders a marker which is often a
black and white barcode picture. The software analysis the marker and make a virtual picture on the
device screen. It explains that the app works with the camera to render the angle and distance.
History:
The history of augmented reality reminds of Sutherland’s work in 1960, which used an HMD to
present the 3D graphics. In the year of 1997, Azuma explained the field of augmented reality, its
issues and they require advancements. After this, there was a revolution in the growth and the
progress of augmented reality.
Applications:
The augmented reality has many applications and some of it is as follows:
The applied science of augmented reality is used in medical field and robotics.
It is used in manufacturing and repairs.
It is used in military aircraft.
It is also used for annotation, visualization, and entertainment.
Augmented Reality Vs Virtual Reality:
Augmented reality:
The system has to augment the scenes of a real world.
The user needs to maintain the presence of the genuine world.
It needs a technique for the combination of virtual and real worlds.
6
In augmented reality, it becomes complex to register virtual and real.
Virtual reality:
All the elements have the immersive environment.
In virtual reality, the system controls the senses.
It needs a technique to feed virtual world to a user.
It is difficult to create the virtual world interesting.
7
Virtual Smart Phone uses Augmented Reality concept to superimpose digital
information on the physical world. With the help of advanced AR technology (e.g.
adding computer vision and object recognition) the information about the surrounding
real world of the user becomes interactive and digitally usable.
Artificial information about the environment and the objects in it can be stored and
retrieved as an information layer on top of the real world view.
The main hardware components for augmented reality are: display, tracking, input
devices, and computer. Combination of powerful CPU, camera, accelerometers, GPS
and solid state compass are often present in modern Smartphone, which make them
prospective platforms.
GESTURE RECOGNITION
8
Gesture recognition is a type of perceptual computing user interface that allows computers to
capture and interpret human gestures as commands. The general definition of gesture recognition is
the ability of a computer to understand gestures and execute commands based on those gestures.
Most consumers are familiar with the concept through Wii Fit, X-box and PlayStation games such
as “Just Dance” and “Kinect Sports.”
In order to understand how gesture recognition works, it is important to understand how the word
“gesture” is defined. In it’s most general sense, the word gesture can refer to any non-verbal
communication that is intended to communicate a specific message. In the world of gesture
recognition, a gesture is defined as any physical movement, large or small, that can be interpreted
by a motion sensor. It may include anything from the pointing of a finger to a roundhouse kick or a
nod of the head to a pinch or wave of the hand. Gestures can be broad and sweeping or small and
contained. In some cases, the definition of “gesture” may also includes voice or verbal commands.
Gesture recognition is an alternative user interface for providing real-time data to a computer.
Instead of typing with keys or tapping on a touch screen, a motion sensor perceives and interprets
movements as the primary source of data input. This is what happens between the time a gesture is
made and the computer reacts.
A camera feeds image data into a sensing device that is connected to a computer. The sensing
device typically uses an infrared sensor or projector for the purpose of calculating depth,
Specially designed software identifies meaningful gestures from a predetermined gesture library
where each gesture is matched to a computer command.
The software then correlates each registered real-time gesture, interprets the gesture and uses the
library to identify meaningful gestures that match the library.
Once the gesture has been interpreted, the computer executes the command correlated to that
specific gesture.
For instance, Kinect looks at a range of human characteristics to provide the best command
recognition based on natural human inputs. It provides both skeletal and facial tracking in addition
to gesture recognition, voice recognition and in some cases the depth and color of the background
9
scene. Kinect reconstructs all of this data into printable three-dimensional (3D) models. The latest
Kinect developments include an adaptive user interface that can detect a user’s height.
Microsoft is leading the charge with Kinect, a gesture recognition platform that allows humans to
communicate with computers entirely through speaking and gesturing. Kinect gives computers,
“eyes, ears, and a brain.” There are a few other players in the space such
as SoftKinect, GestureTek, PointGrab, eyeSight and PrimeSense, an Israeli company recently
acquired by Apple. Emerging technologies from companies such as eyeSight go far beyond gaming
to allow for a new level of small motor precision and depth perception.
Gesture recognition has huge potential in creating interactive, engaging live experiences. Here are
five gesture recognition examples that illustrate the potential of gesture recognition to to educate,
simplify user experiences and delight consumers.
1. In-store retail engagement
Gesture recognition has the power to deliver an exciting, seamless in-store experience. This
example uses Kinect to create an engaging retail experience by immersing the shopper in relevant
content, helping her to try on products and offering a game that allows the shopper to earn a
discount incentive.
2. Changing how we interact with traditional computers
A company named Leap Motion last year introduced the Leap Motion Controller, a gesture-based
computer interaction system for PC and Mac. A USB device and roughly the size of a Swiss army
knife, the controller allows users to interact with traditional computers with gesture control. It’s
very easy to see the live experience applications of this technology.
3. The operating room
Companies such as Microsoft and Siemens are working together to redefine the way that everyone
from motorists to surgeons accomplish highly sensitive tasks. These companies have been focused
on refining gesture recognition technology to focus on fine motor manipulation of images and
enable a surgeon to virtually grasp and move an object on a monitor.
4. Windshield wipers
10
Google and Ford are also reportedly working on a system that allows drivers to control features
such as air conditioning, windows and windshield wipers with gesture controls. The Cadillac CUE
system recognizes some gestures such as tap, flick, swipe and spread to scroll lists and zoom in on
maps.
5. Mobile payments
Seeper, a London-based startup, has created a technology called Seemove that has gone beyond
image and gesture recognition to object recognition. Ultimately, Seeper believes that their system
could allow people to manage personal media, such as photos or files, and even initiate online
payments using gestures.
6. Sign language interpreter
There are several examples of using gesture recognition to bridge the gap between the deaf and non-
deaf who may not know sign language. This example showing how Kinect can understand and
translate sign language from Dani Martinez Capilla explores the notion of breaking down
communication barriers using gesture recognition.
Gesture recognition is a topic in computer science and language technology with the goal of
interpreting human gestures via mathematical algorithms.
Gesture recognition can be seen as a way for computers to begin to understand human body
language, thus building a richer bridge between machines and humans than primitive text user
11
interfaces or even GUIs (graphical user interfaces), which still limit the majority of input to
keyboard and mouse.
Gesture recognition enables humans to interface with the machine (HMI) and interact naturally
without any mechanical devices. Gestures can be used to communicate with a computer so we
will be mostly concerned with empty handed semiotic gestures.
Gestures can originate from any bodily motion or state but commonly originate from the face or
hand. Current focuses in the field include emotion recognition from the face and hand gesture
recognition.
Many approaches have been made using cameras and computer vision algorithms to interpret
sign language.
COMPUTER VISION BASED ALGORITHM
12
Computer vision is the science and technology of machines that can see. As a scientific
discipline, computer vision is concerned with the theory behind artificial systems that extract
information from images.
The image data can take many forms, such as video sequences, views from multiple cameras,
or multi-dimensional data from a medical scanner. The software tracks the user’s gestures
using computervision based algorithms.
The computer vision system for tracking and recognizing the hand postures that control the
menus is based on a combination of multi-scale color feature detection, view based hierarchical
hand models and particle filtering.
The hand postures or states are represented in terms of hierarchies of multi-scale color image
features at different scales, with qualitative interrelations in terms of scale, position and
orientation. In each image, detection of multistage color features is performed.
The hand postures are then simultaneously detected and tracked using particle filtering, with an
extension of layered sampling referred to as hierarchical layered sampling.
To improve the performance of the system, a prior on skin color is included in the particle
filtering. Figure 2: Gesture Recognized Mobile Keypad VSP is also related to augmented
reality where digital information is superimposed on the user’s view of a scene but it also
differs in several significant ways.
First VSP allows user to interact with the projected information using hand gestures. Second
the information is projected onto the Hand/object and surfaces themselves, rather than onto
glasses, goggles or watch which results in a very different user experience.
13
Here are a couple of formal textbook definitions: “the construction of explicit, meaningful
descriptions of physical objects from images” (Ballard & Brown, 1982)
“computing properties of the 3D world from one or more digital images” (Trucco & Verri,
1998)
“to make useful decisions about real physical objects and scenes based on sensed images”
(Sockman & Shapiro, 2001)
14
1 — Image Classification
The problem of image classification goes like this: Given a set of images that are all labeled with a
single category, we’re asked to predict these categories for a novel set of test images and measure the
accuracy of the predictions. There are a variety of challenges associated with this task, including
viewpoint variation, scale variation, intra-class variation, image deformation, image occlusion,
illumination conditions, and background clutter.
How might we go about writing an algorithm that can classify images into distinct categories?
Computer Vision researchers have come up with a data-driven approach to solve this. Instead of
trying to specify what every one of the image categories of interest look like directly in code, they
provide the computer with many examples of each image class and then develop learning algorithms
that look at these examples and learn about the visual appearance of each class.
In other words, they first accumulate a training dataset of labeled images, then feed it to the
computer to process the data. Given that fact, the complete image classification pipeline can be
formalized as follows:
Our input is a training dataset that consists of N images, each labeled with one of K different
classes.
Then, we use this training set to train a classifier to learn what every one of the classes looks
like.
In the end, we evaluate the quality of the classifier by asking it to predict labels for a new set
of images that it’s never seen before. We’ll then compare the true labels of these images to the
ones predicted by the classifier.
The most popular architecture used for image classification is Convolutional Neural Networks
(CNNs). A typical use case for CNNs is where you feed the network images and the network
classifies the data. CNNs tend to start with an input “scanner” which isn’t intended to parse all the
training data at once. For example, to input an image of 100 x 100 pixels, you wouldn’t want a layer
with 10,000 nodes.
Rather, you create a scanning input layer of say 10 x 10 which you feed the first 10 x 10 pixels of the
image. Once you passed that input, you feed it the next 10 x 10 pixels by moving the scanner one
pixel to the right. This technique is known as sliding windows.
This input data is then fed through convolutional layers instead of normal layers. Each node only
concerns itself with close neighboring cells. These convolutional layers also tend to shrink as they
become deeper, mostly by easily divisible factors of the input. Besides these convolutional layers,
15
they also often feature pooling layers. Pooling is a way to filter out details: a commonly found
pooling technique is max pooling, where we take, say, 2 x 2 pixels and pass on the pixel with the
most amount of a certain attribute.
Most image classification techniques nowadays are trained on ImageNet, a dataset with
approximately 1.2 million high-resolution training images. Test images will be presented with no
initial annotation (no segmentation or labels), and algorithms will have to produce labelings
specifying what objects are present in the images. Some of the best existing computer vision
methods were tried on this dataset by leading computer vision groups from Oxford, INRIA, and
XRCE. Typically, computer vision systems use complicated multi-stage pipelines, and the early
stages are typically hand-tuned by optimizing a few parameters.
The winner of the 1st ImageNet competition, Alex Krizhevsky (NIPS 2012), developed a very deep
convolutional neural net of the type pioneered by Yann LeCun. Its architecture includes 7 hidden
layers, not counting some max pooling layers. The early layers were convolutional, while the last 2
layers were globally connected. The activation functions were rectified linear units in every hidden
layer. These train much faster and are more expressive than logistic units. In addition to that, it also
uses competitive normalization to suppress hidden activities when nearby units have stronger
activities. This helps with variations in intensity.
In terms of hardware requirements, Alex uses a very efficient implementation of convolutional nets
on 2 Nvidia GTX 580 GPUs (over 1000 fast little cores). The GPUs are very good for matrix-matrix
multiplies and also have very high bandwidth to memory. This allows him to train the network in a
week and makes it quick to combine results from 10 patches at test time. We can spread a network
over many cores if we can communicate the states fast enough. As cores get cheaper and datasets get
bigger, big neural nets will improve faster than old-fashioned CV systems. Since AlexNet, there have
been multiple new models using CNN as their backbone architecture and achieving excellent results
in ImageNet: ZFNet (2013), GoogLeNet (2014), VGGNet (2014), ResNet (2015), DenseNet (2016)
etc.
2- Object Detection
The task to define objects within images usually involves outputting bounding boxes and labels for
individual objects. This differs from the classification / localization task by applying classification
and localization to many objects instead of just a single dominant object. You only have 2 classes of
object classification, which means object bounding boxes and non-object bounding boxes. For
example, in car detection, you have to detect all cars in a given image with their bounding boxes.
16
If we use the Sliding Window technique like the way we classify and localize images, we need to
apply a CNN to many different crops of the image. Because CNN classifies each crop as object or
background, we need to apply CNN to huge numbers of locations and scales, which is very
computationally expensive!
In order to cope with this, neural network researchers have proposed to use regions instead, where
we find “blobby” image regions that are likely to contain objects.
This is relatively fast to run. The first model that kicked things off was R-CNN(Region-based
Convolutional Neural Network). In a R-CNN, we first scan the input image for possible objects
using an algorithm called Selective Search, generating ~2,000 region proposals. Then we run a CNN
on top of each of these region proposals. Finally, we take the output of each CNN and feed it into an
SVM to classify the region and a linear regression to tighten the bounding box of the object.
Essentially, we turned object detection into an image classification problem. However, there are
some problems — the training is slow, a lot of disk space is required, and inference is also slow.
An immediate descendant to R-CNN is Fast R-CNN, which improves the detection speed through 2
augmentations: 1) Performing feature extraction before proposing regions, thus only running one
CNN over the entire image, and 2) Replacing SVM with a softmax layer, thus extending the neural
network for predictions instead of creating a new model.
17
Fast R-CNN performed much better in terms of speed, because it trains just one CNN for the entire
image. However, the selective search algorithm is still taking a lot of time to generate region
proposals.
Thus comes the invention of Faster R-CNN, which now is a canonical model for deep learning-
based object detection. It replaces the slow selective search algorithm with a fast neural network by
inserting a Region Proposal Network (RPN) to predict proposals from features. The RPN is used to
decide “where” to look in order to reduce the computational requirements of the overall inference
process. The RPN quickly and efficiently scans every location in order to assess whether further
processing needs to be carried out in a given region. It does that by outputting k bounding box
proposals each with 2 scores representing the probability of object or not at each location.
Once we have our region proposals, we feed them straight into what is essentially a Fast R-CNN. We
add a pooling layer, some fully-connected layers, and finally a softmax classification layer and
bounding box regressor.
18
Altogether, Faster R-CNN achieved much better speeds and higher accuracy. It’s worth noting that
although future models did a lot to increase detection speeds, few models managed to outperform
Faster R-CNN by a significant margin. In other words, Faster R-CNN may not be the simplest or
fastest method for object detection, but it’s still one of the best performing.
Major Object Detection trends in recent years have shifted towards quicker, more efficient detection
systems. This was visible in approaches like You Only Look Once (YOLO), Single Shot MultiBox
Detector (SSD), and Region-Based Fully Convolutional Networks (R-FCN) as a move towards
sharing computation on a whole image. Hence, these approaches differentiate themselves from the
costly subnetworks associated with the 3 R-CNN techniques. The main rationale behind these trends
is to avoid having separate algorithms focus on their respective subproblems in isolation, as this
typically increases training time and can lower network accuracy.
3- Object Tracking
Object Tracking refers to the process of following a specific object of interest, or multiple objects, in
a given scene. It traditionally has applications in video and real-world interactions where
observations are made following an initial object detection. Now, it’s crucial to autonomous driving
systems such as self-driving vehicles from companies like Uber and Tesla.
Object Tracking methods can be divided into 2 categories according to the observation model:
generative method and discriminative method. The generative method uses the generative model to
describe the apparent characteristics and minimizes the reconstruction error to search the object, such
as PCA.
The discriminative method can be used to distinguish between the object and the background, its
performance is more robust, and it gradually becomes the main method in tracking. The
discriminative method is also referred to as Tracking-by-Detection, and deep learning belongs to this
category. To achieve tracking-by-detection, we detect candidate objects for all frames and use deep
learning to recognize the wanted object from the candidates. There are 2 kinds of basic network
models that can be used: stacked auto encoders (SAE) and convolutional neural network (CNN).
The most popular deep network for tracking tasks using SAE is Deep Learning Tracker, which
proposes offline pre-training and online fine-tuning the net. The process works like this:
Off-line unsupervised pre-train the stacked denoising auto-encoder using large-scale natural
image datasets to obtain the general object representation. Stacked denoising auto-encoder can
19
obtain more robust feature expression ability by adding noise in input images and reconstructing
the original images.
Combine the coding part of the pre-trained network with a classifier to get the classification
network, then use the positive and negative samples obtained from the initial frame to fine-tune
the network, which can discriminate the current object and background. DLT uses particle filter
as the motion model to produce candidate patches of the current frame. The classification
network outputs the probability scores for these patches, meaning the confidence of their
classifications, then chooses the highest of these patches as the object.
In the model updating, DLT uses the way of limited threshold.
Because of its superiority in image classification and object detection, CNN has become the
mainstream deep model in computer vision and in visual tracking. Generally speaking, a large-scale
CNN can be trained both as a classifier and as a tracker. 2 representative CNN-based tracking
algorithms are fully-convolutional network tracker (FCNT) and multi-domain CNN (MD Net).
FCNT analyzes and takes advantage of the feature maps of the VGG model successfully, which is a
pre-trained ImageNet, and results in the following observations:
CNN feature maps can be used for localization and tracking.
Many CNN feature maps are noisy or un-related for the task of discriminating a particular
object from its background.
Higher layers capture semantic concepts on object categories, whereas lower layers encode
more discriminative features to capture intra-class variation.
Because of these observations, FCNT designs the feature selection network to select the most
relevant feature maps on the conv4–3 and conv5–3 layers of the VGG network. Then in order to
avoid overfitting on noisy ones, it also designs extra two channels (called SNet and GNet) for the
selected feature maps from two layers’ separately. The GNet captures the category information of
the object, while the SNet discriminates the object from a background with a similar appearance.
Both of the networks are initialized with the given bounding-box in the first frame to get heat maps
of the object, and for new frames, a region of interest (ROI) centered at the object location in the last
frame is cropped and propagated. At last, through SNet and GNet, the classifier gets two heat maps
for prediction, and the tracker decides which heat map will be used to generate the final tracking
result according to whether there are distractors. The pipeline of FCNT is shown below.
Different from the idea of FCNT, MD Net uses all the sequences of a video to to track movements in
them. The networks mentioned above use irrelevant image data to reduce the training demand of
20
tracking data, and this idea has some deviation from tracking. The object of one class in this video
can be the background in another video, so MD Net proposes the idea of multi-domain to distinguish
the object and background in every domain independently. And a domain indicates a set of videos
that contain the same kind of object.
As shown below, MD Net is divided into 2 parts: the shared layers and the K branches of domain-
specific layers. Each branch contains a binary classification layer with softmax loss, which is used to
distinguish the object and background in each domain, and the shared layers sharing with all
domains to ensure the general representation.
In recent years, deep learning researchers have tried different ways to adapt to features of the visual
tracking task. There are many directions that have been explored: applying other network models
such as Recurrent Neural Net and Deep Belief Net, designing the network structure to adapt to video
processing and end-to-end learning, optimizing the process, structure, and parameters, or even
combining deep learning with traditional methods of computer vision or approaches in other fields
such as Language Processing and Speech Recognition.
4 -Semantic Segmentation
Central to Computer Vision is the process of segmentation, which divides whole images into pixel
groupings which can then be labelled and classified.
Particularly, Semantic Segmentation tries to semantically understand the role of each pixel in the
image (e.g. is it a car, a motorbike, or some other type of class?). For example, in the picture above,
apart from recognizing the person, the road, the cars, the trees, etc., we also have to delineate the
boundaries of each object. Therefore, unlike classification, we need dense pixel-wise predictions
from our models.
As with other computer vision tasks, CNNs have had enormous success on segmentation problems.
One of the popular initial approaches was patch classification through a sliding window, where each
pixel was separately classified into classes using a patch of images around it. This, however, is very
inefficient computationally because we don’t reuse the shared features between overlapping patches.
The solution, instead, is UC Berkeley’s Fully Convolutional Networks (FCN), which popularized
end-to-end CNN architectures for dense predictions without any fully connected layers. This allowed
segmentation maps to be generated for images of any size and was also much faster compared to the
patch classification approach. Almost all subsequent approaches to semantic segmentation adopted
this paradigm.
21
However, one problem remains: convolutions at original image resolution will be very expensive. To
deal with this, FCN uses downsampling and upsampling inside the network. The downsampling
layer is known as striped convolution, while the upsampling layer is known as transposed
convolution.
Despite the upsampling/downsampling layers, FCN produces coarse segmentation maps because of
information loss during pooling. SegNet is a more memory efficient architecture than FCN that uses-
max pooling and an encoder-decoder framework. In SegNet, shortcut/skip connections are
introduced from higher resolution feature maps to improve the coarseness of
upsampling/downsampling.
5 — Instance Segmentation
Beyond Semantic Segmentation, Instance Segmentation segments different instances of classes, such
as labelling 5 cars with 5 different colors. In classification, there’s generally an image with a single
object as the focus and the task is to say what that image is. But in order to segment instances, we
need to carry out far more complex tasks. We see complicated sights with multiple overlapping
objects and different backgrounds, and we not only classify these different objects but also identify
their boundaries, differences, and relations to one another!
So far, we’ve seen how to use CNN features in many interesting ways to effectively locate different
objects in an image with bounding boxes. Can we extend such techniques to locate exact pixels of
22
each object instead of just bounding boxes? This instance segmentation problem is explored at
Facebook AI using an architecture known as Mask R-CNN.
Much like Fast R-CNN, and Faster R-CNN, Mask R-CNN’s underlying intuition is straightforward
Given that Faster R-CNN works so well for object detection, could we extend it to also carry out
pixel-level segmentation?
Mask R-CNN does this by adding a branch to Faster R-CNN that outputs a binary mask that says
whether or not a given pixel is part of an object. The branch is a Fully Convolutional Network on top
of a CNN-based feature map. Given the CNN Feature Map as the input, the network outputs a matrix
with 1s on all locations where the pixel belongs to the object and 0s elsewhere (this is known as
a binary mask).
Additionally, when run without modifications on the original Faster R-CNN architecture, the regions
of the feature map selected by RoIPool (Region of Interests Pool) were slightly misaligned from the
regions of the original image. Since image segmentation requires pixel-level specificity, unlike
bounding boxes, this naturally led to inaccuracies. Mask R-CNN solves this problem by adjusting
RoIPool to be more precisely aligned using a method known as RoIAlign (Region of Interests
Align). Essentially, RoIAlign uses bilinear interpolation to avoid error in rounding, which causes
inaccuracies in detection and segmentation.
23
Once these masks are generated, Mask R-CNN combines them with the classifications and bounding
boxes from Faster R-CNN to generate such wonderfully precise segmentations:
These 5 major computer vision techniques can help a computer extract, analyze, and understand
useful information from a single or a sequence of images. There are many other advanced techniques
that I haven’t touched, including style transfer, colorization, action recognition, 3D objects, human
pose estimation, and more.
24
OBJECTIVE
VSP Invention is related to transfer of Data & establishing communication from one human
body to other human body or from one human body to digital devices or vice-versa without any
platform dependency.
VSP is basically an attempts to make the communication between users and Digital devices
more tangible and interactive.
The objective of this invention is establishing the connection/communication between humans
and also with digital devices by a touch gesture on the human Palm/Hand. VSP work on two
type of data transfer.
First, itestablishes voice communication between the users with the help of GSM Technology
without any physical cellular phone.
Second, For Transfer of Data between the humans and also with digital devices. It makes use of
the Internet, Intranet network or any other type of data Servers through which device and
humans are connected to and the distinguish from one user to another by the authentication
methods such as username/password, drawing a pattern on the virtual screen, face recognition,
Palm recognition using palm lines or fingerprint detection can be used. In VSP voice
communication form one human to another can be done either by using GSM or
Internet/Intranet technology.
The Transferring of data from one human being to another or device using VSP. The first and
second digital devices may be gesture recognition VSP system connected to a network including
a data storage cloud and both uses VSP Technology.
25
WORKING
Working of VSP consist of 5 Main steps i.e. Enabling & Authenticate VSP, Make Call,
Receive Call, Capture Image/Video, Copying Data & paste/Pass Data to other VSP & Digital
Devices as follows.
fig(5)
A. Enabling VSP: -
The VSP is a wearable device and user has the key to Enable (ON)/Disable (OFF) the
device through the Power Button.
When user enable the VSP Device, an icon appears on the user palm or arm as per user
as per selected by the user for showing the status (if a user has signed in).
If not user can touch this icon to login or change users using different authentication
methods like: Enter user name and password, Drawing a secret sign or pattern, Face
recognition, Picture selection and Fingerprint detection and Palm line Detection after a
user has signed in successfully, VSP is now Ready for making and receive calls and
other Operations.
B. Make Call: -
After Enabling VSP now user is able to make call and communicate with their relatives
and other persons.
To make call, dial mobile number using virtual key or using Voice Recognition system.
For establishing call between two users, VSP uses two methods that are as follows.
26
A. Make Call Using SIM: -
VSP device has a micro SIM (Subscriber Identity Module) by which device established
the call using GSM/CDMA (Global System for Mobile Communications / Code
Division Multiple Access) Technology.
VSP device has a Wi-Fi (Wireless Fidelity) and Mobile Data option which connect the
device to the Intranet/Internet, by using this user is able to make calls using VOIP
(Voice Over IP) Technology.
By using VOIP user is able to make the call to other VSP user as well as all the others
GSM and Internet VOIP enable Digital Devices. When user is not connected to
internet/Intranet, call is simply made using SIM without user’s permission but when user
connect to Internet it asks user to select the option by which user wanted to make call as
per user selection the call is connect to other person.
C. Receive Call: -
When a VSP user called by other VSP user or other digital device users by (Physical
Mobile phone laptop, Desktop and PDA ‘Personal Data assistant) the notification of
incoming call will be shown as per user selected Profile if user select vibrate mode, the
small vibrator motor indicate incoming call by vibration & also shows the identity of
calling user on back side of palm using high Density projector of VSP.
If user select Sound Mode, incoming call notified by selected ring tone with user Name
on the back side of Palm. In Silent mode it only indicates the name of caller in the back
side of palm.
For attending the incoming call user just touch, swipe the incoming call icon or other
touch gesture selected by user.
To speak the caller user either use Bluetooth Headset or wired headset which is
connected to VSP device using 3.0 connector. User also is able to receive call directly
using VSP Device Speaker and Mice.
For VOIP calls both user must be connected to the internet using WI-FI or Mobile Data.
27
D. Capture Image/Video: -
VSP is also able to capture high Quality Images/Video using their high Quality Camera
by click capture image button or by using gesture (make a fame using our index figure
and thumbs) for taking photos.
After taking the picture it shows the picture on user hand using VSP System. For shoot
video with the same gesture user just required to change the camera mode photos to
video. User also zoom in or zoom out while they capture Image/Video using their hand
gesture.
E. Copy Data: -
In VSP allow users to Transfer (Copy/Paste) Data from one human body to another
human body or device by using a single touch gesture. For copy data user has to login
First in VSP device and connected to Internet/Intranet.
For identifying a copy event in VSP uses a long press (Detect by listener Program) on
copy able data item (keeping finger on a data item more than 1.5 sec. shown on user arm
using VSP projector) indicates to copy that data item.
Whenever user Touch any copy able data a touch listener program start counting the
time and when time exceeds the threshold (1.5 sec.) a message appears indicating that
the data item is being copied and gets copied to the user’s unique space in the data cloud.
The copy Data to the data cloud can also be done by alternative ways (instead of long-
press for 1.5 seconds). For example, double tap on data item or draw a circle around the
data item to initiate copy.
Using this process user copy multiple file for passing/paste to the other device all the
copy data save in the cloud on temporary bases with unique id of each data item.
28
ARCHITECTURE DESIGN
In order to provide good user experience to mobile phone users, we propose ViSP in a cloud
environment to leverage virtualization technology. Figure I shows the high-level overview of
ViSP. Virtual smartphones are running on the servers in the data center. A user can own one or
more different virtual smartphones. He can access the virtual smartphone using ViSP client as long
as he has internet access. International Conference on Artificial Intelligence: Technologies and
Applications (ICAITA 2016) © 2016. The authors - Published by Atlantis Press 160 Figure II
presents the detailed architecture of our system. The VISP client is also an app on smartphone,
which receives the screen updates from the server and shows it on the physical smartphone. Thus,
the user is expected to use the virtual smartphone as a separate mobile OS from the original OS.
And the client intercepts all the touch events and then sends them back to the server so that the
virtual smartphone can get user input from the physical device. To make the platform easy to use
and maintain, we must ensure the transparency of the architecture. Our platform should not require
any modifications to the virtual smartphone OSs. All work should be done at the hypervisor layer
so that different mobile OSs can run on our platform as long as they can be virtualized on some
hypervisor. This means that we must intercept the display of virtual smartphone and send in the
control events at the hypervisor level.
IMPLEMENTATION
We have implemented a prototype using Android as the server-side virtual smartphone OS.
Android emulator has been provided in Google's Android SDK and both Android system and the
emulator are open source in Android Open Source Project. A. ARM Native Code Support Android
apps are typically written in Java and run on Android runtime, Dalvik before Android 4.4 and
Android Runtime (ART) since Android 5.0. However, Android also allows developers to
implement parts of their apps in native C/C++ code in order to speed up some CPU-intensive
programs. For instance, some game engines (e.g., Unity, Cocos2d-x) are written in native code for
performance and portability. In the early versions of Android, ARM is the only platform which is
officially supported by Google. Androidx86 is an unofficial initiative to port Android to AMD and
Intel x86 chips. Though x86 and MIPS chips are officially supported by Google nowadays, there
are still some apps which only contain ARM native codes. To solve this problem, Intel has created
a compatibility layer named libhoudini. This library acts as a binary translator, reading in the ARM
instructions and converting them into the corresponding x86 instructions on the fly. So most apps
29
could run on an x86 Android as normal. B. Screen Updates Some traditional remote display
protocols use client-pull mode to update the screen image. This means that the client is responsible
for handling the screen updates. Once the client thinks the screen should be updated, for example,
some touch events have been fired or a fixed time interval has passed, it will request a new screen
image from the server. Hence, the latency is a bit high, which is one Round Trip Time (RTT).
While in server-push mode, the server sends the updates once the screen is changed on the server
side. And latency is half of RTT, which is better than client-pull mode. In server-push mode, the
server acts as the producer while the client is the consumer. The server appends the screen updates
into a buffer queue. The client takes out the updates from the queue. Unfortunately, if the network
connection is slow, the producing speed is faster than the consuming speed, user experience will
suffer. Once the client could not consume the screen updates in time, the buffer queue will become
longer, leading to lag on the client side. Therefore, the server puts a sequence number in each
screen update packet, and client sends back an acknowledgement packet to indicate it has received
a specific update. If the server produces lots of updates while the client has not responded yet, the
update packets will be dropped to avoid the lag.
EVALUATION
The experiments are conducted in a separate network environment. The server has a 3.1GHz Intel
Core i5-3450 CPU and 8GB system memory. The client app runs on LG Nexus 5. The virtual
smartphone OS is Android 5.1 running on our modified Android emulator. The resolution of
virtual Android is 320x480. CPU UTILIZATION ON SERVER The CPU utilization of the server
is presented in Figure III. The average CPU utilization is 29.25% on one core. A 161 quad-core
processor like the one we use could contain up to 10 virtual Android on one server. Running
virtual Android onthe serverr can greatly reduce the CPU burden of physical device. All apps
could run on a server whose CPU is considered more powerful than the CPU on mobile devices.
However, screen updates must be transferred through network. The bandwidth cost must be
considered since users may be using 3G/4G. They may be charged according to their data traffic.
We have recorded the bandwidth cost during a 5minutes period as shown the bandwidth cost if the
screen is directly sent to the client. From the third minute, we start a game on the virtual Android,
so the bandwidth has become rather high in the figure.The average bandwidth cost of raw and zlib
encoding are 1199.2 kBps and 79.1 kBps. The bandwidth could be further reduced if a lossy
compression algorithm is used like JPEG.
30
TECHNOLOGIES USED
a. Voice Call: -
In VSP voice call done by using either by using SIM (GSM/CDMA) or though Internet using
VOIP Technology.
b. Data Transfer: -
Data transfer from one body to another body or device in using VSP is done by using Data
Cloud. For Accessing Data cloud user may be connected to Internet either by WI-FI or Mobile
Data using SIM.
31
ADVANTAGES & DISADVANTAGES
Instant Communication
Smart phones evolved from the earliest communication devices. Thus, it has been created to
primarily improve people’s way of communicating with each other. The advent of smart phone
technology modernized communications. It has paved the way to SMS, text messaging, call, video
chat, and apps that allow people to instantly communicate to everyone across the globe.
Web Surfing
The smart phones also make it convenient for people to surf the web. These devices are integrated
with mobile browsers that enable them to research and access websites anytime and anywhere.
According to a study, 10% of the total time spent by people on smart phones is used in opening
browsers to surf the internet. With this, people have easy access to information.
Camera
In this “selfie” generation, the camera is so important. It saves people from buying a separate digital
camera to take photos and videos. Especially now that the millennials are fond of posting photos in
the social media. According to a 2014 Comtech study, the camera ranked third as the most
important consideration for consumers in buying a smartphone. With this, smart phone giants make
sure their phones are equipped with the best camera.
Entertainment
Smart phones are also viewed as a source of entertainment – games, music, movies, and
books. Based on the 2016 statistics, there are more than 63.7 million people in North America use
smart phones for gaming. Moreover, users can listen to their favorite music with iTunes and
Spotify, among others. Watching movies and reading e-books are also convenient through smart
phones.
Education
Smart phones also aid education, especially in children. With easy access to information and helpful
content, children can have a more interactive learning through watching education videos and
playing education applications. They can also easily surf the internet if they want to search
something about a topic.
Productivity Apps
Smart phones can do almost everything with the help of apps. There are over 2 million apps in
Google Play Store while over 1.5 million apps in Apple App Store. People spend 90% of their time
32
accessing apps in which an average user install 36 apps on their smart phone. The functionality of
apps varies from each other – photo and video editor, ticket booking, online store, payment system,
data analysis, personal assistant, etc.
GPS
Most smart phones now are equipped with Global Positioning System (GPS). This technology
allows people to locate certain addresses and area all around the world. This helped improved not
just communication, but most especially, transportation.
Privacy
With smart phones, you can do whatever you want without anyone knowing it. You can snap photos
of yourself and secure your photo library with a password. You can also send messages to your
loved ones without the fear of anyone knowing it. Online transactions can also be done through
smart phones.
Disadvantages
Costly
Smart phones can be expensive, especially those high-end phones with great specs and features.
Apart from the smart phone itself, some applications require being purchased in order to fully use
the other functionalities offered by the app. If you also want data connectivity, you need to maintain
a data plan.
Poor Social Interaction
Based on the new data released by analytics firm Flurry, people use smart phones at least 5 hours a
day. They also added that people’s usage of apps increases to 69% last 2017. With this, the “real”
social interaction degrades. People no longer interact with people outside as they tend to spend
more time with their smart phones.
Distraction
Despite the productivity, smart phones can really be distracting. Applications notify you when there
are messages, updates, latest offerings, etc. These interrupt the momentum which can potentially
affect your productivity. When you attend to these notifications, you’ll find yourself attached to the
phone.
Health Issues
Smart phones are also found to have a negative impact on your health. Smart phones emit
radiofrequency energy which can be absorbed by the tissues in the body. Sleep deprivation is also
33
one of the common bad effects of using smart phones. Moreover, phones produce HEV light which
can damage your eyes’ retina.
Addiction
When you wake up in the morning, do you find yourself checking your phone first than anything
else? If you do, this is an early sign of smart phone addiction. This problem may lead to a serious
addiction. This may include addiction to games, social media, etc.
Privacy Threats
Even if smart phones are made private, there are still security risks and threats everywhere. Hackers
are always present and virtual viruses are potent. Smart phones are vulnerable to these threats when
you access the internet. Thus, you need to be extra cautious of opening sites and links.
Extra Work
With high-end smart phone comes…extra work. Smart phones are widely used in business. If you
find yourself relying on various apps, then you are working on extra workloads which are not even
existent before. Moreover, your boss can instantly call you even in the middle of the night.
Uncensored Content
Lastly, there is a disadvantage in easy access to information and the internet. People, especially
children can see, intentionally or not, the uncensored content including violence, pornographic
content, etc. If you have children, make sure you regulate their use of smart phones.
34
APPLICATIONS
A virtual smartphone image can be used to execute unknown mobile applications from
unverified third parties. This environment is conventionally called a sandbox because applications
do not run natively on the physical device and can access only a tightly controlled set of remote
resources visible from the virtual smartphone image. Network and data access is heavily restricted
to minimize possible negative impacts of potentially harmful applications. A sandbox is particularly
useful for Android users who would like to install the less-trusted applications obtained outside the
official Android Market.
35
By executing computation-intensive applications remotely on virtual smartphones and transmitting
only the graphical results to the physical smartphone, we free users from the processing-power,
memory, and even battery-life limits of their smartphones.
36
CONCLUSION
VSP is basically a computer-vision based wearable and gestural interface that augments the
physical world around us with digital information and proposes natural hand gestures as the
mechanism to interact with that information.
It connects Physical world to Virtual world. VSP give intuitive way to communicate and Data
Transfer between different users as well as different Digital Devices.
VSP invention fulfill our two future requirements. First, it’s free form physical dependencies
of devices. Second, it connects our physical world to virtual world Some Application of VSP
as Follows:
37