"Metaverse": Name: Usn
"Metaverse": Name: Usn
"Metaverse": Name: Usn
(18TES84)
“METAVERSE”
BACHELOR OF
ENGINEERING IN
TELECOMMUNICATION ENGINEERING
For the Academic year 2021-2022
2021-2022
DEPARTMENT OF TELECOMMUNICATION
ENGINEERING YEAR 2021-2022
CERTIFICATE
1)
2)
ii
Sir M Visvesvaraya Institute of
Technology Department of
Telecommunication Engg Bangalore-
562157
DECLARATION
I hereby declare the Technical Seminar Report on “METAVERSE” has been presented
under the guidance of Mr. Faiz Mohammad Karobari, Asst.Prof., Department of
Telecommunication Engineering Sir MVIT, Bengaluru. This topic has not been submitted
previously in the Dept. of TCE and any other Departments of Sir MVIT.
USN :1MV18TE011
iii
ACKNOWLEDGEMENT
I would like to thank our guide and mentor Mr. FAIZ MOHAMMAD KAROBARI
Asst. Professor, Dept. of TCE, SIR MVIT, for her valuable guidance and support in the
completion of this internship.
I would like to extend my heartfelt gratitude to Dr. E KAVITHA, HOD Dept. of TCE,
SIR MVIT, for constant support and encouragement.
I would also like to thank my parents, friends and close ones who rendered me active
support for the completion of this technical seminar report. I acknowledge them with a lot of
gratitude and regard and without all of the above this technical seminar report may not be easily
possible.
DIWAKAR SINHA
(1MV18TE011)
iv
ABSTRACT
Unlike previous studies on the Metaverse based on Second Life, the current Metaverse is
based on the social value of Generation Z that online and offline selves are not different.
With the technological development of deep learning-based high-precision recognition
models and natural generation models, Metaverse is being strengthened with various factors,
from mobile-based always-on access to connectivity with reality using virtual currency. The
integration of enhanced social activities and neural net methods requires a new definition of
Metaverse suitable for the present, different from the previous Metaverse. This paper divides
the concepts and essential techniques necessary for realizing the Metaverse into three
components (i.e., hardware, software, and contents) and three approaches (i.e., user
interaction, implementation, and application) rather than marketing or hardware approach to
conduct a comprehensive analysis. Furthermore, we describe essential methods based on
three components and techniques to Metaverse’s representative Ready Player One, Roblox,
and Facebook research in the domain of films, games, and studies. Finally, we summarize
the limitations and directions for implementing the immersive Metaverse as social
influences, constraints, and open challenges.
TABLE OF CONTENTS
TOPIC PAGE
I. CERTIFICATE ii
II. DECLARATION iii
III. ACKNOWLEDGEMENT iv
IV. ABSTRACT v
CHAPTER 1: INTRODUCTION 01
Metaverse is expanding rapidly, as seen in Geppetto serving 200 million subscribers and
Animal Crossing running an election campaign in a virtual space. In particular, Roblox’s monthly
active users is 150 million, which is used by 2/3 of children aged 9-12 in the US, and 1/3 of them are
under 16. Early studies for the Metaverse focus on Second Life in 2006. However, the current
Metaverse is based on the social values of Generation Z that online ego is no different from offline
ones. Therefore, since the proportion of social activities and contents grows, it differs from the
previous Metaverse, and a new definition is needed for the present. The novel Metaverse differs from
the earlier Metaverse in three ways. First, the rapid development of deep learning dramatically
improves the accuracy of vision and language recognition, and the development of generative models
enables a more immersive environment and natural movement. The processing time and complexity
were reduced using multimodal models as E2E (end-to-end) solutions with a multimodal pre-trained
model. Second, Metaverse previously served based on PC access and had low consistency due to
time and space constraints, but now it is possible to easily access the Metaverse anytime, anywhere
due to the mobile devices that can connect to the Internet at all times. There are 50 million games in
Roblox and the accumulated monthly usage time is 3 billion hours. People consumes more time than
social network services (e.g., TikTok, YouTube). It has a virtuous cycle ecosystem in which the
inflow and income of producers increase as users and usage time increase while serving various
contents, and thus sales of digital advertisements increase. Lastly, the current Metaverse differs from
the previous one because the program coding can be done in the Metaverse world, and it is more
bonded to real life with virtual currency. Metaverse expands with various social meanings (e.g.,
fashion, event, game, education, and office) based on immersive interaction.
Cryptocurrencies (e.g., Dime) serve as an economic bridge between the Metaverse and
the real world, giving people deeper social meaning. The Metaverse differs from augmented reality
(AR) and virtual reality (VR) in three ways. First, while VR-related studies focus on a physical
approach and rendering, Metaverse has a strong aspect as a service with more sustainable content and
social meaning. Second, the Metaverse does not necessarily use AR and VR technologies. Even if the
platform does not support VR and AR, it can be a Metaverse application. Lastly, the Metaverse has a
scalable environment that can accommodate many people is essential to reinforce social meaning.
The large-scale Metaverse implementation required three components: (i) hardware improvements
(e.g., GPU memory, 5G); (ii) the development of the recognition and expression model that leverages
the parallelism of the hardware; and (iii) the availability of content that people immerse in and
participate in. Despite the considerable research relating to Metaverse, primarily focus on social
meaning, and little attention has focused on technologies for the Metaverse. This research presents a
comprehensive study on the applications and technologies that can give social meaning in a
Metaverse hardware, software, and content with three approaches (i.e., user interactions,
implementation, and applications).
CHAPTER 2
LITERATURE SURVEY
[1]R. Cheng, N. Wu, S. Chen and B. Han, "Reality Check of Metaverse: A First Look at Commercial
Social Virtual Reality Platforms," 2022 IEEE Conference on Virtual Reality and 3D User Interfaces
Abstracts and Workshops (VRW), 2022, pp. 141-148, doi: 10.1109/VRW55335.2022.00040.
Metaverse has grasped the news headlines recently. While being heavily advocated by the industry,
there are great interests from academia as it demands various technological support from both
hardware and software research. There has been an extensive visionary discussion of metaverse
lately, but there are few studies on its technical challenges and user experience in practice. To fill this
critical gap, in this paper, we take a first look at Workrooms, arguably a premature metaverse
product released by Meta (a.k.a. Face-book).
[2] S. -M. Park and Y. -G. Kim, "A Metaverse: Taxonomy, Components, Applications, and Open
Challenges," in IEEE Access, vol. 10, pp. 4209-4251, 2022, doi: 10.1109/ACCESS.2021.3140175.
With the technological development of deep learning-based high-precision recognition models and
natural generation models, Metaverse is being strengthened with various factors, from mobile-based
always-on access to connectivity with reality using virtual currency. The integration of enhanced
social activities and neural-net methods requires a new definition of Metaverse suitable for the
present, different from the previous Metaverse. This paper divides the concepts and essential
techniques necessary for realizing the Metaverse into three components (i.e., hardware, software, and
contents) and three approaches (i.e., user interaction, implementation, and application) rather than
marketing or hardware approach to conduct a comprehensive analysis.
[3] S. Tayal, K. Rajagopal and V. Mahajan, "Virtual Reality based Metaverse of Gamification," 2022
6th International Conference on Computing Methodologies and Communication (ICCMC), 2022, pp.
1597-1604, doi: 10.1109/ICCMC53470.2022.9753727.
The Metaverse is a podium of user mesmerizing interactions in the virtual world with the support of
the Internet, software, and hardware required to create the exceptional immersive user experience.
Gamification is becoming part of our life, and with the growing technology and digitalization, it has
become a prevalent part of different user activities. The increasing reputation of Metaverse is
opening the more user immersive online interaction in the virtual internet world where users can
meet, participate, and collaborate for a specific purpose such as social interaction.
[4] K. Getchell, I. Oliver, A. Miller and C. Allison, "Metaverses as a Platform for Game Based
Learning," 2010 24th IEEE International Conference on Advanced Information Networking and
Applications, 2010, pp. 1195-1202, doi: 10.1109/AINA.2010.125.
This paper evaluates metaverses as a platform for game based learning. Metaverses such as Second
Life are a relatively new type of Internet application. Their functionality is similar to that offered by
3D multi-player online games, but differs in that users are able to construct the environment that
avatars inhabit and are not constrained by predefined goals of the type found within a game
environment. From a quality of service (QoS) perspective metaverses are similar to games in that the
timeliness of network communication is important, but differ in that their demands upon host server
systems and network traffic are more bandwidth intensive. This paper contributes to our
understanding of metaverses by presenting a case study of the application of Game Based Learning
(GBL) within a metaverse environment, by situating the case study within a survey of the state of the
art in GBL in metaverses and by analysing the QoS delivered by the widely used Second Life
metaverse under a range of evaluator-induced network conditions.
CHAPTER 3
WHAT IS METAVERSE?
To access Metaverse, you must first put on a headset, after which you can connect to the virtual
reality interface.
The term Metaverse was coined by Neal Stephenson in a science fiction novel - Snow Crash --
almost 30 years ago, in which he envisioned lifelike avatars who met in realistic 3D buildings and
other virtual reality environments
In recent years, Metaverse has come to represent a utopian convergence of digital experiences fuelled
by Moore's Law - an aspiration to enable rich, real-time, globally-interconnected virtual- and
augmented-reality environments that will enable billions of people to work, play, collaborate and
socialise in entirely new ways, IANS said.
The metaverse is expected to create new revenue streams because it will provide companies with a
new way to promote goods and services, while also allowing them to collect new types of data
generated by user interactions. Before companies can expect a high return on investment for this type
of business initiative, however, there are still many technological challenges and social concerns that
need to be addressed.
Microsoft, Facebook (rebranded Meta), Nvidia and Roblox are often pointed out as pioneers in the
metaverse. These vendors are known for their research and development in mixed
reality technologies and are currently exploring ways to monetize head-tracking and other types of
user-generated data in the metaverse.
Future research is expected to focus on overcoming the current bandwidth, headset and haptic
considerations that put limits on today's metaverse applications.
CHAPTER 4
CHARACTERISTICS OF METAVERSE
As a new Internet application, Metaverse integrates a variety of new technologies and has the
characteristics of multi-technology; as a new social form, Metaverse has the characteristics of sociality; as a
parallel and closely related to the real world In the virtual world, Metaverse has the characteristics of hyper
spatiotemporality.
4.1 MULTI-TECHNOLOGY
Metaverse integrates a variety of new technologies. It provides an immersive experience based on
augmented reality technology, generates a mirror image of the real world based on digital twin technology, and
builds an economic system based on blockchain technology.
4.2 SOCIALITY
As the definition says, the Metaverse is a new type of social form. Metaverse includes economic
systems, cultural systems, and legal systems, which are closely related to reality, but have their own
characteristics.
CHAPTER 5
ELEMENTS OF METAVERSE
5.1 WEB 3.0
Web 3.0 is a huge part of the Metaverse. Web 3.0 is the next version of the internet.
We will move away from centralized servers to a decentralized network system, so
there will not be any government or centralized entity maintaining the network.
5.5 CRYPTOCURRENCIES/TOKENS
Metaverse will have different currencies compared to the real world. It will move on
from fiat or paper-based currencies to digital currencies.
Digital tokens and cryptocurrencies will be the main source of value within the system.
Platforms within the Metaverse already offer various forms of tokens.
CHAPTER 6
METAVERSE COMPONENTS
Metaverse gives patients an immersive experience enough to be used in psychotherapy. People know
that myths and novels are not realistic, but they are moved. Similarly, Metaverse is not the real world but can
provide a tangible feeling, so services based on immersive user-interactive stories can provide. A representative
example of such an approach is a game based on two-way interaction. In order to service the Metaverse like the
real world, it is necessary to be able to interact seamlessly and concurrency in an environment with presence. In
order to maintain a sustainable Metaverse, economic activity between users based on these interactions must
continue. We describe Metaverse into hardware, software, and contents from the component’s point of view in
this section.
The HMD shows an image through the display and plays the role of playing the sound through the
speaker. HMD is a basic input tool of Metaverse and is divided into Non see-through HMD, Optical-see-through
HMD, and video see-though HMD. In the case of a method that covers the screen, it provides a sense of
immersion in a completely virtual world. Optical-see-through (mainly used in AR) is a method of overlaying the
virtual world, and high hardware specifications are required in the process of overlaying. To complement this
method, video-see-though HMD is used. These HMD issues are the bulky, expensive, and short battery life of
the headset. HMD tracks position and orientation according to the movement of the head and delivers the same
change of view as in the virtual world by moving the screen. It is more inaccurate than the method of estimating
motion by external measurement due to problems with accuracy and delay time, but it is widely used because it
can save space and cost.
Diverse circular coordination and input area are proposed for hand-based input devices. Detailed user
data modelling (e.g., mobile phone grip prediction) is required to provide feeling the material with tactile. Haptic
has a passive haptic that gives the texture of real objects and an active haptic that creates virtual pressure.
Passive haptic is used to help understand the situation while giving presence, and active haptic is used for more
effective interaction by adjusting and delivering according to user feedback. Using real props (e.g., physical
degree and operational degree) in a virtual environment helps the user experience, while using a robotized
interface allows for more diverse interactions. Depending on the device’s installation, it is divided into the case
of being attached to the hand and the case of being attached to the outside. Beyond making the material feel, it is
used in various forms (e.g., inducing muscle tension).
6.1.3 Non-Hand-Based Input Device
As auxiliary input means, there are eye-tracking, head tracking, voice input device, and so on. Eye-
tracking is a method of changing the viewpoint by predicting eye movement when the user moves their eyes
without turning their heads. It is a technology that allows the user to see what kind of object the user is paying
attention to. It has the advantage of reducing the load on image processing by generating high-resolution images
in the section where the user is focused on a phobia method. The method of overlaying the display on the arm is
more stable than the method in the air by repeatedly providing the display at a location predictable by the user.
Voice input has an advantage in processing long texts and conversations in a virtual keyboard and an
environment where input is limited.
In order to effectively use the physical sense of space or gravity, body tracking and treadmill are used
to provide accurate motion information with auxiliary devices. Motion input devices are also divided into a
passive method and an active method. The passive method is a method of delivering a sense to the user with a
fixed scenario, and the active method is a method of providing appropriate feedback based on the user’s
behaviour. It is used in various forms to give realism, from a simple way to walking to a 360-degree rotation.
There is a risk of injury to the user, so a method of fixing the waist is used with a treadmill.
A cognitive illusion plays an essential role in immersion in the objective reality of the physical space
and the subjective reality that users feel. There are two types of cognition: static cognition and dynamic
cognition. Static cognition is the proprioceptive senses (e.g., sight, hearing, and touch), while dynamic cognition
is sensory balance and body movement. In dynamic cognition, adaptation, attention, and behaviour are important
features. According to the object of cognition, it can be divided into the cognition of environment and cognition
of an object. In particular, in Metaverse, it is important to reduce the distortion of detection and recognition.
Methods for mitigating distortion include changing the shape of the kernel, changing the expression, and
increasing the input. Objects of object recognition include faces, poses, gestures, and gazes related to the body.
Such object recognition goes through the process of sensing, recording, recognizing, and tracking. There are two
types of stimulation: remote and proximity stimulation. There are bottom-up and top-down approaches to
perceiving stimuli. A concept of perception that is distinct from this intuitive sense is also needed. The
unconscious approach and the conscious approach are classified according to the presence or absence of a
difference in movement according to repetitive recognition. There are instinctive, behavioural, contemplative,
and emotional processing methods. The avatar is an important entity in the Metaverse, and the avatar is created,
and the action is imitated using animation. Vision-based models estimate human poses, recognize hand gestures
and predict gaze. To predict the gaze, iris, facial contour, and 3D gaze prediction are used.
Object recognition is the process of recognizing the size, shape, position, brightness, and colours of
objects according to distance. For scene recognition and object recognition, novel methods (e.g., modal
alignment, cross-modal attention, point cloud, and scene graph) are used. Scene recognition is a good
recognition of what state the current scene is and what components and configurations it has. In sub-graph-based
scene graph generation, a method of clustering object pairs into graphs by clustering and sharing representations
is used. Scene graphs are a good approach to complement the explainable properties that have emerged as
limitations of neural network models. Some studies use generative methods and scene graphs to classify bodies
in overlapping situations and predict human postures behind walls. Object recognition is also important along
with scene recognition, and we have to pay attention to human-centred scene analysis and non-contact
interaction (e.g., gaze, gesture, pose). When many objects are recognized using individual object detection, the
number of computations increases in proportion to the number of objects, so an attempt is made to reduce the
computational burden by using an abstraction concept. In particular, some studies (e.g., world models and
MONET) abstract multiple objects into representations for fast object recognition and efficient training.
Recognizing sounds and processing speech help understand surroundings and communicate with
other avatars. The conversation is a direct method of communication with other avatars and giving instructions
to NPCs in Metaverse. As the Metaverse connection is made in various environments, it is necessary to have a
technology that separates the surrounding noise and one’s own voice without noise. In addition, the loudness of
the sound according to the distance is a variable. For a realistic environment in the Metaverse, voice recognition
technology is needed that considers the surrounding environment while adjusting the volume according to the
distance.
The method of generating the environment and objects in Metaverse is divided into the method of
depicting by reflecting the real world and the method of creating a new imaginary environment. A realistic way
to reflect the real world environment is to reproduce famous places (e.g., museums, Eiffel Tower) and places
familiar to individuals (e.g., home, school) in the real world. Alternatively, it creates a hard-to-reach
environment (e.g., underwater, Mars) to provide a surreal experience. People and things are the main objects of
object generation. Object generation modules create an avatar and NPC of any desired human shape (e.g., a
celebrity, a family member) as an object of conversation. It focuses on facial expressions and natural movements
of joints for fluent multimodal conversation. On the other hand, it generates realistic objects that express in detail
enough to feel the texture of objects that exist in reality. On the other hand, another type of object is imaginary
animals (e.g., unicorns, dragons) and anthropomorphic objects (e.g., talking chairs) that do not exist.
Sound synthesis is a field that gives the user a sense of immersion, but research is insufficient
compared to vision. It creates a sound in the space to give a feeling of presence in the field and to increase the
sense of immersion. In particular, a voice suitable for each character is an important means of expressing the
character’s persona. Taco Tron, a speech synthesis, focuses on that users can use prosody to emphasize words or
express uncertainty. Prosody is the variation of the speech signal that remains after taking the variation into
account (e.g., phonetics and channel effects), which captures meaningful utterances and transfers them by
subtractive methods.
CNNs and global context encoding are used to capture asymmetric dependencies and context patterns
between objects in real-time multi-party 3D motion capture and pose estimation. The graph reflects the structural
characteristics of the body to interpret the action meaning more accurately when the human body is
superimposed. Although it is possible to capture the real-time 3D motion of difficult scenes with a single-color
camera and isolate human body structures (e.g., shaking hands), it is still limited in capturing close interactions
(e.g., hugs).
CHAPTER 7
Metaverse Approaches
Natural interaction is an essential condition for increasing immersion in the Metaverse. It can
reproduce the faces of friends and celebrities to enable realistic interactions and to instil the illusion of users with
familiar and famous places. Temporary dissociation, concentration, and heightened enjoyment are important
factors in the interaction, and emotions of control, curiosity, and intrinsic motivation are used. The target of
interaction is mainly human, and hands are an important feature. Input devices are broadly divided into hand-
held devices and non-hand input devices. Fidelity, proprioception, and egocentric view are important for
interactions on physical devices. Since a 360-degree field of view is used as the receptive field for spatial
recognition, a lot of images and distortion corrections are required for video processing efficiency. In order to
reduce motion sickness and fatigue, visual and bodily sensory collisions and an alternative sensory method are
needed. It also requires multimodal sensory perception that handles speech, gestures, and dialog flows.
The conversation is a basic approach to deliver user intent via voice recognition. In other words,
language is used in various places because it concisely describes complex situations in an implicit sense. It is
necessary to create a Metaverse environment in which understanding the situation through language, abstraction,
QA, and translation. Languages are used in the RL domain as an effective way to define goals and abstract
human-comprehensible tasks. Some agents classify instructions into a single skill level by mimicking human
behaviour. When the agent is faced with an ambiguous situation, the agent clarifies the instruction intention
through a multi-turn conversation with the Oracle.
Humans facilitate efficient adaptation and reason more abstractly by transferring knowledge across
tasks. People communicate not only dialogue but also based on multimodal information (e.g., facial expressions,
gestures, and tone of voice). The method of handling each modal is difficult to handle multiple complex
emotions, so multimodal interaction is required. In general, multimodal has more information than unimodal and
is advantageous for understanding the situation. Text and images in social media posts do not have the same
meaning but instead have more complex meanings that intersect semantically. In particular, multimodal learning
is most effective when the meanings of images and text are different. After the advent of Transformers, studies
have been conducted to learn vision and language together and reduce learning from scratch using a pre-trained
model.
Since the Metaverse handles many things in the cyber world, a model that handles multiple tasks
simultaneously is useful in the aspect of complexity. For such a model, knowledge distillation is used to make a
small model that performs many functions and handles other modal types (e.g., Visual QA). It is relatively easy
to use for similar tasks but easily overfits when target domain data is scarce and has a different distribution. E2E
methods are also used to perform various tasks effectively. Translatotron translated from voice input to voice
output through a sequential process. Compared to the cascaded model, the E2E model has the advantage that
most of the inputs can be utilized without data loss in the process. Translatotron interprets a foreign language,
including its unique pronunciation and emotional meaning. Also, it has the advantage of responding in a voice
form that reflects the prosody of the actual speaker. method SLU for a cloud-based modular dialog system
(SDS), showing that it is effective in situations with low ASR accuracy.
7.1.4 Embodied Interaction
The difference between the Metaverse and other general interactions is that the proportion of
embodied interactions (e.g., embedded QA and visual language navigation) is relatively high. While the required
skills are similar to EQA and VLN, there is a difference in whether the subject is active or passive. While the
purpose of VQA is to answer text questions about a given image, EQA (embodied question and answer)
performs the task of analysing sensor information obtained by an agent materialized through active exploration.
For example, to answer a question about the colour of a car at a distance, the agent actively moves, recognizes,
and responds based on prior knowledge of the car’s location and path. These EQA tasks have recently been
extended in the form of conversations, where agents compensate by querying oracles for insufficient information
to perform the task. The factor that differentiates embedded interaction from 2D-based methods is Exophora
resolution.
People communicate information in a non-verbal form by pointing to an object instead of language. When a user
points to a specific location through a finger, it becomes an intended instruction. In the case of exophora
resolution, specific instructions are performed in terms of multimodal interaction, including motion and speech,
whereas anaphora simply links meaning between texts.
7.2 METAVERSE IMPLEMENTATION
The process of Metaverse implementation is divided into a design phase, a model-training phase, an
operation phase, and an evaluation phase. The design phase considers goals and concept design, development
time and cost, risk estimates, constraints, user scenarios, scope and requirements, and feasibility of
implementation and evaluation. In the model training phase, data analysis, user modelling, scientific
methodology, iterative learning, and parameter tuning are performed. The operation phase considers system
considerations, simulations, job scheduling, network environments, and prototype demonstrations. The
evaluation phase deals with content fidelity, the authenticity of interactions, implementation feasibility, and
failover. This survey covers three types of multimodal inference, RL-based approaches, and lifelong learning for
Metaverse training models. In addition, it is necessary to consider multiagent optimization, integration
optimization, and operational considerations from the perspective of Metaverse service operation.
Humans do not only interpret the meaning of utterances when communicating with others. When
information is given from the cognitive model, it interprets its meaning, combines it with its knowledge, and
inferences its intentions. Verbal ambiguity is compensated to determine the speaker’s underlying intentions
based on direct or indirect representations of the surrounding environment. For example, emotion recognition,
the initiator of emotional interaction, uses multimodal fusion to compensate for the lack of context in textual
information. Multi-modal models do not always outperform single-mode models, so they should be utilized
according to the situation.
Recently, Dialog GPT and Vlbert are proposed to implement dialog and visual-language tasks more
conveniently. Largescale pre-trained language models (PLMs) (e.g., Bidirectional Encoder Representations from
Transformers (BERT), GPT-3) are used for downstream tasks by applying finetuning and few-shot learning.
Multi-agent RL, Imagination-augmented RL, and Language grounded RL are utilized in Metaverse
because RL is suitable for action in a situation without prior learning. Multi-agent RL provides realistic NPCs by
causing collaboration and disputes among various agents. Imagination-augmented RL has the feature of rapidly
stabilizing without enormous training data, and language-based RL is used for conversation.
Technically, RL is a method to achieve an objective goal by determining the behaviour that will receive the
maximum reward based on the state received from the environment. It is divided into model-based RL and
model-free RL according to the existence of a model for a task. It is also divided into a value-based method and
a policy-based method according to the training method. The on-policy method trains an algorithm using the
deterministic output of the target policy, whereas the off-policy method indirectly creates and trains a stored
distribution. Compensation methods (e.g., episodic memory, world model, and language-based RL) have been
proposed to solve the problem of inefficiency and sparse rewards of RL sampling. Furthermore, more efficient
approaches (e.g., offline RL and control RL) are emerging to solve fundamental problems (e.g., sample
inefficiency, unstable training). Unlike traditional off-policy RL and model-based RL, offline RL uses only pre-
collected training data, not online results. Offline RL shows reliable learning with batch training and good
performance in a closed-loop environment. RL methods are steadily growing through knowledge sharing,
memory, abstraction, and language bases. The Diversity all you need (DIAYN) model learns useful skills
without a reward function, just as humans navigate the environment without supervision. DIAYN acquires skills
by maximizing information-theoretic goals using a maximum entropy policy.
Relationships between multiple agents are divided into collaborative, competitive, and oracle
relationships. In order to effectively utilize these relationships in multi-agents, it is necessary to introduce a
mental model (e.g., the Theory of Mind (ToM), intrinsic motivation, and heterogeneous competition). Based on
the concepts and experimental results of psychology and neuroscience, there have been attempts to solve the
problem of neural networks. In particular, the theory of mind, inductive bias, and intrinsic motivation were
effective methods in embodied visual language interaction. Episodic memory, unlike semantic memory, is a
descriptive memory that contains information related to the time and place of acquisition. Gradient episodic
memory reduces forgetting by transferring previous knowledge to evaluate model training on continuous data.
An integrated platform is needed to handle various modals and various events and interactions.
ZEPETO is a platform that is completely provided in the form of a service, and Unity provides more freedom in
which developers create the world they want.
Continuous service through human-centred design and multi-modal interaction is important from an
artistic point of view and a scientific point of view based on design philosophy. Meta RL based on few-shot
learning is used because real-time performance is poor to analyse service. Graph RL using the structural
characteristics of knowledge is also attracting attention. Because planning is essential to perform more complex
scenarios, there are many studies on Planning RL. In order to provide stable service on the integrated platform, it
is necessary to cope with network bandwidth and failure response physically. In addition, measures against
social and politically sensitive issues (e.g., sanctions and hacking) are required.
Most of the research on Metaverse is aimed at marketing and investment purposes, emphasizing social
utility. The domains where Metaverse is popularly serviced are games and some office applications. Huggett
argued that there is a separation between the present reality and virtual reality of virtual heritage and conducted a
study of existence and realism within virtual reality.
8.3.1 Simulation
Metaverse is being serviced in various forms of application. The simulation starts with a game and is
also used for social phenomenon research and marketing simulation. Because it has an educational effect through
simulation, it is also used for education and museum visits. Simulations depicting real world tasks are a
universally available application in the Metaverse. General simulation is solution-dependent, but the simulation
of Metaverse is performed in Metaverse, so it is different from general simulation.
8.3.2 Game
Games are the most common platform in the popularization of the Metaverse. In addition to simply
focusing on interest, there are ways to approach to simplify difficult tasks through games. As much as payment
and personal information are widely used in Metaverse, a game based on blockchain technology has been
proposed. Hide and Seek is a simple yet effective simulation environment for multiagent work that uses visual
representations of objects and scenes from an egocentric perspective.
8.3.3 Office
In order to supplement the sense of space lacking in online solutions in B2B solutions and
conferences, some companies introduced and supplemented the offline concept. In this way, the sound occurring
in the office and physical elements (e.g., desks and conference rooms) is given a sense of space. Representative
examples of office applications include solutions (e.g., Branch, Gather, and Team flow) and use spatial audio
technology to provide speech and footstep sounds according to distance. The Branch is given a game element
that offers virtual currency and experience. Team flow has the advantage of using work-related tools (e.g., file
sharing in conjunction).
8.3.4 Social
Because avatars change skin colour and gender as desired, they have the advantage of reducing
preconceived notions about social discrimination in conversations. These embodied avatars are more
advantageous for simulating social problems than in the form of surveys and role-play.
8.3.5 Marketing
Economic activity is an important content in the Metaverse. It creates an ecosystem that continues
economic activity by consuming clothes and goods provided by the production company and producing and
selling them with other users. Metaverse is a virtual world to predict the future by reflecting the characteristics of
reality realistically.
8.3.6 Education
Audio visual-based education is an important application of Metaverse with a high potential for
popularization. Experiential education is important because what you see in writing and how you feel while
experiencing it are different. For example, radiation is difficult to experience, so you may preconceive that it is
simply dangerous. Through the Metaverse, it is possible to see the educational effects that are considered while
analysing and experiencing radioactivity technically and scientifically in Metaverse.
CHAPTER 9
METAVERSE LIMITATION
9.1 SUSTAINABILITY
Many advantages and applications have been described, but the sustainability of Metaverse is
important. When the world’s population is maintained at a certain level, it can grow and fix
problems, but when the number of users accessing decreases, the world cannot be maintained. In the
concept of life logging, the sustainability of various social relationships is more important than each
event and task (e.g., games and simulations). In order to maintain continuity, a connection
relationship (e.g., Metaverse access, messenger) must be maintained continuously in a relatively low-
spec mobile device that can always be accessed. Using an episodic memory that effectively manages
the user’s log allows the user to feel the comfort and advantage of accessing Metaverse for a long
time. Storing all experiences in memory storage has limitations in utilization and capacity, so
memory research on effectively finding and reusing important episodes is needed. In addition,
latecomer platforms should consider import/export methods that bring the existing user experience
and provide continual usability.
In terms of a sensor in hardware, while the Metaverse resembles the real world a lot, some
sensations are better felt in real life (e.g., day sunlight, smell, stickiness, slippery, wind). In terms of
software, programs developed in the Metaverse without coding are used as a basis for high
compatibility in the Metaverse world. However, as the program becomes more sophisticated, it faces
the limit of sophistication in a complex application. In terms of content, the dialog is developing into
a longer and more natural form of conversation based on persona, but it is still limited as a
sustainable lifelong learning conversation solution with various perspectives and philosophies
beyond exciting conversations. Humans basically have multi-personas, and they are expressed
differently depending on time and place. Therefore, it is necessary to study more complex persona
modelling in consideration of the situation. From this point of view, environments and events are
important to show the various personas of users and NPCs. For example, in the drama Westworld,
avatars perform various actions in the Metaverse, freeing from the constraints and conditions of
reality. Therefore, NPCs in the role of residents of the Metaverse must be able to cope with various
unexpected situations because the allowable range of scenarios is wide. In addition, the persona’s
design is important for the NPC to appear as if they choose with their persona and will. NPCs can be
in the form of humans and various living (e.g., horse, dog, cat) and non-living forms (e.g., desk,
clock).
9.3 DEVELOPMENT HUDDLE
AR uses lightweight devices, suitable for short experiences, but VR relatively needs heavy
and expensive devices for long experiences. Some approaches switch between AR and VR in one
piece of hardware by mixing the advantages of AR and VR. Although this method has the advantage
of using AR and VR in alternative ways, it becomes expensive and heavy compared to a single
model device. Alternatively, holograms are not a popular technology in Metaverse, but they have
potential. Eye-worn lenses are another input method utilized in the Metaverse (e.g., Maya Lenz,
Mirage, Mojo lenz). The lens analyses the user’s information by tracking the direction of eye
movements, focus, blinks, and winks. For example, Maya Lenz is a wearable device in the form of a
contact lens, and Mirage is a way of expressing disliked content by replacing it with positive
alternatives. Mojo lenz is used in conjunction with an assistive device worn around the neck to
seamlessly process a variety of visual information (e.g., data feeds, people’s profiles, video calls,
translations, notifications) into the wearer’s vision.
Privacy and security are critical issues because Metaverse collects data on behaviour that is
more detailed than user conversations and internet history. Avatar two-factor authentication and
protection of transmitted data are essential, and we need to be more vigilant with regard to crimes
that may occur on the Metaverse. In addition, surveillance actions (e.g., inappropriate chat room
surveillance, censorship, and follow-up review) due to the surge in users suggest that organizations
that play the same role as police and government are needed in the real world. There are some
instances where exemplary people in the real world commit crimes based on their online anonymity
in the Metaverse. The norms and restrictions of the Metaverse may differ from those in the real world
because they have a post-nationalism and degrees of freedom. Most users familiar with the
Metaverse are the young generation with relatively various social ideas. It is necessary to build a
Metaverse with a worldview and ethical consciousness in which various avatars can live, rather than
a Metaverse as a physical space.
Since the Metaverse consists of a world that changes in real-time for a large number of users
and NPCs, cross disciplinary research is necessary. As an example of cross disciplinary research,
Metaverse leverages knowledge widely used in cognitive science (e.g., episodic memory, intrinsic
motivation, and theory of mind) to provide more immersive and sustainable services. Episodic
memory occurred a long time ago in the present conversation and induced a natural conversation.
Intrinsic motivation allows an agent to perform multiple tasks rather than a single task consistently.
The theory of mind has the advantage of deepening conversation to understand from the other
person’s point of view. Other examples are the social sciences, psychology, and economics. The
environment in which a certain number of members live using masked avatars differs from how
society currently operates. Neuroscience and psychological approaches for psychotherapy are used to
understand humans and maintain a Metaverse deeply. The virtual currency of Metaverse is different
from the virtual currency in the real world in that it is used as a real product in a virtual environment,
so it can become a new variable from the point of view of economics and develop into a fused form.
CHAPTER 11
CONCLUSION
[1] Roblox Blog, Roblox Corp., 2022. Accessed: Nov. 1, 2021. [Online]. Available:
https://blog.roblox.com/
[2] C. Meier, J. Saorín, A. B. de León, and A. G. Cobos, ‘‘Using the Roblox Video Game Engine for
Creating Virtual tours and Learning about the Sculptural Heritage,’’ Int. J. Emerg. Technol. Learn.,
vol. 20, pp. 268–280, Oct. 2020.
[3] R. U. Long, ‘‘Roblox and effect on education,’’ M.S. thesis, Dept. Educ., Drury Univ.,
Springfield, MO, USA, 2019.
[4] J. Kemp and D. Livingstone, ‘‘Putting a second life ‘metaverse’ skin on learning management
systems,’’ in Proc. 2nd Life Educ. Workshop Second Life Community Conv., San Francisco, CA,
USA, vol. 20, 2006, pp. 22– 47.
[5] A. M. Kaplan and M. Haenlein, ‘‘The fairyland of second life: Virtual social worlds and how to
use them,’’ Bus. Horizons, vol. 52, no. 6, pp. 563–572, Nov. 2009.
[6] M. R. Cagnina and M. Poian, ‘‘How to compete in the metaverse: The business models in second
life,’’ U Udine Econ. Work. Paper, Udine, Italy, Tech. Rep. 01-2007, 2007, doi:
10.2139/ssrn.1088779.
[7] H. Duan, J. Li, S. Fan, Z. Lin, X. Wu, and W. Cai, ‘‘Metaverse for social good: A university
campus prototype,’’ in Proc. 29th ACM Int. Conf. Multimedia, Oct. 2021 pp. 153–161.
[8] H.-S. Choi and S.-H. Kim, ‘‘A content service deployment plan for metaverse museum
exhibitions—Centering on the combination of beacons and HMDs,’’ Int. J. Inf. Manage., vol. 37, no.
1, pp. 1519–1527, Feb. 2017.
[13] R. Schroeder, H. Avon Huxor, and S. Andy, ‘‘Activeworlds: Geography and social interaction
in virtual reality,’’ Futures, vol. 33, no. 7, pp. 569–587, 2001.
[14] C. Jaynes, W. B. Seales, K. Calvert, Z. Fei, and J. Griffioen, ‘‘The metaverse: A networked
collection of inexpensive, self-configuring, immersive environments,’’ in Proc. Workshop Virtual
Environ. (EGVE), 2003, pp. 115–12.