AWS Project by AnwarAkhtar

Amazon Cloud Front is a global content delivery network (CDN) service that securely delivers data, videos,
applications, and APIs to your viewers with low latency and high transfer speeds. CloudFront is integrated with
AWS – including physical locations that are directly connected to the AWS global infrastructure, as well as software
that works seamlessly with services including AWS Shield for DDoS mitigation, Amazon S3, Elastic Load Balancing or
Amazon EC2 as origins for your applications, and Lambda@Edge to run custom code close to your viewers.
We can start with CloudFront in minutes, using the same AWS tools that you're already familiar with: APIs, AWS
Management Console, AWS CloudFormation, CLIs, and SDKs. CloudFront offers a simple, pay-as-you-go pricing
model with no upfront fees or required long-term contracts, and support for CloudFront is included in your existing
AWS Support subscription. It helps to build
Global, Growing Content Delivery Network
Secure Content at the Edge
Programmable CDN
High Performance
Cost Effective
Deep Integration with Key AWS Services
Things we can achieve from Cloud Formation
We can integrated and optimized to work with popular AWS services including
Amazon Simple Storage Service (Amazon S3), Amazon Elastic Compute Cloud (Amazon EC2), Elastic Load Balancing.
Amazon Route 53 to help speed up DNS resolution of applications delivered by CloudFront.
Integration with AWS Lambda allows you to execute custom logic across the AWS global network without
provisioning or managing servers.
With Amazon API Gateway you can further accelerate the delivery of your APIs.
Project Case Study
A board casting company having issue with their content delivery network which did not fully meet their needs for
delivering streamed media files. This led to the periodic failure of streamed videos to start playing, as well as the
chance that some video streams would freeze and not restart.
Because there was no method of measuring performance degradation through company existing content delivery
network, team had difficulty identifying the source of these video streaming issues.
Solutions : To improve the system and prevent these types of issues, Our Interactive team implemented a
monitoring tool that could also be used to test other content delivery networks, including Amazon Web Services
(Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (Amazon S3).
After monitoring multiple CDNs for a few weeks, we found that CloudFront had a significantly lower error rate than
the incumbent CDN. As a result migrated the majority of videos to Amazon S3 storage and delivered them via
Amazon CloudFront.
We completed the migration of its content into Amazon S3 within a matter of weeks and subsequently began
delivering that content via Amazon CloudFront.
After the migration, it has been experienced fifty percent less errors in its video streaming performance. The
department also conducts testing more quickly with the help of Amazon CloudFront’s invalidation request feature
and by analyzing CloudFront logfiles. This feature improved our interactive’s testing by rapidly removing bad files
and quickly refreshing its cache.
“Amazon CloudFront fits well with the other AWS services used by this client. The team members have enjoyed
this migration to Amazon CloudFront.Today; they are delivering nearly all of its streaming video through Amazon
CloudFront. This equates to more than one petabyte of video content delivered every month.
Customer feedback
“As with all the AWS services we leverage, using Amazon CloudFront is so simple and reliable that the team
doesn’t have to think about it. It all just works, freeing us to focus on building cool applications.”
Amazon Elastic MapReduce (EMR)

Amazon EMR provides a managed Hadoop framework that makes it easy, fast, and cost-effective to process vast
amounts of data across dynamically scalable Amazon EC2 instances.We can also run other popular distributed
frameworks such as Apache Spark, HBase, Presto, and Flink in Amazon EMR, and interact with data in other AWS
data stores such as Amazon S3 and Amazon DynamoDB.
Amazon EMR securely and reliably handles a broad set of big data use cases, including log analysis, web indexing,
data transformations (ETL), machine learning, financial analysis, scientific simulation, and bioinformatics.
Project Case Studies: 1

One of our client IT Department need to protect the user from shill or suspect content. They uses an automated
review filter to identify suspicious content and minimize exposure to the consumer. The site also features a wide
range of other features that help people discover new businesses (lists, special offers, and events), and
communicate with each other. Additionally, business owners and managers are able to set up free accounts to
post special offers, upload photos, and message customers. The company has also been focused on developing
mobile apps and was recently voted into the iTunes Apps Hall of Fame. Yelp apps are also available for Android,
Blackberry, Windows 7, Palm Pre and WAP.Local search advertising makes up the majority of Yelp’s revenue
stream. The search ads are colored light orange and clearly labeled “Sponsored Results.” Paying advertisers are not
allowed to change or re-order their reviews.They originally depended upon giant RAIDs to store their logs, along
with a single local instance of Hadoop.where they were running out of hard drive space and capacity on our
Hadoop
Solution : We have taken initiative to move to Amazon Elastic MapReduce (Amazon EMR), and replaced the RAIDs
with Amazon Simple Storage Service (Amazon S3) and immediately transferred all Hadoop jobs to Amazon Elastic
MapReduce.
So we used Amazon S3 to store daily logs and photos, generating around 1.2TB of logs per day. The company also
uses Amazon EMR to power approximately 20 separate batch scripts, most of those processing the logs. Features
powered by Amazon Elastic MapReduce include:
 People Who Viewed this Also Viewed

 Review highlights
 Auto complete as you type on search
 Search spelling suggestions
 Top searches
Their jobs are written exclusively in Python, while Yelp uses their own open-source library, mrjob, to run their
Hadoop streaming jobs on Amazon EMR, with boto to talk to Amazon S3. Yelp also uses s3cmd and the Ruby Elastic
MapReduce utility for monitoring.
Further developers advise others working with AWS to use the boto API as well as mrjob to ensure full utilization of
Amazon Elastic MapReduce job flows. Yelp runs approximately 250 Amazon Elastic MapReduce jobs per day,
processing 30TB of data and that helped with their Hadoop application development.
The Outcome and Benefits
Using Amazon Elastic MapReduce customer was able to save $55,000 in upfront hardware costs and get up and
running in a matter of days not months. However, most important to Yelp is the opportunity cost. “With AWS, our
developers can now do things they couldn’t before,” says Marin. “Our systems team can focus their energies on
other challenges.”
Project Case Study:2
When you launch a cluster, Amazon EMR lets you choose applications that will run on your cluster. But what if you
want to deploy your own custom application?
How to create an Apache Bigtop application, and install and run it on an EMR cluster.
Apache Bigtop is a community maintained repository that supports a wide range of components and projects,
including, but not limited, to Hadoop, HBase, and Spark. Bigtop supports various Linux packaging systems, such as
RPM or Deb, to package applications and application deployment and configuration on clusters using Puppet
To create a Bigtop package for EMR, follow these steps:
1. Launch a development EMR cluster.

2. Clone the Bigtop public repository.
3. Add the application definition to bigtop.bom.
4. Create directories and configuration files for the application.
o Create an RPM package.
o Create a Yum repository.
5. Move the output repository to S3 to make it available for any new cluster where you want to install the
new application.
6. Test the application.
7. Create a bootstrap script.
8. Launch an EMR cluster with the bootstrap script.
You will create an EMR cluster for development purposes. This provides you with the tools needed to create and
test the Bigtop application including Maven and Gradle, among other tools.
Customers can use Amazon EMR and Apache Spark to build scalable big data pipelines. For large-scale production
pipelines, a common use case is to read complex data originating from a variety of sources. This data must be
transformed to make it useful to downstream applications, such as machine learning pipelines, analytics
dashboards, and business reports. Such pipelines often require Spark jobs to be run in parallel on Amazon EMR.
Project Case Study:3 To submit multiple Spark jobs in parallel on an EMR cluster using Apache Livy.
Apache Livy is a service that enables easy interaction with a Spark cluster over a REST interface. Apache Livy lets
you send simple Scala or Python code over REST API calls instead of having to manage and deploy large jar files.
This helps because it scales data pipelines easily with multiple spark jobs running in parallel, rather than running
them serially using EMR Step API. Customers can continue to take advantage of transient clusters as part of the
workflow resulting in cost savings.
we can use Apache Airflow to orchestrate the data pipeline. Airflow is an open-sourced task scheduler that helps
manage ETL tasks. Customers love Apache Airflow because workflows can be scheduled and managed from one
central location. With Airflow’s Configuration as Code approach, automating the generation of workflows, ETL
tasks, and dependencies is easy. It helps customers shift their focus from building and debugging data pipelines to
focusing on the business problems.
High-level Architecture
Following is a detailed technical diagram showing the configuration of the architecture to be deployed.
We use an AWS CloudFormation script to launch the AWS services required to create this workflow.
CloudFormation is a powerful service that allows you to describe and provision all the infrastructure and resources
required for your cloud environment, in simple JSON or YAML templates. In this case, the template includes the
following:
 Amazon Elastic Compute Cloud (Amazon EC2) instance where the Airflow server is to be installed.
 Amazon Relational Database Service (Amazon RDS) instance, which stores the metadata for the Airflow
server. Airflow interacts with its metadata using the SqlAlchemy library. Airflow recommends using
MYSQL or Postgres. We use a PostgreSQL RDS instance.
 AWS Identity and Access Management (IAM) roles that allow the EC2 instance to interact with the RDS
instance.
 Amazon Simple Storage Service (S3) bucket with the movielens data downloaded in it. The output of the
transformed data is also be written into this bucket.
Project (case Study) :Hadoop performance Tuning(Resizing EMR clusters(Yarn) to scale down)
The following are some issues to consider when resizing your clusters.
EMR clusters can use two types of nodes for Hadoop tasks: core nodes and task nodes. Core nodes host persistent
data by running the HDFS DataNode process and run Hadoop tasks through YARN’s resource manager. Task nodes
only run Hadoop tasks through YARN and DO NOT store data in HDFS.
Issue : When scaling down task nodes on a running cluster, expect a short delay for any running Hadoop task on
the cluster to decommission. This allows you to get the best usage of your task node by not losing task progress
through interruption. However, if your job allows for this interruption.
Solutions: We can adjust the one hour default timeout on the resize by adjusting the
yarn.resourcemanager.nodemanager-graceful-decommission-timeout-secs property (in EMR 5.14) in yarn-site.xml.
When this process times out, your task node is shut down regardless of any running tasks. This process is usually
relatively quick, which makes it fast to scale down task nodes.
Issue: While scaling down core nodes, Amazon EMR must also wait for HDFS to decommission to protect your
data. HDFS can take a relatively long time to decommission. This is because HDFS block replication is throttled by
design through configurations located in hdfs-site.xml. This in turn means that HDFS decommissioning is throttled.
This protects your cluster from a spiked workload if a node goes down, but it slows down decommissioning.
Solutions:When scaling down a large number of core nodes, consider adjusting these configurations beforehand so
that you can scale down more quickly with HDFS and resizing speed.
The HDFS configurations, located in hdfs-site.xml, have some of the most significant impact on throttling block
replication:
 datanode.balance.bandwidthPerSec: Bandwidth for each node’s replication
 namenode.replication.max-streams: Max streams running for block replication
 namenode.replication.max-streams-hard-limit: Hard limit on max streams
 datanode.balance.max.concurrent.moves: Number of threads used by the block balancer for pending

moves
 namenode.replication.work.multiplier.per.iteration: Used to determine the number of blocks to begin

transfers immediately during each replication interval
Modifying these configurations can speed up the decommissioning time significantly. Try the following exercise to
see this difference for yourself.
1. Create an EMR cluster with the following hardware configuration:
 Master: 1 node – m3.xlarge

 Core: 6 nodes – m3.xlarge
2. Connect to the master node of your cluster using SSH (Secure Shell).
3. Load data into HDFS by using the following jobs:

4.Edit your hdfs-site.xml configs and then paste in the following configuration setup in the hdfs-site properties
6. Resize your EMR cluster from six to five core nodes, and look in the EMR events tab to see how long the
resize took.
7. Repeat the previous steps without modifying the configurations, and check the difference in resize time.
While performing this exercise, I saw resizing time lower from 45+ minutes (without config changes) down to
about 6 minutes (with modified hdfs-site configs). This exercise demonstrates how much HDFS is throttled under
default configurations. Although removing these throttles is dangerous and performance using them should be
tested first, they can significantly speed up decommissioning time and therefore resizing.
Then do some additional things for resizing clusters:
 Shrink resizing timeouts. We can configure EMR nodes in two ways: instance groups or instance fleets.
For more information, see Create a Cluster with Instance Fleets or Uniform Instance Groups. EMR has
implemented shrink resize timeouts when nodes are configured in instance fleets. This timeout prevents
an instance fleet from attempting to resize forever if something goes wrong during the resize. It currently
defaults to one day, so keep it in mind when you are resizing an instance fleet down.
If an instance fleet shrink request takes longer than one day, it finishes and pauses at however many instances are
currently running. On the other hand, instance groups have no default shrink resize timeout. However, both types
have the one-hour YARN timeout described earlier in the yarn.resourcemanager.nodemanager-graceful-
decommission-timeout-secs property (in EMR 5.14) in yarn-site.xml.
 Watch out for high frequency HDFS writes when resizing core nodes. If HDFS is receiving a lot of writes, it
will modify a large number of blocks that require replication. This replication can interfere with the block
replication from any decommissioning core nodes and significantly slow down the resizing process.
(Beware when modifying: Changing these configurations improperly, especially on a cluster with high load, can
seriously degrade cluster performance.)
AWS LAMBDA
AWS Lambda is a zero-administration compute platform for back-end web developers that runs your code for you
on the AWS Cloud and provides you with a fine-grained pricing structure. AWS Lambda runs your back-end code on
its own AWS compute fleet of Amazon EC2 instances across multiple Availability Zones in a region, which provides
the high availability, security, performance, and scalability of the AWS infrastructure.
AWS Lambda runs your code on a high-availability compute infrastructure and performs all of the administration of
the compute resources, including server and operating system maintenance, capacity provisioning and automatic
scaling, code monitoring and logging. All you need to do is supply your code in one of the languages that AWS
Lambda supports(currently Node.js, Java, C#, Go and Pyt).it run within the AWS Lambda standard runtime
environment and resources provided by Lambda.
When using AWS Lambda, you are responsible only for your code. AWS Lambda manages the compute fleet that
offers a balance of memory, CPU, network, and other resources. This is in exchange for flexibility, which means you
cannot log in to compute instances, or customize the operating system or language runtime. These constraints
enable AWS Lambda to perform operational and administrative activities on your behalf, including provisioning
capacity, monitoring fleet health, applying security patches, deploying your code, and monitoring .
Project case Studies
A leading car information and shopping platform client , having millions of visitors each month find their perfect
car. With products such as Edmunds Your Price, Your Lease, and Used+, shoppers can buy smarter with instant,
upfront prices for cars and trucks.Company was planning a key update of their mobile application with better
image quality and faster load times on the company's website and mobile apps.They had just a few weeks to
figure out how to process the company's library of 50 million vehicle images into several new aspect ratios and
resolutions, which would result in more than half a billion new images.
The company's existing image-handling solution, based on Cloudera MapReduce clusters, wasn't doing the right for
these jobs. "Development time also taking too long, and achieving sufficient scale and processing power for this
project would have required us to manage new clusters and incur new monthly costs of at least $10,000.”
Solutions :
Custumer is advised to use full Amazon Web Services (AWS) with use of serverless solutions orchestrated by AWS
Lambda, a function-based service that runs code in response to events. We tasked an engineer with seeing how
well a server less solution be and came up with the basic solution within hours, and we can evaluating results by
the end of the day."
The team created uses Amazon Simple Storage Service (Amazon S3) for highly available object storage, with AWS
Lambda functions triggered by an AWS API Gateway endpoint. It also includes Amazon Athena, a serverless,
interactive service that uses standard SQL to query large data sets by taking advantage of the query-in-place
functionality of Amazon S3, avoiding the need to move the data to a separate analytics platform. By using Amazon
S3 Standard storage, Client had achieved 99.999999999 percent data durability and replication across three AWS
Availability Zones.
Outcome:
By using a serverless image-processing solution built on AWS, service engineering team beat the release deadline
and avoided the higher costs of a traditionally architected solution. "The project was completed on deadline for a
one-time cost of $6,000, compared to the monthly charges of at least $10,000 we would have incurred if we had
needed to purchase, provision, and configure new resources," says Mahajan. "Also, because we could scale to
thousands of AWS Lambda function invocations and increase the memory allocation with just a few clicks, we
needed only eight days to process all 50 million images into 700 million new images.”
The simplicity and flexibility of connecting services on AWS played a key role in how quickly team was able to put
the solution into production.
Customer feedback
"What took us just a few days to build using a serverless solution based on AWS Lambda would have taken us six
months to build from scratch,. “Our CTO and the rest of the project stakeholders were really happy with how much
money and time we saved

AWS Project by AnwarAkhtar

Uploaded by

Copyright:

Available Formats

AWS Project by AnwarAkhtar

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AWS Project by AnwarAkhtar

Uploaded by

Copyright:

Available Formats

Amazon Cloud Front is a global content delivery network (CDN) service that securely delivers data, videos,

Global, Growing Content Delivery Network

Secure Content at the Edge

Deep Integration with Key AWS Services

Things we can achieve from Cloud Formation

Amazon Route 53 to help speed up DNS resolution of applications delivered by CloudFront.

Project Case Study

Amazon Elastic MapReduce (EMR)

Project Case Studies: 1

 People Who Viewed this Also Viewed

The Outcome and Benefits

Project Case Study:2

To create a Bigtop package for EMR, follow these steps:

1. Launch a development EMR cluster.

 datanode.balance.bandwidthPerSec: Bandwidth for each node’s replication

 namenode.replication.max-streams: Max streams running for block replication

 namenode.replication.max-streams-hard-limit: Hard limit on max streams

 datanode.balance.max.concurrent.moves: Number of threads used by the block balancer for pending

 namenode.replication.work.multiplier.per.iteration: Used to determine the number of blocks to begin

1. Create an EMR cluster with the following hardware configuration:

 Master: 1 node – m3.xlarge

3. Load data into HDFS by using the following jobs:

Then do some additional things for resizing clusters:

You might also like