Data Engineering Interviews
Data Engineering Interviews
Data Engineering Interviews
2. SCALA PROGRAMMING 30
Big Data
4. HIVE INTERVIEW QUESTION 69
Engineering
Interview Questions & Answers
Source: Internet
SQL Interview Question with Answers ● There should not be any transitive dependencies so they must be removed if they exist.
○ FROM -> goes to Secondary files via primary file ■ A stronger form of 3NF so it is also known as 3.5NF
○ WHERE -> applies filter condition (non-aggregate column) ○ SELECT -> dumps data in tempDB ■ We do not need to know much about it. Just know that here you compare between a prime attribute
system database ○ GROUP BY -> groups data according to grouping predicate ○ HAVING -> applies and a prime attribute and a non-key attribute and a non-key attribute.
filter condition (aggregate function) ○ ORDER BY -> sorts data ascending/descending
Normalization will reduce the performance of the read operation; SELECT ● Triggers
● Indexes
3. What are the three degrees of normalization and how is normalization done in each degree? Temporary DB object
1NF: ● Cursors
● There should not be any partial dependencies so they must be removed if they exist. ○ 2. Foreign key
○ 3. Check
8. What is a derived column , hows does it work , how it affects the performance of a database OLTP:
and how can it be improved?
Normalization Level: highly normalized
The Derived Column a new column that is generated on the fly by applying expressions to
Data Usage : Current Data (Database)
transformation input columns.
Processing : fast for delta operations (DML)
Ex: FirstName + ‘ ‘ + LastName AS ‘Full name’
Operation : Delta operation (update, insert, delete) aka DML Terms Used : table, columns and
Derived column affect the performances of the data base due to the creation of a temporary new
relationships
column.
Execution plan can save the new column to have better performance next time.
OLAP:
Normalization Level: highly denormalized
9. What is a Transaction?
Data Usage : historical Data (Data warehouse)
○ It is a set of TSQL statement that must be executed together as a single logical unit. ○ Has ACID
Processing : fast for read operations
properties:
Operation : read operation (select)
Atomicity: Transactions on the DB should be all or nothing. So transactions make sure that any
operations in the transaction happen or none of them do. Terms Used : dimension table, fact table
Consistency: Values inside the DB should be consistent with the constraints and integrity of the DB
before and after a transaction has completed or failed.
11. How do you copy just the structure of a table?
Isolation: Ensures that each transaction is separated from any other transaction occurring on the
SELECT * INTO NewDB.TBL_Structure
system.
FROM OldDB.TBL_Structure
Durability: After successfully being committed to the RDMBS system the transaction will not be lost
in the event of a system failure or error. WHERE 1=0 -- Put any condition that does not make any sense.
○ We can also have a subquery inside of another subquery and so on. This is called a nested
Subquery. Maximum one can have is 32 levels of nested Sub-Queries.
12.What are the different types of Joins?
○ INNER JOIN: Gets all the matching records from both the left and right tables based on joining
columns. 15. What are the SET Operators?
○ LEFT OUTER JOIN: Gets all non-matching records from left table & AND one copy of matching ○ SQL set operators allows you to combine results from two or more SELECT statements.
records from both the tables based on the joining columns.
○ Syntax:
○ RIGHT OUTER JOIN: Gets all non-matching records from right table & AND one copy of
matching records from both the tables based on the joining columns. SELECT Col1, Col2, Col3 FROM T1 <SET OPERATOR>
SELECT Col1, Col2, Col3 FROM T2
○ FULL OUTER JOIN: Gets all non-matching records from left table & all non-matching records
from right table & one copy of matching records from both the tables. ○ Rule 1: The number of columns in first SELECT statement must be same as the number of columns
in the second SELECT statement.
○ CROSS JOIN: returns the Cartesian product.
○ Rule 2: The metadata of all the columns in first SELECT statement MUST be exactly same as the
metadata of all the columns in second SELECT statement accordingly.
13. What are the different types of Restricted Joins?
○ Rule 3: ORDER BY clause do not work with first SELECT statement. ○ UNION, UNION ALL,
○ SELF JOIN: joining a table to itself INTERSECT, EXCEPT
○ RESTRICTED LEFT OUTER JOIN: gets all non-matching records from
left side 16. What is a derived table?
○ RESTRICTED RIGHT OUTER JOIN - gets all non-matching records from ○ SELECT statement that is given an alias name and can now be treated as a virtual table and
operations like joins, aggregations, etc. can be performed on it like on an actual table.
right side
○ Scope is query bound, that is a derived table exists only in the query in which it was defined.
○ RESTRICTED FULL OUTER JOIN - gets all non-matching records from left table & gets all non- SELECT temp1.SalesOrderID, temp1.TotalDue FROM
matching records from right table.
(SELECT TOP 3 SalesOrderID, TotalDue FROM Sales.SalesOrderHeader ORDER BY TotalDue
DESC) AS temp1 LEFT OUTER JOIN
14. What is a sub-query?
(SELECT TOP 2 SalesOrderID, TotalDue FROM Sales.SalesOrderHeader ORDER BY TotalDue
○ It is a query within a query DESC) AS temp2 ON temp1.SalesOrderID = temp2.SalesOrderID WHERE temp2.SalesOrderID IS
NULL
○ Syntax:
○ Both the Indexed View and Base Table are always in sync at any given point. ROW_NUMBER() OVER(ORDER BY TotalDue), RANK() OVER(ORDER BY TotalDue),
○ Indexed Views cannot have NCI-H, always NCI-CI, therefore a duplicate set of the data will be DENSE_RANK() OVER(ORDER BY TotalDue) FROM Sales.SalesOrderHeader
created.
■ NTILE(n): Distributes the rows in an ordered partition into a specified number of groups.
20. What does WITH CHECK do?
○ WITH CHECK is used with a VIEW.
22. What is PARTITION BY?
○ It is used to restrict DML operations on the view according to search predicate (WHERE clause)
○ Creates partitions within the same result set and each partition gets its own ranking. That is, the
specified creating a view.
rank starts from 1 for each partition.
○ Users cannot perform any DML operations that do not satisfy the conditions in WHERE clause
while creating a ○ Ex:
23. What is Temporary Table and what are the two types of it? ○ They are tables just like
regular tables but the main difference is its scope. ○ Moderator’s definition: when someone is able to write a code at the front end using DSQL, he/she
could use malicious code to drop, delete, or manipulate the database. There is no perfect protection
○ The scope of temp tables is temporary whereas regular tables permanently reside. ○ Temporary from it but we can check if there is certain commands such as 'DROP' or 'DELETE' are included in
table are stored in tempDB. the command line.
○ We can do all kinds of SQL operations with temporary tables just like regular tables like JOINs, ○ SQL Injection is a technique used to attack websites by inserting SQL code in web entry fields.
GROUPING, ADDING CONSTRAINTS, etc.
○ Two types of Temporary Table
27. What is SELF JOIN?
■ Local
○ JOINing a table to itself
#LocalTempTableName -- single pound sign
○ When it comes to SELF JOIN, the foreign key of a table points to its primary key. ○ Ex:
Only visible in the session in which they are created. It is session-bound. Employee(Eid, Name, Title, Mid)
■ Global ○ Know how to implement it!!!
##GlobalTempTableName -- double pound sign
Global temporary tables are visible to all sessions after they are created, and are deleted when the 28. What is Correlated Subquery?
session in which they were created in is disconnected.
○ It is a type of subquery in which the inner query depends on the outer query. This means that that
It is last logged-on user bound. In other words, a global temporary table will disappear when the last the subquery is executed repeatedly, once for each row of the outer query.
user on the session logs off.
○ In a regular subquery, inner query generates a result set that is independent of the outer query.
○ Ex:
SELECT *
24. Explain Variables ?
FROM HumanResources.Employee E
○ Variable is a memory space (place holder) that contains a scalar value EXCEPT table variables,
which is 2D WHERE 5000 IN (SELECT S.Bonus
FROM Sales.SalesPerson S
data.
○ Variable in SQL Server are created using DECLARE Statement. ○ Variables are BATCH-BOUND. WHERE S.SalesPersonID = E.EmployeeID)
○ Variables that start with @ are user-defined variables. ○ The performance of Correlated Subquery is very slow because its inner query depends on the outer
query. So the inner subquery goes through every single row of the result of the outer subquery.
Does minimal logging, minimal as not logging everything. TRUNCATE will remove the pointers that Can perform only DML, but not DDL Batch bound scope
point to their pages, which are deallocated. DECLARE @var TABLE(...)
Faster since TRUNCATE does not record into the log file. TRUNCATE resets the identity column. Cannot have indexes
Cannot have triggers on TRUNCATE.
■ Scope of Table variables is batch bound. Extended Stored Procedures (XP_****): stored procedures that can be used in other platforms such as
Java or C++.
■ Table variables cannot have constraints.
36. Explain the Types of SP..? ○ SP with no parameters:
■ Table variables cannot have indexes.
○ SP with a single input parameter:
■ Table variables do not generate statistics.
○ SP with multiple parameters:
■ Cannot ALTER once declared (Again, no DDL statements).
○ SP with output parameters:
Extracting data from a stored procedure based on an input parameter and outputting them using output ○ PASSING INPUT PARAMETER(S) IS/ARE OPTIONAL, BUT MUST HAVE A RETURN
STATEMENT.
variables.
○ SP with RETURN statement (the return value is always single and integer value)
40. What is the difference between Stored Procedure and UDF?
Stored Procedure:
37. What are the characteristics of SP?
may or may not return any value. When it does, it must be scalar INT. Can create temporary tables.
○ SP can have any kind of DML and DDL statements. ○ SP can have error handling (TRY ...
CATCH). Can have robust error handling in SP (TRY/CATCH, transactions). Can include any DDL and DML
statements.
○ SP can use all types of table.
UDF:
○ SP can output multiple integer values using OUT parameters, but can return only one scalar INT
value. ○ SP can take any input except a table variable. must return something, which can be either scalar/table valued. Cannot access to temporary tables.
○ SP can set default inputs. No robust error handling available in UDF like TRY/ CATCH and transactions. Cannot have any
DDL and can do DML only with table variables.
○ SP can use DSQL.
○ SP can have nested SPs.
41. What are the types of UDF?
○ SP cannot output 2D data (cannot return and output table variables).
1. Scalar
○ SP cannot be called from a SELECT statement. It can be executed using only a EXEC/EXECUTE
statement. Deterministic UDF: UDF in which particular input results in particular output. In other words, the
output depends on the input.
Non-deterministic UDF: UDF in which the output does not directly depend on the input.
38. What are the advantages of SP?
○ Precompiled code hence faster.
2. In-line UDF:
○ They allow modular programming, which means it allows you to break down a big chunk of code
into smaller pieces of codes. This way the code will be more readable and more easier to manage. UDFs that do not have any function body(BEGIN...END) and has only a RETURN statement. In-line
UDF must return 2D data.
○ Reusability.
○ Can enhance security of your application. Users can be granted permission to execute SP without
having to have direct permissions on the objects referenced in the procedure. 3. Multi-line or Table Valued Functions:
○ Can reduce network traffic. An operation of hundreds of lines of code can be performed through It is an UDF that has its own function body (BEGIN ... END) and can have multiple SQL
single statement that executes the code in procedure rather than by sending hundreds of lines of
statements that return a single output. Also must return 2D data in the form of table variable.
code over the network.
○ SPs are pre-compiled, which means it has to have an Execution Plan so every time it gets executed
after creating a new Execution Plan, it will save up to 70% of execution time. Without it, the SPs are 42. What is the difference between a nested UDF and recursive UDF?
just like any regular TSQL statements. ○ Nested UDF: calling an UDF within an UDF
○ Recursive UDF: calling an UDF within itself
39. What is User Defined Functions (UDF)?
○ UDFs are a database object and a precompiled set of TSQL statements that can accept parameters, 43. What is a Trigger?
perform complex business calculation, and return of the action as a value.
○ It is a precompiled set of TSQL statements that are automatically executed on a particular DDL,
○ The return value can either be single scalar value or result set-2D data. ○ UDFs are also pre- DML or log-on
compiled and their execution plan is saved.
45. What are ‘inserted’ and ‘deleted’ tables (aka. magic tables)? ■ 4. Scrolling: moves anywhere.
■ 5. Read Only: prevents data manipulation to cursor data set.
○ They are tables that you can communicate with between the external code and trigger body.
○ The structure of inserted and deleted magic tables depends upon the structure of the table in a DML
statement. ○ UPDATE is a combination of INSERT and DELETE, so its old record will be in the 49. What is the difference between Table scan and seek ?
deleted table and its new record will be stored in the inserted table.
○ Scan: going through from the first page to the last page of an offset by offset or row by row. ○ Seek:
going to the specific node and fetching the information needed.
46. What are some String functions to remember? LEN(string): returns the length of string. ○ ‘Seek’ is the fastest way to find and fetch the data. So if you see your Execution Plan and if all of
them is a seek, that means it’s optimized.
UPPER(string) & LOWER(string): returns its upper/lower string
LTRIM(string) & RTRIM(string): remove empty string on either ends of the string LEFT(string):
extracts a certain number of characters from left side of the string RIGHT(string): extracts a certain 50. Why are the DML operations are slower on Indexes?
number of characters from right side of the string SUBSTRING(string, starting_position, length):
returns the sub string of the string REVERSE(string): returns the reverse string of the string ○ It is because the sorting of indexes and the order of sorting has to be always maintained.
Concatenation: Just use + sign for it ○ When inserting or deleting a value that is in the middle of the range of the index, everything has to
be rearranged again. It cannot just insert a new value at the end of the index.
52. What is the architecture in terms of a hard disk, extents and pages? key.
○ A hard disk is divided into Extents. ○ Then it will store the data in the leaf nodes.
○ Every extent has eight pages. ○ Now the data is stored in your hard disk in a continuous manner.
○ Every page is 8KBs ( 8060 bytes).
57. What are the four different types of searching information in a table?
53. What are the nine different types of Indexes? ○ 1. Table Scan -> the worst way
○ 1. Clustered ○ 2. Table Seek -> only theoretical, not possible ○ 3. Index Scan -> scanning leaf nodes
○ 2. Non-clustered ○ 4. Index Seek -> getting to the node needed, the best way
○ 3. Covering
○ 4. Full Text Index 58. What is Fragmentation .?
○ 5. Spatial ○ Fragmentation is a phenomenon in which storage space is used inefficiently.
○ 6. Unique ○ In SQL Server, Fragmentation occurs in case of DML statements on a table that has an index.
○ 7. Filtered ○ When any record is deleted from the table which has any index, it creates a memory bubble which
causes fragmentation.
○ 8. XML
○ Fragmentation can also be caused due to page split, which is the way of building B-Tree
○ 9. Index View dynamically according to the new records coming into the table.
○ Taking care of fragmentation levels and maintaining them is the major problem for Indexes.
54. What is a Clustering Key?
○ Since Indexes slow down DML operations, we do not have a lot of indexes on OLTP, but it is
○ It is a column on which I create any type of index is called a Clustering Key for that particular recommended to have many different indexes in OLAP.
index.
○ Clustered Indexes store data in a contiguous manner. In other words, they cluster the data into a bubbles.
certain spot on a hard disk continuously.
○ Every statistic holds the following info: 9. Avoid using SELECT *. Because you are selecting everything, it will decrease the performance.
Try to select columns you need.
■ 1. The number of rows and pages occupied by a table’s data
10. Avoid using CURSOR because it is an object that goes over a table on a row-by-row basis, which
■ 2. The time that statistics was last updated
is similar to the table scan. It is not really an effective way.
■ 3. The average length of keys in a column
11. Avoid using unnecessary TRIGGER. If you have unnecessary triggers, they will be triggered
■ 4. Histogram showing the distribution of data in column needlessly. Not only slowing the performance down, it might mess up your whole program as well.
12. Manage Indexes using RECOMPILE or REBUILD.
61. What are some optimization techniques in SQL? The internal fragmentation happens when there are a lot of data bubbles on the leaf nodes of the b-tree
and the leaf nodes are not used to its fullest capacity. By recompiling, you can push the actual data on
1. Build indexes. Using indexes on a table, It will dramatically increase the performance of your read the b-tree to the left side of the leaf level and push the memory bubble to the right side. But it is still a
operation because it will allow you to perform index scan or index seek depending on your search temporary solution because the memory bubbles will still exist and won’t be still accessed much.
predicates and select predicates instead of table scan.
The external fragmentation occurs when the logical ordering of the b-tree pages does not match the
Building non-clustered indexes, you could also increase the performance further. physical ordering on the hard disk. By rebuilding, you can cluster them all together, which will solve
2. You could also use an appropriate filtered index for your non clustered index because it could avoid not only the internal but also the external fragmentation issues. You can check the status of the
performing fragmentation by using Data Management Function, sys.dm_db_index_physical_stats(db_id, table_id,
index_id, partition_num, flag), and looking at the columns, avg_page_space_used_in_percent
a key lookup. for the internal fragmentation and avg_fragmentation_in_percent for the external fragmentation.
3. You could also use a filtered index for your non-clustered index since it allows you to create an 13. Try to use JOIN instead of SET operators or SUB-QUERIES because set operators and sub-
index on a particular part of a table that is accessed more frequently than other parts. queries are slower than joins and you can implement the features of sets and sub-queries using joins.
4. You could also use an indexed view, which is a way to create one or more clustered indexes on the 14. Avoid using LIKE operators, which is a string matching operator but it is mighty slow.
same table.
15. Avoid using blocking operations such as order by or derived columns.
In that way, the query optimizer will consider even the clustered keys on the indexed views so there
might be a possible faster option to execute your query. 16. For the last resort, use the SQL Server Profiler. It generates a trace file, which is a really detailed
version of execution plan. Then DTA (Database Engine Tuning Advisor) will take a trace file as its
5. Do table partitioning. When a particular table as a billion of records, it would be practical to input and analyzes it and gives you the recommendation on how to improve your query further.
partition a table so that it can increase the read operation performance. Every partitioned
table will be considered as physical smaller tables internally.
62. How do you present the following tree in a form of a table?
6. Update statistics for TSQL so that the query optimizer will choose the most optimal path in getting
the data
from the underlying table. Statistics are histograms of maximum 200 sample values from columns
separated by A
intervals. /\
BC PRINT @new_string
/\ /\ END
DEFG EXEC rev 'dinesh'
CREATE TABLE tree ( node CHAR(1), parent Node CHAR(1), [level] INT) INSERT INTO tree
VALUES ('A', null, 1),
64. What is Deadlock?
('B', 'A', 2),
○ Deadlock is a situation where, say there are two transactions, the two transactions are waiting for
('C', 'A', 2), each other to release their locks.
('D', 'B', 3), ○ The SQL automatically picks which transaction should be killed, which becomes a deadlock victim,
and roll back the change for it and throws an error message for it.
('E', 'B', 3),
('F', 'C', 3),
65. What is a Fact Table?
('G', 'C', 3)
The primary table in a dimensional model where the numerical performance measurements (or facts)
of the
SELECT * FROM tree
business are stored so they can be summarized to provide information about the history of the
Result: operation of an
A NULL 1 organization.
BA2
CA2 We use the term fact to represent a business measure. The level of granularity defines the grain of the
fact table.
DB3
EB3
66. What is a Dimension Table?
FC3
Dimension tables are highly denormalized tables that contain the textual descriptions of the business
GC3 and facts in their fact table.
Since it is not uncommon for a dimension table to have 50 to 100 attributes and dimension tables tend
63. How do you reverse a string without using REVERSE (‘string’) ? to be relatively shallow in terms of the number of rows, they are also called a wide table.
CREATE PROC rev (@string VARCHAR(50)) AS A dimension table has to have a surrogate key as its primary key and has to have a business/alternate
key to link between the OLTP and OLAP.
BEGIN
67. What are the types of Measures?
DECLARE @new_string VARCHAR(50) = ''
○ Additive: measures that can be added across all dimensions (cost, sales).
DECLARE @len INT = LEN(@string)
○ Semi-Additive: measures that can be added across few dimensions and not with others.
WHILE (@len <> 0)
○ Non-Additive: measures that cannot be added across all dimensions (stock rates).
BEGIN
DECLARE @char CHAR(1) = SUBSTRING(@string, @len, 1) SET @new_string = @new_string +
@char
SET @len = @len - 1 68. What is a Star Schema?
END
○ It is a data warehouse design where at least one or more multiple dimensions are further They are deep.
normalized. ○ Number of dimensions > number of fact table foreign keys ○ 2. Dimensional Tables
○ Normalization reduces redundancy so storage wise it is better but querying can be affected due to They hold textual data.
the excessive joins that need to be performed.
They contain attributes of their fact tables.
They are wide.
70. What is granularity?
○ The lowest level of information that is stored in the fact table. ○ Usually determined by the time
dimension table. 74. What are the types of dimension tables?
○ The best granularity level would be per transaction but it would require a lot of memory. ○ 1. Conformed Dimensions
■ when a particular dimension is connected to one or more fact tables. ex) time dimension ○ 2.
Parent-child Dimensions
71. What is a Surrogate Key?
■ A parent-child dimension is distinguished by the fact that it contains a hierarchy based on a
○ It is a system generated key that is an identity column with the initial value and incremental value recursive
and ensures the uniqueness of the data in the dimension table.
relationship.
○ Every dimension table must have a surrogate key to identify each record!!!
■ when a particular dimension points to its own surrogate key to show an unary relationship. ○ 3.
Role Playing Dimensions
72. What are some advantages of using the Surrogate Key in a Data Warehouse? ■ when a particular dimension plays different roles in the same fact table. ex) dim_time and
○ 1. Using a SK, you can separate the Data Warehouse and the OLTP: to integrate data coming from orderDateKey, shippedDateKey...usually a time dimension table.
heterogeneous sources, we need to differentiate between similar business keys from the OLTP. The ■ Role-playing dimensions conserve storage space, save processing time, and improve database
keys in OLTP are the alternate key (business key). manageability .
○ 2. Performance: The fact table will have a composite key. If surrogate keys are used, then in the fact ○ 4. Slowly Changing Dimensions: A dimension table that have data that changes slowly that occur
by inserting and updating of records.
table, we will have integers for its foreign keys.
■ 1. Type 0: columns where changes are not allowed - no change ex) DOB, SSNm
■ This requires less storage than VARCHAR.
■ The queries will run faster when you join on integers rather than VARCHAR. ■ 2. Type 1: columns where its values can be replaced without adding its new row - replacement
■ 3. Type 2: for any change for the value in a column, a new record it will be added - historical data.
■ The partitioning done on SK will be faster as these are in sequence.
Previous
○ 3. Historical Preservation: A data warehouse acts as a repository of historical data so there will be
values are saved in records marked as outdated. For even a single type 2 column, startDate, EndDate,
various versions of the same record and in order to differentiate between them, we need a SK then we
can keep the history of data. and status are needed.
○ 4. Special Situations (Late Arriving Dimension): Fact table has a record that doesn’t have a match ■ 4. Type 3: advanced version of type 2 where you can set up the upper limit of history which drops
the oldest record when the limit has been reached with the help of outside SQL implementation.
yet in the dimension table. Surrogate key usage enables the use of such a ‘not found’ record as a SK is
not dependent on the
■ Type 0 ~ 2 are implemented on the column level. - Lots of change tables and functions
○ 5. Degenerated Dimensions: a particular dimension that has an one-to-one relationship between - Bad for big changes e.g. truncate & reload Optimization of CDC:
itself and the
- Stop the capture job during load
fact table.
- When applying changes to target, it is ideal to use merge.
■ When a particular Dimension table grows at the same rate as a fact table, the actual dimension can
be removed and the dimensions from the dimension table can be inserted into the actual fact table.
■ You can see this mostly when the granularity level of the the facts are per transaction. 77. What is the difference between a connection and session ?
■ E.g. The dimension salesorderdate (or other dimensions in DimSalesOrder would grow everytime a ○ Connection: It is the number of instance connected to the database. An instance is modelized soon
as the application is
sale is made therefore the dimension (attributes) would be moved into the fact table.
open again.
○ 6. Junk Dimensions: holds all miscellaneous attributes that may or may not necessarily belong to
any other dimensions. It could be yes/no, flags, or long open-ended text data. ○ Session: A session run queries.In one connection, it allowed multiple sessions for one connection.
75. What is your strategy for the incremental load? 78. What are all different types of collation sensitivity?
The combination of different techniques for the incremental load in my previous projects; time Following are different types of collation sensitivity -
stamps, CDC (Change Data Capture), MERGE statement and CHECKSUM() in TSQL, LEFT
OUTER JOIN, TRIGGER, the Lookup Transformation in SSIS. Case Sensitivity - A and a and B and b.
Accent Sensitivity.
CDC (Change Data Capture) is a method to capture data changes, such as INSERT, UPDATE and Width Sensitivity - Single byte character and double byte character.
DELETE,
happening in a source table by reading transaction log files. Using CDC in the process of an 79. What is CLAUSE?
incremental load, you
SQL clause is defined to limit the result set by providing condition to the query. This usually filters
are going to be able to store the changes in a SQL table, enabling us to apply the changes to a target some rows from the whole set of records.
table incrementally.
Example - Query that has WHERE condition Query that has HAVING condition.
In data warehousing, CDC is used for propagating changes in the source system into your data
warehouse,
updating dimensions in a data mart, propagating standing data changes into your data warehouse and 80. What is Union, minus and Interact commands?
such.
UNION operator is used to combine the results of two tables, and it eliminates duplicate rows from
the tables.
The advantages of CDC are: MINUS operator is used to return rows from the first query but not from the second query. Matching
records of first and second query and other rows from the first query will be displayed as a result set.
- It is almost real time ETL.
- It can handle small volume of data.
INTERSECT operator is used to return rows returned by both the queries.
- It can be more efficient than replication.
- It can be auditable.
81.How to fetch common records from two tables?
- It can be used to configurable clean up.
Common records result set can be achieved by -.
Disadvantages of CDC are:
Select studentID from student. <strong>INTERSECT </strong> Select StudentID from Exam
Select studentId from (Select rowno, studentId from student) where mod(rowno,2)=1 from (Select
86. How to find count of duplicate rows?
rowno, studentId from student) where mod(rowno,2)=1.[/sql]
Select rollno, count (rollno) from Student
Group by rollno Having count (rollno)>1 Order by count (rollno) desc;
83. How to select unique records from a table?
Select unique records from a table by using DISTINCT keyword.
87.How to find Third highest salary in Employee table using self-join?
Select DISTINCT StudentID, StudentName from Student.
Select * from Employee a Where 3 = (Select Count (distinct Salary) from Employee where
a.salary<=b.salary;
84.How to remove duplicate rows from table?
Step 1: Selecting Duplicate rows from table
88. How to display following using query?
Select rollno FROM Student WHERE ROWID <>
*
(Select max (rowid) from Student b where rollno=b.rollno);
Step 2: Delete duplicate rows
**
Delete FROM Student WHERE ROWID <>
(Select max (rowid) from Student b where rollno=b.rollno);
***
We cannot use dual table to display output given above. To display output use any table. I am using
85.What is ROWID and ROWNUM in SQL? Student
RowID table.
1.ROWID is nothing but Physical memory allocation SELECT lpad (‘*’, ROWNUM,’*’) FROM Student WHERE ROWNUM <4;
2.ROWID is permanant to that row which identifies the address of that row.
3.ROWID is 16 digit Hexadecimal number which is uniquely identifies the rows. 89. How to display Date in DD-MON-YYYY table?
4.ROWID returns PHYSICAL ADDRESS of that row. Select to_date (Hire_date,’DD-MON-YYYY’) Date_Format from Employee;
5. ROWID is automatically generated unique id of a row and it is generated at the time of insertion of 90. If marks column contain the comma separated values from Student table. How to calculate
row. the count of that comma separated values?
6. ROWID is the fastest means of accessing data. Student Name Marks
Dinesh 30,130,20,4
ROWNUM: Kumar 100,20,30
If you run a parallel import, the map tasks will execute your query with different values substituted in 60) is it possible to use sqoop --direct command in Hbase .?
for $CONDITIONS. one mapper may execute "select * from TblDinesh WHERE (salary>=0 AND
This function is incompatible with direct import. But Sqoop can do bulk loading as opposed to direct
salary < 10000)", and the next mapper may execute "select * from TblDinesh WHERE (salary >=
writes. To use bulk loading, enable it using --hbase-bulkload.
10000 AND salary < 20000)" and so on.
61) Can I configure two sqoop command so that they are dependent on each other? Like if the
55) can sqoop run without a hadoop cluster.?
first sqoop job is successful, second gets triggered. If first fails, second should not run..?
To run Sqoop commands, Hadoop is a mandatory prerequisite. You cannot run sqoop commands
No, using sqoop commands it is not possible, but You can use oozie for this. Create an oozie
without the Hadoop libraries.
workflow. Execute the second action only if the first action succeeds.
56) Is it possible to import a file in fixed column length from the database using sqoop import?
62) What is UBER mode and where is the settings to enable in Hadoop .?
Importing column of a fixed length from any database you can use free form query like below
Normally mappers and reducers will run by ResourceManager (RM), RM will create separate
sqoop import --connect jdbc:oracle:* --username Dinesh --password pwd container for mapper and reducer. Uber configuration, will allow to run mapper and reducers in the
same process as the ApplicationMaster (AM).
-e "select substr(COL1,1,4000),substr(COL2,1,4000) from table where \$CONDITIONS"
Apache Hive is an open source for data warehouse system. Its similar like SQL Queries. We can use
Hive for analyzing and querying in large data sets on top of Hadoop.
58) How to pass Sqoop command as file arguments in Sqoop.?
specify an options file, simply create an options file in a convenient location and pass it to the
command line via - 2) Why do we need Hive?
-options-file argument. Hive is a tool in Hadoop ecosystem which provides an interface to organize and query data in a
eg: sqoop --options-file /users/homer/work/import.txt --table TEST databse like fashion and write SQL like queries. It is suitable for accessing and analyzing data in
Hadoop using SQL syntax.
11) Is it possible to use same metastore by multiple users, in case of embedded hive?
6) What are the types of tables in Hive?
No, it is not possible to use metastores in sharing mode. It is recommended to use standalone real
There are two types of tables in Hive : Internal Table(aka Managed Table) and External table. database like MySQL or PostGresSQL.
7) What kind of data warehouse application is suitable for Hive? 12) If you run hive server, what are the available mechanism for connecting it from application?
Hive is not considered as a full database. The design rules and regulations of Hadoop and HDFS put There are following ways by which you can connect with the Hive Server
restrictions on what Hive can do.Hive is most suitable for data warehouse applications.
1. Thrift Client: Using thrift you can call hive commands from a various programming languages e.g.
Where Analyzing the relatively static data, Less Responsive time and No rapid changes in data. C++, Java, PHP, Python and Ruby.
Hive does not provide fundamental features required for OLTP (Online Transaction Processing). Hive 2. JDBC Driver : It supports for the Java protocal.
is suitable for data warehouse applications in large data sets.
3. ODBC Driver: It supports ODBC protocol.
14)Which classes are used by the Hive to Read and Write HDFS Files ? 17) What is ObjectInspector functionality?
Following classes are used by Hive to read and write HDFS files Hive uses ObjectInspector to analyze the internal structure of the row object and also the structure of
the individual columns.
TextInputFormat or HiveIgnoreKeyTextOutputFormat: These 2 classes read/write data in plain text
file format. ObjectInspector provides a uniform way to access complex objects that can be stored in multiple
formats in the memory, including:
SequenceFileInputFormat or SequenceFileOutputFormat: These 2 classes read/write data in hadoop
SequenceFile format. Instance of a Java class (Thrift or native Java)
A standard Java object (we use java.util.List to represent Struct and Array, and use java.util.Map to
represent Map)
15) Give examples of the SerDe classes which hive uses to Serialize and Deserialize data ?
A lazily-initialized object (For example, a Struct of string fields stored in a single Java string object
Hive currently use these SerDe classes to serialize and Deserialize data:
with starting offset for each field)
MetadataTypedColumnsetSerDe: This SerDe is used to read/write delimited records like CSV, tab-
A complex object can be represented by a pair of ObjectInspector and Java Object. The
separated control-A separated records (quote is not supported yet.)
ObjectInspector not only tells us the structure of the Object, but also gives us ways to access the
ThriftSerDe: This SerDe is used to read or write thrift serialized objects. The class file for the Thrift
internal fields inside the Object.
object must be loaded first.
In simple terms, ObjectInspector functionality in Hive is used to analyze the internal structure of the
DynamicSerDe: This SerDe also read or write thrift serialized objects, but it understands thrift DDL
columns, rows, and complex objects. It allows to access the internal fields inside the objects.
so the schema of the object can be provided at runtime. Also it supports a lot of different protocols,
including TBinaryProtocol, TJSONProtocol, TCTLSeparatedProtocol(which writes data in delimited
records). 18) What is the functionality of Query Processor in Apache Hive?
This component implements the processing framework for converting SQL to a graph of map or
reduce jobs and the execution time framework to run those jobs in the order of dependencies and the
16) How do you write your own custom SerDe and what is the need for that?
help of metastore details.
In most cases, users want to write a Deserializer instead of a SerDe, because users just want to read
their own data format instead of writing to it.
19) What is the limitation of Derby database for Hive metastore?
For example, the RegexDeserializer will deserialize the data using the configuration parameter regex,
and possibly a list of column names. With derby database, you cannot have multiple connections or multiple sessions instantiated at the
same time.
If your SerDe supports DDL (basically, SerDe with parameterized columns and column types), you
probably want to implement a Protocol based on DynamicSerDe, instead of writing a SerDe from Derby database runs in the local mode and it creates a log file so that multiple users cannot access
scratch. The reason is that the framework passes DDL to SerDe through thrift DDL format, and its Hive simultaneously.
non-trivial to write a thrift DDL parser.
Depending on the nature of data the user has, the inbuilt SerDe may not satisfy the format of the data.
20) What are managed and external tables?
So users need to write their own java code to satisfy their data format requirements.
In the external table, metadata is controlled by Hive but the actual data will be controlled by some The values in a column are hashed into a number of buckets which is defined by user. It is a way to
other application. So, when you delete a table accidentally, only the metadata will be lost and the avoid too many partitions or nested partitions while ensuring optimizes query output.
actual data will reside wherever it is.
UNIONTYPE: It represents a column which can have a value that can belong to any of the data types set hive.optimize.bucketmapjoin = ture;
of your choice.
set hive.optimize.bucketmapjoin.sortedmerge = true;
1. Text File format ORC has got indexing on every block based on the statistics min, max, sum, count on columns so
2. Sequence File format when you query, it will skip the blocks based on the indexing.
3. Parquet
4. Avro
31) How to access HBase tables from Hive?
5. RC file format
6. ORC Using Hive-HBase storage handler, you can access the HBase tables from Hive and once you are
connected, you can query HBase using the SQL queries from Hive. You can also join multiple tables
in HBase from Hive and retrieve the result.
28) How is SerDe different from File format in Hive?
SerDe stands for Serializer and Deserializer. It determines how to encode and decode the field values
32) When running a JOIN query, I see out-of-memory errors.?
or the column values from a record that is how you serialize and deserialize the values of a column.
But file format determines how records are stored in key value format or how do you retrieve the This is usually caused by the order of JOIN tables. Instead of [FROM tableA a JOIN tableB b ON ],
records from the table. try [FROM tableB b JOIN tableA a ON ] NOTE that if you are using LEFT OUTER JOIN, you might
want to change to RIGHT OUTER JOIN. This trick usually solve the problem the rule of thumb is,
always put the table with a lot of rows having the same value in the join key on the rightmost side of
29) What is RegexSerDe?
the JOIN.
Regex stands for a regular expression. Whenever you want to have a kind of pattern matching, based
on the pattern matching, you have to store the fields.
33) Did you used Mysql as Metatstore and faced errors like com.mysql.jdbc.exceptions.jdbc4.
RegexSerDe is present in org.apache.hadoop.hive.contrib.serde2.RegexSerDe.
CommunicationsException: Communications link failure ?
In the SerDeproperties, you have to define your input pattern and output fields. For example, you have
This is usually caused by MySQL servers closing connections after the connection is idling for some
to get the column values from line xyz/pq@def if you want to take xyz, pq and def separately.
time. Run the following command on the MySQL server will solve the problem [set global
To extract the pattern, you can use: wait_status=120]
ORC stores collections of rows in one file and within the collection the row data will be stored in a
34) Does Hive support Unicode?
columnar format. With columnar format, it is very easy to compress, thus reducing a lot of storage
cost. You can use Unicode string on data or comments, but cannot use for database or table or column
name.
While querying also, it queries the particular column instead of querying the whole row as the records
are stored in columnar format.
It does not provide OLTP transactions support only OLAP transactions.If application required OLAP,
switch to NoSQL database.HQL queries have higher latency, due to the mapreduce.
42) Explain how can you change a column data type in Hive?
38) Mention what are the different modes of Hive?
You can change a column data type in Hive by using command,
Depending on the size of data nodes in Hadoop, Hive can operate in two modes. These modes are,
Local mode and Map reduce mode ALTER TABLE table_name CHANGE column_name column_name new_datatype;
39) Mention what is (HS2) HiveServer2? 43) Mention what is the difference between order by and sort by in Hive?
It is a server interface that performs following functions. SORT BY will sort the data within each reducer. You can use any number of reducers for SORT BY
operation.
It allows remote clients to execute queries against Hive Retrieve the results of mentioned queries
ORDER BY will sort all of the data together, which has to pass through one reducer. Thus, ORDER
Some advanced features Based on Thrift RPC in its latest version include Multi-client concurrency
BY in hive
Authentication.
uses a single.
Hadoop developers sometimes take an array as input and convert into a separate table row. To convert It is a file containing list of commands needs to run when the hive CLI starts. For example setting the
complex data types into desired table formats, then we can use explode function. strict mode to be true etc.
45) Mention how can you stop a partition form being queried? 52) What are the default record and field delimiter used for hive text files?
You can stop a partition form being queried by using the ENABLE OFFLINE clause with ALTER The default record delimiter is − \n
TABLE statement.
And the filed delimiters are − \001,\002,\003
47) What is the default location where hive stores table data?
54) How do you list all databases whose name starts with p?
hdfs://namenode_server/user/hive/warehouse
SHOW DATABASES LIKE ‘p.*’
49) Can we run unix shell commands from hive? Give example.
56) How can you delete the DBPROPERTY in Hive?
Yes, using the ! mark just before the command.
There is no way you can delete the DBPROPERTY.
For example !pwd at hive prompt will list the current directory.
51) What is the importance of .hiverc file? This can be done with following query
61) When you point a partition of a hive table to a new directory, what happens to the data?
66) What is a Table generating Function on hive?
The data stays in the old location. It has to be moved manually.
A table generating function is a function which takes a single column as argument and expands it to
Write a query to insert a new column(new_col INT) into a hive table (htab) at a position before an
multiple column or rows. Example exploe().
existing column (x_col)
No. It only reduces the number of files which becomes easier for namenode to manage.
68) What is the difference between LIKE and RLIKE operators in Hive?
The LIKE operator behaves the same way as the regular SQL operators used in select queries.
63) While loading data into a hive table using the LOAD DATA clause, how do you specify it is Example − street_name like ‘%Chi’
a hdfs file and not a local file ?
But the RLIKE operator uses more advance regular expressions which are available in java
By Omitting the LOCAL CLAUSE in the LOAD DATA statement. Example − street_name RLIKE ‘.*(Chi|Oho).*’ which will select any word which has either chi or
oho in it.
64) If you omit the OVERWRITE clause while creating a hive table,what happens to file which
are new and files which already exist?
69) Is it possible to create Cartesian join between 2 tables, using Hive?
The new incoming files are just added to the target directory and the existing files are simply
No. As this kind of Join can not be implemented in map reduce
overwritten. Other files whose name does not match any of the incoming files will continue to exist. If
70) What should be the order of table size in a join query? Yes. A partition can be archived. Advantage is it decreases the number of files stored in namenode
and the archived file can be queried using hive. The disadvantage is it will cause less efficient query
In a join query the smallest table to be taken in the first position and largest table should be taken in
and does not offer any space savings.
the last position.
80) How do you specify the table creator name when creating a table in Hive?
75) What types of costs are associated in creating index on hive tables? The TBLPROPERTIES clause is used to add the creator name while creating a table. The
TBLPROPERTIES is added like −
Indexes occupies space and there is a processing cost in arranging the values of the column on which
index is cerated. TBLPROPERTIES(‘creator’= ‘Joan’)
Give the command to see the indexes on a table. SHOW INDEX ON table_name
This will list all the indexes created on any of the columns in the table table_name. 81) Which method has to be overridden when we use custom UDF in Hive?
Whenever you write a custom UDF in Hive, you have to extend the UDF class and you have to
override the evaluate() function.
76) What does /*streamtable(table_name)*/ do?
It is query hint to stream a table into memory before running the query. It is a query optimization
Technique. 82) Suppose I have installed Apache Hive on top of my Hadoop cluster using default metastore
configuration. Then, what will happen if we have multiple clients trying to access Hive at the
same time?
77) Can a partition be archived? What are the advantages and Disadvantages?
We can solve this problem of query latency by partitioning the table according to each month. So, for
each month we will be scanning only the partitioned data instead of whole data sets.
83) Is it possible to change the default location of a managed table?
As we know, we can not partition an existing non-partitioned table directly. So, we will be taking
Yes, it is possible to change the default location of a managed table. It can be achieved by using the following steps to solve the very problem:
clause LOCATION [hdfs_path].
INSERT OVERWRITE TABLE partitioned_transaction PARTITION (month) SELECT cust_id, 3 Andy Hall [email protected] Female 114.123.153.64
amount, country, month FROM transaction_details;
4 Samuel Jackson [email protected] Male 89.60.227.31
Now, we can perform the query using each partition and therefore, decrease the query time.
5 Emily Rose [email protected] Female 119.92.21.19
How will you consume this CSV file into the Hive warehouse using built SerDe?
87) How can you add a new partition for the month December in the above partitioned table?
SerDe stands for serializer or deserializer. A SerDe allows us to convert the unstructured bytes into a
For adding a new partition in the above table partitioned_transaction, we will issue the command give record that we can process using Hive. SerDes are implemented using Java. Hive comes with several
below: built-in SerDes and many other third-party SerDes are also available.
ALTER TABLE partitioned_transaction ADD PARTITION (month=Dec) LOCATION Hive provides a specific SerDe for working with CSV files. We can use this SerDe for the sample.csv
/partitioned_transaction; by issuing following commands:
CREATE EXTERNAL TABLE sample (id int, first_name string,last_name string, email string,
88) What is the default maximum dynamic partition that can be created by a mapper/reducer? gender string, ip_address string)
How can you change it?
ROW FORMAT SERDE org.apache.hadoop.hive.serde2.OpenCSVSerde STORED AS TEXTFILE
By default the number of maximum partition that can be created by a mapper or reducer is set to 100. LOCATION temp;
One can change it by issuing the following command:
Now, we can perform any query on the table sample:
SET hive.exec.max.dynamic.partitions.pernode = value
SELECT first_name FROM sample WHERE gender = male;
89) I am inserting data into a table based on partitions dynamically. But, I received
91) Suppose, I have a lot of small CSV files present in input directory in HDFS and I want to
an error FAILED ERROR IN SEMANTIC ANALYSIS: Dynamic partition strict mode create a single Hive table corresponding to these files. The data in these files are in the format:
{id, name, e-mail, country}. Now, as we know, Hadoop performance degrades when we use lots
requires at least one static partition column. How will you remove this error?
of small files.
So, how will you solve this problem where we want to create a single Hive table for lots of small
To remove this error one has to execute following commands: files without degrading the performance of the system?
SET hive.exec.dynamic.partition = true; One can use the SequenceFile format which will group these small files together to form a single
SET hive.exec.dynamic.partition.mode = nonstrict; sequence file. The steps that will be followed in doing so are as follows:
90) Suppose, I have a CSV file sample.csv present in temp directory with the following entries: 1.Create a temporary table:
id first_name last_name email gender ip_address CREATE TABLE temp_table (id INT, name STRING, e-mail STRING, country STRING) ROW
FORMAT DELIMITED FIELDS TERMINATED BY , STORED AS TEXTFILE;
1 Hugh Jackman [email protected] Male 136.90.241.52
Load the data into temp_table:
2 David Lawrence [email protected] Male 101.177.15.130
INSERT OVERWRITE TABLE sample SELECT * FROM temp_table; Check the output of jps command on a new node
Hence, a single SequenceFile is generated which contains the data present in all of the input files and
therefore, the problem of having lots of small files is finally eliminated.
94) Explain the concatenation function in Hive with an example?
Concatenate function will join the input strings.We can specify the N number of strings separated by a
92) Can We Change settings within Hive Session? If Yes, How? comma.
Yes, we can change the settings within Hive session, using the SET command. It helps to change Hive Example:
job settings for an exact query.
CONCAT (It,-,is,-,a,-,eLearning,-,provider);
Example: The following commands shows buckets are occupied according to the table definition.
Output:
hive> SET hive.enforce.bucketing=true;
It-is-a-eLearning-provider
We can see the current value of any property by using SET with the property name. SET will list all
So, every time we set the limits of the strings by -. If it is common for every strings, then Hive
the properties with their values set by Hive.
provides another
hive> SET hive.enforce.bucketing;
command
hive.enforce.bucketing=true
CONCAT_WS. In this case,we have to specify the set limits of operator first. CONCAT_WS (-
And this list will not include defaults of Hadoop. So we should use the below like
,It,is,a,eLearning,provider);
SET -v
Output: It-is-a-eLearning-provider.
It will list all the properties including the Hadoop defaults in the system.
To remove the trailing space 98) What is the maximum size of string data type supported by hive? Mention the Hive support
binary formats ?
RTRIM(BHAVESH );
The maximum size of string data type supported by hive is 2 GB.
In Reverse function, characters are reversed in the string.
Hive supports the text file format by default and it supports the binary format Sequence files, ORC
Example:
files, Avro Data files, Parquet files.
REVERSE(BHAVESH);
Sequence files: Splittable, compressible and row oriented are the general binary format.
Output:
ORC files: Full form of ORC is optimized row columnar format files. It is a Record columnar file and
HSEVAHB column oriented storage file. It divides the table in row split. In each split stores that value of the first
row in the first column and followed sub subsequently.
96) Explain process to access sub directories recursively in Hive queries? AVRO datafiles: It is same as a sequence file splittable, compressible and row oriented, but except the
support of schema evolution and multilingual binding support.
By using below commands we can access sub directories recursively in Hive hive> Set
mapred.input.dir.recursive=true;
hive> Set hive.mapred.supports.subdirectories=true; 99) What is the precedence order of HIVE configuration?
Hive tables can be pointed to the higher level directory and this is suitable for the directory structure We are using a precedence hierarchy for setting the properties SET Command in HIVE
which is like /data/country/state/city/
The command line -hiveconf option Hive-site.XML
In the above three lines of headers that we do not want to include in our Hive query. To skip header
100) If you run a select * query in Hive, Why does it not run MapReduce?
lines from our tables in the Hive,set a table property that will allow us to skip the header lines.
The hive.fetch.task.conversion property of Hive lowers the latency of mapreduce overhead and in
CREATE EXTERNAL TABLE employee (name STRING,job STRING, dob STRING, id INT, effect when executing queries like SELECT, LIMIT, etc., it skips mapreduce function
salary INT)
TBLPROPERTIES(skip.header.line.count=2);
102) Explain about the different types of join in Hive? A possible workaround is to create a temporary table with STORED AS TEXT, then LOAD DATA
into it, and then copy data from this table to the ORC table.
HiveQL has 4 different types of joins -
Here is an example:
JOIN- Similar to Outer Join in SQL
CREATE TABLE test_details_txt( visit_id INT, store_id SMALLINT) STORED AS TEXTFILE;
FULL OUTER JOIN - Combines the records of both the left and right outer tables that fulfil the join
condition. CREATE TABLE test_details_orc( visit_id INT, store_id SMALLINT) STORED AS ORC;
LEFT OUTER JOIN- All the rows from the left table are returned even if there are no matches in the Load into Text table
right table.
LOAD DATA LOCAL INPATH /home/user/test_details.txt INTO TABLE test_details_txt; Copy to
RIGHT OUTER JOIN-All the rows from the right table are returned even if there are no matches in ORC table
the left table.
INSERT INTO TABLE test_details_orc SELECT * FROM test_details_txt;
IP address and port of the metastore host CREATE EXTERNAL TABlE tableex(id INT, name STRING)
104) What happens on executing the below query? After executing the below query, if you modify the ROW FORMAT org.apache.hadoop.hive.contrib.serde2.RegexSerDe WITH SERDEPROPERTIES
column how will the changes be tracked? (input.regex = ^(\\d+)~\\*(.*)$)
Hive> CREATE INDEX index_bonuspay ON TABLE employee (bonus) STORED AS TEXTFILE LOCATION /user/myusername;
AS org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler;
107) Is there any way to get the column name along with the output while execute any query in format when it comes to reading, writing, and processing the data. It uses techniques like predicate
Hive? push-down, compression, and more to improve the performance of the query.
If we want to see the columns names of the table in HiveQl, the following hive conf property should
be set to true. hive> set hive.cli.print.header=true;
4.Use Partitioning
If you prefer to see the column names always then update the $HOME/.hiverc file with the above
With partitioning, data is stored in separate individual folders on HDFS. Instead of querying the
setting in the first line.
whole dataset, it will query partitioned dataset.
Hive automatically looks for a file named .hiverc in your HOME directory and runs the commands it
1)Create Temporary Table and Load Data Into Temporary Table
contains, if any.
2)Create Partitioned Table
Apache Tez Engine is an extensible framework for building high-performance batch processing and
5.Use Bucketing
interactive data processing. It is coordinated by YARN in Hadoop. Tez improved the MapReduce
paradigm by increasing the processing speed and maintaining the MapReduce ability to scale to The Hive table is divided into a number of partitions and is called Hive Partition. Hive Partition is
petabytes of data. further subdivided into clusters or buckets and is called bucketing or clustering.
Tez engine can be enabled in your environment by setting hive.execution.engine to tez: Cost-Based Query Optimization
set hive.execution.engine=tez; Hive optimizes each querys logical and physical execution plan before submitting for final execution.
However, this is not based on the cost of the query during the initial version of Hive.
During later versions of Hive, query has been optimized according to the cost of the query (like which
2.Use Vectorization
types of join to be performed, how to order joins, the degree of parallelism, etc.).
Vectorization improves the performance by fetching 1,024 rows in a single operation instead of
fetching single row each time. It improves the performance for operations like filter, join, aggregation,
etc. 109) How do I query from a horizontal output to vertical output?
Vectorization can be enabled in the environment by executing below commands. There should be easier way to achieve that using explode function and selecting separately data for
prev and next columns.
• set hive.vectorized.execution.enabled=true;
• set hive.vectorized.execution.reduce.enabled=true; 110) Is there a simple way to replace non numeric characters hive excluding - to allow only -ve
and +ve numbers.?
114) I Dropped and recreated hive external table, but no data shown, So what should I do? 3. Difference Between RDD, Dataframe, Dataset?
This is because the table you created is a partitioned table. The insert you ran would have created a Resilient Distributed Dataset (RDD)
partition for partition_column='abcxyz'. When you drop and re-create the table, Hive looses the RDD was the primary user-facing API in Spark since its inception. At the core, an RDD is an
information about the partition, it only knows about the table. immutable distributed collection of elements of your data, partitioned across nodes in your cluster that
can be operated in parallel with a low-level API that offers transformations and actions.
Run the command below to get hive to re-create the partitions based on the data.
DataFrames (DF)
MSCK REPAIR TABLE user_dinesh.sampletable; Like an RDD, a DataFrame is an immutable distributed collection of data. Unlike an RDD, data is
organized into named columns, like a table in a relational database. Designed to make large data sets
processing even easier, DataFrame allows developers to impose a structure onto a distributed
Part -5 Spark Interview Questions with Answers collection of data, allowing higher-level abstraction; it provides a domain specific language API to
manipulate your distributed data.
1.Spark Architecture
Datasets (DS)
Starting in Spark 2.0, Dataset takes on two distinct APIs characteristics: a strongly-typed API and an
Apache Spark follows a master/slave architecture with two main daemons and a cluster manager - untyped API, as shown in the table below. Conceptually, consider DataFrame as an alias for a
collection of generic objects Dataset[Row], where a Row is a generic untyped JVM object.
Every spark application has same fixed heap size and fixed number of cores for a spark executor. The spark.conf.set("spark.executor.memory", "2g")
heap size is what referred to as the Spark executor memory which is controlled with the spark.
executor.memory property of the -executor-memory flag. Every spark application will have one
28.Why RDD is an immutable ?
executor on each worker node. The executor memory is basically a measure on how much memory of
the worker node will the application utilize. Following are the reasons:
Immutable data is always safe to share across multiple processes as well as multiple threads.
25. What is an “Accumulator”? Since RDD is immutable we can recreate the RDD any time. (From lineage graph). If the computation
“Accumulators” are Spark’s offline debuggers. Similar to “Hadoop Counters”, “Accumulators” is time-consuming, in that we can cache the RDD which result in performance improvement.
provide the number of “events” in a program.
Accumulators are the variables that can be added through associative operations. Spark natively 29.What is Partitioner?
supports accumulators of numeric value types and standard mutable collections.
“AggregrateByKey()” and “combineByKey()” uses accumulators. A partitioner is an object that defines how the elements in a key-value pair RDD are partitioned by
key, maps each key to a partition ID from 0 to numPartitions - 1. It captures the data distribution at
the output. With the help of partitioner, the scheduler can optimize the future operations. The contract
of partitioner ensures that records for a given key have to reside on a single partition.
26.What is SparkContext .?
sparkContext was used as a channel to access all spark functionality. We should choose a partitioner to use for a cogroup-like operations. If any of the RDDs already has a
partitioner, we should choose that one. Otherwise, we use a default HashPartitioner.
The spark driver program uses spark context to connect to the cluster through a resource manager
(YARN or Mesos.). sparkConf is required to create the spark context object, which stores
configuration parameter like appName (to identify your spark driver), application, number of core and There are three types of partitioners in Spark :
memory size of executor running on worker node.
a) Hash Partitioner :- Hash- partitioning attempts to spread the data evenly across various partitions
In order to use APIs of SQL, HIVE, and Streaming, separate contexts need to be created. based on the key.
Example: b) Range Partitioner :- In Range- Partitioning method , tuples having keys with same range will
appear on the same machine.
creating sparkConf :
val conf = new SparkConf().setAppName(“Project”).setMaster(“spark://master:7077”) creation of c) Custom Partitioner
sparkContext: val sc = new SparkContext(conf)
In order to use APIs of SQL, HIVE, and Streaming, no need to create separate contexts as 1.DataFrame is distributed collection of data. In DataFrames, data is organized in named column.
sparkSession includes all the APIs. 2. They are conceptually similar to a table in a relational database. Also, have richer optimizations.
Once the SparkSession is instantiated, we can configure Spark’s run-time config properties. 3. Data Frames empower SQL queries and the DataFrame API.
Example:Creating Spark session: 4. we can process both structured and unstructured data formats through it. Such as: Avro, CSV,
val spark = SparkSession.builder.appName("WorldBankIndex").getOrCreate() Configuring elastic search, and Cassandra. Also, it deals with storage systems HDFS, HIVE tables, MySQL, etc.
properties: 5. In Data Frames, Catalyst supports optimization(catalyst Optimizer). There are general libraries
spark.conf.set("spark.sql.shuffle.partitions", 6) available to represent trees. In four phases, DataFrame uses Catalyst tree transformation:
31.What is Dataset ? disk. This facilitates a fresh environment for every batch and don’t have to worry about metadata
build-up.
A Dataset is an immutable collection of objects, those are mapped to a relational schema. They are
strongly-typed in nature.
There is an encoder, at the core of the Dataset API. That Encoder is responsible for converting 35. What is the Difference between DSM and RDD?
between JVM objects and tabular representation. By using Spark’s internal binary format, the tabular
On the basis of several features, the difference between RDD and DSM is:
representation is stored that allows to carry out operations on serialized data and improves memory
utilization. It also supports automatically generating encoders for a wide variety of types, including i. Read
primitive types (e.g. String, Integer, Long) and Scala case classes. It offers many functional
RDD - The read operation in RDD is either coarse-grained or fine-grained. Coarse-grained meaning
transformations (e.g. map, flatMap, filter).
we can transform the whole dataset but not an individual element on the dataset. While fine-grained
means we can transform individual element on the dataset.
32.What are the benefits of Datasets? DSM - The read operation in Distributed shared memory is fine-grained.
1. 1)Static typing- With Static typing feature of Dataset, a developer can catch errors at compile
time (which saves time and costs).
2. 2)Run-time Safety:- Dataset APIs are all expressed as lambda functions and JVM typed ii. Write
objects, any mismatch of typed-parameters will be detected at compile time. Also, analysis RDD - The write operation in RDD is coarse-grained.
error can be detected at compile time too,when using Datasets, hence saving developer-time
and costs. DSM - The Write operation is fine grained in distributed shared system.
3. 3)Performance and Optimization- Dataset APIs are built on top of the Spark SQL engine, it
uses Catalyst to generate an optimized logical and physical query plan providing the space
and speed efficiency. iii. Consistency
RDD - The consistency of RDD is trivial meaning it is immutable in nature. We can not realtor the
content of RDD i.e. any changes on RDD is permanent. Hence, The level of consistency is very high.
4. 4) For processing demands like high-level expressions, filters, maps, aggregation, averages,
sum,SQL queries, columnar access and also for use of lambda functions on semi-structured DSM - The system guarantees that if the programmer follows the rules, the memory will be
data, DataSets are best. consistent. Also, the results of memory operations will be predictable.
5. 5) Datasets provides rich semantics, high-level abstractions, and domain-specific APIs
DSM - Fault tolerance is achieved by a checkpointing technique which allows applications to roll • 1.Data received and replicated - Data is received from the source, and replicated across
back to a recent checkpoint rather than restarting. worker nodes. In the case of any failure, the data replication will help achieve fault tolerance.
• 2.Data received but not yet replicated - Data is received from the source but buffered for
replication. In the case of any failure, the data needs to be retrieved from the source.
v. Straggler Mitigation
Stragglers, in general, are those that take more time to complete than their peers. This could happen
For stream inputs based on receivers, the fault tolerance is based on the type of receiver:
due to many reasons such as load imbalance, I/O blocks, garbage collections, etc.
An issue with the stragglers is that when the parallel computation is followed by synchronizations
such as reductions that causes all the parallel tasks to wait for others. 1.Reliable receiver - Once the data is received and replicated, an acknowledgment is sent to the
source. In case if the receiver fails, the source will not receive acknowledgment for the received data.
RDD - It is possible to mitigate stragglers by using backup task, in RDDs. DSM - To achieve
When the receiver is restarted, the source will resend the data to achieve fault tolerance.
straggler mitigation, is quite difficult.
2.Unreliable receiver - The received data will not be acknowledged to the source. In this case of any
failure, the source will not know if the data has been received or not, and it will nor resend the data, so
vi. Behavior if not enough RAM there is data loss.
RDD - As there is not enough space to store RDD in RAM, therefore, the RDDs are shifted to disk. To overcome this data loss scenario, Write Ahead Logging (WAL) has been introduced in Apache
DSM - If the RAM runs out of storage, the performance decreases, in this type of systems. Spark 1.2. With WAL enabled, the intention of the operation is first noted down in a log file, such that
if the driver fails and is restarted, the noted operations in that log file can be applied to the data. For
sources that read streaming data, like Kafka or Flume, receivers will be receiving the data, and those
36.What is Speculative Execution in Spark and how to enable it? will be stored in the executor's memory. With WAL enabled, these received data will also be stored in
the log files.
One more point is, Speculative execution will not stop the slow running task but it launch the new
task in parallel. WAL can be enabled by performing the below:
Tabular Form :
Spark Property >> Default Value >> Description Setting the checkpoint directory, by using streamingContext.checkpoint(path)
spark.speculation >> false >> enables ( true ) or disables ( false ) speculative execution of tasks. Enabling the WAL logging, by setting spark.stream.receiver.WriteAheadLog.enable to True.
spark.speculation.interval >> 100ms >> The time interval to use before checking for speculative tasks.
spark.speculation.multiplier >> 1.5 >> How many times slower a task is than the median to be for 38.Explain the difference between reduceByKey, groupByKey, aggregateByKey and
speculation. combineByKey?
spark.speculation.quantile >> 0.75 >> The percentage of tasks that has not finished yet at which to 1.groupByKey:
start speculation. groupByKey can cause out of disk problems as data is sent over the network and collected on the
reduce workers.
The basic semantics of fault tolerance in Apache Spark is, all the Spark RDDs are immutable. It sc.textFile("hdfs://").flatMap(line => line.split(" ") ).map(word => (word,1))
remembers the dependencies between every RDD involved in the operations, through the lineage .groupByKey().map((x,y) => (x,sum(y)) )
graph created in the DAG, and in the event of any failure, Spark refers to the lineage graph to apply
the same operations to perform the tasks.
2.reduceByKey:
There are two types of failures - Worker or driver failure. In case if the worker fails, the executors in
that worker node will be killed, along with the data in their memory. Using the lineage graph, those Data is combined at each partition , only one output for one key at each partition to send over
tasks will be accomplished in any other worker nodes. The data is also replicated to other worker network. reduceByKey required combining all your values into another value with the exact same
nodes to achieve fault tolerance. There are two cases: type.
Example:-
//Function to merge the values within a partition.Add 1 to the # of entries and inp to the existing inp
3.aggregateByKey: val mergeValue = (PartVal:(Int,Double),inp:Double) =>{
same as reduceByKey, which takes an initial value. (PartVal._1 + 1, PartVal._2 + inp)
}
3 parameters as input 1). initial value 2). Combiner logic function 3).merge Function
//Function to merge across the partitions
Example:- val mergeCombiners = (PartOutput1:(Int, Double) , PartOutput2:(Int, Double))=>{
val inp =Seq("dinesh=70","kumar=60","raja=40","ram=60","dinesh=50","dinesh=80","kumar=40" (PartOutput1._1+PartOutput2._1 , PartOutput1._2+PartOutput2._2)
,"raja=40") }
val rdd=sc.parallelize(inp,3) //Function to calculate the average.Personinps is a custom type val CalculateAvg = (personinp:(String,
(Int, Double)))=>{
val pairRdd=rdd.map(_.split("=")).map(x=>(x(0),x(1)))
val (name,(numofinps,inp)) = personinp
val initial_val=0
(name,inp/numofinps)
val addOp=(intVal:Int,StrVal: String)=> intVal+StrVal.toInt val mergeOp=(p1:Int,p2:Int)=>p1+p2
}
val out=pairRdd.aggregateByKey(initial_val)(addOp,mergeOp) out.collect.foreach(println)
mapPartitionsWithIndex is similar to mapPartitions() but it provides second parameter index which textFile() :
keeps the track of partition.
def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String]
Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file
40.Explain fold() operation in Spark? system URI, and return it as an RDD of Strings
fold() is an action. It is wide operation (i.e. shuffle data across multiple partitions and output a single For example sc.textFile("/home/hdadmin/wc-data.txt") so it will create RDD in which each individual
line an element.
value)It takes function as an input which has two parameters of the same type and outputs a single
value of the input type. Everyone knows the use of textFile.
It is similar to reduce but has one more argument 'ZERO VALUE' (say initial value) which will be wholeTextFiles() :
used in the initial call on each partition.
def wholeTextFiles(path: String, minPartitions: Int = defaultMinPartitions): RDD[(String, String)]
def fold(zeroValue: T)(op: (T, T) ⇒ T): T
Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-
Aggregate the elements of each partition, and then the results for all the partitions, using a given
supported file system URI.Rather than create basic RDD, the wholeTextFile() returns pairRDD.
associative function and a neutral "zero value". The function op(t1, t2) is allowed to modify t1 and
return it as its result value to avoid object allocation; however, it should not modify t2. For example, you have few files in a directory so by using wholeTextFile() method,
This behaves somewhat differently from fold operations implemented for non-distributed collections it creates pair RDD with filename with path as key,and value being the whole file as string.
in functional languages like Scala. This fold operation may be applied to partitions individually, and
then fold those results into the final result, rather than apply the fold to each element sequentially in
some defined ordering. For Example:-
functions that are not commutative, the result may differ from that of a fold applied to a non- val myfilerdd = sc.wholeTextFiles("/home/hdadmin/MyFiles") val keyrdd = myfilerdd.keys
distributed
keyrdd.collect
collection.
val filerdd = myfilerdd.values
zeroValue: The initial value for the accumulated result of each partition for the op operator, and also
the initial value for the combine results from different partitions for the op operator - this will filerdd.collect
typically be the neutral element (e.g. Nil for list concatenation or 0 for summation)
Op: an operator used to both accumulate results within a partition and combine results from different
42.What is cogroup() operation.?
partitions
> It's a transformation.
Example :
> It's in package org.apache.spark.rdd.PairRDDFunctions
val rdd1 = sc.parallelize(List(1,2,3,4,5),3) rdd1.fold(5)(_+_)
def cogroup[W1, W2, W3](other1: RDD[(K, W1)], other2: RDD[(K, W2)], other3: RDD[(K, W3)]):
Output : Int = 35
RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2], Iterable[W3]))]
Returns the top k (largest) elements from this RDD as defined by the specified implicit Ordering[T] 49.How to stop INFO messages displaying on spark console?
and maintains the ordering. This does the opposite of takeOrdered.
Edit spark conf/log4j.properties file and change the following line:
def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T]
log4j.rootCategory=INFO, console
Returns the first k (smallest) elements from this RDD as defined by the specified implicit Ordering[T]
and maintains the ordering. This does the opposite of top. to
log4j.rootCategory=ERROR, console
Example :
val myrdd1 = sc.parallelize(List(5,7,9,13,51,89)) for Spark 2.X please import below commands,
import org.apache.log4j.{Logger,Level}
myrdd1.top(3) //Array[Int] = Array(89, 51, 13)
myrdd1.takeOrdered(3) //Array[Int] = Array(5, 7, 9) Logger.getLogger("org").setLevel(Level.ERROR)
Output: Example:-
Seq[Int] = WrappedArray(15, 39, 49) scala> customerDF.show
Seq[Int] = WrappedArray(95) +---+--------+---+------+
Seq[Int] = WrappedArray(78) |cId| name|age|gender|
+---+--------+---+------+
48. How to Kill Spark Running Application.? | 1| James| 21| M|
Get the application Id from the spark scheduler, for instance application_743159779306972_1234 and | 2| Liz| 25| F|
then, run the command in terminal like below yarn application -kill
| 3| John| 31| M|
application_743159779306972_1234
| 4|Jennifer| 45| F|
| 5| Robert| 41| M|
scala> custDF.show 54. What are the Types of Variable Expressions available in Scala?
+---+-----+---+------+
val (aka Values):
|cId| name|age|gender| You can name results of expressions with the val keyword.Once refer a value, it does not re-compute
+---+-----+---+------+ it.
scala> customerDF.except(custDF).show +---+--------+---+------+ Variables are like values, except you can re-assign them. You can define a variable with the var
keyword.
|cId| name|age|gender|
Example:
+---+--------+---+------+
var x = 1 + 1
| 5| Robert| 41| M|
x = 3 // This can compile.
| 3| John| 31| M|
| 4|Jennifer| 45| F|
55. What is the difference between method and functions in Scala..?
| 6| Sandra| 45| F|
Methods:-
+---+--------+---+------+
Methods look and behave very similar to functions, but there are a few key differences between them.
Methods are defined with the def keyword. def is followed by a name, parameter lists, a return type,
52.What are security options in Apache Spark? and a body.
Spark currently supports authentication via a shared secret. Authentication can be configured to be on
via the spark.authenticate configuration parameter. This parameter controls whether the Spark
communication protocols do authentication using the shared secret. This authentication is a basic Example:
handshake to make sure both sides have the same shared secret and are allowed to communicate. If
def add(x: Int, y: Int): Int = x + y println(add(1, 2)) // 3
the shared secret is not identical they will not be allowed to communicate. The shared secret is created
as follows:
For Spark on YARN deployments, configuring spark.authenticate to true will automatically handle Functions:-
generating and distributing the shared secret. Each application will use a unique shared secret.
Functions are expressions that take parameters.
For other types of Spark deployments, the Spark parameter spark.authenticate.secret should be
Bigdata Hadoop: Spark Interview Questions with Answers
configured on each of the nodes. This secret will be used by all the Master/Workers and applications.
You can define an anonymous function (i.e. no name) that returns a given integer plus one: (x: Int)
=> x + 1
53. What is Scala?
You can also name functions. like
val addOne = (x: Int) => x + 1
AnyVal:
59. What is Companion objects in scala? AnyVal represents value types. There are nine predefined value types and they are non-nullable:
An object with the same name as a class is called a companion object. Conversely, the class is the Double, Float, Long, Int, Short, Byte, Char, Unit, and Boolean. Unit is a value type which carries no
object’s companion class. A companion class or object can access the private members of its meaningful information. There is exactly one instance of Unit which can be declared literally like so:
companion. Use a companion object for methods and values which are not specific to instances of the (). All functions must return something so sometimes Unit is a useful return type.
companion class.
Example: AnyRef:
import scala.math._
case class Circle(radius: Double) { import Circle._
Nothing is a subtype of all types, also called the bottom type. There is no value that has type Nothing.
A common use is to signal non-termination such as a thrown exception, program exit, or an infinite 63. What is Pattern matching in Scala?
loop (i.e., it is the type of an expression which does not evaluate to a value, or a method that does not
return normally). Pattern matching is a mechanism for checking a value against a pattern. A successful match can also
deconstruct a value into its constituent parts. It is a more powerful version of the switch statement in
Java and it can likewise be used in place of a series of if/else statements.
Null: Syntax:
Null is a subtype of all reference types (i.e. any subtype of AnyRef). It has a single value identified by import scala.util.Random
the keyword literal null. Null is provided mostly for interoperability with other JVM languages and
should almost never be used in Scala code. We’ll cover alternatives to null later in the tour.
val x: Int = Random.nextInt(10)
Higher order functions take other functions as parameters or return a function as a result. This is case 0 => "zero"
possible case 1 => "one"
because functions are first-class values in Scala. The terminology can get a bit confusing at this point, case 2 => "two"
and we use
case _ => "many"
the phrase “higher order function” for both methods and functions that take functions as parameters or
that }
return a function.
In Higher Order function will make the features are, a Functions that accept another functions & A def matchTest(x: Int): String = x match {
Functions that return to a functions to reduce redundant code.
case 1 => "one"
case 2 => "two"
One of the most common examples is the higher-order function map which is available for collections
case _ => "many"
in Scala. val salaries = Seq(20000, 70000, 40000)
}
val doubleSalary = (x: Int) => x * 2
matchTest(3) // many
val newSalaries = salaries.map(doubleSalary) // List(40000, 140000, 80000)
matchTest(1) // one
Driver Node: The Node that initiates the Spark session. Typically, this will be the server where 68. When not to rely on default type inference?
context is located.
The type inferred for obj was Null. Since the only value of that type is null, So it is impossible to
Driver (Executor): The Driver Node will also show up in the Executor list. assign a different value by default.
65. What are the Configuration properties in Spark? 69. How can we debug spark application locally?
spark.executor.memory:- The maximum possible is managed by the YARN cluster whichcannot Actually we can doing that in local debugging, setting break points, inspecting variables, etc. set spark
exceed the actual RAM available. submission in deploy mode like below -
spark.executor.cores:- Number of cores assigned per Executor which cannot be higher than the cores spark-submit --name CodeTestDinesh --class DineshMainClass --master local[2] DineshApps.jar
available in each worker.
then spark driver to pause and wait for a connection from a debugger when it starts up, by adding an
spark.executor.instances:- Number of executors to start. This property is acknowledged by the cluster option like -
if spark.dynamicAllocation.enabled is set to “false”.
the following:
spark.memory.fraction:- The default is set to 60% of the requested memory per executor.
--conf spark.driver.extraJavaOptions=-
spark.dynamicAllocation.enabled:- Overrides the mechanism that Spark provides to dynamically
adjust resources. Disabling it provides more control over the number of the Executors that can be agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005
started, which in turn impact the amount of storage available for the session. For more information, where agentlib:jdwp is the Java Debug Wire Protocol option, followed by a comma-separated list of
please see the Dynamic Resource Allocation page in the official Spark website. sub-
options:
66. What is Sealed classes? 1. transport: defines the connection protocol used between debugger and debuggee -- either
socket or "shared
Traits and classes can be marked sealed which means all subtypes must be declared in the same file.
2. memory" -- you almost always want socket (dt_socket) except I believe in some cases on
This is useful for pattern matching because we don’t need a “catch all” case. This assures that all Microsoft Windows
subtypes are known. 3. server: whether this process should be the server when talking to the debugger (or
conversely, the client) -- you always need one server and one client. In this case, we're
Example:
going to be the server and wait for a connection from the debugger
sealed abstract class Furniture 4. suspend: whether to pause execution until a debugger has successfully connected. We
turn this on so the driver won't start until the debugger connects
case class Couch() extends Furniture 5. address: here, this is the port to listen on (for incoming debugger connection requests).
case class Chair() extends Furniture You can set it to any available port (you just have to make sure the debugger is
configured to connect to this same port)
def findPlaceToSit(piece: Furniture): String = piece match { case a: Couch => "Lie on the couch" 70. Map collection has Key and value then Key should be mutable or immutable?
case b: Chair => "Sit on the chair" Behavior of a Map is not specified if value of an object is changed in a manner that affects equals
comparison while object with the key. So Key should be an immutable.
}
74. Apache Spark vs. Apache Storm? There is another file system called Tachyon. It is a in memory file system to run spark in distributed
mode.
Apache Spark is an in-memory distributed data analysis platform-- primarily targeted at speeding up
batch analysis jobs, iterative machine learning jobs, interactive query and graph processing. 78. Define about generic classes in scala?
Generic classes are classes which take a type as a parameter. They are particularly useful for
collection classes.
One of Spark's primary distinctions is its use of RDDs or Resilient Distributed Datasets. RDDs are
great for pipelining parallel operators for computation and are, by definition, immutable, which allows Generic classes take a type as a parameter within square brackets []. One convention is to use the
Spark a unique form of fault tolerance based on lineage information. If you are interested in, for letter A as type parameter identifier, though any parameter name may be used.
example, executing a Hadoop
Example: The instance stack can only take Int values. val stack = new Stack[Int] allocated if needed, e.g. by --executor-memory parameter).
stack.push(1)
stack.push(2) 82. How to allocate the memory sizes for the spark jobs in cluster?
println(stack.pop) // prints 2 Before we are answering this question we have to concentrate the 3 main features.
println(stack.pop) // prints 1 which are –
1. NoOfExecutors,
79. How to enable tungsten sort shuffle in Spark 2.x? 2. Executor-memory and
SortShuffleManager is the one and only ShuffleManager in Spark with the short name sort or 3. Number of executor-cores
tungsten-sort.
Lets go with example now, let’s imagine we have a cluster with six nodes running NodeManagers,
In other words, there's no way you could use any other ShuffleManager but SortShuffleManager each with 16 cores and 64GB RAM. The NodeManager sizes,
(unless you enabled one using spark.shuffle.manager property). yarn.nodemanager.resource.memory-mb and yarn.nodemanager.resource.cpu-vcores, should be set
to 63 * 1024 = 64512 (megabytes) and 15 respectively. We never provide 100% allocation of each
resources to YARN containers because the node needs some resources to run the OS processes and
80. How to prevent Spark Executors from getting Lost when using YARN client mode? Hadoop. In this case, we leave a gigabyte and a core for these system processes.
The solution if you're using yarn was to set Cloudera Manager helps by accounting for these and configuring these YARN properties
automatically. So the allocation likely matched as --num-executors 6 --executor-cores 15 --executor-
--conf spark.yarn.executor.memoryOverhead=600, memory 63G.
alternatively if your cluster uses mesos you can try However, this is the wrong approach because: 63GB more on the executor memory overhead won’t fit
--conf spark.mesos.executor.memoryOverhead=600 instead. within the 63GB RAM of the NodeManagers. The application master will cover up a core on one of
the nodes, meaning that there won’t be room for a 15-core executor on that node. 15 cores per
executor can lead to bad HDFS I/O throughput. So the best option would be to use --num-executors
17 --executor-cores 5 --executor-memory 19G.
81. What is the relationship between the YARN Containers and the Spark Executors.?
This configuration results in three executors on all nodes except for the one with the Application
First important thing is this fact that the number of containers will always be the same as the
Master, which
executors created by a Spark application e.g. via --num-executors parameter in spark-submit.
will have two executors. --executor-memory was derived as (63/3 executors per node) = 21. 21 * 0.07
Set by the yarn.scheduler.minimum-allocation-mb every container always allocates at least this
amount of memory. This means if parameter --executor-memory is set to e.g. only 1g but = 1.47. 21 -
yarn.scheduler.minimum-allocation-mb is e.g. 6g, the container is much bigger than needed by the 1.47 ~ 19.
Spark application.
The other way round, if the parameter --executor-memory is set to somthing higher than the
yarn.scheduler.minimum-allocation-mb value, e.g. 12g, the Container will allocate more memory 83. How autocompletion tab can enable in pyspark.?
dynamically, but only if the requested amount of memory is smaller or equal to
Please import the below libraries in pyspark shell
yarn.scheduler.maximum-allocation-mb value.
import rlcompleter, readline
The value of yarn.nodemanager.resource.memory-mb determines, how much memory can be
allocated in sum by all containers of one host! readline.parse_and_bind("tab: complete")
84. can we execute two transformations on the same RDD in parallel in Apache Spark?
So setting yarn.scheduler.minimum-allocation-mb allows you to run smaller containers e.g. for All standard RDD methods are blocking (with exception to AsyncRDDActions) so actions will be
smaller executors (else it would be waste of memory). evaluated sequentially.
Setting yarn.scheduler.maximum-allocation-mb to the maximum value (e.g. equal to It is possible to execute multiple actions concurrently using non-blocking submission (threads,
Futures) with correct configuration of in-application scheduler or explicitly limited resources for each
yarn.nodemanager.resource.memory-mb) allows you to define bigger executors (more memory is
action.
Regarding cache it is impossible to answer without knowing the context. Depending on the cluster
configuration, storage, and data locality it might be cheaper to load data from disk again, especially
88. What is Stateful Transformation .?
when resources are limited, and subsequent actions might trigger cache cleaner.
The uses data or intermediate results from previous batches and computes the result of the current
batch called Stateful Transformation. Stateful transformations are operations on DStreams that track
85. Which cluster type should I choose for Spark?
data across time. Thus it makes use of some data from previous batches to generate the results for a
new batch.
• Standalone - meaning Spark will manage its own cluster In streaming if we have a use case to track data across batches then we need state-ful DStreams.
• YARN - using Hadoop's YARN resource manager For example we may track a user’s interaction in a website during the user session or we may track a
• Mesos - Apache's dedicated resource manager project particular twitter hashtag across time and see which users across the globe is talking about it.
Types of state-ful transformation.
Start with a standalone cluster if this is a new deployment. Standalone mode is the easiest to set up
and will provide almost all the same features as the other cluster managers if you are only running
Spark. 1.State-ful DStreams are of two types - window based tracking and full session tracking.
If you would like to run Spark alongside other applications, or to use richer resource scheduling For stateful tracking all incoming data should be transformed to key-value pairs such that the key
capabilities (e.g. queues), both YARN and Mesos provide these features. Of these, YARN will likely states can be tracked across batches. This is a precondition.
be preinstalled in many
Hadoop distributions.
2.Window based tracking
One advantage of Mesos over both YARN and standalone mode is its fine-grained sharing option,
In window based tracking the incoming batches are grouped in time intervals, i.e. group batches every
which lets interactive applications such as the Spark shell scale down their CPU allocation between
‘x’ seconds. Further computations on these batches are done using slide intervals.
commands. This makes it attractive in environments where multiple users are running interactive
shells.
In all cases, it is best to run Spark on the same nodes as HDFS for fast access to storage. You can
install Mesos or
the standalone cluster manager on the same nodes manually, or most Hadoop distributions already
install YARN
and HDFS together. Part – 6 TOP 250+ Interviews Questions on AWS
Answer:AWS stands for Amazon Web Services. AWS is a platform that provides on-demand • Cold HDD
resources for hosting web services, storage, networking, databases and other resources over the • Throughput optimized
internet with a pay-as-you-go pricing.
Q9) What are the different types of instances?
Q2) What are the components of AWS?
Answer: Following are the types of instances,
Answer:EC2 – Elastic Compute Cloud, S3 – Simple Storage Service, Route53, EBS – Elastic Block
Store, Cloudwatch, Key-Paris are few of the components of AWS. • General purpose
• Computer Optimized
Q3) What are key-pairs? • Storage Optimized
• Memory Optimized
Answer:Key-pairs are secure login information for your instances/virtual machines. To connect to the • Accelerated Computing
instances we use key-pairs that contain a public-key and private-key.
Q10) What is an auto-scaling and what are the components?
Q4) What is S3?
Answer: Auto scaling allows you to automatically scale-up and scale-down the number of instances
Answer:S3 stands for Simple Storage Service. It is a storage service that provides an interface that depending on the CPU utilization or memory utilization. There are 2 components in Auto scaling,
you can use to store any amount of data, at any time, from anywhere in the world. With S3 you pay they are Auto-scaling groups and Launch Configuration.
only for what you use and the payment model is pay-as-you-go.
Q11) What are reserved instances?
Q5) What are the pricing models for EC2instances?
Answer: Reserved instances are the instance that you can reserve a fixed capacity of EC2 instances. In
Answer:The different pricing model for EC2 instances are as below, reserved instances you will have to get into a contract of 1 year or 3 years.
Q6) What are the types of volumes for EC2 instances? Q13) What is an EIP?
Answer: Answer: EIP stands for Elastic IP address. It is designed for dynamic cloud computing. When you
want to have a static IP address for your instances when you stop and restart your instances, you will
• There are two types of volumes, be using EIP address.
• Instance store volumes
• EBS – Elastic Block Stores Q14) What is Cloudwatch?
Q7) What are EBS volumes? Answer: Cloudwatch is a monitoring tool that you can use to monitor your various AWS resources.
Like health check, network, Application, etc.
Answer:EBS stands for Elastic Block Stores. They are persistent volumes that you can attach to the
instances. With EBS volumes, your data will be preserved even when you stop your instances, unlike Q15) What are the types in cloudwatch?
your instance store volumes where the data is deleted when you stop the instances.
Answer: There are 2 types in cloudwatch. Basic monitoring and detailed monitoring. Basic
monitoring is free and detailed monitoring is chargeable.
Q8) What are the types of volumes in EBS? Q16) What are the cloudwatch metrics that are available for EC2 instances?
Answer:Following are the types of volumes in EBS, Answer: Diskreads, Diskwrites, CPU utilization, networkpacketsIn, networkpacketsOut, networkIn,
networkOut, CPUCreditUsage, CPUCreditBalance.
• General purpose
• Provisioned IOPS Q17) What is the minimum and maximum size of individual objects that you can store in S3
• Magnetic
Q19) What is the default storage class in S3? Q27) What is cloudfront?
Answer: The default storage class in S3 in Standard frequently accessed. Answer: Cloudfront is an AWS web service that provided businesses and application developers an
easy and efficient way to distribute their content with low latency and high data transfer speeds.
Q20) What is glacier? Cloudfront is content delivery network of AWS.
Answer: Glacier is the back up or archival tool that you use to back up your data in S3. Q28) What are edge locations?
Q21) How can you secure the access to your S3 bucket? Answer: Edge location is the place where the contents will be cached. When a user tries to access
some content, the content will be searched in the edge location. If it is not available then the content
Answer: There are two ways that you can control the access to your S3 buckets, will be made available from the origin location and a copy will be stored in the edge location.
• ACL – Access Control List Q29) What is the maximum individual archive that you can store in glacier?
• Bucket polices
Answer: You can store a maximum individual archive of upto 40 TB.
Q22) How can you encrypt data in S3?
Q30) What is VPC?
Answer: You can encrypt the data by using the below methods,
Answer: VPC stands for Virtual Private Cloud. VPC allows you to easily customize your networking
• Server Side Encryption – S3 (AES 256 encryption) configuration. VPC is a network that is logically isolated from other network in the cloud. It allows
• Server Side Encryption – KMS (Key management Service) you to have your own IP address range, subnets, internet gateways, NAT gateways and security
• Server Side Encryption – C (Client Side) groups.
Q23) What are the parameters for S3 pricing? Q31) What is VPC peering connection?
Answer: The pricing model for S3 is as below, Answer: VPC peering connection allows you to connect 1 VPC with another VPC. Instances in these
VPC behave as if they are in the same network.
• Storage used
• Number of requests you make Q32) What are NAT gateways?
• Storage management
• Data transfer Answer: NAT stands for Network Address Translation. NAT gateways enables instances in a private
• Transfer acceleration subnet to connect to the internet but prevent the internet from initiating a connection with those
instances.
Q24) What is the pre-requisite to work with Cross region replication in S3?
Q33) How can you control the security to your VPC?
Answer: You need to enable versioning on both source bucket and destination to work with cross
region replication. Also both the source and destination bucket should be in different region. Answer: You can use security groups and NACL (Network Access Control List) to control the
security to your
Q25) What are roles?
VPC.
Q34) What are the different types of storage gateway? • Standard queue
• FIFO (First In First Out)
Answer: Following are the types of storage gateway.
Q42) What is multi-AZ RDS?
• File gateway
• Volume gateway Answer: Multi-AZ (Availability Zone) RDS allows you to have a replica of your production database
• Tape gateway in another availability zone. Multi-AZ (Availability Zone) database is used for disaster recovery. You
will have an exact copy of your database. So when your primary database goes down, your
Q35) What is a snowball? application will automatically failover to the standby database.
Answer: Snowball is a data transport solution that used source appliances to transfer large amounts of Q43) What are the types of backups in RDS database?
data into and out of AWS. Using snowball, you can move huge amount of data from one place to
another which reduces your network costs, long transfer times and also provides better security. Answer: There are 2 types of backups in RDS database.
Answer: Following are the two types of access that you can create.
Q40) What is the maximum size of messages in SQS?
• Programmatic access
Answer: The maximum size of messages in SQS is 256 KB.
• Console access
Q41) What are the types of queues in SQS?
Q48) What are the benefits of auto scaling?
Answer: There are 2 types of queues in SQS.
Answer: Security groups acts as a firewall that contains the traffic for one or more instances. You can Q60) Is it possible to stop a RDS instance, how can I do that?
associate one or more security groups to your instances when you launch then. You can add rules to
each security group that allow traffic to and from its associated instances. You can modify the rules of Answer: Yes it’s possible to stop rds. Instance which are non-production and non multi AZ’s
a security group at any time, the new rules are automatically and immediately applied to all the
instances that are associated with the security group Q61) What is meant by parameter groups in rds. And what is the use of it?
Q50) What are shared AMI’s? Answer: Since RDS is a managed service AWS offers a wide set of parameter in RDS as parameter
group which is modified as per requirement
Answer: Shared AMI’s are the AMI that are created by other developed and made available for other
developed to use. Q62) What is the use of tags and how they are useful?
Q51)What is the difference between the classic load balancer and application load balancer? Answer: Tags are used for identification and grouping AWS Resources
Answer: Dynamic port mapping, multiple port multiple listeners is used in Application Load Q63) I am viewing an AWS Console but unable to launch the instance, I receive an IAM Error
Balancer, One port one listener is achieved via Classic Load Balancer how can I rectify it?
Q52) By default how many Ip address does aws reserve in a subnet? Answer: As AWS user I don’t have access to use it, I need to have permissions to use it further
Answer: 5 Q64) I don’t want my AWS Account id to be exposed to users how can I avoid it?
Q53) What is meant by subnet? Answer: In IAM console there is option as sign in url where I can rename my own account name with
AWS account
Answer: A large section of IP Address divided in to chunks are known as subnets
Q65) By default how many Elastic Ip address does AWS Offer?
Q54) How can you convert a public subnet to private subnet?
Answer: 5 elastic ip per region
Answer: Remove IGW & add NAT Gateway, Associate subnet in Private route table
Q66) You are enabled sticky session with ELB. What does it do with your instance?
Q55) Is it possible to reduce a ebs volume?
Answer: Binds the user session with a specific instance
Answer: no it’s not possible, we can increase it but not reduce them
Q67) Which type of load balancer makes routing decisions at either the transport layer or the
Q56) What is the use of elastic ip are they charged by AWS?
Application layer and supports either EC2 or VPC.
Answer: These are ipv4 address which are used to connect the instance from internet, they are charged
if the instances are not attached to it Answer: Classic Load Balancer
Q57) One of my s3 is bucket is deleted but i need to restore is there any possible way? Q68) Which is virtual network interface that you can attach to an instance in a VPC?
Answer: If versioning is enabled we can easily restore them Answer: Elastic Network Interface
Q58) When I try to launch an ec2 instance i am getting Service limit exceed, how to fix the Q69) You have launched a Linux instance in AWS EC2. While configuring security group, you
issue?
Have selected SSH, HTTP, HTTPS protocol. Why do we need to select SSH?
Answer: To verify that there is a rule that allows traffic from EC2 Instance to your computer queries and table joins. Which configuration provides to the solution for company’s
requirements?
Q70) You have chosen a windows instance with Classic and you want to make some change to
the Answer: An web application provide on Amazon DynamoDB solution.
Security group. How will these changes be effective? Q80) Which the statement use to cases are suitable for Amazon DynamoDB?
Answer: Changes are automatically applied to windows instances Answer:The storing metadata for the Amazon S3 objects& The Running of relational joins and
complex an updates.
Q71) Load Balancer and DNS service comes under which type of cloud service?
Q81) Your application has to the retrieve on data from your user’s mobile take every 5 minutes
Answer: IAAS-Storage and then data is stored in the DynamoDB, later every day at the particular time the data is an
extracted into S3 on a per user basis and then your application is later on used to visualize the
Q72) You have an EC2 instance that has an unencrypted volume. You want to create another data to user. You are the asked to the optimize the architecture of the backend system can to
lower cost, what would you recommend do?
Encrypted volume from this unencrypted volume. Which of the following steps can achieve
this? Answer: Introduce Amazon Elasticache to the cache reads from the Amazon DynamoDB table and to
reduce the provisioned read throughput.
Answer: Create a snapshot of the unencrypted volume (applying encryption parameters), copy the.
Snapshot and create a volume from the copied snapshot Q82) You are running to website on EC2 instances can deployed across multiple Availability
Zones with an Multi-AZ RDS MySQL Extra Large DB Instance etc. Then site performs a high
Q73) Where does the user specify the maximum number of instances with the auto number of the small reads and the write per second and the relies on the eventual consistency
scaling Commands? model. After the comprehensive tests you discover to that there is read contention on RDS
MySQL. Which is the best approaches to the meet these requirements?
Answer: Auto scaling Launch Config
Answer:The Deploy Elasti Cache in-memory cache is running in each availability zone and Then
Increase the RDS MySQL Instance size and the Implement provisioned IOPS.
Q74) Which are the types of AMI provided by AWS?
Answer: Instance Store backed, EBS Backed Q83) An startup is running to a pilot deployment of around 100 sensors to the measure street
noise and The air quality is urban areas for the 3 months. It was noted that every month to
around the 4GB of sensor data are generated. The company uses to a load balanced take auto
Q75) After configuring ELB, you need to ensure that the user requests are always attached to scaled layer of the EC2 instances and a RDS database with a 500 GB standard storage. The pilot
a Single instance. What setting can you use?
was success and now they want to the deploy take atleast 100K sensors.let which to need the
supported by backend. You need to the stored data for at least 2 years to an analyze it. Which
Answer: Sticky session setup of following would you be prefer?
Q76) When do I prefer to Provisioned IOPS over the Standard RDS storage? Answer: The Replace the RDS instance with an 6 node Redshift cluster with take 96TB of storage.
Answer:If you have do batch-oriented is workloads. Q84) Let to Suppose you have an application where do you have to render images and also do
some of general computing. which service will be best fit your need?
Q77) If I am running on my DB Instance a Multi-AZ deployments, can I use to the stand by the
DB Instance for read or write a operation along with to primary DB instance? Answer:Used on Application Load Balancer.
Answer: Primary db instance does not working. Q85) How will change the instance give type for the instances, which are the running in your
applications tier and Then using Auto Scaling. Where will you change it from areas?
Q78) Which the AWS services will you use to the collect and the process e-commerce data for
the near by real-time analysis? Answer: Changed to Auto Scaling launch configuration areas.
Answer: Good of Amazon DynamoDB. Q86) You have an content management system running on the Amazon EC2 instance that is the
approaching 100% CPU of utilization. Which option will be reduce load on the Amazon EC2
Q79) A company is deploying the new two-tier an web application in AWS. The company has to instance?
limited on staff and the requires high availability, and the application requires to complex
Answer: Let Create a load balancer, and Give register the Amazon EC2 instance with it.
Q90) An user has to setup an Auto Scaling group. Due to some issue the group has to failed for Q99) The user has created an the applications, which will be hosted on the EC2. The application
launch a single instance for the more than 24 hours. What will be happen to the Auto Scaling in makes calls to the Dynamo DB to fetch on certain data. The application using the DynamoDB
the condition? SDK to connect with the EC2 instance. Which of respect to best practice for the security in
this scenario?
Answer: The auto Scaling will be suspend to the scaling process.
Answer: The user should be attach an IAM roles with the DynamoDB access to EC2 instance.
Q91) You have an the EC2 Security Group with a several running to EC2 instances. You
changed to the Security of Group rules to allow the inbound traffic on a new port and protocol, Q100) You have an application are running on EC2 Instance, which will allow users to
and then the launched a several new instances in the same of Security Group.Such the new rules download the files from a private S3 bucket using the pre-assigned URL. Before generating to
apply? URL the Q101) application should be verify the existence of file in S3. How do the application
use the AWS credentials to access S3 bucket securely?
Answer:The Immediately to all the instances in security groups.
Answer:An Create an IAM role for the EC2 that allows list access to objects in S3 buckets. Launch
Q92) To create an mirror make a image of your environment in another region for the disaster to instance with this role, and retrieve an role’s credentials from EC2 Instance make metadata.
recoverys, which of the following AWS is resources do not need to be recreated in second
region? Q101) You use the Amazon CloudWatch as your primary monitoring system
for web application. After a recent to software deployment, your users are to getting
Answer: May be the selected on Route 53 Record Sets. Intermittent the 500 Internal Server to the Errors, when you using web application. You want
to create the CloudWatch alarm, and notify the on-call engineer let when these occur. How can
you accomplish the using the AWS services?
Q93) An customers wants to the captures all client connections to get information from his load
balancers at an interval of 5 minutes only, which cal select option should he choose for his
application? Answer: An Create a CloudWatch get Logs to group and A define metric filters that assure capture
500 Internal Servers should be Errors. Set a CloudWatch alarm on the metric and By Use
Answer: The condition should be Enable to AWS CloudTrail for the loadbalancers. of Amazon Simple to create a Notification Service to notify an the on-call engineers
when prepare CloudWatch alarm is triggered.
Q94) Which of the services to you would not use to deploy an app?
Q102) You are designing a multi-platform of web application for the AWS. The application will
run on the EC2 instances and Till will be accessed from PCs, tablets and smart phones.Then
Answer: Lambda app not used on deploy.
Supported accessing a platforms are Windows, MACOS, IOS and Android. They Separate
sticky sessions and SSL certificate took setups are required for the different platform types.
Q95) How do the Elastic Beanstalk can apply to updates? Which do describes the most cost effective and Like performance efficient the architecture
setup?
Answer: By a duplicate ready with a updates prepare before swapping.
Answer:Assign to multiple ELBs an EC2 instance or group of EC2 take instances running
Q96) An created a key in the oregon region to encrypt of my data in North Virginia region for to common component of the web application, one ELB change for each platform type.Take Session
security purposes. I added to two users to the key and the external AWS accounts. I wanted to will be stickiness and SSL termination are done for the ELBs.
encrypt an the object in S3, so when I was tried, then key that I just created is not listed.What
could be reason&solution? Q103) You are migrating to legacy client-server application for AWS. The application responds
to a specific DNS visible domain (e.g. www.example.com) and server 2-tier architecture, with
Answer:The Key should be working in the same region.
multiple application for the servers and the database server. Remote clients use to TCP to HTTP requests to from load-tester system | # of HTTP requests to from private on beta users ||
connect to the application of servers. The application servers need to know the IP address of webserver #1 (subnet an us-west-2a): | 19,210 | 434 | webserver #2 (subnet an us-west-2a): |
clients in order to the function of properly and are currently taking of that information 21,790 | 490 || webserver #3 (subnet an us-west-2b): | 0 | 410 || webserver #4 (subnet an us-west-
from TCP socket. A Multi-AZ RDS MySQL instance to will be used for database. During the 2b): | 0 | 428 |Which as recommendations can be help of ensure that load-testing HTTP requests
migration you change the application code but you have file a change request. How do would are will evenly distributed across to four web servers?
you implement the architecture on the AWS in order to maximize scalability and high
availability? Answer:Result of cloud is re-configure the load-testing software to the re-resolve DNS for each web
request.
Answer: File a change request to get implement of Proxy Protocol support in the application. Use of
ELB with TCP Listener and A Proxy Protocol enabled to distribute the load on two application Q108) To serve the Web traffic for a popular product to your chief financial officer and IT
servers in the different AZs. director have purchased 10 m1.large heavy utilization of Reserved Instances (RIs) evenly put
spread across two availability zones: Route 53 are used to deliver the traffic to on Elastic Load
Q104) Your application currently is leverages AWS Auto Scaling to the grow and shrink as a Balancer (ELB). After the several months, the product grows to even more popular and you
load Increases/decreases and has been performing as well. Your marketing a team expects need to additional capacity As a result, your company that purchases two c3.2xlarge medium
and steady ramp up in traffic to follow an upcoming campaign that will result in 20x growth in utilization RIs You take register the two c3.2xlarge instances on with your ELB and quickly
the traffic over 4 weeks. Your forecast for approximate number of the Amazon EC2 instances find that the ml of large instances at 100% of capacity and the c3.2xlarge instances have
necessary to meet peak demand is 175. What should be you do avoid potential service significant to capacity that’s can unused Which option is the most of cost effective and uses EC2
disruptions during the ramp up traffic? capacity most of effectively?
Answer: Check the service limits in the Trusted Advisors and adjust as necessary, so that forecasted Answer: To use a separate ELB for the each instance type and the distribute load to ELBs with a
count remains within the limits. Route 53 weighted round of robin.
Q105) You have a web application running on the six Amazon EC2 instances, consuming about Q109) An AWS customer are deploying an web application that is the composed of a front-end
45% of resources on the each instance. You are using the auto-scaling to make sure that a six running on the Amazon EC2 and confidential data that are stored on the Amazon S3. The
instances are running at all times. The number of requests this application processes to customer security policy is that all accessing operations to this sensitive data
consistent and does not experience to spikes. Then application are critical to your business and must authenticated and authorized by centralized access to management system that is operated
you want to high availability for at all times. You want to the load be distributed evenly has by separate security team. In addition, the web application team that be owns and administers
between all instances. You also want to between use same Amazon Machine Image (AMI) for all the EC2 web front-end instances are prohibited from having the any ability to access data that
instances. Which are architectural choices should you make? circumvents this centralized access to management system. Which are configurations will
support these requirements?
Answer: Deploy to 3 EC2 instances in one of availability zone and 3 in another availability of zones
and to use of Amazon Elastic is Load Balancer. Answer:The configure to the web application get authenticate end-users against the centralized access
on the management system. Have a web application provision trusted to users STS tokens an entitling
Q106) You are the designing an application that a contains protected health information. the download of the approved data directly from a Amazon S3.
Security and Then compliance requirements for your application mandate that all protected to
health information in application use to encryption at rest and in the transit module. The Q110) A Enterprise customer is starting on their migration to the cloud, their main reason for
application to uses an three-tier architecture. where should data flows through the load the migrating is agility and they want to the make their internal Microsoft active directory
balancers and is stored on the Amazon EBS volumes for the processing, and the results are available to the many applications running on AWS, this is so internal users for only have to
stored in the Amazon S3 using a AWS SDK. Which of the options satisfy the security remember one set of the credentials and as a central point of user take control for the leavers
requirements? and joiners. How could they make their actions the directory secures and the highly available
with minimal on-premises on infrastructure changes in the most cost and the time-
Answer: Use TCP load balancing on load balancer system, SSL termination on Amazon to create EC2 efficient way?
instances, OS-level disk take encryption on Amazon EBS volumes, and The amazon S3 with server-
side to encryption and Use the SSL termination on load balancers, an SSL listener on the Amazon to Answer: By Using a VPC, they could be create an the extension to their data center and to make use
create EC2 instances, Amazon EBS encryption on the EBS volumes containing the PHI, and Amazon of resilient hardware IPSEC on tunnels, they could then have two domain consider to controller
S3 with a server-side of encryption. instances that are joined to the existing domain and reside within the different subnets in the different
availability zones.
Q107) An startup deploys its create photo-sharing site in a VPC. An elastic load balancer
distributes to web traffic across two the subnets. Then the load balancer session to stickiness is Q111) What is Cloud Computing?
configured to use of AWS-generated session cookie, with a session TTL of the 5 minutes. The
web server to change Auto Scaling group is configured as like min-size=4, max-size=4. The Answer:Cloud computing means it provides services to access programs, application, storage,
startup is the preparing for a public launchs, by running the load-testing software installed on network, server over the internet through browser or client side application on your PC, Laptop,
the single Amazon Elastic Compute Cloud (EC2) instance to running in us-west-2a. After 60 Mobile by the end user without installing, updating and maintaining them.
minutes of load-testing, the web server logs of show the following:WEBSERVER LOGS | # of
Answer: Q117) What is mean by Region, Availability Zone and Edge Location?
• Lower computing cost Answer: Region: An independent collection of AWS resources in a defined geography. A collection
• Improved Performance of Data centers (Availability zones). All availability zones in a region connected by high bandwidth.
• No IT Maintenance
• Business connectivity Availability Zones: An Availability zone is a simply a data center. Designed as independent failure
• Easily upgraded zone. High speed connectivity, Low latency.
• Device Independent
Edge Locations: Edge location are the important part of AWS Infrastructure. Edge locations are CDN
Q113) What are the deployment models using in Cloud? endpoints for cloud front to deliver content to end user with low latency
Q122) What is AMI? What are the types in AMI? Answer:Lightsail designed to be the easiest way to launch and manage a virtual private server with
AWS.Lightsail plans include everything you need to jumpstart your project a virtual machine, ssd
Answer: based storage, data transfer, DNS Management and a static ip.
Amazon machine image is a special type of virtual appliance that is used to create a virtual machine Q129) What is EBS?
within the amazon Elastic compute cloud. AMI defines the initial software that will be in an instance
when it is launched. Answer:Amazon EBS Provides persistent block level storage volumes for use with Amazon EC2
instances. Amazon EBS volume is automatically replicated with its availability zone to protect
Types of AMI: component failure offering high availability and durability. Amazon EBS volumes are available in a
variety of types that differ in performance characteristics and Price.
• Published by AWS
• AWS Marketplace Q130) How to compare EBS Volumes?
• Generated from existing instances
• Uploaded virtual server Answer: Magnetic Volume: Magnetic volumes have the lowest performance characteristics of all
Amazon EBS volume types.
Q123) How to Addressing AWS EC2 instances?
EBS Volume size: 1 GB to 1 TB Average IOPS: 100 IOPS Maximum throughput: 40-90 MB
Answer:
General-Purpose SSD: General purpose SSD volumes offers cost-effective storage that is ideal for a
• Public Domain name system (DNS) name: When you launch an instance AWS creates a DNS broad range of workloads. General purpose SSD volumes are billed based on the amount of data space
name that can be used to access the provisioned regardless of how much of data you actually store on the volume.
• Public IP: A launched instance may also have a public ip address This IP address assigned
from the address reserved by AWS and cannot be specified. EBS Volume size: 1 GB to 16 TB Maximum IOPS: upto 10000 IOPS Maximum throughput: 160 MB
• Elastic IP: An Elastic IP Address is an address unique on the internet that you reserve
independently and associate with Amazon EC2 instance. This IP Address persists until the Provisioned IOPS SSD: Provisioned IOPS SSD volumes are designed to meet the needs of I/O
customer release it and is not tried to intensive workloads, particularly database workloads that are sensitive to storage performance and
consistency in random access I/O throughput. Provisioned IOPS SSD Volumes provide predictable,
Q124) What is Security Group? High performance.
Answer: AWS allows you to control traffic in and out of your instance through virtual firewall called EBS Volume size: 4 GB to 16 TB Maximum IOPS: upto 20000 IOPS Maximum throughput: 320 MB
Security groups. Security groups allow you to control traffic based on port, protocol and
source/Destination. Q131) What is cold HDD and Throughput-optimized HDD?
Q125) When your instance show retired state? Answer: Cold HDD: Cold HDD volumes are designed for less frequently accessed workloads. These
volumes are significantly less expensive than throughput-optimized HDD volumes.
Answer:Retired state only available in Reserved instances. Once the reserved instance reserving time
(1 yr/3 yr) ends it shows Retired state. EBS Volume size: 500 GB to 16 TB Maximum IOPS: 200 IOPS Maximum throughput: 250 MB
Q126) Scenario: My EC2 instance IP address change automatically while instance stop and Throughput-Optimized HDD: Throughput-optimized HDD volumes are low cost HDD volumes
start. What is the reason for that and explain solution? designed for frequent access, throughput-intensive workloads such as big data, data warehouse.
Answer:AWS assigned Public IP automatically but it’s change dynamically while stop and start. In EBS Volume size: 500 GB to 16 TB Maximum IOPS: 500 IOPS Maximum throughput: 500 MB
that case we need to assign Elastic IP for that instance, once assigned it doesn’t change automatically.
Q132) What is Amazon EBS-Optimized instances?
Q127) What is Elastic Beanstalk?
Answer: Amazon EBS optimized instances to ensure that the Amazon EC2 instance is prepared to
Answer:AWS Elastic Beanstalk is the fastest and simplest way to get an application up and running take advantage of the I/O of the Amazon EBS Volume. An amazon EBS-optimized instance uses an
on AWS.Developers can simply upload their code and the service automatically handle all the details optimized configuration stack and provide additional dedicated capacity for Amazon EBS I/When you
such as resource provisioning, load balancing, Auto scaling and Monitoring. select Amazon EBS-optimized for an instance you pay an additional hourly charge for that instance.
Answer:
Q135) What are the virtualization types available in AWS? Q140) Explain Amazon s3 lifecycle rules?
Answer: Hardware assisted Virtualization: HVM instances are presented with a fully virtualized set of Answer: Amazon S3 lifecycle configuration rules, you can significantly reduce your storage costs by
hardware and they executing boot by executing master boot record of the root block device of the automatically transitioning data from one storage class to another or even automatically delete data
image. It is default Virtualization. after a period of time.
Para virtualization: This AMI boot with a special boot loader called PV-GRUB. The ability of the • Store backup data initially in Amazon S3 Standard
guest kernel to communicate directly with the hypervisor results in greater performance levels than • After 30 days, transition to Amazon Standard IA
other virtualization approaches but they cannot take advantage of hardware extensions such as • After 90 days, transition to Amazon Glacier
networking, GPU etc. Its customized Virtualization image. Virtualization image can be used only for • After 3 years, delete
particular service.
Q141) What is the relation between Amazon S3 and AWS KMS?
Q136) Differentiate Block storage and File storage?
Answer: To encrypt Amazon S3 data at rest, you can use several variations of Server-Side Encryption.
Answer: Amazon S3 encrypts your data at the object level as it writes it to disks in its data centers and decrypt
it for you when you access it’ll SSE performed by Amazon S3 and AWS Key Management Service
Block Storage: Block storage operates at lower level, raw storage device level and manages data as a (AWS KMS) uses the 256-bit Advanced Encryption Standard (AES).
set of numbered, fixed size blocks.
Q142) What is the function of cross region replication in Amazon S3?
File Storage: File storage operates at a higher level, the operating system level and manage data as a
named hierarchy of files and folders. Answer: Cross region replication is a feature allows you asynchronously replicate all new objects in
the source bucket in one AWS region to a target bucket in another region. To enable cross-region
Q137) What are the advantage and disadvantage of EFS? Advantages: replication, versioning must be turned on for both source and destination buckets. Cross region
replication is commonly used to reduce the latency required to access objects in Amazon S3
Answer:
Q143) How to create Encrypted EBS volume?
• Fully managed service
• File system grows and shrinks automatically to petabytes Answer: You need to select Encrypt this volume option in Volume creation page. While creation a
• Can support thousands of concurrent connections new master key will be created unless you select a master key that you created separately in the
• Multi AZ replication service. Amazon uses the AWS key management service (KMS) to handle key management.
• Throughput scales automatically to ensure consistent low latency Disadvantages:
• Not available in all region Q144) Explain stateful and Stateless firewall.
• Cross region capability not available
• More complicated to provision compared to S3 and EBS Answer:
Q138) what are the things we need to remember while creating s3 bucket? Stateful Firewall: A Security group is a virtual stateful firewall that controls inbound and outbound
network traffic to AWS resources and Amazon EC2 instances. Operates at the instance level. It
Answer: supports allow rules only. Return traffic is automatically allowed, regardless of any rules.
• Amazon S3 and Bucket names are Stateless Firewall: A Network access control List (ACL) is a virtual stateless firewall on a subnet
• This means bucket names must be unique across all AWS level. Supports allow rules and deny rules. Return traffic must be explicitly allowed by rules.
• Bucket names can contain upto 63 lowercase letters, numbers, hyphens and
• You can create and use multiple buckets Q145) What is NAT Instance and NAT Gateway?
• You can have upto 100 per account by
Answer: Answer: Cloud formation is a service which creates the AWS infrastructure using code. It helps to
reduce time to manage resources. We can able to create our resources Quickly and faster.
NAT instance: A network address translation (NAT) instance is an Amazon Linux machine Image
(AMI) that is designed to accept traffic from instances within a private subnet, translate the source IP Q153) How to plan Auto scaling?
address to the Public IP address of the NAT instance and forward the traffic to IWG.
Answer:
NAT Gateway: A NAT gateway is an Amazon managed resources that is designed to operate just like
a NAT instance but it is simpler to manage and highly available within an availability Zone. To allow • Manual Scaling
instance within a private subnet to access internet resources through the IGW via a NAT gateway. • Scheduled Scaling
• Dynamic Scaling
Q146) What is VPC Peering?
Q154) What is Auto Scaling group?
Answer: Amazon VPC peering connection is a networking connection between two amazon vpc’s that
enables instances in either Amazon VPC to communicate with each other as if they are within the Answer: Auto Scaling group is a collection of Amazon EC2 instances managed by the Auto scaling
same network. You can create amazon VPC peering connection between your own Amazon VPC’s or service. Each auto scaling group contains configuration options that control when auto scaling should
Amazon VPC in another AWS account within a single region. launch new instance or terminate existing instance.
Q147) What is MFA in AWS? Q155) Differentiate Basic and Detailed monitoring in cloud watch?
Answer: Multi factor Authentication can add an extra layer of security to your infrastructure by Answer:
adding a second method of authentication beyond just password or access key.
Basic Monitoring: Basic monitoring sends data points to Amazon cloud watch every five minutes for
Q148) What are the Authentication in AWS? a limited number of preselected metrics at no charge.
Answer: Detailed Monitoring: Detailed monitoring sends data points to amazon CloudWatch every minute and
allows data aggregation for an additional charge.
• User Name/Password
• Access Key Q156) What is the relationship between Route53 and Cloud front?
• Access Key/ Session Token
Answer: In Cloud front we will deliver content to edge location wise so here we can use Route 53 for
Q149) What is Data warehouse in AWS? Content Delivery Network. Additionally, if you are using Amazon CloudFront you can configure
Route 53 to route Internet traffic to those resources.
Data ware house is a central repository for data that can come from one or more sources. Organization
typically use data warehouse to compile reports and search the database using highly complex queries. Q157) What are the routing policies available in Amazon Route53?
Data warehouse also typically updated on a batch schedule multiple times per day or per hour
compared to an OLTP (Online Transaction Processing) relational database that can be updated Answer:
thousands of times per second.
• Simple
Q150) What is mean by Multi-AZ in RDS? • Weighted
• Latency Based
Answer: Multi AZ allows you to place a secondary copy of your database in another availability zone • Failover
for disaster recovery purpose. Multi AZ deployments are available for all types of Amazon RDS • Geolocation
Database engines. When you create s Multi-AZ DB instance a primary instance is created in one
Availability Zone and a secondary instance is created by another Availability zone. Q158) What is Amazon ElastiCache?
Q151) What is Amazon Dynamo DB? Answer: Amazon ElastiCache is a web services that simplifies the setup and management of
distributed in memory caching environment.
Answer: Amazon Dynamo DB is fully managed NoSQL database service that provides fast and
predictable performance with seamless scalability. Dynamo DB makes it simple and Cost effective to • Cost Effective
store and retrieve any amount of data. • High Performance
• Scalable Caching Environment
Q152) What is cloud formation? • Using Memcached or Redis Cache Engine
SNS (Simple Notification Service): SNS is a web service that coordinates and manages the delivery or Q166) What is IaaS?
sending of messages to recipients.
Answer:IaaS is a cloud service that runs services on “pay-for-what-you-use” basis
Q160) How To Use Amazon Sqs? What Is Aws?
IaaS workers include Amazon Web Services, Microsoft Azure and Google Compute Engine
Answer:Amazon Web Services is a secure cloud services stage, offering compute power, database
storage, content delivery and other functionality to help industries scale and grow. Users: IT Administrators
Answer:low price – Consume only the amount of calculating, storage and other IT devices needed. No Answer:PaaS runs cloud platforms and runtime environments to develop, test and manage software
long-term assignation, minimum spend or up-front expenditure is required.
Users: Software Developers
Elastic and Scalable – Quickly Rise and decrease resources to applications to satisfy customer
demand and control costs. Avoid provisioning maintenance up-front for plans with variable Q168) What is SaaS?
consumption speeds or low lifetimes.
Answer:In SaaS, cloud workers host and manage the software application on a pay-as-you-go pricing
Q162) What is the way to secure data for resounding in the cloud? model
• Avoid storage sensitive material in the cloud. … Q169) Which Automation Gears Can Help With Spinup Services?
• Read the user contract to find out how your cloud service storing works. …
• Be serious about passwords. … Answer:The API tools can be used for spin up services and also for the written scripts. Persons scripts
• Encrypt. … could be coded in Perl, bash or other languages of your preference. There is one more option that is
• Use an encrypted cloud service. flowery management and stipulating tools such as a dummy or improved descendant. A tool called
Scalar can also be used and finally we can go with a controlled explanation like a Right scale. Which
Q163) Name The Several Layers Of Cloud Computing? automation gears can help with pinup service.
Answer:Cloud computing can be damaged up into three main services: Software-as-a-Service (SaaS), Q170) What Is an Ami? How Do I Build One?
Infrastructure-as-a-Service (IaaS) and Platform-as-a-Service (PaaS). PaaS in the middle, and IaaS on
the lowest Answer:An Amazon Machine Image (AMI) explains the programs and settings that will be applied
when you launch an EC2 instance. Once you have finished organizing the data, services, and
Q164) What Is Lambda edge In Aws? submissions on your ArcGIS Server instance, you can save your work as a custom AMI stored in
Amazon EC2. You can scale out your site by using this institution AMI to launch added instances
Answer:Lambda Edge lets you run Lambda functions to modify satisfied that Cloud Front delivers,
executing the functions in AWS locations closer to the viewer. The functions run in response to Cloud Use the following process to create your own AMI using the AWS Administration Console:
Front events, without provisioning or managing server.
*Configure an EC2 example and its attached EBS volumes in the exact way you want them created in
Q165) Distinguish Between Scalability And Flexibility? the custom AMI.
Answer:Cloud computing offers industries flexibility and scalability when it comes to computing 1. Log out of your instance, but do not stop or terminate it.
needs:
2. Log in to the AWS Management Console, display the EC2 page for your region, then click Answer:Every communication with Amazon S3 is either genuine or anonymous. Authentication is a
Instances. process of validating the individuality of the requester trying to access an Amazon Web Services
3. Choose the instance from which you want to create a custom AMI. (AWS) product. Genuine requests must include a autograph value that authenticates the request
4. Click Actions and click Create Image. sender. The autograph value is, in part, created from the requester’s AWS access keys (access key
5. Type a name for Image Name that is easily identifiable to you and, optionally, input text for identification and secret access key).
Image Description.
6. Click Create Image. Q177) What is the best approach to anchor information for conveying in the cloud ?
Read the message box that appears. To view the AMI standing, go to the AMIs page. Here you can Answer:Backup Data Locally. A standout amongst the most vital interesting points while overseeing
see your AMI being created. It can take a though to create the AMI. Plan for at least 20 minutes, or information is to guarantee that you have reinforcements for your information,
slower if you’ve connected a lot of additional applications or data.
• Avoid Storing Sensitive Information. …
Q171) What Are The Main Features Of Amazon Cloud Front? • Use Cloud Services that Encrypt Data. …
• Encrypt Your Data. …
Answer:Amazon Cloud Front is a web service that speeds up delivery of your static and dynamic web • Install Anti-infection Software. …
content, such as .html, .css, .js, and image files, to your users.CloudFront delivers your content • Make Passwords Stronger. …
through a universal network of data centers called edge locations • Test the Security Measures in Place.
Q172) What Are The Features Of The Amazon Ec2 Service? Q178) What is AWS Certificate Manager ?
Answer:Amazon Elastic Calculate Cloud (Amazon EC2) is a web service that provides secure, Answer:AWS Certificate Manager is an administration that lets you effortlessly arrangement, oversee,
resizable compute capacity in the cloud. It is designed to make web-scale cloud calculating easier for and send open and private Secure Sockets Layer/Transport Layer Security (SSL/TLS) endorsements
designers. Amazon EC2’s simple web serviceinterface allows you to obtain and configure capacity for use with AWS administrations and your inward associated assets. SSL/TLS declarations are
with minimal friction. utilized to anchor arrange interchanges and set up the character of sites over the Internet and
additionally assets on private systems. AWS Certificate Manager expels the tedious manual procedure
Q173) Explain Storage For Amazon Ec2 Instance.? of obtaining, transferring, and reestablishing SSL/TLS endorsements.
Answer:An instance store is a provisional storing type located on disks that are physically attached to Q179) What is the AWS Key Management Service
a host machine. … This article will present you to the AWS instance store storage type, compare it to
AWS Elastic Block Storage (AWS EBS), and show you how to backup data stored on instance stores Answer:AWS Key Management Service (AWS KMS) is an overseen benefit that makes it simple for
to AWS EBS you to make and control the encryption keys used to scramble your information. … AWS KMS is
additionally coordinated with AWS CloudTrail to give encryption key use logs to help meet your
Amazon SQS is a message queue service used by scattered requests to exchange messages through a inspecting, administrative and consistence needs.
polling model, and can be used to decouple sending and receiving components
Q180) What is Amazon EMR ?
Q174) When attached to an Amazon VPC which two components provide connectivity with
external networks? Answer:Amazon Elastic MapReduce (EMR) is one such administration that gives completely oversaw
facilitated Hadoop system over Amazon Elastic Compute Cloud (EC2).
Answer:
Q181) What is Amazon Kinesis Firehose ?
• Internet Gateway {IGW)
• Virtual Private Gateway (VGW) Answer:Amazon Kinesis Data Firehose is the least demanding approach to dependably stack gushing
information into information stores and examination devices. … It is a completely overseen benefit
Q175) Which of the following are characteristics of Amazon VPC subnets? that consequently scales to coordinate the throughput of your information and requires no continuous
organization
Answer:
Q182) What Is Amazon CloudSearch and its highlights ?
• Each subnet maps to a single Availability Zone.
• By defaulting, all subnets can route between each other, whether they are private or public. Answer:Amazon CloudSearch is a versatile cloud-based hunt benefit that frames some portion of
Amazon Web Services (AWS). CloudSearch is normally used to incorporate tweaked seek abilities
Q176) How can you send request to Amazon S3? into different applications. As indicated by Amazon, engineers can set a pursuit application up and
send it completely in under 60 minutes.
Q184) Mention crafted by an Amazon VPC switch. Answer:The principle components of AWS are:
Answer:VPCs and Subnets. A virtual private cloud (VPC) is a virtual system committed to your AWS Highway 53: Route53 is an exceptionally versatile DNS web benefit.
account. It is consistently segregated from other virtual systems in the AWS Cloud. You can dispatch
your AWS assets, for example, Amazon EC2 cases, into your VPC. Basic Storage Service (S3): S3 is most generally utilized AWS stockpiling web benefit.
Q185) How would one be able to associate a VPC to corporate server farm? Straightforward E-mail Service (SES): SES is a facilitated value-based email benefit and enables one
to smoothly send deliverable messages utilizing a RESTFUL API call or through an ordinary SMTP.
Answer:AWS Direct Connect empowers you to safely associate your AWS condition to your on-
premises server farm or office area over a standard 1 gigabit or 10 gigabit Ethernet fiber-optic Personality and Access Management (IAM): IAM gives enhanced character and security the board for
association. AWS Direct Connect offers committed fast, low dormancy association, which sidesteps AWS account.
web access suppliers in your system way. An AWS Direct Connect area gives access to Amazon Web
Services in the locale it is related with, and also access to different US areas. AWS Direct Connect Versatile Compute Cloud (EC2): EC2 is an AWS biological community focal piece. It is in charge of
enables you to consistently parcel the fiber-optic associations into numerous intelligent associations giving on-request and adaptable processing assets with a “pay as you go” estimating model.
called Virtual Local Area Networks (VLAN). You can exploit these intelligent associations with
enhance security, separate traffic, and accomplish consistence necessities.
Flexible Block Store (EBS): EBS offers consistent capacity arrangement that can be found in
occurrences as a customary hard drive.
Q186) Is it conceivable to push off S3 with EC2 examples ?
CloudWatch: CloudWatch enables the controller to viewpoint and accumulate key measurements and
Answer:Truly, it very well may be pushed off for examples with root approaches upheld by local furthermore set a progression of cautions to be advised if there is any inconvenience.
event stockpiling. By utilizing Amazon S3, engineers approach the comparative to a great degree
versatile, reliable, quick, low-valued information stockpiling substructure that Amazon uses to follow
This is among habitually asked AWS engineer inquiries questions. Simply find the questioner psyche
its own overall system of sites. So as to perform frameworks in the Amazon EC2 air, engineers utilize
and solution appropriately either with parts name or with the portrayal alongside.
the instruments giving to stack their Amazon Machine Images (AMIs) into Amazon S3 and to
exchange them between Amazon S3 and Amazon EC2. Extra use case may be for sites facilitated on
Q190) I’m not catching your meaning by AMI? What does it incorporate?
EC2 to stack their stationary substance from S3.
Answer:You may run over at least one AMI related AWS engineer inquiries amid your AWS designer
Q187) What is the distinction between Amazon S3 and EBS ?
meet. Along these lines, set yourself up with a decent learning of AMI.
Answer:EBS is for mounting straightforwardly onto EC2 server examples. S3 is Object Oriented
AMI represents the term Amazon Machine Image. It’s an AWS format which gives the data (an
Storage that isn’t continually waiting be gotten to (and is subsequently less expensive). There is then
much less expensive AWS Glacier which is for long haul stockpiling where you don’t generally hope application server, and working framework, and applications) required to play out the dispatch of an
occasion. This AMI is the duplicate of the AMI that is running in the cloud as a virtual server. You
to need to get to it, however wouldn’t have any desire to lose it.
can dispatch occurrences from the same number of various AMIs as you require. AMI comprises of
the followings:
There are then two principle kinds of EBS – HDD (Hard Disk Drives, i.e. attractive turning circles),
which are genuinely ease back to access, and SSD, which are strong state drives which are
A pull volume format for a current example
excessively quick to get to, yet increasingly costly.
Launch authorizations to figure out which AWS records will inspire the AMI so as to dispatch the
• Finally, EBS can be purchased with or without Provisioned IOPS.
occasions
• Obviously these distinctions accompany related estimating contrasts, so it merits focusing on
the distinctions and utilize the least expensive that conveys the execution you require.
Mapping for square gadget to compute the aggregate volume that will be appended to the example at Amazon EC2 is the basic subject you may run over while experiencing AWS engineer inquiries
the season of dispatch questions. Get a careful learning of the EC2 occurrence and all the capacity alternatives for the EC2
case.
Q191) Is vertically scale is conceivable on Amazon occurrence?
Q195) What are the security best practices for Amazon Ec2 examples?
Answer:Indeed, vertically scale is conceivable on Amazon example.
Answer:There are various accepted procedures for anchoring Amazon EC2 occurrences that are
This is one of the normal AWS engineer inquiries questions. In the event that the questioner is hoping pertinent whether occasions are running on-preface server farms or on virtual machines. How about
to find a definite solution from you, clarify the system for vertical scaling. we view some broad prescribed procedures:
Q192) What is the association among AMI and Instance? Minimum Access: Make beyond any doubt that your EC2 example has controlled access to the case
and in addition to the system. Offer access specialists just to the confided in substances.
Answer:Various sorts of examples can be propelled from one AMI. The sort of an occasion for the
most part manages the equipment segments of the host PC that is utilized for the case. Each kind of Slightest Privilege: Follow the vital guideline of minimum benefit for cases and clients to play out the
occurrence has unmistakable registering and memory adequacy. capacities. Produce jobs with confined access for the occurrences.
When an example is propelled, it gives a role as host and the client cooperation with it is same Setup Management: Consider each EC2 occasion a design thing and use AWS arrangement the
likewise with some other PC however we have a totally controlled access to our occurrences. AWS executives administrations to have a pattern for the setup of the occurrences as these administrations
engineer inquiries questions may contain at least one AMI based inquiries, so set yourself up for the incorporate refreshed enemy of infection programming, security highlights and so forth.
AMI theme exceptionally well.
Whatever be the activity job, you may go over security based AWS inquiries questions. Along these
Q193) What is the distinction between Amazon S3 and EC2? lines, motivate arranged with this inquiry to break the AWS designer meet.
Answer:The contrast between Amazon S3 and EC2 is given beneath: Q196) Clarify the highlights of Amazon EC2 administrations.
Q194) What number of capacity alternatives are there for EC2 Instance? Answer:This is an extremely straightforward inquiry yet positions high among AWS engineer
inquiries questions. Answer this inquiry straightforwardly as the default number of pails made in each
Answer:There are four stockpiling choices for Amazon EC2 Instance: AWS account is 100.
At the season of ending an Amazon EC2 case, a shutdown is performed in an ordinary way. Amid Answer:By using NAT Gateway in the VPC or Launch a NAT Instance ( Ec2) Configure or Attach
this, the erasure of the majority of the Amazon EBS volumes is performed. To stay away from this, the NAT Gateway in Public Subnet ( Which has Route Table attached to IGW) and attach it to the
the estimation of credit deleteOnTermination is set to false. On end, the occurrence additionally Route Table which is Already attached to the Private Subnet.
experiences cancellation, so the case can’t be begun once more.
Q208) What are the Difference Between Security Groups and Network ACL
Q202) What are the mainstream DevOps devices?
Answer:
Answer:In an AWS DevOps Engineer talk with, this is the most widely recognized AWS inquiries for
DevOps. To answer this inquiry, notice the well known DevOps apparatuses with the kind of
Security Groups Network ACL
hardware –
Attached to Ec2 instance Attached to a subnet.
• Jenkins – Continuous Integration Tool Stateless – Changes made
Stateful – Changes made in
• Git – Version Control System Tool incoming rules is automatically in incoming rules is
• Nagios – Continuous Monitoring Tool applied to the outgoing rule
not applied to the outgoing
• Selenium – Continuous Testing Tool rule
• Docker – Containerization Tool Blocking IP Address can’t be
• Puppet, Chef, Ansible – Deployment and Configuration Administration Tools. IP Address can be Blocked
done
Allow rules only, by default all Allow and Deny can be
Q203) What are IAM Roles and Policies, What is the difference between IAM Roles and rules are denied Used.
Policies.
Q209) What are the Difference Between Route53 and ELB?
Answer:Roles are for AWS services, Where we can assign permission of some AWS service to other
Service.
Answer:Amazon Route 53 will handle DNS servers. Route 53 give you web interface through which
the DNS can be managed using Route 53, it is possible to direct and failover traffic. This can be
Example – Giving S3 permission to EC2 to access S3 Bucket Contents. achieved by using DNS Routing Policy.
Policies are for users and groups, Where we can assign permission to user’s and groups. One more routing policy is Failover Routing policy. we set up a health check to monitor your
application endpoints. If one of the endpoints is not available, Route 53 will automatically forward the
Example – Giving permission to user to access the S3 Buckets. traffic to other endpoint.
Q204) What are the Defaults services we get when we create custom AWS VPC? Elastic Load Balancing
Answer: ELB automatically scales depends on the demand, so sizing of the load balancers to handle more
traffic effectively when it is not required.
• Route Table
Q210) What are the DB engines which can be used in AWS RDS? Q216) Difference between EBS,EFS and S3
Answer: Answer:
• MariaDB • We can access EBS only if its mounted with instance, at a time EBS can be mounted only
• MYSQL DB with one instance.
• MS SQL DB • EFS can be shared at a time with multiple instances
• Postgre DB • S3 can be accessed without mounting with instances
• Oracle DB
Q217) Maximum number of bucket which can be crated in AWS.
Q211) What is Status Checks in AWS Ec2?
Answer:100 buckets can be created by default in AWS account.To get more buckets additionally you
Answer: System Status Checks – System Status checks will look into problems with instance which have to request Amazon for that.
needs AWS help to resolve the issue. When we see system status check failure, you can wait for AWS
to resolve the issue, or do it by our self. Q218) Maximum number of EC2 which can be created in VPC.
• Network connectivity Answer:Maximum 20 instances can be created in a VPC. we can create 20 reserve instances and
• System power request for spot instance as per demand.
• Software issues Data Centre’s
• Hardware issues Q219) How EBS can be accessed?
• Instance Status Checks – Instance Status checks will look into issues which need our
involvement to fix the issue. if status check fails, we can reboot that particular instance. Answer:EBS provides high performance block-level storage which can be attached with running EC2
• Failed system status checks instance. Storage can be formatted and mounted with EC2 instance, then it can be accessed.
• Memory Full
• Corrupted file system Q220)Process to mount EBS to EC2 instance
• Kernel issues
Answer:
Q212) To establish a peering connections between two VPC’s What condition must be met?
• Df –k
Answer:
• mkfs.ext4 /dev/xvdf
• Fdisk –l
• CIDR Block should overlap • Mkdir /my5gbdata
• CIDR Block should not overlap • Mount /dev/xvdf /my5gbdata
• VPC should be in the same region
• VPC must belong to same account. Q221) How to add volume permanently with instance.
• CIDR block should not overlap between vpc setting up a peering connection . peering
connection is allowed within a region , across region, across different account.
Answer:With each restart volume will get unmounted from instance, to keep this attached need to
perform below step
Q213) Troubleshooting with EC2 Instances:
Answer: Instance States
Cd /etc/fstab
• If the instance state is 0/2- there might be some hardware issue
/dev/xvdf /data ext4 defaults 0
• If the instance state is ½-there might be issue with OS.
Workaround-Need to restart the instance, if still that is not working logs will help to fix the
issue. 0 <edit the file system name accordingly>
Q214) How EC2instances can be resized. Q222) What is the Difference between the Service Role and SAML Federated Role.
Answer: EC2 instances can be resizable(scale up or scale down) based on requirement Answer: Service Role are meant for usage of AWS Services and based upon the policies attached to
it,it will have the scope to do its task. Example : In case of automation we can create a service role
and attached to it.
Q215) EBS: its block-level storage volume which we can use after mounting with EC2 instances.
Answer: Root User will have acces to entire AWS environment and it will not have any policy
attached to it. While IAM User will be able to do its task on the basis of policies attached to it.
Answer: Principal of least privilege means to provide the same or equivalent permission to the
user/role.
Answer: When an IAM user is created and it is not having any policy attached to it,in that case he will
not be able to access any of the AWS Service until a policy has been attached to it.
Q228) What is the precedence level between explicit allow and explicit deny.
Answer:Creation of Group makes the user management process much simpler and user with the same
kind of permission can be added in a group and at last addition of a policy will be much simpler to the
group in comparison to doing the same thing manually.
Q230) What is the difference between the Administrative Access and Power User Access in term
of pre-build policy.
Answer: Administrative Access will have the Full access to AWS resources. While Power User
Access will have the Admin access except the user/group management permission.
Answer: Identity Provider helps in building the trust between the AWS and the Corporate AD
environment while we create the Federated role.
Answer: It help in securing the AWS environment as we need not to embed or distributed the AWS
Security credentials in the application. As the credentials are temporary we need not to rotate them
and revoke them.
Follow Me : https://www.youtube.com/c/SauravAgarwal