Working With Databricks Tables, Databricks File System (DBFS) Etc

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 3

Technical Skills:

Java/J2EE Java, J2EE, Struts, Liferay, Apache SOLR


Big Data tools Horton Works, Cloudera, Hadoop, HDFS, Map Reduce, Scala, Pig, Sqoop, Flume, Pig,
Hive, Impala, PySpark, Apache Nifi, Oozie, H2o
Cloud Technologies AWS, Microsoft AZURE, Data Bricks, Open-Source Delta Lake
Databases Oracle, PostgreSQL, DB2, SQL Server & MongoDB
Operating Systems Windows, UNIX
Scripting & HTML Python, JavaScript, Shell Scripting & HTML
IDE IBM RAD 7.0, IntelliJ, Eclipse 3.1, and Net Beans
Code Collaboration tools Bit Bucket, GITHUB, SVN, JIRA, Screwdriver, AZURE DevOps
Academics:
Bachelor of Technology from J.N.T University, Hyderabad.
Certifications:
 Microsoft Certified Azure Data Engineer Associate.
Description:
The Purpose of Aggrify is to track and manage ETL jobs that are run daily throughout Johnson and Johnson. The Aggrify application
has several reports that slice and present the data in different ways to help analyze the transformation of data across the company.

Business Deliverable - Name of the application as defined in the AGGRIFY database. A single Business Deliverable can have
multiple SLT IDs.  This is to allow the application team to group their tasks and batches under separate but unique SLT IDs. 
However, to allow grouping on the report, the teams can use the same business deliverable name for all their SLT IDS.

Environment: AWS, EMR, EC2, Hadoop 2.7.3, Spark3.3, Python 3.8, Open-Source Delta Lake, Kafka, Oracle, VS code.

Description: It is a digital platform that provides health plans to have a better understanding of their population, best care, and
reduced cost. It is a simple-to-use product used to analyze, monitor, intervene, and improve quality of care management.

Roles & Responsibilities:


 Design and implement ETL pipelines to consume data from data sources.
 Extensively worked on Migration of existing application from Cloudera to Azure Platform.
 Working with Databricks notebooks using Databricks utilities, magic commands etc.
 Working with Databricks Tables, Databricks File System (DBFS) etc.
 Developed code in Azure Data Bricks for the curation of the source data.
 Automation of the pipelines in Azure.

Project Title: Rx Surveillance Tool

Description: The Rx Surveillance project monitors invoice claims to ensure they are correctly processed based on the set of criteria or
rules, thousands of claims everyday are identified as outliers. Businesses need to verify that each claim has been adjudicated properly.
This tool works as a single repository of all the outliers claims and allows users to filter and writeback comments. Writeback allows
users to add comments to a claim or mass update to multiple claims. Comment updates are then saved in real time to a database.

Role and Responsibilities:


 Design and implement ETL pipelines to consume data from data sources.
 Developed spark SQL code using Scala to migrate existing Impala code to enable faster and in memory computing.
 Developed scripts to load and process the data using Hive QL
 Worked on using Jenkins, an open- source automated build platform designed for CI/CD i.e., tests pull requests, builds the
merged commits, and deploying the code to respective non-prod and prod environments.
 Customized scheduling using AutoSys to run complete end to end flow of self-contained Model, which have a functionality
of integrated Spark functionality.
 Production Support
Environment: Microsoft AZURE, Cloudera Hadoop 6.3.3, Impala 3.2.0, Hive 2.1.1, Oozie, Spark 2.4, Scala 2.11.2, GIT HUB, JIRA,
Jenkins, IntelliJ, HUE, AutoSys, Data Bricks.

Description: NSP (Network Service Personalization) is one of the critical system applications in the Network Personalization program
that creates an interface that is specific to the needs of the Network Repair Bureau (NRB) technicians and ties into the functions they
perform day-to-day.
NSP leverages scoring data produced by Network Data Lake (NDL a Hadoop Platform) based on specific Businesses identified Key
Performance Indicators (KPIs) and presents all the customer experience in form of Interactive Grafana dashboard. NSP also combines
NRB ticket information, Provisioning Data, and record details in a single pane of glass used for network outage troubleshooting. It
includes automation of process streaming and removing unnecessary troubleshooting steps and downstream communication with other
applications to prevent redundant tickets from reaching the NRB.
The main objective of this project is to achieve better control over automation of Tickets/Customer Experience.

Role and Responsibilities:


 Design and implement ETL pipelines to consume data from data sources.
 Developed spark SQL code using Scala to migrate existing pig scripts to enable faster and in memory computing.
 Developed spark code to load the data from Hive tables and csv files and publish the data to Pulsar broker using batch and
stream processing.
 Developed scripts to load and process the data using Hive QL, Pig.
 Worked with Verizon Data science team to provide and integrate the KPIs scoring data into Spark code for scoring.
 Worked on using Screwdriver pipeline, an open-source automated build platform designed for CI/CD i.e., tests pull requests,
builds the merged commits, and deploying the code to respective non-prod and prod environments.
 Customized scheduling in Oozie to run complete end to end flow of self-contained Model, which have a functionality of
integrated Hive, Spark, and Pig jobs.
 Developed UDF in spark Scala for custom functionality.

Environment: Hadoop 2.8, HDFS, Pig 0.14.0, Hive 1.2, Oozie, Spark 2.4, Scala 2.11, GIT HUB, JIRA, Screwdriver, IntelliJ, HUE,
Jenkins

Description: Aetna is a US based health care company, which sells traditional, and consumer directed health care insurance plans and
related services, such as medical, pharmaceutical, dental, behavioral health, long-term care, and disability plans. On average Aetna
receives 1 million claims each day. The sheer number of providers, members and plan types makes the pricing of these claims
incredibly complex. Through misinterpretation of provider contracts and human errors a small number of claims are paid improperly.
As part of the Data Science team all the data from the critical domains like Aetna Medicare, Traditional Group membership, Member,
Plan, Claim, Provider are migrated to Hadoop Environment. All the demographic information is moved from MySQL to Hadoop and
the analysis is done on the data. Also, the claims data will be moved from MQ to Hadoop and after processing the claims sending the
response back to MQ and build history to track all the changes corresponding to the processed claims.

Role and Responsibilities:


 Design and implement data pipelines to consume data from heterogeneous data sources and build an integrated health
Insurance Claims view of data. Use Hortonworks data platform, which consists of Red Hat Linux edge nodes with
availability of Hadoop Distributed File System (HDFS). Data processing and storage is done across a 1000 node cluster.
 Developed scripts to load and process the data in Hive, Pig
 Worked on performance tuning of Hive and Pig queries to improve data processing and retrieving
 Worked closely with data scientists to provide the required data for building the model features.
 Created multiple Hive tables, implemented partitioning, dynamic partitioning in Hive for efficient data access.
 Developed scripts in scheduling the jobs using Zeke Framework.
 Worked on migration of pig scripts to Pyspark code to enable faster and in memory computing. Perform ad hoc analytics on
large/diverse data using PySpark.
 Customized scheduling in Oozie to run complete end to end flow of integrated Hive and Pig jobs.
 Extensively used GIT as a code repository for managing day agile project development process and to keep track of the
issues and blockers.
 Sharing the Production Outcome Analysis report from different jobs to the end users on a weekly basis.
 Developed UDF in Pyspark for the custom functionalities.

Environment: Hadoop 2.7, HDFS, Pig 0.14.0, Hive 0.13.0, Sqoop, Flume, Apache NIFI, Oozie, GIT, JIRA, Pyspark, H2o, Netezza,
MY SQL, Aginity.

Description: “iTunes OPS Reporting” is a near real time data warehouse and reporting solution for iTunes Online Store and acts as a
reporting system for external and operational reporting needs. It also publishes data to downstream systems like Piano (for Label
Reporting) and ICA (for Business Objects Reporting, Campaign List Pull and Analytics). De-normalized data is used for publishing
various reports to the users. In addition, this project caters to the need of the ITS (iTunes Store) Business user groups. A lot of
complex analytical expertise is required which involves a lot of domain knowledge, detailed understanding of the features of iTunes,
its data flow and measuring the accuracy of the system in place.

Role and Responsibilities


 Worked on Analyzing the Tera Data Procedures.
 Worked on Developing the Graffle Design Documents for the Teradata Procedures in Hive
 Creation of design document for implementing HQL for the same.
 Developed HQL scripts for creating the tables and populating the data.
 Worked on testing the Map Side Joins for Performance.
 Developed Oozie scripts for automation.
 Production Support.

Environment: Hadoop 2.x7, HDFS, Hive 0.13.0, Oozie, GIT, JIRA, Java, Teradata

You might also like