Nagaraju Bachu
Nagaraju Bachu
[email protected]
615-576-0066
https://www.linkedin.com/in/naga-raj-21a1b01ab/
Dallas, TX - 75252
Summary:
● Around 9 years of Experience in Data Engineering, designing algorithms, building models, developing Data
Mining, Data Acquisition, Data Preparation, Data Manipulation, Feature Engineering, Machine Learning
Algorithms, Validation and Visualization, and reporting solutions that scale across a massive volume of structured
and unstructured data.
● Have strong hands-on experience in Azure Data Bricks, ADLS, Spark, Python
● Extensively worked on Databricks to load data to Snowflake for data profiling.
● Experience with Apache Hadoop ecosystem components like Spark.
● Used Amazon Web Services Elastic Compute Cloud (AWS EC2) to launch cloud instances.
● Hands-on experience working with Amazon Web Services (AWS) using Elastic Map Reduce (EMR), Redshift,
and EC2 for data processing.
● Experienced with query optimization and performance tuning of SQL stored procedures, functions, SSIS
packages, SSRS reports, and so on.
● Hands-on experience with Amazon EC2, Amazon S3, Amazon RDS, VPC, IAM, Amazon Elastic Load
Balancing, Auto Scaling, Cloud Front, CloudWatch, SNS, SES, SQS, and other services of the AWS family.
● Proficient with Shell, Python, Power shell, JSON, YAML and Groovy scripting languages.
● Expertise in writing Automated shell scripts in Linux/Unix environments using bash.
● Executing the python Scripts by using AWS Lambda
● Perform data engineering responsibilities using agile software engineering practices. Migrate Matillion pipelines
and Looker reports from Amazon Redshift.
● Extensive experience in loading and analyzing large datasets with the Hadoop framework (MapReduce, HDFS,
PIG, HIVE, Flume, Sqoop, SPARK, Impala, Scala), NoSQL databases like MongoDB, HBase, and Cassandra.
● Experience in Setup GCP firewall rules in order to allow or deny traffic to and from the VM’s instances based on
specified configuration and used GCP cloud CDN to deliver content from GCP.
● Worked on GCP services like compute engine, cloud load balancing, cloud storage, cloud SQL, Stack driver
monitoring and cloud deployment manager.
● Experience in developing MapReduce Programs using Apache Hadoop for analyzing the big data as per the
requirement.
● Experience in using Microsoft Azure SQL database, Azure ML and Azure Data Factory.
● Hands on Experience in Writing Aws Templates to create VPC, Subnets, EC2 instances etc.
● Worked with both Scala and Python, Created frameworks for processing data pipelines through Spark.
● Experience of Partitions, bucketing concepts in Hive and designed both Managed and External tables in Hive to
optimize performance.
● Experienced in writing JSON/YAML scripts for Cloud Formation.
● Extensively worked on Databricks to load data to Snowflake for data profiling.
● Experience with GIT, Git Bash, and bit bucket.
● Responsible for building scalable distributed data solutions using sh.
● Experienced in both Waterfall and Agile Development (SCRUM) methodologies
● Extensively strong on databases including Oracle, MS SQL Server.
● Good experience in Data Modeling using Star Schema and Snowflake Schema and well versed with UNIX shell
wrapper and Oracle PL/SQL programming.
● Expertise in Writing PySpark scripts for daily workloads based on the business requirements.
● Maintained current knowledge of emerging cloud computing, data engineering, and RESTful API development
technologies, tools, and techniques, and evaluate and recommend new tools and technologies as needed.
● Extensive experience in writing SQL to validate the database systems and for backend database testing.
● Gained expertise on the entire CI/CD Pipelines of an Analytics Project from Data Ingestion, Exploratory Analysis
to Model Development and Visualization to Solution Deployment.
● Creating new Ansible YAML, Playbooks, Roles, and Bash Shell scripts for application deployments.
● Set up Jenkins server and build jobs to provide continuous automated builds based on polling the Git source
control system to support development needs using Jenkins, Gradle, Git, and Maven.
Technical Skills:
Technologies Used Python, Java, Java script, Spark, Linux/Bash, Kubernetes, Databricks
AWS Services AWS EMR, glue, glue crawler, Athena, Redshift, EC2, S3, IAM, Quick Sight,
SNS, SQS, Event Bridge, Lambda, Cloud formation
Databases Elasticsearch, Oracle, SQL, Postgres, Snowflake, DynamoDB
Azure Services Azure Data Lake, Azure Data Factory (ADF), Azure Blob Storage, Azure SQL
Analytics, Azure Network components (virtual network, network security group,
Gateway, Load Balancer etc., Virtual Machines, Express Route, Traffic Manager,
VPN, Load Balancing, Auto Scaling.
Build technologies Docker, Jenkins
Version control GitHub
Methodologies Agile, Waterfall
Agile Tools Rally, Jira, Confluence
Visualization Power BI, Tableau
Education Details:
Bachelor’s Degree - Mahatma Gandhi Institute of Technology, Hyderabad, India 2014
Work Experience:
Client: CVS Health Group-TX
Role: Data Engineer Jan 2021- Present
Responsibilities:
Design and develop ETL integration patterns using Python on Spark. Participated in Normalization /De-
normalization, Normal Form, and database design methodology. Expertise in using data modeling tools like MS
Visio and Erwin Tool for the logical and physical design of databases.
Optimize the PySpark jobs to run on Secured Clusters for faster data processing.
Used Python for SQL/CRUD operations in DB, file extraction/transformation/generation.
Developed spark applications in Python and PySpark on the distributed environment to load a huge number of
CSV files with different schema into Hive ORC tables.
Designing and Developing Apache NiFi jobs to get the files from transaction systems into the data lake raw zone.
Develop a framework for converting existing PowerCenter mappings to PySpark, Python, and Spark Jobs.
Developed and implemented Apache Flink applications for real-time data processing, streaming analytics, and
batch processing.
Designed and optimized Flink data pipelines to ensure efficient data ingestion, transformation, and output.
Monitored and maintained Terraform infrastructure, ensuring resource optimization, cost efficiency, and
adherence to best practices for security and compliance.
Stayed updated with Terraform releases and new features, evaluating and implementing improvements to
infrastructure provisioning workflows and practices.
Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from
different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool, and backward.
Extensive experience in developing data-driven web applications using Angular framework.
Proficient in building interactive user interfaces and implementing responsive designs using Angular components,
directives, and services.
Strong knowledge of TypeScript and JavaScript, enabling seamless integration of backend data processing with
Angular frontend.
Designed Star and Snowflake Data Models for Enterprise Data Warehouse using ERWIN.
Created Spark clusters and configured high concurrency clusters using Azure Databricks to speed up the
preparation of high-quality data.
Responsible for ingesting data from various source systems (RDBMS, Flat files, Big Data) into Azure (Blob
Storage) using the framework model.
Built Azure Web Job for Product Management teams to connect to different APIs and sources to extract the data
and load it into Azure Data Warehouse using Azure Web Job and Functions.
Developed MapReduce programs and Pig scripts for aggregating the daily eligible & qualified transactions
details and storing the details in HDFS and HIVE.
Involved in converting Hive/SQL queries into spark transformations using Spark RDDs, Scala and Python.
Optimized MongoDB queries and indexes to enhance performance and ensure efficient data retrieval and
aggregation for analytical purposes.
Implemented data security measures in MongoDB by enforcing access controls, encryption, and data masking
techniques to protect sensitive information.
Perform data engineering responsibilities using agile software engineering practices. Migrate Matillion pipelines
and Looker reports from Amazon Redshift
Involved in converting the HQLs to spark transformations using spark RDD with the support of Python and Scala.
Building ETL data pipeline on Hadoop/Teradata using Hadoop/Pig/Hive/UDFs.
Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on
AWS.Implemented a Python-based distributed random forest via Python streaming.
Developed AWS cloud formation scripts for hosting software.
Developed Apache PIG scripts to process the HDFS data on Azure. Created Hive tables to store the processed
results in a tabular format.
Designed a serverless architecture using Dynamo DB, AWS Lambda and lambda code using the S3 Buckets.
Proficient in designing, developing, and maintaining microservices architecture to enable scalable and distributed
data processing using technologies such as Docker and Kubernetes.
Conducted performance tuning and optimization of MongoDB databases and microservices, identifying and
resolving bottlenecks to enhance overall system efficiency.
Used Visualization tools such as Power view for excel, and Tableau for visualizing and generating reports.
Developed and maintained Airflow DAGs (Directed Acyclic Graphs) for data processing and ETL (Extract,
Transform, Load) workflows, ensuring that data is processed efficiently and accurately.
Monitored Airflow workflows to ensure that they are running smoothly and troubleshoot any issues that arise,
ensuring that data is delivered on time and in the correct format.
Deployed Kubernetes Cluster on AWS cloud with master/minion architecture and wrote many YAML files to
create many services like pods, deployments, auto-scaling, load balancers, labels, health checks, Namespaces,
Config Map.
Created data ingestion framework in Snowflake from different file formats using Snowflake Stage and Snowflake
Data Pipe.
Implemented Custom Azure Data Factory (ADF) pipeline Activities and SCOPE scripts.
Set up Jenkins server and build jobs to provide continuous automated builds based on polling the Git source
control system to support development needs using Jenkins, Gradle, Git, and Maven.
Environment: Hadoop, Pig, Spark, Airflow, Spark SQL, Python, PySpark, Hive, Hbase, ADF, Azure Databricks,
Azure SQL, Scala, AWS, EC2, EBS, S3, VPC, Redshift, Oozie, Linux, Maven, Apache NiFi, Oracle, MySQL,
snowflake, HDFS, Hive, Jenkins, Unix Shell Scripting, CI/CD pipeline
Client: Dhruv soft Services Pvt Ltd – Hyderabad, India Feb 2014 - Dec 2016
Python Developer
Responsibilities:
Extensively worked in Sprout core managing the client side and backend in Python.
Expertise in Python scripting.
Designed database architecture of NorthStar .
Migrated the entire database objects from SQL to Oracle.
Worked on the Oracle data base for analysing the data.
Implemented various performance techniques(Partitioning, Bucketing) in Oracle to get
better performance.
Participated in project Initiatives(PI Planning) to plan and assess the technical work.
Used JIRA, to keep track of sprint stories, tasks and defects.
supported many production release activities and involved in active interaction with
business clients to resolve any production issues in a jiffy.
Data profiling and system analysis.
Utilize agile software development practices, coding, data and testing standards, secure
coding practices, code reviews, source code management, continuous delivery and
software architecture.
Environment: Python, Oracle, SQL, Jupyter Notebook, Sprout Core, UNIX, Jira, S3 Buckets, SQL server, My SQL,
Git.