Job Post

Home
Job Post

Hiring - Senior Big Data Developer with Pyspark - Charlotte, NC - Dallas, TX Phoenix AZ Raleigh, NC Denver, CO Minneapolis, MN Salt Lake City, UT San Antonio, TX Des Moines, IA St. Louis, MO

Wed Jul 31 2024 17:20:13

apply | share

Job Tittle: Senior Big Data Developer with Pyspark

Job Type: W2 Contact

Long term contract

Hybrid schedule

Location: Charlotte, NC - Dallas, TX, Phoenix AZRaleigh, NC, Denver, CO, Minneapolis, MN Salt Lake City, UTS an Antonio, TX Des Moines, IA St. Louis, MO Philadelphia, PA Chicago, IL Boston, MA Seattle, WA Palo Alto, CA San Francisco, CA New York City, Jersey City, NJ

Top Skills' Details

1) Experience building/standing up a big data on-prem solution (Hadoop, Cloudera, Hortonworks), data lake or data warehouse or similar solution.

2) All roles require 100% hands on experience and strong data foundations. These skills MUST be on prem NOT in the cloud, if you have both that’s fine but they need these skills from an on-prem big data environment.

3) Big Data Platform (data lake) and data warehouse engineering experience. Preferably with Hadoop stack: HDFS, Hive, SQL, Spark, Spark Streaming, Spark SQL, HBase, Kafka, Sqoop, Atlas, Flink, Kafka, Cloudera Manager, Airflow, Impala, Hive, HBase, Tez, Hue, and a variety of source data connectors. Solid hands-on software engineer who can design and code Big Data pipeline frameworks (as a software product - Cloudera ideally) – not just a “data engineer” implementing spark jobs, or a team lead for data engineers. Building self-service data pipelines that help automate the controls to help build the data pipeline and ingest data into the ecosystem (data lake) and transform it for different consumption to support GCP, Hadoop on Premise, bring in massive volumes of Cybersecurity data , validating data and data quality.

4. solid PySpark Developer – experience working with Spark Core, Spark Streaming, Spark Optimizations – know how to optimize the code, PySpark API. Experience writing PySpark code. PySpark with solid Hadoop data lake foundations

5. Airflow experience - just using it and developing workflows in Airflow.

Job Description

The Big Data Lead software engineer is responsible for owning and driving the technical innovation along with big data technologies. The individual is a subject matter expert technologist with strong Python experience and deep hands-on experience building data pipelines for the Hadoop platform as well as Google cloud. This person will be part of successful Big Data implementations for large data integration initiatives. The candidates for this role must be willing to push the limits of traditional development paradigms typically found in a data-centric organization while embracing the opportunity to gain subject matter expertise in the cyber security domain.

In this role you will
§ Lead the design and development of sophisticated, resilient, and secure engineering solutions for modernizing our data ecosystem that typically involve multiple disciplines, including big data architecture, data pipelines, data management, and data modeling specific to consumer use cases.
§ Provide technical expertise for the design, implementation, maintenance, and control of data management services – especially end-to-end, scale-out data pipelines.
§ Develop self-service, multitenant capabilities on the cyber security data lake including custom/of the shelf services integrated with the Hadoop platform and Google cloud, use API and messaging to communicate across services, integrate with distributed data processing frameworks and data access engines built on the cluster, integrate with enterprise services for security, data governance and automated data controls, and implement policies to enforce fine-grained data access
§ Build, certify and deploy highly automated services and features for data management (registering, classifying, collecting, loading, formatting, cleansing, structuring, transforming, reformatting, distributing, and archiving/purging) through Data Ingestion, Processing, and Consumption stages of the analytical data lifecycle.
§ Provide the highest technical leadership in terms of design, engineering, deployment and maintenance of solutions through collaborative efforts with the team and third-party vendors.
§ Design, code, test, debug, and document programs using Agile development practices.
§ Review and analyze complex data management technologies that require in depth evaluation of multiple factors including intangibles or unprecedented factors.
§ Assist in production deployments, including troubleshooting and problem resolution.
§ Collaborate with enterprise, data platform, data delivery, and other product teams to provide strategic solutions, influencing long range internal and enterprise level data architecture and change management strategies.
§ Provide technical leadership and recommendation into the future direction of data management technology and custom engineering designs.
§ Collaborate and consult with peers, colleagues, and managers to resolve issues and achieve goals.

10+ years of Big Data Platform (data lake) and data warehouse engineering experience. Preferably with Hadoop stack: HDFS, Hive, SQL, Spark, Spark Streaming, Spark SQL, HBase, Kafka, Sqoop, Atlas, Flink, Kafka, Cloudera Manager, Airflow, Impala, Hive, HBase, Tez, Hue, and a variety of source data connectors. Solid hands-on software engineer who can design and code BigData pipeline frameworks (as a software product - Cloudera ideally) – not just a “data engineer” implementing spark jobs, or a team lead for data engineers. Building self-service data pipelines that help automate the controls to help build the data pipeline and ingest data into the ecosystem (data lake) and transform it for different consumption to support GCP, Hadoop On Premise, bring in massive volumes of Cybersecurity data , validating data and data quality. Reporting consumption – advanced analytics, data science and ML.

- 3+ years of hands-on experience designing and building modern, resilient, and secure data pipelines, including movement, collection, integration, transformation of structured/unstructured data with built-in automated data controls, and built-in logging/monitoring/alerting, and pipeline orchestration managed to operational SLAs. Preferably using Airflow Custom Operator (at least 1 year of experience customizing within it), DAGS, connector plugins. - Python, spark, PySpark - working with APIs to integrate different services, Google big data services, Cloud data proc, data store, BigQuery, cloud composer – Google data services. On prem – Apache Airflow – streaming tool core orchestrator. Kafka for streaming services – getting data sourced from and then spark streaming.

Python, spark, APIs to integrate different services, GCP services

Building self-service data pipelines – supports GCP, Hadoop On Premise, bring in massive volumes of Cybersecurity data , validating data and data quality. Reporting consumption – advanced analytics, data science and ML.
Skill sets: Python, Spark, (PYSPARK) used APIs to integrate with various services, Google big data services, Cloud data proc, data store, BigQuery, cloud composer – Google data services. On prem – Apache Airflow – streaming tool core orchestrator. Kafka for streaming services – getting data sourced from and then spark streaming.