Engineering

Data Engineer II

Chennai, Tamil Nadu
Work Type: Full Time

Job Summary:

Building on the foundation of the SDE-I role, the DE- II position takes on a greater level of responsibility and leadership. You'll play a crucial role in driving the evolution and efficiency of our data collection and analytics platform, capable of handling terabyte-scale data and billions of data points.

Key Responsibilities

  • Lead the design, development, and optimization of large-scale data pipelines and infrastructures using technologies like Apache Airflow, Spark, Kafka, and more.
  • Architect and implement distributed data processing solutions to handle terabyte-scale datasets and billions of records efficiently across multi-region cloud infrastructure (AWS, GCP, DO).
  • Develop and maintain real-time data processing solutions for high-volume data collection operations using technologies like Spark Streaming and Kafka.
  • Optimize data storage strategies using technologies such as Amazon S3, HDFS, and Parquet/Avro file formats for efficient querying and cost management.
  • Build and maintain high-quality ETL pipelines, ensuring robust data collection and transformation processes with a focus on scalability and fault tolerance.
  • Collaborate with data analysts, researchers, and cross-functional teams to define and maintain data quality metrics, implement robust data validation, and enforce security best practices.
  • Mentor junior engineers (SDE-I) and foster a collaborative, growth-oriented environment.
  • Participate in technical discussions, contributing to architectural decisions, and proactively identifying improvements for scalability, performance, and cost-efficiency.
  • Ensure application performance monitoring (APM) is in place, utilizing tools like Datadog, New Relic, or similar to proactively monitor and optimize system performance, detect bottlenecks, and ensure system health.
  • Implement effective data partitioning strategies and indexing for performance optimization in distributed databases such as DynamoDB, Cassandra, or HBase.
  • Stay current with advancements in data engineering, orchestration tools, and emerging cloud technologies, continually enhancing the platform’s capabilities

Qualifications & Experience:

  • 4-5+ years of hands-on experience with Apache Airflow and other orchestration tools for managing large-scale workflows and data pipelines.
  • Expertise in AWS technologies, Athena, AWS Glue, DynamoDB, Apache Spark, PySpark, SQL, and NoSQL databases.
  • Experience in designing and managing distributed data processing systems that scale to terabyte and billion-scale datasets using cloud platforms like AWS, GCP, or Digital Ocean.
  • Proficiency in web crawling frameworks, including Node.js, HTTP protocols, Puppeteer, Playwright, and Chromium for large-scale data extraction.
  • Experience with monitoring and observability tools such as Grafana, Prometheus, Elasticsearch, and familiarity with monitoring and optimizing resource utilization in distributed systems.
  • Strong understanding of infrastructure as code using Terraform, automated CI/CD pipelines with Jenkins, and event-driven architecture with Kafka.
  • Experience with data lake architectures and optimizing storage using formats such as Parquet, Avro, or ORC.
  • Strong background in optimizing query performance and data processing frameworks (Spark, Flink, or Hadoop) for efficient data processing at scale.
  • Knowledge of containerization (Docker, Kubernetes) and orchestration for distributed system deployments.
  • Deep experience in designing resilient data systems with a focus on fault tolerance, data replication, and disaster recovery strategies in distributed environments.
  • Strong data engineering skills, including ETL pipeline development, stream processing, and distributed systems.
  • Excellent problem-solving abilities, with a collaborative mindset and strong communication skills.

Submit Your Application

You have successfully applied
  • You have errors in applying