Summary
Data Engineer with 4+ years of experience in building scalable ETL pipelines,
data quality frameworks,
and modern DWH architectures. Proficient in Python, SQL, and dbt, with hands-on experience in Apache
Airflow,
Spark, and DevOps practices (CI/CD, Kubernetes). Experienced in designing reliable data architectures
and optimizing query performance for high-load environments.
Work Experience
- Joined a newly formed Data Quality team and architected a configuration-driven ecosystem to
ensure platform-wide data integrity:
- Developed a Python/SQL framework for business logic validation (replacing Great
Expectations).
- Built a scalable Spark-based reconciliation service for detecting discrepancies between
event streams and S3 storage.
- Engineered a dynamic DAG generator in Airflow: users provide SQL and YAML configs which
are parsed into ClickHouse; Airflow continuously polls ClickHouse to auto-create,
update, and schedule pipelines. Successfully scaled to orchestrate over
2,000+ active data quality checks across the platform.
- Managed OpenMetadata infrastructure on Kubernetes (deployed via ArgoCD), customizing Helm charts
and automating metric ingestion.
- Established CI/CD pipelines using GitLab CI, Tox, and Pre-commit hooks; implemented database
version control via Liquibase.
- Pioneered GenAI initiatives by developing the first AI agent prototype using LangGraph.
- Developed multi-gigabyte data marts for a team of 4 Data Scientists using a Spark-based internal
ETL platform
within the Hadoop ecosystem, accelerating the training and feature engineering for ~10 ML
models.
- Extended ETL capabilities by writing custom UDFs in Scala and optimizing complex SQL logic for
high-load processing.
- Maintained the backend of a real-time scoring service (PostgreSQL) and orchestrated regular data
workflows using Apache Airflow.
- Led the data migration of 4 regional services to the SMEV4 standard, managing metadata
registries and ensuring continuous data integrity.
- Designed an end-to-end reporting system: formed API requirements, built Python/SQL ETL
pipelines, and created an inter-departmental dashboard to monitor and track the percentage of
overdue citizen applications.
- Deployed and administered a self-hosted Apache Airflow instance on a dedicated server.
Skills
Programming: Python, SQL.
Big Data & ETL: Apache Airflow, Apache Spark,
Hadoop (HDFS), dbt, Pandas.
DevOps & Infrastructure: Docker, Kubernetes,
ArgoCD, Helm, GitLab CI/CD, Linux, Liquibase.
Other: OpenMetadata, LangGraph (GenAI), Data
Quality methodologies.
Education
- Faculty: Control Systems and Navigation