top of page

Middle/Senior Site Reliability Engineer

Remote

Only from Hungary, Poland

Job Type

Full Time

Status

New

About the Customer

The leading provider of vehicle lifecycle solutions enables the companies that build, insure, repair, and replace vehicles to power the next generation of transportation. The company delivers advanced mobile, artificial intelligence, and connected car technologies through its platform. It connects a network of 350+ insurance companies, 24,000+ repair facilities, OEMs, parts suppliers, and third-party data and service providers. The customer's collective solutions inform decision-making, enhance productivity, and help clients deliver faster and better experiences for end consumers.

About the Project

The Site Reliability Engineer (SRE) will be embedded within the Product Development team and be responsible for the overall reliability and availability of those applications.
The customer has been working on next-generation analytics since 2018 and has migrated to Amazon EMR for cloud big data platform services. The customer provides software products and services to insurance companies, repair shops, OEMs, parts suppliers, and others, and has a variety of products in auto physical damage, casualty, telematics, and parts domain. All of these applications share data with the analytics team to build an enterprise data lake, which also allows the customer to do next-generation analytics on the amassed data.

Tech Stack:
Languages: Python, Spark, SQL, JAVA, Scala, Shell programming
Operating System: Linux, Windows Server
Automation Tools (key skill): Terraform, Ansible
Query Engine: Presto, Hive, SparkSQL
RDBS: Oracle, PostGres, MySQL
AWS Technologies (key skills): AWS S3, EMR, EC2, EBS, VPC, IAM, CloudWatch, RedShift, Lambda, MSK, CloudEndure, Route53, MWAA (Amazon Managed Airflow), SNS, Secrets Manager, EMRFS
Hadoop Big data Tech: HDFS, YARN, HIVE, Spark, Kafka, Kafka Connectors, NiFi, Juypter Notebook,TEZ
Source Version Control: Git
CI/CD: Jenkins
Networking: Subnets, Routing, VPC Peering, VPC endpoints, Route53
BI Tool: Tableau
Monitoring and Data Visualization: AppDynamics, Alertsite, Nagios, Grafana, Prometheus, Kibana, Datadog, Cloudwatch, KafDrop
Incident Management: RemedyForce, Pagerduty
Certifications: AWS, Kubernetes, AWS, SRE, Java

Responsibilities

  • Design, develop, and maintain CAD software applications using C# and .NET frameworks

  • Participate in the entire software development lifecycle, from concept and design to testing and ●  Help build an SRE culture by sharing best practices, approaches, documentation, and code with other engineering teams across the organization

  • Document tribal knowledge as you acquire it over time by creating runbooks/playbooks and ensuring critical system information is readily available to those who need it through dashboards

  • Configuring and maintaining the monitoring tooling as it relates to the target application

  • Monitor application/infrastructure and take steps to improve overall system software performance, availability, and reliability by incorporating changes through defined feedback loops within the software delivery lifecycle

  • Apply automation to any tasks/parts of the system that are performed manually

  • CollaborateWork closely with software developers and testers to ensure the product is responding correctly to non-functional requirements such as security, performance, and availability

  • Resolve NOC escalations and help prevent the reiteration of incidents by creating processes and automation

  • Be a key part of our response to high-severity internal customer incidents, ensuring we meet all SLAs and SLOs

  • Embrace failures and treat incidents as learning opportunities through conducting blameless postmortems reports

  • Participate in product engineering stand-ups and related design activities

  • Coach other team members to ensure systems are supported by following SRE best practices

Requirements

  • Past enterprise-level experience in DevOps, Software, Infrastructure, or Site Reliability Engineering with the ability to demonstrate an understanding of high-level technical briefs, talks, and ideas

  • Expertise intense leading teams in troubleshooting, issue resolution, or escalations

  • Ability to document solutions, SRE architectural patterns, and best practices to ensure that teams have guidance as needed

  • Proven ability to dig through metrics, logs, and available sources to triage and resolve an incident at any time

  • Proficiency in the full software delivery lifecycle

  • Understanding of Microservices and APIs

  • Experience and interest in working in an Agile environment

  • Versed in system management, monitoring, and analysis in order to identify opportunities to improve service health, manageability, and reliability improvements

  • Eager to problem solve and troubleshoot issues that may arise day to day

  • Сommunication and interpersonal skills

Nice to have:

  • Experience functioning as an SRE in maintaining the reliability of the applications and infrastructure

  • Proficient in infrastructure as code practices

  • Knowledge of building CI/CD pipelines from scratch

  • Able to troubleshoot complicated, cross-platform issues by handling OS, Networking, Database, and applications in cloud-based environments.

English level:

  • Intermediate+

Job Application Form
Select File

Thank you! We'll get in touch soon.

bottom of page