About the Customer
The leading provider of vehicle lifecycle solutions enables the companies that build, insure, repair, and replace vehicles to power the next generation of transportation. The company delivers advanced mobile, artificial intelligence, and connected car technologies through its platform. It connects a network of 350+ insurance companies, 24,000+ repair facilities, OEMs, parts suppliers, and third-party data and service providers. The customer's collective solutions inform decision-making, enhance productivity, and help clients deliver faster and better experiences for end consumers.
About the Project
The Site Reliability Engineer (SRE) will be embedded within the Product Development team and be responsible for the overall reliability and availability of those applications.
The customer has been working on next-generation analytics since 2018 and has migrated to Amazon EMR for cloud big data platform services. The customer provides software products and services to insurance companies, repair shops, OEMs, parts suppliers, and others, and has a variety of products in auto physical damage, casualty, telematics, and parts domain. All of these applications share data with the analytics team to build an enterprise data lake, which also allows the customer to do next-generation analytics on the amassed data.
Tech Stack:
Languages: Python, Spark, SQL, JAVA, Scala, Shell programming
Operating System: Linux, Windows Server
Automation Tools (key skill): Terraform, Ansible
Query Engine: Presto, Hive, SparkSQL
RDBS: Oracle, PostGres, MySQL
AWS Technologies (key skills): AWS S3, EMR, EC2, EBS, VPC, IAM, CloudWatch, RedShift, Lambda, MSK, CloudEndure, Route53, MWAA (Amazon Managed Airflow), SNS, Secrets Manager, EMRFS
Hadoop Big data Tech: HDFS, YARN, HIVE, Spark, Kafka, Kafka Connectors, NiFi, Juypter Notebook,TEZ
Source Version Control: Git
CI/CD: Jenkins
Networking: Subnets, Routing, VPC Peering, VPC endpoints, Route53
BI Tool: Tableau
Monitoring and Data Visualization: AppDynamics, Alertsite, Nagios, Grafana, Prometheus, Kibana, Datadog, Cloudwatch, KafDrop
Incident Management: RemedyForce, Pagerduty
Certifications: AWS, Kubernetes, AWS, SRE, Java
Responsibilities
Design, develop, and maintain CAD software applications using C# and .NET frameworks
Participate in the entire software development lifecycle, from concept and design to testing and ● Help build an SRE culture by sharing best practices, approaches, documentation, and code with other engineering teams across the organization
Document tribal knowledge as you acquire it over time by creating runbooks/playbooks and ensuring critical system information is readily available to those who need it through dashboards
Configuring and maintaining the monitoring tooling as it relates to the target application
Monitor application/infrastructure and take steps to improve overall system software performance, availability, and reliability by incorporating changes through defined feedback loops within the software delivery lifecycle
Apply automation to any tasks/parts of the system that are performed manually
CollaborateWork closely with software developers and testers to ensure the product is responding correctly to non-functional requirements such as security, performance, and availability
Resolve NOC escalations and help prevent the reiteration of incidents by creating processes and automation
Be a key part of our response to high-severity internal customer incidents, ensuring we meet all SLAs and SLOs
Embrace failures and treat incidents as learning opportunities through conducting blameless postmortems reports
Participate in product engineering stand-ups and related design activities
Coach other team members to ensure systems are supported by following SRE best practices
Requirements
Past enterprise-level experience in DevOps, Software, Infrastructure, or Site Reliability Engineering with the ability to demonstrate an understanding of high-level technical briefs, talks, and ideas
Expertise intense leading teams in troubleshooting, issue resolution, or escalations
Ability to document solutions, SRE architectural patterns, and best practices to ensure that teams have guidance as needed
Proven ability to dig through metrics, logs, and available sources to triage and resolve an incident at any time
Proficiency in the full software delivery lifecycle
Understanding of Microservices and APIs
Experience and interest in working in an Agile environment
Versed in system management, monitoring, and analysis in order to identify opportunities to improve service health, manageability, and reliability improvements
Eager to problem solve and troubleshoot issues that may arise day to day
Сommunication and interpersonal skills
Nice to have:
Experience functioning as an SRE in maintaining the reliability of the applications and infrastructure
Proficient in infrastructure as code practices
Knowledge of building CI/CD pipelines from scratch
Able to troubleshoot complicated, cross-platform issues by handling OS, Networking, Database, and applications in cloud-based environments.
English level:
Intermediate+