Pappaya Cloud

Contact Info

Svetlanas, St. Ann's Hill Road, KT16 9NN, United Kingdom

enquiries@pappayacloud.com

Sign Up

Job Details

Site Reliability Engineer role is an advanced and key role in managing and maintaining the complete Pappaya Cloud platform and infrastructure. SRE is a specialized role having specific level of knowledge and work experience in a handful of technical discipline. This role is focused to make sure Pappaya Cloud platforms and services are available to the customers when they need them. This role contributes to enhance system reliability and availability within scope and prioritizing the work based on SLA and SLO metrics that are collected to measure the performance of site, systems and services. This role is combination of system engineering and system administration to ensure scalability, performance and reliability of Pappaya Cloud platform, application and services.
Responsibilities
  1. Monitor and manage Pappaya cloud platform infrastructure, application and services availability, performance and reliability through employing effective methods to measure the systems health.
  2. Identify, Compare and deploy right monitoring tool be it open source or commercial. Develop architecture around monitoring and implement them to observe different types of data related to
     
    • Resource monitoring, server monitoring to capture data related to RAM, CPU and disk space. Monitoring physical server hardware health like temperature and component uptime. Monitoring cloud-based environments
    • Network monitoring
    • Application performance monitoring
    • Third party component monitoring
  3.  Develop rules and thresholds to trigger alerts and events. Develop meaningful metrics and managed the end-to-end incident management process.
  4. Develop proactive improvement plans and execute projects that improves the reliability and performance of systems and applications backed up with the meaningful monitoring data.
  5.  Develop plans and project direction to build automation systems to manage Pappaya Cloud platform infrastructure and applications.
  6. Provide primary operation support and engineering for multiple large-scale distributed software applications.
  7. Gather and analyse metrics from OS, network, storage and applications to assist in performance tuning and fault tolerance.
  8. Provide adequate support in the capacity planning and management to the operations and senior management team to ensure systems can handle expected and un-expected traffic loads.
  9. Deploy technologies to identify unusual behaviour in the software and more importantly collecting data that enables engineers and operations team to identify the root of the issue. This includes data like traces, metrics and logs.
  10. Evaluate service delivery quality and reliability through standard KPIs like SLO, SLA and SLI.
Skills And Abilityy
  1. Being a shared role, it is important to skills to navigate through teams and organization.
  2. Work effectively with cross functional teams
  3. Attention to details and data driven decision making
  4.  Attention and keen interest in understanding the trends in cloud and network technologies.
  5. Aligning with Pappaya Cloud business goals and priorities
Required Skill Set & Experience
  1. Systems Engineering, Systems Administration and Operations.
  2. Deep and comprehensive knowledge of operating systems and server management.
  3. Physically build, configure, wire up and install hardware.
  4. Deep and comprehensive knowledge on resource virtualization and software defined services.
  5. Network implementation and management expert.
  6. Good understanding and experience in software development and software engineering
  7. Good understanding and experience in Quality assurance and testing engineering.
  8. Comprehensive experience in building automation pipeline experts
  9. Good to have database administration skills
Required Skill Set & Experience
  1. 5+ years of experience programming (structured and OOP) using one or more high-level languages, such as Python, Java, C/C++, Ruby, and JavaScript
  2. 3+ years of experience with distributed storage technologies such as NFS, HDFS, Ceph, and Amazon S3, as well as dynamic resource management frameworks (Apache Mesos, Kubernetes, Yarn)
  3. 3+ years of experience in deploying and managing automation and orchestration technologies like Terraform, Ansible, SCCM, Puppet and Salt.
  4.  3+ years of experience in deploying and managing monitoring technologies like, New Relic, DynaTrace, AppDynamics, Grafana and Datalog
  5.  3+ years of experience in deploying and managing observability technologies like opentelementary etc.
  6.  3+ years of experience in reliability engineering tools like Litmus, Chaos Monkey and Chaos Toolkit
  7.  5+ years of experience in managing, incident, problem and change management processes in ITSM tools like Remedy, ITSM Jira, ZenDuty

 

  1.   Coding experience beyond simple scripts
  2. Technical engineering
  3.  Ability to define road maps and lifecycle management of internal applications / tools
  4. Good understanding on enterprise application design and architecture principles
  5. Ability to proactively approach in identifying problems, bottlenecks and areas of improvement.
Education/Qualification
  1. Bachelor’s degree or equivalent in Computer engineering/science preferred.
  2. Certifications in public cloud platform.
  3. Enterprise Architecture Certifications such as TOGAF, Open CA, SABSA etc
  4. ITIL V4 certification preferred
Location

Chennai, India

Job Type

On-site - Full Time

Salary

As Per Industry Standards

Apply for this role now

Please enable JavaScript in your browser to complete this form.
Name
Click or drag a file to this area to upload.