Jobs at NorthShore Resources, Inc.

View all jobs

Site Reliability Consultant

Minneapolis, MN
NorthShore Resources has a need for a Site Reliability consultant for a project at one of our clients.
 
  1. Business need/project background: 
Hennepin County's Information Technology (IT) Department is currently seeking a Senior IT Site Reliability Engineer (SRE) to join its Production Support Unit. The Production Support Unit is responsible for a variety of operational services with focus on providing enterprise services that result in a highly effective technology experience for our staff and business partners (enterprise infrastructure monitoring). As a lead member of our SRE service approach, and in support of our world class hybrid cloud-based services, the consultant will participate in technical advocacy for product optimization, deploy scalable automation, monitor capacity and performance, incident coordination, root cause analysis, and incident post mortem.
 
  1. Scope of services/description of work to be performed 
  • Take responsibility for designing solutions that correspond to non-functional requirements such as availability, performance, security, and maintainability.
  • Leverage your expertise in coding, algorithms, complex analysis, enterprise incident coordination, and large-scale system design.
  • Model SRE culture of intellectual curiosity, problem solving, openness, collaboration, reasonable risk taking, and big thinking in a self-directed environment.
  • Build highly scalable platforms and fault tolerant systems across a range of technologies
  • Define, drive adoption and enforcement of service level objectives at both service and experience levels
  • Analyze root-cause complex problems involving multiple integrated systems and services, networks, hardware and software that relate to scaling and performance
  • Set standards for deployments at scale, infrastructure reliability and scalability
  • Influence engineering teams with customer focus, world class quality, effective communication, decisive, fast moving solutions, quick and constructive resolution of conflicts
  • Manage service availability and scalability through process, tools, and automation
  • Perform post-mortems and optimize incident response processes
  • Lead incident response for production incidents; Drive investigation, analysis and troubleshooting to resolve production incidents and systematically drive down detection and mitigation times  
  • Bring a strong engineering focus to operations, putting your energy into preventing incidents, automation frameworks, self-service infrastructure, logging and metrics, and operational scorecards
  • Develop CI/CD processes to improve cadence
  • Identify or utilize existing tools for logging, monitoring, event management, notification, runbook automation, root cause analysis
  • Partner with security engineers to develop plans and automation that aggressively and safely respond to new risks and vulnerabilities.
  • Develop, communicate, and monitor standard processes to promote the long-term health of sustainability and health of operational development tasks. 
  1. Specific skills/experience required: 
  • 2+ years of experience related to IT Site Reliability Engineering such as configuration, monitoring, information management, AIOPS, DEVOPS, technical architecture, Cloud management systems, ITOM/HDIM, Incident Coordination, or other components of experience centric operations.
  • Experience:
    • Experience with building and maintaining application stacks in a Hybrid Cloud environment, as well as expertise with Microsoft Azure is a plus.
    • Thought leader and mentor for internal and external technical talent
    • 3-5 years or more building and scaling distributed systems leveraging web scale technologies like Linux, Apache, MongoDB, Python, Oracle RDBMS, Redis, Postgres and Hadoop
    • Experience with Linux/Unix internals and systems services like DNS, DHCP, TFTP, iptables, smtp, as well as networking protocols such as TCP, UDP and HTTP.
    • Programming experience in one or more of the following languages: Go, Java, Python, Ruby, Shell, Powershell, JSON, YAML, REST, CLI, and CI/CD tools such as Travis, Drone, Jenkins, Azure DevOps.
    • Hands-on experience using source control (Git, GitHub) and feature branching strategies Preferred Technical and Professional Expertise
    • Experience with containers, such as with Docker, Kubernetes and Open Shift
    • Experience with monitoring and observability such as with New Relic, Nagios, Icinga, or Sysdig
    • Experience automating infrastructure, configuration management, testing, and deployments using tools like Ansible, Chef and can explain the Infrastructure as Code paradigm
    • Participate in security compliance efforts; experience drafting and/or reviewing IT policies.
  • Excellent interpersonal, written, and verbal communication skills.
  • Ability to:
    • Adapt to changing priorities, demands, and timelines.
    • Champion change throughout the organization.
    • Establish and maintain effective working relationships with all levels of the organization and contribute in a team environment.
Work as a leader in a team environment ensuring customer satisfaction and technical excellence. 
  1. Project deliverables: 
­Define and propose SRE tasks, expectations, training, and measurable outcomes to establish the foundations for SRE at Hennepin County.
 
Define and propose an implementation approach for establishing PIM, based on ITIL 4. This should include tasks, expectations, training plans, and measurable outcomes.
 
Actively participate in the definition and adoption of cloud migration strategies for high reliability.
 
Lead the definition of requirements and implementation of processes and procedures for operations management in a hybrid cloud environment.
 
 
 

More Openings

Drone Engineer
Drools Developer

Share This Job

Powered by