Find Jobs

Make your next engineering career move with SoloPoint.

Site Reliability Engineer

Berkeley, CA 94720

Employment Type: Contract Category: Software Engineer, Application Job Number: 10574 Pay Rate: $70/hr - $80/hr

Job Description

Qualifications:
  • BS in Computer Sci, IT, Networking or similar
  • A minimum of 5 years of related experience. Experience with network security: configuring/maintaining ACLs, knowledge of firewalls
  • Understanding of networks and network protocols
  • Strong hands-on knowledge of the Linux shell and working in a command-line (e.g.SSH) environment
  • C, C++, perl, java, or Python or a scripting language with knowledge of standard software development practices
  • Knowledge of and ability to work on large data communications networks and IT infrastructure supporting highly available systems and applications
  • Working knowledge of kubernetes, Prometheus/VictoriaMetrics, alertmanager, building management software, evaporative cooling, and power utilization are helpful
  • Strong communication skills and ability to work effectively across multiple technical teams
  • Experience working in a 24/7 onsite team managing large data centers or other large installations
  • A certification in a system administration area in platforms, software, or any other advanced education in the Computing Science area

Responsibilities:
 
  • Review and respond to alerts from computer systems, storage, network, and other data center/facility related systems
  • Create appropriate solutions to improve the process and to prevent issue recurrence and automate the response to all routine service conditions
  • Identify issues and propose solutions that will improve the ability to monitor or provide better automation for monitoring or triage
  • Respond to alerts from OMNI to ensure that the system continues to collect data 24x7 to provide real time information for diagnoses
  • Develop and maintain tools within the monitoring pipeline in collaboration with the Operations Team
  • Create new software programs to provide alerts and notifications from the HPC system APIs and into the monitoring pipeline
  • Create new software configurations and solve technical issues to enable programs to scale to more dense data or to deliver at scale reliably

 

Send an email reminder to:

Share This Job:

Related Jobs:

Login to save this search and get notified of similar positions.