Find Jobs

Make your next engineering career move with SoloPoint.

So sorry, this position is no longer available. Please go ahead and submit your application. We may have other positions that would be the perfect fit for you. Alternatively, you may want to apply to one of the following related jobs:

Site Reliability Engineer

Berkeley, CA 94720

Employment Type: Contract Category: Software Engineer, Application Job Number: 10574 Pay Rate: $70/hr - $80/hr

Job Description

Qualifications:
  • BS in Computer Sci, IT, Networking or similar
  • A minimum of 5 years of related experience. Experience with network security: configuring/maintaining ACLs, knowledge of firewalls
  • Understanding of networks and network protocols
  • Strong hands-on knowledge of the Linux shell and working in a command-line (e.g.SSH) environment
  • C, C++, perl, java, or Python or a scripting language with knowledge of standard software development practices
  • Knowledge of and ability to work on large data communications networks and IT infrastructure supporting highly available systems and applications
  • Working knowledge of kubernetes, Prometheus/VictoriaMetrics, alertmanager, building management software, evaporative cooling, and power utilization are helpful
  • Strong communication skills and ability to work effectively across multiple technical teams
  • Experience working in a 24/7 onsite team managing large data centers or other large installations
  • A certification in a system administration area in platforms, software, or any other advanced education in the Computing Science area

Responsibilities:
 
  • Review and respond to alerts from computer systems, storage, network, and other data center/facility related systems
  • Create appropriate solutions to improve the process and to prevent issue recurrence and automate the response to all routine service conditions
  • Identify issues and propose solutions that will improve the ability to monitor or provide better automation for monitoring or triage
  • Respond to alerts from OMNI to ensure that the system continues to collect data 24x7 to provide real time information for diagnoses
  • Develop and maintain tools within the monitoring pipeline in collaboration with the Operations Team
  • Create new software programs to provide alerts and notifications from the HPC system APIs and into the monitoring pipeline
  • Create new software configurations and solve technical issues to enable programs to scale to more dense data or to deliver at scale reliably

 

Send an email reminder to:

Share This Job:

Related Jobs:

Login to save this search and get notified of similar positions.