Software Engineer II, MIG AI/ML Infrastructure

GoogleApplyPublished 1 days agoFirst seen 1 days ago
Apply

Google's software engineers develop the next-generation technologies that change how billions of users connect, explore, and interact with information and one another. Our products need to handle information at massive scale, and extend well beyond web search. We're looking for engineers who bring fresh ideas from all areas, including information retrieval, distributed computing, large-scale system design, networking and data storage, security, artificial intelligence, natural language processing, UI design and mobile; the list goes on and is growing every day. As a software engineer, you will work on a specific project critical to Google’s needs with opportunities to switch teams and projects as you and our fast-paced business grow and evolve. We need our engineers to be versatile, display leadership qualities and be enthusiastic to take on new problems across the full-stack as we continue to push technology forward.

The GCE Managed Infrastructure organization contributes to the GCE control plane layer that manages Virtual Machines' (VMs) lifecycles in Google Cloud Platform (GCP). We translate API calls that our users and clients make to GCE into operations orchestrated across multiple cloud services, delivering the machines and making sure their behavior follows the demand, schedules and constraints. We also build Managed Instance Groups (MIG) for running clusters of VMs in a unified way, with capabilities like autohealing, autoscaling, or auto-updating.

The MIG AI/ML Infrastructure team concentrates on Managed Instance Group work supporting new accelerator (TPU, GPU) VM families and features in order to fulfill the massive computational needs of AI/ML workloads.

Google Cloud accelerates every organization’s ability to digitally transform its business and industry. We deliver enterprise-grade solutions that leverage Google’s cutting-edge technology, and tools that help developers build more sustainably. Customers in more than 200 countries and territories turn to Google Cloud as their trusted partner to enable growth and solve their most critical business problems.

Responsibilities

  • Write product or system development code. Participate in, or lead design reviews with peers and stakeholders to decide amongst available technologies. Review code developed by other developers and provide feedback to ensure best practices (e.g., style guidelines, checking code in, accuracy, testability, and efficiency).
  • Contribute to existing documentation or educational content and adapt content based on product/program updates and user feedback.
  • Triage product or system issues and debug/track/resolve by analyzing the sources of issues and the impact on hardware, network, or service operations and quality.
  • Implement and maintain reliable, large-scale computer systems. Participate in analysis and design of solutions for managing orchestration and lifecycle of millions of virtual machines.
  • Work on complex and high-visibility projects supporting Google's TPU and GPU fleet and integrate with many stakeholders from across the GCP stack.

Minimum qualifications:

  • Bachelor’s degree or equivalent practical experience.
  • 1 year of experience with software development in one or more programming languages (e.g., Python, C, C++, Java, JavaScript).

Preferred qualifications:

  • Master's degree in Computer Science or a related technical field.
  • 1 year of experience with data structures and algorithms.
  • Experience with system architecture, distributed systems, object-oriented analysis and design.
  • Experience in building highly reliable systems and delivering cloud based products.