Computer Vision/Machine Learning Engineer (Text-Alignment Understanding)
Summary
If you are passionate about advancing multi-modal understanding, building models that bridge text and visual perception, and shaping the next generation of intelligent on-device experiences, Apple is the right place for you. We are looking for engineers who combine technical depth, curiosity, and creativity to push the boundaries of what machine learning can do on-device.
Description
The computer vision algorithm engineer will work in a dynamic team as part of the Video Engineering org which develops on-device computer vision and machine perception technologies across Apple’s products. We balance research and product to deliver the highest quality, state-of-the-art experiences, innovating through the full stack, and partnering with cross-functional teams to influence what brings our vision to life and into customers’ hands. You will collaborate closely with research scientists, framework engineers, and cross-functional product teams to deliver state-of-the-art models that run efficiently across Apple’s ecosystem, from iPhone to Vision Pro.
Keywords: Concept Prompt; Text-Alignment; Open-Set Segmentation; Multi-Modal Understanding; Model Consolidation; On-Device Foundation Models
Responsibilities
- Design and develop models for text-aligned visual perception and concept-driven segmentation.
- Explore model consolidation and shared representation learning across multiple perception tasks.
- Investigate efficient adaptation of foundation models to on-device constraints.
- Prototype, benchmark, and optimize algorithms for runtime, power, and accuracy.
- Work collaboratively with other research and product teams to integrate technologies into Apple’s camera and video processing pipelines.
Minimum Qualifications
- M.S. or Ph.D. in Computer Science, Electrical Engineering, or related fields (e.g., mathematics, physics, computer engineering) with a focus on computer vision or machine learning.
- Solid experience in one or more of the following: open-vocabulary segmentation, text-image alignment, prompt-based vision models, or video understanding.
- Proficiency in deep learning frameworks (PyTorch, JAX) and programming languages (Python, C++).
- Demonstrated ability to prototype, evaluate, and deploy models in real-world systems.
- Strong written and verbal communication skills; ability to present ideas and results to diverse audiences.
Preferred Qualifications
- Publications in top-tier conferences (e.g., CVPR, ICCV, ECCV, NeurIPS, ICLR).
- Experience with large-scale pretraining or multi-modal foundation models.
- Understanding of generative models, visual-language alignment, or open-set recognition.
- Familiarity with optimizing models for efficient inference on mobile or embedded platforms.
- Passion for building scalable, high-quality systems and working in cross-functional teams.
Apple is an equal opportunity employer that is committed to inclusion and diversity, and thus we treat all applicants fairly and equally. Apple is committed to working with and providing reasonable accommodation to applicants with physical and mental disabilities.