Research Scientist Intern, Real-Time Multimodal AI (PhD)

MetaApplyPublished 1 days agoFirst seen 1 days ago
Apply

Description

Reality Labs is building the future of connection through world-class AR/VR hardware and software. The XR Tech AIX (AI Experiences) team is developing cutting-edge real-time AI systems that power next-generation communication experiences. We are creating intelligent agents that seamlessly interface with fine-tuned foundation models to enable rich, real-time interactions in video calling and telepresence scenarios. We are seeking an exceptional Research Scientist Intern to join our team and contribute to the development of real-time multimodal AI systems. This role focuses on fine-tuning and optimizing large foundation models—particularly vision-language models—for real-time agent-based applications. You will work at the intersection of multimodal learning, real-time systems, and agentic AI. Our internships are twelve (12) to twenty-four (24) weeks long with a flexible summer start date.

Responsibilities

Research and develop novel approaches for fine-tuning large multimodal foundation models (vision-language, audio-visual) for real-time applications Design and implement efficient inference pipelines for deploying fine-tuned models in real-time communication scenarios Explore agentic architectures that leverage fine-tuned models as tools within larger AI systems Collaborate with cross-functional teams to integrate models into prototype experiences Document and present research progress with the goal of publishing findings at top-tier ML/CV conferences Contribute to building working prototypes that demonstrate the capabilities of fine-tuned multimodal models

Qualifications

Currently has, or is in the process of obtaining, a PhD degree in Computer Science, Machine Learning, Electrical Engineering, or a related field 2+ years of research experience in one or more of the following areas: multimodal learning, vision-language models, large language models, or foundation model fine-tuning Hands-on experience fine-tuning large foundation models (e.g., LLaVA, InternVL, Qwen-VL, LLaMA, or similar) Strong programming skills in Python Experience with deep learning frameworks such as PyTorch Excellent communication skills and ability to work independently Must obtain work authorization in the country of employment at the time of hire, and maintain ongoing work authorization during employment Proven track record of achieving significant results as demonstrated by first-authored publications at leading conferences such as NeurIPS, ICML, ICLR, CVPR, ICCV, ECCV, ICASSP, Interspeech, ACL, EMNLP, or similar Experience with speech-to-speech LLMs or audio-visual foundation models Familiarity with real-time communication systems (e.g., LiveKit, WebRTC) or low-latency inference optimization Experience with cloud infrastructure (AWS) and containerization (Docker) Experience with parameter-efficient fine-tuning techniques (LoRA, QLoRA, adapters, etc.) Experience with agentic AI systems, tool-use, or function-calling in LLMs Demonstrated software engineering experience via internships, work experience, or contributions to open source repositories (e.g., GitHub) Intent to return to degree program after completion of the internship

Compensation: $7,650/month to $12,134/month + benefits