Introducing Oryx: An MBZUAI library for Large Vision-Language Models
Updated: May 23
Today, we are publicly releasing a set of projects and demos for a breed of large vision-language models developed at MBZUAI. With a profound interest in large-language models (LLMs) based on their astounding reasoning, understanding, and conversational capabilities, our broad goal is to advance LLMs for multi-modal and domain-specific dialogues. Our current release showcases three projects with this theme within the Oryx library.
The first project ClimateGPT [Eng & العربية] is a specialized LLM developed for conversations related to Climate Change and Sustainability topics in both English and Arabic languages. The goal is to use the capabilities of LLMs to develop an educational and informational resource for the audience on climate-related topics that can act as an interactive, engaging, and responsible tool. Further, ClimateGPT is intended to help researchers and policy-makers in searching, extracting, and gaining insights from Climate-change related conversations. Our codebase, demo and a short video with highlighted results are available.
ClimateGPT is developed over the Vicuna framework and introduces a vector embedding and datastore framework, which can be utilized during model inference for accurate and precise information retrieval. We are also releasing a generated set of over 500k interactive conversational-style samples based on ClimaBench and CCMRC Climate Change datasets. This augmentation of interactive conversational data greatly enhances the performance of LLMs through the fine-tuning process. Additionally, this project marks the first release of a large Arabic dataset (>500k samples) dedicated to climate change and sustainability conversations.
The second project Video-ChatGPT is a conversational engine for videos. The model is based on a dedicated video-encoder and large language model (LLM) aligned together using a simple linear projection, enabling video understanding and conversation about videos. We provide the first set of high-quality 86k instruction data obtained specifically for videos. Besides, we present a human-assisted and semi-automatic annotation framework to obtain high-quality video descriptions that help in developing conversational instructions for videos. The project also develops a quantitative video conversation evaluation framework for benchmarking. Our codebase, demo and a short video with highlighted results are available.
The third project XrayGPT aims at the automated analysis of chest radiographs based on the given X-ray images. Here, we align a frozen medical vision encoder (MedCLIP) with a fine-tuned LLM (Vicuna) using a simple linear transformation. The LLM is fine-tuned on medical data (100k real conversations between patients and doctors) and further on ~30k radiology conversations to acquire domain-specific and relevant features. We generate interactive and clean summaries (~217k) from free-text radiology reports of two datasets (MIMIC-CXR and OpenI) that help enhance the performance via finetuning on high-quality domain-specific data. Our codebase, demo and a short video with highlighted results are available.