diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 3bf4882..c5364ed 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -1,6 +1,6 @@
 # Contribution Guide
 
-Thanks for your interest in contributing. This project was released to accompany a research paper for purposes of reproducability, and beyond its publication there are limited plans for future development of the repository.
+Thanks for your interest in contributing. This project was released to accompany a research paper for purposes of reproducibility, and beyond its publication there are limited plans for future development of the repository.
 
 While we welcome new pull requests and issues please note that our response may be limited. Forks and out-of-tree improvements are strongly encouraged.
 
diff --git a/EVAL.md b/EVAL.md
new file mode 100644
index 0000000..3cf18c4
--- /dev/null
+++ b/EVAL.md
@@ -0,0 +1,17 @@
+# Evaluation
+All evaluation scripts provided usage details/cases in the first several lines of codes. 
+
+## Ferret-Bench
+Please follow [gpt4_eval_script.sh](ferret/eval/gpt4_eval_script.sh) to run inference on Ferret-Bench data and use GPT-4 to rate. It's noted that `openai` package should be installed and user's OPENAI_KEY should be provided.
+
+## LVIS-Referring Object Classification
+Run `ferret/eval/model_lvis.py` following the usage in the file and then run `ferret/eval/eval_lvis.py`.
+
+## RefCOCO/RefCOCO+/RefCOCOg
+Run `ferret/eval/model_refcoco.py` following the usage in the file and then run `ferret/eval/eval_refexp.py`.
+
+## Flickr
+Run `ferret/eval/model_flickr.py` following the usage in the file and then run `ferret/eval/eval_flickr_entities.py`.
+
+## POPE
+Run `ferret/eval/model_pope.py` following the usage in the file and then run `ferret/eval/eval_pope.py`.
\ No newline at end of file
diff --git a/README.md b/README.md
index aedd76d..b37f228 100644
--- a/README.md
+++ b/README.md
@@ -10,11 +10,12 @@ Brief description of the project.
 
 # <img src="figs/ferret_icon.png" alt="Alt text for the image" width="40" height="45"> Ferret: Refer and Ground Anything Anywhere at Any Granularity
 
-An end-to-end MLLM that can accept any-form referring and ground anything in response.*
+*An End-to-End MLLM that Accept Any-Form Referring and Ground Anything in Response.* [[Paper](https://arxiv.org/abs/2310.07704)]
 
 [Haoxuan You*](https://hxyou.github.io/), [Haotian Zhang*](https://haotian-zhang.github.io/), [Zhe Gan](https://zhegan27.github.io/), [Xianzhi Du](https://scholar.google.com/citations?user=l1hP40AAAAAJ&hl=en), [Bowen Zhang](https://zbwglory.github.io/), [Zirui Wang](https://www.cs.cmu.edu/~ziruiw/), [Liangliang Cao](http://llcao.net/), [Shih-Fu Chang](https://www.ee.columbia.edu/~sfchang/), [Yinfei Yang](https://sites.google.com/site/yinfeiyang/) 
 [*: equal contribution]
 
+
 ## Overview
 
 <p align="center">
@@ -25,4 +26,119 @@ An end-to-end MLLM that can accept any-form referring and ground anything in res
 Key Contributions:
 * Ferret Model - **Hybrid Region Representation + Spatial-aware Visual Sampler** enable fine-grained and open-vocabulary referring and grounding in MLLM.
 * GRIT Dataset (~1.1M) - A **Large-scale, Hierarchical, Robust** ground-and-refer instruction tuning dataset.
-* Ferret-Bench - A multimodal evaluation benchmark that jointly requires **Referring/Grounding, Semantics, Knowledge, and Reasoning**.
\ No newline at end of file
+* Ferret-Bench - A multimodal evaluation benchmark that jointly requires **Referring/Grounding, Semantics, Knowledge, and Reasoning**.
+
+
+## Release
+- [10/30] 🔥 We released the code of **FERRET** model.
+
+
+
+**Usage and License Notices**: The data, and code is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.
+
+## Contents
+- [Install](#install)
+- [Train](#train)
+- [Evaluation](#evaluation)
+- [Demo](#demo)
+
+## Install
+
+1. Clone this repository and navigate to FERRET folder
+```bash
+git clone https://github.com/apple/ml-ferret
+cd ml-ferret
+```
+
+2. Install Package
+```Shell
+conda create -n ferret python=3.10 -y
+conda activate ferret
+pip install --upgrade pip  # enable PEP 660 support
+pip install -e .
+pip install pycocotools
+pip install protobuf==3.20.0
+```
+
+3. Install additional packages for training cases
+```
+pip install ninja
+pip install flash-attn --no-build-isolation
+```
+
+
+## Train
+
+FERRET is trained on 8 A100 GPUs with 80GB memory. To train on fewer GPUs, you can reduce the `per_device_train_batch_size` and increase the `gradient_accumulation_steps` accordingly. Always keep the global batch size the same: `per_device_train_batch_size` x `gradient_accumulation_steps` x `num_gpus`.
+
+### Hyperparameters
+We use a similar set of hyperparameters as LLaVA(Vicuna) in finetuning.  
+
+| Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
+| --- | ---: | ---: | ---: | ---: | ---: |
+| FERRET-7B | 128 | 2e-5 | 3 | 2048 | 0 |
+| FERRET-13B | 128 | 2e-5 | 3 | 2048 | 0 |
+
+### Prepare Vicuna checkpoint and LLaVA's projector
+
+Before you start, prepare our base model Vicuna, which is an instruction-tuned chatbot. Please download its weights following the instructions [here](https://github.com/lm-sys/FastChat#model-weights). Vicuna v1.3 is used in FERRET.
+
+Then download LLaVA's first-stage pre-trained projector weight ([7B](https://huggingface.co/liuhaotian/llava-336px-pretrain-vicuna-7b-v1.3), [13B](https://huggingface.co/liuhaotian/llava-336px-pretrain-vicuna-13b-v1.3)).
+
+
+### FERRET Training
+
+The scripts are provided ([7B](experiments/ferret_7b_train.sh), [13B](experiments/ferret_13b_train.sh)).
+
+
+## Evaluation
+
+Please see this [doc](EVAL.md) for the details.
+
+## Demo
+
+To run our demo, you need to train FERRET and use the checkpoints locally. Gradio web UI is used. Please run the following commands one by one. 
+
+#### Launch a controller
+```Shell
+python -m ferret.serve.controller --host 0.0.0.0 --port 10000
+```
+
+#### Launch a gradio web server.
+```Shell
+python -m ferret.serve.gradio_web_server --controller http://localhost:10000 --model-list-mode reload --add_region_feature
+```
+
+#### Launch a model worker
+
+This is the worker that load the ckpt and do the inference on the GPU.  Each worker is responsible for a single model specified in `--model-path`.
+
+```Shell
+CUDA_VISIBLE_DEVICES=0 python -m ferret.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path ./checkpoints/FERRET-13B-v0 --add_region_feature
+```
+Wait until the process finishes loading the model and you see "Uvicorn running on ...".  Now, refresh your Gradio web UI, and you will see the model you just launched in the model list.
+
+
+<p align="center">
+    <img src="figs/ferret_demo.png" width="105%"></a> <br>
+    Example of Ferret Interactive Demo.
+</p>
+
+
+## Citation
+
+If you find Ferret useful, please cite using this BibTeX:
+
+```bibtex
+@article{you2023ferret,
+  title={Ferret: Refer and Ground Anything Anywhere at Any Granularity},
+  author={You, Haoxuan and Zhang, Haotian and Gan, Zhe and Du, Xianzhi and Zhang, Bowen and Wang, Zirui and Cao, Liangliang and Chang, Shih-Fu and Yang, Yinfei},
+  journal={arXiv preprint arXiv:2310.07704},
+  year={2023}
+}
+```
+
+## Acknowledgement
+
+- [LLaVA](https://github.com/haotian-liu/LLaVA): the codebase we built upon. 
+- [Vicuna](https://github.com/lm-sys/FastChat): the LLM codebase.
diff --git a/experiments/ferret_13b_train.sh b/experiments/ferret_13b_train.sh
new file mode 100644
index 0000000..2dbe77b
--- /dev/null
+++ b/experiments/ferret_13b_train.sh
@@ -0,0 +1,100 @@
+#!/usr/bin/env bash
+set -xe
+
+mkdir -p checkpoints
+
+echo "Start Fine-Tuning"
+# =================== Training ======================
+data_path=(
+            'dataset/git_instruction.json' 
+            'dataset/vg_objects.json'  
+            'dataset/vg_relations.json' 
+            'dataset/vg_regions.json' 
+            'dataset/grounded_llava_boxes_detail.json' 
+            'dataset/grounded_llava_boxes_complex_reasoning.json' 
+            'dataset/grounded_llava_boxes_conversation.json' 
+            'dataset/refexp_all.json' 
+            'dataset/flickr.json' 
+            'dataset/objects365.json' 
+            )
+image_folder=(
+            'dataset/coco2014/train2014' 
+            'dataset/vg/images' 
+            'dataset/vg/images' 
+            'dataset/vg/images' 
+            'dataset/coco2014/train2014' 
+            'dataset/coco2014/train2014' 
+            'dataset/coco2014/train2014' 
+            'data/refcoco/train2014' 
+            'data/flickr30k/flickr30k_images_split/train' 
+            'data/objects365_v1/train' 
+            )
+data_multiple=(
+            3 
+            1 
+            0.2 
+            0.2 
+            1 
+            1 
+            1 
+            1 
+            1 
+            1 
+            )
+
+# convert array to string
+data_path="${data_path[@]}"
+image_folder="${image_folder[@]}"
+data_multiple="${data_multiple[@]}"
+
+################## VICUNA ##################
+PROMPT_VERSION=v1
+MODEL_VERSION="vicuna-13b-v1-3"
+################## VICUNA ##################
+
+torchrun --nnodes=1 --nproc_per_node=8 --master_port=25001 \
+    ferret/train/train_mem.py \
+    --lora_enable False \
+    --model_name_or_path ./model/$MODEL_VERSION \
+    --version $PROMPT_VERSION \
+    --data_path $data_path \
+    --image_folder $image_folder \
+    --data_multiple $data_multiple \
+    --vision_tower openai/clip-vit-large-patch14-336 \
+    --pretrain_mm_mlp_adapter ./model/llava-336px-pretrain-$MODEL_VERSION/mm_projector.bin \
+    --mm_vision_select_layer -2 \
+    --mm_use_im_start_end False \
+    --mm_use_im_patch_token False \
+    --bf16 True \
+    --output_dir ./checkpoints/ferret_13b \
+    --num_train_epochs 3 \
+    --per_device_train_batch_size 16 \
+    --per_device_eval_batch_size 4 \
+    --gradient_accumulation_steps 1 \
+    --evaluation_strategy "no" \
+    --save_strategy "steps" \
+    --save_steps 1500 \
+    --save_total_limit 3 \
+    --learning_rate 2e-5 \
+    --weight_decay 0. \
+    --warmup_ratio 0.03 \
+    --lr_scheduler_type "cosine" \
+    --logging_steps 1 \
+    --tf32 True \
+    --fsdp "full_shard auto_wrap" \
+    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
+    --model_max_length 2048 \
+    --gradient_checkpointing True \
+    --dataloader_num_workers 8 \
+    --lazy_preprocess True \
+    --report_to tensorboard \
+    --point_input_sample 'segment_mask|center' \
+    --add_region_feature True \
+    --region_geo_sampler True \
+    --sampler_pooler_mode 'max' \
+    --add_region_feature True \
+    --refer_previous_point False \
+    --resized_image_h 336 \
+    --resized_image_w 336 \
+    --save_vision_tower True
+
diff --git a/experiments/ferret_7b_train.sh b/experiments/ferret_7b_train.sh
new file mode 100644
index 0000000..9b6dc61
--- /dev/null
+++ b/experiments/ferret_7b_train.sh
@@ -0,0 +1,99 @@
+#!/usr/bin/env bash
+set -xe
+
+mkdir -p checkpoints
+
+# =================== Training ======================
+data_path=(
+            'dataset/git_instruction.json' 
+            'dataset/vg_objects.json'  
+            'dataset/vg_relations.json' 
+            'dataset/vg_regions.json' 
+            'dataset/grounded_llava_boxes_detail.json' 
+            'dataset/grounded_llava_boxes_complex_reasoning.json' 
+            'dataset/grounded_llava_boxes_conversation.json' 
+            'dataset/refexp_all.json' 
+            'dataset/flickr.json' 
+            'dataset/objects365.json' 
+            )
+image_folder=(
+            'dataset/coco2014/train2014' 
+            'dataset/vg/images' 
+            'dataset/vg/images' 
+            'dataset/vg/images' 
+            'dataset/coco2014/train2014' 
+            'dataset/coco2014/train2014' 
+            'dataset/coco2014/train2014' 
+            'data/refcoco/train2014' 
+            'data/flickr30k/flickr30k_images_split/train' 
+            'data/objects365_v1/train' 
+            )
+data_multiple=(
+            3 
+            1 
+            0.2 
+            0.2 
+            1 
+            1 
+            1 
+            1 
+            1 
+            1 
+            )
+
+# convert array to string
+data_path="${data_path[@]}"
+image_folder="${image_folder[@]}"
+data_multiple="${data_multiple[@]}"
+
+################## VICUNA ##################
+PROMPT_VERSION=v1
+MODEL_VERSION="vicuna-7b-v1-3"
+################## VICUNA ##################
+
+torchrun --nnodes=1 --nproc_per_node=8 --master_port=25001 \
+    ferret/train/train_mem.py \
+    --lora_enable False \
+    --model_name_or_path ./model/$MODEL_VERSION \
+    --version $PROMPT_VERSION \
+    --data_path $data_path \
+    --image_folder $image_folder \
+    --data_multiple $data_multiple \
+    --vision_tower openai/clip-vit-large-patch14-336 \
+    --pretrain_mm_mlp_adapter ./model/llava-336px-pretrain-$MODEL_VERSION/mm_projector.bin \
+    --mm_vision_select_layer -2 \
+    --mm_use_im_start_end False \
+    --mm_use_im_patch_token False \
+    --bf16 True \
+    --output_dir ./checkpoints/ferret_7b \
+    --num_train_epochs 3 \
+    --per_device_train_batch_size 16 \
+    --per_device_eval_batch_size 4 \
+    --gradient_accumulation_steps 1 \
+    --evaluation_strategy "no" \
+    --save_strategy "steps" \
+    --save_steps 1500 \
+    --save_total_limit 3 \
+    --learning_rate 2e-5 \
+    --weight_decay 0. \
+    --warmup_ratio 0.03 \
+    --lr_scheduler_type "cosine" \
+    --logging_steps 1 \
+    --tf32 True \
+    --fsdp "full_shard auto_wrap" \
+    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
+    --model_max_length 2048 \
+    --gradient_checkpointing True \
+    --dataloader_num_workers 8 \
+    --lazy_preprocess True \
+    --report_to tensorboard \
+    --point_input_sample 'segment_mask|center' \
+    --add_region_feature True \
+    --region_geo_sampler True \
+    --sampler_pooler_mode 'max' \
+    --add_region_feature True \
+    --refer_previous_point False \
+    --resized_image_h 336 \
+    --resized_image_w 336 \
+    --save_vision_tower True
+
diff --git a/ferret/__init__.py b/ferret/__init__.py
new file mode 100644
index 0000000..6ac4c54
--- /dev/null
+++ b/ferret/__init__.py
@@ -0,0 +1 @@
+from .model import FERRETLlamaForCausalLM
diff --git a/ferret/constants.py b/ferret/constants.py
new file mode 100644
index 0000000..be8cf02
--- /dev/null
+++ b/ferret/constants.py
@@ -0,0 +1,12 @@
+CONTROLLER_HEART_BEAT_EXPIRATION = 30
+WORKER_HEART_BEAT_INTERVAL = 15
+
+LOGDIR = "."
+
+# Model Constants
+IGNORE_INDEX = -100
+IMAGE_TOKEN_INDEX = -200
+DEFAULT_IMAGE_TOKEN = "<image>"
+DEFAULT_IMAGE_PATCH_TOKEN = "<im_patch>"
+DEFAULT_IM_START_TOKEN = "<im_start>"
+DEFAULT_IM_END_TOKEN = "<im_end>"
diff --git a/ferret/conversation.py b/ferret/conversation.py
new file mode 100644
index 0000000..23f90a8
--- /dev/null
+++ b/ferret/conversation.py
@@ -0,0 +1,275 @@
+import dataclasses
+from enum import auto, Enum
+from typing import List, Tuple
+
+VOCAB_IMAGE_W = 1000  # 224
+VOCAB_IMAGE_H = 1000  # 224
+
+class SeparatorStyle(Enum):
+    """Different separator style."""
+    SINGLE = auto()
+    TWO = auto()
+    MPT = auto()
+    PLAIN = auto()
+    LLAMA_2 = auto()
+
+
+@dataclasses.dataclass
+class Conversation:
+    """A class that keeps all conversation history."""
+    system: str
+    roles: List[str]
+    messages: List[List[str]]
+    offset: int
+    sep_style: SeparatorStyle = SeparatorStyle.SINGLE
+    sep: str = "###"
+    sep2: str = None
+    version: str = "Unknown"
+
+    skip_next: bool = False
+    first_round: bool = True
+
+
+    def get_prompt(self):
+        messages = self.messages
+        if len(messages) > 0 and type(messages[0][1]) is tuple:
+            messages = self.messages.copy()
+            init_role, init_msg = messages[0].copy()
+            init_msg = init_msg[0].replace("<image>", "").strip()
+            if 'mmtag' in self.version:
+                messages[0] = (init_role, init_msg)
+                messages.insert(0, (self.roles[0], "<Image><image></Image>"))
+                messages.insert(1, (self.roles[1], "Received."))
+            else:
+                messages[0] = (init_role, "<image>\n" + init_msg)
+
+        if self.sep_style == SeparatorStyle.SINGLE:
+            ret = self.system + self.sep
+            for role, message in messages:
+                if message:
+                    if type(message) is tuple:
+                        message, _, _ = message
+                    ret += role + ": " + message + self.sep
+                else:
+                    ret += role + ":"
+        elif self.sep_style == SeparatorStyle.TWO:
+            seps = [self.sep, self.sep2]
+            ret = self.system + seps[0]
+            for i, (role, message) in enumerate(messages):
+                if message:
+                    if type(message) is tuple:
+                        message, _, _ = message
+                    ret += role + ": " + message + seps[i % 2]
+                else:
+                    ret += role + ":"
+        elif self.sep_style == SeparatorStyle.MPT:
+            ret = self.system + self.sep
+            for role, message in messages:
+                if message:
+                    if type(message) is tuple:
+                        message, _, _ = message
+                    ret += role + message + self.sep
+                else:
+                    ret += role
+        elif self.sep_style == SeparatorStyle.LLAMA_2:
+            wrap_sys = lambda msg: f"<<SYS>>\n{msg}\n<</SYS>>\n\n"
+            wrap_inst = lambda msg: f"[INST] {msg} [/INST]"
+            ret = ""
+
+            for i, (role, message) in enumerate(messages):
+                if i == 0:
+                    assert message, "first message should not be none"
+                    assert role == self.roles[0], "first message should come from user"
+                if message:
+                    if type(message) is tuple:
+                        message, _, _ = message
+                    if i == 0: message = wrap_sys(self.system) + message
+                    if i % 2 == 0:
+                        message = wrap_inst(message)
+                        ret += self.sep + message
+                    else:
+                        ret += " " + message + " " + self.sep2
+                else:
+                    ret += ""
+            ret = ret.lstrip(self.sep)
+        elif self.sep_style == SeparatorStyle.PLAIN:
+            seps = [self.sep, self.sep2]
+            ret = self.system
+            for i, (role, message) in enumerate(messages):
+                if message:
+                    if type(message) is tuple:
+                        message, _, _ = message
+                    ret += message + seps[i % 2]
+                else:
+                    ret += ""
+        else:
+            raise ValueError(f"Invalid style: {self.sep_style}")
+
+        return ret
+
+    def append_message(self, role, message):
+        self.messages.append([role, message])
+
+    def get_images(self, return_pil=False):
+        images = []
+        for i, (role, msg) in enumerate(self.messages[self.offset:]):
+            if i % 2 == 0:
+                if type(msg) is tuple:
+                    import base64
+                    from io import BytesIO
+                    from PIL import Image
+                    msg, image, image_process_mode = msg
+                    if image_process_mode == "Pad":
+                        def expand2square(pil_img, background_color=(122, 116, 104)):
+                            width, height = pil_img.size
+                            if width == height:
+                                return pil_img
+                            elif width > height:
+                                result = Image.new(pil_img.mode, (width, width), background_color)
+                                result.paste(pil_img, (0, (width - height) // 2))
+                                return result
+                            else:
+                                result = Image.new(pil_img.mode, (height, height), background_color)
+                                result.paste(pil_img, ((height - width) // 2, 0))
+                                return result
+                        image = expand2square(image)
+                    elif image_process_mode == "Crop":
+                        pass
+                    elif image_process_mode == "Raw+Processor":
+                        pass
+                    elif image_process_mode == "Resize":
+                        image = image.resize((336, 336))
+                    else:
+                        raise ValueError(f"Invalid image_process_mode: {image_process_mode}")
+
+                    if image_process_mode != "Raw+Processor":
+                        max_hw, min_hw = max(image.size), min(image.size)
+                        aspect_ratio = max_hw / min_hw
+                        max_len, min_len = 800, 400
+                        shortest_edge = int(min(max_len / aspect_ratio, min_len, min_hw))
+                        longest_edge = int(shortest_edge * aspect_ratio)
+                        W, H = image.size
+                        if H > W:
+                            H, W = longest_edge, shortest_edge
+                        else:
+                            H, W = shortest_edge, longest_edge
+                        image = image.resize((W, H))
+                    print('Input Image Size:{}'.format(image.size))
+
+                    if return_pil:
+                        images.append(image)
+                    else:
+                        buffered = BytesIO()
+                        image.save(buffered, format="PNG")
+                        img_b64_str = base64.b64encode(buffered.getvalue()).decode()
+                        images.append(img_b64_str)
+        return images
+
+    def to_gradio_chatbot(self):
+        ret = []
+        for i, (role, msg) in enumerate(self.messages[self.offset:]):
+            if i % 2 == 0:
+                if type(msg) is tuple:
+                    import base64
+                    from io import BytesIO
+                    msg, image, image_process_mode = msg
+                    if image_process_mode != "Raw+Processor":
+                        max_hw, min_hw = max(image.size), min(image.size)
+                        aspect_ratio = max_hw / min_hw
+                        max_len, min_len = 800, 400
+                        shortest_edge = int(min(max_len / aspect_ratio, min_len, min_hw))
+                        longest_edge = int(shortest_edge * aspect_ratio)
+                        W, H = image.size
+                        if H > W:
+                            H, W = longest_edge, shortest_edge
+                        else:
+                            H, W = shortest_edge, longest_edge
+                        image = image.resize((W, H))
+                    buffered = BytesIO()
+                    image.save(buffered, format="JPEG")
+                    img_b64_str = base64.b64encode(buffered.getvalue()).decode()
+                    img_str = f'<img src="data:image/png;base64,{img_b64_str}" alt="user upload image" />'
+                    ret.append([img_str, None])
+                    msg = msg.replace('<image>', '').strip()
+                    if len(msg) > 0:
+                        ret.append([msg, None])
+                else:
+                    ret.append([msg, None])
+            else:
+                ret[-1][-1] = msg
+        return ret
+
+    def copy(self):
+        return Conversation(
+            system=self.system,
+            roles=self.roles,
+            messages=[[x, y] for x, y in self.messages],
+            offset=self.offset,
+            sep_style=self.sep_style,
+            sep=self.sep,
+            sep2=self.sep2,
+            version=self.version)
+
+    def dict(self):
+        if len(self.get_images()) > 0:
+            return {
+                "system": self.system,
+                "roles": self.roles,
+                "messages": [[x, y[0] if type(y) is tuple else y] for x, y in self.messages],
+                "offset": self.offset,
+                "sep": self.sep,
+                "sep2": self.sep2,
+            }
+        return {
+            "system": self.system,
+            "roles": self.roles,
+            "messages": self.messages,
+            "offset": self.offset,
+            "sep": self.sep,
+            "sep2": self.sep2,
+        }
+
+
+
+ferret_conv_vicuna_v1_original_system = Conversation(
+    system="A chat between a curious human and an artificial intelligence assistant. "
+           "Assistant is able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language. "
+           "In images, points are represented by coordinates [x, y]. The top-left corner is [0, 0]. The bottom-right corner is [width-1, height-1]. "
+           "Increasing x moves right across the image while increasing y moves down. "
+           "A bounding box is marked by [x1, y1, x2, y2] with the top-left and bottom-right points being [x1, y1] and [x2, y2] respectively. "
+           f"The image size is assumed to be ({VOCAB_IMAGE_W}, {VOCAB_IMAGE_H}), i.e., width={VOCAB_IMAGE_W}, height={VOCAB_IMAGE_H}. "
+           "Follow the instructions carefully. ",
+    roles=("USER", "ASSISTANT"),
+    version="v1",
+    messages=(),
+    offset=0,
+    sep_style=SeparatorStyle.TWO,
+    sep=" ",
+    sep2="</s>",
+)
+
+ferret_conv_vicuna_v1 = Conversation(
+    system="A chat between a human and an AI that understands visuals. "
+           "In images, [x, y] denotes points: top-left [0, 0], bottom-right [width-1, height-1]. "
+           "Increasing x moves right; y moves down. "
+           f"Bounding box: [x1, y1, x2, y2]. Image size: {VOCAB_IMAGE_W}x{VOCAB_IMAGE_H}. "
+           "Follow instructions. ",
+    roles=("USER", "ASSISTANT"),
+    version="v1",
+    messages=(),
+    offset=0,
+    sep_style=SeparatorStyle.TWO,
+    sep=" ",
+    sep2="</s>",
+)
+
+
+default_conversation = ferret_conv_vicuna_v1
+conv_templates = {
+    "v1": ferret_conv_vicuna_v1,
+    "ferret_v1": ferret_conv_vicuna_v1,
+}
+
+
+if __name__ == "__main__":
+    print(default_conversation.get_prompt())
diff --git a/ferret/eval/eval_flickr_entities.py b/ferret/eval/eval_flickr_entities.py
new file mode 100644
index 0000000..c91594c
--- /dev/null
+++ b/ferret/eval/eval_flickr_entities.py
@@ -0,0 +1,611 @@
+"""
+Usage:
+
+python ferret/eval/eval_flickr_entities.py \
+    --prediction_file result_checkpoint-final/flickr_result/final_flickr_mergedGT_test \
+    --annotation_file data/annotations/final_flickr_mergedGT_test.json \
+    --flickr_entities_path data/flickr30k
+
+"""
+
+
+import xml.etree.ElementTree as ET
+from collections import defaultdict
+from pathlib import Path
+from typing import Any, Dict, List, Optional, Sequence, Tuple, Union
+
+import numpy as np
+from prettytable import PrettyTable
+from tqdm import tqdm
+
+import json
+import os
+import re
+
+VOCAB_IMAGE_W = 1000
+VOCAB_IMAGE_H = 1000
+
+
+def resize_bbox(box, image_w=None, image_h=None):
+    ratio_w = image_w * 1.0 / VOCAB_IMAGE_W
+    ratio_h = image_h * 1.0 / VOCAB_IMAGE_H
+
+    new_box = [int(box[0] * ratio_w), int(box[1] * ratio_h), \
+               int(box[2] * ratio_w), int(box[3] * ratio_h)]
+    return new_box
+
+
+def decode_bbox_from_caption(text, img_w, img_h, verbose=False):
+    entities = []
+    boxes = []
+    
+    start = 0
+    in_brackets = False
+    entity = ""
+    box = ""
+    
+    for i, char in enumerate(text):
+        if char == '[':
+            in_brackets = True
+            entity = text[start:i].strip()
+            start = i + 1
+        elif char == ']':
+            in_brackets = False
+            box = text[start:i].strip()
+            start = i + 1
+            
+            # Convert box string to list of integers
+            box_list = list(map(int, box.split(',')))
+            resized_box_list = resize_bbox(box_list, img_w, img_h)
+            entities.append(entity)
+            boxes.append(resized_box_list)
+            
+            # Skip until the next entity (ignoring periods or other delimiters)
+            while start < len(text) and text[start] not in ['.', ',', ';', '!', '?']:
+                start += 1
+            start += 1  # Skip the delimiter
+        
+    return entities, boxes
+
+
+def are_phrases_similar(phrase1, phrase2):
+    # Step 1: Convert to lower case
+    phrase1 = phrase1.lower()
+    phrase2 = phrase2.lower()
+    
+    # Step 2: Standardize spacing around punctuation
+    phrase1 = re.sub(r'\s*([\'",.;!?|:])\s*', r'\1 ', phrase1).strip()
+    phrase2 = re.sub(r'\s*([\'",.;!?|:])\s*', r'\1 ', phrase2).strip()
+    
+    # Step 3: Remove all punctuation
+    phrase1 = re.sub(r'[^\w\s]', '', phrase1)
+    phrase2 = re.sub(r'[^\w\s]', '', phrase2)
+    
+    # Step 4: Remove extra white spaces
+    phrase1 = ' '.join(phrase1.split())
+    phrase2 = ' '.join(phrase2.split())
+    
+    return phrase1 == phrase2
+
+
+def get_sentence_data(filename) -> List[Dict[str, Any]]:
+    """
+    Parses a sentence file from the Flickr30K Entities dataset
+
+    input:
+      filename - full file path to the sentence file to parse
+
+    output:
+      a list of dictionaries for each sentence with the following fields:
+          sentence - the original sentence
+          phrases - a list of dictionaries for each phrase with the
+                    following fields:
+                      phrase - the text of the annotated phrase
+                      first_word_index - the position of the first word of
+                                         the phrase in the sentence
+                      phrase_id - an identifier for this phrase
+                      phrase_type - a list of the coarse categories this
+                                    phrase belongs to
+
+    """
+    with open(filename, "r") as f:
+        sentences = f.read().split("\n")
+
+    annotations = []
+    for sentence in sentences:
+        if not sentence:
+            continue
+
+        first_word = []
+        phrases = []
+        phrase_id = []
+        phrase_type = []
+        words = []
+        current_phrase = []
+        add_to_phrase = False
+        for token in sentence.split():
+            if add_to_phrase:
+                if token[-1] == "]":
+                    add_to_phrase = False
+                    token = token[:-1]
+                    current_phrase.append(token)
+                    phrases.append(" ".join(current_phrase))
+                    current_phrase = []
+                else:
+                    current_phrase.append(token)
+
+                words.append(token)
+            else:
+                if token[0] == "[":
+                    add_to_phrase = True
+                    first_word.append(len(words))
+                    parts = token.split("/")
+                    phrase_id.append(parts[1][3:])
+                    phrase_type.append(parts[2:])
+                else:
+                    words.append(token)
+
+        sentence_data = {"sentence": " ".join(words), "phrases": []}
+        for index, phrase, p_id, p_type in zip(first_word, phrases, phrase_id, phrase_type):
+            sentence_data["phrases"].append(
+                {"first_word_index": index, "phrase": phrase, "phrase_id": p_id, "phrase_type": p_type}
+            )
+
+        annotations.append(sentence_data)
+
+    return annotations
+
+
+def get_annotations(filename) -> Dict[str, Union[int, List[str], Dict[str, List[List[int]]]]]:
+    """
+    Parses the xml files in the Flickr30K Entities dataset
+
+    input:
+      filename - full file path to the annotations file to parse
+
+    output:
+      dictionary with the following fields:
+          scene - list of identifiers which were annotated as
+                  pertaining to the whole scene
+          nobox - list of identifiers which were annotated as
+                  not being visible in the image
+          boxes - a dictionary where the fields are identifiers
+                  and the values are its list of boxes in the
+                  [xmin ymin xmax ymax] format
+          height - int representing the height of the image
+          width - int representing the width of the image
+          depth - int representing the depth of the image
+    """
+    tree = ET.parse(filename)
+    root = tree.getroot()
+    size_container = root.findall("size")[0]
+    anno_info: Dict[str, Union[int, List[str], Dict[str, List[List[int]]]]] = {}
+    all_boxes: Dict[str, List[List[int]]] = {}
+    all_noboxes: List[str] = []
+    all_scenes: List[str] = []
+    for size_element in size_container:
+        assert size_element.text
+        anno_info[size_element.tag] = int(size_element.text)
+
+    for object_container in root.findall("object"):
+        for names in object_container.findall("name"):
+            box_id = names.text
+            assert box_id
+            box_container = object_container.findall("bndbox")
+            if len(box_container) > 0:
+                if box_id not in all_boxes:
+                    all_boxes[box_id] = []
+                xmin = int(box_container[0].findall("xmin")[0].text)
+                ymin = int(box_container[0].findall("ymin")[0].text)
+                xmax = int(box_container[0].findall("xmax")[0].text)
+                ymax = int(box_container[0].findall("ymax")[0].text)
+                all_boxes[box_id].append([xmin, ymin, xmax, ymax])
+            else:
+                nobndbox = int(object_container.findall("nobndbox")[0].text)
+                if nobndbox > 0:
+                    all_noboxes.append(box_id)
+
+                scene = int(object_container.findall("scene")[0].text)
+                if scene > 0:
+                    all_scenes.append(box_id)
+    anno_info["boxes"] = all_boxes
+    anno_info["nobox"] = all_noboxes
+    anno_info["scene"] = all_scenes
+
+    return anno_info
+
+
+#### END of import from flickr30k_entities
+#### Bounding box utilities imported from torchvision and converted to numpy
+def box_area(boxes: np.array) -> np.array:
+    """
+    Computes the area of a set of bounding boxes, which are specified by its
+    (x1, y1, x2, y2) coordinates.
+
+    Args:
+        boxes (Tensor[N, 4]): boxes for which the area will be computed. They
+            are expected to be in (x1, y1, x2, y2) format with
+            ``0 <= x1 < x2`` and ``0 <= y1 < y2``.
+
+    Returns:
+        area (Tensor[N]): area for each box
+    """
+    assert boxes.ndim == 2 and boxes.shape[-1] == 4
+    return (boxes[:, 2] - boxes[:, 0]) * (boxes[:, 3] - boxes[:, 1])
+
+
+# implementation from https://github.com/kuangliu/torchcv/blob/master/torchcv/utils/box.py
+# with slight modifications
+def _box_inter_union(boxes1: np.array, boxes2: np.array) -> Tuple[np.array, np.array]:
+    area1 = box_area(boxes1)
+    area2 = box_area(boxes2)
+
+    lt = np.maximum(boxes1[:, None, :2], boxes2[:, :2])  # [N,M,2]
+    rb = np.minimum(boxes1[:, None, 2:], boxes2[:, 2:])  # [N,M,2]
+
+    wh = (rb - lt).clip(min=0)  # [N,M,2]
+    inter = wh[:, :, 0] * wh[:, :, 1]  # [N,M]
+
+    union = area1[:, None] + area2 - inter
+
+    return inter, union
+
+
+def box_iou(boxes1: np.array, boxes2: np.array) -> np.array:
+    """
+    Return intersection-over-union (Jaccard index) of boxes.
+
+    Both sets of boxes are expected to be in ``(x1, y1, x2, y2)`` format with
+    ``0 <= x1 < x2`` and ``0 <= y1 < y2``.
+
+    Args:
+        boxes1 (Tensor[N, 4])
+        boxes2 (Tensor[M, 4])
+
+    Returns:
+        iou (Tensor[N, M]): the NxM matrix containing the pairwise IoU values for every element in boxes1 and boxes2
+    """
+    inter, union = _box_inter_union(boxes1, boxes2)
+    iou = inter / union
+    return iou
+
+
+#### End of import of box utilities
+
+def _merge_boxes(boxes: List[List[int]]) -> List[List[int]]:
+    """
+    Return the boxes corresponding to the smallest enclosing box containing all the provided boxes
+    The boxes are expected in [x1, y1, x2, y2] format
+    """
+    if len(boxes) == 1:
+        return boxes
+
+    np_boxes = np.asarray(boxes)
+
+    return [[np_boxes[:, 0].min(), np_boxes[:, 1].min(), np_boxes[:, 2].max(), np_boxes[:, 3].max()]]
+
+class RecallTracker:
+    """ Utility class to track recall@k for various k, split by categories"""
+
+    def __init__(self, topk: Sequence[int]):
+        """
+        Parameters:
+           - topk : tuple of ints corresponding to the recalls being tracked (eg, recall@1, recall@10, ...)
+        """
+
+        self.total_byk_bycat: Dict[int, Dict[str, int]] = {k: defaultdict(int) for k in topk}
+        self.positives_byk_bycat: Dict[int, Dict[str, int]] = {k: defaultdict(int) for k in topk}
+
+    def add_positive(self, k: int, category: str):
+        """Log a positive hit @k for given category"""
+        if k not in self.total_byk_bycat:
+            raise RuntimeError(f"{k} is not a valid recall threshold")
+        self.total_byk_bycat[k][category] += 1
+        self.positives_byk_bycat[k][category] += 1
+
+    def add_negative(self, k: int, category: str):
+        """Log a negative hit @k for given category"""
+        if k not in self.total_byk_bycat:
+            raise RuntimeError(f"{k} is not a valid recall threshold")
+        self.total_byk_bycat[k][category] += 1
+
+    def report(self) -> Dict[int, Dict[str, float]]:
+        """Return a condensed report of the results as a dict of dict.
+        report[k][cat] is the recall@k for the given category
+        """
+        report: Dict[int, Dict[str, float]] = {}
+        for k in self.total_byk_bycat:
+            assert k in self.positives_byk_bycat
+            report[k] = {
+                cat: self.positives_byk_bycat[k][cat] / self.total_byk_bycat[k][cat] for cat in self.total_byk_bycat[k]
+            }
+        return report
+
+
+class Flickr30kEntitiesRecallEvaluator:
+    def __init__(
+        self,
+        flickr_path: str,
+        subset: str = "test",
+        topk: Sequence[int] = (1, 5, 10, -1),
+        iou_thresh: float = 0.5,
+        merge_boxes: bool = False,
+        verbose: bool = True,
+    ):
+
+        assert subset in ["train", "test", "val"], f"Wrong flickr subset {subset}"
+
+        self.topk = topk
+        self.iou_thresh = iou_thresh
+
+        flickr_path = Path(flickr_path)
+
+        # We load the image ids corresponding to the current subset
+        with open(flickr_path / f"{subset}.txt") as file_d:
+            self.img_ids = [line.strip() for line in file_d]
+
+        if verbose:
+            print(f"Flickr subset contains {len(self.img_ids)} images")
+
+        # Read the box annotations for all the images
+        self.imgid2boxes: Dict[str, Dict[str, List[List[int]]]] = {}
+
+        if verbose:
+            print("Loading annotations...")
+
+        for img_id in self.img_ids:
+            anno_info = get_annotations(flickr_path / "Annotations" / f"{img_id}.xml")["boxes"]
+            if merge_boxes:
+                merged = {}
+                for phrase_id, boxes in anno_info.items():
+                    merged[phrase_id] = _merge_boxes(boxes)
+                anno_info = merged
+            self.imgid2boxes[img_id] = anno_info
+
+        # Read the sentences annotations
+        self.imgid2sentences: Dict[str, List[List[Optional[Dict]]]] = {}
+
+        if verbose:
+            print("Loading annotations...")
+
+        self.all_ids: List[str] = []
+        tot_phrases = 0
+        for img_id in self.img_ids:
+            sentence_info = get_sentence_data(flickr_path / "Sentences" / f"{img_id}.txt")
+            self.imgid2sentences[img_id] = [None for _ in range(len(sentence_info))]
+
+            # Some phrases don't have boxes, we filter them.
+            for sent_id, sentence in enumerate(sentence_info):
+                phrases = [phrase for phrase in sentence["phrases"] if phrase["phrase_id"] in self.imgid2boxes[img_id]]
+                if len(phrases) > 0:
+                    self.imgid2sentences[img_id][sent_id] = phrases
+                tot_phrases += len(phrases)
+
+            self.all_ids += [
+                f"{img_id}_{k}" for k in range(len(sentence_info)) if self.imgid2sentences[img_id][k] is not None
+            ]
+
+        if verbose:
+            print(f"There are {tot_phrases} phrases in {len(self.all_ids)} sentences to evaluate")
+
+    def evaluate(self, predictions: List[Dict]):
+        evaluated_ids = set()
+
+        recall_tracker = RecallTracker(self.topk)
+
+        for pred in predictions:
+            cur_id = f"{pred['image_id']}_{pred['sentence_id']}"
+            if cur_id in evaluated_ids:
+                print(
+                    "Warning, multiple predictions found for sentence"
+                    f"{pred['sentence_id']} in image {pred['image_id']}"
+                )
+                continue
+
+            # Skip the sentences with no valid phrase
+            if cur_id not in self.all_ids:
+                if len(pred["boxes"]) != 0:
+                    print(
+                        f"Warning, in image {pred['image_id']} we were not expecting predictions "
+                        f"for sentence {pred['sentence_id']}. Ignoring them."
+                    )
+                continue
+
+            evaluated_ids.add(cur_id)
+
+            pred_boxes = pred["boxes"]
+            if str(pred["image_id"]) not in self.imgid2sentences:
+                raise RuntimeError(f"Unknown image id {pred['image_id']}")
+            if not 0 <= int(pred["sentence_id"]) < len(self.imgid2sentences[str(pred["image_id"])]):
+                raise RuntimeError(f"Unknown sentence id {pred['sentence_id']}" f" in image {pred['image_id']}")
+            target_sentence = self.imgid2sentences[str(pred["image_id"])][int(pred["sentence_id"])]
+
+            phrases = self.imgid2sentences[str(pred["image_id"])][int(pred["sentence_id"])]
+            if len(pred_boxes) != len(phrases):
+                raise RuntimeError(
+                    f"Error, got {len(pred_boxes)} predictions, expected {len(phrases)} "
+                    f"for sentence {pred['sentence_id']} in image {pred['image_id']}"
+                )
+
+            for cur_boxes, phrase in zip(pred_boxes, phrases):
+                target_boxes = self.imgid2boxes[str(pred["image_id"])][phrase["phrase_id"]]
+
+                ious = box_iou(np.asarray(cur_boxes), np.asarray(target_boxes))
+                for k in self.topk:
+                    maxi = 0
+                    if k == -1:
+                        maxi = ious.max()
+                    else:
+                        assert k > 0
+                        maxi = ious[:k].max()
+                    if maxi >= self.iou_thresh:
+                        recall_tracker.add_positive(k, "all")
+                        for phrase_type in phrase["phrase_type"]:
+                            recall_tracker.add_positive(k, phrase_type)
+                    else:
+                        recall_tracker.add_negative(k, "all")
+                        for phrase_type in phrase["phrase_type"]:
+                            recall_tracker.add_negative(k, phrase_type)
+
+        if len(evaluated_ids) != len(self.all_ids):
+            print("ERROR, the number of evaluated sentence doesn't match. Missing predictions:")
+            un_processed = set(self.all_ids) - evaluated_ids
+            for missing in un_processed:
+                img_id, sent_id = missing.split("_")
+                print(f"\t sentence {sent_id} in image {img_id}")
+            raise RuntimeError("Missing predictions")
+
+        return recall_tracker.report()
+
+
+class Flickr30kEntitiesRecallEvaluatorFromJsonl(Flickr30kEntitiesRecallEvaluator):
+    def evaluate(self, 
+                 annotation_file: str,
+                 prediction_file: str,
+                 verbose: bool = False,
+                ):
+        recall_tracker = RecallTracker(self.topk)
+        
+        gt_json = json.load(open(annotation_file, 'r', encoding='utf-8'))
+
+        # get the predictions
+        if os.path.isfile(prediction_file):
+            predictions = [json.loads(line) for line in open(prediction_file)]
+        elif os.path.isdir(prediction_file):
+            predictions = [json.loads(line) for pred_file in sorted(os.listdir(prediction_file)) for line in open(os.path.join(prediction_file, pred_file))]
+        else:
+            raise NotImplementedError('Not supported file format.')
+        
+        predict_index = 0
+        
+        valid_cnt = 0
+        for item in tqdm(gt_json['images']):
+            file_name = item["file_name"]
+            caption = item["caption"]
+            img_height = float(item['height'])
+            img_width = float(item['width'])
+            postive_item_pos = item['tokens_positive_eval']
+            
+            # to verify 
+            phrases_from_self = self.imgid2sentences[str(item['original_img_id'])][int(item['sentence_id'])]
+            for pos in postive_item_pos:
+                # pdb.set_trace()
+                if predict_index == len(predictions):
+                    break
+                
+                pos_start, pos_end = pos[0]
+                phrase = caption[pos_start:pos_end]
+                phrase_from_self = [p for p in phrases_from_self if p['phrase'] == phrase]
+                if len(phrase_from_self) == 0:
+                    raise ValueError(f"Can't find the corresponding gt from two file {phrase} vs. {phrases_from_self}")
+                else:
+                    phrase_from_self = phrase_from_self[0]
+                
+                # get the prediction from text line
+                try:
+                    prediction = predictions[predict_index]["text"]
+                except IndexError as e:
+                    print("Raise Indexerror.")
+                    print(f"prediction index / length: {predict_index} / {len(predictions)}")
+                    import sys
+                    sys.exit(0)
+                try:
+                    entities, boxes = decode_bbox_from_caption(prediction, img_width, img_height, verbose=verbose)
+                    assert len(entities) == len(boxes)
+                except ValueError as e:
+                    entities, boxes = [], []
+
+                predict_boxes = []
+
+                for (entity, box) in zip(entities, boxes):
+                    if not are_phrases_similar(entity, phrase): # get the matched noun phrase
+                        # print(f"{entity} | {phrase}")
+                        continue
+                    else:
+                        predict_boxes.append(box)
+
+                if len(predict_boxes) == 0:
+                    print(f"Can't find valid bbox for the given phrase ({phrase}) in caption ({caption}), \n{prediction}")
+                    print(f"We set a 0-area box to calculate recall result")
+                    predict_boxes = [[0., 0., 0., 0.]]
+                    # exit(0)
+                
+                # evaluate
+                target_boxes = self.imgid2boxes[str(item['original_img_id'])][phrase_from_self["phrase_id"]]
+                ious = box_iou(np.asarray(predict_boxes), np.asarray(target_boxes))
+                for k in self.topk:
+                    maxi = 0
+                    if k == -1:
+                        maxi = ious.max()
+                    else:
+                        assert k > 0
+                        maxi = ious[:k].max()
+                    if maxi >= self.iou_thresh:
+                        recall_tracker.add_positive(k, "all")
+                        for phrase_type in phrase_from_self["phrase_type"]:
+                            recall_tracker.add_positive(k, phrase_type)
+                    else:
+                        recall_tracker.add_negative(k, "all")
+                        for phrase_type in phrase_from_self["phrase_type"]:
+                            recall_tracker.add_negative(k, phrase_type)
+                            
+                # pdb.set_trace()
+                valid_cnt += 1
+            predict_index += 1
+  
+        print(f"Valid prediction {valid_cnt}/{len(predictions)}")     
+        self.results = recall_tracker.report()
+        return self.results
+    
+    def summarize(self):
+        table = PrettyTable()
+        all_cat = sorted(list(self.results.values())[0].keys())
+        table.field_names = ["Recall@k"] + all_cat
+
+        score = {}
+        for k, v in self.results.items():
+            cur_results = [v[cat] for cat in all_cat]
+            header = "Upper_bound" if k == -1 else f"Recall@{k}"
+
+            for cat in all_cat:
+                score[f"{header}_{cat}"] = v[cat]
+            table.add_row([header] + cur_results)
+
+        print(table)
+        return score
+
+
+if __name__ == '__main__':
+    import argparse
+
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--prediction_file', help='prediction_file')
+    parser.add_argument('--annotation_file', default='/path/to/final_flickr_mergedGT_test.json', help='annotation_file')
+    parser.add_argument('--flickr_entities_path', default='/path/to/flickr30k_entities', help='flickr entities')
+    
+    args = parser.parse_args()
+
+    if os.path.isfile(args.prediction_file):
+        predictions = [json.loads(line) for line in open(args.prediction_file)]
+    elif os.path.isdir(args.prediction_file):
+        predictions = []
+    
+    if '_test.json' in args.annotation_file:
+        subset = "test"
+    else:
+        subset = "val"
+    
+    evaluator = Flickr30kEntitiesRecallEvaluatorFromJsonl(
+        flickr_path = args.flickr_entities_path,
+        subset = subset,
+        topk = (1, 5, 10, -1),
+        iou_thresh = 0.5,
+        merge_boxes = True,
+        verbose = True,
+    )
+    
+    evaluator.evaluate(args.annotation_file, args.prediction_file, verbose=False)
+    score = evaluator.summarize()
+    
+    with open(os.path.join(args.prediction_file, "metric.json"), "w") as f:
+        json.dump(score, f, indent=2)
\ No newline at end of file
diff --git a/ferret/eval/eval_gpt_review_3newclass.py b/ferret/eval/eval_gpt_review_3newclass.py
new file mode 100644
index 0000000..0869676
--- /dev/null
+++ b/ferret/eval/eval_gpt_review_3newclass.py
@@ -0,0 +1,160 @@
+import argparse
+import json
+import os
+
+import openai
+import time
+import re
+import pdb
+from tqdm import tqdm
+
+NUM_SECONDS_TO_SLEEP = 0.5
+VOCAB_IMAGE_W = 1000
+VOCAB_IMAGE_H = 1000
+
+def get_eval(content: str, max_tokens: int):
+    while True:
+        try:
+            response = openai.ChatCompletion.create(
+                model='gpt-4-0314',
+                messages=[{
+                    'role': 'system',
+                    'content': 'You are a helpful and precise assistant for checking the quality of the answer.'
+                }, {
+                    'role': 'user',
+                    'content': content,
+                }],
+                temperature=0.2,  # TODO: figure out which temperature is best for evaluation
+                max_tokens=max_tokens,
+            )
+            break
+        except openai.error.RateLimitError:
+            pass
+        except Exception as e:
+            print(e)
+        time.sleep(NUM_SECONDS_TO_SLEEP)
+
+    return response['choices'][0]['message']['content']
+
+def postprocess_answer(answer, category):
+    if category == 'refer_desc' or category == 'refer_reason':
+        pattern = r'\[.*?\]'
+        matches = re.findall(pattern, answer)
+        for match in matches:
+            answer = answer.replace(' '+match, '')
+    elif category == 'ground_conv':
+        pattern = r'\[.*?\]'
+        matches = re.findall(pattern, answer)
+        for match in matches:
+            coor_cur = match.replace('[', '')
+            coor_cur = coor_cur.replace(']', '')
+            coor_cur = coor_cur.split(',')
+            coor_cur = [float(i.strip()) for i in coor_cur]
+            try:
+                assert len(coor_cur) == 4 
+            except:
+                print('Found a exception when parsing coordinates')
+                answer = answer.replace(match, '')
+            converted_box_coor = [coor_cur[0]/VOCAB_IMAGE_W, coor_cur[1]/VOCAB_IMAGE_H, coor_cur[2]/VOCAB_IMAGE_W, coor_cur[3]/VOCAB_IMAGE_H]
+            answer = answer.replace(match, f'[{converted_box_coor[0]:.3f}, {converted_box_coor[1]:.3f}, {converted_box_coor[2]:.3f}, {converted_box_coor[3]:.3f}]')
+
+    return answer
+
+
+def parse_score(review):
+    try:
+        score_pair = review.split('\n')[0]
+        score_pair = score_pair.replace(',', ' ')
+        sp = score_pair.split(' ')
+        if len(sp) == 2:
+            return [float(sp[0]), float(sp[1])]
+        else:
+            print('error', review)
+            return [-1, -1]
+    except Exception as e:
+        print(e)
+        print('error', review)
+        return [-1, -1]
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description='ChatGPT-based QA evaluation.')
+    parser.add_argument('-q', '--question')
+    parser.add_argument('-c', '--context')
+    parser.add_argument('-a', '--answer-list', nargs='+', default=[])
+    parser.add_argument('-r', '--rule')
+    parser.add_argument('-o', '--output')
+    parser.add_argument('--max-tokens', type=int, default=1024, help='maximum number of tokens produced in the output')
+    args = parser.parse_args()
+
+    f_q = open(os.path.expanduser(args.question))
+    f_ans1 = open(os.path.expanduser(args.answer_list[0]))
+    f_ans2 = open(os.path.expanduser(args.answer_list[1]))
+    rule_dict = json.load(open(os.path.expanduser(args.rule), 'r'))
+
+    if os.path.isfile(os.path.expanduser(args.output)):
+        cur_reviews = [json.loads(line) for line in open(os.path.expanduser(args.output))]
+    else:
+        cur_reviews = []
+
+    review_file = open(f'{args.output}', 'a')
+
+    context_list = [json.loads(line) for line in open(os.path.expanduser(args.context))]
+    image_to_context = {context['image']: context for context in context_list}
+
+    handles = []
+    idx = 0
+    for ques_js, ans1_js, ans2_js in tqdm(zip(f_q, f_ans1, f_ans2)):
+        ques = json.loads(ques_js)
+        ans1 = json.loads(ans1_js)
+        ans2 = json.loads(ans2_js)
+
+        inst = image_to_context[ques['image']]
+        # cap_str = '\n'.join(inst['captions'])
+        # box_str = '\n'.join([f'{instance["category"]}: {instance["bbox"]}' for instance in inst['instances']])
+
+        category = json.loads(ques_js)['category']
+        if category in rule_dict:
+            rule = rule_dict[category]
+        else:
+            assert False, f"Visual QA category not found in rule file: {category}."
+        
+        # Assume ans2 is the predicted one.
+        processed_answer = postprocess_answer(ans2['text'], category)
+        # pdb.set_trace()
+        ans2['text'] = processed_answer
+        # if category == 'refer_desc':
+            
+        prompt = rule['prompt']
+        role = rule['role']
+        content = (f'[Context]\{inst["text"]}\n\n'
+                   f'[Question]\n{ques["text"]}\n\n'
+                   f'[{role} 1]\n{ans1["text"]}\n\n[End of {role} 1]\n\n'
+                   f'[{role} 2]\n{ans2["text"]}\n\n[End of {role} 2]\n\n'
+                   f'[System]\n{prompt}\n\n')
+        # content = (f'[Context]\n{cap_str}\n\n{box_str}\n\n'
+        #            f'[Question]\n{ques["text"]}\n\n'
+        #            f'[{role} 1]\n{ans1["text"]}\n\n[End of {role} 1]\n\n'
+        #            f'[{role} 2]\n{ans2["text"]}\n\n[End of {role} 2]\n\n'
+        #            f'[System]\n{prompt}\n\n')
+        cur_js = {
+            'id': idx+1,
+            'question_id': ques['question_id'],
+            'answer1_id': ans1.get('answer_id', ans1['question_id']),
+            'answer2_id': ans2.get('answer_id', ans2['question_id']),
+            'category': category
+        }
+        if idx >= len(cur_reviews):
+            review = get_eval(content, args.max_tokens)
+            scores = parse_score(review)
+            cur_js['content'] = review
+            cur_js['tuple'] = scores
+            cur_js['answer1'] = ans1["text"]
+            cur_js['answer2'] = ans2["text"]
+            review_file.write(json.dumps(cur_js) + '\n')
+            review_file.flush()
+        else:
+            print(f'Skipping {idx} as we already have it.')
+        idx += 1
+        print(idx)
+    review_file.close()
diff --git a/ferret/eval/eval_lvis.py b/ferret/eval/eval_lvis.py
new file mode 100644
index 0000000..21830f9
--- /dev/null
+++ b/ferret/eval/eval_lvis.py
@@ -0,0 +1,69 @@
+"""
+Usage:
+- Eval Prediction:
+python ferret/eval/eval_lvis.py --pred_file=[your generated result by running ferret/eval/model_lvis.py]
+
+"""
+import argparse
+import json
+import os
+import re
+import random
+from PIL import Image
+import matplotlib.pyplot as plt
+import matplotlib.patches as patches
+import textwrap
+from tqdm import tqdm
+import pdb
+
+def get_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--pred_file', type=str, default='/Users/youhaoxuan/research_misc/lvis_result/llava_answer_debug.jsonl')
+    return parser.parse_args()
+
+def remove_not_phrases_v2(text):
+    # Pattern covers the start of a phrase up to and including 'not' and any following characters until a comma or period
+    pattern = r"\s+not[^,.]*[,.]"
+    text = re.sub(pattern, "", text)
+    pattern = r"\s+no[^,.]*[,.]"
+    text = re.sub(pattern, "", text)
+    return text 
+
+if __name__ == "__main__":
+    args = get_args()
+    # Fix the random seed
+    random.seed(42)
+    if os.path.isfile(args.pred_file):
+        predictions = [json.loads(line) for line in open(args.pred_file)]
+    elif os.path.isdir(args.pred_file):
+        predictions = [json.loads(line) for pred_file in os.listdir(args.pred_file) for line in open(os.path.join(args.pred_file, pred_file))]
+    else:
+        raise NotImplementedError('Not supported file format.')
+
+    total_correct = 0
+    for i in tqdm(predictions):
+        # Process name and synonyms
+        i['name'] = i['name'].replace('_', ' ').strip()
+        new_synonyms = []
+        for jj in i['synonyms']:
+            if '(' in jj:
+                assert ')' in jj
+                split_list = jj.split('(')
+                assert len(split_list) == 2
+                new_synonyms.append(split_list[0].replace('_', ' ').strip())
+                new_synonyms.append(split_list[1].replace('_', ' ').replace(')', '').strip())
+            else:
+                new_synonyms.append(jj.replace('_', ' ').strip())
+        i['synonyms'] = new_synonyms
+
+        # Match Result
+        processed_text = remove_not_phrases_v2(i['text'])
+        # pdb.set_trace()
+        if i['name'] in processed_text or any(syn_i in processed_text for syn_i in i['synonyms']):
+            total_correct += 1
+        else:
+            pass
+
+    acc = total_correct / len(predictions)
+    print(f'Acc:{acc*100:.3f}%')
+    # pdb.set_trace()
diff --git a/ferret/eval/eval_pope.py b/ferret/eval/eval_pope.py
new file mode 100644
index 0000000..85c2de3
--- /dev/null
+++ b/ferret/eval/eval_pope.py
@@ -0,0 +1,105 @@
+"""
+Usage:
+
+python ferret/eval/eval_pope.py \
+    --prediction_file final_result/ferret_13b_checkpoint-final/pope_result/coco_pope_adversarial \
+    --annotation_file data/pope/coco_pope_adversarial.json
+
+python ferret/eval/eval_pope.py \
+    --prediction_file final_result/ferret_13b_checkpoint-final/pope_result/coco_pope_popular \
+    --annotation_file data/pope/coco_pope_popular.json
+
+python ferret/eval/eval_pope.py \
+    --prediction_file final_result/ferret_13b_checkpoint-final/pope_result/coco_pope_random \
+    --annotation_file data/pope/coco_pope_random.json
+
+"""
+import os
+import json
+
+def evaluate_pope(prediction_file, annotation_file):
+    # get the predictions
+    if os.path.isfile(prediction_file):
+        answers = [json.loads(line) for line in open(prediction_file)]
+    elif os.path.isdir(prediction_file):
+        answers = [json.loads(line) for pred_file in sorted(os.listdir(prediction_file)) for line in open(os.path.join(prediction_file, pred_file))]
+    else:
+        raise NotImplementedError('Not supported file format.')
+
+    label_list = [json.loads(q)['label'] for q in open(annotation_file, 'r')]
+
+    for answer in answers:
+        text = answer['answer']
+
+        # Only keep the first sentence
+        if text.find('.') != -1:
+            text = text.split('.')[0]
+
+        text = text.replace(',', '')
+        words = text.split(' ')
+        if 'No' in words or 'not' in words or 'no' in words:
+            answer['answer'] = 'no'
+        else:
+            answer['answer'] = 'yes'
+
+    for i in range(len(label_list)):
+        if label_list[i] == 'no':
+            label_list[i] = 0
+        else:
+            label_list[i] = 1
+
+    pred_list = []
+    for answer in answers:
+        if answer['answer'] == 'no':
+            pred_list.append(0)
+        else:
+            pred_list.append(1)
+
+    pos = 1
+    neg = 0
+    yes_ratio = pred_list.count(1) / len(pred_list)
+
+    TP, TN, FP, FN = 0, 0, 0, 0
+    for pred, label in zip(pred_list, label_list):
+        if pred == pos and label == pos:
+            TP += 1
+        elif pred == pos and label == neg:
+            FP += 1
+        elif pred == neg and label == neg:
+            TN += 1
+        elif pred == neg and label == pos:
+            FN += 1
+
+    print('TP\tFP\tTN\tFN\t')
+    print('{}\t{}\t{}\t{}'.format(TP, FP, TN, FN))
+
+    precision = float(TP) / float(TP + FP)
+    recall = float(TP) / float(TP + FN)
+    f1 = 2*precision*recall / (precision + recall)
+    acc = (TP + TN) / (TP + TN + FP + FN)
+    print('Accuracy: {}'.format(acc))
+    print('Precision: {}'.format(precision))
+    print('Recall: {}'.format(recall))
+    print('F1 score: {}'.format(f1))
+    print('Yes ratio: {}'.format(yes_ratio))
+
+    score = {"Accuracy": acc, 
+             "Precision": precision,
+             "Recall": recall,
+             "F1 score": f1,
+             "Yes ratio": yes_ratio,
+             }
+
+    with open(os.path.join(args.prediction_file, 'metric.json'), "w") as f:
+        json.dump(score, f, indent=2)
+
+
+if __name__ == '__main__':
+    import argparse
+
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--prediction_file', help='prediction_file')
+    parser.add_argument('--annotation_file', default='/path/to/json_annotations', help='annotation_file')
+    
+    args = parser.parse_args()
+    evaluate_pope(args.prediction_file, args.annotation_file)
\ No newline at end of file
diff --git a/ferret/eval/eval_refexp.py b/ferret/eval/eval_refexp.py
new file mode 100644
index 0000000..eede5c3
--- /dev/null
+++ b/ferret/eval/eval_refexp.py
@@ -0,0 +1,217 @@
+"""
+Usage:
+
+python ferret/eval/eval_refexp.py \
+    --prediction_file final_result/ferret_13b_checkpoint-final/refexp_result/finetune_refcocog_test \
+    --annotation_file data/annotations/finetune_refcocog_test.json
+
+"""
+import os
+import copy
+from collections import defaultdict
+from pathlib import Path
+from tqdm import tqdm
+
+import torch
+import torch.utils.data
+from typing import Any, Dict, List, Optional, Sequence, Tuple, Union
+from prettytable import PrettyTable
+
+import re
+import json
+
+from misc.refcoco.box_ops import generalized_box_iou, box_iou
+
+VOCAB_IMAGE_W = 1000
+VOCAB_IMAGE_H = 1000
+
+
+def resize_bbox(box, image_w=None, image_h=None):
+    ratio_w = image_w * 1.0 / VOCAB_IMAGE_W
+    ratio_h = image_h * 1.0 / VOCAB_IMAGE_H
+
+    new_box = [int(box[0] * ratio_w), int(box[1] * ratio_h), \
+               int(box[2] * ratio_w), int(box[3] * ratio_h)]
+    return new_box
+
+
+def decode_bbox_from_caption(text, img_w, img_h, verbose=False):
+    entities = []
+    boxes = []
+    
+    start = 0
+    in_brackets = False
+    entity = ""
+    box = ""
+    
+    for i, char in enumerate(text):
+        if char == '[':
+            in_brackets = True
+            entity = text[start:i].strip()
+            start = i + 1
+        elif char == ']':
+            in_brackets = False
+            box = text[start:i].strip()
+            start = i + 1
+            
+            # Convert box string to list of integers
+            box_list = list(map(int, box.split(',')))
+            resized_box_list = resize_bbox(box_list, img_w, img_h)
+            entities.append(entity)
+            boxes.append(resized_box_list)
+            
+            # Skip until the next entity (ignoring periods or other delimiters)
+            while start < len(text) and text[start] not in ['.', ',', ';', '!', '?']:
+                start += 1
+            start += 1  # Skip the delimiter
+        
+    return entities, boxes
+
+
+def are_phrases_similar(phrase1, phrase2):
+    # Step 1: Convert to lower case
+    phrase1 = phrase1.lower()
+    phrase2 = phrase2.lower()
+    
+    # Step 2: Standardize spacing around punctuation
+    phrase1 = re.sub(r'\s*([\'",.;!?|:])\s*', r'\1 ', phrase1).strip()
+    phrase2 = re.sub(r'\s*([\'",.;!?|:])\s*', r'\1 ', phrase2).strip()
+    
+    # Step 3: Remove all punctuation
+    phrase1 = re.sub(r'[^\w\s]', '', phrase1)
+    phrase2 = re.sub(r'[^\w\s]', '', phrase2)
+    
+    # Step 4: Remove extra white spaces
+    phrase1 = ' '.join(phrase1.split())
+    phrase2 = ' '.join(phrase2.split())
+    
+    return phrase1 == phrase2
+
+
+class RefExpEvaluatorFromJsonl(object):
+    def __init__(self, refexp_gt_path, k=(1, -1), thresh_iou=0.5):
+        assert isinstance(k, (list, tuple))
+        with open(refexp_gt_path, 'r') as f:
+            self.refexp_gt = json.load(f)
+        self.img_ids = [item['id'] for item in self.refexp_gt['images']]
+        print(f"Load {len(self.img_ids)} images")
+        print(f"Load {len(self.refexp_gt['annotations'])} annotations")
+        self.k = k
+        self.thresh_iou = thresh_iou
+
+    def summarize(self,
+                  prediction_file: str,
+                  verbose: bool = False,):
+        
+        # get the predictions
+        if os.path.isfile(prediction_file):
+            predictions = [json.loads(line) for line in open(prediction_file)]
+        elif os.path.isdir(prediction_file):
+            predictions = [json.loads(line) for pred_file in os.listdir(prediction_file) for line in open(os.path.join(prediction_file, pred_file))]
+        else:
+            raise NotImplementedError('Not supported file format.')
+        
+        # sort the predictions based on 'image_id'
+        predictions = sorted(predictions, key=lambda x: x['image_id'])
+
+        predict_index = 0
+        
+        dataset2score = {
+            "refcoco": {k: 0.0 for k in self.k},
+            "refcoco+": {k: 0.0 for k in self.k},
+            "refcocog": {k: 0.0 for k in self.k},
+        }
+        dataset2count = {"refcoco": 0.0, "refcoco+": 0.0, "refcocog": 0.0}
+        for item_img, item_ann in tqdm(zip(self.refexp_gt['images'], self.refexp_gt['annotations'])):
+
+            # quit when evaluating all predictions
+            if predict_index == len(predictions):
+                break
+                
+            if item_img['id'] != item_ann['image_id']:
+                raise ValueError(f"Ann\n{item_ann} \nis not matched\n {item_img}")
+            
+            dataset_name = item_img['dataset_name']
+            img_height = item_img['height']
+            img_width = item_img['width']
+            caption = item_img['caption']
+            target_bbox = item_ann["bbox"]
+            converted_bbox = [
+                target_bbox[0],
+                target_bbox[1],
+                target_bbox[2] + target_bbox[0],
+                target_bbox[3] + target_bbox[1],
+            ]
+            target_bbox = torch.as_tensor(converted_bbox).view(-1, 4)
+
+            prediction = predictions[predict_index]["text"]
+            try:
+                entities, boxes = decode_bbox_from_caption(prediction, img_width, img_height, verbose=verbose)
+            except ValueError as e:
+                entities, boxes = [], []
+
+            predict_boxes = []
+            for (entity, box) in zip(entities, boxes):
+                if not are_phrases_similar(entity, caption):
+                    if len(box) > 0:
+                        predict_boxes.append(box)
+                else:
+                    predict_boxes.append(box)
+            
+            if len(predict_boxes) == 0:
+                print(f"Can't find valid bbox for the given phrase {caption}, \n{entities, boxes}")
+                print(f"We set a 0-area box to calculate result")
+                predict_boxes = [[0., 0., 0., 0.]]                                                                                                               
+
+            predict_boxes = torch.as_tensor(predict_boxes).view(-1, 4).to(dtype=torch.float32)
+            
+            iou, _ = box_iou(predict_boxes, target_bbox)
+            mean_iou, _ = box_iou(predict_boxes.mean(0).view(-1, 4), target_bbox)
+            for k in self.k:
+                if k == 'upper bound':
+                    if max(iou) >= self.thresh_iou:
+                        dataset2score[dataset_name][k] += 1.0
+                elif k == 'mean':
+                    if max(mean_iou) >= self.thresh_iou:
+                        dataset2score[dataset_name][k] += 1.0
+                else:
+                    if max(iou[0, :k]) >= self.thresh_iou:
+                        dataset2score[dataset_name][k] += 1.0
+
+            dataset2count[dataset_name] += 1.0
+            predict_index += 1
+
+        for key, value in dataset2score.items():
+            for k in self.k:
+                try:
+                    value[k] /= dataset2count[key]
+                except:
+                    pass
+                
+        results = {}
+        for key, value in dataset2score.items():
+            results[key] = sorted([v for k, v in value.items()])
+            print(f" Dataset: {key} - Precision @ 1, mean, all: {results[key]} \n")
+        
+        return results
+
+
+if __name__ == '__main__':
+    import argparse
+
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--prediction_file', help='prediction_file')
+    parser.add_argument('--annotation_file', default='/path/to/json_annotations', help='annotation_file')
+    
+    args = parser.parse_args()
+    
+    evaluator = RefExpEvaluatorFromJsonl(
+        refexp_gt_path=args.annotation_file, 
+        k=(1, 'mean', 'upper bound'), 
+        thresh_iou=0.5,
+    )
+    
+    results = evaluator.summarize(args.prediction_file, verbose=False)
+    
+    with open(os.path.join(args.prediction_file, "metric.json"), "w") as f:
+        json.dump(results, f, indent=2)
\ No newline at end of file
diff --git a/ferret/eval/ferret_gpt4_data/ground_conv/answer.jsonl b/ferret/eval/ferret_gpt4_data/ground_conv/answer.jsonl
new file mode 100644
index 0000000..1106fb2
--- /dev/null
+++ b/ferret/eval/ferret_gpt4_data/ground_conv/answer.jsonl
@@ -0,0 +1,40 @@
+{"question_id": 0, "image": "000000125472.jpg", "category": "ground_conv", "text": "The man [0.201, 0.002, 0.940, 0.758] is performing a trick on a skateboard [0.012, 0.746, 0.664, 0.886]. He appears to be in mid-air, indicating that he is probably performing a jump or some sort of aerial trick."}
+{"question_id": 2, "image": "000000361551.jpg", "category": "ground_conv", "text": "There are people [0.005, 0.562, 0.616, 0.824] standing in the service area [0.003, 0.416, 0.995, 0.996] of the airport, which appears to be a runway. A man [0.659, 0.572, 0.920, 0.844] is walking down the stairs [0.352, 0.676, 1.000, 0.994] of an unloading ramp. Another man [0.008, 0.554, 0.139, 0.800] is holding his baggage [0.107, 0.662, 0.179, 0.750] and bending over. There's also a woman [0.917, 0.610, 1.000, 0.724] dressed in a sleeveless black top [0.960, 0.626, 1.000, 0.668] with a black and white headband [0.923, 0.628, 0.952, 0.646]. There are several pieces of luggage [0.368, 0.702, 0.456, 0.780] [0.907, 0.706, 0.973, 0.786] on the ground [0.000, 0.282, 1.000, 0.976]."}
+{"question_id": 3, "image": "000000184400.jpg", "category": "ground_conv", "text": "There is a train [0.002, 0.408, 1.000, 0.683] on the bridge [0.002, 0.608, 0.988, 0.877]."}
+{"question_id": 4, "image": "000000276018.jpg", "category": "ground_conv", "text": "In the image, there are several children each holding a stuffed animal. A boy [0.102, 0.388, 0.498, 1.000] is holding a stuffed dog [0.391, 0.506, 0.622, 0.714], and he is wearing a black jacket [0.077, 0.520, 0.495, 0.910]. Another boy [0.717, 0.188, 1.000, 0.864] is holding a white stuffed animal [0.695, 0.356, 0.868, 0.580] wearing a red jacket. There is also a girl [0.372, 0.196, 0.603, 0.922] holding a grey stuffed dog [0.286, 0.298, 0.517, 0.422]. Another child [0.342, 0.390, 0.622, 1.000] is seen holding up a white stuffed animal [0.286, 0.298, 0.517, 0.422]. Furthermore, there is a baby [0.385, 0.034, 0.643, 0.434] being held by a lady [0.286, 0.000, 0.683, 0.560]."}
+{"question_id": 5, "image": "000000356424.jpg", "category": "ground_conv", "text": "The man [0.075, 0.102, 0.704, 0.716] is sitting at a table [0.000, 0.592, 0.997, 1.000] and looking at a plate of food [0.416, 0.726, 0.856, 0.904]. There is a glass [0.275, 0.716, 0.501, 0.998] and a bottle [0.048, 0.712, 0.195, 1.002] on the table in front of him."}
+{"question_id": 6, "image": "000000458755.jpg", "category": "ground_conv", "text": "The girl [0.112, 0.091, 0.868, 0.992] is petting a sheep [0.000, 0.003, 0.704, 0.320]. The hand [0.418, 0.373, 0.548, 0.592] of the girl is on the sheep."}
+{"question_id": 7, "image": "000000069138.jpg", "category": "ground_conv", "text": "The main features of the building [0.000, 0.000, 1.000, 0.466] include a door [0.110, 0.370, 0.266, 0.518] with a picture [0.155, 0.378, 0.259, 0.442] on it. There's a stop light [0.366, 0.236, 0.638, 0.394] with a sad face [0.383, 0.244, 0.614, 0.384] on it nearby the building. In front of the building, there is a road [0.000, 0.490, 1.000, 1.000] with arrows [0.000, 0.616, 0.214, 0.644] painted on it. Next to the road, there is a sidewalk [0.666, 0.572, 0.993, 0.618]."}
+{"question_id": 8, "image": "000000003156.jpg", "category": "ground_conv", "text": "The man [0.000, 0.024, 0.835, 1.002] in this image appears to be installing or fixing a toilet [0.564, 0.574, 1.000, 0.974]. He is wearing gloves for the task and is kneeling next to the toilet."}
+{"question_id": 9, "image": "000000131138.jpg", "category": "ground_conv", "text": "On the desk, there are several items. There is a computer mouse [0.414, 0.753, 0.470, 0.811], a keyboard [0.415, 0.620, 0.650, 0.783], and a cup [0.350, 0.783, 0.417, 0.906]. A plate [0.183, 0.799, 0.326, 0.896] with a fork [0.203, 0.794, 0.270, 0.857] on it is also on the desk."}
+{"question_id": 10, "image": "000000259097.jpg", "category": "ground_conv", "text": "The man [0.390, 0.432, 0.466, 0.793] is jumping in the grass [0.000, 0.610, 0.998, 0.997] to catch a Frisbee [0.400, 0.354, 0.446, 0.381] in the air. He is extending his arm [0.416, 0.459, 0.432, 0.565] to reach the Frisbee. He is wearing a shirt [0.402, 0.468, 0.458, 0.649] and pants [0.390, 0.658, 0.424, 0.763]. His shadow [0.492, 0.724, 0.622, 0.994] can be seen in the grass."}
+{"question_id": 11, "image": "000000377882.jpg", "category": "ground_conv", "text": "The image depicts a small harbor scene where multiple boats [0.000, 0.461, 0.354, 0.579] [0.348, 0.501, 0.874, 0.789] [0.302, 0.461, 0.684, 0.611] are docked on land next to a body of water [0.000, 0.259, 1.000, 0.469]. There is a chain-link fence [0.002, 0.176, 0.996, 0.995] enclosing the boats. There are also surfboards [0.830, 0.448, 0.996, 0.552] [0.420, 0.384, 0.502, 0.411] [0.910, 0.768, 0.998, 0.877] [0.430, 0.344, 0.508, 0.371] [0.830, 0.565, 1.000, 0.712] [0.322, 0.307, 0.450, 0.341] [0.766, 0.251, 0.998, 0.368] [0.764, 0.704, 0.998, 0.829] and a rack [0.754, 0.168, 1.000, 0.901] in the scene. In the background, you can see a skyline of buildings [0.692, 0.195, 0.718, 0.248] [0.888, 0.173, 0.922, 0.227] [0.582, 0.211, 0.610, 0.256] [0.180, 0.259, 0.202, 0.293] [0.466, 0.208, 0.518, 0.272] under the blue sky [0.000, 0.000, 0.998, 0.317]."}
+{"question_id": 12, "image": "000000484415.jpg", "category": "ground_conv", "text": "The man [0.000, 0.133, 0.600, 0.992] is interacting with a toilet [0.016, 0.042, 0.719, 0.996]. He is reaching out his hand [0.281, 0.125, 0.603, 0.562] to flush the toilet using the flusher [0.534, 0.092, 0.628, 0.300] located on top of the toilet tank [0.019, 0.021, 0.706, 0.579]."}
+{"question_id": 13, "image": "000000184384.jpg", "category": "ground_conv", "text": "There is a blueberry cake [0.238, 0.093, 0.786, 0.787] topped with butter [0.454, 0.024, 0.638, 0.288] placed on a plate [0.166, 0.453, 1.000, 1.000] which is on the table [0.002, 0.365, 0.998, 0.997]. On the same table, there is another plate [0.628, 0.120, 0.998, 0.389] containing a mix of food [0.632, 0.123, 0.996, 0.336] including an egg [0.636, 0.125, 0.880, 0.267] and a sausage [0.766, 0.248, 0.984, 0.333]. There is also a cup [0.002, 0.000, 0.202, 0.667] of water [0.000, 0.000, 0.202, 0.667] on the table."}
+{"question_id": 14, "image": "000000341058.jpg", "category": "ground_conv", "text": "The items placed on the table are a pair of napkins [0.541, 0.818, 0.601, 0.858], a pepper shaker [0.594, 0.822, 0.619, 0.854], and a salt shaker [0.612, 0.824, 0.637, 0.854]."}
+{"question_id": 15, "image": "000000349184.jpg", "category": "ground_conv", "text": "The image shows a woman [0.009, 0.194, 0.497, 0.888] sitting on a wooden bench [0.000, 0.324, 0.731, 0.994] in a park [0.000, 0.002, 0.997, 1.000] during daytime. The park appears to have a lot of trees [0.554, 0.000, 0.997, 0.376] and there are people [0.386, 0.438, 0.449, 0.504] walking in front of the woman. The woman's purse [0.458, 0.488, 0.605, 0.694] is also on the bench next to her. The park seems to be enclosed by a fence [0.719, 0.310, 0.997, 0.372] and there is a building [0.090, 0.000, 0.686, 0.094] behind the trees."}
+{"question_id": 16, "image": "000000516143.jpg", "category": "ground_conv", "text": "The image features a green and white bus [0.100, 0.376, 0.866, 0.805] that is driving down a city street."}
+{"question_id": 17, "image": "000000159311.jpg", "category": "ground_conv", "text": "There are two zebras in the image. The first zebra [0.000, 0.000, 0.622, 0.790] and the second zebra [0.002, 0.000, 0.684, 0.682]."}
+{"question_id": 18, "image": "000000553990.jpg", "category": "ground_conv", "text": "The person riding the horse is wearing boots [0.328, 0.339, 0.416, 0.492] and a shirt [0.388, 0.150, 0.508, 0.279]. They are also wearing a helmet [0.484, 0.096, 0.560, 0.162]. The person [0.320, 0.078, 0.552, 0.502] is riding the horse."}
+{"question_id": 19, "image": "000000273493.jpg", "category": "ground_conv", "text": "The two men [0.144, 0.360, 0.246, 0.736] [0.730, 0.474, 0.780, 0.613] are playing a game of tennis [0.012, 0.384, 0.984, 0.934]. The first man is wearing a white shirt [0.164, 0.411, 0.222, 0.547], gray shorts [0.162, 0.535, 0.220, 0.628], and black sneakers [0.180, 0.709, 0.216, 0.739]. The second man is wearing white clothing [0.734, 0.492, 0.778, 0.601] and white sneakers [0.762, 0.598, 0.776, 0.613]. They are on a tennis court [0.000, 0.372, 0.988, 0.979] and are currently hitting a ball [0.640, 0.399, 0.648, 0.411] with their tennis rackets [0.214, 0.574, 0.238, 0.619] [0.768, 0.526, 0.808, 0.556]."}
+{"question_id": 20, "image": "000000452122.jpg", "category": "ground_conv", "text": "The airplane [0.112, 0.300, 0.858, 0.640] is in the air, currently in flight. This can be inferred from the position of the sky [0.000, 0.000, 0.998, 1.000]. However, the airplane's landing gear [0.450, 0.592, 0.600, 0.643] is lowered, which indicates the airplane is in the process of landing or taking off."}
+{"question_id": 21, "image": "000000134722.jpg", "category": "ground_conv", "text": "This image is set at a train station. There's a train [0.290, 0.400, 0.998, 0.784] moving on the railway tracks [0.000, 0.752, 0.520, 0.944]. The train station has a platform with an awning [0.000, 0.299, 0.132, 0.403] and a bench under a canopy [0.000, 0.301, 0.136, 0.400]. Near the platform, there is a building [0.000, 0.299, 0.204, 0.659] with a red brick wall [0.000, 0.392, 0.206, 0.611]. The station is surrounded by trees [0.208, 0.253, 0.322, 0.653] and the sky [0.000, 0.000, 0.998, 0.560] above shows some clouds [0.374, 0.067, 0.920, 0.312]."}
+{"question_id": 22, "image": "000000360960.jpg", "category": "ground_conv", "text": "There are three people in the image. One person [0.066, 0.162, 0.318, 0.686] is wearing a black uniform [0.000, 0.222, 0.126, 0.646] and a hat [0.006, 0.162, 0.072, 0.198]. Another person [0.390, 0.344, 0.838, 0.894] is wearing a long black coat [0.405, 0.332, 0.835, 0.746] and pants [0.523, 0.736, 0.739, 0.858]. The last person [0.853, 0.154, 1.000, 0.650] is wearing jeans [0.853, 0.422, 1.000, 0.632]."}
+{"question_id": 23, "image": "000000179765.jpg", "category": "ground_conv", "text": "Sure, the bike [0.146, 0.109, 0.938, 0.803] is a Honda, as indicated by the Honda logo [0.322, 0.395, 0.378, 0.419]. It has a front wheel [0.150, 0.424, 0.366, 0.635] and a back tire [0.574, 0.496, 0.860, 0.800]. It also has a light [0.894, 0.411, 0.944, 0.520]. The bike features a shock absorber [0.626, 0.501, 0.698, 0.680] for smooth riding. It also has a handle [0.284, 0.109, 0.390, 0.384] for steering and a display [0.240, 0.275, 0.290, 0.328] for the rider's information. Not to forget the sylencer [0.462, 0.645, 0.816, 0.779] near the back tire."}
+{"question_id": 24, "image": "000000332318.jpg", "category": "ground_conv", "text": "The setting is a mountainous region. There is a large mountain [0.000, 0.057, 0.992, 0.782] with a snow-covered peak [0.744, 0.042, 0.898, 0.119]. In front of the mountain, there is a lush pasture [0.000, 0.815, 0.984, 1.000] where cows [0.548, 0.860, 0.574, 0.896] [0.436, 0.860, 0.454, 0.890] are grazing. There are trailers [0.796, 0.910, 0.894, 0.997] [0.632, 0.899, 0.742, 0.994] in the pasture, probably for animal equipment and transportation. There are also trees [0.740, 0.409, 1.000, 0.982] around the area. All of this is under a clear sky [0.000, 0.000, 1.002, 0.257]."}
+{"question_id": 25, "image": "000000305695.jpg", "category": "ground_conv", "text": "The zebras [0.730, 0.496, 0.796, 0.581] are in a fenced area [0.464, 0.531, 0.934, 0.848]. Near them, there is a truck [0.000, 0.416, 0.210, 0.805] on the road [0.180, 0.709, 0.432, 0.957]. They are also surrounded by trees [0.128, 0.000, 0.592, 0.597] and grass [0.544, 0.659, 0.840, 0.859]."}
+{"question_id": 26, "image": "000000326174.jpg", "category": "ground_conv", "text": "The boy [0.792, 0.480, 0.938, 0.853] is holding a surfboard [0.790, 0.587, 0.960, 0.691]."}
+{"question_id": 27, "image": "000000562207.jpg", "category": "ground_conv", "text": "There are three people in the image. One man [0.164, 0.455, 0.292, 0.997] is standing on the side [0.236, 0.675, 0.994, 0.997] wearing shorts [0.174, 0.699, 0.254, 0.864]. Another man [0.582, 0.476, 0.662, 0.870] is standing beside the elephant [0.328, 0.157, 0.638, 0.967] wearing a shirt [0.582, 0.521, 0.650, 0.681]. There is also a woman [0.288, 0.473, 0.420, 0.967] wearing a top [0.302, 0.539, 0.358, 0.696] touching the elephant. They are all on the side of a body of water [0.000, 0.488, 0.994, 1.000]."}
+{"question_id": 28, "image": "000000543300.jpg", "category": "ground_conv", "text": "The boat [0.048, 0.552, 0.928, 0.819] is white and is of a large size. It has multiple levels [0.000, 0.709, 1.000, 0.829] [0.068, 0.616, 0.852, 0.688]. The side of the boat has a set of long black windows [0.374, 0.733, 0.790, 0.765]. Further, it has a silver railing [0.094, 0.557, 0.728, 0.624] [0.238, 0.597, 0.744, 0.627] on the top level. There are also red letters [0.414, 0.693, 0.654, 0.725] and blue water symbols [0.268, 0.688, 0.350, 0.779] on the side of the boat."}
+{"question_id": 29, "image": "000000241668.jpg", "category": "ground_conv", "text": "A person [0.490, 0.136, 0.825, 0.998] with red hair [0.507, 0.142, 0.791, 0.642] is holding a cake [0.630, 0.670, 0.772, 0.750]. She is wearing a suit jacket [0.490, 0.422, 0.799, 0.998] and a necktie [0.571, 0.442, 0.674, 0.936]."}
+{"question_id": 30, "image": "000000535578.jpg", "category": "ground_conv", "text": "In the image, there is a group of sheep [0.532, 0.546, 0.646, 0.662] [0.532, 0.666, 0.817, 0.810] grazing in a field [0.000, 0.002, 0.994, 0.998]. The field is bordered by a stone wall [0.000, 0.000, 0.769, 0.180] and is filled with plant life [0.000, 0.764, 0.601, 0.998]. There is also a bush [0.480, 0.000, 0.748, 0.084] and some trees [0.736, 0.036, 0.835, 0.100] present in the field. A few rocks [0.727, 0.410, 0.808, 0.470] can also be spotted in the scene."}
+{"question_id": 31, "image": "000000443969.jpg", "category": "ground_conv", "text": "A child [0.408, 0.168, 0.606, 0.786] is holding the umbrella [0.296, 0.038, 0.782, 0.360]."}
+{"question_id": 32, "image": "000000329219.jpg", "category": "ground_conv", "text": "There is a man [0.274, 0.000, 0.517, 0.792] standing in the kitchen [0.000, 0.000, 0.750, 0.849]. He is standing next to a counter [0.000, 0.329, 0.576, 0.398]. On the floor of the kitchen [0.000, 0.713, 1.000, 1.000], there is a small dog [0.462, 0.593, 0.568, 0.842]. Mugs [0.509, 0.123, 0.595, 0.266] are hanging on the wall [0.506, 0.019, 0.607, 0.384]. There is also a blender [0.015, 0.165, 0.080, 0.307] on the counter."}
+{"question_id": 33, "image": "000000421923.jpg", "category": "ground_conv", "text": "There are several books [0.414, 0.208, 0.538, 0.364] [0.360, 0.202, 0.417, 0.360] [0.435, 0.480, 0.712, 0.578] and a bowl [0.072, 0.030, 0.288, 0.076] on the top shelf [0.000, 0.028, 0.607, 0.202]. On the middle shelf [0.207, 0.334, 0.997, 0.380], there are more books [0.414, 0.208, 0.538, 0.364] [0.360, 0.202, 0.417, 0.360]. The bottom shelf [0.324, 0.528, 0.997, 0.624] contains a stack of books [0.435, 0.480, 0.712, 0.578]."}
+{"question_id": 34, "image": "000000376900.jpg", "category": "ground_conv", "text": "The man [0.163, 0.274, 0.491, 0.936] is preparing to serve a tennis ball. He is holding a tennis racket [0.235, 0.578, 0.304, 0.664] in his hand [0.253, 0.648, 0.299, 0.680]. He is wearing a cap [0.171, 0.388, 0.253, 0.476] on his head [0.173, 0.408, 0.256, 0.474], and shorts [0.216, 0.628, 0.432, 0.782]. His shadow [0.397, 0.898, 0.968, 0.956] is cast in front of him."}
+{"question_id": 35, "image": "000000513567.jpg", "category": "ground_conv", "text": "The image shows two women [0.102, 0.099, 0.486, 0.984] [0.502, 0.000, 0.982, 0.997], both of them are eating hot dogs [0.190, 0.587, 0.350, 0.741] [0.676, 0.315, 0.882, 0.408]. One of the women is wearing sunglasses [0.630, 0.005, 0.794, 0.048] on her head. They seem to be standing on a street [0.042, 0.403, 0.092, 0.520], potentially walking while enjoying their meal."}
+{"question_id": 36, "image": "000000058393.jpg", "category": "ground_conv", "text": "There is a man [0.542, 0.343, 0.812, 0.493] and a woman [0.644, 0.377, 0.834, 0.863] sitting on a bench [0.070, 0.493, 0.932, 0.960]. They are looking at the ocean [0.028, 0.319, 0.972, 0.821]. The man has his arm [0.658, 0.462, 0.828, 0.496] around the woman."}
+{"question_id": 37, "image": "000000010764.jpg", "category": "ground_conv", "text": "The catcher [0.334, 0.193, 0.756, 0.940] is squatting on the field, ready to catch a ball with his gloved hand extended."}
+{"question_id": 38, "image": "000000271402.jpg", "category": "ground_conv", "text": "One girl [0.329, 0.148, 0.973, 0.892] is holding a tennis racket [0.462, 0.480, 0.713, 0.840] and the other girl [0.057, 0.102, 0.456, 0.898] is standing next to a scooter [0.097, 0.424, 0.592, 0.996]."}
+{"question_id": 39, "image": "000000018519.jpg", "category": "ground_conv", "text": "The skater has taken several safety measures. He is wearing a helmet [0.358, 0.354, 0.448, 0.422] which has a sticker [0.408, 0.358, 0.438, 0.368]. He also has a pad [0.540, 0.362, 0.595, 0.420] on his body. His elbow is protected by another pad [0.376, 0.512, 0.443, 0.554] and his knee is protected by a knee pad [0.450, 0.542, 0.512, 0.598]. He is also wearing a wrist brace [0.279, 0.524, 0.338, 0.564]. The skater is also wearing roller skates [0.647, 0.490, 0.709, 0.584] for the activity."}
+{"question_id": 40, "image": "000000106048.jpg", "category": "ground_conv", "text": "The objects on the bus include a design [0.228, 0.422, 0.438, 0.560], side mirrors [0.488, 0.314, 0.530, 0.428] [0.790, 0.332, 0.818, 0.455], wheels [0.266, 0.545, 0.294, 0.677] [0.248, 0.551, 0.264, 0.668] [0.444, 0.578, 0.472, 0.751], windows [0.510, 0.216, 0.796, 0.548] and a windshield [0.518, 0.222, 0.782, 0.545]. The bus [0.222, 0.144, 0.820, 0.757] itself."}
diff --git a/ferret/eval/ferret_gpt4_data/ground_conv/context.jsonl b/ferret/eval/ferret_gpt4_data/ground_conv/context.jsonl
new file mode 100644
index 0000000..7959d57
--- /dev/null
+++ b/ferret/eval/ferret_gpt4_data/ground_conv/context.jsonl
@@ -0,0 +1,40 @@
+{"question_id": 0, "image": "000000125472.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : axle at [0.447, 0.814, 0.535, 0.856].\nObject 1 : background at [0.003, 0.744, 0.994, 0.988].\nObject 2 : bracelet at [0.820, 0.444, 0.859, 0.470].\nObject 3 : building at [0.012, 0.888, 0.099, 0.994].\nObject 4 : corner at [0.027, 0.890, 0.117, 0.992].\nObject 5 : fence at [0.030, 0.886, 1.000, 1.000].\nObject 6 : hair at [0.486, 0.078, 0.712, 0.216].\nObject 7 : jean pants at [0.246, 0.380, 0.841, 0.632].\nObject 8 : laces at [0.168, 0.562, 0.850, 0.674].\nObject 9 : logo at [0.429, 0.232, 0.583, 0.364].\nObject 10 : man at [0.201, 0.002, 0.940, 0.758].\nObject 11 : name at [0.000, 0.960, 0.321, 1.000].\nObject 12 : picture at [0.003, 0.004, 1.000, 0.998].\nObject 13 : poles at [0.180, 0.886, 0.432, 0.990].\nObject 14 : shirt at [0.324, 0.124, 0.694, 0.392].\nObject 15 : shoes at [0.189, 0.606, 0.946, 0.792].\nObject 16 : skateboard at [0.012, 0.746, 0.664, 0.886].\nObject 17 : sky at [0.012, 0.002, 1.000, 0.918].\nObject 18 : stadium lights at [0.147, 0.860, 0.456, 0.994].\nObject 19 : stitching at [0.312, 0.408, 0.754, 0.638].\nObject 20 : strip at [0.279, 0.770, 0.529, 0.802].\nObject 21 : top at [0.024, 0.830, 0.420, 0.936].\nObject 22 : trees at [0.024, 0.846, 1.000, 1.000].\nObject 23 : wheels at [0.012, 0.808, 0.586, 0.904].\nObject 24 : wrist at [0.802, 0.434, 0.856, 0.484].\n\nRelationships:\nobject 2 : bracelet -> on mans -> object 24 : wrist.\nobject 23 : wheels -> on a -> object 16 : skateboard.\nobject 14 : shirt -> has a -> object 9 : logo.\nobject 10 : man -> doing trick on -> object 16 : skateboard.\nobject 3 : building -> behind a -> object 5 : fence.\nobject 11 : name -> on -> object 12 : picture.\nobject 11 : name -> has a -> object 11 : name.\nobject 10 : man -> performing on a -> object 16 : skateboard.\nobject 4 : corner -> of -> object 3 : building.\nobject 18 : stadium lights -> are on -> object 13 : poles.\nobject 16 : skateboard -> has -> object 23 : wheels.\nobject 2 : bracelet -> on mans -> object 24 : wrist.\nobject 11 : name -> on -> object 12 : picture.\nobject 16 : skateboard -> under -> object 10 : man.\nobject 10 : man -> wearing -> object 15 : shoes.\nobject 3 : building -> behind -> object 5 : fence.\nobject 22 : trees -> in -> object 1 : background.\nobject 15 : shoes -> have -> object 8 : laces.\nobject 18 : stadium lights -> on -> object 13 : poles.\nobject 5 : fence -> behind -> object 10 : man.\nobject 20 : strip -> on -> object 16 : skateboard.\nobject 19 : stitching -> on -> object 7 : jean pants.\nobject 9 : logo -> on -> object 14 : shirt.\nobject 23 : wheels -> on -> object 16 : skateboard.\nobject 0 : axle -> on -> object 16 : skateboard.\nobject 21 : top -> of -> object 22 : trees.\n\nRegion Description:\nRegion Description at [0.030, 0.774, 0.643, 0.912] : a black skateboard with black wheels.\n\nGlobal Caption:\nA man flying through the air while riding a skateboard.\nA man is doing tricks on a skateboard.\nA skateboarder jumps while trying to perform a trick.\na man in the air standing above the skateboard\na person attempting a jump with a skateboard"}
+{"question_id": 2, "image": "000000361551.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : baggage at [0.107, 0.662, 0.179, 0.750].\nObject 1 : baggage at [0.368, 0.706, 0.456, 0.782].\nObject 2 : building at [0.000, 0.000, 0.997, 0.326].\nObject 3 : cap at [0.784, 0.544, 0.824, 0.568].\nObject 4 : duffel bag at [0.584, 0.702, 0.643, 0.768].\nObject 5 : ground at [0.000, 0.282, 1.000, 0.976].\nObject 6 : hair at [0.920, 0.614, 0.973, 0.640].\nObject 7 : headband at [0.923, 0.628, 0.952, 0.646].\nObject 8 : jacket at [0.776, 0.568, 0.840, 0.642].\nObject 9 : line at [0.696, 0.750, 0.989, 0.794].\nObject 10 : lines at [0.000, 0.436, 0.851, 0.486].\nObject 11 : luggage at [0.907, 0.706, 0.973, 0.786].\nObject 12 : luggage at [0.368, 0.702, 0.456, 0.780].\nObject 13 : man at [0.008, 0.554, 0.139, 0.800].\nObject 14 : man at [0.659, 0.572, 0.920, 0.844].\nObject 15 : man at [0.771, 0.538, 0.843, 0.640].\nObject 16 : pavement at [0.003, 0.308, 0.992, 0.566].\nObject 17 : people at [0.005, 0.562, 0.616, 0.824].\nObject 18 : pillars at [0.211, 0.130, 0.235, 0.240].\nObject 19 : ramp at [0.179, 0.158, 0.707, 0.408].\nObject 20 : service area at [0.003, 0.416, 0.995, 0.996].\nObject 21 : stairs at [0.352, 0.676, 1.000, 0.994].\nObject 22 : sweater at [0.667, 0.634, 0.920, 0.824].\nObject 23 : top at [0.960, 0.626, 1.000, 0.668].\nObject 24 : truck at [0.781, 0.278, 0.997, 0.366].\nObject 25 : walls at [0.608, 0.000, 0.989, 0.320].\nObject 26 : wheel at [0.843, 0.338, 0.875, 0.366].\nObject 27 : woman at [0.917, 0.610, 1.000, 0.724].\n\nRelationships:\nobject 17 : people -> in -> object 20 : service area.\nobject 27 : woman -> bends over -> object 11 : luggage.\nobject 14 : man -> walks down -> object 21 : stairs.\nobject 12 : luggage -> on -> object 5 : ground.\nobject 13 : man -> carries -> object 0 : baggage.\nobject 14 : man -> wears -> object 22 : sweater.\nobject 15 : man -> wears -> object 3 : cap.\nobject 24 : truck -> in -> object 20 : service area.\nobject 15 : man -> wears -> object 8 : jacket.\nobject 10 : lines -> on -> object 16 : pavement.\nobject 14 : man -> walks down -> object 21 : stairs.\nobject 9 : line -> on -> object 16 : pavement.\nobject 24 : truck -> has -> object 26 : wheel.\nobject 2 : building -> has -> object 25 : walls.\nobject 15 : man -> on -> object 20 : service area.\nobject 13 : man -> holds -> object 0 : baggage.\nobject 14 : man -> walks down -> object 21 : stairs.\nobject 13 : man -> holds -> object 0 : baggage.\nobject 27 : woman -> wears -> object 7 : headband.\nobject 1 : baggage -> on -> object 20 : service area.\n\nRegion Description:\nRegion Description at [0.443, 0.528, 0.992, 0.850] : People standing in service area of airport..\nRegion Description at [0.648, 0.564, 0.960, 0.892] : Man walking down stairs of unloading ramp..\nRegion Description at [0.229, 0.698, 0.381, 0.776] : Black and red luggage sitting on ground..\nRegion Description at [0.957, 0.616, 0.997, 0.670] : Woman dressed in sleeveless black top..\nRegion Description at [0.011, 0.548, 0.211, 0.750] : Man holding his luggage and bending over.\nRegion Description at [0.893, 0.578, 0.995, 0.678] : woman with a black and white head band.\nRegion Description at [0.235, 0.684, 0.973, 0.816] : Rainbow of colors in the form of luggage.\n\nGlobal Caption:\nSome are standing outside a building with suitcases.\nA few people are getting of a plane.\nA group of people and luggage on a airport tarmac.\nSome people who are placing luggage on a runway.\nAn airport and plane unloading passengers with luggage."}
+{"question_id": 3, "image": "000000184400.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : air conditioner at [0.004, 0.261, 0.018, 0.293].\nObject 1 : balcony at [0.048, 0.037, 0.100, 0.077].\nObject 2 : beam at [0.616, 0.621, 0.664, 0.824].\nObject 3 : beam at [0.490, 0.640, 0.532, 0.832].\nObject 4 : beam at [0.426, 0.640, 0.462, 0.835].\nObject 5 : bridge at [0.002, 0.608, 0.988, 0.877].\nObject 6 : bridge at [0.004, 0.453, 1.000, 0.867].\nObject 7 : building at [0.000, 0.000, 0.252, 0.469].\nObject 8 : bushes at [0.000, 0.939, 0.072, 0.997].\nObject 9 : colors at [0.194, 0.480, 0.330, 0.661].\nObject 10 : column at [0.618, 0.824, 0.676, 0.997].\nObject 11 : guard rails at [0.000, 0.496, 1.000, 0.624].\nObject 12 : light at [0.606, 0.192, 0.724, 0.243].\nObject 13 : light at [0.864, 0.947, 0.916, 1.000].\nObject 14 : metal support at [0.002, 0.603, 0.976, 0.995].\nObject 15 : pole at [0.700, 0.205, 0.724, 0.995].\nObject 16 : red line at [0.632, 0.851, 0.648, 0.995].\nObject 17 : sky at [0.250, 0.013, 1.000, 0.467].\nObject 18 : south west at [0.338, 0.616, 0.442, 0.651].\nObject 19 : street at [0.002, 0.861, 1.000, 0.997].\nObject 20 : train at [0.002, 0.408, 1.000, 0.683].\nObject 21 : window at [0.144, 0.013, 0.182, 0.064].\nObject 22 : window at [0.430, 0.485, 0.534, 0.595].\nObject 23 : window at [0.134, 0.091, 0.182, 0.155].\nObject 24 : window at [0.340, 0.504, 0.424, 0.613].\nObject 25 : window at [0.116, 0.944, 0.168, 1.000].\nObject 26 : windows at [0.762, 0.437, 0.920, 0.613].\nObject 27 : windows at [0.004, 0.000, 0.096, 0.088].\n\nRelationships:\nobject 10 : column -> supporting -> object 6 : bridge.\nobject 10 : column -> has -> object 16 : red line.\nobject 12 : light -> on -> object 15 : pole.\nobject 7 : building -> behind -> object 20 : train.\nobject 21 : window -> on -> object 7 : building.\nobject 1 : balcony -> on -> object 7 : building.\nobject 25 : window -> visible under -> object 5 : bridge.\nobject 12 : light -> on -> object 19 : street.\nobject 2 : beam -> of -> object 5 : bridge.\nobject 20 : train -> in -> object 9 : colors.\nobject 24 : window -> of -> object 20 : train.\nobject 22 : window -> of train -> object 20 : train.\nobject 5 : bridge -> on -> object 20 : train.\nobject 7 : building -> beside -> object 20 : train.\nobject 23 : window -> of -> object 7 : building.\nobject 12 : light -> on a -> object 15 : pole.\nobject 12 : light -> on -> object 15 : pole.\nobject 20 : train -> says -> object 18 : south west.\nobject 8 : bushes -> are in -> object 19 : street.\nobject 7 : building -> has many -> object 27 : windows.\nobject 7 : building -> has -> object 0 : air conditioner.\nobject 20 : train -> on -> object 6 : bridge.\nobject 12 : light -> in -> object 19 : street.\nobject 5 : bridge -> has -> object 11 : guard rails.\nobject 26 : windows -> on -> object 20 : train.\nobject 20 : train -> has -> object 18 : south west.\nobject 6 : bridge -> has -> object 14 : metal support.\nobject 9 : colors -> to -> object 20 : train.\n\nRegion Description:\nRegion Description at [0.602, 0.837, 0.696, 0.997] : a metal support column for the bridge.\n\nGlobal Caption:\nA train as it travels down the tracks over a bridge.\na colorful train going along an elevated track \nA train rides on a bridge past a building.\nA subway train that is passing over a train bridge.\na train on a train track on an elevated bridge"}
+{"question_id": 4, "image": "000000276018.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : animal at [0.717, 0.042, 0.831, 0.152].\nObject 1 : animal at [0.114, 0.582, 0.348, 0.840].\nObject 2 : baby at [0.385, 0.034, 0.643, 0.434].\nObject 3 : baby at [0.911, 0.028, 1.000, 0.250].\nObject 4 : bear at [0.391, 0.506, 0.622, 0.714].\nObject 5 : bear at [0.695, 0.356, 0.868, 0.580].\nObject 6 : bear hand at [0.114, 0.630, 0.175, 0.660].\nObject 7 : black sock at [0.800, 0.796, 0.858, 0.834].\nObject 8 : blonde boy at [0.166, 0.170, 0.351, 0.460].\nObject 9 : boy at [0.102, 0.388, 0.498, 1.000].\nObject 10 : boy at [0.717, 0.188, 1.000, 0.864].\nObject 11 : child at [0.342, 0.390, 0.622, 1.000].\nObject 12 : coat at [0.077, 0.520, 0.495, 0.910].\nObject 13 : coat at [0.775, 0.296, 1.000, 0.616].\nObject 14 : coat at [0.397, 0.090, 0.634, 0.262].\nObject 15 : flip flops at [0.434, 0.756, 0.606, 0.910].\nObject 16 : girl at [0.372, 0.196, 0.603, 0.922].\nObject 17 : glasses at [0.191, 0.236, 0.308, 0.250].\nObject 18 : grass at [0.637, 0.652, 0.754, 0.788].\nObject 19 : hand at [0.714, 0.094, 0.788, 0.160].\nObject 20 : hands at [0.763, 0.380, 0.877, 0.430].\nObject 21 : hat at [0.757, 0.030, 0.889, 0.078].\nObject 22 : jacket at [0.357, 0.500, 0.622, 0.782].\nObject 23 : jacket at [0.422, 0.286, 0.603, 0.550].\nObject 24 : jacket at [0.163, 0.296, 0.320, 0.462].\nObject 25 : jacket at [0.911, 0.106, 1.000, 0.224].\nObject 26 : lady at [0.286, 0.000, 0.683, 0.560].\nObject 27 : man at [0.628, 0.030, 0.951, 0.742].\nObject 28 : shirt at [0.831, 0.306, 0.957, 0.404].\nObject 29 : shirt at [0.197, 0.296, 0.298, 0.370].\nObject 30 : shoe at [0.717, 0.804, 0.871, 0.864].\nObject 31 : sidewalk at [0.628, 0.574, 0.769, 0.632].\nObject 32 : stuffed animal at [0.286, 0.298, 0.517, 0.422].\n\nRelationships:\nobject 10 : boy -> wearing -> object 28 : shirt.\nobject 3 : baby -> wearing -> object 25 : jacket.\nobject 22 : jacket -> carrying -> object 4 : bear.\nobject 8 : blonde boy -> wears -> object 17 : glasses.\nobject 8 : blonde boy -> wears -> object 24 : jacket.\nobject 11 : child -> holding up -> object 32 : stuffed animal.\nobject 10 : boy -> holding up -> object 5 : bear.\nobject 30 : shoe -> with a -> object 7 : black sock.\nobject 10 : boy -> wearing -> object 7 : black sock.\nobject 26 : lady -> holding -> object 2 : baby.\nobject 16 : girl -> wearing -> object 15 : flip flops.\nobject 9 : boy -> wearing -> object 12 : coat.\nobject 10 : boy -> wearing a -> object 13 : coat.\nobject 4 : bear -> on -> object 20 : hands.\nobject 26 : lady -> carrying -> object 2 : baby.\nobject 0 : animal -> in -> object 19 : hand.\n\nRegion Description:\nRegion Description at [0.905, 0.020, 0.997, 0.272] : blonde haired baby wearing yellow jacket.\nRegion Description at [0.357, 0.388, 0.640, 0.730] : girl in blue jacket carrying blue dog.\nRegion Description at [0.071, 0.378, 0.498, 0.842] : boy in black jacket holding stuffed dog.\nRegion Description at [0.055, 0.572, 0.375, 0.846] : brown stuffed dog with red and white collar.\nRegion Description at [0.283, 0.194, 0.603, 0.400] : girl in pink jacket holding white stuffed animal.\nRegion Description at [0.695, 0.356, 0.874, 0.576] : White stuffed animal wearing a red jacket..\nRegion Description at [0.332, 0.394, 0.618, 0.992] : Little girl holding a grey stuffed dog..\nRegion Description at [0.372, 0.476, 0.723, 0.786] : little girl holding blue and white stuffed animal.\nRegion Description at [0.062, 0.556, 0.422, 0.840] : little boy holding brown and white stuffed animal.\n\nGlobal Caption:\na bunch of kids walking through some grass\nA group of children are holding various stuffed animals and dolls.\nKids walking while holding their stuffed animals. \nA group of kids holding teddy bears and looking happy.\nA group of children carrying stuffed animals walks across the grass. "}
+{"question_id": 5, "image": "000000356424.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : bottle at [0.048, 0.712, 0.195, 1.002].\nObject 1 : chair at [0.696, 0.500, 1.003, 0.718].\nObject 2 : cork at [0.053, 0.712, 0.139, 0.776].\nObject 3 : cup at [0.043, 0.736, 0.240, 0.916].\nObject 4 : dish at [0.416, 0.726, 0.856, 0.904].\nObject 5 : fruit at [0.629, 0.834, 0.675, 0.880].\nObject 6 : glass at [0.275, 0.716, 0.501, 0.998].\nObject 7 : glasses at [0.179, 0.242, 0.464, 0.322].\nObject 8 : hair at [0.536, 0.258, 0.656, 0.320].\nObject 9 : man at [0.075, 0.102, 0.704, 0.716].\nObject 10 : rasberries at [0.499, 0.750, 0.544, 0.786].\nObject 11 : raspberries at [0.664, 0.828, 0.741, 0.864].\nObject 12 : sauce at [0.565, 0.752, 0.715, 0.824].\nObject 13 : shirt at [0.600, 0.350, 0.645, 0.494].\nObject 14 : shirt at [0.635, 0.282, 0.997, 0.654].\nObject 15 : sign at [0.419, 0.134, 0.509, 0.184].\nObject 16 : sweater at [0.072, 0.288, 0.704, 0.718].\nObject 17 : table at [0.000, 0.592, 0.997, 1.000].\nObject 18 : window at [0.328, 0.000, 0.600, 0.298].\nObject 19 : woman at [0.531, 0.258, 0.768, 0.688].\n\nRelationships:\nobject 9 : man -> wearing -> object 7 : glasses.\nobject 0 : bottle -> on -> object 17 : table.\nobject 6 : glass -> on -> object 17 : table.\nobject 11 : raspberries -> on -> object 4 : dish.\nobject 9 : man -> wearing -> object 7 : glasses.\n\nRegion Description:\nRegion Description at [0.640, 0.180, 0.989, 0.530] : Man wearing a black and orange stripe shirt.\nRegion Description at [0.413, 0.136, 0.512, 0.184] : Yellow closed sign with brown letters.\nRegion Description at [0.629, 0.186, 0.995, 0.706] : a man wearing and orange and black striped shirt.\nRegion Description at [0.528, 0.254, 0.717, 0.666] : a woman with a ponytail eating lunch.\nRegion Description at [0.152, 0.238, 0.459, 0.322] : a pair of black wire rimmed eye glasses.\nRegion Description at [0.029, 0.716, 0.243, 0.922] : empty cup that used to contain coffee.\nRegion Description at [0.264, 0.708, 0.867, 0.994] : A plate of food with a glass of water.\n\nGlobal Caption:\nA man sitting in front of a plate of food.\nA man at a wooden table looking at a plate of food.\na man smiling while looking at his plate of food\nA man sitting at a table with a plate filled with food.\nA man looking happily at some dish in front of him."}
+{"question_id": 6, "image": "000000458755.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : edge at [0.498, 0.296, 0.530, 0.419].\nObject 1 : feet at [0.914, 0.235, 1.000, 0.291].\nObject 2 : floor at [0.000, 0.003, 0.994, 1.000].\nObject 3 : girl at [0.112, 0.091, 0.868, 0.992].\nObject 4 : grass at [0.000, 0.005, 0.998, 0.995].\nObject 5 : ground at [0.000, 0.005, 0.992, 0.992].\nObject 6 : hair at [0.542, 0.096, 0.910, 0.624].\nObject 7 : hand at [0.418, 0.373, 0.548, 0.592].\nObject 8 : jeans at [0.500, 0.216, 0.584, 0.400].\nObject 9 : sheep at [0.000, 0.003, 0.704, 0.320].\nObject 10 : shirt at [0.426, 0.504, 0.900, 0.992].\nObject 11 : shoe at [0.472, 0.379, 0.558, 0.453].\nObject 12 : sneakers at [0.904, 0.141, 0.968, 0.187].\nObject 13 : someon at [0.884, 0.019, 0.994, 0.192].\nObject 14 : strap at [0.744, 0.600, 0.872, 0.715].\nObject 15 : strip at [0.512, 0.520, 0.548, 0.589].\nObject 16 : sweater at [0.500, 0.555, 0.818, 0.997].\nObject 17 : tie at [0.532, 0.515, 0.546, 0.579].\nObject 18 : wool at [0.016, 0.171, 0.114, 0.411].\n\nRelationships:\nobject 7 : hand -> on -> object 9 : sheep.\nobject 3 : girl -> with -> object 10 : shirt.\nobject 6 : hair -> on -> object 3 : girl.\nobject 3 : girl -> has -> object 6 : hair.\nobject 6 : hair -> on -> object 3 : girl.\n\nRegion Description:\nRegion Description at [0.000, 0.027, 0.530, 0.744] : a sheep that has been recently shorn.\nRegion Description at [0.116, 0.032, 0.924, 0.992] : girl in front has gray sweater hanging over her left shoulder.\nRegion Description at [0.506, 0.045, 0.912, 0.845] : girl in front is facing away from the camera.\nRegion Description at [0.120, 0.053, 0.890, 0.989] : girl in front wears a gray and white striped T-shirt.\nRegion Description at [0.300, 0.005, 0.624, 0.451] : someone in jeans and brown shoes stands behind the sheep.\nRegion Description at [0.880, 0.003, 0.992, 0.949] : several people only visible from the feet.\n\nGlobal Caption:\nYoung woman with sheep on straw covered floor.\nA child places his hands on the head and neck of a sheep while another sheep looks at his face.\nA person petting the head of a cute fluffy sheep.\nA child is petting a sheep while another sheep watches.\nA woman kneeling to pet animals while others wait. "}
+{"question_id": 7, "image": "000000069138.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : arrows at [0.000, 0.616, 0.214, 0.644].\nObject 1 : awning at [0.159, 0.260, 0.293, 0.336].\nObject 2 : building at [0.000, 0.000, 1.000, 0.466].\nObject 3 : bushes at [0.693, 0.342, 1.000, 0.512].\nObject 4 : door at [0.110, 0.370, 0.266, 0.518].\nObject 5 : face at [0.390, 0.256, 0.614, 0.392].\nObject 6 : greenery at [0.824, 0.154, 0.997, 0.384].\nObject 7 : hitch at [0.221, 0.520, 0.259, 0.542].\nObject 8 : ladder at [0.110, 0.342, 0.283, 0.364].\nObject 9 : license plate at [0.141, 0.460, 0.234, 0.500].\nObject 10 : line at [0.017, 0.700, 0.266, 0.756].\nObject 11 : picture at [0.155, 0.378, 0.259, 0.442].\nObject 12 : plant barrier at [0.672, 0.482, 1.000, 0.606].\nObject 13 : planter at [0.676, 0.152, 1.000, 0.510].\nObject 14 : pole at [0.328, 0.068, 0.483, 0.994].\nObject 15 : road at [0.000, 0.490, 1.000, 1.000].\nObject 16 : roof at [0.117, 0.360, 0.283, 0.382].\nObject 17 : sad face at [0.383, 0.244, 0.614, 0.384].\nObject 18 : short term at [0.624, 0.040, 0.769, 0.080].\nObject 19 : sidewalk at [0.666, 0.572, 0.993, 0.618].\nObject 20 : sign at [0.621, 0.082, 0.772, 0.132].\nObject 21 : sign at [0.007, 0.144, 0.069, 0.204].\nObject 22 : signal at [0.266, 0.210, 0.679, 0.848].\nObject 23 : stop light at [0.366, 0.236, 0.638, 0.394].\nObject 24 : tail light at [0.100, 0.446, 0.121, 0.472].\nObject 25 : van at [0.076, 0.326, 0.297, 0.556].\nObject 26 : wall at [0.676, 0.500, 0.997, 0.604].\nObject 27 : window at [0.903, 0.000, 1.000, 0.086].\n\nRelationships:\nobject 23 : stop light -> with -> object 17 : sad face.\nobject 0 : arrows -> on -> object 15 : road.\nobject 12 : plant barrier -> beside -> object 15 : road.\nobject 11 : picture -> on -> object 4 : door.\nobject 10 : line -> painted in -> object 15 : road.\nobject 19 : sidewalk -> next to -> object 15 : road.\nobject 2 : building -> for -> object 18 : short term.\nobject 23 : stop light -> making -> object 5 : face.\nobject 3 : bushes -> just above -> object 26 : wall.\nobject 22 : signal -> on -> object 14 : pole.\nobject 25 : van -> has -> object 16 : roof.\nobject 25 : van -> has -> object 8 : ladder.\nobject 8 : ladder -> on -> object 16 : roof.\nobject 13 : planter -> by -> object 15 : road.\nobject 23 : stop light -> on -> object 22 : signal.\n\nRegion Description:\nRegion Description at [0.331, 0.852, 0.472, 0.996] : Pole holding traffic light on street.\nRegion Description at [0.600, 0.036, 0.793, 0.084] : Building offers short term office space.\nRegion Description at [0.603, 0.074, 0.776, 0.120] : Office space as small as 2,500 sq. ft. available.\nRegion Description at [0.003, 0.008, 0.972, 0.356] : an office building is in the background.\n\nGlobal Caption:\nA red traffic light with a sad face drawn over it.\nA street scene with a close of of a stop light.\nA red stoplight with a street in the background.\nA stop sign gives traffic a frown face.\nThe sign is now at a red light."}
+{"question_id": 8, "image": "000000003156.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : back splash at [0.000, 0.000, 0.145, 0.278].\nObject 1 : blemish at [0.564, 0.178, 0.572, 0.184].\nObject 2 : blemish at [0.517, 0.214, 0.523, 0.222].\nObject 3 : checkered tile at [0.751, 0.226, 1.000, 0.422].\nObject 4 : checkered tile at [0.000, 0.000, 0.145, 0.282].\nObject 5 : cloth at [0.384, 0.878, 1.000, 1.002].\nObject 6 : face at [0.488, 0.106, 0.671, 0.270].\nObject 7 : faucet at [0.000, 0.170, 0.078, 0.268].\nObject 8 : floor at [0.000, 0.796, 1.000, 1.004].\nObject 9 : flooring at [0.000, 0.788, 1.000, 1.002].\nObject 10 : glove at [0.647, 0.622, 0.815, 0.780].\nObject 11 : hand at [0.633, 0.606, 0.815, 0.784].\nObject 12 : man at [0.000, 0.024, 0.835, 1.002].\nObject 13 : overalls at [0.000, 0.576, 0.702, 0.962].\nObject 14 : part at [0.043, 0.274, 0.133, 0.370].\nObject 15 : pipes at [0.000, 0.354, 0.046, 0.472].\nObject 16 : poster at [0.749, 0.226, 0.997, 0.426].\nObject 17 : seat at [0.581, 0.582, 1.000, 0.716].\nObject 18 : sill at [0.792, 0.032, 1.000, 0.094].\nObject 19 : sink at [0.000, 0.240, 0.136, 0.376].\nObject 20 : sock at [0.217, 0.856, 0.251, 0.892].\nObject 21 : tarp at [0.358, 0.868, 1.000, 1.004].\nObject 22 : tile at [0.749, 0.230, 1.000, 0.420].\nObject 23 : toilet at [0.564, 0.574, 1.000, 0.974].\nObject 24 : towel at [0.000, 0.872, 1.000, 1.002].\nObject 25 : wall at [0.000, 0.000, 1.000, 0.870].\nObject 26 : window at [0.777, 0.000, 1.000, 0.080].\n\nRelationships:\nobject 26 : window -> above -> object 23 : toilet.\nobject 21 : tarp -> to protect -> object 8 : floor.\nobject 14 : part -> of a bathroom -> object 19 : sink.\nobject 4 : checkered tile -> on bathroom -> object 25 : wall.\nobject 1 : blemish -> on -> object 6 : face.\nobject 2 : blemish -> on -> object 6 : face.\nobject 1 : blemish -> on -> object 6 : face.\nobject 6 : face -> on -> object 12 : man.\nobject 10 : glove -> on -> object 11 : hand.\n\nRegion Description:\nRegion Description at [0.685, 0.508, 0.879, 0.774] : the man is wearing gloves on his hands.\nRegion Description at [0.685, 0.638, 0.815, 0.764] : rubber glove on the man's right hand.\nRegion Description at [0.220, 0.860, 0.251, 0.886] : black and white design on man's sock.\nRegion Description at [0.000, 0.052, 0.124, 0.158] : black and white back splash for bathroom sink.\n\nGlobal Caption:\nA young man bending next to a toilet.\nA man is kneeling and holding on to a toilet.\nA man attempting to lift up a toilet off the floor.\nA man fixing a toilet in a black and white photo.\nA man wears gloves as he installs a toilet."}
+{"question_id": 9, "image": "000000131138.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : computer mouse at [0.414, 0.753, 0.470, 0.811].\nObject 1 : cup at [0.350, 0.783, 0.417, 0.906].\nObject 2 : desk at [0.000, 0.488, 0.998, 0.999].\nObject 3 : fork at [0.203, 0.794, 0.270, 0.857].\nObject 4 : glass at [0.277, 0.703, 0.345, 0.816].\nObject 5 : head phones at [0.872, 0.556, 0.993, 0.634].\nObject 6 : keyboard at [0.415, 0.620, 0.650, 0.783].\nObject 7 : lamp at [0.000, 0.302, 0.214, 0.430].\nObject 8 : laptop at [0.491, 0.296, 0.703, 0.540].\nObject 9 : picture at [0.795, 0.204, 0.898, 0.358].\nObject 10 : plant at [0.192, 0.201, 0.391, 0.461].\nObject 11 : plate at [0.183, 0.799, 0.326, 0.896].\nObject 12 : screen at [0.237, 0.249, 0.504, 0.628].\nObject 13 : stand at [0.506, 0.531, 0.663, 0.617].\nObject 14 : window at [0.606, 0.000, 1.000, 0.346].\n\nRelationships:\nobject 0 : computer mouse -> on -> object 2 : desk.\nobject 8 : laptop -> on -> object 13 : stand.\nobject 6 : keyboard -> on -> object 2 : desk.\nobject 9 : picture -> near -> object 14 : window.\nobject 3 : fork -> on -> object 11 : plate.\n\nRegion Description:\n\nGlobal Caption:\na desk with a cup plate laptop monitor and keyboard\nA laptop sitting next to a monitor, keyboard and a mouse.\nA laptop and a desktop monitor are displayed on top of the desk.\nLarge office desk with computers near a window.\nA desk with a laptop, second monitor and keyboard."}
+{"question_id": 10, "image": "000000259097.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : abs at [0.392, 0.628, 0.426, 0.664].\nObject 1 : arm at [0.416, 0.459, 0.432, 0.565].\nObject 2 : buildings at [0.242, 0.532, 0.640, 0.580].\nObject 3 : frisbee at [0.400, 0.354, 0.446, 0.381].\nObject 4 : grass at [0.000, 0.610, 0.998, 0.997].\nObject 5 : hand at [0.418, 0.423, 0.438, 0.474].\nObject 6 : legs at [0.420, 0.703, 0.456, 0.811].\nObject 7 : man at [0.390, 0.432, 0.466, 0.793].\nObject 8 : pants at [0.390, 0.658, 0.424, 0.763].\nObject 9 : shadow at [0.492, 0.724, 0.622, 0.994].\nObject 10 : shirt at [0.402, 0.468, 0.458, 0.649].\nObject 11 : sky at [0.002, 0.003, 0.996, 0.556].\nObject 12 : trees at [0.002, 0.498, 0.998, 0.646].\n\nRelationships:\nobject 7 : man -> tossing -> object 3 : frisbee.\nobject 7 : man -> has -> object 6 : legs.\nobject 7 : man -> playing -> object 3 : frisbee.\nobject 2 : buildings -> near -> object 12 : trees.\nobject 7 : man -> wearing -> object 10 : shirt.\nobject 7 : man -> wearing -> object 8 : pants.\nobject 7 : man -> catching -> object 3 : frisbee.\nobject 7 : man -> has -> object 5 : hand.\nobject 3 : frisbee -> in -> object 11 : sky.\nobject 7 : man -> wearing -> object 10 : shirt.\nobject 7 : man -> catching -> object 3 : frisbee.\nobject 7 : man -> wearing -> object 8 : pants.\nobject 9 : shadow -> in -> object 4 : grass.\nobject 7 : man -> jumping -> object 4 : grass.\nobject 2 : buildings -> behind -> object 4 : grass.\nobject 7 : man -> catching -> object 3 : frisbee.\nobject 7 : man -> catching -> object 3 : frisbee.\nobject 2 : buildings -> near -> object 12 : trees.\nobject 7 : man -> extending -> object 1 : arm.\nobject 9 : shadow -> in -> object 4 : grass.\nobject 7 : man -> exposing -> object 0 : abs.\nobject 7 : man -> catching -> object 3 : frisbee.\nobject 3 : frisbee -> in -> object 11 : sky.\nobject 9 : shadow -> in -> object 4 : grass.\nobject 2 : buildings -> near -> object 12 : trees.\nobject 1 : arm -> reaching for -> object 3 : frisbee.\n\nRegion Description:\nRegion Description at [0.394, 0.658, 0.480, 0.826] : A person wearing black color trouser.\nRegion Description at [0.394, 0.435, 0.460, 0.796] : man in a red sweatshirt and jeans jumping.\nRegion Description at [0.390, 0.357, 0.464, 0.823] : man catching a frisbee in a wheat field.\nRegion Description at [0.012, 0.520, 0.996, 0.631] : trees and a village on a hill in the distance.\nRegion Description at [0.390, 0.423, 0.464, 0.649] : arm straight up and arm bent at elbow.\n\nGlobal Caption:\nA person trying to reach a Frisbee in a field with high brown grass.\nA young boy in a red top is playing with a red object tossed in the sky.\nA young man in a red jacket jumping for a Frizbee in a field.\nA guy is jumping to catch a frisbee in tall grass.\nA man jumps to catch a Frisbee flying through the air."}
+{"question_id": 11, "image": "000000377882.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : blue sky at [0.000, 0.000, 0.998, 0.317].\nObject 1 : boat at [0.000, 0.461, 0.354, 0.579].\nObject 2 : boat at [0.348, 0.501, 0.874, 0.789].\nObject 3 : boat at [0.302, 0.461, 0.684, 0.611].\nObject 4 : buildings at [0.692, 0.195, 0.718, 0.248].\nObject 5 : buildings at [0.888, 0.173, 0.922, 0.227].\nObject 6 : buildings at [0.582, 0.211, 0.610, 0.256].\nObject 7 : buildings at [0.180, 0.259, 0.202, 0.293].\nObject 8 : buildings at [0.466, 0.208, 0.518, 0.272].\nObject 9 : chain-link fence at [0.002, 0.176, 0.996, 0.995].\nObject 10 : cord at [0.412, 0.587, 0.626, 1.000].\nObject 11 : fence pole at [0.230, 0.227, 0.336, 1.000].\nObject 12 : grass at [0.000, 0.667, 0.756, 0.997].\nObject 13 : horizon at [0.000, 0.187, 1.000, 0.336].\nObject 14 : mast at [0.570, 0.000, 0.722, 0.571].\nObject 15 : rack at [0.754, 0.168, 1.000, 0.901].\nObject 16 : sail post at [0.586, 0.000, 0.628, 0.568].\nObject 17 : section at [0.272, 0.179, 0.994, 0.992].\nObject 18 : shelf at [0.762, 0.355, 1.000, 0.387].\nObject 19 : sky line at [0.012, 0.173, 0.994, 0.195].\nObject 20 : surfboard at [0.830, 0.448, 0.996, 0.552].\nObject 21 : surfboard at [0.420, 0.384, 0.502, 0.411].\nObject 22 : surfboard at [0.910, 0.768, 0.998, 0.877].\nObject 23 : surfboard at [0.430, 0.344, 0.508, 0.371].\nObject 24 : surfboard at [0.830, 0.565, 1.000, 0.712].\nObject 25 : surfboard at [0.322, 0.307, 0.450, 0.341].\nObject 26 : surfboard at [0.766, 0.251, 0.998, 0.368].\nObject 27 : surfboard at [0.764, 0.704, 0.998, 0.829].\nObject 28 : water at [0.000, 0.259, 1.000, 0.469].\nObject 29 : water way at [0.008, 0.272, 0.996, 0.432].\n\nRelationships:\nobject 25 : surfboard -> stacked on -> object 18 : shelf.\nobject 24 : surfboard -> stacked on -> object 18 : shelf.\nobject 20 : surfboard -> stacked on -> object 18 : shelf.\nobject 26 : surfboard -> stacked on -> object 18 : shelf.\nobject 15 : rack -> of -> object 20 : surfboard.\nobject 8 : buildings -> on -> object 13 : horizon.\nobject 6 : buildings -> on -> object 13 : horizon.\nobject 4 : buildings -> on -> object 13 : horizon.\nobject 7 : buildings -> on -> object 13 : horizon.\nobject 5 : buildings -> on -> object 13 : horizon.\nobject 14 : mast -> on -> object 2 : boat.\nobject 9 : chain-link fence -> near -> object 29 : water way.\nobject 17 : section -> of -> object 9 : chain-link fence.\n\nRegion Description:\nRegion Description at [0.020, 0.187, 0.972, 0.963] : boats and surfboards behind wire fencing.\nRegion Description at [0.000, 0.160, 0.990, 0.349] : trees and buildings on other side of water.\nRegion Description at [0.340, 0.493, 0.852, 0.613] : white covering pulled over top of boat.\nRegion Description at [0.010, 0.667, 0.516, 0.995] : green bushes beside the chain link fence.\nRegion Description at [0.018, 0.213, 0.992, 0.995] : Black chain link fence enclosing boats..\nRegion Description at [0.242, 0.211, 0.302, 0.989] : Black fence pole holding chain link fence..\nRegion Description at [0.374, 0.499, 0.804, 0.803] : Yellow and white boat with sail pole..\nRegion Description at [0.014, 0.181, 0.998, 0.296] : Skyline of gray buildings in the background..\nRegion Description at [0.000, 0.664, 0.994, 0.976] : Green shrubs growing along side of a lake..\nRegion Description at [0.774, 0.216, 0.996, 0.944] : Boat parts on an outdoor shelving unit..\nRegion Description at [0.006, 0.013, 0.150, 0.285] : Sail masks with no flag attached to them..\n\nGlobal Caption:\nBoats docked on land sitting side by side next to a lake.\nA small harbor with boats docked and on racks\nA collection of boats behind a fence by a body of water.\nBoats and surfboards docked at a harbor bay.\n\nMany boats as seen through a chain link fence."}
+{"question_id": 12, "image": "000000484415.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : arm at [0.000, 0.125, 0.609, 0.988].\nObject 1 : bathroom tile at [0.009, 0.008, 0.994, 0.446].\nObject 2 : blue jeans at [0.369, 0.558, 0.722, 0.979].\nObject 3 : brush at [0.681, 0.208, 0.878, 0.500].\nObject 4 : brush holder at [0.716, 0.279, 0.891, 0.554].\nObject 5 : button at [0.519, 0.113, 0.584, 0.171].\nObject 6 : flusher at [0.534, 0.092, 0.628, 0.300].\nObject 7 : hand at [0.281, 0.125, 0.603, 0.562].\nObject 8 : holder at [0.713, 0.283, 0.903, 0.558].\nObject 9 : lid at [0.028, 0.046, 0.694, 0.446].\nObject 10 : man at [0.000, 0.133, 0.600, 0.992].\nObject 11 : seat at [0.138, 0.583, 0.722, 0.992].\nObject 12 : tank at [0.019, 0.021, 0.706, 0.579].\nObject 13 : tile at [0.794, 0.000, 1.000, 0.200].\nObject 14 : tile at [0.000, 0.000, 0.278, 0.129].\nObject 15 : toilet at [0.016, 0.042, 0.719, 0.996].\nObject 16 : toilet scrubber at [0.744, 0.192, 0.844, 0.521].\nObject 17 : toilet seat at [0.103, 0.517, 0.728, 0.996].\nObject 18 : wall at [0.659, 0.000, 0.978, 0.392].\nObject 19 : water at [0.369, 0.738, 0.500, 0.921].\n\nRelationships:\nobject 15 : toilet -> has -> object 11 : seat.\nobject 4 : brush holder -> by -> object 15 : toilet.\nobject 19 : water -> in -> object 15 : toilet.\nobject 6 : flusher -> on -> object 15 : toilet.\nobject 9 : lid -> on -> object 15 : toilet.\nobject 10 : man -> by -> object 15 : toilet.\nobject 10 : man -> by -> object 15 : toilet.\nobject 10 : man -> has -> object 7 : hand.\nobject 0 : arm -> on -> object 15 : toilet.\nobject 14 : tile -> on -> object 18 : wall.\n\nRegion Description:\nRegion Description at [0.000, 0.046, 0.716, 0.987] : the arm reaching for the white toilet bowl.\nRegion Description at [0.716, 0.192, 0.894, 0.550] : the container and the toilet brush cleaner.\nRegion Description at [0.009, 0.042, 0.894, 0.992] : the toilet bowl next to the toilet bowl cleaner.\nRegion Description at [0.534, 0.087, 0.666, 0.329] : The hand is on the flusher in the image .\nRegion Description at [0.053, 0.158, 0.903, 0.875] : Porcelain toilet with flusher on top of the lid .\nRegion Description at [0.094, 0.154, 0.856, 0.942] : Man flushing the toilet in the bathroom .\n\nGlobal Caption:\nA hand is reaching out to the top if a toilet. \nA person flushing a toilet with a motion sensor.\nA person's hand flushing a toilet with a button on top of the tank. \na persons hand reaching for the top of a toilet\nA hand is reaching over a white toilet."}
+{"question_id": 13, "image": "000000184384.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : blueberry at [0.306, 0.312, 0.400, 0.429].\nObject 1 : butter at [0.454, 0.024, 0.638, 0.288].\nObject 2 : cake at [0.238, 0.093, 0.786, 0.787].\nObject 3 : cup at [0.002, 0.000, 0.202, 0.667].\nObject 4 : cup at [0.140, 0.008, 0.336, 0.456].\nObject 5 : egg at [0.636, 0.125, 0.880, 0.267].\nObject 6 : food at [0.632, 0.123, 0.996, 0.336].\nObject 7 : lemon at [0.514, 0.728, 0.798, 0.997].\nObject 8 : melon at [0.308, 0.768, 0.658, 0.997].\nObject 9 : orange at [0.514, 0.733, 0.794, 0.997].\nObject 10 : parsley at [0.372, 0.515, 0.762, 0.965].\nObject 11 : plate at [0.166, 0.453, 1.000, 1.000].\nObject 12 : plate at [0.628, 0.120, 0.998, 0.389].\nObject 13 : sausage at [0.766, 0.248, 0.984, 0.333].\nObject 14 : spot at [0.766, 0.600, 0.790, 0.637].\nObject 15 : table at [0.002, 0.365, 0.998, 0.997].\nObject 16 : water at [0.000, 0.000, 0.202, 0.667].\n\nRelationships:\nobject 7 : lemon -> on -> object 11 : plate.\nobject 10 : parsley -> on -> object 11 : plate.\nobject 6 : food -> on -> object 12 : plate.\nobject 1 : butter -> on -> object 2 : cake.\nobject 11 : plate -> has -> object 14 : spot.\nobject 1 : butter -> on -> object 2 : cake.\nobject 9 : orange -> on -> object 11 : plate.\nobject 13 : sausage -> on -> object 12 : plate.\nobject 0 : blueberry -> on -> object 2 : cake.\nobject 5 : egg -> on -> object 12 : plate.\nobject 8 : melon -> on -> object 11 : plate.\nobject 1 : butter -> on -> object 2 : cake.\nobject 9 : orange -> on -> object 11 : plate.\nobject 2 : cake -> on -> object 11 : plate.\nobject 16 : water -> in -> object 3 : cup.\nobject 13 : sausage -> on -> object 12 : plate.\n\nRegion Description:\nRegion Description at [0.678, 0.104, 0.942, 0.424] : There is food on the plate in the back.\nRegion Description at [0.456, 0.013, 0.636, 0.307] : White frosting on top of a piece of cake.\nRegion Description at [0.322, 0.752, 0.650, 0.997] : square of honey dew on a white plate.\n\nGlobal Caption:\nA bluebery cake is on a plate and is topped with butter.\nA piece of cake with butter on it sits next to an orange slice. \nA large piece of blueberry cake on a plate.\nA plate of food attractively arranged on a table.\nA plate of blueberry coffee cake with butter and an orange slice on a table with breakfast foods."}
+{"question_id": 14, "image": "000000341058.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : napkins at [0.541, 0.818, 0.601, 0.858].\nObject 1 : pepper at [0.598, 0.836, 0.623, 0.860].\nObject 2 : post at [0.673, 0.494, 0.712, 0.926].\nObject 3 : restaurant sign at [0.548, 0.180, 0.779, 0.344].\nObject 4 : salt at [0.619, 0.838, 0.633, 0.850].\nObject 5 : shaker at [0.594, 0.822, 0.619, 0.854].\nObject 6 : shaker at [0.612, 0.824, 0.637, 0.854].\nObject 7 : table at [0.448, 0.834, 0.925, 0.998].\n\nRelationships:\nobject 4 : salt -> in -> object 6 : shaker.\nobject 0 : napkins -> on -> object 7 : table.\nobject 3 : restaurant sign -> on -> object 2 : post.\n\nRegion Description:\n\nGlobal Caption:\nThis is an empty table at a restaurant with ships in the background.\nThis table is covered by a blue Sam Adams umbrella\nAdvertising sign above a patio umbrella on sunny day.\nA lamp post stands next to an umbrella and table.\nAn umbrella is opened over an outdoor table."}
+{"question_id": 15, "image": "000000349184.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : arm rest at [0.674, 0.486, 0.722, 0.560].\nObject 1 : bench at [0.000, 0.324, 0.731, 0.994].\nObject 2 : bricks at [0.075, 0.850, 0.180, 0.882].\nObject 3 : building at [0.090, 0.000, 0.686, 0.094].\nObject 4 : children at [0.470, 0.302, 0.539, 0.360].\nObject 5 : coat at [0.473, 0.322, 0.542, 0.364].\nObject 6 : daytime at [0.000, 0.002, 0.997, 1.000].\nObject 7 : fence at [0.719, 0.310, 0.997, 0.372].\nObject 8 : grass at [0.000, 0.364, 0.997, 0.720].\nObject 9 : jacket at [0.012, 0.424, 0.485, 0.690].\nObject 10 : jeans at [0.165, 0.748, 0.293, 0.844].\nObject 11 : leg at [0.168, 0.750, 0.308, 0.844].\nObject 12 : people at [0.386, 0.438, 0.449, 0.504].\nObject 13 : purse at [0.458, 0.488, 0.605, 0.694].\nObject 14 : shoe at [0.192, 0.836, 0.305, 0.890].\nObject 15 : strap at [0.677, 0.470, 0.814, 0.584].\nObject 16 : trees at [0.554, 0.000, 0.997, 0.376].\nObject 17 : woman at [0.009, 0.194, 0.497, 0.888].\n\nRelationships:\nobject 17 : woman -> on -> object 1 : bench.\nobject 17 : woman -> on -> object 1 : bench.\nobject 17 : woman -> on -> object 1 : bench.\nobject 13 : purse -> has a -> object 15 : strap.\nobject 2 : bricks -> under -> object 1 : bench.\nobject 3 : building -> behind -> object 16 : trees.\nobject 2 : bricks -> under -> object 1 : bench.\nobject 9 : jacket -> on -> object 17 : woman.\nobject 12 : people -> near -> object 16 : trees.\nobject 17 : woman -> has a -> object 11 : leg.\nobject 1 : bench -> has an -> object 0 : arm rest.\nobject 15 : strap -> from -> object 13 : purse.\nobject 2 : bricks -> near -> object 1 : bench.\nobject 16 : trees -> in -> object 6 : daytime.\nobject 7 : fence -> under -> object 16 : trees.\nobject 12 : people -> in front of -> object 7 : fence.\nobject 13 : purse -> on -> object 1 : bench.\nobject 14 : shoe -> on -> object 2 : bricks.\n\nRegion Description:\nRegion Description at [0.096, 0.006, 0.662, 0.074] : Building with brown and white facade.\nRegion Description at [0.374, 0.298, 0.542, 0.360] : two people walking in front of woman.\n\nGlobal Caption:\nA woman sitting on top of a wooden bench near a park.\nA person sits on a wooden bench facing blooming trees.\nA woman sitting on a wooden bench viewing some beautiful trees.\nAdult sitting on wooden park bench in large open space.\nA woman sits on a bench watching the park."}
+{"question_id": 16, "image": "000000516143.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : advertisement at [0.654, 0.400, 0.852, 0.555].\nObject 1 : area at [0.118, 0.379, 0.862, 0.787].\nObject 2 : back wheel at [0.432, 0.677, 0.490, 0.803].\nObject 3 : background at [0.004, 0.005, 0.998, 0.496].\nObject 4 : bottom at [0.488, 0.781, 0.526, 0.843].\nObject 5 : bus at [0.100, 0.376, 0.866, 0.805].\nObject 6 : door at [0.120, 0.456, 0.178, 0.696].\nObject 7 : front wheel at [0.172, 0.643, 0.204, 0.728].\nObject 8 : houses at [0.880, 0.344, 0.998, 0.483].\nObject 9 : light pole at [0.482, 0.005, 0.532, 0.840].\nObject 10 : list at [0.218, 0.395, 0.608, 0.461].\nObject 11 : message at [0.626, 0.184, 0.822, 0.328].\nObject 12 : name at [0.288, 0.629, 0.420, 0.731].\nObject 13 : person at [0.858, 0.560, 0.888, 0.715].\nObject 14 : pole at [0.680, 0.325, 0.708, 0.941].\nObject 15 : railing at [0.854, 0.589, 1.000, 0.704].\nObject 16 : sidewalk at [0.002, 0.688, 0.998, 1.000].\nObject 17 : sign at [0.578, 0.181, 0.826, 0.341].\nObject 18 : street at [0.000, 0.587, 0.998, 0.931].\nObject 19 : structure at [0.238, 0.293, 0.398, 0.424].\nObject 20 : symbol at [0.732, 0.427, 0.786, 0.469].\nObject 21 : tail lights at [0.812, 0.653, 0.860, 0.712].\nObject 22 : window at [0.342, 0.419, 0.424, 0.619].\nObject 23 : windows at [0.516, 0.392, 0.634, 0.627].\n\nRelationships:\nobject 10 : list -> on -> object 5 : bus.\nobject 0 : advertisement -> on -> object 5 : bus.\nobject 12 : name -> on -> object 5 : bus.\nobject 6 : door -> on -> object 5 : bus.\nobject 2 : back wheel -> of -> object 5 : bus.\nobject 7 : front wheel -> of -> object 5 : bus.\nobject 17 : sign -> on -> object 16 : sidewalk.\nobject 5 : bus -> on -> object 18 : street.\nobject 1 : area -> in -> object 3 : background.\nobject 8 : houses -> near -> object 5 : bus.\nobject 10 : list -> on -> object 5 : bus.\nobject 21 : tail lights -> on -> object 5 : bus.\nobject 5 : bus -> on -> object 18 : street.\nobject 4 : bottom -> of -> object 9 : light pole.\nobject 13 : person -> walking by -> object 5 : bus.\nobject 2 : back wheel -> on -> object 5 : bus.\nobject 17 : sign -> on -> object 18 : street.\nobject 22 : window -> of -> object 5 : bus.\nobject 12 : name -> on -> object 5 : bus.\nobject 8 : houses -> in -> object 3 : background.\nobject 14 : pole -> holding up -> object 17 : sign.\nobject 6 : door -> to -> object 5 : bus.\nobject 19 : structure -> in -> object 3 : background.\nobject 13 : person -> walking down -> object 16 : sidewalk.\nobject 15 : railing -> along -> object 16 : sidewalk.\nobject 17 : sign -> with -> object 11 : message.\nobject 6 : door -> of -> object 5 : bus.\nobject 14 : pole -> by -> object 18 : street.\nobject 14 : pole -> by -> object 5 : bus.\nobject 1 : area -> by -> object 16 : sidewalk.\nobject 13 : person -> walking across -> object 18 : street.\nobject 17 : sign -> attached to -> object 14 : pole.\nobject 11 : message -> on -> object 17 : sign.\nobject 0 : advertisement -> on -> object 5 : bus.\nobject 12 : name -> on -> object 5 : bus.\nobject 0 : advertisement -> on -> object 5 : bus.\nobject 0 : advertisement -> on -> object 5 : bus.\nobject 23 : windows -> on -> object 5 : bus.\nobject 6 : door -> on -> object 5 : bus.\nobject 5 : bus -> on -> object 18 : street.\n\nRegion Description:\nRegion Description at [0.576, 0.163, 0.838, 0.341] : street sign that reads All directions.\nRegion Description at [0.114, 0.323, 0.164, 0.448] : yellow and red structure in background.\nRegion Description at [0.580, 0.179, 0.838, 0.333] : a sign implying zero degrees equals 360 degrees.\n\nGlobal Caption:\na green and white bus is on the street\na public transit bus on a city street\nthe signs states all directions and points up\nAn empty city bus travels down a city street.\nA green and blue bus driving down a street."}
+{"question_id": 17, "image": "000000159311.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : eye at [0.566, 0.526, 0.592, 0.565].\nObject 1 : grass at [0.004, 0.808, 0.118, 0.991].\nObject 2 : grass at [0.206, 0.853, 0.356, 0.982].\nObject 3 : leg at [0.232, 0.375, 0.312, 0.805].\nObject 4 : plant at [0.500, 0.736, 0.618, 0.796].\nObject 5 : sitck at [0.746, 0.042, 0.912, 0.339].\nObject 6 : zebra at [0.000, 0.000, 0.622, 0.790].\nObject 7 : zebra at [0.002, 0.000, 0.684, 0.682].\n\nRelationships:\nobject 7 : zebra -> eating -> object 4 : plant.\nobject 6 : zebra -> standing in -> object 1 : grass.\nobject 7 : zebra -> standing in -> object 1 : grass.\nobject 7 : zebra -> grazing in -> object 1 : grass.\nobject 6 : zebra -> grazing in -> object 1 : grass.\n\nRegion Description:\nRegion Description at [0.352, 0.093, 0.602, 0.393] : thin line of hair running down the neck.\n\nGlobal Caption:\nA pair of zebra's leaning over eating grass in a field.\nTwo zebra stand near bushes and tall grass.\nTwo zebras grazing from grass next to a tree.\nTwo zebra standing next to each other on a lush green field.\nTwo zebras are feeding on the grass by themselves."}
+{"question_id": 18, "image": "000000553990.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : bar at [0.444, 0.622, 0.640, 0.688].\nObject 1 : boots at [0.328, 0.339, 0.416, 0.492].\nObject 2 : bridal at [0.474, 0.246, 0.678, 0.432].\nObject 3 : food at [0.416, 0.646, 0.466, 0.715].\nObject 4 : foot at [0.324, 0.402, 0.380, 0.492].\nObject 5 : girl at [0.320, 0.078, 0.552, 0.502].\nObject 6 : grass at [0.012, 0.694, 0.998, 0.994].\nObject 7 : ground at [0.004, 0.679, 0.996, 0.913].\nObject 8 : helmet at [0.484, 0.096, 0.560, 0.162].\nObject 9 : hoof at [0.120, 0.853, 0.170, 0.925].\nObject 10 : horse at [0.024, 0.210, 0.690, 0.949].\nObject 11 : legs at [0.478, 0.453, 0.598, 0.637].\nObject 12 : legs at [0.130, 0.583, 0.278, 0.925].\nObject 13 : mane at [0.484, 0.186, 0.648, 0.279].\nObject 14 : person at [0.568, 0.568, 0.604, 0.640].\nObject 15 : poles at [0.460, 0.814, 0.538, 0.955].\nObject 16 : shirt at [0.580, 0.586, 0.594, 0.622].\nObject 17 : shirt at [0.388, 0.150, 0.508, 0.279].\nObject 18 : tail at [0.044, 0.357, 0.222, 0.784].\nObject 19 : tree at [0.720, 0.057, 0.874, 0.568].\nObject 20 : tree at [0.220, 0.000, 0.456, 0.586].\nObject 21 : trees at [0.730, 0.003, 0.986, 0.628].\nObject 22 : wall at [0.188, 0.276, 0.254, 0.393].\nObject 23 : water at [0.028, 0.468, 0.134, 0.574].\n\nRelationships:\nobject 5 : girl -> has -> object 1 : boots.\nobject 6 : grass -> under -> object 10 : horse.\nobject 21 : trees -> behind -> object 10 : horse.\nobject 10 : horse -> jumping -> object 15 : poles.\nobject 11 : legs -> on -> object 10 : horse.\nobject 12 : legs -> on -> object 10 : horse.\nobject 12 : legs -> on -> object 10 : horse.\nobject 14 : person -> in -> object 16 : shirt.\nobject 10 : horse -> has -> object 9 : hoof.\n\nRegion Description:\n\nGlobal Caption:\nA young person ridding a horse jumps a gate in a competition.\nA man riding on a horse as it jumps over a pole. \nA woman is riding a horse as it jumps over a bar.\nthere is a woman jockey riding a hose over the hurdle\nA woman riding a horse jumps over an obstacle."}
+{"question_id": 19, "image": "000000273493.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : ball at [0.640, 0.399, 0.648, 0.411].\nObject 1 : border at [0.040, 0.502, 1.000, 0.556].\nObject 2 : boundary lines at [0.030, 0.661, 1.000, 1.000].\nObject 3 : bushes at [0.020, 0.186, 0.104, 0.517].\nObject 4 : fence at [0.008, 0.366, 0.994, 0.565].\nObject 5 : fence at [0.024, 0.502, 0.996, 0.709].\nObject 6 : grass at [0.004, 0.529, 0.994, 0.997].\nObject 7 : man at [0.144, 0.360, 0.246, 0.736].\nObject 8 : man at [0.730, 0.474, 0.780, 0.613].\nObject 9 : pants at [0.732, 0.529, 0.778, 0.604].\nObject 10 : shirt at [0.164, 0.411, 0.222, 0.547].\nObject 11 : shorts at [0.162, 0.535, 0.220, 0.628].\nObject 12 : sign at [0.916, 0.405, 0.934, 0.438].\nObject 13 : sky at [0.006, 0.021, 0.990, 0.279].\nObject 14 : sneakers at [0.180, 0.709, 0.216, 0.739].\nObject 15 : sneakers at [0.762, 0.598, 0.776, 0.613].\nObject 16 : tennis at [0.012, 0.384, 0.984, 0.934].\nObject 17 : tennis court at [0.000, 0.372, 0.988, 0.979].\nObject 18 : tennis racket at [0.768, 0.526, 0.808, 0.556].\nObject 19 : tennis racket at [0.214, 0.574, 0.238, 0.619].\nObject 20 : trees at [0.586, 0.282, 0.692, 0.420].\nObject 21 : white at [0.734, 0.492, 0.778, 0.601].\n\nRelationships:\nobject 7 : man -> in -> object 10 : shirt.\nobject 7 : man -> with -> object 19 : tennis racket.\nobject 7 : man -> plays -> object 16 : tennis.\nobject 7 : man -> wears -> object 14 : sneakers.\nobject 8 : man -> wears -> object 15 : sneakers.\nobject 7 : man -> wears -> object 11 : shorts.\nobject 8 : man -> wears -> object 9 : pants.\nobject 5 : fence -> has -> object 1 : border.\nobject 20 : trees -> behind -> object 3 : bushes.\nobject 2 : boundary lines -> on -> object 17 : tennis court.\nobject 2 : boundary lines -> on -> object 6 : grass.\nobject 3 : bushes -> behind -> object 4 : fence.\nobject 20 : trees -> behind -> object 4 : fence.\nobject 7 : man -> has -> object 19 : tennis racket.\nobject 8 : man -> wears -> object 21 : white.\nobject 4 : fence -> around -> object 17 : tennis court.\nobject 20 : trees -> behind -> object 8 : man.\nobject 6 : grass -> on -> object 17 : tennis court.\nobject 8 : man -> has -> object 18 : tennis racket.\nobject 8 : man -> hitting -> object 0 : ball.\nobject 5 : fence -> on -> object 17 : tennis court.\n\nRegion Description:\nRegion Description at [0.024, 0.489, 0.998, 0.730] : The tennis net separating the sides of the players..\nRegion Description at [0.144, 0.652, 0.234, 0.745] : The black sneakers the player is wearing..\nRegion Description at [0.720, 0.577, 0.784, 0.613] : The white sneakers the player is wearing..\nRegion Description at [0.158, 0.544, 0.230, 0.628] : The gray shorts the player is wearing..\nRegion Description at [0.006, 0.402, 0.998, 0.574] : The trimmed bushes behind the player..\nRegion Description at [0.008, 0.168, 0.998, 0.402] : The trees behind the trimmed bushes behind the player..\nRegion Description at [0.006, 0.604, 0.998, 0.985] : The white boundary lines on the tennis court..\nRegion Description at [0.020, 0.447, 0.994, 0.760] : A black and white net stretches across the field.\nRegion Description at [0.060, 0.526, 0.984, 0.985] : The field has green grass with white lines.\nRegion Description at [0.016, 0.369, 0.978, 0.595] : A tall green shrub is behind the fence.\nRegion Description at [0.034, 0.150, 0.984, 0.393] : Trees are seen behind the fence and shrub.\nRegion Description at [0.588, 0.327, 0.850, 0.703] : The yellow ball is flying towards the man.\nRegion Description at [0.902, 0.378, 0.956, 0.529] : A black circular sign with the number five.\nRegion Description at [0.142, 0.354, 0.248, 0.736] : male in white t-shirt playing tennis.\nRegion Description at [0.200, 0.565, 0.244, 0.625] : Head of tennis racket of man playing.\nRegion Description at [0.726, 0.465, 0.786, 0.631] : Man in white preparing to hit tennis ball.\n\nGlobal Caption:\nTwo men playing a game of tennis on a court.\ntwo people playing tennis with rackets on a grass court\nTwo young men playing a game of tennis.\nPeople playing tennis on a court surrounded by green hedges.\ntHERE ARE TWO MEN PLAYING TENNIS ON THE TENNIS COURT"}
+{"question_id": 20, "image": "000000452122.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : airline at [0.408, 0.420, 0.758, 0.502].\nObject 1 : airplane at [0.112, 0.300, 0.858, 0.640].\nObject 2 : engine at [0.652, 0.529, 0.730, 0.592].\nObject 3 : engine at [0.494, 0.502, 0.574, 0.577].\nObject 4 : fin at [0.208, 0.303, 0.320, 0.492].\nObject 5 : fin at [0.116, 0.480, 0.284, 0.526].\nObject 6 : front door at [0.752, 0.435, 0.772, 0.483].\nObject 7 : gear at [0.450, 0.592, 0.600, 0.643].\nObject 8 : letters at [0.694, 0.489, 0.732, 0.520].\nObject 9 : name at [0.398, 0.426, 0.760, 0.489].\nObject 10 : sky at [0.000, 0.000, 0.998, 1.000].\nObject 11 : window at [0.806, 0.438, 0.844, 0.456].\nObject 12 : windows at [0.326, 0.450, 0.750, 0.532].\nObject 13 : wing at [0.152, 0.426, 0.598, 0.538].\nObject 14 : wing at [0.116, 0.492, 0.282, 0.538].\n\nRelationships:\nobject 6 : front door -> of -> object 1 : airplane.\n\nRegion Description:\n\nGlobal Caption:\nAn airplane flying in the air during the day.\nA large aircraft is shown in the air.\nThe large jumbo jet has it's landing gear lowered.\nA large white airplane flies in the gray sky.\nAn airplane in route with a cloudy sky behind it."}
+{"question_id": 21, "image": "000000134722.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : awning at [0.886, 0.000, 1.000, 0.240].\nObject 1 : awning at [0.000, 0.299, 0.132, 0.403].\nObject 2 : bench at [0.000, 0.592, 0.066, 0.683].\nObject 3 : building at [0.000, 0.299, 0.204, 0.659].\nObject 4 : canopy at [0.000, 0.301, 0.136, 0.400].\nObject 5 : car at [0.290, 0.400, 0.998, 0.784].\nObject 6 : clouds at [0.374, 0.067, 0.920, 0.312].\nObject 7 : door opening at [0.658, 0.501, 0.682, 0.680].\nObject 8 : door opening at [0.678, 0.509, 0.710, 0.675].\nObject 9 : exterior at [0.000, 0.400, 0.200, 0.669].\nObject 10 : front at [0.294, 0.400, 0.494, 0.739].\nObject 11 : gravel at [0.090, 0.837, 0.334, 0.997].\nObject 12 : headlights at [0.416, 0.624, 0.446, 0.656].\nObject 13 : headlights at [0.300, 0.624, 0.324, 0.651].\nObject 14 : markings at [0.606, 0.821, 0.770, 0.928].\nObject 15 : panel at [0.304, 0.421, 0.450, 0.677].\nObject 16 : pole at [0.030, 0.419, 0.062, 0.656].\nObject 17 : railway tracks at [0.000, 0.752, 0.520, 0.944].\nObject 18 : side walk at [0.192, 0.712, 1.000, 0.997].\nObject 19 : sky at [0.000, 0.000, 0.998, 0.560].\nObject 20 : train stop at [0.000, 0.000, 1.000, 1.000].\nObject 21 : trees at [0.208, 0.253, 0.322, 0.653].\nObject 22 : trim at [0.000, 0.333, 0.132, 0.403].\nObject 23 : wall at [0.000, 0.392, 0.206, 0.611].\nObject 24 : wheel at [0.844, 0.669, 0.884, 0.728].\nObject 25 : wheel at [0.792, 0.675, 0.840, 0.747].\nObject 26 : wheel at [0.516, 0.691, 0.620, 0.808].\nObject 27 : window at [0.316, 0.451, 0.458, 0.595].\nObject 28 : windows at [0.700, 0.547, 0.848, 0.632].\nObject 29 : windsheild wipers at [0.348, 0.499, 0.410, 0.584].\n\nRelationships:\nobject 6 : clouds -> in -> object 19 : sky.\nobject 2 : bench -> in -> object 4 : canopy.\nobject 22 : trim -> on -> object 1 : awning.\nobject 11 : gravel -> next to -> object 17 : railway tracks.\nobject 14 : markings -> on side of -> object 18 : side walk.\nobject 5 : car -> on -> object 17 : railway tracks.\n\nRegion Description:\nRegion Description at [0.288, 0.392, 0.510, 0.741] : the front of the train is yellow and white.\nRegion Description at [0.320, 0.451, 0.460, 0.592] : the front window of the train has windshield wipers.\nRegion Description at [0.292, 0.592, 0.456, 0.739] : the headlights are on front of the train.\nRegion Description at [0.010, 0.405, 0.220, 0.736] : a red brick wall is near the platform.\nRegion Description at [0.000, 0.288, 0.128, 0.707] : an aluminum canopy is on the platform.\nRegion Description at [0.016, 0.325, 0.100, 0.672] : a red steel pole is holding up the awning.\nRegion Description at [0.306, 0.395, 0.998, 0.733] : the train has windowed passenger cars.\nRegion Description at [0.300, 0.427, 0.492, 0.693] : the yellow and white front of a train.\nRegion Description at [0.510, 0.744, 0.834, 0.891] : white painted line beside a train track.\nRegion Description at [0.298, 0.408, 0.468, 0.661] : a yellow panel on the front of the train.\nRegion Description at [0.002, 0.397, 0.210, 0.675] : a red brick building on the side of the tracks.\nRegion Description at [0.844, 0.000, 0.998, 0.248] : an awning of a structure next to the train tracks.\nRegion Description at [0.294, 0.360, 0.516, 0.787] : front of a train car in yellow, white and blue.\nRegion Description at [0.194, 0.221, 0.286, 0.901] : trees on the side of a train station.\nRegion Description at [0.580, 0.821, 0.764, 0.931] : markings on the side of railway tracks.\nRegion Description at [0.632, 0.491, 0.726, 0.691] : white, blue and grey doors on the side of a train car.\nRegion Description at [0.500, 0.096, 0.916, 0.531] : skyline on the side of a train station.\n\nGlobal Caption:\nFast commuter train moving past an outdoor platform.\nA train on the track pulling by a train station.\nA train pulling into a station outside during the day.\nA passenger train moving through a rail yard\na long passenger train pulling up to a station"}
+{"question_id": 22, "image": "000000360960.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : coat at [0.405, 0.332, 0.835, 0.746].\nObject 1 : decorative square at [0.000, 0.382, 1.000, 1.000].\nObject 2 : hat at [0.006, 0.162, 0.072, 0.198].\nObject 3 : jacket at [0.078, 0.222, 0.318, 0.430].\nObject 4 : jeans at [0.853, 0.422, 1.000, 0.632].\nObject 5 : leg at [0.853, 0.456, 0.928, 0.610].\nObject 6 : leg at [0.210, 0.458, 0.303, 0.638].\nObject 7 : leg at [0.000, 0.458, 0.060, 0.630].\nObject 8 : man at [0.066, 0.162, 0.318, 0.686].\nObject 9 : man at [0.850, 0.156, 1.000, 0.652].\nObject 10 : man at [0.390, 0.344, 0.838, 0.894].\nObject 11 : pants at [0.523, 0.736, 0.739, 0.858].\nObject 12 : person at [0.000, 0.162, 0.135, 0.668].\nObject 13 : person at [0.853, 0.154, 1.000, 0.650].\nObject 14 : section at [0.000, 0.134, 1.000, 1.000].\nObject 15 : sidewalk at [0.000, 0.388, 1.000, 1.000].\nObject 16 : umbrella at [0.168, 0.106, 0.910, 0.366].\nObject 17 : uniform at [0.000, 0.222, 0.126, 0.646].\nObject 18 : uniform at [0.105, 0.218, 0.318, 0.628].\n\nRelationships:\nobject 10 : man -> wearing -> object 11 : pants.\nobject 10 : man -> wearing -> object 0 : coat.\nobject 9 : man -> wearing -> object 4 : jeans.\nobject 8 : man -> wearing -> object 2 : hat.\nobject 8 : man -> wearing -> object 3 : jacket.\nobject 16 : umbrella -> has -> object 14 : section.\nobject 5 : leg -> of -> object 13 : person.\nobject 7 : leg -> of -> object 12 : person.\nobject 12 : person -> in -> object 17 : uniform.\n\nRegion Description:\nRegion Description at [0.066, 0.164, 0.318, 0.686] : the back of a man in a black uniform.\nRegion Description at [0.393, 0.324, 0.871, 0.766] : THIS MAN IS WEARING A LONG BLACK COAT.\nRegion Description at [0.468, 0.142, 0.634, 0.356] : THIS IS A RED SECTION ON THE UMBRELLA.\nRegion Description at [0.168, 0.140, 0.523, 0.292] : THIS IS A YELLOW SECTION ON THE UMBRELLA.\nRegion Description at [0.568, 0.138, 0.919, 0.232] : THIS IS A GREEN SECTION OF THE UMBRELLA.\n\nGlobal Caption:\nSeveral people walking on a sidewalk, with one man holding an umbrella.\nA person walking while carrying a rainbow umbrella\nA person is holding up a large colorful umbrella\na person walking down the street carrying a rainbow colored umbrella\nA person walking in a square carrying a rainbow colored umbrella."}
+{"question_id": 23, "image": "000000179765.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : back tire at [0.574, 0.496, 0.860, 0.800].\nObject 1 : bike at [0.146, 0.109, 0.938, 0.803].\nObject 2 : bike indicators at [0.238, 0.363, 0.264, 0.389].\nObject 3 : car at [0.000, 0.077, 0.086, 0.157].\nObject 4 : display at [0.240, 0.275, 0.290, 0.328].\nObject 5 : exhaust pipe at [0.460, 0.661, 0.818, 0.773].\nObject 6 : front tire at [0.146, 0.419, 0.366, 0.637].\nObject 7 : front wheel at [0.150, 0.424, 0.366, 0.635].\nObject 8 : garage door at [0.000, 0.000, 0.214, 0.341].\nObject 9 : handle at [0.284, 0.109, 0.390, 0.384].\nObject 10 : honda logo at [0.322, 0.395, 0.378, 0.419].\nObject 11 : house at [0.420, 0.000, 0.736, 0.149].\nObject 12 : leather seat at [0.496, 0.355, 0.792, 0.517].\nObject 13 : light at [0.894, 0.411, 0.944, 0.520].\nObject 14 : orange light at [0.280, 0.419, 0.296, 0.467].\nObject 15 : shock at [0.258, 0.477, 0.296, 0.568].\nObject 16 : shock absorber at [0.626, 0.501, 0.698, 0.680].\nObject 17 : shrubs at [0.628, 0.021, 0.764, 0.200].\nObject 18 : small windshield at [0.210, 0.120, 0.256, 0.291].\nObject 19 : sylencer at [0.462, 0.645, 0.816, 0.779].\nObject 20 : trees at [0.256, 0.003, 0.444, 0.205].\n\nRelationships:\nobject 1 : bike -> has -> object 7 : front wheel.\nobject 1 : bike -> has -> object 0 : back tire.\nobject 1 : bike -> has -> object 19 : sylencer.\nobject 1 : bike -> has -> object 16 : shock absorber.\nobject 1 : bike -> has -> object 13 : light.\nobject 9 : handle -> on -> object 1 : bike.\nobject 4 : display -> on -> object 1 : bike.\n\nRegion Description:\n\nGlobal Caption:\nA black Honda motorcycle parked in front of a garage.\nA Honda motorcycle parked in a grass driveway\nA black Honda motorcycle with a dark burgundy seat.\nMa motorcycle parked on the gravel in front of a garage\nA motorcycle with its brake extended standing outside"}
+{"question_id": 24, "image": "000000332318.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : background at [0.000, 0.000, 1.002, 0.997].\nObject 1 : bench at [0.604, 0.967, 0.672, 0.997].\nObject 2 : cow at [0.548, 0.860, 0.574, 0.896].\nObject 3 : cow at [0.436, 0.860, 0.454, 0.890].\nObject 4 : fence at [0.698, 0.949, 0.852, 0.997].\nObject 5 : moutain at [0.000, 0.057, 0.992, 0.782].\nObject 6 : pasture at [0.000, 0.815, 0.984, 1.000].\nObject 7 : peak at [0.744, 0.042, 0.898, 0.119].\nObject 8 : sky at [0.000, 0.000, 1.002, 0.257].\nObject 9 : snow at [0.210, 0.036, 0.962, 0.445].\nObject 10 : trailer at [0.796, 0.910, 0.894, 0.997].\nObject 11 : trailer at [0.632, 0.899, 0.742, 0.994].\nObject 12 : tree at [0.740, 0.409, 1.000, 0.982].\nObject 13 : tree at [0.638, 0.284, 0.652, 0.301].\n\nRelationships:\nobject 11 : trailer -> in -> object 6 : pasture.\nobject 5 : moutain -> has -> object 9 : snow.\nobject 6 : pasture -> near -> object 5 : moutain.\nobject 3 : cow -> in -> object 6 : pasture.\nobject 2 : cow -> in -> object 6 : pasture.\nobject 9 : snow -> on -> object 5 : moutain.\nobject 5 : moutain -> covered in -> object 9 : snow.\nobject 5 : moutain -> has -> object 7 : peak.\nobject 2 : cow -> in -> object 6 : pasture.\nobject 5 : moutain -> in -> object 0 : background.\nobject 5 : moutain -> has -> object 9 : snow.\nobject 11 : trailer -> near -> object 12 : tree.\nobject 5 : moutain -> has -> object 13 : tree.\nobject 7 : peak -> covered with -> object 9 : snow.\n\nRegion Description:\nRegion Description at [0.784, 0.901, 0.934, 0.991] : storage container for animal equipment.\nRegion Description at [0.828, 0.060, 0.880, 0.125] : The mountain is partially covered in snow..\nRegion Description at [0.840, 0.899, 0.920, 0.997] : horse trailer or cow trailer is silvertone, rectangular.\nRegion Description at [0.606, 0.919, 0.640, 0.982] : smaller trailer, white w/ brown+orange stripe.\nRegion Description at [0.060, 0.472, 0.540, 0.806] : a bare patch of earth amid lush green growth.\nRegion Description at [0.034, 0.839, 0.812, 0.973] : tiny cattle-containing fenceposts in the distance.\nRegion Description at [0.902, 0.827, 0.990, 0.997] : a split tree trunk in shadow, beneath leaves, shadow on ground.\nRegion Description at [0.734, 0.919, 0.802, 0.994] : an older station wagon/suv-type van thing.\nRegion Description at [0.090, 0.854, 0.124, 0.904] : a black & white animal stands alone, away from brown brethren, in the far distance.\n\nGlobal Caption:\nCows lounge in a field with a mountain backdrop.\nA VERY BIG MOUNTAIN AND ANIMALS SPREAD ACROSS A FARM.\nSeveral herd animals are on the grass by a mountain.\nCattle on a level pasture in a mountainous area.\nA bunch of cattle relax in a pasture located in the mountains"}
+{"question_id": 25, "image": "000000305695.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : animal at [0.358, 0.509, 0.434, 0.664].\nObject 1 : area at [0.464, 0.531, 0.934, 0.848].\nObject 2 : branches at [0.266, 0.107, 0.470, 0.272].\nObject 3 : bushes at [0.598, 0.424, 0.622, 0.445].\nObject 4 : grass at [0.544, 0.659, 0.840, 0.859].\nObject 5 : hill at [0.574, 0.323, 0.624, 0.376].\nObject 6 : leaves at [0.808, 0.293, 0.918, 0.360].\nObject 7 : license plate at [0.000, 0.691, 0.064, 0.747].\nObject 8 : light at [0.170, 0.557, 0.186, 0.632].\nObject 9 : park at [0.250, 0.192, 0.818, 0.664].\nObject 10 : road at [0.180, 0.709, 0.432, 0.957].\nObject 11 : sky at [0.448, 0.053, 0.532, 0.187].\nObject 12 : tire at [0.070, 0.728, 0.130, 0.795].\nObject 13 : tree at [0.000, 0.000, 0.478, 0.600].\nObject 14 : trees at [0.128, 0.000, 0.592, 0.597].\nObject 15 : truck at [0.000, 0.416, 0.210, 0.805].\nObject 16 : zebras at [0.730, 0.496, 0.796, 0.581].\n\nRelationships:\nobject 7 : license plate -> on -> object 15 : truck.\nobject 12 : tire -> on -> object 15 : truck.\nobject 5 : hill -> in -> object 9 : park.\nobject 0 : animal -> in -> object 1 : area.\nobject 13 : tree -> has -> object 6 : leaves.\nobject 0 : animal -> on -> object 1 : area.\nobject 15 : truck -> on -> object 10 : road.\nobject 10 : road -> with -> object 15 : truck.\nobject 3 : bushes -> on -> object 1 : area.\nobject 16 : zebras -> in -> object 1 : area.\nobject 2 : branches -> on -> object 13 : tree.\n\nRegion Description:\nRegion Description at [0.338, 0.480, 0.438, 0.680] : zebra watching in opposite direction.\n\nGlobal Caption:\nZebras are grazing on grass by a car.\nZebras are standing in a fenced in area.\nA herd of zebras stand under tress near a road. \nSeveral zebras are on the grass by a truck. \nA bunch of zebras grazing near a road where vehicles are driving by."}
+{"question_id": 26, "image": "000000326174.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : beach at [0.000, 0.720, 0.998, 1.000].\nObject 1 : boy at [0.792, 0.480, 0.938, 0.853].\nObject 2 : child at [0.322, 0.587, 0.376, 0.835].\nObject 3 : child at [0.320, 0.587, 0.374, 0.835].\nObject 4 : girl at [0.444, 0.539, 0.534, 0.856].\nObject 5 : man at [0.140, 0.443, 0.216, 0.845].\nObject 6 : man at [0.434, 0.459, 0.500, 0.760].\nObject 7 : man at [0.578, 0.459, 0.682, 0.845].\nObject 8 : ocean waters at [0.590, 0.419, 0.892, 0.629].\nObject 9 : people at [0.206, 0.456, 0.352, 0.851].\nObject 10 : person at [0.792, 0.480, 0.936, 0.851].\nObject 11 : shirt at [0.592, 0.496, 0.670, 0.629].\nObject 12 : shore at [0.000, 0.360, 0.998, 0.997].\nObject 13 : surfboard at [0.306, 0.709, 0.538, 0.853].\nObject 14 : surfboard at [0.790, 0.587, 0.960, 0.691].\nObject 15 : water at [0.384, 0.368, 0.544, 0.435].\nObject 16 : waves at [0.656, 0.709, 0.794, 0.779].\nObject 17 : wetsuit at [0.326, 0.629, 0.372, 0.773].\nObject 18 : woman at [0.208, 0.499, 0.304, 0.629].\n\nRelationships:\nobject 1 : boy -> holding -> object 14 : surfboard.\nobject 5 : man -> and -> object 18 : woman.\nobject 18 : woman -> and -> object 3 : child.\nobject 16 : waves -> coming to -> object 12 : shore.\nobject 7 : man -> looking down to -> object 15 : water.\nobject 2 : child -> with -> object 17 : wetsuit.\nobject 6 : man -> looking back to -> object 4 : girl.\nobject 4 : girl -> pulling -> object 13 : surfboard.\nobject 9 : people -> on -> object 0 : beach.\nobject 7 : man -> wearing -> object 11 : shirt.\n\nRegion Description:\nRegion Description at [0.096, 0.437, 0.970, 0.872] : Seven people headed to the water to surf..\nRegion Description at [0.390, 0.531, 0.540, 0.851] : Girl in yellow shirt and pony tail. .\nRegion Description at [0.312, 0.581, 0.374, 0.851] : Small child with red and black wetsuit..\nRegion Description at [0.578, 0.443, 0.688, 0.856] : Man with white shirt and grey wetsuit pants..\nRegion Description at [0.436, 0.440, 0.534, 0.872] : Man looking back to girl pulling surfboard..\nRegion Description at [0.444, 0.459, 0.552, 0.853] : A man and a little girl having a conversation.\nRegion Description at [0.104, 0.419, 0.314, 0.851] : A man and a woman walking toward the water.\n\nGlobal Caption:\nA group of people are taking surfing lessons.\nA group of men, women and children walking toward the water with surfboards.\nA mixed age group is going toward the ocean with surfboards.\nA group of surfers are carrying their surf boards into the ocean.\nSeveral people are getting ready to enter the water for surfing."}
+{"question_id": 27, "image": "000000562207.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : body at [0.166, 0.539, 0.296, 0.997].\nObject 1 : boot at [0.594, 0.753, 0.620, 0.870].\nObject 2 : boot at [0.620, 0.744, 0.658, 0.858].\nObject 3 : bucket at [0.268, 0.744, 0.322, 0.828].\nObject 4 : clouds at [0.156, 0.000, 0.968, 0.328].\nObject 5 : ear at [0.590, 0.226, 0.638, 0.410].\nObject 6 : ear at [0.368, 0.208, 0.448, 0.434].\nObject 7 : elephant at [0.328, 0.157, 0.638, 0.967].\nObject 8 : eye at [0.476, 0.319, 0.504, 0.346].\nObject 9 : foot at [0.436, 0.901, 0.516, 0.958].\nObject 10 : grass at [0.950, 0.759, 0.996, 0.807].\nObject 11 : leg at [0.498, 0.572, 0.548, 0.898].\nObject 12 : leg at [0.408, 0.512, 0.516, 0.955].\nObject 13 : man at [0.582, 0.476, 0.662, 0.870].\nObject 14 : man at [0.164, 0.455, 0.292, 0.997].\nObject 15 : mountains at [0.000, 0.265, 0.376, 0.470].\nObject 16 : rock at [0.736, 0.895, 0.762, 0.934].\nObject 17 : sand at [0.240, 0.687, 0.998, 1.000].\nObject 18 : shirt at [0.582, 0.521, 0.650, 0.681].\nObject 19 : shorts at [0.174, 0.699, 0.254, 0.864].\nObject 20 : side at [0.236, 0.675, 0.994, 0.997].\nObject 21 : skirt at [0.298, 0.687, 0.360, 0.810].\nObject 22 : sky at [0.004, 0.000, 0.998, 0.355].\nObject 23 : top at [0.302, 0.539, 0.358, 0.696].\nObject 24 : tree at [0.012, 0.407, 0.076, 0.500].\nObject 25 : trunk at [0.506, 0.392, 0.600, 0.964].\nObject 26 : watch at [0.172, 0.711, 0.192, 0.732].\nObject 27 : water at [0.000, 0.488, 0.994, 1.000].\nObject 28 : woman at [0.288, 0.473, 0.420, 0.967].\n\nRelationships:\nobject 7 : elephant -> on -> object 20 : side.\nobject 28 : woman -> touching -> object 7 : elephant.\nobject 14 : man -> standing on -> object 20 : side.\nobject 14 : man -> standing beside -> object 7 : elephant.\nobject 10 : grass -> on -> object 20 : side.\nobject 28 : woman -> wearing -> object 23 : top.\nobject 13 : man -> wearing -> object 18 : shirt.\nobject 13 : man -> wearing -> object 1 : boot.\nobject 13 : man -> wearing -> object 2 : boot.\nobject 28 : woman -> touching -> object 7 : elephant.\nobject 7 : elephant -> has -> object 25 : trunk.\nobject 14 : man -> wearing -> object 19 : shorts.\nobject 28 : woman -> petting -> object 7 : elephant.\nobject 14 : man -> with -> object 7 : elephant.\nobject 28 : woman -> with -> object 7 : elephant.\nobject 13 : man -> with -> object 7 : elephant.\nobject 25 : trunk -> of -> object 7 : elephant.\nobject 25 : trunk -> of -> object 7 : elephant.\nobject 9 : foot -> of an -> object 7 : elephant.\nobject 25 : trunk -> of -> object 7 : elephant.\nobject 11 : leg -> of -> object 7 : elephant.\nobject 12 : leg -> of -> object 7 : elephant.\nobject 5 : ear -> of -> object 7 : elephant.\nobject 6 : ear -> of -> object 7 : elephant.\nobject 8 : eye -> of -> object 7 : elephant.\nobject 27 : water -> behind -> object 7 : elephant.\n\nRegion Description:\nRegion Description at [0.338, 0.139, 0.618, 0.967] : the elephant standing on the lake side.\nRegion Description at [0.154, 0.392, 0.300, 0.964] : a man standing on the lake side with shorts.\nRegion Description at [0.574, 0.422, 0.686, 0.910] : the man standing beside the elephant.\nRegion Description at [0.292, 0.485, 0.378, 0.705] : this lady is wearing a blue tank top.\nRegion Description at [0.722, 0.768, 0.988, 0.964] : the sand is brown with green grass growing in it.\nRegion Description at [0.156, 0.669, 0.270, 0.910] : the man is wearing grey black and white shorts.\nRegion Description at [0.504, 0.560, 0.568, 0.898] : The front right leg of the elephant..\nRegion Description at [0.310, 0.536, 0.358, 0.690] : The light blue tank top the girl is wearing..\nRegion Description at [0.262, 0.732, 0.326, 0.825] : The black bucket in the girl's hand..\nRegion Description at [0.002, 0.443, 0.992, 0.994] : The water behind the people and the elephant..\n\nGlobal Caption:\nA group of people are standing next to an elephant emerging from the water.\na group of people stand beside of a giant elephant \nThree tourists pose for a picture next to an elephant.\nThree people stand with an elephant in front of a stream.\nThree people standing next to an elephant along a river."}
+{"question_id": 28, "image": "000000543300.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : boat at [0.048, 0.552, 0.928, 0.819].\nObject 1 : building at [0.328, 0.493, 0.538, 0.613].\nObject 2 : building at [0.000, 0.467, 0.338, 0.651].\nObject 3 : building at [0.534, 0.096, 0.998, 0.637].\nObject 4 : canopies at [0.452, 0.504, 0.620, 0.600].\nObject 5 : container at [0.858, 0.643, 0.948, 0.712].\nObject 6 : dolphin at [0.282, 0.691, 0.344, 0.773].\nObject 7 : flag at [0.322, 0.563, 0.340, 0.597].\nObject 8 : ground at [0.822, 0.696, 0.880, 0.715].\nObject 9 : leaves at [0.002, 0.483, 0.080, 0.659].\nObject 10 : level at [0.000, 0.709, 1.000, 0.829].\nObject 11 : level at [0.068, 0.616, 0.852, 0.688].\nObject 12 : outdoor seating at [0.502, 0.579, 0.532, 0.624].\nObject 13 : pink writing at [0.414, 0.693, 0.654, 0.725].\nObject 14 : pole at [0.282, 0.416, 0.292, 0.515].\nObject 15 : railing at [0.094, 0.557, 0.728, 0.624].\nObject 16 : railing at [0.238, 0.597, 0.744, 0.627].\nObject 17 : reflection at [0.174, 0.808, 0.922, 0.848].\nObject 18 : roof at [0.000, 0.469, 0.280, 0.523].\nObject 19 : roof at [0.348, 0.509, 0.482, 0.568].\nObject 20 : roof at [0.920, 0.264, 0.980, 0.344].\nObject 21 : row at [0.700, 0.499, 0.878, 0.573].\nObject 22 : sea wall at [0.878, 0.712, 0.998, 0.819].\nObject 23 : shore at [0.000, 0.627, 0.996, 0.816].\nObject 24 : sky at [0.006, 0.000, 1.000, 0.517].\nObject 25 : steeple at [0.918, 0.088, 0.936, 0.237].\nObject 26 : symbol at [0.268, 0.688, 0.350, 0.779].\nObject 27 : symbol at [0.702, 0.693, 0.752, 0.725].\nObject 28 : tree at [0.472, 0.491, 0.592, 0.597].\nObject 29 : trees at [0.948, 0.573, 1.000, 0.691].\nObject 30 : trees at [0.000, 0.488, 0.080, 0.675].\nObject 31 : vehicle at [0.968, 0.653, 0.998, 0.693].\nObject 32 : water at [0.004, 0.813, 0.998, 0.992].\nObject 33 : water at [0.008, 0.717, 0.998, 0.981].\nObject 34 : window at [0.374, 0.733, 0.790, 0.765].\nObject 35 : window at [0.800, 0.491, 0.868, 0.576].\nObject 36 : window at [0.928, 0.512, 0.950, 0.576].\nObject 37 : window at [0.892, 0.395, 0.912, 0.443].\nObject 38 : window at [0.894, 0.517, 0.910, 0.571].\nObject 39 : window at [0.630, 0.493, 0.652, 0.565].\nObject 40 : windows at [0.384, 0.637, 0.724, 0.685].\n\nRelationships:\nobject 40 : windows -> on -> object 0 : boat.\nobject 17 : reflection -> in -> object 33 : water.\nobject 29 : trees -> growing on -> object 23 : shore.\nobject 30 : trees -> growing on -> object 23 : shore.\nobject 28 : tree -> growing on -> object 23 : shore.\nobject 18 : roof -> on -> object 2 : building.\nobject 5 : container -> on -> object 22 : sea wall.\nobject 0 : boat -> in -> object 32 : water.\nobject 0 : boat -> has -> object 15 : railing.\n\nRegion Description:\nRegion Description at [0.414, 0.691, 0.662, 0.725] : the are red letters on the side of the cruise ship.\nRegion Description at [0.370, 0.707, 0.780, 0.763] : there is a long set of black windows on the side of the cruise ship.\nRegion Description at [0.870, 0.243, 0.992, 0.357] : there is a red roof on this building.\nRegion Description at [0.538, 0.400, 0.712, 0.549] : there is red and gray building in the background.\nRegion Description at [0.054, 0.595, 0.312, 0.821] : there is two levels on this cruise ship.\nRegion Description at [0.370, 0.587, 0.664, 0.621] : there is a silver railing on the top level of the cruise ship.\nRegion Description at [0.858, 0.621, 0.952, 0.717] : there is a blue container on the dock.\nRegion Description at [0.876, 0.707, 0.996, 0.787] : there is a gray sea wall beside the ship.\nRegion Description at [0.268, 0.723, 0.346, 0.787] : there are blue water symbols on the side of the cruise ship.\nRegion Description at [0.000, 0.619, 0.024, 0.712] : there is a blue and white sign on the dock.\nRegion Description at [0.662, 0.533, 0.904, 0.603] : An outdoor canopy creates shade for customers. .\n\nGlobal Caption:\nA boat sits on the side of the dock.\nA large white boat in the open water.\nA white double decker boat n water next to buildings.\nA large cruise ship is traveling on the ocean. \nA Port River Dolphin Cruise ship sits in the water."}
+{"question_id": 29, "image": "000000241668.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : boutonniere at [0.710, 0.574, 0.799, 0.660].\nObject 1 : cake at [0.630, 0.670, 0.772, 0.750].\nObject 2 : cake crumb at [0.710, 0.348, 0.721, 0.356].\nObject 3 : crown at [0.370, 0.006, 0.549, 0.056].\nObject 4 : dress at [0.000, 0.574, 0.582, 1.000].\nObject 5 : eye at [0.649, 0.244, 0.699, 0.272].\nObject 6 : eye at [0.735, 0.264, 0.769, 0.280].\nObject 7 : eyebrow at [0.655, 0.230, 0.710, 0.250].\nObject 8 : eyebrow at [0.741, 0.252, 0.780, 0.264].\nObject 9 : finger at [0.721, 0.772, 0.816, 0.800].\nObject 10 : finger at [0.535, 0.740, 0.685, 0.826].\nObject 11 : ground at [0.003, 0.888, 0.997, 1.000].\nObject 12 : hair at [0.507, 0.142, 0.791, 0.642].\nObject 13 : hair at [0.189, 0.044, 0.652, 0.374].\nObject 14 : hand at [0.721, 0.720, 0.822, 0.818].\nObject 15 : hand at [0.493, 0.710, 0.685, 0.826].\nObject 16 : head at [0.209, 0.048, 0.652, 0.360].\nObject 17 : mouth at [0.646, 0.310, 0.724, 0.352].\nObject 18 : neck at [0.560, 0.344, 0.663, 0.460].\nObject 19 : necklace at [0.357, 0.334, 0.471, 0.484].\nObject 20 : necktie at [0.571, 0.442, 0.674, 0.936].\nObject 21 : paper at [0.760, 0.792, 0.914, 0.934].\nObject 22 : person at [0.490, 0.136, 0.825, 0.998].\nObject 23 : plate at [0.579, 0.734, 0.816, 0.768].\nObject 24 : purse at [0.774, 0.792, 0.883, 0.840].\nObject 25 : ring at [0.786, 0.780, 0.794, 0.796].\nObject 26 : shirt at [0.554, 0.376, 0.691, 0.950].\nObject 27 : suit jacket at [0.490, 0.422, 0.799, 0.998].\nObject 28 : table at [0.696, 0.816, 0.997, 0.916].\nObject 29 : toilet at [0.000, 0.656, 0.997, 0.936].\nObject 30 : wallpaper at [0.003, 0.000, 0.916, 0.656].\n\nRelationships:\nobject 21 : paper -> on top of -> object 11 : ground.\nobject 21 : paper -> by -> object 29 : toilet.\nobject 21 : paper -> on top of -> object 11 : ground.\nobject 21 : paper -> on top of -> object 11 : ground.\nobject 21 : paper -> by -> object 29 : toilet.\nobject 21 : paper -> by -> object 29 : toilet.\nobject 21 : paper -> sitting by -> object 29 : toilet.\nobject 21 : paper -> lying by -> object 29 : toilet.\nobject 21 : paper -> on top of -> object 11 : ground.\nobject 21 : paper -> lying by -> object 29 : toilet.\nobject 2 : cake crumb -> on side of -> object 17 : mouth.\nobject 24 : purse -> on top of -> object 28 : table.\nobject 5 : eye -> of a -> object 22 : person.\nobject 6 : eye -> of a -> object 22 : person.\nobject 7 : eyebrow -> of -> object 22 : person.\nobject 8 : eyebrow -> of -> object 22 : person.\nobject 10 : finger -> of -> object 15 : hand.\nobject 10 : finger -> of -> object 15 : hand.\nobject 10 : finger -> of -> object 15 : hand.\nobject 10 : finger -> of -> object 15 : hand.\nobject 3 : crown -> on top of -> object 16 : head.\nobject 20 : necktie -> worn on -> object 22 : person.\nobject 22 : person -> holding -> object 1 : cake.\nobject 14 : hand -> holding -> object 1 : cake.\nobject 22 : person -> wearing -> object 27 : suit jacket.\nobject 22 : person -> wearing -> object 4 : dress.\nobject 20 : necktie -> worn on -> object 18 : neck.\nobject 13 : hair -> on top of -> object 16 : head.\nobject 1 : cake -> on top of -> object 23 : plate.\nobject 25 : ring -> worn on -> object 9 : finger.\n\nRegion Description:\nRegion Description at [0.022, 0.020, 0.203, 0.312] : A green and yellow striped wallpaper.\nRegion Description at [0.000, 0.048, 0.613, 0.996] : woman wearing a strapless white wedding dress .\nRegion Description at [0.487, 0.136, 0.808, 0.986] : woman white red hair holding a piece of cake on a plate.\nRegion Description at [0.543, 0.674, 0.813, 0.826] : woman's hands holding a plate of cake.\nRegion Description at [0.579, 0.124, 0.788, 0.524] : red haired woman wearing a tie and suit jacket .\nRegion Description at [0.000, 0.012, 0.819, 0.996] : two people wearing formal wedding attire .\n\nGlobal Caption:\nThere are two people enjoying a wedding reception\nA woman in a wedding dress with another woman in a suit behind\nA woman in a wedding dress with another lady holding a piece of cake.\nA red head girl holding a piece of cake\nA bride is with a long red haired person with cake."}
+{"question_id": 30, "image": "000000535578.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : bush at [0.480, 0.000, 0.748, 0.084].\nObject 1 : ear at [0.544, 0.544, 0.571, 0.562].\nObject 2 : field at [0.000, 0.002, 0.994, 0.998].\nObject 3 : hill at [0.000, 0.000, 0.997, 0.998].\nObject 4 : plant at [0.000, 0.764, 0.601, 0.998].\nObject 5 : rock at [0.727, 0.410, 0.808, 0.470].\nObject 6 : sheep at [0.532, 0.546, 0.646, 0.662].\nObject 7 : sheep at [0.532, 0.666, 0.817, 0.810].\nObject 8 : tail at [0.565, 0.572, 0.604, 0.610].\nObject 9 : tree at [0.649, 0.000, 0.997, 0.334].\nObject 10 : trees at [0.736, 0.036, 0.835, 0.100].\nObject 11 : wall at [0.000, 0.000, 0.769, 0.180].\nObject 12 : weed at [0.417, 0.346, 0.492, 0.390].\n\nRelationships:\nobject 7 : sheep -> in a -> object 2 : field.\nobject 7 : sheep -> grazing in -> object 2 : field.\nobject 7 : sheep -> grazing in a -> object 2 : field.\nobject 7 : sheep -> grazing in a -> object 2 : field.\nobject 7 : sheep -> grazing in a -> object 2 : field.\nobject 7 : sheep -> grazing in a -> object 2 : field.\nobject 11 : wall -> borders -> object 2 : field.\nobject 7 : sheep -> grazing in a -> object 2 : field.\nobject 5 : rock -> in -> object 2 : field.\nobject 7 : sheep -> grazing in -> object 2 : field.\nobject 7 : sheep -> grazing in -> object 2 : field.\nobject 5 : rock -> in -> object 2 : field.\nobject 0 : bush -> in -> object 2 : field.\nobject 10 : trees -> in -> object 2 : field.\nobject 6 : sheep -> has an -> object 1 : ear.\nobject 6 : sheep -> has a -> object 8 : tail.\nobject 12 : weed -> growing in -> object 2 : field.\nobject 7 : sheep -> on -> object 3 : hill.\nobject 4 : plant -> on -> object 2 : field.\nobject 5 : rock -> on -> object 3 : hill.\nobject 7 : sheep -> are in -> object 2 : field.\nobject 11 : wall -> running across -> object 2 : field.\nobject 0 : bush -> in -> object 2 : field.\nobject 0 : bush -> in -> object 2 : field.\nobject 5 : rock -> in -> object 2 : field.\nobject 6 : sheep -> has a -> object 8 : tail.\nobject 5 : rock -> in -> object 2 : field.\n\nRegion Description:\nRegion Description at [0.000, 0.072, 0.760, 0.160] : A stone wall boarding a field of sheep.\nRegion Description at [0.189, 0.032, 0.703, 0.178] : rocks and grass in the background of the pasture.\nRegion Description at [0.541, 0.662, 0.823, 0.802] : white sheep grazing in green grassy field.\nRegion Description at [0.538, 0.544, 0.646, 0.656] : white sheep grazing in green grassy field.\nRegion Description at [0.228, 0.374, 0.357, 0.436] : white sheep grazing in green grassy field.\nRegion Description at [0.607, 0.380, 0.712, 0.456] : white sheep grazing in green grassy field.\nRegion Description at [0.811, 0.296, 0.937, 0.338] : two white sheep grazing in green grassy field.\nRegion Description at [0.048, 0.200, 0.249, 0.242] : group of white sheep grazing in green grassy field.\nRegion Description at [0.213, 0.164, 0.336, 0.192] : group of white sheep grazing in green grassy field.\nRegion Description at [0.000, 0.006, 0.997, 0.172] : two long gray stone walls across field.\nRegion Description at [0.453, 0.000, 0.730, 0.062] : a stand of trees outside the stone fence.\n\nGlobal Caption:\nA group of sheep grazing in a grassy valley.\nSheep graze in a lushly green mountain meadow\nA flock of sheep walking along a grassy hillside grazing.\nA flock of sheep are grazing on a grassy slope.\nA group of sheep grazing in a grassy field."}
+{"question_id": 31, "image": "000000443969.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : bold writings at [0.492, 0.770, 0.556, 0.810].\nObject 1 : bottle at [0.468, 0.642, 0.634, 0.916].\nObject 2 : cart at [0.232, 0.328, 0.808, 0.998].\nObject 3 : child at [0.408, 0.168, 0.606, 0.786].\nObject 4 : cleaner at [0.466, 0.634, 0.636, 0.916].\nObject 5 : floor at [0.000, 0.190, 1.000, 1.000].\nObject 6 : green shirt at [0.000, 0.180, 0.078, 0.540].\nObject 7 : houses at [0.000, 0.000, 0.240, 0.414].\nObject 8 : leaves at [0.894, 0.202, 0.910, 0.204].\nObject 9 : line at [0.796, 0.954, 0.996, 0.966].\nObject 10 : lines at [0.828, 0.398, 0.998, 0.568].\nObject 11 : metal at [0.514, 0.116, 0.558, 0.292].\nObject 12 : metal at [0.234, 0.336, 0.802, 0.998].\nObject 13 : metal part at [0.512, 0.862, 0.566, 0.992].\nObject 14 : pants at [0.432, 0.524, 0.574, 0.670].\nObject 15 : person at [0.110, 0.070, 0.258, 0.456].\nObject 16 : person at [0.412, 0.166, 0.604, 0.784].\nObject 17 : person at [0.000, 0.182, 0.216, 0.958].\nObject 18 : sandal at [0.070, 0.862, 0.180, 0.954].\nObject 19 : shirt at [0.128, 0.120, 0.216, 0.260].\nObject 20 : shorts at [0.140, 0.222, 0.216, 0.348].\nObject 21 : skirt at [0.000, 0.470, 0.214, 0.894].\nObject 22 : umbrella at [0.296, 0.038, 0.782, 0.360].\nObject 23 : woman at [0.286, 0.000, 0.802, 0.812].\nObject 24 : writings at [0.512, 0.838, 0.564, 0.868].\n\nRelationships:\nobject 3 : child -> holding -> object 22 : umbrella.\nobject 23 : woman -> pushing -> object 2 : cart.\nobject 21 : skirt -> on -> object 17 : person.\nobject 10 : lines -> on -> object 5 : floor.\nobject 20 : shorts -> on -> object 15 : person.\nobject 16 : person -> next to -> object 2 : cart.\nobject 16 : person -> wearing -> object 21 : skirt.\nobject 18 : sandal -> on -> object 17 : person.\nobject 6 : green shirt -> on -> object 16 : person.\nobject 14 : pants -> on -> object 3 : child.\n\nRegion Description:\nRegion Description at [0.298, 0.050, 0.778, 0.422] : the opened umbrella the child is holding.\n\nGlobal Caption:\nA baby girl standing in a shopping cart holding an umbrella.\nA GIRL IS IN A GROCERY CART \nA little girl is riding in a shopping cart while holding her umbrella.\nA little girl inside of a shopping cart.\nA small child stands in a shopping cart with an umbrella."}
+{"question_id": 32, "image": "000000329219.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : bearded face at [0.371, 0.064, 0.393, 0.094].\nObject 1 : blender at [0.015, 0.165, 0.080, 0.307].\nObject 2 : box at [0.176, 0.249, 0.228, 0.329].\nObject 3 : buttons at [0.038, 0.268, 0.048, 0.275].\nObject 4 : counter at [0.567, 0.340, 0.738, 0.395].\nObject 5 : counter at [0.000, 0.329, 0.576, 0.398].\nObject 6 : curtain at [0.429, 0.048, 0.504, 0.318].\nObject 7 : curtain at [0.227, 0.000, 0.309, 0.287].\nObject 8 : dog at [0.462, 0.593, 0.568, 0.842].\nObject 9 : door knob at [0.242, 0.477, 0.253, 0.499].\nObject 10 : drawer at [0.112, 0.370, 0.259, 0.452].\nObject 11 : drawer at [0.284, 0.382, 0.394, 0.439].\nObject 12 : faucet at [0.338, 0.327, 0.388, 0.357].\nObject 13 : floor at [0.000, 0.713, 1.000, 1.000].\nObject 14 : kitchen at [0.000, 0.000, 0.750, 0.849].\nObject 15 : knob at [0.179, 0.398, 0.197, 0.422].\nObject 16 : knob at [0.340, 0.400, 0.352, 0.420].\nObject 17 : man at [0.274, 0.000, 0.517, 0.792].\nObject 18 : mugs at [0.509, 0.123, 0.595, 0.266].\nObject 19 : outlet at [0.107, 0.212, 0.143, 0.256].\nObject 20 : shoes at [0.391, 0.735, 0.476, 0.786].\nObject 21 : spatula at [0.126, 0.003, 0.153, 0.094].\nObject 22 : tile at [0.526, 0.592, 0.557, 0.634].\nObject 23 : wall at [0.003, 0.000, 0.220, 0.294].\nObject 24 : wall at [0.506, 0.019, 0.607, 0.384].\nObject 25 : window at [0.303, 0.016, 0.392, 0.328].\nObject 26 : wire at [0.097, 0.233, 0.129, 0.319].\n\nRelationships:\nobject 17 : man -> standing in -> object 14 : kitchen.\nobject 18 : mugs -> hanging on -> object 24 : wall.\nobject 1 : blender -> with -> object 3 : buttons.\nobject 17 : man -> with -> object 0 : bearded face.\nobject 26 : wire -> hanging from -> object 23 : wall.\nobject 8 : dog -> on -> object 13 : floor.\nobject 1 : blender -> on -> object 5 : counter.\nobject 6 : curtain -> on -> object 25 : window.\nobject 20 : shoes -> on -> object 17 : man.\n\nRegion Description:\nRegion Description at [0.056, 0.214, 0.140, 0.277] : A dark electric cord plugged into the wall.\nRegion Description at [0.000, 0.662, 0.116, 0.940] : A latter with onely one rung visible.\nRegion Description at [0.004, 0.698, 0.999, 0.991] : Durable Tan and brown laminent flooring.\nRegion Description at [0.004, 0.324, 0.739, 0.880] : cheap waferboard constructed cabinets .\nRegion Description at [0.514, 0.126, 0.588, 0.262] : convient and accessable way to store coffee mugs.\nRegion Description at [0.222, 0.001, 0.510, 0.286] : small window curtians with paisley design.\nRegion Description at [0.347, 0.053, 0.490, 0.312] : light weight flanel design mens shirt .\nRegion Description at [0.222, 0.004, 0.315, 0.303] : gold and white curtain on a kitchen window.\nRegion Description at [0.511, 0.126, 0.589, 0.261] : coffee cups hanging on the kitchen wall.\nRegion Description at [0.012, 0.149, 0.091, 0.340] : gold colored blinder sits on the counter.\nRegion Description at [-0.001, 0.000, 0.157, 0.122] : cooking utensils hanging against wall.\n\nGlobal Caption:\nA man standing next to a dog on the ground.\nA man is at a kitchen counter by a dog.\nAn man standing in a kitchen with a small puppy.\nthere is a small puppy on the kitchen floor\nA man in the kitchen standing with his dog."}
+{"question_id": 33, "image": "000000421923.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : block at [0.156, 0.630, 0.357, 0.822].\nObject 1 : book at [0.414, 0.208, 0.538, 0.364].\nObject 2 : book at [0.360, 0.202, 0.417, 0.360].\nObject 3 : book at [0.426, 0.484, 0.691, 0.522].\nObject 4 : book at [0.399, 0.404, 0.520, 0.554].\nObject 5 : bowl at [0.072, 0.030, 0.288, 0.076].\nObject 6 : center at [0.850, 0.732, 0.886, 0.766].\nObject 7 : eye at [0.282, 0.506, 0.327, 0.532].\nObject 8 : eye at [0.189, 0.506, 0.237, 0.534].\nObject 9 : flower at [0.796, 0.462, 0.982, 0.550].\nObject 10 : flower at [0.817, 0.528, 0.976, 0.612].\nObject 11 : flower at [0.760, 0.678, 0.946, 0.824].\nObject 12 : flower at [0.691, 0.608, 0.838, 0.722].\nObject 13 : flower at [0.913, 0.680, 1.000, 0.770].\nObject 14 : object at [0.213, 0.840, 0.583, 0.972].\nObject 15 : picture at [0.778, 0.060, 1.000, 0.352].\nObject 16 : shelf at [0.324, 0.528, 0.997, 0.624].\nObject 17 : shelf at [0.207, 0.334, 0.997, 0.380].\nObject 18 : shelf at [0.000, 0.028, 0.607, 0.202].\nObject 19 : stack at [0.435, 0.480, 0.712, 0.578].\nObject 20 : statue at [0.147, 0.404, 0.372, 0.652].\nObject 21 : table at [0.000, 0.690, 1.003, 0.998].\nObject 22 : vase at [0.838, 0.774, 0.994, 0.974].\nObject 23 : water at [0.847, 0.864, 0.997, 0.984].\n\nRelationships:\nobject 20 : statue -> on -> object 0 : block.\nobject 14 : object -> on -> object 21 : table.\nobject 1 : book -> on -> object 17 : shelf.\nobject 4 : book -> on -> object 16 : shelf.\nobject 5 : bowl -> on -> object 18 : shelf.\nobject 22 : vase -> has -> object 23 : water.\nobject 20 : statue -> has -> object 8 : eye.\nobject 20 : statue -> has -> object 7 : eye.\nobject 20 : statue -> on -> object 0 : block.\nobject 9 : flower -> in -> object 22 : vase.\nobject 10 : flower -> in -> object 22 : vase.\nobject 12 : flower -> in -> object 22 : vase.\nobject 13 : flower -> in -> object 22 : vase.\nobject 3 : book -> in -> object 19 : stack.\nobject 11 : flower -> has -> object 6 : center.\nobject 1 : book -> on -> object 17 : shelf.\nobject 2 : book -> on -> object 17 : shelf.\nobject 11 : flower -> has -> object 6 : center.\nobject 3 : book -> on -> object 19 : stack.\nobject 19 : stack -> on -> object 16 : shelf.\nobject 20 : statue -> on -> object 0 : block.\n\nRegion Description:\n\nGlobal Caption:\na glass vase with some flowers coming out of it \nA room witb a statue, bookshelves, books and a vase with flowers in it.\nA desk with a vase containing flowers, a sculpture of a man's head and shelves behind it.\nA statue next to a vase of flowers on a shelf. \nThe bust of a man's head is next to a vase of flowers."}
+{"question_id": 34, "image": "000000376900.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : area at [0.000, 0.002, 0.995, 0.996].\nObject 1 : background at [0.000, 0.132, 0.997, 0.268].\nObject 2 : cap at [0.171, 0.388, 0.253, 0.476].\nObject 3 : green/tennis court at [0.005, 0.720, 0.880, 0.994].\nObject 4 : hand at [0.253, 0.648, 0.299, 0.680].\nObject 5 : head at [0.173, 0.408, 0.256, 0.474].\nObject 6 : line at [0.397, 0.778, 0.501, 0.996].\nObject 7 : man at [0.163, 0.274, 0.491, 0.936].\nObject 8 : photo at [0.005, 0.004, 0.968, 0.976].\nObject 9 : pole at [0.019, 0.162, 0.035, 0.258].\nObject 10 : ses at [0.912, 0.962, 0.992, 0.994].\nObject 11 : shadow at [0.397, 0.898, 0.968, 0.956].\nObject 12 : shorts at [0.216, 0.628, 0.432, 0.782].\nObject 13 : sock at [0.325, 0.840, 0.376, 0.890].\nObject 14 : sport at [0.144, 0.270, 0.515, 0.944].\nObject 15 : tennis racket at [0.235, 0.578, 0.304, 0.664].\nObject 16 : tennis shoe at [0.213, 0.880, 0.280, 0.930].\nObject 17 : tennis shoe at [0.299, 0.886, 0.405, 0.936].\nObject 18 : trees at [0.269, 0.192, 0.995, 0.250].\nObject 19 : wrist at [0.384, 0.318, 0.429, 0.360].\nObject 20 : wristband at [0.384, 0.318, 0.432, 0.360].\n\nRelationships:\nobject 7 : man -> wearing -> object 12 : shorts.\nobject 4 : hand -> holding -> object 15 : tennis racket.\nobject 2 : cap -> on mans -> object 5 : head.\nobject 5 : head -> of a -> object 7 : man.\nobject 7 : man -> wearing a -> object 2 : cap.\nobject 7 : man -> wearing a -> object 13 : sock.\nobject 18 : trees -> in -> object 1 : background.\nobject 14 : sport -> in -> object 0 : area.\nobject 20 : wristband -> on a -> object 19 : wrist.\nobject 2 : cap -> on -> object 5 : head.\nobject 11 : shadow -> of -> object 7 : man.\nobject 12 : shorts -> on -> object 7 : man.\n\nRegion Description:\nRegion Description at [0.163, 0.322, 0.579, 0.926] : The tennis player is wearing all white.\nRegion Description at [0.397, 0.858, 0.936, 0.968] : Tennis player's shadow cast in front of him.\nRegion Description at [0.219, 0.560, 0.309, 0.680] : a black tennis racket in a man's hand.\nRegion Description at [0.341, 0.538, 0.480, 0.728] : a line judge at the side of a tennis court.\n\nGlobal Caption:\nA tennis player prepares to serve a tennis ball.\na tennis player in all white playing on a court \nA tennis player is reaching up with one arm and has a racquet in the other hand. \nThe tennis player throws the ball up to serve\nSpectators watching a man swinging at a tennis ball."}
+{"question_id": 35, "image": "000000513567.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : bag at [0.428, 0.435, 0.476, 0.528].\nObject 1 : bag at [0.322, 0.923, 0.498, 0.997].\nObject 2 : building at [0.000, 0.003, 0.158, 0.413].\nObject 3 : face at [0.246, 0.240, 0.374, 0.483].\nObject 4 : flag at [0.044, 0.013, 0.090, 0.149].\nObject 5 : girl at [0.538, 0.019, 0.968, 0.949].\nObject 6 : hand at [0.176, 0.680, 0.304, 0.821].\nObject 7 : hands at [0.660, 0.344, 0.756, 0.517].\nObject 8 : head at [0.560, 0.003, 0.822, 0.339].\nObject 9 : hot dog at [0.676, 0.315, 0.882, 0.408].\nObject 10 : hot dogs at [0.190, 0.587, 0.350, 0.741].\nObject 11 : jeans at [0.586, 0.843, 0.916, 0.995].\nObject 12 : lady at [0.572, 0.045, 0.952, 0.984].\nObject 13 : logo at [0.920, 0.069, 0.996, 0.165].\nObject 14 : man at [0.486, 0.235, 0.564, 0.509].\nObject 15 : man at [0.456, 0.213, 0.520, 0.317].\nObject 16 : maroon shirt at [0.546, 0.333, 0.928, 0.944].\nObject 17 : mouth at [0.288, 0.408, 0.356, 0.440].\nObject 18 : people at [0.552, 0.029, 0.876, 0.995].\nObject 19 : post at [0.104, 0.005, 0.138, 0.533].\nObject 20 : purse at [0.842, 0.661, 0.980, 0.888].\nObject 21 : purse strap at [0.270, 0.893, 0.390, 0.992].\nObject 22 : shadow at [0.934, 0.067, 0.996, 0.141].\nObject 23 : side at [0.922, 0.875, 0.998, 0.997].\nObject 24 : street at [0.042, 0.403, 0.092, 0.520].\nObject 25 : sunglasses at [0.630, 0.005, 0.794, 0.048].\nObject 26 : woman at [0.502, 0.000, 0.982, 0.997].\nObject 27 : woman at [0.102, 0.099, 0.486, 0.984].\nObject 28 : woman's shirt at [0.518, 0.320, 0.944, 0.949].\n\nRelationships:\nobject 0 : bag -> on -> object 15 : man.\nobject 13 : logo -> on -> object 2 : building.\nobject 25 : sunglasses -> on -> object 26 : woman.\nobject 25 : sunglasses -> on -> object 8 : head.\nobject 4 : flag -> on -> object 19 : post.\nobject 6 : hand -> holds -> object 10 : hot dogs.\nobject 27 : woman -> has -> object 17 : mouth.\nobject 12 : lady -> holding -> object 9 : hot dog.\nobject 9 : hot dog -> in -> object 7 : hands.\nobject 18 : people -> crossing -> object 24 : street.\nobject 27 : woman -> wearing -> object 11 : jeans.\nobject 5 : girl -> wears -> object 16 : maroon shirt.\n\nRegion Description:\nRegion Description at [0.038, 0.173, 0.540, 0.995] : Laughing girl in a green shirt holding a hotdog..\nRegion Description at [0.504, 0.000, 0.954, 0.989] : Black haired girl in maroon shirt wearing sunglasses on her head..\nRegion Description at [0.508, 0.000, 0.960, 0.979] : Girl looking at the hot dog she's holding in her hands.\nRegion Description at [0.040, 0.173, 0.536, 0.981] : Girl holding hot dog in her right hand.\nRegion Description at [0.926, 0.253, 0.998, 0.645] : Woman in a brown shirt and jeans crossing the street.\nRegion Description at [0.202, 0.563, 0.334, 0.995] : Blue purse strap around woman's shoulder.\nRegion Description at [0.146, 0.587, 0.370, 0.787] : woman holding hot dog in white napkin.\nRegion Description at [0.682, 0.229, 0.742, 0.315] : woman's mouth open looking at hot dog.\nRegion Description at [0.234, 0.213, 0.396, 0.507] : woman's face smiling with eyes closed.\n\nGlobal Caption:\nTwo Asian women eating chili dogs while standing on a street.\nTwo women preparing to eat a hot dog on a city side.\nThe woman are eating their hot dogs while walking.\nTwo young women are eating hot dogs while walking down the sidewalk.\nTwo women eat chili dogs on a city sidewalk. "}
+{"question_id": 36, "image": "000000058393.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : arm at [0.658, 0.462, 0.828, 0.496].\nObject 1 : bench at [0.070, 0.493, 0.932, 0.960].\nObject 2 : concrete at [0.030, 0.810, 0.974, 0.997].\nObject 3 : foot at [0.724, 0.784, 0.782, 0.844].\nObject 4 : hair at [0.646, 0.367, 0.754, 0.472].\nObject 5 : hair at [0.564, 0.338, 0.652, 0.462].\nObject 6 : man at [0.542, 0.343, 0.812, 0.493].\nObject 7 : ocean at [0.028, 0.319, 0.972, 0.821].\nObject 8 : post at [0.090, 0.641, 0.102, 0.734].\nObject 9 : post at [0.924, 0.652, 0.944, 0.836].\nObject 10 : rail at [0.028, 0.620, 0.974, 0.660].\nObject 11 : seat at [0.072, 0.728, 0.928, 0.786].\nObject 12 : shoe at [0.720, 0.789, 0.782, 0.855].\nObject 13 : sky at [0.028, 0.037, 0.974, 0.325].\nObject 14 : slat at [0.072, 0.749, 0.928, 0.781].\nObject 15 : slat at [0.112, 0.499, 0.912, 0.522].\nObject 16 : slat at [0.126, 0.702, 0.912, 0.728].\nObject 17 : slat at [0.108, 0.594, 0.908, 0.625].\nObject 18 : slat at [0.106, 0.525, 0.908, 0.554].\nObject 19 : woman at [0.644, 0.377, 0.834, 0.863].\n\nRelationships:\nobject 6 : man -> sitting on -> object 1 : bench.\nobject 6 : man -> sitting with -> object 19 : woman.\nobject 6 : man -> has -> object 0 : arm.\nobject 0 : arm -> around -> object 19 : woman.\nobject 3 : foot -> wearing -> object 12 : shoe.\nobject 19 : woman -> has -> object 3 : foot.\nobject 3 : foot -> inside -> object 12 : shoe.\nobject 19 : woman -> looking at -> object 7 : ocean.\nobject 6 : man -> looking at -> object 7 : ocean.\nobject 19 : woman -> has -> object 4 : hair.\nobject 6 : man -> has -> object 5 : hair.\nobject 1 : bench -> in front of -> object 7 : ocean.\nobject 1 : bench -> in front of -> object 7 : ocean.\nobject 1 : bench -> backs up to -> object 1 : bench.\nobject 19 : woman -> sitting on -> object 1 : bench.\nobject 6 : man -> sitting on -> object 1 : bench.\nobject 19 : woman -> relaxing on -> object 1 : bench.\nobject 6 : man -> relaxing on -> object 1 : bench.\nobject 19 : woman -> facing -> object 7 : ocean.\nobject 6 : man -> facing -> object 7 : ocean.\nobject 19 : woman -> looking at -> object 7 : ocean.\nobject 6 : man -> looking at -> object 7 : ocean.\nobject 6 : man -> relaxing with -> object 19 : woman.\nobject 6 : man -> on bench with -> object 19 : woman.\nobject 19 : woman -> resting on -> object 1 : bench.\nobject 6 : man -> resting on -> object 1 : bench.\nobject 1 : bench -> near -> object 7 : ocean.\nobject 1 : bench -> near -> object 7 : ocean.\nobject 11 : seat -> part of -> object 1 : bench.\nobject 9 : post -> supporting -> object 10 : rail.\nobject 8 : post -> supporting -> object 10 : rail.\nobject 19 : woman -> has -> object 3 : foot.\nobject 12 : shoe -> belongs to -> object 19 : woman.\nobject 19 : woman -> has -> object 3 : foot.\nobject 2 : concrete -> under -> object 1 : bench.\nobject 2 : concrete -> under -> object 1 : bench.\nobject 7 : ocean -> in front of -> object 1 : bench.\nobject 6 : man -> sitting next to -> object 19 : woman.\nobject 6 : man -> cuddling with -> object 19 : woman.\nobject 0 : arm -> around -> object 19 : woman.\nobject 6 : man -> silhouetted with -> object 19 : woman.\nobject 18 : slat -> part of -> object 1 : bench.\n\nRegion Description:\nRegion Description at [0.502, 0.309, 0.892, 0.512] : a man and woman looking at the ocean.\n\nGlobal Caption:\nTwo people sitting on a bench silhouetted against the sea.\nTwo people are sitting on a bench together in front of water.\nThe silhouette of two people sitting on a bench in front of the water.\nA couple is sitting on a bench in front of the water. \nA couple sits on a park bench and watches the water"}
+{"question_id": 37, "image": "000000010764.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : catcher at [0.334, 0.193, 0.756, 0.940].\nObject 1 : field at [0.000, 0.000, 0.998, 0.997].\nObject 2 : glove at [0.660, 0.492, 0.764, 0.674].\nObject 3 : hand at [0.666, 0.498, 0.748, 0.665].\nObject 4 : helmet at [0.472, 0.187, 0.610, 0.444].\nObject 5 : jersey at [0.340, 0.332, 0.556, 0.695].\nObject 6 : line at [0.396, 0.656, 0.560, 0.731].\nObject 7 : lines at [0.866, 0.927, 1.000, 0.997].\nObject 8 : lines at [0.754, 0.837, 0.998, 0.867].\nObject 9 : pads at [0.562, 0.668, 0.634, 0.782].\nObject 10 : pants at [0.336, 0.640, 0.612, 0.858].\nObject 11 : sneakers at [0.406, 0.834, 0.544, 0.946].\nObject 12 : stripe at [0.608, 0.737, 0.998, 0.795].\nObject 13 : wrist band at [0.586, 0.583, 0.604, 0.640].\n\nRelationships:\nobject 0 : catcher -> in -> object 1 : field.\nobject 2 : glove -> on -> object 3 : hand.\nobject 6 : line -> on -> object 10 : pants.\n\nRegion Description:\nRegion Description at [0.546, 0.625, 0.626, 0.801] : The player is wearing knee and leg pads..\nRegion Description at [0.018, 0.665, 0.280, 0.825] : A brown dirt ground surface on a baseball field.\nRegion Description at [0.676, 0.701, 0.974, 0.979] : White chalk lines painted on a baseball field.\nRegion Description at [0.062, 0.130, 0.370, 0.535] : A green grass ground surface of a baseball field.\nRegion Description at [0.566, 0.580, 0.620, 0.656] : A black and red bracelet on a man's wrist.\n\nGlobal Caption:\nA catches crouches on a patch of dirt.\nA catcher squatting at a base with his gloved hand extended.\nA baseball catcher stands ready to catch a ball.\na catcher kneeling at the mound waiting for a baseball \nA catcher in white uniform during a baseball game."}
+{"question_id": 38, "image": "000000271402.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : blonde hair at [0.193, 0.100, 0.375, 0.176].\nObject 1 : building at [0.804, 0.200, 0.906, 0.318].\nObject 2 : dress at [0.378, 0.284, 0.804, 0.652].\nObject 3 : fence at [0.607, 0.282, 0.997, 0.378].\nObject 4 : girl at [0.329, 0.148, 0.973, 0.892].\nObject 5 : girl at [0.057, 0.102, 0.456, 0.898].\nObject 6 : ground at [0.000, 0.374, 1.000, 0.916].\nObject 7 : hair at [0.320, 0.148, 0.517, 0.286].\nObject 8 : handle at [0.329, 0.432, 0.508, 0.480].\nObject 9 : handle at [0.091, 0.450, 0.299, 0.502].\nObject 10 : head at [0.335, 0.152, 0.508, 0.314].\nObject 11 : insignia at [0.447, 0.350, 0.502, 0.390].\nObject 12 : orange platform at [0.181, 0.816, 0.489, 0.998].\nObject 13 : orange wheel at [0.193, 0.820, 0.248, 0.876].\nObject 14 : pavement at [0.009, 0.370, 0.994, 0.996].\nObject 15 : racket at [0.462, 0.480, 0.713, 0.840].\nObject 16 : right shoe at [0.465, 0.778, 0.610, 0.886].\nObject 17 : scooter at [0.097, 0.424, 0.592, 0.996].\nObject 18 : shoe at [0.060, 0.794, 0.202, 0.902].\nObject 19 : shoe at [0.302, 0.780, 0.453, 0.874].\nObject 20 : skirt at [0.471, 0.514, 0.804, 0.654].\nObject 21 : sneaker at [0.849, 0.738, 0.970, 0.886].\nObject 22 : sock at [0.317, 0.776, 0.347, 0.798].\nObject 23 : sock at [0.130, 0.790, 0.184, 0.810].\n\nRelationships:\nobject 4 : girl -> on -> object 14 : pavement.\nobject 5 : girl -> wearing -> object 22 : sock.\nobject 5 : girl -> wearing -> object 23 : sock.\nobject 4 : girl -> wearing -> object 20 : skirt.\nobject 4 : girl -> holding -> object 15 : racket.\nobject 5 : girl -> with -> object 0 : blonde hair.\nobject 17 : scooter -> with -> object 8 : handle.\nobject 1 : building -> with -> object 3 : fence.\nobject 4 : girl -> with -> object 11 : insignia.\nobject 13 : orange wheel -> of -> object 17 : scooter.\n\nRegion Description:\nRegion Description at [0.858, 0.760, 0.970, 0.852] : Girl is wearing blue, white, pink, and gray shoes..\nRegion Description at [0.293, 0.136, 0.976, 0.884] : a little girl holding a tennis racket..\nRegion Description at [0.060, 0.086, 0.462, 0.908] : A little girl standing near a scooter..\nRegion Description at [0.308, 0.146, 0.985, 0.892] : young girl wearing velcro strapped tennis shoes.\nRegion Description at [0.082, 0.436, 0.601, 0.996] : orange scooter board with black handles.\nRegion Description at [0.755, 0.184, 0.973, 0.372] : a tall building with fence in foreground.\nRegion Description at [0.021, 0.096, 0.988, 0.928] : two young girls wearing white outfits.\nRegion Description at [0.311, 0.136, 0.991, 0.886] : young girl with insignia on white outfit.\nRegion Description at [0.175, 0.814, 0.266, 0.888] : orange colored back wheel of a scooter board.\nRegion Description at [0.453, 0.478, 0.725, 0.848] : lavender, yellow and pink colored tennis racket.\n\nGlobal Caption:\ntwo little girls in tennis uniforms standing next to a scooter\nTwo young girls with a tennis racket and a scooter.\nTwo little girls posing for a picture, on a tennis court.\nTwo young girls on a tennis court with a racquet and a scooter\nTwo cute girls with a scooter and tennis raquet."}
+{"question_id": 39, "image": "000000018519.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : concrete at [0.000, 0.576, 1.002, 0.998].\nObject 1 : elbow at [0.403, 0.538, 0.433, 0.552].\nObject 2 : fence at [0.000, 0.314, 0.998, 0.600].\nObject 3 : graffiti at [0.470, 0.856, 0.794, 0.998].\nObject 4 : grass at [0.000, 0.154, 1.002, 0.448].\nObject 5 : helmet at [0.358, 0.354, 0.448, 0.422].\nObject 6 : knee at [0.525, 0.608, 0.545, 0.622].\nObject 7 : knee pad at [0.450, 0.542, 0.512, 0.598].\nObject 8 : pad at [0.540, 0.362, 0.595, 0.420].\nObject 9 : pad at [0.512, 0.578, 0.592, 0.624].\nObject 10 : pad at [0.376, 0.512, 0.443, 0.554].\nObject 11 : park at [0.007, 0.006, 1.000, 0.578].\nObject 12 : pipe at [0.657, 0.300, 0.687, 0.578].\nObject 13 : pipe at [0.177, 0.324, 0.211, 0.590].\nObject 14 : rail at [0.000, 0.310, 1.000, 0.334].\nObject 15 : ramp at [0.000, 0.592, 1.002, 0.998].\nObject 16 : rock at [0.100, 0.302, 0.154, 0.326].\nObject 17 : shadow at [0.415, 0.642, 0.754, 0.912].\nObject 18 : shirt at [0.438, 0.376, 0.637, 0.514].\nObject 19 : shorts at [0.460, 0.500, 0.664, 0.580].\nObject 20 : skate at [0.647, 0.490, 0.709, 0.584].\nObject 21 : skater at [0.234, 0.352, 0.719, 0.624].\nObject 22 : sticker at [0.408, 0.358, 0.438, 0.368].\nObject 23 : tree at [0.122, 0.008, 0.677, 0.322].\nObject 24 : wheels at [0.689, 0.496, 0.721, 0.526].\nObject 25 : wrist brace at [0.279, 0.524, 0.338, 0.564].\n\nRelationships:\nobject 21 : skater -> has a -> object 17 : shadow.\nobject 20 : skate -> has -> object 24 : wheels.\nobject 23 : tree -> standing in a -> object 11 : park.\nobject 21 : skater -> wearing a -> object 5 : helmet.\nobject 10 : pad -> protecting an -> object 1 : elbow.\nobject 9 : pad -> protecting a -> object 6 : knee.\nobject 17 : shadow -> of a -> object 21 : skater.\nobject 15 : ramp -> has a -> object 3 : graffiti.\nobject 21 : skater -> has a -> object 5 : helmet.\nobject 16 : rock -> in -> object 4 : grass.\nobject 5 : helmet -> has a -> object 22 : sticker.\nobject 21 : skater -> wearing -> object 20 : skate.\nobject 21 : skater -> wearing a -> object 10 : pad.\nobject 25 : wrist brace -> on -> object 21 : skater.\nobject 21 : skater -> has a -> object 20 : skate.\nobject 17 : shadow -> on -> object 15 : ramp.\nobject 21 : skater -> has a -> object 5 : helmet.\nobject 21 : skater -> has a -> object 8 : pad.\nobject 21 : skater -> has a -> object 18 : shirt.\nobject 21 : skater -> has -> object 19 : shorts.\nobject 23 : tree -> behind -> object 21 : skater.\nobject 25 : wrist brace -> on -> object 21 : skater.\nobject 21 : skater -> has a -> object 9 : pad.\nobject 7 : knee pad -> for a -> object 21 : skater.\nobject 17 : shadow -> on -> object 0 : concrete.\nobject 3 : graffiti -> on -> object 0 : concrete.\n\nRegion Description:\nRegion Description at [0.391, 0.630, 0.776, 0.962] : Skater's shadow while performing a trick.\nRegion Description at [0.346, 0.342, 0.475, 0.440] : Man is wearing a black safety helmet.\nRegion Description at [0.184, 0.320, 0.741, 0.700] : a man roller skating at a skate park.\nRegion Description at [0.448, 0.636, 0.779, 0.940] : the shadow of the man cast on the cement ramp.\nRegion Description at [0.465, 0.856, 0.803, 0.996] : light blue painted graffiti on the cement ramp.\nRegion Description at [0.279, 0.524, 0.341, 0.570] : a black wrist guard on the man's wrist.\nRegion Description at [0.353, 0.352, 0.460, 0.422] : black helmet with several stickers on it.\nRegion Description at [0.644, 0.488, 0.719, 0.574] : the black rollerskate the man is wearing.\nRegion Description at [0.142, 0.314, 0.234, 0.604] : a grey post to the metal fence that is at the top of the ramp.\nRegion Description at [0.363, 0.500, 0.453, 0.566] : a black elbow pad the man is wearing.\nRegion Description at [0.405, 0.642, 0.746, 0.916] : shadow of a roller skater on concrete.\n\nGlobal Caption:\nA young man riding a skateboard down the side of a ramp.\nA man doing a trick on roller-skates in a skate park.\nA skateboarder performing a jump off the side of a ramp.\na man wearing roller skates doing a jump on the side of a wall \nThe man in the helmet is jumping while wearing roller skates. "}
+{"question_id": 40, "image": "000000106048.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : book at [0.218, 0.105, 0.834, 0.754].\nObject 1 : building at [0.050, 0.000, 1.000, 0.713].\nObject 2 : bus at [0.222, 0.144, 0.820, 0.757].\nObject 3 : bushes at [0.810, 0.401, 1.000, 0.680].\nObject 4 : design at [0.228, 0.422, 0.438, 0.560].\nObject 5 : ground at [0.000, 0.629, 1.002, 0.994].\nObject 6 : headlight at [0.738, 0.590, 0.796, 0.632].\nObject 7 : headlight at [0.522, 0.596, 0.610, 0.629].\nObject 8 : light at [0.604, 0.201, 0.706, 0.222].\nObject 9 : pavement at [0.002, 0.629, 0.996, 0.994].\nObject 10 : pipe at [0.172, 0.147, 0.208, 0.617].\nObject 11 : pipe at [0.438, 0.096, 0.458, 0.192].\nObject 12 : roof at [0.118, 0.000, 0.896, 0.174].\nObject 13 : side mirror at [0.488, 0.314, 0.530, 0.428].\nObject 14 : side mirror at [0.790, 0.332, 0.818, 0.455].\nObject 15 : street at [0.002, 0.611, 0.992, 0.991].\nObject 16 : stripe at [0.228, 0.428, 0.516, 0.569].\nObject 17 : trash can at [0.790, 0.569, 0.822, 0.662].\nObject 18 : wall at [0.858, 0.368, 0.920, 0.419].\nObject 19 : wheel at [0.266, 0.545, 0.294, 0.677].\nObject 20 : wheel at [0.248, 0.551, 0.264, 0.668].\nObject 21 : wheel at [0.444, 0.578, 0.472, 0.751].\nObject 22 : windows at [0.510, 0.216, 0.796, 0.548].\nObject 23 : windshield at [0.518, 0.222, 0.782, 0.545].\n\nRelationships:\nobject 10 : pipe -> running from -> object 12 : roof.\nobject 12 : roof -> to -> object 5 : ground.\nobject 17 : trash can -> next to -> object 3 : bushes.\nobject 3 : bushes -> by -> object 15 : street.\n\nRegion Description:\nRegion Description at [0.568, 0.524, 0.770, 0.599] : Divine Transportation written on front of bus.\nRegion Description at [0.162, 0.129, 0.212, 0.623] : black drain pipe running from the roof to the ground.\nRegion Description at [0.712, 0.177, 0.762, 0.240] : bus identification number on top of bus.\nRegion Description at [0.790, 0.557, 0.820, 0.647] : gray trash can next to bushes behind bus.\nRegion Description at [0.810, 0.407, 0.990, 0.692] : large green bushes in front of building.\nRegion Description at [0.670, 0.317, 0.740, 0.527] : black windshield wiper on windshield.\n\nGlobal Caption:\nA white bus driving past a tall building.\na black and white bus some bushes and building\nA white decorated bus is next to a building.\na large white bus that is by a building\nA large bus parked in a parking lot "}
diff --git a/ferret/eval/ferret_gpt4_data/ground_conv/question.jsonl b/ferret/eval/ferret_gpt4_data/ground_conv/question.jsonl
new file mode 100644
index 0000000..657cd70
--- /dev/null
+++ b/ferret/eval/ferret_gpt4_data/ground_conv/question.jsonl
@@ -0,0 +1,40 @@
+{"question_id": 0, "image": "000000125472.jpg", "category": "ground_conv", "text": "What is the man in the image doing and tell me the coordinates of the man?"}
+{"question_id": 2, "image": "000000361551.jpg", "category": "ground_conv", "text": "What's happening on the runway of the airport and provide me the coordinates of the runway and featured objects?"}
+{"question_id": 3, "image": "000000184400.jpg", "category": "ground_conv", "text": "What kind of vehicle is on the bridge, and provide me the coordinates of the vehicle and the bridge?"}
+{"question_id": 4, "image": "000000276018.jpg", "category": "ground_conv", "text": "What are the children doing, and what are they holding? Please provide the coordinates of the mentioned objects."}
+{"question_id": 5, "image": "000000356424.jpg", "category": "ground_conv", "text": "What is the man doing and what objects are in front of him? Please provide the coordinates of these objects."}
+{"question_id": 6, "image": "000000458755.jpg", "category": "ground_conv", "text": "What is the girl doing and provide me the coordinates of the girl and the objects she is interacting with."}
+{"question_id": 7, "image": "000000069138.jpg", "category": "ground_conv", "text": "What are the main features of the building and provide me the coordinates of these features?"}
+{"question_id": 8, "image": "000000003156.jpg", "category": "ground_conv", "text": "What is the man doing in this image and provide the coordinates of the toilet and the man."}
+{"question_id": 9, "image": "000000131138.jpg", "category": "ground_conv", "text": "What items are located on the desk and tell me the coordinates of these items?"}
+{"question_id": 10, "image": "000000259097.jpg", "category": "ground_conv", "text": "Describe the activities of the man and what is he wearing. Also, tell me the coordinates of the man and the items related to him."}
+{"question_id": 11, "image": "000000377882.jpg", "category": "ground_conv", "text": "What is the scene in the image? Provide the coordinates of the main objects."}
+{"question_id": 12, "image": "000000484415.jpg", "category": "ground_conv", "text": "What is the man doing, and what objects is he interacting with? Please provide the coordinates of the mentioned objects."}
+{"question_id": 13, "image": "000000184384.jpg", "category": "ground_conv", "text": "What is the type of food placed on the table and provide me the coordinates of these objects?"}
+{"question_id": 14, "image": "000000341058.jpg", "category": "ground_conv", "text": "What items are placed on the table and provide me the coordinates of mentioned items?"}
+{"question_id": 15, "image": "000000349184.jpg", "category": "ground_conv", "text": "What is happening in this image and tell me the coordinates of the woman sitting?"}
+{"question_id": 16, "image": "000000516143.jpg", "category": "ground_conv", "text": "What kind of bus is in the image, and what is it doing? Also, tell me the coordinates of the bus."}
+{"question_id": 17, "image": "000000159311.jpg", "category": "ground_conv", "text": "How many zebras are in the image and provide me the coordinates of mentioned zebras?"}
+{"question_id": 18, "image": "000000553990.jpg", "category": "ground_conv", "text": "What is the attire of the person who is riding the horse, and provide me the coordinates of that person?"}
+{"question_id": 19, "image": "000000273493.jpg", "category": "ground_conv", "text": "What are the two men doing in the image? What are they wearing and provide me the coordinates of mentioned objects?"}
+{"question_id": 20, "image": "000000452122.jpg", "category": "ground_conv", "text": "What state is the airplane in and tell me the coordinates of the mentioned objects?"}
+{"question_id": 21, "image": "000000134722.jpg", "category": "ground_conv", "text": "What is the setting of this image and provide me the coordinates of the objects you mention?"}
+{"question_id": 22, "image": "000000360960.jpg", "category": "ground_conv", "text": "How many people are there in the image and what are they wearing? Tell me the coordinates of the people and their clothing."}
+{"question_id": 23, "image": "000000179765.jpg", "category": "ground_conv", "text": "Can you tell me about the features of the bike and provide the coordinates of each feature mentioned?"}
+{"question_id": 24, "image": "000000332318.jpg", "category": "ground_conv", "text": "What is the setting of this image? Please provide the coordinates of all objects mentioned in your explanation."}
+{"question_id": 25, "image": "000000305695.jpg", "category": "ground_conv", "text": "Where are the zebras and what is near them? Provide coordinates of the mentioned objects."}
+{"question_id": 26, "image": "000000326174.jpg", "category": "ground_conv", "text": "Who is holding a surfboard and provide me the coordinates of the surfboard?"}
+{"question_id": 27, "image": "000000562207.jpg", "category": "ground_conv", "text": "What are the people doing and provide the coordinates of the mentioned objects."}
+{"question_id": 28, "image": "000000543300.jpg", "category": "ground_conv", "text": "What are some of the features of the boat and provide me the coordinates of mentioned objects?"}
+{"question_id": 29, "image": "000000241668.jpg", "category": "ground_conv", "text": "Who is holding the cake and what is she wearing, also tell me the coordinates of the mentioned objects?"}
+{"question_id": 30, "image": "000000535578.jpg", "category": "ground_conv", "text": "What is happening in the field and provide the coordinates of mentioned objects?"}
+{"question_id": 31, "image": "000000443969.jpg", "category": "ground_conv", "text": "Who is holding the umbrella and provide me the coordinates of the umbrella?"}
+{"question_id": 32, "image": "000000329219.jpg", "category": "ground_conv", "text": "What's happening in the kitchen? Could you provide coordinates for the objects you mention?"}
+{"question_id": 33, "image": "000000421923.jpg", "category": "ground_conv", "text": "What objects are present on the shelf and provide me the coordinates of mentioned objects?"}
+{"question_id": 34, "image": "000000376900.jpg", "category": "ground_conv", "text": "What is the man doing in the image and tell me the coordinates of the mentioned objects?"}
+{"question_id": 35, "image": "000000513567.jpg", "category": "ground_conv", "text": "Who are the two people in the image and what are they doing? Tell me the coordinates of mentioned objects."}
+{"question_id": 36, "image": "000000058393.jpg", "category": "ground_conv", "text": "Who are sitting on the bench and what are they doing? Tell me the coordinates of mentioned objects."}
+{"question_id": 37, "image": "000000010764.jpg", "category": "ground_conv", "text": "What is the catcher doing and tell me the coordinates of the catcher."}
+{"question_id": 38, "image": "000000271402.jpg", "category": "ground_conv", "text": "What are the girls holding and provide me the coordinates of these items?"}
+{"question_id": 39, "image": "000000018519.jpg", "category": "ground_conv", "text": "Can you describe the safety measures the skater has taken and provide the coordinates of these measures? "}
+{"question_id": 40, "image": "000000106048.jpg", "category": "ground_conv", "text": "What are the objects on the bus and provide me the coordinates of these objects?"}
diff --git a/ferret/eval/ferret_gpt4_data/refer_caption/answer.jsonl b/ferret/eval/ferret_gpt4_data/refer_caption/answer.jsonl
new file mode 100644
index 0000000..b005f46
--- /dev/null
+++ b/ferret/eval/ferret_gpt4_data/refer_caption/answer.jsonl
@@ -0,0 +1,40 @@
+{"question_id": 0, "image": "000000069138.jpg", "category": "refer_desc", "text": "The object is a sign that is placed on a building. The building is offering short term office space, which is stated on the sign. This sign is part of the overall street scene, with other objects nearby, such as a stop light on a pole and a road. The sign is directly related to the building, contributing to the information about the building's function to the overall scene."}
+{"question_id": 1, "image": "000000131138.jpg", "category": "refer_desc", "text": "The object is a plate, and on top of this plate is a fork. It's located on a desk. Nearby, there's also a cup and a computer mouse."}
+{"question_id": 2, "image": "000000475150.jpg", "category": "refer_desc", "text": "The region is primarily occupied by the face of a giraffe, which is white and brown in color. This portion of the giraffe's body is in close proximity to its neck, which is covered with wrinkles. The giraffe's eye and ear are also located within this region. The giraffe is standing in front of a tree, and its face and neck are spotted with tan and brown spots. The background is filled with branches of the tree and the sky."}
+{"question_id": 3, "image": "000000356424.jpg", "category": "refer_desc", "text": "In the region, there's a woman who seems to be having lunch. She is sitting near a table where there are several objects, including a plate of food that contains raspberries, a bottle, and a glass of water. There is also a man nearby, who is wearing glasses and a striped orange and black shirt. Both of them seem to be engaged in a meal."}
+{"question_id": 4, "image": "000000491090.jpg", "category": "refer_desc", "text": "The region corresponds to the back wheel of a motorcycle, which is an integral part of the black motorcycle with silver accessories. Close to the back wheel, you can find the tail pipe of the motorcycle located on the left. Above the back wheel, there's a person sitting on the motorcycle, wearing a sweater, jeans, and sneakers."}
+{"question_id": 5, "image": "000000484415.jpg", "category": "refer_desc", "text": "In the region, there is a container and a toilet brush cleaner. This region is right next to the toilet bowl, indicating that the brush cleaner is accessible for bathroom cleaning. It's crucial for maintaining the cleanliness of the toilet nearby."}
+{"question_id": 7, "image": "000000184324.jpg", "category": "refer_desc", "text": "The region is a crosswalk on a busy city street, highlighted by white stripes. It's being used by a cyclist and a group of people who are crossing the street. There are bikes on the road, and cars are also visible within the vicinity. A large white vehicle with a big windshield is also nearby. This bustling scene is typical for a city intersection."}
+{"question_id": 8, "image": "000000341058.jpg", "category": "refer_desc", "text": "The object is a salt shaker. It is located on a table along with a napkin and another shaker, which contains pepper. The table appears to be set for dining at a restaurant, as indicated by the presence of these objects."}
+{"question_id": 9, "image": "000000184384.jpg", "category": "refer_desc", "text": "In the region, there is a plate with various types of food on it. This includes a sausage, an egg, and a few other unidentified items. The food is attractively arranged on the plate, which is positioned towards the back of the table. The plate and its contents seem to be part of a larger meal setup on the table."}
+{"question_id": 10, "image": "000000259097.jpg", "category": "refer_desc", "text": "Region is full of trees and there is a village on a hill in the distance. These trees and buildings are located behind a grassy field where a man is seen jumping to catch a frisbee. The man's shadow can be seen on the grass."}
+{"question_id": 11, "image": "000000377882.jpg", "category": "refer_desc", "text": "The region contains a black fence pole, which seems to be part of a chain-link fence enclosing the area. This fence is next to a water way and encloses several boats and surfboards. There are buildings on the horizon, and some green shrubs growing along the side of the lake."}
+{"question_id": 12, "image": "000000415748.jpg", "category": "refer_desc", "text": "The region contains an elephant, which is quite large. There's a man riding on the back of the elephant, and they are moving close to a building. The shadow of the elephant can be seen on the ground. Additionally, the elephant's face and trunk are painted, which indicates some cultural significance."}
+{"question_id": 13, "image": "000000408120.jpg", "category": "refer_desc", "text": "In the region, there is a concrete surface which is part of the alley. It is placed alongside the curb and the road, and there is a car parked on it. Also, nearby, there is a girl holding an umbrella walking along this path."}
+{"question_id": 14, "image": "000000184400.jpg", "category": "refer_desc", "text": "In the region, there is a metal support column. This column is providing support for a bridge above it, which a train is passing over. The column also features a red line on it. This region is part of a larger scene that includes a train track on an elevated bridge."}
+{"question_id": 15, "image": "000000276018.jpg", "category": "refer_desc", "text": "The region is occupied by a boy who is wearing a black jacket. He is holding a brown stuffed dog with a red and white collar. The boy seems to be part of a larger group of children who are all holding various stuffed animals and dolls. They seem to be walking across some grassy area, possibly in some kind of event or gathering."}
+{"question_id": 16, "image": "000000376322.jpg", "category": "refer_desc", "text": "In the region, there is a man wearing a green shirt. He is sitting at a table, presumably in a social setting, along with other people. The table is full of items such as plates, glasses, and a decanter. One of the significant interactions is that the man is engaged in a conversation with the people around him."}
+{"question_id": 17, "image": "000000125472.jpg", "category": "refer_desc", "text": "This region is primarily occupied by a man, who appears to be in mid-air, performing a trick on a skateboard. The skateboard is beneath him. He is wearing jeans and shoes with laces, and has a bracelet on his wrist. In the background of this region, there are trees, a building, and a fence. The scene seems to be taking place in a stadium, as there are stadium lights on poles in the vicinity."}
+{"question_id": 18, "image": "000000361551.jpg", "category": "refer_desc", "text": "This region features a woman, who is dressed in a sleeveless black top. She is bending over her luggage, possibly preparing or checking something inside it. The woman is wearing a black and white headband as well. She is located in the service area of an airport, where there are other people standing around as well, some of them are holding their luggage. This scene is quite typical in an airport setting where passengers are usually seen handling their luggage."}
+{"question_id": 19, "image": "000000412240.jpg", "category": "refer_desc", "text": "This region primarily contains a shoe. The shoe appears to be placed on a floor, and light is reflecting off of it. A dog is sitting nearby on the floor as well, and the shoe is positioned next to the dog. The shoe features several distinct elements like laces, a heel, and a toe."}
+{"question_id": 20, "image": "000000130566.jpg", "category": "refer_desc", "text": "The region features windows on the side of a train engine. The train itself is traveling down a set of tracks, which are part of a larger railway system that includes multiple sets of tracks on the ground. Nearby, there are also electric lines hanging above the tracks. Further off, there are buildings, trees, and a wall, which add to the overall rural setting."}
+{"question_id": 21, "image": "000000421923.jpg", "category": "refer_desc", "text": "The object is a vase, and the object is a flower. The flower is in the vase, suggesting it is a decorative element within the room."}
+{"question_id": 22, "image": "000000513567.jpg", "category": "refer_desc", "text": "A woman, who is wearing a brown shirt and jeans, is crossing the street."}
+{"question_id": 23, "image": "000000543300.jpg", "category": "refer_desc", "text": "The region is displaying red letters. These letters are on the side of a large, white boat that's sitting in the water. The boat has two levels and there is a set of long, black windows on its side. A silver railing is present on the top level of the boat. Close to the boat, there are buildings with red roofs and outdoor canopies. There's also a blue container on the dock, and a gray sea wall next to the ship."}
+{"question_id": 24, "image": "000000241668.jpg", "category": "refer_desc", "text": "In the region, there is a woman with red hair. She's wearing a tie and a suit jacket, and is holding a plate with a piece of cake. The woman is dressed in formal attire, suggesting that she's attending a special occasion like a wedding."}
+{"question_id": 25, "image": "000000535578.jpg", "category": "refer_desc", "text": "The region contains rocks and grass, providing a background for the pasture. Nearby, there are white sheep grazing in the green grassy field. There are also trees and a bush in the vicinity. A stone wall is running across the grassy field, bordering it. Besides, there's a hill in the field where some sheep and a rock are located."}
+{"question_id": 26, "image": "000000277051.jpg", "category": "refer_desc", "text": "In this region, a bird is standing on the edge of a table. The table is covered with a red tablecloth and there are several objects on it, including a plate with food and crumbs, a bottle, and a steak knife. The bird is close to the knife and the plate with food. There's also a chair next to the table."}
+{"question_id": 27, "image": "000000018519.jpg", "category": "refer_desc", "text": "The region contains a black wrist guard that the skater is wearing. This wrist guard is part of the safety gear that the skater has on, which also includes a black helmet, elbow pad, knee pad, and a pair of roller skates. The skater is performing a trick at the skate park, his shadow is cast on the cement ramp, and there is a grey post to a metal fence at the top of the ramp nearby. Overall, this region is an important part of the scene, showing the skater's safety equipment."}
+{"question_id": 28, "image": "000000106048.jpg", "category": "refer_desc", "text": "This is a large decorated white bus. It seems to be driving past a tall building. You can see \"Divine Transportation\" written on the front of the bus. There's also a bus identification number on top. The bus features a design, including stripes, and there are headlights at the front. You can also see the side mirrors and wheels. Behind the bus, there's a gray trash can next to some large green bushes."}
+{"question_id": 29, "image": "000000058393.jpg", "category": "refer_desc", "text": "The region includes a man who is sitting on a bench. He has his arm around a woman, indicating a close relationship between them. They are both looking towards the ocean, suggesting that they are enjoying the view together. The bench they are sitting on is in front of the ocean."}
+{"question_id": 30, "image": "000000010764.jpg", "category": "refer_desc", "text": "This region is occupied by a baseball player wearing knee and leg pads. These pads are a part of the player's protective gear. The player, dressed as a catcher, is crouched on the field, ready to catch a ball. He is in a white uniform, which includes pants with a line on them, and he's wearing sneakers. His gloved hand is extended, prepared to receive. We can also see a black and red wrist band on his wrist. The field beneath him is brown dirt, contrasting with the green grass in the rest of the baseball field. Nearby, there are white chalk lines painted on the field."}
+{"question_id": 31, "image": "000000271402.jpg", "category": "refer_desc", "text": "This region contains a little girl who is standing near a scooter. The scooter has an orange board and black handles, and it's specifically located to the right of her. The girl has blonde hair and she's wearing white socks. She is also standing on the pavement."}
+{"question_id": 32, "image": "000000273493.jpg", "category": "refer_desc", "text": "In this region, a man in white clothing is preparing to hit a yellow tennis ball with his racket. He is on a tennis court with white boundary lines and a net in front of him. Behind him, there are a fence, trimmed bushes, and tall trees in the distance."}
+{"question_id": 33, "image": "000000360960.jpg", "category": "refer_desc", "text": "The region is where a man is found wearing a pair of pants. This man is also wearing a long black coat. He seems to be walking on a sidewalk or decorative square, which fills the background of the image."}
+{"question_id": 34, "image": "000000452122.jpg", "category": "refer_desc", "text": "In the region, there is an airplane's engine. The airplane seems to be in mid-flight, given the sky that surrounds it. The front door of the airplane is also visible in this region. The plane appears to be a commercial airline, as indicated by visible letters and windows. Notably, the landing gear of the airplane is lowered, suggesting that it's preparing to land."}
+{"question_id": 35, "image": "000000134722.jpg", "category": "refer_desc", "text": "The region contains the front window of a train, which has windshield wipers. This window is part of the front of the train, which is painted yellow and white. Also, the region is located near the headlights of the train."}
+{"question_id": 36, "image": "000000039484.jpg", "category": "refer_desc", "text": "In this region, there are people sitting at a table, likely dining or socializing outside a restaurant. This area is part of a bustling city street, filled with various cars, some parked and others potentially in motion. There are numerous buildings nearby, with diverse businesses and stores. One notable building nearby even has a marquee sign indicating \"for lease\". This scene suggests that the region is in a vibrant urban setting, where people are engaging in day-to-day activities such as dining outdoors and commuting by car."}
+{"question_id": 37, "image": "000000159311.jpg", "category": "refer_desc", "text": "The region is a patch of grass. There are two zebras standing in and grazing on this grass. They are feeding themselves and are near bushes and a tree."}
+{"question_id": 38, "image": "000000326174.jpg", "category": "refer_desc", "text": "In the region, there's a man and a little girl, they seem to be having a conversation. The man is looking back to the girl, who is pulling a surfboard, probably getting ready to surf. They are part of a larger group of people who are heading to the water with their surfboards."}
+{"question_id": 39, "image": "000000562207.jpg", "category": "refer_desc", "text": "In the region, there's a man standing wearing shorts. He is standing on the side of a lake, next to an elephant. The elephant is emerging from the water and seems to be interacting with the man and two other individuals not far from him. All three people appear to be tourists posing for a picture with the elephant. The surroundings include water, and some mountains and trees in the far distance, creating a serene and natural setting."}
+{"question_id": 40, "image": "000000332318.jpg", "category": "refer_desc", "text": "Within the region, there is a cow. This cow is in a pasture, which is located near a mountainous area. The mountain is partially covered in snow. There are also multiple trailers in the pasture, and one of them appears to be storage for animal equipment. The pasture and its surroundings provide a peaceful and natural living environment for the cows."}
diff --git a/ferret/eval/ferret_gpt4_data/refer_caption/context.jsonl b/ferret/eval/ferret_gpt4_data/refer_caption/context.jsonl
new file mode 100644
index 0000000..fc6570d
--- /dev/null
+++ b/ferret/eval/ferret_gpt4_data/refer_caption/context.jsonl
@@ -0,0 +1,40 @@
+{"question_id": 0, "image": "000000069138.jpg", "category": "refer_desc", "text": "Objects:\nObject 0 : arrows at [0.000, 0.616, 0.214, 0.644].\nObject 1 : awning at [0.159, 0.260, 0.293, 0.336].\nObject 2 : building at [0.000, 0.000, 1.000, 0.466].\nObject 3 : bushes at [0.693, 0.342, 1.000, 0.512].\nObject 4 : door at [0.110, 0.370, 0.266, 0.518].\nObject 5 : face at [0.390, 0.256, 0.614, 0.392].\nObject 6 : greenery at [0.824, 0.154, 0.997, 0.384].\nObject 7 : hitch at [0.221, 0.520, 0.259, 0.542].\nObject 8 : ladder at [0.110, 0.342, 0.283, 0.364].\nObject 9 : license plate at [0.141, 0.460, 0.234, 0.500].\nObject 10 : line at [0.017, 0.700, 0.266, 0.756].\nObject 11 : picture at [0.155, 0.378, 0.259, 0.442].\nObject 12 : plant barrier at [0.672, 0.482, 1.000, 0.606].\nObject 13 : planter at [0.676, 0.152, 1.000, 0.510].\nObject 14 : pole at [0.328, 0.068, 0.483, 0.994].\nObject 15 : road at [0.000, 0.490, 1.000, 1.000].\nObject 16 : roof at [0.117, 0.360, 0.283, 0.382].\nObject 17 : sad face at [0.383, 0.244, 0.614, 0.384].\nObject 18 : short term at [0.624, 0.040, 0.769, 0.080].\nObject 19 : sidewalk at [0.666, 0.572, 0.993, 0.618].\nObject 20 : sign at [0.621, 0.082, 0.772, 0.132].\nObject 21 : sign at [0.007, 0.144, 0.069, 0.204].\nObject 22 : signal at [0.266, 0.210, 0.679, 0.848].\nObject 23 : stop light at [0.366, 0.236, 0.638, 0.394].\nObject 24 : tail light at [0.100, 0.446, 0.121, 0.472].\nObject 25 : van at [0.076, 0.326, 0.297, 0.556].\nObject 26 : wall at [0.676, 0.500, 0.997, 0.604].\nObject 27 : window at [0.903, 0.000, 1.000, 0.086].\n\nRelationships:\nobject 23 : stop light -> with -> object 17 : sad face.\nobject 0 : arrows -> on -> object 15 : road.\nobject 12 : plant barrier -> beside -> object 15 : road.\nobject 11 : picture -> on -> object 4 : door.\nobject 10 : line -> painted in -> object 15 : road.\nobject 19 : sidewalk -> next to -> object 15 : road.\nobject 2 : building -> for -> object 18 : short term.\nobject 23 : stop light -> making -> object 5 : face.\nobject 3 : bushes -> just above -> object 26 : wall.\nobject 22 : signal -> on -> object 14 : pole.\nobject 25 : van -> has -> object 16 : roof.\nobject 25 : van -> has -> object 8 : ladder.\nobject 8 : ladder -> on -> object 16 : roof.\nobject 13 : planter -> by -> object 15 : road.\nobject 23 : stop light -> on -> object 22 : signal.\n\nRegion Description:\nRegion Description at [0.331, 0.852, 0.472, 0.996] : Pole holding traffic light on street.\nRegion Description at [0.600, 0.036, 0.793, 0.084] : Building offers short term office space.\nRegion Description at [0.603, 0.074, 0.776, 0.120] : Office space as small as 2,500 sq. ft. available.\nRegion Description at [0.003, 0.008, 0.972, 0.356] : an office building is in the background.\n\nGlobal Caption:\nA red traffic light with a sad face drawn over it.\nA street scene with a close of of a stop light.\nA red stoplight with a street in the background.\nA stop sign gives traffic a frown face.\nThe sign is now at a red light."}
+{"question_id": 1, "image": "000000131138.jpg", "category": "refer_desc", "text": "Objects:\nObject 0 : computer mouse at [0.414, 0.753, 0.470, 0.811].\nObject 1 : cup at [0.350, 0.783, 0.417, 0.906].\nObject 2 : desk at [0.000, 0.488, 0.998, 0.999].\nObject 3 : fork at [0.203, 0.794, 0.270, 0.857].\nObject 4 : glass at [0.277, 0.703, 0.345, 0.816].\nObject 5 : head phones at [0.872, 0.556, 0.993, 0.634].\nObject 6 : keyboard at [0.415, 0.620, 0.650, 0.783].\nObject 7 : lamp at [0.000, 0.302, 0.214, 0.430].\nObject 8 : laptop at [0.491, 0.296, 0.703, 0.540].\nObject 9 : picture at [0.795, 0.204, 0.898, 0.358].\nObject 10 : plant at [0.192, 0.201, 0.391, 0.461].\nObject 11 : plate at [0.183, 0.799, 0.326, 0.896].\nObject 12 : screen at [0.237, 0.249, 0.504, 0.628].\nObject 13 : stand at [0.506, 0.531, 0.663, 0.617].\nObject 14 : window at [0.606, 0.000, 1.000, 0.346].\n\nRelationships:\nobject 0 : computer mouse -> on -> object 2 : desk.\nobject 8 : laptop -> on -> object 13 : stand.\nobject 6 : keyboard -> on -> object 2 : desk.\nobject 9 : picture -> near -> object 14 : window.\nobject 3 : fork -> on -> object 11 : plate.\n\nRegion Description:\n\nGlobal Caption:\na desk with a cup plate laptop monitor and keyboard\nA laptop sitting next to a monitor, keyboard and a mouse.\nA laptop and a desktop monitor are displayed on top of the desk.\nLarge office desk with computers near a window.\nA desk with a laptop, second monitor and keyboard."}
+{"question_id": 2, "image": "000000475150.jpg", "category": "refer_desc", "text": "Objects:\nObject 0 : animal at [0.220, 0.105, 1.006, 0.997].\nObject 1 : branches at [0.000, 0.000, 1.000, 1.000].\nObject 2 : ear at [0.402, 0.288, 0.452, 0.378].\nObject 3 : eye at [0.332, 0.396, 0.378, 0.429].\nObject 4 : foliage at [0.584, 0.093, 0.748, 0.255].\nObject 5 : giraffe`s neck at [0.476, 0.264, 1.000, 1.003].\nObject 6 : head at [0.216, 0.102, 0.476, 0.706].\nObject 7 : mane at [0.576, 0.502, 0.836, 0.811].\nObject 8 : nose at [0.222, 0.640, 0.266, 0.703].\nObject 9 : sky at [0.000, 0.000, 1.000, 0.562].\nObject 10 : spot at [0.562, 0.535, 0.616, 0.625].\nObject 11 : spot at [0.560, 0.447, 0.592, 0.508].\nObject 12 : spot at [0.592, 0.444, 0.670, 0.556].\nObject 13 : spot at [0.622, 0.565, 0.694, 0.664].\nObject 14 : spot at [0.514, 0.483, 0.570, 0.571].\nObject 15 : spots at [0.700, 0.640, 0.806, 0.817].\nObject 16 : spots at [0.706, 0.823, 0.776, 0.943].\nObject 17 : spots at [0.852, 0.829, 0.984, 0.997].\nObject 18 : spots at [0.674, 0.547, 0.758, 0.655].\nObject 19 : spots at [0.774, 0.700, 0.902, 0.913].\nObject 20 : tree at [0.000, 0.000, 1.000, 1.000].\nObject 21 : wrinkles at [0.466, 0.468, 0.554, 0.586].\n\nRelationships:\nobject 20 : tree -> has -> object 4 : foliage.\nobject 21 : wrinkles -> on -> object 5 : giraffe`s neck.\nobject 3 : eye -> on a -> object 0 : animal.\nobject 4 : foliage -> in -> object 20 : tree.\nobject 1 : branches -> behind -> object 0 : animal.\nobject 14 : spot -> on -> object 0 : animal.\nobject 11 : spot -> on -> object 0 : animal.\nobject 10 : spot -> on -> object 0 : animal.\nobject 12 : spot -> on -> object 0 : animal.\nobject 13 : spot -> on -> object 0 : animal.\nobject 5 : giraffe`s neck -> on -> object 0 : animal.\nobject 3 : eye -> of -> object 0 : animal.\nobject 2 : ear -> of -> object 0 : animal.\nobject 6 : head -> of -> object 0 : animal.\n\nRegion Description:\nRegion Description at [0.616, 0.565, 0.956, 0.958] : the giraffe is spotted tan and brown.\nRegion Description at [0.288, 0.324, 0.572, 0.649] : the giraffes face is white and brown.\n\nGlobal Caption:\nA giraffe stands near a tree in the wilderness. \nA giraffe standing in front of a group of trees.\nA giraffe standing next to a leaf free tree.\nHead and neck of a giraffe in natural feeding habitat.\nA giraffe walking near a tree with very few leaves."}
+{"question_id": 3, "image": "000000356424.jpg", "category": "refer_desc", "text": "Objects:\nObject 0 : bottle at [0.048, 0.712, 0.195, 1.002].\nObject 1 : chair at [0.696, 0.500, 1.003, 0.718].\nObject 2 : cork at [0.053, 0.712, 0.139, 0.776].\nObject 3 : cup at [0.043, 0.736, 0.240, 0.916].\nObject 4 : dish at [0.416, 0.726, 0.856, 0.904].\nObject 5 : fruit at [0.629, 0.834, 0.675, 0.880].\nObject 6 : glass at [0.275, 0.716, 0.501, 0.998].\nObject 7 : glasses at [0.179, 0.242, 0.464, 0.322].\nObject 8 : hair at [0.536, 0.258, 0.656, 0.320].\nObject 9 : man at [0.075, 0.102, 0.704, 0.716].\nObject 10 : rasberries at [0.499, 0.750, 0.544, 0.786].\nObject 11 : raspberries at [0.664, 0.828, 0.741, 0.864].\nObject 12 : sauce at [0.565, 0.752, 0.715, 0.824].\nObject 13 : shirt at [0.600, 0.350, 0.645, 0.494].\nObject 14 : shirt at [0.635, 0.282, 0.997, 0.654].\nObject 15 : sign at [0.419, 0.134, 0.509, 0.184].\nObject 16 : sweater at [0.072, 0.288, 0.704, 0.718].\nObject 17 : table at [0.000, 0.592, 0.997, 1.000].\nObject 18 : window at [0.328, 0.000, 0.600, 0.298].\nObject 19 : woman at [0.531, 0.258, 0.768, 0.688].\n\nRelationships:\nobject 9 : man -> wearing -> object 7 : glasses.\nobject 0 : bottle -> on -> object 17 : table.\nobject 6 : glass -> on -> object 17 : table.\nobject 11 : raspberries -> on -> object 4 : dish.\nobject 9 : man -> wearing -> object 7 : glasses.\n\nRegion Description:\nRegion Description at [0.640, 0.180, 0.989, 0.530] : Man wearing a black and orange stripe shirt.\nRegion Description at [0.413, 0.136, 0.512, 0.184] : Yellow closed sign with brown letters.\nRegion Description at [0.629, 0.186, 0.995, 0.706] : a man wearing and orange and black striped shirt.\nRegion Description at [0.528, 0.254, 0.717, 0.666] : a woman with a ponytail eating lunch.\nRegion Description at [0.152, 0.238, 0.459, 0.322] : a pair of black wire rimmed eye glasses.\nRegion Description at [0.029, 0.716, 0.243, 0.922] : empty cup that used to contain coffee.\nRegion Description at [0.264, 0.708, 0.867, 0.994] : A plate of food with a glass of water.\n\nGlobal Caption:\nA man sitting in front of a plate of food.\nA man at a wooden table looking at a plate of food.\na man smiling while looking at his plate of food\nA man sitting at a table with a plate filled with food.\nA man looking happily at some dish in front of him."}
+{"question_id": 4, "image": "000000491090.jpg", "category": "refer_desc", "text": "Objects:\nObject 0 : arm at [0.313, 0.238, 0.567, 0.512].\nObject 1 : back wheel at [0.107, 0.502, 0.307, 0.720].\nObject 2 : face at [0.430, 0.118, 0.535, 0.218].\nObject 3 : floor at [0.003, 0.380, 0.997, 0.998].\nObject 4 : front light at [0.765, 0.514, 0.890, 0.634].\nObject 5 : front wheel at [0.642, 0.706, 0.997, 0.996].\nObject 6 : garage door at [0.532, 0.002, 0.858, 0.096].\nObject 7 : glasses at [0.422, 0.140, 0.548, 0.168].\nObject 8 : hand at [0.457, 0.450, 0.561, 0.518].\nObject 9 : indicator light at [0.666, 0.578, 0.722, 0.620].\nObject 10 : jeans at [0.241, 0.438, 0.465, 0.712].\nObject 11 : lettering at [0.003, 0.062, 0.302, 0.146].\nObject 12 : license plate at [0.939, 0.594, 1.000, 0.654].\nObject 13 : mirrors at [0.428, 0.320, 0.559, 0.384].\nObject 14 : motorcycle at [0.067, 0.358, 0.989, 1.000].\nObject 15 : person at [0.227, 0.086, 0.765, 0.758].\nObject 16 : sneaker at [0.243, 0.646, 0.342, 0.758].\nObject 17 : sweater at [0.243, 0.192, 0.676, 0.486].\nObject 18 : tail pipe at [0.059, 0.524, 0.257, 0.706].\n\nRelationships:\nobject 15 : person -> has -> object 7 : glasses.\nobject 15 : person -> has -> object 16 : sneaker.\nobject 15 : person -> has -> object 17 : sweater.\nobject 15 : person -> has -> object 17 : sweater.\nobject 15 : person -> has on -> object 10 : jeans.\nobject 14 : motorcycle -> has -> object 5 : front wheel.\nobject 14 : motorcycle -> has -> object 1 : back wheel.\nobject 4 : front light -> on -> object 14 : motorcycle.\nobject 15 : person -> on -> object 14 : motorcycle.\nobject 14 : motorcycle -> has -> object 18 : tail pipe.\nobject 15 : person -> sitting on -> object 14 : motorcycle.\nobject 15 : person -> wearing -> object 17 : sweater.\nobject 4 : front light -> on -> object 14 : motorcycle.\nobject 15 : person -> has -> object 8 : hand.\nobject 15 : person -> has -> object 7 : glasses.\nobject 13 : mirrors -> are on -> object 14 : motorcycle.\nobject 1 : back wheel -> on -> object 14 : motorcycle.\nobject 5 : front wheel -> on -> object 14 : motorcycle.\nobject 4 : front light -> on -> object 14 : motorcycle.\nobject 15 : person -> has -> object 2 : face.\nobject 15 : person -> has -> object 0 : arm.\nobject 15 : person -> sitting on -> object 14 : motorcycle.\nobject 15 : person -> has -> object 7 : glasses.\n\nRegion Description:\nRegion Description at [0.444, 0.138, 0.521, 0.168] : The eyeglasses the person on the motorcycle is wearing..\nRegion Description at [0.230, 0.640, 0.361, 0.760] : The person on the motorcycle's sneaker..\nRegion Description at [0.297, 0.216, 0.449, 0.404] : The left sleeve of the person's sweater..\nRegion Description at [0.545, 0.254, 0.738, 0.404] : The right sleeve of the person's sweater..\nRegion Description at [0.644, 0.706, 0.997, 0.994] : The front wheel of the motorcycle the person is on..\nRegion Description at [0.102, 0.498, 0.329, 0.692] : The back wheel of the motorcycle the person is on..\nRegion Description at [0.775, 0.518, 0.896, 0.626] : The front light of the motorcycle the person is on..\nRegion Description at [0.439, 0.432, 0.751, 0.522] : The handle bars on the motorcycle the person is on..\nRegion Description at [0.059, 0.516, 0.310, 0.708] : The tail pipe of the motorcycle the person is on..\nRegion Description at [0.663, 0.568, 0.733, 0.634] : small circular orange indicator light.\nRegion Description at [0.056, 0.522, 0.257, 0.706] : stainless steel motorcycle tailpipe .\nRegion Description at [0.067, 0.318, 0.992, 0.992] : Black motorcycle with silver accessories.\nRegion Description at [0.636, 0.690, 0.989, 0.992] : Black front wheel and fender of motorcycle.\nRegion Description at [0.243, 0.640, 0.353, 0.754] : Black and white shoe of man on motorcycle.\n\nGlobal Caption:\nA man sitting on one of a group of motorcycles.\nA MAN IS SMILING SITTING ON A MOTOR BIKE \nA middle-aged man leans on a sports bike, smiling\nA person sits on top of a motorcycle with others.\nA woman riding on the back of a motorcycle."}
+{"question_id": 5, "image": "000000484415.jpg", "category": "refer_desc", "text": "Objects:\nObject 0 : arm at [0.000, 0.125, 0.609, 0.988].\nObject 1 : bathroom tile at [0.009, 0.008, 0.994, 0.446].\nObject 2 : blue jeans at [0.369, 0.558, 0.722, 0.979].\nObject 3 : brush at [0.681, 0.208, 0.878, 0.500].\nObject 4 : brush holder at [0.716, 0.279, 0.891, 0.554].\nObject 5 : button at [0.519, 0.113, 0.584, 0.171].\nObject 6 : flusher at [0.534, 0.092, 0.628, 0.300].\nObject 7 : hand at [0.281, 0.125, 0.603, 0.562].\nObject 8 : holder at [0.713, 0.283, 0.903, 0.558].\nObject 9 : lid at [0.028, 0.046, 0.694, 0.446].\nObject 10 : man at [0.000, 0.133, 0.600, 0.992].\nObject 11 : seat at [0.138, 0.583, 0.722, 0.992].\nObject 12 : tank at [0.019, 0.021, 0.706, 0.579].\nObject 13 : tile at [0.794, 0.000, 1.000, 0.200].\nObject 14 : tile at [0.000, 0.000, 0.278, 0.129].\nObject 15 : toilet at [0.016, 0.042, 0.719, 0.996].\nObject 16 : toilet scrubber at [0.744, 0.192, 0.844, 0.521].\nObject 17 : toilet seat at [0.103, 0.517, 0.728, 0.996].\nObject 18 : wall at [0.659, 0.000, 0.978, 0.392].\nObject 19 : water at [0.369, 0.738, 0.500, 0.921].\n\nRelationships:\nobject 15 : toilet -> has -> object 11 : seat.\nobject 4 : brush holder -> by -> object 15 : toilet.\nobject 19 : water -> in -> object 15 : toilet.\nobject 6 : flusher -> on -> object 15 : toilet.\nobject 9 : lid -> on -> object 15 : toilet.\nobject 10 : man -> by -> object 15 : toilet.\nobject 10 : man -> by -> object 15 : toilet.\nobject 10 : man -> has -> object 7 : hand.\nobject 0 : arm -> on -> object 15 : toilet.\nobject 14 : tile -> on -> object 18 : wall.\n\nRegion Description:\nRegion Description at [0.000, 0.046, 0.716, 0.987] : the arm reaching for the white toilet bowl.\nRegion Description at [0.716, 0.192, 0.894, 0.550] : the container and the toilet brush cleaner.\nRegion Description at [0.009, 0.042, 0.894, 0.992] : the toilet bowl next to the toilet bowl cleaner.\nRegion Description at [0.534, 0.087, 0.666, 0.329] : The hand is on the flusher in the image .\nRegion Description at [0.053, 0.158, 0.903, 0.875] : Porcelain toilet with flusher on top of the lid .\nRegion Description at [0.094, 0.154, 0.856, 0.942] : Man flushing the toilet in the bathroom .\n\nGlobal Caption:\nA hand is reaching out to the top if a toilet. \nA person flushing a toilet with a motion sensor.\nA person's hand flushing a toilet with a button on top of the tank. \na persons hand reaching for the top of a toilet\nA hand is reaching over a white toilet."}
+{"question_id": 7, "image": "000000184324.jpg", "category": "refer_desc", "text": "Objects:\nObject 0 : awning at [0.514, 0.500, 0.736, 0.545].\nObject 1 : bag at [0.086, 0.723, 0.124, 0.777].\nObject 2 : bicycle at [0.716, 0.660, 0.756, 0.738].\nObject 3 : bikes at [0.710, 0.753, 0.864, 0.934].\nObject 4 : black jacket at [0.052, 0.569, 0.120, 0.723].\nObject 5 : blue jeans at [0.654, 0.678, 0.672, 0.729].\nObject 6 : building at [0.540, 0.042, 0.760, 0.617].\nObject 7 : building at [0.706, 0.000, 0.998, 0.726].\nObject 8 : bus at [0.186, 0.491, 0.246, 0.608].\nObject 9 : car at [0.432, 0.557, 0.538, 0.636].\nObject 10 : cars at [0.130, 0.491, 0.756, 0.630].\nObject 11 : coat at [0.128, 0.602, 0.212, 0.798].\nObject 12 : cross walk at [0.428, 0.750, 0.954, 1.000].\nObject 13 : cyclist at [0.752, 0.614, 0.860, 0.792].\nObject 14 : lines at [0.432, 0.608, 0.948, 1.000].\nObject 15 : man at [0.052, 0.518, 0.132, 0.898].\nObject 16 : people at [0.000, 0.515, 0.212, 1.000].\nObject 17 : people at [0.754, 0.605, 0.858, 0.756].\nObject 18 : pole at [0.954, 0.699, 0.970, 0.777].\nObject 19 : road at [0.004, 0.545, 1.000, 1.000].\nObject 20 : scarf at [0.032, 0.873, 0.134, 0.997].\nObject 21 : sidewalk at [0.536, 0.572, 0.668, 0.623].\nObject 22 : sign at [0.482, 0.470, 0.494, 0.494].\nObject 23 : sign at [0.810, 0.407, 0.970, 0.497].\nObject 24 : sign at [0.584, 0.434, 0.614, 0.494].\nObject 25 : store at [0.806, 0.395, 0.968, 0.720].\nObject 26 : street light at [0.640, 0.461, 0.652, 0.485].\nObject 27 : stripes at [0.452, 0.620, 0.944, 0.982].\nObject 28 : tires at [0.712, 0.747, 0.864, 0.931].\nObject 29 : tree at [0.280, 0.358, 0.340, 0.569].\nObject 30 : van at [0.460, 0.545, 0.488, 0.566].\nObject 31 : window at [0.820, 0.217, 0.884, 0.358].\nObject 32 : windshield at [0.192, 0.512, 0.242, 0.548].\nObject 33 : woman at [0.128, 0.569, 0.212, 0.913].\nObject 34 : woman at [0.650, 0.593, 0.688, 0.729].\nObject 35 : woman at [0.020, 0.765, 0.168, 1.000].\nObject 36 : writing at [0.838, 0.422, 0.948, 0.482].\n\nRelationships:\nobject 3 : bikes -> are on -> object 19 : road.\nobject 3 : bikes -> are on -> object 19 : road.\nobject 17 : people -> are riding -> object 3 : bikes.\nobject 3 : bikes -> are on -> object 19 : road.\nobject 17 : people -> are on -> object 19 : road.\nobject 8 : bus -> on -> object 19 : road.\nobject 8 : bus -> on -> object 19 : road.\nobject 8 : bus -> on -> object 19 : road.\nobject 12 : cross walk -> being used by a -> object 13 : cyclist.\nobject 17 : people -> are using -> object 12 : cross walk.\nobject 0 : awning -> above -> object 21 : sidewalk.\nobject 10 : cars -> are on -> object 19 : road.\nobject 26 : street light -> on -> object 6 : building.\nobject 27 : stripes -> on -> object 12 : cross walk.\nobject 7 : building -> has a -> object 31 : window.\nobject 3 : bikes -> have -> object 28 : tires.\nobject 35 : woman -> wearing a -> object 20 : scarf.\nobject 23 : sign -> for -> object 25 : store.\nobject 33 : woman -> wearing a -> object 11 : coat.\nobject 34 : woman -> wearing -> object 5 : blue jeans.\nobject 3 : bikes -> are on -> object 19 : road.\nobject 14 : lines -> are on -> object 19 : road.\nobject 15 : man -> wearing a -> object 4 : black jacket.\nobject 30 : van -> on -> object 19 : road.\nobject 15 : man -> has a -> object 1 : bag.\nobject 8 : bus -> has a -> object 32 : windshield.\nobject 7 : building -> has a -> object 31 : window.\nobject 31 : window -> above -> object 23 : sign.\nobject 14 : lines -> are on -> object 19 : road.\nobject 18 : pole -> near -> object 7 : building.\nobject 35 : woman -> wearing a -> object 20 : scarf.\n\nRegion Description:\nRegion Description at [0.822, 0.395, 0.968, 0.500] : red writing above buisness along the street.\nRegion Description at [0.564, 0.771, 0.876, 0.991] : white stripes painted to indicate cross walk.\nRegion Description at [0.184, 0.485, 0.244, 0.605] : large white vehicle with big windshield.\nRegion Description at [0.478, 0.464, 0.492, 0.491] : blue street sign with a white P on it.\nRegion Description at [0.820, 0.220, 0.886, 0.370] : window on the building above red sign.\n\nGlobal Caption:\nA group of people walking across a busy city street.\nA fish eye lens shows the corner of a busy city street with bikes, people and buildings.\na number of people and cars on a city street\nAn oddly taken photo of some buildings and shops.\nA picture of a city intersection with period buildings and store fronts. "}
+{"question_id": 8, "image": "000000341058.jpg", "category": "refer_desc", "text": "Objects:\nObject 0 : napkins at [0.541, 0.818, 0.601, 0.858].\nObject 1 : pepper at [0.598, 0.836, 0.623, 0.860].\nObject 2 : post at [0.673, 0.494, 0.712, 0.926].\nObject 3 : restaurant sign at [0.548, 0.180, 0.779, 0.344].\nObject 4 : salt at [0.619, 0.838, 0.633, 0.850].\nObject 5 : shaker at [0.594, 0.822, 0.619, 0.854].\nObject 6 : shaker at [0.612, 0.824, 0.637, 0.854].\nObject 7 : table at [0.448, 0.834, 0.925, 0.998].\n\nRelationships:\nobject 4 : salt -> in -> object 6 : shaker.\nobject 0 : napkins -> on -> object 7 : table.\nobject 3 : restaurant sign -> on -> object 2 : post.\n\nRegion Description:\n\nGlobal Caption:\nThis is an empty table at a restaurant with ships in the background.\nThis table is covered by a blue Sam Adams umbrella\nAdvertising sign above a patio umbrella on sunny day.\nA lamp post stands next to an umbrella and table.\nAn umbrella is opened over an outdoor table."}
+{"question_id": 9, "image": "000000184384.jpg", "category": "refer_desc", "text": "Objects:\nObject 0 : blueberry at [0.306, 0.312, 0.400, 0.429].\nObject 1 : butter at [0.454, 0.024, 0.638, 0.288].\nObject 2 : cake at [0.238, 0.093, 0.786, 0.787].\nObject 3 : cup at [0.002, 0.000, 0.202, 0.667].\nObject 4 : cup at [0.140, 0.008, 0.336, 0.456].\nObject 5 : egg at [0.636, 0.125, 0.880, 0.267].\nObject 6 : food at [0.632, 0.123, 0.996, 0.336].\nObject 7 : lemon at [0.514, 0.728, 0.798, 0.997].\nObject 8 : melon at [0.308, 0.768, 0.658, 0.997].\nObject 9 : orange at [0.514, 0.733, 0.794, 0.997].\nObject 10 : parsley at [0.372, 0.515, 0.762, 0.965].\nObject 11 : plate at [0.166, 0.453, 1.000, 1.000].\nObject 12 : plate at [0.628, 0.120, 0.998, 0.389].\nObject 13 : sausage at [0.766, 0.248, 0.984, 0.333].\nObject 14 : spot at [0.766, 0.600, 0.790, 0.637].\nObject 15 : table at [0.002, 0.365, 0.998, 0.997].\nObject 16 : water at [0.000, 0.000, 0.202, 0.667].\n\nRelationships:\nobject 7 : lemon -> on -> object 11 : plate.\nobject 10 : parsley -> on -> object 11 : plate.\nobject 6 : food -> on -> object 12 : plate.\nobject 1 : butter -> on -> object 2 : cake.\nobject 11 : plate -> has -> object 14 : spot.\nobject 1 : butter -> on -> object 2 : cake.\nobject 9 : orange -> on -> object 11 : plate.\nobject 13 : sausage -> on -> object 12 : plate.\nobject 0 : blueberry -> on -> object 2 : cake.\nobject 5 : egg -> on -> object 12 : plate.\nobject 8 : melon -> on -> object 11 : plate.\nobject 1 : butter -> on -> object 2 : cake.\nobject 9 : orange -> on -> object 11 : plate.\nobject 2 : cake -> on -> object 11 : plate.\nobject 16 : water -> in -> object 3 : cup.\nobject 13 : sausage -> on -> object 12 : plate.\n\nRegion Description:\nRegion Description at [0.678, 0.104, 0.942, 0.424] : There is food on the plate in the back.\nRegion Description at [0.456, 0.013, 0.636, 0.307] : White frosting on top of a piece of cake.\nRegion Description at [0.322, 0.752, 0.650, 0.997] : square of honey dew on a white plate.\n\nGlobal Caption:\nA bluebery cake is on a plate and is topped with butter.\nA piece of cake with butter on it sits next to an orange slice. \nA large piece of blueberry cake on a plate.\nA plate of food attractively arranged on a table.\nA plate of blueberry coffee cake with butter and an orange slice on a table with breakfast foods."}
+{"question_id": 10, "image": "000000259097.jpg", "category": "refer_desc", "text": "Objects:\nObject 0 : abs at [0.392, 0.628, 0.426, 0.664].\nObject 1 : arm at [0.416, 0.459, 0.432, 0.565].\nObject 2 : buildings at [0.242, 0.532, 0.640, 0.580].\nObject 3 : frisbee at [0.400, 0.354, 0.446, 0.381].\nObject 4 : grass at [0.000, 0.610, 0.998, 0.997].\nObject 5 : hand at [0.418, 0.423, 0.438, 0.474].\nObject 6 : legs at [0.420, 0.703, 0.456, 0.811].\nObject 7 : man at [0.390, 0.432, 0.466, 0.793].\nObject 8 : pants at [0.390, 0.658, 0.424, 0.763].\nObject 9 : shadow at [0.492, 0.724, 0.622, 0.994].\nObject 10 : shirt at [0.402, 0.468, 0.458, 0.649].\nObject 11 : sky at [0.002, 0.003, 0.996, 0.556].\nObject 12 : trees at [0.002, 0.498, 0.998, 0.646].\n\nRelationships:\nobject 7 : man -> tossing -> object 3 : frisbee.\nobject 7 : man -> has -> object 6 : legs.\nobject 7 : man -> playing -> object 3 : frisbee.\nobject 2 : buildings -> near -> object 12 : trees.\nobject 7 : man -> wearing -> object 10 : shirt.\nobject 7 : man -> wearing -> object 8 : pants.\nobject 7 : man -> catching -> object 3 : frisbee.\nobject 7 : man -> has -> object 5 : hand.\nobject 3 : frisbee -> in -> object 11 : sky.\nobject 7 : man -> wearing -> object 10 : shirt.\nobject 7 : man -> catching -> object 3 : frisbee.\nobject 7 : man -> wearing -> object 8 : pants.\nobject 9 : shadow -> in -> object 4 : grass.\nobject 7 : man -> jumping -> object 4 : grass.\nobject 2 : buildings -> behind -> object 4 : grass.\nobject 7 : man -> catching -> object 3 : frisbee.\nobject 7 : man -> catching -> object 3 : frisbee.\nobject 2 : buildings -> near -> object 12 : trees.\nobject 7 : man -> extending -> object 1 : arm.\nobject 9 : shadow -> in -> object 4 : grass.\nobject 7 : man -> exposing -> object 0 : abs.\nobject 7 : man -> catching -> object 3 : frisbee.\nobject 3 : frisbee -> in -> object 11 : sky.\nobject 9 : shadow -> in -> object 4 : grass.\nobject 2 : buildings -> near -> object 12 : trees.\nobject 1 : arm -> reaching for -> object 3 : frisbee.\n\nRegion Description:\nRegion Description at [0.394, 0.658, 0.480, 0.826] : A person wearing black color trouser.\nRegion Description at [0.394, 0.435, 0.460, 0.796] : man in a red sweatshirt and jeans jumping.\nRegion Description at [0.390, 0.357, 0.464, 0.823] : man catching a frisbee in a wheat field.\nRegion Description at [0.012, 0.520, 0.996, 0.631] : trees and a village on a hill in the distance.\nRegion Description at [0.390, 0.423, 0.464, 0.649] : arm straight up and arm bent at elbow.\n\nGlobal Caption:\nA person trying to reach a Frisbee in a field with high brown grass.\nA young boy in a red top is playing with a red object tossed in the sky.\nA young man in a red jacket jumping for a Frizbee in a field.\nA guy is jumping to catch a frisbee in tall grass.\nA man jumps to catch a Frisbee flying through the air."}
+{"question_id": 11, "image": "000000377882.jpg", "category": "refer_desc", "text": "Objects:\nObject 0 : blue sky at [0.000, 0.000, 0.998, 0.317].\nObject 1 : boat at [0.000, 0.461, 0.354, 0.579].\nObject 2 : boat at [0.348, 0.501, 0.874, 0.789].\nObject 3 : boat at [0.302, 0.461, 0.684, 0.611].\nObject 4 : buildings at [0.692, 0.195, 0.718, 0.248].\nObject 5 : buildings at [0.888, 0.173, 0.922, 0.227].\nObject 6 : buildings at [0.582, 0.211, 0.610, 0.256].\nObject 7 : buildings at [0.180, 0.259, 0.202, 0.293].\nObject 8 : buildings at [0.466, 0.208, 0.518, 0.272].\nObject 9 : chain-link fence at [0.002, 0.176, 0.996, 0.995].\nObject 10 : cord at [0.412, 0.587, 0.626, 1.000].\nObject 11 : fence pole at [0.230, 0.227, 0.336, 1.000].\nObject 12 : grass at [0.000, 0.667, 0.756, 0.997].\nObject 13 : horizon at [0.000, 0.187, 1.000, 0.336].\nObject 14 : mast at [0.570, 0.000, 0.722, 0.571].\nObject 15 : rack at [0.754, 0.168, 1.000, 0.901].\nObject 16 : sail post at [0.586, 0.000, 0.628, 0.568].\nObject 17 : section at [0.272, 0.179, 0.994, 0.992].\nObject 18 : shelf at [0.762, 0.355, 1.000, 0.387].\nObject 19 : sky line at [0.012, 0.173, 0.994, 0.195].\nObject 20 : surfboard at [0.830, 0.448, 0.996, 0.552].\nObject 21 : surfboard at [0.420, 0.384, 0.502, 0.411].\nObject 22 : surfboard at [0.910, 0.768, 0.998, 0.877].\nObject 23 : surfboard at [0.430, 0.344, 0.508, 0.371].\nObject 24 : surfboard at [0.830, 0.565, 1.000, 0.712].\nObject 25 : surfboard at [0.322, 0.307, 0.450, 0.341].\nObject 26 : surfboard at [0.766, 0.251, 0.998, 0.368].\nObject 27 : surfboard at [0.764, 0.704, 0.998, 0.829].\nObject 28 : water at [0.000, 0.259, 1.000, 0.469].\nObject 29 : water way at [0.008, 0.272, 0.996, 0.432].\n\nRelationships:\nobject 25 : surfboard -> stacked on -> object 18 : shelf.\nobject 24 : surfboard -> stacked on -> object 18 : shelf.\nobject 20 : surfboard -> stacked on -> object 18 : shelf.\nobject 26 : surfboard -> stacked on -> object 18 : shelf.\nobject 15 : rack -> of -> object 20 : surfboard.\nobject 8 : buildings -> on -> object 13 : horizon.\nobject 6 : buildings -> on -> object 13 : horizon.\nobject 4 : buildings -> on -> object 13 : horizon.\nobject 7 : buildings -> on -> object 13 : horizon.\nobject 5 : buildings -> on -> object 13 : horizon.\nobject 14 : mast -> on -> object 2 : boat.\nobject 9 : chain-link fence -> near -> object 29 : water way.\nobject 17 : section -> of -> object 9 : chain-link fence.\n\nRegion Description:\nRegion Description at [0.020, 0.187, 0.972, 0.963] : boats and surfboards behind wire fencing.\nRegion Description at [0.000, 0.160, 0.990, 0.349] : trees and buildings on other side of water.\nRegion Description at [0.340, 0.493, 0.852, 0.613] : white covering pulled over top of boat.\nRegion Description at [0.010, 0.667, 0.516, 0.995] : green bushes beside the chain link fence.\nRegion Description at [0.018, 0.213, 0.992, 0.995] : Black chain link fence enclosing boats..\nRegion Description at [0.242, 0.211, 0.302, 0.989] : Black fence pole holding chain link fence..\nRegion Description at [0.374, 0.499, 0.804, 0.803] : Yellow and white boat with sail pole..\nRegion Description at [0.014, 0.181, 0.998, 0.296] : Skyline of gray buildings in the background..\nRegion Description at [0.000, 0.664, 0.994, 0.976] : Green shrubs growing along side of a lake..\nRegion Description at [0.774, 0.216, 0.996, 0.944] : Boat parts on an outdoor shelving unit..\nRegion Description at [0.006, 0.013, 0.150, 0.285] : Sail masks with no flag attached to them..\n\nGlobal Caption:\nBoats docked on land sitting side by side next to a lake.\nA small harbor with boats docked and on racks\nA collection of boats behind a fence by a body of water.\nBoats and surfboards docked at a harbor bay.\n\nMany boats as seen through a chain link fence."}
+{"question_id": 12, "image": "000000415748.jpg", "category": "refer_desc", "text": "Objects:\nObject 0 : building at [0.000, 0.086, 0.697, 0.516].\nObject 1 : elephant at [0.084, 0.438, 0.727, 0.954].\nObject 2 : face at [0.411, 0.442, 0.670, 0.664].\nObject 3 : ground at [0.000, 0.742, 0.165, 0.998].\nObject 4 : man at [0.186, 0.246, 0.631, 0.516].\nObject 5 : shadow at [0.477, 0.812, 1.000, 0.958].\nObject 6 : sky at [0.006, 0.000, 0.228, 0.200].\nObject 7 : toe at [0.372, 0.900, 0.411, 0.924].\nObject 8 : tusk at [0.462, 0.670, 0.489, 0.692].\n\nRelationships:\nobject 4 : man -> on -> object 1 : elephant.\nobject 7 : toe -> of -> object 1 : elephant.\nobject 4 : man -> near -> object 0 : building.\nobject 4 : man -> on -> object 1 : elephant.\nobject 4 : man -> near -> object 1 : elephant.\nobject 8 : tusk -> on -> object 2 : face.\nobject 5 : shadow -> of -> object 1 : elephant.\nobject 5 : shadow -> on -> object 3 : ground.\nobject 4 : man -> close to -> object 0 : building.\nobject 0 : building -> close to -> object 1 : elephant.\n\nRegion Description:\nRegion Description at [0.411, 0.482, 0.634, 0.788] : elephant's face and trunk are painted.\n\nGlobal Caption:\nA man riding on the back of an elephant through a city street.\nMan riding on the back of a painted elephant. \nA man in colorful clothing riding a painted elephant.\na man in a white shirt is riding an elephant and some buildings\nAn old decorated elephant and its colorful rider"}
+{"question_id": 13, "image": "000000408120.jpg", "category": "refer_desc", "text": "Objects:\nObject 0 : alley at [0.052, 0.261, 0.948, 0.997].\nObject 1 : bars at [0.050, 0.000, 0.400, 0.682].\nObject 2 : black tire at [0.500, 0.219, 0.522, 0.249].\nObject 3 : brick at [0.784, 0.105, 0.818, 0.144].\nObject 4 : bricks at [0.926, 0.165, 0.946, 0.195].\nObject 5 : building at [0.742, 0.000, 0.954, 0.796].\nObject 6 : car at [0.418, 0.168, 0.526, 0.240].\nObject 7 : concrete at [0.394, 0.565, 0.570, 0.718].\nObject 8 : corner at [0.850, 0.934, 0.950, 1.000].\nObject 9 : curb at [0.050, 0.264, 0.396, 0.868].\nObject 10 : fence at [0.686, 0.252, 0.826, 0.565].\nObject 11 : flower at [0.580, 0.078, 0.608, 0.123].\nObject 12 : flowers at [0.598, 0.072, 0.634, 0.105].\nObject 13 : girl at [0.444, 0.249, 0.500, 0.480].\nObject 14 : photo at [0.044, 0.000, 0.956, 0.997].\nObject 15 : plants at [0.040, 0.324, 0.224, 0.685].\nObject 16 : polka dot at [0.430, 0.231, 0.450, 0.261].\nObject 17 : road at [0.048, 0.243, 0.954, 0.994].\nObject 18 : shirt at [0.456, 0.279, 0.496, 0.390].\nObject 19 : shoe at [0.484, 0.441, 0.496, 0.459].\nObject 20 : shoe at [0.452, 0.459, 0.470, 0.489].\nObject 21 : umbrella at [0.404, 0.189, 0.528, 0.297].\nObject 22 : wall at [0.738, 0.003, 0.950, 0.760].\nObject 23 : wall window at [0.524, 0.000, 0.538, 0.060].\nObject 24 : window at [0.570, 0.003, 0.586, 0.051].\nObject 25 : window at [0.524, 0.102, 0.538, 0.150].\n\nRelationships:\nobject 13 : girl -> with -> object 19 : shoe.\nobject 13 : girl -> with -> object 20 : shoe.\nobject 13 : girl -> with -> object 18 : shirt.\nobject 4 : bricks -> on -> object 5 : building.\nobject 15 : plants -> are near -> object 0 : alley.\nobject 6 : car -> on -> object 17 : road.\nobject 8 : corner -> of an -> object 0 : alley.\nobject 15 : plants -> in front of -> object 14 : photo.\nobject 21 : umbrella -> on -> object 13 : girl.\nobject 9 : curb -> built alongside -> object 17 : road.\n\nRegion Description:\nRegion Description at [0.038, 0.426, 0.162, 0.526] : patch of green plants in front of photo.\nRegion Description at [0.586, 0.060, 0.678, 0.138] : purple flowers inside of bush on right.\n\nGlobal Caption:\nA little girl that is standing with an umbrella.\nA little girl walking down a driveway carrying a pink umbrella.\nA LITTLE GIRL DRESSED IN PINK ALSO HAS A PINK UMBRELLA\nA small girl is holding an umbrella over her head\nA young girl carries and open unbrella while walking down an alley."}
+{"question_id": 14, "image": "000000184400.jpg", "category": "refer_desc", "text": "Objects:\nObject 0 : air conditioner at [0.004, 0.261, 0.018, 0.293].\nObject 1 : balcony at [0.048, 0.037, 0.100, 0.077].\nObject 2 : beam at [0.616, 0.621, 0.664, 0.824].\nObject 3 : beam at [0.490, 0.640, 0.532, 0.832].\nObject 4 : beam at [0.426, 0.640, 0.462, 0.835].\nObject 5 : bridge at [0.002, 0.608, 0.988, 0.877].\nObject 6 : bridge at [0.004, 0.453, 1.000, 0.867].\nObject 7 : building at [0.000, 0.000, 0.252, 0.469].\nObject 8 : bushes at [0.000, 0.939, 0.072, 0.997].\nObject 9 : colors at [0.194, 0.480, 0.330, 0.661].\nObject 10 : column at [0.618, 0.824, 0.676, 0.997].\nObject 11 : guard rails at [0.000, 0.496, 1.000, 0.624].\nObject 12 : light at [0.606, 0.192, 0.724, 0.243].\nObject 13 : light at [0.864, 0.947, 0.916, 1.000].\nObject 14 : metal support at [0.002, 0.603, 0.976, 0.995].\nObject 15 : pole at [0.700, 0.205, 0.724, 0.995].\nObject 16 : red line at [0.632, 0.851, 0.648, 0.995].\nObject 17 : sky at [0.250, 0.013, 1.000, 0.467].\nObject 18 : south west at [0.338, 0.616, 0.442, 0.651].\nObject 19 : street at [0.002, 0.861, 1.000, 0.997].\nObject 20 : train at [0.002, 0.408, 1.000, 0.683].\nObject 21 : window at [0.144, 0.013, 0.182, 0.064].\nObject 22 : window at [0.430, 0.485, 0.534, 0.595].\nObject 23 : window at [0.134, 0.091, 0.182, 0.155].\nObject 24 : window at [0.340, 0.504, 0.424, 0.613].\nObject 25 : window at [0.116, 0.944, 0.168, 1.000].\nObject 26 : windows at [0.762, 0.437, 0.920, 0.613].\nObject 27 : windows at [0.004, 0.000, 0.096, 0.088].\n\nRelationships:\nobject 10 : column -> supporting -> object 6 : bridge.\nobject 10 : column -> has -> object 16 : red line.\nobject 12 : light -> on -> object 15 : pole.\nobject 7 : building -> behind -> object 20 : train.\nobject 21 : window -> on -> object 7 : building.\nobject 1 : balcony -> on -> object 7 : building.\nobject 25 : window -> visible under -> object 5 : bridge.\nobject 12 : light -> on -> object 19 : street.\nobject 2 : beam -> of -> object 5 : bridge.\nobject 20 : train -> in -> object 9 : colors.\nobject 24 : window -> of -> object 20 : train.\nobject 22 : window -> of train -> object 20 : train.\nobject 5 : bridge -> on -> object 20 : train.\nobject 7 : building -> beside -> object 20 : train.\nobject 23 : window -> of -> object 7 : building.\nobject 12 : light -> on a -> object 15 : pole.\nobject 12 : light -> on -> object 15 : pole.\nobject 20 : train -> says -> object 18 : south west.\nobject 8 : bushes -> are in -> object 19 : street.\nobject 7 : building -> has many -> object 27 : windows.\nobject 7 : building -> has -> object 0 : air conditioner.\nobject 20 : train -> on -> object 6 : bridge.\nobject 12 : light -> in -> object 19 : street.\nobject 5 : bridge -> has -> object 11 : guard rails.\nobject 26 : windows -> on -> object 20 : train.\nobject 20 : train -> has -> object 18 : south west.\nobject 6 : bridge -> has -> object 14 : metal support.\nobject 9 : colors -> to -> object 20 : train.\n\nRegion Description:\nRegion Description at [0.602, 0.837, 0.696, 0.997] : a metal support column for the bridge.\n\nGlobal Caption:\nA train as it travels down the tracks over a bridge.\na colorful train going along an elevated track \nA train rides on a bridge past a building.\nA subway train that is passing over a train bridge.\na train on a train track on an elevated bridge"}
+{"question_id": 15, "image": "000000276018.jpg", "category": "refer_desc", "text": "Objects:\nObject 0 : animal at [0.717, 0.042, 0.831, 0.152].\nObject 1 : animal at [0.114, 0.582, 0.348, 0.840].\nObject 2 : baby at [0.385, 0.034, 0.643, 0.434].\nObject 3 : baby at [0.911, 0.028, 1.000, 0.250].\nObject 4 : bear at [0.391, 0.506, 0.622, 0.714].\nObject 5 : bear at [0.695, 0.356, 0.868, 0.580].\nObject 6 : bear hand at [0.114, 0.630, 0.175, 0.660].\nObject 7 : black sock at [0.800, 0.796, 0.858, 0.834].\nObject 8 : blonde boy at [0.166, 0.170, 0.351, 0.460].\nObject 9 : boy at [0.102, 0.388, 0.498, 1.000].\nObject 10 : boy at [0.717, 0.188, 1.000, 0.864].\nObject 11 : child at [0.342, 0.390, 0.622, 1.000].\nObject 12 : coat at [0.077, 0.520, 0.495, 0.910].\nObject 13 : coat at [0.775, 0.296, 1.000, 0.616].\nObject 14 : coat at [0.397, 0.090, 0.634, 0.262].\nObject 15 : flip flops at [0.434, 0.756, 0.606, 0.910].\nObject 16 : girl at [0.372, 0.196, 0.603, 0.922].\nObject 17 : glasses at [0.191, 0.236, 0.308, 0.250].\nObject 18 : grass at [0.637, 0.652, 0.754, 0.788].\nObject 19 : hand at [0.714, 0.094, 0.788, 0.160].\nObject 20 : hands at [0.763, 0.380, 0.877, 0.430].\nObject 21 : hat at [0.757, 0.030, 0.889, 0.078].\nObject 22 : jacket at [0.357, 0.500, 0.622, 0.782].\nObject 23 : jacket at [0.422, 0.286, 0.603, 0.550].\nObject 24 : jacket at [0.163, 0.296, 0.320, 0.462].\nObject 25 : jacket at [0.911, 0.106, 1.000, 0.224].\nObject 26 : lady at [0.286, 0.000, 0.683, 0.560].\nObject 27 : man at [0.628, 0.030, 0.951, 0.742].\nObject 28 : shirt at [0.831, 0.306, 0.957, 0.404].\nObject 29 : shirt at [0.197, 0.296, 0.298, 0.370].\nObject 30 : shoe at [0.717, 0.804, 0.871, 0.864].\nObject 31 : sidewalk at [0.628, 0.574, 0.769, 0.632].\nObject 32 : stuffed animal at [0.286, 0.298, 0.517, 0.422].\n\nRelationships:\nobject 10 : boy -> wearing -> object 28 : shirt.\nobject 3 : baby -> wearing -> object 25 : jacket.\nobject 22 : jacket -> carrying -> object 4 : bear.\nobject 8 : blonde boy -> wears -> object 17 : glasses.\nobject 8 : blonde boy -> wears -> object 24 : jacket.\nobject 11 : child -> holding up -> object 32 : stuffed animal.\nobject 10 : boy -> holding up -> object 5 : bear.\nobject 30 : shoe -> with a -> object 7 : black sock.\nobject 10 : boy -> wearing -> object 7 : black sock.\nobject 26 : lady -> holding -> object 2 : baby.\nobject 16 : girl -> wearing -> object 15 : flip flops.\nobject 9 : boy -> wearing -> object 12 : coat.\nobject 10 : boy -> wearing a -> object 13 : coat.\nobject 4 : bear -> on -> object 20 : hands.\nobject 26 : lady -> carrying -> object 2 : baby.\nobject 0 : animal -> in -> object 19 : hand.\n\nRegion Description:\nRegion Description at [0.905, 0.020, 0.997, 0.272] : blonde haired baby wearing yellow jacket.\nRegion Description at [0.357, 0.388, 0.640, 0.730] : girl in blue jacket carrying blue dog.\nRegion Description at [0.071, 0.378, 0.498, 0.842] : boy in black jacket holding stuffed dog.\nRegion Description at [0.055, 0.572, 0.375, 0.846] : brown stuffed dog with red and white collar.\nRegion Description at [0.283, 0.194, 0.603, 0.400] : girl in pink jacket holding white stuffed animal.\nRegion Description at [0.695, 0.356, 0.874, 0.576] : White stuffed animal wearing a red jacket..\nRegion Description at [0.332, 0.394, 0.618, 0.992] : Little girl holding a grey stuffed dog..\nRegion Description at [0.372, 0.476, 0.723, 0.786] : little girl holding blue and white stuffed animal.\nRegion Description at [0.062, 0.556, 0.422, 0.840] : little boy holding brown and white stuffed animal.\n\nGlobal Caption:\na bunch of kids walking through some grass\nA group of children are holding various stuffed animals and dolls.\nKids walking while holding their stuffed animals. \nA group of kids holding teddy bears and looking happy.\nA group of children carrying stuffed animals walks across the grass. "}
+{"question_id": 16, "image": "000000376322.jpg", "category": "refer_desc", "text": "Objects:\nObject 0 : beer at [0.404, 0.568, 0.505, 0.724].\nObject 1 : cell phone at [0.128, 0.726, 0.332, 0.784].\nObject 2 : decanter at [0.417, 0.416, 0.503, 0.574].\nObject 3 : fork at [0.142, 0.852, 0.553, 0.964].\nObject 4 : fork at [0.174, 0.616, 0.414, 0.660].\nObject 5 : fork at [0.107, 0.882, 0.545, 0.998].\nObject 6 : glass at [0.401, 0.568, 0.508, 0.726].\nObject 7 : glass at [0.773, 0.622, 0.880, 0.796].\nObject 8 : glasses at [0.013, 0.342, 0.139, 0.376].\nObject 9 : green shirt at [0.698, 0.376, 0.909, 0.620].\nObject 10 : hair at [0.607, 0.336, 0.743, 0.422].\nObject 11 : hair at [0.824, 0.244, 1.000, 0.474].\nObject 12 : man at [0.668, 0.252, 0.909, 0.622].\nObject 13 : man at [0.000, 0.304, 0.136, 0.808].\nObject 14 : plate at [0.102, 0.780, 0.404, 0.898].\nObject 15 : silver spoon at [0.698, 0.882, 0.799, 0.998].\nObject 16 : table at [0.000, 0.428, 0.997, 0.998].\nObject 17 : wall at [0.535, 0.194, 0.997, 0.370].\nObject 18 : watch at [0.570, 0.482, 0.596, 0.508].\nObject 19 : watch at [0.888, 0.486, 0.949, 0.514].\nObject 20 : white plate at [0.361, 0.712, 0.805, 0.860].\nObject 21 : woman at [0.813, 0.242, 1.000, 0.582].\nObject 22 : woman at [0.532, 0.338, 0.765, 0.550].\n\nRelationships:\nobject 21 : woman -> with -> object 11 : hair.\nobject 9 : green shirt -> on -> object 12 : man.\nobject 14 : plate -> on -> object 16 : table.\nobject 1 : cell phone -> on -> object 16 : table.\nobject 5 : fork -> on -> object 16 : table.\nobject 5 : fork -> on -> object 16 : table.\nobject 3 : fork -> on -> object 16 : table.\nobject 4 : fork -> on -> object 16 : table.\nobject 2 : decanter -> on -> object 16 : table.\nobject 12 : man -> wearing a -> object 9 : green shirt.\nobject 21 : woman -> wearing a -> object 19 : watch.\nobject 22 : woman -> wearing a -> object 18 : watch.\nobject 13 : man -> wearing -> object 8 : glasses.\nobject 10 : hair -> on -> object 22 : woman.\nobject 22 : woman -> at -> object 16 : table.\n\nRegion Description:\nRegion Description at [0.353, 0.700, 0.802, 0.860] : a round plate with six pieces of bread and two butter pats.\nRegion Description at [0.096, 0.778, 0.404, 0.892] : a plate with one slice of bread and one butter pat.\nRegion Description at [0.890, 0.698, 0.997, 0.992] : glass of red wine closest to the camera.\nRegion Description at [0.366, 0.710, 0.805, 0.856] : the round white plate under the bread and butter.\n\nGlobal Caption:\nA group of people are reading a menu at the table\nA group of people sit at a large table while talking.\nPeople sitting on the long table with plates of food. \nA long table full of people on both sides.\nA long table accommodating many people while eating"}
+{"question_id": 17, "image": "000000125472.jpg", "category": "refer_desc", "text": "Objects:\nObject 0 : axle at [0.447, 0.814, 0.535, 0.856].\nObject 1 : background at [0.003, 0.744, 0.994, 0.988].\nObject 2 : bracelet at [0.820, 0.444, 0.859, 0.470].\nObject 3 : building at [0.012, 0.888, 0.099, 0.994].\nObject 4 : corner at [0.027, 0.890, 0.117, 0.992].\nObject 5 : fence at [0.030, 0.886, 1.000, 1.000].\nObject 6 : hair at [0.486, 0.078, 0.712, 0.216].\nObject 7 : jean pants at [0.246, 0.380, 0.841, 0.632].\nObject 8 : laces at [0.168, 0.562, 0.850, 0.674].\nObject 9 : logo at [0.429, 0.232, 0.583, 0.364].\nObject 10 : man at [0.201, 0.002, 0.940, 0.758].\nObject 11 : name at [0.000, 0.960, 0.321, 1.000].\nObject 12 : picture at [0.003, 0.004, 1.000, 0.998].\nObject 13 : poles at [0.180, 0.886, 0.432, 0.990].\nObject 14 : shirt at [0.324, 0.124, 0.694, 0.392].\nObject 15 : shoes at [0.189, 0.606, 0.946, 0.792].\nObject 16 : skateboard at [0.012, 0.746, 0.664, 0.886].\nObject 17 : sky at [0.012, 0.002, 1.000, 0.918].\nObject 18 : stadium lights at [0.147, 0.860, 0.456, 0.994].\nObject 19 : stitching at [0.312, 0.408, 0.754, 0.638].\nObject 20 : strip at [0.279, 0.770, 0.529, 0.802].\nObject 21 : top at [0.024, 0.830, 0.420, 0.936].\nObject 22 : trees at [0.024, 0.846, 1.000, 1.000].\nObject 23 : wheels at [0.012, 0.808, 0.586, 0.904].\nObject 24 : wrist at [0.802, 0.434, 0.856, 0.484].\n\nRelationships:\nobject 2 : bracelet -> on mans -> object 24 : wrist.\nobject 23 : wheels -> on a -> object 16 : skateboard.\nobject 14 : shirt -> has a -> object 9 : logo.\nobject 10 : man -> doing trick on -> object 16 : skateboard.\nobject 3 : building -> behind a -> object 5 : fence.\nobject 11 : name -> on -> object 12 : picture.\nobject 11 : name -> has a -> object 11 : name.\nobject 10 : man -> performing on a -> object 16 : skateboard.\nobject 4 : corner -> of -> object 3 : building.\nobject 18 : stadium lights -> are on -> object 13 : poles.\nobject 16 : skateboard -> has -> object 23 : wheels.\nobject 2 : bracelet -> on mans -> object 24 : wrist.\nobject 11 : name -> on -> object 12 : picture.\nobject 16 : skateboard -> under -> object 10 : man.\nobject 10 : man -> wearing -> object 15 : shoes.\nobject 3 : building -> behind -> object 5 : fence.\nobject 22 : trees -> in -> object 1 : background.\nobject 15 : shoes -> have -> object 8 : laces.\nobject 18 : stadium lights -> on -> object 13 : poles.\nobject 5 : fence -> behind -> object 10 : man.\nobject 20 : strip -> on -> object 16 : skateboard.\nobject 19 : stitching -> on -> object 7 : jean pants.\nobject 9 : logo -> on -> object 14 : shirt.\nobject 23 : wheels -> on -> object 16 : skateboard.\nobject 0 : axle -> on -> object 16 : skateboard.\nobject 21 : top -> of -> object 22 : trees.\n\nRegion Description:\nRegion Description at [0.030, 0.774, 0.643, 0.912] : a black skateboard with black wheels.\n\nGlobal Caption:\nA man flying through the air while riding a skateboard.\nA man is doing tricks on a skateboard.\nA skateboarder jumps while trying to perform a trick.\na man in the air standing above the skateboard\na person attempting a jump with a skateboard"}
+{"question_id": 18, "image": "000000361551.jpg", "category": "refer_desc", "text": "Objects:\nObject 0 : baggage at [0.107, 0.662, 0.179, 0.750].\nObject 1 : baggage at [0.368, 0.706, 0.456, 0.782].\nObject 2 : building at [0.000, 0.000, 0.997, 0.326].\nObject 3 : cap at [0.784, 0.544, 0.824, 0.568].\nObject 4 : duffel bag at [0.584, 0.702, 0.643, 0.768].\nObject 5 : ground at [0.000, 0.282, 1.000, 0.976].\nObject 6 : hair at [0.920, 0.614, 0.973, 0.640].\nObject 7 : headband at [0.923, 0.628, 0.952, 0.646].\nObject 8 : jacket at [0.776, 0.568, 0.840, 0.642].\nObject 9 : line at [0.696, 0.750, 0.989, 0.794].\nObject 10 : lines at [0.000, 0.436, 0.851, 0.486].\nObject 11 : luggage at [0.907, 0.706, 0.973, 0.786].\nObject 12 : luggage at [0.368, 0.702, 0.456, 0.780].\nObject 13 : man at [0.008, 0.554, 0.139, 0.800].\nObject 14 : man at [0.659, 0.572, 0.920, 0.844].\nObject 15 : man at [0.771, 0.538, 0.843, 0.640].\nObject 16 : pavement at [0.003, 0.308, 0.992, 0.566].\nObject 17 : people at [0.005, 0.562, 0.616, 0.824].\nObject 18 : pillars at [0.211, 0.130, 0.235, 0.240].\nObject 19 : ramp at [0.179, 0.158, 0.707, 0.408].\nObject 20 : service area at [0.003, 0.416, 0.995, 0.996].\nObject 21 : stairs at [0.352, 0.676, 1.000, 0.994].\nObject 22 : sweater at [0.667, 0.634, 0.920, 0.824].\nObject 23 : top at [0.960, 0.626, 1.000, 0.668].\nObject 24 : truck at [0.781, 0.278, 0.997, 0.366].\nObject 25 : walls at [0.608, 0.000, 0.989, 0.320].\nObject 26 : wheel at [0.843, 0.338, 0.875, 0.366].\nObject 27 : woman at [0.917, 0.610, 1.000, 0.724].\n\nRelationships:\nobject 17 : people -> in -> object 20 : service area.\nobject 27 : woman -> bends over -> object 11 : luggage.\nobject 14 : man -> walks down -> object 21 : stairs.\nobject 12 : luggage -> on -> object 5 : ground.\nobject 13 : man -> carries -> object 0 : baggage.\nobject 14 : man -> wears -> object 22 : sweater.\nobject 15 : man -> wears -> object 3 : cap.\nobject 24 : truck -> in -> object 20 : service area.\nobject 15 : man -> wears -> object 8 : jacket.\nobject 10 : lines -> on -> object 16 : pavement.\nobject 14 : man -> walks down -> object 21 : stairs.\nobject 9 : line -> on -> object 16 : pavement.\nobject 24 : truck -> has -> object 26 : wheel.\nobject 2 : building -> has -> object 25 : walls.\nobject 15 : man -> on -> object 20 : service area.\nobject 13 : man -> holds -> object 0 : baggage.\nobject 14 : man -> walks down -> object 21 : stairs.\nobject 13 : man -> holds -> object 0 : baggage.\nobject 27 : woman -> wears -> object 7 : headband.\nobject 1 : baggage -> on -> object 20 : service area.\n\nRegion Description:\nRegion Description at [0.443, 0.528, 0.992, 0.850] : People standing in service area of airport..\nRegion Description at [0.648, 0.564, 0.960, 0.892] : Man walking down stairs of unloading ramp..\nRegion Description at [0.229, 0.698, 0.381, 0.776] : Black and red luggage sitting on ground..\nRegion Description at [0.957, 0.616, 0.997, 0.670] : Woman dressed in sleeveless black top..\nRegion Description at [0.011, 0.548, 0.211, 0.750] : Man holding his luggage and bending over.\nRegion Description at [0.893, 0.578, 0.995, 0.678] : woman with a black and white head band.\nRegion Description at [0.235, 0.684, 0.973, 0.816] : Rainbow of colors in the form of luggage.\n\nGlobal Caption:\nSome are standing outside a building with suitcases.\nA few people are getting of a plane.\nA group of people and luggage on a airport tarmac.\nSome people who are placing luggage on a runway.\nAn airport and plane unloading passengers with luggage."}
+{"question_id": 19, "image": "000000412240.jpg", "category": "refer_desc", "text": "Objects:\nObject 0 : characters at [0.528, 0.251, 0.992, 0.395].\nObject 1 : date at [0.646, 0.869, 0.824, 0.923].\nObject 2 : dog at [0.292, 0.131, 0.820, 0.771].\nObject 3 : eyes at [0.332, 0.219, 0.354, 0.243].\nObject 4 : floor at [0.002, 0.715, 1.000, 0.997].\nObject 5 : head at [0.290, 0.117, 0.500, 0.392].\nObject 6 : heel at [0.218, 0.629, 0.324, 0.741].\nObject 7 : laces at [0.398, 0.464, 0.540, 0.608].\nObject 8 : left eye at [0.420, 0.245, 0.446, 0.283].\nObject 9 : light at [0.320, 0.493, 0.608, 0.720].\nObject 10 : mouth at [0.318, 0.320, 0.392, 0.373].\nObject 11 : nose at [0.348, 0.283, 0.392, 0.328].\nObject 12 : panel at [0.690, 0.544, 1.000, 0.779].\nObject 13 : photo at [0.000, 0.003, 0.996, 0.997].\nObject 14 : shoe at [0.002, 0.437, 0.250, 0.720].\nObject 15 : shoe at [0.212, 0.445, 0.720, 0.787].\nObject 16 : symbol at [0.750, 0.600, 0.828, 0.699].\nObject 17 : tail at [0.734, 0.720, 0.824, 0.768].\nObject 18 : time at [0.852, 0.872, 0.938, 0.923].\nObject 19 : toe at [0.564, 0.643, 0.724, 0.776].\nObject 20 : year at [0.752, 0.877, 0.834, 0.923].\n\nRelationships:\nobject 3 : eyes -> of -> object 2 : dog.\nobject 1 : date -> of -> object 13 : photo.\nobject 6 : heel -> of -> object 15 : shoe.\nobject 2 : dog -> sitting on -> object 4 : floor.\nobject 15 : shoe -> next to -> object 2 : dog.\nobject 15 : shoe -> reflecting -> object 9 : light.\nobject 0 : characters -> playing -> object 0 : characters.\nobject 0 : characters -> playing -> object 0 : characters.\nobject 2 : dog -> has a -> object 8 : left eye.\nobject 5 : head -> of -> object 2 : dog.\nobject 3 : eyes -> of -> object 2 : dog.\nobject 11 : nose -> on a -> object 2 : dog.\nobject 10 : mouth -> on a -> object 2 : dog.\nobject 15 : shoe -> has -> object 7 : laces.\nobject 17 : tail -> of -> object 2 : dog.\nobject 15 : shoe -> has a -> object 6 : heel.\nobject 19 : toe -> of -> object 15 : shoe.\n\nRegion Description:\nRegion Description at [0.838, 0.837, 0.976, 0.968] : the time written in bottom right corner.\n\nGlobal Caption:\nA dog sitting behind a pair of black shoes.\nA dog sits on the floor next to some shoes. \nA puppy is sitting behind a pair of shoes.\na close up of a small dog near a pair of shoes\nA small black dog sits beside a pair of shoes."}
+{"question_id": 20, "image": "000000130566.jpg", "category": "refer_desc", "text": "Objects:\nObject 0 : buds at [0.130, 0.814, 0.334, 0.883].\nObject 1 : building at [0.622, 0.213, 0.708, 0.273].\nObject 2 : building at [0.708, 0.222, 0.994, 0.294].\nObject 3 : building at [0.472, 0.240, 0.602, 0.282].\nObject 4 : cars at [0.628, 0.411, 0.912, 0.739].\nObject 5 : electric lines at [0.000, 0.000, 0.912, 0.126].\nObject 6 : gravel at [0.382, 0.381, 0.878, 0.907].\nObject 7 : leaves at [0.736, 0.357, 0.764, 0.390].\nObject 8 : pole at [0.550, 0.589, 0.558, 0.724].\nObject 9 : sky at [0.322, 0.093, 0.852, 0.162].\nObject 10 : tracks at [0.382, 0.429, 0.502, 0.511].\nObject 11 : tracks at [0.374, 0.408, 0.692, 0.709].\nObject 12 : tracks at [0.706, 0.775, 0.942, 0.922].\nObject 13 : train at [0.016, 0.273, 0.906, 0.733].\nObject 14 : train tracks at [0.024, 0.291, 0.996, 0.997].\nObject 15 : tree at [0.760, 0.279, 0.998, 0.426].\nObject 16 : wall at [0.556, 0.721, 0.790, 0.991].\nObject 17 : windshield at [0.850, 0.523, 0.898, 0.583].\nObject 18 : windshield at [0.796, 0.526, 0.846, 0.580].\n\nRelationships:\nobject 18 : windshield -> on a -> object 13 : train.\nobject 12 : tracks -> for a -> object 13 : train.\nobject 15 : tree -> with -> object 7 : leaves.\nobject 5 : electric lines -> on -> object 14 : train tracks.\nobject 8 : pole -> beside -> object 13 : train.\nobject 16 : wall -> beside -> object 13 : train.\nobject 13 : train -> traveling down -> object 11 : tracks.\n\nRegion Description:\nRegion Description at [0.022, 0.258, 0.632, 0.679] : THESE CARS ARE FOR CARGO NOT PASSENGERS.\nRegion Description at [0.630, 0.471, 0.682, 0.550] : THE WINDOWS ARE ON THE SIDE OF THE ENGINE.\nRegion Description at [0.000, 0.024, 0.448, 0.144] : electric lines hanging above train tracks.\nRegion Description at [0.532, 0.571, 0.568, 0.727] : black metal pole beside train tracks.\nRegion Description at [0.782, 0.586, 0.918, 0.667] : yellow paint on the front of the train.\nRegion Description at [0.062, 0.300, 0.996, 0.997] : multiple sets of tracks on the ground.\nRegion Description at [0.026, 0.114, 0.950, 0.970] : a freight train travelling down the tracks.\nRegion Description at [0.054, 0.685, 0.684, 0.991] : wildflowers on the side of a train track.\nRegion Description at [0.002, 0.129, 0.998, 0.991] : the grass and trees around the tracks.\n\nGlobal Caption:\nA yellow train on the tracks with several cars\nA train pulls past an intersection in the rail in a rural area.\na long cargo train going down a track by some trees \nA train with a red and yellow engine on a railroad track.\nA train pulls a large number of cars through a junction."}
+{"question_id": 21, "image": "000000421923.jpg", "category": "refer_desc", "text": "Objects:\nObject 0 : block at [0.156, 0.630, 0.357, 0.822].\nObject 1 : book at [0.414, 0.208, 0.538, 0.364].\nObject 2 : book at [0.360, 0.202, 0.417, 0.360].\nObject 3 : book at [0.426, 0.484, 0.691, 0.522].\nObject 4 : book at [0.399, 0.404, 0.520, 0.554].\nObject 5 : bowl at [0.072, 0.030, 0.288, 0.076].\nObject 6 : center at [0.850, 0.732, 0.886, 0.766].\nObject 7 : eye at [0.282, 0.506, 0.327, 0.532].\nObject 8 : eye at [0.189, 0.506, 0.237, 0.534].\nObject 9 : flower at [0.796, 0.462, 0.982, 0.550].\nObject 10 : flower at [0.817, 0.528, 0.976, 0.612].\nObject 11 : flower at [0.760, 0.678, 0.946, 0.824].\nObject 12 : flower at [0.691, 0.608, 0.838, 0.722].\nObject 13 : flower at [0.913, 0.680, 1.000, 0.770].\nObject 14 : object at [0.213, 0.840, 0.583, 0.972].\nObject 15 : picture at [0.778, 0.060, 1.000, 0.352].\nObject 16 : shelf at [0.324, 0.528, 0.997, 0.624].\nObject 17 : shelf at [0.207, 0.334, 0.997, 0.380].\nObject 18 : shelf at [0.000, 0.028, 0.607, 0.202].\nObject 19 : stack at [0.435, 0.480, 0.712, 0.578].\nObject 20 : statue at [0.147, 0.404, 0.372, 0.652].\nObject 21 : table at [0.000, 0.690, 1.003, 0.998].\nObject 22 : vase at [0.838, 0.774, 0.994, 0.974].\nObject 23 : water at [0.847, 0.864, 0.997, 0.984].\n\nRelationships:\nobject 20 : statue -> on -> object 0 : block.\nobject 14 : object -> on -> object 21 : table.\nobject 1 : book -> on -> object 17 : shelf.\nobject 4 : book -> on -> object 16 : shelf.\nobject 5 : bowl -> on -> object 18 : shelf.\nobject 22 : vase -> has -> object 23 : water.\nobject 20 : statue -> has -> object 8 : eye.\nobject 20 : statue -> has -> object 7 : eye.\nobject 20 : statue -> on -> object 0 : block.\nobject 9 : flower -> in -> object 22 : vase.\nobject 10 : flower -> in -> object 22 : vase.\nobject 12 : flower -> in -> object 22 : vase.\nobject 13 : flower -> in -> object 22 : vase.\nobject 3 : book -> in -> object 19 : stack.\nobject 11 : flower -> has -> object 6 : center.\nobject 1 : book -> on -> object 17 : shelf.\nobject 2 : book -> on -> object 17 : shelf.\nobject 11 : flower -> has -> object 6 : center.\nobject 3 : book -> on -> object 19 : stack.\nobject 19 : stack -> on -> object 16 : shelf.\nobject 20 : statue -> on -> object 0 : block.\n\nRegion Description:\n\nGlobal Caption:\na glass vase with some flowers coming out of it \nA room witb a statue, bookshelves, books and a vase with flowers in it.\nA desk with a vase containing flowers, a sculpture of a man's head and shelves behind it.\nA statue next to a vase of flowers on a shelf. \nThe bust of a man's head is next to a vase of flowers."}
+{"question_id": 22, "image": "000000513567.jpg", "category": "refer_desc", "text": "Objects:\nObject 0 : bag at [0.428, 0.435, 0.476, 0.528].\nObject 1 : bag at [0.322, 0.923, 0.498, 0.997].\nObject 2 : building at [0.000, 0.003, 0.158, 0.413].\nObject 3 : face at [0.246, 0.240, 0.374, 0.483].\nObject 4 : flag at [0.044, 0.013, 0.090, 0.149].\nObject 5 : girl at [0.538, 0.019, 0.968, 0.949].\nObject 6 : hand at [0.176, 0.680, 0.304, 0.821].\nObject 7 : hands at [0.660, 0.344, 0.756, 0.517].\nObject 8 : head at [0.560, 0.003, 0.822, 0.339].\nObject 9 : hot dog at [0.676, 0.315, 0.882, 0.408].\nObject 10 : hot dogs at [0.190, 0.587, 0.350, 0.741].\nObject 11 : jeans at [0.586, 0.843, 0.916, 0.995].\nObject 12 : lady at [0.572, 0.045, 0.952, 0.984].\nObject 13 : logo at [0.920, 0.069, 0.996, 0.165].\nObject 14 : man at [0.486, 0.235, 0.564, 0.509].\nObject 15 : man at [0.456, 0.213, 0.520, 0.317].\nObject 16 : maroon shirt at [0.546, 0.333, 0.928, 0.944].\nObject 17 : mouth at [0.288, 0.408, 0.356, 0.440].\nObject 18 : people at [0.552, 0.029, 0.876, 0.995].\nObject 19 : post at [0.104, 0.005, 0.138, 0.533].\nObject 20 : purse at [0.842, 0.661, 0.980, 0.888].\nObject 21 : purse strap at [0.270, 0.893, 0.390, 0.992].\nObject 22 : shadow at [0.934, 0.067, 0.996, 0.141].\nObject 23 : side at [0.922, 0.875, 0.998, 0.997].\nObject 24 : street at [0.042, 0.403, 0.092, 0.520].\nObject 25 : sunglasses at [0.630, 0.005, 0.794, 0.048].\nObject 26 : woman at [0.502, 0.000, 0.982, 0.997].\nObject 27 : woman at [0.102, 0.099, 0.486, 0.984].\nObject 28 : woman's shirt at [0.518, 0.320, 0.944, 0.949].\n\nRelationships:\nobject 0 : bag -> on -> object 15 : man.\nobject 13 : logo -> on -> object 2 : building.\nobject 25 : sunglasses -> on -> object 26 : woman.\nobject 25 : sunglasses -> on -> object 8 : head.\nobject 4 : flag -> on -> object 19 : post.\nobject 6 : hand -> holds -> object 10 : hot dogs.\nobject 27 : woman -> has -> object 17 : mouth.\nobject 12 : lady -> holding -> object 9 : hot dog.\nobject 9 : hot dog -> in -> object 7 : hands.\nobject 18 : people -> crossing -> object 24 : street.\nobject 27 : woman -> wearing -> object 11 : jeans.\nobject 5 : girl -> wears -> object 16 : maroon shirt.\n\nRegion Description:\nRegion Description at [0.038, 0.173, 0.540, 0.995] : Laughing girl in a green shirt holding a hotdog..\nRegion Description at [0.504, 0.000, 0.954, 0.989] : Black haired girl in maroon shirt wearing sunglasses on her head..\nRegion Description at [0.508, 0.000, 0.960, 0.979] : Girl looking at the hot dog she's holding in her hands.\nRegion Description at [0.040, 0.173, 0.536, 0.981] : Girl holding hot dog in her right hand.\nRegion Description at [0.926, 0.253, 0.998, 0.645] : Woman in a brown shirt and jeans crossing the street.\nRegion Description at [0.202, 0.563, 0.334, 0.995] : Blue purse strap around woman's shoulder.\nRegion Description at [0.146, 0.587, 0.370, 0.787] : woman holding hot dog in white napkin.\nRegion Description at [0.682, 0.229, 0.742, 0.315] : woman's mouth open looking at hot dog.\nRegion Description at [0.234, 0.213, 0.396, 0.507] : woman's face smiling with eyes closed.\n\nGlobal Caption:\nTwo Asian women eating chili dogs while standing on a street.\nTwo women preparing to eat a hot dog on a city side.\nThe woman are eating their hot dogs while walking.\nTwo young women are eating hot dogs while walking down the sidewalk.\nTwo women eat chili dogs on a city sidewalk. "}
+{"question_id": 23, "image": "000000543300.jpg", "category": "refer_desc", "text": "Objects:\nObject 0 : boat at [0.048, 0.552, 0.928, 0.819].\nObject 1 : building at [0.328, 0.493, 0.538, 0.613].\nObject 2 : building at [0.000, 0.467, 0.338, 0.651].\nObject 3 : building at [0.534, 0.096, 0.998, 0.637].\nObject 4 : canopies at [0.452, 0.504, 0.620, 0.600].\nObject 5 : container at [0.858, 0.643, 0.948, 0.712].\nObject 6 : dolphin at [0.282, 0.691, 0.344, 0.773].\nObject 7 : flag at [0.322, 0.563, 0.340, 0.597].\nObject 8 : ground at [0.822, 0.696, 0.880, 0.715].\nObject 9 : leaves at [0.002, 0.483, 0.080, 0.659].\nObject 10 : level at [0.000, 0.709, 1.000, 0.829].\nObject 11 : level at [0.068, 0.616, 0.852, 0.688].\nObject 12 : outdoor seating at [0.502, 0.579, 0.532, 0.624].\nObject 13 : pink writing at [0.414, 0.693, 0.654, 0.725].\nObject 14 : pole at [0.282, 0.416, 0.292, 0.515].\nObject 15 : railing at [0.094, 0.557, 0.728, 0.624].\nObject 16 : railing at [0.238, 0.597, 0.744, 0.627].\nObject 17 : reflection at [0.174, 0.808, 0.922, 0.848].\nObject 18 : roof at [0.000, 0.469, 0.280, 0.523].\nObject 19 : roof at [0.348, 0.509, 0.482, 0.568].\nObject 20 : roof at [0.920, 0.264, 0.980, 0.344].\nObject 21 : row at [0.700, 0.499, 0.878, 0.573].\nObject 22 : sea wall at [0.878, 0.712, 0.998, 0.819].\nObject 23 : shore at [0.000, 0.627, 0.996, 0.816].\nObject 24 : sky at [0.006, 0.000, 1.000, 0.517].\nObject 25 : steeple at [0.918, 0.088, 0.936, 0.237].\nObject 26 : symbol at [0.268, 0.688, 0.350, 0.779].\nObject 27 : symbol at [0.702, 0.693, 0.752, 0.725].\nObject 28 : tree at [0.472, 0.491, 0.592, 0.597].\nObject 29 : trees at [0.948, 0.573, 1.000, 0.691].\nObject 30 : trees at [0.000, 0.488, 0.080, 0.675].\nObject 31 : vehicle at [0.968, 0.653, 0.998, 0.693].\nObject 32 : water at [0.004, 0.813, 0.998, 0.992].\nObject 33 : water at [0.008, 0.717, 0.998, 0.981].\nObject 34 : window at [0.374, 0.733, 0.790, 0.765].\nObject 35 : window at [0.800, 0.491, 0.868, 0.576].\nObject 36 : window at [0.928, 0.512, 0.950, 0.576].\nObject 37 : window at [0.892, 0.395, 0.912, 0.443].\nObject 38 : window at [0.894, 0.517, 0.910, 0.571].\nObject 39 : window at [0.630, 0.493, 0.652, 0.565].\nObject 40 : windows at [0.384, 0.637, 0.724, 0.685].\n\nRelationships:\nobject 40 : windows -> on -> object 0 : boat.\nobject 17 : reflection -> in -> object 33 : water.\nobject 29 : trees -> growing on -> object 23 : shore.\nobject 30 : trees -> growing on -> object 23 : shore.\nobject 28 : tree -> growing on -> object 23 : shore.\nobject 18 : roof -> on -> object 2 : building.\nobject 5 : container -> on -> object 22 : sea wall.\nobject 0 : boat -> in -> object 32 : water.\nobject 0 : boat -> has -> object 15 : railing.\n\nRegion Description:\nRegion Description at [0.414, 0.691, 0.662, 0.725] : the are red letters on the side of the cruise ship.\nRegion Description at [0.370, 0.707, 0.780, 0.763] : there is a long set of black windows on the side of the cruise ship.\nRegion Description at [0.870, 0.243, 0.992, 0.357] : there is a red roof on this building.\nRegion Description at [0.538, 0.400, 0.712, 0.549] : there is red and gray building in the background.\nRegion Description at [0.054, 0.595, 0.312, 0.821] : there is two levels on this cruise ship.\nRegion Description at [0.370, 0.587, 0.664, 0.621] : there is a silver railing on the top level of the cruise ship.\nRegion Description at [0.858, 0.621, 0.952, 0.717] : there is a blue container on the dock.\nRegion Description at [0.876, 0.707, 0.996, 0.787] : there is a gray sea wall beside the ship.\nRegion Description at [0.268, 0.723, 0.346, 0.787] : there are blue water symbols on the side of the cruise ship.\nRegion Description at [0.000, 0.619, 0.024, 0.712] : there is a blue and white sign on the dock.\nRegion Description at [0.662, 0.533, 0.904, 0.603] : An outdoor canopy creates shade for customers. .\n\nGlobal Caption:\nA boat sits on the side of the dock.\nA large white boat in the open water.\nA white double decker boat n water next to buildings.\nA large cruise ship is traveling on the ocean. \nA Port River Dolphin Cruise ship sits in the water."}
+{"question_id": 24, "image": "000000241668.jpg", "category": "refer_desc", "text": "Objects:\nObject 0 : boutonniere at [0.710, 0.574, 0.799, 0.660].\nObject 1 : cake at [0.630, 0.670, 0.772, 0.750].\nObject 2 : cake crumb at [0.710, 0.348, 0.721, 0.356].\nObject 3 : crown at [0.370, 0.006, 0.549, 0.056].\nObject 4 : dress at [0.000, 0.574, 0.582, 1.000].\nObject 5 : eye at [0.649, 0.244, 0.699, 0.272].\nObject 6 : eye at [0.735, 0.264, 0.769, 0.280].\nObject 7 : eyebrow at [0.655, 0.230, 0.710, 0.250].\nObject 8 : eyebrow at [0.741, 0.252, 0.780, 0.264].\nObject 9 : finger at [0.721, 0.772, 0.816, 0.800].\nObject 10 : finger at [0.535, 0.740, 0.685, 0.826].\nObject 11 : ground at [0.003, 0.888, 0.997, 1.000].\nObject 12 : hair at [0.507, 0.142, 0.791, 0.642].\nObject 13 : hair at [0.189, 0.044, 0.652, 0.374].\nObject 14 : hand at [0.721, 0.720, 0.822, 0.818].\nObject 15 : hand at [0.493, 0.710, 0.685, 0.826].\nObject 16 : head at [0.209, 0.048, 0.652, 0.360].\nObject 17 : mouth at [0.646, 0.310, 0.724, 0.352].\nObject 18 : neck at [0.560, 0.344, 0.663, 0.460].\nObject 19 : necklace at [0.357, 0.334, 0.471, 0.484].\nObject 20 : necktie at [0.571, 0.442, 0.674, 0.936].\nObject 21 : paper at [0.760, 0.792, 0.914, 0.934].\nObject 22 : person at [0.490, 0.136, 0.825, 0.998].\nObject 23 : plate at [0.579, 0.734, 0.816, 0.768].\nObject 24 : purse at [0.774, 0.792, 0.883, 0.840].\nObject 25 : ring at [0.786, 0.780, 0.794, 0.796].\nObject 26 : shirt at [0.554, 0.376, 0.691, 0.950].\nObject 27 : suit jacket at [0.490, 0.422, 0.799, 0.998].\nObject 28 : table at [0.696, 0.816, 0.997, 0.916].\nObject 29 : toilet at [0.000, 0.656, 0.997, 0.936].\nObject 30 : wallpaper at [0.003, 0.000, 0.916, 0.656].\n\nRelationships:\nobject 21 : paper -> on top of -> object 11 : ground.\nobject 21 : paper -> by -> object 29 : toilet.\nobject 21 : paper -> on top of -> object 11 : ground.\nobject 21 : paper -> on top of -> object 11 : ground.\nobject 21 : paper -> by -> object 29 : toilet.\nobject 21 : paper -> by -> object 29 : toilet.\nobject 21 : paper -> sitting by -> object 29 : toilet.\nobject 21 : paper -> lying by -> object 29 : toilet.\nobject 21 : paper -> on top of -> object 11 : ground.\nobject 21 : paper -> lying by -> object 29 : toilet.\nobject 2 : cake crumb -> on side of -> object 17 : mouth.\nobject 24 : purse -> on top of -> object 28 : table.\nobject 5 : eye -> of a -> object 22 : person.\nobject 6 : eye -> of a -> object 22 : person.\nobject 7 : eyebrow -> of -> object 22 : person.\nobject 8 : eyebrow -> of -> object 22 : person.\nobject 10 : finger -> of -> object 15 : hand.\nobject 10 : finger -> of -> object 15 : hand.\nobject 10 : finger -> of -> object 15 : hand.\nobject 10 : finger -> of -> object 15 : hand.\nobject 3 : crown -> on top of -> object 16 : head.\nobject 20 : necktie -> worn on -> object 22 : person.\nobject 22 : person -> holding -> object 1 : cake.\nobject 14 : hand -> holding -> object 1 : cake.\nobject 22 : person -> wearing -> object 27 : suit jacket.\nobject 22 : person -> wearing -> object 4 : dress.\nobject 20 : necktie -> worn on -> object 18 : neck.\nobject 13 : hair -> on top of -> object 16 : head.\nobject 1 : cake -> on top of -> object 23 : plate.\nobject 25 : ring -> worn on -> object 9 : finger.\n\nRegion Description:\nRegion Description at [0.022, 0.020, 0.203, 0.312] : A green and yellow striped wallpaper.\nRegion Description at [0.000, 0.048, 0.613, 0.996] : woman wearing a strapless white wedding dress .\nRegion Description at [0.487, 0.136, 0.808, 0.986] : woman white red hair holding a piece of cake on a plate.\nRegion Description at [0.543, 0.674, 0.813, 0.826] : woman's hands holding a plate of cake.\nRegion Description at [0.579, 0.124, 0.788, 0.524] : red haired woman wearing a tie and suit jacket .\nRegion Description at [0.000, 0.012, 0.819, 0.996] : two people wearing formal wedding attire .\n\nGlobal Caption:\nThere are two people enjoying a wedding reception\nA woman in a wedding dress with another woman in a suit behind\nA woman in a wedding dress with another lady holding a piece of cake.\nA red head girl holding a piece of cake\nA bride is with a long red haired person with cake."}
+{"question_id": 25, "image": "000000535578.jpg", "category": "refer_desc", "text": "Objects:\nObject 0 : bush at [0.480, 0.000, 0.748, 0.084].\nObject 1 : ear at [0.544, 0.544, 0.571, 0.562].\nObject 2 : field at [0.000, 0.002, 0.994, 0.998].\nObject 3 : hill at [0.000, 0.000, 0.997, 0.998].\nObject 4 : plant at [0.000, 0.764, 0.601, 0.998].\nObject 5 : rock at [0.727, 0.410, 0.808, 0.470].\nObject 6 : sheep at [0.532, 0.546, 0.646, 0.662].\nObject 7 : sheep at [0.532, 0.666, 0.817, 0.810].\nObject 8 : tail at [0.565, 0.572, 0.604, 0.610].\nObject 9 : tree at [0.649, 0.000, 0.997, 0.334].\nObject 10 : trees at [0.736, 0.036, 0.835, 0.100].\nObject 11 : wall at [0.000, 0.000, 0.769, 0.180].\nObject 12 : weed at [0.417, 0.346, 0.492, 0.390].\n\nRelationships:\nobject 7 : sheep -> in a -> object 2 : field.\nobject 7 : sheep -> grazing in -> object 2 : field.\nobject 7 : sheep -> grazing in a -> object 2 : field.\nobject 7 : sheep -> grazing in a -> object 2 : field.\nobject 7 : sheep -> grazing in a -> object 2 : field.\nobject 7 : sheep -> grazing in a -> object 2 : field.\nobject 11 : wall -> borders -> object 2 : field.\nobject 7 : sheep -> grazing in a -> object 2 : field.\nobject 5 : rock -> in -> object 2 : field.\nobject 7 : sheep -> grazing in -> object 2 : field.\nobject 7 : sheep -> grazing in -> object 2 : field.\nobject 5 : rock -> in -> object 2 : field.\nobject 0 : bush -> in -> object 2 : field.\nobject 10 : trees -> in -> object 2 : field.\nobject 6 : sheep -> has an -> object 1 : ear.\nobject 6 : sheep -> has a -> object 8 : tail.\nobject 12 : weed -> growing in -> object 2 : field.\nobject 7 : sheep -> on -> object 3 : hill.\nobject 4 : plant -> on -> object 2 : field.\nobject 5 : rock -> on -> object 3 : hill.\nobject 7 : sheep -> are in -> object 2 : field.\nobject 11 : wall -> running across -> object 2 : field.\nobject 0 : bush -> in -> object 2 : field.\nobject 0 : bush -> in -> object 2 : field.\nobject 5 : rock -> in -> object 2 : field.\nobject 6 : sheep -> has a -> object 8 : tail.\nobject 5 : rock -> in -> object 2 : field.\n\nRegion Description:\nRegion Description at [0.000, 0.072, 0.760, 0.160] : A stone wall boarding a field of sheep.\nRegion Description at [0.189, 0.032, 0.703, 0.178] : rocks and grass in the background of the pasture.\nRegion Description at [0.541, 0.662, 0.823, 0.802] : white sheep grazing in green grassy field.\nRegion Description at [0.538, 0.544, 0.646, 0.656] : white sheep grazing in green grassy field.\nRegion Description at [0.228, 0.374, 0.357, 0.436] : white sheep grazing in green grassy field.\nRegion Description at [0.607, 0.380, 0.712, 0.456] : white sheep grazing in green grassy field.\nRegion Description at [0.811, 0.296, 0.937, 0.338] : two white sheep grazing in green grassy field.\nRegion Description at [0.048, 0.200, 0.249, 0.242] : group of white sheep grazing in green grassy field.\nRegion Description at [0.213, 0.164, 0.336, 0.192] : group of white sheep grazing in green grassy field.\nRegion Description at [0.000, 0.006, 0.997, 0.172] : two long gray stone walls across field.\nRegion Description at [0.453, 0.000, 0.730, 0.062] : a stand of trees outside the stone fence.\n\nGlobal Caption:\nA group of sheep grazing in a grassy valley.\nSheep graze in a lushly green mountain meadow\nA flock of sheep walking along a grassy hillside grazing.\nA flock of sheep are grazing on a grassy slope.\nA group of sheep grazing in a grassy field."}
+{"question_id": 26, "image": "000000277051.jpg", "category": "refer_desc", "text": "Objects:\nObject 0 : bird at [0.400, 0.408, 0.688, 0.775].\nObject 1 : bird at [0.110, 0.468, 0.576, 0.820].\nObject 2 : bottle at [0.080, 0.003, 0.296, 0.721].\nObject 3 : chair at [0.678, 0.177, 0.882, 0.408].\nObject 4 : crumbs at [0.098, 0.835, 0.434, 1.000].\nObject 5 : feet at [0.514, 0.724, 0.562, 0.769].\nObject 6 : food at [0.000, 0.877, 0.180, 1.000].\nObject 7 : foot at [0.474, 0.706, 0.514, 0.733].\nObject 8 : ground at [0.518, 0.183, 0.620, 0.402].\nObject 9 : handle at [0.488, 0.796, 0.800, 0.940].\nObject 10 : knife at [0.000, 0.793, 0.800, 1.000].\nObject 11 : label at [0.080, 0.000, 0.260, 0.598].\nObject 12 : leg at [0.552, 0.652, 0.578, 0.742].\nObject 13 : leg at [0.508, 0.646, 0.540, 0.685].\nObject 14 : liquid at [0.092, 0.114, 0.294, 0.721].\nObject 15 : paper at [0.000, 0.658, 0.762, 1.003].\nObject 16 : placemat at [0.000, 0.658, 0.766, 1.000].\nObject 17 : plate at [0.000, 0.748, 0.618, 1.000].\nObject 18 : table at [0.742, 0.261, 1.002, 0.883].\nObject 19 : table at [0.000, 0.658, 1.000, 1.003].\nObject 20 : tablecloth at [0.000, 0.664, 1.002, 1.003].\nObject 21 : tablecloth at [0.596, 0.267, 1.000, 0.883].\n\nRelationships:\nobject 6 : food -> on -> object 17 : plate.\nobject 4 : crumbs -> on -> object 17 : plate.\nobject 3 : chair -> next to -> object 18 : table.\nobject 3 : chair -> beside -> object 18 : table.\n\nRegion Description:\nRegion Description at [0.050, 0.769, 0.804, 0.979] : a steak knife resting on the edge of a plate.\nRegion Description at [0.008, 0.724, 0.628, 0.994] : a white plate with food and crumbs on it.\nRegion Description at [0.040, 0.685, 0.380, 0.925] : a blue and white paper placemat underneath a plate.\nRegion Description at [0.636, 0.147, 0.906, 0.492] : a bird on a table with a chair behind it.\nRegion Description at [0.384, 0.372, 0.698, 0.787] : a bird standing on the edge of a table.\n\nGlobal Caption:\ntwo little sparrows standing on a table by a knife\ntwo gray white and brown birds a knife and a red table\nA couple of small birds standing on top of a table.\nTwo sparrows sit n a table with a red tablecloth at an outdoor cafe. \nTwo birds perched on a table near a plate of food."}
+{"question_id": 27, "image": "000000018519.jpg", "category": "refer_desc", "text": "Objects:\nObject 0 : concrete at [0.000, 0.576, 1.002, 0.998].\nObject 1 : elbow at [0.403, 0.538, 0.433, 0.552].\nObject 2 : fence at [0.000, 0.314, 0.998, 0.600].\nObject 3 : graffiti at [0.470, 0.856, 0.794, 0.998].\nObject 4 : grass at [0.000, 0.154, 1.002, 0.448].\nObject 5 : helmet at [0.358, 0.354, 0.448, 0.422].\nObject 6 : knee at [0.525, 0.608, 0.545, 0.622].\nObject 7 : knee pad at [0.450, 0.542, 0.512, 0.598].\nObject 8 : pad at [0.540, 0.362, 0.595, 0.420].\nObject 9 : pad at [0.512, 0.578, 0.592, 0.624].\nObject 10 : pad at [0.376, 0.512, 0.443, 0.554].\nObject 11 : park at [0.007, 0.006, 1.000, 0.578].\nObject 12 : pipe at [0.657, 0.300, 0.687, 0.578].\nObject 13 : pipe at [0.177, 0.324, 0.211, 0.590].\nObject 14 : rail at [0.000, 0.310, 1.000, 0.334].\nObject 15 : ramp at [0.000, 0.592, 1.002, 0.998].\nObject 16 : rock at [0.100, 0.302, 0.154, 0.326].\nObject 17 : shadow at [0.415, 0.642, 0.754, 0.912].\nObject 18 : shirt at [0.438, 0.376, 0.637, 0.514].\nObject 19 : shorts at [0.460, 0.500, 0.664, 0.580].\nObject 20 : skate at [0.647, 0.490, 0.709, 0.584].\nObject 21 : skater at [0.234, 0.352, 0.719, 0.624].\nObject 22 : sticker at [0.408, 0.358, 0.438, 0.368].\nObject 23 : tree at [0.122, 0.008, 0.677, 0.322].\nObject 24 : wheels at [0.689, 0.496, 0.721, 0.526].\nObject 25 : wrist brace at [0.279, 0.524, 0.338, 0.564].\n\nRelationships:\nobject 21 : skater -> has a -> object 17 : shadow.\nobject 20 : skate -> has -> object 24 : wheels.\nobject 23 : tree -> standing in a -> object 11 : park.\nobject 21 : skater -> wearing a -> object 5 : helmet.\nobject 10 : pad -> protecting an -> object 1 : elbow.\nobject 9 : pad -> protecting a -> object 6 : knee.\nobject 17 : shadow -> of a -> object 21 : skater.\nobject 15 : ramp -> has a -> object 3 : graffiti.\nobject 21 : skater -> has a -> object 5 : helmet.\nobject 16 : rock -> in -> object 4 : grass.\nobject 5 : helmet -> has a -> object 22 : sticker.\nobject 21 : skater -> wearing -> object 20 : skate.\nobject 21 : skater -> wearing a -> object 10 : pad.\nobject 25 : wrist brace -> on -> object 21 : skater.\nobject 21 : skater -> has a -> object 20 : skate.\nobject 17 : shadow -> on -> object 15 : ramp.\nobject 21 : skater -> has a -> object 5 : helmet.\nobject 21 : skater -> has a -> object 8 : pad.\nobject 21 : skater -> has a -> object 18 : shirt.\nobject 21 : skater -> has -> object 19 : shorts.\nobject 23 : tree -> behind -> object 21 : skater.\nobject 25 : wrist brace -> on -> object 21 : skater.\nobject 21 : skater -> has a -> object 9 : pad.\nobject 7 : knee pad -> for a -> object 21 : skater.\nobject 17 : shadow -> on -> object 0 : concrete.\nobject 3 : graffiti -> on -> object 0 : concrete.\n\nRegion Description:\nRegion Description at [0.391, 0.630, 0.776, 0.962] : Skater's shadow while performing a trick.\nRegion Description at [0.346, 0.342, 0.475, 0.440] : Man is wearing a black safety helmet.\nRegion Description at [0.184, 0.320, 0.741, 0.700] : a man roller skating at a skate park.\nRegion Description at [0.448, 0.636, 0.779, 0.940] : the shadow of the man cast on the cement ramp.\nRegion Description at [0.465, 0.856, 0.803, 0.996] : light blue painted graffiti on the cement ramp.\nRegion Description at [0.279, 0.524, 0.341, 0.570] : a black wrist guard on the man's wrist.\nRegion Description at [0.353, 0.352, 0.460, 0.422] : black helmet with several stickers on it.\nRegion Description at [0.644, 0.488, 0.719, 0.574] : the black rollerskate the man is wearing.\nRegion Description at [0.142, 0.314, 0.234, 0.604] : a grey post to the metal fence that is at the top of the ramp.\nRegion Description at [0.363, 0.500, 0.453, 0.566] : a black elbow pad the man is wearing.\nRegion Description at [0.405, 0.642, 0.746, 0.916] : shadow of a roller skater on concrete.\n\nGlobal Caption:\nA young man riding a skateboard down the side of a ramp.\nA man doing a trick on roller-skates in a skate park.\nA skateboarder performing a jump off the side of a ramp.\na man wearing roller skates doing a jump on the side of a wall \nThe man in the helmet is jumping while wearing roller skates. "}
+{"question_id": 28, "image": "000000106048.jpg", "category": "refer_desc", "text": "Objects:\nObject 0 : book at [0.218, 0.105, 0.834, 0.754].\nObject 1 : building at [0.050, 0.000, 1.000, 0.713].\nObject 2 : bus at [0.222, 0.144, 0.820, 0.757].\nObject 3 : bushes at [0.810, 0.401, 1.000, 0.680].\nObject 4 : design at [0.228, 0.422, 0.438, 0.560].\nObject 5 : ground at [0.000, 0.629, 1.002, 0.994].\nObject 6 : headlight at [0.738, 0.590, 0.796, 0.632].\nObject 7 : headlight at [0.522, 0.596, 0.610, 0.629].\nObject 8 : light at [0.604, 0.201, 0.706, 0.222].\nObject 9 : pavement at [0.002, 0.629, 0.996, 0.994].\nObject 10 : pipe at [0.172, 0.147, 0.208, 0.617].\nObject 11 : pipe at [0.438, 0.096, 0.458, 0.192].\nObject 12 : roof at [0.118, 0.000, 0.896, 0.174].\nObject 13 : side mirror at [0.488, 0.314, 0.530, 0.428].\nObject 14 : side mirror at [0.790, 0.332, 0.818, 0.455].\nObject 15 : street at [0.002, 0.611, 0.992, 0.991].\nObject 16 : stripe at [0.228, 0.428, 0.516, 0.569].\nObject 17 : trash can at [0.790, 0.569, 0.822, 0.662].\nObject 18 : wall at [0.858, 0.368, 0.920, 0.419].\nObject 19 : wheel at [0.266, 0.545, 0.294, 0.677].\nObject 20 : wheel at [0.248, 0.551, 0.264, 0.668].\nObject 21 : wheel at [0.444, 0.578, 0.472, 0.751].\nObject 22 : windows at [0.510, 0.216, 0.796, 0.548].\nObject 23 : windshield at [0.518, 0.222, 0.782, 0.545].\n\nRelationships:\nobject 10 : pipe -> running from -> object 12 : roof.\nobject 12 : roof -> to -> object 5 : ground.\nobject 17 : trash can -> next to -> object 3 : bushes.\nobject 3 : bushes -> by -> object 15 : street.\n\nRegion Description:\nRegion Description at [0.568, 0.524, 0.770, 0.599] : Divine Transportation written on front of bus.\nRegion Description at [0.162, 0.129, 0.212, 0.623] : black drain pipe running from the roof to the ground.\nRegion Description at [0.712, 0.177, 0.762, 0.240] : bus identification number on top of bus.\nRegion Description at [0.790, 0.557, 0.820, 0.647] : gray trash can next to bushes behind bus.\nRegion Description at [0.810, 0.407, 0.990, 0.692] : large green bushes in front of building.\nRegion Description at [0.670, 0.317, 0.740, 0.527] : black windshield wiper on windshield.\n\nGlobal Caption:\nA white bus driving past a tall building.\na black and white bus some bushes and building\nA white decorated bus is next to a building.\na large white bus that is by a building\nA large bus parked in a parking lot "}
+{"question_id": 29, "image": "000000058393.jpg", "category": "refer_desc", "text": "Objects:\nObject 0 : arm at [0.658, 0.462, 0.828, 0.496].\nObject 1 : bench at [0.070, 0.493, 0.932, 0.960].\nObject 2 : concrete at [0.030, 0.810, 0.974, 0.997].\nObject 3 : foot at [0.724, 0.784, 0.782, 0.844].\nObject 4 : hair at [0.646, 0.367, 0.754, 0.472].\nObject 5 : hair at [0.564, 0.338, 0.652, 0.462].\nObject 6 : man at [0.542, 0.343, 0.812, 0.493].\nObject 7 : ocean at [0.028, 0.319, 0.972, 0.821].\nObject 8 : post at [0.090, 0.641, 0.102, 0.734].\nObject 9 : post at [0.924, 0.652, 0.944, 0.836].\nObject 10 : rail at [0.028, 0.620, 0.974, 0.660].\nObject 11 : seat at [0.072, 0.728, 0.928, 0.786].\nObject 12 : shoe at [0.720, 0.789, 0.782, 0.855].\nObject 13 : sky at [0.028, 0.037, 0.974, 0.325].\nObject 14 : slat at [0.072, 0.749, 0.928, 0.781].\nObject 15 : slat at [0.112, 0.499, 0.912, 0.522].\nObject 16 : slat at [0.126, 0.702, 0.912, 0.728].\nObject 17 : slat at [0.108, 0.594, 0.908, 0.625].\nObject 18 : slat at [0.106, 0.525, 0.908, 0.554].\nObject 19 : woman at [0.644, 0.377, 0.834, 0.863].\n\nRelationships:\nobject 6 : man -> sitting on -> object 1 : bench.\nobject 6 : man -> sitting with -> object 19 : woman.\nobject 6 : man -> has -> object 0 : arm.\nobject 0 : arm -> around -> object 19 : woman.\nobject 3 : foot -> wearing -> object 12 : shoe.\nobject 19 : woman -> has -> object 3 : foot.\nobject 3 : foot -> inside -> object 12 : shoe.\nobject 19 : woman -> looking at -> object 7 : ocean.\nobject 6 : man -> looking at -> object 7 : ocean.\nobject 19 : woman -> has -> object 4 : hair.\nobject 6 : man -> has -> object 5 : hair.\nobject 1 : bench -> in front of -> object 7 : ocean.\nobject 1 : bench -> in front of -> object 7 : ocean.\nobject 1 : bench -> backs up to -> object 1 : bench.\nobject 19 : woman -> sitting on -> object 1 : bench.\nobject 6 : man -> sitting on -> object 1 : bench.\nobject 19 : woman -> relaxing on -> object 1 : bench.\nobject 6 : man -> relaxing on -> object 1 : bench.\nobject 19 : woman -> facing -> object 7 : ocean.\nobject 6 : man -> facing -> object 7 : ocean.\nobject 19 : woman -> looking at -> object 7 : ocean.\nobject 6 : man -> looking at -> object 7 : ocean.\nobject 6 : man -> relaxing with -> object 19 : woman.\nobject 6 : man -> on bench with -> object 19 : woman.\nobject 19 : woman -> resting on -> object 1 : bench.\nobject 6 : man -> resting on -> object 1 : bench.\nobject 1 : bench -> near -> object 7 : ocean.\nobject 1 : bench -> near -> object 7 : ocean.\nobject 11 : seat -> part of -> object 1 : bench.\nobject 9 : post -> supporting -> object 10 : rail.\nobject 8 : post -> supporting -> object 10 : rail.\nobject 19 : woman -> has -> object 3 : foot.\nobject 12 : shoe -> belongs to -> object 19 : woman.\nobject 19 : woman -> has -> object 3 : foot.\nobject 2 : concrete -> under -> object 1 : bench.\nobject 2 : concrete -> under -> object 1 : bench.\nobject 7 : ocean -> in front of -> object 1 : bench.\nobject 6 : man -> sitting next to -> object 19 : woman.\nobject 6 : man -> cuddling with -> object 19 : woman.\nobject 0 : arm -> around -> object 19 : woman.\nobject 6 : man -> silhouetted with -> object 19 : woman.\nobject 18 : slat -> part of -> object 1 : bench.\n\nRegion Description:\nRegion Description at [0.502, 0.309, 0.892, 0.512] : a man and woman looking at the ocean.\n\nGlobal Caption:\nTwo people sitting on a bench silhouetted against the sea.\nTwo people are sitting on a bench together in front of water.\nThe silhouette of two people sitting on a bench in front of the water.\nA couple is sitting on a bench in front of the water. \nA couple sits on a park bench and watches the water"}
+{"question_id": 30, "image": "000000010764.jpg", "category": "refer_desc", "text": "Objects:\nObject 0 : catcher at [0.334, 0.193, 0.756, 0.940].\nObject 1 : field at [0.000, 0.000, 0.998, 0.997].\nObject 2 : glove at [0.660, 0.492, 0.764, 0.674].\nObject 3 : hand at [0.666, 0.498, 0.748, 0.665].\nObject 4 : helmet at [0.472, 0.187, 0.610, 0.444].\nObject 5 : jersey at [0.340, 0.332, 0.556, 0.695].\nObject 6 : line at [0.396, 0.656, 0.560, 0.731].\nObject 7 : lines at [0.866, 0.927, 1.000, 0.997].\nObject 8 : lines at [0.754, 0.837, 0.998, 0.867].\nObject 9 : pads at [0.562, 0.668, 0.634, 0.782].\nObject 10 : pants at [0.336, 0.640, 0.612, 0.858].\nObject 11 : sneakers at [0.406, 0.834, 0.544, 0.946].\nObject 12 : stripe at [0.608, 0.737, 0.998, 0.795].\nObject 13 : wrist band at [0.586, 0.583, 0.604, 0.640].\n\nRelationships:\nobject 0 : catcher -> in -> object 1 : field.\nobject 2 : glove -> on -> object 3 : hand.\nobject 6 : line -> on -> object 10 : pants.\n\nRegion Description:\nRegion Description at [0.546, 0.625, 0.626, 0.801] : The player is wearing knee and leg pads..\nRegion Description at [0.018, 0.665, 0.280, 0.825] : A brown dirt ground surface on a baseball field.\nRegion Description at [0.676, 0.701, 0.974, 0.979] : White chalk lines painted on a baseball field.\nRegion Description at [0.062, 0.130, 0.370, 0.535] : A green grass ground surface of a baseball field.\nRegion Description at [0.566, 0.580, 0.620, 0.656] : A black and red bracelet on a man's wrist.\n\nGlobal Caption:\nA catches crouches on a patch of dirt.\nA catcher squatting at a base with his gloved hand extended.\nA baseball catcher stands ready to catch a ball.\na catcher kneeling at the mound waiting for a baseball \nA catcher in white uniform during a baseball game."}
+{"question_id": 31, "image": "000000271402.jpg", "category": "refer_desc", "text": "Objects:\nObject 0 : blonde hair at [0.193, 0.100, 0.375, 0.176].\nObject 1 : building at [0.804, 0.200, 0.906, 0.318].\nObject 2 : dress at [0.378, 0.284, 0.804, 0.652].\nObject 3 : fence at [0.607, 0.282, 0.997, 0.378].\nObject 4 : girl at [0.329, 0.148, 0.973, 0.892].\nObject 5 : girl at [0.057, 0.102, 0.456, 0.898].\nObject 6 : ground at [0.000, 0.374, 1.000, 0.916].\nObject 7 : hair at [0.320, 0.148, 0.517, 0.286].\nObject 8 : handle at [0.329, 0.432, 0.508, 0.480].\nObject 9 : handle at [0.091, 0.450, 0.299, 0.502].\nObject 10 : head at [0.335, 0.152, 0.508, 0.314].\nObject 11 : insignia at [0.447, 0.350, 0.502, 0.390].\nObject 12 : orange platform at [0.181, 0.816, 0.489, 0.998].\nObject 13 : orange wheel at [0.193, 0.820, 0.248, 0.876].\nObject 14 : pavement at [0.009, 0.370, 0.994, 0.996].\nObject 15 : racket at [0.462, 0.480, 0.713, 0.840].\nObject 16 : right shoe at [0.465, 0.778, 0.610, 0.886].\nObject 17 : scooter at [0.097, 0.424, 0.592, 0.996].\nObject 18 : shoe at [0.060, 0.794, 0.202, 0.902].\nObject 19 : shoe at [0.302, 0.780, 0.453, 0.874].\nObject 20 : skirt at [0.471, 0.514, 0.804, 0.654].\nObject 21 : sneaker at [0.849, 0.738, 0.970, 0.886].\nObject 22 : sock at [0.317, 0.776, 0.347, 0.798].\nObject 23 : sock at [0.130, 0.790, 0.184, 0.810].\n\nRelationships:\nobject 4 : girl -> on -> object 14 : pavement.\nobject 5 : girl -> wearing -> object 22 : sock.\nobject 5 : girl -> wearing -> object 23 : sock.\nobject 4 : girl -> wearing -> object 20 : skirt.\nobject 4 : girl -> holding -> object 15 : racket.\nobject 5 : girl -> with -> object 0 : blonde hair.\nobject 17 : scooter -> with -> object 8 : handle.\nobject 1 : building -> with -> object 3 : fence.\nobject 4 : girl -> with -> object 11 : insignia.\nobject 13 : orange wheel -> of -> object 17 : scooter.\n\nRegion Description:\nRegion Description at [0.858, 0.760, 0.970, 0.852] : Girl is wearing blue, white, pink, and gray shoes..\nRegion Description at [0.293, 0.136, 0.976, 0.884] : a little girl holding a tennis racket..\nRegion Description at [0.060, 0.086, 0.462, 0.908] : A little girl standing near a scooter..\nRegion Description at [0.308, 0.146, 0.985, 0.892] : young girl wearing velcro strapped tennis shoes.\nRegion Description at [0.082, 0.436, 0.601, 0.996] : orange scooter board with black handles.\nRegion Description at [0.755, 0.184, 0.973, 0.372] : a tall building with fence in foreground.\nRegion Description at [0.021, 0.096, 0.988, 0.928] : two young girls wearing white outfits.\nRegion Description at [0.311, 0.136, 0.991, 0.886] : young girl with insignia on white outfit.\nRegion Description at [0.175, 0.814, 0.266, 0.888] : orange colored back wheel of a scooter board.\nRegion Description at [0.453, 0.478, 0.725, 0.848] : lavender, yellow and pink colored tennis racket.\n\nGlobal Caption:\ntwo little girls in tennis uniforms standing next to a scooter\nTwo young girls with a tennis racket and a scooter.\nTwo little girls posing for a picture, on a tennis court.\nTwo young girls on a tennis court with a racquet and a scooter\nTwo cute girls with a scooter and tennis raquet."}
+{"question_id": 32, "image": "000000273493.jpg", "category": "refer_desc", "text": "Objects:\nObject 0 : ball at [0.640, 0.399, 0.648, 0.411].\nObject 1 : border at [0.040, 0.502, 1.000, 0.556].\nObject 2 : boundary lines at [0.030, 0.661, 1.000, 1.000].\nObject 3 : bushes at [0.020, 0.186, 0.104, 0.517].\nObject 4 : fence at [0.008, 0.366, 0.994, 0.565].\nObject 5 : fence at [0.024, 0.502, 0.996, 0.709].\nObject 6 : grass at [0.004, 0.529, 0.994, 0.997].\nObject 7 : man at [0.144, 0.360, 0.246, 0.736].\nObject 8 : man at [0.730, 0.474, 0.780, 0.613].\nObject 9 : pants at [0.732, 0.529, 0.778, 0.604].\nObject 10 : shirt at [0.164, 0.411, 0.222, 0.547].\nObject 11 : shorts at [0.162, 0.535, 0.220, 0.628].\nObject 12 : sign at [0.916, 0.405, 0.934, 0.438].\nObject 13 : sky at [0.006, 0.021, 0.990, 0.279].\nObject 14 : sneakers at [0.180, 0.709, 0.216, 0.739].\nObject 15 : sneakers at [0.762, 0.598, 0.776, 0.613].\nObject 16 : tennis at [0.012, 0.384, 0.984, 0.934].\nObject 17 : tennis court at [0.000, 0.372, 0.988, 0.979].\nObject 18 : tennis racket at [0.768, 0.526, 0.808, 0.556].\nObject 19 : tennis racket at [0.214, 0.574, 0.238, 0.619].\nObject 20 : trees at [0.586, 0.282, 0.692, 0.420].\nObject 21 : white at [0.734, 0.492, 0.778, 0.601].\n\nRelationships:\nobject 7 : man -> in -> object 10 : shirt.\nobject 7 : man -> with -> object 19 : tennis racket.\nobject 7 : man -> plays -> object 16 : tennis.\nobject 7 : man -> wears -> object 14 : sneakers.\nobject 8 : man -> wears -> object 15 : sneakers.\nobject 7 : man -> wears -> object 11 : shorts.\nobject 8 : man -> wears -> object 9 : pants.\nobject 5 : fence -> has -> object 1 : border.\nobject 20 : trees -> behind -> object 3 : bushes.\nobject 2 : boundary lines -> on -> object 17 : tennis court.\nobject 2 : boundary lines -> on -> object 6 : grass.\nobject 3 : bushes -> behind -> object 4 : fence.\nobject 20 : trees -> behind -> object 4 : fence.\nobject 7 : man -> has -> object 19 : tennis racket.\nobject 8 : man -> wears -> object 21 : white.\nobject 4 : fence -> around -> object 17 : tennis court.\nobject 20 : trees -> behind -> object 8 : man.\nobject 6 : grass -> on -> object 17 : tennis court.\nobject 8 : man -> has -> object 18 : tennis racket.\nobject 8 : man -> hitting -> object 0 : ball.\nobject 5 : fence -> on -> object 17 : tennis court.\n\nRegion Description:\nRegion Description at [0.024, 0.489, 0.998, 0.730] : The tennis net separating the sides of the players..\nRegion Description at [0.144, 0.652, 0.234, 0.745] : The black sneakers the player is wearing..\nRegion Description at [0.720, 0.577, 0.784, 0.613] : The white sneakers the player is wearing..\nRegion Description at [0.158, 0.544, 0.230, 0.628] : The gray shorts the player is wearing..\nRegion Description at [0.006, 0.402, 0.998, 0.574] : The trimmed bushes behind the player..\nRegion Description at [0.008, 0.168, 0.998, 0.402] : The trees behind the trimmed bushes behind the player..\nRegion Description at [0.006, 0.604, 0.998, 0.985] : The white boundary lines on the tennis court..\nRegion Description at [0.020, 0.447, 0.994, 0.760] : A black and white net stretches across the field.\nRegion Description at [0.060, 0.526, 0.984, 0.985] : The field has green grass with white lines.\nRegion Description at [0.016, 0.369, 0.978, 0.595] : A tall green shrub is behind the fence.\nRegion Description at [0.034, 0.150, 0.984, 0.393] : Trees are seen behind the fence and shrub.\nRegion Description at [0.588, 0.327, 0.850, 0.703] : The yellow ball is flying towards the man.\nRegion Description at [0.902, 0.378, 0.956, 0.529] : A black circular sign with the number five.\nRegion Description at [0.142, 0.354, 0.248, 0.736] : male in white t-shirt playing tennis.\nRegion Description at [0.200, 0.565, 0.244, 0.625] : Head of tennis racket of man playing.\nRegion Description at [0.726, 0.465, 0.786, 0.631] : Man in white preparing to hit tennis ball.\n\nGlobal Caption:\nTwo men playing a game of tennis on a court.\ntwo people playing tennis with rackets on a grass court\nTwo young men playing a game of tennis.\nPeople playing tennis on a court surrounded by green hedges.\ntHERE ARE TWO MEN PLAYING TENNIS ON THE TENNIS COURT"}
+{"question_id": 33, "image": "000000360960.jpg", "category": "refer_desc", "text": "Objects:\nObject 0 : coat at [0.405, 0.332, 0.835, 0.746].\nObject 1 : decorative square at [0.000, 0.382, 1.000, 1.000].\nObject 2 : hat at [0.006, 0.162, 0.072, 0.198].\nObject 3 : jacket at [0.078, 0.222, 0.318, 0.430].\nObject 4 : jeans at [0.853, 0.422, 1.000, 0.632].\nObject 5 : leg at [0.853, 0.456, 0.928, 0.610].\nObject 6 : leg at [0.210, 0.458, 0.303, 0.638].\nObject 7 : leg at [0.000, 0.458, 0.060, 0.630].\nObject 8 : man at [0.066, 0.162, 0.318, 0.686].\nObject 9 : man at [0.850, 0.156, 1.000, 0.652].\nObject 10 : man at [0.390, 0.344, 0.838, 0.894].\nObject 11 : pants at [0.523, 0.736, 0.739, 0.858].\nObject 12 : person at [0.000, 0.162, 0.135, 0.668].\nObject 13 : person at [0.853, 0.154, 1.000, 0.650].\nObject 14 : section at [0.000, 0.134, 1.000, 1.000].\nObject 15 : sidewalk at [0.000, 0.388, 1.000, 1.000].\nObject 16 : umbrella at [0.168, 0.106, 0.910, 0.366].\nObject 17 : uniform at [0.000, 0.222, 0.126, 0.646].\nObject 18 : uniform at [0.105, 0.218, 0.318, 0.628].\n\nRelationships:\nobject 10 : man -> wearing -> object 11 : pants.\nobject 10 : man -> wearing -> object 0 : coat.\nobject 9 : man -> wearing -> object 4 : jeans.\nobject 8 : man -> wearing -> object 2 : hat.\nobject 8 : man -> wearing -> object 3 : jacket.\nobject 16 : umbrella -> has -> object 14 : section.\nobject 5 : leg -> of -> object 13 : person.\nobject 7 : leg -> of -> object 12 : person.\nobject 12 : person -> in -> object 17 : uniform.\n\nRegion Description:\nRegion Description at [0.066, 0.164, 0.318, 0.686] : the back of a man in a black uniform.\nRegion Description at [0.393, 0.324, 0.871, 0.766] : THIS MAN IS WEARING A LONG BLACK COAT.\nRegion Description at [0.468, 0.142, 0.634, 0.356] : THIS IS A RED SECTION ON THE UMBRELLA.\nRegion Description at [0.168, 0.140, 0.523, 0.292] : THIS IS A YELLOW SECTION ON THE UMBRELLA.\nRegion Description at [0.568, 0.138, 0.919, 0.232] : THIS IS A GREEN SECTION OF THE UMBRELLA.\n\nGlobal Caption:\nSeveral people walking on a sidewalk, with one man holding an umbrella.\nA person walking while carrying a rainbow umbrella\nA person is holding up a large colorful umbrella\na person walking down the street carrying a rainbow colored umbrella\nA person walking in a square carrying a rainbow colored umbrella."}
+{"question_id": 34, "image": "000000452122.jpg", "category": "refer_desc", "text": "Objects:\nObject 0 : airline at [0.408, 0.420, 0.758, 0.502].\nObject 1 : airplane at [0.112, 0.300, 0.858, 0.640].\nObject 2 : engine at [0.652, 0.529, 0.730, 0.592].\nObject 3 : engine at [0.494, 0.502, 0.574, 0.577].\nObject 4 : fin at [0.208, 0.303, 0.320, 0.492].\nObject 5 : fin at [0.116, 0.480, 0.284, 0.526].\nObject 6 : front door at [0.752, 0.435, 0.772, 0.483].\nObject 7 : gear at [0.450, 0.592, 0.600, 0.643].\nObject 8 : letters at [0.694, 0.489, 0.732, 0.520].\nObject 9 : name at [0.398, 0.426, 0.760, 0.489].\nObject 10 : sky at [0.000, 0.000, 0.998, 1.000].\nObject 11 : window at [0.806, 0.438, 0.844, 0.456].\nObject 12 : windows at [0.326, 0.450, 0.750, 0.532].\nObject 13 : wing at [0.152, 0.426, 0.598, 0.538].\nObject 14 : wing at [0.116, 0.492, 0.282, 0.538].\n\nRelationships:\nobject 6 : front door -> of -> object 1 : airplane.\n\nRegion Description:\n\nGlobal Caption:\nAn airplane flying in the air during the day.\nA large aircraft is shown in the air.\nThe large jumbo jet has it's landing gear lowered.\nA large white airplane flies in the gray sky.\nAn airplane in route with a cloudy sky behind it."}
+{"question_id": 35, "image": "000000134722.jpg", "category": "refer_desc", "text": "Objects:\nObject 0 : awning at [0.886, 0.000, 1.000, 0.240].\nObject 1 : awning at [0.000, 0.299, 0.132, 0.403].\nObject 2 : bench at [0.000, 0.592, 0.066, 0.683].\nObject 3 : building at [0.000, 0.299, 0.204, 0.659].\nObject 4 : canopy at [0.000, 0.301, 0.136, 0.400].\nObject 5 : car at [0.290, 0.400, 0.998, 0.784].\nObject 6 : clouds at [0.374, 0.067, 0.920, 0.312].\nObject 7 : door opening at [0.658, 0.501, 0.682, 0.680].\nObject 8 : door opening at [0.678, 0.509, 0.710, 0.675].\nObject 9 : exterior at [0.000, 0.400, 0.200, 0.669].\nObject 10 : front at [0.294, 0.400, 0.494, 0.739].\nObject 11 : gravel at [0.090, 0.837, 0.334, 0.997].\nObject 12 : headlights at [0.416, 0.624, 0.446, 0.656].\nObject 13 : headlights at [0.300, 0.624, 0.324, 0.651].\nObject 14 : markings at [0.606, 0.821, 0.770, 0.928].\nObject 15 : panel at [0.304, 0.421, 0.450, 0.677].\nObject 16 : pole at [0.030, 0.419, 0.062, 0.656].\nObject 17 : railway tracks at [0.000, 0.752, 0.520, 0.944].\nObject 18 : side walk at [0.192, 0.712, 1.000, 0.997].\nObject 19 : sky at [0.000, 0.000, 0.998, 0.560].\nObject 20 : train stop at [0.000, 0.000, 1.000, 1.000].\nObject 21 : trees at [0.208, 0.253, 0.322, 0.653].\nObject 22 : trim at [0.000, 0.333, 0.132, 0.403].\nObject 23 : wall at [0.000, 0.392, 0.206, 0.611].\nObject 24 : wheel at [0.844, 0.669, 0.884, 0.728].\nObject 25 : wheel at [0.792, 0.675, 0.840, 0.747].\nObject 26 : wheel at [0.516, 0.691, 0.620, 0.808].\nObject 27 : window at [0.316, 0.451, 0.458, 0.595].\nObject 28 : windows at [0.700, 0.547, 0.848, 0.632].\nObject 29 : windsheild wipers at [0.348, 0.499, 0.410, 0.584].\n\nRelationships:\nobject 6 : clouds -> in -> object 19 : sky.\nobject 2 : bench -> in -> object 4 : canopy.\nobject 22 : trim -> on -> object 1 : awning.\nobject 11 : gravel -> next to -> object 17 : railway tracks.\nobject 14 : markings -> on side of -> object 18 : side walk.\nobject 5 : car -> on -> object 17 : railway tracks.\n\nRegion Description:\nRegion Description at [0.288, 0.392, 0.510, 0.741] : the front of the train is yellow and white.\nRegion Description at [0.320, 0.451, 0.460, 0.592] : the front window of the train has windshield wipers.\nRegion Description at [0.292, 0.592, 0.456, 0.739] : the headlights are on front of the train.\nRegion Description at [0.010, 0.405, 0.220, 0.736] : a red brick wall is near the platform.\nRegion Description at [0.000, 0.288, 0.128, 0.707] : an aluminum canopy is on the platform.\nRegion Description at [0.016, 0.325, 0.100, 0.672] : a red steel pole is holding up the awning.\nRegion Description at [0.306, 0.395, 0.998, 0.733] : the train has windowed passenger cars.\nRegion Description at [0.300, 0.427, 0.492, 0.693] : the yellow and white front of a train.\nRegion Description at [0.510, 0.744, 0.834, 0.891] : white painted line beside a train track.\nRegion Description at [0.298, 0.408, 0.468, 0.661] : a yellow panel on the front of the train.\nRegion Description at [0.002, 0.397, 0.210, 0.675] : a red brick building on the side of the tracks.\nRegion Description at [0.844, 0.000, 0.998, 0.248] : an awning of a structure next to the train tracks.\nRegion Description at [0.294, 0.360, 0.516, 0.787] : front of a train car in yellow, white and blue.\nRegion Description at [0.194, 0.221, 0.286, 0.901] : trees on the side of a train station.\nRegion Description at [0.580, 0.821, 0.764, 0.931] : markings on the side of railway tracks.\nRegion Description at [0.632, 0.491, 0.726, 0.691] : white, blue and grey doors on the side of a train car.\nRegion Description at [0.500, 0.096, 0.916, 0.531] : skyline on the side of a train station.\n\nGlobal Caption:\nFast commuter train moving past an outdoor platform.\nA train on the track pulling by a train station.\nA train pulling into a station outside during the day.\nA passenger train moving through a rail yard\na long passenger train pulling up to a station"}
+{"question_id": 36, "image": "000000039484.jpg", "category": "refer_desc", "text": "Objects:\nObject 0 : area at [0.002, 0.000, 0.998, 1.000].\nObject 1 : awning at [0.804, 0.625, 1.000, 0.718].\nObject 2 : billboard at [0.172, 0.367, 0.306, 0.484].\nObject 3 : building at [0.854, 0.246, 0.998, 0.815].\nObject 4 : building at [0.428, 0.537, 0.536, 0.801].\nObject 5 : building at [0.670, 0.481, 0.860, 0.827].\nObject 6 : building at [0.124, 0.531, 0.350, 0.798].\nObject 7 : car at [0.488, 0.774, 0.650, 0.900].\nObject 8 : car at [0.276, 0.762, 0.334, 0.850].\nObject 9 : car at [0.002, 0.804, 0.180, 1.000].\nObject 10 : car at [0.608, 0.768, 0.702, 0.894].\nObject 11 : car at [0.366, 0.783, 0.472, 0.871].\nObject 12 : car at [0.326, 0.757, 0.424, 0.865].\nObject 13 : car at [0.216, 0.786, 0.280, 0.839].\nObject 14 : car at [0.054, 0.833, 0.418, 1.000].\nObject 15 : car at [0.670, 0.768, 0.820, 0.909].\nObject 16 : crown at [0.176, 0.680, 0.210, 0.713].\nObject 17 : for lease at [0.542, 0.639, 0.602, 0.657].\nObject 18 : hat at [0.762, 0.525, 0.790, 0.554].\nObject 19 : lamp post at [0.604, 0.196, 0.738, 0.777].\nObject 20 : light at [0.718, 0.531, 0.740, 0.560].\nObject 21 : light at [0.612, 0.199, 0.650, 0.223].\nObject 22 : man at [0.756, 0.528, 0.794, 0.598].\nObject 23 : marquee at [0.526, 0.613, 0.644, 0.686].\nObject 24 : omnifest at [0.188, 0.431, 0.266, 0.463].\nObject 25 : outside at [0.000, 0.000, 0.998, 0.997].\nObject 26 : people at [0.816, 0.783, 0.990, 0.880].\nObject 27 : person at [0.856, 0.798, 0.890, 0.827].\nObject 28 : person at [0.880, 0.789, 0.910, 0.827].\nObject 29 : person at [0.944, 0.783, 0.966, 0.880].\nObject 30 : person at [0.956, 0.783, 0.984, 0.877].\nObject 31 : person at [0.906, 0.789, 0.924, 0.830].\nObject 32 : person at [0.368, 0.762, 0.402, 0.871].\nObject 33 : person at [0.916, 0.795, 0.942, 0.880].\nObject 34 : plane at [0.944, 0.026, 0.988, 0.053].\nObject 35 : pole at [0.002, 0.648, 0.076, 0.657].\nObject 36 : shirt at [0.856, 0.809, 0.892, 0.830].\nObject 37 : sign at [0.872, 0.525, 0.972, 0.595].\nObject 38 : sign at [0.738, 0.522, 0.796, 0.613].\nObject 39 : sign at [0.652, 0.326, 0.692, 0.372].\nObject 40 : sign at [0.432, 0.484, 0.522, 0.531].\nObject 41 : sky at [0.002, 0.000, 1.000, 0.630].\nObject 42 : street at [0.326, 0.853, 1.000, 1.000].\nObject 43 : street sign at [0.018, 0.645, 0.068, 0.666].\nObject 44 : table at [0.888, 0.824, 0.918, 0.880].\nObject 45 : van at [0.328, 0.754, 0.420, 0.865].\nObject 46 : wheel at [0.486, 0.839, 0.510, 0.891].\nObject 47 : wheel at [0.530, 0.842, 0.556, 0.894].\nObject 48 : wheel at [0.616, 0.865, 0.642, 0.903].\nObject 49 : wheel at [0.572, 0.880, 0.592, 0.900].\nObject 50 : window at [0.746, 0.777, 0.810, 0.812].\nObject 51 : window at [0.904, 0.413, 0.930, 0.516].\n\nRelationships:\nobject 7 : car -> on -> object 42 : street.\nobject 33 : person -> sitting -> object 25 : outside.\nobject 29 : person -> sitting -> object 25 : outside.\nobject 27 : person -> sitting -> object 25 : outside.\nobject 28 : person -> sitting -> object 25 : outside.\nobject 30 : person -> sitting -> object 25 : outside.\nobject 1 : awning -> on -> object 3 : building.\nobject 37 : sign -> on -> object 3 : building.\nobject 2 : billboard -> at top of -> object 6 : building.\nobject 32 : person -> getting out of -> object 11 : car.\nobject 19 : lamp post -> next to -> object 42 : street.\nobject 27 : person -> sitting at -> object 44 : table.\nobject 33 : person -> sitting at -> object 44 : table.\nobject 28 : person -> sitting at -> object 44 : table.\nobject 31 : person -> sitting at -> object 44 : table.\nobject 27 : person -> wearing -> object 36 : shirt.\nobject 2 : billboard -> on -> object 6 : building.\nobject 32 : person -> getting out of -> object 11 : car.\nobject 27 : person -> sitting -> object 25 : outside.\nobject 28 : person -> sitting -> object 25 : outside.\nobject 31 : person -> sitting -> object 25 : outside.\nobject 33 : person -> sitting -> object 25 : outside.\nobject 29 : person -> sitting -> object 25 : outside.\nobject 17 : for lease -> on -> object 23 : marquee.\nobject 47 : wheel -> on -> object 7 : car.\nobject 46 : wheel -> on -> object 7 : car.\nobject 48 : wheel -> on -> object 7 : car.\nobject 49 : wheel -> on -> object 7 : car.\nobject 51 : window -> on -> object 3 : building.\nobject 9 : car -> on -> object 42 : street.\nobject 39 : sign -> above -> object 5 : building.\nobject 14 : car -> parked on -> object 42 : street.\nobject 10 : car -> parked on -> object 42 : street.\nobject 13 : car -> parked on -> object 42 : street.\nobject 8 : car -> parked on -> object 42 : street.\nobject 12 : car -> parked on -> object 42 : street.\nobject 43 : street sign -> on -> object 35 : pole.\nobject 24 : omnifest -> on -> object 2 : billboard.\nobject 50 : window -> on -> object 15 : car.\nobject 17 : for lease -> on -> object 23 : marquee.\nobject 20 : light -> on -> object 19 : lamp post.\nobject 21 : light -> on -> object 19 : lamp post.\nobject 34 : plane -> in -> object 41 : sky.\nobject 32 : person -> getting out of -> object 11 : car.\nobject 24 : omnifest -> on -> object 2 : billboard.\nobject 45 : van -> in front of -> object 4 : building.\nobject 22 : man -> wearing -> object 18 : hat.\nobject 22 : man -> on -> object 38 : sign.\n\nRegion Description:\nRegion Description at [0.844, 0.777, 0.958, 0.897] : people sitting at table in front of restaurant.\nRegion Description at [0.530, 0.625, 0.640, 0.683] : a sign on a marquee that says \"for lease\".\nRegion Description at [0.166, 0.352, 0.296, 0.472] : large omnifest billboard advertisement.\nRegion Description at [0.074, 0.833, 0.428, 0.994] : parked silver car with tinted windshield.\nRegion Description at [0.088, 0.760, 0.178, 0.824] : Van in front of red building is purple.\n\nGlobal Caption:\nA variety of cars on a street with buildings.\nA variety of shops and stores on a busy city street.\nA busy street crammed with businesses and parked cars.\nA street full of cars that are driving and parked and a group of people sitting outside eating.\nA city has colorful buildings and signs among cars."}
+{"question_id": 37, "image": "000000159311.jpg", "category": "refer_desc", "text": "Objects:\nObject 0 : eye at [0.566, 0.526, 0.592, 0.565].\nObject 1 : grass at [0.004, 0.808, 0.118, 0.991].\nObject 2 : grass at [0.206, 0.853, 0.356, 0.982].\nObject 3 : leg at [0.232, 0.375, 0.312, 0.805].\nObject 4 : plant at [0.500, 0.736, 0.618, 0.796].\nObject 5 : sitck at [0.746, 0.042, 0.912, 0.339].\nObject 6 : zebra at [0.000, 0.000, 0.622, 0.790].\nObject 7 : zebra at [0.002, 0.000, 0.684, 0.682].\n\nRelationships:\nobject 7 : zebra -> eating -> object 4 : plant.\nobject 6 : zebra -> standing in -> object 1 : grass.\nobject 7 : zebra -> standing in -> object 1 : grass.\nobject 7 : zebra -> grazing in -> object 1 : grass.\nobject 6 : zebra -> grazing in -> object 1 : grass.\n\nRegion Description:\nRegion Description at [0.352, 0.093, 0.602, 0.393] : thin line of hair running down the neck.\n\nGlobal Caption:\nA pair of zebra's leaning over eating grass in a field.\nTwo zebra stand near bushes and tall grass.\nTwo zebras grazing from grass next to a tree.\nTwo zebra standing next to each other on a lush green field.\nTwo zebras are feeding on the grass by themselves."}
+{"question_id": 38, "image": "000000326174.jpg", "category": "refer_desc", "text": "Objects:\nObject 0 : beach at [0.000, 0.720, 0.998, 1.000].\nObject 1 : boy at [0.792, 0.480, 0.938, 0.853].\nObject 2 : child at [0.322, 0.587, 0.376, 0.835].\nObject 3 : child at [0.320, 0.587, 0.374, 0.835].\nObject 4 : girl at [0.444, 0.539, 0.534, 0.856].\nObject 5 : man at [0.140, 0.443, 0.216, 0.845].\nObject 6 : man at [0.434, 0.459, 0.500, 0.760].\nObject 7 : man at [0.578, 0.459, 0.682, 0.845].\nObject 8 : ocean waters at [0.590, 0.419, 0.892, 0.629].\nObject 9 : people at [0.206, 0.456, 0.352, 0.851].\nObject 10 : person at [0.792, 0.480, 0.936, 0.851].\nObject 11 : shirt at [0.592, 0.496, 0.670, 0.629].\nObject 12 : shore at [0.000, 0.360, 0.998, 0.997].\nObject 13 : surfboard at [0.306, 0.709, 0.538, 0.853].\nObject 14 : surfboard at [0.790, 0.587, 0.960, 0.691].\nObject 15 : water at [0.384, 0.368, 0.544, 0.435].\nObject 16 : waves at [0.656, 0.709, 0.794, 0.779].\nObject 17 : wetsuit at [0.326, 0.629, 0.372, 0.773].\nObject 18 : woman at [0.208, 0.499, 0.304, 0.629].\n\nRelationships:\nobject 1 : boy -> holding -> object 14 : surfboard.\nobject 5 : man -> and -> object 18 : woman.\nobject 18 : woman -> and -> object 3 : child.\nobject 16 : waves -> coming to -> object 12 : shore.\nobject 7 : man -> looking down to -> object 15 : water.\nobject 2 : child -> with -> object 17 : wetsuit.\nobject 6 : man -> looking back to -> object 4 : girl.\nobject 4 : girl -> pulling -> object 13 : surfboard.\nobject 9 : people -> on -> object 0 : beach.\nobject 7 : man -> wearing -> object 11 : shirt.\n\nRegion Description:\nRegion Description at [0.096, 0.437, 0.970, 0.872] : Seven people headed to the water to surf..\nRegion Description at [0.390, 0.531, 0.540, 0.851] : Girl in yellow shirt and pony tail. .\nRegion Description at [0.312, 0.581, 0.374, 0.851] : Small child with red and black wetsuit..\nRegion Description at [0.578, 0.443, 0.688, 0.856] : Man with white shirt and grey wetsuit pants..\nRegion Description at [0.436, 0.440, 0.534, 0.872] : Man looking back to girl pulling surfboard..\nRegion Description at [0.444, 0.459, 0.552, 0.853] : A man and a little girl having a conversation.\nRegion Description at [0.104, 0.419, 0.314, 0.851] : A man and a woman walking toward the water.\n\nGlobal Caption:\nA group of people are taking surfing lessons.\nA group of men, women and children walking toward the water with surfboards.\nA mixed age group is going toward the ocean with surfboards.\nA group of surfers are carrying their surf boards into the ocean.\nSeveral people are getting ready to enter the water for surfing."}
+{"question_id": 39, "image": "000000562207.jpg", "category": "refer_desc", "text": "Objects:\nObject 0 : body at [0.166, 0.539, 0.296, 0.997].\nObject 1 : boot at [0.594, 0.753, 0.620, 0.870].\nObject 2 : boot at [0.620, 0.744, 0.658, 0.858].\nObject 3 : bucket at [0.268, 0.744, 0.322, 0.828].\nObject 4 : clouds at [0.156, 0.000, 0.968, 0.328].\nObject 5 : ear at [0.590, 0.226, 0.638, 0.410].\nObject 6 : ear at [0.368, 0.208, 0.448, 0.434].\nObject 7 : elephant at [0.328, 0.157, 0.638, 0.967].\nObject 8 : eye at [0.476, 0.319, 0.504, 0.346].\nObject 9 : foot at [0.436, 0.901, 0.516, 0.958].\nObject 10 : grass at [0.950, 0.759, 0.996, 0.807].\nObject 11 : leg at [0.498, 0.572, 0.548, 0.898].\nObject 12 : leg at [0.408, 0.512, 0.516, 0.955].\nObject 13 : man at [0.582, 0.476, 0.662, 0.870].\nObject 14 : man at [0.164, 0.455, 0.292, 0.997].\nObject 15 : mountains at [0.000, 0.265, 0.376, 0.470].\nObject 16 : rock at [0.736, 0.895, 0.762, 0.934].\nObject 17 : sand at [0.240, 0.687, 0.998, 1.000].\nObject 18 : shirt at [0.582, 0.521, 0.650, 0.681].\nObject 19 : shorts at [0.174, 0.699, 0.254, 0.864].\nObject 20 : side at [0.236, 0.675, 0.994, 0.997].\nObject 21 : skirt at [0.298, 0.687, 0.360, 0.810].\nObject 22 : sky at [0.004, 0.000, 0.998, 0.355].\nObject 23 : top at [0.302, 0.539, 0.358, 0.696].\nObject 24 : tree at [0.012, 0.407, 0.076, 0.500].\nObject 25 : trunk at [0.506, 0.392, 0.600, 0.964].\nObject 26 : watch at [0.172, 0.711, 0.192, 0.732].\nObject 27 : water at [0.000, 0.488, 0.994, 1.000].\nObject 28 : woman at [0.288, 0.473, 0.420, 0.967].\n\nRelationships:\nobject 7 : elephant -> on -> object 20 : side.\nobject 28 : woman -> touching -> object 7 : elephant.\nobject 14 : man -> standing on -> object 20 : side.\nobject 14 : man -> standing beside -> object 7 : elephant.\nobject 10 : grass -> on -> object 20 : side.\nobject 28 : woman -> wearing -> object 23 : top.\nobject 13 : man -> wearing -> object 18 : shirt.\nobject 13 : man -> wearing -> object 1 : boot.\nobject 13 : man -> wearing -> object 2 : boot.\nobject 28 : woman -> touching -> object 7 : elephant.\nobject 7 : elephant -> has -> object 25 : trunk.\nobject 14 : man -> wearing -> object 19 : shorts.\nobject 28 : woman -> petting -> object 7 : elephant.\nobject 14 : man -> with -> object 7 : elephant.\nobject 28 : woman -> with -> object 7 : elephant.\nobject 13 : man -> with -> object 7 : elephant.\nobject 25 : trunk -> of -> object 7 : elephant.\nobject 25 : trunk -> of -> object 7 : elephant.\nobject 9 : foot -> of an -> object 7 : elephant.\nobject 25 : trunk -> of -> object 7 : elephant.\nobject 11 : leg -> of -> object 7 : elephant.\nobject 12 : leg -> of -> object 7 : elephant.\nobject 5 : ear -> of -> object 7 : elephant.\nobject 6 : ear -> of -> object 7 : elephant.\nobject 8 : eye -> of -> object 7 : elephant.\nobject 27 : water -> behind -> object 7 : elephant.\n\nRegion Description:\nRegion Description at [0.338, 0.139, 0.618, 0.967] : the elephant standing on the lake side.\nRegion Description at [0.154, 0.392, 0.300, 0.964] : a man standing on the lake side with shorts.\nRegion Description at [0.574, 0.422, 0.686, 0.910] : the man standing beside the elephant.\nRegion Description at [0.292, 0.485, 0.378, 0.705] : this lady is wearing a blue tank top.\nRegion Description at [0.722, 0.768, 0.988, 0.964] : the sand is brown with green grass growing in it.\nRegion Description at [0.156, 0.669, 0.270, 0.910] : the man is wearing grey black and white shorts.\nRegion Description at [0.504, 0.560, 0.568, 0.898] : The front right leg of the elephant..\nRegion Description at [0.310, 0.536, 0.358, 0.690] : The light blue tank top the girl is wearing..\nRegion Description at [0.262, 0.732, 0.326, 0.825] : The black bucket in the girl's hand..\nRegion Description at [0.002, 0.443, 0.992, 0.994] : The water behind the people and the elephant..\n\nGlobal Caption:\nA group of people are standing next to an elephant emerging from the water.\na group of people stand beside of a giant elephant \nThree tourists pose for a picture next to an elephant.\nThree people stand with an elephant in front of a stream.\nThree people standing next to an elephant along a river."}
+{"question_id": 40, "image": "000000332318.jpg", "category": "refer_desc", "text": "Objects:\nObject 0 : background at [0.000, 0.000, 1.002, 0.997].\nObject 1 : bench at [0.604, 0.967, 0.672, 0.997].\nObject 2 : cow at [0.548, 0.860, 0.574, 0.896].\nObject 3 : cow at [0.436, 0.860, 0.454, 0.890].\nObject 4 : fence at [0.698, 0.949, 0.852, 0.997].\nObject 5 : moutain at [0.000, 0.057, 0.992, 0.782].\nObject 6 : pasture at [0.000, 0.815, 0.984, 1.000].\nObject 7 : peak at [0.744, 0.042, 0.898, 0.119].\nObject 8 : sky at [0.000, 0.000, 1.002, 0.257].\nObject 9 : snow at [0.210, 0.036, 0.962, 0.445].\nObject 10 : trailer at [0.796, 0.910, 0.894, 0.997].\nObject 11 : trailer at [0.632, 0.899, 0.742, 0.994].\nObject 12 : tree at [0.740, 0.409, 1.000, 0.982].\nObject 13 : tree at [0.638, 0.284, 0.652, 0.301].\n\nRelationships:\nobject 11 : trailer -> in -> object 6 : pasture.\nobject 5 : moutain -> has -> object 9 : snow.\nobject 6 : pasture -> near -> object 5 : moutain.\nobject 3 : cow -> in -> object 6 : pasture.\nobject 2 : cow -> in -> object 6 : pasture.\nobject 9 : snow -> on -> object 5 : moutain.\nobject 5 : moutain -> covered in -> object 9 : snow.\nobject 5 : moutain -> has -> object 7 : peak.\nobject 2 : cow -> in -> object 6 : pasture.\nobject 5 : moutain -> in -> object 0 : background.\nobject 5 : moutain -> has -> object 9 : snow.\nobject 11 : trailer -> near -> object 12 : tree.\nobject 5 : moutain -> has -> object 13 : tree.\nobject 7 : peak -> covered with -> object 9 : snow.\n\nRegion Description:\nRegion Description at [0.784, 0.901, 0.934, 0.991] : storage container for animal equipment.\nRegion Description at [0.828, 0.060, 0.880, 0.125] : The mountain is partially covered in snow..\nRegion Description at [0.840, 0.899, 0.920, 0.997] : horse trailer or cow trailer is silvertone, rectangular.\nRegion Description at [0.606, 0.919, 0.640, 0.982] : smaller trailer, white w/ brown+orange stripe.\nRegion Description at [0.060, 0.472, 0.540, 0.806] : a bare patch of earth amid lush green growth.\nRegion Description at [0.034, 0.839, 0.812, 0.973] : tiny cattle-containing fenceposts in the distance.\nRegion Description at [0.902, 0.827, 0.990, 0.997] : a split tree trunk in shadow, beneath leaves, shadow on ground.\nRegion Description at [0.734, 0.919, 0.802, 0.994] : an older station wagon/suv-type van thing.\nRegion Description at [0.090, 0.854, 0.124, 0.904] : a black & white animal stands alone, away from brown brethren, in the far distance.\n\nGlobal Caption:\nCows lounge in a field with a mountain backdrop.\nA VERY BIG MOUNTAIN AND ANIMALS SPREAD ACROSS A FARM.\nSeveral herd animals are on the grass by a mountain.\nCattle on a level pasture in a mountainous area.\nA bunch of cattle relax in a pasture located in the mountains"}
diff --git a/ferret/eval/ferret_gpt4_data/refer_caption/question.jsonl b/ferret/eval/ferret_gpt4_data/refer_caption/question.jsonl
new file mode 100644
index 0000000..1141a4e
--- /dev/null
+++ b/ferret/eval/ferret_gpt4_data/refer_caption/question.jsonl
@@ -0,0 +1,40 @@
+{"question_id": 0, "image": "000000069138.jpg", "category": "refer_desc", "text": "What is the interaction between the object [0.621, 0.082, 0.772, 0.132] and its surroundings?"}
+{"question_id": 1, "image": "000000131138.jpg", "category": "refer_desc", "text": "What is the interaction between the object [0.183, 0.799, 0.326, 0.896] and its surrounding?"}
+{"question_id": 2, "image": "000000475150.jpg", "category": "refer_desc", "text": "What can you tell about the region [0.288, 0.324, 0.572, 0.649] and its interaction with the surrounding areas?"}
+{"question_id": 3, "image": "000000356424.jpg", "category": "refer_desc", "text": "What is happening in the region [0.528, 0.254, 0.717, 0.666] and what is its relationship to the surrounding objects?"}
+{"question_id": 4, "image": "000000491090.jpg", "category": "refer_desc", "text": "What can be said about the region [0.102, 0.498, 0.329, 0.692] in relation to nearby objects or elements?"}
+{"question_id": 5, "image": "000000484415.jpg", "category": "refer_desc", "text": "What can be observed in the region [0.716, 0.192, 0.894, 0.550] and how does it interact with the surroundings?"}
+{"question_id": 7, "image": "000000184324.jpg", "category": "refer_desc", "text": "What is happening within the region [0.564, 0.771, 0.876, 0.991] and how is it related to the nearby objects?"}
+{"question_id": 8, "image": "000000341058.jpg", "category": "refer_desc", "text": "What is the object [0.619, 0.838, 0.633, 0.850] and what is its relationship with nearby objects?"}
+{"question_id": 9, "image": "000000184384.jpg", "category": "refer_desc", "text": "What can you tell about the objects found in the region [0.628, 0.120, 0.998, 0.389]?"}
+{"question_id": 10, "image": "000000259097.jpg", "category": "refer_desc", "text": "What can be said about the region [0.012, 0.520, 0.996, 0.631] in relation to the surrounding areas?"}
+{"question_id": 11, "image": "000000377882.jpg", "category": "refer_desc", "text": "What can you tell about the region [0.242, 0.211, 0.302, 0.989] and its surrounding context?"}
+{"question_id": 12, "image": "000000415748.jpg", "category": "refer_desc", "text": "What can you tell about the object [0.084, 0.438, 0.727, 0.954] and its interaction with nearby objects?"}
+{"question_id": 13, "image": "000000408120.jpg", "category": "refer_desc", "text": "What can you see within the region [0.394, 0.565, 0.570, 0.718] and what is its interaction with nearby objects?"}
+{"question_id": 14, "image": "000000184400.jpg", "category": "refer_desc", "text": "What is the interaction between the object [0.602, 0.837, 0.696, 0.997] and its surrounding objects?"}
+{"question_id": 15, "image": "000000276018.jpg", "category": "refer_desc", "text": "What can you tell me about the region [0.071, 0.378, 0.498, 0.842] and its interactions with nearby objects?"}
+{"question_id": 16, "image": "000000376322.jpg", "category": "refer_desc", "text": "What is the interaction between objects in the region [0.668, 0.252, 0.909, 0.622]?"}
+{"question_id": 17, "image": "000000125472.jpg", "category": "refer_desc", "text": "What can you tell about the region [0.201, 0.002, 0.940, 0.758] and its interaction with surrounding objects?"}
+{"question_id": 18, "image": "000000361551.jpg", "category": "refer_desc", "text": "Can you tell me about the interaction happening in the region [0.957, 0.616, 0.997, 0.670] and its context?"}
+{"question_id": 19, "image": "000000412240.jpg", "category": "refer_desc", "text": "What can be said about the region [0.002, 0.437, 0.720, 0.787] in terms of the surrounding objects and their interactions?"}
+{"question_id": 20, "image": "000000130566.jpg", "category": "refer_desc", "text": "What can you tell about the region [0.630, 0.471, 0.682, 0.550] and its interaction with the surrounding environment?"}
+{"question_id": 21, "image": "000000421923.jpg", "category": "refer_desc", "text": "What is the relationship between the object [0.838, 0.774, 0.994, 0.974] and object [0.796, 0.462, 0.982, 0.550]?"}
+{"question_id": 22, "image": "000000513567.jpg", "category": "refer_desc", "text": "What is happening in the region [0.926, 0.253, 0.998, 0.645]?"}
+{"question_id": 23, "image": "000000543300.jpg", "category": "refer_desc", "text": "What can you tell about the region [0.414, 0.691, 0.662, 0.725] and how it relates to the surroundings?"}
+{"question_id": 24, "image": "000000241668.jpg", "category": "refer_desc", "text": "What is happening in the region [0.487, 0.136, 0.808, 0.986]?"}
+{"question_id": 25, "image": "000000535578.jpg", "category": "refer_desc", "text": "What can you tell about the region [0.189, 0.032, 0.703, 0.178] and its surrounding areas?"}
+{"question_id": 26, "image": "000000277051.jpg", "category": "refer_desc", "text": "Describe the bird [0.384, 0.372, 0.698, 0.787] and its interactions with surrounding objects?"}
+{"question_id": 27, "image": "000000018519.jpg", "category": "refer_desc", "text": "What are the details of the region [0.279, 0.524, 0.341, 0.570] and how does it relate to the nearby objects?"}
+{"question_id": 28, "image": "000000106048.jpg", "category": "refer_desc", "text": "Can you describe what's happening in the region [0.222, 0.144, 0.820, 0.757]?"}
+{"question_id": 29, "image": "000000058393.jpg", "category": "refer_desc", "text": "What can you say about the interaction between objects in the region [0.542, 0.343, 0.812, 0.493]?"}
+{"question_id": 30, "image": "000000010764.jpg", "category": "refer_desc", "text": "Referencing the region [0.546, 0.625, 0.626, 0.801], can you describe what you see and how it interacts with the surrounding context?"}
+{"question_id": 31, "image": "000000271402.jpg", "category": "refer_desc", "text": "What can you tell me about the region [0.060, 0.086, 0.462, 0.908] and its relation to nearby objects?"}
+{"question_id": 32, "image": "000000273493.jpg", "category": "refer_desc", "text": "What is happening in the region [0.588, 0.327, 0.850, 0.703] with regard to its surroundings?"}
+{"question_id": 33, "image": "000000360960.jpg", "category": "refer_desc", "text": "Can you describe the region [0.524, 0.740, 0.734, 0.856] and its interaction with the surroundings?"}
+{"question_id": 34, "image": "000000452122.jpg", "category": "refer_desc", "text": "What is happening in the region [0.650, 0.428, 0.858, 0.600]?"}
+{"question_id": 35, "image": "000000134722.jpg", "category": "refer_desc", "text": "What can you say about the region [0.320, 0.451, 0.460, 0.592] and its relation with nearby objects?"}
+{"question_id": 36, "image": "000000039484.jpg", "category": "refer_desc", "text": "What is happening in the region [0.844, 0.777, 0.958, 0.897] and how does this relate to the surrounding area?"}
+{"question_id": 37, "image": "000000159311.jpg", "category": "refer_desc", "text": "What can you tell about the region [0.206, 0.853, 0.356, 0.982] considering the surrounding entities and their interactions?"}
+{"question_id": 38, "image": "000000326174.jpg", "category": "refer_desc", "text": "Can you describe the interaction or relationship between the objects in the region [0.444, 0.459, 0.552, 0.853]?"}
+{"question_id": 39, "image": "000000562207.jpg", "category": "refer_desc", "text": "Can you describe what's happening in the region [0.154, 0.392, 0.300, 0.964] and how it relates to nearby objects or individuals?"}
+{"question_id": 40, "image": "000000332318.jpg", "category": "refer_desc", "text": "What can you tell about the region [0.436, 0.860, 0.454, 0.890] and how does it relate to the rest of the scene?"}
diff --git a/ferret/eval/ferret_gpt4_data/refer_reason/answer.jsonl b/ferret/eval/ferret_gpt4_data/refer_reason/answer.jsonl
new file mode 100644
index 0000000..e0da64d
--- /dev/null
+++ b/ferret/eval/ferret_gpt4_data/refer_reason/answer.jsonl
@@ -0,0 +1,40 @@
+{"question_id": 0, "image": "000000130566.jpg", "category": "refer_reason", "text": "The object is a windshield on the train. The windshield is typically used to provide visibility for the train operator while protecting them from wind, debris, and other elements that could obstruct their view or pose a safety risk while the train is in motion."}
+{"question_id": 1, "image": "000000010764.jpg", "category": "refer_reason", "text": "The object is knee and leg pads which are worn by the player. These pads are used in sports like baseball to protect the player's knees and legs from injuries that could occur due to falls, abrupt movements, or impacts with other players or equipment like the ball."}
+{"question_id": 2, "image": "000000184324.jpg", "category": "refer_reason", "text": "The object is a blue street sign with a white \"P\" on it. This type of sign typically indicates a parking zone or area. Therefore, its purpose is to provide information about the availability of parking spaces to drivers."}
+{"question_id": 3, "image": "000000452122.jpg", "category": "refer_reason", "text": "The object is the landing gear of an airplane. The purpose of the landing gear is to support the airplane during landing and take-off. It also facilitates the airplane's movement when it is on the ground. In this image, the airplane appears to be in flight and the landing gear is lowered, which usually happens during landing or take-off procedures."}
+{"question_id": 4, "image": "000000032334.jpg", "category": "refer_reason", "text": "The object is a pair of glasses that the woman is wearing. These glasses, also known as eyeglasses or spectacles, are typically used to correct vision, protecting the eyes, or for fashion. In this context, considering the woman's age and the style of the glasses, they are likely used to correct her vision."}
+{"question_id": 5, "image": "000000360960.jpg", "category": "refer_reason", "text": "The man appears to be walking. He is wearing jeans and seems to be in an outdoor setting. Given the context of the image where several people are walking on a sidewalk and one man is holding a colorful umbrella, it is likely that the man might continue walking, possibly under the umbrella if it is raining or if he wants to take shelter from the sun."}
+{"question_id": 7, "image": "000000376322.jpg", "category": "refer_reason", "text": "The people are sitting on both sides of a long table. On the table, there are plates of food, including bread and butter, and glasses of red wine. Some people are reading a menu, suggesting they are at a restaurant. Given this setting, it appears to be a group meal or gathering, possibly a celebration or a business meal."}
+{"question_id": 8, "image": "000000271402.jpg", "category": "refer_reason", "text": "The object is a tennis racket held by a young girl. The purpose of the tennis racket is to hit a tennis ball in the sport of tennis. The girls seem to be dressed in tennis uniforms, suggesting they might be preparing to play or practicing the sport."}
+{"question_id": 9, "image": "000000356424.jpg", "category": "refer_reason", "text": "The object is a sign, specifically a yellow closed sign with brown letters. This sign is typically used in commercial settings to indicate that a store or service is not currently open to the public. This could relate to the scene in the image, possibly indicating that the man and woman are having a private meal in a restaurant that is currently closed to the public."}
+{"question_id": 10, "image": "000000131138.jpg", "category": "refer_reason", "text": "The object is a pair of headphones. In this setting, with a laptop, a monitor, a keyboard, and a mouse on the desk, the headphones are likely used for audio output. This could be for listening to music, taking video calls, or for any multimedia content that the user might be interacting with on the computer. The headphones help to keep the audio private and not disturb others in the environment, which is particularly useful in an office or shared workspace setting."}
+{"question_id": 11, "image": "000000332318.jpg", "category": "refer_reason", "text": "The object is a trailer. Given the rural setting of the image, with cows in a pasture and a mountain backdrop, this trailer is most likely used for the transport of farm animals such as cows or horses. It could also be used for storing animal equipment, as it's common in such settings."}
+{"question_id": 12, "image": "000000513567.jpg", "category": "refer_reason", "text": "The girl has her mouth open. Given the context of the image, this girl is holding a hot dog and looking at it, it's reasonable to infer that she is opening her mouth in anticipation of eating the hot dog. It's a common reaction when people are about to eat something delicious."}
+{"question_id": 13, "image": "000000134722.jpg", "category": "refer_reason", "text": "The object is windshield wipers, located on the front window of a train. The purpose of these wipers is to clear rain, snow, and other debris from the windshield, to improve the driver's visibility during poor weather conditions."}
+{"question_id": 14, "image": "000000341058.jpg", "category": "refer_reason", "text": "The object is a restaurant sign posted on a post. This is typically used to advertise and identify the restaurant to passers-by and potential customers. It can provide information such as the name of the restaurant, its logo, or other branding elements."}
+{"question_id": 15, "image": "000000277051.jpg", "category": "refer_reason", "text": "The object is a bottle. Bottles are typically used to hold and store different types of liquids. In this context, it might be used to store a beverage for the meal."}
+{"question_id": 16, "image": "000000376900.jpg", "category": "refer_reason", "text": "The object is a tennis racket, held by a man who is a tennis player. He is ready to serve the ball. Therefore, the tennis racket is used to hit the tennis ball in the game."}
+{"question_id": 17, "image": "000000412240.jpg", "category": "refer_reason", "text": "The region corresponds to a date. Judging by the context of the image, which features a dog sitting on the floor next to a pair of shoes, it's likely this date represents when the photo was taken, possibly indicating a special moment or event."}
+{"question_id": 18, "image": "000000179765.jpg", "category": "refer_reason", "text": "The object is a shock absorber on the bike. Its purpose is to absorb or dampen shock impulses. It does this by converting the kinetic energy of the shock into another form of energy (typically heat) which is then dissipated. In the context of the motorcycle, it is particularly useful in providing comfort and stability for the rider, especially when travelling over uneven or rough terrains."}
+{"question_id": 19, "image": "000000329219.jpg", "category": "refer_reason", "text": "The object is a dark electric cord that is plugged into the wall. It is used to transmit electrical power from the outlet to an electrical device, such as a blender, allowing it to operate."}
+{"question_id": 20, "image": "000000184384.jpg", "category": "refer_reason", "text": "The object is butter. The butter is spread on top of a blueberry cake. This could be done to add extra flavor to the cake, as butter can enhance the taste and texture of baked goods."}
+{"question_id": 21, "image": "000000018519.jpg", "category": "refer_reason", "text": "The region refers to a black wrist guard on a man's wrist. The wrist guard is used by the skater to protect his wrist while performing tricks, as it can help prevent injuries in case of a fall."}
+{"question_id": 22, "image": "000000415748.jpg", "category": "refer_reason", "text": "The object is a tusk, which is part of an elephant. The unusual thing about this tusk is that it is on the face of the elephant, indicating that the elephant is an adult, as tusks only grow in after an elephant has reached maturity. In addition, the tusk is part of the painted decoration on the elephant's face, which is not a common sight and is typically associated with specific cultural practices or festivals."}
+{"question_id": 23, "image": "000000543300.jpg", "category": "refer_reason", "text": "The region includes red letters on the side of a cruise ship. The purpose of these letters is typically to display the name of the ship or to showcase some identifying information about the ship, such as the port of registry. In this case, the letters likely serve as a way to identify the cruise ship as the \"Port River Dolphin Cruise\" ship. The letters are important for communication and identification purposes in maritime navigation."}
+{"question_id": 24, "image": "000000349184.jpg", "category": "refer_reason", "text": "The object is a purse. The purpose of a purse is generally to carry personal items such as wallet, keys, cosmetics, and other small belongings. Looking at the image, the purse seems to be used by the woman who is sitting on the bench. It's a common accessory for people, especially women, when they go out."}
+{"question_id": 25, "image": "000000042070.jpg", "category": "refer_reason", "text": "The object is a display on the front of a bus. The purpose of this display is to show the bus route name and number. This helps passengers identify the route and destination of the bus."}
+{"question_id": 26, "image": "000000241668.jpg", "category": "refer_reason", "text": "The object is a ring. Given the context of the image, which features two people in formal wedding attire, one wearing a wedding dress and the other holding a piece of cake, it can be inferred that this ring might be a wedding ring. Therefore, it's highly possible that a wedding ceremony is taking place."}
+{"question_id": 27, "image": "000000535578.jpg", "category": "refer_reason", "text": "The object is a stone wall that borders the field where the sheep are grazing. It serves as a boundary to keep the sheep contained within the field, preventing them from wandering off into unwanted areas. This is a common practice in sheep farming to manage and protect the herd."}
+{"question_id": 28, "image": "000000484415.jpg", "category": "refer_reason", "text": "The object is a brush, specifically a toilet scrubber. This tool is designed for cleaning the inside of a toilet bowl. It typically has stiff bristles and a long handle to allow for efficient cleaning while keeping the user's hand away from the toilet water and bowl."}
+{"question_id": 29, "image": "000000491090.jpg", "category": "refer_reason", "text": "The object is a small circular orange indicator light on the motorcycle. Its primary function is to indicate the direction or intention of the motorcycle's movement, typically used when the rider is about to make a turn or change lanes. It enhances safety by signaling the rider's intentions to other road users."}
+{"question_id": 30, "image": "000000276018.jpg", "category": "refer_reason", "text": "The item is a hat, worn by a boy. The boy is outside walking with a group of other kids, all holding stuffed animals. The hat is likely being used to protect the boy's head from the sun."}
+{"question_id": 31, "image": "000000361551.jpg", "category": "refer_reason", "text": "The object is a cap. It's worn by a man in the service area of what appears to be an airport. The cap is used for various purposes such as protection from the sun, keeping the head warm, or as a fashion accessory. In this case, it could be used for any of these purposes, or even to help identify the man as part of a particular group or organization."}
+{"question_id": 32, "image": "000000562207.jpg", "category": "refer_reason", "text": "The object is a bucket carried by a woman. Given that they are located beside an elephant on the side of a lake, it's likely that the bucket could be used for feeding or bathing the elephant."}
+{"question_id": 33, "image": "000000553990.jpg", "category": "refer_reason", "text": "The object is a bridal. A girl is seen riding a horse, and the bridal is being used by her to control and guide the horse. The bridal allows the rider to communicate with the horse by applying pressure on the horse's mouth, head, and neck. This is particularly important in activities such as the horse jumping event depicted."}
+{"question_id": 34, "image": "000000106048.jpg", "category": "refer_reason", "text": "The region has the text \"Divine Transportation\" written on the front of a bus. This is likely the name of the bus company or service. It's a common practice to display the company name on the front and sides of the bus. This serves the purpose of advertising the bus service and allowing passengers and others to identify the company operating the bus."}
+{"question_id": 35, "image": "000000421923.jpg", "category": "refer_reason", "text": "The object is a vase. It is filled with water and contains several flowers. Therefore, its purpose is decorative: it is used to display and support the flowers, enhancing the aesthetics of the room."}
+{"question_id": 36, "image": "000000273493.jpg", "category": "refer_reason", "text": "The object is a black circular sign with the number five on it. In the context of this image, this is likely a score or court number sign in a tennis court. So, its function is to indicate the score or the court number to help players, referees, and spectators keep track of the game progress or location."}
+{"question_id": 37, "image": "000000475150.jpg", "category": "refer_reason", "text": "The region refers to the lower part of a giraffe's neck and its body. The pattern in this region is spotted, with tan and brown colors. This pattern is a typical skin coloration of giraffes, indicating that the object is indeed a giraffe. The pattern provides camouflage and helps the giraffe blend in with its natural environment."}
+{"question_id": 38, "image": "000000125472.jpg", "category": "refer_reason", "text": "The object is a skateboard. The skateboard was popularly invented in the 1940s or 1950s when Californian surfers were looking for a way to surf on land when the waves were not suitable for surfing. The current image shows a man performing a trick on a skateboard, which indicates the evolved use of skateboards for recreational and sporting activities, particularly in skateboarding sports and competitions."}
+{"question_id": 39, "image": "000000069138.jpg", "category": "refer_reason", "text": "The object is a sign. The sign is providing information that the building nearby offers short term office space, as small as 2,500 sq. ft. Therefore, the purpose of this sign is to inform potential customers or tenants about the availability and flexibility of office space in that building."}
+{"question_id": 40, "image": "000000408120.jpg", "category": "refer_reason", "text": "The object is an umbrella, held by a little girl who is wearing a pink dress. The purpose of the umbrella in this scene is likely to protect the girl from the weather. Since there's no indication of rain in the image, it might be used to shield her from the sun."}
diff --git a/ferret/eval/ferret_gpt4_data/refer_reason/context.jsonl b/ferret/eval/ferret_gpt4_data/refer_reason/context.jsonl
new file mode 100644
index 0000000..87d6658
--- /dev/null
+++ b/ferret/eval/ferret_gpt4_data/refer_reason/context.jsonl
@@ -0,0 +1,40 @@
+{"question_id": 0, "image": "000000130566.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : buds at [0.130, 0.814, 0.334, 0.883].\nObject 1 : building at [0.622, 0.213, 0.708, 0.273].\nObject 2 : building at [0.708, 0.222, 0.994, 0.294].\nObject 3 : building at [0.472, 0.240, 0.602, 0.282].\nObject 4 : cars at [0.628, 0.411, 0.912, 0.739].\nObject 5 : electric lines at [0.000, 0.000, 0.912, 0.126].\nObject 6 : gravel at [0.382, 0.381, 0.878, 0.907].\nObject 7 : leaves at [0.736, 0.357, 0.764, 0.390].\nObject 8 : pole at [0.550, 0.589, 0.558, 0.724].\nObject 9 : sky at [0.322, 0.093, 0.852, 0.162].\nObject 10 : tracks at [0.382, 0.429, 0.502, 0.511].\nObject 11 : tracks at [0.374, 0.408, 0.692, 0.709].\nObject 12 : tracks at [0.706, 0.775, 0.942, 0.922].\nObject 13 : train at [0.016, 0.273, 0.906, 0.733].\nObject 14 : train tracks at [0.024, 0.291, 0.996, 0.997].\nObject 15 : tree at [0.760, 0.279, 0.998, 0.426].\nObject 16 : wall at [0.556, 0.721, 0.790, 0.991].\nObject 17 : windshield at [0.850, 0.523, 0.898, 0.583].\nObject 18 : windshield at [0.796, 0.526, 0.846, 0.580].\n\nRelationships:\nobject 18 : windshield -> on a -> object 13 : train.\nobject 12 : tracks -> for a -> object 13 : train.\nobject 15 : tree -> with -> object 7 : leaves.\nobject 5 : electric lines -> on -> object 14 : train tracks.\nobject 8 : pole -> beside -> object 13 : train.\nobject 16 : wall -> beside -> object 13 : train.\nobject 13 : train -> traveling down -> object 11 : tracks.\n\nRegion Description:\nRegion Description at [0.022, 0.258, 0.632, 0.679] : THESE CARS ARE FOR CARGO NOT PASSENGERS.\nRegion Description at [0.630, 0.471, 0.682, 0.550] : THE WINDOWS ARE ON THE SIDE OF THE ENGINE.\nRegion Description at [0.000, 0.024, 0.448, 0.144] : electric lines hanging above train tracks.\nRegion Description at [0.532, 0.571, 0.568, 0.727] : black metal pole beside train tracks.\nRegion Description at [0.782, 0.586, 0.918, 0.667] : yellow paint on the front of the train.\nRegion Description at [0.062, 0.300, 0.996, 0.997] : multiple sets of tracks on the ground.\nRegion Description at [0.026, 0.114, 0.950, 0.970] : a freight train travelling down the tracks.\nRegion Description at [0.054, 0.685, 0.684, 0.991] : wildflowers on the side of a train track.\nRegion Description at [0.002, 0.129, 0.998, 0.991] : the grass and trees around the tracks.\n\nGlobal Caption:\nA yellow train on the tracks with several cars\nA train pulls past an intersection in the rail in a rural area.\na long cargo train going down a track by some trees \nA train with a red and yellow engine on a railroad track.\nA train pulls a large number of cars through a junction."}
+{"question_id": 1, "image": "000000010764.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : catcher at [0.334, 0.193, 0.756, 0.940].\nObject 1 : field at [0.000, 0.000, 0.998, 0.997].\nObject 2 : glove at [0.660, 0.492, 0.764, 0.674].\nObject 3 : hand at [0.666, 0.498, 0.748, 0.665].\nObject 4 : helmet at [0.472, 0.187, 0.610, 0.444].\nObject 5 : jersey at [0.340, 0.332, 0.556, 0.695].\nObject 6 : line at [0.396, 0.656, 0.560, 0.731].\nObject 7 : lines at [0.866, 0.927, 1.000, 0.997].\nObject 8 : lines at [0.754, 0.837, 0.998, 0.867].\nObject 9 : pads at [0.562, 0.668, 0.634, 0.782].\nObject 10 : pants at [0.336, 0.640, 0.612, 0.858].\nObject 11 : sneakers at [0.406, 0.834, 0.544, 0.946].\nObject 12 : stripe at [0.608, 0.737, 0.998, 0.795].\nObject 13 : wrist band at [0.586, 0.583, 0.604, 0.640].\n\nRelationships:\nobject 0 : catcher -> in -> object 1 : field.\nobject 2 : glove -> on -> object 3 : hand.\nobject 6 : line -> on -> object 10 : pants.\n\nRegion Description:\nRegion Description at [0.546, 0.625, 0.626, 0.801] : The player is wearing knee and leg pads..\nRegion Description at [0.018, 0.665, 0.280, 0.825] : A brown dirt ground surface on a baseball field.\nRegion Description at [0.676, 0.701, 0.974, 0.979] : White chalk lines painted on a baseball field.\nRegion Description at [0.062, 0.130, 0.370, 0.535] : A green grass ground surface of a baseball field.\nRegion Description at [0.566, 0.580, 0.620, 0.656] : A black and red bracelet on a man's wrist.\n\nGlobal Caption:\nA catches crouches on a patch of dirt.\nA catcher squatting at a base with his gloved hand extended.\nA baseball catcher stands ready to catch a ball.\na catcher kneeling at the mound waiting for a baseball \nA catcher in white uniform during a baseball game."}
+{"question_id": 2, "image": "000000184324.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : awning at [0.514, 0.500, 0.736, 0.545].\nObject 1 : bag at [0.086, 0.723, 0.124, 0.777].\nObject 2 : bicycle at [0.716, 0.660, 0.756, 0.738].\nObject 3 : bikes at [0.710, 0.753, 0.864, 0.934].\nObject 4 : black jacket at [0.052, 0.569, 0.120, 0.723].\nObject 5 : blue jeans at [0.654, 0.678, 0.672, 0.729].\nObject 6 : building at [0.540, 0.042, 0.760, 0.617].\nObject 7 : building at [0.706, 0.000, 0.998, 0.726].\nObject 8 : bus at [0.186, 0.491, 0.246, 0.608].\nObject 9 : car at [0.432, 0.557, 0.538, 0.636].\nObject 10 : cars at [0.130, 0.491, 0.756, 0.630].\nObject 11 : coat at [0.128, 0.602, 0.212, 0.798].\nObject 12 : cross walk at [0.428, 0.750, 0.954, 1.000].\nObject 13 : cyclist at [0.752, 0.614, 0.860, 0.792].\nObject 14 : lines at [0.432, 0.608, 0.948, 1.000].\nObject 15 : man at [0.052, 0.518, 0.132, 0.898].\nObject 16 : people at [0.000, 0.515, 0.212, 1.000].\nObject 17 : people at [0.754, 0.605, 0.858, 0.756].\nObject 18 : pole at [0.954, 0.699, 0.970, 0.777].\nObject 19 : road at [0.004, 0.545, 1.000, 1.000].\nObject 20 : scarf at [0.032, 0.873, 0.134, 0.997].\nObject 21 : sidewalk at [0.536, 0.572, 0.668, 0.623].\nObject 22 : sign at [0.482, 0.470, 0.494, 0.494].\nObject 23 : sign at [0.810, 0.407, 0.970, 0.497].\nObject 24 : sign at [0.584, 0.434, 0.614, 0.494].\nObject 25 : store at [0.806, 0.395, 0.968, 0.720].\nObject 26 : street light at [0.640, 0.461, 0.652, 0.485].\nObject 27 : stripes at [0.452, 0.620, 0.944, 0.982].\nObject 28 : tires at [0.712, 0.747, 0.864, 0.931].\nObject 29 : tree at [0.280, 0.358, 0.340, 0.569].\nObject 30 : van at [0.460, 0.545, 0.488, 0.566].\nObject 31 : window at [0.820, 0.217, 0.884, 0.358].\nObject 32 : windshield at [0.192, 0.512, 0.242, 0.548].\nObject 33 : woman at [0.128, 0.569, 0.212, 0.913].\nObject 34 : woman at [0.650, 0.593, 0.688, 0.729].\nObject 35 : woman at [0.020, 0.765, 0.168, 1.000].\nObject 36 : writing at [0.838, 0.422, 0.948, 0.482].\n\nRelationships:\nobject 3 : bikes -> are on -> object 19 : road.\nobject 3 : bikes -> are on -> object 19 : road.\nobject 17 : people -> are riding -> object 3 : bikes.\nobject 3 : bikes -> are on -> object 19 : road.\nobject 17 : people -> are on -> object 19 : road.\nobject 8 : bus -> on -> object 19 : road.\nobject 8 : bus -> on -> object 19 : road.\nobject 8 : bus -> on -> object 19 : road.\nobject 12 : cross walk -> being used by a -> object 13 : cyclist.\nobject 17 : people -> are using -> object 12 : cross walk.\nobject 0 : awning -> above -> object 21 : sidewalk.\nobject 10 : cars -> are on -> object 19 : road.\nobject 26 : street light -> on -> object 6 : building.\nobject 27 : stripes -> on -> object 12 : cross walk.\nobject 7 : building -> has a -> object 31 : window.\nobject 3 : bikes -> have -> object 28 : tires.\nobject 35 : woman -> wearing a -> object 20 : scarf.\nobject 23 : sign -> for -> object 25 : store.\nobject 33 : woman -> wearing a -> object 11 : coat.\nobject 34 : woman -> wearing -> object 5 : blue jeans.\nobject 3 : bikes -> are on -> object 19 : road.\nobject 14 : lines -> are on -> object 19 : road.\nobject 15 : man -> wearing a -> object 4 : black jacket.\nobject 30 : van -> on -> object 19 : road.\nobject 15 : man -> has a -> object 1 : bag.\nobject 8 : bus -> has a -> object 32 : windshield.\nobject 7 : building -> has a -> object 31 : window.\nobject 31 : window -> above -> object 23 : sign.\nobject 14 : lines -> are on -> object 19 : road.\nobject 18 : pole -> near -> object 7 : building.\nobject 35 : woman -> wearing a -> object 20 : scarf.\n\nRegion Description:\nRegion Description at [0.822, 0.395, 0.968, 0.500] : red writing above buisness along the street.\nRegion Description at [0.564, 0.771, 0.876, 0.991] : white stripes painted to indicate cross walk.\nRegion Description at [0.184, 0.485, 0.244, 0.605] : large white vehicle with big windshield.\nRegion Description at [0.478, 0.464, 0.492, 0.491] : blue street sign with a white P on it.\nRegion Description at [0.820, 0.220, 0.886, 0.370] : window on the building above red sign.\n\nGlobal Caption:\nA group of people walking across a busy city street.\nA fish eye lens shows the corner of a busy city street with bikes, people and buildings.\na number of people and cars on a city street\nAn oddly taken photo of some buildings and shops.\nA picture of a city intersection with period buildings and store fronts. "}
+{"question_id": 3, "image": "000000452122.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : airline at [0.408, 0.420, 0.758, 0.502].\nObject 1 : airplane at [0.112, 0.300, 0.858, 0.640].\nObject 2 : engine at [0.652, 0.529, 0.730, 0.592].\nObject 3 : engine at [0.494, 0.502, 0.574, 0.577].\nObject 4 : fin at [0.208, 0.303, 0.320, 0.492].\nObject 5 : fin at [0.116, 0.480, 0.284, 0.526].\nObject 6 : front door at [0.752, 0.435, 0.772, 0.483].\nObject 7 : gear at [0.450, 0.592, 0.600, 0.643].\nObject 8 : letters at [0.694, 0.489, 0.732, 0.520].\nObject 9 : name at [0.398, 0.426, 0.760, 0.489].\nObject 10 : sky at [0.000, 0.000, 0.998, 1.000].\nObject 11 : window at [0.806, 0.438, 0.844, 0.456].\nObject 12 : windows at [0.326, 0.450, 0.750, 0.532].\nObject 13 : wing at [0.152, 0.426, 0.598, 0.538].\nObject 14 : wing at [0.116, 0.492, 0.282, 0.538].\n\nRelationships:\nobject 6 : front door -> of -> object 1 : airplane.\n\nRegion Description:\n\nGlobal Caption:\nAn airplane flying in the air during the day.\nA large aircraft is shown in the air.\nThe large jumbo jet has it's landing gear lowered.\nA large white airplane flies in the gray sky.\nAn airplane in route with a cloudy sky behind it."}
+{"question_id": 4, "image": "000000032334.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : akimbo at [0.232, 0.275, 0.294, 0.328].\nObject 1 : counter at [0.000, 0.552, 0.214, 0.995].\nObject 2 : eyeglasses at [0.080, 0.269, 0.182, 0.339].\nObject 3 : face at [0.358, 0.261, 0.520, 0.504].\nObject 4 : glass at [0.406, 0.517, 0.526, 0.763].\nObject 5 : glasses at [0.350, 0.325, 0.516, 0.373].\nObject 6 : green at [0.538, 0.584, 0.590, 0.648].\nObject 7 : green shirt at [0.228, 0.475, 0.626, 0.997].\nObject 8 : ground at [0.274, 0.421, 0.380, 0.525].\nObject 9 : hair at [0.314, 0.155, 0.560, 0.496].\nObject 10 : man at [0.226, 0.221, 0.292, 0.395].\nObject 11 : man at [0.346, 0.000, 1.000, 1.000].\nObject 12 : man at [0.044, 0.184, 0.282, 0.704].\nObject 13 : menu at [0.002, 0.640, 0.196, 0.728].\nObject 14 : menu at [0.000, 0.771, 0.144, 0.931].\nObject 15 : shirt at [0.604, 0.491, 1.000, 0.992].\nObject 16 : teeth at [0.706, 0.371, 0.808, 0.416].\nObject 17 : wine at [0.270, 0.675, 0.342, 0.715].\nObject 18 : wine at [0.416, 0.677, 0.514, 0.752].\nObject 19 : wine glass at [0.256, 0.512, 0.370, 0.829].\nObject 20 : wine glass at [0.000, 0.573, 0.034, 0.760].\nObject 21 : woman at [0.210, 0.171, 0.618, 0.997].\nObject 22 : woman at [0.132, 0.165, 0.630, 1.000].\n\nRelationships:\nobject 22 : woman -> has -> object 9 : hair.\nobject 16 : teeth -> of -> object 11 : man.\nobject 10 : man -> standing -> object 0 : akimbo.\nobject 14 : menu -> on -> object 1 : counter.\nobject 22 : woman -> drinking -> object 17 : wine.\nobject 3 : face -> of -> object 22 : woman.\nobject 5 : glasses -> on -> object 3 : face.\nobject 22 : woman -> wearing -> object 7 : green shirt.\nobject 20 : wine glass -> next to -> object 13 : menu.\nobject 11 : man -> holding -> object 4 : glass.\nobject 12 : man -> wearing -> object 2 : eyeglasses.\nobject 22 : woman -> holding -> object 19 : wine glass.\n\nRegion Description:\nRegion Description at [0.356, 0.307, 0.512, 0.389] : The woman is wearing corrective lenses.\n\nGlobal Caption:\nTwo people are smiling holding empty wine glasses.\nMan and woman doing a toast with a glass of wine.\nA man and a woman toast their wine glasses.\nSome friends pose for a picture while holding wine glasses.\nTwo people, a man and a woman, are toasting with wine glasses."}
+{"question_id": 5, "image": "000000360960.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : coat at [0.405, 0.332, 0.835, 0.746].\nObject 1 : decorative square at [0.000, 0.382, 1.000, 1.000].\nObject 2 : hat at [0.006, 0.162, 0.072, 0.198].\nObject 3 : jacket at [0.078, 0.222, 0.318, 0.430].\nObject 4 : jeans at [0.853, 0.422, 1.000, 0.632].\nObject 5 : leg at [0.853, 0.456, 0.928, 0.610].\nObject 6 : leg at [0.210, 0.458, 0.303, 0.638].\nObject 7 : leg at [0.000, 0.458, 0.060, 0.630].\nObject 8 : man at [0.066, 0.162, 0.318, 0.686].\nObject 9 : man at [0.850, 0.156, 1.000, 0.652].\nObject 10 : man at [0.390, 0.344, 0.838, 0.894].\nObject 11 : pants at [0.523, 0.736, 0.739, 0.858].\nObject 12 : person at [0.000, 0.162, 0.135, 0.668].\nObject 13 : person at [0.853, 0.154, 1.000, 0.650].\nObject 14 : section at [0.000, 0.134, 1.000, 1.000].\nObject 15 : sidewalk at [0.000, 0.388, 1.000, 1.000].\nObject 16 : umbrella at [0.168, 0.106, 0.910, 0.366].\nObject 17 : uniform at [0.000, 0.222, 0.126, 0.646].\nObject 18 : uniform at [0.105, 0.218, 0.318, 0.628].\n\nRelationships:\nobject 10 : man -> wearing -> object 11 : pants.\nobject 10 : man -> wearing -> object 0 : coat.\nobject 9 : man -> wearing -> object 4 : jeans.\nobject 8 : man -> wearing -> object 2 : hat.\nobject 8 : man -> wearing -> object 3 : jacket.\nobject 16 : umbrella -> has -> object 14 : section.\nobject 5 : leg -> of -> object 13 : person.\nobject 7 : leg -> of -> object 12 : person.\nobject 12 : person -> in -> object 17 : uniform.\n\nRegion Description:\nRegion Description at [0.066, 0.164, 0.318, 0.686] : the back of a man in a black uniform.\nRegion Description at [0.393, 0.324, 0.871, 0.766] : THIS MAN IS WEARING A LONG BLACK COAT.\nRegion Description at [0.468, 0.142, 0.634, 0.356] : THIS IS A RED SECTION ON THE UMBRELLA.\nRegion Description at [0.168, 0.140, 0.523, 0.292] : THIS IS A YELLOW SECTION ON THE UMBRELLA.\nRegion Description at [0.568, 0.138, 0.919, 0.232] : THIS IS A GREEN SECTION OF THE UMBRELLA.\n\nGlobal Caption:\nSeveral people walking on a sidewalk, with one man holding an umbrella.\nA person walking while carrying a rainbow umbrella\nA person is holding up a large colorful umbrella\na person walking down the street carrying a rainbow colored umbrella\nA person walking in a square carrying a rainbow colored umbrella."}
+{"question_id": 7, "image": "000000376322.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : beer at [0.404, 0.568, 0.505, 0.724].\nObject 1 : cell phone at [0.128, 0.726, 0.332, 0.784].\nObject 2 : decanter at [0.417, 0.416, 0.503, 0.574].\nObject 3 : fork at [0.142, 0.852, 0.553, 0.964].\nObject 4 : fork at [0.174, 0.616, 0.414, 0.660].\nObject 5 : fork at [0.107, 0.882, 0.545, 0.998].\nObject 6 : glass at [0.401, 0.568, 0.508, 0.726].\nObject 7 : glass at [0.773, 0.622, 0.880, 0.796].\nObject 8 : glasses at [0.013, 0.342, 0.139, 0.376].\nObject 9 : green shirt at [0.698, 0.376, 0.909, 0.620].\nObject 10 : hair at [0.607, 0.336, 0.743, 0.422].\nObject 11 : hair at [0.824, 0.244, 1.000, 0.474].\nObject 12 : man at [0.668, 0.252, 0.909, 0.622].\nObject 13 : man at [0.000, 0.304, 0.136, 0.808].\nObject 14 : plate at [0.102, 0.780, 0.404, 0.898].\nObject 15 : silver spoon at [0.698, 0.882, 0.799, 0.998].\nObject 16 : table at [0.000, 0.428, 0.997, 0.998].\nObject 17 : wall at [0.535, 0.194, 0.997, 0.370].\nObject 18 : watch at [0.570, 0.482, 0.596, 0.508].\nObject 19 : watch at [0.888, 0.486, 0.949, 0.514].\nObject 20 : white plate at [0.361, 0.712, 0.805, 0.860].\nObject 21 : woman at [0.813, 0.242, 1.000, 0.582].\nObject 22 : woman at [0.532, 0.338, 0.765, 0.550].\n\nRelationships:\nobject 21 : woman -> with -> object 11 : hair.\nobject 9 : green shirt -> on -> object 12 : man.\nobject 14 : plate -> on -> object 16 : table.\nobject 1 : cell phone -> on -> object 16 : table.\nobject 5 : fork -> on -> object 16 : table.\nobject 5 : fork -> on -> object 16 : table.\nobject 3 : fork -> on -> object 16 : table.\nobject 4 : fork -> on -> object 16 : table.\nobject 2 : decanter -> on -> object 16 : table.\nobject 12 : man -> wearing a -> object 9 : green shirt.\nobject 21 : woman -> wearing a -> object 19 : watch.\nobject 22 : woman -> wearing a -> object 18 : watch.\nobject 13 : man -> wearing -> object 8 : glasses.\nobject 10 : hair -> on -> object 22 : woman.\nobject 22 : woman -> at -> object 16 : table.\n\nRegion Description:\nRegion Description at [0.353, 0.700, 0.802, 0.860] : a round plate with six pieces of bread and two butter pats.\nRegion Description at [0.096, 0.778, 0.404, 0.892] : a plate with one slice of bread and one butter pat.\nRegion Description at [0.890, 0.698, 0.997, 0.992] : glass of red wine closest to the camera.\nRegion Description at [0.366, 0.710, 0.805, 0.856] : the round white plate under the bread and butter.\n\nGlobal Caption:\nA group of people are reading a menu at the table\nA group of people sit at a large table while talking.\nPeople sitting on the long table with plates of food. \nA long table full of people on both sides.\nA long table accommodating many people while eating"}
+{"question_id": 8, "image": "000000271402.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : blonde hair at [0.193, 0.100, 0.375, 0.176].\nObject 1 : building at [0.804, 0.200, 0.906, 0.318].\nObject 2 : dress at [0.378, 0.284, 0.804, 0.652].\nObject 3 : fence at [0.607, 0.282, 0.997, 0.378].\nObject 4 : girl at [0.329, 0.148, 0.973, 0.892].\nObject 5 : girl at [0.057, 0.102, 0.456, 0.898].\nObject 6 : ground at [0.000, 0.374, 1.000, 0.916].\nObject 7 : hair at [0.320, 0.148, 0.517, 0.286].\nObject 8 : handle at [0.329, 0.432, 0.508, 0.480].\nObject 9 : handle at [0.091, 0.450, 0.299, 0.502].\nObject 10 : head at [0.335, 0.152, 0.508, 0.314].\nObject 11 : insignia at [0.447, 0.350, 0.502, 0.390].\nObject 12 : orange platform at [0.181, 0.816, 0.489, 0.998].\nObject 13 : orange wheel at [0.193, 0.820, 0.248, 0.876].\nObject 14 : pavement at [0.009, 0.370, 0.994, 0.996].\nObject 15 : racket at [0.462, 0.480, 0.713, 0.840].\nObject 16 : right shoe at [0.465, 0.778, 0.610, 0.886].\nObject 17 : scooter at [0.097, 0.424, 0.592, 0.996].\nObject 18 : shoe at [0.060, 0.794, 0.202, 0.902].\nObject 19 : shoe at [0.302, 0.780, 0.453, 0.874].\nObject 20 : skirt at [0.471, 0.514, 0.804, 0.654].\nObject 21 : sneaker at [0.849, 0.738, 0.970, 0.886].\nObject 22 : sock at [0.317, 0.776, 0.347, 0.798].\nObject 23 : sock at [0.130, 0.790, 0.184, 0.810].\n\nRelationships:\nobject 4 : girl -> on -> object 14 : pavement.\nobject 5 : girl -> wearing -> object 22 : sock.\nobject 5 : girl -> wearing -> object 23 : sock.\nobject 4 : girl -> wearing -> object 20 : skirt.\nobject 4 : girl -> holding -> object 15 : racket.\nobject 5 : girl -> with -> object 0 : blonde hair.\nobject 17 : scooter -> with -> object 8 : handle.\nobject 1 : building -> with -> object 3 : fence.\nobject 4 : girl -> with -> object 11 : insignia.\nobject 13 : orange wheel -> of -> object 17 : scooter.\n\nRegion Description:\nRegion Description at [0.858, 0.760, 0.970, 0.852] : Girl is wearing blue, white, pink, and gray shoes..\nRegion Description at [0.293, 0.136, 0.976, 0.884] : a little girl holding a tennis racket..\nRegion Description at [0.060, 0.086, 0.462, 0.908] : A little girl standing near a scooter..\nRegion Description at [0.308, 0.146, 0.985, 0.892] : young girl wearing velcro strapped tennis shoes.\nRegion Description at [0.082, 0.436, 0.601, 0.996] : orange scooter board with black handles.\nRegion Description at [0.755, 0.184, 0.973, 0.372] : a tall building with fence in foreground.\nRegion Description at [0.021, 0.096, 0.988, 0.928] : two young girls wearing white outfits.\nRegion Description at [0.311, 0.136, 0.991, 0.886] : young girl with insignia on white outfit.\nRegion Description at [0.175, 0.814, 0.266, 0.888] : orange colored back wheel of a scooter board.\nRegion Description at [0.453, 0.478, 0.725, 0.848] : lavender, yellow and pink colored tennis racket.\n\nGlobal Caption:\ntwo little girls in tennis uniforms standing next to a scooter\nTwo young girls with a tennis racket and a scooter.\nTwo little girls posing for a picture, on a tennis court.\nTwo young girls on a tennis court with a racquet and a scooter\nTwo cute girls with a scooter and tennis raquet."}
+{"question_id": 9, "image": "000000356424.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : bottle at [0.048, 0.712, 0.195, 1.002].\nObject 1 : chair at [0.696, 0.500, 1.003, 0.718].\nObject 2 : cork at [0.053, 0.712, 0.139, 0.776].\nObject 3 : cup at [0.043, 0.736, 0.240, 0.916].\nObject 4 : dish at [0.416, 0.726, 0.856, 0.904].\nObject 5 : fruit at [0.629, 0.834, 0.675, 0.880].\nObject 6 : glass at [0.275, 0.716, 0.501, 0.998].\nObject 7 : glasses at [0.179, 0.242, 0.464, 0.322].\nObject 8 : hair at [0.536, 0.258, 0.656, 0.320].\nObject 9 : man at [0.075, 0.102, 0.704, 0.716].\nObject 10 : rasberries at [0.499, 0.750, 0.544, 0.786].\nObject 11 : raspberries at [0.664, 0.828, 0.741, 0.864].\nObject 12 : sauce at [0.565, 0.752, 0.715, 0.824].\nObject 13 : shirt at [0.600, 0.350, 0.645, 0.494].\nObject 14 : shirt at [0.635, 0.282, 0.997, 0.654].\nObject 15 : sign at [0.419, 0.134, 0.509, 0.184].\nObject 16 : sweater at [0.072, 0.288, 0.704, 0.718].\nObject 17 : table at [0.000, 0.592, 0.997, 1.000].\nObject 18 : window at [0.328, 0.000, 0.600, 0.298].\nObject 19 : woman at [0.531, 0.258, 0.768, 0.688].\n\nRelationships:\nobject 9 : man -> wearing -> object 7 : glasses.\nobject 0 : bottle -> on -> object 17 : table.\nobject 6 : glass -> on -> object 17 : table.\nobject 11 : raspberries -> on -> object 4 : dish.\nobject 9 : man -> wearing -> object 7 : glasses.\n\nRegion Description:\nRegion Description at [0.640, 0.180, 0.989, 0.530] : Man wearing a black and orange stripe shirt.\nRegion Description at [0.413, 0.136, 0.512, 0.184] : Yellow closed sign with brown letters.\nRegion Description at [0.629, 0.186, 0.995, 0.706] : a man wearing and orange and black striped shirt.\nRegion Description at [0.528, 0.254, 0.717, 0.666] : a woman with a ponytail eating lunch.\nRegion Description at [0.152, 0.238, 0.459, 0.322] : a pair of black wire rimmed eye glasses.\nRegion Description at [0.029, 0.716, 0.243, 0.922] : empty cup that used to contain coffee.\nRegion Description at [0.264, 0.708, 0.867, 0.994] : A plate of food with a glass of water.\n\nGlobal Caption:\nA man sitting in front of a plate of food.\nA man at a wooden table looking at a plate of food.\na man smiling while looking at his plate of food\nA man sitting at a table with a plate filled with food.\nA man looking happily at some dish in front of him."}
+{"question_id": 10, "image": "000000131138.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : computer mouse at [0.414, 0.753, 0.470, 0.811].\nObject 1 : cup at [0.350, 0.783, 0.417, 0.906].\nObject 2 : desk at [0.000, 0.488, 0.998, 0.999].\nObject 3 : fork at [0.203, 0.794, 0.270, 0.857].\nObject 4 : glass at [0.277, 0.703, 0.345, 0.816].\nObject 5 : head phones at [0.872, 0.556, 0.993, 0.634].\nObject 6 : keyboard at [0.415, 0.620, 0.650, 0.783].\nObject 7 : lamp at [0.000, 0.302, 0.214, 0.430].\nObject 8 : laptop at [0.491, 0.296, 0.703, 0.540].\nObject 9 : picture at [0.795, 0.204, 0.898, 0.358].\nObject 10 : plant at [0.192, 0.201, 0.391, 0.461].\nObject 11 : plate at [0.183, 0.799, 0.326, 0.896].\nObject 12 : screen at [0.237, 0.249, 0.504, 0.628].\nObject 13 : stand at [0.506, 0.531, 0.663, 0.617].\nObject 14 : window at [0.606, 0.000, 1.000, 0.346].\n\nRelationships:\nobject 0 : computer mouse -> on -> object 2 : desk.\nobject 8 : laptop -> on -> object 13 : stand.\nobject 6 : keyboard -> on -> object 2 : desk.\nobject 9 : picture -> near -> object 14 : window.\nobject 3 : fork -> on -> object 11 : plate.\n\nRegion Description:\n\nGlobal Caption:\na desk with a cup plate laptop monitor and keyboard\nA laptop sitting next to a monitor, keyboard and a mouse.\nA laptop and a desktop monitor are displayed on top of the desk.\nLarge office desk with computers near a window.\nA desk with a laptop, second monitor and keyboard."}
+{"question_id": 11, "image": "000000332318.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : background at [0.000, 0.000, 1.002, 0.997].\nObject 1 : bench at [0.604, 0.967, 0.672, 0.997].\nObject 2 : cow at [0.548, 0.860, 0.574, 0.896].\nObject 3 : cow at [0.436, 0.860, 0.454, 0.890].\nObject 4 : fence at [0.698, 0.949, 0.852, 0.997].\nObject 5 : moutain at [0.000, 0.057, 0.992, 0.782].\nObject 6 : pasture at [0.000, 0.815, 0.984, 1.000].\nObject 7 : peak at [0.744, 0.042, 0.898, 0.119].\nObject 8 : sky at [0.000, 0.000, 1.002, 0.257].\nObject 9 : snow at [0.210, 0.036, 0.962, 0.445].\nObject 10 : trailer at [0.796, 0.910, 0.894, 0.997].\nObject 11 : trailer at [0.632, 0.899, 0.742, 0.994].\nObject 12 : tree at [0.740, 0.409, 1.000, 0.982].\nObject 13 : tree at [0.638, 0.284, 0.652, 0.301].\n\nRelationships:\nobject 11 : trailer -> in -> object 6 : pasture.\nobject 5 : moutain -> has -> object 9 : snow.\nobject 6 : pasture -> near -> object 5 : moutain.\nobject 3 : cow -> in -> object 6 : pasture.\nobject 2 : cow -> in -> object 6 : pasture.\nobject 9 : snow -> on -> object 5 : moutain.\nobject 5 : moutain -> covered in -> object 9 : snow.\nobject 5 : moutain -> has -> object 7 : peak.\nobject 2 : cow -> in -> object 6 : pasture.\nobject 5 : moutain -> in -> object 0 : background.\nobject 5 : moutain -> has -> object 9 : snow.\nobject 11 : trailer -> near -> object 12 : tree.\nobject 5 : moutain -> has -> object 13 : tree.\nobject 7 : peak -> covered with -> object 9 : snow.\n\nRegion Description:\nRegion Description at [0.784, 0.901, 0.934, 0.991] : storage container for animal equipment.\nRegion Description at [0.828, 0.060, 0.880, 0.125] : The mountain is partially covered in snow..\nRegion Description at [0.840, 0.899, 0.920, 0.997] : horse trailer or cow trailer is silvertone, rectangular.\nRegion Description at [0.606, 0.919, 0.640, 0.982] : smaller trailer, white w/ brown+orange stripe.\nRegion Description at [0.060, 0.472, 0.540, 0.806] : a bare patch of earth amid lush green growth.\nRegion Description at [0.034, 0.839, 0.812, 0.973] : tiny cattle-containing fenceposts in the distance.\nRegion Description at [0.902, 0.827, 0.990, 0.997] : a split tree trunk in shadow, beneath leaves, shadow on ground.\nRegion Description at [0.734, 0.919, 0.802, 0.994] : an older station wagon/suv-type van thing.\nRegion Description at [0.090, 0.854, 0.124, 0.904] : a black & white animal stands alone, away from brown brethren, in the far distance.\n\nGlobal Caption:\nCows lounge in a field with a mountain backdrop.\nA VERY BIG MOUNTAIN AND ANIMALS SPREAD ACROSS A FARM.\nSeveral herd animals are on the grass by a mountain.\nCattle on a level pasture in a mountainous area.\nA bunch of cattle relax in a pasture located in the mountains"}
+{"question_id": 12, "image": "000000513567.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : bag at [0.428, 0.435, 0.476, 0.528].\nObject 1 : bag at [0.322, 0.923, 0.498, 0.997].\nObject 2 : building at [0.000, 0.003, 0.158, 0.413].\nObject 3 : face at [0.246, 0.240, 0.374, 0.483].\nObject 4 : flag at [0.044, 0.013, 0.090, 0.149].\nObject 5 : girl at [0.538, 0.019, 0.968, 0.949].\nObject 6 : hand at [0.176, 0.680, 0.304, 0.821].\nObject 7 : hands at [0.660, 0.344, 0.756, 0.517].\nObject 8 : head at [0.560, 0.003, 0.822, 0.339].\nObject 9 : hot dog at [0.676, 0.315, 0.882, 0.408].\nObject 10 : hot dogs at [0.190, 0.587, 0.350, 0.741].\nObject 11 : jeans at [0.586, 0.843, 0.916, 0.995].\nObject 12 : lady at [0.572, 0.045, 0.952, 0.984].\nObject 13 : logo at [0.920, 0.069, 0.996, 0.165].\nObject 14 : man at [0.486, 0.235, 0.564, 0.509].\nObject 15 : man at [0.456, 0.213, 0.520, 0.317].\nObject 16 : maroon shirt at [0.546, 0.333, 0.928, 0.944].\nObject 17 : mouth at [0.288, 0.408, 0.356, 0.440].\nObject 18 : people at [0.552, 0.029, 0.876, 0.995].\nObject 19 : post at [0.104, 0.005, 0.138, 0.533].\nObject 20 : purse at [0.842, 0.661, 0.980, 0.888].\nObject 21 : purse strap at [0.270, 0.893, 0.390, 0.992].\nObject 22 : shadow at [0.934, 0.067, 0.996, 0.141].\nObject 23 : side at [0.922, 0.875, 0.998, 0.997].\nObject 24 : street at [0.042, 0.403, 0.092, 0.520].\nObject 25 : sunglasses at [0.630, 0.005, 0.794, 0.048].\nObject 26 : woman at [0.502, 0.000, 0.982, 0.997].\nObject 27 : woman at [0.102, 0.099, 0.486, 0.984].\nObject 28 : woman's shirt at [0.518, 0.320, 0.944, 0.949].\n\nRelationships:\nobject 0 : bag -> on -> object 15 : man.\nobject 13 : logo -> on -> object 2 : building.\nobject 25 : sunglasses -> on -> object 26 : woman.\nobject 25 : sunglasses -> on -> object 8 : head.\nobject 4 : flag -> on -> object 19 : post.\nobject 6 : hand -> holds -> object 10 : hot dogs.\nobject 27 : woman -> has -> object 17 : mouth.\nobject 12 : lady -> holding -> object 9 : hot dog.\nobject 9 : hot dog -> in -> object 7 : hands.\nobject 18 : people -> crossing -> object 24 : street.\nobject 27 : woman -> wearing -> object 11 : jeans.\nobject 5 : girl -> wears -> object 16 : maroon shirt.\n\nRegion Description:\nRegion Description at [0.038, 0.173, 0.540, 0.995] : Laughing girl in a green shirt holding a hotdog..\nRegion Description at [0.504, 0.000, 0.954, 0.989] : Black haired girl in maroon shirt wearing sunglasses on her head..\nRegion Description at [0.508, 0.000, 0.960, 0.979] : Girl looking at the hot dog she's holding in her hands.\nRegion Description at [0.040, 0.173, 0.536, 0.981] : Girl holding hot dog in her right hand.\nRegion Description at [0.926, 0.253, 0.998, 0.645] : Woman in a brown shirt and jeans crossing the street.\nRegion Description at [0.202, 0.563, 0.334, 0.995] : Blue purse strap around woman's shoulder.\nRegion Description at [0.146, 0.587, 0.370, 0.787] : woman holding hot dog in white napkin.\nRegion Description at [0.682, 0.229, 0.742, 0.315] : woman's mouth open looking at hot dog.\nRegion Description at [0.234, 0.213, 0.396, 0.507] : woman's face smiling with eyes closed.\n\nGlobal Caption:\nTwo Asian women eating chili dogs while standing on a street.\nTwo women preparing to eat a hot dog on a city side.\nThe woman are eating their hot dogs while walking.\nTwo young women are eating hot dogs while walking down the sidewalk.\nTwo women eat chili dogs on a city sidewalk. "}
+{"question_id": 13, "image": "000000134722.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : awning at [0.886, 0.000, 1.000, 0.240].\nObject 1 : awning at [0.000, 0.299, 0.132, 0.403].\nObject 2 : bench at [0.000, 0.592, 0.066, 0.683].\nObject 3 : building at [0.000, 0.299, 0.204, 0.659].\nObject 4 : canopy at [0.000, 0.301, 0.136, 0.400].\nObject 5 : car at [0.290, 0.400, 0.998, 0.784].\nObject 6 : clouds at [0.374, 0.067, 0.920, 0.312].\nObject 7 : door opening at [0.658, 0.501, 0.682, 0.680].\nObject 8 : door opening at [0.678, 0.509, 0.710, 0.675].\nObject 9 : exterior at [0.000, 0.400, 0.200, 0.669].\nObject 10 : front at [0.294, 0.400, 0.494, 0.739].\nObject 11 : gravel at [0.090, 0.837, 0.334, 0.997].\nObject 12 : headlights at [0.416, 0.624, 0.446, 0.656].\nObject 13 : headlights at [0.300, 0.624, 0.324, 0.651].\nObject 14 : markings at [0.606, 0.821, 0.770, 0.928].\nObject 15 : panel at [0.304, 0.421, 0.450, 0.677].\nObject 16 : pole at [0.030, 0.419, 0.062, 0.656].\nObject 17 : railway tracks at [0.000, 0.752, 0.520, 0.944].\nObject 18 : side walk at [0.192, 0.712, 1.000, 0.997].\nObject 19 : sky at [0.000, 0.000, 0.998, 0.560].\nObject 20 : train stop at [0.000, 0.000, 1.000, 1.000].\nObject 21 : trees at [0.208, 0.253, 0.322, 0.653].\nObject 22 : trim at [0.000, 0.333, 0.132, 0.403].\nObject 23 : wall at [0.000, 0.392, 0.206, 0.611].\nObject 24 : wheel at [0.844, 0.669, 0.884, 0.728].\nObject 25 : wheel at [0.792, 0.675, 0.840, 0.747].\nObject 26 : wheel at [0.516, 0.691, 0.620, 0.808].\nObject 27 : window at [0.316, 0.451, 0.458, 0.595].\nObject 28 : windows at [0.700, 0.547, 0.848, 0.632].\nObject 29 : windsheild wipers at [0.348, 0.499, 0.410, 0.584].\n\nRelationships:\nobject 6 : clouds -> in -> object 19 : sky.\nobject 2 : bench -> in -> object 4 : canopy.\nobject 22 : trim -> on -> object 1 : awning.\nobject 11 : gravel -> next to -> object 17 : railway tracks.\nobject 14 : markings -> on side of -> object 18 : side walk.\nobject 5 : car -> on -> object 17 : railway tracks.\n\nRegion Description:\nRegion Description at [0.288, 0.392, 0.510, 0.741] : the front of the train is yellow and white.\nRegion Description at [0.320, 0.451, 0.460, 0.592] : the front window of the train has windshield wipers.\nRegion Description at [0.292, 0.592, 0.456, 0.739] : the headlights are on front of the train.\nRegion Description at [0.010, 0.405, 0.220, 0.736] : a red brick wall is near the platform.\nRegion Description at [0.000, 0.288, 0.128, 0.707] : an aluminum canopy is on the platform.\nRegion Description at [0.016, 0.325, 0.100, 0.672] : a red steel pole is holding up the awning.\nRegion Description at [0.306, 0.395, 0.998, 0.733] : the train has windowed passenger cars.\nRegion Description at [0.300, 0.427, 0.492, 0.693] : the yellow and white front of a train.\nRegion Description at [0.510, 0.744, 0.834, 0.891] : white painted line beside a train track.\nRegion Description at [0.298, 0.408, 0.468, 0.661] : a yellow panel on the front of the train.\nRegion Description at [0.002, 0.397, 0.210, 0.675] : a red brick building on the side of the tracks.\nRegion Description at [0.844, 0.000, 0.998, 0.248] : an awning of a structure next to the train tracks.\nRegion Description at [0.294, 0.360, 0.516, 0.787] : front of a train car in yellow, white and blue.\nRegion Description at [0.194, 0.221, 0.286, 0.901] : trees on the side of a train station.\nRegion Description at [0.580, 0.821, 0.764, 0.931] : markings on the side of railway tracks.\nRegion Description at [0.632, 0.491, 0.726, 0.691] : white, blue and grey doors on the side of a train car.\nRegion Description at [0.500, 0.096, 0.916, 0.531] : skyline on the side of a train station.\n\nGlobal Caption:\nFast commuter train moving past an outdoor platform.\nA train on the track pulling by a train station.\nA train pulling into a station outside during the day.\nA passenger train moving through a rail yard\na long passenger train pulling up to a station"}
+{"question_id": 14, "image": "000000341058.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : napkins at [0.541, 0.818, 0.601, 0.858].\nObject 1 : pepper at [0.598, 0.836, 0.623, 0.860].\nObject 2 : post at [0.673, 0.494, 0.712, 0.926].\nObject 3 : restaurant sign at [0.548, 0.180, 0.779, 0.344].\nObject 4 : salt at [0.619, 0.838, 0.633, 0.850].\nObject 5 : shaker at [0.594, 0.822, 0.619, 0.854].\nObject 6 : shaker at [0.612, 0.824, 0.637, 0.854].\nObject 7 : table at [0.448, 0.834, 0.925, 0.998].\n\nRelationships:\nobject 4 : salt -> in -> object 6 : shaker.\nobject 0 : napkins -> on -> object 7 : table.\nobject 3 : restaurant sign -> on -> object 2 : post.\n\nRegion Description:\n\nGlobal Caption:\nThis is an empty table at a restaurant with ships in the background.\nThis table is covered by a blue Sam Adams umbrella\nAdvertising sign above a patio umbrella on sunny day.\nA lamp post stands next to an umbrella and table.\nAn umbrella is opened over an outdoor table."}
+{"question_id": 15, "image": "000000277051.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : bird at [0.400, 0.408, 0.688, 0.775].\nObject 1 : bird at [0.110, 0.468, 0.576, 0.820].\nObject 2 : bottle at [0.080, 0.003, 0.296, 0.721].\nObject 3 : chair at [0.678, 0.177, 0.882, 0.408].\nObject 4 : crumbs at [0.098, 0.835, 0.434, 1.000].\nObject 5 : feet at [0.514, 0.724, 0.562, 0.769].\nObject 6 : food at [0.000, 0.877, 0.180, 1.000].\nObject 7 : foot at [0.474, 0.706, 0.514, 0.733].\nObject 8 : ground at [0.518, 0.183, 0.620, 0.402].\nObject 9 : handle at [0.488, 0.796, 0.800, 0.940].\nObject 10 : knife at [0.000, 0.793, 0.800, 1.000].\nObject 11 : label at [0.080, 0.000, 0.260, 0.598].\nObject 12 : leg at [0.552, 0.652, 0.578, 0.742].\nObject 13 : leg at [0.508, 0.646, 0.540, 0.685].\nObject 14 : liquid at [0.092, 0.114, 0.294, 0.721].\nObject 15 : paper at [0.000, 0.658, 0.762, 1.003].\nObject 16 : placemat at [0.000, 0.658, 0.766, 1.000].\nObject 17 : plate at [0.000, 0.748, 0.618, 1.000].\nObject 18 : table at [0.742, 0.261, 1.002, 0.883].\nObject 19 : table at [0.000, 0.658, 1.000, 1.003].\nObject 20 : tablecloth at [0.000, 0.664, 1.002, 1.003].\nObject 21 : tablecloth at [0.596, 0.267, 1.000, 0.883].\n\nRelationships:\nobject 6 : food -> on -> object 17 : plate.\nobject 4 : crumbs -> on -> object 17 : plate.\nobject 3 : chair -> next to -> object 18 : table.\nobject 3 : chair -> beside -> object 18 : table.\n\nRegion Description:\nRegion Description at [0.050, 0.769, 0.804, 0.979] : a steak knife resting on the edge of a plate.\nRegion Description at [0.008, 0.724, 0.628, 0.994] : a white plate with food and crumbs on it.\nRegion Description at [0.040, 0.685, 0.380, 0.925] : a blue and white paper placemat underneath a plate.\nRegion Description at [0.636, 0.147, 0.906, 0.492] : a bird on a table with a chair behind it.\nRegion Description at [0.384, 0.372, 0.698, 0.787] : a bird standing on the edge of a table.\n\nGlobal Caption:\ntwo little sparrows standing on a table by a knife\ntwo gray white and brown birds a knife and a red table\nA couple of small birds standing on top of a table.\nTwo sparrows sit n a table with a red tablecloth at an outdoor cafe. \nTwo birds perched on a table near a plate of food."}
+{"question_id": 16, "image": "000000376900.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : area at [0.000, 0.002, 0.995, 0.996].\nObject 1 : background at [0.000, 0.132, 0.997, 0.268].\nObject 2 : cap at [0.171, 0.388, 0.253, 0.476].\nObject 3 : green/tennis court at [0.005, 0.720, 0.880, 0.994].\nObject 4 : hand at [0.253, 0.648, 0.299, 0.680].\nObject 5 : head at [0.173, 0.408, 0.256, 0.474].\nObject 6 : line at [0.397, 0.778, 0.501, 0.996].\nObject 7 : man at [0.163, 0.274, 0.491, 0.936].\nObject 8 : photo at [0.005, 0.004, 0.968, 0.976].\nObject 9 : pole at [0.019, 0.162, 0.035, 0.258].\nObject 10 : ses at [0.912, 0.962, 0.992, 0.994].\nObject 11 : shadow at [0.397, 0.898, 0.968, 0.956].\nObject 12 : shorts at [0.216, 0.628, 0.432, 0.782].\nObject 13 : sock at [0.325, 0.840, 0.376, 0.890].\nObject 14 : sport at [0.144, 0.270, 0.515, 0.944].\nObject 15 : tennis racket at [0.235, 0.578, 0.304, 0.664].\nObject 16 : tennis shoe at [0.213, 0.880, 0.280, 0.930].\nObject 17 : tennis shoe at [0.299, 0.886, 0.405, 0.936].\nObject 18 : trees at [0.269, 0.192, 0.995, 0.250].\nObject 19 : wrist at [0.384, 0.318, 0.429, 0.360].\nObject 20 : wristband at [0.384, 0.318, 0.432, 0.360].\n\nRelationships:\nobject 7 : man -> wearing -> object 12 : shorts.\nobject 4 : hand -> holding -> object 15 : tennis racket.\nobject 2 : cap -> on mans -> object 5 : head.\nobject 5 : head -> of a -> object 7 : man.\nobject 7 : man -> wearing a -> object 2 : cap.\nobject 7 : man -> wearing a -> object 13 : sock.\nobject 18 : trees -> in -> object 1 : background.\nobject 14 : sport -> in -> object 0 : area.\nobject 20 : wristband -> on a -> object 19 : wrist.\nobject 2 : cap -> on -> object 5 : head.\nobject 11 : shadow -> of -> object 7 : man.\nobject 12 : shorts -> on -> object 7 : man.\n\nRegion Description:\nRegion Description at [0.163, 0.322, 0.579, 0.926] : The tennis player is wearing all white.\nRegion Description at [0.397, 0.858, 0.936, 0.968] : Tennis player's shadow cast in front of him.\nRegion Description at [0.219, 0.560, 0.309, 0.680] : a black tennis racket in a man's hand.\nRegion Description at [0.341, 0.538, 0.480, 0.728] : a line judge at the side of a tennis court.\n\nGlobal Caption:\nA tennis player prepares to serve a tennis ball.\na tennis player in all white playing on a court \nA tennis player is reaching up with one arm and has a racquet in the other hand. \nThe tennis player throws the ball up to serve\nSpectators watching a man swinging at a tennis ball."}
+{"question_id": 17, "image": "000000412240.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : characters at [0.528, 0.251, 0.992, 0.395].\nObject 1 : date at [0.646, 0.869, 0.824, 0.923].\nObject 2 : dog at [0.292, 0.131, 0.820, 0.771].\nObject 3 : eyes at [0.332, 0.219, 0.354, 0.243].\nObject 4 : floor at [0.002, 0.715, 1.000, 0.997].\nObject 5 : head at [0.290, 0.117, 0.500, 0.392].\nObject 6 : heel at [0.218, 0.629, 0.324, 0.741].\nObject 7 : laces at [0.398, 0.464, 0.540, 0.608].\nObject 8 : left eye at [0.420, 0.245, 0.446, 0.283].\nObject 9 : light at [0.320, 0.493, 0.608, 0.720].\nObject 10 : mouth at [0.318, 0.320, 0.392, 0.373].\nObject 11 : nose at [0.348, 0.283, 0.392, 0.328].\nObject 12 : panel at [0.690, 0.544, 1.000, 0.779].\nObject 13 : photo at [0.000, 0.003, 0.996, 0.997].\nObject 14 : shoe at [0.002, 0.437, 0.250, 0.720].\nObject 15 : shoe at [0.212, 0.445, 0.720, 0.787].\nObject 16 : symbol at [0.750, 0.600, 0.828, 0.699].\nObject 17 : tail at [0.734, 0.720, 0.824, 0.768].\nObject 18 : time at [0.852, 0.872, 0.938, 0.923].\nObject 19 : toe at [0.564, 0.643, 0.724, 0.776].\nObject 20 : year at [0.752, 0.877, 0.834, 0.923].\n\nRelationships:\nobject 3 : eyes -> of -> object 2 : dog.\nobject 1 : date -> of -> object 13 : photo.\nobject 6 : heel -> of -> object 15 : shoe.\nobject 2 : dog -> sitting on -> object 4 : floor.\nobject 15 : shoe -> next to -> object 2 : dog.\nobject 15 : shoe -> reflecting -> object 9 : light.\nobject 0 : characters -> playing -> object 0 : characters.\nobject 0 : characters -> playing -> object 0 : characters.\nobject 2 : dog -> has a -> object 8 : left eye.\nobject 5 : head -> of -> object 2 : dog.\nobject 3 : eyes -> of -> object 2 : dog.\nobject 11 : nose -> on a -> object 2 : dog.\nobject 10 : mouth -> on a -> object 2 : dog.\nobject 15 : shoe -> has -> object 7 : laces.\nobject 17 : tail -> of -> object 2 : dog.\nobject 15 : shoe -> has a -> object 6 : heel.\nobject 19 : toe -> of -> object 15 : shoe.\n\nRegion Description:\nRegion Description at [0.838, 0.837, 0.976, 0.968] : the time written in bottom right corner.\n\nGlobal Caption:\nA dog sitting behind a pair of black shoes.\nA dog sits on the floor next to some shoes. \nA puppy is sitting behind a pair of shoes.\na close up of a small dog near a pair of shoes\nA small black dog sits beside a pair of shoes."}
+{"question_id": 18, "image": "000000179765.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : back tire at [0.574, 0.496, 0.860, 0.800].\nObject 1 : bike at [0.146, 0.109, 0.938, 0.803].\nObject 2 : bike indicators at [0.238, 0.363, 0.264, 0.389].\nObject 3 : car at [0.000, 0.077, 0.086, 0.157].\nObject 4 : display at [0.240, 0.275, 0.290, 0.328].\nObject 5 : exhaust pipe at [0.460, 0.661, 0.818, 0.773].\nObject 6 : front tire at [0.146, 0.419, 0.366, 0.637].\nObject 7 : front wheel at [0.150, 0.424, 0.366, 0.635].\nObject 8 : garage door at [0.000, 0.000, 0.214, 0.341].\nObject 9 : handle at [0.284, 0.109, 0.390, 0.384].\nObject 10 : honda logo at [0.322, 0.395, 0.378, 0.419].\nObject 11 : house at [0.420, 0.000, 0.736, 0.149].\nObject 12 : leather seat at [0.496, 0.355, 0.792, 0.517].\nObject 13 : light at [0.894, 0.411, 0.944, 0.520].\nObject 14 : orange light at [0.280, 0.419, 0.296, 0.467].\nObject 15 : shock at [0.258, 0.477, 0.296, 0.568].\nObject 16 : shock absorber at [0.626, 0.501, 0.698, 0.680].\nObject 17 : shrubs at [0.628, 0.021, 0.764, 0.200].\nObject 18 : small windshield at [0.210, 0.120, 0.256, 0.291].\nObject 19 : sylencer at [0.462, 0.645, 0.816, 0.779].\nObject 20 : trees at [0.256, 0.003, 0.444, 0.205].\n\nRelationships:\nobject 1 : bike -> has -> object 7 : front wheel.\nobject 1 : bike -> has -> object 0 : back tire.\nobject 1 : bike -> has -> object 19 : sylencer.\nobject 1 : bike -> has -> object 16 : shock absorber.\nobject 1 : bike -> has -> object 13 : light.\nobject 9 : handle -> on -> object 1 : bike.\nobject 4 : display -> on -> object 1 : bike.\n\nRegion Description:\n\nGlobal Caption:\nA black Honda motorcycle parked in front of a garage.\nA Honda motorcycle parked in a grass driveway\nA black Honda motorcycle with a dark burgundy seat.\nMa motorcycle parked on the gravel in front of a garage\nA motorcycle with its brake extended standing outside"}
+{"question_id": 19, "image": "000000329219.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : bearded face at [0.371, 0.064, 0.393, 0.094].\nObject 1 : blender at [0.015, 0.165, 0.080, 0.307].\nObject 2 : box at [0.176, 0.249, 0.228, 0.329].\nObject 3 : buttons at [0.038, 0.268, 0.048, 0.275].\nObject 4 : counter at [0.567, 0.340, 0.738, 0.395].\nObject 5 : counter at [0.000, 0.329, 0.576, 0.398].\nObject 6 : curtain at [0.429, 0.048, 0.504, 0.318].\nObject 7 : curtain at [0.227, 0.000, 0.309, 0.287].\nObject 8 : dog at [0.462, 0.593, 0.568, 0.842].\nObject 9 : door knob at [0.242, 0.477, 0.253, 0.499].\nObject 10 : drawer at [0.112, 0.370, 0.259, 0.452].\nObject 11 : drawer at [0.284, 0.382, 0.394, 0.439].\nObject 12 : faucet at [0.338, 0.327, 0.388, 0.357].\nObject 13 : floor at [0.000, 0.713, 1.000, 1.000].\nObject 14 : kitchen at [0.000, 0.000, 0.750, 0.849].\nObject 15 : knob at [0.179, 0.398, 0.197, 0.422].\nObject 16 : knob at [0.340, 0.400, 0.352, 0.420].\nObject 17 : man at [0.274, 0.000, 0.517, 0.792].\nObject 18 : mugs at [0.509, 0.123, 0.595, 0.266].\nObject 19 : outlet at [0.107, 0.212, 0.143, 0.256].\nObject 20 : shoes at [0.391, 0.735, 0.476, 0.786].\nObject 21 : spatula at [0.126, 0.003, 0.153, 0.094].\nObject 22 : tile at [0.526, 0.592, 0.557, 0.634].\nObject 23 : wall at [0.003, 0.000, 0.220, 0.294].\nObject 24 : wall at [0.506, 0.019, 0.607, 0.384].\nObject 25 : window at [0.303, 0.016, 0.392, 0.328].\nObject 26 : wire at [0.097, 0.233, 0.129, 0.319].\n\nRelationships:\nobject 17 : man -> standing in -> object 14 : kitchen.\nobject 18 : mugs -> hanging on -> object 24 : wall.\nobject 1 : blender -> with -> object 3 : buttons.\nobject 17 : man -> with -> object 0 : bearded face.\nobject 26 : wire -> hanging from -> object 23 : wall.\nobject 8 : dog -> on -> object 13 : floor.\nobject 1 : blender -> on -> object 5 : counter.\nobject 6 : curtain -> on -> object 25 : window.\nobject 20 : shoes -> on -> object 17 : man.\n\nRegion Description:\nRegion Description at [0.056, 0.214, 0.140, 0.277] : A dark electric cord plugged into the wall.\nRegion Description at [0.000, 0.662, 0.116, 0.940] : A latter with onely one rung visible.\nRegion Description at [0.004, 0.698, 0.999, 0.991] : Durable Tan and brown laminent flooring.\nRegion Description at [0.004, 0.324, 0.739, 0.880] : cheap waferboard constructed cabinets .\nRegion Description at [0.514, 0.126, 0.588, 0.262] : convient and accessable way to store coffee mugs.\nRegion Description at [0.222, 0.001, 0.510, 0.286] : small window curtians with paisley design.\nRegion Description at [0.347, 0.053, 0.490, 0.312] : light weight flanel design mens shirt .\nRegion Description at [0.222, 0.004, 0.315, 0.303] : gold and white curtain on a kitchen window.\nRegion Description at [0.511, 0.126, 0.589, 0.261] : coffee cups hanging on the kitchen wall.\nRegion Description at [0.012, 0.149, 0.091, 0.340] : gold colored blinder sits on the counter.\nRegion Description at [-0.001, 0.000, 0.157, 0.122] : cooking utensils hanging against wall.\n\nGlobal Caption:\nA man standing next to a dog on the ground.\nA man is at a kitchen counter by a dog.\nAn man standing in a kitchen with a small puppy.\nthere is a small puppy on the kitchen floor\nA man in the kitchen standing with his dog."}
+{"question_id": 20, "image": "000000184384.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : blueberry at [0.306, 0.312, 0.400, 0.429].\nObject 1 : butter at [0.454, 0.024, 0.638, 0.288].\nObject 2 : cake at [0.238, 0.093, 0.786, 0.787].\nObject 3 : cup at [0.002, 0.000, 0.202, 0.667].\nObject 4 : cup at [0.140, 0.008, 0.336, 0.456].\nObject 5 : egg at [0.636, 0.125, 0.880, 0.267].\nObject 6 : food at [0.632, 0.123, 0.996, 0.336].\nObject 7 : lemon at [0.514, 0.728, 0.798, 0.997].\nObject 8 : melon at [0.308, 0.768, 0.658, 0.997].\nObject 9 : orange at [0.514, 0.733, 0.794, 0.997].\nObject 10 : parsley at [0.372, 0.515, 0.762, 0.965].\nObject 11 : plate at [0.166, 0.453, 1.000, 1.000].\nObject 12 : plate at [0.628, 0.120, 0.998, 0.389].\nObject 13 : sausage at [0.766, 0.248, 0.984, 0.333].\nObject 14 : spot at [0.766, 0.600, 0.790, 0.637].\nObject 15 : table at [0.002, 0.365, 0.998, 0.997].\nObject 16 : water at [0.000, 0.000, 0.202, 0.667].\n\nRelationships:\nobject 7 : lemon -> on -> object 11 : plate.\nobject 10 : parsley -> on -> object 11 : plate.\nobject 6 : food -> on -> object 12 : plate.\nobject 1 : butter -> on -> object 2 : cake.\nobject 11 : plate -> has -> object 14 : spot.\nobject 1 : butter -> on -> object 2 : cake.\nobject 9 : orange -> on -> object 11 : plate.\nobject 13 : sausage -> on -> object 12 : plate.\nobject 0 : blueberry -> on -> object 2 : cake.\nobject 5 : egg -> on -> object 12 : plate.\nobject 8 : melon -> on -> object 11 : plate.\nobject 1 : butter -> on -> object 2 : cake.\nobject 9 : orange -> on -> object 11 : plate.\nobject 2 : cake -> on -> object 11 : plate.\nobject 16 : water -> in -> object 3 : cup.\nobject 13 : sausage -> on -> object 12 : plate.\n\nRegion Description:\nRegion Description at [0.678, 0.104, 0.942, 0.424] : There is food on the plate in the back.\nRegion Description at [0.456, 0.013, 0.636, 0.307] : White frosting on top of a piece of cake.\nRegion Description at [0.322, 0.752, 0.650, 0.997] : square of honey dew on a white plate.\n\nGlobal Caption:\nA bluebery cake is on a plate and is topped with butter.\nA piece of cake with butter on it sits next to an orange slice. \nA large piece of blueberry cake on a plate.\nA plate of food attractively arranged on a table.\nA plate of blueberry coffee cake with butter and an orange slice on a table with breakfast foods."}
+{"question_id": 21, "image": "000000018519.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : concrete at [0.000, 0.576, 1.002, 0.998].\nObject 1 : elbow at [0.403, 0.538, 0.433, 0.552].\nObject 2 : fence at [0.000, 0.314, 0.998, 0.600].\nObject 3 : graffiti at [0.470, 0.856, 0.794, 0.998].\nObject 4 : grass at [0.000, 0.154, 1.002, 0.448].\nObject 5 : helmet at [0.358, 0.354, 0.448, 0.422].\nObject 6 : knee at [0.525, 0.608, 0.545, 0.622].\nObject 7 : knee pad at [0.450, 0.542, 0.512, 0.598].\nObject 8 : pad at [0.540, 0.362, 0.595, 0.420].\nObject 9 : pad at [0.512, 0.578, 0.592, 0.624].\nObject 10 : pad at [0.376, 0.512, 0.443, 0.554].\nObject 11 : park at [0.007, 0.006, 1.000, 0.578].\nObject 12 : pipe at [0.657, 0.300, 0.687, 0.578].\nObject 13 : pipe at [0.177, 0.324, 0.211, 0.590].\nObject 14 : rail at [0.000, 0.310, 1.000, 0.334].\nObject 15 : ramp at [0.000, 0.592, 1.002, 0.998].\nObject 16 : rock at [0.100, 0.302, 0.154, 0.326].\nObject 17 : shadow at [0.415, 0.642, 0.754, 0.912].\nObject 18 : shirt at [0.438, 0.376, 0.637, 0.514].\nObject 19 : shorts at [0.460, 0.500, 0.664, 0.580].\nObject 20 : skate at [0.647, 0.490, 0.709, 0.584].\nObject 21 : skater at [0.234, 0.352, 0.719, 0.624].\nObject 22 : sticker at [0.408, 0.358, 0.438, 0.368].\nObject 23 : tree at [0.122, 0.008, 0.677, 0.322].\nObject 24 : wheels at [0.689, 0.496, 0.721, 0.526].\nObject 25 : wrist brace at [0.279, 0.524, 0.338, 0.564].\n\nRelationships:\nobject 21 : skater -> has a -> object 17 : shadow.\nobject 20 : skate -> has -> object 24 : wheels.\nobject 23 : tree -> standing in a -> object 11 : park.\nobject 21 : skater -> wearing a -> object 5 : helmet.\nobject 10 : pad -> protecting an -> object 1 : elbow.\nobject 9 : pad -> protecting a -> object 6 : knee.\nobject 17 : shadow -> of a -> object 21 : skater.\nobject 15 : ramp -> has a -> object 3 : graffiti.\nobject 21 : skater -> has a -> object 5 : helmet.\nobject 16 : rock -> in -> object 4 : grass.\nobject 5 : helmet -> has a -> object 22 : sticker.\nobject 21 : skater -> wearing -> object 20 : skate.\nobject 21 : skater -> wearing a -> object 10 : pad.\nobject 25 : wrist brace -> on -> object 21 : skater.\nobject 21 : skater -> has a -> object 20 : skate.\nobject 17 : shadow -> on -> object 15 : ramp.\nobject 21 : skater -> has a -> object 5 : helmet.\nobject 21 : skater -> has a -> object 8 : pad.\nobject 21 : skater -> has a -> object 18 : shirt.\nobject 21 : skater -> has -> object 19 : shorts.\nobject 23 : tree -> behind -> object 21 : skater.\nobject 25 : wrist brace -> on -> object 21 : skater.\nobject 21 : skater -> has a -> object 9 : pad.\nobject 7 : knee pad -> for a -> object 21 : skater.\nobject 17 : shadow -> on -> object 0 : concrete.\nobject 3 : graffiti -> on -> object 0 : concrete.\n\nRegion Description:\nRegion Description at [0.391, 0.630, 0.776, 0.962] : Skater's shadow while performing a trick.\nRegion Description at [0.346, 0.342, 0.475, 0.440] : Man is wearing a black safety helmet.\nRegion Description at [0.184, 0.320, 0.741, 0.700] : a man roller skating at a skate park.\nRegion Description at [0.448, 0.636, 0.779, 0.940] : the shadow of the man cast on the cement ramp.\nRegion Description at [0.465, 0.856, 0.803, 0.996] : light blue painted graffiti on the cement ramp.\nRegion Description at [0.279, 0.524, 0.341, 0.570] : a black wrist guard on the man's wrist.\nRegion Description at [0.353, 0.352, 0.460, 0.422] : black helmet with several stickers on it.\nRegion Description at [0.644, 0.488, 0.719, 0.574] : the black rollerskate the man is wearing.\nRegion Description at [0.142, 0.314, 0.234, 0.604] : a grey post to the metal fence that is at the top of the ramp.\nRegion Description at [0.363, 0.500, 0.453, 0.566] : a black elbow pad the man is wearing.\nRegion Description at [0.405, 0.642, 0.746, 0.916] : shadow of a roller skater on concrete.\n\nGlobal Caption:\nA young man riding a skateboard down the side of a ramp.\nA man doing a trick on roller-skates in a skate park.\nA skateboarder performing a jump off the side of a ramp.\na man wearing roller skates doing a jump on the side of a wall \nThe man in the helmet is jumping while wearing roller skates. "}
+{"question_id": 22, "image": "000000415748.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : building at [0.000, 0.086, 0.697, 0.516].\nObject 1 : elephant at [0.084, 0.438, 0.727, 0.954].\nObject 2 : face at [0.411, 0.442, 0.670, 0.664].\nObject 3 : ground at [0.000, 0.742, 0.165, 0.998].\nObject 4 : man at [0.186, 0.246, 0.631, 0.516].\nObject 5 : shadow at [0.477, 0.812, 1.000, 0.958].\nObject 6 : sky at [0.006, 0.000, 0.228, 0.200].\nObject 7 : toe at [0.372, 0.900, 0.411, 0.924].\nObject 8 : tusk at [0.462, 0.670, 0.489, 0.692].\n\nRelationships:\nobject 4 : man -> on -> object 1 : elephant.\nobject 7 : toe -> of -> object 1 : elephant.\nobject 4 : man -> near -> object 0 : building.\nobject 4 : man -> on -> object 1 : elephant.\nobject 4 : man -> near -> object 1 : elephant.\nobject 8 : tusk -> on -> object 2 : face.\nobject 5 : shadow -> of -> object 1 : elephant.\nobject 5 : shadow -> on -> object 3 : ground.\nobject 4 : man -> close to -> object 0 : building.\nobject 0 : building -> close to -> object 1 : elephant.\n\nRegion Description:\nRegion Description at [0.411, 0.482, 0.634, 0.788] : elephant's face and trunk are painted.\n\nGlobal Caption:\nA man riding on the back of an elephant through a city street.\nMan riding on the back of a painted elephant. \nA man in colorful clothing riding a painted elephant.\na man in a white shirt is riding an elephant and some buildings\nAn old decorated elephant and its colorful rider"}
+{"question_id": 23, "image": "000000543300.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : boat at [0.048, 0.552, 0.928, 0.819].\nObject 1 : building at [0.328, 0.493, 0.538, 0.613].\nObject 2 : building at [0.000, 0.467, 0.338, 0.651].\nObject 3 : building at [0.534, 0.096, 0.998, 0.637].\nObject 4 : canopies at [0.452, 0.504, 0.620, 0.600].\nObject 5 : container at [0.858, 0.643, 0.948, 0.712].\nObject 6 : dolphin at [0.282, 0.691, 0.344, 0.773].\nObject 7 : flag at [0.322, 0.563, 0.340, 0.597].\nObject 8 : ground at [0.822, 0.696, 0.880, 0.715].\nObject 9 : leaves at [0.002, 0.483, 0.080, 0.659].\nObject 10 : level at [0.000, 0.709, 1.000, 0.829].\nObject 11 : level at [0.068, 0.616, 0.852, 0.688].\nObject 12 : outdoor seating at [0.502, 0.579, 0.532, 0.624].\nObject 13 : pink writing at [0.414, 0.693, 0.654, 0.725].\nObject 14 : pole at [0.282, 0.416, 0.292, 0.515].\nObject 15 : railing at [0.094, 0.557, 0.728, 0.624].\nObject 16 : railing at [0.238, 0.597, 0.744, 0.627].\nObject 17 : reflection at [0.174, 0.808, 0.922, 0.848].\nObject 18 : roof at [0.000, 0.469, 0.280, 0.523].\nObject 19 : roof at [0.348, 0.509, 0.482, 0.568].\nObject 20 : roof at [0.920, 0.264, 0.980, 0.344].\nObject 21 : row at [0.700, 0.499, 0.878, 0.573].\nObject 22 : sea wall at [0.878, 0.712, 0.998, 0.819].\nObject 23 : shore at [0.000, 0.627, 0.996, 0.816].\nObject 24 : sky at [0.006, 0.000, 1.000, 0.517].\nObject 25 : steeple at [0.918, 0.088, 0.936, 0.237].\nObject 26 : symbol at [0.268, 0.688, 0.350, 0.779].\nObject 27 : symbol at [0.702, 0.693, 0.752, 0.725].\nObject 28 : tree at [0.472, 0.491, 0.592, 0.597].\nObject 29 : trees at [0.948, 0.573, 1.000, 0.691].\nObject 30 : trees at [0.000, 0.488, 0.080, 0.675].\nObject 31 : vehicle at [0.968, 0.653, 0.998, 0.693].\nObject 32 : water at [0.004, 0.813, 0.998, 0.992].\nObject 33 : water at [0.008, 0.717, 0.998, 0.981].\nObject 34 : window at [0.374, 0.733, 0.790, 0.765].\nObject 35 : window at [0.800, 0.491, 0.868, 0.576].\nObject 36 : window at [0.928, 0.512, 0.950, 0.576].\nObject 37 : window at [0.892, 0.395, 0.912, 0.443].\nObject 38 : window at [0.894, 0.517, 0.910, 0.571].\nObject 39 : window at [0.630, 0.493, 0.652, 0.565].\nObject 40 : windows at [0.384, 0.637, 0.724, 0.685].\n\nRelationships:\nobject 40 : windows -> on -> object 0 : boat.\nobject 17 : reflection -> in -> object 33 : water.\nobject 29 : trees -> growing on -> object 23 : shore.\nobject 30 : trees -> growing on -> object 23 : shore.\nobject 28 : tree -> growing on -> object 23 : shore.\nobject 18 : roof -> on -> object 2 : building.\nobject 5 : container -> on -> object 22 : sea wall.\nobject 0 : boat -> in -> object 32 : water.\nobject 0 : boat -> has -> object 15 : railing.\n\nRegion Description:\nRegion Description at [0.414, 0.691, 0.662, 0.725] : the are red letters on the side of the cruise ship.\nRegion Description at [0.370, 0.707, 0.780, 0.763] : there is a long set of black windows on the side of the cruise ship.\nRegion Description at [0.870, 0.243, 0.992, 0.357] : there is a red roof on this building.\nRegion Description at [0.538, 0.400, 0.712, 0.549] : there is red and gray building in the background.\nRegion Description at [0.054, 0.595, 0.312, 0.821] : there is two levels on this cruise ship.\nRegion Description at [0.370, 0.587, 0.664, 0.621] : there is a silver railing on the top level of the cruise ship.\nRegion Description at [0.858, 0.621, 0.952, 0.717] : there is a blue container on the dock.\nRegion Description at [0.876, 0.707, 0.996, 0.787] : there is a gray sea wall beside the ship.\nRegion Description at [0.268, 0.723, 0.346, 0.787] : there are blue water symbols on the side of the cruise ship.\nRegion Description at [0.000, 0.619, 0.024, 0.712] : there is a blue and white sign on the dock.\nRegion Description at [0.662, 0.533, 0.904, 0.603] : An outdoor canopy creates shade for customers. .\n\nGlobal Caption:\nA boat sits on the side of the dock.\nA large white boat in the open water.\nA white double decker boat n water next to buildings.\nA large cruise ship is traveling on the ocean. \nA Port River Dolphin Cruise ship sits in the water."}
+{"question_id": 24, "image": "000000349184.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : arm rest at [0.674, 0.486, 0.722, 0.560].\nObject 1 : bench at [0.000, 0.324, 0.731, 0.994].\nObject 2 : bricks at [0.075, 0.850, 0.180, 0.882].\nObject 3 : building at [0.090, 0.000, 0.686, 0.094].\nObject 4 : children at [0.470, 0.302, 0.539, 0.360].\nObject 5 : coat at [0.473, 0.322, 0.542, 0.364].\nObject 6 : daytime at [0.000, 0.002, 0.997, 1.000].\nObject 7 : fence at [0.719, 0.310, 0.997, 0.372].\nObject 8 : grass at [0.000, 0.364, 0.997, 0.720].\nObject 9 : jacket at [0.012, 0.424, 0.485, 0.690].\nObject 10 : jeans at [0.165, 0.748, 0.293, 0.844].\nObject 11 : leg at [0.168, 0.750, 0.308, 0.844].\nObject 12 : people at [0.386, 0.438, 0.449, 0.504].\nObject 13 : purse at [0.458, 0.488, 0.605, 0.694].\nObject 14 : shoe at [0.192, 0.836, 0.305, 0.890].\nObject 15 : strap at [0.677, 0.470, 0.814, 0.584].\nObject 16 : trees at [0.554, 0.000, 0.997, 0.376].\nObject 17 : woman at [0.009, 0.194, 0.497, 0.888].\n\nRelationships:\nobject 17 : woman -> on -> object 1 : bench.\nobject 17 : woman -> on -> object 1 : bench.\nobject 17 : woman -> on -> object 1 : bench.\nobject 13 : purse -> has a -> object 15 : strap.\nobject 2 : bricks -> under -> object 1 : bench.\nobject 3 : building -> behind -> object 16 : trees.\nobject 2 : bricks -> under -> object 1 : bench.\nobject 9 : jacket -> on -> object 17 : woman.\nobject 12 : people -> near -> object 16 : trees.\nobject 17 : woman -> has a -> object 11 : leg.\nobject 1 : bench -> has an -> object 0 : arm rest.\nobject 15 : strap -> from -> object 13 : purse.\nobject 2 : bricks -> near -> object 1 : bench.\nobject 16 : trees -> in -> object 6 : daytime.\nobject 7 : fence -> under -> object 16 : trees.\nobject 12 : people -> in front of -> object 7 : fence.\nobject 13 : purse -> on -> object 1 : bench.\nobject 14 : shoe -> on -> object 2 : bricks.\n\nRegion Description:\nRegion Description at [0.096, 0.006, 0.662, 0.074] : Building with brown and white facade.\nRegion Description at [0.374, 0.298, 0.542, 0.360] : two people walking in front of woman.\n\nGlobal Caption:\nA woman sitting on top of a wooden bench near a park.\nA person sits on a wooden bench facing blooming trees.\nA woman sitting on a wooden bench viewing some beautiful trees.\nAdult sitting on wooden park bench in large open space.\nA woman sits on a bench watching the park."}
+{"question_id": 25, "image": "000000042070.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : 61 at [0.268, 0.480, 0.310, 0.532].\nObject 1 : asphalt at [0.128, 0.632, 0.998, 0.992].\nObject 2 : bike rack at [0.278, 0.583, 0.762, 0.885].\nObject 3 : bottom at [0.166, 0.757, 0.864, 0.895].\nObject 4 : building at [0.000, 0.000, 0.194, 0.635].\nObject 5 : bus at [0.160, 0.035, 0.906, 0.910].\nObject 6 : corner at [0.868, 0.570, 0.998, 0.652].\nObject 7 : display at [0.258, 0.055, 0.770, 0.168].\nObject 8 : driver at [0.614, 0.350, 0.736, 0.520].\nObject 9 : driver's seat at [0.616, 0.367, 0.726, 0.500].\nObject 10 : font at [0.484, 0.695, 0.544, 0.725].\nObject 11 : headlight at [0.202, 0.685, 0.304, 0.728].\nObject 12 : information at [0.292, 0.077, 0.682, 0.160].\nObject 13 : license plate at [0.468, 0.677, 0.562, 0.745].\nObject 14 : light post at [0.948, 0.340, 0.992, 0.650].\nObject 15 : mirror at [0.848, 0.395, 0.900, 0.495].\nObject 16 : name at [0.448, 0.085, 0.670, 0.153].\nObject 17 : number at [0.744, 0.608, 0.818, 0.653].\nObject 18 : number at [0.488, 0.698, 0.492, 0.723].\nObject 19 : number at [0.308, 0.090, 0.334, 0.158].\nObject 20 : number at [0.308, 0.087, 0.356, 0.155].\nObject 21 : number at [0.268, 0.490, 0.308, 0.527].\nObject 22 : number at [0.268, 0.490, 0.288, 0.527].\nObject 23 : paint at [0.180, 0.562, 0.370, 0.780].\nObject 24 : pole at [0.952, 0.335, 0.988, 0.650].\nObject 25 : rack at [0.206, 0.480, 0.840, 0.820].\nObject 26 : red lettering at [0.534, 0.695, 0.546, 0.725].\nObject 27 : red lettering at [0.524, 0.698, 0.534, 0.725].\nObject 28 : red lettering at [0.512, 0.698, 0.524, 0.725].\nObject 29 : red lettering at [0.494, 0.695, 0.506, 0.722].\nObject 30 : red lettering at [0.484, 0.695, 0.494, 0.722].\nObject 31 : reflective light at [0.274, 0.043, 0.328, 0.073].\nObject 32 : reflective light at [0.716, 0.040, 0.748, 0.077].\nObject 33 : reflective light at [0.560, 0.043, 0.602, 0.055].\nObject 34 : reflective light at [0.430, 0.043, 0.468, 0.068].\nObject 35 : reflective light at [0.500, 0.037, 0.538, 0.075].\nObject 36 : road at [0.116, 0.632, 0.996, 0.995].\nObject 37 : sidewalk at [0.056, 0.765, 0.104, 0.818].\nObject 38 : steering wheel at [0.634, 0.445, 0.770, 0.495].\nObject 39 : street light at [0.948, 0.333, 0.992, 0.645].\nObject 40 : stripe at [0.918, 0.863, 0.998, 0.950].\nObject 41 : tree at [0.862, 0.470, 0.994, 0.632].\nObject 42 : window at [0.198, 0.175, 0.838, 0.550].\nObject 43 : windshield at [0.518, 0.170, 0.862, 0.570].\nObject 44 : windshield at [0.180, 0.170, 0.850, 0.560].\nObject 45 : wiper at [0.528, 0.362, 0.722, 0.630].\nObject 46 : wiper at [0.454, 0.370, 0.656, 0.633].\nObject 47 : word at [0.434, 0.080, 0.688, 0.163].\n\nRelationships:\nobject 13 : license plate -> on -> object 5 : bus.\nobject 17 : number -> on -> object 5 : bus.\nobject 2 : bike rack -> on -> object 5 : bus.\nobject 5 : bus -> parked on side of -> object 36 : road.\nobject 39 : street light -> on -> object 6 : corner.\nobject 38 : steering wheel -> on -> object 5 : bus.\nobject 13 : license plate -> with -> object 10 : font.\nobject 7 : display -> showing -> object 16 : name.\nobject 23 : paint -> on -> object 5 : bus.\nobject 20 : number -> on -> object 5 : bus.\nobject 47 : word -> on -> object 5 : bus.\n\nRegion Description:\nRegion Description at [0.040, 0.278, 0.940, 0.913] : The bus is parked on the side of road..\nRegion Description at [0.272, 0.055, 0.752, 0.168] : display showing the current bus route name and number.\nRegion Description at [0.168, 0.025, 0.864, 0.902] : White bus with green and white design.\nRegion Description at [0.008, 0.005, 0.266, 0.630] : Brick building with red and white stripes.\n\nGlobal Caption:\nA very big city bus on a big street.\nA large bus on the side of a street.\nBlue, white, and green passenger bus parked at a stop. \na city bus parked on the side of the road\nA white bus driving down a street next to a building."}
+{"question_id": 26, "image": "000000241668.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : boutonniere at [0.710, 0.574, 0.799, 0.660].\nObject 1 : cake at [0.630, 0.670, 0.772, 0.750].\nObject 2 : cake crumb at [0.710, 0.348, 0.721, 0.356].\nObject 3 : crown at [0.370, 0.006, 0.549, 0.056].\nObject 4 : dress at [0.000, 0.574, 0.582, 1.000].\nObject 5 : eye at [0.649, 0.244, 0.699, 0.272].\nObject 6 : eye at [0.735, 0.264, 0.769, 0.280].\nObject 7 : eyebrow at [0.655, 0.230, 0.710, 0.250].\nObject 8 : eyebrow at [0.741, 0.252, 0.780, 0.264].\nObject 9 : finger at [0.721, 0.772, 0.816, 0.800].\nObject 10 : finger at [0.535, 0.740, 0.685, 0.826].\nObject 11 : ground at [0.003, 0.888, 0.997, 1.000].\nObject 12 : hair at [0.507, 0.142, 0.791, 0.642].\nObject 13 : hair at [0.189, 0.044, 0.652, 0.374].\nObject 14 : hand at [0.721, 0.720, 0.822, 0.818].\nObject 15 : hand at [0.493, 0.710, 0.685, 0.826].\nObject 16 : head at [0.209, 0.048, 0.652, 0.360].\nObject 17 : mouth at [0.646, 0.310, 0.724, 0.352].\nObject 18 : neck at [0.560, 0.344, 0.663, 0.460].\nObject 19 : necklace at [0.357, 0.334, 0.471, 0.484].\nObject 20 : necktie at [0.571, 0.442, 0.674, 0.936].\nObject 21 : paper at [0.760, 0.792, 0.914, 0.934].\nObject 22 : person at [0.490, 0.136, 0.825, 0.998].\nObject 23 : plate at [0.579, 0.734, 0.816, 0.768].\nObject 24 : purse at [0.774, 0.792, 0.883, 0.840].\nObject 25 : ring at [0.786, 0.780, 0.794, 0.796].\nObject 26 : shirt at [0.554, 0.376, 0.691, 0.950].\nObject 27 : suit jacket at [0.490, 0.422, 0.799, 0.998].\nObject 28 : table at [0.696, 0.816, 0.997, 0.916].\nObject 29 : toilet at [0.000, 0.656, 0.997, 0.936].\nObject 30 : wallpaper at [0.003, 0.000, 0.916, 0.656].\n\nRelationships:\nobject 21 : paper -> on top of -> object 11 : ground.\nobject 21 : paper -> by -> object 29 : toilet.\nobject 21 : paper -> on top of -> object 11 : ground.\nobject 21 : paper -> on top of -> object 11 : ground.\nobject 21 : paper -> by -> object 29 : toilet.\nobject 21 : paper -> by -> object 29 : toilet.\nobject 21 : paper -> sitting by -> object 29 : toilet.\nobject 21 : paper -> lying by -> object 29 : toilet.\nobject 21 : paper -> on top of -> object 11 : ground.\nobject 21 : paper -> lying by -> object 29 : toilet.\nobject 2 : cake crumb -> on side of -> object 17 : mouth.\nobject 24 : purse -> on top of -> object 28 : table.\nobject 5 : eye -> of a -> object 22 : person.\nobject 6 : eye -> of a -> object 22 : person.\nobject 7 : eyebrow -> of -> object 22 : person.\nobject 8 : eyebrow -> of -> object 22 : person.\nobject 10 : finger -> of -> object 15 : hand.\nobject 10 : finger -> of -> object 15 : hand.\nobject 10 : finger -> of -> object 15 : hand.\nobject 10 : finger -> of -> object 15 : hand.\nobject 3 : crown -> on top of -> object 16 : head.\nobject 20 : necktie -> worn on -> object 22 : person.\nobject 22 : person -> holding -> object 1 : cake.\nobject 14 : hand -> holding -> object 1 : cake.\nobject 22 : person -> wearing -> object 27 : suit jacket.\nobject 22 : person -> wearing -> object 4 : dress.\nobject 20 : necktie -> worn on -> object 18 : neck.\nobject 13 : hair -> on top of -> object 16 : head.\nobject 1 : cake -> on top of -> object 23 : plate.\nobject 25 : ring -> worn on -> object 9 : finger.\n\nRegion Description:\nRegion Description at [0.022, 0.020, 0.203, 0.312] : A green and yellow striped wallpaper.\nRegion Description at [0.000, 0.048, 0.613, 0.996] : woman wearing a strapless white wedding dress .\nRegion Description at [0.487, 0.136, 0.808, 0.986] : woman white red hair holding a piece of cake on a plate.\nRegion Description at [0.543, 0.674, 0.813, 0.826] : woman's hands holding a plate of cake.\nRegion Description at [0.579, 0.124, 0.788, 0.524] : red haired woman wearing a tie and suit jacket .\nRegion Description at [0.000, 0.012, 0.819, 0.996] : two people wearing formal wedding attire .\n\nGlobal Caption:\nThere are two people enjoying a wedding reception\nA woman in a wedding dress with another woman in a suit behind\nA woman in a wedding dress with another lady holding a piece of cake.\nA red head girl holding a piece of cake\nA bride is with a long red haired person with cake."}
+{"question_id": 27, "image": "000000535578.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : bush at [0.480, 0.000, 0.748, 0.084].\nObject 1 : ear at [0.544, 0.544, 0.571, 0.562].\nObject 2 : field at [0.000, 0.002, 0.994, 0.998].\nObject 3 : hill at [0.000, 0.000, 0.997, 0.998].\nObject 4 : plant at [0.000, 0.764, 0.601, 0.998].\nObject 5 : rock at [0.727, 0.410, 0.808, 0.470].\nObject 6 : sheep at [0.532, 0.546, 0.646, 0.662].\nObject 7 : sheep at [0.532, 0.666, 0.817, 0.810].\nObject 8 : tail at [0.565, 0.572, 0.604, 0.610].\nObject 9 : tree at [0.649, 0.000, 0.997, 0.334].\nObject 10 : trees at [0.736, 0.036, 0.835, 0.100].\nObject 11 : wall at [0.000, 0.000, 0.769, 0.180].\nObject 12 : weed at [0.417, 0.346, 0.492, 0.390].\n\nRelationships:\nobject 7 : sheep -> in a -> object 2 : field.\nobject 7 : sheep -> grazing in -> object 2 : field.\nobject 7 : sheep -> grazing in a -> object 2 : field.\nobject 7 : sheep -> grazing in a -> object 2 : field.\nobject 7 : sheep -> grazing in a -> object 2 : field.\nobject 7 : sheep -> grazing in a -> object 2 : field.\nobject 11 : wall -> borders -> object 2 : field.\nobject 7 : sheep -> grazing in a -> object 2 : field.\nobject 5 : rock -> in -> object 2 : field.\nobject 7 : sheep -> grazing in -> object 2 : field.\nobject 7 : sheep -> grazing in -> object 2 : field.\nobject 5 : rock -> in -> object 2 : field.\nobject 0 : bush -> in -> object 2 : field.\nobject 10 : trees -> in -> object 2 : field.\nobject 6 : sheep -> has an -> object 1 : ear.\nobject 6 : sheep -> has a -> object 8 : tail.\nobject 12 : weed -> growing in -> object 2 : field.\nobject 7 : sheep -> on -> object 3 : hill.\nobject 4 : plant -> on -> object 2 : field.\nobject 5 : rock -> on -> object 3 : hill.\nobject 7 : sheep -> are in -> object 2 : field.\nobject 11 : wall -> running across -> object 2 : field.\nobject 0 : bush -> in -> object 2 : field.\nobject 0 : bush -> in -> object 2 : field.\nobject 5 : rock -> in -> object 2 : field.\nobject 6 : sheep -> has a -> object 8 : tail.\nobject 5 : rock -> in -> object 2 : field.\n\nRegion Description:\nRegion Description at [0.000, 0.072, 0.760, 0.160] : A stone wall boarding a field of sheep.\nRegion Description at [0.189, 0.032, 0.703, 0.178] : rocks and grass in the background of the pasture.\nRegion Description at [0.541, 0.662, 0.823, 0.802] : white sheep grazing in green grassy field.\nRegion Description at [0.538, 0.544, 0.646, 0.656] : white sheep grazing in green grassy field.\nRegion Description at [0.228, 0.374, 0.357, 0.436] : white sheep grazing in green grassy field.\nRegion Description at [0.607, 0.380, 0.712, 0.456] : white sheep grazing in green grassy field.\nRegion Description at [0.811, 0.296, 0.937, 0.338] : two white sheep grazing in green grassy field.\nRegion Description at [0.048, 0.200, 0.249, 0.242] : group of white sheep grazing in green grassy field.\nRegion Description at [0.213, 0.164, 0.336, 0.192] : group of white sheep grazing in green grassy field.\nRegion Description at [0.000, 0.006, 0.997, 0.172] : two long gray stone walls across field.\nRegion Description at [0.453, 0.000, 0.730, 0.062] : a stand of trees outside the stone fence.\n\nGlobal Caption:\nA group of sheep grazing in a grassy valley.\nSheep graze in a lushly green mountain meadow\nA flock of sheep walking along a grassy hillside grazing.\nA flock of sheep are grazing on a grassy slope.\nA group of sheep grazing in a grassy field."}
+{"question_id": 28, "image": "000000484415.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : arm at [0.000, 0.125, 0.609, 0.988].\nObject 1 : bathroom tile at [0.009, 0.008, 0.994, 0.446].\nObject 2 : blue jeans at [0.369, 0.558, 0.722, 0.979].\nObject 3 : brush at [0.681, 0.208, 0.878, 0.500].\nObject 4 : brush holder at [0.716, 0.279, 0.891, 0.554].\nObject 5 : button at [0.519, 0.113, 0.584, 0.171].\nObject 6 : flusher at [0.534, 0.092, 0.628, 0.300].\nObject 7 : hand at [0.281, 0.125, 0.603, 0.562].\nObject 8 : holder at [0.713, 0.283, 0.903, 0.558].\nObject 9 : lid at [0.028, 0.046, 0.694, 0.446].\nObject 10 : man at [0.000, 0.133, 0.600, 0.992].\nObject 11 : seat at [0.138, 0.583, 0.722, 0.992].\nObject 12 : tank at [0.019, 0.021, 0.706, 0.579].\nObject 13 : tile at [0.794, 0.000, 1.000, 0.200].\nObject 14 : tile at [0.000, 0.000, 0.278, 0.129].\nObject 15 : toilet at [0.016, 0.042, 0.719, 0.996].\nObject 16 : toilet scrubber at [0.744, 0.192, 0.844, 0.521].\nObject 17 : toilet seat at [0.103, 0.517, 0.728, 0.996].\nObject 18 : wall at [0.659, 0.000, 0.978, 0.392].\nObject 19 : water at [0.369, 0.738, 0.500, 0.921].\n\nRelationships:\nobject 15 : toilet -> has -> object 11 : seat.\nobject 4 : brush holder -> by -> object 15 : toilet.\nobject 19 : water -> in -> object 15 : toilet.\nobject 6 : flusher -> on -> object 15 : toilet.\nobject 9 : lid -> on -> object 15 : toilet.\nobject 10 : man -> by -> object 15 : toilet.\nobject 10 : man -> by -> object 15 : toilet.\nobject 10 : man -> has -> object 7 : hand.\nobject 0 : arm -> on -> object 15 : toilet.\nobject 14 : tile -> on -> object 18 : wall.\n\nRegion Description:\nRegion Description at [0.000, 0.046, 0.716, 0.987] : the arm reaching for the white toilet bowl.\nRegion Description at [0.716, 0.192, 0.894, 0.550] : the container and the toilet brush cleaner.\nRegion Description at [0.009, 0.042, 0.894, 0.992] : the toilet bowl next to the toilet bowl cleaner.\nRegion Description at [0.534, 0.087, 0.666, 0.329] : The hand is on the flusher in the image .\nRegion Description at [0.053, 0.158, 0.903, 0.875] : Porcelain toilet with flusher on top of the lid .\nRegion Description at [0.094, 0.154, 0.856, 0.942] : Man flushing the toilet in the bathroom .\n\nGlobal Caption:\nA hand is reaching out to the top if a toilet. \nA person flushing a toilet with a motion sensor.\nA person's hand flushing a toilet with a button on top of the tank. \na persons hand reaching for the top of a toilet\nA hand is reaching over a white toilet."}
+{"question_id": 29, "image": "000000491090.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : arm at [0.313, 0.238, 0.567, 0.512].\nObject 1 : back wheel at [0.107, 0.502, 0.307, 0.720].\nObject 2 : face at [0.430, 0.118, 0.535, 0.218].\nObject 3 : floor at [0.003, 0.380, 0.997, 0.998].\nObject 4 : front light at [0.765, 0.514, 0.890, 0.634].\nObject 5 : front wheel at [0.642, 0.706, 0.997, 0.996].\nObject 6 : garage door at [0.532, 0.002, 0.858, 0.096].\nObject 7 : glasses at [0.422, 0.140, 0.548, 0.168].\nObject 8 : hand at [0.457, 0.450, 0.561, 0.518].\nObject 9 : indicator light at [0.666, 0.578, 0.722, 0.620].\nObject 10 : jeans at [0.241, 0.438, 0.465, 0.712].\nObject 11 : lettering at [0.003, 0.062, 0.302, 0.146].\nObject 12 : license plate at [0.939, 0.594, 1.000, 0.654].\nObject 13 : mirrors at [0.428, 0.320, 0.559, 0.384].\nObject 14 : motorcycle at [0.067, 0.358, 0.989, 1.000].\nObject 15 : person at [0.227, 0.086, 0.765, 0.758].\nObject 16 : sneaker at [0.243, 0.646, 0.342, 0.758].\nObject 17 : sweater at [0.243, 0.192, 0.676, 0.486].\nObject 18 : tail pipe at [0.059, 0.524, 0.257, 0.706].\n\nRelationships:\nobject 15 : person -> has -> object 7 : glasses.\nobject 15 : person -> has -> object 16 : sneaker.\nobject 15 : person -> has -> object 17 : sweater.\nobject 15 : person -> has -> object 17 : sweater.\nobject 15 : person -> has on -> object 10 : jeans.\nobject 14 : motorcycle -> has -> object 5 : front wheel.\nobject 14 : motorcycle -> has -> object 1 : back wheel.\nobject 4 : front light -> on -> object 14 : motorcycle.\nobject 15 : person -> on -> object 14 : motorcycle.\nobject 14 : motorcycle -> has -> object 18 : tail pipe.\nobject 15 : person -> sitting on -> object 14 : motorcycle.\nobject 15 : person -> wearing -> object 17 : sweater.\nobject 4 : front light -> on -> object 14 : motorcycle.\nobject 15 : person -> has -> object 8 : hand.\nobject 15 : person -> has -> object 7 : glasses.\nobject 13 : mirrors -> are on -> object 14 : motorcycle.\nobject 1 : back wheel -> on -> object 14 : motorcycle.\nobject 5 : front wheel -> on -> object 14 : motorcycle.\nobject 4 : front light -> on -> object 14 : motorcycle.\nobject 15 : person -> has -> object 2 : face.\nobject 15 : person -> has -> object 0 : arm.\nobject 15 : person -> sitting on -> object 14 : motorcycle.\nobject 15 : person -> has -> object 7 : glasses.\n\nRegion Description:\nRegion Description at [0.444, 0.138, 0.521, 0.168] : The eyeglasses the person on the motorcycle is wearing..\nRegion Description at [0.230, 0.640, 0.361, 0.760] : The person on the motorcycle's sneaker..\nRegion Description at [0.297, 0.216, 0.449, 0.404] : The left sleeve of the person's sweater..\nRegion Description at [0.545, 0.254, 0.738, 0.404] : The right sleeve of the person's sweater..\nRegion Description at [0.644, 0.706, 0.997, 0.994] : The front wheel of the motorcycle the person is on..\nRegion Description at [0.102, 0.498, 0.329, 0.692] : The back wheel of the motorcycle the person is on..\nRegion Description at [0.775, 0.518, 0.896, 0.626] : The front light of the motorcycle the person is on..\nRegion Description at [0.439, 0.432, 0.751, 0.522] : The handle bars on the motorcycle the person is on..\nRegion Description at [0.059, 0.516, 0.310, 0.708] : The tail pipe of the motorcycle the person is on..\nRegion Description at [0.663, 0.568, 0.733, 0.634] : small circular orange indicator light.\nRegion Description at [0.056, 0.522, 0.257, 0.706] : stainless steel motorcycle tailpipe .\nRegion Description at [0.067, 0.318, 0.992, 0.992] : Black motorcycle with silver accessories.\nRegion Description at [0.636, 0.690, 0.989, 0.992] : Black front wheel and fender of motorcycle.\nRegion Description at [0.243, 0.640, 0.353, 0.754] : Black and white shoe of man on motorcycle.\n\nGlobal Caption:\nA man sitting on one of a group of motorcycles.\nA MAN IS SMILING SITTING ON A MOTOR BIKE \nA middle-aged man leans on a sports bike, smiling\nA person sits on top of a motorcycle with others.\nA woman riding on the back of a motorcycle."}
+{"question_id": 30, "image": "000000276018.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : animal at [0.717, 0.042, 0.831, 0.152].\nObject 1 : animal at [0.114, 0.582, 0.348, 0.840].\nObject 2 : baby at [0.385, 0.034, 0.643, 0.434].\nObject 3 : baby at [0.911, 0.028, 1.000, 0.250].\nObject 4 : bear at [0.391, 0.506, 0.622, 0.714].\nObject 5 : bear at [0.695, 0.356, 0.868, 0.580].\nObject 6 : bear hand at [0.114, 0.630, 0.175, 0.660].\nObject 7 : black sock at [0.800, 0.796, 0.858, 0.834].\nObject 8 : blonde boy at [0.166, 0.170, 0.351, 0.460].\nObject 9 : boy at [0.102, 0.388, 0.498, 1.000].\nObject 10 : boy at [0.717, 0.188, 1.000, 0.864].\nObject 11 : child at [0.342, 0.390, 0.622, 1.000].\nObject 12 : coat at [0.077, 0.520, 0.495, 0.910].\nObject 13 : coat at [0.775, 0.296, 1.000, 0.616].\nObject 14 : coat at [0.397, 0.090, 0.634, 0.262].\nObject 15 : flip flops at [0.434, 0.756, 0.606, 0.910].\nObject 16 : girl at [0.372, 0.196, 0.603, 0.922].\nObject 17 : glasses at [0.191, 0.236, 0.308, 0.250].\nObject 18 : grass at [0.637, 0.652, 0.754, 0.788].\nObject 19 : hand at [0.714, 0.094, 0.788, 0.160].\nObject 20 : hands at [0.763, 0.380, 0.877, 0.430].\nObject 21 : hat at [0.757, 0.030, 0.889, 0.078].\nObject 22 : jacket at [0.357, 0.500, 0.622, 0.782].\nObject 23 : jacket at [0.422, 0.286, 0.603, 0.550].\nObject 24 : jacket at [0.163, 0.296, 0.320, 0.462].\nObject 25 : jacket at [0.911, 0.106, 1.000, 0.224].\nObject 26 : lady at [0.286, 0.000, 0.683, 0.560].\nObject 27 : man at [0.628, 0.030, 0.951, 0.742].\nObject 28 : shirt at [0.831, 0.306, 0.957, 0.404].\nObject 29 : shirt at [0.197, 0.296, 0.298, 0.370].\nObject 30 : shoe at [0.717, 0.804, 0.871, 0.864].\nObject 31 : sidewalk at [0.628, 0.574, 0.769, 0.632].\nObject 32 : stuffed animal at [0.286, 0.298, 0.517, 0.422].\n\nRelationships:\nobject 10 : boy -> wearing -> object 28 : shirt.\nobject 3 : baby -> wearing -> object 25 : jacket.\nobject 22 : jacket -> carrying -> object 4 : bear.\nobject 8 : blonde boy -> wears -> object 17 : glasses.\nobject 8 : blonde boy -> wears -> object 24 : jacket.\nobject 11 : child -> holding up -> object 32 : stuffed animal.\nobject 10 : boy -> holding up -> object 5 : bear.\nobject 30 : shoe -> with a -> object 7 : black sock.\nobject 10 : boy -> wearing -> object 7 : black sock.\nobject 26 : lady -> holding -> object 2 : baby.\nobject 16 : girl -> wearing -> object 15 : flip flops.\nobject 9 : boy -> wearing -> object 12 : coat.\nobject 10 : boy -> wearing a -> object 13 : coat.\nobject 4 : bear -> on -> object 20 : hands.\nobject 26 : lady -> carrying -> object 2 : baby.\nobject 0 : animal -> in -> object 19 : hand.\n\nRegion Description:\nRegion Description at [0.905, 0.020, 0.997, 0.272] : blonde haired baby wearing yellow jacket.\nRegion Description at [0.357, 0.388, 0.640, 0.730] : girl in blue jacket carrying blue dog.\nRegion Description at [0.071, 0.378, 0.498, 0.842] : boy in black jacket holding stuffed dog.\nRegion Description at [0.055, 0.572, 0.375, 0.846] : brown stuffed dog with red and white collar.\nRegion Description at [0.283, 0.194, 0.603, 0.400] : girl in pink jacket holding white stuffed animal.\nRegion Description at [0.695, 0.356, 0.874, 0.576] : White stuffed animal wearing a red jacket..\nRegion Description at [0.332, 0.394, 0.618, 0.992] : Little girl holding a grey stuffed dog..\nRegion Description at [0.372, 0.476, 0.723, 0.786] : little girl holding blue and white stuffed animal.\nRegion Description at [0.062, 0.556, 0.422, 0.840] : little boy holding brown and white stuffed animal.\n\nGlobal Caption:\na bunch of kids walking through some grass\nA group of children are holding various stuffed animals and dolls.\nKids walking while holding their stuffed animals. \nA group of kids holding teddy bears and looking happy.\nA group of children carrying stuffed animals walks across the grass. "}
+{"question_id": 31, "image": "000000361551.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : baggage at [0.107, 0.662, 0.179, 0.750].\nObject 1 : baggage at [0.368, 0.706, 0.456, 0.782].\nObject 2 : building at [0.000, 0.000, 0.997, 0.326].\nObject 3 : cap at [0.784, 0.544, 0.824, 0.568].\nObject 4 : duffel bag at [0.584, 0.702, 0.643, 0.768].\nObject 5 : ground at [0.000, 0.282, 1.000, 0.976].\nObject 6 : hair at [0.920, 0.614, 0.973, 0.640].\nObject 7 : headband at [0.923, 0.628, 0.952, 0.646].\nObject 8 : jacket at [0.776, 0.568, 0.840, 0.642].\nObject 9 : line at [0.696, 0.750, 0.989, 0.794].\nObject 10 : lines at [0.000, 0.436, 0.851, 0.486].\nObject 11 : luggage at [0.907, 0.706, 0.973, 0.786].\nObject 12 : luggage at [0.368, 0.702, 0.456, 0.780].\nObject 13 : man at [0.008, 0.554, 0.139, 0.800].\nObject 14 : man at [0.659, 0.572, 0.920, 0.844].\nObject 15 : man at [0.771, 0.538, 0.843, 0.640].\nObject 16 : pavement at [0.003, 0.308, 0.992, 0.566].\nObject 17 : people at [0.005, 0.562, 0.616, 0.824].\nObject 18 : pillars at [0.211, 0.130, 0.235, 0.240].\nObject 19 : ramp at [0.179, 0.158, 0.707, 0.408].\nObject 20 : service area at [0.003, 0.416, 0.995, 0.996].\nObject 21 : stairs at [0.352, 0.676, 1.000, 0.994].\nObject 22 : sweater at [0.667, 0.634, 0.920, 0.824].\nObject 23 : top at [0.960, 0.626, 1.000, 0.668].\nObject 24 : truck at [0.781, 0.278, 0.997, 0.366].\nObject 25 : walls at [0.608, 0.000, 0.989, 0.320].\nObject 26 : wheel at [0.843, 0.338, 0.875, 0.366].\nObject 27 : woman at [0.917, 0.610, 1.000, 0.724].\n\nRelationships:\nobject 17 : people -> in -> object 20 : service area.\nobject 27 : woman -> bends over -> object 11 : luggage.\nobject 14 : man -> walks down -> object 21 : stairs.\nobject 12 : luggage -> on -> object 5 : ground.\nobject 13 : man -> carries -> object 0 : baggage.\nobject 14 : man -> wears -> object 22 : sweater.\nobject 15 : man -> wears -> object 3 : cap.\nobject 24 : truck -> in -> object 20 : service area.\nobject 15 : man -> wears -> object 8 : jacket.\nobject 10 : lines -> on -> object 16 : pavement.\nobject 14 : man -> walks down -> object 21 : stairs.\nobject 9 : line -> on -> object 16 : pavement.\nobject 24 : truck -> has -> object 26 : wheel.\nobject 2 : building -> has -> object 25 : walls.\nobject 15 : man -> on -> object 20 : service area.\nobject 13 : man -> holds -> object 0 : baggage.\nobject 14 : man -> walks down -> object 21 : stairs.\nobject 13 : man -> holds -> object 0 : baggage.\nobject 27 : woman -> wears -> object 7 : headband.\nobject 1 : baggage -> on -> object 20 : service area.\n\nRegion Description:\nRegion Description at [0.443, 0.528, 0.992, 0.850] : People standing in service area of airport..\nRegion Description at [0.648, 0.564, 0.960, 0.892] : Man walking down stairs of unloading ramp..\nRegion Description at [0.229, 0.698, 0.381, 0.776] : Black and red luggage sitting on ground..\nRegion Description at [0.957, 0.616, 0.997, 0.670] : Woman dressed in sleeveless black top..\nRegion Description at [0.011, 0.548, 0.211, 0.750] : Man holding his luggage and bending over.\nRegion Description at [0.893, 0.578, 0.995, 0.678] : woman with a black and white head band.\nRegion Description at [0.235, 0.684, 0.973, 0.816] : Rainbow of colors in the form of luggage.\n\nGlobal Caption:\nSome are standing outside a building with suitcases.\nA few people are getting of a plane.\nA group of people and luggage on a airport tarmac.\nSome people who are placing luggage on a runway.\nAn airport and plane unloading passengers with luggage."}
+{"question_id": 32, "image": "000000562207.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : body at [0.166, 0.539, 0.296, 0.997].\nObject 1 : boot at [0.594, 0.753, 0.620, 0.870].\nObject 2 : boot at [0.620, 0.744, 0.658, 0.858].\nObject 3 : bucket at [0.268, 0.744, 0.322, 0.828].\nObject 4 : clouds at [0.156, 0.000, 0.968, 0.328].\nObject 5 : ear at [0.590, 0.226, 0.638, 0.410].\nObject 6 : ear at [0.368, 0.208, 0.448, 0.434].\nObject 7 : elephant at [0.328, 0.157, 0.638, 0.967].\nObject 8 : eye at [0.476, 0.319, 0.504, 0.346].\nObject 9 : foot at [0.436, 0.901, 0.516, 0.958].\nObject 10 : grass at [0.950, 0.759, 0.996, 0.807].\nObject 11 : leg at [0.498, 0.572, 0.548, 0.898].\nObject 12 : leg at [0.408, 0.512, 0.516, 0.955].\nObject 13 : man at [0.582, 0.476, 0.662, 0.870].\nObject 14 : man at [0.164, 0.455, 0.292, 0.997].\nObject 15 : mountains at [0.000, 0.265, 0.376, 0.470].\nObject 16 : rock at [0.736, 0.895, 0.762, 0.934].\nObject 17 : sand at [0.240, 0.687, 0.998, 1.000].\nObject 18 : shirt at [0.582, 0.521, 0.650, 0.681].\nObject 19 : shorts at [0.174, 0.699, 0.254, 0.864].\nObject 20 : side at [0.236, 0.675, 0.994, 0.997].\nObject 21 : skirt at [0.298, 0.687, 0.360, 0.810].\nObject 22 : sky at [0.004, 0.000, 0.998, 0.355].\nObject 23 : top at [0.302, 0.539, 0.358, 0.696].\nObject 24 : tree at [0.012, 0.407, 0.076, 0.500].\nObject 25 : trunk at [0.506, 0.392, 0.600, 0.964].\nObject 26 : watch at [0.172, 0.711, 0.192, 0.732].\nObject 27 : water at [0.000, 0.488, 0.994, 1.000].\nObject 28 : woman at [0.288, 0.473, 0.420, 0.967].\n\nRelationships:\nobject 7 : elephant -> on -> object 20 : side.\nobject 28 : woman -> touching -> object 7 : elephant.\nobject 14 : man -> standing on -> object 20 : side.\nobject 14 : man -> standing beside -> object 7 : elephant.\nobject 10 : grass -> on -> object 20 : side.\nobject 28 : woman -> wearing -> object 23 : top.\nobject 13 : man -> wearing -> object 18 : shirt.\nobject 13 : man -> wearing -> object 1 : boot.\nobject 13 : man -> wearing -> object 2 : boot.\nobject 28 : woman -> touching -> object 7 : elephant.\nobject 7 : elephant -> has -> object 25 : trunk.\nobject 14 : man -> wearing -> object 19 : shorts.\nobject 28 : woman -> petting -> object 7 : elephant.\nobject 14 : man -> with -> object 7 : elephant.\nobject 28 : woman -> with -> object 7 : elephant.\nobject 13 : man -> with -> object 7 : elephant.\nobject 25 : trunk -> of -> object 7 : elephant.\nobject 25 : trunk -> of -> object 7 : elephant.\nobject 9 : foot -> of an -> object 7 : elephant.\nobject 25 : trunk -> of -> object 7 : elephant.\nobject 11 : leg -> of -> object 7 : elephant.\nobject 12 : leg -> of -> object 7 : elephant.\nobject 5 : ear -> of -> object 7 : elephant.\nobject 6 : ear -> of -> object 7 : elephant.\nobject 8 : eye -> of -> object 7 : elephant.\nobject 27 : water -> behind -> object 7 : elephant.\n\nRegion Description:\nRegion Description at [0.338, 0.139, 0.618, 0.967] : the elephant standing on the lake side.\nRegion Description at [0.154, 0.392, 0.300, 0.964] : a man standing on the lake side with shorts.\nRegion Description at [0.574, 0.422, 0.686, 0.910] : the man standing beside the elephant.\nRegion Description at [0.292, 0.485, 0.378, 0.705] : this lady is wearing a blue tank top.\nRegion Description at [0.722, 0.768, 0.988, 0.964] : the sand is brown with green grass growing in it.\nRegion Description at [0.156, 0.669, 0.270, 0.910] : the man is wearing grey black and white shorts.\nRegion Description at [0.504, 0.560, 0.568, 0.898] : The front right leg of the elephant..\nRegion Description at [0.310, 0.536, 0.358, 0.690] : The light blue tank top the girl is wearing..\nRegion Description at [0.262, 0.732, 0.326, 0.825] : The black bucket in the girl's hand..\nRegion Description at [0.002, 0.443, 0.992, 0.994] : The water behind the people and the elephant..\n\nGlobal Caption:\nA group of people are standing next to an elephant emerging from the water.\na group of people stand beside of a giant elephant \nThree tourists pose for a picture next to an elephant.\nThree people stand with an elephant in front of a stream.\nThree people standing next to an elephant along a river."}
+{"question_id": 33, "image": "000000553990.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : bar at [0.444, 0.622, 0.640, 0.688].\nObject 1 : boots at [0.328, 0.339, 0.416, 0.492].\nObject 2 : bridal at [0.474, 0.246, 0.678, 0.432].\nObject 3 : food at [0.416, 0.646, 0.466, 0.715].\nObject 4 : foot at [0.324, 0.402, 0.380, 0.492].\nObject 5 : girl at [0.320, 0.078, 0.552, 0.502].\nObject 6 : grass at [0.012, 0.694, 0.998, 0.994].\nObject 7 : ground at [0.004, 0.679, 0.996, 0.913].\nObject 8 : helmet at [0.484, 0.096, 0.560, 0.162].\nObject 9 : hoof at [0.120, 0.853, 0.170, 0.925].\nObject 10 : horse at [0.024, 0.210, 0.690, 0.949].\nObject 11 : legs at [0.478, 0.453, 0.598, 0.637].\nObject 12 : legs at [0.130, 0.583, 0.278, 0.925].\nObject 13 : mane at [0.484, 0.186, 0.648, 0.279].\nObject 14 : person at [0.568, 0.568, 0.604, 0.640].\nObject 15 : poles at [0.460, 0.814, 0.538, 0.955].\nObject 16 : shirt at [0.580, 0.586, 0.594, 0.622].\nObject 17 : shirt at [0.388, 0.150, 0.508, 0.279].\nObject 18 : tail at [0.044, 0.357, 0.222, 0.784].\nObject 19 : tree at [0.720, 0.057, 0.874, 0.568].\nObject 20 : tree at [0.220, 0.000, 0.456, 0.586].\nObject 21 : trees at [0.730, 0.003, 0.986, 0.628].\nObject 22 : wall at [0.188, 0.276, 0.254, 0.393].\nObject 23 : water at [0.028, 0.468, 0.134, 0.574].\n\nRelationships:\nobject 5 : girl -> has -> object 1 : boots.\nobject 6 : grass -> under -> object 10 : horse.\nobject 21 : trees -> behind -> object 10 : horse.\nobject 10 : horse -> jumping -> object 15 : poles.\nobject 11 : legs -> on -> object 10 : horse.\nobject 12 : legs -> on -> object 10 : horse.\nobject 12 : legs -> on -> object 10 : horse.\nobject 14 : person -> in -> object 16 : shirt.\nobject 10 : horse -> has -> object 9 : hoof.\n\nRegion Description:\n\nGlobal Caption:\nA young person ridding a horse jumps a gate in a competition.\nA man riding on a horse as it jumps over a pole. \nA woman is riding a horse as it jumps over a bar.\nthere is a woman jockey riding a hose over the hurdle\nA woman riding a horse jumps over an obstacle."}
+{"question_id": 34, "image": "000000106048.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : book at [0.218, 0.105, 0.834, 0.754].\nObject 1 : building at [0.050, 0.000, 1.000, 0.713].\nObject 2 : bus at [0.222, 0.144, 0.820, 0.757].\nObject 3 : bushes at [0.810, 0.401, 1.000, 0.680].\nObject 4 : design at [0.228, 0.422, 0.438, 0.560].\nObject 5 : ground at [0.000, 0.629, 1.002, 0.994].\nObject 6 : headlight at [0.738, 0.590, 0.796, 0.632].\nObject 7 : headlight at [0.522, 0.596, 0.610, 0.629].\nObject 8 : light at [0.604, 0.201, 0.706, 0.222].\nObject 9 : pavement at [0.002, 0.629, 0.996, 0.994].\nObject 10 : pipe at [0.172, 0.147, 0.208, 0.617].\nObject 11 : pipe at [0.438, 0.096, 0.458, 0.192].\nObject 12 : roof at [0.118, 0.000, 0.896, 0.174].\nObject 13 : side mirror at [0.488, 0.314, 0.530, 0.428].\nObject 14 : side mirror at [0.790, 0.332, 0.818, 0.455].\nObject 15 : street at [0.002, 0.611, 0.992, 0.991].\nObject 16 : stripe at [0.228, 0.428, 0.516, 0.569].\nObject 17 : trash can at [0.790, 0.569, 0.822, 0.662].\nObject 18 : wall at [0.858, 0.368, 0.920, 0.419].\nObject 19 : wheel at [0.266, 0.545, 0.294, 0.677].\nObject 20 : wheel at [0.248, 0.551, 0.264, 0.668].\nObject 21 : wheel at [0.444, 0.578, 0.472, 0.751].\nObject 22 : windows at [0.510, 0.216, 0.796, 0.548].\nObject 23 : windshield at [0.518, 0.222, 0.782, 0.545].\n\nRelationships:\nobject 10 : pipe -> running from -> object 12 : roof.\nobject 12 : roof -> to -> object 5 : ground.\nobject 17 : trash can -> next to -> object 3 : bushes.\nobject 3 : bushes -> by -> object 15 : street.\n\nRegion Description:\nRegion Description at [0.568, 0.524, 0.770, 0.599] : Divine Transportation written on front of bus.\nRegion Description at [0.162, 0.129, 0.212, 0.623] : black drain pipe running from the roof to the ground.\nRegion Description at [0.712, 0.177, 0.762, 0.240] : bus identification number on top of bus.\nRegion Description at [0.790, 0.557, 0.820, 0.647] : gray trash can next to bushes behind bus.\nRegion Description at [0.810, 0.407, 0.990, 0.692] : large green bushes in front of building.\nRegion Description at [0.670, 0.317, 0.740, 0.527] : black windshield wiper on windshield.\n\nGlobal Caption:\nA white bus driving past a tall building.\na black and white bus some bushes and building\nA white decorated bus is next to a building.\na large white bus that is by a building\nA large bus parked in a parking lot "}
+{"question_id": 35, "image": "000000421923.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : block at [0.156, 0.630, 0.357, 0.822].\nObject 1 : book at [0.414, 0.208, 0.538, 0.364].\nObject 2 : book at [0.360, 0.202, 0.417, 0.360].\nObject 3 : book at [0.426, 0.484, 0.691, 0.522].\nObject 4 : book at [0.399, 0.404, 0.520, 0.554].\nObject 5 : bowl at [0.072, 0.030, 0.288, 0.076].\nObject 6 : center at [0.850, 0.732, 0.886, 0.766].\nObject 7 : eye at [0.282, 0.506, 0.327, 0.532].\nObject 8 : eye at [0.189, 0.506, 0.237, 0.534].\nObject 9 : flower at [0.796, 0.462, 0.982, 0.550].\nObject 10 : flower at [0.817, 0.528, 0.976, 0.612].\nObject 11 : flower at [0.760, 0.678, 0.946, 0.824].\nObject 12 : flower at [0.691, 0.608, 0.838, 0.722].\nObject 13 : flower at [0.913, 0.680, 1.000, 0.770].\nObject 14 : object at [0.213, 0.840, 0.583, 0.972].\nObject 15 : picture at [0.778, 0.060, 1.000, 0.352].\nObject 16 : shelf at [0.324, 0.528, 0.997, 0.624].\nObject 17 : shelf at [0.207, 0.334, 0.997, 0.380].\nObject 18 : shelf at [0.000, 0.028, 0.607, 0.202].\nObject 19 : stack at [0.435, 0.480, 0.712, 0.578].\nObject 20 : statue at [0.147, 0.404, 0.372, 0.652].\nObject 21 : table at [0.000, 0.690, 1.003, 0.998].\nObject 22 : vase at [0.838, 0.774, 0.994, 0.974].\nObject 23 : water at [0.847, 0.864, 0.997, 0.984].\n\nRelationships:\nobject 20 : statue -> on -> object 0 : block.\nobject 14 : object -> on -> object 21 : table.\nobject 1 : book -> on -> object 17 : shelf.\nobject 4 : book -> on -> object 16 : shelf.\nobject 5 : bowl -> on -> object 18 : shelf.\nobject 22 : vase -> has -> object 23 : water.\nobject 20 : statue -> has -> object 8 : eye.\nobject 20 : statue -> has -> object 7 : eye.\nobject 20 : statue -> on -> object 0 : block.\nobject 9 : flower -> in -> object 22 : vase.\nobject 10 : flower -> in -> object 22 : vase.\nobject 12 : flower -> in -> object 22 : vase.\nobject 13 : flower -> in -> object 22 : vase.\nobject 3 : book -> in -> object 19 : stack.\nobject 11 : flower -> has -> object 6 : center.\nobject 1 : book -> on -> object 17 : shelf.\nobject 2 : book -> on -> object 17 : shelf.\nobject 11 : flower -> has -> object 6 : center.\nobject 3 : book -> on -> object 19 : stack.\nobject 19 : stack -> on -> object 16 : shelf.\nobject 20 : statue -> on -> object 0 : block.\n\nRegion Description:\n\nGlobal Caption:\na glass vase with some flowers coming out of it \nA room witb a statue, bookshelves, books and a vase with flowers in it.\nA desk with a vase containing flowers, a sculpture of a man's head and shelves behind it.\nA statue next to a vase of flowers on a shelf. \nThe bust of a man's head is next to a vase of flowers."}
+{"question_id": 36, "image": "000000273493.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : ball at [0.640, 0.399, 0.648, 0.411].\nObject 1 : border at [0.040, 0.502, 1.000, 0.556].\nObject 2 : boundary lines at [0.030, 0.661, 1.000, 1.000].\nObject 3 : bushes at [0.020, 0.186, 0.104, 0.517].\nObject 4 : fence at [0.008, 0.366, 0.994, 0.565].\nObject 5 : fence at [0.024, 0.502, 0.996, 0.709].\nObject 6 : grass at [0.004, 0.529, 0.994, 0.997].\nObject 7 : man at [0.144, 0.360, 0.246, 0.736].\nObject 8 : man at [0.730, 0.474, 0.780, 0.613].\nObject 9 : pants at [0.732, 0.529, 0.778, 0.604].\nObject 10 : shirt at [0.164, 0.411, 0.222, 0.547].\nObject 11 : shorts at [0.162, 0.535, 0.220, 0.628].\nObject 12 : sign at [0.916, 0.405, 0.934, 0.438].\nObject 13 : sky at [0.006, 0.021, 0.990, 0.279].\nObject 14 : sneakers at [0.180, 0.709, 0.216, 0.739].\nObject 15 : sneakers at [0.762, 0.598, 0.776, 0.613].\nObject 16 : tennis at [0.012, 0.384, 0.984, 0.934].\nObject 17 : tennis court at [0.000, 0.372, 0.988, 0.979].\nObject 18 : tennis racket at [0.768, 0.526, 0.808, 0.556].\nObject 19 : tennis racket at [0.214, 0.574, 0.238, 0.619].\nObject 20 : trees at [0.586, 0.282, 0.692, 0.420].\nObject 21 : white at [0.734, 0.492, 0.778, 0.601].\n\nRelationships:\nobject 7 : man -> in -> object 10 : shirt.\nobject 7 : man -> with -> object 19 : tennis racket.\nobject 7 : man -> plays -> object 16 : tennis.\nobject 7 : man -> wears -> object 14 : sneakers.\nobject 8 : man -> wears -> object 15 : sneakers.\nobject 7 : man -> wears -> object 11 : shorts.\nobject 8 : man -> wears -> object 9 : pants.\nobject 5 : fence -> has -> object 1 : border.\nobject 20 : trees -> behind -> object 3 : bushes.\nobject 2 : boundary lines -> on -> object 17 : tennis court.\nobject 2 : boundary lines -> on -> object 6 : grass.\nobject 3 : bushes -> behind -> object 4 : fence.\nobject 20 : trees -> behind -> object 4 : fence.\nobject 7 : man -> has -> object 19 : tennis racket.\nobject 8 : man -> wears -> object 21 : white.\nobject 4 : fence -> around -> object 17 : tennis court.\nobject 20 : trees -> behind -> object 8 : man.\nobject 6 : grass -> on -> object 17 : tennis court.\nobject 8 : man -> has -> object 18 : tennis racket.\nobject 8 : man -> hitting -> object 0 : ball.\nobject 5 : fence -> on -> object 17 : tennis court.\n\nRegion Description:\nRegion Description at [0.024, 0.489, 0.998, 0.730] : The tennis net separating the sides of the players..\nRegion Description at [0.144, 0.652, 0.234, 0.745] : The black sneakers the player is wearing..\nRegion Description at [0.720, 0.577, 0.784, 0.613] : The white sneakers the player is wearing..\nRegion Description at [0.158, 0.544, 0.230, 0.628] : The gray shorts the player is wearing..\nRegion Description at [0.006, 0.402, 0.998, 0.574] : The trimmed bushes behind the player..\nRegion Description at [0.008, 0.168, 0.998, 0.402] : The trees behind the trimmed bushes behind the player..\nRegion Description at [0.006, 0.604, 0.998, 0.985] : The white boundary lines on the tennis court..\nRegion Description at [0.020, 0.447, 0.994, 0.760] : A black and white net stretches across the field.\nRegion Description at [0.060, 0.526, 0.984, 0.985] : The field has green grass with white lines.\nRegion Description at [0.016, 0.369, 0.978, 0.595] : A tall green shrub is behind the fence.\nRegion Description at [0.034, 0.150, 0.984, 0.393] : Trees are seen behind the fence and shrub.\nRegion Description at [0.588, 0.327, 0.850, 0.703] : The yellow ball is flying towards the man.\nRegion Description at [0.902, 0.378, 0.956, 0.529] : A black circular sign with the number five.\nRegion Description at [0.142, 0.354, 0.248, 0.736] : male in white t-shirt playing tennis.\nRegion Description at [0.200, 0.565, 0.244, 0.625] : Head of tennis racket of man playing.\nRegion Description at [0.726, 0.465, 0.786, 0.631] : Man in white preparing to hit tennis ball.\n\nGlobal Caption:\nTwo men playing a game of tennis on a court.\ntwo people playing tennis with rackets on a grass court\nTwo young men playing a game of tennis.\nPeople playing tennis on a court surrounded by green hedges.\ntHERE ARE TWO MEN PLAYING TENNIS ON THE TENNIS COURT"}
+{"question_id": 37, "image": "000000475150.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : animal at [0.220, 0.105, 1.006, 0.997].\nObject 1 : branches at [0.000, 0.000, 1.000, 1.000].\nObject 2 : ear at [0.402, 0.288, 0.452, 0.378].\nObject 3 : eye at [0.332, 0.396, 0.378, 0.429].\nObject 4 : foliage at [0.584, 0.093, 0.748, 0.255].\nObject 5 : giraffe`s neck at [0.476, 0.264, 1.000, 1.003].\nObject 6 : head at [0.216, 0.102, 0.476, 0.706].\nObject 7 : mane at [0.576, 0.502, 0.836, 0.811].\nObject 8 : nose at [0.222, 0.640, 0.266, 0.703].\nObject 9 : sky at [0.000, 0.000, 1.000, 0.562].\nObject 10 : spot at [0.562, 0.535, 0.616, 0.625].\nObject 11 : spot at [0.560, 0.447, 0.592, 0.508].\nObject 12 : spot at [0.592, 0.444, 0.670, 0.556].\nObject 13 : spot at [0.622, 0.565, 0.694, 0.664].\nObject 14 : spot at [0.514, 0.483, 0.570, 0.571].\nObject 15 : spots at [0.700, 0.640, 0.806, 0.817].\nObject 16 : spots at [0.706, 0.823, 0.776, 0.943].\nObject 17 : spots at [0.852, 0.829, 0.984, 0.997].\nObject 18 : spots at [0.674, 0.547, 0.758, 0.655].\nObject 19 : spots at [0.774, 0.700, 0.902, 0.913].\nObject 20 : tree at [0.000, 0.000, 1.000, 1.000].\nObject 21 : wrinkles at [0.466, 0.468, 0.554, 0.586].\n\nRelationships:\nobject 20 : tree -> has -> object 4 : foliage.\nobject 21 : wrinkles -> on -> object 5 : giraffe`s neck.\nobject 3 : eye -> on a -> object 0 : animal.\nobject 4 : foliage -> in -> object 20 : tree.\nobject 1 : branches -> behind -> object 0 : animal.\nobject 14 : spot -> on -> object 0 : animal.\nobject 11 : spot -> on -> object 0 : animal.\nobject 10 : spot -> on -> object 0 : animal.\nobject 12 : spot -> on -> object 0 : animal.\nobject 13 : spot -> on -> object 0 : animal.\nobject 5 : giraffe`s neck -> on -> object 0 : animal.\nobject 3 : eye -> of -> object 0 : animal.\nobject 2 : ear -> of -> object 0 : animal.\nobject 6 : head -> of -> object 0 : animal.\n\nRegion Description:\nRegion Description at [0.616, 0.565, 0.956, 0.958] : the giraffe is spotted tan and brown.\nRegion Description at [0.288, 0.324, 0.572, 0.649] : the giraffes face is white and brown.\n\nGlobal Caption:\nA giraffe stands near a tree in the wilderness. \nA giraffe standing in front of a group of trees.\nA giraffe standing next to a leaf free tree.\nHead and neck of a giraffe in natural feeding habitat.\nA giraffe walking near a tree with very few leaves."}
+{"question_id": 38, "image": "000000125472.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : axle at [0.447, 0.814, 0.535, 0.856].\nObject 1 : background at [0.003, 0.744, 0.994, 0.988].\nObject 2 : bracelet at [0.820, 0.444, 0.859, 0.470].\nObject 3 : building at [0.012, 0.888, 0.099, 0.994].\nObject 4 : corner at [0.027, 0.890, 0.117, 0.992].\nObject 5 : fence at [0.030, 0.886, 1.000, 1.000].\nObject 6 : hair at [0.486, 0.078, 0.712, 0.216].\nObject 7 : jean pants at [0.246, 0.380, 0.841, 0.632].\nObject 8 : laces at [0.168, 0.562, 0.850, 0.674].\nObject 9 : logo at [0.429, 0.232, 0.583, 0.364].\nObject 10 : man at [0.201, 0.002, 0.940, 0.758].\nObject 11 : name at [0.000, 0.960, 0.321, 1.000].\nObject 12 : picture at [0.003, 0.004, 1.000, 0.998].\nObject 13 : poles at [0.180, 0.886, 0.432, 0.990].\nObject 14 : shirt at [0.324, 0.124, 0.694, 0.392].\nObject 15 : shoes at [0.189, 0.606, 0.946, 0.792].\nObject 16 : skateboard at [0.012, 0.746, 0.664, 0.886].\nObject 17 : sky at [0.012, 0.002, 1.000, 0.918].\nObject 18 : stadium lights at [0.147, 0.860, 0.456, 0.994].\nObject 19 : stitching at [0.312, 0.408, 0.754, 0.638].\nObject 20 : strip at [0.279, 0.770, 0.529, 0.802].\nObject 21 : top at [0.024, 0.830, 0.420, 0.936].\nObject 22 : trees at [0.024, 0.846, 1.000, 1.000].\nObject 23 : wheels at [0.012, 0.808, 0.586, 0.904].\nObject 24 : wrist at [0.802, 0.434, 0.856, 0.484].\n\nRelationships:\nobject 2 : bracelet -> on mans -> object 24 : wrist.\nobject 23 : wheels -> on a -> object 16 : skateboard.\nobject 14 : shirt -> has a -> object 9 : logo.\nobject 10 : man -> doing trick on -> object 16 : skateboard.\nobject 3 : building -> behind a -> object 5 : fence.\nobject 11 : name -> on -> object 12 : picture.\nobject 11 : name -> has a -> object 11 : name.\nobject 10 : man -> performing on a -> object 16 : skateboard.\nobject 4 : corner -> of -> object 3 : building.\nobject 18 : stadium lights -> are on -> object 13 : poles.\nobject 16 : skateboard -> has -> object 23 : wheels.\nobject 2 : bracelet -> on mans -> object 24 : wrist.\nobject 11 : name -> on -> object 12 : picture.\nobject 16 : skateboard -> under -> object 10 : man.\nobject 10 : man -> wearing -> object 15 : shoes.\nobject 3 : building -> behind -> object 5 : fence.\nobject 22 : trees -> in -> object 1 : background.\nobject 15 : shoes -> have -> object 8 : laces.\nobject 18 : stadium lights -> on -> object 13 : poles.\nobject 5 : fence -> behind -> object 10 : man.\nobject 20 : strip -> on -> object 16 : skateboard.\nobject 19 : stitching -> on -> object 7 : jean pants.\nobject 9 : logo -> on -> object 14 : shirt.\nobject 23 : wheels -> on -> object 16 : skateboard.\nobject 0 : axle -> on -> object 16 : skateboard.\nobject 21 : top -> of -> object 22 : trees.\n\nRegion Description:\nRegion Description at [0.030, 0.774, 0.643, 0.912] : a black skateboard with black wheels.\n\nGlobal Caption:\nA man flying through the air while riding a skateboard.\nA man is doing tricks on a skateboard.\nA skateboarder jumps while trying to perform a trick.\na man in the air standing above the skateboard\na person attempting a jump with a skateboard"}
+{"question_id": 39, "image": "000000069138.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : arrows at [0.000, 0.616, 0.214, 0.644].\nObject 1 : awning at [0.159, 0.260, 0.293, 0.336].\nObject 2 : building at [0.000, 0.000, 1.000, 0.466].\nObject 3 : bushes at [0.693, 0.342, 1.000, 0.512].\nObject 4 : door at [0.110, 0.370, 0.266, 0.518].\nObject 5 : face at [0.390, 0.256, 0.614, 0.392].\nObject 6 : greenery at [0.824, 0.154, 0.997, 0.384].\nObject 7 : hitch at [0.221, 0.520, 0.259, 0.542].\nObject 8 : ladder at [0.110, 0.342, 0.283, 0.364].\nObject 9 : license plate at [0.141, 0.460, 0.234, 0.500].\nObject 10 : line at [0.017, 0.700, 0.266, 0.756].\nObject 11 : picture at [0.155, 0.378, 0.259, 0.442].\nObject 12 : plant barrier at [0.672, 0.482, 1.000, 0.606].\nObject 13 : planter at [0.676, 0.152, 1.000, 0.510].\nObject 14 : pole at [0.328, 0.068, 0.483, 0.994].\nObject 15 : road at [0.000, 0.490, 1.000, 1.000].\nObject 16 : roof at [0.117, 0.360, 0.283, 0.382].\nObject 17 : sad face at [0.383, 0.244, 0.614, 0.384].\nObject 18 : short term at [0.624, 0.040, 0.769, 0.080].\nObject 19 : sidewalk at [0.666, 0.572, 0.993, 0.618].\nObject 20 : sign at [0.621, 0.082, 0.772, 0.132].\nObject 21 : sign at [0.007, 0.144, 0.069, 0.204].\nObject 22 : signal at [0.266, 0.210, 0.679, 0.848].\nObject 23 : stop light at [0.366, 0.236, 0.638, 0.394].\nObject 24 : tail light at [0.100, 0.446, 0.121, 0.472].\nObject 25 : van at [0.076, 0.326, 0.297, 0.556].\nObject 26 : wall at [0.676, 0.500, 0.997, 0.604].\nObject 27 : window at [0.903, 0.000, 1.000, 0.086].\n\nRelationships:\nobject 23 : stop light -> with -> object 17 : sad face.\nobject 0 : arrows -> on -> object 15 : road.\nobject 12 : plant barrier -> beside -> object 15 : road.\nobject 11 : picture -> on -> object 4 : door.\nobject 10 : line -> painted in -> object 15 : road.\nobject 19 : sidewalk -> next to -> object 15 : road.\nobject 2 : building -> for -> object 18 : short term.\nobject 23 : stop light -> making -> object 5 : face.\nobject 3 : bushes -> just above -> object 26 : wall.\nobject 22 : signal -> on -> object 14 : pole.\nobject 25 : van -> has -> object 16 : roof.\nobject 25 : van -> has -> object 8 : ladder.\nobject 8 : ladder -> on -> object 16 : roof.\nobject 13 : planter -> by -> object 15 : road.\nobject 23 : stop light -> on -> object 22 : signal.\n\nRegion Description:\nRegion Description at [0.331, 0.852, 0.472, 0.996] : Pole holding traffic light on street.\nRegion Description at [0.600, 0.036, 0.793, 0.084] : Building offers short term office space.\nRegion Description at [0.603, 0.074, 0.776, 0.120] : Office space as small as 2,500 sq. ft. available.\nRegion Description at [0.003, 0.008, 0.972, 0.356] : an office building is in the background.\n\nGlobal Caption:\nA red traffic light with a sad face drawn over it.\nA street scene with a close of of a stop light.\nA red stoplight with a street in the background.\nA stop sign gives traffic a frown face.\nThe sign is now at a red light."}
+{"question_id": 40, "image": "000000408120.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : alley at [0.052, 0.261, 0.948, 0.997].\nObject 1 : bars at [0.050, 0.000, 0.400, 0.682].\nObject 2 : black tire at [0.500, 0.219, 0.522, 0.249].\nObject 3 : brick at [0.784, 0.105, 0.818, 0.144].\nObject 4 : bricks at [0.926, 0.165, 0.946, 0.195].\nObject 5 : building at [0.742, 0.000, 0.954, 0.796].\nObject 6 : car at [0.418, 0.168, 0.526, 0.240].\nObject 7 : concrete at [0.394, 0.565, 0.570, 0.718].\nObject 8 : corner at [0.850, 0.934, 0.950, 1.000].\nObject 9 : curb at [0.050, 0.264, 0.396, 0.868].\nObject 10 : fence at [0.686, 0.252, 0.826, 0.565].\nObject 11 : flower at [0.580, 0.078, 0.608, 0.123].\nObject 12 : flowers at [0.598, 0.072, 0.634, 0.105].\nObject 13 : girl at [0.444, 0.249, 0.500, 0.480].\nObject 14 : photo at [0.044, 0.000, 0.956, 0.997].\nObject 15 : plants at [0.040, 0.324, 0.224, 0.685].\nObject 16 : polka dot at [0.430, 0.231, 0.450, 0.261].\nObject 17 : road at [0.048, 0.243, 0.954, 0.994].\nObject 18 : shirt at [0.456, 0.279, 0.496, 0.390].\nObject 19 : shoe at [0.484, 0.441, 0.496, 0.459].\nObject 20 : shoe at [0.452, 0.459, 0.470, 0.489].\nObject 21 : umbrella at [0.404, 0.189, 0.528, 0.297].\nObject 22 : wall at [0.738, 0.003, 0.950, 0.760].\nObject 23 : wall window at [0.524, 0.000, 0.538, 0.060].\nObject 24 : window at [0.570, 0.003, 0.586, 0.051].\nObject 25 : window at [0.524, 0.102, 0.538, 0.150].\n\nRelationships:\nobject 13 : girl -> with -> object 19 : shoe.\nobject 13 : girl -> with -> object 20 : shoe.\nobject 13 : girl -> with -> object 18 : shirt.\nobject 4 : bricks -> on -> object 5 : building.\nobject 15 : plants -> are near -> object 0 : alley.\nobject 6 : car -> on -> object 17 : road.\nobject 8 : corner -> of an -> object 0 : alley.\nobject 15 : plants -> in front of -> object 14 : photo.\nobject 21 : umbrella -> on -> object 13 : girl.\nobject 9 : curb -> built alongside -> object 17 : road.\n\nRegion Description:\nRegion Description at [0.038, 0.426, 0.162, 0.526] : patch of green plants in front of photo.\nRegion Description at [0.586, 0.060, 0.678, 0.138] : purple flowers inside of bush on right.\n\nGlobal Caption:\nA little girl that is standing with an umbrella.\nA little girl walking down a driveway carrying a pink umbrella.\nA LITTLE GIRL DRESSED IN PINK ALSO HAS A PINK UMBRELLA\nA small girl is holding an umbrella over her head\nA young girl carries and open unbrella while walking down an alley."}
diff --git a/ferret/eval/ferret_gpt4_data/refer_reason/question.jsonl b/ferret/eval/ferret_gpt4_data/refer_reason/question.jsonl
new file mode 100644
index 0000000..d4200c5
--- /dev/null
+++ b/ferret/eval/ferret_gpt4_data/refer_reason/question.jsonl
@@ -0,0 +1,40 @@
+{"question_id": 0, "image": "000000130566.jpg", "category": "refer_reason", "text": "What might be the purpose of the object [0.850, 0.523, 0.898, 0.583] on the train?"}
+{"question_id": 1, "image": "000000010764.jpg", "category": "refer_reason", "text": "What is the purpose of the object [0.546, 0.625, 0.626, 0.801]?"}
+{"question_id": 2, "image": "000000184324.jpg", "category": "refer_reason", "text": "What is the purpose of the object [0.478, 0.464, 0.492, 0.491]?"}
+{"question_id": 3, "image": "000000452122.jpg", "category": "refer_reason", "text": "What is the purpose of the object [0.450, 0.592, 0.600, 0.643]?"}
+{"question_id": 4, "image": "000000032334.jpg", "category": "refer_reason", "text": "What is the purpose of the object [0.350, 0.325, 0.516, 0.373]?"}
+{"question_id": 5, "image": "000000360960.jpg", "category": "refer_reason", "text": "What might the man [0.850, 0.156, 1.000, 0.652] do next based on the current scene?"}
+{"question_id": 7, "image": "000000376322.jpg", "category": "refer_reason", "text": "What is the likely occasion for the people [0.000, 0.428, 0.997, 0.998]?"}
+{"question_id": 8, "image": "000000271402.jpg", "category": "refer_reason", "text": "What is the purpose of the object [0.462, 0.480, 0.713, 0.840]?"}
+{"question_id": 9, "image": "000000356424.jpg", "category": "refer_reason", "text": "What might be the purpose of the object [0.419, 0.134, 0.509, 0.184]?"}
+{"question_id": 10, "image": "000000131138.jpg", "category": "refer_reason", "text": "What is the purpose of the object [0.872, 0.556, 0.993, 0.634] in this setting?"}
+{"question_id": 11, "image": "000000332318.jpg", "category": "refer_reason", "text": "What is the purpose of the object [0.796, 0.910, 0.894, 0.997] in this rural setting?"}
+{"question_id": 12, "image": "000000513567.jpg", "category": "refer_reason", "text": "What could be the potential reason for the girl [0.682, 0.229, 0.742, 0.315] to have her mouth open?"}
+{"question_id": 13, "image": "000000134722.jpg", "category": "refer_reason", "text": "What's the purpose of the object [0.348, 0.499, 0.410, 0.584]?"}
+{"question_id": 14, "image": "000000341058.jpg", "category": "refer_reason", "text": "What is the purpose of the object [0.548, 0.180, 0.779, 0.344]?"}
+{"question_id": 15, "image": "000000277051.jpg", "category": "refer_reason", "text": "What is the purpose of the object [0.080, 0.003, 0.296, 0.721]?"}
+{"question_id": 16, "image": "000000376900.jpg", "category": "refer_reason", "text": "What is the object [0.235, 0.578, 0.304, 0.664] used for?"}
+{"question_id": 17, "image": "000000412240.jpg", "category": "refer_reason", "text": "What does the region [0.646, 0.869, 0.824, 0.923] likely represent in this image?"}
+{"question_id": 18, "image": "000000179765.jpg", "category": "refer_reason", "text": "What is the purpose of the object [0.626, 0.501, 0.698, 0.680] on the bike?"}
+{"question_id": 19, "image": "000000329219.jpg", "category": "refer_reason", "text": "What is the purpose of the object [0.056, 0.214, 0.140, 0.277]?"}
+{"question_id": 20, "image": "000000184384.jpg", "category": "refer_reason", "text": "What is the purpose of the object [0.454, 0.024, 0.638, 0.288] on the cake?"}
+{"question_id": 21, "image": "000000018519.jpg", "category": "refer_reason", "text": "What is the use of the object [0.279, 0.524, 0.341, 0.570]?"}
+{"question_id": 22, "image": "000000415748.jpg", "category": "refer_reason", "text": "Can you tell me what is unusual about the object [0.462, 0.670, 0.489, 0.692]?"}
+{"question_id": 23, "image": "000000543300.jpg", "category": "refer_reason", "text": "What is the purpose of the object [0.414, 0.691, 0.662, 0.725]?"}
+{"question_id": 24, "image": "000000349184.jpg", "category": "refer_reason", "text": "What is the purpose of the object [0.458, 0.488, 0.605, 0.694]?"}
+{"question_id": 25, "image": "000000042070.jpg", "category": "refer_reason", "text": "Can you tell what might be the purpose of the object [0.258, 0.055, 0.770, 0.168]?"}
+{"question_id": 26, "image": "000000241668.jpg", "category": "refer_reason", "text": "What can be inferred from the object [0.786, 0.780, 0.794, 0.796]?"}
+{"question_id": 27, "image": "000000535578.jpg", "category": "refer_reason", "text": "What purpose does the object [0.000, 0.072, 0.760, 0.160] serve in relation to the sheep?"}
+{"question_id": 28, "image": "000000484415.jpg", "category": "refer_reason", "text": "What is the function of the object [0.681, 0.208, 0.878, 0.500]?"}
+{"question_id": 29, "image": "000000491090.jpg", "category": "refer_reason", "text": "What might be the function of the object [0.663, 0.568, 0.733, 0.634]?"}
+{"question_id": 30, "image": "000000276018.jpg", "category": "refer_reason", "text": "What is the purpose of the item [0.757, 0.030, 0.889, 0.078]?"}
+{"question_id": 31, "image": "000000361551.jpg", "category": "refer_reason", "text": "What is the purpose of the object [0.784, 0.544, 0.824, 0.568]?"}
+{"question_id": 32, "image": "000000562207.jpg", "category": "refer_reason", "text": "What is the object [0.268, 0.744, 0.322, 0.828] used for?"}
+{"question_id": 33, "image": "000000553990.jpg", "category": "refer_reason", "text": "What is the function of the object [0.474, 0.246, 0.678, 0.432]?"}
+{"question_id": 34, "image": "000000106048.jpg", "category": "refer_reason", "text": "What could be the purpose of the text found in the region [0.568, 0.524, 0.770, 0.599]?"}
+{"question_id": 35, "image": "000000421923.jpg", "category": "refer_reason", "text": "What is the purpose of the object [0.838, 0.774, 0.994, 0.974]?"}
+{"question_id": 36, "image": "000000273493.jpg", "category": "refer_reason", "text": "What is the function of the object [0.916, 0.405, 0.934, 0.438]?"}
+{"question_id": 37, "image": "000000475150.jpg", "category": "refer_reason", "text": "What is the pattern on the object within the region [0.616, 0.565, 0.956, 0.958] and what does it indicate about the object?"}
+{"question_id": 38, "image": "000000125472.jpg", "category": "refer_reason", "text": "When was the object [0.012, 0.746, 0.664, 0.886] popularly invented?"}
+{"question_id": 39, "image": "000000069138.jpg", "category": "refer_reason", "text": "What is the purpose of the object [0.621, 0.082, 0.772, 0.132]?"}
+{"question_id": 40, "image": "000000408120.jpg", "category": "refer_reason", "text": "What is the purpose of the object [0.404, 0.189, 0.528, 0.297] in this scene?"}
diff --git a/ferret/eval/ferret_gpt4_data/rule.json b/ferret/eval/ferret_gpt4_data/rule.json
new file mode 100644
index 0000000..7294372
--- /dev/null
+++ b/ferret/eval/ferret_gpt4_data/rule.json
@@ -0,0 +1,5 @@
+{
+    "refer_desc":  {"role": "Assistant", "prompt": "We would like to request your feedback on the performance of two AI assistants in response to the user question displayed above. The user asks the question about specific region of an image. For your reference, the visual content in the image is represented with five descriptive sentences describing the same image. In addition, specific object locations within the image are given, along with detailed coordinates. These coordinates are in the form of bounding boxes, represented as (x1, y1, x2, y2) with floating numbers ranging from 0 to 1. These values correspond to the top left x, top left y, bottom right x, and bottom right y. Also, the relationships between pairs of objects are provided, in the format of object -> relationship -> subject, where the object/subject are indexed by object id from previous object lists as well as the object names. Also, several region description are given, each describing a box region of image, with detailed coordinates. \nPlease rate the spatial correspondence, helpfulness, relevance, accuracy, level of details of their responses. Each assistant receives an overall score on a scale of 1 to 10, where a higher score indicates better overall performance.\nPlease first output a single line containing only two values indicating the scores for Assistant 1 and 2, respectively. The two scores are separated by a space.\nIn the subsequent line, please provide a comprehensive explanation of your evaluation, avoiding any potential bias and ensuring that the order in which the responses were presented does not affect your judgment."},
+    "refer_reason":  {"role": "Assistant", "prompt": "We would like to request your feedback on the performance of two AI assistants in response to the user question displayed above. The user asks the question about specific region of an image. For your reference, the visual content in the image is represented with five descriptive sentences describing the same image. In addition, specific object locations within the image are given, along with detailed coordinates. These coordinates are in the form of bounding boxes, represented as (x1, y1, x2, y2) with floating numbers ranging from 0 to 1. These values correspond to the top left x, top left y, bottom right x, and bottom right y. Also, the relationships between pairs of objects are provided, in the format of object -> relationship -> subject, where the object/subject are indexed by object id from previous object lists as well as the object names. Also, several region description are given, each describing a box region of image, with detailed coordinates. \nPlease rate the spatial correspondence, helpfulness, relevance, accuracy, level of details of their responses. Each assistant receives an overall score on a scale of 1 to 10, where a higher score indicates better overall performance.\nPlease first output a single line containing only two values indicating the scores for Assistant 1 and 2, respectively. The two scores are separated by a space.\nIn the subsequent line, please provide a comprehensive explanation of your evaluation, avoiding any potential bias and ensuring that the order in which the responses were presented does not affect your judgment."},
+    "ground_conv":  {"role": "Assistant", "prompt": "We would like to request your feedback on the performance of two AI assistants in response to the user question displayed above. The user asks the question that requires model to predict the coordinates of relevant object. For your reference, the visual content in the image is represented with five descriptive sentences describing the same image. In addition, specific object locations within the image are given, along with detailed coordinates. These coordinates are in the form of bounding boxes, represented as (x1, y1, x2, y2) with floating numbers ranging from 0 to 1. These values correspond to the top left x, top left y, bottom right x, and bottom right y. Also, the relationships between pairs of objects are provided, in the format of object -> relationship -> subject, where the object/subject are indexed by object id from previous object lists as well as the object names. Also, several region description are given, each describing a box region of image, with detailed coordinates. \nPlease rate the predicted coordinates, helpfulness, relevance, accuracy, level of details of their responses. Specifically, pay your attention to the precision of the coordinates and whether it matches the object. Small deviation (<20% of ground-truth box width or height) of coordinates is allowed and shouldn't be punished. More than that, the degree of deviation should be reflected in scoring too.  Each assistant receives an overall score on a scale of 1 to 10, where a higher score indicates better overall performance.\nPlease first output a single line containing only two values indicating the scores for Assistant 1 and 2, respectively. The two scores are separated by a space.\nIn the subsequent line, please provide a comprehensive explanation of your evaluation, avoiding any potential bias and ensuring that the order in which the responses were presented does not affect your judgment."}
+}
\ No newline at end of file
diff --git a/ferret/eval/gpt4_eval_script.sh b/ferret/eval/gpt4_eval_script.sh
new file mode 100644
index 0000000..7700c7b
--- /dev/null
+++ b/ferret/eval/gpt4_eval_script.sh
@@ -0,0 +1,52 @@
+#!/bin/bash
+ 
+CHECKPOINT_FILE='ferret_ft/final-checkpoint'
+
+CUDA_VISIBLE_DEVICES=0 python -m ferret.eval.model_gpt4eval_3newclass --add_region_feature \
+    --model-path checkpoints/${CHECKPOINT_FILE} \
+    --data_path ferret/eval/ferret_gpt4_data/refer_desc/question.jsonl \
+    --answers-file gpt4_result/${CHECKPOINT_FILE}/refer_desc &
+CUDA_VISIBLE_DEVICES=1 python -m ferret.eval.model_gpt4eval_3newclass --add_region_feature \
+    --model-path checkpoints/${CHECKPOINT_FILE} \
+    --data_path ferret/eval/ferret_gpt4_data/ground_conv/question.jsonl \
+    --answers-file gpt4_result/${CHECKPOINT_FILE}/ground_conv &
+CUDA_VISIBLE_DEVICES=2 python -m ferret.eval.model_gpt4eval_3newclass --add_region_feature \
+    --model-path checkpoints/${CHECKPOINT_FILE} \
+    --data_path ferret/eval/ferret_gpt4_data/refer_reason/question.jsonl \
+    --answers-file gpt4_result/${CHECKPOINT_FILE}/refer_reason &
+
+wait
+echo "Finish Inference."
+
+OPENAI_API_KEY="xxx" python ferret/eval/eval_gpt_review_3newclass.py \
+    --question ferret/eval/ferret_gpt4_data/refer_desc/question.jsonl \
+    --context ferret/eval/ferret_gpt4_data/refer_desc/context.jsonl \
+    --answer-list \
+    ferret/eval/ferret_gpt4_data/refer_desc/answer.jsonl \
+    gpt4_result/${CHECKPOINT_FILE}/refer_desc/ferret_answer.jsonl \
+    --rule ferret/eval/ferret_gpt4_data/rule.json \
+    --output gpt4_result/${CHECKPOINT_FILE}/review_refer_desc.jsonl &
+OPENAI_API_KEY="xxx" python ferret/eval/eval_gpt_review_3newclass.py \
+    --question ferret/eval/ferret_gpt4_data/ground_conv/question.jsonl \
+    --context ferret/eval/ferret_gpt4_data/ground_conv/context.jsonl \
+    --answer-list \
+    ferret/eval/ferret_gpt4_data/ground_conv/answer.jsonl \
+    gpt4_result/${CHECKPOINT_FILE}/ground_conv/ferret_answer.jsonl \
+    --rule ferret/eval/ferret_gpt4_data/rule.json \
+    --output gpt4_result/${CHECKPOINT_FILE}/review_ground_conv.jsonl &
+OPENAI_API_KEY="xxx" python ferret/eval/eval_gpt_review_3newclass.py \
+    --question ferret/eval/ferret_gpt4_data/refer_reason/question.jsonl \
+    --context ferret/eval/ferret_gpt4_data/refer_reason/context.jsonl \
+    --answer-list \
+    ferret/eval/ferret_gpt4_data/refer_reason/answer.jsonl \
+    gpt4_result/${CHECKPOINT_FILE}/refer_reason/ferret_answer.jsonl \
+    --rule ferret/eval/ferret_gpt4_data/rule.json \
+    --output gpt4_result/${CHECKPOINT_FILE}/review_refer_reason.jsonl &
+
+wait
+echo "Finish Review."
+
+echo "Gather final score."
+echo $CHECKPOINT_FILE
+python ferret/eval/summarize_gpt_review.py  \
+    --dir=gpt4_result/${CHECKPOINT_FILE}
\ No newline at end of file
diff --git a/ferret/eval/model_flickr.py b/ferret/eval/model_flickr.py
new file mode 100644
index 0000000..99275e0
--- /dev/null
+++ b/ferret/eval/model_flickr.py
@@ -0,0 +1,260 @@
+"""
+Usage:
+
+--data_path: path of flickr30k annotation. 
+--image_path:  path of flickr30k test images. 
+--answers-file: path of output result.
+
+Example:
+CUDA_VISIBLE_DEVICES=0 python -m ferret.eval.model_flickr \
+    --model-path checkpoints/ferret_13b/checkpoint-final \
+    --image_path data/flickr30k/flickr30k_images_split/test \
+    --data_path data/annotations/final_flickr_mergedGT_test.json \
+    --answers-file flickr_result/test_answer \
+    --add_region_feature \
+    --chunk-idx 0 \
+    --num-chunks 1 
+
+"""
+
+import argparse
+from typing import Any, Tuple
+import torch
+import os
+import json
+from tqdm import tqdm
+
+# Added
+from ferret.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria
+from ferret.model.builder import load_pretrained_model
+from ferret.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
+from ferret.conversation import conv_templates, SeparatorStyle
+from ferret.utils import disable_torch_init
+from PIL import Image
+import re
+import math
+import torchvision
+import numpy as np
+from copy import deepcopy
+
+# Added for visualization
+from PIL import Image, ImageDraw, ImageFont
+
+VOCAB_IMAGE_W = 1000
+VOCAB_IMAGE_H = 1000
+DEFAULT_REGION_FEA_TOKEN = "<region_fea>"
+
+
+def split_list(lst, n):
+    """Split a list into n (roughly) equal-sized chunks"""
+    chunk_size = math.ceil(len(lst) / n)  # integer division
+    return [lst[i:i+chunk_size] for i in range(0, len(lst), chunk_size)]
+
+
+def get_chunk(lst, n, k):
+    chunks = split_list(lst, n)
+    return chunks[k]
+
+
+def plot_flickr(img, boxes, entities, mode='pred'):
+    if mode == "gt":
+        color = "green"
+    elif mode == "pred":
+        color = "blue"
+    draw = ImageDraw.Draw(img)
+    fnt = ImageFont.load_default()
+    boxes = boxes
+    entities = entities
+    for box, tk in zip(boxes, entities):
+        draw.rectangle([box[0], box[1], box[2], box[3]], outline=color)
+        draw.text((box[0], box[1]-5), f'{tk}', font=fnt, fill=color)
+    return img
+
+
+def remove_punctuation(text: str) -> str:
+    punct = [',',]
+    for p in punct:
+        text = text.replace(p, '')
+    return text.strip()
+
+
+def resize_bbox(box, image_w=None, image_h=None):
+    ratio_w = image_w * 1.0 / VOCAB_IMAGE_W
+    ratio_h = image_h * 1.0 / VOCAB_IMAGE_H
+
+    new_box = [int(box[0] * ratio_w), int(box[1] * ratio_h), \
+               int(box[2] * ratio_w), int(box[3] * ratio_h)]
+    return new_box
+
+
+def find_bbox_template(text, img_w, img_h):
+    entities = []
+    boxes = []
+    
+    # Regular expression pattern to match entities and boxes
+    pattern = r'([a-zA-Z\s]+)\s+(\[[\d,\s]+\])'
+    
+    # Find all matches in the text
+    matches = re.findall(pattern, text)
+    
+    for entity, box_str in matches:
+        # Append the entity to the entities list
+        entities.append(entity.strip())
+        
+        # Convert the box string to a list of integers and append to the boxes list
+        box = list(map(int, box_str.strip('[]').split(',')))
+        resized_box = resize_bbox(box, img_w, img_h)
+        boxes.append(resized_box)
+        
+    return entities, boxes
+
+
+class FlickrGrounding(torchvision.datasets.CocoDetection):
+    def __init__(self, img_folder, ann_file, transforms):
+        super(FlickrGrounding, self).__init__(img_folder, ann_file)
+        self._transforms = transforms
+        self.question_prompt = "What are the locations of <objs>?"
+
+    def __getitem__(self, idx):
+        img, target = super(FlickrGrounding, self).__getitem__(idx)
+        image_id = self.ids[idx]
+        coco_img = self.coco.loadImgs(image_id)[0]
+        file_name = coco_img["file_name"]
+        caption = coco_img["caption"]
+        postive_item_pos = coco_img['tokens_positive_eval']
+        dataset_name = coco_img["dataset_name"] if "dataset_name" in coco_img else None
+        w, h = img.size
+        
+        # token_positive = []
+        bboxes = []
+        entities = []
+        for anno in target:
+            bbox_xywh = anno["bbox"]
+            bbox_xyxy = np.array([bbox_xywh[0], bbox_xywh[1], bbox_xywh[0] + bbox_xywh[2], bbox_xywh[1] + bbox_xywh[3]])
+            bbox_xyxy[0::2].clip(min=0, max=w)
+            bbox_xyxy[1::2].clip(min=0, max=h)
+            bboxes.append(bbox_xyxy.tolist())
+            # tokens_positive = anno["tokens_positive"]
+            # token_positive.append(tokens_positive)
+        
+        entities = [remove_punctuation(caption[t[0][0]:t[0][1]].lower()) for t in postive_item_pos]
+        obj_caption = ", ".join(entities)
+        assert "<objs>" in self.question_prompt
+        question = self.question_prompt.replace("<objs>", obj_caption)
+
+        target = {"image_id": image_id, "file_name": file_name, "annotations": target, 
+                  "caption": caption, "img_w": w, "img_h": h, "question": question, "bboxes": bboxes, "entities": entities}
+        if self._transforms is not None:
+            img, target = self._transforms(img, target)
+        target["dataset_name"] = dataset_name
+        for extra_key in ["sentence_id", "original_img_id", "original_id", "task_id"]:
+            if extra_key in coco_img:
+                target[extra_key] = coco_img[extra_key]
+        return img, target
+    
+
+def eval_model_flickr(args):
+    # Data
+    dataset = FlickrGrounding(img_folder=args.image_path, 
+                              ann_file=args.data_path,
+                              transforms=None,
+                              )
+    data_ids = range(len(dataset))
+
+    # Model
+    disable_torch_init()
+    model_path = os.path.expanduser(args.model_path)
+    model_name = get_model_name_from_path(model_path)
+    tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, args.model_base, model_name)
+
+    chunk_data_ids = get_chunk(data_ids, args.num_chunks, args.chunk_idx)
+    answers_file = os.path.expanduser(args.answers_file)
+    os.makedirs(answers_file, exist_ok=True)
+    answers_file = os.path.join(answers_file, f'{args.chunk_idx}_of_{args.num_chunks}.jsonl')
+    ans_file = open(answers_file, "w")
+
+    for i, id in enumerate(tqdm(chunk_data_ids)):
+        img, ann = dataset[id]
+        qs = ann["question"]
+        cur_prompt = qs
+
+        # Plot GTs
+        # img = plot_flickr(img, ann["bboxes"], ann["entities"], mode="gt")
+
+        if model.config.mm_use_im_start_end:
+            qs = DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN + '\n' + qs
+        else:
+            qs = DEFAULT_IMAGE_TOKEN + '\n' + qs
+        
+        conv = conv_templates[args.conv_mode].copy()
+        conv.append_message(conv.roles[0], qs)
+        conv.append_message(conv.roles[1], None)
+        prompt = conv.get_prompt()
+        input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()
+        img_w, img_h = ann["img_w"], ann["img_h"]
+        image_tensor = image_processor.preprocess(img, return_tensors='pt', do_resize=True, 
+                                                  do_center_crop=False, size=[args.image_h, args.image_w])['pixel_values'][0]
+        stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
+        keywords = [stop_str]
+        stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
+
+        with torch.inference_mode():
+            output_ids = model.generate(
+                input_ids,
+                images=image_tensor.unsqueeze(0).half().cuda(),
+                do_sample=True,
+                temperature=args.temperature,
+                top_p=args.top_p,
+                num_beams=args.num_beams,
+                # no_repeat_ngram_size=3,
+                max_new_tokens=1024,
+                use_cache=True,
+                stopping_criteria=[stopping_criteria],
+                )
+        
+        input_token_len = input_ids.shape[1]
+        n_diff_input_output = (input_ids != output_ids[:, :input_token_len]).sum().item()
+        if n_diff_input_output > 0:
+            print(f'[Warning] Sample {i}: {n_diff_input_output} output_ids are not the same as the input_ids')
+        outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]
+        outputs = outputs.strip()
+        if outputs.endswith(stop_str):
+            outputs = outputs[:-len(stop_str)]
+        outputs = outputs.strip()
+
+        # Plot Preds
+        # pred_entities, pred_bboxes = find_bbox_template(outputs, img_w=img_w, img_h=img_h)
+        # img = plot_flickr(img, pred_bboxes, pred_entities, mode="pred")
+        # img.save('flickr_result/images/{}.png'.format(i))
+
+        ans_file.write(json.dumps({"image_id": ann['original_img_id'],    
+                                   "sentence_id": ann['sentence_id'],
+                                   "file_name": ann["file_name"],
+                                   "prompt": cur_prompt,
+                                   "text": outputs,
+                                   "width": ann['img_w'],
+                                   "height": ann['img_h'],
+                                   }) + "\n")
+        ans_file.flush()
+    ans_file.close()
+    
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model-path", type=str, default="facebook/opt-350m")
+    parser.add_argument("--model-base", type=str, default=None)
+    parser.add_argument("--image_path", type=str, default="data/flickr30k/flickr30k_images_split/test")
+    parser.add_argument("--data_path", type=str, default="data/annotations/final_flickr_separateGT_test.json")
+    parser.add_argument("--answers-file", type=str, default="flickr_result/test_answer.jsonl")
+    parser.add_argument("--conv-mode", type=str, default="ferret_v1")
+    parser.add_argument("--num-chunks", type=int, default=1)
+    parser.add_argument("--chunk-idx", type=int, default=0)
+    parser.add_argument("--image_w", type=int, default=336)  #  224
+    parser.add_argument("--image_h", type=int, default=336)  #  224
+    parser.add_argument("--add_region_feature", action="store_true")
+    parser.add_argument("--temperature", type=float, default=0.001)
+    parser.add_argument("--top_p", type=float, default=None)
+    parser.add_argument("--num_beams", type=int, default=1)
+    args = parser.parse_args()
+
+    eval_model_flickr(args)
\ No newline at end of file
diff --git a/ferret/eval/model_gpt4eval_3newclass.py b/ferret/eval/model_gpt4eval_3newclass.py
new file mode 100644
index 0000000..b7cccf5
--- /dev/null
+++ b/ferret/eval/model_gpt4eval_3newclass.py
@@ -0,0 +1,274 @@
+"""
+Usage:
+
+Example:
+CUDA_VISIBLE_DEVICES=0 python -m ferret.eval.model_gpt4eval_3newclass \
+    --model-path checkpoints/ferret_13b \
+    --data_path ferret/eval/ferret_gpt4_data/refer_desc/question.jsonl \
+    --answers-file gpt4_result/refer_desc/ferret_ft_clipL336_vicunaV1-3-13b_3Ep  --add_region_feature
+
+"""
+
+import argparse
+import torch
+import os
+import json
+from tqdm import tqdm
+
+# Added
+from ferret.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria
+from ferret.model.builder import load_pretrained_model
+from ferret.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
+from ferret.conversation import conv_templates, SeparatorStyle
+from ferret.utils import disable_torch_init
+from PIL import Image
+import math
+import pdb
+import numpy as np
+from copy import deepcopy
+from functools import partial
+import re
+
+VOCAB_IMAGE_W = 1000
+VOCAB_IMAGE_H = 1000
+DEFAULT_REGION_FEA_TOKEN = "<region_fea>"
+
+def split_list(lst, n):
+    """Split a list into n (roughly) equal-sized chunks"""
+    chunk_size = math.ceil(len(lst) / n)  # integer division
+    return [lst[i:i+chunk_size] for i in range(0, len(lst), chunk_size)]
+
+
+def get_chunk(lst, n, k):
+    chunks = split_list(lst, n)
+    return chunks[k]
+
+
+def generate_mask_for_feature(coor, raw_w, raw_h, mask=None):
+    if mask is not None:
+        assert mask.shape[0] == raw_w and mask.shape[1] == raw_h
+    coor_mask = np.zeros((raw_w, raw_h))
+    # Assume it samples a point.
+    if len(coor) == 2:
+        # Define window size
+        span = 5
+        # Make sure the window does not exceed array bounds
+        x_min = max(0, coor[0] - span)
+        x_max = min(raw_w, coor[0] + span + 1)
+        y_min = max(0, coor[1] - span)
+        y_max = min(raw_h, coor[1] + span + 1)
+        coor_mask[int(x_min):int(x_max), int(y_min):int(y_max)] = 1
+        assert (coor_mask==1).any(), f"coor: {coor}, raw_w: {raw_w}, raw_h: {raw_h}"
+    elif len(coor) == 4:
+        # Box input or Sketch input.
+        coor_mask[coor[0]:coor[2]+1, coor[1]:coor[3]+1] = 1
+        if mask is not None:
+            coor_mask = coor_mask * mask
+    coor_mask = torch.from_numpy(coor_mask)
+    try:
+        assert len(coor_mask.nonzero()) != 0
+    except:
+        pdb.set_trace()
+    return coor_mask
+    
+class GPTEval_Data():
+    def __init__(self, data_path, image_path, args) -> None:
+        datas = [json.loads(q) for q in open(os.path.expanduser(args.data_path), "r")]
+
+        for idx, i in enumerate(datas):
+            i['image_path'] = os.path.join(image_path, i['image'].split('/')[-1])
+            img_raw_w, img_raw_h = Image.open(i['image_path']).size
+            pattern = r'\[.*?\]'
+            matches = re.findall(pattern, i['text'])
+            question = i['text']
+            masks = []
+            for match in matches:
+                coor_cur = match.replace('[', '')
+                coor_cur = coor_cur.replace(']', '')
+                coor_cur = coor_cur.split(',')
+                coor_cur = [float(i.strip()) for i in coor_cur]
+                raw_box_coor = [int(coor_cur[0] * img_raw_w), int(coor_cur[1] * img_raw_h), int(coor_cur[2] * img_raw_w), int(coor_cur[3] * img_raw_h)]
+                converted_box_coor = [int(coor_cur[0] * VOCAB_IMAGE_W), int(coor_cur[1] * VOCAB_IMAGE_H), int(coor_cur[2] * VOCAB_IMAGE_W), int(coor_cur[3] * VOCAB_IMAGE_H)]
+                if args.add_region_feature:
+                    question = question.replace(match, f'[{converted_box_coor[0]}, {converted_box_coor[1]}, {converted_box_coor[2]}, {converted_box_coor[3]}] {DEFAULT_REGION_FEA_TOKEN}')
+                    generated_mask = generate_mask_for_feature(raw_box_coor, raw_w=img_raw_w, raw_h=img_raw_h, mask=None)
+                    masks.append(generated_mask)
+                else:
+                    question = question.replace(match, f'[{converted_box_coor[0]}, {converted_box_coor[1]}, {converted_box_coor[2]}, {converted_box_coor[3]}]')
+
+            # pdb.set_trace()
+            if args.add_region_feature:
+                i['region_masks'] = masks
+            else:
+                i['region_masks'] = None
+            i['question'] = question
+        # obj_list = [json.loads(line) for line in tqdm(open(data_path))]
+        # question_prompt = "Is the object in <location> of the image a <obj1> or a <obj2>?"
+        
+        # for idx, i in enumerate(obj_list):
+        #     i['image_path'] = os.path.join(image_path, i['image_path'].split('/')[-1])
+        #     ratio_w = VOCAB_IMAGE_W * 1.0 / i['width']
+        #     ratio_h = VOCAB_IMAGE_H * 1.0 / i['height']
+
+        #     point_x_textvocab = int(i['sample_point'][0]*ratio_w)
+        #     point_y_textvocab = int(i['sample_point'][1]*ratio_h)
+
+        #     box_x1 = int(i['bbox_norm'][0]*i['width'])
+        #     box_y1 = int(i['bbox_norm'][1]*i['height'])
+        #     box_x2 = int(i['bbox_norm'][2]*i['width'])
+        #     box_y2 = int(i['bbox_norm'][3]*i['height'])
+
+        #     box_x1_textvocab = int(i['bbox_norm'][0]*VOCAB_IMAGE_W)
+        #     box_y1_textvocab = int(i['bbox_norm'][1]*VOCAB_IMAGE_H)
+        #     box_x2_textvocab = int(i['bbox_norm'][2]*VOCAB_IMAGE_W)
+        #     box_y2_textvocab = int(i['bbox_norm'][3]*VOCAB_IMAGE_H)
+
+        #     if args.region_format == 'point':
+        #         region_coordinate_raw = [i['sample_point'][0], i['sample_point'][1]]
+        #         i['question'] = question_prompt.replace('<location>', '[{}, {}]'.format(point_x_textvocab, point_y_textvocab))
+        #         segment_mask = None
+        #     elif args.region_format == 'box' or args.region_format == 'segment':
+        #         region_coordinate_raw = [box_x1, box_y1, box_x2, box_y2]
+        #         i['question'] = question_prompt.replace('<location>', '[{}, {}, {}, {}]'.format(box_x1_textvocab, box_y1_textvocab, box_x2_textvocab, box_y2_textvocab))
+        #         if args.region_format == 'segment':
+        #             segment_mask = np.array(i['segment_mask'])
+        #         else:
+        #             segment_mask = None
+        #     else:
+        #         raise NotImplementedError(f'{args.region_format} is not supported.')
+
+        #     if args.add_region_feature:
+        #         i['question'] = i['question'].replace('of the image', f'{DEFAULT_REGION_FEA_TOKEN} of the image')
+        #         generated_mask = generate_mask_for_feature(region_coordinate_raw, raw_w=i['width'], raw_h=i['height'], mask=segment_mask)
+        #         i['region_masks'] = [generated_mask]
+        #     else:
+        #         i['region_masks'] = None
+
+        #     if idx % 2 == 0:
+        #         i['question'] = i['question'].replace('<obj1>', i['name'])
+        #         i['question'] = i['question'].replace('<obj2>', i['neg_class'])
+        #     else:
+        #         i['question'] = i['question'].replace('<obj2>', i['name'])
+        #         i['question'] = i['question'].replace('<obj1>', i['neg_class'])
+
+        self.datas = datas
+        self._ids = range(len(self.datas))
+        # pdb.set_trace()
+    
+    @property
+    def ids(self):
+        return deepcopy(self._ids)
+    
+    def fetch_data(self, id):
+        ann = self.datas[id]
+        return ann
+
+
+def eval_model(args):
+    # Data
+    dataset = GPTEval_Data(data_path=args.data_path, image_path=args.image_path, args=args)
+    data_ids = dataset.ids
+
+    # Model
+    disable_torch_init()
+    model_path = os.path.expanduser(args.model_path)
+    model_name = get_model_name_from_path(model_path)
+    tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, args.model_base, model_name)
+
+    # chunk_data_ids = get_chunk(data_ids, args.num_chunks, args.chunk_idx)
+    answers_file = os.path.expanduser(args.answers_file)
+    os.makedirs(answers_file, exist_ok=True)
+    # answers_file = os.path.join(answers_file, f'{args.chunk_idx}_of_{args.num_chunks}.jsonl')
+    answers_file = os.path.join(answers_file, f'ferret_answer.jsonl')
+    ans_file = open(answers_file, "w")
+
+    for i, id in enumerate(tqdm(data_ids)):
+        ann = dataset.fetch_data(id)
+        image_path = ann['image_path']
+
+        qs = ann['question']
+        cur_prompt = qs
+        # pdb.set_trace()
+        if model.config.mm_use_im_start_end:
+            qs = DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN + '\n' + qs
+        else:
+            qs = DEFAULT_IMAGE_TOKEN + '\n' + qs
+
+        conv = conv_templates[args.conv_mode].copy()
+        conv.append_message(conv.roles[0], qs)
+        conv.append_message(conv.roles[1], None)
+        prompt = conv.get_prompt()
+        input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()
+        image = Image.open(image_path).convert('RGB')
+        image_tensor = image_processor.preprocess(image, return_tensors='pt', do_resize=True, 
+                                                  do_center_crop=False, size=[args.image_h, args.image_w])['pixel_values'][0]
+
+        stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
+        keywords = [stop_str]
+        stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
+
+        region_masks = ann['region_masks']
+        if region_masks is not None:
+            region_masks = [[region_mask_i.cuda().half() for region_mask_i in region_masks]]
+        else:
+            region_masks = None
+
+        # pdb.set_trace()
+        with torch.inference_mode():
+            model.orig_forward = model.forward
+            model.forward = partial(
+                model.orig_forward,
+                region_masks=region_masks
+            )
+            output_ids = model.generate(
+                input_ids,
+                images=image_tensor.unsqueeze(0).half().cuda(),
+                do_sample=True,
+                temperature=args.temperature,
+                top_p=args.top_p,
+                num_beams=args.num_beams,
+                # no_repeat_ngram_size=3,
+                max_new_tokens=1024,
+                use_cache=True,
+                stopping_criteria=[stopping_criteria])
+            model.forward = model.orig_forward
+
+        input_token_len = input_ids.shape[1]
+        n_diff_input_output = (input_ids != output_ids[:, :input_token_len]).sum().item()
+        if n_diff_input_output > 0:
+            print(f'[Warning] Sample {i}: {n_diff_input_output} output_ids are not the same as the input_ids')
+        outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]
+        outputs = outputs.strip()
+        if outputs.endswith(stop_str):
+            outputs = outputs[:-len(stop_str)]
+        outputs = outputs.strip()
+
+        # pdb.set_trace()
+        ans_file.write(json.dumps({"question_id":ann['question_id'],     # +1 offset     
+                                   "image_path":image_path,
+                                   "prompt": cur_prompt,
+                                   "text": outputs
+                                   }) + "\n")
+        ans_file.flush()
+    ans_file.close()
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model-path", type=str, default="facebook/opt-350m")
+    parser.add_argument("--model-base", type=str, default=None)
+    parser.add_argument("--image_path", type=str, default="dataset/cocoval2017")
+    parser.add_argument("--data_path", type=str, default="dataset/lvis/lvis_v1_minival_inserted_image_name.json")
+    parser.add_argument("--answers-file", type=str, default="lvis_result/answer.jsonl")
+    parser.add_argument("--conv-mode", type=str, default="ferret_v1")
+    parser.add_argument("--num-chunks", type=int, default=1)
+    parser.add_argument("--chunk-idx", type=int, default=0)
+    parser.add_argument("--image_w", type=int, default=336)  #  224
+    parser.add_argument("--image_h", type=int, default=336)  #  224
+    parser.add_argument("--add_region_feature", action="store_true")
+    parser.add_argument("--temperature", type=float, default=0.001)
+    parser.add_argument("--top_p", type=float, default=None)
+    parser.add_argument("--num_beams", type=int, default=1)
+    parser.add_argument("--region_format", type=str, default="box", choices=["point", "box", "segment"])
+    args = parser.parse_args()
+
+    eval_model(args)
diff --git a/ferret/eval/model_lvis.py b/ferret/eval/model_lvis.py
new file mode 100644
index 0000000..167c949
--- /dev/null
+++ b/ferret/eval/model_lvis.py
@@ -0,0 +1,277 @@
+"""
+Usage:
+--data_path: path of LVIS eval annotation. 
+--image_path: path of coco val 2017 images.
+--answers-file: path of output result.
+Change --region_format to evaluate different types of referring regions. Choices: ["point", "box", "free_shape"] 
+
+If eval on free-form shape:
+
+CUDA_VISIBLE_DEVICES=0 python -m ferret.eval.model_lvis \
+    --model-path checkpoints/ferret_13b/final-checkpoint \
+    --data_path dataset/lvis/lvis_eval.jsonl \
+    --image_path dataset/cocoval2017 \
+    --answers-file lvis_result/ferret_13b_freeshape \
+    --add_region_feature \
+    --chunk-idx 0 \
+    --num-chunks 1 \
+    --region_format free_shape
+"""
+
+import argparse
+import torch
+import os
+import json
+from tqdm import tqdm
+
+# Added
+from ferret.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria
+from ferret.model.builder import load_pretrained_model
+from ferret.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
+from ferret.conversation import conv_templates, SeparatorStyle
+from ferret.utils import disable_torch_init
+from PIL import Image
+import math
+import pdb
+import numpy as np
+from copy import deepcopy
+from functools import partial
+
+VOCAB_IMAGE_W = 1000
+VOCAB_IMAGE_H = 1000
+DEFAULT_REGION_FEA_TOKEN = "<region_fea>"
+
+def split_list(lst, n):
+    """Split a list into n (roughly) equal-sized chunks"""
+    chunk_size = math.ceil(len(lst) / n)  # integer division
+    return [lst[i:i+chunk_size] for i in range(0, len(lst), chunk_size)]
+
+
+def get_chunk(lst, n, k):
+    chunks = split_list(lst, n)
+    return chunks[k]
+
+
+def generate_mask_for_feature(coor, raw_w, raw_h, mask=None):
+    if mask is not None:
+        assert mask.shape[0] == raw_w and mask.shape[1] == raw_h
+    coor_mask = np.zeros((raw_w, raw_h))
+    # Assume it samples a point.
+    if len(coor) == 2:
+        # Define window size
+        span = 5
+        # Make sure the window does not exceed array bounds
+        x_min = max(0, coor[0] - span)
+        x_max = min(raw_w, coor[0] + span + 1)
+        y_min = max(0, coor[1] - span)
+        y_max = min(raw_h, coor[1] + span + 1)
+        coor_mask[int(x_min):int(x_max), int(y_min):int(y_max)] = 1
+        assert (coor_mask==1).any(), f"coor: {coor}, raw_w: {raw_w}, raw_h: {raw_h}"
+    elif len(coor) == 4:
+        # Box input or Sketch input.
+        coor_mask[coor[0]:coor[2]+1, coor[1]:coor[3]+1] = 1
+        if mask is not None:
+            coor_mask = coor_mask * mask
+    coor_mask = torch.from_numpy(coor_mask)
+    try:
+        assert len(coor_mask.nonzero()) != 0
+    except:
+        pdb.set_trace()
+    return coor_mask
+    
+class LVISData_V1():
+    def __init__(self, data_path, image_path, args) -> None:
+        obj_list = [json.loads(line) for line in tqdm(open(data_path))]
+        # question_prompt = "Is the object in <location> of the image a <obj1> or a <obj2>?"
+        question_prompt = "Is the object <location> of the image a <obj1> or a <obj2>?"
+
+        for idx, i in enumerate(obj_list):
+            i['image_path'] = os.path.join(image_path, i['image_path'].split('/')[-1])
+            ratio_w = VOCAB_IMAGE_W * 1.0 / i['width']
+            ratio_h = VOCAB_IMAGE_H * 1.0 / i['height']
+
+            point_x_textvocab = int(i['sample_point'][0]*ratio_w)
+            point_y_textvocab = int(i['sample_point'][1]*ratio_h)
+
+            box_x1 = int(i['bbox_norm'][0]*i['width'])
+            box_y1 = int(i['bbox_norm'][1]*i['height'])
+            box_x2 = int(i['bbox_norm'][2]*i['width'])
+            box_y2 = int(i['bbox_norm'][3]*i['height'])
+
+            box_x1_textvocab = int(i['bbox_norm'][0]*VOCAB_IMAGE_W)
+            box_y1_textvocab = int(i['bbox_norm'][1]*VOCAB_IMAGE_H)
+            box_x2_textvocab = int(i['bbox_norm'][2]*VOCAB_IMAGE_W)
+            box_y2_textvocab = int(i['bbox_norm'][3]*VOCAB_IMAGE_H)
+
+            if args.region_format == 'point':
+                region_coordinate_raw = [i['sample_point'][0], i['sample_point'][1]]
+                if args.no_coor:
+                    assert args.add_region_feature
+                    i['question'] = question_prompt.replace('<location>', '')
+                else:
+                    i['question'] = question_prompt.replace('<location>', '[{}, {}]'.format(point_x_textvocab, point_y_textvocab))
+                segment_mask = None
+
+            elif args.region_format == 'box' or args.region_format == 'free_shape':
+                if args.region_format == 'free_shape':
+                    region_coordinate_raw = i['free_shape_bbox_raw']
+                    box_x1_textvocab = int(i['free_shape_bbox_raw'][0]*ratio_w)
+                    box_y1_textvocab = int(i['free_shape_bbox_raw'][1]*ratio_h)
+                    box_x2_textvocab = int(i['free_shape_bbox_raw'][2]*ratio_w)
+                    box_y2_textvocab = int(i['free_shape_bbox_raw'][3]*ratio_h)
+                else:
+                    region_coordinate_raw = [box_x1, box_y1, box_x2, box_y2]
+                
+                if args.no_coor:
+                    assert args.add_region_feature
+                    i['question'] = question_prompt.replace('<location>', '')
+                else:
+                    i['question'] = question_prompt.replace('<location>', '[{}, {}, {}, {}]'.format(box_x1_textvocab, box_y1_textvocab, box_x2_textvocab, box_y2_textvocab))
+                
+                if args.region_format == 'free_shape':
+                    segment_mask = np.array(i['free_shape_segment_mask'])
+                else:
+                    segment_mask = None
+            else:
+                raise NotImplementedError(f'{args.region_format} is not supported.')
+
+            if args.add_region_feature:
+                i['question'] = i['question'].replace('of the image', f'{DEFAULT_REGION_FEA_TOKEN} of the image')
+                generated_mask = generate_mask_for_feature(region_coordinate_raw, raw_w=i['width'], raw_h=i['height'], mask=segment_mask)
+                i['region_masks'] = [generated_mask]
+            else:
+                i['region_masks'] = None
+
+            if idx % 2 == 0:
+                i['question'] = i['question'].replace('<obj1>', i['name'])
+                i['question'] = i['question'].replace('<obj2>', i['neg_class'])
+            else:
+                i['question'] = i['question'].replace('<obj2>', i['name'])
+                i['question'] = i['question'].replace('<obj1>', i['neg_class'])
+
+        self.obj_list = obj_list
+        self._ids = range(len(self.obj_list))
+    
+    @property
+    def ids(self):
+        return deepcopy(self._ids)
+    
+    def fetch_data(self, id):
+        ann = self.obj_list[id]
+        return ann
+
+
+def eval_model(args):
+    # Data
+    dataset = LVISData_V1(data_path=args.data_path, image_path=args.image_path, args=args)
+    data_ids = dataset.ids
+
+    # Model
+    disable_torch_init()
+    model_path = os.path.expanduser(args.model_path)
+    model_name = get_model_name_from_path(model_path)
+    tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, args.model_base, model_name)
+
+    chunk_data_ids = get_chunk(data_ids, args.num_chunks, args.chunk_idx)
+    answers_file = os.path.expanduser(args.answers_file)
+    os.makedirs(answers_file, exist_ok=True)
+    answers_file = os.path.join(answers_file, f'{args.chunk_idx}_of_{args.num_chunks}.jsonl')
+    ans_file = open(answers_file, "w")
+
+    for i, id in enumerate(tqdm(chunk_data_ids)):
+        ann = dataset.fetch_data(id)
+        image_path = ann['image_path']
+
+        qs = ann['question']
+        cur_prompt = qs
+        # pdb.set_trace()
+        if model.config.mm_use_im_start_end:
+            qs = DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN + '\n' + qs
+        else:
+            qs = DEFAULT_IMAGE_TOKEN + '\n' + qs
+
+        conv = conv_templates[args.conv_mode].copy()
+        conv.append_message(conv.roles[0], qs)
+        conv.append_message(conv.roles[1], None)
+        prompt = conv.get_prompt()
+        input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()
+        image = Image.open(image_path).convert('RGB')
+        image_tensor = image_processor.preprocess(image, return_tensors='pt', do_resize=True, 
+                                                  do_center_crop=False, size=[args.image_h, args.image_w])['pixel_values'][0]
+
+        stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
+        keywords = [stop_str]
+        stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
+
+        region_masks = ann['region_masks']
+        if region_masks is not None:
+            region_masks = [[region_mask_i.cuda().half() for region_mask_i in region_masks]]
+        else:
+            region_masks = None
+
+        # pdb.set_trace()
+        with torch.inference_mode():
+            model.orig_forward = model.forward
+            model.forward = partial(
+                model.orig_forward,
+                region_masks=region_masks
+            )
+            output_ids = model.generate(
+                input_ids,
+                images=image_tensor.unsqueeze(0).half().cuda(),
+                do_sample=True,
+                temperature=args.temperature,
+                top_p=args.top_p,
+                num_beams=args.num_beams,
+                # no_repeat_ngram_size=3,
+                max_new_tokens=1024,
+                use_cache=True,
+                stopping_criteria=[stopping_criteria])
+            model.forward = model.orig_forward
+
+        input_token_len = input_ids.shape[1]
+        n_diff_input_output = (input_ids != output_ids[:, :input_token_len]).sum().item()
+        if n_diff_input_output > 0:
+            print(f'[Warning] Sample {i}: {n_diff_input_output} output_ids are not the same as the input_ids')
+        outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]
+        outputs = outputs.strip()
+        if outputs.endswith(stop_str):
+            outputs = outputs[:-len(stop_str)]
+        outputs = outputs.strip()
+
+        # pdb.set_trace()
+        ans_file.write(json.dumps({"id":ann['id'],     # +1 offset     
+                                   "image_path":image_path,
+                                   "prompt": cur_prompt,
+                                   "text": outputs,
+                                   "name":ann['name'],
+                                   "synonyms":ann['synonyms'],
+                                   "bbox":ann['bbox'],
+                                   "bbox_norm":ann['bbox_norm'],
+                                   "width": ann['width'],
+                                   "height": ann['height'],
+                                   }) + "\n")
+        ans_file.flush()
+    ans_file.close()
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model-path", type=str, default="facebook/opt-350m")
+    parser.add_argument("--model-base", type=str, default=None)
+    parser.add_argument("--image_path", type=str, default="dataset/cocoval2017")
+    parser.add_argument("--data_path", type=str, default="dataset/lvis/lvis_v1_minival_inserted_image_name.json")
+    parser.add_argument("--answers-file", type=str, default="lvis_result/answer.jsonl")
+    parser.add_argument("--conv-mode", type=str, default="ferret_v1")
+    parser.add_argument("--num-chunks", type=int, default=1)
+    parser.add_argument("--chunk-idx", type=int, default=0)
+    parser.add_argument("--image_w", type=int, default=336)  #  224
+    parser.add_argument("--image_h", type=int, default=336)  #  224
+    parser.add_argument("--add_region_feature", action="store_true")
+    parser.add_argument("--no_coor", action="store_true")
+    parser.add_argument("--temperature", type=float, default=0.001)
+    parser.add_argument("--top_p", type=float, default=None)
+    parser.add_argument("--num_beams", type=int, default=1)
+    parser.add_argument("--region_format", type=str, default="point", choices=["point", "box", "free_shape"])
+    args = parser.parse_args()
+
+    eval_model(args)
diff --git a/ferret/eval/model_point_cls_single_image.py b/ferret/eval/model_point_cls_single_image.py
new file mode 100644
index 0000000..765c870
--- /dev/null
+++ b/ferret/eval/model_point_cls_single_image.py
@@ -0,0 +1,181 @@
+"""
+Usage:
+- If eval on center point:
+CUDA_VISIBLE_DEVICES=1 python -m ferret.eval.model_point_cls_single_image \
+    --model-path checkpoints/ferret_13b/checkpoint-4500 \
+    --img_path ferret/serve/examples/extreme_ironing.jpg \
+    --answers-file lvis_result/single_img/ \
+    --add_region_feature
+"""
+
+import argparse
+from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig
+import torch
+import os
+import json
+from tqdm import tqdm
+
+from ferret.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria
+from ferret.model.builder import load_pretrained_model
+from ferret.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
+from ferret.conversation import conv_templates, SeparatorStyle
+from ferret.utils import disable_torch_init
+
+from PIL import Image
+import random
+import math
+from copy import deepcopy
+import pdb
+import numpy as np
+from functools import partial
+
+VOCAB_IMAGE_W = 1000
+VOCAB_IMAGE_H = 1000
+DEFAULT_REGION_FEA_TOKEN = "<region_fea>"
+
+
+
+
+def generate_mask_for_feature(coor,raw_w, raw_h):
+    coor_mask = np.zeros((raw_w, raw_h))
+    # Assume it samples a point.
+    if len(coor) == 2:
+        # Define window size
+        span = 5
+        # Make sure the window does not exceed array bounds
+        x_min = max(0, coor[0] - span)
+        x_max = min(raw_w, coor[0] + span + 1)
+        y_min = max(0, coor[1] - span)
+        y_max = min(raw_h, coor[1] + span + 1)
+        coor_mask[int(x_min):int(x_max), int(y_min):int(y_max)] = 1
+        assert (coor_mask==1).any(), f"coor: {coor}, raw_w: {raw_w}, raw_h: {raw_h}"
+    else:
+        raise NotImplementedError('Coordinates must be 2d.')
+    coor_mask = torch.from_numpy(coor_mask)
+    assert len(coor_mask.nonzero()) != 0
+    return coor_mask
+
+
+def eval_model(args):
+    # Model
+    disable_torch_init()
+    model_path = os.path.expanduser(args.model_path)
+    model_name = get_model_name_from_path(model_path)
+    tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, args.model_base, model_name)
+
+
+    image_path_list = ['ferret/serve/examples/extreme_ironing.jpg']
+    # image_path_list = ['ferret/serve/examples/2409138.jpg', 'ferret/serve/examples/extreme_ironing.jpg', 'ferret/serve/examples/2332136.jpg']
+    image_path = args.img_path
+    coor_list = []
+    grid_w = 10
+    grid_h = 10
+    for i in range(grid_w):
+        for j in range(grid_h):
+            coor_i = VOCAB_IMAGE_W * (i + 1) / (grid_w+1)
+            coor_j = VOCAB_IMAGE_H * (j + 1) / (grid_h+1)
+            coor_list.append([int(coor_i), int(coor_j)])
+
+    if args.add_region_feature: 
+        question = f'What is the class of object <coor> {DEFAULT_REGION_FEA_TOKEN}?'
+    else:
+        question = 'What is the class of object <coor>?'
+
+    for image_path in image_path_list:
+        answers_file = os.path.expanduser(args.answers_file)
+        os.makedirs(answers_file, exist_ok=True)
+        image_name = image_path.split('.')[0].split('/')[-1]
+        answers_file = os.path.join(answers_file, f'{image_name}.jsonl')
+        ans_file = open(answers_file, "w")
+
+        for i, coor_i in enumerate(tqdm(coor_list)):
+            qs = question.replace('<coor>', '[{}, {}]'.format(int(coor_i[0]), int(coor_i[1])))
+            cur_prompt = qs
+
+            if model.config.mm_use_im_start_end:
+                qs = DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN + '\n' + qs
+            else:
+                qs = DEFAULT_IMAGE_TOKEN + '\n' + qs
+
+            conv = conv_templates[args.conv_mode].copy()
+            conv.append_message(conv.roles[0], qs)
+            conv.append_message(conv.roles[1], None)
+            prompt = conv.get_prompt()
+            # inputs = tokenizer([prompt])
+
+            image = Image.open(image_path).convert('RGB')
+            # image.save(os.path.join(save_image_folder, image_file))
+            image_tensor = image_processor.preprocess(image, return_tensors='pt', do_resize=True, 
+                                                    do_center_crop=False, size=[args.image_h, args.image_w])['pixel_values'][0]
+            # image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values'][0]
+
+            input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()
+
+            if args.add_region_feature:
+                generated_mask = generate_mask_for_feature(coor_i, VOCAB_IMAGE_W, VOCAB_IMAGE_H)
+                region_masks = [generated_mask]
+                region_masks = [[region_mask_i.cuda().half() for region_mask_i in region_masks]]
+            else:
+                region_masks = None
+
+            stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
+            keywords = [stop_str]
+            stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
+
+            with torch.inference_mode():
+                model.orig_forward = model.forward
+                model.forward = partial(
+                    model.orig_forward,
+                    region_masks=region_masks
+                )
+                output_ids = model.generate(
+                    input_ids,
+                    images=image_tensor.unsqueeze(0).half().cuda(),
+                    do_sample=True,
+                    temperature=args.temperature,
+                    max_new_tokens=1024,
+                    num_beams=1,
+                    use_cache=True,
+                    stopping_criteria=[stopping_criteria])
+                model.forward = model.orig_forward
+
+            input_token_len = input_ids.shape[1]
+            n_diff_input_output = (input_ids != output_ids[:, :input_token_len]).sum().item()
+            if n_diff_input_output > 0:
+                print(f'[Warning] Sample {i}: {n_diff_input_output} output_ids are not the same as the input_ids')
+            outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]
+
+            outputs = outputs.strip()
+            if outputs.endswith(stop_str):
+                outputs = outputs[:-len(stop_str)]
+            outputs = outputs.strip()
+
+            # pdb.set_trace()
+            img_w, img_h = image.size
+            ans_file.write(json.dumps({"img_w": img_w,
+                                    "img_h": img_h,
+                                    "VOCAB_IMAGE_W": VOCAB_IMAGE_W,
+                                    "VOCAB_IMAGE_H": VOCAB_IMAGE_H,
+                                    "coor": coor_i,  
+                                    "image_path":image_path,
+                                    "prompt": cur_prompt,
+                                    "text": outputs,
+                                    }) + "\n")
+            ans_file.flush()
+        ans_file.close()
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model-path", type=str, default="facebook/opt-350m")
+    parser.add_argument("--model-base", type=str, default=None)
+    parser.add_argument("--img_path", type=str, default='llava/serve/examples/extreme_ironing.jpg')
+    parser.add_argument("--answers-file", type=str, default="lvis_result/answer.jsonl")
+    parser.add_argument("--conv-mode", type=str, default="ferret_v1")
+    parser.add_argument("--image_w", type=int, default=336)
+    parser.add_argument("--image_h", type=int, default=336)
+    parser.add_argument("--answer_prompter", action="store_true")
+    parser.add_argument("--add_region_feature", action="store_true")
+    parser.add_argument("--temperature", type=float, default=0.001)
+    args = parser.parse_args()
+
+    eval_model(args)
diff --git a/ferret/eval/model_pope.py b/ferret/eval/model_pope.py
new file mode 100644
index 0000000..24343c3
--- /dev/null
+++ b/ferret/eval/model_pope.py
@@ -0,0 +1,212 @@
+"""
+Usage:
+--data_path: path of pope annotation. 
+--image_path:  path of coco2014 val images. 
+--answers-file: path of output result.
+
+Example:
+CUDA_VISIBLE_DEVICES=0 python -m ferret.eval.model_pope \
+    --model-path checkpoints/ferret_13b/checkpoint-final \
+    --image_path data/refcoco/val2014 \
+    --data_path data/pope/coco_pope_adversarial.json \
+    --answers-file pope/coco_pope_adversarial \
+    --add_region_feature \
+    --chunk-idx 0 \
+    --num-chunks 8 
+
+"""
+
+import argparse
+from typing import Any, Tuple
+import torch
+import os
+import json
+from tqdm import tqdm
+
+# Added
+from ferret.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria
+from ferret.model.builder import load_pretrained_model
+from ferret.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
+from ferret.conversation import conv_templates, SeparatorStyle
+from ferret.utils import disable_torch_init
+from PIL import Image
+import re
+import math
+import torchvision
+import numpy as np
+from copy import deepcopy
+
+# Added for visualization
+from PIL import Image, ImageDraw, ImageFont
+
+VOCAB_IMAGE_W = 1000
+VOCAB_IMAGE_H = 1000
+DEFAULT_REGION_FEA_TOKEN = "<region_fea>"
+
+
+def split_list(lst, n):
+    """Split a list into n (roughly) equal-sized chunks"""
+    chunk_size = math.ceil(len(lst) / n)  # integer division
+    return [lst[i:i+chunk_size] for i in range(0, len(lst), chunk_size)]
+
+def get_chunk(lst, n, k):
+    chunks = split_list(lst, n)
+    return chunks[k]
+
+
+def plot_pope(img, boxes, text):
+    draw = ImageDraw.Draw(img)
+    fnt = ImageFont.load_default()
+    draw.rectangle([boxes[0], boxes[1], boxes[2], boxes[3]], outline="blue")
+    draw.text((boxes[0], boxes[1]-5), f'{text}', font=fnt, fill="green")
+    return img
+
+
+def resize_bbox(box, image_w=None, image_h=None):
+    ratio_w = image_w * 1.0 / VOCAB_IMAGE_W
+    ratio_h = image_h * 1.0 / VOCAB_IMAGE_H
+
+    new_box = [int(box[0] * ratio_w), int(box[1] * ratio_h), \
+               int(box[2] * ratio_w), int(box[3] * ratio_h)]
+    return new_box
+
+
+def find_bbox_template_v3(text, img_w, img_h):
+    pattern = r'\[(\d+), (\d+), (\d+), (\d+)\]'
+    matches = re.findall(pattern, text)
+    new_bboxes = []
+    old_bboxes = []
+    for match in matches:
+        x1, y1, x2, y2 = map(int, match)
+        new_box = resize_bbox([x1, y1, x2, y2], img_w, img_h)
+        new_bboxes.append(new_box)
+        old_bboxes.append([x1, y1, x2, y2])
+    
+    set_old_bboxes = sorted(set(map(tuple, old_bboxes)), key=list(map(tuple, old_bboxes)).index)
+    list_old_bboxes = list(map(list, set_old_bboxes))
+
+    set_bboxes = sorted(set(map(tuple, new_bboxes)), key=list(map(tuple, new_bboxes)).index)
+    list_bboxes = list(map(list, set_bboxes))
+
+    for i in range(len(list_bboxes)):
+        x1, y1, x2, y2 = list_old_bboxes[i]
+        obj_string = '[obj{}]'.format(i)
+        text = text.replace('[{}, {}, {}, {}]'.format(x1, y1, x2, y2), obj_string)
+    
+    return text, list_bboxes
+
+
+class PopeGrounding():
+    def __init__(self, img_folder, ann_file):
+        self.img_folder = img_folder
+        self.ann_file = ann_file
+
+        self.label_list = [json.loads(q) for q in open(self.ann_file, 'r')]
+        self._ids = range(len(self.label_list))
+
+    def __getitem__(self, idx):
+        label = self.label_list[idx]
+        filename = label["image"]
+        image = Image.open(os.path.join(self.img_folder, filename)).convert('RGB')
+        question = label["text"]
+
+        return image, question
+    
+    @property
+    def ids(self):
+        return deepcopy(self._ids)
+
+        
+def eval_model_pope(args):
+    # Data
+    dataset = PopeGrounding(img_folder=args.image_path, ann_file=args.data_path)
+    data_ids = dataset.ids
+
+    # Model
+    disable_torch_init()
+    model_path = os.path.expanduser(args.model_path)
+    model_name = get_model_name_from_path(model_path)
+    tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, args.model_base, model_name)
+
+    chunk_data_ids = get_chunk(data_ids, args.num_chunks, args.chunk_idx)
+    answers_file = os.path.expanduser(args.answers_file)
+    os.makedirs(answers_file, exist_ok=True)
+    answers_file = os.path.join(answers_file, f'{args.chunk_idx}_of_{args.num_chunks}.jsonl')
+    ans_file = open(answers_file, "w")
+
+    for i, id in enumerate(tqdm(chunk_data_ids)):
+        img, question = dataset[id]
+        qs = question
+        img_w, img_h = img.size
+
+        if model.config.mm_use_im_start_end:
+            qs = DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN + '\n' + qs
+        else:
+            qs = DEFAULT_IMAGE_TOKEN + '\n' + qs
+
+        conv = conv_templates[args.conv_mode].copy()
+        conv.append_message(conv.roles[0], qs)
+        conv.append_message(conv.roles[1], None)
+        prompt = conv.get_prompt()
+
+        input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()
+        image_tensor = image_processor.preprocess(img, return_tensors='pt', do_resize=True, 
+                                                  do_center_crop=False, size=[args.image_h, args.image_w])['pixel_values'][0]
+        stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
+        keywords = [stop_str]
+        stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
+
+        with torch.inference_mode():
+            output_ids = model.generate(
+                input_ids,
+                images=image_tensor.unsqueeze(0).half().cuda(),
+                do_sample=True,
+                temperature=args.temperature,
+                top_p=args.top_p,
+                num_beams=args.num_beams,
+                # no_repeat_ngram_size=3,
+                max_new_tokens=1024,
+                use_cache=True,
+                stopping_criteria=[stopping_criteria],
+                )
+        
+        input_token_len = input_ids.shape[1]
+        n_diff_input_output = (input_ids != output_ids[:, :input_token_len]).sum().item()
+        if n_diff_input_output > 0:
+            print(f'[Warning] Sample {i}: {n_diff_input_output} output_ids are not the same as the input_ids')
+        outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]
+        outputs = outputs.strip()
+        if outputs.endswith(stop_str):
+            outputs = outputs[:-len(stop_str)]
+        outputs = outputs.strip()
+
+        # Plot Preds
+        # text, bboxes = find_bbox_template_v3(outputs, img_w=img_w, img_h=img_h)
+        # # print(text, bboxes)
+        # img = plot_pope(img, bboxes[0], text)
+        # img.save('pope/images/{}.png'.format(i))
+
+        ans_file.write(json.dumps({"question": question,
+                                   "answer": outputs.lower(),
+                                   }) + "\n")
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model-path", type=str, default="facebook/opt-350m")
+    parser.add_argument("--model-base", type=str, default=None)
+    parser.add_argument("--image_path", type=str, default="data/refcoco/val2014")
+    parser.add_argument("--data_path", type=str, default="data/pope/coco_pope_popular.json")
+    parser.add_argument("--answers-file", type=str, default="pope/coco_pope_popular")
+    parser.add_argument("--conv-mode", type=str, default="ferret_v1")
+    parser.add_argument("--num-chunks", type=int, default=1)
+    parser.add_argument("--chunk-idx", type=int, default=0)
+    parser.add_argument("--image_w", type=int, default=336)  #  224
+    parser.add_argument("--image_h", type=int, default=336)  #  224
+    parser.add_argument("--add_region_feature", action="store_true")
+    parser.add_argument("--temperature", type=float, default=0.001)
+    parser.add_argument("--top_p", type=float, default=None)
+    parser.add_argument("--num_beams", type=int, default=1)
+    args = parser.parse_args()
+
+    eval_model_pope(args)
diff --git a/ferret/eval/model_refcoco.py b/ferret/eval/model_refcoco.py
new file mode 100644
index 0000000..08f6038
--- /dev/null
+++ b/ferret/eval/model_refcoco.py
@@ -0,0 +1,252 @@
+"""
+Usage:
+--data_path: path of refcoco annotation. 
+--image_path:  path of refcoco images. 
+--answers-file: path of output result.
+
+Example:
+CUDA_VISIBLE_DEVICES=0 python -m ferret.eval.model_refcoco \
+    --model-path checkpoints/ferret_13b/checkpoint-final \
+    --image_path data/refcoco/train2014 \
+    --data_path data/annotations/finetune_refcocog_test.json \
+    --answers-file refexp_result/finetune_refcocog_test \
+    --add_region_feature \
+    --chunk-idx 0 \
+    --num-chunks 1 
+
+"""
+
+import argparse
+from typing import Any, Tuple
+import torch
+import os
+import json
+from tqdm import tqdm
+
+# Added
+from ferret.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria
+from ferret.model.builder import load_pretrained_model
+from ferret.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
+from ferret.conversation import conv_templates, SeparatorStyle
+from ferret.utils import disable_torch_init
+from PIL import Image
+import re
+import math
+import torchvision
+import numpy as np
+from copy import deepcopy
+
+# Added for visualization
+from PIL import Image, ImageDraw, ImageFont
+
+VOCAB_IMAGE_W = 1000
+VOCAB_IMAGE_H = 1000
+DEFAULT_REGION_FEA_TOKEN = "<region_fea>"
+
+
+def split_list(lst, n):
+    """Split a list into n (roughly) equal-sized chunks"""
+    chunk_size = math.ceil(len(lst) / n)  # integer division
+    return [lst[i:i+chunk_size] for i in range(0, len(lst), chunk_size)]
+
+
+def get_chunk(lst, n, k):
+    chunks = split_list(lst, n)
+    return chunks[k]
+
+
+def plot_refexp(img, boxes, entities, mode='pred'):
+    if mode == "gt":
+        color = "green"
+    elif mode == "pred":
+        color = "blue"
+    draw = ImageDraw.Draw(img)
+    fnt = ImageFont.load_default()
+    draw.rectangle([boxes[0], boxes[1], boxes[2], boxes[3]], outline=color)
+    draw.text((boxes[0], boxes[1]-5), f'{entities}', font=fnt, fill=color)
+    return img
+
+
+def remove_punctuation(text: str) -> str:
+    punct = [',',]
+    for p in punct:
+        text = text.replace(p, '')
+    return text.strip()
+
+
+def resize_bbox(box, image_w=None, image_h=None):
+    ratio_w = image_w * 1.0 / VOCAB_IMAGE_W
+    ratio_h = image_h * 1.0 / VOCAB_IMAGE_H
+
+    new_box = [int(box[0] * ratio_w), int(box[1] * ratio_h), \
+               int(box[2] * ratio_w), int(box[3] * ratio_h)]
+    return new_box
+
+
+def find_bbox_template(text, img_w, img_h):
+    pattern = r'\[(\d+), (\d+), (\d+), (\d+)\]'
+    matches = re.findall(pattern, text)
+    new_bboxes = []
+    old_bboxes = []
+    for match in matches:
+        x1, y1, x2, y2 = map(int, match)
+        new_box = resize_bbox([x1, y1, x2, y2], img_w, img_h)
+        new_bboxes.append(new_box)
+        old_bboxes.append([x1, y1, x2, y2])
+    
+    set_old_bboxes = sorted(set(map(tuple, old_bboxes)), key=list(map(tuple, old_bboxes)).index)
+    list_old_bboxes = list(map(list, set_old_bboxes))
+
+    set_bboxes = sorted(set(map(tuple, new_bboxes)), key=list(map(tuple, new_bboxes)).index)
+    list_bboxes = list(map(list, set_bboxes))
+
+    for i in range(len(list_bboxes)):
+        x1, y1, x2, y2 = list_old_bboxes[i]
+        text = text.replace('[{}, {}, {}, {}]'.format(x1, y1, x2, y2), '')
+    
+    if text.endswith(" ."):
+        text = text[:-2]
+    split_text = text.split(" . ")
+    entities = [item.strip() for item in split_text if item.strip() != '']
+
+    return entities, list_bboxes
+
+
+class RefExpGrounding(torchvision.datasets.CocoDetection):
+    def __init__(self, img_folder, ann_file, transforms):
+        super(RefExpGrounding, self).__init__(img_folder, ann_file)
+        self._transforms = transforms
+        self.question_prompt = "What is the location of <obj> in the image?"
+
+    def __getitem__(self, idx):
+        img, target = super(RefExpGrounding, self).__getitem__(idx)
+        image_id = self.ids[idx]
+        coco_img = self.coco.loadImgs(image_id)[0]
+        file_name = coco_img["file_name"]
+        caption = coco_img["caption"]
+        dataset_name = coco_img["dataset_name"] if "dataset_name" in coco_img else None
+        assert len(target) == 1
+        bbox_xywh = target[0]["bbox"]
+        bbox_xyxy = np.array([bbox_xywh[0], bbox_xywh[1], bbox_xywh[0] + bbox_xywh[2], bbox_xywh[1] + bbox_xywh[3]])
+        w, h = img.size
+        bbox_xyxy[0::2].clip(min=0, max=w)
+        bbox_xyxy[1::2].clip(min=0, max=h)
+
+        assert "<obj>" in self.question_prompt
+        question = self.question_prompt.replace("<obj>", remove_punctuation(caption))
+
+        target = {"image_id": image_id, "file_name": file_name, "annotations": target, "caption": caption, 
+                   "img_w": w, "img_h": h, "question": question, "bboxes": bbox_xyxy.tolist(), "entities": [caption]}
+        if self._transforms is not None:
+            img, target = self._transforms(img, target)
+        target["dataset_name"] = dataset_name
+        for extra_key in ["sentence_id", "original_img_id", "original_id", "task_id"]:
+            if extra_key in coco_img:
+                target[extra_key] = coco_img[extra_key]
+        return img, target
+
+
+def eval_model_refexp(args):
+    # Data
+    dataset = RefExpGrounding(img_folder=args.image_path, 
+                              ann_file=args.data_path,
+                              transforms=None,
+                              )
+    data_ids = range(len(dataset))
+
+    # Model
+    disable_torch_init()
+    model_path = os.path.expanduser(args.model_path)
+    model_name = get_model_name_from_path(model_path)
+    tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, args.model_base, model_name)
+    
+    chunk_data_ids = get_chunk(data_ids, args.num_chunks, args.chunk_idx)
+    answers_file = os.path.expanduser(args.answers_file)
+    os.makedirs(answers_file, exist_ok=True)
+    answers_file = os.path.join(answers_file, f'{args.chunk_idx}_of_{args.num_chunks}.jsonl')
+    ans_file = open(answers_file, "w")
+
+    for i, id in enumerate(tqdm(chunk_data_ids)):
+        img, ann = dataset[id]
+        qs = ann["question"]
+        cur_prompt = qs
+
+        # Plot GTs
+        # img = plot_refexp(img, ann["bboxes"], ann["entities"], mode="gt")
+
+        if model.config.mm_use_im_start_end:
+            qs = DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN + '\n' + qs
+        else:
+            qs = DEFAULT_IMAGE_TOKEN + '\n' + qs
+
+        conv = conv_templates[args.conv_mode].copy()
+        conv.append_message(conv.roles[0], qs)
+        conv.append_message(conv.roles[1], None)
+        prompt = conv.get_prompt()
+        input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()
+        img_w, img_h = ann["img_w"], ann["img_h"]
+        image_tensor = image_processor.preprocess(img, return_tensors='pt', do_resize=True, 
+                                                  do_center_crop=False, size=[args.image_h, args.image_w])['pixel_values'][0]
+        stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
+        keywords = [stop_str]
+        stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
+
+        with torch.inference_mode():
+            output_ids = model.generate(
+                input_ids,
+                images=image_tensor.unsqueeze(0).half().cuda(),
+                do_sample=True,
+                temperature=args.temperature,
+                top_p=args.top_p,
+                num_beams=args.num_beams,
+                # no_repeat_ngram_size=3,
+                max_new_tokens=1024,
+                use_cache=True,
+                stopping_criteria=[stopping_criteria],
+                )
+        
+        input_token_len = input_ids.shape[1]
+        n_diff_input_output = (input_ids != output_ids[:, :input_token_len]).sum().item()
+        if n_diff_input_output > 0:
+            print(f'[Warning] Sample {i}: {n_diff_input_output} output_ids are not the same as the input_ids')
+        outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]
+        outputs = outputs.strip()
+        if outputs.endswith(stop_str):
+            outputs = outputs[:-len(stop_str)]
+        outputs = outputs.strip()
+        
+        # Plot Preds
+        # pred_entities, pred_bboxes = find_bbox_template(outputs, img_w=img_w, img_h=img_h)
+        # img = plot_refexp(img, pred_bboxes[0], pred_entities, mode="pred")
+        # img.save('refexp_result/images/{}.png'.format(i))
+
+        ans_file.write(json.dumps({"image_id": ann['image_id'],    
+                                   "file_name": ann["file_name"],
+                                   "prompt": cur_prompt,
+                                   "text": outputs,
+                                   "width": ann['img_w'],
+                                   "height": ann['img_h'],
+                                   }) + "\n")
+        ans_file.flush()
+    ans_file.close()
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model-path", type=str, default="facebook/opt-350m")
+    parser.add_argument("--model-base", type=str, default=None)
+    parser.add_argument("--image_path", type=str, default="data/refcoco/train2014")
+    parser.add_argument("--data_path", type=str, default="data/annotations/finetune_refcoco_testA.json")
+    parser.add_argument("--answers-file", type=str, default="refexp_result/refcoco_testA")
+    parser.add_argument("--conv-mode", type=str, default="ferret_v1")
+    parser.add_argument("--num-chunks", type=int, default=1)
+    parser.add_argument("--chunk-idx", type=int, default=0)
+    parser.add_argument("--image_w", type=int, default=336)  #  224
+    parser.add_argument("--image_h", type=int, default=336)  #  224
+    parser.add_argument("--add_region_feature", action="store_true")
+    parser.add_argument("--temperature", type=float, default=0.001)
+    parser.add_argument("--top_p", type=float, default=None)
+    parser.add_argument("--num_beams", type=int, default=1)
+    args = parser.parse_args()
+
+    eval_model_refexp(args)
\ No newline at end of file
diff --git a/ferret/eval/summarize_gpt_review.py b/ferret/eval/summarize_gpt_review.py
new file mode 100644
index 0000000..13b8bbd
--- /dev/null
+++ b/ferret/eval/summarize_gpt_review.py
@@ -0,0 +1,63 @@
+import json
+import os
+from collections import defaultdict
+
+import numpy as np
+
+import argparse
+
+def parse_args():
+    parser = argparse.ArgumentParser(description='ChatGPT-based QA evaluation.')
+    parser.add_argument('-d', '--dir', default=None)
+    parser.add_argument('-f', '--files', nargs='*', default=None)
+    parser.add_argument('-i', '--ignore', nargs='*', default=None)
+    parser.add_argument('-s', '--save', action='store_true')
+    return parser.parse_args()
+
+
+if __name__ == '__main__':
+    args = parse_args()
+
+    if args.ignore is not None:
+        args.ignore = [int(x) for x in args.ignore]
+
+    if args.files is not None and len(args.files) > 0:
+        review_files = args.files
+    else:
+        review_files = [x for x in os.listdir(args.dir) if x.endswith('.jsonl') and (x.startswith('gpt4_text') or x.startswith('reviews_') or x.startswith('review_'))]
+    
+    metrics = []
+    for review_file in sorted(review_files):
+        config = os.path.basename(review_file).replace('gpt4_text_', '').replace('.jsonl', '')
+        scores = defaultdict(list)
+        print(config)
+        with open(os.path.join(args.dir, review_file) if args.dir is not None else review_file) as f:
+            for review_str in f:
+                review = json.loads(review_str)
+                if args.ignore is not None and review['question_id'] in args.ignore:
+                    continue
+                if 'category' in review:
+                    scores[review['category']].append(review['tuple'])
+                    scores['all'].append(review['tuple'])
+                else:
+                    if 'tuple' in review:
+                        scores['all'].append(review['tuple'])
+                    else:
+                        scores['all'].append(review['score'])
+        summ_dict = defaultdict(list)
+        for k, v in sorted(scores.items()):
+            stats = np.asarray(v).mean(0).tolist()
+            stats = [round(x, 3) for x in stats]
+            # print(k, stats, round(stats[1]/stats[0]*100, 1))
+            print(k, round(stats[1]/stats[0]*100, 1))
+            summ_dict[k] = round(stats[1]/stats[0]*100, 1)
+        print('=================================')
+        metrics.append(summ_dict)
+    
+    if args.save:
+        with open(os.path.join(args.dir, 'metric.json'), 'w') as f:
+            json.dump(metrics, f, indent=2)
+            
+
+
+        
diff --git a/ferret/mm_utils.py b/ferret/mm_utils.py
new file mode 100644
index 0000000..8c1865a
--- /dev/null
+++ b/ferret/mm_utils.py
@@ -0,0 +1,74 @@
+from PIL import Image
+from io import BytesIO
+import base64
+
+import torch
+from transformers import StoppingCriteria
+from ferret.constants import IMAGE_TOKEN_INDEX
+
+
+def load_image_from_base64(image):
+    return Image.open(BytesIO(base64.b64decode(image)))
+
+
+def process_images(images, image_processor, model_cfg):
+    return image_processor(images, return_tensors='pt')['pixel_values']
+
+
+def tokenizer_image_token(prompt, tokenizer, image_token_index=IMAGE_TOKEN_INDEX, return_tensors=None):
+    prompt_chunks = [tokenizer(chunk).input_ids for chunk in prompt.split('<image>')]
+
+    def insert_separator(X, sep):
+        return [ele for sublist in zip(X, [sep]*len(X)) for ele in sublist][:-1]
+
+    input_ids = []
+    offset = 0
+    if len(prompt_chunks) > 0 and len(prompt_chunks[0]) > 0 and prompt_chunks[0][0] == tokenizer.bos_token_id:
+        offset = 1
+        input_ids.append(prompt_chunks[0][0])
+
+    for x in insert_separator(prompt_chunks, [image_token_index] * (offset + 1)):
+        input_ids.extend(x[offset:])
+
+    if return_tensors is not None:
+        if return_tensors == 'pt':
+            return torch.tensor(input_ids, dtype=torch.long)
+        raise ValueError(f'Unsupported tensor type: {return_tensors}')
+    return input_ids
+
+
+def get_model_name_from_path(model_path):
+    model_path = model_path.strip("/")
+    model_paths = model_path.split("/")
+    if model_paths[-1].startswith('checkpoint-') or model_paths[-1].endswith('checkpoint'):
+        return model_paths[-2] + "_" + model_paths[-1]
+    else:
+        return model_paths[-1]
+
+
+
+
+class KeywordsStoppingCriteria(StoppingCriteria):
+    def __init__(self, keywords, tokenizer, input_ids):
+        self.keywords = keywords
+        self.keyword_ids = []
+        for keyword in keywords:
+            cur_keyword_ids = tokenizer(keyword).input_ids
+            if len(cur_keyword_ids) > 1 and cur_keyword_ids[0] == tokenizer.bos_token_id:
+                cur_keyword_ids = cur_keyword_ids[1:]
+            self.keyword_ids.append(torch.tensor(cur_keyword_ids))
+        self.tokenizer = tokenizer
+        self.start_len = input_ids.shape[1]
+
+    def __call__(self, output_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
+        assert output_ids.shape[0] == 1, "Only support batch size 1 (yet)"  # TODO
+        offset = min(output_ids.shape[1] - self.start_len, 3)
+        self.keyword_ids = [keyword_id.to(output_ids.device) for keyword_id in self.keyword_ids]
+        for keyword_id in self.keyword_ids:
+            if output_ids[0, -keyword_id.shape[0]:] == keyword_id:
+                return True
+        outputs = self.tokenizer.batch_decode(output_ids[:, -offset:], skip_special_tokens=True)[0]
+        for keyword in self.keywords:
+            if keyword in outputs:
+                return True
+        return False
diff --git a/ferret/model/__init__.py b/ferret/model/__init__.py
new file mode 100644
index 0000000..a4d920d
--- /dev/null
+++ b/ferret/model/__init__.py
@@ -0,0 +1 @@
+from .language_model.ferret_llama import FERRETLlamaForCausalLM, FERRETConfig
diff --git a/ferret/model/apply_delta.py b/ferret/model/apply_delta.py
new file mode 100644
index 0000000..75abb6e
--- /dev/null
+++ b/ferret/model/apply_delta.py
@@ -0,0 +1,48 @@
+"""
+Usage:
+python3 -m fastchat.model.apply_delta --base ~/model_weights/llama-7b --target ~/model_weights/vicuna-7b --delta lmsys/vicuna-7b-delta
+"""
+import argparse
+
+import torch
+from tqdm import tqdm
+from transformers import AutoTokenizer, AutoModelForCausalLM
+from ferret import FERRETLlamaForCausalLM
+
+
+def apply_delta(base_model_path, target_model_path, delta_path):
+    print("Loading base model")
+    base = AutoModelForCausalLM.from_pretrained(
+        base_model_path, torch_dtype=torch.float16, low_cpu_mem_usage=True)
+
+    print("Loading delta")
+    delta = FERRETLlamaForCausalLM.from_pretrained(delta_path, torch_dtype=torch.float16, low_cpu_mem_usage=True)
+    delta_tokenizer = AutoTokenizer.from_pretrained(delta_path)
+
+    print("Applying delta")
+    for name, param in tqdm(delta.state_dict().items(), desc="Applying delta"):
+        if name not in base.state_dict():
+            assert name in ['model.mm_projector.weight', 'model.mm_projector.bias'], f'{name} not in base model'
+            continue
+        if param.data.shape == base.state_dict()[name].shape:
+            param.data += base.state_dict()[name]
+        else:
+            assert name in ['model.embed_tokens.weight', 'lm_head.weight'], \
+                f'{name} dimension mismatch: {param.data.shape} vs {base.state_dict()[name].shape}'
+            bparam = base.state_dict()[name]
+            param.data[:bparam.shape[0], :bparam.shape[1]] += bparam
+
+    print("Saving target model")
+    delta.save_pretrained(target_model_path)
+    delta_tokenizer.save_pretrained(target_model_path)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--base-model-path", type=str, required=True)
+    parser.add_argument("--target-model-path", type=str, required=True)
+    parser.add_argument("--delta-path", type=str, required=True)
+
+    args = parser.parse_args()
+
+    apply_delta(args.base_model_path, args.target_model_path, args.delta_path)
diff --git a/ferret/model/builder.py b/ferret/model/builder.py
new file mode 100644
index 0000000..f901f53
--- /dev/null
+++ b/ferret/model/builder.py
@@ -0,0 +1,139 @@
+#    Licensed under the Apache License, Version 2.0 (the "License");
+#    you may not use this file except in compliance with the License.
+#    You may obtain a copy of the License at
+#
+#        http://www.apache.org/licenses/LICENSE-2.0
+#
+#    Unless required by applicable law or agreed to in writing, software
+#    distributed under the License is distributed on an "AS IS" BASIS,
+#    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#    See the License for the specific language governing permissions and
+#    limitations under the License.
+
+
+import os
+import shutil
+import pdb
+
+from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig, BitsAndBytesConfig
+import torch
+from ferret.model import *
+from ferret.constants import DEFAULT_IMAGE_PATCH_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
+DEFAULT_REGION_FEA_TOKEN = "<region_fea>"
+
+def load_pretrained_model(model_path, model_base, model_name, load_8bit=False, load_4bit=False, device_map="auto"):
+    kwargs = {"device_map": device_map}
+
+    if load_8bit:
+        kwargs['load_in_8bit'] = True
+    elif load_4bit:
+        kwargs['load_in_4bit'] = True
+        kwargs['quantization_config'] = BitsAndBytesConfig(
+            load_in_4bit=True,
+            bnb_4bit_compute_dtype=torch.float16,
+            bnb_4bit_use_double_quant=True,
+            bnb_4bit_quant_type='nf4'
+        )
+    else:
+        kwargs['torch_dtype'] = torch.float16
+
+    if 'llava' in model_name.lower() or 'ferret' in model_name.lower():
+        # Load LLaVA/FERRET model
+        if 'lora' in model_name.lower() and model_base is not None:
+            lora_cfg_pretrained = AutoConfig.from_pretrained(model_path)
+            tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False)
+            print('Loading LLaVA/FERRET from base model...')
+            model = FERRETLlamaForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained, **kwargs)
+            token_num, tokem_dim = model.lm_head.out_features, model.lm_head.in_features
+            if model.lm_head.weight.shape[0] != token_num:
+                model.lm_head.weight = torch.nn.Parameter(torch.empty(token_num, tokem_dim, device=model.device, dtype=model.dtype))
+                model.model.embed_tokens.weight = torch.nn.Parameter(torch.empty(token_num, tokem_dim, device=model.device, dtype=model.dtype))
+
+            print('Loading additional LLaVA/FERRET weights...')
+            if os.path.exists(os.path.join(model_path, 'non_lora_trainables.bin')):
+                non_lora_trainables = torch.load(os.path.join(model_path, 'non_lora_trainables.bin'), map_location='cpu')
+            else:
+                # this is probably from HF Hub
+                from huggingface_hub import hf_hub_download
+                def load_from_hf(repo_id, filename, subfolder=None):
+                    cache_file = hf_hub_download(
+                        repo_id=repo_id,
+                        filename=filename,
+                        subfolder=subfolder)
+                    return torch.load(cache_file, map_location='cpu')
+                non_lora_trainables = load_from_hf(model_path, 'non_lora_trainables.bin')
+            non_lora_trainables = {(k[11:] if k.startswith('base_model.') else k): v for k, v in non_lora_trainables.items()}
+            if any(k.startswith('model.model.') for k in non_lora_trainables):
+                non_lora_trainables = {(k[6:] if k.startswith('model.') else k): v for k, v in non_lora_trainables.items()}
+            model.load_state_dict(non_lora_trainables, strict=False)
+
+            from peft import PeftModel
+            print('Loading LoRA weights...')
+            model = PeftModel.from_pretrained(model, model_path)
+            print('Merging LoRA weights...')
+            model = model.merge_and_unload()
+            print('Model is loaded...')
+        elif model_base is not None:
+            # this may be mm projector only
+            print('Loading LLaVA/FERRET from base model...')
+            tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False)
+            cfg_pretrained = AutoConfig.from_pretrained(model_path)
+            model = FERRETLlamaForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=cfg_pretrained, **kwargs)
+
+            mm_projector_weights = torch.load(os.path.join(model_path, 'mm_projector.bin'), map_location='cpu')
+            mm_projector_weights = {k: v.to(torch.float16) for k, v in mm_projector_weights.items()}
+            model.load_state_dict(mm_projector_weights, strict=False)
+        else:
+            tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
+            model = FERRETLlamaForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)
+    else:
+        # Load language model
+        if model_base is not None:
+            # PEFT model
+            from peft import PeftModel
+            tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False)
+            model = AutoModelForCausalLM.from_pretrained(model_base, torch_dtype=torch.float16, low_cpu_mem_usage=True, device_map="auto")
+            print(f"Loading LoRA weights from {model_path}")
+            model = PeftModel.from_pretrained(model, model_path)
+            print(f"Merging weights")
+            model = model.merge_and_unload()
+            print('Convert to FP16...')
+            model.to(torch.float16)
+        else:
+            use_fast = False
+            tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
+            model = AutoModelForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)
+
+    image_processor = None
+
+    if 'llava' in model_name.lower() or 'ferret' in model_name.lower():
+        mm_use_im_start_end = getattr(model.config, "mm_use_im_start_end", False)
+        mm_use_im_patch_token = getattr(model.config, "mm_use_im_patch_token", True)
+        mm_im_region_fea_token = getattr(model.config, "im_region_fea_token", None)
+        if mm_use_im_patch_token:
+            tokenizer.add_tokens([DEFAULT_IMAGE_PATCH_TOKEN], special_tokens=True)
+        if mm_im_region_fea_token is not None:
+            tokenizer.add_tokens([DEFAULT_REGION_FEA_TOKEN], special_tokens=True)
+        if mm_use_im_start_end:
+            tokenizer.add_tokens([DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN], special_tokens=True)
+        model.resize_token_embeddings(len(tokenizer))
+
+        vision_tower = model.get_vision_tower()
+        vision_tower_path = os.path.join(model_path, 'vision_tower')
+        if not vision_tower.is_loaded or os.path.exists(vision_tower_path):
+            if os.path.exists(vision_tower_path):
+                print(f'Start Loading vision tower from {vision_tower_path}')
+                vision_tower.load_model(vision_tower_path=vision_tower_path)
+                print(f'Finish Loading vision tower from {vision_tower_path}')
+            else:
+                vision_tower.load_model()
+
+        vision_tower.to(device='cuda', dtype=torch.float16)
+        image_processor = vision_tower.image_processor
+
+    if hasattr(model.config, "max_sequence_length"):
+        context_len = model.config.max_sequence_length
+    else:
+        context_len = 2048
+
+    return tokenizer, model, image_processor, context_len
diff --git a/ferret/model/consolidate.py b/ferret/model/consolidate.py
new file mode 100644
index 0000000..8d516bc
--- /dev/null
+++ b/ferret/model/consolidate.py
@@ -0,0 +1,29 @@
+"""
+Usage:
+python3 -m llava.model.consolidate --src ~/model_weights/llava-7b --dst ~/model_weights/llava-7b_consolidate
+"""
+import argparse
+
+import torch
+from transformers import AutoTokenizer, AutoModelForCausalLM
+from ferret.model import *
+from ferret.model.utils import auto_upgrade
+
+
+def consolidate_ckpt(src_path, dst_path):
+    print("Loading model")
+    auto_upgrade(src_path)
+    src_model = AutoModelForCausalLM.from_pretrained(src_path, torch_dtype=torch.float16, low_cpu_mem_usage=True)
+    src_tokenizer = AutoTokenizer.from_pretrained(src_path, use_fast=False)
+    src_model.save_pretrained(dst_path)
+    src_tokenizer.save_pretrained(dst_path)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--src", type=str, required=True)
+    parser.add_argument("--dst", type=str, required=True)
+
+    args = parser.parse_args()
+
+    consolidate_ckpt(args.src, args.dst)
diff --git a/ferret/model/ferret_arch.py b/ferret/model/ferret_arch.py
new file mode 100644
index 0000000..a5bbc5f
--- /dev/null
+++ b/ferret/model/ferret_arch.py
@@ -0,0 +1,678 @@
+#    Licensed under the Apache License, Version 2.0 (the "License");
+#    you may not use this file except in compliance with the License.
+#    You may obtain a copy of the License at
+#
+#        http://www.apache.org/licenses/LICENSE-2.0
+#
+#    Unless required by applicable law or agreed to in writing, software
+#    distributed under the License is distributed on an "AS IS" BASIS,
+#    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#    See the License for the specific language governing permissions and
+#    limitations under the License.
+
+
+from abc import ABC, abstractmethod
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import math
+from .multimodal_encoder.builder import build_vision_tower
+import pdb
+
+from ferret.constants import IGNORE_INDEX, IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_PATCH_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
+DEFAULT_REGION_FEA_TOKEN = "<region_fea>"
+
+
+def rand_sample(x, max_len):
+    if x.shape[0] <= max_len:
+        return x
+    else:
+        rand_idx = torch.randperm(x.shape[0])[:max_len]
+    return x[rand_idx, :]
+
+def rand_sample_repeat(x, max_len):
+    if x.shape[0] < max_len:
+        indices = torch.randint(0, x.shape[0], (max_len-x.shape[0],))
+        # pdb.set_trace()
+        return torch.cat((x, x[indices]), dim=0)
+    elif x.shape[0] == max_len:
+        return x
+    else:
+        rand_idx = torch.randperm(x.shape[0])[:max_len]
+        return x[rand_idx, :]
+
+def point_sample(input, point_coords, return_dtype, **kwargs):
+    """
+    A wrapper around :function:`torch.nn.functional.grid_sample` to support 3D point_coords tensors.
+    Unlike :function:`torch.nn.functional.grid_sample` it assumes `point_coords` to lie inside
+    [0, 1] x [0, 1] square.
+
+    Args:
+        input (Tensor): A tensor of shape (N, C, H, W) that contains features map on a H x W grid.
+        point_coords (Tensor): A tensor of shape (N, P, 2) or (N, Hgrid, Wgrid, 2) that contains
+        [0, 1] x [0, 1] normalized point coordinates.
+
+    Returns:
+        output (Tensor): A tensor of shape (N, C, P) or (N, C, Hgrid, Wgrid) that contains
+            features for points in `point_coords`. The features are obtained via bilinear
+            interplation from `input` the same way as :function:`torch.nn.functional.grid_sample`.
+    """
+    add_dim = False
+    if point_coords.dim() == 3:
+        add_dim = True
+        point_coords = point_coords.unsqueeze(2)
+    # output = F.grid_sample(input, 2.0 * point_coords - 1.0, **kwargs)
+    output = F.grid_sample(input.float(), (2.0 * point_coords - 1.0).float(), **kwargs)
+    output = output.to(return_dtype)
+    if add_dim:
+        output = output.squeeze(3)
+    return output
+
+
+def farthest_point_sample(xyz, npoint):
+    """
+    Input:
+        xyz: pointcloud data, [B, N, 2]
+        npoint: number of samples
+    Return:
+        centroids: sampled pointcloud index, [B, npoint]
+    """
+    device = xyz.device
+    B, N, C = xyz.shape
+    centroids = torch.zeros(B, npoint, dtype=torch.long).to(device)
+    distance = torch.ones(B, N).to(device) * 1e10
+    farthest = torch.randint(0, N, (B,), dtype=torch.long).to(device)
+    batch_indices = torch.arange(B, dtype=torch.long).to(device)
+    for i in range(npoint):
+        centroids[:, i] = farthest
+        centroid = xyz[batch_indices, farthest, :].view(B, 1, 2)
+        dist = torch.sum((xyz - centroid) ** 2, -1)
+        distance = torch.min(distance, dist)
+        farthest = torch.max(distance, -1)[1]
+    return centroids
+
+
+def index_points(points, idx):
+    """
+    Input:
+        points: input points data, [B, N, C]
+        idx: sample index data, [B, S]
+    Return:
+        new_points:, indexed points data, [B, S, C]
+    """
+    device = points.device
+    B = points.shape[0]
+    view_shape = list(idx.shape)
+    view_shape[1:] = [1] * (len(view_shape) - 1)
+    repeat_shape = list(idx.shape)
+    repeat_shape[0] = 1
+    batch_indices = torch.arange(B, dtype=torch.long).to(device).view(view_shape).repeat(repeat_shape)
+    new_points = points[batch_indices, idx, :]
+    return new_points
+
+
+def square_distance(src, dst):
+    """
+    Calculate Euclid distance between each two points.
+    src^T * dst = xn * xm + yn * ym + zn * zm；
+    sum(src^2, dim=-1) = xn*xn + yn*yn + zn*zn;
+    sum(dst^2, dim=-1) = xm*xm + ym*ym + zm*zm;
+    dist = (xn - xm)^2 + (yn - ym)^2 + (zn - zm)^2
+         = sum(src**2,dim=-1)+sum(dst**2,dim=-1)-2*src^T*dst
+    Input:
+        src: source points, [B, N, C]
+        dst: target points, [B, M, C]
+    Output:
+        dist: per-point square distance, [B, N, M]
+    """
+    B, N, _ = src.shape
+    _, M, _ = dst.shape
+    dist = -2 * torch.matmul(src, dst.permute(0, 2, 1))
+    dist += torch.sum(src ** 2, -1).view(B, N, 1)
+    dist += torch.sum(dst ** 2, -1).view(B, 1, M)
+    return dist
+
+
+def knn_point(nsample, xyz, new_xyz):
+    """
+    Input:
+        nsample: max sample number in local region
+        xyz: all points, [B, N, C]
+        new_xyz: query points, [B, S, C]
+    Return:
+        group_idx: grouped points index, [B, S, nsample]
+    """
+    sqrdists = square_distance(new_xyz, xyz)
+    _, group_idx = torch.topk(sqrdists, nsample, dim=-1, largest=False, sorted=False)
+    return group_idx
+
+
+class ConvReLULN1D(nn.Module):
+    def __init__(self, in_channels, out_channels, kernel_size=1, bias=True):
+        super(ConvReLULN1D, self).__init__()
+        self.act = nn.ReLU(inplace=True)
+        self.net = nn.Sequential(
+            nn.Conv1d(in_channels=in_channels, out_channels=out_channels, kernel_size=kernel_size, bias=bias),
+            self.act
+        )
+        self.norm = nn.LayerNorm(out_channels)
+
+    def forward(self, x):
+        # (B, C, N) -> (B, C_1, N)
+        x = self.net(x)
+        x = x.permute(0, 2, 1)
+        x = self.norm(x)
+        x = x.permute(0, 2, 1)
+        
+        return x
+    
+
+def normal_init(module, mean=0, std=1, bias=0):
+    if hasattr(module, 'weight') and module.weight is not None:
+        nn.init.normal_(module.weight, mean, std)
+    if hasattr(module, 'bias') and module.bias is not None:
+        nn.init.constant_(module.bias, bias)
+
+
+class GeoRegionSampler(nn.Module):
+    def __init__(self, 
+                 input_dim,
+                 output_dim,
+                 num_init_point,
+                 num_sub_point,
+                 num_neighbor,
+                 pooler_mode='mean'):
+        super(GeoRegionSampler, self).__init__()
+        self.input_dim = input_dim
+        self.output_dim = output_dim
+        self.num_init_point = num_init_point
+        self.num_sub_point = num_sub_point
+        self.num_neighbor = num_neighbor
+
+        self.diff_projector_list = nn.ModuleList()
+        self.agg_projector_list = nn.ModuleList()
+        self.pooler_list = nn.ModuleList()
+
+        for ii in range(len(num_sub_point)):
+            self.diff_projector_list.append(nn.Linear(self.input_dim + 2, self.input_dim + 2))
+            self.agg_projector_list.append(ConvReLULN1D(in_channels=2*(self.input_dim + 2),
+                                                        out_channels=self.input_dim,
+                                                        ))
+            if pooler_mode == 'mean':
+                self.pooler_list.append(nn.AvgPool1d(kernel_size=num_neighbor[ii]))
+            elif pooler_mode =='max':
+                self.pooler_list.append(nn.AdaptiveMaxPool1d(output_size=1))
+            else:
+                raise NotImplementedError(f'{self.pooler_mode} is not supported.')
+
+        self.flatten_projector = nn.Linear(self.input_dim * num_sub_point[-1], self.input_dim)
+        self.dim_projector = nn.Linear(self.input_dim, self.output_dim)
+
+        self.norm_init_weights()
+
+    #  self.dtype = torch.float32
+    def norm_init_weights(self):
+        for m in self.modules():
+            if isinstance(m, nn.Conv2d):
+                normal_init(m, 0, 0.01)
+
+
+    def forward(self, 
+                feature_map, 
+                region_masks, 
+                original_dtype,
+                return_dtype):
+
+        assert len(feature_map) == len(region_masks)
+
+        all_points = []
+        all_points_fea = []
+        all_points_img_ids = []
+        # Sample points and their features
+        for img_idx, (region_feature_map_i, region_masks_list_i) in enumerate(zip(feature_map, region_masks)):
+            if len(region_masks_list_i) != 0:
+                # (w, h)
+                ori_image_wh = torch.tensor([region_masks_list_i[0].shape[0], region_masks_list_i[0].shape[1]], device=region_masks_list_i[0].device)[None,]
+                # list of elements of shape [num_sample_point, 2] 
+                # pdb.set_trace()
+                cur_non_zero_pos = [rand_sample_repeat((m.nonzero()/ori_image_wh), self.num_init_point) for m in region_masks_list_i]
+                # list -> [num_mask, num_sample_point, 2]
+                cur_non_zero_pos = torch.stack(cur_non_zero_pos)
+                # [HxW, C] -> [H, W, C] -> [C, H, W] -> [N, C, H, W]
+                h = w = int(math.sqrt(region_feature_map_i.shape[0]))
+                c = region_feature_map_i.shape[-1]
+                dup_region_feature_map_i = region_feature_map_i.reshape(h, w, c).permute(2, 0, 1)
+                dup_region_feature_map_i = dup_region_feature_map_i.unsqueeze(0).repeat(cur_non_zero_pos.shape[0], 1, 1, 1)
+                # [num_mask, C, H, W] x [num_mask, num_sample_point, 2] -> [num_mask, C, num_sample_point] -> [num_mask, num_sample_point, C]
+                # F.grid_sample doesn't support BF16. Need to tranform into float32 then transform back.
+                dup_region_feature_map_i_ori_type = dup_region_feature_map_i.to(original_dtype)
+                region_feature_i = point_sample(dup_region_feature_map_i_ori_type, 
+                                                cur_non_zero_pos.flip(dims=(2,)).type(original_dtype), 
+                                                return_dtype,
+                                                align_corners=True,
+                                                )
+                # region_feature_i = region_feature_i.to(dup_region_feature_map_i.dtype)
+                region_feature_i = region_feature_i.transpose(-2, -1)
+
+                cur_img_ids = [img_idx] * len(cur_non_zero_pos)
+                # save to global list
+                all_points.append(cur_non_zero_pos)
+                all_points_fea.append(region_feature_i)
+                all_points_img_ids.extend(cur_img_ids)
+
+        # pdb.set_trace()
+        # No region found, return list of None.
+        if len(all_points) == 0:
+            return [None] * len(region_masks)
+        
+        all_points = torch.cat(all_points, dim=0).to(return_dtype)  # [B*num_mask, num_sample_point, 2]
+        all_points_fea = torch.cat(all_points_fea, dim=0)  # [B*num_mask, num_sample_point, C]
+        all_points_img_ids = torch.tensor(all_points_img_ids, device=all_points_fea.device)
+        # pdb.set_trace()
+        assert all_points_fea.shape[:-1] == all_points_fea.shape[:-1]
+        
+        # Processing.
+        for stage_i in range(len(self.num_sub_point)):
+            cur_num_sub_point = self.num_sub_point[stage_i]
+            cur_num_neighbor = self.num_neighbor[stage_i]
+
+            all_points = all_points.contiguous()  # xy [btach, points, xy]
+            fps_idx = farthest_point_sample(all_points, cur_num_sub_point).long()
+
+            new_points = index_points(all_points, fps_idx)  # [B, npoint, 2]
+            new_points_fea = index_points(all_points_fea, fps_idx)  # [B, npoint, d]
+
+            idx = knn_point(cur_num_neighbor, all_points, new_points)
+            grouped_points = index_points(all_points, idx)  # [B, npoint, k, 2]
+            grouped_points_fea = index_points(all_points_fea, idx)  # [B, npoint, k, d]
+
+            # pdb.set_trace()
+            local_points_fea = torch.cat([grouped_points_fea, grouped_points],dim=-1)  # [B, npoint, k, d+2]
+            anchor_points_fea = torch.cat([new_points_fea, new_points],dim=-1).unsqueeze(-2)
+            diff_points_fea = local_points_fea-anchor_points_fea
+
+            diff_points_fea = self.diff_projector_list[stage_i](diff_points_fea)
+            gather_points_fea = torch.cat([diff_points_fea, anchor_points_fea.repeat(1, 1, cur_num_neighbor, 1)], dim=-1)  # [B, npoint, k, 2(d+2)]
+
+            # pdb.set_trace()
+            b, n, s, d = gather_points_fea.size() 
+            gather_points_fea = gather_points_fea.permute(0, 1, 3, 2)   # [B, npoint, 2(d+2), k]
+            gather_points_fea = gather_points_fea.reshape(-1, d, s)   # [B*npoint, 2(d+2), k]
+            gather_points_fea = self.agg_projector_list[stage_i](gather_points_fea) # [B*npoint, d, k]
+            # pdb.set_trace()
+            batch_size, new_dim, _ = gather_points_fea.size()
+            gather_points_fea = self.pooler_list[stage_i](gather_points_fea).view(batch_size, new_dim) # [B*npoint, d]
+            # gather_points_fea = F.adaptive_max_pool1d(gather_points_fea, 1).view(batch_size, -1) # [B*npoint, d]
+            # pdb.set_trace()
+            gather_points_fea = gather_points_fea.reshape(b, n, -1)     # [B, npoint, d]
+            # pdb.set_trace()
+
+            all_points = new_points
+            all_points_fea = gather_points_fea
+
+        # pdb.set_trace()
+        x = all_points_fea.flatten(1, -1)  # [B, npoint x d]
+        x = self.flatten_projector(x)
+        all_region_fea = self.dim_projector(x)  # [B, d]
+
+        output_region_fea = []
+        for img_idx in range(len(region_masks)):
+            cur_mask = all_points_img_ids == img_idx
+            # pdb.set_trace()
+            if not cur_mask.any():
+                output_region_fea.append(None)
+            else:
+                output_region_fea.append(all_region_fea[cur_mask])
+
+        # pdb.set_trace()
+        return output_region_fea
+
+
+
+class FERRETMetaModel:
+
+    def __init__(self, config):
+        super(FERRETMetaModel, self).__init__(config)
+        self.max_sample_point = 512
+
+        if hasattr(config, "mm_vision_tower"):
+            self.vision_tower = build_vision_tower(config, delay_load=True)
+            self.mm_projector = nn.Linear(config.mm_hidden_size, config.hidden_size)
+
+        if hasattr(config, "region_fea_adapter"):
+            self.region_fea_adapter = nn.Linear(config.mm_hidden_size, config.hidden_size)
+
+        if hasattr(config, "region_geo_sampler"):
+            # pdb.set_trace()
+            self.region_geo_sampler = GeoRegionSampler(input_dim=config.mm_hidden_size,
+                                                       output_dim=config.hidden_size,
+                                                       num_init_point=self.max_sample_point,
+                                                       num_sub_point=[128, 32],
+                                                       num_neighbor=[24, 24],
+                                                       pooler_mode=config.sampler_pooler_mode
+                                                       )
+
+    def get_vision_tower(self):
+        vision_tower = getattr(self, 'vision_tower', None)
+        if type(vision_tower) is list:
+            vision_tower = vision_tower[0]
+        return vision_tower
+
+    def initialize_vision_modules(self, model_args, fsdp=None, add_region_feature=False, region_geo_sampler=False, sampler_pooler_mode='mean'):
+        vision_tower = model_args.vision_tower
+        mm_vision_select_layer = model_args.mm_vision_select_layer
+        mm_vision_select_feature = model_args.mm_vision_select_feature
+        pretrain_mm_mlp_adapter = model_args.pretrain_mm_mlp_adapter
+
+        self.config.mm_vision_tower = vision_tower
+
+        vision_tower = build_vision_tower(model_args)
+
+        if fsdp is not None and len(fsdp) > 0:
+            self.vision_tower = [vision_tower]
+        else:
+            self.vision_tower = vision_tower
+
+        self.config.use_mm_proj = True
+        self.config.mm_hidden_size = vision_tower.hidden_size
+        self.config.mm_vision_select_layer = mm_vision_select_layer
+        self.config.mm_vision_select_feature = mm_vision_select_feature
+
+        if not hasattr(self, 'mm_projector'):
+            self.mm_projector = nn.Linear(self.config.mm_hidden_size, self.config.hidden_size)
+
+        if add_region_feature:
+            if region_geo_sampler:
+                self.config.region_geo_sampler = True
+                self.config.sampler_pooler_mode = sampler_pooler_mode
+                # pdb.set_trace()
+                if not hasattr(self, 'region_geo_sampler'):
+                    self.region_geo_sampler = GeoRegionSampler(input_dim=self.config.mm_hidden_size,
+                                                            output_dim=self.config.hidden_size,
+                                                            num_init_point=self.max_sample_point,
+                                                            num_sub_point=[128, 32],
+                                                            num_neighbor=[24, 24],
+                                                            pooler_mode=sampler_pooler_mode
+                                                            )
+            else:
+                self.config.region_fea_adapter = True
+                if not hasattr(self, 'region_fea_adapter'):
+                    self.region_fea_adapter = nn.Linear(self.config.mm_hidden_size, self.config.hidden_size)
+
+        if pretrain_mm_mlp_adapter is not None:
+            mm_projector_weights = torch.load(pretrain_mm_mlp_adapter, map_location='cpu')
+            def get_w(weights, keyword):
+                return {k.split(keyword + '.')[1]: v for k, v in weights.items() if keyword in k}
+
+            self.mm_projector.load_state_dict(get_w(mm_projector_weights, 'mm_projector'))
+
+
+class FERRETMetaForCausalLM(ABC):
+
+    @abstractmethod
+    def get_model(self):
+        pass
+
+    def get_vision_tower(self):
+        return self.get_model().get_vision_tower()
+
+    def encode_images(self, images, region_flag=False, region_geo_sampler=False):
+        image_features = self.get_model().get_vision_tower()(images)
+        projected_image_features = self.get_model().mm_projector(image_features)
+
+        if region_flag:
+            if region_geo_sampler:
+                new_region_feature_map = image_features
+            else:
+                new_region_feature_map = self.get_model().region_fea_adapter(image_features)
+        else:
+            new_region_feature_map = None
+
+        return image_features, projected_image_features, new_region_feature_map
+
+
+    def extract_region_feature(self, region_feature_map, region_masks, original_dtype, return_dtype):
+        all_region_features = []
+        assert len(region_feature_map) == len(region_masks)
+        for region_feature_map_i, region_masks_list_i in zip(region_feature_map, region_masks):
+            if len(region_masks_list_i) == 0:
+                all_region_features.append(None)
+            else:
+                # (w, h)
+                ori_image_wh = torch.tensor([region_masks_list_i[0].shape[0], region_masks_list_i[0].shape[1]], device=region_masks_list_i[0].device)[None,]
+                # list of elements of shape [num_sample_point, 2]
+                non_zero_pos = [rand_sample((m.nonzero()/ori_image_wh), self.get_model().max_sample_point) for m in region_masks_list_i]
+                # [num_mask, num_sample_point(padded), 2]
+                non_zero_pos = nn.utils.rnn.pad_sequence(non_zero_pos, padding_value=-1, batch_first=True)
+                non_zero_pos_mask = ~(non_zero_pos.sum(dim=-1) < 0)
+                # [HxW, C] -> [H, W, C] -> [C, H, W] -> [N, C, H, W]
+                h = w = int(math.sqrt(region_feature_map_i.shape[0]))
+                c = region_feature_map_i.shape[-1]
+                dup_region_feature_map_i = region_feature_map_i.reshape(h, w, c).permute(2, 0, 1)
+                dup_region_feature_map_i = dup_region_feature_map_i.unsqueeze(0).repeat(non_zero_pos.shape[0], 1, 1, 1)
+                # [num_mask, C, H, W] x [num_mask, num_sample_point(padded), 2] -> [num_mask, C, num_sample_point(padded)]
+                # F.grid_sample doesn't support BF16. Need to tranform into float32 then transform back.
+                dup_region_feature_map_i_ori_type = dup_region_feature_map_i.to(original_dtype)
+                # pdb.set_trace()
+                region_feature_i = point_sample(dup_region_feature_map_i_ori_type, 
+                                                non_zero_pos.flip(dims=(2,)).type(original_dtype), 
+                                                return_dtype, 
+                                                align_corners=True
+                                                )
+                region_feature_i = region_feature_i.to(dup_region_feature_map_i.dtype)
+                # [num_mask, C]
+                region_feature_i = torch.stack([x[m].mean(dim=0) for x, m in zip(region_feature_i.transpose(1,2), non_zero_pos_mask)]).nan_to_num()
+                all_region_features.append(region_feature_i)
+        
+        return all_region_features
+    
+
+    def prepare_inputs_labels_for_multimodal(
+        self, input_ids, attention_mask, past_key_values, labels, images, region_masks
+    ):
+        if region_masks is not None:
+            region_flag = True
+        else:
+            region_flag = False
+        region_geo_sampler = region_flag and getattr(self.config, 'region_geo_sampler', False)
+
+        vision_tower = self.get_vision_tower()
+        if vision_tower is None or images is None or input_ids.shape[1] == 1:
+            if past_key_values is not None and vision_tower is not None and images is not None and input_ids.shape[1] == 1:
+                attention_mask = torch.ones((attention_mask.shape[0], past_key_values[-1][-1].shape[-2] + 1), dtype=attention_mask.dtype, device=attention_mask.device)
+            return input_ids, attention_mask, past_key_values, None, labels
+
+        if type(images) is list or images.ndim == 5:
+            assert region_flag == False
+            concat_images = torch.cat([image for image in images], dim=0)
+            raw_image_features, image_features, region_feature_map = self.encode_images(concat_images, region_flag, region_geo_sampler)
+            # image_features = self.encode_images(concat_images)
+            split_sizes = [image.shape[0] for image in images]
+            image_features = torch.split(image_features, split_sizes, dim=0)
+            image_features = [x.flatten(0, 1) for x in image_features]
+        else:
+            raw_image_features, image_features, region_feature_map = self.encode_images(images, region_flag, region_geo_sampler)
+
+        if region_flag:
+            if region_geo_sampler:
+                # pdb.set_trace()
+                region_features = self.get_model().region_geo_sampler(region_feature_map, region_masks, 
+                                                                      original_dtype=raw_image_features.dtype,
+                                                                      return_dtype=image_features.dtype)
+            else:
+                region_features = self.extract_region_feature(region_feature_map, region_masks, 
+                                                              original_dtype=raw_image_features.dtype,
+                                                              return_dtype=image_features.dtype)
+            assert len(region_features) == len(input_ids)
+
+        new_input_embeds = []
+        new_labels = [] if labels is not None else None
+        cur_image_idx = 0
+        for batch_idx, cur_input_ids in enumerate(input_ids):
+            if (cur_input_ids == IMAGE_TOKEN_INDEX).sum() == 0:
+                # multimodal LLM, but the current sample is not multimodal
+                cur_input_embeds = self.get_model().embed_tokens(cur_input_ids)
+                cur_input_embeds = cur_input_embeds + (0. * self.get_model().mm_projector(vision_tower.dummy_feature)).sum()
+                new_input_embeds.append(cur_input_embeds)
+                if labels is not None:
+                    new_labels.append(labels[batch_idx])
+                cur_image_idx += 1
+                continue
+            image_token_indices = torch.where(cur_input_ids == IMAGE_TOKEN_INDEX)[0]
+            cur_new_input_embeds = []
+            if labels is not None:
+                cur_labels = labels[batch_idx]
+                cur_new_labels = []
+                assert cur_labels.shape == cur_input_ids.shape
+            while image_token_indices.numel() > 0:
+                cur_image_features = image_features[cur_image_idx]
+                image_token_start = image_token_indices[0]
+                if region_flag:
+                    assert (cur_input_ids[:image_token_start] == self.config.im_region_fea_token).sum() == 0
+                # If not use start-end token, pt ckpt saved only has mm projector.
+                if getattr(self.config, 'tune_mm_mlp_adapter', False) and getattr(self.config, 'mm_use_im_start_end', False):
+                    cur_new_input_embeds.append(self.get_model().embed_tokens(cur_input_ids[:image_token_start-1]).detach())
+                    cur_new_input_embeds.append(self.get_model().embed_tokens(cur_input_ids[image_token_start-1:image_token_start]))
+                    cur_new_input_embeds.append(cur_image_features)
+                    cur_new_input_embeds.append(self.get_model().embed_tokens(cur_input_ids[image_token_start+1:image_token_start+2]))
+                    if labels is not None:
+                        cur_new_labels.append(cur_labels[:image_token_start])
+                        cur_new_labels.append(torch.full((cur_image_features.shape[0],), IGNORE_INDEX, device=labels.device, dtype=labels.dtype))
+                        cur_new_labels.append(cur_labels[image_token_start:image_token_start+1])
+                        cur_labels = cur_labels[image_token_start+2:]
+                else:
+                    cur_new_input_embeds.append(self.get_model().embed_tokens(cur_input_ids[:image_token_start]))
+                    cur_new_input_embeds.append(cur_image_features)
+                    if labels is not None:
+                        cur_new_labels.append(cur_labels[:image_token_start])
+                        cur_new_labels.append(torch.full((cur_image_features.shape[0],), IGNORE_INDEX, device=labels.device, dtype=labels.dtype))
+                        cur_labels = cur_labels[image_token_start+1:]
+                cur_image_idx += 1
+                if getattr(self.config, 'tune_mm_mlp_adapter', False) and getattr(self.config, 'mm_use_im_start_end', False):
+                    cur_input_ids = cur_input_ids[image_token_start+2:]
+                else:
+                    cur_input_ids = cur_input_ids[image_token_start+1:]
+                image_token_indices = torch.where(cur_input_ids == IMAGE_TOKEN_INDEX)[0]
+            if cur_input_ids.numel() > 0:
+                if getattr(self.config, 'tune_mm_mlp_adapter', False) and getattr(self.config, 'mm_use_im_start_end', False):
+                    text_input_embeds = self.get_model().embed_tokens(cur_input_ids).detach()
+                else:
+                    text_input_embeds = self.get_model().embed_tokens(cur_input_ids)
+                if labels is not None:
+                    cur_new_labels.append(cur_labels)
+
+                # Add region feature into text feature embeddings.
+                assert batch_idx+1 == cur_image_idx
+                if region_flag and region_features[batch_idx] is not None:
+                    region_embs = torch.zeros_like(text_input_embeds)
+                    region_replace_mask = (cur_input_ids == self.config.im_region_fea_token)
+                    # pdb.set_trace()
+                    region_embs[region_replace_mask] = region_features[batch_idx].to(text_input_embeds.dtype)
+                    text_input_embeds = text_input_embeds * (~region_replace_mask).to(text_input_embeds.dtype)[:, None] + region_embs                    
+                    # print('region_embs[..., 0].nonzero()', region_embs[..., 0].nonzero())
+                    # raise NotImplementedError()
+                    # pdb.set_trace()
+                else:
+                    if hasattr(self.config, 'im_region_fea_token'):
+                        assert (cur_input_ids == self.config.im_region_fea_token).sum() == 0
+
+                cur_new_input_embeds.append(text_input_embeds)
+            cur_new_input_embeds = [x.to(device=self.device) for x in cur_new_input_embeds]
+            cur_new_input_embeds = torch.cat(cur_new_input_embeds, dim=0)
+            new_input_embeds.append(cur_new_input_embeds)
+            if labels is not None:
+                cur_new_labels = torch.cat(cur_new_labels, dim=0)
+                new_labels.append(cur_new_labels)
+
+        if any(x.shape != new_input_embeds[0].shape for x in new_input_embeds):
+            max_len = max(x.shape[0] for x in new_input_embeds)
+
+            new_input_embeds_align = []
+            for cur_new_embed in new_input_embeds:
+                cur_new_embed = torch.cat((cur_new_embed, torch.zeros((max_len - cur_new_embed.shape[0], cur_new_embed.shape[1]), dtype=cur_new_embed.dtype, device=cur_new_embed.device)), dim=0)
+                new_input_embeds_align.append(cur_new_embed)
+            new_input_embeds = torch.stack(new_input_embeds_align, dim=0)
+
+            if labels is not None:
+                new_labels_align = []
+                _new_labels = new_labels
+                for cur_new_label in new_labels:
+                    cur_new_label = torch.cat((cur_new_label, torch.full((max_len - cur_new_label.shape[0],), IGNORE_INDEX, dtype=cur_new_label.dtype, device=cur_new_label.device)), dim=0)
+                    new_labels_align.append(cur_new_label)
+                new_labels = torch.stack(new_labels_align, dim=0)
+
+            if attention_mask is not None:
+                new_attention_mask = []
+                for cur_attention_mask, cur_new_labels, cur_new_labels_align in zip(attention_mask, _new_labels, new_labels):
+                    new_attn_mask_pad_left = torch.full((cur_new_labels.shape[0] - labels.shape[1],), True, dtype=attention_mask.dtype, device=attention_mask.device)
+                    new_attn_mask_pad_right = torch.full((cur_new_labels_align.shape[0] - cur_new_labels.shape[0],), False, dtype=attention_mask.dtype, device=attention_mask.device)
+                    cur_new_attention_mask = torch.cat((new_attn_mask_pad_left, cur_attention_mask, new_attn_mask_pad_right), dim=0)
+                    new_attention_mask.append(cur_new_attention_mask)
+                attention_mask = torch.stack(new_attention_mask, dim=0)
+                assert attention_mask.shape == new_labels.shape
+        else:
+            new_input_embeds = torch.stack(new_input_embeds, dim=0)
+            if labels is not None:
+                new_labels  = torch.stack(new_labels, dim=0)
+
+            if attention_mask is not None:
+                new_attn_mask_pad_left = torch.full((attention_mask.shape[0], new_input_embeds.shape[1] - input_ids.shape[1]), True, dtype=attention_mask.dtype, device=attention_mask.device)
+                attention_mask = torch.cat((new_attn_mask_pad_left, attention_mask), dim=1)
+                assert attention_mask.shape == new_input_embeds.shape[:2]
+
+        return None, attention_mask, past_key_values, new_input_embeds, new_labels
+
+    def initialize_vision_tokenizer(self, model_args, tokenizer, add_region_feature=False):
+        if model_args.mm_use_im_patch_token:
+            tokenizer.add_tokens([DEFAULT_IMAGE_PATCH_TOKEN], special_tokens=True)
+            self.resize_token_embeddings(len(tokenizer))
+
+        if add_region_feature:
+            num_region_fea_tokens = tokenizer.add_tokens([DEFAULT_REGION_FEA_TOKEN], special_tokens=True)
+            self.config.im_region_fea_token = tokenizer.convert_tokens_to_ids([DEFAULT_REGION_FEA_TOKEN])[0]
+            self.resize_token_embeddings(len(tokenizer))
+
+        if model_args.mm_use_im_start_end:
+            num_new_tokens = tokenizer.add_tokens([DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN], special_tokens=True)
+            self.resize_token_embeddings(len(tokenizer))
+
+            if add_region_feature:
+                num_new_tokens = num_new_tokens + num_region_fea_tokens
+
+            if num_new_tokens > 0:
+                input_embeddings = self.get_input_embeddings().weight.data
+                output_embeddings = self.get_output_embeddings().weight.data
+
+                input_embeddings_avg = input_embeddings[:-num_new_tokens].mean(
+                    dim=0, keepdim=True)
+                output_embeddings_avg = output_embeddings[:-num_new_tokens].mean(
+                    dim=0, keepdim=True)
+
+                input_embeddings[-num_new_tokens:] = input_embeddings_avg
+                output_embeddings[-num_new_tokens:] = output_embeddings_avg
+
+            if model_args.tune_mm_mlp_adapter:
+                for p in self.get_input_embeddings().parameters():
+                    p.requires_grad = True
+                for p in self.get_output_embeddings().parameters():
+                    p.requires_grad = False
+
+            if model_args.pretrain_mm_mlp_adapter:
+                mm_projector_weights = torch.load(model_args.pretrain_mm_mlp_adapter, map_location='cpu')
+                embed_tokens_weight = mm_projector_weights['model.embed_tokens.weight']
+                if add_region_feature:
+                    num_new_tokens = num_new_tokens - num_region_fea_tokens
+                assert num_new_tokens == 2
+                if input_embeddings.shape == embed_tokens_weight.shape:
+                    input_embeddings[-num_new_tokens:] = embed_tokens_weight[-num_new_tokens:]
+                elif embed_tokens_weight.shape[0] == num_new_tokens:
+                    input_embeddings[-num_new_tokens:] = embed_tokens_weight
+                else:
+                    raise ValueError(f"Unexpected embed_tokens_weight shape. Pretrained: {embed_tokens_weight.shape}. Current: {input_embeddings.shape}. Numer of new tokens: {num_new_tokens}.")
+        elif model_args.mm_use_im_patch_token:
+            if model_args.tune_mm_mlp_adapter:
+                for p in self.get_input_embeddings().parameters():
+                    p.requires_grad = False
+                for p in self.get_output_embeddings().parameters():
+                    p.requires_grad = False
diff --git a/ferret/model/language_model/ferret_llama.py b/ferret/model/language_model/ferret_llama.py
new file mode 100644
index 0000000..99ced13
--- /dev/null
+++ b/ferret/model/language_model/ferret_llama.py
@@ -0,0 +1,139 @@
+#    Licensed under the Apache License, Version 2.0 (the "License");
+#    you may not use this file except in compliance with the License.
+#    You may obtain a copy of the License at
+#
+#        http://www.apache.org/licenses/LICENSE-2.0
+#
+#    Unless required by applicable law or agreed to in writing, software
+#    distributed under the License is distributed on an "AS IS" BASIS,
+#    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#    See the License for the specific language governing permissions and
+#    limitations under the License.
+
+
+from typing import List, Optional, Tuple, Union
+
+import torch
+import torch.nn as nn
+from torch.nn import CrossEntropyLoss
+
+from transformers import AutoConfig, AutoModelForCausalLM, \
+                         LlamaConfig, LlamaModel, LlamaForCausalLM
+
+from transformers.modeling_outputs import CausalLMOutputWithPast
+
+from ..ferret_arch import FERRETMetaModel, FERRETMetaForCausalLM
+
+
+class FERRETConfig(LlamaConfig):
+    model_type = "ferret"
+
+
+class FERRETLlamaModel(FERRETMetaModel, LlamaModel):
+    config_class = FERRETConfig
+
+    def __init__(self, config: LlamaConfig):
+        super(FERRETLlamaModel, self).__init__(config)
+
+
+class FERRETLlamaForCausalLM(LlamaForCausalLM, FERRETMetaForCausalLM):
+    config_class = FERRETConfig
+
+    def __init__(self, config):
+        super(LlamaForCausalLM, self).__init__(config)
+        self.model = FERRETLlamaModel(config)
+
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_model(self):
+        return self.model
+
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        region_masks: Optional[List[torch.Tensor]] = None,
+        past_key_values: Optional[List[torch.FloatTensor]] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        images: Optional[torch.FloatTensor] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, CausalLMOutputWithPast]:
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+        input_ids, attention_mask, past_key_values, inputs_embeds, labels = self.prepare_inputs_labels_for_multimodal(input_ids, attention_mask, past_key_values, labels, images, region_masks=region_masks)
+
+        # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
+        outputs = self.model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict
+        )
+
+        hidden_states = outputs[0]
+        logits = self.lm_head(hidden_states)
+
+        loss = None
+        if labels is not None:
+            # Shift so that tokens < n predict n
+            shift_logits = logits[..., :-1, :].contiguous()
+            shift_labels = labels[..., 1:].contiguous()
+            # Flatten the tokens
+            loss_fct = CrossEntropyLoss()
+            shift_logits = shift_logits.view(-1, self.config.vocab_size)
+            shift_labels = shift_labels.view(-1)
+            # Enable model/pipeline parallelism
+            shift_labels = shift_labels.to(shift_logits.device)
+            loss = loss_fct(shift_logits, shift_labels)
+
+        if not return_dict:
+            output = (logits,) + outputs[1:]
+            return (loss,) + output if loss is not None else output
+
+        return CausalLMOutputWithPast(
+            loss=loss,
+            logits=logits,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+
+    def prepare_inputs_for_generation(
+        self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs
+    ):
+        if past_key_values:
+            input_ids = input_ids[:, -1:]
+
+        # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
+        if inputs_embeds is not None and past_key_values is None:
+            model_inputs = {"inputs_embeds": inputs_embeds}
+        else:
+            model_inputs = {"input_ids": input_ids}
+
+        model_inputs.update(
+            {
+                "past_key_values": past_key_values,
+                "use_cache": kwargs.get("use_cache"),
+                "attention_mask": attention_mask,
+                "images": kwargs.get("images", None),
+            }
+        )
+        return model_inputs
+
+AutoConfig.register("ferret", FERRETConfig)
+AutoModelForCausalLM.register(FERRETConfig, FERRETLlamaForCausalLM)
diff --git a/ferret/model/multimodal_encoder/builder.py b/ferret/model/multimodal_encoder/builder.py
new file mode 100644
index 0000000..2b13589
--- /dev/null
+++ b/ferret/model/multimodal_encoder/builder.py
@@ -0,0 +1,11 @@
+import os
+from .clip_encoder import CLIPVisionTower
+
+
+def build_vision_tower(vision_tower_cfg, **kwargs):
+    vision_tower = getattr(vision_tower_cfg, 'mm_vision_tower', getattr(vision_tower_cfg, 'vision_tower', None))
+    is_absolute_path_exists = os.path.exists(vision_tower)
+    if is_absolute_path_exists or vision_tower.startswith("openai") or vision_tower.startswith("laion"):
+        return CLIPVisionTower(vision_tower, args=vision_tower_cfg, **kwargs)
+
+    raise ValueError(f'Unknown vision tower: {vision_tower}')
diff --git a/ferret/model/multimodal_encoder/clip_encoder.py b/ferret/model/multimodal_encoder/clip_encoder.py
new file mode 100644
index 0000000..39ad057
--- /dev/null
+++ b/ferret/model/multimodal_encoder/clip_encoder.py
@@ -0,0 +1,123 @@
+import torch
+import torch.nn as nn
+
+from transformers import CLIPVisionModel, CLIPImageProcessor, CLIPVisionConfig
+# Added for customized Processor.
+import math
+import numpy as np
+from typing import Dict
+from transformers.image_utils import PILImageResampling, ChannelDimension
+from transformers.image_processing_utils import get_size_dict
+from transformers.image_transforms import (
+    get_resize_output_image_size,
+    resize,
+)
+from typing import List, Optional, Tuple, Union
+
+class CLIPImageProcessor_GIT(CLIPImageProcessor):
+    def resize(
+        self,
+        image: np.ndarray,
+        size: Dict[str, int],
+        resample: PILImageResampling = PILImageResampling.BICUBIC,
+        data_format: Optional[Union[str, ChannelDimension]] = None,
+        **kwargs,
+    ) -> np.ndarray:
+        """
+        Resize an image. The shortest edge of the image is resized to size["shortest_edge"], with the longest edge
+        resized to keep the input aspect ratio.
+        Args:
+            image (`np.ndarray`):
+                Image to resize.
+            size (`Dict[str, int]`):
+                Size of the output image.
+            resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`):
+                Resampling filter to use when resiizing the image.
+            data_format (`str` or `ChannelDimension`, *optional*):
+                The channel dimension format of the image. If not provided, it will be the same as the input image.
+        """
+        size = get_size_dict(size, default_to_square=True, height_width_order=True)
+        # Hack(haoxuan): Bypass the shortest_edge detection. We hope to get a {"height": size[0], "width": size[1]}, where w=h.
+        # if "shortest_edge" not in size:
+        #     raise ValueError(f"The `size` parameter must contain the key `shortest_edge`. Got {size.keys()}")
+        # output_size = get_resize_output_image_size(image, size=size["shortest_edge"], default_to_square=True)
+        output_size = get_resize_output_image_size(image, size=(size["height"], size["width"]), default_to_square=True)
+        return resize(image, size=output_size, resample=resample, data_format=data_format, **kwargs)
+    
+
+class CLIPVisionTower(nn.Module):
+    def __init__(self, vision_tower, args, delay_load=False):
+        super().__init__()
+
+        self.is_loaded = False
+
+        self.vision_tower_name = vision_tower
+        self.select_layer = args.mm_vision_select_layer
+        self.select_feature = getattr(args, 'mm_vision_select_feature', 'patch')
+
+        if not delay_load:
+            self.load_model()
+        else:
+            self.cfg_only = CLIPVisionConfig.from_pretrained(self.vision_tower_name)
+
+    def load_model(self, vision_tower_path=None):
+        self.image_processor = CLIPImageProcessor_GIT.from_pretrained(self.vision_tower_name)
+        if vision_tower_path is not None:
+            self.vision_tower, loading_info = CLIPVisionModel.from_pretrained(vision_tower_path, output_loading_info=True)
+            print('loading_info:', loading_info)
+        else:
+            self.vision_tower = CLIPVisionModel.from_pretrained(self.vision_tower_name)
+        self.vision_tower.requires_grad_(False)
+
+        self.is_loaded = True
+
+    def feature_select(self, image_forward_outs):
+        image_features = image_forward_outs.hidden_states[self.select_layer]
+        if self.select_feature == 'patch':
+            image_features = image_features[:, 1:]
+        elif self.select_feature == 'cls_patch':
+            image_features = image_features
+        else:
+            raise ValueError(f'Unexpected select feature: {self.select_feature}')
+        return image_features
+
+    @torch.no_grad()
+    def forward(self, images):
+        if type(images) is list:
+            image_features = []
+            for image in images:
+                image_forward_out = self.vision_tower(image.to(device=self.device, dtype=self.dtype).unsqueeze(0), output_hidden_states=True)
+                image_feature = self.feature_select(image_forward_out).to(image.dtype)
+                image_features.append(image_feature)
+        else:
+            image_forward_outs = self.vision_tower(images.to(device=self.device, dtype=self.dtype), output_hidden_states=True)
+            image_features = self.feature_select(image_forward_outs).to(images.dtype)
+
+        return image_features
+
+    @property
+    def dummy_feature(self):
+        return torch.zeros(1, self.hidden_size, device=self.device, dtype=self.dtype)
+
+    @property
+    def dtype(self):
+        return self.vision_tower.dtype
+
+    @property
+    def device(self):
+        return self.vision_tower.device
+
+    @property
+    def config(self):
+        if self.is_loaded:
+            return self.vision_tower.config
+        else:
+            return self.cfg_only
+
+    @property
+    def hidden_size(self):
+        return self.config.hidden_size
+
+    @property
+    def num_patches(self):
+        return (self.config.image_size // self.config.patch_size) ** 2
diff --git a/ferret/model/utils.py b/ferret/model/utils.py
new file mode 100644
index 0000000..bbdf3b2
--- /dev/null
+++ b/ferret/model/utils.py
@@ -0,0 +1,20 @@
+from transformers import AutoConfig
+
+
+def auto_upgrade(config):
+    cfg = AutoConfig.from_pretrained(config)
+    if 'llava' in config and 'llava' not in cfg.model_type:
+        assert cfg.model_type == 'llama'
+        print("You are using newer LLaVA code base, while the checkpoint of v0 is from older code base.")
+        print("You must upgrade the checkpoint to the new code base (this can be done automatically).")
+        confirm = input("Please confirm that you want to upgrade the checkpoint. [Y/N]")
+        if confirm.lower() in ["y", "yes"]:
+            print("Upgrading checkpoint...")
+            assert len(cfg.architectures) == 1
+            setattr(cfg.__class__, "model_type", "llava")
+            cfg.architectures[0] = 'FERRETLlamaForCausalLM'
+            cfg.save_pretrained(config)
+            print("Checkpoint upgraded.")
+        else:
+            print("Checkpoint upgrade aborted.")
+            exit(1)
diff --git a/ferret/serve/__init__.py b/ferret/serve/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/ferret/serve/controller.py b/ferret/serve/controller.py
new file mode 100644
index 0000000..e01cc3a
--- /dev/null
+++ b/ferret/serve/controller.py
@@ -0,0 +1,298 @@
+"""
+A controller manages distributed workers.
+It sends worker addresses to clients.
+"""
+import argparse
+import asyncio
+import dataclasses
+from enum import Enum, auto
+import json
+import logging
+import time
+from typing import List, Union
+import threading
+
+from fastapi import FastAPI, Request
+from fastapi.responses import StreamingResponse
+import numpy as np
+import requests
+import uvicorn
+
+from ferret.constants import CONTROLLER_HEART_BEAT_EXPIRATION
+from ferret.utils import build_logger, server_error_msg
+
+
+logger = build_logger("controller", "controller.log")
+
+
+class DispatchMethod(Enum):
+    LOTTERY = auto()
+    SHORTEST_QUEUE = auto()
+
+    @classmethod
+    def from_str(cls, name):
+        if name == "lottery":
+            return cls.LOTTERY
+        elif name == "shortest_queue":
+            return cls.SHORTEST_QUEUE
+        else:
+            raise ValueError(f"Invalid dispatch method")
+
+
+@dataclasses.dataclass
+class WorkerInfo:
+    model_names: List[str]
+    speed: int
+    queue_length: int
+    check_heart_beat: bool
+    last_heart_beat: str
+
+
+def heart_beat_controller(controller):
+    while True:
+        time.sleep(CONTROLLER_HEART_BEAT_EXPIRATION)
+        controller.remove_stable_workers_by_expiration()
+
+
+class Controller:
+    def __init__(self, dispatch_method: str):
+        # Dict[str -> WorkerInfo]
+        self.worker_info = {}
+        self.dispatch_method = DispatchMethod.from_str(dispatch_method)
+
+        self.heart_beat_thread = threading.Thread(
+            target=heart_beat_controller, args=(self,))
+        self.heart_beat_thread.start()
+
+        logger.info("Init controller")
+
+    def register_worker(self, worker_name: str, check_heart_beat: bool,
+                        worker_status: dict):
+        if worker_name not in self.worker_info:
+            logger.info(f"Register a new worker: {worker_name}")
+        else:
+            logger.info(f"Register an existing worker: {worker_name}")
+
+        if not worker_status:
+            worker_status = self.get_worker_status(worker_name)
+        if not worker_status:
+            return False
+
+        self.worker_info[worker_name] = WorkerInfo(
+            worker_status["model_names"], worker_status["speed"], worker_status["queue_length"],
+            check_heart_beat, time.time())
+
+        logger.info(f"Register done: {worker_name}, {worker_status}")
+        return True
+
+    def get_worker_status(self, worker_name: str):
+        try:
+            r = requests.post(worker_name + "/worker_get_status", timeout=5)
+        except requests.exceptions.RequestException as e:
+            logger.error(f"Get status fails: {worker_name}, {e}")
+            return None
+
+        if r.status_code != 200:
+            logger.error(f"Get status fails: {worker_name}, {r}")
+            return None
+
+        return r.json()
+
+    def remove_worker(self, worker_name: str):
+        del self.worker_info[worker_name]
+
+    def refresh_all_workers(self):
+        old_info = dict(self.worker_info)
+        self.worker_info = {}
+
+        for w_name, w_info in old_info.items():
+            if not self.register_worker(w_name, w_info.check_heart_beat, None):
+                logger.info(f"Remove stale worker: {w_name}")
+
+    def list_models(self):
+        model_names = set()
+
+        for w_name, w_info in self.worker_info.items():
+            model_names.update(w_info.model_names)
+
+        return list(model_names)
+
+    def get_worker_address(self, model_name: str):
+        if self.dispatch_method == DispatchMethod.LOTTERY:
+            worker_names = []
+            worker_speeds = []
+            for w_name, w_info in self.worker_info.items():
+                if model_name in w_info.model_names:
+                    worker_names.append(w_name)
+                    worker_speeds.append(w_info.speed)
+            worker_speeds = np.array(worker_speeds, dtype=np.float32)
+            norm = np.sum(worker_speeds)
+            if norm < 1e-4:
+                return ""
+            worker_speeds = worker_speeds / norm
+            if True:  # Directly return address
+                pt = np.random.choice(np.arange(len(worker_names)),
+                    p=worker_speeds)
+                worker_name = worker_names[pt]
+                return worker_name
+
+            # Check status before returning
+            while True:
+                pt = np.random.choice(np.arange(len(worker_names)),
+                    p=worker_speeds)
+                worker_name = worker_names[pt]
+
+                if self.get_worker_status(worker_name):
+                    break
+                else:
+                    self.remove_worker(worker_name)
+                    worker_speeds[pt] = 0
+                    norm = np.sum(worker_speeds)
+                    if norm < 1e-4:
+                        return ""
+                    worker_speeds = worker_speeds / norm
+                    continue
+            return worker_name
+        elif self.dispatch_method == DispatchMethod.SHORTEST_QUEUE:
+            worker_names = []
+            worker_qlen = []
+            for w_name, w_info in self.worker_info.items():
+                if model_name in w_info.model_names:
+                    worker_names.append(w_name)
+                    worker_qlen.append(w_info.queue_length / w_info.speed)
+            if len(worker_names) == 0:
+                return ""
+            min_index = np.argmin(worker_qlen)
+            w_name = worker_names[min_index]
+            self.worker_info[w_name].queue_length += 1
+            logger.info(f"names: {worker_names}, queue_lens: {worker_qlen}, ret: {w_name}")
+            return w_name
+        else:
+            raise ValueError(f"Invalid dispatch method: {self.dispatch_method}")
+
+    def receive_heart_beat(self, worker_name: str, queue_length: int):
+        if worker_name not in self.worker_info:
+            logger.info(f"Receive unknown heart beat. {worker_name}")
+            return False
+
+        self.worker_info[worker_name].queue_length = queue_length
+        self.worker_info[worker_name].last_heart_beat = time.time()
+        logger.info(f"Receive heart beat. {worker_name}")
+        return True
+
+    def remove_stable_workers_by_expiration(self):
+        expire = time.time() - CONTROLLER_HEART_BEAT_EXPIRATION
+        to_delete = []
+        for worker_name, w_info in self.worker_info.items():
+            if w_info.check_heart_beat and w_info.last_heart_beat < expire:
+                to_delete.append(worker_name)
+
+        for worker_name in to_delete:
+            self.remove_worker(worker_name)
+
+    def worker_api_generate_stream(self, params):
+        worker_addr = self.get_worker_address(params["model"])
+        if not worker_addr:
+            logger.info(f"no worker: {params['model']}")
+            ret = {
+                "text": server_error_msg,
+                "error_code": 2,
+            }
+            yield json.dumps(ret).encode() + b"\0"
+
+        try:
+            response = requests.post(worker_addr + "/worker_generate_stream",
+                json=params, stream=True, timeout=5)
+            for chunk in response.iter_lines(decode_unicode=False, delimiter=b"\0"):
+                if chunk:
+                    yield chunk + b"\0"
+        except requests.exceptions.RequestException as e:
+            logger.info(f"worker timeout: {worker_addr}")
+            ret = {
+                "text": server_error_msg,
+                "error_code": 3,
+            }
+            yield json.dumps(ret).encode() + b"\0"
+
+
+    # Let the controller act as a worker to achieve hierarchical
+    # management. This can be used to connect isolated sub networks.
+    def worker_api_get_status(self):
+        model_names = set()
+        speed = 0
+        queue_length = 0
+
+        for w_name in self.worker_info:
+            worker_status = self.get_worker_status(w_name)
+            if worker_status is not None:
+                model_names.update(worker_status["model_names"])
+                speed += worker_status["speed"]
+                queue_length += worker_status["queue_length"]
+
+        return {
+            "model_names": list(model_names),
+            "speed": speed,
+            "queue_length": queue_length,
+        }
+
+
+app = FastAPI()
+
+
+@app.post("/register_worker")
+async def register_worker(request: Request):
+    data = await request.json()
+    controller.register_worker(
+        data["worker_name"], data["check_heart_beat"],
+        data.get("worker_status", None))
+
+
+@app.post("/refresh_all_workers")
+async def refresh_all_workers():
+    models = controller.refresh_all_workers()
+
+
+@app.post("/list_models")
+async def list_models():
+    models = controller.list_models()
+    return {"models": models}
+
+
+@app.post("/get_worker_address")
+async def get_worker_address(request: Request):
+    data = await request.json()
+    addr = controller.get_worker_address(data["model"])
+    return {"address": addr}
+
+
+@app.post("/receive_heart_beat")
+async def receive_heart_beat(request: Request):
+    data = await request.json()
+    exist = controller.receive_heart_beat(
+        data["worker_name"], data["queue_length"])
+    return {"exist": exist}
+
+
+@app.post("/worker_generate_stream")
+async def worker_api_generate_stream(request: Request):
+    params = await request.json()
+    generator = controller.worker_api_generate_stream(params)
+    return StreamingResponse(generator)
+
+
+@app.post("/worker_get_status")
+async def worker_api_get_status(request: Request):
+    return controller.worker_api_get_status()
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--host", type=str, default="localhost")
+    parser.add_argument("--port", type=int, default=21001)
+    parser.add_argument("--dispatch-method", type=str, choices=[
+        "lottery", "shortest_queue"], default="shortest_queue")
+    args = parser.parse_args()
+    logger.info(f"args: {args}")
+
+    controller = Controller(args.dispatch_method)
+    uvicorn.run(app, host=args.host, port=args.port, log_level="info")
diff --git a/ferret/serve/dejavu/.uuid b/ferret/serve/dejavu/.uuid
new file mode 100644
index 0000000..9bdb22d
--- /dev/null
+++ b/ferret/serve/dejavu/.uuid
@@ -0,0 +1 @@
+06a92cd7-c698-4e59-b980-58e4bc162946
\ No newline at end of file
diff --git a/ferret/serve/dejavu/DejaVuSans-Bold.ttf b/ferret/serve/dejavu/DejaVuSans-Bold.ttf
new file mode 100644
index 0000000..06db62c
Binary files /dev/null and b/ferret/serve/dejavu/DejaVuSans-Bold.ttf differ
diff --git a/ferret/serve/dejavu/DejaVuSans.ttf b/ferret/serve/dejavu/DejaVuSans.ttf
new file mode 100644
index 0000000..2fbbe69
Binary files /dev/null and b/ferret/serve/dejavu/DejaVuSans.ttf differ
diff --git a/ferret/serve/dejavu/DejaVuSansMono-Bold.ttf b/ferret/serve/dejavu/DejaVuSansMono-Bold.ttf
new file mode 100644
index 0000000..b210eb5
Binary files /dev/null and b/ferret/serve/dejavu/DejaVuSansMono-Bold.ttf differ
diff --git a/ferret/serve/dejavu/DejaVuSansMono.ttf b/ferret/serve/dejavu/DejaVuSansMono.ttf
new file mode 100644
index 0000000..041cffc
Binary files /dev/null and b/ferret/serve/dejavu/DejaVuSansMono.ttf differ
diff --git a/ferret/serve/dejavu/DejaVuSerif-Bold.ttf b/ferret/serve/dejavu/DejaVuSerif-Bold.ttf
new file mode 100644
index 0000000..8162112
Binary files /dev/null and b/ferret/serve/dejavu/DejaVuSerif-Bold.ttf differ
diff --git a/ferret/serve/dejavu/DejaVuSerif.ttf b/ferret/serve/dejavu/DejaVuSerif.ttf
new file mode 100644
index 0000000..1b62977
Binary files /dev/null and b/ferret/serve/dejavu/DejaVuSerif.ttf differ
diff --git a/ferret/serve/examples/bathroom.jpg b/ferret/serve/examples/bathroom.jpg
new file mode 100644
index 0000000..cb1bbac
Binary files /dev/null and b/ferret/serve/examples/bathroom.jpg differ
diff --git a/ferret/serve/examples/ferret.jpg b/ferret/serve/examples/ferret.jpg
new file mode 100644
index 0000000..be3ff25
Binary files /dev/null and b/ferret/serve/examples/ferret.jpg differ
diff --git a/ferret/serve/examples/flickr_9472793441.jpg b/ferret/serve/examples/flickr_9472793441.jpg
new file mode 100644
index 0000000..f9198f2
Binary files /dev/null and b/ferret/serve/examples/flickr_9472793441.jpg differ
diff --git a/ferret/serve/examples/kitchen.png b/ferret/serve/examples/kitchen.png
new file mode 100644
index 0000000..e02cf99
Binary files /dev/null and b/ferret/serve/examples/kitchen.png differ
diff --git a/ferret/serve/examples/make_sandwitch.jpg b/ferret/serve/examples/make_sandwitch.jpg
new file mode 100644
index 0000000..6c1d0a5
Binary files /dev/null and b/ferret/serve/examples/make_sandwitch.jpg differ
diff --git a/ferret/serve/examples/room_planning.jpg b/ferret/serve/examples/room_planning.jpg
new file mode 100644
index 0000000..4a58ee2
Binary files /dev/null and b/ferret/serve/examples/room_planning.jpg differ
diff --git a/ferret/serve/examples/waterview.jpg b/ferret/serve/examples/waterview.jpg
new file mode 100644
index 0000000..6f44eba
Binary files /dev/null and b/ferret/serve/examples/waterview.jpg differ
diff --git a/ferret/serve/gradio_css.py b/ferret/serve/gradio_css.py
new file mode 100644
index 0000000..5545413
--- /dev/null
+++ b/ferret/serve/gradio_css.py
@@ -0,0 +1,73 @@
+code_highlight_css = (
+"""
+#chatbot .hll { background-color: #ffffcc }
+#chatbot .c { color: #408080; font-style: italic }
+#chatbot .err { border: 1px solid #FF0000 }
+#chatbot .k { color: #008000; font-weight: bold }
+#chatbot .o { color: #666666 }
+#chatbot .ch { color: #408080; font-style: italic }
+#chatbot .cm { color: #408080; font-style: italic }
+#chatbot .cp { color: #BC7A00 }
+#chatbot .cpf { color: #408080; font-style: italic }
+#chatbot .c1 { color: #408080; font-style: italic }
+#chatbot .cs { color: #408080; font-style: italic }
+#chatbot .gd { color: #A00000 }
+#chatbot .ge { font-style: italic }
+#chatbot .gr { color: #FF0000 }
+#chatbot .gh { color: #000080; font-weight: bold }
+#chatbot .gi { color: #00A000 }
+#chatbot .go { color: #888888 }
+#chatbot .gp { color: #000080; font-weight: bold }
+#chatbot .gs { font-weight: bold }
+#chatbot .gu { color: #800080; font-weight: bold }
+#chatbot .gt { color: #0044DD }
+#chatbot .kc { color: #008000; font-weight: bold }
+#chatbot .kd { color: #008000; font-weight: bold }
+#chatbot .kn { color: #008000; font-weight: bold }
+#chatbot .kp { color: #008000 }
+#chatbot .kr { color: #008000; font-weight: bold }
+#chatbot .kt { color: #B00040 }
+#chatbot .m { color: #666666 }
+#chatbot .s { color: #BA2121 }
+#chatbot .na { color: #7D9029 }
+#chatbot .nb { color: #008000 }
+#chatbot .nc { color: #0000FF; font-weight: bold }
+#chatbot .no { color: #880000 }
+#chatbot .nd { color: #AA22FF }
+#chatbot .ni { color: #999999; font-weight: bold }
+#chatbot .ne { color: #D2413A; font-weight: bold }
+#chatbot .nf { color: #0000FF }
+#chatbot .nl { color: #A0A000 }
+#chatbot .nn { color: #0000FF; font-weight: bold }
+#chatbot .nt { color: #008000; font-weight: bold }
+#chatbot .nv { color: #19177C }
+#chatbot .ow { color: #AA22FF; font-weight: bold }
+#chatbot .w { color: #bbbbbb }
+#chatbot .mb { color: #666666 }
+#chatbot .mf { color: #666666 }
+#chatbot .mh { color: #666666 }
+#chatbot .mi { color: #666666 }
+#chatbot .mo { color: #666666 }
+#chatbot .sa { color: #BA2121 }
+#chatbot .sb { color: #BA2121 }
+#chatbot .sc { color: #BA2121 }
+#chatbot .dl { color: #BA2121 }
+#chatbot .sd { color: #BA2121; font-style: italic }
+#chatbot .s2 { color: #BA2121 }
+#chatbot .se { color: #BB6622; font-weight: bold }
+#chatbot .sh { color: #BA2121 }
+#chatbot .si { color: #BB6688; font-weight: bold }
+#chatbot .sx { color: #008000 }
+#chatbot .sr { color: #BB6688 }
+#chatbot .s1 { color: #BA2121 }
+#chatbot .ss { color: #19177C }
+#chatbot .bp { color: #008000 }
+#chatbot .fm { color: #0000FF }
+#chatbot .vc { color: #19177C }
+#chatbot .vg { color: #19177C }
+#chatbot .vi { color: #19177C }
+#chatbot .vm { color: #19177C }
+#chatbot .il { color: #666666 }
+""")
+#.highlight  { background: #f8f8f8; }
+
diff --git a/ferret/serve/gradio_web_server.py b/ferret/serve/gradio_web_server.py
new file mode 100644
index 0000000..c90f46c
--- /dev/null
+++ b/ferret/serve/gradio_web_server.py
@@ -0,0 +1,714 @@
+'''
+Usage:
+
+python -m ferret.serve.gradio_web_server --controller http://localhost:10000 --add_region_feature
+'''
+import argparse
+import datetime
+import json
+import os
+import time
+
+import gradio as gr
+import requests
+
+from ferret.conversation import (default_conversation, conv_templates,
+                                   SeparatorStyle)
+from ferret.constants import LOGDIR
+from ferret.utils import (build_logger, server_error_msg,
+    violates_moderation, moderation_msg)
+import hashlib
+# Added 
+import re
+from copy import deepcopy
+from PIL import ImageDraw, ImageFont
+from gradio import processing_utils
+import numpy as np
+import torch
+import torch.nn.functional as F
+from scipy.ndimage import binary_dilation, binary_erosion
+import pdb
+from ferret.serve.gradio_css import code_highlight_css
+
+DEFAULT_REGION_REFER_TOKEN = "[region]"
+DEFAULT_REGION_FEA_TOKEN = "<region_fea>"
+
+
+logger = build_logger("gradio_web_server", "gradio_web_server.log")
+
+headers = {"User-Agent": "FERRET Client"}
+
+no_change_btn = gr.Button.update()
+enable_btn = gr.Button.update(interactive=True)
+disable_btn = gr.Button.update(interactive=False)
+
+priority = {
+    "vicuna-13b": "aaaaaaa",
+    "koala-13b": "aaaaaab",
+}
+
+VOCAB_IMAGE_W = 1000  # 224
+VOCAB_IMAGE_H = 1000  # 224
+
+def generate_mask_for_feature(coor, raw_w, raw_h, mask=None):
+    if mask is not None:
+        assert mask.shape[0] == raw_w and mask.shape[1] == raw_h
+    coor_mask = torch.zeros((raw_w, raw_h))
+    # Assume it samples a point.
+    if len(coor) == 2:
+        # Define window size
+        span = 5
+        # Make sure the window does not exceed array bounds
+        x_min = max(0, coor[0] - span)
+        x_max = min(raw_w, coor[0] + span + 1)
+        y_min = max(0, coor[1] - span)
+        y_max = min(raw_h, coor[1] + span + 1)
+        coor_mask[int(x_min):int(x_max), int(y_min):int(y_max)] = 1
+        assert (coor_mask==1).any(), f"coor: {coor}, raw_w: {raw_w}, raw_h: {raw_h}"
+    elif len(coor) == 4:
+        # Box input or Sketch input.
+        coor_mask = torch.zeros((raw_w, raw_h))
+        coor_mask[coor[0]:coor[2]+1, coor[1]:coor[3]+1] = 1
+        if mask is not None:
+            coor_mask = coor_mask * mask
+    # coor_mask = torch.from_numpy(coor_mask)
+    # pdb.set_trace()
+    assert len(coor_mask.nonzero()) != 0
+    return coor_mask.tolist()
+
+
+def draw_box(coor, region_mask, region_ph, img, input_mode):
+    colors = ["red"]
+    draw = ImageDraw.Draw(img)
+    font = ImageFont.truetype("./ferret/serve/dejavu/DejaVuSans.ttf", size=18)
+    if input_mode == 'Box':
+        draw.rectangle([coor[0], coor[1], coor[2], coor[3]], outline=colors[0], width=4)
+        draw.rectangle([coor[0], coor[3] - int(font.size * 1.2), coor[0] + int((len(region_ph) + 0.8) * font.size * 0.6), coor[3]], outline=colors[0], fill=colors[0], width=4)
+        draw.text([coor[0] + int(font.size * 0.2), coor[3] - int(font.size*1.2)], region_ph, font=font, fill=(255,255,255))
+    elif input_mode == 'Point':
+        r = 8 
+        leftUpPoint = (coor[0]-r, coor[1]-r)
+        rightDownPoint = (coor[0]+r, coor[1]+r)
+        twoPointList = [leftUpPoint, rightDownPoint]
+        draw.ellipse(twoPointList, outline=colors[0], width=4)
+        draw.rectangle([coor[0], coor[1], coor[0] + int((len(region_ph) + 0.8) * font.size * 0.6), coor[1] + int(font.size * 1.2)], outline=colors[0], fill=colors[0], width=4)
+        draw.text([coor[0] + int(font.size * 0.2), coor[1]], region_ph, font=font, fill=(255,255,255))
+    elif input_mode == 'Sketch':
+        draw.rectangle([coor[0], coor[3] - int(font.size * 1.2), coor[0] + int((len(region_ph) + 0.8) * font.size * 0.6), coor[3]], outline=colors[0], fill=colors[0], width=4)
+        draw.text([coor[0] + int(font.size * 0.2), coor[3] - int(font.size*1.2)], region_ph, font=font, fill=(255,255,255))
+        # Use morphological operations to find the boundary
+        mask = np.array(region_mask)
+        dilated = binary_dilation(mask, structure=np.ones((3,3)))
+        eroded = binary_erosion(mask, structure=np.ones((3,3)))
+        boundary = dilated ^ eroded  # XOR operation to find the difference between dilated and eroded mask
+        # Loop over the boundary and paint the corresponding pixels
+        for i in range(boundary.shape[0]):
+            for j in range(boundary.shape[1]):
+                if boundary[i, j]:
+                    # This is a pixel on the boundary, paint it red
+                    draw.point((i, j), fill=colors[0])
+    else:
+        NotImplementedError(f'Input mode of {input_mode} is not Implemented.')
+    return img
+
+
+def get_conv_log_filename():
+    t = datetime.datetime.now()
+    name = os.path.join(LOGDIR, f"{t.year}-{t.month:02d}-{t.day:02d}-conv.json")
+    return name
+
+
+def get_model_list():
+    ret = requests.post(args.controller_url + "/refresh_all_workers")
+    assert ret.status_code == 200
+    ret = requests.post(args.controller_url + "/list_models")
+    models = ret.json()["models"]
+    models.sort(key=lambda x: priority.get(x, x))
+    logger.info(f"Models: {models}")
+    return models
+
+
+get_window_url_params = """
+function() {
+    const params = new URLSearchParams(window.location.search);
+    url_params = Object.fromEntries(params);
+    console.log(url_params);
+    return url_params;
+    }
+"""
+
+
+def load_demo(url_params, request: gr.Request):
+    logger.info(f"load_demo. ip: {request.client.host}. params: {url_params}")
+
+    dropdown_update = gr.Dropdown.update(visible=True)
+    if "model" in url_params:
+        model = url_params["model"]
+        if model in models:
+            dropdown_update = gr.Dropdown.update(
+                value=model, visible=True)
+
+    state = default_conversation.copy()
+    return (state,
+            dropdown_update,
+            gr.Chatbot.update(visible=True),
+            gr.Textbox.update(visible=True),
+            gr.Button.update(visible=True),
+            gr.Row.update(visible=True),
+            gr.Accordion.update(visible=True))
+
+
+def load_demo_refresh_model_list(request: gr.Request):
+    logger.info(f"load_demo. ip: {request.client.host}")
+    models = get_model_list()
+    state = default_conversation.copy()
+    return (state, gr.Dropdown.update(
+               choices=models,
+               value=models[0] if len(models) > 0 else ""),
+            gr.Chatbot.update(visible=True),
+            gr.Textbox.update(visible=True),
+            gr.Button.update(visible=True),
+            gr.Row.update(visible=True),
+            gr.Accordion.update(visible=True))
+
+
+def vote_last_response(state, vote_type, model_selector, request: gr.Request):
+    with open(get_conv_log_filename(), "a") as fout:
+        data = {
+            "tstamp": round(time.time(), 4),
+            "type": vote_type,
+            "model": model_selector,
+            "state": state.dict(),
+            "ip": request.client.host,
+        }
+        fout.write(json.dumps(data) + "\n")
+
+
+def upvote_last_response(state, model_selector, request: gr.Request):
+    logger.info(f"upvote. ip: {request.client.host}")
+    vote_last_response(state, "upvote", model_selector, request)
+    return ("",) + (disable_btn,) * 3
+
+
+def downvote_last_response(state, model_selector, request: gr.Request):
+    logger.info(f"downvote. ip: {request.client.host}")
+    vote_last_response(state, "downvote", model_selector, request)
+    return ("",) + (disable_btn,) * 3
+
+
+def flag_last_response(state, model_selector, request: gr.Request):
+    logger.info(f"flag. ip: {request.client.host}")
+    vote_last_response(state, "flag", model_selector, request)
+    return ("",) + (disable_btn,) * 3
+
+
+def regenerate(state, image_process_mode, request: gr.Request):
+    logger.info(f"regenerate. ip: {request.client.host}")
+    state.messages[-1][-1] = None
+    prev_human_msg = state.messages[-2]
+    if type(prev_human_msg[1]) in (tuple, list):
+        prev_human_msg[1] = (*prev_human_msg[1][:2], image_process_mode)
+    state.skip_next = False
+    return (state, state.to_gradio_chatbot(), "") + (disable_btn,) * 5
+
+
+def clear_history(request: gr.Request):
+    logger.info(f"clear_history. ip: {request.client.host}")
+    state = default_conversation.copy()
+    return (state, state.to_gradio_chatbot(), "", None, None) + (disable_btn,) * 5 + \
+        (None, {'region_placeholder_tokens':[],'region_coordinates':[],'region_masks':[],'masks':[]}, [], None)
+
+
+def resize_bbox(box, image_w=None, image_h=None, default_wh=VOCAB_IMAGE_W):
+    ratio_w = image_w * 1.0 / default_wh
+    ratio_h = image_h * 1.0 / default_wh
+
+    new_box = [int(box[0] * ratio_w), int(box[1] * ratio_h), \
+               int(box[2] * ratio_w), int(box[3] * ratio_h)]
+    return new_box
+
+
+def show_location(sketch_pad, chatbot):
+    image = sketch_pad['image']
+    img_w, img_h = image.size
+    new_bboxes = []
+    old_bboxes = []
+    # chatbot[0] is image.
+    text = chatbot[1:]
+    for round_i in text:
+        human_input = round_i[0]
+        model_output = round_i[1]
+        # TODO: Difference: vocab representation.
+        # pattern = r'\[x\d*=(\d+(?:\.\d+)?), y\d*=(\d+(?:\.\d+)?), x\d*=(\d+(?:\.\d+)?), y\d*=(\d+(?:\.\d+)?)\]'
+        pattern = r'\[(\d+(?:\.\d+)?), (\d+(?:\.\d+)?), (\d+(?:\.\d+)?), (\d+(?:\.\d+)?)\]'
+        matches = re.findall(pattern, model_output)
+        for match in matches:
+            x1, y1, x2, y2 = map(int, match)
+            new_box = resize_bbox([x1, y1, x2, y2], img_w, img_h)
+            new_bboxes.append(new_box)
+            old_bboxes.append([x1, y1, x2, y2])
+        
+    set_old_bboxes = sorted(set(map(tuple, old_bboxes)), key=list(map(tuple, old_bboxes)).index)
+    list_old_bboxes = list(map(list, set_old_bboxes))
+
+    set_bboxes = sorted(set(map(tuple, new_bboxes)), key=list(map(tuple, new_bboxes)).index)
+    list_bboxes = list(map(list, set_bboxes))
+
+    output_image = deepcopy(image)
+    draw = ImageDraw.Draw(output_image)
+    font = ImageFont.truetype("./ferret/serve/dejavu/DejaVuSans.ttf", 28)
+    for i in range(len(list_bboxes)):
+        x1, y1, x2, y2 = list_old_bboxes[i]
+        x1_new, y1_new, x2_new, y2_new = list_bboxes[i]
+        obj_string = '[obj{}]'.format(i)
+        for round_i in text:
+            model_output = round_i[1]
+            model_output = model_output.replace('[{}, {}, {}, {}]'.format(x1, y1, x2, y2), obj_string)
+            round_i[1] = model_output
+        draw.rectangle([(x1_new, y1_new), (x2_new, y2_new)], outline="red", width=3)
+        draw.text((x1_new+2, y1_new+5), obj_string[1:-1], fill="red", font=font)
+
+    return (output_image, [chatbot[0]] + text, disable_btn)
+
+
+def add_text(state, text, image_process_mode, original_image, sketch_pad, request: gr.Request):
+    image = sketch_pad['image']
+
+    logger.info(f"add_text. ip: {request.client.host}. len: {len(text)}")
+    if len(text) <= 0 and image is None:
+        state.skip_next = True
+        return (state, state.to_gradio_chatbot(), "", None) + (no_change_btn,) * 5
+    if args.moderate:
+        flagged = violates_moderation(text)
+        if flagged:
+            state.skip_next = True
+            return (state, state.to_gradio_chatbot(), moderation_msg, None) + (
+                no_change_btn,) * 5
+
+    text = text[:1536]  # Hard cut-off
+    if original_image is None:
+        assert image is not None
+        original_image = image.copy()
+        print('No location, copy original image in add_text')
+
+    if image is not None:
+        if state.first_round:
+            text = text[:1200]  # Hard cut-off for images
+            if '<image>' not in text:
+                # text = '<Image><image></Image>' + text
+                text = text + '\n<image>'
+            text = (text, original_image, image_process_mode)
+            if len(state.get_images(return_pil=True)) > 0:
+                new_state = default_conversation.copy()
+                new_state.first_round = False
+                state=new_state
+                print('First round add image finsihed.')
+
+    state.append_message(state.roles[0], text)
+    state.append_message(state.roles[1], None)
+    state.skip_next = False
+    return (state, state.to_gradio_chatbot(), "", original_image) + (disable_btn,) * 5
+
+
+def post_process_code(code):
+    sep = "\n```"
+    if sep in code:
+        blocks = code.split(sep)
+        if len(blocks) % 2 == 1:
+            for i in range(1, len(blocks), 2):
+                blocks[i] = blocks[i].replace("\\_", "_")
+        code = sep.join(blocks)
+    return code
+
+
+def format_region_prompt(prompt, refer_input_state):
+    for region_ph_index, region_ph_i in enumerate(refer_input_state['region_placeholder_tokens']):
+        prompt = prompt.replace(region_ph_i, '{} {}'.format(refer_input_state['region_coordinates'][region_ph_index], DEFAULT_REGION_FEA_TOKEN))
+    return prompt
+    
+
+def http_bot(state, model_selector, temperature, top_p, max_new_tokens, refer_input_state, request: gr.Request):
+# def http_bot(state, model_selector, temperature, top_p, max_new_tokens, request: gr.Request):
+    logger.info(f"http_bot. ip: {request.client.host}")
+    start_tstamp = time.time()
+    model_name = model_selector
+
+    if state.skip_next:
+        # This generate call is skipped due to invalid inputs
+        yield (state, state.to_gradio_chatbot()) + (no_change_btn,) * 5
+        return
+
+    if len(state.messages) == state.offset + 2:
+        # First round of conversation
+        template_name = 'ferret_v1'
+        new_state = conv_templates[template_name].copy()
+        new_state.append_message(new_state.roles[0], state.messages[-2][1])
+        new_state.append_message(new_state.roles[1], None)
+        state = new_state
+        state.first_round = False
+
+    # Query worker address
+    controller_url = args.controller_url
+    ret = requests.post(controller_url + "/get_worker_address",
+            json={"model": model_name})
+    worker_addr = ret.json()["address"]
+    logger.info(f"model_name: {model_name}, worker_addr: {worker_addr}")
+
+    # No available worker
+    if worker_addr == "":
+        state.messages[-1][-1] = server_error_msg
+        yield (state, state.to_gradio_chatbot(), disable_btn, disable_btn, disable_btn, enable_btn, enable_btn)
+        return
+
+    # Construct prompt
+    prompt = state.get_prompt()
+    if args.add_region_feature:
+        prompt = format_region_prompt(prompt, refer_input_state)
+
+    all_images = state.get_images(return_pil=True)
+    all_image_hash = [hashlib.md5(image.tobytes()).hexdigest() for image in all_images]
+    for image, hash in zip(all_images, all_image_hash):
+        t = datetime.datetime.now()
+        filename = os.path.join(LOGDIR, "serve_images", f"{t.year}-{t.month:02d}-{t.day:02d}", f"{hash}.jpg")
+        if not os.path.isfile(filename):
+            os.makedirs(os.path.dirname(filename), exist_ok=True)
+            image.save(filename)
+
+    # Make requests
+    pload = {
+        "model": model_name,
+        "prompt": prompt,
+        "temperature": float(temperature),
+        "top_p": float(top_p),
+        "max_new_tokens": min(int(max_new_tokens), 1536),
+        "stop": state.sep if state.sep_style in [SeparatorStyle.SINGLE, SeparatorStyle.MPT] else state.sep2,
+        "images": f'List of {len(state.get_images())} images: {all_image_hash}',
+    }
+    logger.info(f"==== request ====\n{pload}")
+    if args.add_region_feature:
+        pload['region_masks'] = refer_input_state['region_masks']
+        logger.info(f"==== add region_masks to request ====\n")
+
+    pload['images'] = state.get_images()
+    print(f'Input Prompt: {prompt}')
+
+    state.messages[-1][-1] = "▌"
+    yield (state, state.to_gradio_chatbot()) + (disable_btn,) * 5
+
+    try:
+        # Stream output
+        response = requests.post(worker_addr + "/worker_generate_stream",
+            headers=headers, json=pload, stream=True, timeout=10)
+        for chunk in response.iter_lines(decode_unicode=False, delimiter=b"\0"):
+            if chunk:
+                data = json.loads(chunk.decode())
+                if data["error_code"] == 0:
+                    output = data["text"][len(prompt):].strip()
+                    output = post_process_code(output)
+                    state.messages[-1][-1] = output + "▌"
+                    yield (state, state.to_gradio_chatbot()) + (disable_btn,) * 5
+                else:
+                    output = data["text"] + f" (error_code: {data['error_code']})"
+                    state.messages[-1][-1] = output
+                    yield (state, state.to_gradio_chatbot()) + (disable_btn, disable_btn, disable_btn, enable_btn, enable_btn)
+                    return
+                time.sleep(0.03)
+    except requests.exceptions.RequestException as e:
+        state.messages[-1][-1] = server_error_msg
+        yield (state, state.to_gradio_chatbot()) + (disable_btn, disable_btn, disable_btn, enable_btn, enable_btn)
+        return
+
+    state.messages[-1][-1] = state.messages[-1][-1][:-1]
+    yield (state, state.to_gradio_chatbot()) + (enable_btn,) * 5
+
+    finish_tstamp = time.time()
+    logger.info(f"{output}")
+
+    with open(get_conv_log_filename(), "a") as fout:
+        data = {
+            "tstamp": round(finish_tstamp, 4),
+            "type": "chat",
+            "model": model_name,
+            "start": round(start_tstamp, 4),
+            "finish": round(start_tstamp, 4),
+            "state": state.dict(),
+            "images": all_image_hash,
+            "ip": request.client.host,
+        }
+        fout.write(json.dumps(data) + "\n")
+
+title_markdown = ("""
+# 🦦 Ferret: Refer and Ground Anything Anywhere at Any Granularity
+[[Code](https://github.com/apple/ml-ferret)] [[Paper](https://arxiv.org/abs/2310.07704)] 
+""")
+
+tos_markdown = ("""
+### Terms of use
+By using this service, users are required to agree to the following terms: The service is a research preview intended for non-commercial use only. It only provides limited safety measures and may generate offensive content. It must not be used for any illegal, harmful, violent, racist, or sexual purposes. The service may collect user dialogue data for future research.
+""")
+
+
+learn_more_markdown = ("""
+### License
+The service is a research preview intended for non-commercial use only
+""")
+
+
+css = code_highlight_css + """
+pre {
+    white-space: pre-wrap;       /* Since CSS 2.1 */
+    white-space: -moz-pre-wrap;  /* Mozilla, since 1999 */
+    white-space: -pre-wrap;      /* Opera 4-6 */
+    white-space: -o-pre-wrap;    /* Opera 7 */
+    word-wrap: break-word;       /* Internet Explorer 5.5+ */
+}
+"""
+
+Instructions = '''
+Instructions:
+1. Select a 'Referring Input Type'
+2. Draw on the image to refer to a region/point.
+3. Copy the region id from 'Referring Input Type' to refer to a region in your chat.
+'''
+
+class ImageMask(gr.components.Image):
+    """
+    Sets: source="canvas", tool="sketch"
+    """
+
+    is_template = True
+
+    def __init__(self, **kwargs):
+        super().__init__(source="upload", tool="sketch", interactive=True, **kwargs)
+
+    def preprocess(self, x):
+        return super().preprocess(x)
+    
+
+def draw(input_mode, input, refer_input_state, refer_text_show, imagebox_refer):
+    if type(input) == dict:
+        image = deepcopy(input['image'])
+        mask = deepcopy(input['mask'])
+    else:
+        mask = deepcopy(input)
+
+    # W, H -> H, W, 3
+    image_new = np.asarray(image)
+    img_height = image_new.shape[0]
+    img_width = image_new.shape[1]
+
+    # W, H, 4 -> H, W
+    mask_new = np.asarray(mask)[:,:,0].copy()
+    mask_new = torch.from_numpy(mask_new)
+    mask_new = (F.interpolate(mask_new.unsqueeze(0).unsqueeze(0), (img_height, img_width), mode='bilinear') > 0)
+    mask_new = mask_new[0, 0].transpose(1, 0).long()
+
+    if len(refer_input_state['masks']) == 0:
+        last_mask = torch.zeros_like(mask_new)
+    else:
+        last_mask = refer_input_state['masks'][-1]
+
+    diff_mask = mask_new - last_mask
+    if torch.all(diff_mask == 0):
+        print('Init Uploading Images.')
+        return (refer_input_state, refer_text_show, image)
+    else:
+        refer_input_state['masks'].append(mask_new)
+
+    if input_mode == 'Point':
+        nonzero_points = diff_mask.nonzero()
+        nonzero_points_avg_x = torch.median(nonzero_points[:, 0])
+        nonzero_points_avg_y = torch.median(nonzero_points[:, 1])
+        sampled_coor = [nonzero_points_avg_x, nonzero_points_avg_y]
+        # pdb.set_trace()
+        cur_region_masks = generate_mask_for_feature(sampled_coor, raw_w=img_width, raw_h=img_height)
+    elif input_mode == 'Box' or input_mode == 'Sketch':
+        # pdb.set_trace()
+        x1x2 = diff_mask.max(1)[0].nonzero()[:, 0]
+        y1y2 = diff_mask.max(0)[0].nonzero()[:, 0]
+        y1, y2 = y1y2.min(), y1y2.max()
+        x1, x2 = x1x2.min(), x1x2.max()
+        # pdb.set_trace()
+        sampled_coor = [x1, y1, x2, y2]
+        if input_mode == 'Box':
+            cur_region_masks = generate_mask_for_feature(sampled_coor, raw_w=img_width, raw_h=img_height)
+        else:
+            cur_region_masks = generate_mask_for_feature(sampled_coor, raw_w=img_width, raw_h=img_height, mask=diff_mask)
+    else:
+        raise NotImplementedError(f'Input mode of {input_mode} is not Implemented.')
+
+    # TODO(haoxuan): Hack img_size to be 224 here, need to make it a argument.
+    if len(sampled_coor) == 2:
+        point_x = int(VOCAB_IMAGE_W * sampled_coor[0] / img_width)
+        point_y = int(VOCAB_IMAGE_H * sampled_coor[1] / img_height)
+        cur_region_coordinates = f'[{int(point_x)}, {int(point_y)}]'
+    elif len(sampled_coor) == 4:
+        point_x1 = int(VOCAB_IMAGE_W * sampled_coor[0] / img_width)
+        point_y1 = int(VOCAB_IMAGE_H * sampled_coor[1] / img_height)
+        point_x2 = int(VOCAB_IMAGE_W * sampled_coor[2] / img_width)
+        point_y2 = int(VOCAB_IMAGE_H * sampled_coor[3] / img_height)
+        cur_region_coordinates = f'[{int(point_x1)}, {int(point_y1)}, {int(point_x2)}, {int(point_y2)}]'
+
+    cur_region_id = len(refer_input_state['region_placeholder_tokens'])
+    cur_region_token = DEFAULT_REGION_REFER_TOKEN.split(']')[0] + str(cur_region_id) + ']'
+    refer_input_state['region_placeholder_tokens'].append(cur_region_token)
+    refer_input_state['region_coordinates'].append(cur_region_coordinates)
+    refer_input_state['region_masks'].append(cur_region_masks)
+    refer_text_show.append((cur_region_token, ''))
+
+    # Show Parsed Referring.
+    imagebox_refer = draw_box(sampled_coor, cur_region_masks, \
+                         cur_region_token, imagebox_refer, input_mode)
+
+    return (refer_input_state, refer_text_show, imagebox_refer)
+
+def build_demo(embed_mode):
+    textbox = gr.Textbox(show_label=False, placeholder="Enter text and press ENTER", visible=False, container=False)
+    with gr.Blocks(title="FERRET", theme=gr.themes.Base(), css=css) as demo:
+        state = gr.State()
+
+        if not embed_mode:
+            gr.Markdown(title_markdown)
+            gr.Markdown(Instructions)
+
+        with gr.Row():
+            with gr.Column(scale=4):
+                with gr.Row(elem_id="model_selector_row"):
+                    model_selector = gr.Dropdown(
+                        choices=models,
+                        value=models[0] if len(models) > 0 else "",
+                        interactive=True,
+                        show_label=False,
+                        container=False)
+
+                original_image = gr.Image(type="pil", visible=False)
+                image_process_mode = gr.Radio(
+                    ["Raw+Processor", "Crop", "Resize", "Pad"],
+                    value="Raw+Processor",
+                    label="Preprocess for non-square image",
+                    visible=False)
+
+                # Added for any-format input.
+                sketch_pad = ImageMask(label="Image & Sketch", type="pil", elem_id="img2text")
+                refer_input_mode = gr.Radio(
+                    ["Point", "Box", "Sketch"],
+                    value="Point",
+                    label="Referring Input Type")
+                refer_input_state = gr.State({'region_placeholder_tokens':[],
+                                              'region_coordinates':[],
+                                              'region_masks':[],
+                                              'masks':[],
+                                              })
+                refer_text_show = gr.HighlightedText(value=[], label="Referring Input Cache")
+
+                imagebox_refer = gr.Image(type="pil", label="Parsed Referring Input")
+                imagebox_output = gr.Image(type="pil", label='Output Vis')
+
+                cur_dir = os.path.dirname(os.path.abspath(__file__))
+                gr.Examples(examples=[
+                    # [f"{cur_dir}/examples/harry-potter-hogwarts.jpg", "What is in [region0]? And what do people use it for?"],
+                    # [f"{cur_dir}/examples/ingredients.jpg", "What objects are in [region0] and [region1]?"],
+                    # [f"{cur_dir}/examples/extreme_ironing.jpg", "What is unusual about this image? And tell me the coordinates of mentioned objects."],
+                    [f"{cur_dir}/examples/ferret.jpg", "What's the relationship between object [region0] and object [region1]?"],
+                    [f"{cur_dir}/examples/waterview.jpg", "What are the things I should be cautious about when I visit here? Tell me the coordinates in response."],
+                    [f"{cur_dir}/examples/flickr_9472793441.jpg", "Describe the image in details."],
+                    # [f"{cur_dir}/examples/coco_000000281759.jpg", "What are the locations of the woman wearing a blue dress, the woman in flowery top, the girl in purple dress, the girl wearing green shirt?"],
+                    [f"{cur_dir}/examples/room_planning.jpg", "How to improve the design of the given room?"],
+                    [f"{cur_dir}/examples/make_sandwitch.jpg", "How can I make a sandwich with available ingredients?"],
+                    [f"{cur_dir}/examples/bathroom.jpg", "What is unusual about this image?"],
+                    [f"{cur_dir}/examples/kitchen.png", "Is the object a man or a chicken? Explain the reason."],
+                ], inputs=[sketch_pad, textbox])
+
+                with gr.Accordion("Parameters", open=False, visible=False) as parameter_row:
+                    temperature = gr.Slider(minimum=0.0, maximum=1.0, value=0.2, step=0.1, interactive=True, label="Temperature",)
+                    top_p = gr.Slider(minimum=0.0, maximum=1.0, value=0.7, step=0.1, interactive=True, label="Top P",)
+                    max_output_tokens = gr.Slider(minimum=0, maximum=1024, value=512, step=64, interactive=True, label="Max output tokens",)
+
+            with gr.Column(scale=5):
+                chatbot = gr.Chatbot(elem_id="chatbot", label="FERRET", visible=False).style(height=750)
+                with gr.Row():
+                    with gr.Column(scale=8):
+                        textbox.render()
+                    with gr.Column(scale=1, min_width=60):
+                        submit_btn = gr.Button(value="Submit", visible=False)
+                with gr.Row(visible=False) as button_row:
+                    upvote_btn = gr.Button(value="👍  Upvote", interactive=False)
+                    downvote_btn = gr.Button(value="👎  Downvote", interactive=False)
+                    # flag_btn = gr.Button(value="⚠️  Flag", interactive=False)
+                    #stop_btn = gr.Button(value="⏹️  Stop Generation", interactive=False)
+                    regenerate_btn = gr.Button(value="🔄  Regenerate", interactive=False)
+                    clear_btn = gr.Button(value="🗑️  Clear history", interactive=False)
+                    location_btn = gr.Button(value="🪄 Show location", interactive=False)
+
+        if not embed_mode:
+            gr.Markdown(tos_markdown)
+            gr.Markdown(learn_more_markdown)
+        url_params = gr.JSON(visible=False)
+
+        # Register listeners
+        btn_list = [upvote_btn, downvote_btn, location_btn, regenerate_btn, clear_btn]
+        upvote_btn.click(upvote_last_response,
+            [state, model_selector], [textbox, upvote_btn, downvote_btn, location_btn])
+        downvote_btn.click(downvote_last_response,
+            [state, model_selector], [textbox, upvote_btn, downvote_btn, location_btn])
+        # flag_btn.click(flag_last_response,
+        #     [state, model_selector], [textbox, upvote_btn, downvote_btn, flag_btn])
+        regenerate_btn.click(regenerate, [state, image_process_mode],
+            [state, chatbot, textbox] + btn_list).then(
+            http_bot, [state, model_selector, temperature, top_p, max_output_tokens, refer_input_state],
+            [state, chatbot] + btn_list)
+        clear_btn.click(clear_history, None, [state, chatbot, textbox, imagebox_output, original_image] + btn_list + \
+                        [sketch_pad, refer_input_state, refer_text_show, imagebox_refer])
+        location_btn.click(show_location,
+            [sketch_pad, chatbot], [imagebox_output, chatbot, location_btn])
+
+        textbox.submit(add_text, [state, textbox, image_process_mode, original_image, sketch_pad], [state, chatbot, textbox, original_image] + btn_list
+            ).then(http_bot, [state, model_selector, temperature, top_p, max_output_tokens, refer_input_state],
+                   [state, chatbot] + btn_list)
+
+        submit_btn.click(add_text, [state, textbox, image_process_mode, original_image, sketch_pad], [state, chatbot, textbox, original_image] + btn_list
+            ).then(http_bot, [state, model_selector, temperature, top_p, max_output_tokens, refer_input_state],
+                   [state, chatbot] + btn_list)
+
+        sketch_pad.edit(
+            draw,
+            inputs=[refer_input_mode, sketch_pad, refer_input_state, refer_text_show, imagebox_refer],
+            outputs=[refer_input_state, refer_text_show, imagebox_refer],
+            queue=True,
+        )
+
+        if args.model_list_mode == "once":
+            demo.load(load_demo, [url_params], [state, model_selector,
+                chatbot, textbox, submit_btn, button_row, parameter_row],
+                _js=get_window_url_params)
+        elif args.model_list_mode == "reload":
+            demo.load(load_demo_refresh_model_list, None, [state, model_selector,
+                chatbot, textbox, submit_btn, button_row, parameter_row])
+        else:
+            raise ValueError(f"Unknown model list mode: {args.model_list_mode}")
+
+    return demo
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--host", type=str, default="0.0.0.0")
+    parser.add_argument("--port", type=int)
+    parser.add_argument("--controller-url", type=str, default="http://localhost:21001")
+    parser.add_argument("--concurrency-count", type=int, default=8)
+    parser.add_argument("--model-list-mode", type=str, default="once",
+        choices=["once", "reload"])
+    parser.add_argument("--share", action="store_true")
+    parser.add_argument("--moderate", action="store_true")
+    parser.add_argument("--embed", action="store_true")
+    parser.add_argument("--add_region_feature", action="store_true")
+    args = parser.parse_args()
+    logger.info(f"args: {args}")
+
+    models = get_model_list()
+
+    logger.info(args)
+    demo = build_demo(args.embed)
+    demo.queue(concurrency_count=args.concurrency_count, status_update_rate=10,
+               api_open=False).launch(
+        server_name=args.host, server_port=args.port, share=args.share)
diff --git a/ferret/serve/model_worker.py b/ferret/serve/model_worker.py
new file mode 100644
index 0000000..7557890
--- /dev/null
+++ b/ferret/serve/model_worker.py
@@ -0,0 +1,367 @@
+"""
+A model worker executes the model.
+Usage:
+
+CUDA_VISIBLE_DEVICES=0 python -m ferret.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 \
+    --worker http://localhost:40000 --model-path checkpoints/xxx \
+    --multi-modal --add_region_feature
+
+"""
+import argparse
+import asyncio
+import json
+import time
+import threading
+import uuid
+
+from fastapi import FastAPI, Request, BackgroundTasks
+from fastapi.responses import StreamingResponse
+import requests
+import torch
+import uvicorn
+from functools import partial
+
+from ferret.constants import WORKER_HEART_BEAT_INTERVAL
+from ferret.utils import (build_logger, server_error_msg,
+    pretty_print_semaphore)
+from ferret.model.builder import load_pretrained_model
+from ferret.mm_utils import process_images, load_image_from_base64, tokenizer_image_token, KeywordsStoppingCriteria
+from ferret.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
+# from transformers import TextIteratorStreamer
+from threading import Thread
+
+
+GB = 1 << 30
+
+worker_id = str(uuid.uuid4())[:6]
+logger = build_logger("model_worker", f"model_worker_{worker_id}.log")
+global_counter = 0
+
+model_semaphore = None
+
+DEFAULT_REGION_FEA_TOKEN = "<region_fea>"
+
+def heart_beat_worker(controller):
+
+    while True:
+        time.sleep(WORKER_HEART_BEAT_INTERVAL)
+        controller.send_heart_beat()
+
+
+class ModelWorker:
+    def __init__(self, controller_addr, worker_addr,
+                 worker_id, no_register,
+                 model_path, model_base, model_name,
+                 load_8bit, load_4bit,
+                 keep_aspect_ratio,
+                 num_gpus,
+                 add_region_feature,
+                 image_w,
+                 image_h):
+        self.image_w = image_w
+        self.image_h = image_h
+        self.controller_addr = controller_addr
+        self.worker_addr = worker_addr
+        self.worker_id = worker_id
+        if model_path.endswith("/"):
+            model_path = model_path[:-1]
+        if model_name is None:
+            model_paths = model_path.split("/")
+            if model_paths[-1].startswith('checkpoint-'):
+                self.model_name = model_paths[-2] + "_" + model_paths[-1]
+            else:
+                self.model_name = model_paths[-1]
+        else:
+            self.model_name = model_name
+
+        logger.info(f"Loading the model {self.model_name} on worker {worker_id} ...")
+        self.keep_aspect_ratio = keep_aspect_ratio
+        self.add_region_feature = add_region_feature
+        self.tokenizer, self.model, self.image_processor, self.context_len = load_pretrained_model(
+            model_path, model_base, self.model_name, load_8bit, load_4bit)
+        self.is_multimodal = 'llava' in self.model_name.lower() or 'ferret' in self.model_name.lower()
+
+        if not no_register:
+            self.register_to_controller()
+            self.heart_beat_thread = threading.Thread(
+                target=heart_beat_worker, args=(self,))
+            self.heart_beat_thread.start()
+
+    def register_to_controller(self):
+        logger.info("Register to controller")
+
+        url = self.controller_addr + "/register_worker"
+        data = {
+            "worker_name": self.worker_addr,
+            "check_heart_beat": True,
+            "worker_status": self.get_status()
+        }
+        r = requests.post(url, json=data)
+        assert r.status_code == 200
+
+    def send_heart_beat(self):
+        # logger.info(f"Send heart beat. Models: {[self.model_name]}. "
+        #             f"Semaphore: {pretty_print_semaphore(model_semaphore)}. "
+        #             f"global_counter: {global_counter}")
+
+        url = self.controller_addr + "/receive_heart_beat"
+
+        while True:
+            try:
+                ret = requests.post(url, json={
+                    "worker_name": self.worker_addr,
+                    "queue_length": self.get_queue_length()}, timeout=5)
+                exist = ret.json()["exist"]
+                break
+            except requests.exceptions.RequestException as e:
+                logger.error(f"heart beat error: {e}")
+            time.sleep(5)
+
+        if not exist:
+            self.register_to_controller()
+
+    def get_queue_length(self):
+        if model_semaphore is None:
+            return 0
+        else:
+            return args.limit_model_concurrency - model_semaphore._value + (len(
+                model_semaphore._waiters) if model_semaphore._waiters is not None else 0)
+
+    def get_status(self):
+        return {
+            "model_names": [self.model_name],
+            "speed": 1,
+            "queue_length": self.get_queue_length(),
+        }
+
+    @torch.inference_mode()
+    def generate_stream(self, params):
+        tokenizer, model, image_processor = self.tokenizer, self.model, self.image_processor
+
+        image_w = self.image_w
+        image_h = self.image_h
+        prompt = params["prompt"]
+        ori_prompt = prompt
+        region_masks = params.get('region_masks', None)
+
+        images = params.get("images", None)
+        num_image_tokens = 0
+        if images is not None and len(images) > 0 and self.is_multimodal:
+            if len(images) > 0:
+                if len(images) != prompt.count(DEFAULT_IMAGE_TOKEN):
+                    raise ValueError("Number of images does not match number of <image> tokens in prompt")
+
+                images = [load_image_from_base64(image) for image in images]
+                if self.keep_aspect_ratio:
+                    images = process_images(images, image_processor, model.config)
+                else:
+                    images = image_processor(images, return_tensors='pt', do_resize=True, do_center_crop=False, size=[image_h, image_w])['pixel_values']
+
+                if type(images) is list:
+                    images = [image.to(self.model.device, dtype=torch.float16) for image in images]
+                else:
+                    images = images.to(self.model.device, dtype=torch.float16)
+
+                replace_token = DEFAULT_IMAGE_TOKEN
+                if getattr(self.model.config, 'mm_use_im_start_end', False):
+                    replace_token = DEFAULT_IM_START_TOKEN + replace_token + DEFAULT_IM_END_TOKEN
+                prompt = prompt.replace(DEFAULT_IMAGE_TOKEN, replace_token)
+
+                num_image_tokens = prompt.count(replace_token) * model.get_vision_tower().num_patches
+            else:
+                images = None
+            image_args = {"images": images}
+        else:
+            images = None
+            image_args = {}
+
+        if region_masks is not None:
+            assert self.add_region_feature
+            region_masks = [[torch.Tensor(region_mask_i).cuda().half() for region_mask_i in region_masks]]
+            image_args["region_masks"] = region_masks
+            logger.info("Add region_masks to image_args.")
+        else:
+            logger.info("No region_masks for this sample.")
+            region_masks = None
+
+        l_prompt = len(prompt)
+        temperature = float(params.get("temperature", 1.0))
+        top_p = float(params.get("top_p", 1.0))
+        max_context_length = getattr(model.config, 'max_position_embeddings', 2048)
+        max_new_tokens = min(int(params.get("max_new_tokens", 256)), 1024)
+        stop_str = params.get("stop", None)
+
+        stop_idx = None
+        if stop_str is not None:
+            stop_idx = tokenizer(stop_str).input_ids
+            if len(stop_idx) == 1:
+                stop_idx = stop_idx[0]
+            else:
+                stop_idx = None
+
+        # input_ids = tokenizer(prompt).input_ids
+        input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors=None)
+        output_ids = list(input_ids)
+        pred_ids = []
+
+        max_src_len = self.context_len - max_new_tokens - 8
+        input_ids = input_ids[-max_src_len:]
+
+        past_key_values = None
+        for i in range(max_new_tokens):
+            if i == 0:
+                out = model(
+                    torch.as_tensor([input_ids]).cuda(),
+                    use_cache=True,
+                    **image_args)
+                logits = out.logits
+                past_key_values = out.past_key_values
+            else:
+                attention_mask = torch.ones(
+                    1, past_key_values[0][0].shape[-2] + 1, device="cuda")
+                out = model(input_ids=torch.as_tensor([[token]], device="cuda"),
+                            use_cache=True,
+                            attention_mask=attention_mask,
+                            past_key_values=past_key_values,
+                            region_masks=region_masks)
+                logits = out.logits
+                past_key_values = out.past_key_values
+
+            last_token_logits = logits[0][-1]
+            if temperature < 1e-4:
+                token = int(torch.argmax(last_token_logits))
+            else:
+                probs = torch.softmax(last_token_logits / temperature, dim=-1)
+                token = int(torch.multinomial(probs, num_samples=1))
+
+            output_ids.append(token)
+            pred_ids.append(token)
+
+            if stop_idx is not None and token == stop_idx:
+                stopped = True
+            elif token == tokenizer.eos_token_id:
+                stopped = True
+            else:
+                stopped = False
+
+            if i % args.stream_interval == 0 or i == max_new_tokens - 1 or stopped:
+                cur_out = tokenizer.decode(pred_ids, skip_special_tokens=True)
+                pos = cur_out.rfind(stop_str)
+                if pos != -1:
+                    cur_out = cur_out[:pos]
+                    stopped = True
+                output = ori_prompt + cur_out
+
+                ret = {
+                    "text": output,
+                    "error_code": 0,
+                }
+                yield json.dumps(ret).encode() + b"\0"
+
+            if stopped:
+                break
+
+        if past_key_values is not None:
+            del past_key_values
+
+
+    def generate_stream_gate(self, params):
+        try:
+            for x in self.generate_stream(params):
+                yield x
+        except ValueError as e:
+            print("Caught ValueError:", e)
+            ret = {
+                "text": server_error_msg,
+                "error_code": 1,
+            }
+            yield json.dumps(ret).encode() + b"\0"
+        except torch.cuda.CudaError as e:
+            print("Caught torch.cuda.CudaError:", e)
+            ret = {
+                "text": server_error_msg,
+                "error_code": 1,
+            }
+            yield json.dumps(ret).encode() + b"\0"
+        except Exception as e:
+            print("Caught Unknown Error", e)
+            ret = {
+                "text": server_error_msg,
+                "error_code": 1,
+            }
+            yield json.dumps(ret).encode() + b"\0"
+
+
+app = FastAPI()
+
+
+def release_model_semaphore(fn=None):
+    model_semaphore.release()
+    if fn is not None:
+        fn()
+
+
+@app.post("/worker_generate_stream")
+async def generate_stream(request: Request):
+    global model_semaphore, global_counter
+    global_counter += 1
+    params = await request.json()
+
+    if model_semaphore is None:
+        model_semaphore = asyncio.Semaphore(args.limit_model_concurrency)
+    await model_semaphore.acquire()
+    worker.send_heart_beat()
+    generator = worker.generate_stream_gate(params)
+    background_tasks = BackgroundTasks()
+    background_tasks.add_task(partial(release_model_semaphore, fn=worker.send_heart_beat))
+    return StreamingResponse(generator, background=background_tasks)
+
+
+@app.post("/worker_get_status")
+async def get_status(request: Request):
+    return worker.get_status()
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--host", type=str, default="localhost")
+    parser.add_argument("--port", type=int, default=21002)
+    parser.add_argument("--worker-address", type=str,
+        default="http://localhost:21002")
+    parser.add_argument("--controller-address", type=str,
+        default="http://localhost:21001")
+    parser.add_argument("--model-path", type=str, default="facebook/opt-350m")
+    parser.add_argument("--model-base", type=str, default=None)
+    parser.add_argument("--model-name", type=str)
+    parser.add_argument("--multi-modal", action="store_true", help="Multimodal mode is automatically detected with model name, please make sure `ferret` is included in the model path.")
+    parser.add_argument("--keep-aspect-ratio", action="store_true")
+    parser.add_argument("--num-gpus", type=int, default=1)
+    parser.add_argument("--limit-model-concurrency", type=int, default=5)
+    parser.add_argument("--stream-interval", type=int, default=1)
+    parser.add_argument("--no-register", action="store_true")
+    parser.add_argument("--load-8bit", action="store_true")
+    parser.add_argument("--load-4bit", action="store_true")
+    parser.add_argument("--add_region_feature", action="store_true")
+    parser.add_argument("--image_w", type=int, default=336)  #  224
+    parser.add_argument("--image_h", type=int, default=336)  #  224
+    args = parser.parse_args()
+    logger.info(f"args: {args}")
+
+    if args.multi_modal:
+        logger.warning("Multimodal mode is automatically detected with model name, please make sure `ferret` is included in the model path.")
+
+    worker = ModelWorker(args.controller_address,
+                         args.worker_address,
+                         worker_id,
+                         args.no_register,
+                         args.model_path,
+                         args.model_base,
+                         args.model_name,
+                         args.load_8bit,
+                         args.load_4bit,
+                         args.keep_aspect_ratio,
+                         args.num_gpus,
+                         args.add_region_feature,
+                         args.image_w,
+                         args.image_h)
+    uvicorn.run(app, host=args.host, port=args.port, log_level="info")
diff --git a/ferret/serve/register_worker.py b/ferret/serve/register_worker.py
new file mode 100644
index 0000000..2c2c402
--- /dev/null
+++ b/ferret/serve/register_worker.py
@@ -0,0 +1,26 @@
+"""
+Manually register workers.
+
+Usage:
+python3 -m fastchat.serve.register_worker --controller http://localhost:21001 --worker-name http://localhost:21002
+"""
+
+import argparse
+
+import requests
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--controller-address", type=str)
+    parser.add_argument("--worker-name", type=str)
+    parser.add_argument("--check-heart-beat", action="store_true")
+    args = parser.parse_args()
+
+    url = args.controller_address + "/register_worker"
+    data = {
+        "worker_name": args.worker_name,
+        "check_heart_beat": args.check_heart_beat,
+        "worker_status": None,
+    }
+    r = requests.post(url, json=data)
+    assert r.status_code == 200
diff --git a/ferret/train/ferret_trainer.py b/ferret/train/ferret_trainer.py
new file mode 100644
index 0000000..00669f8
--- /dev/null
+++ b/ferret/train/ferret_trainer.py
@@ -0,0 +1,66 @@
+import os
+import torch
+
+from transformers import Trainer
+from typing import Optional
+
+
+def maybe_zero_3(param, ignore_status=False, name=None):
+    from deepspeed import zero
+    from deepspeed.runtime.zero.partition_parameters import ZeroParamStatus
+    if hasattr(param, "ds_id"):
+        if param.ds_status == ZeroParamStatus.NOT_AVAILABLE:
+            if not ignore_status:
+                print(name, 'no ignore status')
+        with zero.GatheredParameters([param]):
+            param = param.data.detach().cpu().clone()
+    else:
+        param = param.detach().cpu().clone()
+    return param
+
+
+def get_mm_adapter_state_maybe_zero_3(named_params, keys_to_match):
+    to_return = {k: t for k, t in named_params if any(key_match in k for key_match in keys_to_match)}
+    to_return = {k: maybe_zero_3(v, ignore_status=True, name=k).cpu() for k, v in to_return.items()}
+    return to_return
+
+
+class FERRETTrainer(Trainer):
+
+    def _save_checkpoint(self, model, trial, metrics=None):
+        if getattr(self.args, 'tune_mm_mlp_adapter', False):
+            from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR
+            checkpoint_folder = f"{PREFIX_CHECKPOINT_DIR}-{self.state.global_step}"
+
+            run_dir = self._get_output_dir(trial=trial)
+            output_dir = os.path.join(run_dir, checkpoint_folder)
+
+            # Only save Adapter
+            keys_to_match = ['mm_projector']
+            if getattr(self.args, "use_im_start_end", False):
+                keys_to_match.extend(['embed_tokens', 'embed_in'])
+
+            weight_to_save = get_mm_adapter_state_maybe_zero_3(self.model.named_parameters(), keys_to_match)
+
+            if self.args.local_rank == 0 or self.args.local_rank == -1:
+                self.model.config.save_pretrained(output_dir)
+                torch.save(weight_to_save, os.path.join(output_dir, f'mm_projector.bin'))
+        else:
+            super(FERRETTrainer, self)._save_checkpoint(model, trial, metrics)
+
+            if getattr(self.args, 'save_vision_tower', False):
+                from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR
+                checkpoint_folder = f"{PREFIX_CHECKPOINT_DIR}-{self.state.global_step}"
+                run_dir = self._get_output_dir(trial=trial)
+                output_dir = os.path.join(run_dir, checkpoint_folder, 'vision_tower')
+                if self.args.local_rank == 0 or self.args.local_rank == -1:
+                    vision_tower = self.model.model.get_vision_tower().vision_tower
+                    vision_tower.save_pretrained(output_dir)
+                    print(f'Save vision tower ckpt to {output_dir}/vision_tower')
+
+
+    def _save(self, output_dir: Optional[str] = None, state_dict=None):
+        if getattr(self.args, 'tune_mm_mlp_adapter', False):
+            pass
+        else:
+            super(FERRETTrainer, self)._save(output_dir, state_dict)
diff --git a/ferret/train/llama_flash_attn_monkey_patch.py b/ferret/train/llama_flash_attn_monkey_patch.py
new file mode 100644
index 0000000..89f9c3b
--- /dev/null
+++ b/ferret/train/llama_flash_attn_monkey_patch.py
@@ -0,0 +1,102 @@
+# Adopted from https://github.com/lm-sys/FastChat. Below is the original copyright:
+from typing import List, Optional, Tuple
+
+import torch
+from torch import nn
+
+import transformers
+from transformers.models.llama.modeling_llama import apply_rotary_pos_emb
+
+from einops import rearrange
+
+from flash_attn.flash_attn_interface import flash_attn_unpadded_qkvpacked_func
+from flash_attn.bert_padding import unpad_input, pad_input
+
+def forward(
+    self,
+    hidden_states: torch.Tensor,
+    past_key_value: Optional[Tuple[torch.Tensor]] = None,
+    attention_mask: Optional[torch.Tensor] = None,
+    output_attentions: bool = False,
+    use_cache: bool = False,
+) -> Tuple[torch.Tensor, Optional[torch.Tensor],
+            Optional[Tuple[torch.Tensor]]]:
+    """Input shape: Batch x Time x Channel
+    
+    attention_mask: [bsz, q_len]
+    """
+    bsz, q_len, _ = hidden_states.size()
+
+    query_states = self.q_proj(hidden_states).view(
+        bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
+    key_states = self.k_proj(hidden_states).view(
+        bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
+    value_states = self.v_proj(hidden_states).view(
+        bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
+    # [bsz, q_len, nh, hd]
+    # [bsz, nh, q_len, hd]
+
+    kv_seq_len = key_states.shape[-2]
+    offset = 0
+    if past_key_value is not None:
+        offset = past_key_value[0].shape[-2]
+        kv_seq_len += offset
+    cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
+    query_states, key_states = apply_rotary_pos_emb(query_states,
+                                                    key_states,
+                                                    cos,
+                                                    sin,
+                                                    offset=offset)
+    # [bsz, nh, t, hd]
+    assert not output_attentions, "output_attentions is not supported"
+    assert not use_cache, "use_cache is not supported"
+    assert past_key_value is None, "past_key_value is not supported"
+
+    # Flash attention codes from
+    # https://github.com/HazyResearch/flash-attention/blob/main/flash_attn/flash_attention.py
+
+    # transform the data into the format required by flash attention
+    qkv = torch.stack([query_states, key_states, value_states], dim=2) # [bsz, nh, 3, q_len, hd]
+    qkv = qkv.transpose(1, 3) # [bsz, q_len, 3, nh, hd]
+    # We have disabled _prepare_decoder_attention_mask in LlamaModel
+    # the attention_mask should be the same as the key_padding_mask
+    key_padding_mask = attention_mask
+
+
+    if key_padding_mask is None:
+        qkv = rearrange(qkv, 'b s ... -> (b s) ...')
+        max_s = q_len
+        cu_q_lens = torch.arange(0, (bsz + 1) * q_len, step=q_len, dtype=torch.int32,
+                                device=qkv.device)
+        output = flash_attn_unpadded_qkvpacked_func(
+            qkv, cu_q_lens, max_s, 0.0,
+            softmax_scale=None, causal=True
+        )
+        output = rearrange(output, '(b s) ... -> b s ...', b=bsz)
+    else:
+        nheads = qkv.shape[-2]
+        x = rearrange(qkv, 'b s three h d -> b s (three h d)')
+        x_unpad, indices, cu_q_lens, max_s = unpad_input(x, key_padding_mask)
+        x_unpad = rearrange(x_unpad, 'nnz (three h d) -> nnz three h d', three=3, h=nheads)
+        output_unpad = flash_attn_unpadded_qkvpacked_func(
+            x_unpad, cu_q_lens, max_s, 0.0,
+            softmax_scale=None, causal=True
+        )
+        output = rearrange(pad_input(rearrange(output_unpad, 'nnz h d -> nnz (h d)'),
+                                    indices, bsz, q_len),
+                        'b s (h d) -> b s h d', h=nheads)
+    return self.o_proj(rearrange(output,
+                                    'b s h d -> b s (h d)')), None, None
+
+
+# Disable the transformation of the attention mask in LlamaModel as the flash attention
+# requires the attention mask to be the same as the key_padding_mask
+def _prepare_decoder_attention_mask(self, attention_mask, input_shape,
+                                    inputs_embeds, past_key_values_length):
+    # [bsz, seq_len]
+    return attention_mask
+
+
+def replace_llama_attn_with_flash_attn():
+    transformers.models.llama.modeling_llama.LlamaModel._prepare_decoder_attention_mask = _prepare_decoder_attention_mask
+    transformers.models.llama.modeling_llama.LlamaAttention.forward = forward
diff --git a/ferret/train/train.py b/ferret/train/train.py
new file mode 100644
index 0000000..4a20b43
--- /dev/null
+++ b/ferret/train/train.py
@@ -0,0 +1,1387 @@
+# Adopted from https://github.com/lm-sys/FastChat. Below is the original copyright:
+# Adopted from tatsu-lab@stanford_alpaca. Below is the original copyright:
+#    Copyright 2023 Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li
+#
+#    Licensed under the Apache License, Version 2.0 (the "License");
+#    you may not use this file except in compliance with the License.
+#    You may obtain a copy of the License at
+#
+#        http://www.apache.org/licenses/LICENSE-2.0
+#
+#    Unless required by applicable law or agreed to in writing, software
+#    distributed under the License is distributed on an "AS IS" BASIS,
+#    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#    See the License for the specific language governing permissions and
+#    limitations under the License.
+
+import os
+import copy
+from copy import deepcopy
+from dataclasses import dataclass, field
+import json
+import logging
+import pathlib
+from typing import Dict, Optional, Sequence, Union, List
+
+import torch
+
+import transformers
+
+from ferret.constants import IGNORE_INDEX, IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
+from torch.utils.data import Dataset
+from ferret.train.ferret_trainer import FERRETTrainer
+
+from ferret import conversation as conversation_lib
+from ferret.model import *
+from ferret.mm_utils import tokenizer_image_token
+
+from PIL import Image
+import torch.nn as nn
+import random
+import math
+import pdb
+from pycocotools import mask as mask_util
+import numpy as np
+import re
+
+DEFAULT_REGION_FEA_TOKEN = "<region_fea>"
+VOCAB_IMAGE_W = 1000
+VOCAB_IMAGE_H = 1000
+
+
+local_rank = None
+
+
+def rank0_print(*args):
+    if local_rank == 0:
+        print(*args)
+
+
+@dataclass
+class ModelArguments:
+    model_name_or_path: Optional[str] = field(default="facebook/opt-125m")
+    version: Optional[str] = field(default="v0")
+    freeze_backbone: bool = field(default=False)
+    tune_mm_mlp_adapter: bool = field(default=False)
+    vision_tower: Optional[str] = field(default=None)
+    mm_vision_select_layer: Optional[int] = field(default=-1)   # default to the last layer
+    pretrain_mm_mlp_adapter: Optional[str] = field(default=None)
+    mm_use_im_start_end: bool = field(default=False)
+    mm_use_im_patch_token: bool = field(default=True)
+    mm_vision_select_feature: Optional[str] = field(default="patch")
+    add_region_feature: bool = False
+    region_geo_sampler: bool = False
+    sampler_pooler_mode: str = field(default='mean')    # Support 'mean' or 'max'
+    no_coor: bool = False
+    save_vision_tower: bool = field(default=False)
+
+
+@dataclass
+class DataArguments:
+    # data_path: str = field(default=None,
+    #                        metadata={"help": "Path to the training data."})
+    data_path: List[str] = field(default=None,
+                           metadata={"help": "Path to the training data."})
+    data_multiple: List[float] = field(default=None,
+                           metadata={"help": "Data mutliplier for each dataset when mixed. None means direct concat."})    
+    lazy_preprocess: bool = False
+    is_multimodal: bool = False
+    image_folder: List[str] = field(default=None)
+    image_aspect_ratio: str = 'square_nocrop'  # Original Default:'square'
+    resized_image_h: int = 336  #  224
+    resized_image_w: int = 336  #  224
+    point_input_sample: str = 'segment_mask|uniform'  # 'segment_mask|uniform', 'segment_mask|center', 'segment_mask|gaussian', 'gaussian', 'center'
+    refer_previous_point: bool = True
+    image_grid_pinpoints: Optional[str] = field(default=None)
+
+
+@dataclass
+class TrainingArguments(transformers.TrainingArguments):
+    cache_dir: Optional[str] = field(default=None)
+    optim: str = field(default="adamw_torch")
+    remove_unused_columns: bool = field(default=False)
+    freeze_mm_mlp_adapter: bool = field(default=False)
+    mpt_attn_impl: Optional[str] = field(default="triton")
+    model_max_length: int = field(
+        default=512,
+        metadata={
+            "help":
+            "Maximum sequence length. Sequences will be right padded (and possibly truncated)."
+        },
+    )
+    double_quant: bool = field(
+        default=True,
+        metadata={"help": "Compress the quantization statistics through double quantization."}
+    )
+    quant_type: str = field(
+        default="nf4",
+        metadata={"help": "Quantization data type to use. Should be one of `fp4` or `nf4`."}
+    )
+    bits: int = field(
+        default=16,
+        metadata={"help": "How many bits to use."}
+    )
+    lora_enable: bool = False
+    lora_r: int = 64
+    lora_alpha: int = 16
+    lora_dropout: float = 0.05
+    lora_weight_path: str = ""
+    lora_bias: str = "none"
+
+
+def maybe_zero_3(param, ignore_status=False, name=None):
+    from deepspeed import zero
+    from deepspeed.runtime.zero.partition_parameters import ZeroParamStatus
+    if hasattr(param, "ds_id"):
+        if param.ds_status == ZeroParamStatus.NOT_AVAILABLE:
+            if not ignore_status:
+                logging.warning(f"{name}: param.ds_status != ZeroParamStatus.NOT_AVAILABLE: {param.ds_status}")
+        with zero.GatheredParameters([param]):
+            param = param.data.detach().cpu().clone()
+    else:
+        param = param.detach().cpu().clone()
+    return param
+
+
+# Borrowed from peft.utils.get_peft_model_state_dict
+def get_peft_state_maybe_zero_3(named_params, bias):
+    if bias == "none":
+        to_return = {k: t for k, t in named_params if "lora_" in k}
+    elif bias == "all":
+        to_return = {k: t for k, t in named_params if "lora_" in k or "bias" in k}
+    elif bias == "lora_only":
+        to_return = {}
+        maybe_lora_bias = {}
+        lora_bias_names = set()
+        for k, t in named_params:
+            if "lora_" in k:
+                to_return[k] = t
+                bias_name = k.split("lora_")[0] + "bias"
+                lora_bias_names.add(bias_name)
+            elif "bias" in k:
+                maybe_lora_bias[k] = t
+        for k, t in maybe_lora_bias:
+            if bias_name in lora_bias_names:
+                to_return[bias_name] = t
+    else:
+        raise NotImplementedError
+    to_return = {k: maybe_zero_3(v, name=k) for k, v in to_return.items()}
+    return to_return
+
+
+def get_peft_state_non_lora_maybe_zero_3(named_params, require_grad_only=True):
+    to_return = {k: t for k, t in named_params if "lora_" not in k}
+    if require_grad_only:
+        to_return = {k: t for k, t in to_return.items() if t.requires_grad}
+    to_return = {k: maybe_zero_3(v, ignore_status=True).cpu() for k, v in to_return.items()}
+    return to_return
+
+
+def get_mm_adapter_state_maybe_zero_3(named_params, keys_to_match):
+    to_return = {k: t for k, t in named_params if any(key_match in k for key_match in keys_to_match)}
+    to_return = {k: maybe_zero_3(v, ignore_status=True).cpu() for k, v in to_return.items()}
+    return to_return
+
+
+def find_all_linear_names(model):
+    cls = torch.nn.Linear
+    lora_module_names = set()
+    for name, module in model.named_modules():
+        if isinstance(module, cls):
+            names = name.split('.')
+            lora_module_names.add(names[0] if len(names) == 1 else names[-1])
+
+
+    if 'lm_head' in lora_module_names: # needed for 16-bit
+        lora_module_names.remove('lm_head')
+    return list(lora_module_names)
+
+
+def safe_save_model_for_hf_trainer(trainer: transformers.Trainer,
+                                   output_dir: str,
+                                   save_vision_tower: bool):
+    """Collects the state dict and dump to disk."""
+    output_dir = os.path.join(output_dir, 'final-checkpoint')
+    if trainer.args.local_rank == 0 or trainer.args.local_rank == -1:
+        os.makedirs(output_dir, exist_ok=True)
+
+    if getattr(trainer.args, "tune_mm_mlp_adapter", False):
+        # Only save Adapter
+        keys_to_match = ['mm_projector']
+        if getattr(trainer.args, "use_im_start_end", False):
+            keys_to_match.extend(['embed_tokens', 'embed_in'])
+
+        weight_to_save = get_mm_adapter_state_maybe_zero_3(trainer.model.named_parameters(), keys_to_match)
+        trainer.model.config.save_pretrained(output_dir)
+
+        current_folder = output_dir.split('/')[-1]
+        parent_folder = os.path.dirname(output_dir)
+        if trainer.args.local_rank == 0 or trainer.args.local_rank == -1:
+            if current_folder.startswith('checkpoint-'):
+                mm_projector_folder = os.path.join(parent_folder, "mm_projector")
+                os.makedirs(mm_projector_folder, exist_ok=True)
+                torch.save(weight_to_save, os.path.join(mm_projector_folder, f'{current_folder}.bin'))
+            else:
+                torch.save(weight_to_save, os.path.join(output_dir, f'mm_projector.bin'))
+        return
+
+    if trainer.deepspeed:
+        torch.cuda.synchronize()
+        trainer.save_model(output_dir)
+        return
+
+    if save_vision_tower:
+        if trainer.args.local_rank == 0 or trainer.args.local_rank == -1:
+            current_folder = output_dir.split('/')[-1]
+            parent_folder = os.path.dirname(output_dir)
+            if current_folder.startswith('checkpoint-'):
+                vision_tower_folder = os.path.join(parent_folder, "vision_tower")
+            else:
+                vision_tower_folder = os.path.join(output_dir, "vision_tower")
+            trainer.model.model.get_vision_tower().vision_tower.save_pretrained(vision_tower_folder)
+            print(f'Save vision tower ckpt to {vision_tower_folder}')
+
+
+    state_dict = trainer.model.state_dict()
+    if trainer.args.should_save:
+        cpu_state_dict = {
+            key: value.cpu()
+            for key, value in state_dict.items()
+        }
+        del state_dict
+        trainer._save(output_dir, state_dict=cpu_state_dict)  # noqa
+
+
+def smart_tokenizer_and_embedding_resize(
+    special_tokens_dict: Dict,
+    tokenizer: transformers.PreTrainedTokenizer,
+    model: transformers.PreTrainedModel,
+):
+    """Resize tokenizer and embedding.
+
+    Note: This is the unoptimized version that may make your embedding size not be divisible by 64.
+    """
+    num_new_tokens = tokenizer.add_special_tokens(special_tokens_dict)
+    model.resize_token_embeddings(len(tokenizer))
+
+    if num_new_tokens > 0:
+        input_embeddings = model.get_input_embeddings().weight.data
+        output_embeddings = model.get_output_embeddings().weight.data
+
+        input_embeddings_avg = input_embeddings[:-num_new_tokens].mean(
+            dim=0, keepdim=True)
+        output_embeddings_avg = output_embeddings[:-num_new_tokens].mean(
+            dim=0, keepdim=True)
+
+        input_embeddings[-num_new_tokens:] = input_embeddings_avg
+        output_embeddings[-num_new_tokens:] = output_embeddings_avg
+
+
+def _tokenize_fn(strings: Sequence[str],
+                 tokenizer: transformers.PreTrainedTokenizer) -> Dict:
+    """Tokenize a list of strings."""
+    tokenized_list = [
+        tokenizer(
+            text,
+            return_tensors="pt",
+            padding="longest",
+            max_length=tokenizer.model_max_length,
+            truncation=True,
+        ) for text in strings
+    ]
+    input_ids = labels = [
+        tokenized.input_ids[0] for tokenized in tokenized_list
+    ]
+    input_ids_lens = labels_lens = [
+        tokenized.input_ids.ne(tokenizer.pad_token_id).sum().item()
+        for tokenized in tokenized_list
+    ]
+    return dict(
+        input_ids=input_ids,
+        labels=labels,
+        input_ids_lens=input_ids_lens,
+        labels_lens=labels_lens,
+    )
+
+
+def _mask_targets(target, tokenized_lens, speakers):
+    # cur_idx = 0
+    cur_idx = tokenized_lens[0]
+    tokenized_lens = tokenized_lens[1:]
+    target[:cur_idx] = IGNORE_INDEX
+    for tokenized_len, speaker in zip(tokenized_lens, speakers):
+        if speaker == "human":
+            target[cur_idx+2:cur_idx + tokenized_len] = IGNORE_INDEX
+        cur_idx += tokenized_len
+
+
+def _add_speaker_and_signal(header, source, get_conversation=True):
+    """Add speaker and start/end signal on each round."""
+    BEGIN_SIGNAL = "### "
+    END_SIGNAL = "\n"
+    conversation = header
+    for sentence in source:
+        from_str = sentence["from"]
+        if from_str.lower() == "human":
+            from_str = conversation_lib.default_conversation.roles[0]
+        elif from_str.lower() == "gpt":
+            from_str = conversation_lib.default_conversation.roles[1]
+        else:
+            from_str = 'unknown'
+        sentence["value"] = (BEGIN_SIGNAL + from_str + ": " +
+                             sentence["value"] + END_SIGNAL)
+        if get_conversation:
+            conversation += sentence["value"]
+    conversation += BEGIN_SIGNAL
+    return conversation
+
+
+def preprocess_multimodal(
+    sources: Sequence[str],
+    data_args: DataArguments
+) -> Dict:
+    is_multimodal = data_args.is_multimodal
+    if not is_multimodal:
+        return sources
+
+    for source in sources:
+        for sentence in source:
+            if DEFAULT_IMAGE_TOKEN in sentence['value']:
+                sentence['value'] = sentence['value'].replace(DEFAULT_IMAGE_TOKEN, '').strip()
+                sentence['value'] = DEFAULT_IMAGE_TOKEN + '\n' + sentence['value']
+                sentence['value'] = sentence['value'].strip()
+                if "mmtag" in conversation_lib.default_conversation.version:
+                    sentence['value'] = sentence['value'].replace(DEFAULT_IMAGE_TOKEN, '<Image>' + DEFAULT_IMAGE_TOKEN + '</Image>')
+            replace_token = DEFAULT_IMAGE_TOKEN
+            if data_args.mm_use_im_start_end:
+                replace_token = DEFAULT_IM_START_TOKEN + replace_token + DEFAULT_IM_END_TOKEN
+            sentence["value"] = sentence["value"].replace(DEFAULT_IMAGE_TOKEN, replace_token)
+
+    # print('sources["value"]:', [j["value"] for j in sources[0]])
+    # raise NotImplementedError()
+    # pdb.set_trace()
+    return sources
+
+
+def preprocess_llama_2(
+    sources,
+    tokenizer: transformers.PreTrainedTokenizer,
+    has_image: bool = False
+) -> Dict:
+    conv = conversation_lib.default_conversation.copy()
+    roles = {"human": conv.roles[0], "gpt": conv.roles[1]}
+
+    # Apply prompt templates
+    conversations = []
+    for i, source in enumerate(sources):
+        if roles[source[0]["from"]] != conv.roles[0]:
+            # Skip the first one if it is not from human
+            source = source[1:]
+
+        conv.messages = []
+        for j, sentence in enumerate(source):
+            role = roles[sentence["from"]]
+            assert role == conv.roles[j % 2], f"{i}"
+            conv.append_message(role, sentence["value"])
+        conversations.append(conv.get_prompt())
+
+    # Tokenize conversations
+
+    if has_image:
+        input_ids = torch.stack([tokenizer_image_token(prompt, tokenizer, return_tensors='pt') for prompt in conversations], dim=0)
+    else:
+        input_ids = tokenizer(
+            conversations,
+            return_tensors="pt",
+            padding="longest",
+            max_length=tokenizer.model_max_length,
+            truncation=True,
+        ).input_ids
+
+    targets = input_ids.clone()
+
+    assert conv.sep_style == conversation_lib.SeparatorStyle.LLAMA_2
+
+    # Mask targets
+    sep = "[/INST] "
+    for conversation, target in zip(conversations, targets):
+        total_len = int(target.ne(tokenizer.pad_token_id).sum())
+
+        rounds = conversation.split(conv.sep2)
+        cur_len = 1
+        target[:cur_len] = IGNORE_INDEX
+        for i, rou in enumerate(rounds):
+            if rou == "":
+                break
+
+            parts = rou.split(sep)
+            if len(parts) != 2:
+                break
+            parts[0] += sep
+
+            if has_image:
+                round_len = len(tokenizer_image_token(rou, tokenizer))
+                instruction_len = len(tokenizer_image_token(parts[0], tokenizer)) - 2
+            else:
+                round_len = len(tokenizer(rou).input_ids)
+                instruction_len = len(tokenizer(parts[0]).input_ids) - 2
+
+            target[cur_len : cur_len + instruction_len] = IGNORE_INDEX
+
+            cur_len += round_len
+        target[cur_len:] = IGNORE_INDEX
+
+        if cur_len < tokenizer.model_max_length:
+            if cur_len != total_len:
+                target[:] = IGNORE_INDEX
+                print(
+                    f"WARNING: tokenization mismatch: {cur_len} vs. {total_len}."
+                    f" (ignored)"
+                )
+
+    return dict(
+        input_ids=input_ids,
+        labels=targets,
+    )
+
+
+def preprocess_v1(
+    sources,
+    tokenizer: transformers.PreTrainedTokenizer,
+    has_image: bool = False
+) -> Dict:
+    conv = conversation_lib.default_conversation.copy()
+    roles = {"human": conv.roles[0], "gpt": conv.roles[1]}
+
+    # Apply prompt templates
+    conversations = []
+    for i, source in enumerate(sources):
+        if roles[source[0]["from"]] != conv.roles[0]:
+            # Skip the first one if it is not from human
+            source = source[1:]
+
+        conv.messages = []
+        for j, sentence in enumerate(source):
+            role = roles[sentence["from"]]
+            assert role == conv.roles[j % 2], f"{i}"
+            conv.append_message(role, sentence["value"])
+        conversations.append(conv.get_prompt())
+
+    # Tokenize conversations
+
+    if has_image:
+        input_ids = torch.stack([tokenizer_image_token(prompt, tokenizer, return_tensors='pt') for prompt in conversations], dim=0)
+    else:
+        input_ids = tokenizer(
+            conversations,
+            return_tensors="pt",
+            padding="longest",
+            max_length=tokenizer.model_max_length,
+            truncation=True,
+        ).input_ids
+
+    targets = input_ids.clone()
+
+    assert conv.sep_style == conversation_lib.SeparatorStyle.TWO
+
+    # Mask targets
+    sep = conv.sep + conv.roles[1] + ": "
+    for conversation, target in zip(conversations, targets):
+        total_len = int(target.ne(tokenizer.pad_token_id).sum())
+
+        rounds = conversation.split(conv.sep2)
+        cur_len = 1
+        target[:cur_len] = IGNORE_INDEX
+        for i, rou in enumerate(rounds):
+            if rou == "":
+                break
+
+            parts = rou.split(sep)
+            if len(parts) != 2:
+                break
+            parts[0] += sep
+
+            if has_image:
+                round_len = len(tokenizer_image_token(rou, tokenizer))
+                instruction_len = len(tokenizer_image_token(parts[0], tokenizer)) - 2
+            else:
+                round_len = len(tokenizer(rou).input_ids)
+                instruction_len = len(tokenizer(parts[0]).input_ids) - 2
+
+            target[cur_len : cur_len + instruction_len] = IGNORE_INDEX
+
+            cur_len += round_len
+        target[cur_len:] = IGNORE_INDEX
+
+        if cur_len < tokenizer.model_max_length:
+            if cur_len != total_len:
+                target[:] = IGNORE_INDEX
+                print(
+                    f"WARNING: tokenization mismatch: {cur_len} vs. {total_len}."
+                    f" (ignored)"
+                )
+
+    # pdb.set_trace()
+    # print('conversations:', conversations)
+    # print('input_ids', input_ids)
+    # print('target', target)
+    # raise NotImplementedError()
+
+    return dict(
+        input_ids=input_ids,
+        labels=targets,
+    )
+
+
+
+def preprocess_plain(
+    sources: Sequence[str],
+    tokenizer: transformers.PreTrainedTokenizer,
+) -> Dict:
+    # add end signal and concatenate together
+    conversations = []
+    for source in sources:
+        assert len(source) == 2
+        assert DEFAULT_IMAGE_TOKEN in source[0]['value']
+        source[0]['value'] = DEFAULT_IMAGE_TOKEN
+        conversation = source[0]['value'] + source[1]['value'] + conversation_lib.default_conversation.sep
+        conversations.append(conversation)
+    # tokenize conversations
+    input_ids = [tokenizer_image_token(prompt, tokenizer, return_tensors='pt') for prompt in conversations]
+    targets = copy.deepcopy(input_ids)
+    for target, source in zip(targets, sources):
+        tokenized_len = len(tokenizer_image_token(source[0]['value'], tokenizer))
+        target[:tokenized_len] = IGNORE_INDEX
+
+    return dict(input_ids=input_ids, labels=targets)
+
+
+def preprocess(
+    sources: Sequence[str],
+    tokenizer: transformers.PreTrainedTokenizer,
+    has_image: bool = False,
+    # add_region_feature: bool=False,
+) -> Dict:
+    """
+    Given a list of sources, each is a conversation list. This transform:
+    1. Add signal '### ' at the beginning each sentence, with end signal '\n';
+    2. Concatenate conversations together;
+    3. Tokenize the concatenated conversation;
+    4. Make a deepcopy as the target. Mask human words with IGNORE_INDEX.
+    """
+    if conversation_lib.default_conversation.sep_style == conversation_lib.SeparatorStyle.PLAIN:
+        return preprocess_plain(sources, tokenizer)
+    if conversation_lib.default_conversation.sep_style == conversation_lib.SeparatorStyle.LLAMA_2:
+        return preprocess_llama_2(sources, tokenizer, has_image=has_image)
+    if conversation_lib.default_conversation.version.startswith("v1"):
+        return preprocess_v1(sources, tokenizer, has_image=has_image)
+    # add end signal and concatenate together
+    conversations = []
+    for source in sources:
+        header = f"{conversation_lib.default_conversation.system}\n\n"
+        conversation = _add_speaker_and_signal(header, source)
+        conversations.append(conversation)
+    # tokenize conversations
+    def get_tokenize_len(prompts):
+        return [len(tokenizer_image_token(prompt, tokenizer)) for prompt in prompts]
+
+    if has_image:
+        input_ids = [tokenizer_image_token(prompt, tokenizer, return_tensors='pt') for prompt in conversations]
+    else:
+        conversations_tokenized = _tokenize_fn(conversations, tokenizer)
+        input_ids = conversations_tokenized["input_ids"]
+
+    targets = copy.deepcopy(input_ids)
+    for target, source in zip(targets, sources):
+        if has_image:
+            tokenized_lens = get_tokenize_len([header] + [s["value"] for s in source])
+        else:
+            tokenized_lens = _tokenize_fn([header] + [s["value"] for s in source], tokenizer)["input_ids_lens"]
+        speakers = [sentence["from"] for sentence in source]
+        _mask_targets(target, tokenized_lens, speakers)
+
+    # pdb.set_trace()
+        # if add_region_feature:
+            # target[target==tokenizer.convert_tokens_to_ids([DEFAULT_REGION_FEA_TOKEN])[0]] = IGNORE_INDEX
+
+
+    return dict(input_ids=input_ids, labels=targets)
+
+def extend_list(original_list, multiplier):
+    # Calculate how many elements to replicate and how many to select randomly
+    replicate_elements = math.floor(multiplier)
+    random_elements = multiplier - replicate_elements
+
+    # Replicate the list
+    replicated_list = original_list * replicate_elements
+
+    # Calculate how many elements to randomly select
+    select_elements = math.ceil(len(original_list) * random_elements)
+
+    # Randomly select elements and append to the replicated list
+    for _ in range(select_elements):
+        random_element = random.choice(original_list)
+        replicated_list.append(random_element)
+
+    return replicated_list
+
+
+def extract_coors(s):
+    # Regex pattern to match brackets content
+    brackets_pattern = r'\[(.*?)\]'
+
+    # Regex pattern to match values
+    values_pattern = r'=\s*([^,\]]+)'
+
+    # Find all bracketed strings
+    brackets = re.findall(brackets_pattern, s)
+
+    # Define a list to hold the list of values
+    values_list = []
+
+    # Extract values from each bracketed string
+    for bracket in brackets:
+        # Find all matches in the string
+        matches = re.findall(values_pattern, bracket)
+        # Convert matches to integers and add to values_list
+        values_list.append([int(match) for match in matches])
+
+    return values_list
+
+def regulate_box(box, img_w, img_h):
+    return [max(0, min(box[0], img_w-1)), max(0, min(box[1], img_h-1)), max(0, min(box[2], img_w-1)), max(0, min(box[3], img_h-1))]
+
+
+class LazySupervisedDataset(Dataset):
+    """Dataset for supervised fine-tuning."""
+
+    def load_vg_object(self, data_path, image_folder):
+        datas = json.load(open(data_path, "r"))
+        for data_i in datas:
+            data_i['dataset'] = 'vg_object'
+            data_i['image'] = os.path.join(image_folder, data_i['image'])
+            data_i['location_instruction'] = True
+        return datas
+
+    def load_vg_yesno_object(self, data_path, image_folder):
+        datas = json.load(open(data_path, "r"))
+        for data_i in datas:
+            data_i['dataset'] = 'vg_yesno_object'
+            data_i['image'] = os.path.join(image_folder, data_i['image'])
+            data_i['location_instruction'] = True
+        return datas
+
+    def load_vg_attribute(self, data_path, image_folder):
+        datas = json.load(open(data_path, "r"))
+        for data_i in datas:
+            data_i['dataset'] = 'vg_attribute'
+            data_i['image'] = os.path.join(image_folder, data_i['image'])
+            data_i['location_instruction'] = True
+        return datas
+
+    def load_vg_relation(self, data_path, image_folder):
+        datas = json.load(open(data_path, "r"))
+        for data_i in datas:
+            data_i['dataset'] = 'vg_relation'
+            data_i['image'] = os.path.join(image_folder, data_i['image'])
+            data_i['location_instruction'] = True
+        return datas
+
+    def load_vg_region(self, data_path, image_folder):
+        datas = json.load(open(data_path, "r"))
+        for data_i in datas:
+            data_i['dataset'] = 'vg_region'
+            data_i['image'] = os.path.join(image_folder, data_i['image'])
+            data_i['location_instruction'] = True
+        return datas
+
+    def load_git_instruction(self, data_path, image_folder):
+        datas = json.load(open(data_path, "r"))
+        for data_i in datas:
+            data_i['dataset'] = 'git_instruction'
+            data_i['image'] = os.path.join(image_folder, data_i['image'])
+            data_i['location_instruction'] = True
+        return datas
+        
+    def load_llava(self, data_path, image_folder):
+        datas = json.load(open(data_path, "r"))
+        for data_i in datas:
+            data_i['dataset'] = 'llava_instruct'
+            data_i['image'] = os.path.join(image_folder, data_i['image'])
+        return datas
+
+    def load_grounded_llava_boxes(self, data_path, image_folder):
+        datas = json.load(open(data_path, "r"))
+        for data_i in datas:
+            data_i["dataset"] = "grounded_llava_boxes"
+            data_i["location_instruction"] = True
+            data_i["image"] = os.path.join(image_folder, data_i["image"])
+        return datas
+
+    def load_refexp(self, data_path, image_folder):
+        datas = json.load(open(data_path, "r"))
+        for data_i in datas:
+            data_i["dataset"] = "refexp"
+            data_i["location_instruction"] = True
+            data_i["image"] = os.path.join(image_folder, data_i['image'])
+        return datas
+
+    def load_flickr(self, data_path, image_folder):
+        datas = json.load(open(data_path, "r"))
+        for data_i in datas:
+            data_i["dataset"] = "flickr"
+            data_i["location_instruction"] = True
+            data_i["image"] = os.path.join(image_folder, data_i['image'])
+        return datas
+
+    def load_objects365(self, data_path, image_folder):
+        datas = json.load(open(data_path, "r"))
+        for data_i in datas:
+            data_i["dataset"] = "objects365"
+            data_i["location_instruction"] = True
+            data_i["image"] = os.path.join(image_folder, data_i['image'])
+        return datas
+
+    def load_cc3m(self, data_path, image_folder):
+        datas = json.load(open(data_path, "r"))
+        for data_i in datas:
+            data_i['dataset'] = 'cc3m_595k'
+            data_i['image'] = os.path.join(image_folder, data_i['image'])
+        return datas
+    
+    def __init__(self, data_path: str,
+                 tokenizer: transformers.PreTrainedTokenizer,
+                 model_args: DataArguments,
+                 data_args: DataArguments):
+        super(LazySupervisedDataset, self).__init__()
+        data_multiple = data_args.data_multiple
+        if not isinstance(data_path, list):
+            data_path = [data_path]
+
+        image_folders = data_args.image_folder
+        if not isinstance(data_args.image_folder, list):
+            image_folders = [data_args.image_folder]
+
+
+        list_data_dict = []
+        for data_path_i, image_folder_i in zip(data_path, image_folders):
+            if 'vg_object' in data_path_i:
+                logging.warning(f"Loading vg_object data")
+                list_data_dict.append(self.load_vg_object(data_path_i, image_folder_i))
+            elif 'vg_yesno_object' in data_path_i:
+                logging.warning(f"Loading vg_yesno_object data")
+                list_data_dict.append(self.load_vg_yesno_object(data_path_i, image_folder_i))
+            elif 'vg_attribute' in data_path_i:
+                logging.warning(f"Loading vg_attribute data")
+                list_data_dict.append(self.load_vg_attribute(data_path_i, image_folder_i))
+            elif 'vg_relation' in data_path_i:
+                logging.warning(f"Loading vg_relation data")
+                list_data_dict.append(self.load_vg_relation(data_path_i, image_folder_i))
+            elif 'vg_region' in data_path_i:
+                logging.warning(f"Loading vg_region data")
+                list_data_dict.append(self.load_vg_region(data_path_i, image_folder_i))
+            # elif 'grounded_llava_objects' in data_path_i:
+            #     logging.warning(f"Loading grounded_llava_objects data")
+            #     list_data_dict.append(self.load_grounded_llava_objects(data_path_i, image_folder_i))
+            elif 'git_instruction' in data_path_i:
+                logging.warning(f"Loading git_instruction data")
+                if data_multiple is None:
+                    logging.warning(f"Multiplying git_instruction by 3 times to make it around 100k")
+                    list_data_dict.append(self.load_git_instruction(data_path_i, image_folder_i) * 3)
+                else: 
+                    list_data_dict.append(self.load_git_instruction(data_path_i, image_folder_i))
+            elif 'llava_instruct' in data_path_i:
+                logging.warning(f"Loading llava_instruct data")
+                list_data_dict.append(self.load_llava(data_path_i, image_folder_i))
+            elif 'grounded_llava_boxes' in data_path_i:
+                logging.warning(f"Loading grounded_llava_boxes data")
+                list_data_dict.append(self.load_grounded_llava_boxes(data_path_i, image_folder_i))
+            elif 'refexp' in data_path_i:
+                logging.warning(f"Loading refexp data")
+                list_data_dict.append(self.load_refexp(data_path_i, image_folder_i))
+            elif 'flickr' in data_path_i:
+                logging.warning(f"Loading flickr data")
+                list_data_dict.append(self.load_flickr(data_path_i, image_folder_i))
+            elif 'objects365' in data_path_i:
+                logging.warning(f"Loading o365 data")
+                list_data_dict.append(self.load_objects365(data_path_i, image_folder_i))
+            elif 'cc3m_595k' in data_path_i:
+                logging.warning(f"Loading cc3m_595k data")
+                list_data_dict.append(self.load_cc3m(data_path_i, image_folder_i))
+            else:
+                raise ValueError(f'{data_path_i} Not Supported.')
+        
+        if data_multiple is None:
+            # Concat all data directly and Shuffle.
+            list_data_dict = [item for dataset_i in list_data_dict for item in dataset_i]
+            random.shuffle(list_data_dict)
+        else:
+            new_list_data_dict = []
+            for data_scaler_i, dataset_i in zip(data_multiple, list_data_dict):
+                dataset_name_i = dataset_i[0]['dataset']
+                logging.warning(f"Multiplying {dataset_name_i} by {data_scaler_i} times")
+                new_dataset_i = extend_list(dataset_i, data_scaler_i)
+                new_list_data_dict.extend(new_dataset_i)
+            list_data_dict = new_list_data_dict
+            random.shuffle(list_data_dict)
+
+        rank0_print("Formatting inputs...Skip in lazy mode")
+        self.tokenizer = tokenizer
+        self.list_data_dict = list_data_dict
+        self.data_args = data_args
+        self.model_args = model_args
+        self.point_input_sample = self.data_args.point_input_sample
+        self.add_region_feature = self.model_args.add_region_feature
+        self.no_coor = self.model_args.no_coor
+        self.refer_previous_point = self.data_args.refer_previous_point
+
+    def __len__(self):
+        return len(self.list_data_dict)
+    
+    def get_obj_center(self, box, ratio_w, ratio_h, std_dev_weight=0.15):
+        box_center_w = ratio_w * (box[0]+box[2])/2.0
+        box_center_h = ratio_h * (box[1]+box[3])/2.0
+        
+        box_min_w = ratio_w * box[0]
+        box_max_w = ratio_w * box[2]
+
+        box_min_h = ratio_h * box[1]
+        box_max_h = ratio_h * box[3]
+
+        # Set std of gaussian sampling, 68%/95% is sampled within +- 1/2 std_dev.
+        gau_std_w = (box_max_w - box_min_w)*std_dev_weight
+        gau_std_h = (box_max_h - box_min_h)*std_dev_weight
+
+        def sample_float_within_range(mean, std_dev, min_val, max_val):
+            while True:
+                x = random.gauss(mean[0], std_dev[0])
+                y = random.gauss(mean[1], std_dev[1])
+                if min_val[0] <= x <= max_val[0] and min_val[1] <= y <= max_val[1]:
+                    return x, y
+        
+        jit_x, jit_y = sample_float_within_range(mean=[box_center_w, box_center_h], std_dev=[gau_std_w, gau_std_h], 
+                                                 min_val=[box_min_w, box_min_h], max_val=[box_max_w, box_max_h])
+
+        return jit_x, jit_y
+
+    def sample_point_in_segment(self, mask, ratio_w, ratio_h, box=None, sampling='uniform'):
+        mask['counts'] = mask['counts'].encode('ascii')
+        bin_mask = mask_util.decode(mask)
+        # Get the indices of True elements in the mask
+        # Note here the size of bin_mask is [h, w].
+        indices = np.transpose(np.nonzero(bin_mask))
+        if sampling == 'center' or sampling == 'gaussian':
+            if sampling == 'center':
+                box_anchor_w = int((box[0]+box[2])/2.0)
+                box_anchor_h = int((box[1]+box[3])/2.0)
+            elif sampling == 'gaussian':
+                # Sample a point based on centered gaussian distribution. ratio_w and ratio_h is set to 1 to keep original wh.
+                box_anchor_w, box_anchor_h = self.get_obj_center(box, 1, 1, std_dev_weight=0.15)
+            # get 1000 random items from the indices
+            sampled_list = random.sample(list(range(len(indices))), min(1000, len(indices)))
+            min_dis = 1e6
+            min_point = None
+            for sampled_i in sampled_list:
+                point_i = indices[sampled_i]
+                dis_i = (point_i[0] - box_anchor_h)**2 + (point_i[1] - box_anchor_w)**2
+                if dis_i <= min_dis or min_point is None:
+                    min_dis = dis_i
+                    min_point = point_i
+            point = min_point
+        elif sampling == 'uniform':
+            # Randomly select an index
+            random_index = np.random.choice(len(indices))
+            # Get the randomly selected point
+            point = indices[random_index]
+        else:
+            raise NotImplementedError(f'Not support {sampling}.')
+        # Note here point is in original image size and its order is [h, w].
+        cor_x = point[1] * ratio_w
+        cor_y = point[0] * ratio_h
+        return cor_x, cor_y
+
+
+    def get_bbox_coor(self, box, ratio_w, ratio_h):
+        return box[0] * ratio_w, box[1] * ratio_h, box[2] * ratio_w, box[3] * ratio_h
+
+
+    def generate_mask_for_feature(self, coor, box, mask, raw_w, raw_h):
+        # Build SAM mask
+        if mask is not None:
+            mask['counts'] = mask['counts'].encode('ascii')
+            sam_mask = mask_util.decode(mask)
+            # Note [h, w] -> [w, h].
+            sam_mask = np.transpose(sam_mask)
+        else:
+            sam_mask = None
+        # Build box mask
+        box_mask = np.zeros((raw_w, raw_h))
+        box_mask[box[0]:box[2]+1, box[1]:box[3]+1] = 1
+        
+        coor_mask = np.zeros((raw_w, raw_h))
+        # Assume it samples a point.
+        if len(coor) == 2:
+            # Define window size
+            span = 5
+            # Make sure the window does not exceed array bounds
+            x_min = max(0, coor[0] - span)
+            x_max = min(raw_w, coor[0] + span + 1)
+            y_min = max(0, coor[1] - span)
+            y_max = min(raw_h, coor[1] + span + 1)
+            coor_mask[int(x_min):int(x_max), int(y_min):int(y_max)] = 1
+            # SAM mask might be out of bounding box, so don't use sam_mask * box_mask. 
+            coor_mask = coor_mask * sam_mask if sam_mask is not None else coor_mask * box_mask
+            assert (coor_mask==1).any(), f"coor: {coor}, box: {box}, raw_w: {raw_w}, raw_h: {raw_h}"
+        elif len(coor) == 4:
+            coor_mask = box_mask * sam_mask if sam_mask is not None else box_mask
+            if (coor_mask==0).all():
+                print('Find one sample sam mask and box has no overlap, use box mask only')
+                coor_mask = box_mask
+            assert (coor_mask==1).any(), f"coor: {coor}, box: {box}, raw_w: {raw_w}, raw_h: {raw_h}"
+        else:
+            raise NotImplementedError('Coordinates must be 2d or 4d.')
+        coor_mask = torch.from_numpy(coor_mask)
+        assert len(coor_mask.nonzero()) != 0
+
+        return coor_mask
+
+
+    def __getitem__(self, i) -> Dict[str, torch.Tensor]:
+        sources = deepcopy(self.list_data_dict[i])
+        # sources = self.list_data_dict[i]
+        cache_region_masks = []
+        if isinstance(i, int):
+            sources = [sources]
+        assert len(sources) == 1, "Don't know why it is wrapped to a list"  # FIXME
+        if 'image' in sources[0]:
+            image_file = self.list_data_dict[i]['image']
+            # image_folder = self.data_args.image_folder
+            processor = self.data_args.image_processor
+            # image = Image.open(os.path.join(image_folder, image_file)).convert('RGB')
+            image = Image.open(image_file).convert('RGB')
+            if self.data_args.image_aspect_ratio == 'pad':
+                def expand2square(pil_img, background_color):
+                    width, height = pil_img.size
+                    if width == height:
+                        return pil_img
+                    elif width > height:
+                        result = Image.new(pil_img.mode, (width, width), background_color)
+                        result.paste(pil_img, (0, (width - height) // 2))
+                        return result
+                    else:
+                        result = Image.new(pil_img.mode, (height, height), background_color)
+                        result.paste(pil_img, ((height - width) // 2, 0))
+                        return result
+                image = expand2square(image, tuple(int(x*255) for x in processor.image_mean))
+                image = processor.preprocess(image, return_tensors='pt')['pixel_values'][0]
+            elif self.data_args.image_aspect_ratio == 'square_nocrop':
+                resized_image_h = self.data_args.resized_image_h
+                resized_image_w = self.data_args.resized_image_w
+                image = processor.preprocess(image, return_tensors='pt', do_resize=True, do_center_crop=False, size=[resized_image_h, resized_image_w])['pixel_values'][0]
+            else:
+                image = processor.preprocess(image, return_tensors='pt')['pixel_values'][0]
+            
+            # Process Locations/Coordinations.
+            if 'location_instruction' in sources[0]:
+                assert sources[0]['dataset'] in ['vg_object', 'vg_yesno_object', 'vg_attribute', 'vg_relation', 'vg_region', \
+                                                 'git_instruction', 'grounded_llava_boxes', 'refexp', 'flickr',\
+                                                    'objects365']
+                ratio_w = VOCAB_IMAGE_W * 1.0 / sources[0]['image_w']
+                ratio_h = VOCAB_IMAGE_H * 1.0 / sources[0]['image_h']
+                conversation = deepcopy(sources[0]['conversations'])
+                assert len(sources) == 1
+                for box_list_idx, box_list_i in enumerate(sources[0]['box_x1y1x2y2']):
+                    # For human input, always build a cache to save sampled point in this round of human input.
+                    if box_list_idx % 2 == 0 and self.refer_previous_point:
+                        point_box_cache = {}
+
+                    # Replace location placeholders with points or boxes.
+                    if len(box_list_i) == 0:
+                        # No location mentioned in this round of conversation.
+                        continue
+
+                    if box_list_idx % 2 == 0:
+                        # Randomly choose point or box as coordination format in human round.
+                        location_format = random.choice(['point', 'box'])
+                    else:
+                        # Always output box in model reponse.
+                        location_format = 'box'
+                    cur_conv = conversation[box_list_idx]['value']
+                    # Iteratively replace <bbox_location> in current conv with real box/point coordinate.
+                    for box_idx, box_i in enumerate(box_list_i):
+                        box_i = regulate_box(box_i, sources[0]['image_w'], sources[0]['image_h'])
+                        if location_format == 'box':
+                            # If this box is mentioned in last human input, use the same coordinates as human mentioned.
+                            if 'point_box_cache' in locals() and tuple(box_i) in point_box_cache:
+                                raw_coor_i = point_box_cache[tuple(box_i)]
+                                coor_i = f'[{int(raw_coor_i[0])}, {int(raw_coor_i[1])}]'
+                            else:
+                                raw_coor_i = self.get_bbox_coor(box=box_i, ratio_w=ratio_w, ratio_h=ratio_h)
+                                coor_i = f'[{int(raw_coor_i[0])}, {int(raw_coor_i[1])}, {int(raw_coor_i[2])}, {int(raw_coor_i[3])}]'
+                        elif location_format == 'point':
+                            # Assert it's human input.
+                            assert box_list_idx % 2 == 0
+                            # If this box is mentioned previously in this round of human input, use the same coordinates as previously mentioned.
+                            if 'point_box_cache' in locals() and tuple(box_i) in point_box_cache:
+                                raw_coor_i = point_box_cache[tuple(box_i)]
+                            else:
+                                if 'segment_mask' in self.point_input_sample:
+                                    if 'masks' in sources[0]:
+                                        cur_mask = deepcopy(sources[0]['masks'][box_list_idx][box_idx])
+                                        assert cur_mask['size'][0] == sources[0]['image_h']
+                                        assert cur_mask['size'][1] == sources[0]['image_w']
+                                        if 'uniform' in self.point_input_sample.split('|')[1]:
+                                            obj_center_x, obj_center_y = self.sample_point_in_segment(mask=cur_mask, ratio_w=ratio_w, ratio_h=ratio_h)
+                                        elif 'center' in self.point_input_sample.split('|')[1]:
+                                            obj_center_x, obj_center_y = self.sample_point_in_segment(mask=cur_mask, ratio_w=ratio_w, ratio_h=ratio_h, box=box_i, sampling='center')
+                                        elif 'gaussian' in self.point_input_sample.split('|')[1]:
+                                            obj_center_x, obj_center_y = self.sample_point_in_segment(mask=cur_mask, ratio_w=ratio_w, ratio_h=ratio_h, box=box_i, sampling='gaussian')
+                                    else:
+                                        # Not all data have/need segment masks.
+                                        obj_center_x, obj_center_y = self.get_obj_center(box=box_i, ratio_w=ratio_w, ratio_h=ratio_h, std_dev_weight=0.15)
+                                elif self.point_input_sample == 'gaussian':
+                                    obj_center_x, obj_center_y = self.get_obj_center(box=box_i, ratio_w=ratio_w, ratio_h=ratio_h, std_dev_weight=0.15)
+                                elif self.point_input_sample == 'center':
+                                    obj_center_x = ratio_w * (box_i[0]+box_i[2])/2.0
+                                    obj_center_y = ratio_h * (box_i[1]+box_i[3])/2.0
+                                else:
+                                    raise NotImplementedError(f'Not support {self.point_input_sample} in data sampling')
+                                raw_coor_i = [obj_center_x, obj_center_y]
+                                if 'point_box_cache' in locals() and self.refer_previous_point:
+                                    point_box_cache[tuple(box_i)] = raw_coor_i
+                            coor_i = f'[{int(raw_coor_i[0])}, {int(raw_coor_i[1])}]'
+                        assert f'<bbox_location{box_idx}>' in cur_conv
+                        if self.add_region_feature and box_list_idx % 2 == 0:
+                            if self.no_coor:
+                                cur_conv = cur_conv.replace(f'<bbox_location{box_idx}>', f'{DEFAULT_REGION_FEA_TOKEN}')
+                            else:
+                                cur_conv = cur_conv.replace(f'<bbox_location{box_idx}>', coor_i + f' {DEFAULT_REGION_FEA_TOKEN}')
+                            cur_box = box_i
+                            cur_mask = deepcopy(sources[0]['masks'][box_list_idx][box_idx]) if 'masks' in sources[0] else None
+                            ori_size_raw_coor_i = [raw_coor_i[0]/ratio_w, raw_coor_i[1]/ratio_h, raw_coor_i[2]/ratio_w, raw_coor_i[3]/ratio_h] if len(raw_coor_i) == 4 \
+                                else [raw_coor_i[0]/ratio_w, raw_coor_i[1]/ratio_h]
+                            cur_region_mask = self.generate_mask_for_feature(ori_size_raw_coor_i, cur_box, cur_mask, raw_w=sources[0]['image_w'], raw_h=sources[0]['image_h'])
+                            cache_region_masks.append(cur_region_mask)
+                            # print('cur_conv:', cur_conv)
+                            # print('cur_region_mask:', cur_region_mask.nonzero())
+                            # raise NotImplementedError()
+                            # pdb.set_trace()
+                        else:
+                            if self.no_coor:
+                                cur_conv = cur_conv.replace(f'<bbox_location{box_idx}>', '')
+                            else:
+                                cur_conv = cur_conv.replace(f'<bbox_location{box_idx}>', coor_i)
+                    # Assign this round of conv back.
+                    conversation[box_list_idx]['value'] = cur_conv
+                sources[0]['conversations'] = conversation
+                # print(conversation)
+                # exit(0)
+
+            # cur_token_len = (image.shape[1]//14) * (image.shape[2]//14)   # FIXME: 14 is hardcoded patch size
+            sources = preprocess_multimodal(
+                copy.deepcopy([e["conversations"] for e in sources]),
+                self.data_args)
+        else:
+            sources = copy.deepcopy([e["conversations"] for e in sources])
+        data_dict = preprocess(
+            sources,
+            self.tokenizer,
+            has_image=('image' in self.list_data_dict[i]),
+            # add_region_feature=self.add_region_feature
+            )
+        if isinstance(i, int):
+            data_dict = dict(input_ids=data_dict["input_ids"][0],
+                             labels=data_dict["labels"][0])
+
+        # image exist in the data
+        if 'image' in self.list_data_dict[i]:
+            data_dict['image'] = image
+        elif self.data_args.is_multimodal:
+            # image does not exist in the data, but the model is multimodal
+            crop_size = self.data_args.image_processor.crop_size
+            data_dict['image'] = torch.zeros(3, crop_size['height'], crop_size['width'])
+        if self.add_region_feature:
+            data_dict['region_masks'] = cache_region_masks
+        return data_dict
+
+
+@dataclass
+class DataCollatorForSupervisedDataset(object):
+    """Collate examples for supervised fine-tuning."""
+
+    tokenizer: transformers.PreTrainedTokenizer
+
+    def __call__(self, instances: Sequence[Dict]) -> Dict[str, torch.Tensor]:
+        input_ids, labels = tuple([instance[key] for instance in instances]
+                                  for key in ("input_ids", "labels"))
+        input_ids = torch.nn.utils.rnn.pad_sequence(
+            input_ids,
+            batch_first=True,
+            padding_value=self.tokenizer.pad_token_id)
+        labels = torch.nn.utils.rnn.pad_sequence(labels,
+                                                 batch_first=True,
+                                                 padding_value=IGNORE_INDEX)
+        input_ids = input_ids[:, :self.tokenizer.model_max_length]
+        labels = labels[:, :self.tokenizer.model_max_length]
+        batch = dict(
+            input_ids=input_ids,
+            labels=labels,
+            attention_mask=input_ids.ne(self.tokenizer.pad_token_id),
+        )
+
+        if 'image' in instances[0]:
+            images = [instance['image'] for instance in instances]
+            if all(x is not None and x.shape == images[0].shape for x in images):
+                batch['images'] = torch.stack(images)
+            else:
+                batch['images'] = images
+
+        if 'region_masks' in instances[0]:
+            region_masks = [instance['region_masks'] for instance in instances]
+            batch['region_masks'] = region_masks
+
+        return batch
+
+
+def make_supervised_data_module(tokenizer: transformers.PreTrainedTokenizer,
+                                data_args,
+                                model_args) -> Dict:
+    """Make dataset and collator for supervised fine-tuning."""
+    train_dataset = LazySupervisedDataset(tokenizer=tokenizer,
+                                data_path=data_args.data_path,
+                                data_args=data_args,
+                                model_args=model_args)
+    data_collator = DataCollatorForSupervisedDataset(tokenizer=tokenizer)
+    return dict(train_dataset=train_dataset,
+                eval_dataset=None,
+                data_collator=data_collator)
+
+
+def train():
+    global local_rank
+
+    parser = transformers.HfArgumentParser(
+        (ModelArguments, DataArguments, TrainingArguments))
+    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
+    if model_args.no_coor:
+        assert model_args.add_region_feature
+    local_rank = training_args.local_rank
+    compute_dtype = (torch.float16 if training_args.fp16 else (torch.bfloat16 if training_args.bf16 else torch.float32))
+
+    bnb_model_from_pretrained_args = {}
+    if training_args.bits in [4, 8]:
+        from transformers import BitsAndBytesConfig
+        bnb_model_from_pretrained_args.update(dict(
+            device_map={"": training_args.device},
+            load_in_4bit=training_args.bits == 4,
+            load_in_8bit=training_args.bits == 8,
+            quantization_config=BitsAndBytesConfig(
+                load_in_4bit=training_args.bits == 4,
+                load_in_8bit=training_args.bits == 8,
+                llm_int8_threshold=6.0,
+                llm_int8_has_fp16_weight=False,
+                bnb_4bit_compute_dtype=compute_dtype,
+                bnb_4bit_use_double_quant=training_args.double_quant,
+                bnb_4bit_quant_type=training_args.quant_type # {'fp4', 'nf4'}
+            )
+        ))
+
+    if model_args.vision_tower is not None:
+        model = FERRETLlamaForCausalLM.from_pretrained(
+            model_args.model_name_or_path,
+            cache_dir=training_args.cache_dir,
+            **bnb_model_from_pretrained_args
+        )
+    else:
+        model = transformers.LlamaForCausalLM.from_pretrained(
+            model_args.model_name_or_path,
+            cache_dir=training_args.cache_dir,
+            **bnb_model_from_pretrained_args
+        )
+    model.config.use_cache = False
+
+    if model_args.freeze_backbone:
+        model.model.requires_grad_(False)
+
+    if training_args.bits in [4, 8]:
+        from peft import prepare_model_for_kbit_training
+        model.config.torch_dtype=(torch.float32 if training_args.fp16 else (torch.bfloat16 if training_args.bf16 else torch.float32))
+        model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=training_args.gradient_checkpointing)
+
+    if training_args.gradient_checkpointing and not training_args.fsdp:
+        if hasattr(model, "enable_input_require_grads"):
+            model.enable_input_require_grads()
+        else:
+            def make_inputs_require_grad(module, input, output):
+                output.requires_grad_(True)
+            model.get_input_embeddings().register_forward_hook(make_inputs_require_grad)
+
+    if training_args.lora_enable:
+        from peft import LoraConfig, get_peft_model
+        lora_config = LoraConfig(
+            r=training_args.lora_r,
+            lora_alpha=training_args.lora_alpha,
+            target_modules=find_all_linear_names(model),
+            lora_dropout=training_args.lora_dropout,
+            bias=training_args.lora_bias,
+            task_type="CAUSAL_LM",
+        )
+        if training_args.bits == 16:
+            if training_args.bf16:
+                model.to(torch.bfloat16)
+            if training_args.fp16:
+                model.to(torch.float16)
+        rank0_print("Adding LoRA adapters...")
+        model = get_peft_model(model, lora_config)
+        # model.print_trainable_parameters()
+
+    tokenizer = transformers.AutoTokenizer.from_pretrained(
+        model_args.model_name_or_path,
+        cache_dir=training_args.cache_dir,
+        model_max_length=training_args.model_max_length,
+        padding_side="right",
+        use_fast=False,
+    )
+
+    if model_args.version == "v0":
+        if tokenizer.pad_token is None:
+            smart_tokenizer_and_embedding_resize(
+                special_tokens_dict=dict(pad_token="[PAD]"),
+                tokenizer=tokenizer,
+                model=model,
+            )
+    elif model_args.version == "v0.5":
+        tokenizer.pad_token = tokenizer.unk_token
+    else:
+        tokenizer.pad_token = tokenizer.unk_token
+        if model_args.version in conversation_lib.conv_templates:
+            conversation_lib.default_conversation = conversation_lib.conv_templates[model_args.version]
+        else:
+            conversation_lib.default_conversation = conversation_lib.conv_templates["ferret_v1"]
+
+    if model_args.vision_tower is not None:
+        model.get_model().initialize_vision_modules(
+            model_args=model_args,
+            fsdp=training_args.fsdp,
+            add_region_feature=model_args.add_region_feature,
+            region_geo_sampler=model_args.region_geo_sampler,
+            sampler_pooler_mode=model_args.sampler_pooler_mode
+        )
+        
+        vision_tower = model.get_vision_tower()
+
+        dtype = torch.float32
+        if training_args.fp16:
+            dtype = torch.float16
+        if training_args.bf16:
+            dtype = torch.bfloat16
+
+        vision_tower.to(dtype=dtype, device=training_args.device)
+
+        data_args.image_processor = vision_tower.image_processor
+        data_args.is_multimodal = True
+
+        model.config.image_aspect_ratio = data_args.image_aspect_ratio
+        model.config.image_grid_pinpoints = data_args.image_grid_pinpoints
+        training_args.save_vision_tower = model_args.save_vision_tower
+
+        model.config.tune_mm_mlp_adapter = training_args.tune_mm_mlp_adapter = model_args.tune_mm_mlp_adapter
+        if model_args.tune_mm_mlp_adapter:
+            model.requires_grad_(False)
+            for p in model.get_model().mm_projector.parameters():
+                p.requires_grad = True
+
+        model.config.freeze_mm_mlp_adapter = training_args.freeze_mm_mlp_adapter
+        if training_args.freeze_mm_mlp_adapter:
+            for p in model.get_model().mm_projector.parameters():
+                p.requires_grad = False
+
+        if model_args.add_region_feature:
+            if model_args.region_geo_sampler:
+                for p in model.get_model().region_geo_sampler.parameters():
+                    p.requires_grad = True
+            else:
+                for p in model.get_model().region_fea_adapter.parameters():
+                    p.requires_grad = True
+
+        if training_args.bits in [4, 8]:
+            model.get_model().mm_projector.to(dtype=compute_dtype, device=training_args.device)
+
+        model.config.mm_use_im_start_end = data_args.mm_use_im_start_end = model_args.mm_use_im_start_end
+        training_args.use_im_start_end = model_args.mm_use_im_start_end
+        model.config.mm_use_im_patch_token = model_args.mm_use_im_patch_token
+        model.initialize_vision_tokenizer(model_args, tokenizer=tokenizer, add_region_feature=model_args.add_region_feature)
+
+    if training_args.bits in [4, 8]:
+        from peft.tuners.lora import LoraLayer
+        for name, module in model.named_modules():
+            if isinstance(module, LoraLayer):
+                if training_args.bf16:
+                    module = module.to(torch.bfloat16)
+            if 'norm' in name:
+                module = module.to(torch.float32)
+            if 'lm_head' in name or 'embed_tokens' in name:
+                if hasattr(module, 'weight'):
+                    if training_args.bf16 and module.weight.dtype == torch.float32:
+                        module = module.to(torch.bfloat16)
+
+    # pdb.set_trace()
+    params_no_grad = [n for n, p in model.named_parameters() if not p.requires_grad]
+    params_has_grad = [n for n, p in model.named_parameters() if p.requires_grad]
+    print(f'Params being frozen: {params_no_grad}.')
+    print(f'Params being updated: {params_has_grad}.')
+    if len(params_no_grad) > 0:
+        if training_args.fsdp is not None and len(training_args.fsdp) > 0:
+            if len(params_no_grad) < 10:
+                print('[WARNING] Attempting to use FSDP while {} parameters do not require gradients: {}'. format(len(params_no_grad), params_no_grad))
+            else:
+                print('[WARNING] Attempting to use FSDP while {} parameters do not require gradients: {}...(omitted)'. format(len(params_no_grad), ', '.join(params_no_grad[:10])))
+            print("[WARNING] Attempting to use FSDP with partially frozen paramters, this is experimental.")
+            print("[WARNING] As of 4/30/23, this feature requires PyTorch-nightly build.  See here for details: https://github.com/haotian-liu/LLaVA#experimental-use-fsdp-to-save-memory-in-pretraining")
+
+            from torch.distributed.fsdp.fully_sharded_data_parallel import FullyShardedDataParallel as FSDP
+            def patch_FSDP_use_orig_params(func):
+                def wrap_func(*args, **kwargs):
+                    use_orig_params = kwargs.pop('use_orig_params', True)
+                    return func(*args, **kwargs, use_orig_params=use_orig_params)
+                return wrap_func
+
+            FSDP.__init__ = patch_FSDP_use_orig_params(FSDP.__init__)
+
+    data_module = make_supervised_data_module(tokenizer=tokenizer,
+                                              data_args=data_args,
+                                              model_args=model_args)
+    trainer = FERRETTrainer(model=model,
+                    tokenizer=tokenizer,
+                    args=training_args,
+                    **data_module)
+    # pdb.set_trace()
+
+    if list(pathlib.Path(training_args.output_dir).glob("checkpoint-*")):
+        trainer.train(resume_from_checkpoint=True)
+    else:
+        trainer.train()
+    trainer.save_state()
+
+    if not training_args.fsdp:
+        model.config.use_cache = True
+
+    if training_args.lora_enable:
+        state_dict = get_peft_state_maybe_zero_3(
+            model.named_parameters(), training_args.lora_bias
+        )
+        non_lora_state_dict = get_peft_state_non_lora_maybe_zero_3(
+            model.named_parameters()
+        )
+        if training_args.local_rank == 0 or training_args.local_rank == -1:
+            model.config.save_pretrained(training_args.output_dir)
+            model.save_pretrained(training_args.output_dir, state_dict=state_dict)
+            torch.save(non_lora_state_dict, os.path.join(training_args.output_dir, 'non_lora_trainables.bin'))
+    else:
+        safe_save_model_for_hf_trainer(trainer=trainer,
+                                       output_dir=training_args.output_dir,
+                                       save_vision_tower=model_args.save_vision_tower)
+
+
+if __name__ == "__main__":
+    train()
diff --git a/ferret/train/train_mem.py b/ferret/train/train_mem.py
new file mode 100644
index 0000000..92be452
--- /dev/null
+++ b/ferret/train/train_mem.py
@@ -0,0 +1,13 @@
+# Adopted from https://github.com/lm-sys/FastChat. Below is the original copyright:
+# Adopted from tatsu-lab@stanford_alpaca. Below is the original copyright:
+# Make it more memory efficient by monkey patching the LLaMA model with FlashAttn.
+
+# Need to call this before importing transformers.
+from ferret.train.llama_flash_attn_monkey_patch import replace_llama_attn_with_flash_attn
+
+replace_llama_attn_with_flash_attn()
+
+from ferret.train.train import train
+
+if __name__ == "__main__":
+    train()
diff --git a/ferret/utils.py b/ferret/utils.py
new file mode 100644
index 0000000..a681120
--- /dev/null
+++ b/ferret/utils.py
@@ -0,0 +1,126 @@
+import datetime
+import logging
+import logging.handlers
+import os
+import sys
+
+import requests
+
+from ferret.constants import LOGDIR
+
+server_error_msg = "**NETWORK ERROR DUE TO HIGH TRAFFIC. PLEASE REGENERATE OR REFRESH THIS PAGE.**"
+moderation_msg = "YOUR INPUT VIOLATES OUR CONTENT MODERATION GUIDELINES. PLEASE TRY AGAIN."
+
+handler = None
+
+
+def build_logger(logger_name, logger_filename):
+    global handler
+
+    formatter = logging.Formatter(
+        fmt="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
+        datefmt="%Y-%m-%d %H:%M:%S",
+    )
+
+    # Set the format of root handlers
+    if not logging.getLogger().handlers:
+        logging.basicConfig(level=logging.INFO)
+    logging.getLogger().handlers[0].setFormatter(formatter)
+
+    # Redirect stdout and stderr to loggers
+    stdout_logger = logging.getLogger("stdout")
+    stdout_logger.setLevel(logging.INFO)
+    sl = StreamToLogger(stdout_logger, logging.INFO)
+    sys.stdout = sl
+
+    stderr_logger = logging.getLogger("stderr")
+    stderr_logger.setLevel(logging.ERROR)
+    sl = StreamToLogger(stderr_logger, logging.ERROR)
+    sys.stderr = sl
+
+    # Get logger
+    logger = logging.getLogger(logger_name)
+    logger.setLevel(logging.INFO)
+
+    # Add a file handler for all loggers
+    if handler is None:
+        os.makedirs(LOGDIR, exist_ok=True)
+        filename = os.path.join(LOGDIR, logger_filename)
+        handler = logging.handlers.TimedRotatingFileHandler(
+            filename, when='D', utc=True)
+        handler.setFormatter(formatter)
+
+        for name, item in logging.root.manager.loggerDict.items():
+            if isinstance(item, logging.Logger):
+                item.addHandler(handler)
+
+    return logger
+
+
+class StreamToLogger(object):
+    """
+    Fake file-like stream object that redirects writes to a logger instance.
+    """
+    def __init__(self, logger, log_level=logging.INFO):
+        self.terminal = sys.stdout
+        self.logger = logger
+        self.log_level = log_level
+        self.linebuf = ''
+
+    def __getattr__(self, attr):
+        return getattr(self.terminal, attr)
+
+    def write(self, buf):
+        temp_linebuf = self.linebuf + buf
+        self.linebuf = ''
+        for line in temp_linebuf.splitlines(True):
+            # From the io.TextIOWrapper docs:
+            #   On output, if newline is None, any '\n' characters written
+            #   are translated to the system default line separator.
+            # By default sys.stdout.write() expects '\n' newlines and then
+            # translates them so this is still cross platform.
+            if line[-1] == '\n':
+                self.logger.log(self.log_level, line.rstrip())
+            else:
+                self.linebuf += line
+
+    def flush(self):
+        if self.linebuf != '':
+            self.logger.log(self.log_level, self.linebuf.rstrip())
+        self.linebuf = ''
+
+
+def disable_torch_init():
+    """
+    Disable the redundant torch default initialization to accelerate model creation.
+    """
+    import torch
+    setattr(torch.nn.Linear, "reset_parameters", lambda self: None)
+    setattr(torch.nn.LayerNorm, "reset_parameters", lambda self: None)
+
+
+def violates_moderation(text):
+    """
+    Check whether the text violates OpenAI moderation API.
+    """
+    url = "https://api.openai.com/v1/moderations"
+    headers = {"Content-Type": "application/json",
+               "Authorization": "Bearer " + os.environ["OPENAI_API_KEY"]}
+    text = text.replace("\n", "")
+    data = "{" + '"input": ' + f'"{text}"' + "}"
+    data = data.encode("utf-8")
+    try:
+        ret = requests.post(url, headers=headers, data=data, timeout=5)
+        flagged = ret.json()["results"][0]["flagged"]
+    except requests.exceptions.RequestException as e:
+        flagged = False
+    except KeyError as e:
+        flagged = False
+
+    return flagged
+
+
+def pretty_print_semaphore(semaphore):
+    if semaphore is None:
+        return "None"
+    return f"Semaphore(value={semaphore._value}, locked={semaphore.locked()})"
diff --git a/figs/ferret_demo.png b/figs/ferret_demo.png
new file mode 100644
index 0000000..89433ee
Binary files /dev/null and b/figs/ferret_demo.png differ
diff --git a/pyproject.toml b/pyproject.toml
new file mode 100644
index 0000000..aa285b7
--- /dev/null
+++ b/pyproject.toml
@@ -0,0 +1,38 @@
+[build-system]
+requires = ["setuptools>=61.0"]
+build-backend = "setuptools.build_meta"
+
+[project]
+name = "ferret"
+version = "1.0.1"
+description = "Towards GPT-4 like large language and visual assistant."
+readme = "README.md"
+requires-python = ">=3.8"
+classifiers = [
+    "Programming Language :: Python :: 3",
+    "License :: OSI Approved :: Apache Software License",
+]
+dependencies = [
+    "einops", "fastapi", "gradio==3.26", "markdown2[all]", "numpy",
+    "requests", "sentencepiece", "tokenizers>=0.12.1",
+    "torch", "torchvision", "uvicorn", "wandb",
+    "shortuuid", "httpx==0.24.0",
+    "deepspeed==0.9.5",
+    "peft==0.4.0",
+    "transformers @ git+https://github.com/huggingface/transformers.git@cae78c46",
+    "accelerate==0.21.0",
+    "bitsandbytes==0.41.0",
+    "scikit-learn==1.2.2",
+    "sentencepiece==0.1.99",
+    "einops==0.6.1", "einops-exts==0.0.4", "timm==0.6.13", "openai",
+    "gradio_client==0.1.2"
+]
+
+[project.urls]
+"Homepage" = "https://github.com/apple/ml-ferret"
+
+[tool.setuptools.packages.find]
+exclude = ["assets*", "benchmark*", "docs", "dist*", "playground*", "scripts*", "tests*"]
+
+[tool.wheel]
+exclude = ["assets*", "benchmark*", "docs", "dist*", "playground*", "scripts*", "tests*"]