first code commit

This commit is contained in:
Haotian Zhang 2023-10-30 20:44:41 -07:00
parent 941c75ad7e
commit 5fc7ff83ef
68 changed files with 8194 additions and 3 deletions

View File

@ -1,6 +1,6 @@
# Contribution Guide
Thanks for your interest in contributing. This project was released to accompany a research paper for purposes of reproducability, and beyond its publication there are limited plans for future development of the repository.
Thanks for your interest in contributing. This project was released to accompany a research paper for purposes of reproducibility, and beyond its publication there are limited plans for future development of the repository.
While we welcome new pull requests and issues please note that our response may be limited. Forks and out-of-tree improvements are strongly encouraged.

17
EVAL.md Normal file
View File

@ -0,0 +1,17 @@
# Evaluation
All evaluation scripts provided usage details/cases in the first several lines of codes.
## Ferret-Bench
Please follow [gpt4_eval_script.sh](ferret/eval/gpt4_eval_script.sh) to run inference on Ferret-Bench data and use GPT-4 to rate. It's noted that `openai` package should be installed and user's OPENAI_KEY should be provided.
## LVIS-Referring Object Classification
Run `ferret/eval/model_lvis.py` following the usage in the file and then run `ferret/eval/eval_lvis.py`.
## RefCOCO/RefCOCO+/RefCOCOg
Run `ferret/eval/model_refcoco.py` following the usage in the file and then run `ferret/eval/eval_refexp.py`.
## Flickr
Run `ferret/eval/model_flickr.py` following the usage in the file and then run `ferret/eval/eval_flickr_entities.py`.
## POPE
Run `ferret/eval/model_pope.py` following the usage in the file and then run `ferret/eval/eval_pope.py`.

120
README.md
View File

@ -10,11 +10,12 @@ Brief description of the project.
# <img src="figs/ferret_icon.png" alt="Alt text for the image" width="40" height="45"> Ferret: Refer and Ground Anything Anywhere at Any Granularity
An end-to-end MLLM that can accept any-form referring and ground anything in response.*
*An End-to-End MLLM that Accept Any-Form Referring and Ground Anything in Response.* [[Paper](https://arxiv.org/abs/2310.07704)]
[Haoxuan You*](https://hxyou.github.io/), [Haotian Zhang*](https://haotian-zhang.github.io/), [Zhe Gan](https://zhegan27.github.io/), [Xianzhi Du](https://scholar.google.com/citations?user=l1hP40AAAAAJ&hl=en), [Bowen Zhang](https://zbwglory.github.io/), [Zirui Wang](https://www.cs.cmu.edu/~ziruiw/), [Liangliang Cao](http://llcao.net/), [Shih-Fu Chang](https://www.ee.columbia.edu/~sfchang/), [Yinfei Yang](https://sites.google.com/site/yinfeiyang/)
[*: equal contribution]
## Overview
<p align="center">
@ -25,4 +26,119 @@ An end-to-end MLLM that can accept any-form referring and ground anything in res
Key Contributions:
* Ferret Model - **Hybrid Region Representation + Spatial-aware Visual Sampler** enable fine-grained and open-vocabulary referring and grounding in MLLM.
* GRIT Dataset (~1.1M) - A **Large-scale, Hierarchical, Robust** ground-and-refer instruction tuning dataset.
* Ferret-Bench - A multimodal evaluation benchmark that jointly requires **Referring/Grounding, Semantics, Knowledge, and Reasoning**.
* Ferret-Bench - A multimodal evaluation benchmark that jointly requires **Referring/Grounding, Semantics, Knowledge, and Reasoning**.
## Release
- [10/30] 🔥 We released the code of **FERRET** model.
**Usage and License Notices**: The data, and code is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.
## Contents
- [Install](#install)
- [Train](#train)
- [Evaluation](#evaluation)
- [Demo](#demo)
## Install
1. Clone this repository and navigate to FERRET folder
```bash
git clone https://github.com/apple/ml-ferret
cd ml-ferret
```
2. Install Package
```Shell
conda create -n ferret python=3.10 -y
conda activate ferret
pip install --upgrade pip # enable PEP 660 support
pip install -e .
pip install pycocotools
pip install protobuf==3.20.0
```
3. Install additional packages for training cases
```
pip install ninja
pip install flash-attn --no-build-isolation
```
## Train
FERRET is trained on 8 A100 GPUs with 80GB memory. To train on fewer GPUs, you can reduce the `per_device_train_batch_size` and increase the `gradient_accumulation_steps` accordingly. Always keep the global batch size the same: `per_device_train_batch_size` x `gradient_accumulation_steps` x `num_gpus`.
### Hyperparameters
We use a similar set of hyperparameters as LLaVA(Vicuna) in finetuning.
| Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
| --- | ---: | ---: | ---: | ---: | ---: |
| FERRET-7B | 128 | 2e-5 | 3 | 2048 | 0 |
| FERRET-13B | 128 | 2e-5 | 3 | 2048 | 0 |
### Prepare Vicuna checkpoint and LLaVA's projector
Before you start, prepare our base model Vicuna, which is an instruction-tuned chatbot. Please download its weights following the instructions [here](https://github.com/lm-sys/FastChat#model-weights). Vicuna v1.3 is used in FERRET.
Then download LLaVA's first-stage pre-trained projector weight ([7B](https://huggingface.co/liuhaotian/llava-336px-pretrain-vicuna-7b-v1.3), [13B](https://huggingface.co/liuhaotian/llava-336px-pretrain-vicuna-13b-v1.3)).
### FERRET Training
The scripts are provided ([7B](experiments/ferret_7b_train.sh), [13B](experiments/ferret_13b_train.sh)).
## Evaluation
Please see this [doc](EVAL.md) for the details.
## Demo
To run our demo, you need to train FERRET and use the checkpoints locally. Gradio web UI is used. Please run the following commands one by one.
#### Launch a controller
```Shell
python -m ferret.serve.controller --host 0.0.0.0 --port 10000
```
#### Launch a gradio web server.
```Shell
python -m ferret.serve.gradio_web_server --controller http://localhost:10000 --model-list-mode reload --add_region_feature
```
#### Launch a model worker
This is the worker that load the ckpt and do the inference on the GPU. Each worker is responsible for a single model specified in `--model-path`.
```Shell
CUDA_VISIBLE_DEVICES=0 python -m ferret.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path ./checkpoints/FERRET-13B-v0 --add_region_feature
```
Wait until the process finishes loading the model and you see "Uvicorn running on ...". Now, refresh your Gradio web UI, and you will see the model you just launched in the model list.
<p align="center">
<img src="figs/ferret_demo.png" width="105%"></a> <br>
Example of Ferret Interactive Demo.
</p>
## Citation
If you find Ferret useful, please cite using this BibTeX:
```bibtex
@article{you2023ferret,
title={Ferret: Refer and Ground Anything Anywhere at Any Granularity},
author={You, Haoxuan and Zhang, Haotian and Gan, Zhe and Du, Xianzhi and Zhang, Bowen and Wang, Zirui and Cao, Liangliang and Chang, Shih-Fu and Yang, Yinfei},
journal={arXiv preprint arXiv:2310.07704},
year={2023}
}
```
## Acknowledgement
- [LLaVA](https://github.com/haotian-liu/LLaVA): the codebase we built upon.
- [Vicuna](https://github.com/lm-sys/FastChat): the LLM codebase.

View File

@ -0,0 +1,100 @@
#!/usr/bin/env bash
set -xe
mkdir -p checkpoints
echo "Start Fine-Tuning"
# =================== Training ======================
data_path=(
'dataset/git_instruction.json'
'dataset/vg_objects.json'
'dataset/vg_relations.json'
'dataset/vg_regions.json'
'dataset/grounded_llava_boxes_detail.json'
'dataset/grounded_llava_boxes_complex_reasoning.json'
'dataset/grounded_llava_boxes_conversation.json'
'dataset/refexp_all.json'
'dataset/flickr.json'
'dataset/objects365.json'
)
image_folder=(
'dataset/coco2014/train2014'
'dataset/vg/images'
'dataset/vg/images'
'dataset/vg/images'
'dataset/coco2014/train2014'
'dataset/coco2014/train2014'
'dataset/coco2014/train2014'
'data/refcoco/train2014'
'data/flickr30k/flickr30k_images_split/train'
'data/objects365_v1/train'
)
data_multiple=(
3
1
0.2
0.2
1
1
1
1
1
1
)
# convert array to string
data_path="${data_path[@]}"
image_folder="${image_folder[@]}"
data_multiple="${data_multiple[@]}"
################## VICUNA ##################
PROMPT_VERSION=v1
MODEL_VERSION="vicuna-13b-v1-3"
################## VICUNA ##################
torchrun --nnodes=1 --nproc_per_node=8 --master_port=25001 \
ferret/train/train_mem.py \
--lora_enable False \
--model_name_or_path ./model/$MODEL_VERSION \
--version $PROMPT_VERSION \
--data_path $data_path \
--image_folder $image_folder \
--data_multiple $data_multiple \
--vision_tower openai/clip-vit-large-patch14-336 \
--pretrain_mm_mlp_adapter ./model/llava-336px-pretrain-$MODEL_VERSION/mm_projector.bin \
--mm_vision_select_layer -2 \
--mm_use_im_start_end False \
--mm_use_im_patch_token False \
--bf16 True \
--output_dir ./checkpoints/ferret_13b \
--num_train_epochs 3 \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 1500 \
--save_total_limit 3 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--fsdp "full_shard auto_wrap" \
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
--model_max_length 2048 \
--gradient_checkpointing True \
--dataloader_num_workers 8 \
--lazy_preprocess True \
--report_to tensorboard \
--point_input_sample 'segment_mask|center' \
--add_region_feature True \
--region_geo_sampler True \
--sampler_pooler_mode 'max' \
--add_region_feature True \
--refer_previous_point False \
--resized_image_h 336 \
--resized_image_w 336 \
--save_vision_tower True

View File

@ -0,0 +1,99 @@
#!/usr/bin/env bash
set -xe
mkdir -p checkpoints
# =================== Training ======================
data_path=(
'dataset/git_instruction.json'
'dataset/vg_objects.json'
'dataset/vg_relations.json'
'dataset/vg_regions.json'
'dataset/grounded_llava_boxes_detail.json'
'dataset/grounded_llava_boxes_complex_reasoning.json'
'dataset/grounded_llava_boxes_conversation.json'
'dataset/refexp_all.json'
'dataset/flickr.json'
'dataset/objects365.json'
)
image_folder=(
'dataset/coco2014/train2014'
'dataset/vg/images'
'dataset/vg/images'
'dataset/vg/images'
'dataset/coco2014/train2014'
'dataset/coco2014/train2014'
'dataset/coco2014/train2014'
'data/refcoco/train2014'
'data/flickr30k/flickr30k_images_split/train'
'data/objects365_v1/train'
)
data_multiple=(
3
1
0.2
0.2
1
1
1
1
1
1
)
# convert array to string
data_path="${data_path[@]}"
image_folder="${image_folder[@]}"
data_multiple="${data_multiple[@]}"
################## VICUNA ##################
PROMPT_VERSION=v1
MODEL_VERSION="vicuna-7b-v1-3"
################## VICUNA ##################
torchrun --nnodes=1 --nproc_per_node=8 --master_port=25001 \
ferret/train/train_mem.py \
--lora_enable False \
--model_name_or_path ./model/$MODEL_VERSION \
--version $PROMPT_VERSION \
--data_path $data_path \
--image_folder $image_folder \
--data_multiple $data_multiple \
--vision_tower openai/clip-vit-large-patch14-336 \
--pretrain_mm_mlp_adapter ./model/llava-336px-pretrain-$MODEL_VERSION/mm_projector.bin \
--mm_vision_select_layer -2 \
--mm_use_im_start_end False \
--mm_use_im_patch_token False \
--bf16 True \
--output_dir ./checkpoints/ferret_7b \
--num_train_epochs 3 \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 1500 \
--save_total_limit 3 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--fsdp "full_shard auto_wrap" \
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
--model_max_length 2048 \
--gradient_checkpointing True \
--dataloader_num_workers 8 \
--lazy_preprocess True \
--report_to tensorboard \
--point_input_sample 'segment_mask|center' \
--add_region_feature True \
--region_geo_sampler True \
--sampler_pooler_mode 'max' \
--add_region_feature True \
--refer_previous_point False \
--resized_image_h 336 \
--resized_image_w 336 \
--save_vision_tower True

1
ferret/__init__.py Normal file
View File

@ -0,0 +1 @@
from .model import FERRETLlamaForCausalLM

12
ferret/constants.py Normal file
View File

@ -0,0 +1,12 @@
CONTROLLER_HEART_BEAT_EXPIRATION = 30
WORKER_HEART_BEAT_INTERVAL = 15
LOGDIR = "."
# Model Constants
IGNORE_INDEX = -100
IMAGE_TOKEN_INDEX = -200
DEFAULT_IMAGE_TOKEN = "<image>"
DEFAULT_IMAGE_PATCH_TOKEN = "<im_patch>"
DEFAULT_IM_START_TOKEN = "<im_start>"
DEFAULT_IM_END_TOKEN = "<im_end>"

275
ferret/conversation.py Normal file
View File

@ -0,0 +1,275 @@
import dataclasses
from enum import auto, Enum
from typing import List, Tuple
VOCAB_IMAGE_W = 1000 # 224
VOCAB_IMAGE_H = 1000 # 224
class SeparatorStyle(Enum):
"""Different separator style."""
SINGLE = auto()
TWO = auto()
MPT = auto()
PLAIN = auto()
LLAMA_2 = auto()
@dataclasses.dataclass
class Conversation:
"""A class that keeps all conversation history."""
system: str
roles: List[str]
messages: List[List[str]]
offset: int
sep_style: SeparatorStyle = SeparatorStyle.SINGLE
sep: str = "###"
sep2: str = None
version: str = "Unknown"
skip_next: bool = False
first_round: bool = True
def get_prompt(self):
messages = self.messages
if len(messages) > 0 and type(messages[0][1]) is tuple:
messages = self.messages.copy()
init_role, init_msg = messages[0].copy()
init_msg = init_msg[0].replace("<image>", "").strip()
if 'mmtag' in self.version:
messages[0] = (init_role, init_msg)
messages.insert(0, (self.roles[0], "<Image><image></Image>"))
messages.insert(1, (self.roles[1], "Received."))
else:
messages[0] = (init_role, "<image>\n" + init_msg)
if self.sep_style == SeparatorStyle.SINGLE:
ret = self.system + self.sep
for role, message in messages:
if message:
if type(message) is tuple:
message, _, _ = message
ret += role + ": " + message + self.sep
else:
ret += role + ":"
elif self.sep_style == SeparatorStyle.TWO:
seps = [self.sep, self.sep2]
ret = self.system + seps[0]
for i, (role, message) in enumerate(messages):
if message:
if type(message) is tuple:
message, _, _ = message
ret += role + ": " + message + seps[i % 2]
else:
ret += role + ":"
elif self.sep_style == SeparatorStyle.MPT:
ret = self.system + self.sep
for role, message in messages:
if message:
if type(message) is tuple:
message, _, _ = message
ret += role + message + self.sep
else:
ret += role
elif self.sep_style == SeparatorStyle.LLAMA_2:
wrap_sys = lambda msg: f"<<SYS>>\n{msg}\n<</SYS>>\n\n"
wrap_inst = lambda msg: f"[INST] {msg} [/INST]"
ret = ""
for i, (role, message) in enumerate(messages):
if i == 0:
assert message, "first message should not be none"
assert role == self.roles[0], "first message should come from user"
if message:
if type(message) is tuple:
message, _, _ = message
if i == 0: message = wrap_sys(self.system) + message
if i % 2 == 0:
message = wrap_inst(message)
ret += self.sep + message
else:
ret += " " + message + " " + self.sep2
else:
ret += ""
ret = ret.lstrip(self.sep)
elif self.sep_style == SeparatorStyle.PLAIN:
seps = [self.sep, self.sep2]
ret = self.system
for i, (role, message) in enumerate(messages):
if message:
if type(message) is tuple:
message, _, _ = message
ret += message + seps[i % 2]
else:
ret += ""
else:
raise ValueError(f"Invalid style: {self.sep_style}")
return ret
def append_message(self, role, message):
self.messages.append([role, message])
def get_images(self, return_pil=False):
images = []
for i, (role, msg) in enumerate(self.messages[self.offset:]):
if i % 2 == 0:
if type(msg) is tuple:
import base64
from io import BytesIO
from PIL import Image
msg, image, image_process_mode = msg
if image_process_mode == "Pad":
def expand2square(pil_img, background_color=(122, 116, 104)):
width, height = pil_img.size
if width == height:
return pil_img
elif width > height:
result = Image.new(pil_img.mode, (width, width), background_color)
result.paste(pil_img, (0, (width - height) // 2))
return result
else:
result = Image.new(pil_img.mode, (height, height), background_color)
result.paste(pil_img, ((height - width) // 2, 0))
return result
image = expand2square(image)
elif image_process_mode == "Crop":
pass
elif image_process_mode == "Raw+Processor":
pass
elif image_process_mode == "Resize":
image = image.resize((336, 336))
else:
raise ValueError(f"Invalid image_process_mode: {image_process_mode}")
if image_process_mode != "Raw+Processor":
max_hw, min_hw = max(image.size), min(image.size)
aspect_ratio = max_hw / min_hw
max_len, min_len = 800, 400
shortest_edge = int(min(max_len / aspect_ratio, min_len, min_hw))
longest_edge = int(shortest_edge * aspect_ratio)
W, H = image.size
if H > W:
H, W = longest_edge, shortest_edge
else:
H, W = shortest_edge, longest_edge
image = image.resize((W, H))
print('Input Image Size:{}'.format(image.size))
if return_pil:
images.append(image)
else:
buffered = BytesIO()
image.save(buffered, format="PNG")
img_b64_str = base64.b64encode(buffered.getvalue()).decode()
images.append(img_b64_str)
return images
def to_gradio_chatbot(self):
ret = []
for i, (role, msg) in enumerate(self.messages[self.offset:]):
if i % 2 == 0:
if type(msg) is tuple:
import base64
from io import BytesIO
msg, image, image_process_mode = msg
if image_process_mode != "Raw+Processor":
max_hw, min_hw = max(image.size), min(image.size)
aspect_ratio = max_hw / min_hw
max_len, min_len = 800, 400
shortest_edge = int(min(max_len / aspect_ratio, min_len, min_hw))
longest_edge = int(shortest_edge * aspect_ratio)
W, H = image.size
if H > W:
H, W = longest_edge, shortest_edge
else:
H, W = shortest_edge, longest_edge
image = image.resize((W, H))
buffered = BytesIO()
image.save(buffered, format="JPEG")
img_b64_str = base64.b64encode(buffered.getvalue()).decode()
img_str = f'<img src="data:image/png;base64,{img_b64_str}" alt="user upload image" />'
ret.append([img_str, None])
msg = msg.replace('<image>', '').strip()
if len(msg) > 0:
ret.append([msg, None])
else:
ret.append([msg, None])
else:
ret[-1][-1] = msg
return ret
def copy(self):
return Conversation(
system=self.system,
roles=self.roles,
messages=[[x, y] for x, y in self.messages],
offset=self.offset,
sep_style=self.sep_style,
sep=self.sep,
sep2=self.sep2,
version=self.version)
def dict(self):
if len(self.get_images()) > 0:
return {
"system": self.system,
"roles": self.roles,
"messages": [[x, y[0] if type(y) is tuple else y] for x, y in self.messages],
"offset": self.offset,
"sep": self.sep,
"sep2": self.sep2,
}
return {
"system": self.system,
"roles": self.roles,
"messages": self.messages,
"offset": self.offset,
"sep": self.sep,
"sep2": self.sep2,
}
ferret_conv_vicuna_v1_original_system = Conversation(
system="A chat between a curious human and an artificial intelligence assistant. "
"Assistant is able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language. "
"In images, points are represented by coordinates [x, y]. The top-left corner is [0, 0]. The bottom-right corner is [width-1, height-1]. "
"Increasing x moves right across the image while increasing y moves down. "
"A bounding box is marked by [x1, y1, x2, y2] with the top-left and bottom-right points being [x1, y1] and [x2, y2] respectively. "
f"The image size is assumed to be ({VOCAB_IMAGE_W}, {VOCAB_IMAGE_H}), i.e., width={VOCAB_IMAGE_W}, height={VOCAB_IMAGE_H}. "
"Follow the instructions carefully. ",
roles=("USER", "ASSISTANT"),
version="v1",
messages=(),
offset=0,
sep_style=SeparatorStyle.TWO,
sep=" ",
sep2="</s>",
)
ferret_conv_vicuna_v1 = Conversation(
system="A chat between a human and an AI that understands visuals. "
"In images, [x, y] denotes points: top-left [0, 0], bottom-right [width-1, height-1]. "
"Increasing x moves right; y moves down. "
f"Bounding box: [x1, y1, x2, y2]. Image size: {VOCAB_IMAGE_W}x{VOCAB_IMAGE_H}. "
"Follow instructions. ",
roles=("USER", "ASSISTANT"),
version="v1",
messages=(),
offset=0,
sep_style=SeparatorStyle.TWO,
sep=" ",
sep2="</s>",
)
default_conversation = ferret_conv_vicuna_v1
conv_templates = {
"v1": ferret_conv_vicuna_v1,
"ferret_v1": ferret_conv_vicuna_v1,
}
if __name__ == "__main__":
print(default_conversation.get_prompt())

View File

@ -0,0 +1,611 @@
"""
Usage:
python ferret/eval/eval_flickr_entities.py \
--prediction_file result_checkpoint-final/flickr_result/final_flickr_mergedGT_test \
--annotation_file data/annotations/final_flickr_mergedGT_test.json \
--flickr_entities_path data/flickr30k
"""
import xml.etree.ElementTree as ET
from collections import defaultdict
from pathlib import Path
from typing import Any, Dict, List, Optional, Sequence, Tuple, Union
import numpy as np
from prettytable import PrettyTable
from tqdm import tqdm
import json
import os
import re
VOCAB_IMAGE_W = 1000
VOCAB_IMAGE_H = 1000
def resize_bbox(box, image_w=None, image_h=None):
ratio_w = image_w * 1.0 / VOCAB_IMAGE_W
ratio_h = image_h * 1.0 / VOCAB_IMAGE_H
new_box = [int(box[0] * ratio_w), int(box[1] * ratio_h), \
int(box[2] * ratio_w), int(box[3] * ratio_h)]
return new_box
def decode_bbox_from_caption(text, img_w, img_h, verbose=False):
entities = []
boxes = []
start = 0
in_brackets = False
entity = ""
box = ""
for i, char in enumerate(text):
if char == '[':
in_brackets = True
entity = text[start:i].strip()
start = i + 1
elif char == ']':
in_brackets = False
box = text[start:i].strip()
start = i + 1
# Convert box string to list of integers
box_list = list(map(int, box.split(',')))
resized_box_list = resize_bbox(box_list, img_w, img_h)
entities.append(entity)
boxes.append(resized_box_list)
# Skip until the next entity (ignoring periods or other delimiters)
while start < len(text) and text[start] not in ['.', ',', ';', '!', '?']:
start += 1
start += 1 # Skip the delimiter
return entities, boxes
def are_phrases_similar(phrase1, phrase2):
# Step 1: Convert to lower case
phrase1 = phrase1.lower()
phrase2 = phrase2.lower()
# Step 2: Standardize spacing around punctuation
phrase1 = re.sub(r'\s*([\'",.;!?|:])\s*', r'\1 ', phrase1).strip()
phrase2 = re.sub(r'\s*([\'",.;!?|:])\s*', r'\1 ', phrase2).strip()
# Step 3: Remove all punctuation
phrase1 = re.sub(r'[^\w\s]', '', phrase1)
phrase2 = re.sub(r'[^\w\s]', '', phrase2)
# Step 4: Remove extra white spaces
phrase1 = ' '.join(phrase1.split())
phrase2 = ' '.join(phrase2.split())
return phrase1 == phrase2
def get_sentence_data(filename) -> List[Dict[str, Any]]:
"""
Parses a sentence file from the Flickr30K Entities dataset
input:
filename - full file path to the sentence file to parse
output:
a list of dictionaries for each sentence with the following fields:
sentence - the original sentence
phrases - a list of dictionaries for each phrase with the
following fields:
phrase - the text of the annotated phrase
first_word_index - the position of the first word of
the phrase in the sentence
phrase_id - an identifier for this phrase
phrase_type - a list of the coarse categories this
phrase belongs to
"""
with open(filename, "r") as f:
sentences = f.read().split("\n")
annotations = []
for sentence in sentences:
if not sentence:
continue
first_word = []
phrases = []
phrase_id = []
phrase_type = []
words = []
current_phrase = []
add_to_phrase = False
for token in sentence.split():
if add_to_phrase:
if token[-1] == "]":
add_to_phrase = False
token = token[:-1]
current_phrase.append(token)
phrases.append(" ".join(current_phrase))
current_phrase = []
else:
current_phrase.append(token)
words.append(token)
else:
if token[0] == "[":
add_to_phrase = True
first_word.append(len(words))
parts = token.split("/")
phrase_id.append(parts[1][3:])
phrase_type.append(parts[2:])
else:
words.append(token)
sentence_data = {"sentence": " ".join(words), "phrases": []}
for index, phrase, p_id, p_type in zip(first_word, phrases, phrase_id, phrase_type):
sentence_data["phrases"].append(
{"first_word_index": index, "phrase": phrase, "phrase_id": p_id, "phrase_type": p_type}
)
annotations.append(sentence_data)
return annotations
def get_annotations(filename) -> Dict[str, Union[int, List[str], Dict[str, List[List[int]]]]]:
"""
Parses the xml files in the Flickr30K Entities dataset
input:
filename - full file path to the annotations file to parse
output:
dictionary with the following fields:
scene - list of identifiers which were annotated as
pertaining to the whole scene
nobox - list of identifiers which were annotated as
not being visible in the image
boxes - a dictionary where the fields are identifiers
and the values are its list of boxes in the
[xmin ymin xmax ymax] format
height - int representing the height of the image
width - int representing the width of the image
depth - int representing the depth of the image
"""
tree = ET.parse(filename)
root = tree.getroot()
size_container = root.findall("size")[0]
anno_info: Dict[str, Union[int, List[str], Dict[str, List[List[int]]]]] = {}
all_boxes: Dict[str, List[List[int]]] = {}
all_noboxes: List[str] = []
all_scenes: List[str] = []
for size_element in size_container:
assert size_element.text
anno_info[size_element.tag] = int(size_element.text)
for object_container in root.findall("object"):
for names in object_container.findall("name"):
box_id = names.text
assert box_id
box_container = object_container.findall("bndbox")
if len(box_container) > 0:
if box_id not in all_boxes:
all_boxes[box_id] = []
xmin = int(box_container[0].findall("xmin")[0].text)
ymin = int(box_container[0].findall("ymin")[0].text)
xmax = int(box_container[0].findall("xmax")[0].text)
ymax = int(box_container[0].findall("ymax")[0].text)
all_boxes[box_id].append([xmin, ymin, xmax, ymax])
else:
nobndbox = int(object_container.findall("nobndbox")[0].text)
if nobndbox > 0:
all_noboxes.append(box_id)
scene = int(object_container.findall("scene")[0].text)
if scene > 0:
all_scenes.append(box_id)
anno_info["boxes"] = all_boxes
anno_info["nobox"] = all_noboxes
anno_info["scene"] = all_scenes
return anno_info
#### END of import from flickr30k_entities
#### Bounding box utilities imported from torchvision and converted to numpy
def box_area(boxes: np.array) -> np.array:
"""
Computes the area of a set of bounding boxes, which are specified by its
(x1, y1, x2, y2) coordinates.
Args:
boxes (Tensor[N, 4]): boxes for which the area will be computed. They
are expected to be in (x1, y1, x2, y2) format with
``0 <= x1 < x2`` and ``0 <= y1 < y2``.
Returns:
area (Tensor[N]): area for each box
"""
assert boxes.ndim == 2 and boxes.shape[-1] == 4
return (boxes[:, 2] - boxes[:, 0]) * (boxes[:, 3] - boxes[:, 1])
# implementation from https://github.com/kuangliu/torchcv/blob/master/torchcv/utils/box.py
# with slight modifications
def _box_inter_union(boxes1: np.array, boxes2: np.array) -> Tuple[np.array, np.array]:
area1 = box_area(boxes1)
area2 = box_area(boxes2)
lt = np.maximum(boxes1[:, None, :2], boxes2[:, :2]) # [N,M,2]
rb = np.minimum(boxes1[:, None, 2:], boxes2[:, 2:]) # [N,M,2]
wh = (rb - lt).clip(min=0) # [N,M,2]
inter = wh[:, :, 0] * wh[:, :, 1] # [N,M]
union = area1[:, None] + area2 - inter
return inter, union
def box_iou(boxes1: np.array, boxes2: np.array) -> np.array:
"""
Return intersection-over-union (Jaccard index) of boxes.
Both sets of boxes are expected to be in ``(x1, y1, x2, y2)`` format with
``0 <= x1 < x2`` and ``0 <= y1 < y2``.
Args:
boxes1 (Tensor[N, 4])
boxes2 (Tensor[M, 4])
Returns:
iou (Tensor[N, M]): the NxM matrix containing the pairwise IoU values for every element in boxes1 and boxes2
"""
inter, union = _box_inter_union(boxes1, boxes2)
iou = inter / union
return iou
#### End of import of box utilities
def _merge_boxes(boxes: List[List[int]]) -> List[List[int]]:
"""
Return the boxes corresponding to the smallest enclosing box containing all the provided boxes
The boxes are expected in [x1, y1, x2, y2] format
"""
if len(boxes) == 1:
return boxes
np_boxes = np.asarray(boxes)
return [[np_boxes[:, 0].min(), np_boxes[:, 1].min(), np_boxes[:, 2].max(), np_boxes[:, 3].max()]]
class RecallTracker:
""" Utility class to track recall@k for various k, split by categories"""
def __init__(self, topk: Sequence[int]):
"""
Parameters:
- topk : tuple of ints corresponding to the recalls being tracked (eg, recall@1, recall@10, ...)
"""
self.total_byk_bycat: Dict[int, Dict[str, int]] = {k: defaultdict(int) for k in topk}
self.positives_byk_bycat: Dict[int, Dict[str, int]] = {k: defaultdict(int) for k in topk}
def add_positive(self, k: int, category: str):
"""Log a positive hit @k for given category"""
if k not in self.total_byk_bycat:
raise RuntimeError(f"{k} is not a valid recall threshold")
self.total_byk_bycat[k][category] += 1
self.positives_byk_bycat[k][category] += 1
def add_negative(self, k: int, category: str):
"""Log a negative hit @k for given category"""
if k not in self.total_byk_bycat:
raise RuntimeError(f"{k} is not a valid recall threshold")
self.total_byk_bycat[k][category] += 1
def report(self) -> Dict[int, Dict[str, float]]:
"""Return a condensed report of the results as a dict of dict.
report[k][cat] is the recall@k for the given category
"""
report: Dict[int, Dict[str, float]] = {}
for k in self.total_byk_bycat:
assert k in self.positives_byk_bycat
report[k] = {
cat: self.positives_byk_bycat[k][cat] / self.total_byk_bycat[k][cat] for cat in self.total_byk_bycat[k]
}
return report
class Flickr30kEntitiesRecallEvaluator:
def __init__(
self,
flickr_path: str,
subset: str = "test",
topk: Sequence[int] = (1, 5, 10, -1),
iou_thresh: float = 0.5,
merge_boxes: bool = False,
verbose: bool = True,
):
assert subset in ["train", "test", "val"], f"Wrong flickr subset {subset}"
self.topk = topk
self.iou_thresh = iou_thresh
flickr_path = Path(flickr_path)
# We load the image ids corresponding to the current subset
with open(flickr_path / f"{subset}.txt") as file_d:
self.img_ids = [line.strip() for line in file_d]
if verbose:
print(f"Flickr subset contains {len(self.img_ids)} images")
# Read the box annotations for all the images
self.imgid2boxes: Dict[str, Dict[str, List[List[int]]]] = {}
if verbose:
print("Loading annotations...")
for img_id in self.img_ids:
anno_info = get_annotations(flickr_path / "Annotations" / f"{img_id}.xml")["boxes"]
if merge_boxes:
merged = {}
for phrase_id, boxes in anno_info.items():
merged[phrase_id] = _merge_boxes(boxes)
anno_info = merged
self.imgid2boxes[img_id] = anno_info
# Read the sentences annotations
self.imgid2sentences: Dict[str, List[List[Optional[Dict]]]] = {}
if verbose:
print("Loading annotations...")
self.all_ids: List[str] = []
tot_phrases = 0
for img_id in self.img_ids:
sentence_info = get_sentence_data(flickr_path / "Sentences" / f"{img_id}.txt")
self.imgid2sentences[img_id] = [None for _ in range(len(sentence_info))]
# Some phrases don't have boxes, we filter them.
for sent_id, sentence in enumerate(sentence_info):
phrases = [phrase for phrase in sentence["phrases"] if phrase["phrase_id"] in self.imgid2boxes[img_id]]
if len(phrases) > 0:
self.imgid2sentences[img_id][sent_id] = phrases
tot_phrases += len(phrases)
self.all_ids += [
f"{img_id}_{k}" for k in range(len(sentence_info)) if self.imgid2sentences[img_id][k] is not None
]
if verbose:
print(f"There are {tot_phrases} phrases in {len(self.all_ids)} sentences to evaluate")
def evaluate(self, predictions: List[Dict]):
evaluated_ids = set()
recall_tracker = RecallTracker(self.topk)
for pred in predictions:
cur_id = f"{pred['image_id']}_{pred['sentence_id']}"
if cur_id in evaluated_ids:
print(
"Warning, multiple predictions found for sentence"
f"{pred['sentence_id']} in image {pred['image_id']}"
)
continue
# Skip the sentences with no valid phrase
if cur_id not in self.all_ids:
if len(pred["boxes"]) != 0:
print(
f"Warning, in image {pred['image_id']} we were not expecting predictions "
f"for sentence {pred['sentence_id']}. Ignoring them."
)
continue
evaluated_ids.add(cur_id)
pred_boxes = pred["boxes"]
if str(pred["image_id"]) not in self.imgid2sentences:
raise RuntimeError(f"Unknown image id {pred['image_id']}")
if not 0 <= int(pred["sentence_id"]) < len(self.imgid2sentences[str(pred["image_id"])]):
raise RuntimeError(f"Unknown sentence id {pred['sentence_id']}" f" in image {pred['image_id']}")
target_sentence = self.imgid2sentences[str(pred["image_id"])][int(pred["sentence_id"])]
phrases = self.imgid2sentences[str(pred["image_id"])][int(pred["sentence_id"])]
if len(pred_boxes) != len(phrases):
raise RuntimeError(
f"Error, got {len(pred_boxes)} predictions, expected {len(phrases)} "
f"for sentence {pred['sentence_id']} in image {pred['image_id']}"
)
for cur_boxes, phrase in zip(pred_boxes, phrases):
target_boxes = self.imgid2boxes[str(pred["image_id"])][phrase["phrase_id"]]
ious = box_iou(np.asarray(cur_boxes), np.asarray(target_boxes))
for k in self.topk:
maxi = 0
if k == -1:
maxi = ious.max()
else:
assert k > 0
maxi = ious[:k].max()
if maxi >= self.iou_thresh:
recall_tracker.add_positive(k, "all")
for phrase_type in phrase["phrase_type"]:
recall_tracker.add_positive(k, phrase_type)
else:
recall_tracker.add_negative(k, "all")
for phrase_type in phrase["phrase_type"]:
recall_tracker.add_negative(k, phrase_type)
if len(evaluated_ids) != len(self.all_ids):
print("ERROR, the number of evaluated sentence doesn't match. Missing predictions:")
un_processed = set(self.all_ids) - evaluated_ids
for missing in un_processed:
img_id, sent_id = missing.split("_")
print(f"\t sentence {sent_id} in image {img_id}")
raise RuntimeError("Missing predictions")
return recall_tracker.report()
class Flickr30kEntitiesRecallEvaluatorFromJsonl(Flickr30kEntitiesRecallEvaluator):
def evaluate(self,
annotation_file: str,
prediction_file: str,
verbose: bool = False,
):
recall_tracker = RecallTracker(self.topk)
gt_json = json.load(open(annotation_file, 'r', encoding='utf-8'))
# get the predictions
if os.path.isfile(prediction_file):
predictions = [json.loads(line) for line in open(prediction_file)]
elif os.path.isdir(prediction_file):
predictions = [json.loads(line) for pred_file in sorted(os.listdir(prediction_file)) for line in open(os.path.join(prediction_file, pred_file))]
else:
raise NotImplementedError('Not supported file format.')
predict_index = 0
valid_cnt = 0
for item in tqdm(gt_json['images']):
file_name = item["file_name"]
caption = item["caption"]
img_height = float(item['height'])
img_width = float(item['width'])
postive_item_pos = item['tokens_positive_eval']
# to verify
phrases_from_self = self.imgid2sentences[str(item['original_img_id'])][int(item['sentence_id'])]
for pos in postive_item_pos:
# pdb.set_trace()
if predict_index == len(predictions):
break
pos_start, pos_end = pos[0]
phrase = caption[pos_start:pos_end]
phrase_from_self = [p for p in phrases_from_self if p['phrase'] == phrase]
if len(phrase_from_self) == 0:
raise ValueError(f"Can't find the corresponding gt from two file {phrase} vs. {phrases_from_self}")
else:
phrase_from_self = phrase_from_self[0]
# get the prediction from text line
try:
prediction = predictions[predict_index]["text"]
except IndexError as e:
print("Raise Indexerror.")
print(f"prediction index / length: {predict_index} / {len(predictions)}")
import sys
sys.exit(0)
try:
entities, boxes = decode_bbox_from_caption(prediction, img_width, img_height, verbose=verbose)
assert len(entities) == len(boxes)
except ValueError as e:
entities, boxes = [], []
predict_boxes = []
for (entity, box) in zip(entities, boxes):
if not are_phrases_similar(entity, phrase): # get the matched noun phrase
# print(f"{entity} | {phrase}")
continue
else:
predict_boxes.append(box)
if len(predict_boxes) == 0:
print(f"Can't find valid bbox for the given phrase ({phrase}) in caption ({caption}), \n{prediction}")
print(f"We set a 0-area box to calculate recall result")
predict_boxes = [[0., 0., 0., 0.]]
# exit(0)
# evaluate
target_boxes = self.imgid2boxes[str(item['original_img_id'])][phrase_from_self["phrase_id"]]
ious = box_iou(np.asarray(predict_boxes), np.asarray(target_boxes))
for k in self.topk:
maxi = 0
if k == -1:
maxi = ious.max()
else:
assert k > 0
maxi = ious[:k].max()
if maxi >= self.iou_thresh:
recall_tracker.add_positive(k, "all")
for phrase_type in phrase_from_self["phrase_type"]:
recall_tracker.add_positive(k, phrase_type)
else:
recall_tracker.add_negative(k, "all")
for phrase_type in phrase_from_self["phrase_type"]:
recall_tracker.add_negative(k, phrase_type)
# pdb.set_trace()
valid_cnt += 1
predict_index += 1
print(f"Valid prediction {valid_cnt}/{len(predictions)}")
self.results = recall_tracker.report()
return self.results
def summarize(self):
table = PrettyTable()
all_cat = sorted(list(self.results.values())[0].keys())
table.field_names = ["Recall@k"] + all_cat
score = {}
for k, v in self.results.items():
cur_results = [v[cat] for cat in all_cat]
header = "Upper_bound" if k == -1 else f"Recall@{k}"
for cat in all_cat:
score[f"{header}_{cat}"] = v[cat]
table.add_row([header] + cur_results)
print(table)
return score
if __name__ == '__main__':
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--prediction_file', help='prediction_file')
parser.add_argument('--annotation_file', default='/path/to/final_flickr_mergedGT_test.json', help='annotation_file')
parser.add_argument('--flickr_entities_path', default='/path/to/flickr30k_entities', help='flickr entities')
args = parser.parse_args()
if os.path.isfile(args.prediction_file):
predictions = [json.loads(line) for line in open(args.prediction_file)]
elif os.path.isdir(args.prediction_file):
predictions = []
if '_test.json' in args.annotation_file:
subset = "test"
else:
subset = "val"
evaluator = Flickr30kEntitiesRecallEvaluatorFromJsonl(
flickr_path = args.flickr_entities_path,
subset = subset,
topk = (1, 5, 10, -1),
iou_thresh = 0.5,
merge_boxes = True,
verbose = True,
)
evaluator.evaluate(args.annotation_file, args.prediction_file, verbose=False)
score = evaluator.summarize()
with open(os.path.join(args.prediction_file, "metric.json"), "w") as f:
json.dump(score, f, indent=2)

View File

@ -0,0 +1,160 @@
import argparse
import json
import os
import openai
import time
import re
import pdb
from tqdm import tqdm
NUM_SECONDS_TO_SLEEP = 0.5
VOCAB_IMAGE_W = 1000
VOCAB_IMAGE_H = 1000
def get_eval(content: str, max_tokens: int):
while True:
try:
response = openai.ChatCompletion.create(
model='gpt-4-0314',
messages=[{
'role': 'system',
'content': 'You are a helpful and precise assistant for checking the quality of the answer.'
}, {
'role': 'user',
'content': content,
}],
temperature=0.2, # TODO: figure out which temperature is best for evaluation
max_tokens=max_tokens,
)
break
except openai.error.RateLimitError:
pass
except Exception as e:
print(e)
time.sleep(NUM_SECONDS_TO_SLEEP)
return response['choices'][0]['message']['content']
def postprocess_answer(answer, category):
if category == 'refer_desc' or category == 'refer_reason':
pattern = r'\[.*?\]'
matches = re.findall(pattern, answer)
for match in matches:
answer = answer.replace(' '+match, '')
elif category == 'ground_conv':
pattern = r'\[.*?\]'
matches = re.findall(pattern, answer)
for match in matches:
coor_cur = match.replace('[', '')
coor_cur = coor_cur.replace(']', '')
coor_cur = coor_cur.split(',')
coor_cur = [float(i.strip()) for i in coor_cur]
try:
assert len(coor_cur) == 4
except:
print('Found a exception when parsing coordinates')
answer = answer.replace(match, '')
converted_box_coor = [coor_cur[0]/VOCAB_IMAGE_W, coor_cur[1]/VOCAB_IMAGE_H, coor_cur[2]/VOCAB_IMAGE_W, coor_cur[3]/VOCAB_IMAGE_H]
answer = answer.replace(match, f'[{converted_box_coor[0]:.3f}, {converted_box_coor[1]:.3f}, {converted_box_coor[2]:.3f}, {converted_box_coor[3]:.3f}]')
return answer
def parse_score(review):
try:
score_pair = review.split('\n')[0]
score_pair = score_pair.replace(',', ' ')
sp = score_pair.split(' ')
if len(sp) == 2:
return [float(sp[0]), float(sp[1])]
else:
print('error', review)
return [-1, -1]
except Exception as e:
print(e)
print('error', review)
return [-1, -1]
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='ChatGPT-based QA evaluation.')
parser.add_argument('-q', '--question')
parser.add_argument('-c', '--context')
parser.add_argument('-a', '--answer-list', nargs='+', default=[])
parser.add_argument('-r', '--rule')
parser.add_argument('-o', '--output')
parser.add_argument('--max-tokens', type=int, default=1024, help='maximum number of tokens produced in the output')
args = parser.parse_args()
f_q = open(os.path.expanduser(args.question))
f_ans1 = open(os.path.expanduser(args.answer_list[0]))
f_ans2 = open(os.path.expanduser(args.answer_list[1]))
rule_dict = json.load(open(os.path.expanduser(args.rule), 'r'))
if os.path.isfile(os.path.expanduser(args.output)):
cur_reviews = [json.loads(line) for line in open(os.path.expanduser(args.output))]
else:
cur_reviews = []
review_file = open(f'{args.output}', 'a')
context_list = [json.loads(line) for line in open(os.path.expanduser(args.context))]
image_to_context = {context['image']: context for context in context_list}
handles = []
idx = 0
for ques_js, ans1_js, ans2_js in tqdm(zip(f_q, f_ans1, f_ans2)):
ques = json.loads(ques_js)
ans1 = json.loads(ans1_js)
ans2 = json.loads(ans2_js)
inst = image_to_context[ques['image']]
# cap_str = '\n'.join(inst['captions'])
# box_str = '\n'.join([f'{instance["category"]}: {instance["bbox"]}' for instance in inst['instances']])
category = json.loads(ques_js)['category']
if category in rule_dict:
rule = rule_dict[category]
else:
assert False, f"Visual QA category not found in rule file: {category}."
# Assume ans2 is the predicted one.
processed_answer = postprocess_answer(ans2['text'], category)
# pdb.set_trace()
ans2['text'] = processed_answer
# if category == 'refer_desc':
prompt = rule['prompt']
role = rule['role']
content = (f'[Context]\{inst["text"]}\n\n'
f'[Question]\n{ques["text"]}\n\n'
f'[{role} 1]\n{ans1["text"]}\n\n[End of {role} 1]\n\n'
f'[{role} 2]\n{ans2["text"]}\n\n[End of {role} 2]\n\n'
f'[System]\n{prompt}\n\n')
# content = (f'[Context]\n{cap_str}\n\n{box_str}\n\n'
# f'[Question]\n{ques["text"]}\n\n'
# f'[{role} 1]\n{ans1["text"]}\n\n[End of {role} 1]\n\n'
# f'[{role} 2]\n{ans2["text"]}\n\n[End of {role} 2]\n\n'
# f'[System]\n{prompt}\n\n')
cur_js = {
'id': idx+1,
'question_id': ques['question_id'],
'answer1_id': ans1.get('answer_id', ans1['question_id']),
'answer2_id': ans2.get('answer_id', ans2['question_id']),
'category': category
}
if idx >= len(cur_reviews):
review = get_eval(content, args.max_tokens)
scores = parse_score(review)
cur_js['content'] = review
cur_js['tuple'] = scores
cur_js['answer1'] = ans1["text"]
cur_js['answer2'] = ans2["text"]
review_file.write(json.dumps(cur_js) + '\n')
review_file.flush()
else:
print(f'Skipping {idx} as we already have it.')
idx += 1
print(idx)
review_file.close()

69
ferret/eval/eval_lvis.py Normal file
View File

@ -0,0 +1,69 @@
"""
Usage:
- Eval Prediction:
python ferret/eval/eval_lvis.py --pred_file=[your generated result by running ferret/eval/model_lvis.py]
"""
import argparse
import json
import os
import re
import random
from PIL import Image
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import textwrap
from tqdm import tqdm
import pdb
def get_args():
parser = argparse.ArgumentParser()
parser.add_argument('--pred_file', type=str, default='/Users/youhaoxuan/research_misc/lvis_result/llava_answer_debug.jsonl')
return parser.parse_args()
def remove_not_phrases_v2(text):
# Pattern covers the start of a phrase up to and including 'not' and any following characters until a comma or period
pattern = r"\s+not[^,.]*[,.]"
text = re.sub(pattern, "", text)
pattern = r"\s+no[^,.]*[,.]"
text = re.sub(pattern, "", text)
return text
if __name__ == "__main__":
args = get_args()
# Fix the random seed
random.seed(42)
if os.path.isfile(args.pred_file):
predictions = [json.loads(line) for line in open(args.pred_file)]
elif os.path.isdir(args.pred_file):
predictions = [json.loads(line) for pred_file in os.listdir(args.pred_file) for line in open(os.path.join(args.pred_file, pred_file))]
else:
raise NotImplementedError('Not supported file format.')
total_correct = 0
for i in tqdm(predictions):
# Process name and synonyms
i['name'] = i['name'].replace('_', ' ').strip()
new_synonyms = []
for jj in i['synonyms']:
if '(' in jj:
assert ')' in jj
split_list = jj.split('(')
assert len(split_list) == 2
new_synonyms.append(split_list[0].replace('_', ' ').strip())
new_synonyms.append(split_list[1].replace('_', ' ').replace(')', '').strip())
else:
new_synonyms.append(jj.replace('_', ' ').strip())
i['synonyms'] = new_synonyms
# Match Result
processed_text = remove_not_phrases_v2(i['text'])
# pdb.set_trace()
if i['name'] in processed_text or any(syn_i in processed_text for syn_i in i['synonyms']):
total_correct += 1
else:
pass
acc = total_correct / len(predictions)
print(f'Acc:{acc*100:.3f}%')
# pdb.set_trace()

105
ferret/eval/eval_pope.py Normal file
View File

@ -0,0 +1,105 @@
"""
Usage:
python ferret/eval/eval_pope.py \
--prediction_file final_result/ferret_13b_checkpoint-final/pope_result/coco_pope_adversarial \
--annotation_file data/pope/coco_pope_adversarial.json
python ferret/eval/eval_pope.py \
--prediction_file final_result/ferret_13b_checkpoint-final/pope_result/coco_pope_popular \
--annotation_file data/pope/coco_pope_popular.json
python ferret/eval/eval_pope.py \
--prediction_file final_result/ferret_13b_checkpoint-final/pope_result/coco_pope_random \
--annotation_file data/pope/coco_pope_random.json
"""
import os
import json
def evaluate_pope(prediction_file, annotation_file):
# get the predictions
if os.path.isfile(prediction_file):
answers = [json.loads(line) for line in open(prediction_file)]
elif os.path.isdir(prediction_file):
answers = [json.loads(line) for pred_file in sorted(os.listdir(prediction_file)) for line in open(os.path.join(prediction_file, pred_file))]
else:
raise NotImplementedError('Not supported file format.')
label_list = [json.loads(q)['label'] for q in open(annotation_file, 'r')]
for answer in answers:
text = answer['answer']
# Only keep the first sentence
if text.find('.') != -1:
text = text.split('.')[0]
text = text.replace(',', '')
words = text.split(' ')
if 'No' in words or 'not' in words or 'no' in words:
answer['answer'] = 'no'
else:
answer['answer'] = 'yes'
for i in range(len(label_list)):
if label_list[i] == 'no':
label_list[i] = 0
else:
label_list[i] = 1
pred_list = []
for answer in answers:
if answer['answer'] == 'no':
pred_list.append(0)
else:
pred_list.append(1)
pos = 1
neg = 0
yes_ratio = pred_list.count(1) / len(pred_list)
TP, TN, FP, FN = 0, 0, 0, 0
for pred, label in zip(pred_list, label_list):
if pred == pos and label == pos:
TP += 1
elif pred == pos and label == neg:
FP += 1
elif pred == neg and label == neg:
TN += 1
elif pred == neg and label == pos:
FN += 1
print('TP\tFP\tTN\tFN\t')
print('{}\t{}\t{}\t{}'.format(TP, FP, TN, FN))
precision = float(TP) / float(TP + FP)
recall = float(TP) / float(TP + FN)
f1 = 2*precision*recall / (precision + recall)
acc = (TP + TN) / (TP + TN + FP + FN)
print('Accuracy: {}'.format(acc))
print('Precision: {}'.format(precision))
print('Recall: {}'.format(recall))
print('F1 score: {}'.format(f1))
print('Yes ratio: {}'.format(yes_ratio))
score = {"Accuracy": acc,
"Precision": precision,
"Recall": recall,
"F1 score": f1,
"Yes ratio": yes_ratio,
}
with open(os.path.join(args.prediction_file, 'metric.json'), "w") as f:
json.dump(score, f, indent=2)
if __name__ == '__main__':
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--prediction_file', help='prediction_file')
parser.add_argument('--annotation_file', default='/path/to/json_annotations', help='annotation_file')
args = parser.parse_args()
evaluate_pope(args.prediction_file, args.annotation_file)

217
ferret/eval/eval_refexp.py Normal file
View File

@ -0,0 +1,217 @@
"""
Usage:
python ferret/eval/eval_refexp.py \
--prediction_file final_result/ferret_13b_checkpoint-final/refexp_result/finetune_refcocog_test \
--annotation_file data/annotations/finetune_refcocog_test.json
"""
import os
import copy
from collections import defaultdict
from pathlib import Path
from tqdm import tqdm
import torch
import torch.utils.data
from typing import Any, Dict, List, Optional, Sequence, Tuple, Union
from prettytable import PrettyTable
import re
import json
from misc.refcoco.box_ops import generalized_box_iou, box_iou
VOCAB_IMAGE_W = 1000
VOCAB_IMAGE_H = 1000
def resize_bbox(box, image_w=None, image_h=None):
ratio_w = image_w * 1.0 / VOCAB_IMAGE_W
ratio_h = image_h * 1.0 / VOCAB_IMAGE_H
new_box = [int(box[0] * ratio_w), int(box[1] * ratio_h), \
int(box[2] * ratio_w), int(box[3] * ratio_h)]
return new_box
def decode_bbox_from_caption(text, img_w, img_h, verbose=False):
entities = []
boxes = []
start = 0
in_brackets = False
entity = ""
box = ""
for i, char in enumerate(text):
if char == '[':
in_brackets = True
entity = text[start:i].strip()
start = i + 1
elif char == ']':
in_brackets = False
box = text[start:i].strip()
start = i + 1
# Convert box string to list of integers
box_list = list(map(int, box.split(',')))
resized_box_list = resize_bbox(box_list, img_w, img_h)
entities.append(entity)
boxes.append(resized_box_list)
# Skip until the next entity (ignoring periods or other delimiters)
while start < len(text) and text[start] not in ['.', ',', ';', '!', '?']:
start += 1
start += 1 # Skip the delimiter
return entities, boxes
def are_phrases_similar(phrase1, phrase2):
# Step 1: Convert to lower case
phrase1 = phrase1.lower()
phrase2 = phrase2.lower()
# Step 2: Standardize spacing around punctuation
phrase1 = re.sub(r'\s*([\'",.;!?|:])\s*', r'\1 ', phrase1).strip()
phrase2 = re.sub(r'\s*([\'",.;!?|:])\s*', r'\1 ', phrase2).strip()
# Step 3: Remove all punctuation
phrase1 = re.sub(r'[^\w\s]', '', phrase1)
phrase2 = re.sub(r'[^\w\s]', '', phrase2)
# Step 4: Remove extra white spaces
phrase1 = ' '.join(phrase1.split())
phrase2 = ' '.join(phrase2.split())
return phrase1 == phrase2
class RefExpEvaluatorFromJsonl(object):
def __init__(self, refexp_gt_path, k=(1, -1), thresh_iou=0.5):
assert isinstance(k, (list, tuple))
with open(refexp_gt_path, 'r') as f:
self.refexp_gt = json.load(f)
self.img_ids = [item['id'] for item in self.refexp_gt['images']]
print(f"Load {len(self.img_ids)} images")
print(f"Load {len(self.refexp_gt['annotations'])} annotations")
self.k = k
self.thresh_iou = thresh_iou
def summarize(self,
prediction_file: str,
verbose: bool = False,):
# get the predictions
if os.path.isfile(prediction_file):
predictions = [json.loads(line) for line in open(prediction_file)]
elif os.path.isdir(prediction_file):
predictions = [json.loads(line) for pred_file in os.listdir(prediction_file) for line in open(os.path.join(prediction_file, pred_file))]
else:
raise NotImplementedError('Not supported file format.')
# sort the predictions based on 'image_id'
predictions = sorted(predictions, key=lambda x: x['image_id'])
predict_index = 0
dataset2score = {
"refcoco": {k: 0.0 for k in self.k},
"refcoco+": {k: 0.0 for k in self.k},
"refcocog": {k: 0.0 for k in self.k},
}
dataset2count = {"refcoco": 0.0, "refcoco+": 0.0, "refcocog": 0.0}
for item_img, item_ann in tqdm(zip(self.refexp_gt['images'], self.refexp_gt['annotations'])):
# quit when evaluating all predictions
if predict_index == len(predictions):
break
if item_img['id'] != item_ann['image_id']:
raise ValueError(f"Ann\n{item_ann} \nis not matched\n {item_img}")
dataset_name = item_img['dataset_name']
img_height = item_img['height']
img_width = item_img['width']
caption = item_img['caption']
target_bbox = item_ann["bbox"]
converted_bbox = [
target_bbox[0],
target_bbox[1],
target_bbox[2] + target_bbox[0],
target_bbox[3] + target_bbox[1],
]
target_bbox = torch.as_tensor(converted_bbox).view(-1, 4)
prediction = predictions[predict_index]["text"]
try:
entities, boxes = decode_bbox_from_caption(prediction, img_width, img_height, verbose=verbose)
except ValueError as e:
entities, boxes = [], []
predict_boxes = []
for (entity, box) in zip(entities, boxes):
if not are_phrases_similar(entity, caption):
if len(box) > 0:
predict_boxes.append(box)
else:
predict_boxes.append(box)
if len(predict_boxes) == 0:
print(f"Can't find valid bbox for the given phrase {caption}, \n{entities, boxes}")
print(f"We set a 0-area box to calculate result")
predict_boxes = [[0., 0., 0., 0.]]
predict_boxes = torch.as_tensor(predict_boxes).view(-1, 4).to(dtype=torch.float32)
iou, _ = box_iou(predict_boxes, target_bbox)
mean_iou, _ = box_iou(predict_boxes.mean(0).view(-1, 4), target_bbox)
for k in self.k:
if k == 'upper bound':
if max(iou) >= self.thresh_iou:
dataset2score[dataset_name][k] += 1.0
elif k == 'mean':
if max(mean_iou) >= self.thresh_iou:
dataset2score[dataset_name][k] += 1.0
else:
if max(iou[0, :k]) >= self.thresh_iou:
dataset2score[dataset_name][k] += 1.0
dataset2count[dataset_name] += 1.0
predict_index += 1
for key, value in dataset2score.items():
for k in self.k:
try:
value[k] /= dataset2count[key]
except:
pass
results = {}
for key, value in dataset2score.items():
results[key] = sorted([v for k, v in value.items()])
print(f" Dataset: {key} - Precision @ 1, mean, all: {results[key]} \n")
return results
if __name__ == '__main__':
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--prediction_file', help='prediction_file')
parser.add_argument('--annotation_file', default='/path/to/json_annotations', help='annotation_file')
args = parser.parse_args()
evaluator = RefExpEvaluatorFromJsonl(
refexp_gt_path=args.annotation_file,
k=(1, 'mean', 'upper bound'),
thresh_iou=0.5,
)
results = evaluator.summarize(args.prediction_file, verbose=False)
with open(os.path.join(args.prediction_file, "metric.json"), "w") as f:
json.dump(results, f, indent=2)

View File

@ -0,0 +1,40 @@
{"question_id": 0, "image": "000000125472.jpg", "category": "ground_conv", "text": "The man [0.201, 0.002, 0.940, 0.758] is performing a trick on a skateboard [0.012, 0.746, 0.664, 0.886]. He appears to be in mid-air, indicating that he is probably performing a jump or some sort of aerial trick."}
{"question_id": 2, "image": "000000361551.jpg", "category": "ground_conv", "text": "There are people [0.005, 0.562, 0.616, 0.824] standing in the service area [0.003, 0.416, 0.995, 0.996] of the airport, which appears to be a runway. A man [0.659, 0.572, 0.920, 0.844] is walking down the stairs [0.352, 0.676, 1.000, 0.994] of an unloading ramp. Another man [0.008, 0.554, 0.139, 0.800] is holding his baggage [0.107, 0.662, 0.179, 0.750] and bending over. There's also a woman [0.917, 0.610, 1.000, 0.724] dressed in a sleeveless black top [0.960, 0.626, 1.000, 0.668] with a black and white headband [0.923, 0.628, 0.952, 0.646]. There are several pieces of luggage [0.368, 0.702, 0.456, 0.780] [0.907, 0.706, 0.973, 0.786] on the ground [0.000, 0.282, 1.000, 0.976]."}
{"question_id": 3, "image": "000000184400.jpg", "category": "ground_conv", "text": "There is a train [0.002, 0.408, 1.000, 0.683] on the bridge [0.002, 0.608, 0.988, 0.877]."}
{"question_id": 4, "image": "000000276018.jpg", "category": "ground_conv", "text": "In the image, there are several children each holding a stuffed animal. A boy [0.102, 0.388, 0.498, 1.000] is holding a stuffed dog [0.391, 0.506, 0.622, 0.714], and he is wearing a black jacket [0.077, 0.520, 0.495, 0.910]. Another boy [0.717, 0.188, 1.000, 0.864] is holding a white stuffed animal [0.695, 0.356, 0.868, 0.580] wearing a red jacket. There is also a girl [0.372, 0.196, 0.603, 0.922] holding a grey stuffed dog [0.286, 0.298, 0.517, 0.422]. Another child [0.342, 0.390, 0.622, 1.000] is seen holding up a white stuffed animal [0.286, 0.298, 0.517, 0.422]. Furthermore, there is a baby [0.385, 0.034, 0.643, 0.434] being held by a lady [0.286, 0.000, 0.683, 0.560]."}
{"question_id": 5, "image": "000000356424.jpg", "category": "ground_conv", "text": "The man [0.075, 0.102, 0.704, 0.716] is sitting at a table [0.000, 0.592, 0.997, 1.000] and looking at a plate of food [0.416, 0.726, 0.856, 0.904]. There is a glass [0.275, 0.716, 0.501, 0.998] and a bottle [0.048, 0.712, 0.195, 1.002] on the table in front of him."}
{"question_id": 6, "image": "000000458755.jpg", "category": "ground_conv", "text": "The girl [0.112, 0.091, 0.868, 0.992] is petting a sheep [0.000, 0.003, 0.704, 0.320]. The hand [0.418, 0.373, 0.548, 0.592] of the girl is on the sheep."}
{"question_id": 7, "image": "000000069138.jpg", "category": "ground_conv", "text": "The main features of the building [0.000, 0.000, 1.000, 0.466] include a door [0.110, 0.370, 0.266, 0.518] with a picture [0.155, 0.378, 0.259, 0.442] on it. There's a stop light [0.366, 0.236, 0.638, 0.394] with a sad face [0.383, 0.244, 0.614, 0.384] on it nearby the building. In front of the building, there is a road [0.000, 0.490, 1.000, 1.000] with arrows [0.000, 0.616, 0.214, 0.644] painted on it. Next to the road, there is a sidewalk [0.666, 0.572, 0.993, 0.618]."}
{"question_id": 8, "image": "000000003156.jpg", "category": "ground_conv", "text": "The man [0.000, 0.024, 0.835, 1.002] in this image appears to be installing or fixing a toilet [0.564, 0.574, 1.000, 0.974]. He is wearing gloves for the task and is kneeling next to the toilet."}
{"question_id": 9, "image": "000000131138.jpg", "category": "ground_conv", "text": "On the desk, there are several items. There is a computer mouse [0.414, 0.753, 0.470, 0.811], a keyboard [0.415, 0.620, 0.650, 0.783], and a cup [0.350, 0.783, 0.417, 0.906]. A plate [0.183, 0.799, 0.326, 0.896] with a fork [0.203, 0.794, 0.270, 0.857] on it is also on the desk."}
{"question_id": 10, "image": "000000259097.jpg", "category": "ground_conv", "text": "The man [0.390, 0.432, 0.466, 0.793] is jumping in the grass [0.000, 0.610, 0.998, 0.997] to catch a Frisbee [0.400, 0.354, 0.446, 0.381] in the air. He is extending his arm [0.416, 0.459, 0.432, 0.565] to reach the Frisbee. He is wearing a shirt [0.402, 0.468, 0.458, 0.649] and pants [0.390, 0.658, 0.424, 0.763]. His shadow [0.492, 0.724, 0.622, 0.994] can be seen in the grass."}
{"question_id": 11, "image": "000000377882.jpg", "category": "ground_conv", "text": "The image depicts a small harbor scene where multiple boats [0.000, 0.461, 0.354, 0.579] [0.348, 0.501, 0.874, 0.789] [0.302, 0.461, 0.684, 0.611] are docked on land next to a body of water [0.000, 0.259, 1.000, 0.469]. There is a chain-link fence [0.002, 0.176, 0.996, 0.995] enclosing the boats. There are also surfboards [0.830, 0.448, 0.996, 0.552] [0.420, 0.384, 0.502, 0.411] [0.910, 0.768, 0.998, 0.877] [0.430, 0.344, 0.508, 0.371] [0.830, 0.565, 1.000, 0.712] [0.322, 0.307, 0.450, 0.341] [0.766, 0.251, 0.998, 0.368] [0.764, 0.704, 0.998, 0.829] and a rack [0.754, 0.168, 1.000, 0.901] in the scene. In the background, you can see a skyline of buildings [0.692, 0.195, 0.718, 0.248] [0.888, 0.173, 0.922, 0.227] [0.582, 0.211, 0.610, 0.256] [0.180, 0.259, 0.202, 0.293] [0.466, 0.208, 0.518, 0.272] under the blue sky [0.000, 0.000, 0.998, 0.317]."}
{"question_id": 12, "image": "000000484415.jpg", "category": "ground_conv", "text": "The man [0.000, 0.133, 0.600, 0.992] is interacting with a toilet [0.016, 0.042, 0.719, 0.996]. He is reaching out his hand [0.281, 0.125, 0.603, 0.562] to flush the toilet using the flusher [0.534, 0.092, 0.628, 0.300] located on top of the toilet tank [0.019, 0.021, 0.706, 0.579]."}
{"question_id": 13, "image": "000000184384.jpg", "category": "ground_conv", "text": "There is a blueberry cake [0.238, 0.093, 0.786, 0.787] topped with butter [0.454, 0.024, 0.638, 0.288] placed on a plate [0.166, 0.453, 1.000, 1.000] which is on the table [0.002, 0.365, 0.998, 0.997]. On the same table, there is another plate [0.628, 0.120, 0.998, 0.389] containing a mix of food [0.632, 0.123, 0.996, 0.336] including an egg [0.636, 0.125, 0.880, 0.267] and a sausage [0.766, 0.248, 0.984, 0.333]. There is also a cup [0.002, 0.000, 0.202, 0.667] of water [0.000, 0.000, 0.202, 0.667] on the table."}
{"question_id": 14, "image": "000000341058.jpg", "category": "ground_conv", "text": "The items placed on the table are a pair of napkins [0.541, 0.818, 0.601, 0.858], a pepper shaker [0.594, 0.822, 0.619, 0.854], and a salt shaker [0.612, 0.824, 0.637, 0.854]."}
{"question_id": 15, "image": "000000349184.jpg", "category": "ground_conv", "text": "The image shows a woman [0.009, 0.194, 0.497, 0.888] sitting on a wooden bench [0.000, 0.324, 0.731, 0.994] in a park [0.000, 0.002, 0.997, 1.000] during daytime. The park appears to have a lot of trees [0.554, 0.000, 0.997, 0.376] and there are people [0.386, 0.438, 0.449, 0.504] walking in front of the woman. The woman's purse [0.458, 0.488, 0.605, 0.694] is also on the bench next to her. The park seems to be enclosed by a fence [0.719, 0.310, 0.997, 0.372] and there is a building [0.090, 0.000, 0.686, 0.094] behind the trees."}
{"question_id": 16, "image": "000000516143.jpg", "category": "ground_conv", "text": "The image features a green and white bus [0.100, 0.376, 0.866, 0.805] that is driving down a city street."}
{"question_id": 17, "image": "000000159311.jpg", "category": "ground_conv", "text": "There are two zebras in the image. The first zebra [0.000, 0.000, 0.622, 0.790] and the second zebra [0.002, 0.000, 0.684, 0.682]."}
{"question_id": 18, "image": "000000553990.jpg", "category": "ground_conv", "text": "The person riding the horse is wearing boots [0.328, 0.339, 0.416, 0.492] and a shirt [0.388, 0.150, 0.508, 0.279]. They are also wearing a helmet [0.484, 0.096, 0.560, 0.162]. The person [0.320, 0.078, 0.552, 0.502] is riding the horse."}
{"question_id": 19, "image": "000000273493.jpg", "category": "ground_conv", "text": "The two men [0.144, 0.360, 0.246, 0.736] [0.730, 0.474, 0.780, 0.613] are playing a game of tennis [0.012, 0.384, 0.984, 0.934]. The first man is wearing a white shirt [0.164, 0.411, 0.222, 0.547], gray shorts [0.162, 0.535, 0.220, 0.628], and black sneakers [0.180, 0.709, 0.216, 0.739]. The second man is wearing white clothing [0.734, 0.492, 0.778, 0.601] and white sneakers [0.762, 0.598, 0.776, 0.613]. They are on a tennis court [0.000, 0.372, 0.988, 0.979] and are currently hitting a ball [0.640, 0.399, 0.648, 0.411] with their tennis rackets [0.214, 0.574, 0.238, 0.619] [0.768, 0.526, 0.808, 0.556]."}
{"question_id": 20, "image": "000000452122.jpg", "category": "ground_conv", "text": "The airplane [0.112, 0.300, 0.858, 0.640] is in the air, currently in flight. This can be inferred from the position of the sky [0.000, 0.000, 0.998, 1.000]. However, the airplane's landing gear [0.450, 0.592, 0.600, 0.643] is lowered, which indicates the airplane is in the process of landing or taking off."}
{"question_id": 21, "image": "000000134722.jpg", "category": "ground_conv", "text": "This image is set at a train station. There's a train [0.290, 0.400, 0.998, 0.784] moving on the railway tracks [0.000, 0.752, 0.520, 0.944]. The train station has a platform with an awning [0.000, 0.299, 0.132, 0.403] and a bench under a canopy [0.000, 0.301, 0.136, 0.400]. Near the platform, there is a building [0.000, 0.299, 0.204, 0.659] with a red brick wall [0.000, 0.392, 0.206, 0.611]. The station is surrounded by trees [0.208, 0.253, 0.322, 0.653] and the sky [0.000, 0.000, 0.998, 0.560] above shows some clouds [0.374, 0.067, 0.920, 0.312]."}
{"question_id": 22, "image": "000000360960.jpg", "category": "ground_conv", "text": "There are three people in the image. One person [0.066, 0.162, 0.318, 0.686] is wearing a black uniform [0.000, 0.222, 0.126, 0.646] and a hat [0.006, 0.162, 0.072, 0.198]. Another person [0.390, 0.344, 0.838, 0.894] is wearing a long black coat [0.405, 0.332, 0.835, 0.746] and pants [0.523, 0.736, 0.739, 0.858]. The last person [0.853, 0.154, 1.000, 0.650] is wearing jeans [0.853, 0.422, 1.000, 0.632]."}
{"question_id": 23, "image": "000000179765.jpg", "category": "ground_conv", "text": "Sure, the bike [0.146, 0.109, 0.938, 0.803] is a Honda, as indicated by the Honda logo [0.322, 0.395, 0.378, 0.419]. It has a front wheel [0.150, 0.424, 0.366, 0.635] and a back tire [0.574, 0.496, 0.860, 0.800]. It also has a light [0.894, 0.411, 0.944, 0.520]. The bike features a shock absorber [0.626, 0.501, 0.698, 0.680] for smooth riding. It also has a handle [0.284, 0.109, 0.390, 0.384] for steering and a display [0.240, 0.275, 0.290, 0.328] for the rider's information. Not to forget the sylencer [0.462, 0.645, 0.816, 0.779] near the back tire."}
{"question_id": 24, "image": "000000332318.jpg", "category": "ground_conv", "text": "The setting is a mountainous region. There is a large mountain [0.000, 0.057, 0.992, 0.782] with a snow-covered peak [0.744, 0.042, 0.898, 0.119]. In front of the mountain, there is a lush pasture [0.000, 0.815, 0.984, 1.000] where cows [0.548, 0.860, 0.574, 0.896] [0.436, 0.860, 0.454, 0.890] are grazing. There are trailers [0.796, 0.910, 0.894, 0.997] [0.632, 0.899, 0.742, 0.994] in the pasture, probably for animal equipment and transportation. There are also trees [0.740, 0.409, 1.000, 0.982] around the area. All of this is under a clear sky [0.000, 0.000, 1.002, 0.257]."}
{"question_id": 25, "image": "000000305695.jpg", "category": "ground_conv", "text": "The zebras [0.730, 0.496, 0.796, 0.581] are in a fenced area [0.464, 0.531, 0.934, 0.848]. Near them, there is a truck [0.000, 0.416, 0.210, 0.805] on the road [0.180, 0.709, 0.432, 0.957]. They are also surrounded by trees [0.128, 0.000, 0.592, 0.597] and grass [0.544, 0.659, 0.840, 0.859]."}
{"question_id": 26, "image": "000000326174.jpg", "category": "ground_conv", "text": "The boy [0.792, 0.480, 0.938, 0.853] is holding a surfboard [0.790, 0.587, 0.960, 0.691]."}
{"question_id": 27, "image": "000000562207.jpg", "category": "ground_conv", "text": "There are three people in the image. One man [0.164, 0.455, 0.292, 0.997] is standing on the side [0.236, 0.675, 0.994, 0.997] wearing shorts [0.174, 0.699, 0.254, 0.864]. Another man [0.582, 0.476, 0.662, 0.870] is standing beside the elephant [0.328, 0.157, 0.638, 0.967] wearing a shirt [0.582, 0.521, 0.650, 0.681]. There is also a woman [0.288, 0.473, 0.420, 0.967] wearing a top [0.302, 0.539, 0.358, 0.696] touching the elephant. They are all on the side of a body of water [0.000, 0.488, 0.994, 1.000]."}
{"question_id": 28, "image": "000000543300.jpg", "category": "ground_conv", "text": "The boat [0.048, 0.552, 0.928, 0.819] is white and is of a large size. It has multiple levels [0.000, 0.709, 1.000, 0.829] [0.068, 0.616, 0.852, 0.688]. The side of the boat has a set of long black windows [0.374, 0.733, 0.790, 0.765]. Further, it has a silver railing [0.094, 0.557, 0.728, 0.624] [0.238, 0.597, 0.744, 0.627] on the top level. There are also red letters [0.414, 0.693, 0.654, 0.725] and blue water symbols [0.268, 0.688, 0.350, 0.779] on the side of the boat."}
{"question_id": 29, "image": "000000241668.jpg", "category": "ground_conv", "text": "A person [0.490, 0.136, 0.825, 0.998] with red hair [0.507, 0.142, 0.791, 0.642] is holding a cake [0.630, 0.670, 0.772, 0.750]. She is wearing a suit jacket [0.490, 0.422, 0.799, 0.998] and a necktie [0.571, 0.442, 0.674, 0.936]."}
{"question_id": 30, "image": "000000535578.jpg", "category": "ground_conv", "text": "In the image, there is a group of sheep [0.532, 0.546, 0.646, 0.662] [0.532, 0.666, 0.817, 0.810] grazing in a field [0.000, 0.002, 0.994, 0.998]. The field is bordered by a stone wall [0.000, 0.000, 0.769, 0.180] and is filled with plant life [0.000, 0.764, 0.601, 0.998]. There is also a bush [0.480, 0.000, 0.748, 0.084] and some trees [0.736, 0.036, 0.835, 0.100] present in the field. A few rocks [0.727, 0.410, 0.808, 0.470] can also be spotted in the scene."}
{"question_id": 31, "image": "000000443969.jpg", "category": "ground_conv", "text": "A child [0.408, 0.168, 0.606, 0.786] is holding the umbrella [0.296, 0.038, 0.782, 0.360]."}
{"question_id": 32, "image": "000000329219.jpg", "category": "ground_conv", "text": "There is a man [0.274, 0.000, 0.517, 0.792] standing in the kitchen [0.000, 0.000, 0.750, 0.849]. He is standing next to a counter [0.000, 0.329, 0.576, 0.398]. On the floor of the kitchen [0.000, 0.713, 1.000, 1.000], there is a small dog [0.462, 0.593, 0.568, 0.842]. Mugs [0.509, 0.123, 0.595, 0.266] are hanging on the wall [0.506, 0.019, 0.607, 0.384]. There is also a blender [0.015, 0.165, 0.080, 0.307] on the counter."}
{"question_id": 33, "image": "000000421923.jpg", "category": "ground_conv", "text": "There are several books [0.414, 0.208, 0.538, 0.364] [0.360, 0.202, 0.417, 0.360] [0.435, 0.480, 0.712, 0.578] and a bowl [0.072, 0.030, 0.288, 0.076] on the top shelf [0.000, 0.028, 0.607, 0.202]. On the middle shelf [0.207, 0.334, 0.997, 0.380], there are more books [0.414, 0.208, 0.538, 0.364] [0.360, 0.202, 0.417, 0.360]. The bottom shelf [0.324, 0.528, 0.997, 0.624] contains a stack of books [0.435, 0.480, 0.712, 0.578]."}
{"question_id": 34, "image": "000000376900.jpg", "category": "ground_conv", "text": "The man [0.163, 0.274, 0.491, 0.936] is preparing to serve a tennis ball. He is holding a tennis racket [0.235, 0.578, 0.304, 0.664] in his hand [0.253, 0.648, 0.299, 0.680]. He is wearing a cap [0.171, 0.388, 0.253, 0.476] on his head [0.173, 0.408, 0.256, 0.474], and shorts [0.216, 0.628, 0.432, 0.782]. His shadow [0.397, 0.898, 0.968, 0.956] is cast in front of him."}
{"question_id": 35, "image": "000000513567.jpg", "category": "ground_conv", "text": "The image shows two women [0.102, 0.099, 0.486, 0.984] [0.502, 0.000, 0.982, 0.997], both of them are eating hot dogs [0.190, 0.587, 0.350, 0.741] [0.676, 0.315, 0.882, 0.408]. One of the women is wearing sunglasses [0.630, 0.005, 0.794, 0.048] on her head. They seem to be standing on a street [0.042, 0.403, 0.092, 0.520], potentially walking while enjoying their meal."}
{"question_id": 36, "image": "000000058393.jpg", "category": "ground_conv", "text": "There is a man [0.542, 0.343, 0.812, 0.493] and a woman [0.644, 0.377, 0.834, 0.863] sitting on a bench [0.070, 0.493, 0.932, 0.960]. They are looking at the ocean [0.028, 0.319, 0.972, 0.821]. The man has his arm [0.658, 0.462, 0.828, 0.496] around the woman."}
{"question_id": 37, "image": "000000010764.jpg", "category": "ground_conv", "text": "The catcher [0.334, 0.193, 0.756, 0.940] is squatting on the field, ready to catch a ball with his gloved hand extended."}
{"question_id": 38, "image": "000000271402.jpg", "category": "ground_conv", "text": "One girl [0.329, 0.148, 0.973, 0.892] is holding a tennis racket [0.462, 0.480, 0.713, 0.840] and the other girl [0.057, 0.102, 0.456, 0.898] is standing next to a scooter [0.097, 0.424, 0.592, 0.996]."}
{"question_id": 39, "image": "000000018519.jpg", "category": "ground_conv", "text": "The skater has taken several safety measures. He is wearing a helmet [0.358, 0.354, 0.448, 0.422] which has a sticker [0.408, 0.358, 0.438, 0.368]. He also has a pad [0.540, 0.362, 0.595, 0.420] on his body. His elbow is protected by another pad [0.376, 0.512, 0.443, 0.554] and his knee is protected by a knee pad [0.450, 0.542, 0.512, 0.598]. He is also wearing a wrist brace [0.279, 0.524, 0.338, 0.564]. The skater is also wearing roller skates [0.647, 0.490, 0.709, 0.584] for the activity."}
{"question_id": 40, "image": "000000106048.jpg", "category": "ground_conv", "text": "The objects on the bus include a design [0.228, 0.422, 0.438, 0.560], side mirrors [0.488, 0.314, 0.530, 0.428] [0.790, 0.332, 0.818, 0.455], wheels [0.266, 0.545, 0.294, 0.677] [0.248, 0.551, 0.264, 0.668] [0.444, 0.578, 0.472, 0.751], windows [0.510, 0.216, 0.796, 0.548] and a windshield [0.518, 0.222, 0.782, 0.545]. The bus [0.222, 0.144, 0.820, 0.757] itself."}

View File

@ -0,0 +1,40 @@
{"question_id": 0, "image": "000000125472.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : axle at [0.447, 0.814, 0.535, 0.856].\nObject 1 : background at [0.003, 0.744, 0.994, 0.988].\nObject 2 : bracelet at [0.820, 0.444, 0.859, 0.470].\nObject 3 : building at [0.012, 0.888, 0.099, 0.994].\nObject 4 : corner at [0.027, 0.890, 0.117, 0.992].\nObject 5 : fence at [0.030, 0.886, 1.000, 1.000].\nObject 6 : hair at [0.486, 0.078, 0.712, 0.216].\nObject 7 : jean pants at [0.246, 0.380, 0.841, 0.632].\nObject 8 : laces at [0.168, 0.562, 0.850, 0.674].\nObject 9 : logo at [0.429, 0.232, 0.583, 0.364].\nObject 10 : man at [0.201, 0.002, 0.940, 0.758].\nObject 11 : name at [0.000, 0.960, 0.321, 1.000].\nObject 12 : picture at [0.003, 0.004, 1.000, 0.998].\nObject 13 : poles at [0.180, 0.886, 0.432, 0.990].\nObject 14 : shirt at [0.324, 0.124, 0.694, 0.392].\nObject 15 : shoes at [0.189, 0.606, 0.946, 0.792].\nObject 16 : skateboard at [0.012, 0.746, 0.664, 0.886].\nObject 17 : sky at [0.012, 0.002, 1.000, 0.918].\nObject 18 : stadium lights at [0.147, 0.860, 0.456, 0.994].\nObject 19 : stitching at [0.312, 0.408, 0.754, 0.638].\nObject 20 : strip at [0.279, 0.770, 0.529, 0.802].\nObject 21 : top at [0.024, 0.830, 0.420, 0.936].\nObject 22 : trees at [0.024, 0.846, 1.000, 1.000].\nObject 23 : wheels at [0.012, 0.808, 0.586, 0.904].\nObject 24 : wrist at [0.802, 0.434, 0.856, 0.484].\n\nRelationships:\nobject 2 : bracelet -> on mans -> object 24 : wrist.\nobject 23 : wheels -> on a -> object 16 : skateboard.\nobject 14 : shirt -> has a -> object 9 : logo.\nobject 10 : man -> doing trick on -> object 16 : skateboard.\nobject 3 : building -> behind a -> object 5 : fence.\nobject 11 : name -> on -> object 12 : picture.\nobject 11 : name -> has a -> object 11 : name.\nobject 10 : man -> performing on a -> object 16 : skateboard.\nobject 4 : corner -> of -> object 3 : building.\nobject 18 : stadium lights -> are on -> object 13 : poles.\nobject 16 : skateboard -> has -> object 23 : wheels.\nobject 2 : bracelet -> on mans -> object 24 : wrist.\nobject 11 : name -> on -> object 12 : picture.\nobject 16 : skateboard -> under -> object 10 : man.\nobject 10 : man -> wearing -> object 15 : shoes.\nobject 3 : building -> behind -> object 5 : fence.\nobject 22 : trees -> in -> object 1 : background.\nobject 15 : shoes -> have -> object 8 : laces.\nobject 18 : stadium lights -> on -> object 13 : poles.\nobject 5 : fence -> behind -> object 10 : man.\nobject 20 : strip -> on -> object 16 : skateboard.\nobject 19 : stitching -> on -> object 7 : jean pants.\nobject 9 : logo -> on -> object 14 : shirt.\nobject 23 : wheels -> on -> object 16 : skateboard.\nobject 0 : axle -> on -> object 16 : skateboard.\nobject 21 : top -> of -> object 22 : trees.\n\nRegion Description:\nRegion Description at [0.030, 0.774, 0.643, 0.912] : a black skateboard with black wheels.\n\nGlobal Caption:\nA man flying through the air while riding a skateboard.\nA man is doing tricks on a skateboard.\nA skateboarder jumps while trying to perform a trick.\na man in the air standing above the skateboard\na person attempting a jump with a skateboard"}
{"question_id": 2, "image": "000000361551.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : baggage at [0.107, 0.662, 0.179, 0.750].\nObject 1 : baggage at [0.368, 0.706, 0.456, 0.782].\nObject 2 : building at [0.000, 0.000, 0.997, 0.326].\nObject 3 : cap at [0.784, 0.544, 0.824, 0.568].\nObject 4 : duffel bag at [0.584, 0.702, 0.643, 0.768].\nObject 5 : ground at [0.000, 0.282, 1.000, 0.976].\nObject 6 : hair at [0.920, 0.614, 0.973, 0.640].\nObject 7 : headband at [0.923, 0.628, 0.952, 0.646].\nObject 8 : jacket at [0.776, 0.568, 0.840, 0.642].\nObject 9 : line at [0.696, 0.750, 0.989, 0.794].\nObject 10 : lines at [0.000, 0.436, 0.851, 0.486].\nObject 11 : luggage at [0.907, 0.706, 0.973, 0.786].\nObject 12 : luggage at [0.368, 0.702, 0.456, 0.780].\nObject 13 : man at [0.008, 0.554, 0.139, 0.800].\nObject 14 : man at [0.659, 0.572, 0.920, 0.844].\nObject 15 : man at [0.771, 0.538, 0.843, 0.640].\nObject 16 : pavement at [0.003, 0.308, 0.992, 0.566].\nObject 17 : people at [0.005, 0.562, 0.616, 0.824].\nObject 18 : pillars at [0.211, 0.130, 0.235, 0.240].\nObject 19 : ramp at [0.179, 0.158, 0.707, 0.408].\nObject 20 : service area at [0.003, 0.416, 0.995, 0.996].\nObject 21 : stairs at [0.352, 0.676, 1.000, 0.994].\nObject 22 : sweater at [0.667, 0.634, 0.920, 0.824].\nObject 23 : top at [0.960, 0.626, 1.000, 0.668].\nObject 24 : truck at [0.781, 0.278, 0.997, 0.366].\nObject 25 : walls at [0.608, 0.000, 0.989, 0.320].\nObject 26 : wheel at [0.843, 0.338, 0.875, 0.366].\nObject 27 : woman at [0.917, 0.610, 1.000, 0.724].\n\nRelationships:\nobject 17 : people -> in -> object 20 : service area.\nobject 27 : woman -> bends over -> object 11 : luggage.\nobject 14 : man -> walks down -> object 21 : stairs.\nobject 12 : luggage -> on -> object 5 : ground.\nobject 13 : man -> carries -> object 0 : baggage.\nobject 14 : man -> wears -> object 22 : sweater.\nobject 15 : man -> wears -> object 3 : cap.\nobject 24 : truck -> in -> object 20 : service area.\nobject 15 : man -> wears -> object 8 : jacket.\nobject 10 : lines -> on -> object 16 : pavement.\nobject 14 : man -> walks down -> object 21 : stairs.\nobject 9 : line -> on -> object 16 : pavement.\nobject 24 : truck -> has -> object 26 : wheel.\nobject 2 : building -> has -> object 25 : walls.\nobject 15 : man -> on -> object 20 : service area.\nobject 13 : man -> holds -> object 0 : baggage.\nobject 14 : man -> walks down -> object 21 : stairs.\nobject 13 : man -> holds -> object 0 : baggage.\nobject 27 : woman -> wears -> object 7 : headband.\nobject 1 : baggage -> on -> object 20 : service area.\n\nRegion Description:\nRegion Description at [0.443, 0.528, 0.992, 0.850] : People standing in service area of airport..\nRegion Description at [0.648, 0.564, 0.960, 0.892] : Man walking down stairs of unloading ramp..\nRegion Description at [0.229, 0.698, 0.381, 0.776] : Black and red luggage sitting on ground..\nRegion Description at [0.957, 0.616, 0.997, 0.670] : Woman dressed in sleeveless black top..\nRegion Description at [0.011, 0.548, 0.211, 0.750] : Man holding his luggage and bending over.\nRegion Description at [0.893, 0.578, 0.995, 0.678] : woman with a black and white head band.\nRegion Description at [0.235, 0.684, 0.973, 0.816] : Rainbow of colors in the form of luggage.\n\nGlobal Caption:\nSome are standing outside a building with suitcases.\nA few people are getting of a plane.\nA group of people and luggage on a airport tarmac.\nSome people who are placing luggage on a runway.\nAn airport and plane unloading passengers with luggage."}
{"question_id": 3, "image": "000000184400.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : air conditioner at [0.004, 0.261, 0.018, 0.293].\nObject 1 : balcony at [0.048, 0.037, 0.100, 0.077].\nObject 2 : beam at [0.616, 0.621, 0.664, 0.824].\nObject 3 : beam at [0.490, 0.640, 0.532, 0.832].\nObject 4 : beam at [0.426, 0.640, 0.462, 0.835].\nObject 5 : bridge at [0.002, 0.608, 0.988, 0.877].\nObject 6 : bridge at [0.004, 0.453, 1.000, 0.867].\nObject 7 : building at [0.000, 0.000, 0.252, 0.469].\nObject 8 : bushes at [0.000, 0.939, 0.072, 0.997].\nObject 9 : colors at [0.194, 0.480, 0.330, 0.661].\nObject 10 : column at [0.618, 0.824, 0.676, 0.997].\nObject 11 : guard rails at [0.000, 0.496, 1.000, 0.624].\nObject 12 : light at [0.606, 0.192, 0.724, 0.243].\nObject 13 : light at [0.864, 0.947, 0.916, 1.000].\nObject 14 : metal support at [0.002, 0.603, 0.976, 0.995].\nObject 15 : pole at [0.700, 0.205, 0.724, 0.995].\nObject 16 : red line at [0.632, 0.851, 0.648, 0.995].\nObject 17 : sky at [0.250, 0.013, 1.000, 0.467].\nObject 18 : south west at [0.338, 0.616, 0.442, 0.651].\nObject 19 : street at [0.002, 0.861, 1.000, 0.997].\nObject 20 : train at [0.002, 0.408, 1.000, 0.683].\nObject 21 : window at [0.144, 0.013, 0.182, 0.064].\nObject 22 : window at [0.430, 0.485, 0.534, 0.595].\nObject 23 : window at [0.134, 0.091, 0.182, 0.155].\nObject 24 : window at [0.340, 0.504, 0.424, 0.613].\nObject 25 : window at [0.116, 0.944, 0.168, 1.000].\nObject 26 : windows at [0.762, 0.437, 0.920, 0.613].\nObject 27 : windows at [0.004, 0.000, 0.096, 0.088].\n\nRelationships:\nobject 10 : column -> supporting -> object 6 : bridge.\nobject 10 : column -> has -> object 16 : red line.\nobject 12 : light -> on -> object 15 : pole.\nobject 7 : building -> behind -> object 20 : train.\nobject 21 : window -> on -> object 7 : building.\nobject 1 : balcony -> on -> object 7 : building.\nobject 25 : window -> visible under -> object 5 : bridge.\nobject 12 : light -> on -> object 19 : street.\nobject 2 : beam -> of -> object 5 : bridge.\nobject 20 : train -> in -> object 9 : colors.\nobject 24 : window -> of -> object 20 : train.\nobject 22 : window -> of train -> object 20 : train.\nobject 5 : bridge -> on -> object 20 : train.\nobject 7 : building -> beside -> object 20 : train.\nobject 23 : window -> of -> object 7 : building.\nobject 12 : light -> on a -> object 15 : pole.\nobject 12 : light -> on -> object 15 : pole.\nobject 20 : train -> says -> object 18 : south west.\nobject 8 : bushes -> are in -> object 19 : street.\nobject 7 : building -> has many -> object 27 : windows.\nobject 7 : building -> has -> object 0 : air conditioner.\nobject 20 : train -> on -> object 6 : bridge.\nobject 12 : light -> in -> object 19 : street.\nobject 5 : bridge -> has -> object 11 : guard rails.\nobject 26 : windows -> on -> object 20 : train.\nobject 20 : train -> has -> object 18 : south west.\nobject 6 : bridge -> has -> object 14 : metal support.\nobject 9 : colors -> to -> object 20 : train.\n\nRegion Description:\nRegion Description at [0.602, 0.837, 0.696, 0.997] : a metal support column for the bridge.\n\nGlobal Caption:\nA train as it travels down the tracks over a bridge.\na colorful train going along an elevated track \nA train rides on a bridge past a building.\nA subway train that is passing over a train bridge.\na train on a train track on an elevated bridge"}
{"question_id": 4, "image": "000000276018.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : animal at [0.717, 0.042, 0.831, 0.152].\nObject 1 : animal at [0.114, 0.582, 0.348, 0.840].\nObject 2 : baby at [0.385, 0.034, 0.643, 0.434].\nObject 3 : baby at [0.911, 0.028, 1.000, 0.250].\nObject 4 : bear at [0.391, 0.506, 0.622, 0.714].\nObject 5 : bear at [0.695, 0.356, 0.868, 0.580].\nObject 6 : bear hand at [0.114, 0.630, 0.175, 0.660].\nObject 7 : black sock at [0.800, 0.796, 0.858, 0.834].\nObject 8 : blonde boy at [0.166, 0.170, 0.351, 0.460].\nObject 9 : boy at [0.102, 0.388, 0.498, 1.000].\nObject 10 : boy at [0.717, 0.188, 1.000, 0.864].\nObject 11 : child at [0.342, 0.390, 0.622, 1.000].\nObject 12 : coat at [0.077, 0.520, 0.495, 0.910].\nObject 13 : coat at [0.775, 0.296, 1.000, 0.616].\nObject 14 : coat at [0.397, 0.090, 0.634, 0.262].\nObject 15 : flip flops at [0.434, 0.756, 0.606, 0.910].\nObject 16 : girl at [0.372, 0.196, 0.603, 0.922].\nObject 17 : glasses at [0.191, 0.236, 0.308, 0.250].\nObject 18 : grass at [0.637, 0.652, 0.754, 0.788].\nObject 19 : hand at [0.714, 0.094, 0.788, 0.160].\nObject 20 : hands at [0.763, 0.380, 0.877, 0.430].\nObject 21 : hat at [0.757, 0.030, 0.889, 0.078].\nObject 22 : jacket at [0.357, 0.500, 0.622, 0.782].\nObject 23 : jacket at [0.422, 0.286, 0.603, 0.550].\nObject 24 : jacket at [0.163, 0.296, 0.320, 0.462].\nObject 25 : jacket at [0.911, 0.106, 1.000, 0.224].\nObject 26 : lady at [0.286, 0.000, 0.683, 0.560].\nObject 27 : man at [0.628, 0.030, 0.951, 0.742].\nObject 28 : shirt at [0.831, 0.306, 0.957, 0.404].\nObject 29 : shirt at [0.197, 0.296, 0.298, 0.370].\nObject 30 : shoe at [0.717, 0.804, 0.871, 0.864].\nObject 31 : sidewalk at [0.628, 0.574, 0.769, 0.632].\nObject 32 : stuffed animal at [0.286, 0.298, 0.517, 0.422].\n\nRelationships:\nobject 10 : boy -> wearing -> object 28 : shirt.\nobject 3 : baby -> wearing -> object 25 : jacket.\nobject 22 : jacket -> carrying -> object 4 : bear.\nobject 8 : blonde boy -> wears -> object 17 : glasses.\nobject 8 : blonde boy -> wears -> object 24 : jacket.\nobject 11 : child -> holding up -> object 32 : stuffed animal.\nobject 10 : boy -> holding up -> object 5 : bear.\nobject 30 : shoe -> with a -> object 7 : black sock.\nobject 10 : boy -> wearing -> object 7 : black sock.\nobject 26 : lady -> holding -> object 2 : baby.\nobject 16 : girl -> wearing -> object 15 : flip flops.\nobject 9 : boy -> wearing -> object 12 : coat.\nobject 10 : boy -> wearing a -> object 13 : coat.\nobject 4 : bear -> on -> object 20 : hands.\nobject 26 : lady -> carrying -> object 2 : baby.\nobject 0 : animal -> in -> object 19 : hand.\n\nRegion Description:\nRegion Description at [0.905, 0.020, 0.997, 0.272] : blonde haired baby wearing yellow jacket.\nRegion Description at [0.357, 0.388, 0.640, 0.730] : girl in blue jacket carrying blue dog.\nRegion Description at [0.071, 0.378, 0.498, 0.842] : boy in black jacket holding stuffed dog.\nRegion Description at [0.055, 0.572, 0.375, 0.846] : brown stuffed dog with red and white collar.\nRegion Description at [0.283, 0.194, 0.603, 0.400] : girl in pink jacket holding white stuffed animal.\nRegion Description at [0.695, 0.356, 0.874, 0.576] : White stuffed animal wearing a red jacket..\nRegion Description at [0.332, 0.394, 0.618, 0.992] : Little girl holding a grey stuffed dog..\nRegion Description at [0.372, 0.476, 0.723, 0.786] : little girl holding blue and white stuffed animal.\nRegion Description at [0.062, 0.556, 0.422, 0.840] : little boy holding brown and white stuffed animal.\n\nGlobal Caption:\na bunch of kids walking through some grass\nA group of children are holding various stuffed animals and dolls.\nKids walking while holding their stuffed animals. \nA group of kids holding teddy bears and looking happy.\nA group of children carrying stuffed animals walks across the grass. "}
{"question_id": 5, "image": "000000356424.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : bottle at [0.048, 0.712, 0.195, 1.002].\nObject 1 : chair at [0.696, 0.500, 1.003, 0.718].\nObject 2 : cork at [0.053, 0.712, 0.139, 0.776].\nObject 3 : cup at [0.043, 0.736, 0.240, 0.916].\nObject 4 : dish at [0.416, 0.726, 0.856, 0.904].\nObject 5 : fruit at [0.629, 0.834, 0.675, 0.880].\nObject 6 : glass at [0.275, 0.716, 0.501, 0.998].\nObject 7 : glasses at [0.179, 0.242, 0.464, 0.322].\nObject 8 : hair at [0.536, 0.258, 0.656, 0.320].\nObject 9 : man at [0.075, 0.102, 0.704, 0.716].\nObject 10 : rasberries at [0.499, 0.750, 0.544, 0.786].\nObject 11 : raspberries at [0.664, 0.828, 0.741, 0.864].\nObject 12 : sauce at [0.565, 0.752, 0.715, 0.824].\nObject 13 : shirt at [0.600, 0.350, 0.645, 0.494].\nObject 14 : shirt at [0.635, 0.282, 0.997, 0.654].\nObject 15 : sign at [0.419, 0.134, 0.509, 0.184].\nObject 16 : sweater at [0.072, 0.288, 0.704, 0.718].\nObject 17 : table at [0.000, 0.592, 0.997, 1.000].\nObject 18 : window at [0.328, 0.000, 0.600, 0.298].\nObject 19 : woman at [0.531, 0.258, 0.768, 0.688].\n\nRelationships:\nobject 9 : man -> wearing -> object 7 : glasses.\nobject 0 : bottle -> on -> object 17 : table.\nobject 6 : glass -> on -> object 17 : table.\nobject 11 : raspberries -> on -> object 4 : dish.\nobject 9 : man -> wearing -> object 7 : glasses.\n\nRegion Description:\nRegion Description at [0.640, 0.180, 0.989, 0.530] : Man wearing a black and orange stripe shirt.\nRegion Description at [0.413, 0.136, 0.512, 0.184] : Yellow closed sign with brown letters.\nRegion Description at [0.629, 0.186, 0.995, 0.706] : a man wearing and orange and black striped shirt.\nRegion Description at [0.528, 0.254, 0.717, 0.666] : a woman with a ponytail eating lunch.\nRegion Description at [0.152, 0.238, 0.459, 0.322] : a pair of black wire rimmed eye glasses.\nRegion Description at [0.029, 0.716, 0.243, 0.922] : empty cup that used to contain coffee.\nRegion Description at [0.264, 0.708, 0.867, 0.994] : A plate of food with a glass of water.\n\nGlobal Caption:\nA man sitting in front of a plate of food.\nA man at a wooden table looking at a plate of food.\na man smiling while looking at his plate of food\nA man sitting at a table with a plate filled with food.\nA man looking happily at some dish in front of him."}
{"question_id": 6, "image": "000000458755.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : edge at [0.498, 0.296, 0.530, 0.419].\nObject 1 : feet at [0.914, 0.235, 1.000, 0.291].\nObject 2 : floor at [0.000, 0.003, 0.994, 1.000].\nObject 3 : girl at [0.112, 0.091, 0.868, 0.992].\nObject 4 : grass at [0.000, 0.005, 0.998, 0.995].\nObject 5 : ground at [0.000, 0.005, 0.992, 0.992].\nObject 6 : hair at [0.542, 0.096, 0.910, 0.624].\nObject 7 : hand at [0.418, 0.373, 0.548, 0.592].\nObject 8 : jeans at [0.500, 0.216, 0.584, 0.400].\nObject 9 : sheep at [0.000, 0.003, 0.704, 0.320].\nObject 10 : shirt at [0.426, 0.504, 0.900, 0.992].\nObject 11 : shoe at [0.472, 0.379, 0.558, 0.453].\nObject 12 : sneakers at [0.904, 0.141, 0.968, 0.187].\nObject 13 : someon at [0.884, 0.019, 0.994, 0.192].\nObject 14 : strap at [0.744, 0.600, 0.872, 0.715].\nObject 15 : strip at [0.512, 0.520, 0.548, 0.589].\nObject 16 : sweater at [0.500, 0.555, 0.818, 0.997].\nObject 17 : tie at [0.532, 0.515, 0.546, 0.579].\nObject 18 : wool at [0.016, 0.171, 0.114, 0.411].\n\nRelationships:\nobject 7 : hand -> on -> object 9 : sheep.\nobject 3 : girl -> with -> object 10 : shirt.\nobject 6 : hair -> on -> object 3 : girl.\nobject 3 : girl -> has -> object 6 : hair.\nobject 6 : hair -> on -> object 3 : girl.\n\nRegion Description:\nRegion Description at [0.000, 0.027, 0.530, 0.744] : a sheep that has been recently shorn.\nRegion Description at [0.116, 0.032, 0.924, 0.992] : girl in front has gray sweater hanging over her left shoulder.\nRegion Description at [0.506, 0.045, 0.912, 0.845] : girl in front is facing away from the camera.\nRegion Description at [0.120, 0.053, 0.890, 0.989] : girl in front wears a gray and white striped T-shirt.\nRegion Description at [0.300, 0.005, 0.624, 0.451] : someone in jeans and brown shoes stands behind the sheep.\nRegion Description at [0.880, 0.003, 0.992, 0.949] : several people only visible from the feet.\n\nGlobal Caption:\nYoung woman with sheep on straw covered floor.\nA child places his hands on the head and neck of a sheep while another sheep looks at his face.\nA person petting the head of a cute fluffy sheep.\nA child is petting a sheep while another sheep watches.\nA woman kneeling to pet animals while others wait. "}
{"question_id": 7, "image": "000000069138.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : arrows at [0.000, 0.616, 0.214, 0.644].\nObject 1 : awning at [0.159, 0.260, 0.293, 0.336].\nObject 2 : building at [0.000, 0.000, 1.000, 0.466].\nObject 3 : bushes at [0.693, 0.342, 1.000, 0.512].\nObject 4 : door at [0.110, 0.370, 0.266, 0.518].\nObject 5 : face at [0.390, 0.256, 0.614, 0.392].\nObject 6 : greenery at [0.824, 0.154, 0.997, 0.384].\nObject 7 : hitch at [0.221, 0.520, 0.259, 0.542].\nObject 8 : ladder at [0.110, 0.342, 0.283, 0.364].\nObject 9 : license plate at [0.141, 0.460, 0.234, 0.500].\nObject 10 : line at [0.017, 0.700, 0.266, 0.756].\nObject 11 : picture at [0.155, 0.378, 0.259, 0.442].\nObject 12 : plant barrier at [0.672, 0.482, 1.000, 0.606].\nObject 13 : planter at [0.676, 0.152, 1.000, 0.510].\nObject 14 : pole at [0.328, 0.068, 0.483, 0.994].\nObject 15 : road at [0.000, 0.490, 1.000, 1.000].\nObject 16 : roof at [0.117, 0.360, 0.283, 0.382].\nObject 17 : sad face at [0.383, 0.244, 0.614, 0.384].\nObject 18 : short term at [0.624, 0.040, 0.769, 0.080].\nObject 19 : sidewalk at [0.666, 0.572, 0.993, 0.618].\nObject 20 : sign at [0.621, 0.082, 0.772, 0.132].\nObject 21 : sign at [0.007, 0.144, 0.069, 0.204].\nObject 22 : signal at [0.266, 0.210, 0.679, 0.848].\nObject 23 : stop light at [0.366, 0.236, 0.638, 0.394].\nObject 24 : tail light at [0.100, 0.446, 0.121, 0.472].\nObject 25 : van at [0.076, 0.326, 0.297, 0.556].\nObject 26 : wall at [0.676, 0.500, 0.997, 0.604].\nObject 27 : window at [0.903, 0.000, 1.000, 0.086].\n\nRelationships:\nobject 23 : stop light -> with -> object 17 : sad face.\nobject 0 : arrows -> on -> object 15 : road.\nobject 12 : plant barrier -> beside -> object 15 : road.\nobject 11 : picture -> on -> object 4 : door.\nobject 10 : line -> painted in -> object 15 : road.\nobject 19 : sidewalk -> next to -> object 15 : road.\nobject 2 : building -> for -> object 18 : short term.\nobject 23 : stop light -> making -> object 5 : face.\nobject 3 : bushes -> just above -> object 26 : wall.\nobject 22 : signal -> on -> object 14 : pole.\nobject 25 : van -> has -> object 16 : roof.\nobject 25 : van -> has -> object 8 : ladder.\nobject 8 : ladder -> on -> object 16 : roof.\nobject 13 : planter -> by -> object 15 : road.\nobject 23 : stop light -> on -> object 22 : signal.\n\nRegion Description:\nRegion Description at [0.331, 0.852, 0.472, 0.996] : Pole holding traffic light on street.\nRegion Description at [0.600, 0.036, 0.793, 0.084] : Building offers short term office space.\nRegion Description at [0.603, 0.074, 0.776, 0.120] : Office space as small as 2,500 sq. ft. available.\nRegion Description at [0.003, 0.008, 0.972, 0.356] : an office building is in the background.\n\nGlobal Caption:\nA red traffic light with a sad face drawn over it.\nA street scene with a close of of a stop light.\nA red stoplight with a street in the background.\nA stop sign gives traffic a frown face.\nThe sign is now at a red light."}
{"question_id": 8, "image": "000000003156.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : back splash at [0.000, 0.000, 0.145, 0.278].\nObject 1 : blemish at [0.564, 0.178, 0.572, 0.184].\nObject 2 : blemish at [0.517, 0.214, 0.523, 0.222].\nObject 3 : checkered tile at [0.751, 0.226, 1.000, 0.422].\nObject 4 : checkered tile at [0.000, 0.000, 0.145, 0.282].\nObject 5 : cloth at [0.384, 0.878, 1.000, 1.002].\nObject 6 : face at [0.488, 0.106, 0.671, 0.270].\nObject 7 : faucet at [0.000, 0.170, 0.078, 0.268].\nObject 8 : floor at [0.000, 0.796, 1.000, 1.004].\nObject 9 : flooring at [0.000, 0.788, 1.000, 1.002].\nObject 10 : glove at [0.647, 0.622, 0.815, 0.780].\nObject 11 : hand at [0.633, 0.606, 0.815, 0.784].\nObject 12 : man at [0.000, 0.024, 0.835, 1.002].\nObject 13 : overalls at [0.000, 0.576, 0.702, 0.962].\nObject 14 : part at [0.043, 0.274, 0.133, 0.370].\nObject 15 : pipes at [0.000, 0.354, 0.046, 0.472].\nObject 16 : poster at [0.749, 0.226, 0.997, 0.426].\nObject 17 : seat at [0.581, 0.582, 1.000, 0.716].\nObject 18 : sill at [0.792, 0.032, 1.000, 0.094].\nObject 19 : sink at [0.000, 0.240, 0.136, 0.376].\nObject 20 : sock at [0.217, 0.856, 0.251, 0.892].\nObject 21 : tarp at [0.358, 0.868, 1.000, 1.004].\nObject 22 : tile at [0.749, 0.230, 1.000, 0.420].\nObject 23 : toilet at [0.564, 0.574, 1.000, 0.974].\nObject 24 : towel at [0.000, 0.872, 1.000, 1.002].\nObject 25 : wall at [0.000, 0.000, 1.000, 0.870].\nObject 26 : window at [0.777, 0.000, 1.000, 0.080].\n\nRelationships:\nobject 26 : window -> above -> object 23 : toilet.\nobject 21 : tarp -> to protect -> object 8 : floor.\nobject 14 : part -> of a bathroom -> object 19 : sink.\nobject 4 : checkered tile -> on bathroom -> object 25 : wall.\nobject 1 : blemish -> on -> object 6 : face.\nobject 2 : blemish -> on -> object 6 : face.\nobject 1 : blemish -> on -> object 6 : face.\nobject 6 : face -> on -> object 12 : man.\nobject 10 : glove -> on -> object 11 : hand.\n\nRegion Description:\nRegion Description at [0.685, 0.508, 0.879, 0.774] : the man is wearing gloves on his hands.\nRegion Description at [0.685, 0.638, 0.815, 0.764] : rubber glove on the man's right hand.\nRegion Description at [0.220, 0.860, 0.251, 0.886] : black and white design on man's sock.\nRegion Description at [0.000, 0.052, 0.124, 0.158] : black and white back splash for bathroom sink.\n\nGlobal Caption:\nA young man bending next to a toilet.\nA man is kneeling and holding on to a toilet.\nA man attempting to lift up a toilet off the floor.\nA man fixing a toilet in a black and white photo.\nA man wears gloves as he installs a toilet."}
{"question_id": 9, "image": "000000131138.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : computer mouse at [0.414, 0.753, 0.470, 0.811].\nObject 1 : cup at [0.350, 0.783, 0.417, 0.906].\nObject 2 : desk at [0.000, 0.488, 0.998, 0.999].\nObject 3 : fork at [0.203, 0.794, 0.270, 0.857].\nObject 4 : glass at [0.277, 0.703, 0.345, 0.816].\nObject 5 : head phones at [0.872, 0.556, 0.993, 0.634].\nObject 6 : keyboard at [0.415, 0.620, 0.650, 0.783].\nObject 7 : lamp at [0.000, 0.302, 0.214, 0.430].\nObject 8 : laptop at [0.491, 0.296, 0.703, 0.540].\nObject 9 : picture at [0.795, 0.204, 0.898, 0.358].\nObject 10 : plant at [0.192, 0.201, 0.391, 0.461].\nObject 11 : plate at [0.183, 0.799, 0.326, 0.896].\nObject 12 : screen at [0.237, 0.249, 0.504, 0.628].\nObject 13 : stand at [0.506, 0.531, 0.663, 0.617].\nObject 14 : window at [0.606, 0.000, 1.000, 0.346].\n\nRelationships:\nobject 0 : computer mouse -> on -> object 2 : desk.\nobject 8 : laptop -> on -> object 13 : stand.\nobject 6 : keyboard -> on -> object 2 : desk.\nobject 9 : picture -> near -> object 14 : window.\nobject 3 : fork -> on -> object 11 : plate.\n\nRegion Description:\n\nGlobal Caption:\na desk with a cup plate laptop monitor and keyboard\nA laptop sitting next to a monitor, keyboard and a mouse.\nA laptop and a desktop monitor are displayed on top of the desk.\nLarge office desk with computers near a window.\nA desk with a laptop, second monitor and keyboard."}
{"question_id": 10, "image": "000000259097.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : abs at [0.392, 0.628, 0.426, 0.664].\nObject 1 : arm at [0.416, 0.459, 0.432, 0.565].\nObject 2 : buildings at [0.242, 0.532, 0.640, 0.580].\nObject 3 : frisbee at [0.400, 0.354, 0.446, 0.381].\nObject 4 : grass at [0.000, 0.610, 0.998, 0.997].\nObject 5 : hand at [0.418, 0.423, 0.438, 0.474].\nObject 6 : legs at [0.420, 0.703, 0.456, 0.811].\nObject 7 : man at [0.390, 0.432, 0.466, 0.793].\nObject 8 : pants at [0.390, 0.658, 0.424, 0.763].\nObject 9 : shadow at [0.492, 0.724, 0.622, 0.994].\nObject 10 : shirt at [0.402, 0.468, 0.458, 0.649].\nObject 11 : sky at [0.002, 0.003, 0.996, 0.556].\nObject 12 : trees at [0.002, 0.498, 0.998, 0.646].\n\nRelationships:\nobject 7 : man -> tossing -> object 3 : frisbee.\nobject 7 : man -> has -> object 6 : legs.\nobject 7 : man -> playing -> object 3 : frisbee.\nobject 2 : buildings -> near -> object 12 : trees.\nobject 7 : man -> wearing -> object 10 : shirt.\nobject 7 : man -> wearing -> object 8 : pants.\nobject 7 : man -> catching -> object 3 : frisbee.\nobject 7 : man -> has -> object 5 : hand.\nobject 3 : frisbee -> in -> object 11 : sky.\nobject 7 : man -> wearing -> object 10 : shirt.\nobject 7 : man -> catching -> object 3 : frisbee.\nobject 7 : man -> wearing -> object 8 : pants.\nobject 9 : shadow -> in -> object 4 : grass.\nobject 7 : man -> jumping -> object 4 : grass.\nobject 2 : buildings -> behind -> object 4 : grass.\nobject 7 : man -> catching -> object 3 : frisbee.\nobject 7 : man -> catching -> object 3 : frisbee.\nobject 2 : buildings -> near -> object 12 : trees.\nobject 7 : man -> extending -> object 1 : arm.\nobject 9 : shadow -> in -> object 4 : grass.\nobject 7 : man -> exposing -> object 0 : abs.\nobject 7 : man -> catching -> object 3 : frisbee.\nobject 3 : frisbee -> in -> object 11 : sky.\nobject 9 : shadow -> in -> object 4 : grass.\nobject 2 : buildings -> near -> object 12 : trees.\nobject 1 : arm -> reaching for -> object 3 : frisbee.\n\nRegion Description:\nRegion Description at [0.394, 0.658, 0.480, 0.826] : A person wearing black color trouser.\nRegion Description at [0.394, 0.435, 0.460, 0.796] : man in a red sweatshirt and jeans jumping.\nRegion Description at [0.390, 0.357, 0.464, 0.823] : man catching a frisbee in a wheat field.\nRegion Description at [0.012, 0.520, 0.996, 0.631] : trees and a village on a hill in the distance.\nRegion Description at [0.390, 0.423, 0.464, 0.649] : arm straight up and arm bent at elbow.\n\nGlobal Caption:\nA person trying to reach a Frisbee in a field with high brown grass.\nA young boy in a red top is playing with a red object tossed in the sky.\nA young man in a red jacket jumping for a Frizbee in a field.\nA guy is jumping to catch a frisbee in tall grass.\nA man jumps to catch a Frisbee flying through the air."}
{"question_id": 11, "image": "000000377882.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : blue sky at [0.000, 0.000, 0.998, 0.317].\nObject 1 : boat at [0.000, 0.461, 0.354, 0.579].\nObject 2 : boat at [0.348, 0.501, 0.874, 0.789].\nObject 3 : boat at [0.302, 0.461, 0.684, 0.611].\nObject 4 : buildings at [0.692, 0.195, 0.718, 0.248].\nObject 5 : buildings at [0.888, 0.173, 0.922, 0.227].\nObject 6 : buildings at [0.582, 0.211, 0.610, 0.256].\nObject 7 : buildings at [0.180, 0.259, 0.202, 0.293].\nObject 8 : buildings at [0.466, 0.208, 0.518, 0.272].\nObject 9 : chain-link fence at [0.002, 0.176, 0.996, 0.995].\nObject 10 : cord at [0.412, 0.587, 0.626, 1.000].\nObject 11 : fence pole at [0.230, 0.227, 0.336, 1.000].\nObject 12 : grass at [0.000, 0.667, 0.756, 0.997].\nObject 13 : horizon at [0.000, 0.187, 1.000, 0.336].\nObject 14 : mast at [0.570, 0.000, 0.722, 0.571].\nObject 15 : rack at [0.754, 0.168, 1.000, 0.901].\nObject 16 : sail post at [0.586, 0.000, 0.628, 0.568].\nObject 17 : section at [0.272, 0.179, 0.994, 0.992].\nObject 18 : shelf at [0.762, 0.355, 1.000, 0.387].\nObject 19 : sky line at [0.012, 0.173, 0.994, 0.195].\nObject 20 : surfboard at [0.830, 0.448, 0.996, 0.552].\nObject 21 : surfboard at [0.420, 0.384, 0.502, 0.411].\nObject 22 : surfboard at [0.910, 0.768, 0.998, 0.877].\nObject 23 : surfboard at [0.430, 0.344, 0.508, 0.371].\nObject 24 : surfboard at [0.830, 0.565, 1.000, 0.712].\nObject 25 : surfboard at [0.322, 0.307, 0.450, 0.341].\nObject 26 : surfboard at [0.766, 0.251, 0.998, 0.368].\nObject 27 : surfboard at [0.764, 0.704, 0.998, 0.829].\nObject 28 : water at [0.000, 0.259, 1.000, 0.469].\nObject 29 : water way at [0.008, 0.272, 0.996, 0.432].\n\nRelationships:\nobject 25 : surfboard -> stacked on -> object 18 : shelf.\nobject 24 : surfboard -> stacked on -> object 18 : shelf.\nobject 20 : surfboard -> stacked on -> object 18 : shelf.\nobject 26 : surfboard -> stacked on -> object 18 : shelf.\nobject 15 : rack -> of -> object 20 : surfboard.\nobject 8 : buildings -> on -> object 13 : horizon.\nobject 6 : buildings -> on -> object 13 : horizon.\nobject 4 : buildings -> on -> object 13 : horizon.\nobject 7 : buildings -> on -> object 13 : horizon.\nobject 5 : buildings -> on -> object 13 : horizon.\nobject 14 : mast -> on -> object 2 : boat.\nobject 9 : chain-link fence -> near -> object 29 : water way.\nobject 17 : section -> of -> object 9 : chain-link fence.\n\nRegion Description:\nRegion Description at [0.020, 0.187, 0.972, 0.963] : boats and surfboards behind wire fencing.\nRegion Description at [0.000, 0.160, 0.990, 0.349] : trees and buildings on other side of water.\nRegion Description at [0.340, 0.493, 0.852, 0.613] : white covering pulled over top of boat.\nRegion Description at [0.010, 0.667, 0.516, 0.995] : green bushes beside the chain link fence.\nRegion Description at [0.018, 0.213, 0.992, 0.995] : Black chain link fence enclosing boats..\nRegion Description at [0.242, 0.211, 0.302, 0.989] : Black fence pole holding chain link fence..\nRegion Description at [0.374, 0.499, 0.804, 0.803] : Yellow and white boat with sail pole..\nRegion Description at [0.014, 0.181, 0.998, 0.296] : Skyline of gray buildings in the background..\nRegion Description at [0.000, 0.664, 0.994, 0.976] : Green shrubs growing along side of a lake..\nRegion Description at [0.774, 0.216, 0.996, 0.944] : Boat parts on an outdoor shelving unit..\nRegion Description at [0.006, 0.013, 0.150, 0.285] : Sail masks with no flag attached to them..\n\nGlobal Caption:\nBoats docked on land sitting side by side next to a lake.\nA small harbor with boats docked and on racks\nA collection of boats behind a fence by a body of water.\nBoats and surfboards docked at a harbor bay.\n\nMany boats as seen through a chain link fence."}
{"question_id": 12, "image": "000000484415.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : arm at [0.000, 0.125, 0.609, 0.988].\nObject 1 : bathroom tile at [0.009, 0.008, 0.994, 0.446].\nObject 2 : blue jeans at [0.369, 0.558, 0.722, 0.979].\nObject 3 : brush at [0.681, 0.208, 0.878, 0.500].\nObject 4 : brush holder at [0.716, 0.279, 0.891, 0.554].\nObject 5 : button at [0.519, 0.113, 0.584, 0.171].\nObject 6 : flusher at [0.534, 0.092, 0.628, 0.300].\nObject 7 : hand at [0.281, 0.125, 0.603, 0.562].\nObject 8 : holder at [0.713, 0.283, 0.903, 0.558].\nObject 9 : lid at [0.028, 0.046, 0.694, 0.446].\nObject 10 : man at [0.000, 0.133, 0.600, 0.992].\nObject 11 : seat at [0.138, 0.583, 0.722, 0.992].\nObject 12 : tank at [0.019, 0.021, 0.706, 0.579].\nObject 13 : tile at [0.794, 0.000, 1.000, 0.200].\nObject 14 : tile at [0.000, 0.000, 0.278, 0.129].\nObject 15 : toilet at [0.016, 0.042, 0.719, 0.996].\nObject 16 : toilet scrubber at [0.744, 0.192, 0.844, 0.521].\nObject 17 : toilet seat at [0.103, 0.517, 0.728, 0.996].\nObject 18 : wall at [0.659, 0.000, 0.978, 0.392].\nObject 19 : water at [0.369, 0.738, 0.500, 0.921].\n\nRelationships:\nobject 15 : toilet -> has -> object 11 : seat.\nobject 4 : brush holder -> by -> object 15 : toilet.\nobject 19 : water -> in -> object 15 : toilet.\nobject 6 : flusher -> on -> object 15 : toilet.\nobject 9 : lid -> on -> object 15 : toilet.\nobject 10 : man -> by -> object 15 : toilet.\nobject 10 : man -> by -> object 15 : toilet.\nobject 10 : man -> has -> object 7 : hand.\nobject 0 : arm -> on -> object 15 : toilet.\nobject 14 : tile -> on -> object 18 : wall.\n\nRegion Description:\nRegion Description at [0.000, 0.046, 0.716, 0.987] : the arm reaching for the white toilet bowl.\nRegion Description at [0.716, 0.192, 0.894, 0.550] : the container and the toilet brush cleaner.\nRegion Description at [0.009, 0.042, 0.894, 0.992] : the toilet bowl next to the toilet bowl cleaner.\nRegion Description at [0.534, 0.087, 0.666, 0.329] : The hand is on the flusher in the image .\nRegion Description at [0.053, 0.158, 0.903, 0.875] : Porcelain toilet with flusher on top of the lid .\nRegion Description at [0.094, 0.154, 0.856, 0.942] : Man flushing the toilet in the bathroom .\n\nGlobal Caption:\nA hand is reaching out to the top if a toilet. \nA person flushing a toilet with a motion sensor.\nA person's hand flushing a toilet with a button on top of the tank. \na persons hand reaching for the top of a toilet\nA hand is reaching over a white toilet."}
{"question_id": 13, "image": "000000184384.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : blueberry at [0.306, 0.312, 0.400, 0.429].\nObject 1 : butter at [0.454, 0.024, 0.638, 0.288].\nObject 2 : cake at [0.238, 0.093, 0.786, 0.787].\nObject 3 : cup at [0.002, 0.000, 0.202, 0.667].\nObject 4 : cup at [0.140, 0.008, 0.336, 0.456].\nObject 5 : egg at [0.636, 0.125, 0.880, 0.267].\nObject 6 : food at [0.632, 0.123, 0.996, 0.336].\nObject 7 : lemon at [0.514, 0.728, 0.798, 0.997].\nObject 8 : melon at [0.308, 0.768, 0.658, 0.997].\nObject 9 : orange at [0.514, 0.733, 0.794, 0.997].\nObject 10 : parsley at [0.372, 0.515, 0.762, 0.965].\nObject 11 : plate at [0.166, 0.453, 1.000, 1.000].\nObject 12 : plate at [0.628, 0.120, 0.998, 0.389].\nObject 13 : sausage at [0.766, 0.248, 0.984, 0.333].\nObject 14 : spot at [0.766, 0.600, 0.790, 0.637].\nObject 15 : table at [0.002, 0.365, 0.998, 0.997].\nObject 16 : water at [0.000, 0.000, 0.202, 0.667].\n\nRelationships:\nobject 7 : lemon -> on -> object 11 : plate.\nobject 10 : parsley -> on -> object 11 : plate.\nobject 6 : food -> on -> object 12 : plate.\nobject 1 : butter -> on -> object 2 : cake.\nobject 11 : plate -> has -> object 14 : spot.\nobject 1 : butter -> on -> object 2 : cake.\nobject 9 : orange -> on -> object 11 : plate.\nobject 13 : sausage -> on -> object 12 : plate.\nobject 0 : blueberry -> on -> object 2 : cake.\nobject 5 : egg -> on -> object 12 : plate.\nobject 8 : melon -> on -> object 11 : plate.\nobject 1 : butter -> on -> object 2 : cake.\nobject 9 : orange -> on -> object 11 : plate.\nobject 2 : cake -> on -> object 11 : plate.\nobject 16 : water -> in -> object 3 : cup.\nobject 13 : sausage -> on -> object 12 : plate.\n\nRegion Description:\nRegion Description at [0.678, 0.104, 0.942, 0.424] : There is food on the plate in the back.\nRegion Description at [0.456, 0.013, 0.636, 0.307] : White frosting on top of a piece of cake.\nRegion Description at [0.322, 0.752, 0.650, 0.997] : square of honey dew on a white plate.\n\nGlobal Caption:\nA bluebery cake is on a plate and is topped with butter.\nA piece of cake with butter on it sits next to an orange slice. \nA large piece of blueberry cake on a plate.\nA plate of food attractively arranged on a table.\nA plate of blueberry coffee cake with butter and an orange slice on a table with breakfast foods."}
{"question_id": 14, "image": "000000341058.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : napkins at [0.541, 0.818, 0.601, 0.858].\nObject 1 : pepper at [0.598, 0.836, 0.623, 0.860].\nObject 2 : post at [0.673, 0.494, 0.712, 0.926].\nObject 3 : restaurant sign at [0.548, 0.180, 0.779, 0.344].\nObject 4 : salt at [0.619, 0.838, 0.633, 0.850].\nObject 5 : shaker at [0.594, 0.822, 0.619, 0.854].\nObject 6 : shaker at [0.612, 0.824, 0.637, 0.854].\nObject 7 : table at [0.448, 0.834, 0.925, 0.998].\n\nRelationships:\nobject 4 : salt -> in -> object 6 : shaker.\nobject 0 : napkins -> on -> object 7 : table.\nobject 3 : restaurant sign -> on -> object 2 : post.\n\nRegion Description:\n\nGlobal Caption:\nThis is an empty table at a restaurant with ships in the background.\nThis table is covered by a blue Sam Adams umbrella\nAdvertising sign above a patio umbrella on sunny day.\nA lamp post stands next to an umbrella and table.\nAn umbrella is opened over an outdoor table."}
{"question_id": 15, "image": "000000349184.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : arm rest at [0.674, 0.486, 0.722, 0.560].\nObject 1 : bench at [0.000, 0.324, 0.731, 0.994].\nObject 2 : bricks at [0.075, 0.850, 0.180, 0.882].\nObject 3 : building at [0.090, 0.000, 0.686, 0.094].\nObject 4 : children at [0.470, 0.302, 0.539, 0.360].\nObject 5 : coat at [0.473, 0.322, 0.542, 0.364].\nObject 6 : daytime at [0.000, 0.002, 0.997, 1.000].\nObject 7 : fence at [0.719, 0.310, 0.997, 0.372].\nObject 8 : grass at [0.000, 0.364, 0.997, 0.720].\nObject 9 : jacket at [0.012, 0.424, 0.485, 0.690].\nObject 10 : jeans at [0.165, 0.748, 0.293, 0.844].\nObject 11 : leg at [0.168, 0.750, 0.308, 0.844].\nObject 12 : people at [0.386, 0.438, 0.449, 0.504].\nObject 13 : purse at [0.458, 0.488, 0.605, 0.694].\nObject 14 : shoe at [0.192, 0.836, 0.305, 0.890].\nObject 15 : strap at [0.677, 0.470, 0.814, 0.584].\nObject 16 : trees at [0.554, 0.000, 0.997, 0.376].\nObject 17 : woman at [0.009, 0.194, 0.497, 0.888].\n\nRelationships:\nobject 17 : woman -> on -> object 1 : bench.\nobject 17 : woman -> on -> object 1 : bench.\nobject 17 : woman -> on -> object 1 : bench.\nobject 13 : purse -> has a -> object 15 : strap.\nobject 2 : bricks -> under -> object 1 : bench.\nobject 3 : building -> behind -> object 16 : trees.\nobject 2 : bricks -> under -> object 1 : bench.\nobject 9 : jacket -> on -> object 17 : woman.\nobject 12 : people -> near -> object 16 : trees.\nobject 17 : woman -> has a -> object 11 : leg.\nobject 1 : bench -> has an -> object 0 : arm rest.\nobject 15 : strap -> from -> object 13 : purse.\nobject 2 : bricks -> near -> object 1 : bench.\nobject 16 : trees -> in -> object 6 : daytime.\nobject 7 : fence -> under -> object 16 : trees.\nobject 12 : people -> in front of -> object 7 : fence.\nobject 13 : purse -> on -> object 1 : bench.\nobject 14 : shoe -> on -> object 2 : bricks.\n\nRegion Description:\nRegion Description at [0.096, 0.006, 0.662, 0.074] : Building with brown and white facade.\nRegion Description at [0.374, 0.298, 0.542, 0.360] : two people walking in front of woman.\n\nGlobal Caption:\nA woman sitting on top of a wooden bench near a park.\nA person sits on a wooden bench facing blooming trees.\nA woman sitting on a wooden bench viewing some beautiful trees.\nAdult sitting on wooden park bench in large open space.\nA woman sits on a bench watching the park."}
{"question_id": 16, "image": "000000516143.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : advertisement at [0.654, 0.400, 0.852, 0.555].\nObject 1 : area at [0.118, 0.379, 0.862, 0.787].\nObject 2 : back wheel at [0.432, 0.677, 0.490, 0.803].\nObject 3 : background at [0.004, 0.005, 0.998, 0.496].\nObject 4 : bottom at [0.488, 0.781, 0.526, 0.843].\nObject 5 : bus at [0.100, 0.376, 0.866, 0.805].\nObject 6 : door at [0.120, 0.456, 0.178, 0.696].\nObject 7 : front wheel at [0.172, 0.643, 0.204, 0.728].\nObject 8 : houses at [0.880, 0.344, 0.998, 0.483].\nObject 9 : light pole at [0.482, 0.005, 0.532, 0.840].\nObject 10 : list at [0.218, 0.395, 0.608, 0.461].\nObject 11 : message at [0.626, 0.184, 0.822, 0.328].\nObject 12 : name at [0.288, 0.629, 0.420, 0.731].\nObject 13 : person at [0.858, 0.560, 0.888, 0.715].\nObject 14 : pole at [0.680, 0.325, 0.708, 0.941].\nObject 15 : railing at [0.854, 0.589, 1.000, 0.704].\nObject 16 : sidewalk at [0.002, 0.688, 0.998, 1.000].\nObject 17 : sign at [0.578, 0.181, 0.826, 0.341].\nObject 18 : street at [0.000, 0.587, 0.998, 0.931].\nObject 19 : structure at [0.238, 0.293, 0.398, 0.424].\nObject 20 : symbol at [0.732, 0.427, 0.786, 0.469].\nObject 21 : tail lights at [0.812, 0.653, 0.860, 0.712].\nObject 22 : window at [0.342, 0.419, 0.424, 0.619].\nObject 23 : windows at [0.516, 0.392, 0.634, 0.627].\n\nRelationships:\nobject 10 : list -> on -> object 5 : bus.\nobject 0 : advertisement -> on -> object 5 : bus.\nobject 12 : name -> on -> object 5 : bus.\nobject 6 : door -> on -> object 5 : bus.\nobject 2 : back wheel -> of -> object 5 : bus.\nobject 7 : front wheel -> of -> object 5 : bus.\nobject 17 : sign -> on -> object 16 : sidewalk.\nobject 5 : bus -> on -> object 18 : street.\nobject 1 : area -> in -> object 3 : background.\nobject 8 : houses -> near -> object 5 : bus.\nobject 10 : list -> on -> object 5 : bus.\nobject 21 : tail lights -> on -> object 5 : bus.\nobject 5 : bus -> on -> object 18 : street.\nobject 4 : bottom -> of -> object 9 : light pole.\nobject 13 : person -> walking by -> object 5 : bus.\nobject 2 : back wheel -> on -> object 5 : bus.\nobject 17 : sign -> on -> object 18 : street.\nobject 22 : window -> of -> object 5 : bus.\nobject 12 : name -> on -> object 5 : bus.\nobject 8 : houses -> in -> object 3 : background.\nobject 14 : pole -> holding up -> object 17 : sign.\nobject 6 : door -> to -> object 5 : bus.\nobject 19 : structure -> in -> object 3 : background.\nobject 13 : person -> walking down -> object 16 : sidewalk.\nobject 15 : railing -> along -> object 16 : sidewalk.\nobject 17 : sign -> with -> object 11 : message.\nobject 6 : door -> of -> object 5 : bus.\nobject 14 : pole -> by -> object 18 : street.\nobject 14 : pole -> by -> object 5 : bus.\nobject 1 : area -> by -> object 16 : sidewalk.\nobject 13 : person -> walking across -> object 18 : street.\nobject 17 : sign -> attached to -> object 14 : pole.\nobject 11 : message -> on -> object 17 : sign.\nobject 0 : advertisement -> on -> object 5 : bus.\nobject 12 : name -> on -> object 5 : bus.\nobject 0 : advertisement -> on -> object 5 : bus.\nobject 0 : advertisement -> on -> object 5 : bus.\nobject 23 : windows -> on -> object 5 : bus.\nobject 6 : door -> on -> object 5 : bus.\nobject 5 : bus -> on -> object 18 : street.\n\nRegion Description:\nRegion Description at [0.576, 0.163, 0.838, 0.341] : street sign that reads All directions.\nRegion Description at [0.114, 0.323, 0.164, 0.448] : yellow and red structure in background.\nRegion Description at [0.580, 0.179, 0.838, 0.333] : a sign implying zero degrees equals 360 degrees.\n\nGlobal Caption:\na green and white bus is on the street\na public transit bus on a city street\nthe signs states all directions and points up\nAn empty city bus travels down a city street.\nA green and blue bus driving down a street."}
{"question_id": 17, "image": "000000159311.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : eye at [0.566, 0.526, 0.592, 0.565].\nObject 1 : grass at [0.004, 0.808, 0.118, 0.991].\nObject 2 : grass at [0.206, 0.853, 0.356, 0.982].\nObject 3 : leg at [0.232, 0.375, 0.312, 0.805].\nObject 4 : plant at [0.500, 0.736, 0.618, 0.796].\nObject 5 : sitck at [0.746, 0.042, 0.912, 0.339].\nObject 6 : zebra at [0.000, 0.000, 0.622, 0.790].\nObject 7 : zebra at [0.002, 0.000, 0.684, 0.682].\n\nRelationships:\nobject 7 : zebra -> eating -> object 4 : plant.\nobject 6 : zebra -> standing in -> object 1 : grass.\nobject 7 : zebra -> standing in -> object 1 : grass.\nobject 7 : zebra -> grazing in -> object 1 : grass.\nobject 6 : zebra -> grazing in -> object 1 : grass.\n\nRegion Description:\nRegion Description at [0.352, 0.093, 0.602, 0.393] : thin line of hair running down the neck.\n\nGlobal Caption:\nA pair of zebra's leaning over eating grass in a field.\nTwo zebra stand near bushes and tall grass.\nTwo zebras grazing from grass next to a tree.\nTwo zebra standing next to each other on a lush green field.\nTwo zebras are feeding on the grass by themselves."}
{"question_id": 18, "image": "000000553990.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : bar at [0.444, 0.622, 0.640, 0.688].\nObject 1 : boots at [0.328, 0.339, 0.416, 0.492].\nObject 2 : bridal at [0.474, 0.246, 0.678, 0.432].\nObject 3 : food at [0.416, 0.646, 0.466, 0.715].\nObject 4 : foot at [0.324, 0.402, 0.380, 0.492].\nObject 5 : girl at [0.320, 0.078, 0.552, 0.502].\nObject 6 : grass at [0.012, 0.694, 0.998, 0.994].\nObject 7 : ground at [0.004, 0.679, 0.996, 0.913].\nObject 8 : helmet at [0.484, 0.096, 0.560, 0.162].\nObject 9 : hoof at [0.120, 0.853, 0.170, 0.925].\nObject 10 : horse at [0.024, 0.210, 0.690, 0.949].\nObject 11 : legs at [0.478, 0.453, 0.598, 0.637].\nObject 12 : legs at [0.130, 0.583, 0.278, 0.925].\nObject 13 : mane at [0.484, 0.186, 0.648, 0.279].\nObject 14 : person at [0.568, 0.568, 0.604, 0.640].\nObject 15 : poles at [0.460, 0.814, 0.538, 0.955].\nObject 16 : shirt at [0.580, 0.586, 0.594, 0.622].\nObject 17 : shirt at [0.388, 0.150, 0.508, 0.279].\nObject 18 : tail at [0.044, 0.357, 0.222, 0.784].\nObject 19 : tree at [0.720, 0.057, 0.874, 0.568].\nObject 20 : tree at [0.220, 0.000, 0.456, 0.586].\nObject 21 : trees at [0.730, 0.003, 0.986, 0.628].\nObject 22 : wall at [0.188, 0.276, 0.254, 0.393].\nObject 23 : water at [0.028, 0.468, 0.134, 0.574].\n\nRelationships:\nobject 5 : girl -> has -> object 1 : boots.\nobject 6 : grass -> under -> object 10 : horse.\nobject 21 : trees -> behind -> object 10 : horse.\nobject 10 : horse -> jumping -> object 15 : poles.\nobject 11 : legs -> on -> object 10 : horse.\nobject 12 : legs -> on -> object 10 : horse.\nobject 12 : legs -> on -> object 10 : horse.\nobject 14 : person -> in -> object 16 : shirt.\nobject 10 : horse -> has -> object 9 : hoof.\n\nRegion Description:\n\nGlobal Caption:\nA young person ridding a horse jumps a gate in a competition.\nA man riding on a horse as it jumps over a pole. \nA woman is riding a horse as it jumps over a bar.\nthere is a woman jockey riding a hose over the hurdle\nA woman riding a horse jumps over an obstacle."}
{"question_id": 19, "image": "000000273493.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : ball at [0.640, 0.399, 0.648, 0.411].\nObject 1 : border at [0.040, 0.502, 1.000, 0.556].\nObject 2 : boundary lines at [0.030, 0.661, 1.000, 1.000].\nObject 3 : bushes at [0.020, 0.186, 0.104, 0.517].\nObject 4 : fence at [0.008, 0.366, 0.994, 0.565].\nObject 5 : fence at [0.024, 0.502, 0.996, 0.709].\nObject 6 : grass at [0.004, 0.529, 0.994, 0.997].\nObject 7 : man at [0.144, 0.360, 0.246, 0.736].\nObject 8 : man at [0.730, 0.474, 0.780, 0.613].\nObject 9 : pants at [0.732, 0.529, 0.778, 0.604].\nObject 10 : shirt at [0.164, 0.411, 0.222, 0.547].\nObject 11 : shorts at [0.162, 0.535, 0.220, 0.628].\nObject 12 : sign at [0.916, 0.405, 0.934, 0.438].\nObject 13 : sky at [0.006, 0.021, 0.990, 0.279].\nObject 14 : sneakers at [0.180, 0.709, 0.216, 0.739].\nObject 15 : sneakers at [0.762, 0.598, 0.776, 0.613].\nObject 16 : tennis at [0.012, 0.384, 0.984, 0.934].\nObject 17 : tennis court at [0.000, 0.372, 0.988, 0.979].\nObject 18 : tennis racket at [0.768, 0.526, 0.808, 0.556].\nObject 19 : tennis racket at [0.214, 0.574, 0.238, 0.619].\nObject 20 : trees at [0.586, 0.282, 0.692, 0.420].\nObject 21 : white at [0.734, 0.492, 0.778, 0.601].\n\nRelationships:\nobject 7 : man -> in -> object 10 : shirt.\nobject 7 : man -> with -> object 19 : tennis racket.\nobject 7 : man -> plays -> object 16 : tennis.\nobject 7 : man -> wears -> object 14 : sneakers.\nobject 8 : man -> wears -> object 15 : sneakers.\nobject 7 : man -> wears -> object 11 : shorts.\nobject 8 : man -> wears -> object 9 : pants.\nobject 5 : fence -> has -> object 1 : border.\nobject 20 : trees -> behind -> object 3 : bushes.\nobject 2 : boundary lines -> on -> object 17 : tennis court.\nobject 2 : boundary lines -> on -> object 6 : grass.\nobject 3 : bushes -> behind -> object 4 : fence.\nobject 20 : trees -> behind -> object 4 : fence.\nobject 7 : man -> has -> object 19 : tennis racket.\nobject 8 : man -> wears -> object 21 : white.\nobject 4 : fence -> around -> object 17 : tennis court.\nobject 20 : trees -> behind -> object 8 : man.\nobject 6 : grass -> on -> object 17 : tennis court.\nobject 8 : man -> has -> object 18 : tennis racket.\nobject 8 : man -> hitting -> object 0 : ball.\nobject 5 : fence -> on -> object 17 : tennis court.\n\nRegion Description:\nRegion Description at [0.024, 0.489, 0.998, 0.730] : The tennis net separating the sides of the players..\nRegion Description at [0.144, 0.652, 0.234, 0.745] : The black sneakers the player is wearing..\nRegion Description at [0.720, 0.577, 0.784, 0.613] : The white sneakers the player is wearing..\nRegion Description at [0.158, 0.544, 0.230, 0.628] : The gray shorts the player is wearing..\nRegion Description at [0.006, 0.402, 0.998, 0.574] : The trimmed bushes behind the player..\nRegion Description at [0.008, 0.168, 0.998, 0.402] : The trees behind the trimmed bushes behind the player..\nRegion Description at [0.006, 0.604, 0.998, 0.985] : The white boundary lines on the tennis court..\nRegion Description at [0.020, 0.447, 0.994, 0.760] : A black and white net stretches across the field.\nRegion Description at [0.060, 0.526, 0.984, 0.985] : The field has green grass with white lines.\nRegion Description at [0.016, 0.369, 0.978, 0.595] : A tall green shrub is behind the fence.\nRegion Description at [0.034, 0.150, 0.984, 0.393] : Trees are seen behind the fence and shrub.\nRegion Description at [0.588, 0.327, 0.850, 0.703] : The yellow ball is flying towards the man.\nRegion Description at [0.902, 0.378, 0.956, 0.529] : A black circular sign with the number five.\nRegion Description at [0.142, 0.354, 0.248, 0.736] : male in white t-shirt playing tennis.\nRegion Description at [0.200, 0.565, 0.244, 0.625] : Head of tennis racket of man playing.\nRegion Description at [0.726, 0.465, 0.786, 0.631] : Man in white preparing to hit tennis ball.\n\nGlobal Caption:\nTwo men playing a game of tennis on a court.\ntwo people playing tennis with rackets on a grass court\nTwo young men playing a game of tennis.\nPeople playing tennis on a court surrounded by green hedges.\ntHERE ARE TWO MEN PLAYING TENNIS ON THE TENNIS COURT"}
{"question_id": 20, "image": "000000452122.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : airline at [0.408, 0.420, 0.758, 0.502].\nObject 1 : airplane at [0.112, 0.300, 0.858, 0.640].\nObject 2 : engine at [0.652, 0.529, 0.730, 0.592].\nObject 3 : engine at [0.494, 0.502, 0.574, 0.577].\nObject 4 : fin at [0.208, 0.303, 0.320, 0.492].\nObject 5 : fin at [0.116, 0.480, 0.284, 0.526].\nObject 6 : front door at [0.752, 0.435, 0.772, 0.483].\nObject 7 : gear at [0.450, 0.592, 0.600, 0.643].\nObject 8 : letters at [0.694, 0.489, 0.732, 0.520].\nObject 9 : name at [0.398, 0.426, 0.760, 0.489].\nObject 10 : sky at [0.000, 0.000, 0.998, 1.000].\nObject 11 : window at [0.806, 0.438, 0.844, 0.456].\nObject 12 : windows at [0.326, 0.450, 0.750, 0.532].\nObject 13 : wing at [0.152, 0.426, 0.598, 0.538].\nObject 14 : wing at [0.116, 0.492, 0.282, 0.538].\n\nRelationships:\nobject 6 : front door -> of -> object 1 : airplane.\n\nRegion Description:\n\nGlobal Caption:\nAn airplane flying in the air during the day.\nA large aircraft is shown in the air.\nThe large jumbo jet has it's landing gear lowered.\nA large white airplane flies in the gray sky.\nAn airplane in route with a cloudy sky behind it."}
{"question_id": 21, "image": "000000134722.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : awning at [0.886, 0.000, 1.000, 0.240].\nObject 1 : awning at [0.000, 0.299, 0.132, 0.403].\nObject 2 : bench at [0.000, 0.592, 0.066, 0.683].\nObject 3 : building at [0.000, 0.299, 0.204, 0.659].\nObject 4 : canopy at [0.000, 0.301, 0.136, 0.400].\nObject 5 : car at [0.290, 0.400, 0.998, 0.784].\nObject 6 : clouds at [0.374, 0.067, 0.920, 0.312].\nObject 7 : door opening at [0.658, 0.501, 0.682, 0.680].\nObject 8 : door opening at [0.678, 0.509, 0.710, 0.675].\nObject 9 : exterior at [0.000, 0.400, 0.200, 0.669].\nObject 10 : front at [0.294, 0.400, 0.494, 0.739].\nObject 11 : gravel at [0.090, 0.837, 0.334, 0.997].\nObject 12 : headlights at [0.416, 0.624, 0.446, 0.656].\nObject 13 : headlights at [0.300, 0.624, 0.324, 0.651].\nObject 14 : markings at [0.606, 0.821, 0.770, 0.928].\nObject 15 : panel at [0.304, 0.421, 0.450, 0.677].\nObject 16 : pole at [0.030, 0.419, 0.062, 0.656].\nObject 17 : railway tracks at [0.000, 0.752, 0.520, 0.944].\nObject 18 : side walk at [0.192, 0.712, 1.000, 0.997].\nObject 19 : sky at [0.000, 0.000, 0.998, 0.560].\nObject 20 : train stop at [0.000, 0.000, 1.000, 1.000].\nObject 21 : trees at [0.208, 0.253, 0.322, 0.653].\nObject 22 : trim at [0.000, 0.333, 0.132, 0.403].\nObject 23 : wall at [0.000, 0.392, 0.206, 0.611].\nObject 24 : wheel at [0.844, 0.669, 0.884, 0.728].\nObject 25 : wheel at [0.792, 0.675, 0.840, 0.747].\nObject 26 : wheel at [0.516, 0.691, 0.620, 0.808].\nObject 27 : window at [0.316, 0.451, 0.458, 0.595].\nObject 28 : windows at [0.700, 0.547, 0.848, 0.632].\nObject 29 : windsheild wipers at [0.348, 0.499, 0.410, 0.584].\n\nRelationships:\nobject 6 : clouds -> in -> object 19 : sky.\nobject 2 : bench -> in -> object 4 : canopy.\nobject 22 : trim -> on -> object 1 : awning.\nobject 11 : gravel -> next to -> object 17 : railway tracks.\nobject 14 : markings -> on side of -> object 18 : side walk.\nobject 5 : car -> on -> object 17 : railway tracks.\n\nRegion Description:\nRegion Description at [0.288, 0.392, 0.510, 0.741] : the front of the train is yellow and white.\nRegion Description at [0.320, 0.451, 0.460, 0.592] : the front window of the train has windshield wipers.\nRegion Description at [0.292, 0.592, 0.456, 0.739] : the headlights are on front of the train.\nRegion Description at [0.010, 0.405, 0.220, 0.736] : a red brick wall is near the platform.\nRegion Description at [0.000, 0.288, 0.128, 0.707] : an aluminum canopy is on the platform.\nRegion Description at [0.016, 0.325, 0.100, 0.672] : a red steel pole is holding up the awning.\nRegion Description at [0.306, 0.395, 0.998, 0.733] : the train has windowed passenger cars.\nRegion Description at [0.300, 0.427, 0.492, 0.693] : the yellow and white front of a train.\nRegion Description at [0.510, 0.744, 0.834, 0.891] : white painted line beside a train track.\nRegion Description at [0.298, 0.408, 0.468, 0.661] : a yellow panel on the front of the train.\nRegion Description at [0.002, 0.397, 0.210, 0.675] : a red brick building on the side of the tracks.\nRegion Description at [0.844, 0.000, 0.998, 0.248] : an awning of a structure next to the train tracks.\nRegion Description at [0.294, 0.360, 0.516, 0.787] : front of a train car in yellow, white and blue.\nRegion Description at [0.194, 0.221, 0.286, 0.901] : trees on the side of a train station.\nRegion Description at [0.580, 0.821, 0.764, 0.931] : markings on the side of railway tracks.\nRegion Description at [0.632, 0.491, 0.726, 0.691] : white, blue and grey doors on the side of a train car.\nRegion Description at [0.500, 0.096, 0.916, 0.531] : skyline on the side of a train station.\n\nGlobal Caption:\nFast commuter train moving past an outdoor platform.\nA train on the track pulling by a train station.\nA train pulling into a station outside during the day.\nA passenger train moving through a rail yard\na long passenger train pulling up to a station"}
{"question_id": 22, "image": "000000360960.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : coat at [0.405, 0.332, 0.835, 0.746].\nObject 1 : decorative square at [0.000, 0.382, 1.000, 1.000].\nObject 2 : hat at [0.006, 0.162, 0.072, 0.198].\nObject 3 : jacket at [0.078, 0.222, 0.318, 0.430].\nObject 4 : jeans at [0.853, 0.422, 1.000, 0.632].\nObject 5 : leg at [0.853, 0.456, 0.928, 0.610].\nObject 6 : leg at [0.210, 0.458, 0.303, 0.638].\nObject 7 : leg at [0.000, 0.458, 0.060, 0.630].\nObject 8 : man at [0.066, 0.162, 0.318, 0.686].\nObject 9 : man at [0.850, 0.156, 1.000, 0.652].\nObject 10 : man at [0.390, 0.344, 0.838, 0.894].\nObject 11 : pants at [0.523, 0.736, 0.739, 0.858].\nObject 12 : person at [0.000, 0.162, 0.135, 0.668].\nObject 13 : person at [0.853, 0.154, 1.000, 0.650].\nObject 14 : section at [0.000, 0.134, 1.000, 1.000].\nObject 15 : sidewalk at [0.000, 0.388, 1.000, 1.000].\nObject 16 : umbrella at [0.168, 0.106, 0.910, 0.366].\nObject 17 : uniform at [0.000, 0.222, 0.126, 0.646].\nObject 18 : uniform at [0.105, 0.218, 0.318, 0.628].\n\nRelationships:\nobject 10 : man -> wearing -> object 11 : pants.\nobject 10 : man -> wearing -> object 0 : coat.\nobject 9 : man -> wearing -> object 4 : jeans.\nobject 8 : man -> wearing -> object 2 : hat.\nobject 8 : man -> wearing -> object 3 : jacket.\nobject 16 : umbrella -> has -> object 14 : section.\nobject 5 : leg -> of -> object 13 : person.\nobject 7 : leg -> of -> object 12 : person.\nobject 12 : person -> in -> object 17 : uniform.\n\nRegion Description:\nRegion Description at [0.066, 0.164, 0.318, 0.686] : the back of a man in a black uniform.\nRegion Description at [0.393, 0.324, 0.871, 0.766] : THIS MAN IS WEARING A LONG BLACK COAT.\nRegion Description at [0.468, 0.142, 0.634, 0.356] : THIS IS A RED SECTION ON THE UMBRELLA.\nRegion Description at [0.168, 0.140, 0.523, 0.292] : THIS IS A YELLOW SECTION ON THE UMBRELLA.\nRegion Description at [0.568, 0.138, 0.919, 0.232] : THIS IS A GREEN SECTION OF THE UMBRELLA.\n\nGlobal Caption:\nSeveral people walking on a sidewalk, with one man holding an umbrella.\nA person walking while carrying a rainbow umbrella\nA person is holding up a large colorful umbrella\na person walking down the street carrying a rainbow colored umbrella\nA person walking in a square carrying a rainbow colored umbrella."}
{"question_id": 23, "image": "000000179765.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : back tire at [0.574, 0.496, 0.860, 0.800].\nObject 1 : bike at [0.146, 0.109, 0.938, 0.803].\nObject 2 : bike indicators at [0.238, 0.363, 0.264, 0.389].\nObject 3 : car at [0.000, 0.077, 0.086, 0.157].\nObject 4 : display at [0.240, 0.275, 0.290, 0.328].\nObject 5 : exhaust pipe at [0.460, 0.661, 0.818, 0.773].\nObject 6 : front tire at [0.146, 0.419, 0.366, 0.637].\nObject 7 : front wheel at [0.150, 0.424, 0.366, 0.635].\nObject 8 : garage door at [0.000, 0.000, 0.214, 0.341].\nObject 9 : handle at [0.284, 0.109, 0.390, 0.384].\nObject 10 : honda logo at [0.322, 0.395, 0.378, 0.419].\nObject 11 : house at [0.420, 0.000, 0.736, 0.149].\nObject 12 : leather seat at [0.496, 0.355, 0.792, 0.517].\nObject 13 : light at [0.894, 0.411, 0.944, 0.520].\nObject 14 : orange light at [0.280, 0.419, 0.296, 0.467].\nObject 15 : shock at [0.258, 0.477, 0.296, 0.568].\nObject 16 : shock absorber at [0.626, 0.501, 0.698, 0.680].\nObject 17 : shrubs at [0.628, 0.021, 0.764, 0.200].\nObject 18 : small windshield at [0.210, 0.120, 0.256, 0.291].\nObject 19 : sylencer at [0.462, 0.645, 0.816, 0.779].\nObject 20 : trees at [0.256, 0.003, 0.444, 0.205].\n\nRelationships:\nobject 1 : bike -> has -> object 7 : front wheel.\nobject 1 : bike -> has -> object 0 : back tire.\nobject 1 : bike -> has -> object 19 : sylencer.\nobject 1 : bike -> has -> object 16 : shock absorber.\nobject 1 : bike -> has -> object 13 : light.\nobject 9 : handle -> on -> object 1 : bike.\nobject 4 : display -> on -> object 1 : bike.\n\nRegion Description:\n\nGlobal Caption:\nA black Honda motorcycle parked in front of a garage.\nA Honda motorcycle parked in a grass driveway\nA black Honda motorcycle with a dark burgundy seat.\nMa motorcycle parked on the gravel in front of a garage\nA motorcycle with its brake extended standing outside"}
{"question_id": 24, "image": "000000332318.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : background at [0.000, 0.000, 1.002, 0.997].\nObject 1 : bench at [0.604, 0.967, 0.672, 0.997].\nObject 2 : cow at [0.548, 0.860, 0.574, 0.896].\nObject 3 : cow at [0.436, 0.860, 0.454, 0.890].\nObject 4 : fence at [0.698, 0.949, 0.852, 0.997].\nObject 5 : moutain at [0.000, 0.057, 0.992, 0.782].\nObject 6 : pasture at [0.000, 0.815, 0.984, 1.000].\nObject 7 : peak at [0.744, 0.042, 0.898, 0.119].\nObject 8 : sky at [0.000, 0.000, 1.002, 0.257].\nObject 9 : snow at [0.210, 0.036, 0.962, 0.445].\nObject 10 : trailer at [0.796, 0.910, 0.894, 0.997].\nObject 11 : trailer at [0.632, 0.899, 0.742, 0.994].\nObject 12 : tree at [0.740, 0.409, 1.000, 0.982].\nObject 13 : tree at [0.638, 0.284, 0.652, 0.301].\n\nRelationships:\nobject 11 : trailer -> in -> object 6 : pasture.\nobject 5 : moutain -> has -> object 9 : snow.\nobject 6 : pasture -> near -> object 5 : moutain.\nobject 3 : cow -> in -> object 6 : pasture.\nobject 2 : cow -> in -> object 6 : pasture.\nobject 9 : snow -> on -> object 5 : moutain.\nobject 5 : moutain -> covered in -> object 9 : snow.\nobject 5 : moutain -> has -> object 7 : peak.\nobject 2 : cow -> in -> object 6 : pasture.\nobject 5 : moutain -> in -> object 0 : background.\nobject 5 : moutain -> has -> object 9 : snow.\nobject 11 : trailer -> near -> object 12 : tree.\nobject 5 : moutain -> has -> object 13 : tree.\nobject 7 : peak -> covered with -> object 9 : snow.\n\nRegion Description:\nRegion Description at [0.784, 0.901, 0.934, 0.991] : storage container for animal equipment.\nRegion Description at [0.828, 0.060, 0.880, 0.125] : The mountain is partially covered in snow..\nRegion Description at [0.840, 0.899, 0.920, 0.997] : horse trailer or cow trailer is silvertone, rectangular.\nRegion Description at [0.606, 0.919, 0.640, 0.982] : smaller trailer, white w/ brown+orange stripe.\nRegion Description at [0.060, 0.472, 0.540, 0.806] : a bare patch of earth amid lush green growth.\nRegion Description at [0.034, 0.839, 0.812, 0.973] : tiny cattle-containing fenceposts in the distance.\nRegion Description at [0.902, 0.827, 0.990, 0.997] : a split tree trunk in shadow, beneath leaves, shadow on ground.\nRegion Description at [0.734, 0.919, 0.802, 0.994] : an older station wagon/suv-type van thing.\nRegion Description at [0.090, 0.854, 0.124, 0.904] : a black & white animal stands alone, away from brown brethren, in the far distance.\n\nGlobal Caption:\nCows lounge in a field with a mountain backdrop.\nA VERY BIG MOUNTAIN AND ANIMALS SPREAD ACROSS A FARM.\nSeveral herd animals are on the grass by a mountain.\nCattle on a level pasture in a mountainous area.\nA bunch of cattle relax in a pasture located in the mountains"}
{"question_id": 25, "image": "000000305695.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : animal at [0.358, 0.509, 0.434, 0.664].\nObject 1 : area at [0.464, 0.531, 0.934, 0.848].\nObject 2 : branches at [0.266, 0.107, 0.470, 0.272].\nObject 3 : bushes at [0.598, 0.424, 0.622, 0.445].\nObject 4 : grass at [0.544, 0.659, 0.840, 0.859].\nObject 5 : hill at [0.574, 0.323, 0.624, 0.376].\nObject 6 : leaves at [0.808, 0.293, 0.918, 0.360].\nObject 7 : license plate at [0.000, 0.691, 0.064, 0.747].\nObject 8 : light at [0.170, 0.557, 0.186, 0.632].\nObject 9 : park at [0.250, 0.192, 0.818, 0.664].\nObject 10 : road at [0.180, 0.709, 0.432, 0.957].\nObject 11 : sky at [0.448, 0.053, 0.532, 0.187].\nObject 12 : tire at [0.070, 0.728, 0.130, 0.795].\nObject 13 : tree at [0.000, 0.000, 0.478, 0.600].\nObject 14 : trees at [0.128, 0.000, 0.592, 0.597].\nObject 15 : truck at [0.000, 0.416, 0.210, 0.805].\nObject 16 : zebras at [0.730, 0.496, 0.796, 0.581].\n\nRelationships:\nobject 7 : license plate -> on -> object 15 : truck.\nobject 12 : tire -> on -> object 15 : truck.\nobject 5 : hill -> in -> object 9 : park.\nobject 0 : animal -> in -> object 1 : area.\nobject 13 : tree -> has -> object 6 : leaves.\nobject 0 : animal -> on -> object 1 : area.\nobject 15 : truck -> on -> object 10 : road.\nobject 10 : road -> with -> object 15 : truck.\nobject 3 : bushes -> on -> object 1 : area.\nobject 16 : zebras -> in -> object 1 : area.\nobject 2 : branches -> on -> object 13 : tree.\n\nRegion Description:\nRegion Description at [0.338, 0.480, 0.438, 0.680] : zebra watching in opposite direction.\n\nGlobal Caption:\nZebras are grazing on grass by a car.\nZebras are standing in a fenced in area.\nA herd of zebras stand under tress near a road. \nSeveral zebras are on the grass by a truck. \nA bunch of zebras grazing near a road where vehicles are driving by."}
{"question_id": 26, "image": "000000326174.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : beach at [0.000, 0.720, 0.998, 1.000].\nObject 1 : boy at [0.792, 0.480, 0.938, 0.853].\nObject 2 : child at [0.322, 0.587, 0.376, 0.835].\nObject 3 : child at [0.320, 0.587, 0.374, 0.835].\nObject 4 : girl at [0.444, 0.539, 0.534, 0.856].\nObject 5 : man at [0.140, 0.443, 0.216, 0.845].\nObject 6 : man at [0.434, 0.459, 0.500, 0.760].\nObject 7 : man at [0.578, 0.459, 0.682, 0.845].\nObject 8 : ocean waters at [0.590, 0.419, 0.892, 0.629].\nObject 9 : people at [0.206, 0.456, 0.352, 0.851].\nObject 10 : person at [0.792, 0.480, 0.936, 0.851].\nObject 11 : shirt at [0.592, 0.496, 0.670, 0.629].\nObject 12 : shore at [0.000, 0.360, 0.998, 0.997].\nObject 13 : surfboard at [0.306, 0.709, 0.538, 0.853].\nObject 14 : surfboard at [0.790, 0.587, 0.960, 0.691].\nObject 15 : water at [0.384, 0.368, 0.544, 0.435].\nObject 16 : waves at [0.656, 0.709, 0.794, 0.779].\nObject 17 : wetsuit at [0.326, 0.629, 0.372, 0.773].\nObject 18 : woman at [0.208, 0.499, 0.304, 0.629].\n\nRelationships:\nobject 1 : boy -> holding -> object 14 : surfboard.\nobject 5 : man -> and -> object 18 : woman.\nobject 18 : woman -> and -> object 3 : child.\nobject 16 : waves -> coming to -> object 12 : shore.\nobject 7 : man -> looking down to -> object 15 : water.\nobject 2 : child -> with -> object 17 : wetsuit.\nobject 6 : man -> looking back to -> object 4 : girl.\nobject 4 : girl -> pulling -> object 13 : surfboard.\nobject 9 : people -> on -> object 0 : beach.\nobject 7 : man -> wearing -> object 11 : shirt.\n\nRegion Description:\nRegion Description at [0.096, 0.437, 0.970, 0.872] : Seven people headed to the water to surf..\nRegion Description at [0.390, 0.531, 0.540, 0.851] : Girl in yellow shirt and pony tail. .\nRegion Description at [0.312, 0.581, 0.374, 0.851] : Small child with red and black wetsuit..\nRegion Description at [0.578, 0.443, 0.688, 0.856] : Man with white shirt and grey wetsuit pants..\nRegion Description at [0.436, 0.440, 0.534, 0.872] : Man looking back to girl pulling surfboard..\nRegion Description at [0.444, 0.459, 0.552, 0.853] : A man and a little girl having a conversation.\nRegion Description at [0.104, 0.419, 0.314, 0.851] : A man and a woman walking toward the water.\n\nGlobal Caption:\nA group of people are taking surfing lessons.\nA group of men, women and children walking toward the water with surfboards.\nA mixed age group is going toward the ocean with surfboards.\nA group of surfers are carrying their surf boards into the ocean.\nSeveral people are getting ready to enter the water for surfing."}
{"question_id": 27, "image": "000000562207.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : body at [0.166, 0.539, 0.296, 0.997].\nObject 1 : boot at [0.594, 0.753, 0.620, 0.870].\nObject 2 : boot at [0.620, 0.744, 0.658, 0.858].\nObject 3 : bucket at [0.268, 0.744, 0.322, 0.828].\nObject 4 : clouds at [0.156, 0.000, 0.968, 0.328].\nObject 5 : ear at [0.590, 0.226, 0.638, 0.410].\nObject 6 : ear at [0.368, 0.208, 0.448, 0.434].\nObject 7 : elephant at [0.328, 0.157, 0.638, 0.967].\nObject 8 : eye at [0.476, 0.319, 0.504, 0.346].\nObject 9 : foot at [0.436, 0.901, 0.516, 0.958].\nObject 10 : grass at [0.950, 0.759, 0.996, 0.807].\nObject 11 : leg at [0.498, 0.572, 0.548, 0.898].\nObject 12 : leg at [0.408, 0.512, 0.516, 0.955].\nObject 13 : man at [0.582, 0.476, 0.662, 0.870].\nObject 14 : man at [0.164, 0.455, 0.292, 0.997].\nObject 15 : mountains at [0.000, 0.265, 0.376, 0.470].\nObject 16 : rock at [0.736, 0.895, 0.762, 0.934].\nObject 17 : sand at [0.240, 0.687, 0.998, 1.000].\nObject 18 : shirt at [0.582, 0.521, 0.650, 0.681].\nObject 19 : shorts at [0.174, 0.699, 0.254, 0.864].\nObject 20 : side at [0.236, 0.675, 0.994, 0.997].\nObject 21 : skirt at [0.298, 0.687, 0.360, 0.810].\nObject 22 : sky at [0.004, 0.000, 0.998, 0.355].\nObject 23 : top at [0.302, 0.539, 0.358, 0.696].\nObject 24 : tree at [0.012, 0.407, 0.076, 0.500].\nObject 25 : trunk at [0.506, 0.392, 0.600, 0.964].\nObject 26 : watch at [0.172, 0.711, 0.192, 0.732].\nObject 27 : water at [0.000, 0.488, 0.994, 1.000].\nObject 28 : woman at [0.288, 0.473, 0.420, 0.967].\n\nRelationships:\nobject 7 : elephant -> on -> object 20 : side.\nobject 28 : woman -> touching -> object 7 : elephant.\nobject 14 : man -> standing on -> object 20 : side.\nobject 14 : man -> standing beside -> object 7 : elephant.\nobject 10 : grass -> on -> object 20 : side.\nobject 28 : woman -> wearing -> object 23 : top.\nobject 13 : man -> wearing -> object 18 : shirt.\nobject 13 : man -> wearing -> object 1 : boot.\nobject 13 : man -> wearing -> object 2 : boot.\nobject 28 : woman -> touching -> object 7 : elephant.\nobject 7 : elephant -> has -> object 25 : trunk.\nobject 14 : man -> wearing -> object 19 : shorts.\nobject 28 : woman -> petting -> object 7 : elephant.\nobject 14 : man -> with -> object 7 : elephant.\nobject 28 : woman -> with -> object 7 : elephant.\nobject 13 : man -> with -> object 7 : elephant.\nobject 25 : trunk -> of -> object 7 : elephant.\nobject 25 : trunk -> of -> object 7 : elephant.\nobject 9 : foot -> of an -> object 7 : elephant.\nobject 25 : trunk -> of -> object 7 : elephant.\nobject 11 : leg -> of -> object 7 : elephant.\nobject 12 : leg -> of -> object 7 : elephant.\nobject 5 : ear -> of -> object 7 : elephant.\nobject 6 : ear -> of -> object 7 : elephant.\nobject 8 : eye -> of -> object 7 : elephant.\nobject 27 : water -> behind -> object 7 : elephant.\n\nRegion Description:\nRegion Description at [0.338, 0.139, 0.618, 0.967] : the elephant standing on the lake side.\nRegion Description at [0.154, 0.392, 0.300, 0.964] : a man standing on the lake side with shorts.\nRegion Description at [0.574, 0.422, 0.686, 0.910] : the man standing beside the elephant.\nRegion Description at [0.292, 0.485, 0.378, 0.705] : this lady is wearing a blue tank top.\nRegion Description at [0.722, 0.768, 0.988, 0.964] : the sand is brown with green grass growing in it.\nRegion Description at [0.156, 0.669, 0.270, 0.910] : the man is wearing grey black and white shorts.\nRegion Description at [0.504, 0.560, 0.568, 0.898] : The front right leg of the elephant..\nRegion Description at [0.310, 0.536, 0.358, 0.690] : The light blue tank top the girl is wearing..\nRegion Description at [0.262, 0.732, 0.326, 0.825] : The black bucket in the girl's hand..\nRegion Description at [0.002, 0.443, 0.992, 0.994] : The water behind the people and the elephant..\n\nGlobal Caption:\nA group of people are standing next to an elephant emerging from the water.\na group of people stand beside of a giant elephant \nThree tourists pose for a picture next to an elephant.\nThree people stand with an elephant in front of a stream.\nThree people standing next to an elephant along a river."}
{"question_id": 28, "image": "000000543300.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : boat at [0.048, 0.552, 0.928, 0.819].\nObject 1 : building at [0.328, 0.493, 0.538, 0.613].\nObject 2 : building at [0.000, 0.467, 0.338, 0.651].\nObject 3 : building at [0.534, 0.096, 0.998, 0.637].\nObject 4 : canopies at [0.452, 0.504, 0.620, 0.600].\nObject 5 : container at [0.858, 0.643, 0.948, 0.712].\nObject 6 : dolphin at [0.282, 0.691, 0.344, 0.773].\nObject 7 : flag at [0.322, 0.563, 0.340, 0.597].\nObject 8 : ground at [0.822, 0.696, 0.880, 0.715].\nObject 9 : leaves at [0.002, 0.483, 0.080, 0.659].\nObject 10 : level at [0.000, 0.709, 1.000, 0.829].\nObject 11 : level at [0.068, 0.616, 0.852, 0.688].\nObject 12 : outdoor seating at [0.502, 0.579, 0.532, 0.624].\nObject 13 : pink writing at [0.414, 0.693, 0.654, 0.725].\nObject 14 : pole at [0.282, 0.416, 0.292, 0.515].\nObject 15 : railing at [0.094, 0.557, 0.728, 0.624].\nObject 16 : railing at [0.238, 0.597, 0.744, 0.627].\nObject 17 : reflection at [0.174, 0.808, 0.922, 0.848].\nObject 18 : roof at [0.000, 0.469, 0.280, 0.523].\nObject 19 : roof at [0.348, 0.509, 0.482, 0.568].\nObject 20 : roof at [0.920, 0.264, 0.980, 0.344].\nObject 21 : row at [0.700, 0.499, 0.878, 0.573].\nObject 22 : sea wall at [0.878, 0.712, 0.998, 0.819].\nObject 23 : shore at [0.000, 0.627, 0.996, 0.816].\nObject 24 : sky at [0.006, 0.000, 1.000, 0.517].\nObject 25 : steeple at [0.918, 0.088, 0.936, 0.237].\nObject 26 : symbol at [0.268, 0.688, 0.350, 0.779].\nObject 27 : symbol at [0.702, 0.693, 0.752, 0.725].\nObject 28 : tree at [0.472, 0.491, 0.592, 0.597].\nObject 29 : trees at [0.948, 0.573, 1.000, 0.691].\nObject 30 : trees at [0.000, 0.488, 0.080, 0.675].\nObject 31 : vehicle at [0.968, 0.653, 0.998, 0.693].\nObject 32 : water at [0.004, 0.813, 0.998, 0.992].\nObject 33 : water at [0.008, 0.717, 0.998, 0.981].\nObject 34 : window at [0.374, 0.733, 0.790, 0.765].\nObject 35 : window at [0.800, 0.491, 0.868, 0.576].\nObject 36 : window at [0.928, 0.512, 0.950, 0.576].\nObject 37 : window at [0.892, 0.395, 0.912, 0.443].\nObject 38 : window at [0.894, 0.517, 0.910, 0.571].\nObject 39 : window at [0.630, 0.493, 0.652, 0.565].\nObject 40 : windows at [0.384, 0.637, 0.724, 0.685].\n\nRelationships:\nobject 40 : windows -> on -> object 0 : boat.\nobject 17 : reflection -> in -> object 33 : water.\nobject 29 : trees -> growing on -> object 23 : shore.\nobject 30 : trees -> growing on -> object 23 : shore.\nobject 28 : tree -> growing on -> object 23 : shore.\nobject 18 : roof -> on -> object 2 : building.\nobject 5 : container -> on -> object 22 : sea wall.\nobject 0 : boat -> in -> object 32 : water.\nobject 0 : boat -> has -> object 15 : railing.\n\nRegion Description:\nRegion Description at [0.414, 0.691, 0.662, 0.725] : the are red letters on the side of the cruise ship.\nRegion Description at [0.370, 0.707, 0.780, 0.763] : there is a long set of black windows on the side of the cruise ship.\nRegion Description at [0.870, 0.243, 0.992, 0.357] : there is a red roof on this building.\nRegion Description at [0.538, 0.400, 0.712, 0.549] : there is red and gray building in the background.\nRegion Description at [0.054, 0.595, 0.312, 0.821] : there is two levels on this cruise ship.\nRegion Description at [0.370, 0.587, 0.664, 0.621] : there is a silver railing on the top level of the cruise ship.\nRegion Description at [0.858, 0.621, 0.952, 0.717] : there is a blue container on the dock.\nRegion Description at [0.876, 0.707, 0.996, 0.787] : there is a gray sea wall beside the ship.\nRegion Description at [0.268, 0.723, 0.346, 0.787] : there are blue water symbols on the side of the cruise ship.\nRegion Description at [0.000, 0.619, 0.024, 0.712] : there is a blue and white sign on the dock.\nRegion Description at [0.662, 0.533, 0.904, 0.603] : An outdoor canopy creates shade for customers. .\n\nGlobal Caption:\nA boat sits on the side of the dock.\nA large white boat in the open water.\nA white double decker boat n water next to buildings.\nA large cruise ship is traveling on the ocean. \nA Port River Dolphin Cruise ship sits in the water."}
{"question_id": 29, "image": "000000241668.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : boutonniere at [0.710, 0.574, 0.799, 0.660].\nObject 1 : cake at [0.630, 0.670, 0.772, 0.750].\nObject 2 : cake crumb at [0.710, 0.348, 0.721, 0.356].\nObject 3 : crown at [0.370, 0.006, 0.549, 0.056].\nObject 4 : dress at [0.000, 0.574, 0.582, 1.000].\nObject 5 : eye at [0.649, 0.244, 0.699, 0.272].\nObject 6 : eye at [0.735, 0.264, 0.769, 0.280].\nObject 7 : eyebrow at [0.655, 0.230, 0.710, 0.250].\nObject 8 : eyebrow at [0.741, 0.252, 0.780, 0.264].\nObject 9 : finger at [0.721, 0.772, 0.816, 0.800].\nObject 10 : finger at [0.535, 0.740, 0.685, 0.826].\nObject 11 : ground at [0.003, 0.888, 0.997, 1.000].\nObject 12 : hair at [0.507, 0.142, 0.791, 0.642].\nObject 13 : hair at [0.189, 0.044, 0.652, 0.374].\nObject 14 : hand at [0.721, 0.720, 0.822, 0.818].\nObject 15 : hand at [0.493, 0.710, 0.685, 0.826].\nObject 16 : head at [0.209, 0.048, 0.652, 0.360].\nObject 17 : mouth at [0.646, 0.310, 0.724, 0.352].\nObject 18 : neck at [0.560, 0.344, 0.663, 0.460].\nObject 19 : necklace at [0.357, 0.334, 0.471, 0.484].\nObject 20 : necktie at [0.571, 0.442, 0.674, 0.936].\nObject 21 : paper at [0.760, 0.792, 0.914, 0.934].\nObject 22 : person at [0.490, 0.136, 0.825, 0.998].\nObject 23 : plate at [0.579, 0.734, 0.816, 0.768].\nObject 24 : purse at [0.774, 0.792, 0.883, 0.840].\nObject 25 : ring at [0.786, 0.780, 0.794, 0.796].\nObject 26 : shirt at [0.554, 0.376, 0.691, 0.950].\nObject 27 : suit jacket at [0.490, 0.422, 0.799, 0.998].\nObject 28 : table at [0.696, 0.816, 0.997, 0.916].\nObject 29 : toilet at [0.000, 0.656, 0.997, 0.936].\nObject 30 : wallpaper at [0.003, 0.000, 0.916, 0.656].\n\nRelationships:\nobject 21 : paper -> on top of -> object 11 : ground.\nobject 21 : paper -> by -> object 29 : toilet.\nobject 21 : paper -> on top of -> object 11 : ground.\nobject 21 : paper -> on top of -> object 11 : ground.\nobject 21 : paper -> by -> object 29 : toilet.\nobject 21 : paper -> by -> object 29 : toilet.\nobject 21 : paper -> sitting by -> object 29 : toilet.\nobject 21 : paper -> lying by -> object 29 : toilet.\nobject 21 : paper -> on top of -> object 11 : ground.\nobject 21 : paper -> lying by -> object 29 : toilet.\nobject 2 : cake crumb -> on side of -> object 17 : mouth.\nobject 24 : purse -> on top of -> object 28 : table.\nobject 5 : eye -> of a -> object 22 : person.\nobject 6 : eye -> of a -> object 22 : person.\nobject 7 : eyebrow -> of -> object 22 : person.\nobject 8 : eyebrow -> of -> object 22 : person.\nobject 10 : finger -> of -> object 15 : hand.\nobject 10 : finger -> of -> object 15 : hand.\nobject 10 : finger -> of -> object 15 : hand.\nobject 10 : finger -> of -> object 15 : hand.\nobject 3 : crown -> on top of -> object 16 : head.\nobject 20 : necktie -> worn on -> object 22 : person.\nobject 22 : person -> holding -> object 1 : cake.\nobject 14 : hand -> holding -> object 1 : cake.\nobject 22 : person -> wearing -> object 27 : suit jacket.\nobject 22 : person -> wearing -> object 4 : dress.\nobject 20 : necktie -> worn on -> object 18 : neck.\nobject 13 : hair -> on top of -> object 16 : head.\nobject 1 : cake -> on top of -> object 23 : plate.\nobject 25 : ring -> worn on -> object 9 : finger.\n\nRegion Description:\nRegion Description at [0.022, 0.020, 0.203, 0.312] : A green and yellow striped wallpaper.\nRegion Description at [0.000, 0.048, 0.613, 0.996] : woman wearing a strapless white wedding dress .\nRegion Description at [0.487, 0.136, 0.808, 0.986] : woman white red hair holding a piece of cake on a plate.\nRegion Description at [0.543, 0.674, 0.813, 0.826] : woman's hands holding a plate of cake.\nRegion Description at [0.579, 0.124, 0.788, 0.524] : red haired woman wearing a tie and suit jacket .\nRegion Description at [0.000, 0.012, 0.819, 0.996] : two people wearing formal wedding attire .\n\nGlobal Caption:\nThere are two people enjoying a wedding reception\nA woman in a wedding dress with another woman in a suit behind\nA woman in a wedding dress with another lady holding a piece of cake.\nA red head girl holding a piece of cake\nA bride is with a long red haired person with cake."}
{"question_id": 30, "image": "000000535578.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : bush at [0.480, 0.000, 0.748, 0.084].\nObject 1 : ear at [0.544, 0.544, 0.571, 0.562].\nObject 2 : field at [0.000, 0.002, 0.994, 0.998].\nObject 3 : hill at [0.000, 0.000, 0.997, 0.998].\nObject 4 : plant at [0.000, 0.764, 0.601, 0.998].\nObject 5 : rock at [0.727, 0.410, 0.808, 0.470].\nObject 6 : sheep at [0.532, 0.546, 0.646, 0.662].\nObject 7 : sheep at [0.532, 0.666, 0.817, 0.810].\nObject 8 : tail at [0.565, 0.572, 0.604, 0.610].\nObject 9 : tree at [0.649, 0.000, 0.997, 0.334].\nObject 10 : trees at [0.736, 0.036, 0.835, 0.100].\nObject 11 : wall at [0.000, 0.000, 0.769, 0.180].\nObject 12 : weed at [0.417, 0.346, 0.492, 0.390].\n\nRelationships:\nobject 7 : sheep -> in a -> object 2 : field.\nobject 7 : sheep -> grazing in -> object 2 : field.\nobject 7 : sheep -> grazing in a -> object 2 : field.\nobject 7 : sheep -> grazing in a -> object 2 : field.\nobject 7 : sheep -> grazing in a -> object 2 : field.\nobject 7 : sheep -> grazing in a -> object 2 : field.\nobject 11 : wall -> borders -> object 2 : field.\nobject 7 : sheep -> grazing in a -> object 2 : field.\nobject 5 : rock -> in -> object 2 : field.\nobject 7 : sheep -> grazing in -> object 2 : field.\nobject 7 : sheep -> grazing in -> object 2 : field.\nobject 5 : rock -> in -> object 2 : field.\nobject 0 : bush -> in -> object 2 : field.\nobject 10 : trees -> in -> object 2 : field.\nobject 6 : sheep -> has an -> object 1 : ear.\nobject 6 : sheep -> has a -> object 8 : tail.\nobject 12 : weed -> growing in -> object 2 : field.\nobject 7 : sheep -> on -> object 3 : hill.\nobject 4 : plant -> on -> object 2 : field.\nobject 5 : rock -> on -> object 3 : hill.\nobject 7 : sheep -> are in -> object 2 : field.\nobject 11 : wall -> running across -> object 2 : field.\nobject 0 : bush -> in -> object 2 : field.\nobject 0 : bush -> in -> object 2 : field.\nobject 5 : rock -> in -> object 2 : field.\nobject 6 : sheep -> has a -> object 8 : tail.\nobject 5 : rock -> in -> object 2 : field.\n\nRegion Description:\nRegion Description at [0.000, 0.072, 0.760, 0.160] : A stone wall boarding a field of sheep.\nRegion Description at [0.189, 0.032, 0.703, 0.178] : rocks and grass in the background of the pasture.\nRegion Description at [0.541, 0.662, 0.823, 0.802] : white sheep grazing in green grassy field.\nRegion Description at [0.538, 0.544, 0.646, 0.656] : white sheep grazing in green grassy field.\nRegion Description at [0.228, 0.374, 0.357, 0.436] : white sheep grazing in green grassy field.\nRegion Description at [0.607, 0.380, 0.712, 0.456] : white sheep grazing in green grassy field.\nRegion Description at [0.811, 0.296, 0.937, 0.338] : two white sheep grazing in green grassy field.\nRegion Description at [0.048, 0.200, 0.249, 0.242] : group of white sheep grazing in green grassy field.\nRegion Description at [0.213, 0.164, 0.336, 0.192] : group of white sheep grazing in green grassy field.\nRegion Description at [0.000, 0.006, 0.997, 0.172] : two long gray stone walls across field.\nRegion Description at [0.453, 0.000, 0.730, 0.062] : a stand of trees outside the stone fence.\n\nGlobal Caption:\nA group of sheep grazing in a grassy valley.\nSheep graze in a lushly green mountain meadow\nA flock of sheep walking along a grassy hillside grazing.\nA flock of sheep are grazing on a grassy slope.\nA group of sheep grazing in a grassy field."}
{"question_id": 31, "image": "000000443969.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : bold writings at [0.492, 0.770, 0.556, 0.810].\nObject 1 : bottle at [0.468, 0.642, 0.634, 0.916].\nObject 2 : cart at [0.232, 0.328, 0.808, 0.998].\nObject 3 : child at [0.408, 0.168, 0.606, 0.786].\nObject 4 : cleaner at [0.466, 0.634, 0.636, 0.916].\nObject 5 : floor at [0.000, 0.190, 1.000, 1.000].\nObject 6 : green shirt at [0.000, 0.180, 0.078, 0.540].\nObject 7 : houses at [0.000, 0.000, 0.240, 0.414].\nObject 8 : leaves at [0.894, 0.202, 0.910, 0.204].\nObject 9 : line at [0.796, 0.954, 0.996, 0.966].\nObject 10 : lines at [0.828, 0.398, 0.998, 0.568].\nObject 11 : metal at [0.514, 0.116, 0.558, 0.292].\nObject 12 : metal at [0.234, 0.336, 0.802, 0.998].\nObject 13 : metal part at [0.512, 0.862, 0.566, 0.992].\nObject 14 : pants at [0.432, 0.524, 0.574, 0.670].\nObject 15 : person at [0.110, 0.070, 0.258, 0.456].\nObject 16 : person at [0.412, 0.166, 0.604, 0.784].\nObject 17 : person at [0.000, 0.182, 0.216, 0.958].\nObject 18 : sandal at [0.070, 0.862, 0.180, 0.954].\nObject 19 : shirt at [0.128, 0.120, 0.216, 0.260].\nObject 20 : shorts at [0.140, 0.222, 0.216, 0.348].\nObject 21 : skirt at [0.000, 0.470, 0.214, 0.894].\nObject 22 : umbrella at [0.296, 0.038, 0.782, 0.360].\nObject 23 : woman at [0.286, 0.000, 0.802, 0.812].\nObject 24 : writings at [0.512, 0.838, 0.564, 0.868].\n\nRelationships:\nobject 3 : child -> holding -> object 22 : umbrella.\nobject 23 : woman -> pushing -> object 2 : cart.\nobject 21 : skirt -> on -> object 17 : person.\nobject 10 : lines -> on -> object 5 : floor.\nobject 20 : shorts -> on -> object 15 : person.\nobject 16 : person -> next to -> object 2 : cart.\nobject 16 : person -> wearing -> object 21 : skirt.\nobject 18 : sandal -> on -> object 17 : person.\nobject 6 : green shirt -> on -> object 16 : person.\nobject 14 : pants -> on -> object 3 : child.\n\nRegion Description:\nRegion Description at [0.298, 0.050, 0.778, 0.422] : the opened umbrella the child is holding.\n\nGlobal Caption:\nA baby girl standing in a shopping cart holding an umbrella.\nA GIRL IS IN A GROCERY CART \nA little girl is riding in a shopping cart while holding her umbrella.\nA little girl inside of a shopping cart.\nA small child stands in a shopping cart with an umbrella."}
{"question_id": 32, "image": "000000329219.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : bearded face at [0.371, 0.064, 0.393, 0.094].\nObject 1 : blender at [0.015, 0.165, 0.080, 0.307].\nObject 2 : box at [0.176, 0.249, 0.228, 0.329].\nObject 3 : buttons at [0.038, 0.268, 0.048, 0.275].\nObject 4 : counter at [0.567, 0.340, 0.738, 0.395].\nObject 5 : counter at [0.000, 0.329, 0.576, 0.398].\nObject 6 : curtain at [0.429, 0.048, 0.504, 0.318].\nObject 7 : curtain at [0.227, 0.000, 0.309, 0.287].\nObject 8 : dog at [0.462, 0.593, 0.568, 0.842].\nObject 9 : door knob at [0.242, 0.477, 0.253, 0.499].\nObject 10 : drawer at [0.112, 0.370, 0.259, 0.452].\nObject 11 : drawer at [0.284, 0.382, 0.394, 0.439].\nObject 12 : faucet at [0.338, 0.327, 0.388, 0.357].\nObject 13 : floor at [0.000, 0.713, 1.000, 1.000].\nObject 14 : kitchen at [0.000, 0.000, 0.750, 0.849].\nObject 15 : knob at [0.179, 0.398, 0.197, 0.422].\nObject 16 : knob at [0.340, 0.400, 0.352, 0.420].\nObject 17 : man at [0.274, 0.000, 0.517, 0.792].\nObject 18 : mugs at [0.509, 0.123, 0.595, 0.266].\nObject 19 : outlet at [0.107, 0.212, 0.143, 0.256].\nObject 20 : shoes at [0.391, 0.735, 0.476, 0.786].\nObject 21 : spatula at [0.126, 0.003, 0.153, 0.094].\nObject 22 : tile at [0.526, 0.592, 0.557, 0.634].\nObject 23 : wall at [0.003, 0.000, 0.220, 0.294].\nObject 24 : wall at [0.506, 0.019, 0.607, 0.384].\nObject 25 : window at [0.303, 0.016, 0.392, 0.328].\nObject 26 : wire at [0.097, 0.233, 0.129, 0.319].\n\nRelationships:\nobject 17 : man -> standing in -> object 14 : kitchen.\nobject 18 : mugs -> hanging on -> object 24 : wall.\nobject 1 : blender -> with -> object 3 : buttons.\nobject 17 : man -> with -> object 0 : bearded face.\nobject 26 : wire -> hanging from -> object 23 : wall.\nobject 8 : dog -> on -> object 13 : floor.\nobject 1 : blender -> on -> object 5 : counter.\nobject 6 : curtain -> on -> object 25 : window.\nobject 20 : shoes -> on -> object 17 : man.\n\nRegion Description:\nRegion Description at [0.056, 0.214, 0.140, 0.277] : A dark electric cord plugged into the wall.\nRegion Description at [0.000, 0.662, 0.116, 0.940] : A latter with onely one rung visible.\nRegion Description at [0.004, 0.698, 0.999, 0.991] : Durable Tan and brown laminent flooring.\nRegion Description at [0.004, 0.324, 0.739, 0.880] : cheap waferboard constructed cabinets .\nRegion Description at [0.514, 0.126, 0.588, 0.262] : convient and accessable way to store coffee mugs.\nRegion Description at [0.222, 0.001, 0.510, 0.286] : small window curtians with paisley design.\nRegion Description at [0.347, 0.053, 0.490, 0.312] : light weight flanel design mens shirt .\nRegion Description at [0.222, 0.004, 0.315, 0.303] : gold and white curtain on a kitchen window.\nRegion Description at [0.511, 0.126, 0.589, 0.261] : coffee cups hanging on the kitchen wall.\nRegion Description at [0.012, 0.149, 0.091, 0.340] : gold colored blinder sits on the counter.\nRegion Description at [-0.001, 0.000, 0.157, 0.122] : cooking utensils hanging against wall.\n\nGlobal Caption:\nA man standing next to a dog on the ground.\nA man is at a kitchen counter by a dog.\nAn man standing in a kitchen with a small puppy.\nthere is a small puppy on the kitchen floor\nA man in the kitchen standing with his dog."}
{"question_id": 33, "image": "000000421923.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : block at [0.156, 0.630, 0.357, 0.822].\nObject 1 : book at [0.414, 0.208, 0.538, 0.364].\nObject 2 : book at [0.360, 0.202, 0.417, 0.360].\nObject 3 : book at [0.426, 0.484, 0.691, 0.522].\nObject 4 : book at [0.399, 0.404, 0.520, 0.554].\nObject 5 : bowl at [0.072, 0.030, 0.288, 0.076].\nObject 6 : center at [0.850, 0.732, 0.886, 0.766].\nObject 7 : eye at [0.282, 0.506, 0.327, 0.532].\nObject 8 : eye at [0.189, 0.506, 0.237, 0.534].\nObject 9 : flower at [0.796, 0.462, 0.982, 0.550].\nObject 10 : flower at [0.817, 0.528, 0.976, 0.612].\nObject 11 : flower at [0.760, 0.678, 0.946, 0.824].\nObject 12 : flower at [0.691, 0.608, 0.838, 0.722].\nObject 13 : flower at [0.913, 0.680, 1.000, 0.770].\nObject 14 : object at [0.213, 0.840, 0.583, 0.972].\nObject 15 : picture at [0.778, 0.060, 1.000, 0.352].\nObject 16 : shelf at [0.324, 0.528, 0.997, 0.624].\nObject 17 : shelf at [0.207, 0.334, 0.997, 0.380].\nObject 18 : shelf at [0.000, 0.028, 0.607, 0.202].\nObject 19 : stack at [0.435, 0.480, 0.712, 0.578].\nObject 20 : statue at [0.147, 0.404, 0.372, 0.652].\nObject 21 : table at [0.000, 0.690, 1.003, 0.998].\nObject 22 : vase at [0.838, 0.774, 0.994, 0.974].\nObject 23 : water at [0.847, 0.864, 0.997, 0.984].\n\nRelationships:\nobject 20 : statue -> on -> object 0 : block.\nobject 14 : object -> on -> object 21 : table.\nobject 1 : book -> on -> object 17 : shelf.\nobject 4 : book -> on -> object 16 : shelf.\nobject 5 : bowl -> on -> object 18 : shelf.\nobject 22 : vase -> has -> object 23 : water.\nobject 20 : statue -> has -> object 8 : eye.\nobject 20 : statue -> has -> object 7 : eye.\nobject 20 : statue -> on -> object 0 : block.\nobject 9 : flower -> in -> object 22 : vase.\nobject 10 : flower -> in -> object 22 : vase.\nobject 12 : flower -> in -> object 22 : vase.\nobject 13 : flower -> in -> object 22 : vase.\nobject 3 : book -> in -> object 19 : stack.\nobject 11 : flower -> has -> object 6 : center.\nobject 1 : book -> on -> object 17 : shelf.\nobject 2 : book -> on -> object 17 : shelf.\nobject 11 : flower -> has -> object 6 : center.\nobject 3 : book -> on -> object 19 : stack.\nobject 19 : stack -> on -> object 16 : shelf.\nobject 20 : statue -> on -> object 0 : block.\n\nRegion Description:\n\nGlobal Caption:\na glass vase with some flowers coming out of it \nA room witb a statue, bookshelves, books and a vase with flowers in it.\nA desk with a vase containing flowers, a sculpture of a man's head and shelves behind it.\nA statue next to a vase of flowers on a shelf. \nThe bust of a man's head is next to a vase of flowers."}
{"question_id": 34, "image": "000000376900.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : area at [0.000, 0.002, 0.995, 0.996].\nObject 1 : background at [0.000, 0.132, 0.997, 0.268].\nObject 2 : cap at [0.171, 0.388, 0.253, 0.476].\nObject 3 : green/tennis court at [0.005, 0.720, 0.880, 0.994].\nObject 4 : hand at [0.253, 0.648, 0.299, 0.680].\nObject 5 : head at [0.173, 0.408, 0.256, 0.474].\nObject 6 : line at [0.397, 0.778, 0.501, 0.996].\nObject 7 : man at [0.163, 0.274, 0.491, 0.936].\nObject 8 : photo at [0.005, 0.004, 0.968, 0.976].\nObject 9 : pole at [0.019, 0.162, 0.035, 0.258].\nObject 10 : ses at [0.912, 0.962, 0.992, 0.994].\nObject 11 : shadow at [0.397, 0.898, 0.968, 0.956].\nObject 12 : shorts at [0.216, 0.628, 0.432, 0.782].\nObject 13 : sock at [0.325, 0.840, 0.376, 0.890].\nObject 14 : sport at [0.144, 0.270, 0.515, 0.944].\nObject 15 : tennis racket at [0.235, 0.578, 0.304, 0.664].\nObject 16 : tennis shoe at [0.213, 0.880, 0.280, 0.930].\nObject 17 : tennis shoe at [0.299, 0.886, 0.405, 0.936].\nObject 18 : trees at [0.269, 0.192, 0.995, 0.250].\nObject 19 : wrist at [0.384, 0.318, 0.429, 0.360].\nObject 20 : wristband at [0.384, 0.318, 0.432, 0.360].\n\nRelationships:\nobject 7 : man -> wearing -> object 12 : shorts.\nobject 4 : hand -> holding -> object 15 : tennis racket.\nobject 2 : cap -> on mans -> object 5 : head.\nobject 5 : head -> of a -> object 7 : man.\nobject 7 : man -> wearing a -> object 2 : cap.\nobject 7 : man -> wearing a -> object 13 : sock.\nobject 18 : trees -> in -> object 1 : background.\nobject 14 : sport -> in -> object 0 : area.\nobject 20 : wristband -> on a -> object 19 : wrist.\nobject 2 : cap -> on -> object 5 : head.\nobject 11 : shadow -> of -> object 7 : man.\nobject 12 : shorts -> on -> object 7 : man.\n\nRegion Description:\nRegion Description at [0.163, 0.322, 0.579, 0.926] : The tennis player is wearing all white.\nRegion Description at [0.397, 0.858, 0.936, 0.968] : Tennis player's shadow cast in front of him.\nRegion Description at [0.219, 0.560, 0.309, 0.680] : a black tennis racket in a man's hand.\nRegion Description at [0.341, 0.538, 0.480, 0.728] : a line judge at the side of a tennis court.\n\nGlobal Caption:\nA tennis player prepares to serve a tennis ball.\na tennis player in all white playing on a court \nA tennis player is reaching up with one arm and has a racquet in the other hand. \nThe tennis player throws the ball up to serve\nSpectators watching a man swinging at a tennis ball."}
{"question_id": 35, "image": "000000513567.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : bag at [0.428, 0.435, 0.476, 0.528].\nObject 1 : bag at [0.322, 0.923, 0.498, 0.997].\nObject 2 : building at [0.000, 0.003, 0.158, 0.413].\nObject 3 : face at [0.246, 0.240, 0.374, 0.483].\nObject 4 : flag at [0.044, 0.013, 0.090, 0.149].\nObject 5 : girl at [0.538, 0.019, 0.968, 0.949].\nObject 6 : hand at [0.176, 0.680, 0.304, 0.821].\nObject 7 : hands at [0.660, 0.344, 0.756, 0.517].\nObject 8 : head at [0.560, 0.003, 0.822, 0.339].\nObject 9 : hot dog at [0.676, 0.315, 0.882, 0.408].\nObject 10 : hot dogs at [0.190, 0.587, 0.350, 0.741].\nObject 11 : jeans at [0.586, 0.843, 0.916, 0.995].\nObject 12 : lady at [0.572, 0.045, 0.952, 0.984].\nObject 13 : logo at [0.920, 0.069, 0.996, 0.165].\nObject 14 : man at [0.486, 0.235, 0.564, 0.509].\nObject 15 : man at [0.456, 0.213, 0.520, 0.317].\nObject 16 : maroon shirt at [0.546, 0.333, 0.928, 0.944].\nObject 17 : mouth at [0.288, 0.408, 0.356, 0.440].\nObject 18 : people at [0.552, 0.029, 0.876, 0.995].\nObject 19 : post at [0.104, 0.005, 0.138, 0.533].\nObject 20 : purse at [0.842, 0.661, 0.980, 0.888].\nObject 21 : purse strap at [0.270, 0.893, 0.390, 0.992].\nObject 22 : shadow at [0.934, 0.067, 0.996, 0.141].\nObject 23 : side at [0.922, 0.875, 0.998, 0.997].\nObject 24 : street at [0.042, 0.403, 0.092, 0.520].\nObject 25 : sunglasses at [0.630, 0.005, 0.794, 0.048].\nObject 26 : woman at [0.502, 0.000, 0.982, 0.997].\nObject 27 : woman at [0.102, 0.099, 0.486, 0.984].\nObject 28 : woman's shirt at [0.518, 0.320, 0.944, 0.949].\n\nRelationships:\nobject 0 : bag -> on -> object 15 : man.\nobject 13 : logo -> on -> object 2 : building.\nobject 25 : sunglasses -> on -> object 26 : woman.\nobject 25 : sunglasses -> on -> object 8 : head.\nobject 4 : flag -> on -> object 19 : post.\nobject 6 : hand -> holds -> object 10 : hot dogs.\nobject 27 : woman -> has -> object 17 : mouth.\nobject 12 : lady -> holding -> object 9 : hot dog.\nobject 9 : hot dog -> in -> object 7 : hands.\nobject 18 : people -> crossing -> object 24 : street.\nobject 27 : woman -> wearing -> object 11 : jeans.\nobject 5 : girl -> wears -> object 16 : maroon shirt.\n\nRegion Description:\nRegion Description at [0.038, 0.173, 0.540, 0.995] : Laughing girl in a green shirt holding a hotdog..\nRegion Description at [0.504, 0.000, 0.954, 0.989] : Black haired girl in maroon shirt wearing sunglasses on her head..\nRegion Description at [0.508, 0.000, 0.960, 0.979] : Girl looking at the hot dog she's holding in her hands.\nRegion Description at [0.040, 0.173, 0.536, 0.981] : Girl holding hot dog in her right hand.\nRegion Description at [0.926, 0.253, 0.998, 0.645] : Woman in a brown shirt and jeans crossing the street.\nRegion Description at [0.202, 0.563, 0.334, 0.995] : Blue purse strap around woman's shoulder.\nRegion Description at [0.146, 0.587, 0.370, 0.787] : woman holding hot dog in white napkin.\nRegion Description at [0.682, 0.229, 0.742, 0.315] : woman's mouth open looking at hot dog.\nRegion Description at [0.234, 0.213, 0.396, 0.507] : woman's face smiling with eyes closed.\n\nGlobal Caption:\nTwo Asian women eating chili dogs while standing on a street.\nTwo women preparing to eat a hot dog on a city side.\nThe woman are eating their hot dogs while walking.\nTwo young women are eating hot dogs while walking down the sidewalk.\nTwo women eat chili dogs on a city sidewalk. "}
{"question_id": 36, "image": "000000058393.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : arm at [0.658, 0.462, 0.828, 0.496].\nObject 1 : bench at [0.070, 0.493, 0.932, 0.960].\nObject 2 : concrete at [0.030, 0.810, 0.974, 0.997].\nObject 3 : foot at [0.724, 0.784, 0.782, 0.844].\nObject 4 : hair at [0.646, 0.367, 0.754, 0.472].\nObject 5 : hair at [0.564, 0.338, 0.652, 0.462].\nObject 6 : man at [0.542, 0.343, 0.812, 0.493].\nObject 7 : ocean at [0.028, 0.319, 0.972, 0.821].\nObject 8 : post at [0.090, 0.641, 0.102, 0.734].\nObject 9 : post at [0.924, 0.652, 0.944, 0.836].\nObject 10 : rail at [0.028, 0.620, 0.974, 0.660].\nObject 11 : seat at [0.072, 0.728, 0.928, 0.786].\nObject 12 : shoe at [0.720, 0.789, 0.782, 0.855].\nObject 13 : sky at [0.028, 0.037, 0.974, 0.325].\nObject 14 : slat at [0.072, 0.749, 0.928, 0.781].\nObject 15 : slat at [0.112, 0.499, 0.912, 0.522].\nObject 16 : slat at [0.126, 0.702, 0.912, 0.728].\nObject 17 : slat at [0.108, 0.594, 0.908, 0.625].\nObject 18 : slat at [0.106, 0.525, 0.908, 0.554].\nObject 19 : woman at [0.644, 0.377, 0.834, 0.863].\n\nRelationships:\nobject 6 : man -> sitting on -> object 1 : bench.\nobject 6 : man -> sitting with -> object 19 : woman.\nobject 6 : man -> has -> object 0 : arm.\nobject 0 : arm -> around -> object 19 : woman.\nobject 3 : foot -> wearing -> object 12 : shoe.\nobject 19 : woman -> has -> object 3 : foot.\nobject 3 : foot -> inside -> object 12 : shoe.\nobject 19 : woman -> looking at -> object 7 : ocean.\nobject 6 : man -> looking at -> object 7 : ocean.\nobject 19 : woman -> has -> object 4 : hair.\nobject 6 : man -> has -> object 5 : hair.\nobject 1 : bench -> in front of -> object 7 : ocean.\nobject 1 : bench -> in front of -> object 7 : ocean.\nobject 1 : bench -> backs up to -> object 1 : bench.\nobject 19 : woman -> sitting on -> object 1 : bench.\nobject 6 : man -> sitting on -> object 1 : bench.\nobject 19 : woman -> relaxing on -> object 1 : bench.\nobject 6 : man -> relaxing on -> object 1 : bench.\nobject 19 : woman -> facing -> object 7 : ocean.\nobject 6 : man -> facing -> object 7 : ocean.\nobject 19 : woman -> looking at -> object 7 : ocean.\nobject 6 : man -> looking at -> object 7 : ocean.\nobject 6 : man -> relaxing with -> object 19 : woman.\nobject 6 : man -> on bench with -> object 19 : woman.\nobject 19 : woman -> resting on -> object 1 : bench.\nobject 6 : man -> resting on -> object 1 : bench.\nobject 1 : bench -> near -> object 7 : ocean.\nobject 1 : bench -> near -> object 7 : ocean.\nobject 11 : seat -> part of -> object 1 : bench.\nobject 9 : post -> supporting -> object 10 : rail.\nobject 8 : post -> supporting -> object 10 : rail.\nobject 19 : woman -> has -> object 3 : foot.\nobject 12 : shoe -> belongs to -> object 19 : woman.\nobject 19 : woman -> has -> object 3 : foot.\nobject 2 : concrete -> under -> object 1 : bench.\nobject 2 : concrete -> under -> object 1 : bench.\nobject 7 : ocean -> in front of -> object 1 : bench.\nobject 6 : man -> sitting next to -> object 19 : woman.\nobject 6 : man -> cuddling with -> object 19 : woman.\nobject 0 : arm -> around -> object 19 : woman.\nobject 6 : man -> silhouetted with -> object 19 : woman.\nobject 18 : slat -> part of -> object 1 : bench.\n\nRegion Description:\nRegion Description at [0.502, 0.309, 0.892, 0.512] : a man and woman looking at the ocean.\n\nGlobal Caption:\nTwo people sitting on a bench silhouetted against the sea.\nTwo people are sitting on a bench together in front of water.\nThe silhouette of two people sitting on a bench in front of the water.\nA couple is sitting on a bench in front of the water. \nA couple sits on a park bench and watches the water"}
{"question_id": 37, "image": "000000010764.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : catcher at [0.334, 0.193, 0.756, 0.940].\nObject 1 : field at [0.000, 0.000, 0.998, 0.997].\nObject 2 : glove at [0.660, 0.492, 0.764, 0.674].\nObject 3 : hand at [0.666, 0.498, 0.748, 0.665].\nObject 4 : helmet at [0.472, 0.187, 0.610, 0.444].\nObject 5 : jersey at [0.340, 0.332, 0.556, 0.695].\nObject 6 : line at [0.396, 0.656, 0.560, 0.731].\nObject 7 : lines at [0.866, 0.927, 1.000, 0.997].\nObject 8 : lines at [0.754, 0.837, 0.998, 0.867].\nObject 9 : pads at [0.562, 0.668, 0.634, 0.782].\nObject 10 : pants at [0.336, 0.640, 0.612, 0.858].\nObject 11 : sneakers at [0.406, 0.834, 0.544, 0.946].\nObject 12 : stripe at [0.608, 0.737, 0.998, 0.795].\nObject 13 : wrist band at [0.586, 0.583, 0.604, 0.640].\n\nRelationships:\nobject 0 : catcher -> in -> object 1 : field.\nobject 2 : glove -> on -> object 3 : hand.\nobject 6 : line -> on -> object 10 : pants.\n\nRegion Description:\nRegion Description at [0.546, 0.625, 0.626, 0.801] : The player is wearing knee and leg pads..\nRegion Description at [0.018, 0.665, 0.280, 0.825] : A brown dirt ground surface on a baseball field.\nRegion Description at [0.676, 0.701, 0.974, 0.979] : White chalk lines painted on a baseball field.\nRegion Description at [0.062, 0.130, 0.370, 0.535] : A green grass ground surface of a baseball field.\nRegion Description at [0.566, 0.580, 0.620, 0.656] : A black and red bracelet on a man's wrist.\n\nGlobal Caption:\nA catches crouches on a patch of dirt.\nA catcher squatting at a base with his gloved hand extended.\nA baseball catcher stands ready to catch a ball.\na catcher kneeling at the mound waiting for a baseball \nA catcher in white uniform during a baseball game."}
{"question_id": 38, "image": "000000271402.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : blonde hair at [0.193, 0.100, 0.375, 0.176].\nObject 1 : building at [0.804, 0.200, 0.906, 0.318].\nObject 2 : dress at [0.378, 0.284, 0.804, 0.652].\nObject 3 : fence at [0.607, 0.282, 0.997, 0.378].\nObject 4 : girl at [0.329, 0.148, 0.973, 0.892].\nObject 5 : girl at [0.057, 0.102, 0.456, 0.898].\nObject 6 : ground at [0.000, 0.374, 1.000, 0.916].\nObject 7 : hair at [0.320, 0.148, 0.517, 0.286].\nObject 8 : handle at [0.329, 0.432, 0.508, 0.480].\nObject 9 : handle at [0.091, 0.450, 0.299, 0.502].\nObject 10 : head at [0.335, 0.152, 0.508, 0.314].\nObject 11 : insignia at [0.447, 0.350, 0.502, 0.390].\nObject 12 : orange platform at [0.181, 0.816, 0.489, 0.998].\nObject 13 : orange wheel at [0.193, 0.820, 0.248, 0.876].\nObject 14 : pavement at [0.009, 0.370, 0.994, 0.996].\nObject 15 : racket at [0.462, 0.480, 0.713, 0.840].\nObject 16 : right shoe at [0.465, 0.778, 0.610, 0.886].\nObject 17 : scooter at [0.097, 0.424, 0.592, 0.996].\nObject 18 : shoe at [0.060, 0.794, 0.202, 0.902].\nObject 19 : shoe at [0.302, 0.780, 0.453, 0.874].\nObject 20 : skirt at [0.471, 0.514, 0.804, 0.654].\nObject 21 : sneaker at [0.849, 0.738, 0.970, 0.886].\nObject 22 : sock at [0.317, 0.776, 0.347, 0.798].\nObject 23 : sock at [0.130, 0.790, 0.184, 0.810].\n\nRelationships:\nobject 4 : girl -> on -> object 14 : pavement.\nobject 5 : girl -> wearing -> object 22 : sock.\nobject 5 : girl -> wearing -> object 23 : sock.\nobject 4 : girl -> wearing -> object 20 : skirt.\nobject 4 : girl -> holding -> object 15 : racket.\nobject 5 : girl -> with -> object 0 : blonde hair.\nobject 17 : scooter -> with -> object 8 : handle.\nobject 1 : building -> with -> object 3 : fence.\nobject 4 : girl -> with -> object 11 : insignia.\nobject 13 : orange wheel -> of -> object 17 : scooter.\n\nRegion Description:\nRegion Description at [0.858, 0.760, 0.970, 0.852] : Girl is wearing blue, white, pink, and gray shoes..\nRegion Description at [0.293, 0.136, 0.976, 0.884] : a little girl holding a tennis racket..\nRegion Description at [0.060, 0.086, 0.462, 0.908] : A little girl standing near a scooter..\nRegion Description at [0.308, 0.146, 0.985, 0.892] : young girl wearing velcro strapped tennis shoes.\nRegion Description at [0.082, 0.436, 0.601, 0.996] : orange scooter board with black handles.\nRegion Description at [0.755, 0.184, 0.973, 0.372] : a tall building with fence in foreground.\nRegion Description at [0.021, 0.096, 0.988, 0.928] : two young girls wearing white outfits.\nRegion Description at [0.311, 0.136, 0.991, 0.886] : young girl with insignia on white outfit.\nRegion Description at [0.175, 0.814, 0.266, 0.888] : orange colored back wheel of a scooter board.\nRegion Description at [0.453, 0.478, 0.725, 0.848] : lavender, yellow and pink colored tennis racket.\n\nGlobal Caption:\ntwo little girls in tennis uniforms standing next to a scooter\nTwo young girls with a tennis racket and a scooter.\nTwo little girls posing for a picture, on a tennis court.\nTwo young girls on a tennis court with a racquet and a scooter\nTwo cute girls with a scooter and tennis raquet."}
{"question_id": 39, "image": "000000018519.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : concrete at [0.000, 0.576, 1.002, 0.998].\nObject 1 : elbow at [0.403, 0.538, 0.433, 0.552].\nObject 2 : fence at [0.000, 0.314, 0.998, 0.600].\nObject 3 : graffiti at [0.470, 0.856, 0.794, 0.998].\nObject 4 : grass at [0.000, 0.154, 1.002, 0.448].\nObject 5 : helmet at [0.358, 0.354, 0.448, 0.422].\nObject 6 : knee at [0.525, 0.608, 0.545, 0.622].\nObject 7 : knee pad at [0.450, 0.542, 0.512, 0.598].\nObject 8 : pad at [0.540, 0.362, 0.595, 0.420].\nObject 9 : pad at [0.512, 0.578, 0.592, 0.624].\nObject 10 : pad at [0.376, 0.512, 0.443, 0.554].\nObject 11 : park at [0.007, 0.006, 1.000, 0.578].\nObject 12 : pipe at [0.657, 0.300, 0.687, 0.578].\nObject 13 : pipe at [0.177, 0.324, 0.211, 0.590].\nObject 14 : rail at [0.000, 0.310, 1.000, 0.334].\nObject 15 : ramp at [0.000, 0.592, 1.002, 0.998].\nObject 16 : rock at [0.100, 0.302, 0.154, 0.326].\nObject 17 : shadow at [0.415, 0.642, 0.754, 0.912].\nObject 18 : shirt at [0.438, 0.376, 0.637, 0.514].\nObject 19 : shorts at [0.460, 0.500, 0.664, 0.580].\nObject 20 : skate at [0.647, 0.490, 0.709, 0.584].\nObject 21 : skater at [0.234, 0.352, 0.719, 0.624].\nObject 22 : sticker at [0.408, 0.358, 0.438, 0.368].\nObject 23 : tree at [0.122, 0.008, 0.677, 0.322].\nObject 24 : wheels at [0.689, 0.496, 0.721, 0.526].\nObject 25 : wrist brace at [0.279, 0.524, 0.338, 0.564].\n\nRelationships:\nobject 21 : skater -> has a -> object 17 : shadow.\nobject 20 : skate -> has -> object 24 : wheels.\nobject 23 : tree -> standing in a -> object 11 : park.\nobject 21 : skater -> wearing a -> object 5 : helmet.\nobject 10 : pad -> protecting an -> object 1 : elbow.\nobject 9 : pad -> protecting a -> object 6 : knee.\nobject 17 : shadow -> of a -> object 21 : skater.\nobject 15 : ramp -> has a -> object 3 : graffiti.\nobject 21 : skater -> has a -> object 5 : helmet.\nobject 16 : rock -> in -> object 4 : grass.\nobject 5 : helmet -> has a -> object 22 : sticker.\nobject 21 : skater -> wearing -> object 20 : skate.\nobject 21 : skater -> wearing a -> object 10 : pad.\nobject 25 : wrist brace -> on -> object 21 : skater.\nobject 21 : skater -> has a -> object 20 : skate.\nobject 17 : shadow -> on -> object 15 : ramp.\nobject 21 : skater -> has a -> object 5 : helmet.\nobject 21 : skater -> has a -> object 8 : pad.\nobject 21 : skater -> has a -> object 18 : shirt.\nobject 21 : skater -> has -> object 19 : shorts.\nobject 23 : tree -> behind -> object 21 : skater.\nobject 25 : wrist brace -> on -> object 21 : skater.\nobject 21 : skater -> has a -> object 9 : pad.\nobject 7 : knee pad -> for a -> object 21 : skater.\nobject 17 : shadow -> on -> object 0 : concrete.\nobject 3 : graffiti -> on -> object 0 : concrete.\n\nRegion Description:\nRegion Description at [0.391, 0.630, 0.776, 0.962] : Skater's shadow while performing a trick.\nRegion Description at [0.346, 0.342, 0.475, 0.440] : Man is wearing a black safety helmet.\nRegion Description at [0.184, 0.320, 0.741, 0.700] : a man roller skating at a skate park.\nRegion Description at [0.448, 0.636, 0.779, 0.940] : the shadow of the man cast on the cement ramp.\nRegion Description at [0.465, 0.856, 0.803, 0.996] : light blue painted graffiti on the cement ramp.\nRegion Description at [0.279, 0.524, 0.341, 0.570] : a black wrist guard on the man's wrist.\nRegion Description at [0.353, 0.352, 0.460, 0.422] : black helmet with several stickers on it.\nRegion Description at [0.644, 0.488, 0.719, 0.574] : the black rollerskate the man is wearing.\nRegion Description at [0.142, 0.314, 0.234, 0.604] : a grey post to the metal fence that is at the top of the ramp.\nRegion Description at [0.363, 0.500, 0.453, 0.566] : a black elbow pad the man is wearing.\nRegion Description at [0.405, 0.642, 0.746, 0.916] : shadow of a roller skater on concrete.\n\nGlobal Caption:\nA young man riding a skateboard down the side of a ramp.\nA man doing a trick on roller-skates in a skate park.\nA skateboarder performing a jump off the side of a ramp.\na man wearing roller skates doing a jump on the side of a wall \nThe man in the helmet is jumping while wearing roller skates. "}
{"question_id": 40, "image": "000000106048.jpg", "category": "ground_conv", "text": "Objects:\nObject 0 : book at [0.218, 0.105, 0.834, 0.754].\nObject 1 : building at [0.050, 0.000, 1.000, 0.713].\nObject 2 : bus at [0.222, 0.144, 0.820, 0.757].\nObject 3 : bushes at [0.810, 0.401, 1.000, 0.680].\nObject 4 : design at [0.228, 0.422, 0.438, 0.560].\nObject 5 : ground at [0.000, 0.629, 1.002, 0.994].\nObject 6 : headlight at [0.738, 0.590, 0.796, 0.632].\nObject 7 : headlight at [0.522, 0.596, 0.610, 0.629].\nObject 8 : light at [0.604, 0.201, 0.706, 0.222].\nObject 9 : pavement at [0.002, 0.629, 0.996, 0.994].\nObject 10 : pipe at [0.172, 0.147, 0.208, 0.617].\nObject 11 : pipe at [0.438, 0.096, 0.458, 0.192].\nObject 12 : roof at [0.118, 0.000, 0.896, 0.174].\nObject 13 : side mirror at [0.488, 0.314, 0.530, 0.428].\nObject 14 : side mirror at [0.790, 0.332, 0.818, 0.455].\nObject 15 : street at [0.002, 0.611, 0.992, 0.991].\nObject 16 : stripe at [0.228, 0.428, 0.516, 0.569].\nObject 17 : trash can at [0.790, 0.569, 0.822, 0.662].\nObject 18 : wall at [0.858, 0.368, 0.920, 0.419].\nObject 19 : wheel at [0.266, 0.545, 0.294, 0.677].\nObject 20 : wheel at [0.248, 0.551, 0.264, 0.668].\nObject 21 : wheel at [0.444, 0.578, 0.472, 0.751].\nObject 22 : windows at [0.510, 0.216, 0.796, 0.548].\nObject 23 : windshield at [0.518, 0.222, 0.782, 0.545].\n\nRelationships:\nobject 10 : pipe -> running from -> object 12 : roof.\nobject 12 : roof -> to -> object 5 : ground.\nobject 17 : trash can -> next to -> object 3 : bushes.\nobject 3 : bushes -> by -> object 15 : street.\n\nRegion Description:\nRegion Description at [0.568, 0.524, 0.770, 0.599] : Divine Transportation written on front of bus.\nRegion Description at [0.162, 0.129, 0.212, 0.623] : black drain pipe running from the roof to the ground.\nRegion Description at [0.712, 0.177, 0.762, 0.240] : bus identification number on top of bus.\nRegion Description at [0.790, 0.557, 0.820, 0.647] : gray trash can next to bushes behind bus.\nRegion Description at [0.810, 0.407, 0.990, 0.692] : large green bushes in front of building.\nRegion Description at [0.670, 0.317, 0.740, 0.527] : black windshield wiper on windshield.\n\nGlobal Caption:\nA white bus driving past a tall building.\na black and white bus some bushes and building\nA white decorated bus is next to a building.\na large white bus that is by a building\nA large bus parked in a parking lot "}

View File

@ -0,0 +1,40 @@
{"question_id": 0, "image": "000000125472.jpg", "category": "ground_conv", "text": "What is the man in the image doing and tell me the coordinates of the man?"}
{"question_id": 2, "image": "000000361551.jpg", "category": "ground_conv", "text": "What's happening on the runway of the airport and provide me the coordinates of the runway and featured objects?"}
{"question_id": 3, "image": "000000184400.jpg", "category": "ground_conv", "text": "What kind of vehicle is on the bridge, and provide me the coordinates of the vehicle and the bridge?"}
{"question_id": 4, "image": "000000276018.jpg", "category": "ground_conv", "text": "What are the children doing, and what are they holding? Please provide the coordinates of the mentioned objects."}
{"question_id": 5, "image": "000000356424.jpg", "category": "ground_conv", "text": "What is the man doing and what objects are in front of him? Please provide the coordinates of these objects."}
{"question_id": 6, "image": "000000458755.jpg", "category": "ground_conv", "text": "What is the girl doing and provide me the coordinates of the girl and the objects she is interacting with."}
{"question_id": 7, "image": "000000069138.jpg", "category": "ground_conv", "text": "What are the main features of the building and provide me the coordinates of these features?"}
{"question_id": 8, "image": "000000003156.jpg", "category": "ground_conv", "text": "What is the man doing in this image and provide the coordinates of the toilet and the man."}
{"question_id": 9, "image": "000000131138.jpg", "category": "ground_conv", "text": "What items are located on the desk and tell me the coordinates of these items?"}
{"question_id": 10, "image": "000000259097.jpg", "category": "ground_conv", "text": "Describe the activities of the man and what is he wearing. Also, tell me the coordinates of the man and the items related to him."}
{"question_id": 11, "image": "000000377882.jpg", "category": "ground_conv", "text": "What is the scene in the image? Provide the coordinates of the main objects."}
{"question_id": 12, "image": "000000484415.jpg", "category": "ground_conv", "text": "What is the man doing, and what objects is he interacting with? Please provide the coordinates of the mentioned objects."}
{"question_id": 13, "image": "000000184384.jpg", "category": "ground_conv", "text": "What is the type of food placed on the table and provide me the coordinates of these objects?"}
{"question_id": 14, "image": "000000341058.jpg", "category": "ground_conv", "text": "What items are placed on the table and provide me the coordinates of mentioned items?"}
{"question_id": 15, "image": "000000349184.jpg", "category": "ground_conv", "text": "What is happening in this image and tell me the coordinates of the woman sitting?"}
{"question_id": 16, "image": "000000516143.jpg", "category": "ground_conv", "text": "What kind of bus is in the image, and what is it doing? Also, tell me the coordinates of the bus."}
{"question_id": 17, "image": "000000159311.jpg", "category": "ground_conv", "text": "How many zebras are in the image and provide me the coordinates of mentioned zebras?"}
{"question_id": 18, "image": "000000553990.jpg", "category": "ground_conv", "text": "What is the attire of the person who is riding the horse, and provide me the coordinates of that person?"}
{"question_id": 19, "image": "000000273493.jpg", "category": "ground_conv", "text": "What are the two men doing in the image? What are they wearing and provide me the coordinates of mentioned objects?"}
{"question_id": 20, "image": "000000452122.jpg", "category": "ground_conv", "text": "What state is the airplane in and tell me the coordinates of the mentioned objects?"}
{"question_id": 21, "image": "000000134722.jpg", "category": "ground_conv", "text": "What is the setting of this image and provide me the coordinates of the objects you mention?"}
{"question_id": 22, "image": "000000360960.jpg", "category": "ground_conv", "text": "How many people are there in the image and what are they wearing? Tell me the coordinates of the people and their clothing."}
{"question_id": 23, "image": "000000179765.jpg", "category": "ground_conv", "text": "Can you tell me about the features of the bike and provide the coordinates of each feature mentioned?"}
{"question_id": 24, "image": "000000332318.jpg", "category": "ground_conv", "text": "What is the setting of this image? Please provide the coordinates of all objects mentioned in your explanation."}
{"question_id": 25, "image": "000000305695.jpg", "category": "ground_conv", "text": "Where are the zebras and what is near them? Provide coordinates of the mentioned objects."}
{"question_id": 26, "image": "000000326174.jpg", "category": "ground_conv", "text": "Who is holding a surfboard and provide me the coordinates of the surfboard?"}
{"question_id": 27, "image": "000000562207.jpg", "category": "ground_conv", "text": "What are the people doing and provide the coordinates of the mentioned objects."}
{"question_id": 28, "image": "000000543300.jpg", "category": "ground_conv", "text": "What are some of the features of the boat and provide me the coordinates of mentioned objects?"}
{"question_id": 29, "image": "000000241668.jpg", "category": "ground_conv", "text": "Who is holding the cake and what is she wearing, also tell me the coordinates of the mentioned objects?"}
{"question_id": 30, "image": "000000535578.jpg", "category": "ground_conv", "text": "What is happening in the field and provide the coordinates of mentioned objects?"}
{"question_id": 31, "image": "000000443969.jpg", "category": "ground_conv", "text": "Who is holding the umbrella and provide me the coordinates of the umbrella?"}
{"question_id": 32, "image": "000000329219.jpg", "category": "ground_conv", "text": "What's happening in the kitchen? Could you provide coordinates for the objects you mention?"}
{"question_id": 33, "image": "000000421923.jpg", "category": "ground_conv", "text": "What objects are present on the shelf and provide me the coordinates of mentioned objects?"}
{"question_id": 34, "image": "000000376900.jpg", "category": "ground_conv", "text": "What is the man doing in the image and tell me the coordinates of the mentioned objects?"}
{"question_id": 35, "image": "000000513567.jpg", "category": "ground_conv", "text": "Who are the two people in the image and what are they doing? Tell me the coordinates of mentioned objects."}
{"question_id": 36, "image": "000000058393.jpg", "category": "ground_conv", "text": "Who are sitting on the bench and what are they doing? Tell me the coordinates of mentioned objects."}
{"question_id": 37, "image": "000000010764.jpg", "category": "ground_conv", "text": "What is the catcher doing and tell me the coordinates of the catcher."}
{"question_id": 38, "image": "000000271402.jpg", "category": "ground_conv", "text": "What are the girls holding and provide me the coordinates of these items?"}
{"question_id": 39, "image": "000000018519.jpg", "category": "ground_conv", "text": "Can you describe the safety measures the skater has taken and provide the coordinates of these measures? "}
{"question_id": 40, "image": "000000106048.jpg", "category": "ground_conv", "text": "What are the objects on the bus and provide me the coordinates of these objects?"}

View File

@ -0,0 +1,40 @@
{"question_id": 0, "image": "000000069138.jpg", "category": "refer_desc", "text": "The object is a sign that is placed on a building. The building is offering short term office space, which is stated on the sign. This sign is part of the overall street scene, with other objects nearby, such as a stop light on a pole and a road. The sign is directly related to the building, contributing to the information about the building's function to the overall scene."}
{"question_id": 1, "image": "000000131138.jpg", "category": "refer_desc", "text": "The object is a plate, and on top of this plate is a fork. It's located on a desk. Nearby, there's also a cup and a computer mouse."}
{"question_id": 2, "image": "000000475150.jpg", "category": "refer_desc", "text": "The region is primarily occupied by the face of a giraffe, which is white and brown in color. This portion of the giraffe's body is in close proximity to its neck, which is covered with wrinkles. The giraffe's eye and ear are also located within this region. The giraffe is standing in front of a tree, and its face and neck are spotted with tan and brown spots. The background is filled with branches of the tree and the sky."}
{"question_id": 3, "image": "000000356424.jpg", "category": "refer_desc", "text": "In the region, there's a woman who seems to be having lunch. She is sitting near a table where there are several objects, including a plate of food that contains raspberries, a bottle, and a glass of water. There is also a man nearby, who is wearing glasses and a striped orange and black shirt. Both of them seem to be engaged in a meal."}
{"question_id": 4, "image": "000000491090.jpg", "category": "refer_desc", "text": "The region corresponds to the back wheel of a motorcycle, which is an integral part of the black motorcycle with silver accessories. Close to the back wheel, you can find the tail pipe of the motorcycle located on the left. Above the back wheel, there's a person sitting on the motorcycle, wearing a sweater, jeans, and sneakers."}
{"question_id": 5, "image": "000000484415.jpg", "category": "refer_desc", "text": "In the region, there is a container and a toilet brush cleaner. This region is right next to the toilet bowl, indicating that the brush cleaner is accessible for bathroom cleaning. It's crucial for maintaining the cleanliness of the toilet nearby."}
{"question_id": 7, "image": "000000184324.jpg", "category": "refer_desc", "text": "The region is a crosswalk on a busy city street, highlighted by white stripes. It's being used by a cyclist and a group of people who are crossing the street. There are bikes on the road, and cars are also visible within the vicinity. A large white vehicle with a big windshield is also nearby. This bustling scene is typical for a city intersection."}
{"question_id": 8, "image": "000000341058.jpg", "category": "refer_desc", "text": "The object is a salt shaker. It is located on a table along with a napkin and another shaker, which contains pepper. The table appears to be set for dining at a restaurant, as indicated by the presence of these objects."}
{"question_id": 9, "image": "000000184384.jpg", "category": "refer_desc", "text": "In the region, there is a plate with various types of food on it. This includes a sausage, an egg, and a few other unidentified items. The food is attractively arranged on the plate, which is positioned towards the back of the table. The plate and its contents seem to be part of a larger meal setup on the table."}
{"question_id": 10, "image": "000000259097.jpg", "category": "refer_desc", "text": "Region is full of trees and there is a village on a hill in the distance. These trees and buildings are located behind a grassy field where a man is seen jumping to catch a frisbee. The man's shadow can be seen on the grass."}
{"question_id": 11, "image": "000000377882.jpg", "category": "refer_desc", "text": "The region contains a black fence pole, which seems to be part of a chain-link fence enclosing the area. This fence is next to a water way and encloses several boats and surfboards. There are buildings on the horizon, and some green shrubs growing along the side of the lake."}
{"question_id": 12, "image": "000000415748.jpg", "category": "refer_desc", "text": "The region contains an elephant, which is quite large. There's a man riding on the back of the elephant, and they are moving close to a building. The shadow of the elephant can be seen on the ground. Additionally, the elephant's face and trunk are painted, which indicates some cultural significance."}
{"question_id": 13, "image": "000000408120.jpg", "category": "refer_desc", "text": "In the region, there is a concrete surface which is part of the alley. It is placed alongside the curb and the road, and there is a car parked on it. Also, nearby, there is a girl holding an umbrella walking along this path."}
{"question_id": 14, "image": "000000184400.jpg", "category": "refer_desc", "text": "In the region, there is a metal support column. This column is providing support for a bridge above it, which a train is passing over. The column also features a red line on it. This region is part of a larger scene that includes a train track on an elevated bridge."}
{"question_id": 15, "image": "000000276018.jpg", "category": "refer_desc", "text": "The region is occupied by a boy who is wearing a black jacket. He is holding a brown stuffed dog with a red and white collar. The boy seems to be part of a larger group of children who are all holding various stuffed animals and dolls. They seem to be walking across some grassy area, possibly in some kind of event or gathering."}
{"question_id": 16, "image": "000000376322.jpg", "category": "refer_desc", "text": "In the region, there is a man wearing a green shirt. He is sitting at a table, presumably in a social setting, along with other people. The table is full of items such as plates, glasses, and a decanter. One of the significant interactions is that the man is engaged in a conversation with the people around him."}
{"question_id": 17, "image": "000000125472.jpg", "category": "refer_desc", "text": "This region is primarily occupied by a man, who appears to be in mid-air, performing a trick on a skateboard. The skateboard is beneath him. He is wearing jeans and shoes with laces, and has a bracelet on his wrist. In the background of this region, there are trees, a building, and a fence. The scene seems to be taking place in a stadium, as there are stadium lights on poles in the vicinity."}
{"question_id": 18, "image": "000000361551.jpg", "category": "refer_desc", "text": "This region features a woman, who is dressed in a sleeveless black top. She is bending over her luggage, possibly preparing or checking something inside it. The woman is wearing a black and white headband as well. She is located in the service area of an airport, where there are other people standing around as well, some of them are holding their luggage. This scene is quite typical in an airport setting where passengers are usually seen handling their luggage."}
{"question_id": 19, "image": "000000412240.jpg", "category": "refer_desc", "text": "This region primarily contains a shoe. The shoe appears to be placed on a floor, and light is reflecting off of it. A dog is sitting nearby on the floor as well, and the shoe is positioned next to the dog. The shoe features several distinct elements like laces, a heel, and a toe."}
{"question_id": 20, "image": "000000130566.jpg", "category": "refer_desc", "text": "The region features windows on the side of a train engine. The train itself is traveling down a set of tracks, which are part of a larger railway system that includes multiple sets of tracks on the ground. Nearby, there are also electric lines hanging above the tracks. Further off, there are buildings, trees, and a wall, which add to the overall rural setting."}
{"question_id": 21, "image": "000000421923.jpg", "category": "refer_desc", "text": "The object is a vase, and the object is a flower. The flower is in the vase, suggesting it is a decorative element within the room."}
{"question_id": 22, "image": "000000513567.jpg", "category": "refer_desc", "text": "A woman, who is wearing a brown shirt and jeans, is crossing the street."}
{"question_id": 23, "image": "000000543300.jpg", "category": "refer_desc", "text": "The region is displaying red letters. These letters are on the side of a large, white boat that's sitting in the water. The boat has two levels and there is a set of long, black windows on its side. A silver railing is present on the top level of the boat. Close to the boat, there are buildings with red roofs and outdoor canopies. There's also a blue container on the dock, and a gray sea wall next to the ship."}
{"question_id": 24, "image": "000000241668.jpg", "category": "refer_desc", "text": "In the region, there is a woman with red hair. She's wearing a tie and a suit jacket, and is holding a plate with a piece of cake. The woman is dressed in formal attire, suggesting that she's attending a special occasion like a wedding."}
{"question_id": 25, "image": "000000535578.jpg", "category": "refer_desc", "text": "The region contains rocks and grass, providing a background for the pasture. Nearby, there are white sheep grazing in the green grassy field. There are also trees and a bush in the vicinity. A stone wall is running across the grassy field, bordering it. Besides, there's a hill in the field where some sheep and a rock are located."}
{"question_id": 26, "image": "000000277051.jpg", "category": "refer_desc", "text": "In this region, a bird is standing on the edge of a table. The table is covered with a red tablecloth and there are several objects on it, including a plate with food and crumbs, a bottle, and a steak knife. The bird is close to the knife and the plate with food. There's also a chair next to the table."}
{"question_id": 27, "image": "000000018519.jpg", "category": "refer_desc", "text": "The region contains a black wrist guard that the skater is wearing. This wrist guard is part of the safety gear that the skater has on, which also includes a black helmet, elbow pad, knee pad, and a pair of roller skates. The skater is performing a trick at the skate park, his shadow is cast on the cement ramp, and there is a grey post to a metal fence at the top of the ramp nearby. Overall, this region is an important part of the scene, showing the skater's safety equipment."}
{"question_id": 28, "image": "000000106048.jpg", "category": "refer_desc", "text": "This is a large decorated white bus. It seems to be driving past a tall building. You can see \"Divine Transportation\" written on the front of the bus. There's also a bus identification number on top. The bus features a design, including stripes, and there are headlights at the front. You can also see the side mirrors and wheels. Behind the bus, there's a gray trash can next to some large green bushes."}
{"question_id": 29, "image": "000000058393.jpg", "category": "refer_desc", "text": "The region includes a man who is sitting on a bench. He has his arm around a woman, indicating a close relationship between them. They are both looking towards the ocean, suggesting that they are enjoying the view together. The bench they are sitting on is in front of the ocean."}
{"question_id": 30, "image": "000000010764.jpg", "category": "refer_desc", "text": "This region is occupied by a baseball player wearing knee and leg pads. These pads are a part of the player's protective gear. The player, dressed as a catcher, is crouched on the field, ready to catch a ball. He is in a white uniform, which includes pants with a line on them, and he's wearing sneakers. His gloved hand is extended, prepared to receive. We can also see a black and red wrist band on his wrist. The field beneath him is brown dirt, contrasting with the green grass in the rest of the baseball field. Nearby, there are white chalk lines painted on the field."}
{"question_id": 31, "image": "000000271402.jpg", "category": "refer_desc", "text": "This region contains a little girl who is standing near a scooter. The scooter has an orange board and black handles, and it's specifically located to the right of her. The girl has blonde hair and she's wearing white socks. She is also standing on the pavement."}
{"question_id": 32, "image": "000000273493.jpg", "category": "refer_desc", "text": "In this region, a man in white clothing is preparing to hit a yellow tennis ball with his racket. He is on a tennis court with white boundary lines and a net in front of him. Behind him, there are a fence, trimmed bushes, and tall trees in the distance."}
{"question_id": 33, "image": "000000360960.jpg", "category": "refer_desc", "text": "The region is where a man is found wearing a pair of pants. This man is also wearing a long black coat. He seems to be walking on a sidewalk or decorative square, which fills the background of the image."}
{"question_id": 34, "image": "000000452122.jpg", "category": "refer_desc", "text": "In the region, there is an airplane's engine. The airplane seems to be in mid-flight, given the sky that surrounds it. The front door of the airplane is also visible in this region. The plane appears to be a commercial airline, as indicated by visible letters and windows. Notably, the landing gear of the airplane is lowered, suggesting that it's preparing to land."}
{"question_id": 35, "image": "000000134722.jpg", "category": "refer_desc", "text": "The region contains the front window of a train, which has windshield wipers. This window is part of the front of the train, which is painted yellow and white. Also, the region is located near the headlights of the train."}
{"question_id": 36, "image": "000000039484.jpg", "category": "refer_desc", "text": "In this region, there are people sitting at a table, likely dining or socializing outside a restaurant. This area is part of a bustling city street, filled with various cars, some parked and others potentially in motion. There are numerous buildings nearby, with diverse businesses and stores. One notable building nearby even has a marquee sign indicating \"for lease\". This scene suggests that the region is in a vibrant urban setting, where people are engaging in day-to-day activities such as dining outdoors and commuting by car."}
{"question_id": 37, "image": "000000159311.jpg", "category": "refer_desc", "text": "The region is a patch of grass. There are two zebras standing in and grazing on this grass. They are feeding themselves and are near bushes and a tree."}
{"question_id": 38, "image": "000000326174.jpg", "category": "refer_desc", "text": "In the region, there's a man and a little girl, they seem to be having a conversation. The man is looking back to the girl, who is pulling a surfboard, probably getting ready to surf. They are part of a larger group of people who are heading to the water with their surfboards."}
{"question_id": 39, "image": "000000562207.jpg", "category": "refer_desc", "text": "In the region, there's a man standing wearing shorts. He is standing on the side of a lake, next to an elephant. The elephant is emerging from the water and seems to be interacting with the man and two other individuals not far from him. All three people appear to be tourists posing for a picture with the elephant. The surroundings include water, and some mountains and trees in the far distance, creating a serene and natural setting."}
{"question_id": 40, "image": "000000332318.jpg", "category": "refer_desc", "text": "Within the region, there is a cow. This cow is in a pasture, which is located near a mountainous area. The mountain is partially covered in snow. There are also multiple trailers in the pasture, and one of them appears to be storage for animal equipment. The pasture and its surroundings provide a peaceful and natural living environment for the cows."}

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,40 @@
{"question_id": 0, "image": "000000069138.jpg", "category": "refer_desc", "text": "What is the interaction between the object [0.621, 0.082, 0.772, 0.132] and its surroundings?"}
{"question_id": 1, "image": "000000131138.jpg", "category": "refer_desc", "text": "What is the interaction between the object [0.183, 0.799, 0.326, 0.896] and its surrounding?"}
{"question_id": 2, "image": "000000475150.jpg", "category": "refer_desc", "text": "What can you tell about the region [0.288, 0.324, 0.572, 0.649] and its interaction with the surrounding areas?"}
{"question_id": 3, "image": "000000356424.jpg", "category": "refer_desc", "text": "What is happening in the region [0.528, 0.254, 0.717, 0.666] and what is its relationship to the surrounding objects?"}
{"question_id": 4, "image": "000000491090.jpg", "category": "refer_desc", "text": "What can be said about the region [0.102, 0.498, 0.329, 0.692] in relation to nearby objects or elements?"}
{"question_id": 5, "image": "000000484415.jpg", "category": "refer_desc", "text": "What can be observed in the region [0.716, 0.192, 0.894, 0.550] and how does it interact with the surroundings?"}
{"question_id": 7, "image": "000000184324.jpg", "category": "refer_desc", "text": "What is happening within the region [0.564, 0.771, 0.876, 0.991] and how is it related to the nearby objects?"}
{"question_id": 8, "image": "000000341058.jpg", "category": "refer_desc", "text": "What is the object [0.619, 0.838, 0.633, 0.850] and what is its relationship with nearby objects?"}
{"question_id": 9, "image": "000000184384.jpg", "category": "refer_desc", "text": "What can you tell about the objects found in the region [0.628, 0.120, 0.998, 0.389]?"}
{"question_id": 10, "image": "000000259097.jpg", "category": "refer_desc", "text": "What can be said about the region [0.012, 0.520, 0.996, 0.631] in relation to the surrounding areas?"}
{"question_id": 11, "image": "000000377882.jpg", "category": "refer_desc", "text": "What can you tell about the region [0.242, 0.211, 0.302, 0.989] and its surrounding context?"}
{"question_id": 12, "image": "000000415748.jpg", "category": "refer_desc", "text": "What can you tell about the object [0.084, 0.438, 0.727, 0.954] and its interaction with nearby objects?"}
{"question_id": 13, "image": "000000408120.jpg", "category": "refer_desc", "text": "What can you see within the region [0.394, 0.565, 0.570, 0.718] and what is its interaction with nearby objects?"}
{"question_id": 14, "image": "000000184400.jpg", "category": "refer_desc", "text": "What is the interaction between the object [0.602, 0.837, 0.696, 0.997] and its surrounding objects?"}
{"question_id": 15, "image": "000000276018.jpg", "category": "refer_desc", "text": "What can you tell me about the region [0.071, 0.378, 0.498, 0.842] and its interactions with nearby objects?"}
{"question_id": 16, "image": "000000376322.jpg", "category": "refer_desc", "text": "What is the interaction between objects in the region [0.668, 0.252, 0.909, 0.622]?"}
{"question_id": 17, "image": "000000125472.jpg", "category": "refer_desc", "text": "What can you tell about the region [0.201, 0.002, 0.940, 0.758] and its interaction with surrounding objects?"}
{"question_id": 18, "image": "000000361551.jpg", "category": "refer_desc", "text": "Can you tell me about the interaction happening in the region [0.957, 0.616, 0.997, 0.670] and its context?"}
{"question_id": 19, "image": "000000412240.jpg", "category": "refer_desc", "text": "What can be said about the region [0.002, 0.437, 0.720, 0.787] in terms of the surrounding objects and their interactions?"}
{"question_id": 20, "image": "000000130566.jpg", "category": "refer_desc", "text": "What can you tell about the region [0.630, 0.471, 0.682, 0.550] and its interaction with the surrounding environment?"}
{"question_id": 21, "image": "000000421923.jpg", "category": "refer_desc", "text": "What is the relationship between the object [0.838, 0.774, 0.994, 0.974] and object [0.796, 0.462, 0.982, 0.550]?"}
{"question_id": 22, "image": "000000513567.jpg", "category": "refer_desc", "text": "What is happening in the region [0.926, 0.253, 0.998, 0.645]?"}
{"question_id": 23, "image": "000000543300.jpg", "category": "refer_desc", "text": "What can you tell about the region [0.414, 0.691, 0.662, 0.725] and how it relates to the surroundings?"}
{"question_id": 24, "image": "000000241668.jpg", "category": "refer_desc", "text": "What is happening in the region [0.487, 0.136, 0.808, 0.986]?"}
{"question_id": 25, "image": "000000535578.jpg", "category": "refer_desc", "text": "What can you tell about the region [0.189, 0.032, 0.703, 0.178] and its surrounding areas?"}
{"question_id": 26, "image": "000000277051.jpg", "category": "refer_desc", "text": "Describe the bird [0.384, 0.372, 0.698, 0.787] and its interactions with surrounding objects?"}
{"question_id": 27, "image": "000000018519.jpg", "category": "refer_desc", "text": "What are the details of the region [0.279, 0.524, 0.341, 0.570] and how does it relate to the nearby objects?"}
{"question_id": 28, "image": "000000106048.jpg", "category": "refer_desc", "text": "Can you describe what's happening in the region [0.222, 0.144, 0.820, 0.757]?"}
{"question_id": 29, "image": "000000058393.jpg", "category": "refer_desc", "text": "What can you say about the interaction between objects in the region [0.542, 0.343, 0.812, 0.493]?"}
{"question_id": 30, "image": "000000010764.jpg", "category": "refer_desc", "text": "Referencing the region [0.546, 0.625, 0.626, 0.801], can you describe what you see and how it interacts with the surrounding context?"}
{"question_id": 31, "image": "000000271402.jpg", "category": "refer_desc", "text": "What can you tell me about the region [0.060, 0.086, 0.462, 0.908] and its relation to nearby objects?"}
{"question_id": 32, "image": "000000273493.jpg", "category": "refer_desc", "text": "What is happening in the region [0.588, 0.327, 0.850, 0.703] with regard to its surroundings?"}
{"question_id": 33, "image": "000000360960.jpg", "category": "refer_desc", "text": "Can you describe the region [0.524, 0.740, 0.734, 0.856] and its interaction with the surroundings?"}
{"question_id": 34, "image": "000000452122.jpg", "category": "refer_desc", "text": "What is happening in the region [0.650, 0.428, 0.858, 0.600]?"}
{"question_id": 35, "image": "000000134722.jpg", "category": "refer_desc", "text": "What can you say about the region [0.320, 0.451, 0.460, 0.592] and its relation with nearby objects?"}
{"question_id": 36, "image": "000000039484.jpg", "category": "refer_desc", "text": "What is happening in the region [0.844, 0.777, 0.958, 0.897] and how does this relate to the surrounding area?"}
{"question_id": 37, "image": "000000159311.jpg", "category": "refer_desc", "text": "What can you tell about the region [0.206, 0.853, 0.356, 0.982] considering the surrounding entities and their interactions?"}
{"question_id": 38, "image": "000000326174.jpg", "category": "refer_desc", "text": "Can you describe the interaction or relationship between the objects in the region [0.444, 0.459, 0.552, 0.853]?"}
{"question_id": 39, "image": "000000562207.jpg", "category": "refer_desc", "text": "Can you describe what's happening in the region [0.154, 0.392, 0.300, 0.964] and how it relates to nearby objects or individuals?"}
{"question_id": 40, "image": "000000332318.jpg", "category": "refer_desc", "text": "What can you tell about the region [0.436, 0.860, 0.454, 0.890] and how does it relate to the rest of the scene?"}

View File

@ -0,0 +1,40 @@
{"question_id": 0, "image": "000000130566.jpg", "category": "refer_reason", "text": "The object is a windshield on the train. The windshield is typically used to provide visibility for the train operator while protecting them from wind, debris, and other elements that could obstruct their view or pose a safety risk while the train is in motion."}
{"question_id": 1, "image": "000000010764.jpg", "category": "refer_reason", "text": "The object is knee and leg pads which are worn by the player. These pads are used in sports like baseball to protect the player's knees and legs from injuries that could occur due to falls, abrupt movements, or impacts with other players or equipment like the ball."}
{"question_id": 2, "image": "000000184324.jpg", "category": "refer_reason", "text": "The object is a blue street sign with a white \"P\" on it. This type of sign typically indicates a parking zone or area. Therefore, its purpose is to provide information about the availability of parking spaces to drivers."}
{"question_id": 3, "image": "000000452122.jpg", "category": "refer_reason", "text": "The object is the landing gear of an airplane. The purpose of the landing gear is to support the airplane during landing and take-off. It also facilitates the airplane's movement when it is on the ground. In this image, the airplane appears to be in flight and the landing gear is lowered, which usually happens during landing or take-off procedures."}
{"question_id": 4, "image": "000000032334.jpg", "category": "refer_reason", "text": "The object is a pair of glasses that the woman is wearing. These glasses, also known as eyeglasses or spectacles, are typically used to correct vision, protecting the eyes, or for fashion. In this context, considering the woman's age and the style of the glasses, they are likely used to correct her vision."}
{"question_id": 5, "image": "000000360960.jpg", "category": "refer_reason", "text": "The man appears to be walking. He is wearing jeans and seems to be in an outdoor setting. Given the context of the image where several people are walking on a sidewalk and one man is holding a colorful umbrella, it is likely that the man might continue walking, possibly under the umbrella if it is raining or if he wants to take shelter from the sun."}
{"question_id": 7, "image": "000000376322.jpg", "category": "refer_reason", "text": "The people are sitting on both sides of a long table. On the table, there are plates of food, including bread and butter, and glasses of red wine. Some people are reading a menu, suggesting they are at a restaurant. Given this setting, it appears to be a group meal or gathering, possibly a celebration or a business meal."}
{"question_id": 8, "image": "000000271402.jpg", "category": "refer_reason", "text": "The object is a tennis racket held by a young girl. The purpose of the tennis racket is to hit a tennis ball in the sport of tennis. The girls seem to be dressed in tennis uniforms, suggesting they might be preparing to play or practicing the sport."}
{"question_id": 9, "image": "000000356424.jpg", "category": "refer_reason", "text": "The object is a sign, specifically a yellow closed sign with brown letters. This sign is typically used in commercial settings to indicate that a store or service is not currently open to the public. This could relate to the scene in the image, possibly indicating that the man and woman are having a private meal in a restaurant that is currently closed to the public."}
{"question_id": 10, "image": "000000131138.jpg", "category": "refer_reason", "text": "The object is a pair of headphones. In this setting, with a laptop, a monitor, a keyboard, and a mouse on the desk, the headphones are likely used for audio output. This could be for listening to music, taking video calls, or for any multimedia content that the user might be interacting with on the computer. The headphones help to keep the audio private and not disturb others in the environment, which is particularly useful in an office or shared workspace setting."}
{"question_id": 11, "image": "000000332318.jpg", "category": "refer_reason", "text": "The object is a trailer. Given the rural setting of the image, with cows in a pasture and a mountain backdrop, this trailer is most likely used for the transport of farm animals such as cows or horses. It could also be used for storing animal equipment, as it's common in such settings."}
{"question_id": 12, "image": "000000513567.jpg", "category": "refer_reason", "text": "The girl has her mouth open. Given the context of the image, this girl is holding a hot dog and looking at it, it's reasonable to infer that she is opening her mouth in anticipation of eating the hot dog. It's a common reaction when people are about to eat something delicious."}
{"question_id": 13, "image": "000000134722.jpg", "category": "refer_reason", "text": "The object is windshield wipers, located on the front window of a train. The purpose of these wipers is to clear rain, snow, and other debris from the windshield, to improve the driver's visibility during poor weather conditions."}
{"question_id": 14, "image": "000000341058.jpg", "category": "refer_reason", "text": "The object is a restaurant sign posted on a post. This is typically used to advertise and identify the restaurant to passers-by and potential customers. It can provide information such as the name of the restaurant, its logo, or other branding elements."}
{"question_id": 15, "image": "000000277051.jpg", "category": "refer_reason", "text": "The object is a bottle. Bottles are typically used to hold and store different types of liquids. In this context, it might be used to store a beverage for the meal."}
{"question_id": 16, "image": "000000376900.jpg", "category": "refer_reason", "text": "The object is a tennis racket, held by a man who is a tennis player. He is ready to serve the ball. Therefore, the tennis racket is used to hit the tennis ball in the game."}
{"question_id": 17, "image": "000000412240.jpg", "category": "refer_reason", "text": "The region corresponds to a date. Judging by the context of the image, which features a dog sitting on the floor next to a pair of shoes, it's likely this date represents when the photo was taken, possibly indicating a special moment or event."}
{"question_id": 18, "image": "000000179765.jpg", "category": "refer_reason", "text": "The object is a shock absorber on the bike. Its purpose is to absorb or dampen shock impulses. It does this by converting the kinetic energy of the shock into another form of energy (typically heat) which is then dissipated. In the context of the motorcycle, it is particularly useful in providing comfort and stability for the rider, especially when travelling over uneven or rough terrains."}
{"question_id": 19, "image": "000000329219.jpg", "category": "refer_reason", "text": "The object is a dark electric cord that is plugged into the wall. It is used to transmit electrical power from the outlet to an electrical device, such as a blender, allowing it to operate."}
{"question_id": 20, "image": "000000184384.jpg", "category": "refer_reason", "text": "The object is butter. The butter is spread on top of a blueberry cake. This could be done to add extra flavor to the cake, as butter can enhance the taste and texture of baked goods."}
{"question_id": 21, "image": "000000018519.jpg", "category": "refer_reason", "text": "The region refers to a black wrist guard on a man's wrist. The wrist guard is used by the skater to protect his wrist while performing tricks, as it can help prevent injuries in case of a fall."}
{"question_id": 22, "image": "000000415748.jpg", "category": "refer_reason", "text": "The object is a tusk, which is part of an elephant. The unusual thing about this tusk is that it is on the face of the elephant, indicating that the elephant is an adult, as tusks only grow in after an elephant has reached maturity. In addition, the tusk is part of the painted decoration on the elephant's face, which is not a common sight and is typically associated with specific cultural practices or festivals."}
{"question_id": 23, "image": "000000543300.jpg", "category": "refer_reason", "text": "The region includes red letters on the side of a cruise ship. The purpose of these letters is typically to display the name of the ship or to showcase some identifying information about the ship, such as the port of registry. In this case, the letters likely serve as a way to identify the cruise ship as the \"Port River Dolphin Cruise\" ship. The letters are important for communication and identification purposes in maritime navigation."}
{"question_id": 24, "image": "000000349184.jpg", "category": "refer_reason", "text": "The object is a purse. The purpose of a purse is generally to carry personal items such as wallet, keys, cosmetics, and other small belongings. Looking at the image, the purse seems to be used by the woman who is sitting on the bench. It's a common accessory for people, especially women, when they go out."}
{"question_id": 25, "image": "000000042070.jpg", "category": "refer_reason", "text": "The object is a display on the front of a bus. The purpose of this display is to show the bus route name and number. This helps passengers identify the route and destination of the bus."}
{"question_id": 26, "image": "000000241668.jpg", "category": "refer_reason", "text": "The object is a ring. Given the context of the image, which features two people in formal wedding attire, one wearing a wedding dress and the other holding a piece of cake, it can be inferred that this ring might be a wedding ring. Therefore, it's highly possible that a wedding ceremony is taking place."}
{"question_id": 27, "image": "000000535578.jpg", "category": "refer_reason", "text": "The object is a stone wall that borders the field where the sheep are grazing. It serves as a boundary to keep the sheep contained within the field, preventing them from wandering off into unwanted areas. This is a common practice in sheep farming to manage and protect the herd."}
{"question_id": 28, "image": "000000484415.jpg", "category": "refer_reason", "text": "The object is a brush, specifically a toilet scrubber. This tool is designed for cleaning the inside of a toilet bowl. It typically has stiff bristles and a long handle to allow for efficient cleaning while keeping the user's hand away from the toilet water and bowl."}
{"question_id": 29, "image": "000000491090.jpg", "category": "refer_reason", "text": "The object is a small circular orange indicator light on the motorcycle. Its primary function is to indicate the direction or intention of the motorcycle's movement, typically used when the rider is about to make a turn or change lanes. It enhances safety by signaling the rider's intentions to other road users."}
{"question_id": 30, "image": "000000276018.jpg", "category": "refer_reason", "text": "The item is a hat, worn by a boy. The boy is outside walking with a group of other kids, all holding stuffed animals. The hat is likely being used to protect the boy's head from the sun."}
{"question_id": 31, "image": "000000361551.jpg", "category": "refer_reason", "text": "The object is a cap. It's worn by a man in the service area of what appears to be an airport. The cap is used for various purposes such as protection from the sun, keeping the head warm, or as a fashion accessory. In this case, it could be used for any of these purposes, or even to help identify the man as part of a particular group or organization."}
{"question_id": 32, "image": "000000562207.jpg", "category": "refer_reason", "text": "The object is a bucket carried by a woman. Given that they are located beside an elephant on the side of a lake, it's likely that the bucket could be used for feeding or bathing the elephant."}
{"question_id": 33, "image": "000000553990.jpg", "category": "refer_reason", "text": "The object is a bridal. A girl is seen riding a horse, and the bridal is being used by her to control and guide the horse. The bridal allows the rider to communicate with the horse by applying pressure on the horse's mouth, head, and neck. This is particularly important in activities such as the horse jumping event depicted."}
{"question_id": 34, "image": "000000106048.jpg", "category": "refer_reason", "text": "The region has the text \"Divine Transportation\" written on the front of a bus. This is likely the name of the bus company or service. It's a common practice to display the company name on the front and sides of the bus. This serves the purpose of advertising the bus service and allowing passengers and others to identify the company operating the bus."}
{"question_id": 35, "image": "000000421923.jpg", "category": "refer_reason", "text": "The object is a vase. It is filled with water and contains several flowers. Therefore, its purpose is decorative: it is used to display and support the flowers, enhancing the aesthetics of the room."}
{"question_id": 36, "image": "000000273493.jpg", "category": "refer_reason", "text": "The object is a black circular sign with the number five on it. In the context of this image, this is likely a score or court number sign in a tennis court. So, its function is to indicate the score or the court number to help players, referees, and spectators keep track of the game progress or location."}
{"question_id": 37, "image": "000000475150.jpg", "category": "refer_reason", "text": "The region refers to the lower part of a giraffe's neck and its body. The pattern in this region is spotted, with tan and brown colors. This pattern is a typical skin coloration of giraffes, indicating that the object is indeed a giraffe. The pattern provides camouflage and helps the giraffe blend in with its natural environment."}
{"question_id": 38, "image": "000000125472.jpg", "category": "refer_reason", "text": "The object is a skateboard. The skateboard was popularly invented in the 1940s or 1950s when Californian surfers were looking for a way to surf on land when the waves were not suitable for surfing. The current image shows a man performing a trick on a skateboard, which indicates the evolved use of skateboards for recreational and sporting activities, particularly in skateboarding sports and competitions."}
{"question_id": 39, "image": "000000069138.jpg", "category": "refer_reason", "text": "The object is a sign. The sign is providing information that the building nearby offers short term office space, as small as 2,500 sq. ft. Therefore, the purpose of this sign is to inform potential customers or tenants about the availability and flexibility of office space in that building."}
{"question_id": 40, "image": "000000408120.jpg", "category": "refer_reason", "text": "The object is an umbrella, held by a little girl who is wearing a pink dress. The purpose of the umbrella in this scene is likely to protect the girl from the weather. Since there's no indication of rain in the image, it might be used to shield her from the sun."}

View File

@ -0,0 +1,40 @@
{"question_id": 0, "image": "000000130566.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : buds at [0.130, 0.814, 0.334, 0.883].\nObject 1 : building at [0.622, 0.213, 0.708, 0.273].\nObject 2 : building at [0.708, 0.222, 0.994, 0.294].\nObject 3 : building at [0.472, 0.240, 0.602, 0.282].\nObject 4 : cars at [0.628, 0.411, 0.912, 0.739].\nObject 5 : electric lines at [0.000, 0.000, 0.912, 0.126].\nObject 6 : gravel at [0.382, 0.381, 0.878, 0.907].\nObject 7 : leaves at [0.736, 0.357, 0.764, 0.390].\nObject 8 : pole at [0.550, 0.589, 0.558, 0.724].\nObject 9 : sky at [0.322, 0.093, 0.852, 0.162].\nObject 10 : tracks at [0.382, 0.429, 0.502, 0.511].\nObject 11 : tracks at [0.374, 0.408, 0.692, 0.709].\nObject 12 : tracks at [0.706, 0.775, 0.942, 0.922].\nObject 13 : train at [0.016, 0.273, 0.906, 0.733].\nObject 14 : train tracks at [0.024, 0.291, 0.996, 0.997].\nObject 15 : tree at [0.760, 0.279, 0.998, 0.426].\nObject 16 : wall at [0.556, 0.721, 0.790, 0.991].\nObject 17 : windshield at [0.850, 0.523, 0.898, 0.583].\nObject 18 : windshield at [0.796, 0.526, 0.846, 0.580].\n\nRelationships:\nobject 18 : windshield -> on a -> object 13 : train.\nobject 12 : tracks -> for a -> object 13 : train.\nobject 15 : tree -> with -> object 7 : leaves.\nobject 5 : electric lines -> on -> object 14 : train tracks.\nobject 8 : pole -> beside -> object 13 : train.\nobject 16 : wall -> beside -> object 13 : train.\nobject 13 : train -> traveling down -> object 11 : tracks.\n\nRegion Description:\nRegion Description at [0.022, 0.258, 0.632, 0.679] : THESE CARS ARE FOR CARGO NOT PASSENGERS.\nRegion Description at [0.630, 0.471, 0.682, 0.550] : THE WINDOWS ARE ON THE SIDE OF THE ENGINE.\nRegion Description at [0.000, 0.024, 0.448, 0.144] : electric lines hanging above train tracks.\nRegion Description at [0.532, 0.571, 0.568, 0.727] : black metal pole beside train tracks.\nRegion Description at [0.782, 0.586, 0.918, 0.667] : yellow paint on the front of the train.\nRegion Description at [0.062, 0.300, 0.996, 0.997] : multiple sets of tracks on the ground.\nRegion Description at [0.026, 0.114, 0.950, 0.970] : a freight train travelling down the tracks.\nRegion Description at [0.054, 0.685, 0.684, 0.991] : wildflowers on the side of a train track.\nRegion Description at [0.002, 0.129, 0.998, 0.991] : the grass and trees around the tracks.\n\nGlobal Caption:\nA yellow train on the tracks with several cars\nA train pulls past an intersection in the rail in a rural area.\na long cargo train going down a track by some trees \nA train with a red and yellow engine on a railroad track.\nA train pulls a large number of cars through a junction."}
{"question_id": 1, "image": "000000010764.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : catcher at [0.334, 0.193, 0.756, 0.940].\nObject 1 : field at [0.000, 0.000, 0.998, 0.997].\nObject 2 : glove at [0.660, 0.492, 0.764, 0.674].\nObject 3 : hand at [0.666, 0.498, 0.748, 0.665].\nObject 4 : helmet at [0.472, 0.187, 0.610, 0.444].\nObject 5 : jersey at [0.340, 0.332, 0.556, 0.695].\nObject 6 : line at [0.396, 0.656, 0.560, 0.731].\nObject 7 : lines at [0.866, 0.927, 1.000, 0.997].\nObject 8 : lines at [0.754, 0.837, 0.998, 0.867].\nObject 9 : pads at [0.562, 0.668, 0.634, 0.782].\nObject 10 : pants at [0.336, 0.640, 0.612, 0.858].\nObject 11 : sneakers at [0.406, 0.834, 0.544, 0.946].\nObject 12 : stripe at [0.608, 0.737, 0.998, 0.795].\nObject 13 : wrist band at [0.586, 0.583, 0.604, 0.640].\n\nRelationships:\nobject 0 : catcher -> in -> object 1 : field.\nobject 2 : glove -> on -> object 3 : hand.\nobject 6 : line -> on -> object 10 : pants.\n\nRegion Description:\nRegion Description at [0.546, 0.625, 0.626, 0.801] : The player is wearing knee and leg pads..\nRegion Description at [0.018, 0.665, 0.280, 0.825] : A brown dirt ground surface on a baseball field.\nRegion Description at [0.676, 0.701, 0.974, 0.979] : White chalk lines painted on a baseball field.\nRegion Description at [0.062, 0.130, 0.370, 0.535] : A green grass ground surface of a baseball field.\nRegion Description at [0.566, 0.580, 0.620, 0.656] : A black and red bracelet on a man's wrist.\n\nGlobal Caption:\nA catches crouches on a patch of dirt.\nA catcher squatting at a base with his gloved hand extended.\nA baseball catcher stands ready to catch a ball.\na catcher kneeling at the mound waiting for a baseball \nA catcher in white uniform during a baseball game."}
{"question_id": 2, "image": "000000184324.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : awning at [0.514, 0.500, 0.736, 0.545].\nObject 1 : bag at [0.086, 0.723, 0.124, 0.777].\nObject 2 : bicycle at [0.716, 0.660, 0.756, 0.738].\nObject 3 : bikes at [0.710, 0.753, 0.864, 0.934].\nObject 4 : black jacket at [0.052, 0.569, 0.120, 0.723].\nObject 5 : blue jeans at [0.654, 0.678, 0.672, 0.729].\nObject 6 : building at [0.540, 0.042, 0.760, 0.617].\nObject 7 : building at [0.706, 0.000, 0.998, 0.726].\nObject 8 : bus at [0.186, 0.491, 0.246, 0.608].\nObject 9 : car at [0.432, 0.557, 0.538, 0.636].\nObject 10 : cars at [0.130, 0.491, 0.756, 0.630].\nObject 11 : coat at [0.128, 0.602, 0.212, 0.798].\nObject 12 : cross walk at [0.428, 0.750, 0.954, 1.000].\nObject 13 : cyclist at [0.752, 0.614, 0.860, 0.792].\nObject 14 : lines at [0.432, 0.608, 0.948, 1.000].\nObject 15 : man at [0.052, 0.518, 0.132, 0.898].\nObject 16 : people at [0.000, 0.515, 0.212, 1.000].\nObject 17 : people at [0.754, 0.605, 0.858, 0.756].\nObject 18 : pole at [0.954, 0.699, 0.970, 0.777].\nObject 19 : road at [0.004, 0.545, 1.000, 1.000].\nObject 20 : scarf at [0.032, 0.873, 0.134, 0.997].\nObject 21 : sidewalk at [0.536, 0.572, 0.668, 0.623].\nObject 22 : sign at [0.482, 0.470, 0.494, 0.494].\nObject 23 : sign at [0.810, 0.407, 0.970, 0.497].\nObject 24 : sign at [0.584, 0.434, 0.614, 0.494].\nObject 25 : store at [0.806, 0.395, 0.968, 0.720].\nObject 26 : street light at [0.640, 0.461, 0.652, 0.485].\nObject 27 : stripes at [0.452, 0.620, 0.944, 0.982].\nObject 28 : tires at [0.712, 0.747, 0.864, 0.931].\nObject 29 : tree at [0.280, 0.358, 0.340, 0.569].\nObject 30 : van at [0.460, 0.545, 0.488, 0.566].\nObject 31 : window at [0.820, 0.217, 0.884, 0.358].\nObject 32 : windshield at [0.192, 0.512, 0.242, 0.548].\nObject 33 : woman at [0.128, 0.569, 0.212, 0.913].\nObject 34 : woman at [0.650, 0.593, 0.688, 0.729].\nObject 35 : woman at [0.020, 0.765, 0.168, 1.000].\nObject 36 : writing at [0.838, 0.422, 0.948, 0.482].\n\nRelationships:\nobject 3 : bikes -> are on -> object 19 : road.\nobject 3 : bikes -> are on -> object 19 : road.\nobject 17 : people -> are riding -> object 3 : bikes.\nobject 3 : bikes -> are on -> object 19 : road.\nobject 17 : people -> are on -> object 19 : road.\nobject 8 : bus -> on -> object 19 : road.\nobject 8 : bus -> on -> object 19 : road.\nobject 8 : bus -> on -> object 19 : road.\nobject 12 : cross walk -> being used by a -> object 13 : cyclist.\nobject 17 : people -> are using -> object 12 : cross walk.\nobject 0 : awning -> above -> object 21 : sidewalk.\nobject 10 : cars -> are on -> object 19 : road.\nobject 26 : street light -> on -> object 6 : building.\nobject 27 : stripes -> on -> object 12 : cross walk.\nobject 7 : building -> has a -> object 31 : window.\nobject 3 : bikes -> have -> object 28 : tires.\nobject 35 : woman -> wearing a -> object 20 : scarf.\nobject 23 : sign -> for -> object 25 : store.\nobject 33 : woman -> wearing a -> object 11 : coat.\nobject 34 : woman -> wearing -> object 5 : blue jeans.\nobject 3 : bikes -> are on -> object 19 : road.\nobject 14 : lines -> are on -> object 19 : road.\nobject 15 : man -> wearing a -> object 4 : black jacket.\nobject 30 : van -> on -> object 19 : road.\nobject 15 : man -> has a -> object 1 : bag.\nobject 8 : bus -> has a -> object 32 : windshield.\nobject 7 : building -> has a -> object 31 : window.\nobject 31 : window -> above -> object 23 : sign.\nobject 14 : lines -> are on -> object 19 : road.\nobject 18 : pole -> near -> object 7 : building.\nobject 35 : woman -> wearing a -> object 20 : scarf.\n\nRegion Description:\nRegion Description at [0.822, 0.395, 0.968, 0.500] : red writing above buisness along the street.\nRegion Description at [0.564, 0.771, 0.876, 0.991] : white stripes painted to indicate cross walk.\nRegion Description at [0.184, 0.485, 0.244, 0.605] : large white vehicle with big windshield.\nRegion Description at [0.478, 0.464, 0.492, 0.491] : blue street sign with a white P on it.\nRegion Description at [0.820, 0.220, 0.886, 0.370] : window on the building above red sign.\n\nGlobal Caption:\nA group of people walking across a busy city street.\nA fish eye lens shows the corner of a busy city street with bikes, people and buildings.\na number of people and cars on a city street\nAn oddly taken photo of some buildings and shops.\nA picture of a city intersection with period buildings and store fronts. "}
{"question_id": 3, "image": "000000452122.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : airline at [0.408, 0.420, 0.758, 0.502].\nObject 1 : airplane at [0.112, 0.300, 0.858, 0.640].\nObject 2 : engine at [0.652, 0.529, 0.730, 0.592].\nObject 3 : engine at [0.494, 0.502, 0.574, 0.577].\nObject 4 : fin at [0.208, 0.303, 0.320, 0.492].\nObject 5 : fin at [0.116, 0.480, 0.284, 0.526].\nObject 6 : front door at [0.752, 0.435, 0.772, 0.483].\nObject 7 : gear at [0.450, 0.592, 0.600, 0.643].\nObject 8 : letters at [0.694, 0.489, 0.732, 0.520].\nObject 9 : name at [0.398, 0.426, 0.760, 0.489].\nObject 10 : sky at [0.000, 0.000, 0.998, 1.000].\nObject 11 : window at [0.806, 0.438, 0.844, 0.456].\nObject 12 : windows at [0.326, 0.450, 0.750, 0.532].\nObject 13 : wing at [0.152, 0.426, 0.598, 0.538].\nObject 14 : wing at [0.116, 0.492, 0.282, 0.538].\n\nRelationships:\nobject 6 : front door -> of -> object 1 : airplane.\n\nRegion Description:\n\nGlobal Caption:\nAn airplane flying in the air during the day.\nA large aircraft is shown in the air.\nThe large jumbo jet has it's landing gear lowered.\nA large white airplane flies in the gray sky.\nAn airplane in route with a cloudy sky behind it."}
{"question_id": 4, "image": "000000032334.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : akimbo at [0.232, 0.275, 0.294, 0.328].\nObject 1 : counter at [0.000, 0.552, 0.214, 0.995].\nObject 2 : eyeglasses at [0.080, 0.269, 0.182, 0.339].\nObject 3 : face at [0.358, 0.261, 0.520, 0.504].\nObject 4 : glass at [0.406, 0.517, 0.526, 0.763].\nObject 5 : glasses at [0.350, 0.325, 0.516, 0.373].\nObject 6 : green at [0.538, 0.584, 0.590, 0.648].\nObject 7 : green shirt at [0.228, 0.475, 0.626, 0.997].\nObject 8 : ground at [0.274, 0.421, 0.380, 0.525].\nObject 9 : hair at [0.314, 0.155, 0.560, 0.496].\nObject 10 : man at [0.226, 0.221, 0.292, 0.395].\nObject 11 : man at [0.346, 0.000, 1.000, 1.000].\nObject 12 : man at [0.044, 0.184, 0.282, 0.704].\nObject 13 : menu at [0.002, 0.640, 0.196, 0.728].\nObject 14 : menu at [0.000, 0.771, 0.144, 0.931].\nObject 15 : shirt at [0.604, 0.491, 1.000, 0.992].\nObject 16 : teeth at [0.706, 0.371, 0.808, 0.416].\nObject 17 : wine at [0.270, 0.675, 0.342, 0.715].\nObject 18 : wine at [0.416, 0.677, 0.514, 0.752].\nObject 19 : wine glass at [0.256, 0.512, 0.370, 0.829].\nObject 20 : wine glass at [0.000, 0.573, 0.034, 0.760].\nObject 21 : woman at [0.210, 0.171, 0.618, 0.997].\nObject 22 : woman at [0.132, 0.165, 0.630, 1.000].\n\nRelationships:\nobject 22 : woman -> has -> object 9 : hair.\nobject 16 : teeth -> of -> object 11 : man.\nobject 10 : man -> standing -> object 0 : akimbo.\nobject 14 : menu -> on -> object 1 : counter.\nobject 22 : woman -> drinking -> object 17 : wine.\nobject 3 : face -> of -> object 22 : woman.\nobject 5 : glasses -> on -> object 3 : face.\nobject 22 : woman -> wearing -> object 7 : green shirt.\nobject 20 : wine glass -> next to -> object 13 : menu.\nobject 11 : man -> holding -> object 4 : glass.\nobject 12 : man -> wearing -> object 2 : eyeglasses.\nobject 22 : woman -> holding -> object 19 : wine glass.\n\nRegion Description:\nRegion Description at [0.356, 0.307, 0.512, 0.389] : The woman is wearing corrective lenses.\n\nGlobal Caption:\nTwo people are smiling holding empty wine glasses.\nMan and woman doing a toast with a glass of wine.\nA man and a woman toast their wine glasses.\nSome friends pose for a picture while holding wine glasses.\nTwo people, a man and a woman, are toasting with wine glasses."}
{"question_id": 5, "image": "000000360960.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : coat at [0.405, 0.332, 0.835, 0.746].\nObject 1 : decorative square at [0.000, 0.382, 1.000, 1.000].\nObject 2 : hat at [0.006, 0.162, 0.072, 0.198].\nObject 3 : jacket at [0.078, 0.222, 0.318, 0.430].\nObject 4 : jeans at [0.853, 0.422, 1.000, 0.632].\nObject 5 : leg at [0.853, 0.456, 0.928, 0.610].\nObject 6 : leg at [0.210, 0.458, 0.303, 0.638].\nObject 7 : leg at [0.000, 0.458, 0.060, 0.630].\nObject 8 : man at [0.066, 0.162, 0.318, 0.686].\nObject 9 : man at [0.850, 0.156, 1.000, 0.652].\nObject 10 : man at [0.390, 0.344, 0.838, 0.894].\nObject 11 : pants at [0.523, 0.736, 0.739, 0.858].\nObject 12 : person at [0.000, 0.162, 0.135, 0.668].\nObject 13 : person at [0.853, 0.154, 1.000, 0.650].\nObject 14 : section at [0.000, 0.134, 1.000, 1.000].\nObject 15 : sidewalk at [0.000, 0.388, 1.000, 1.000].\nObject 16 : umbrella at [0.168, 0.106, 0.910, 0.366].\nObject 17 : uniform at [0.000, 0.222, 0.126, 0.646].\nObject 18 : uniform at [0.105, 0.218, 0.318, 0.628].\n\nRelationships:\nobject 10 : man -> wearing -> object 11 : pants.\nobject 10 : man -> wearing -> object 0 : coat.\nobject 9 : man -> wearing -> object 4 : jeans.\nobject 8 : man -> wearing -> object 2 : hat.\nobject 8 : man -> wearing -> object 3 : jacket.\nobject 16 : umbrella -> has -> object 14 : section.\nobject 5 : leg -> of -> object 13 : person.\nobject 7 : leg -> of -> object 12 : person.\nobject 12 : person -> in -> object 17 : uniform.\n\nRegion Description:\nRegion Description at [0.066, 0.164, 0.318, 0.686] : the back of a man in a black uniform.\nRegion Description at [0.393, 0.324, 0.871, 0.766] : THIS MAN IS WEARING A LONG BLACK COAT.\nRegion Description at [0.468, 0.142, 0.634, 0.356] : THIS IS A RED SECTION ON THE UMBRELLA.\nRegion Description at [0.168, 0.140, 0.523, 0.292] : THIS IS A YELLOW SECTION ON THE UMBRELLA.\nRegion Description at [0.568, 0.138, 0.919, 0.232] : THIS IS A GREEN SECTION OF THE UMBRELLA.\n\nGlobal Caption:\nSeveral people walking on a sidewalk, with one man holding an umbrella.\nA person walking while carrying a rainbow umbrella\nA person is holding up a large colorful umbrella\na person walking down the street carrying a rainbow colored umbrella\nA person walking in a square carrying a rainbow colored umbrella."}
{"question_id": 7, "image": "000000376322.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : beer at [0.404, 0.568, 0.505, 0.724].\nObject 1 : cell phone at [0.128, 0.726, 0.332, 0.784].\nObject 2 : decanter at [0.417, 0.416, 0.503, 0.574].\nObject 3 : fork at [0.142, 0.852, 0.553, 0.964].\nObject 4 : fork at [0.174, 0.616, 0.414, 0.660].\nObject 5 : fork at [0.107, 0.882, 0.545, 0.998].\nObject 6 : glass at [0.401, 0.568, 0.508, 0.726].\nObject 7 : glass at [0.773, 0.622, 0.880, 0.796].\nObject 8 : glasses at [0.013, 0.342, 0.139, 0.376].\nObject 9 : green shirt at [0.698, 0.376, 0.909, 0.620].\nObject 10 : hair at [0.607, 0.336, 0.743, 0.422].\nObject 11 : hair at [0.824, 0.244, 1.000, 0.474].\nObject 12 : man at [0.668, 0.252, 0.909, 0.622].\nObject 13 : man at [0.000, 0.304, 0.136, 0.808].\nObject 14 : plate at [0.102, 0.780, 0.404, 0.898].\nObject 15 : silver spoon at [0.698, 0.882, 0.799, 0.998].\nObject 16 : table at [0.000, 0.428, 0.997, 0.998].\nObject 17 : wall at [0.535, 0.194, 0.997, 0.370].\nObject 18 : watch at [0.570, 0.482, 0.596, 0.508].\nObject 19 : watch at [0.888, 0.486, 0.949, 0.514].\nObject 20 : white plate at [0.361, 0.712, 0.805, 0.860].\nObject 21 : woman at [0.813, 0.242, 1.000, 0.582].\nObject 22 : woman at [0.532, 0.338, 0.765, 0.550].\n\nRelationships:\nobject 21 : woman -> with -> object 11 : hair.\nobject 9 : green shirt -> on -> object 12 : man.\nobject 14 : plate -> on -> object 16 : table.\nobject 1 : cell phone -> on -> object 16 : table.\nobject 5 : fork -> on -> object 16 : table.\nobject 5 : fork -> on -> object 16 : table.\nobject 3 : fork -> on -> object 16 : table.\nobject 4 : fork -> on -> object 16 : table.\nobject 2 : decanter -> on -> object 16 : table.\nobject 12 : man -> wearing a -> object 9 : green shirt.\nobject 21 : woman -> wearing a -> object 19 : watch.\nobject 22 : woman -> wearing a -> object 18 : watch.\nobject 13 : man -> wearing -> object 8 : glasses.\nobject 10 : hair -> on -> object 22 : woman.\nobject 22 : woman -> at -> object 16 : table.\n\nRegion Description:\nRegion Description at [0.353, 0.700, 0.802, 0.860] : a round plate with six pieces of bread and two butter pats.\nRegion Description at [0.096, 0.778, 0.404, 0.892] : a plate with one slice of bread and one butter pat.\nRegion Description at [0.890, 0.698, 0.997, 0.992] : glass of red wine closest to the camera.\nRegion Description at [0.366, 0.710, 0.805, 0.856] : the round white plate under the bread and butter.\n\nGlobal Caption:\nA group of people are reading a menu at the table\nA group of people sit at a large table while talking.\nPeople sitting on the long table with plates of food. \nA long table full of people on both sides.\nA long table accommodating many people while eating"}
{"question_id": 8, "image": "000000271402.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : blonde hair at [0.193, 0.100, 0.375, 0.176].\nObject 1 : building at [0.804, 0.200, 0.906, 0.318].\nObject 2 : dress at [0.378, 0.284, 0.804, 0.652].\nObject 3 : fence at [0.607, 0.282, 0.997, 0.378].\nObject 4 : girl at [0.329, 0.148, 0.973, 0.892].\nObject 5 : girl at [0.057, 0.102, 0.456, 0.898].\nObject 6 : ground at [0.000, 0.374, 1.000, 0.916].\nObject 7 : hair at [0.320, 0.148, 0.517, 0.286].\nObject 8 : handle at [0.329, 0.432, 0.508, 0.480].\nObject 9 : handle at [0.091, 0.450, 0.299, 0.502].\nObject 10 : head at [0.335, 0.152, 0.508, 0.314].\nObject 11 : insignia at [0.447, 0.350, 0.502, 0.390].\nObject 12 : orange platform at [0.181, 0.816, 0.489, 0.998].\nObject 13 : orange wheel at [0.193, 0.820, 0.248, 0.876].\nObject 14 : pavement at [0.009, 0.370, 0.994, 0.996].\nObject 15 : racket at [0.462, 0.480, 0.713, 0.840].\nObject 16 : right shoe at [0.465, 0.778, 0.610, 0.886].\nObject 17 : scooter at [0.097, 0.424, 0.592, 0.996].\nObject 18 : shoe at [0.060, 0.794, 0.202, 0.902].\nObject 19 : shoe at [0.302, 0.780, 0.453, 0.874].\nObject 20 : skirt at [0.471, 0.514, 0.804, 0.654].\nObject 21 : sneaker at [0.849, 0.738, 0.970, 0.886].\nObject 22 : sock at [0.317, 0.776, 0.347, 0.798].\nObject 23 : sock at [0.130, 0.790, 0.184, 0.810].\n\nRelationships:\nobject 4 : girl -> on -> object 14 : pavement.\nobject 5 : girl -> wearing -> object 22 : sock.\nobject 5 : girl -> wearing -> object 23 : sock.\nobject 4 : girl -> wearing -> object 20 : skirt.\nobject 4 : girl -> holding -> object 15 : racket.\nobject 5 : girl -> with -> object 0 : blonde hair.\nobject 17 : scooter -> with -> object 8 : handle.\nobject 1 : building -> with -> object 3 : fence.\nobject 4 : girl -> with -> object 11 : insignia.\nobject 13 : orange wheel -> of -> object 17 : scooter.\n\nRegion Description:\nRegion Description at [0.858, 0.760, 0.970, 0.852] : Girl is wearing blue, white, pink, and gray shoes..\nRegion Description at [0.293, 0.136, 0.976, 0.884] : a little girl holding a tennis racket..\nRegion Description at [0.060, 0.086, 0.462, 0.908] : A little girl standing near a scooter..\nRegion Description at [0.308, 0.146, 0.985, 0.892] : young girl wearing velcro strapped tennis shoes.\nRegion Description at [0.082, 0.436, 0.601, 0.996] : orange scooter board with black handles.\nRegion Description at [0.755, 0.184, 0.973, 0.372] : a tall building with fence in foreground.\nRegion Description at [0.021, 0.096, 0.988, 0.928] : two young girls wearing white outfits.\nRegion Description at [0.311, 0.136, 0.991, 0.886] : young girl with insignia on white outfit.\nRegion Description at [0.175, 0.814, 0.266, 0.888] : orange colored back wheel of a scooter board.\nRegion Description at [0.453, 0.478, 0.725, 0.848] : lavender, yellow and pink colored tennis racket.\n\nGlobal Caption:\ntwo little girls in tennis uniforms standing next to a scooter\nTwo young girls with a tennis racket and a scooter.\nTwo little girls posing for a picture, on a tennis court.\nTwo young girls on a tennis court with a racquet and a scooter\nTwo cute girls with a scooter and tennis raquet."}
{"question_id": 9, "image": "000000356424.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : bottle at [0.048, 0.712, 0.195, 1.002].\nObject 1 : chair at [0.696, 0.500, 1.003, 0.718].\nObject 2 : cork at [0.053, 0.712, 0.139, 0.776].\nObject 3 : cup at [0.043, 0.736, 0.240, 0.916].\nObject 4 : dish at [0.416, 0.726, 0.856, 0.904].\nObject 5 : fruit at [0.629, 0.834, 0.675, 0.880].\nObject 6 : glass at [0.275, 0.716, 0.501, 0.998].\nObject 7 : glasses at [0.179, 0.242, 0.464, 0.322].\nObject 8 : hair at [0.536, 0.258, 0.656, 0.320].\nObject 9 : man at [0.075, 0.102, 0.704, 0.716].\nObject 10 : rasberries at [0.499, 0.750, 0.544, 0.786].\nObject 11 : raspberries at [0.664, 0.828, 0.741, 0.864].\nObject 12 : sauce at [0.565, 0.752, 0.715, 0.824].\nObject 13 : shirt at [0.600, 0.350, 0.645, 0.494].\nObject 14 : shirt at [0.635, 0.282, 0.997, 0.654].\nObject 15 : sign at [0.419, 0.134, 0.509, 0.184].\nObject 16 : sweater at [0.072, 0.288, 0.704, 0.718].\nObject 17 : table at [0.000, 0.592, 0.997, 1.000].\nObject 18 : window at [0.328, 0.000, 0.600, 0.298].\nObject 19 : woman at [0.531, 0.258, 0.768, 0.688].\n\nRelationships:\nobject 9 : man -> wearing -> object 7 : glasses.\nobject 0 : bottle -> on -> object 17 : table.\nobject 6 : glass -> on -> object 17 : table.\nobject 11 : raspberries -> on -> object 4 : dish.\nobject 9 : man -> wearing -> object 7 : glasses.\n\nRegion Description:\nRegion Description at [0.640, 0.180, 0.989, 0.530] : Man wearing a black and orange stripe shirt.\nRegion Description at [0.413, 0.136, 0.512, 0.184] : Yellow closed sign with brown letters.\nRegion Description at [0.629, 0.186, 0.995, 0.706] : a man wearing and orange and black striped shirt.\nRegion Description at [0.528, 0.254, 0.717, 0.666] : a woman with a ponytail eating lunch.\nRegion Description at [0.152, 0.238, 0.459, 0.322] : a pair of black wire rimmed eye glasses.\nRegion Description at [0.029, 0.716, 0.243, 0.922] : empty cup that used to contain coffee.\nRegion Description at [0.264, 0.708, 0.867, 0.994] : A plate of food with a glass of water.\n\nGlobal Caption:\nA man sitting in front of a plate of food.\nA man at a wooden table looking at a plate of food.\na man smiling while looking at his plate of food\nA man sitting at a table with a plate filled with food.\nA man looking happily at some dish in front of him."}
{"question_id": 10, "image": "000000131138.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : computer mouse at [0.414, 0.753, 0.470, 0.811].\nObject 1 : cup at [0.350, 0.783, 0.417, 0.906].\nObject 2 : desk at [0.000, 0.488, 0.998, 0.999].\nObject 3 : fork at [0.203, 0.794, 0.270, 0.857].\nObject 4 : glass at [0.277, 0.703, 0.345, 0.816].\nObject 5 : head phones at [0.872, 0.556, 0.993, 0.634].\nObject 6 : keyboard at [0.415, 0.620, 0.650, 0.783].\nObject 7 : lamp at [0.000, 0.302, 0.214, 0.430].\nObject 8 : laptop at [0.491, 0.296, 0.703, 0.540].\nObject 9 : picture at [0.795, 0.204, 0.898, 0.358].\nObject 10 : plant at [0.192, 0.201, 0.391, 0.461].\nObject 11 : plate at [0.183, 0.799, 0.326, 0.896].\nObject 12 : screen at [0.237, 0.249, 0.504, 0.628].\nObject 13 : stand at [0.506, 0.531, 0.663, 0.617].\nObject 14 : window at [0.606, 0.000, 1.000, 0.346].\n\nRelationships:\nobject 0 : computer mouse -> on -> object 2 : desk.\nobject 8 : laptop -> on -> object 13 : stand.\nobject 6 : keyboard -> on -> object 2 : desk.\nobject 9 : picture -> near -> object 14 : window.\nobject 3 : fork -> on -> object 11 : plate.\n\nRegion Description:\n\nGlobal Caption:\na desk with a cup plate laptop monitor and keyboard\nA laptop sitting next to a monitor, keyboard and a mouse.\nA laptop and a desktop monitor are displayed on top of the desk.\nLarge office desk with computers near a window.\nA desk with a laptop, second monitor and keyboard."}
{"question_id": 11, "image": "000000332318.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : background at [0.000, 0.000, 1.002, 0.997].\nObject 1 : bench at [0.604, 0.967, 0.672, 0.997].\nObject 2 : cow at [0.548, 0.860, 0.574, 0.896].\nObject 3 : cow at [0.436, 0.860, 0.454, 0.890].\nObject 4 : fence at [0.698, 0.949, 0.852, 0.997].\nObject 5 : moutain at [0.000, 0.057, 0.992, 0.782].\nObject 6 : pasture at [0.000, 0.815, 0.984, 1.000].\nObject 7 : peak at [0.744, 0.042, 0.898, 0.119].\nObject 8 : sky at [0.000, 0.000, 1.002, 0.257].\nObject 9 : snow at [0.210, 0.036, 0.962, 0.445].\nObject 10 : trailer at [0.796, 0.910, 0.894, 0.997].\nObject 11 : trailer at [0.632, 0.899, 0.742, 0.994].\nObject 12 : tree at [0.740, 0.409, 1.000, 0.982].\nObject 13 : tree at [0.638, 0.284, 0.652, 0.301].\n\nRelationships:\nobject 11 : trailer -> in -> object 6 : pasture.\nobject 5 : moutain -> has -> object 9 : snow.\nobject 6 : pasture -> near -> object 5 : moutain.\nobject 3 : cow -> in -> object 6 : pasture.\nobject 2 : cow -> in -> object 6 : pasture.\nobject 9 : snow -> on -> object 5 : moutain.\nobject 5 : moutain -> covered in -> object 9 : snow.\nobject 5 : moutain -> has -> object 7 : peak.\nobject 2 : cow -> in -> object 6 : pasture.\nobject 5 : moutain -> in -> object 0 : background.\nobject 5 : moutain -> has -> object 9 : snow.\nobject 11 : trailer -> near -> object 12 : tree.\nobject 5 : moutain -> has -> object 13 : tree.\nobject 7 : peak -> covered with -> object 9 : snow.\n\nRegion Description:\nRegion Description at [0.784, 0.901, 0.934, 0.991] : storage container for animal equipment.\nRegion Description at [0.828, 0.060, 0.880, 0.125] : The mountain is partially covered in snow..\nRegion Description at [0.840, 0.899, 0.920, 0.997] : horse trailer or cow trailer is silvertone, rectangular.\nRegion Description at [0.606, 0.919, 0.640, 0.982] : smaller trailer, white w/ brown+orange stripe.\nRegion Description at [0.060, 0.472, 0.540, 0.806] : a bare patch of earth amid lush green growth.\nRegion Description at [0.034, 0.839, 0.812, 0.973] : tiny cattle-containing fenceposts in the distance.\nRegion Description at [0.902, 0.827, 0.990, 0.997] : a split tree trunk in shadow, beneath leaves, shadow on ground.\nRegion Description at [0.734, 0.919, 0.802, 0.994] : an older station wagon/suv-type van thing.\nRegion Description at [0.090, 0.854, 0.124, 0.904] : a black & white animal stands alone, away from brown brethren, in the far distance.\n\nGlobal Caption:\nCows lounge in a field with a mountain backdrop.\nA VERY BIG MOUNTAIN AND ANIMALS SPREAD ACROSS A FARM.\nSeveral herd animals are on the grass by a mountain.\nCattle on a level pasture in a mountainous area.\nA bunch of cattle relax in a pasture located in the mountains"}
{"question_id": 12, "image": "000000513567.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : bag at [0.428, 0.435, 0.476, 0.528].\nObject 1 : bag at [0.322, 0.923, 0.498, 0.997].\nObject 2 : building at [0.000, 0.003, 0.158, 0.413].\nObject 3 : face at [0.246, 0.240, 0.374, 0.483].\nObject 4 : flag at [0.044, 0.013, 0.090, 0.149].\nObject 5 : girl at [0.538, 0.019, 0.968, 0.949].\nObject 6 : hand at [0.176, 0.680, 0.304, 0.821].\nObject 7 : hands at [0.660, 0.344, 0.756, 0.517].\nObject 8 : head at [0.560, 0.003, 0.822, 0.339].\nObject 9 : hot dog at [0.676, 0.315, 0.882, 0.408].\nObject 10 : hot dogs at [0.190, 0.587, 0.350, 0.741].\nObject 11 : jeans at [0.586, 0.843, 0.916, 0.995].\nObject 12 : lady at [0.572, 0.045, 0.952, 0.984].\nObject 13 : logo at [0.920, 0.069, 0.996, 0.165].\nObject 14 : man at [0.486, 0.235, 0.564, 0.509].\nObject 15 : man at [0.456, 0.213, 0.520, 0.317].\nObject 16 : maroon shirt at [0.546, 0.333, 0.928, 0.944].\nObject 17 : mouth at [0.288, 0.408, 0.356, 0.440].\nObject 18 : people at [0.552, 0.029, 0.876, 0.995].\nObject 19 : post at [0.104, 0.005, 0.138, 0.533].\nObject 20 : purse at [0.842, 0.661, 0.980, 0.888].\nObject 21 : purse strap at [0.270, 0.893, 0.390, 0.992].\nObject 22 : shadow at [0.934, 0.067, 0.996, 0.141].\nObject 23 : side at [0.922, 0.875, 0.998, 0.997].\nObject 24 : street at [0.042, 0.403, 0.092, 0.520].\nObject 25 : sunglasses at [0.630, 0.005, 0.794, 0.048].\nObject 26 : woman at [0.502, 0.000, 0.982, 0.997].\nObject 27 : woman at [0.102, 0.099, 0.486, 0.984].\nObject 28 : woman's shirt at [0.518, 0.320, 0.944, 0.949].\n\nRelationships:\nobject 0 : bag -> on -> object 15 : man.\nobject 13 : logo -> on -> object 2 : building.\nobject 25 : sunglasses -> on -> object 26 : woman.\nobject 25 : sunglasses -> on -> object 8 : head.\nobject 4 : flag -> on -> object 19 : post.\nobject 6 : hand -> holds -> object 10 : hot dogs.\nobject 27 : woman -> has -> object 17 : mouth.\nobject 12 : lady -> holding -> object 9 : hot dog.\nobject 9 : hot dog -> in -> object 7 : hands.\nobject 18 : people -> crossing -> object 24 : street.\nobject 27 : woman -> wearing -> object 11 : jeans.\nobject 5 : girl -> wears -> object 16 : maroon shirt.\n\nRegion Description:\nRegion Description at [0.038, 0.173, 0.540, 0.995] : Laughing girl in a green shirt holding a hotdog..\nRegion Description at [0.504, 0.000, 0.954, 0.989] : Black haired girl in maroon shirt wearing sunglasses on her head..\nRegion Description at [0.508, 0.000, 0.960, 0.979] : Girl looking at the hot dog she's holding in her hands.\nRegion Description at [0.040, 0.173, 0.536, 0.981] : Girl holding hot dog in her right hand.\nRegion Description at [0.926, 0.253, 0.998, 0.645] : Woman in a brown shirt and jeans crossing the street.\nRegion Description at [0.202, 0.563, 0.334, 0.995] : Blue purse strap around woman's shoulder.\nRegion Description at [0.146, 0.587, 0.370, 0.787] : woman holding hot dog in white napkin.\nRegion Description at [0.682, 0.229, 0.742, 0.315] : woman's mouth open looking at hot dog.\nRegion Description at [0.234, 0.213, 0.396, 0.507] : woman's face smiling with eyes closed.\n\nGlobal Caption:\nTwo Asian women eating chili dogs while standing on a street.\nTwo women preparing to eat a hot dog on a city side.\nThe woman are eating their hot dogs while walking.\nTwo young women are eating hot dogs while walking down the sidewalk.\nTwo women eat chili dogs on a city sidewalk. "}
{"question_id": 13, "image": "000000134722.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : awning at [0.886, 0.000, 1.000, 0.240].\nObject 1 : awning at [0.000, 0.299, 0.132, 0.403].\nObject 2 : bench at [0.000, 0.592, 0.066, 0.683].\nObject 3 : building at [0.000, 0.299, 0.204, 0.659].\nObject 4 : canopy at [0.000, 0.301, 0.136, 0.400].\nObject 5 : car at [0.290, 0.400, 0.998, 0.784].\nObject 6 : clouds at [0.374, 0.067, 0.920, 0.312].\nObject 7 : door opening at [0.658, 0.501, 0.682, 0.680].\nObject 8 : door opening at [0.678, 0.509, 0.710, 0.675].\nObject 9 : exterior at [0.000, 0.400, 0.200, 0.669].\nObject 10 : front at [0.294, 0.400, 0.494, 0.739].\nObject 11 : gravel at [0.090, 0.837, 0.334, 0.997].\nObject 12 : headlights at [0.416, 0.624, 0.446, 0.656].\nObject 13 : headlights at [0.300, 0.624, 0.324, 0.651].\nObject 14 : markings at [0.606, 0.821, 0.770, 0.928].\nObject 15 : panel at [0.304, 0.421, 0.450, 0.677].\nObject 16 : pole at [0.030, 0.419, 0.062, 0.656].\nObject 17 : railway tracks at [0.000, 0.752, 0.520, 0.944].\nObject 18 : side walk at [0.192, 0.712, 1.000, 0.997].\nObject 19 : sky at [0.000, 0.000, 0.998, 0.560].\nObject 20 : train stop at [0.000, 0.000, 1.000, 1.000].\nObject 21 : trees at [0.208, 0.253, 0.322, 0.653].\nObject 22 : trim at [0.000, 0.333, 0.132, 0.403].\nObject 23 : wall at [0.000, 0.392, 0.206, 0.611].\nObject 24 : wheel at [0.844, 0.669, 0.884, 0.728].\nObject 25 : wheel at [0.792, 0.675, 0.840, 0.747].\nObject 26 : wheel at [0.516, 0.691, 0.620, 0.808].\nObject 27 : window at [0.316, 0.451, 0.458, 0.595].\nObject 28 : windows at [0.700, 0.547, 0.848, 0.632].\nObject 29 : windsheild wipers at [0.348, 0.499, 0.410, 0.584].\n\nRelationships:\nobject 6 : clouds -> in -> object 19 : sky.\nobject 2 : bench -> in -> object 4 : canopy.\nobject 22 : trim -> on -> object 1 : awning.\nobject 11 : gravel -> next to -> object 17 : railway tracks.\nobject 14 : markings -> on side of -> object 18 : side walk.\nobject 5 : car -> on -> object 17 : railway tracks.\n\nRegion Description:\nRegion Description at [0.288, 0.392, 0.510, 0.741] : the front of the train is yellow and white.\nRegion Description at [0.320, 0.451, 0.460, 0.592] : the front window of the train has windshield wipers.\nRegion Description at [0.292, 0.592, 0.456, 0.739] : the headlights are on front of the train.\nRegion Description at [0.010, 0.405, 0.220, 0.736] : a red brick wall is near the platform.\nRegion Description at [0.000, 0.288, 0.128, 0.707] : an aluminum canopy is on the platform.\nRegion Description at [0.016, 0.325, 0.100, 0.672] : a red steel pole is holding up the awning.\nRegion Description at [0.306, 0.395, 0.998, 0.733] : the train has windowed passenger cars.\nRegion Description at [0.300, 0.427, 0.492, 0.693] : the yellow and white front of a train.\nRegion Description at [0.510, 0.744, 0.834, 0.891] : white painted line beside a train track.\nRegion Description at [0.298, 0.408, 0.468, 0.661] : a yellow panel on the front of the train.\nRegion Description at [0.002, 0.397, 0.210, 0.675] : a red brick building on the side of the tracks.\nRegion Description at [0.844, 0.000, 0.998, 0.248] : an awning of a structure next to the train tracks.\nRegion Description at [0.294, 0.360, 0.516, 0.787] : front of a train car in yellow, white and blue.\nRegion Description at [0.194, 0.221, 0.286, 0.901] : trees on the side of a train station.\nRegion Description at [0.580, 0.821, 0.764, 0.931] : markings on the side of railway tracks.\nRegion Description at [0.632, 0.491, 0.726, 0.691] : white, blue and grey doors on the side of a train car.\nRegion Description at [0.500, 0.096, 0.916, 0.531] : skyline on the side of a train station.\n\nGlobal Caption:\nFast commuter train moving past an outdoor platform.\nA train on the track pulling by a train station.\nA train pulling into a station outside during the day.\nA passenger train moving through a rail yard\na long passenger train pulling up to a station"}
{"question_id": 14, "image": "000000341058.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : napkins at [0.541, 0.818, 0.601, 0.858].\nObject 1 : pepper at [0.598, 0.836, 0.623, 0.860].\nObject 2 : post at [0.673, 0.494, 0.712, 0.926].\nObject 3 : restaurant sign at [0.548, 0.180, 0.779, 0.344].\nObject 4 : salt at [0.619, 0.838, 0.633, 0.850].\nObject 5 : shaker at [0.594, 0.822, 0.619, 0.854].\nObject 6 : shaker at [0.612, 0.824, 0.637, 0.854].\nObject 7 : table at [0.448, 0.834, 0.925, 0.998].\n\nRelationships:\nobject 4 : salt -> in -> object 6 : shaker.\nobject 0 : napkins -> on -> object 7 : table.\nobject 3 : restaurant sign -> on -> object 2 : post.\n\nRegion Description:\n\nGlobal Caption:\nThis is an empty table at a restaurant with ships in the background.\nThis table is covered by a blue Sam Adams umbrella\nAdvertising sign above a patio umbrella on sunny day.\nA lamp post stands next to an umbrella and table.\nAn umbrella is opened over an outdoor table."}
{"question_id": 15, "image": "000000277051.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : bird at [0.400, 0.408, 0.688, 0.775].\nObject 1 : bird at [0.110, 0.468, 0.576, 0.820].\nObject 2 : bottle at [0.080, 0.003, 0.296, 0.721].\nObject 3 : chair at [0.678, 0.177, 0.882, 0.408].\nObject 4 : crumbs at [0.098, 0.835, 0.434, 1.000].\nObject 5 : feet at [0.514, 0.724, 0.562, 0.769].\nObject 6 : food at [0.000, 0.877, 0.180, 1.000].\nObject 7 : foot at [0.474, 0.706, 0.514, 0.733].\nObject 8 : ground at [0.518, 0.183, 0.620, 0.402].\nObject 9 : handle at [0.488, 0.796, 0.800, 0.940].\nObject 10 : knife at [0.000, 0.793, 0.800, 1.000].\nObject 11 : label at [0.080, 0.000, 0.260, 0.598].\nObject 12 : leg at [0.552, 0.652, 0.578, 0.742].\nObject 13 : leg at [0.508, 0.646, 0.540, 0.685].\nObject 14 : liquid at [0.092, 0.114, 0.294, 0.721].\nObject 15 : paper at [0.000, 0.658, 0.762, 1.003].\nObject 16 : placemat at [0.000, 0.658, 0.766, 1.000].\nObject 17 : plate at [0.000, 0.748, 0.618, 1.000].\nObject 18 : table at [0.742, 0.261, 1.002, 0.883].\nObject 19 : table at [0.000, 0.658, 1.000, 1.003].\nObject 20 : tablecloth at [0.000, 0.664, 1.002, 1.003].\nObject 21 : tablecloth at [0.596, 0.267, 1.000, 0.883].\n\nRelationships:\nobject 6 : food -> on -> object 17 : plate.\nobject 4 : crumbs -> on -> object 17 : plate.\nobject 3 : chair -> next to -> object 18 : table.\nobject 3 : chair -> beside -> object 18 : table.\n\nRegion Description:\nRegion Description at [0.050, 0.769, 0.804, 0.979] : a steak knife resting on the edge of a plate.\nRegion Description at [0.008, 0.724, 0.628, 0.994] : a white plate with food and crumbs on it.\nRegion Description at [0.040, 0.685, 0.380, 0.925] : a blue and white paper placemat underneath a plate.\nRegion Description at [0.636, 0.147, 0.906, 0.492] : a bird on a table with a chair behind it.\nRegion Description at [0.384, 0.372, 0.698, 0.787] : a bird standing on the edge of a table.\n\nGlobal Caption:\ntwo little sparrows standing on a table by a knife\ntwo gray white and brown birds a knife and a red table\nA couple of small birds standing on top of a table.\nTwo sparrows sit n a table with a red tablecloth at an outdoor cafe. \nTwo birds perched on a table near a plate of food."}
{"question_id": 16, "image": "000000376900.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : area at [0.000, 0.002, 0.995, 0.996].\nObject 1 : background at [0.000, 0.132, 0.997, 0.268].\nObject 2 : cap at [0.171, 0.388, 0.253, 0.476].\nObject 3 : green/tennis court at [0.005, 0.720, 0.880, 0.994].\nObject 4 : hand at [0.253, 0.648, 0.299, 0.680].\nObject 5 : head at [0.173, 0.408, 0.256, 0.474].\nObject 6 : line at [0.397, 0.778, 0.501, 0.996].\nObject 7 : man at [0.163, 0.274, 0.491, 0.936].\nObject 8 : photo at [0.005, 0.004, 0.968, 0.976].\nObject 9 : pole at [0.019, 0.162, 0.035, 0.258].\nObject 10 : ses at [0.912, 0.962, 0.992, 0.994].\nObject 11 : shadow at [0.397, 0.898, 0.968, 0.956].\nObject 12 : shorts at [0.216, 0.628, 0.432, 0.782].\nObject 13 : sock at [0.325, 0.840, 0.376, 0.890].\nObject 14 : sport at [0.144, 0.270, 0.515, 0.944].\nObject 15 : tennis racket at [0.235, 0.578, 0.304, 0.664].\nObject 16 : tennis shoe at [0.213, 0.880, 0.280, 0.930].\nObject 17 : tennis shoe at [0.299, 0.886, 0.405, 0.936].\nObject 18 : trees at [0.269, 0.192, 0.995, 0.250].\nObject 19 : wrist at [0.384, 0.318, 0.429, 0.360].\nObject 20 : wristband at [0.384, 0.318, 0.432, 0.360].\n\nRelationships:\nobject 7 : man -> wearing -> object 12 : shorts.\nobject 4 : hand -> holding -> object 15 : tennis racket.\nobject 2 : cap -> on mans -> object 5 : head.\nobject 5 : head -> of a -> object 7 : man.\nobject 7 : man -> wearing a -> object 2 : cap.\nobject 7 : man -> wearing a -> object 13 : sock.\nobject 18 : trees -> in -> object 1 : background.\nobject 14 : sport -> in -> object 0 : area.\nobject 20 : wristband -> on a -> object 19 : wrist.\nobject 2 : cap -> on -> object 5 : head.\nobject 11 : shadow -> of -> object 7 : man.\nobject 12 : shorts -> on -> object 7 : man.\n\nRegion Description:\nRegion Description at [0.163, 0.322, 0.579, 0.926] : The tennis player is wearing all white.\nRegion Description at [0.397, 0.858, 0.936, 0.968] : Tennis player's shadow cast in front of him.\nRegion Description at [0.219, 0.560, 0.309, 0.680] : a black tennis racket in a man's hand.\nRegion Description at [0.341, 0.538, 0.480, 0.728] : a line judge at the side of a tennis court.\n\nGlobal Caption:\nA tennis player prepares to serve a tennis ball.\na tennis player in all white playing on a court \nA tennis player is reaching up with one arm and has a racquet in the other hand. \nThe tennis player throws the ball up to serve\nSpectators watching a man swinging at a tennis ball."}
{"question_id": 17, "image": "000000412240.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : characters at [0.528, 0.251, 0.992, 0.395].\nObject 1 : date at [0.646, 0.869, 0.824, 0.923].\nObject 2 : dog at [0.292, 0.131, 0.820, 0.771].\nObject 3 : eyes at [0.332, 0.219, 0.354, 0.243].\nObject 4 : floor at [0.002, 0.715, 1.000, 0.997].\nObject 5 : head at [0.290, 0.117, 0.500, 0.392].\nObject 6 : heel at [0.218, 0.629, 0.324, 0.741].\nObject 7 : laces at [0.398, 0.464, 0.540, 0.608].\nObject 8 : left eye at [0.420, 0.245, 0.446, 0.283].\nObject 9 : light at [0.320, 0.493, 0.608, 0.720].\nObject 10 : mouth at [0.318, 0.320, 0.392, 0.373].\nObject 11 : nose at [0.348, 0.283, 0.392, 0.328].\nObject 12 : panel at [0.690, 0.544, 1.000, 0.779].\nObject 13 : photo at [0.000, 0.003, 0.996, 0.997].\nObject 14 : shoe at [0.002, 0.437, 0.250, 0.720].\nObject 15 : shoe at [0.212, 0.445, 0.720, 0.787].\nObject 16 : symbol at [0.750, 0.600, 0.828, 0.699].\nObject 17 : tail at [0.734, 0.720, 0.824, 0.768].\nObject 18 : time at [0.852, 0.872, 0.938, 0.923].\nObject 19 : toe at [0.564, 0.643, 0.724, 0.776].\nObject 20 : year at [0.752, 0.877, 0.834, 0.923].\n\nRelationships:\nobject 3 : eyes -> of -> object 2 : dog.\nobject 1 : date -> of -> object 13 : photo.\nobject 6 : heel -> of -> object 15 : shoe.\nobject 2 : dog -> sitting on -> object 4 : floor.\nobject 15 : shoe -> next to -> object 2 : dog.\nobject 15 : shoe -> reflecting -> object 9 : light.\nobject 0 : characters -> playing -> object 0 : characters.\nobject 0 : characters -> playing -> object 0 : characters.\nobject 2 : dog -> has a -> object 8 : left eye.\nobject 5 : head -> of -> object 2 : dog.\nobject 3 : eyes -> of -> object 2 : dog.\nobject 11 : nose -> on a -> object 2 : dog.\nobject 10 : mouth -> on a -> object 2 : dog.\nobject 15 : shoe -> has -> object 7 : laces.\nobject 17 : tail -> of -> object 2 : dog.\nobject 15 : shoe -> has a -> object 6 : heel.\nobject 19 : toe -> of -> object 15 : shoe.\n\nRegion Description:\nRegion Description at [0.838, 0.837, 0.976, 0.968] : the time written in bottom right corner.\n\nGlobal Caption:\nA dog sitting behind a pair of black shoes.\nA dog sits on the floor next to some shoes. \nA puppy is sitting behind a pair of shoes.\na close up of a small dog near a pair of shoes\nA small black dog sits beside a pair of shoes."}
{"question_id": 18, "image": "000000179765.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : back tire at [0.574, 0.496, 0.860, 0.800].\nObject 1 : bike at [0.146, 0.109, 0.938, 0.803].\nObject 2 : bike indicators at [0.238, 0.363, 0.264, 0.389].\nObject 3 : car at [0.000, 0.077, 0.086, 0.157].\nObject 4 : display at [0.240, 0.275, 0.290, 0.328].\nObject 5 : exhaust pipe at [0.460, 0.661, 0.818, 0.773].\nObject 6 : front tire at [0.146, 0.419, 0.366, 0.637].\nObject 7 : front wheel at [0.150, 0.424, 0.366, 0.635].\nObject 8 : garage door at [0.000, 0.000, 0.214, 0.341].\nObject 9 : handle at [0.284, 0.109, 0.390, 0.384].\nObject 10 : honda logo at [0.322, 0.395, 0.378, 0.419].\nObject 11 : house at [0.420, 0.000, 0.736, 0.149].\nObject 12 : leather seat at [0.496, 0.355, 0.792, 0.517].\nObject 13 : light at [0.894, 0.411, 0.944, 0.520].\nObject 14 : orange light at [0.280, 0.419, 0.296, 0.467].\nObject 15 : shock at [0.258, 0.477, 0.296, 0.568].\nObject 16 : shock absorber at [0.626, 0.501, 0.698, 0.680].\nObject 17 : shrubs at [0.628, 0.021, 0.764, 0.200].\nObject 18 : small windshield at [0.210, 0.120, 0.256, 0.291].\nObject 19 : sylencer at [0.462, 0.645, 0.816, 0.779].\nObject 20 : trees at [0.256, 0.003, 0.444, 0.205].\n\nRelationships:\nobject 1 : bike -> has -> object 7 : front wheel.\nobject 1 : bike -> has -> object 0 : back tire.\nobject 1 : bike -> has -> object 19 : sylencer.\nobject 1 : bike -> has -> object 16 : shock absorber.\nobject 1 : bike -> has -> object 13 : light.\nobject 9 : handle -> on -> object 1 : bike.\nobject 4 : display -> on -> object 1 : bike.\n\nRegion Description:\n\nGlobal Caption:\nA black Honda motorcycle parked in front of a garage.\nA Honda motorcycle parked in a grass driveway\nA black Honda motorcycle with a dark burgundy seat.\nMa motorcycle parked on the gravel in front of a garage\nA motorcycle with its brake extended standing outside"}
{"question_id": 19, "image": "000000329219.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : bearded face at [0.371, 0.064, 0.393, 0.094].\nObject 1 : blender at [0.015, 0.165, 0.080, 0.307].\nObject 2 : box at [0.176, 0.249, 0.228, 0.329].\nObject 3 : buttons at [0.038, 0.268, 0.048, 0.275].\nObject 4 : counter at [0.567, 0.340, 0.738, 0.395].\nObject 5 : counter at [0.000, 0.329, 0.576, 0.398].\nObject 6 : curtain at [0.429, 0.048, 0.504, 0.318].\nObject 7 : curtain at [0.227, 0.000, 0.309, 0.287].\nObject 8 : dog at [0.462, 0.593, 0.568, 0.842].\nObject 9 : door knob at [0.242, 0.477, 0.253, 0.499].\nObject 10 : drawer at [0.112, 0.370, 0.259, 0.452].\nObject 11 : drawer at [0.284, 0.382, 0.394, 0.439].\nObject 12 : faucet at [0.338, 0.327, 0.388, 0.357].\nObject 13 : floor at [0.000, 0.713, 1.000, 1.000].\nObject 14 : kitchen at [0.000, 0.000, 0.750, 0.849].\nObject 15 : knob at [0.179, 0.398, 0.197, 0.422].\nObject 16 : knob at [0.340, 0.400, 0.352, 0.420].\nObject 17 : man at [0.274, 0.000, 0.517, 0.792].\nObject 18 : mugs at [0.509, 0.123, 0.595, 0.266].\nObject 19 : outlet at [0.107, 0.212, 0.143, 0.256].\nObject 20 : shoes at [0.391, 0.735, 0.476, 0.786].\nObject 21 : spatula at [0.126, 0.003, 0.153, 0.094].\nObject 22 : tile at [0.526, 0.592, 0.557, 0.634].\nObject 23 : wall at [0.003, 0.000, 0.220, 0.294].\nObject 24 : wall at [0.506, 0.019, 0.607, 0.384].\nObject 25 : window at [0.303, 0.016, 0.392, 0.328].\nObject 26 : wire at [0.097, 0.233, 0.129, 0.319].\n\nRelationships:\nobject 17 : man -> standing in -> object 14 : kitchen.\nobject 18 : mugs -> hanging on -> object 24 : wall.\nobject 1 : blender -> with -> object 3 : buttons.\nobject 17 : man -> with -> object 0 : bearded face.\nobject 26 : wire -> hanging from -> object 23 : wall.\nobject 8 : dog -> on -> object 13 : floor.\nobject 1 : blender -> on -> object 5 : counter.\nobject 6 : curtain -> on -> object 25 : window.\nobject 20 : shoes -> on -> object 17 : man.\n\nRegion Description:\nRegion Description at [0.056, 0.214, 0.140, 0.277] : A dark electric cord plugged into the wall.\nRegion Description at [0.000, 0.662, 0.116, 0.940] : A latter with onely one rung visible.\nRegion Description at [0.004, 0.698, 0.999, 0.991] : Durable Tan and brown laminent flooring.\nRegion Description at [0.004, 0.324, 0.739, 0.880] : cheap waferboard constructed cabinets .\nRegion Description at [0.514, 0.126, 0.588, 0.262] : convient and accessable way to store coffee mugs.\nRegion Description at [0.222, 0.001, 0.510, 0.286] : small window curtians with paisley design.\nRegion Description at [0.347, 0.053, 0.490, 0.312] : light weight flanel design mens shirt .\nRegion Description at [0.222, 0.004, 0.315, 0.303] : gold and white curtain on a kitchen window.\nRegion Description at [0.511, 0.126, 0.589, 0.261] : coffee cups hanging on the kitchen wall.\nRegion Description at [0.012, 0.149, 0.091, 0.340] : gold colored blinder sits on the counter.\nRegion Description at [-0.001, 0.000, 0.157, 0.122] : cooking utensils hanging against wall.\n\nGlobal Caption:\nA man standing next to a dog on the ground.\nA man is at a kitchen counter by a dog.\nAn man standing in a kitchen with a small puppy.\nthere is a small puppy on the kitchen floor\nA man in the kitchen standing with his dog."}
{"question_id": 20, "image": "000000184384.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : blueberry at [0.306, 0.312, 0.400, 0.429].\nObject 1 : butter at [0.454, 0.024, 0.638, 0.288].\nObject 2 : cake at [0.238, 0.093, 0.786, 0.787].\nObject 3 : cup at [0.002, 0.000, 0.202, 0.667].\nObject 4 : cup at [0.140, 0.008, 0.336, 0.456].\nObject 5 : egg at [0.636, 0.125, 0.880, 0.267].\nObject 6 : food at [0.632, 0.123, 0.996, 0.336].\nObject 7 : lemon at [0.514, 0.728, 0.798, 0.997].\nObject 8 : melon at [0.308, 0.768, 0.658, 0.997].\nObject 9 : orange at [0.514, 0.733, 0.794, 0.997].\nObject 10 : parsley at [0.372, 0.515, 0.762, 0.965].\nObject 11 : plate at [0.166, 0.453, 1.000, 1.000].\nObject 12 : plate at [0.628, 0.120, 0.998, 0.389].\nObject 13 : sausage at [0.766, 0.248, 0.984, 0.333].\nObject 14 : spot at [0.766, 0.600, 0.790, 0.637].\nObject 15 : table at [0.002, 0.365, 0.998, 0.997].\nObject 16 : water at [0.000, 0.000, 0.202, 0.667].\n\nRelationships:\nobject 7 : lemon -> on -> object 11 : plate.\nobject 10 : parsley -> on -> object 11 : plate.\nobject 6 : food -> on -> object 12 : plate.\nobject 1 : butter -> on -> object 2 : cake.\nobject 11 : plate -> has -> object 14 : spot.\nobject 1 : butter -> on -> object 2 : cake.\nobject 9 : orange -> on -> object 11 : plate.\nobject 13 : sausage -> on -> object 12 : plate.\nobject 0 : blueberry -> on -> object 2 : cake.\nobject 5 : egg -> on -> object 12 : plate.\nobject 8 : melon -> on -> object 11 : plate.\nobject 1 : butter -> on -> object 2 : cake.\nobject 9 : orange -> on -> object 11 : plate.\nobject 2 : cake -> on -> object 11 : plate.\nobject 16 : water -> in -> object 3 : cup.\nobject 13 : sausage -> on -> object 12 : plate.\n\nRegion Description:\nRegion Description at [0.678, 0.104, 0.942, 0.424] : There is food on the plate in the back.\nRegion Description at [0.456, 0.013, 0.636, 0.307] : White frosting on top of a piece of cake.\nRegion Description at [0.322, 0.752, 0.650, 0.997] : square of honey dew on a white plate.\n\nGlobal Caption:\nA bluebery cake is on a plate and is topped with butter.\nA piece of cake with butter on it sits next to an orange slice. \nA large piece of blueberry cake on a plate.\nA plate of food attractively arranged on a table.\nA plate of blueberry coffee cake with butter and an orange slice on a table with breakfast foods."}
{"question_id": 21, "image": "000000018519.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : concrete at [0.000, 0.576, 1.002, 0.998].\nObject 1 : elbow at [0.403, 0.538, 0.433, 0.552].\nObject 2 : fence at [0.000, 0.314, 0.998, 0.600].\nObject 3 : graffiti at [0.470, 0.856, 0.794, 0.998].\nObject 4 : grass at [0.000, 0.154, 1.002, 0.448].\nObject 5 : helmet at [0.358, 0.354, 0.448, 0.422].\nObject 6 : knee at [0.525, 0.608, 0.545, 0.622].\nObject 7 : knee pad at [0.450, 0.542, 0.512, 0.598].\nObject 8 : pad at [0.540, 0.362, 0.595, 0.420].\nObject 9 : pad at [0.512, 0.578, 0.592, 0.624].\nObject 10 : pad at [0.376, 0.512, 0.443, 0.554].\nObject 11 : park at [0.007, 0.006, 1.000, 0.578].\nObject 12 : pipe at [0.657, 0.300, 0.687, 0.578].\nObject 13 : pipe at [0.177, 0.324, 0.211, 0.590].\nObject 14 : rail at [0.000, 0.310, 1.000, 0.334].\nObject 15 : ramp at [0.000, 0.592, 1.002, 0.998].\nObject 16 : rock at [0.100, 0.302, 0.154, 0.326].\nObject 17 : shadow at [0.415, 0.642, 0.754, 0.912].\nObject 18 : shirt at [0.438, 0.376, 0.637, 0.514].\nObject 19 : shorts at [0.460, 0.500, 0.664, 0.580].\nObject 20 : skate at [0.647, 0.490, 0.709, 0.584].\nObject 21 : skater at [0.234, 0.352, 0.719, 0.624].\nObject 22 : sticker at [0.408, 0.358, 0.438, 0.368].\nObject 23 : tree at [0.122, 0.008, 0.677, 0.322].\nObject 24 : wheels at [0.689, 0.496, 0.721, 0.526].\nObject 25 : wrist brace at [0.279, 0.524, 0.338, 0.564].\n\nRelationships:\nobject 21 : skater -> has a -> object 17 : shadow.\nobject 20 : skate -> has -> object 24 : wheels.\nobject 23 : tree -> standing in a -> object 11 : park.\nobject 21 : skater -> wearing a -> object 5 : helmet.\nobject 10 : pad -> protecting an -> object 1 : elbow.\nobject 9 : pad -> protecting a -> object 6 : knee.\nobject 17 : shadow -> of a -> object 21 : skater.\nobject 15 : ramp -> has a -> object 3 : graffiti.\nobject 21 : skater -> has a -> object 5 : helmet.\nobject 16 : rock -> in -> object 4 : grass.\nobject 5 : helmet -> has a -> object 22 : sticker.\nobject 21 : skater -> wearing -> object 20 : skate.\nobject 21 : skater -> wearing a -> object 10 : pad.\nobject 25 : wrist brace -> on -> object 21 : skater.\nobject 21 : skater -> has a -> object 20 : skate.\nobject 17 : shadow -> on -> object 15 : ramp.\nobject 21 : skater -> has a -> object 5 : helmet.\nobject 21 : skater -> has a -> object 8 : pad.\nobject 21 : skater -> has a -> object 18 : shirt.\nobject 21 : skater -> has -> object 19 : shorts.\nobject 23 : tree -> behind -> object 21 : skater.\nobject 25 : wrist brace -> on -> object 21 : skater.\nobject 21 : skater -> has a -> object 9 : pad.\nobject 7 : knee pad -> for a -> object 21 : skater.\nobject 17 : shadow -> on -> object 0 : concrete.\nobject 3 : graffiti -> on -> object 0 : concrete.\n\nRegion Description:\nRegion Description at [0.391, 0.630, 0.776, 0.962] : Skater's shadow while performing a trick.\nRegion Description at [0.346, 0.342, 0.475, 0.440] : Man is wearing a black safety helmet.\nRegion Description at [0.184, 0.320, 0.741, 0.700] : a man roller skating at a skate park.\nRegion Description at [0.448, 0.636, 0.779, 0.940] : the shadow of the man cast on the cement ramp.\nRegion Description at [0.465, 0.856, 0.803, 0.996] : light blue painted graffiti on the cement ramp.\nRegion Description at [0.279, 0.524, 0.341, 0.570] : a black wrist guard on the man's wrist.\nRegion Description at [0.353, 0.352, 0.460, 0.422] : black helmet with several stickers on it.\nRegion Description at [0.644, 0.488, 0.719, 0.574] : the black rollerskate the man is wearing.\nRegion Description at [0.142, 0.314, 0.234, 0.604] : a grey post to the metal fence that is at the top of the ramp.\nRegion Description at [0.363, 0.500, 0.453, 0.566] : a black elbow pad the man is wearing.\nRegion Description at [0.405, 0.642, 0.746, 0.916] : shadow of a roller skater on concrete.\n\nGlobal Caption:\nA young man riding a skateboard down the side of a ramp.\nA man doing a trick on roller-skates in a skate park.\nA skateboarder performing a jump off the side of a ramp.\na man wearing roller skates doing a jump on the side of a wall \nThe man in the helmet is jumping while wearing roller skates. "}
{"question_id": 22, "image": "000000415748.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : building at [0.000, 0.086, 0.697, 0.516].\nObject 1 : elephant at [0.084, 0.438, 0.727, 0.954].\nObject 2 : face at [0.411, 0.442, 0.670, 0.664].\nObject 3 : ground at [0.000, 0.742, 0.165, 0.998].\nObject 4 : man at [0.186, 0.246, 0.631, 0.516].\nObject 5 : shadow at [0.477, 0.812, 1.000, 0.958].\nObject 6 : sky at [0.006, 0.000, 0.228, 0.200].\nObject 7 : toe at [0.372, 0.900, 0.411, 0.924].\nObject 8 : tusk at [0.462, 0.670, 0.489, 0.692].\n\nRelationships:\nobject 4 : man -> on -> object 1 : elephant.\nobject 7 : toe -> of -> object 1 : elephant.\nobject 4 : man -> near -> object 0 : building.\nobject 4 : man -> on -> object 1 : elephant.\nobject 4 : man -> near -> object 1 : elephant.\nobject 8 : tusk -> on -> object 2 : face.\nobject 5 : shadow -> of -> object 1 : elephant.\nobject 5 : shadow -> on -> object 3 : ground.\nobject 4 : man -> close to -> object 0 : building.\nobject 0 : building -> close to -> object 1 : elephant.\n\nRegion Description:\nRegion Description at [0.411, 0.482, 0.634, 0.788] : elephant's face and trunk are painted.\n\nGlobal Caption:\nA man riding on the back of an elephant through a city street.\nMan riding on the back of a painted elephant. \nA man in colorful clothing riding a painted elephant.\na man in a white shirt is riding an elephant and some buildings\nAn old decorated elephant and its colorful rider"}
{"question_id": 23, "image": "000000543300.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : boat at [0.048, 0.552, 0.928, 0.819].\nObject 1 : building at [0.328, 0.493, 0.538, 0.613].\nObject 2 : building at [0.000, 0.467, 0.338, 0.651].\nObject 3 : building at [0.534, 0.096, 0.998, 0.637].\nObject 4 : canopies at [0.452, 0.504, 0.620, 0.600].\nObject 5 : container at [0.858, 0.643, 0.948, 0.712].\nObject 6 : dolphin at [0.282, 0.691, 0.344, 0.773].\nObject 7 : flag at [0.322, 0.563, 0.340, 0.597].\nObject 8 : ground at [0.822, 0.696, 0.880, 0.715].\nObject 9 : leaves at [0.002, 0.483, 0.080, 0.659].\nObject 10 : level at [0.000, 0.709, 1.000, 0.829].\nObject 11 : level at [0.068, 0.616, 0.852, 0.688].\nObject 12 : outdoor seating at [0.502, 0.579, 0.532, 0.624].\nObject 13 : pink writing at [0.414, 0.693, 0.654, 0.725].\nObject 14 : pole at [0.282, 0.416, 0.292, 0.515].\nObject 15 : railing at [0.094, 0.557, 0.728, 0.624].\nObject 16 : railing at [0.238, 0.597, 0.744, 0.627].\nObject 17 : reflection at [0.174, 0.808, 0.922, 0.848].\nObject 18 : roof at [0.000, 0.469, 0.280, 0.523].\nObject 19 : roof at [0.348, 0.509, 0.482, 0.568].\nObject 20 : roof at [0.920, 0.264, 0.980, 0.344].\nObject 21 : row at [0.700, 0.499, 0.878, 0.573].\nObject 22 : sea wall at [0.878, 0.712, 0.998, 0.819].\nObject 23 : shore at [0.000, 0.627, 0.996, 0.816].\nObject 24 : sky at [0.006, 0.000, 1.000, 0.517].\nObject 25 : steeple at [0.918, 0.088, 0.936, 0.237].\nObject 26 : symbol at [0.268, 0.688, 0.350, 0.779].\nObject 27 : symbol at [0.702, 0.693, 0.752, 0.725].\nObject 28 : tree at [0.472, 0.491, 0.592, 0.597].\nObject 29 : trees at [0.948, 0.573, 1.000, 0.691].\nObject 30 : trees at [0.000, 0.488, 0.080, 0.675].\nObject 31 : vehicle at [0.968, 0.653, 0.998, 0.693].\nObject 32 : water at [0.004, 0.813, 0.998, 0.992].\nObject 33 : water at [0.008, 0.717, 0.998, 0.981].\nObject 34 : window at [0.374, 0.733, 0.790, 0.765].\nObject 35 : window at [0.800, 0.491, 0.868, 0.576].\nObject 36 : window at [0.928, 0.512, 0.950, 0.576].\nObject 37 : window at [0.892, 0.395, 0.912, 0.443].\nObject 38 : window at [0.894, 0.517, 0.910, 0.571].\nObject 39 : window at [0.630, 0.493, 0.652, 0.565].\nObject 40 : windows at [0.384, 0.637, 0.724, 0.685].\n\nRelationships:\nobject 40 : windows -> on -> object 0 : boat.\nobject 17 : reflection -> in -> object 33 : water.\nobject 29 : trees -> growing on -> object 23 : shore.\nobject 30 : trees -> growing on -> object 23 : shore.\nobject 28 : tree -> growing on -> object 23 : shore.\nobject 18 : roof -> on -> object 2 : building.\nobject 5 : container -> on -> object 22 : sea wall.\nobject 0 : boat -> in -> object 32 : water.\nobject 0 : boat -> has -> object 15 : railing.\n\nRegion Description:\nRegion Description at [0.414, 0.691, 0.662, 0.725] : the are red letters on the side of the cruise ship.\nRegion Description at [0.370, 0.707, 0.780, 0.763] : there is a long set of black windows on the side of the cruise ship.\nRegion Description at [0.870, 0.243, 0.992, 0.357] : there is a red roof on this building.\nRegion Description at [0.538, 0.400, 0.712, 0.549] : there is red and gray building in the background.\nRegion Description at [0.054, 0.595, 0.312, 0.821] : there is two levels on this cruise ship.\nRegion Description at [0.370, 0.587, 0.664, 0.621] : there is a silver railing on the top level of the cruise ship.\nRegion Description at [0.858, 0.621, 0.952, 0.717] : there is a blue container on the dock.\nRegion Description at [0.876, 0.707, 0.996, 0.787] : there is a gray sea wall beside the ship.\nRegion Description at [0.268, 0.723, 0.346, 0.787] : there are blue water symbols on the side of the cruise ship.\nRegion Description at [0.000, 0.619, 0.024, 0.712] : there is a blue and white sign on the dock.\nRegion Description at [0.662, 0.533, 0.904, 0.603] : An outdoor canopy creates shade for customers. .\n\nGlobal Caption:\nA boat sits on the side of the dock.\nA large white boat in the open water.\nA white double decker boat n water next to buildings.\nA large cruise ship is traveling on the ocean. \nA Port River Dolphin Cruise ship sits in the water."}
{"question_id": 24, "image": "000000349184.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : arm rest at [0.674, 0.486, 0.722, 0.560].\nObject 1 : bench at [0.000, 0.324, 0.731, 0.994].\nObject 2 : bricks at [0.075, 0.850, 0.180, 0.882].\nObject 3 : building at [0.090, 0.000, 0.686, 0.094].\nObject 4 : children at [0.470, 0.302, 0.539, 0.360].\nObject 5 : coat at [0.473, 0.322, 0.542, 0.364].\nObject 6 : daytime at [0.000, 0.002, 0.997, 1.000].\nObject 7 : fence at [0.719, 0.310, 0.997, 0.372].\nObject 8 : grass at [0.000, 0.364, 0.997, 0.720].\nObject 9 : jacket at [0.012, 0.424, 0.485, 0.690].\nObject 10 : jeans at [0.165, 0.748, 0.293, 0.844].\nObject 11 : leg at [0.168, 0.750, 0.308, 0.844].\nObject 12 : people at [0.386, 0.438, 0.449, 0.504].\nObject 13 : purse at [0.458, 0.488, 0.605, 0.694].\nObject 14 : shoe at [0.192, 0.836, 0.305, 0.890].\nObject 15 : strap at [0.677, 0.470, 0.814, 0.584].\nObject 16 : trees at [0.554, 0.000, 0.997, 0.376].\nObject 17 : woman at [0.009, 0.194, 0.497, 0.888].\n\nRelationships:\nobject 17 : woman -> on -> object 1 : bench.\nobject 17 : woman -> on -> object 1 : bench.\nobject 17 : woman -> on -> object 1 : bench.\nobject 13 : purse -> has a -> object 15 : strap.\nobject 2 : bricks -> under -> object 1 : bench.\nobject 3 : building -> behind -> object 16 : trees.\nobject 2 : bricks -> under -> object 1 : bench.\nobject 9 : jacket -> on -> object 17 : woman.\nobject 12 : people -> near -> object 16 : trees.\nobject 17 : woman -> has a -> object 11 : leg.\nobject 1 : bench -> has an -> object 0 : arm rest.\nobject 15 : strap -> from -> object 13 : purse.\nobject 2 : bricks -> near -> object 1 : bench.\nobject 16 : trees -> in -> object 6 : daytime.\nobject 7 : fence -> under -> object 16 : trees.\nobject 12 : people -> in front of -> object 7 : fence.\nobject 13 : purse -> on -> object 1 : bench.\nobject 14 : shoe -> on -> object 2 : bricks.\n\nRegion Description:\nRegion Description at [0.096, 0.006, 0.662, 0.074] : Building with brown and white facade.\nRegion Description at [0.374, 0.298, 0.542, 0.360] : two people walking in front of woman.\n\nGlobal Caption:\nA woman sitting on top of a wooden bench near a park.\nA person sits on a wooden bench facing blooming trees.\nA woman sitting on a wooden bench viewing some beautiful trees.\nAdult sitting on wooden park bench in large open space.\nA woman sits on a bench watching the park."}
{"question_id": 25, "image": "000000042070.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : 61 at [0.268, 0.480, 0.310, 0.532].\nObject 1 : asphalt at [0.128, 0.632, 0.998, 0.992].\nObject 2 : bike rack at [0.278, 0.583, 0.762, 0.885].\nObject 3 : bottom at [0.166, 0.757, 0.864, 0.895].\nObject 4 : building at [0.000, 0.000, 0.194, 0.635].\nObject 5 : bus at [0.160, 0.035, 0.906, 0.910].\nObject 6 : corner at [0.868, 0.570, 0.998, 0.652].\nObject 7 : display at [0.258, 0.055, 0.770, 0.168].\nObject 8 : driver at [0.614, 0.350, 0.736, 0.520].\nObject 9 : driver's seat at [0.616, 0.367, 0.726, 0.500].\nObject 10 : font at [0.484, 0.695, 0.544, 0.725].\nObject 11 : headlight at [0.202, 0.685, 0.304, 0.728].\nObject 12 : information at [0.292, 0.077, 0.682, 0.160].\nObject 13 : license plate at [0.468, 0.677, 0.562, 0.745].\nObject 14 : light post at [0.948, 0.340, 0.992, 0.650].\nObject 15 : mirror at [0.848, 0.395, 0.900, 0.495].\nObject 16 : name at [0.448, 0.085, 0.670, 0.153].\nObject 17 : number at [0.744, 0.608, 0.818, 0.653].\nObject 18 : number at [0.488, 0.698, 0.492, 0.723].\nObject 19 : number at [0.308, 0.090, 0.334, 0.158].\nObject 20 : number at [0.308, 0.087, 0.356, 0.155].\nObject 21 : number at [0.268, 0.490, 0.308, 0.527].\nObject 22 : number at [0.268, 0.490, 0.288, 0.527].\nObject 23 : paint at [0.180, 0.562, 0.370, 0.780].\nObject 24 : pole at [0.952, 0.335, 0.988, 0.650].\nObject 25 : rack at [0.206, 0.480, 0.840, 0.820].\nObject 26 : red lettering at [0.534, 0.695, 0.546, 0.725].\nObject 27 : red lettering at [0.524, 0.698, 0.534, 0.725].\nObject 28 : red lettering at [0.512, 0.698, 0.524, 0.725].\nObject 29 : red lettering at [0.494, 0.695, 0.506, 0.722].\nObject 30 : red lettering at [0.484, 0.695, 0.494, 0.722].\nObject 31 : reflective light at [0.274, 0.043, 0.328, 0.073].\nObject 32 : reflective light at [0.716, 0.040, 0.748, 0.077].\nObject 33 : reflective light at [0.560, 0.043, 0.602, 0.055].\nObject 34 : reflective light at [0.430, 0.043, 0.468, 0.068].\nObject 35 : reflective light at [0.500, 0.037, 0.538, 0.075].\nObject 36 : road at [0.116, 0.632, 0.996, 0.995].\nObject 37 : sidewalk at [0.056, 0.765, 0.104, 0.818].\nObject 38 : steering wheel at [0.634, 0.445, 0.770, 0.495].\nObject 39 : street light at [0.948, 0.333, 0.992, 0.645].\nObject 40 : stripe at [0.918, 0.863, 0.998, 0.950].\nObject 41 : tree at [0.862, 0.470, 0.994, 0.632].\nObject 42 : window at [0.198, 0.175, 0.838, 0.550].\nObject 43 : windshield at [0.518, 0.170, 0.862, 0.570].\nObject 44 : windshield at [0.180, 0.170, 0.850, 0.560].\nObject 45 : wiper at [0.528, 0.362, 0.722, 0.630].\nObject 46 : wiper at [0.454, 0.370, 0.656, 0.633].\nObject 47 : word at [0.434, 0.080, 0.688, 0.163].\n\nRelationships:\nobject 13 : license plate -> on -> object 5 : bus.\nobject 17 : number -> on -> object 5 : bus.\nobject 2 : bike rack -> on -> object 5 : bus.\nobject 5 : bus -> parked on side of -> object 36 : road.\nobject 39 : street light -> on -> object 6 : corner.\nobject 38 : steering wheel -> on -> object 5 : bus.\nobject 13 : license plate -> with -> object 10 : font.\nobject 7 : display -> showing -> object 16 : name.\nobject 23 : paint -> on -> object 5 : bus.\nobject 20 : number -> on -> object 5 : bus.\nobject 47 : word -> on -> object 5 : bus.\n\nRegion Description:\nRegion Description at [0.040, 0.278, 0.940, 0.913] : The bus is parked on the side of road..\nRegion Description at [0.272, 0.055, 0.752, 0.168] : display showing the current bus route name and number.\nRegion Description at [0.168, 0.025, 0.864, 0.902] : White bus with green and white design.\nRegion Description at [0.008, 0.005, 0.266, 0.630] : Brick building with red and white stripes.\n\nGlobal Caption:\nA very big city bus on a big street.\nA large bus on the side of a street.\nBlue, white, and green passenger bus parked at a stop. \na city bus parked on the side of the road\nA white bus driving down a street next to a building."}
{"question_id": 26, "image": "000000241668.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : boutonniere at [0.710, 0.574, 0.799, 0.660].\nObject 1 : cake at [0.630, 0.670, 0.772, 0.750].\nObject 2 : cake crumb at [0.710, 0.348, 0.721, 0.356].\nObject 3 : crown at [0.370, 0.006, 0.549, 0.056].\nObject 4 : dress at [0.000, 0.574, 0.582, 1.000].\nObject 5 : eye at [0.649, 0.244, 0.699, 0.272].\nObject 6 : eye at [0.735, 0.264, 0.769, 0.280].\nObject 7 : eyebrow at [0.655, 0.230, 0.710, 0.250].\nObject 8 : eyebrow at [0.741, 0.252, 0.780, 0.264].\nObject 9 : finger at [0.721, 0.772, 0.816, 0.800].\nObject 10 : finger at [0.535, 0.740, 0.685, 0.826].\nObject 11 : ground at [0.003, 0.888, 0.997, 1.000].\nObject 12 : hair at [0.507, 0.142, 0.791, 0.642].\nObject 13 : hair at [0.189, 0.044, 0.652, 0.374].\nObject 14 : hand at [0.721, 0.720, 0.822, 0.818].\nObject 15 : hand at [0.493, 0.710, 0.685, 0.826].\nObject 16 : head at [0.209, 0.048, 0.652, 0.360].\nObject 17 : mouth at [0.646, 0.310, 0.724, 0.352].\nObject 18 : neck at [0.560, 0.344, 0.663, 0.460].\nObject 19 : necklace at [0.357, 0.334, 0.471, 0.484].\nObject 20 : necktie at [0.571, 0.442, 0.674, 0.936].\nObject 21 : paper at [0.760, 0.792, 0.914, 0.934].\nObject 22 : person at [0.490, 0.136, 0.825, 0.998].\nObject 23 : plate at [0.579, 0.734, 0.816, 0.768].\nObject 24 : purse at [0.774, 0.792, 0.883, 0.840].\nObject 25 : ring at [0.786, 0.780, 0.794, 0.796].\nObject 26 : shirt at [0.554, 0.376, 0.691, 0.950].\nObject 27 : suit jacket at [0.490, 0.422, 0.799, 0.998].\nObject 28 : table at [0.696, 0.816, 0.997, 0.916].\nObject 29 : toilet at [0.000, 0.656, 0.997, 0.936].\nObject 30 : wallpaper at [0.003, 0.000, 0.916, 0.656].\n\nRelationships:\nobject 21 : paper -> on top of -> object 11 : ground.\nobject 21 : paper -> by -> object 29 : toilet.\nobject 21 : paper -> on top of -> object 11 : ground.\nobject 21 : paper -> on top of -> object 11 : ground.\nobject 21 : paper -> by -> object 29 : toilet.\nobject 21 : paper -> by -> object 29 : toilet.\nobject 21 : paper -> sitting by -> object 29 : toilet.\nobject 21 : paper -> lying by -> object 29 : toilet.\nobject 21 : paper -> on top of -> object 11 : ground.\nobject 21 : paper -> lying by -> object 29 : toilet.\nobject 2 : cake crumb -> on side of -> object 17 : mouth.\nobject 24 : purse -> on top of -> object 28 : table.\nobject 5 : eye -> of a -> object 22 : person.\nobject 6 : eye -> of a -> object 22 : person.\nobject 7 : eyebrow -> of -> object 22 : person.\nobject 8 : eyebrow -> of -> object 22 : person.\nobject 10 : finger -> of -> object 15 : hand.\nobject 10 : finger -> of -> object 15 : hand.\nobject 10 : finger -> of -> object 15 : hand.\nobject 10 : finger -> of -> object 15 : hand.\nobject 3 : crown -> on top of -> object 16 : head.\nobject 20 : necktie -> worn on -> object 22 : person.\nobject 22 : person -> holding -> object 1 : cake.\nobject 14 : hand -> holding -> object 1 : cake.\nobject 22 : person -> wearing -> object 27 : suit jacket.\nobject 22 : person -> wearing -> object 4 : dress.\nobject 20 : necktie -> worn on -> object 18 : neck.\nobject 13 : hair -> on top of -> object 16 : head.\nobject 1 : cake -> on top of -> object 23 : plate.\nobject 25 : ring -> worn on -> object 9 : finger.\n\nRegion Description:\nRegion Description at [0.022, 0.020, 0.203, 0.312] : A green and yellow striped wallpaper.\nRegion Description at [0.000, 0.048, 0.613, 0.996] : woman wearing a strapless white wedding dress .\nRegion Description at [0.487, 0.136, 0.808, 0.986] : woman white red hair holding a piece of cake on a plate.\nRegion Description at [0.543, 0.674, 0.813, 0.826] : woman's hands holding a plate of cake.\nRegion Description at [0.579, 0.124, 0.788, 0.524] : red haired woman wearing a tie and suit jacket .\nRegion Description at [0.000, 0.012, 0.819, 0.996] : two people wearing formal wedding attire .\n\nGlobal Caption:\nThere are two people enjoying a wedding reception\nA woman in a wedding dress with another woman in a suit behind\nA woman in a wedding dress with another lady holding a piece of cake.\nA red head girl holding a piece of cake\nA bride is with a long red haired person with cake."}
{"question_id": 27, "image": "000000535578.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : bush at [0.480, 0.000, 0.748, 0.084].\nObject 1 : ear at [0.544, 0.544, 0.571, 0.562].\nObject 2 : field at [0.000, 0.002, 0.994, 0.998].\nObject 3 : hill at [0.000, 0.000, 0.997, 0.998].\nObject 4 : plant at [0.000, 0.764, 0.601, 0.998].\nObject 5 : rock at [0.727, 0.410, 0.808, 0.470].\nObject 6 : sheep at [0.532, 0.546, 0.646, 0.662].\nObject 7 : sheep at [0.532, 0.666, 0.817, 0.810].\nObject 8 : tail at [0.565, 0.572, 0.604, 0.610].\nObject 9 : tree at [0.649, 0.000, 0.997, 0.334].\nObject 10 : trees at [0.736, 0.036, 0.835, 0.100].\nObject 11 : wall at [0.000, 0.000, 0.769, 0.180].\nObject 12 : weed at [0.417, 0.346, 0.492, 0.390].\n\nRelationships:\nobject 7 : sheep -> in a -> object 2 : field.\nobject 7 : sheep -> grazing in -> object 2 : field.\nobject 7 : sheep -> grazing in a -> object 2 : field.\nobject 7 : sheep -> grazing in a -> object 2 : field.\nobject 7 : sheep -> grazing in a -> object 2 : field.\nobject 7 : sheep -> grazing in a -> object 2 : field.\nobject 11 : wall -> borders -> object 2 : field.\nobject 7 : sheep -> grazing in a -> object 2 : field.\nobject 5 : rock -> in -> object 2 : field.\nobject 7 : sheep -> grazing in -> object 2 : field.\nobject 7 : sheep -> grazing in -> object 2 : field.\nobject 5 : rock -> in -> object 2 : field.\nobject 0 : bush -> in -> object 2 : field.\nobject 10 : trees -> in -> object 2 : field.\nobject 6 : sheep -> has an -> object 1 : ear.\nobject 6 : sheep -> has a -> object 8 : tail.\nobject 12 : weed -> growing in -> object 2 : field.\nobject 7 : sheep -> on -> object 3 : hill.\nobject 4 : plant -> on -> object 2 : field.\nobject 5 : rock -> on -> object 3 : hill.\nobject 7 : sheep -> are in -> object 2 : field.\nobject 11 : wall -> running across -> object 2 : field.\nobject 0 : bush -> in -> object 2 : field.\nobject 0 : bush -> in -> object 2 : field.\nobject 5 : rock -> in -> object 2 : field.\nobject 6 : sheep -> has a -> object 8 : tail.\nobject 5 : rock -> in -> object 2 : field.\n\nRegion Description:\nRegion Description at [0.000, 0.072, 0.760, 0.160] : A stone wall boarding a field of sheep.\nRegion Description at [0.189, 0.032, 0.703, 0.178] : rocks and grass in the background of the pasture.\nRegion Description at [0.541, 0.662, 0.823, 0.802] : white sheep grazing in green grassy field.\nRegion Description at [0.538, 0.544, 0.646, 0.656] : white sheep grazing in green grassy field.\nRegion Description at [0.228, 0.374, 0.357, 0.436] : white sheep grazing in green grassy field.\nRegion Description at [0.607, 0.380, 0.712, 0.456] : white sheep grazing in green grassy field.\nRegion Description at [0.811, 0.296, 0.937, 0.338] : two white sheep grazing in green grassy field.\nRegion Description at [0.048, 0.200, 0.249, 0.242] : group of white sheep grazing in green grassy field.\nRegion Description at [0.213, 0.164, 0.336, 0.192] : group of white sheep grazing in green grassy field.\nRegion Description at [0.000, 0.006, 0.997, 0.172] : two long gray stone walls across field.\nRegion Description at [0.453, 0.000, 0.730, 0.062] : a stand of trees outside the stone fence.\n\nGlobal Caption:\nA group of sheep grazing in a grassy valley.\nSheep graze in a lushly green mountain meadow\nA flock of sheep walking along a grassy hillside grazing.\nA flock of sheep are grazing on a grassy slope.\nA group of sheep grazing in a grassy field."}
{"question_id": 28, "image": "000000484415.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : arm at [0.000, 0.125, 0.609, 0.988].\nObject 1 : bathroom tile at [0.009, 0.008, 0.994, 0.446].\nObject 2 : blue jeans at [0.369, 0.558, 0.722, 0.979].\nObject 3 : brush at [0.681, 0.208, 0.878, 0.500].\nObject 4 : brush holder at [0.716, 0.279, 0.891, 0.554].\nObject 5 : button at [0.519, 0.113, 0.584, 0.171].\nObject 6 : flusher at [0.534, 0.092, 0.628, 0.300].\nObject 7 : hand at [0.281, 0.125, 0.603, 0.562].\nObject 8 : holder at [0.713, 0.283, 0.903, 0.558].\nObject 9 : lid at [0.028, 0.046, 0.694, 0.446].\nObject 10 : man at [0.000, 0.133, 0.600, 0.992].\nObject 11 : seat at [0.138, 0.583, 0.722, 0.992].\nObject 12 : tank at [0.019, 0.021, 0.706, 0.579].\nObject 13 : tile at [0.794, 0.000, 1.000, 0.200].\nObject 14 : tile at [0.000, 0.000, 0.278, 0.129].\nObject 15 : toilet at [0.016, 0.042, 0.719, 0.996].\nObject 16 : toilet scrubber at [0.744, 0.192, 0.844, 0.521].\nObject 17 : toilet seat at [0.103, 0.517, 0.728, 0.996].\nObject 18 : wall at [0.659, 0.000, 0.978, 0.392].\nObject 19 : water at [0.369, 0.738, 0.500, 0.921].\n\nRelationships:\nobject 15 : toilet -> has -> object 11 : seat.\nobject 4 : brush holder -> by -> object 15 : toilet.\nobject 19 : water -> in -> object 15 : toilet.\nobject 6 : flusher -> on -> object 15 : toilet.\nobject 9 : lid -> on -> object 15 : toilet.\nobject 10 : man -> by -> object 15 : toilet.\nobject 10 : man -> by -> object 15 : toilet.\nobject 10 : man -> has -> object 7 : hand.\nobject 0 : arm -> on -> object 15 : toilet.\nobject 14 : tile -> on -> object 18 : wall.\n\nRegion Description:\nRegion Description at [0.000, 0.046, 0.716, 0.987] : the arm reaching for the white toilet bowl.\nRegion Description at [0.716, 0.192, 0.894, 0.550] : the container and the toilet brush cleaner.\nRegion Description at [0.009, 0.042, 0.894, 0.992] : the toilet bowl next to the toilet bowl cleaner.\nRegion Description at [0.534, 0.087, 0.666, 0.329] : The hand is on the flusher in the image .\nRegion Description at [0.053, 0.158, 0.903, 0.875] : Porcelain toilet with flusher on top of the lid .\nRegion Description at [0.094, 0.154, 0.856, 0.942] : Man flushing the toilet in the bathroom .\n\nGlobal Caption:\nA hand is reaching out to the top if a toilet. \nA person flushing a toilet with a motion sensor.\nA person's hand flushing a toilet with a button on top of the tank. \na persons hand reaching for the top of a toilet\nA hand is reaching over a white toilet."}
{"question_id": 29, "image": "000000491090.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : arm at [0.313, 0.238, 0.567, 0.512].\nObject 1 : back wheel at [0.107, 0.502, 0.307, 0.720].\nObject 2 : face at [0.430, 0.118, 0.535, 0.218].\nObject 3 : floor at [0.003, 0.380, 0.997, 0.998].\nObject 4 : front light at [0.765, 0.514, 0.890, 0.634].\nObject 5 : front wheel at [0.642, 0.706, 0.997, 0.996].\nObject 6 : garage door at [0.532, 0.002, 0.858, 0.096].\nObject 7 : glasses at [0.422, 0.140, 0.548, 0.168].\nObject 8 : hand at [0.457, 0.450, 0.561, 0.518].\nObject 9 : indicator light at [0.666, 0.578, 0.722, 0.620].\nObject 10 : jeans at [0.241, 0.438, 0.465, 0.712].\nObject 11 : lettering at [0.003, 0.062, 0.302, 0.146].\nObject 12 : license plate at [0.939, 0.594, 1.000, 0.654].\nObject 13 : mirrors at [0.428, 0.320, 0.559, 0.384].\nObject 14 : motorcycle at [0.067, 0.358, 0.989, 1.000].\nObject 15 : person at [0.227, 0.086, 0.765, 0.758].\nObject 16 : sneaker at [0.243, 0.646, 0.342, 0.758].\nObject 17 : sweater at [0.243, 0.192, 0.676, 0.486].\nObject 18 : tail pipe at [0.059, 0.524, 0.257, 0.706].\n\nRelationships:\nobject 15 : person -> has -> object 7 : glasses.\nobject 15 : person -> has -> object 16 : sneaker.\nobject 15 : person -> has -> object 17 : sweater.\nobject 15 : person -> has -> object 17 : sweater.\nobject 15 : person -> has on -> object 10 : jeans.\nobject 14 : motorcycle -> has -> object 5 : front wheel.\nobject 14 : motorcycle -> has -> object 1 : back wheel.\nobject 4 : front light -> on -> object 14 : motorcycle.\nobject 15 : person -> on -> object 14 : motorcycle.\nobject 14 : motorcycle -> has -> object 18 : tail pipe.\nobject 15 : person -> sitting on -> object 14 : motorcycle.\nobject 15 : person -> wearing -> object 17 : sweater.\nobject 4 : front light -> on -> object 14 : motorcycle.\nobject 15 : person -> has -> object 8 : hand.\nobject 15 : person -> has -> object 7 : glasses.\nobject 13 : mirrors -> are on -> object 14 : motorcycle.\nobject 1 : back wheel -> on -> object 14 : motorcycle.\nobject 5 : front wheel -> on -> object 14 : motorcycle.\nobject 4 : front light -> on -> object 14 : motorcycle.\nobject 15 : person -> has -> object 2 : face.\nobject 15 : person -> has -> object 0 : arm.\nobject 15 : person -> sitting on -> object 14 : motorcycle.\nobject 15 : person -> has -> object 7 : glasses.\n\nRegion Description:\nRegion Description at [0.444, 0.138, 0.521, 0.168] : The eyeglasses the person on the motorcycle is wearing..\nRegion Description at [0.230, 0.640, 0.361, 0.760] : The person on the motorcycle's sneaker..\nRegion Description at [0.297, 0.216, 0.449, 0.404] : The left sleeve of the person's sweater..\nRegion Description at [0.545, 0.254, 0.738, 0.404] : The right sleeve of the person's sweater..\nRegion Description at [0.644, 0.706, 0.997, 0.994] : The front wheel of the motorcycle the person is on..\nRegion Description at [0.102, 0.498, 0.329, 0.692] : The back wheel of the motorcycle the person is on..\nRegion Description at [0.775, 0.518, 0.896, 0.626] : The front light of the motorcycle the person is on..\nRegion Description at [0.439, 0.432, 0.751, 0.522] : The handle bars on the motorcycle the person is on..\nRegion Description at [0.059, 0.516, 0.310, 0.708] : The tail pipe of the motorcycle the person is on..\nRegion Description at [0.663, 0.568, 0.733, 0.634] : small circular orange indicator light.\nRegion Description at [0.056, 0.522, 0.257, 0.706] : stainless steel motorcycle tailpipe .\nRegion Description at [0.067, 0.318, 0.992, 0.992] : Black motorcycle with silver accessories.\nRegion Description at [0.636, 0.690, 0.989, 0.992] : Black front wheel and fender of motorcycle.\nRegion Description at [0.243, 0.640, 0.353, 0.754] : Black and white shoe of man on motorcycle.\n\nGlobal Caption:\nA man sitting on one of a group of motorcycles.\nA MAN IS SMILING SITTING ON A MOTOR BIKE \nA middle-aged man leans on a sports bike, smiling\nA person sits on top of a motorcycle with others.\nA woman riding on the back of a motorcycle."}
{"question_id": 30, "image": "000000276018.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : animal at [0.717, 0.042, 0.831, 0.152].\nObject 1 : animal at [0.114, 0.582, 0.348, 0.840].\nObject 2 : baby at [0.385, 0.034, 0.643, 0.434].\nObject 3 : baby at [0.911, 0.028, 1.000, 0.250].\nObject 4 : bear at [0.391, 0.506, 0.622, 0.714].\nObject 5 : bear at [0.695, 0.356, 0.868, 0.580].\nObject 6 : bear hand at [0.114, 0.630, 0.175, 0.660].\nObject 7 : black sock at [0.800, 0.796, 0.858, 0.834].\nObject 8 : blonde boy at [0.166, 0.170, 0.351, 0.460].\nObject 9 : boy at [0.102, 0.388, 0.498, 1.000].\nObject 10 : boy at [0.717, 0.188, 1.000, 0.864].\nObject 11 : child at [0.342, 0.390, 0.622, 1.000].\nObject 12 : coat at [0.077, 0.520, 0.495, 0.910].\nObject 13 : coat at [0.775, 0.296, 1.000, 0.616].\nObject 14 : coat at [0.397, 0.090, 0.634, 0.262].\nObject 15 : flip flops at [0.434, 0.756, 0.606, 0.910].\nObject 16 : girl at [0.372, 0.196, 0.603, 0.922].\nObject 17 : glasses at [0.191, 0.236, 0.308, 0.250].\nObject 18 : grass at [0.637, 0.652, 0.754, 0.788].\nObject 19 : hand at [0.714, 0.094, 0.788, 0.160].\nObject 20 : hands at [0.763, 0.380, 0.877, 0.430].\nObject 21 : hat at [0.757, 0.030, 0.889, 0.078].\nObject 22 : jacket at [0.357, 0.500, 0.622, 0.782].\nObject 23 : jacket at [0.422, 0.286, 0.603, 0.550].\nObject 24 : jacket at [0.163, 0.296, 0.320, 0.462].\nObject 25 : jacket at [0.911, 0.106, 1.000, 0.224].\nObject 26 : lady at [0.286, 0.000, 0.683, 0.560].\nObject 27 : man at [0.628, 0.030, 0.951, 0.742].\nObject 28 : shirt at [0.831, 0.306, 0.957, 0.404].\nObject 29 : shirt at [0.197, 0.296, 0.298, 0.370].\nObject 30 : shoe at [0.717, 0.804, 0.871, 0.864].\nObject 31 : sidewalk at [0.628, 0.574, 0.769, 0.632].\nObject 32 : stuffed animal at [0.286, 0.298, 0.517, 0.422].\n\nRelationships:\nobject 10 : boy -> wearing -> object 28 : shirt.\nobject 3 : baby -> wearing -> object 25 : jacket.\nobject 22 : jacket -> carrying -> object 4 : bear.\nobject 8 : blonde boy -> wears -> object 17 : glasses.\nobject 8 : blonde boy -> wears -> object 24 : jacket.\nobject 11 : child -> holding up -> object 32 : stuffed animal.\nobject 10 : boy -> holding up -> object 5 : bear.\nobject 30 : shoe -> with a -> object 7 : black sock.\nobject 10 : boy -> wearing -> object 7 : black sock.\nobject 26 : lady -> holding -> object 2 : baby.\nobject 16 : girl -> wearing -> object 15 : flip flops.\nobject 9 : boy -> wearing -> object 12 : coat.\nobject 10 : boy -> wearing a -> object 13 : coat.\nobject 4 : bear -> on -> object 20 : hands.\nobject 26 : lady -> carrying -> object 2 : baby.\nobject 0 : animal -> in -> object 19 : hand.\n\nRegion Description:\nRegion Description at [0.905, 0.020, 0.997, 0.272] : blonde haired baby wearing yellow jacket.\nRegion Description at [0.357, 0.388, 0.640, 0.730] : girl in blue jacket carrying blue dog.\nRegion Description at [0.071, 0.378, 0.498, 0.842] : boy in black jacket holding stuffed dog.\nRegion Description at [0.055, 0.572, 0.375, 0.846] : brown stuffed dog with red and white collar.\nRegion Description at [0.283, 0.194, 0.603, 0.400] : girl in pink jacket holding white stuffed animal.\nRegion Description at [0.695, 0.356, 0.874, 0.576] : White stuffed animal wearing a red jacket..\nRegion Description at [0.332, 0.394, 0.618, 0.992] : Little girl holding a grey stuffed dog..\nRegion Description at [0.372, 0.476, 0.723, 0.786] : little girl holding blue and white stuffed animal.\nRegion Description at [0.062, 0.556, 0.422, 0.840] : little boy holding brown and white stuffed animal.\n\nGlobal Caption:\na bunch of kids walking through some grass\nA group of children are holding various stuffed animals and dolls.\nKids walking while holding their stuffed animals. \nA group of kids holding teddy bears and looking happy.\nA group of children carrying stuffed animals walks across the grass. "}
{"question_id": 31, "image": "000000361551.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : baggage at [0.107, 0.662, 0.179, 0.750].\nObject 1 : baggage at [0.368, 0.706, 0.456, 0.782].\nObject 2 : building at [0.000, 0.000, 0.997, 0.326].\nObject 3 : cap at [0.784, 0.544, 0.824, 0.568].\nObject 4 : duffel bag at [0.584, 0.702, 0.643, 0.768].\nObject 5 : ground at [0.000, 0.282, 1.000, 0.976].\nObject 6 : hair at [0.920, 0.614, 0.973, 0.640].\nObject 7 : headband at [0.923, 0.628, 0.952, 0.646].\nObject 8 : jacket at [0.776, 0.568, 0.840, 0.642].\nObject 9 : line at [0.696, 0.750, 0.989, 0.794].\nObject 10 : lines at [0.000, 0.436, 0.851, 0.486].\nObject 11 : luggage at [0.907, 0.706, 0.973, 0.786].\nObject 12 : luggage at [0.368, 0.702, 0.456, 0.780].\nObject 13 : man at [0.008, 0.554, 0.139, 0.800].\nObject 14 : man at [0.659, 0.572, 0.920, 0.844].\nObject 15 : man at [0.771, 0.538, 0.843, 0.640].\nObject 16 : pavement at [0.003, 0.308, 0.992, 0.566].\nObject 17 : people at [0.005, 0.562, 0.616, 0.824].\nObject 18 : pillars at [0.211, 0.130, 0.235, 0.240].\nObject 19 : ramp at [0.179, 0.158, 0.707, 0.408].\nObject 20 : service area at [0.003, 0.416, 0.995, 0.996].\nObject 21 : stairs at [0.352, 0.676, 1.000, 0.994].\nObject 22 : sweater at [0.667, 0.634, 0.920, 0.824].\nObject 23 : top at [0.960, 0.626, 1.000, 0.668].\nObject 24 : truck at [0.781, 0.278, 0.997, 0.366].\nObject 25 : walls at [0.608, 0.000, 0.989, 0.320].\nObject 26 : wheel at [0.843, 0.338, 0.875, 0.366].\nObject 27 : woman at [0.917, 0.610, 1.000, 0.724].\n\nRelationships:\nobject 17 : people -> in -> object 20 : service area.\nobject 27 : woman -> bends over -> object 11 : luggage.\nobject 14 : man -> walks down -> object 21 : stairs.\nobject 12 : luggage -> on -> object 5 : ground.\nobject 13 : man -> carries -> object 0 : baggage.\nobject 14 : man -> wears -> object 22 : sweater.\nobject 15 : man -> wears -> object 3 : cap.\nobject 24 : truck -> in -> object 20 : service area.\nobject 15 : man -> wears -> object 8 : jacket.\nobject 10 : lines -> on -> object 16 : pavement.\nobject 14 : man -> walks down -> object 21 : stairs.\nobject 9 : line -> on -> object 16 : pavement.\nobject 24 : truck -> has -> object 26 : wheel.\nobject 2 : building -> has -> object 25 : walls.\nobject 15 : man -> on -> object 20 : service area.\nobject 13 : man -> holds -> object 0 : baggage.\nobject 14 : man -> walks down -> object 21 : stairs.\nobject 13 : man -> holds -> object 0 : baggage.\nobject 27 : woman -> wears -> object 7 : headband.\nobject 1 : baggage -> on -> object 20 : service area.\n\nRegion Description:\nRegion Description at [0.443, 0.528, 0.992, 0.850] : People standing in service area of airport..\nRegion Description at [0.648, 0.564, 0.960, 0.892] : Man walking down stairs of unloading ramp..\nRegion Description at [0.229, 0.698, 0.381, 0.776] : Black and red luggage sitting on ground..\nRegion Description at [0.957, 0.616, 0.997, 0.670] : Woman dressed in sleeveless black top..\nRegion Description at [0.011, 0.548, 0.211, 0.750] : Man holding his luggage and bending over.\nRegion Description at [0.893, 0.578, 0.995, 0.678] : woman with a black and white head band.\nRegion Description at [0.235, 0.684, 0.973, 0.816] : Rainbow of colors in the form of luggage.\n\nGlobal Caption:\nSome are standing outside a building with suitcases.\nA few people are getting of a plane.\nA group of people and luggage on a airport tarmac.\nSome people who are placing luggage on a runway.\nAn airport and plane unloading passengers with luggage."}
{"question_id": 32, "image": "000000562207.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : body at [0.166, 0.539, 0.296, 0.997].\nObject 1 : boot at [0.594, 0.753, 0.620, 0.870].\nObject 2 : boot at [0.620, 0.744, 0.658, 0.858].\nObject 3 : bucket at [0.268, 0.744, 0.322, 0.828].\nObject 4 : clouds at [0.156, 0.000, 0.968, 0.328].\nObject 5 : ear at [0.590, 0.226, 0.638, 0.410].\nObject 6 : ear at [0.368, 0.208, 0.448, 0.434].\nObject 7 : elephant at [0.328, 0.157, 0.638, 0.967].\nObject 8 : eye at [0.476, 0.319, 0.504, 0.346].\nObject 9 : foot at [0.436, 0.901, 0.516, 0.958].\nObject 10 : grass at [0.950, 0.759, 0.996, 0.807].\nObject 11 : leg at [0.498, 0.572, 0.548, 0.898].\nObject 12 : leg at [0.408, 0.512, 0.516, 0.955].\nObject 13 : man at [0.582, 0.476, 0.662, 0.870].\nObject 14 : man at [0.164, 0.455, 0.292, 0.997].\nObject 15 : mountains at [0.000, 0.265, 0.376, 0.470].\nObject 16 : rock at [0.736, 0.895, 0.762, 0.934].\nObject 17 : sand at [0.240, 0.687, 0.998, 1.000].\nObject 18 : shirt at [0.582, 0.521, 0.650, 0.681].\nObject 19 : shorts at [0.174, 0.699, 0.254, 0.864].\nObject 20 : side at [0.236, 0.675, 0.994, 0.997].\nObject 21 : skirt at [0.298, 0.687, 0.360, 0.810].\nObject 22 : sky at [0.004, 0.000, 0.998, 0.355].\nObject 23 : top at [0.302, 0.539, 0.358, 0.696].\nObject 24 : tree at [0.012, 0.407, 0.076, 0.500].\nObject 25 : trunk at [0.506, 0.392, 0.600, 0.964].\nObject 26 : watch at [0.172, 0.711, 0.192, 0.732].\nObject 27 : water at [0.000, 0.488, 0.994, 1.000].\nObject 28 : woman at [0.288, 0.473, 0.420, 0.967].\n\nRelationships:\nobject 7 : elephant -> on -> object 20 : side.\nobject 28 : woman -> touching -> object 7 : elephant.\nobject 14 : man -> standing on -> object 20 : side.\nobject 14 : man -> standing beside -> object 7 : elephant.\nobject 10 : grass -> on -> object 20 : side.\nobject 28 : woman -> wearing -> object 23 : top.\nobject 13 : man -> wearing -> object 18 : shirt.\nobject 13 : man -> wearing -> object 1 : boot.\nobject 13 : man -> wearing -> object 2 : boot.\nobject 28 : woman -> touching -> object 7 : elephant.\nobject 7 : elephant -> has -> object 25 : trunk.\nobject 14 : man -> wearing -> object 19 : shorts.\nobject 28 : woman -> petting -> object 7 : elephant.\nobject 14 : man -> with -> object 7 : elephant.\nobject 28 : woman -> with -> object 7 : elephant.\nobject 13 : man -> with -> object 7 : elephant.\nobject 25 : trunk -> of -> object 7 : elephant.\nobject 25 : trunk -> of -> object 7 : elephant.\nobject 9 : foot -> of an -> object 7 : elephant.\nobject 25 : trunk -> of -> object 7 : elephant.\nobject 11 : leg -> of -> object 7 : elephant.\nobject 12 : leg -> of -> object 7 : elephant.\nobject 5 : ear -> of -> object 7 : elephant.\nobject 6 : ear -> of -> object 7 : elephant.\nobject 8 : eye -> of -> object 7 : elephant.\nobject 27 : water -> behind -> object 7 : elephant.\n\nRegion Description:\nRegion Description at [0.338, 0.139, 0.618, 0.967] : the elephant standing on the lake side.\nRegion Description at [0.154, 0.392, 0.300, 0.964] : a man standing on the lake side with shorts.\nRegion Description at [0.574, 0.422, 0.686, 0.910] : the man standing beside the elephant.\nRegion Description at [0.292, 0.485, 0.378, 0.705] : this lady is wearing a blue tank top.\nRegion Description at [0.722, 0.768, 0.988, 0.964] : the sand is brown with green grass growing in it.\nRegion Description at [0.156, 0.669, 0.270, 0.910] : the man is wearing grey black and white shorts.\nRegion Description at [0.504, 0.560, 0.568, 0.898] : The front right leg of the elephant..\nRegion Description at [0.310, 0.536, 0.358, 0.690] : The light blue tank top the girl is wearing..\nRegion Description at [0.262, 0.732, 0.326, 0.825] : The black bucket in the girl's hand..\nRegion Description at [0.002, 0.443, 0.992, 0.994] : The water behind the people and the elephant..\n\nGlobal Caption:\nA group of people are standing next to an elephant emerging from the water.\na group of people stand beside of a giant elephant \nThree tourists pose for a picture next to an elephant.\nThree people stand with an elephant in front of a stream.\nThree people standing next to an elephant along a river."}
{"question_id": 33, "image": "000000553990.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : bar at [0.444, 0.622, 0.640, 0.688].\nObject 1 : boots at [0.328, 0.339, 0.416, 0.492].\nObject 2 : bridal at [0.474, 0.246, 0.678, 0.432].\nObject 3 : food at [0.416, 0.646, 0.466, 0.715].\nObject 4 : foot at [0.324, 0.402, 0.380, 0.492].\nObject 5 : girl at [0.320, 0.078, 0.552, 0.502].\nObject 6 : grass at [0.012, 0.694, 0.998, 0.994].\nObject 7 : ground at [0.004, 0.679, 0.996, 0.913].\nObject 8 : helmet at [0.484, 0.096, 0.560, 0.162].\nObject 9 : hoof at [0.120, 0.853, 0.170, 0.925].\nObject 10 : horse at [0.024, 0.210, 0.690, 0.949].\nObject 11 : legs at [0.478, 0.453, 0.598, 0.637].\nObject 12 : legs at [0.130, 0.583, 0.278, 0.925].\nObject 13 : mane at [0.484, 0.186, 0.648, 0.279].\nObject 14 : person at [0.568, 0.568, 0.604, 0.640].\nObject 15 : poles at [0.460, 0.814, 0.538, 0.955].\nObject 16 : shirt at [0.580, 0.586, 0.594, 0.622].\nObject 17 : shirt at [0.388, 0.150, 0.508, 0.279].\nObject 18 : tail at [0.044, 0.357, 0.222, 0.784].\nObject 19 : tree at [0.720, 0.057, 0.874, 0.568].\nObject 20 : tree at [0.220, 0.000, 0.456, 0.586].\nObject 21 : trees at [0.730, 0.003, 0.986, 0.628].\nObject 22 : wall at [0.188, 0.276, 0.254, 0.393].\nObject 23 : water at [0.028, 0.468, 0.134, 0.574].\n\nRelationships:\nobject 5 : girl -> has -> object 1 : boots.\nobject 6 : grass -> under -> object 10 : horse.\nobject 21 : trees -> behind -> object 10 : horse.\nobject 10 : horse -> jumping -> object 15 : poles.\nobject 11 : legs -> on -> object 10 : horse.\nobject 12 : legs -> on -> object 10 : horse.\nobject 12 : legs -> on -> object 10 : horse.\nobject 14 : person -> in -> object 16 : shirt.\nobject 10 : horse -> has -> object 9 : hoof.\n\nRegion Description:\n\nGlobal Caption:\nA young person ridding a horse jumps a gate in a competition.\nA man riding on a horse as it jumps over a pole. \nA woman is riding a horse as it jumps over a bar.\nthere is a woman jockey riding a hose over the hurdle\nA woman riding a horse jumps over an obstacle."}
{"question_id": 34, "image": "000000106048.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : book at [0.218, 0.105, 0.834, 0.754].\nObject 1 : building at [0.050, 0.000, 1.000, 0.713].\nObject 2 : bus at [0.222, 0.144, 0.820, 0.757].\nObject 3 : bushes at [0.810, 0.401, 1.000, 0.680].\nObject 4 : design at [0.228, 0.422, 0.438, 0.560].\nObject 5 : ground at [0.000, 0.629, 1.002, 0.994].\nObject 6 : headlight at [0.738, 0.590, 0.796, 0.632].\nObject 7 : headlight at [0.522, 0.596, 0.610, 0.629].\nObject 8 : light at [0.604, 0.201, 0.706, 0.222].\nObject 9 : pavement at [0.002, 0.629, 0.996, 0.994].\nObject 10 : pipe at [0.172, 0.147, 0.208, 0.617].\nObject 11 : pipe at [0.438, 0.096, 0.458, 0.192].\nObject 12 : roof at [0.118, 0.000, 0.896, 0.174].\nObject 13 : side mirror at [0.488, 0.314, 0.530, 0.428].\nObject 14 : side mirror at [0.790, 0.332, 0.818, 0.455].\nObject 15 : street at [0.002, 0.611, 0.992, 0.991].\nObject 16 : stripe at [0.228, 0.428, 0.516, 0.569].\nObject 17 : trash can at [0.790, 0.569, 0.822, 0.662].\nObject 18 : wall at [0.858, 0.368, 0.920, 0.419].\nObject 19 : wheel at [0.266, 0.545, 0.294, 0.677].\nObject 20 : wheel at [0.248, 0.551, 0.264, 0.668].\nObject 21 : wheel at [0.444, 0.578, 0.472, 0.751].\nObject 22 : windows at [0.510, 0.216, 0.796, 0.548].\nObject 23 : windshield at [0.518, 0.222, 0.782, 0.545].\n\nRelationships:\nobject 10 : pipe -> running from -> object 12 : roof.\nobject 12 : roof -> to -> object 5 : ground.\nobject 17 : trash can -> next to -> object 3 : bushes.\nobject 3 : bushes -> by -> object 15 : street.\n\nRegion Description:\nRegion Description at [0.568, 0.524, 0.770, 0.599] : Divine Transportation written on front of bus.\nRegion Description at [0.162, 0.129, 0.212, 0.623] : black drain pipe running from the roof to the ground.\nRegion Description at [0.712, 0.177, 0.762, 0.240] : bus identification number on top of bus.\nRegion Description at [0.790, 0.557, 0.820, 0.647] : gray trash can next to bushes behind bus.\nRegion Description at [0.810, 0.407, 0.990, 0.692] : large green bushes in front of building.\nRegion Description at [0.670, 0.317, 0.740, 0.527] : black windshield wiper on windshield.\n\nGlobal Caption:\nA white bus driving past a tall building.\na black and white bus some bushes and building\nA white decorated bus is next to a building.\na large white bus that is by a building\nA large bus parked in a parking lot "}
{"question_id": 35, "image": "000000421923.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : block at [0.156, 0.630, 0.357, 0.822].\nObject 1 : book at [0.414, 0.208, 0.538, 0.364].\nObject 2 : book at [0.360, 0.202, 0.417, 0.360].\nObject 3 : book at [0.426, 0.484, 0.691, 0.522].\nObject 4 : book at [0.399, 0.404, 0.520, 0.554].\nObject 5 : bowl at [0.072, 0.030, 0.288, 0.076].\nObject 6 : center at [0.850, 0.732, 0.886, 0.766].\nObject 7 : eye at [0.282, 0.506, 0.327, 0.532].\nObject 8 : eye at [0.189, 0.506, 0.237, 0.534].\nObject 9 : flower at [0.796, 0.462, 0.982, 0.550].\nObject 10 : flower at [0.817, 0.528, 0.976, 0.612].\nObject 11 : flower at [0.760, 0.678, 0.946, 0.824].\nObject 12 : flower at [0.691, 0.608, 0.838, 0.722].\nObject 13 : flower at [0.913, 0.680, 1.000, 0.770].\nObject 14 : object at [0.213, 0.840, 0.583, 0.972].\nObject 15 : picture at [0.778, 0.060, 1.000, 0.352].\nObject 16 : shelf at [0.324, 0.528, 0.997, 0.624].\nObject 17 : shelf at [0.207, 0.334, 0.997, 0.380].\nObject 18 : shelf at [0.000, 0.028, 0.607, 0.202].\nObject 19 : stack at [0.435, 0.480, 0.712, 0.578].\nObject 20 : statue at [0.147, 0.404, 0.372, 0.652].\nObject 21 : table at [0.000, 0.690, 1.003, 0.998].\nObject 22 : vase at [0.838, 0.774, 0.994, 0.974].\nObject 23 : water at [0.847, 0.864, 0.997, 0.984].\n\nRelationships:\nobject 20 : statue -> on -> object 0 : block.\nobject 14 : object -> on -> object 21 : table.\nobject 1 : book -> on -> object 17 : shelf.\nobject 4 : book -> on -> object 16 : shelf.\nobject 5 : bowl -> on -> object 18 : shelf.\nobject 22 : vase -> has -> object 23 : water.\nobject 20 : statue -> has -> object 8 : eye.\nobject 20 : statue -> has -> object 7 : eye.\nobject 20 : statue -> on -> object 0 : block.\nobject 9 : flower -> in -> object 22 : vase.\nobject 10 : flower -> in -> object 22 : vase.\nobject 12 : flower -> in -> object 22 : vase.\nobject 13 : flower -> in -> object 22 : vase.\nobject 3 : book -> in -> object 19 : stack.\nobject 11 : flower -> has -> object 6 : center.\nobject 1 : book -> on -> object 17 : shelf.\nobject 2 : book -> on -> object 17 : shelf.\nobject 11 : flower -> has -> object 6 : center.\nobject 3 : book -> on -> object 19 : stack.\nobject 19 : stack -> on -> object 16 : shelf.\nobject 20 : statue -> on -> object 0 : block.\n\nRegion Description:\n\nGlobal Caption:\na glass vase with some flowers coming out of it \nA room witb a statue, bookshelves, books and a vase with flowers in it.\nA desk with a vase containing flowers, a sculpture of a man's head and shelves behind it.\nA statue next to a vase of flowers on a shelf. \nThe bust of a man's head is next to a vase of flowers."}
{"question_id": 36, "image": "000000273493.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : ball at [0.640, 0.399, 0.648, 0.411].\nObject 1 : border at [0.040, 0.502, 1.000, 0.556].\nObject 2 : boundary lines at [0.030, 0.661, 1.000, 1.000].\nObject 3 : bushes at [0.020, 0.186, 0.104, 0.517].\nObject 4 : fence at [0.008, 0.366, 0.994, 0.565].\nObject 5 : fence at [0.024, 0.502, 0.996, 0.709].\nObject 6 : grass at [0.004, 0.529, 0.994, 0.997].\nObject 7 : man at [0.144, 0.360, 0.246, 0.736].\nObject 8 : man at [0.730, 0.474, 0.780, 0.613].\nObject 9 : pants at [0.732, 0.529, 0.778, 0.604].\nObject 10 : shirt at [0.164, 0.411, 0.222, 0.547].\nObject 11 : shorts at [0.162, 0.535, 0.220, 0.628].\nObject 12 : sign at [0.916, 0.405, 0.934, 0.438].\nObject 13 : sky at [0.006, 0.021, 0.990, 0.279].\nObject 14 : sneakers at [0.180, 0.709, 0.216, 0.739].\nObject 15 : sneakers at [0.762, 0.598, 0.776, 0.613].\nObject 16 : tennis at [0.012, 0.384, 0.984, 0.934].\nObject 17 : tennis court at [0.000, 0.372, 0.988, 0.979].\nObject 18 : tennis racket at [0.768, 0.526, 0.808, 0.556].\nObject 19 : tennis racket at [0.214, 0.574, 0.238, 0.619].\nObject 20 : trees at [0.586, 0.282, 0.692, 0.420].\nObject 21 : white at [0.734, 0.492, 0.778, 0.601].\n\nRelationships:\nobject 7 : man -> in -> object 10 : shirt.\nobject 7 : man -> with -> object 19 : tennis racket.\nobject 7 : man -> plays -> object 16 : tennis.\nobject 7 : man -> wears -> object 14 : sneakers.\nobject 8 : man -> wears -> object 15 : sneakers.\nobject 7 : man -> wears -> object 11 : shorts.\nobject 8 : man -> wears -> object 9 : pants.\nobject 5 : fence -> has -> object 1 : border.\nobject 20 : trees -> behind -> object 3 : bushes.\nobject 2 : boundary lines -> on -> object 17 : tennis court.\nobject 2 : boundary lines -> on -> object 6 : grass.\nobject 3 : bushes -> behind -> object 4 : fence.\nobject 20 : trees -> behind -> object 4 : fence.\nobject 7 : man -> has -> object 19 : tennis racket.\nobject 8 : man -> wears -> object 21 : white.\nobject 4 : fence -> around -> object 17 : tennis court.\nobject 20 : trees -> behind -> object 8 : man.\nobject 6 : grass -> on -> object 17 : tennis court.\nobject 8 : man -> has -> object 18 : tennis racket.\nobject 8 : man -> hitting -> object 0 : ball.\nobject 5 : fence -> on -> object 17 : tennis court.\n\nRegion Description:\nRegion Description at [0.024, 0.489, 0.998, 0.730] : The tennis net separating the sides of the players..\nRegion Description at [0.144, 0.652, 0.234, 0.745] : The black sneakers the player is wearing..\nRegion Description at [0.720, 0.577, 0.784, 0.613] : The white sneakers the player is wearing..\nRegion Description at [0.158, 0.544, 0.230, 0.628] : The gray shorts the player is wearing..\nRegion Description at [0.006, 0.402, 0.998, 0.574] : The trimmed bushes behind the player..\nRegion Description at [0.008, 0.168, 0.998, 0.402] : The trees behind the trimmed bushes behind the player..\nRegion Description at [0.006, 0.604, 0.998, 0.985] : The white boundary lines on the tennis court..\nRegion Description at [0.020, 0.447, 0.994, 0.760] : A black and white net stretches across the field.\nRegion Description at [0.060, 0.526, 0.984, 0.985] : The field has green grass with white lines.\nRegion Description at [0.016, 0.369, 0.978, 0.595] : A tall green shrub is behind the fence.\nRegion Description at [0.034, 0.150, 0.984, 0.393] : Trees are seen behind the fence and shrub.\nRegion Description at [0.588, 0.327, 0.850, 0.703] : The yellow ball is flying towards the man.\nRegion Description at [0.902, 0.378, 0.956, 0.529] : A black circular sign with the number five.\nRegion Description at [0.142, 0.354, 0.248, 0.736] : male in white t-shirt playing tennis.\nRegion Description at [0.200, 0.565, 0.244, 0.625] : Head of tennis racket of man playing.\nRegion Description at [0.726, 0.465, 0.786, 0.631] : Man in white preparing to hit tennis ball.\n\nGlobal Caption:\nTwo men playing a game of tennis on a court.\ntwo people playing tennis with rackets on a grass court\nTwo young men playing a game of tennis.\nPeople playing tennis on a court surrounded by green hedges.\ntHERE ARE TWO MEN PLAYING TENNIS ON THE TENNIS COURT"}
{"question_id": 37, "image": "000000475150.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : animal at [0.220, 0.105, 1.006, 0.997].\nObject 1 : branches at [0.000, 0.000, 1.000, 1.000].\nObject 2 : ear at [0.402, 0.288, 0.452, 0.378].\nObject 3 : eye at [0.332, 0.396, 0.378, 0.429].\nObject 4 : foliage at [0.584, 0.093, 0.748, 0.255].\nObject 5 : giraffe`s neck at [0.476, 0.264, 1.000, 1.003].\nObject 6 : head at [0.216, 0.102, 0.476, 0.706].\nObject 7 : mane at [0.576, 0.502, 0.836, 0.811].\nObject 8 : nose at [0.222, 0.640, 0.266, 0.703].\nObject 9 : sky at [0.000, 0.000, 1.000, 0.562].\nObject 10 : spot at [0.562, 0.535, 0.616, 0.625].\nObject 11 : spot at [0.560, 0.447, 0.592, 0.508].\nObject 12 : spot at [0.592, 0.444, 0.670, 0.556].\nObject 13 : spot at [0.622, 0.565, 0.694, 0.664].\nObject 14 : spot at [0.514, 0.483, 0.570, 0.571].\nObject 15 : spots at [0.700, 0.640, 0.806, 0.817].\nObject 16 : spots at [0.706, 0.823, 0.776, 0.943].\nObject 17 : spots at [0.852, 0.829, 0.984, 0.997].\nObject 18 : spots at [0.674, 0.547, 0.758, 0.655].\nObject 19 : spots at [0.774, 0.700, 0.902, 0.913].\nObject 20 : tree at [0.000, 0.000, 1.000, 1.000].\nObject 21 : wrinkles at [0.466, 0.468, 0.554, 0.586].\n\nRelationships:\nobject 20 : tree -> has -> object 4 : foliage.\nobject 21 : wrinkles -> on -> object 5 : giraffe`s neck.\nobject 3 : eye -> on a -> object 0 : animal.\nobject 4 : foliage -> in -> object 20 : tree.\nobject 1 : branches -> behind -> object 0 : animal.\nobject 14 : spot -> on -> object 0 : animal.\nobject 11 : spot -> on -> object 0 : animal.\nobject 10 : spot -> on -> object 0 : animal.\nobject 12 : spot -> on -> object 0 : animal.\nobject 13 : spot -> on -> object 0 : animal.\nobject 5 : giraffe`s neck -> on -> object 0 : animal.\nobject 3 : eye -> of -> object 0 : animal.\nobject 2 : ear -> of -> object 0 : animal.\nobject 6 : head -> of -> object 0 : animal.\n\nRegion Description:\nRegion Description at [0.616, 0.565, 0.956, 0.958] : the giraffe is spotted tan and brown.\nRegion Description at [0.288, 0.324, 0.572, 0.649] : the giraffes face is white and brown.\n\nGlobal Caption:\nA giraffe stands near a tree in the wilderness. \nA giraffe standing in front of a group of trees.\nA giraffe standing next to a leaf free tree.\nHead and neck of a giraffe in natural feeding habitat.\nA giraffe walking near a tree with very few leaves."}
{"question_id": 38, "image": "000000125472.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : axle at [0.447, 0.814, 0.535, 0.856].\nObject 1 : background at [0.003, 0.744, 0.994, 0.988].\nObject 2 : bracelet at [0.820, 0.444, 0.859, 0.470].\nObject 3 : building at [0.012, 0.888, 0.099, 0.994].\nObject 4 : corner at [0.027, 0.890, 0.117, 0.992].\nObject 5 : fence at [0.030, 0.886, 1.000, 1.000].\nObject 6 : hair at [0.486, 0.078, 0.712, 0.216].\nObject 7 : jean pants at [0.246, 0.380, 0.841, 0.632].\nObject 8 : laces at [0.168, 0.562, 0.850, 0.674].\nObject 9 : logo at [0.429, 0.232, 0.583, 0.364].\nObject 10 : man at [0.201, 0.002, 0.940, 0.758].\nObject 11 : name at [0.000, 0.960, 0.321, 1.000].\nObject 12 : picture at [0.003, 0.004, 1.000, 0.998].\nObject 13 : poles at [0.180, 0.886, 0.432, 0.990].\nObject 14 : shirt at [0.324, 0.124, 0.694, 0.392].\nObject 15 : shoes at [0.189, 0.606, 0.946, 0.792].\nObject 16 : skateboard at [0.012, 0.746, 0.664, 0.886].\nObject 17 : sky at [0.012, 0.002, 1.000, 0.918].\nObject 18 : stadium lights at [0.147, 0.860, 0.456, 0.994].\nObject 19 : stitching at [0.312, 0.408, 0.754, 0.638].\nObject 20 : strip at [0.279, 0.770, 0.529, 0.802].\nObject 21 : top at [0.024, 0.830, 0.420, 0.936].\nObject 22 : trees at [0.024, 0.846, 1.000, 1.000].\nObject 23 : wheels at [0.012, 0.808, 0.586, 0.904].\nObject 24 : wrist at [0.802, 0.434, 0.856, 0.484].\n\nRelationships:\nobject 2 : bracelet -> on mans -> object 24 : wrist.\nobject 23 : wheels -> on a -> object 16 : skateboard.\nobject 14 : shirt -> has a -> object 9 : logo.\nobject 10 : man -> doing trick on -> object 16 : skateboard.\nobject 3 : building -> behind a -> object 5 : fence.\nobject 11 : name -> on -> object 12 : picture.\nobject 11 : name -> has a -> object 11 : name.\nobject 10 : man -> performing on a -> object 16 : skateboard.\nobject 4 : corner -> of -> object 3 : building.\nobject 18 : stadium lights -> are on -> object 13 : poles.\nobject 16 : skateboard -> has -> object 23 : wheels.\nobject 2 : bracelet -> on mans -> object 24 : wrist.\nobject 11 : name -> on -> object 12 : picture.\nobject 16 : skateboard -> under -> object 10 : man.\nobject 10 : man -> wearing -> object 15 : shoes.\nobject 3 : building -> behind -> object 5 : fence.\nobject 22 : trees -> in -> object 1 : background.\nobject 15 : shoes -> have -> object 8 : laces.\nobject 18 : stadium lights -> on -> object 13 : poles.\nobject 5 : fence -> behind -> object 10 : man.\nobject 20 : strip -> on -> object 16 : skateboard.\nobject 19 : stitching -> on -> object 7 : jean pants.\nobject 9 : logo -> on -> object 14 : shirt.\nobject 23 : wheels -> on -> object 16 : skateboard.\nobject 0 : axle -> on -> object 16 : skateboard.\nobject 21 : top -> of -> object 22 : trees.\n\nRegion Description:\nRegion Description at [0.030, 0.774, 0.643, 0.912] : a black skateboard with black wheels.\n\nGlobal Caption:\nA man flying through the air while riding a skateboard.\nA man is doing tricks on a skateboard.\nA skateboarder jumps while trying to perform a trick.\na man in the air standing above the skateboard\na person attempting a jump with a skateboard"}
{"question_id": 39, "image": "000000069138.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : arrows at [0.000, 0.616, 0.214, 0.644].\nObject 1 : awning at [0.159, 0.260, 0.293, 0.336].\nObject 2 : building at [0.000, 0.000, 1.000, 0.466].\nObject 3 : bushes at [0.693, 0.342, 1.000, 0.512].\nObject 4 : door at [0.110, 0.370, 0.266, 0.518].\nObject 5 : face at [0.390, 0.256, 0.614, 0.392].\nObject 6 : greenery at [0.824, 0.154, 0.997, 0.384].\nObject 7 : hitch at [0.221, 0.520, 0.259, 0.542].\nObject 8 : ladder at [0.110, 0.342, 0.283, 0.364].\nObject 9 : license plate at [0.141, 0.460, 0.234, 0.500].\nObject 10 : line at [0.017, 0.700, 0.266, 0.756].\nObject 11 : picture at [0.155, 0.378, 0.259, 0.442].\nObject 12 : plant barrier at [0.672, 0.482, 1.000, 0.606].\nObject 13 : planter at [0.676, 0.152, 1.000, 0.510].\nObject 14 : pole at [0.328, 0.068, 0.483, 0.994].\nObject 15 : road at [0.000, 0.490, 1.000, 1.000].\nObject 16 : roof at [0.117, 0.360, 0.283, 0.382].\nObject 17 : sad face at [0.383, 0.244, 0.614, 0.384].\nObject 18 : short term at [0.624, 0.040, 0.769, 0.080].\nObject 19 : sidewalk at [0.666, 0.572, 0.993, 0.618].\nObject 20 : sign at [0.621, 0.082, 0.772, 0.132].\nObject 21 : sign at [0.007, 0.144, 0.069, 0.204].\nObject 22 : signal at [0.266, 0.210, 0.679, 0.848].\nObject 23 : stop light at [0.366, 0.236, 0.638, 0.394].\nObject 24 : tail light at [0.100, 0.446, 0.121, 0.472].\nObject 25 : van at [0.076, 0.326, 0.297, 0.556].\nObject 26 : wall at [0.676, 0.500, 0.997, 0.604].\nObject 27 : window at [0.903, 0.000, 1.000, 0.086].\n\nRelationships:\nobject 23 : stop light -> with -> object 17 : sad face.\nobject 0 : arrows -> on -> object 15 : road.\nobject 12 : plant barrier -> beside -> object 15 : road.\nobject 11 : picture -> on -> object 4 : door.\nobject 10 : line -> painted in -> object 15 : road.\nobject 19 : sidewalk -> next to -> object 15 : road.\nobject 2 : building -> for -> object 18 : short term.\nobject 23 : stop light -> making -> object 5 : face.\nobject 3 : bushes -> just above -> object 26 : wall.\nobject 22 : signal -> on -> object 14 : pole.\nobject 25 : van -> has -> object 16 : roof.\nobject 25 : van -> has -> object 8 : ladder.\nobject 8 : ladder -> on -> object 16 : roof.\nobject 13 : planter -> by -> object 15 : road.\nobject 23 : stop light -> on -> object 22 : signal.\n\nRegion Description:\nRegion Description at [0.331, 0.852, 0.472, 0.996] : Pole holding traffic light on street.\nRegion Description at [0.600, 0.036, 0.793, 0.084] : Building offers short term office space.\nRegion Description at [0.603, 0.074, 0.776, 0.120] : Office space as small as 2,500 sq. ft. available.\nRegion Description at [0.003, 0.008, 0.972, 0.356] : an office building is in the background.\n\nGlobal Caption:\nA red traffic light with a sad face drawn over it.\nA street scene with a close of of a stop light.\nA red stoplight with a street in the background.\nA stop sign gives traffic a frown face.\nThe sign is now at a red light."}
{"question_id": 40, "image": "000000408120.jpg", "category": "refer_reason", "text": "Objects:\nObject 0 : alley at [0.052, 0.261, 0.948, 0.997].\nObject 1 : bars at [0.050, 0.000, 0.400, 0.682].\nObject 2 : black tire at [0.500, 0.219, 0.522, 0.249].\nObject 3 : brick at [0.784, 0.105, 0.818, 0.144].\nObject 4 : bricks at [0.926, 0.165, 0.946, 0.195].\nObject 5 : building at [0.742, 0.000, 0.954, 0.796].\nObject 6 : car at [0.418, 0.168, 0.526, 0.240].\nObject 7 : concrete at [0.394, 0.565, 0.570, 0.718].\nObject 8 : corner at [0.850, 0.934, 0.950, 1.000].\nObject 9 : curb at [0.050, 0.264, 0.396, 0.868].\nObject 10 : fence at [0.686, 0.252, 0.826, 0.565].\nObject 11 : flower at [0.580, 0.078, 0.608, 0.123].\nObject 12 : flowers at [0.598, 0.072, 0.634, 0.105].\nObject 13 : girl at [0.444, 0.249, 0.500, 0.480].\nObject 14 : photo at [0.044, 0.000, 0.956, 0.997].\nObject 15 : plants at [0.040, 0.324, 0.224, 0.685].\nObject 16 : polka dot at [0.430, 0.231, 0.450, 0.261].\nObject 17 : road at [0.048, 0.243, 0.954, 0.994].\nObject 18 : shirt at [0.456, 0.279, 0.496, 0.390].\nObject 19 : shoe at [0.484, 0.441, 0.496, 0.459].\nObject 20 : shoe at [0.452, 0.459, 0.470, 0.489].\nObject 21 : umbrella at [0.404, 0.189, 0.528, 0.297].\nObject 22 : wall at [0.738, 0.003, 0.950, 0.760].\nObject 23 : wall window at [0.524, 0.000, 0.538, 0.060].\nObject 24 : window at [0.570, 0.003, 0.586, 0.051].\nObject 25 : window at [0.524, 0.102, 0.538, 0.150].\n\nRelationships:\nobject 13 : girl -> with -> object 19 : shoe.\nobject 13 : girl -> with -> object 20 : shoe.\nobject 13 : girl -> with -> object 18 : shirt.\nobject 4 : bricks -> on -> object 5 : building.\nobject 15 : plants -> are near -> object 0 : alley.\nobject 6 : car -> on -> object 17 : road.\nobject 8 : corner -> of an -> object 0 : alley.\nobject 15 : plants -> in front of -> object 14 : photo.\nobject 21 : umbrella -> on -> object 13 : girl.\nobject 9 : curb -> built alongside -> object 17 : road.\n\nRegion Description:\nRegion Description at [0.038, 0.426, 0.162, 0.526] : patch of green plants in front of photo.\nRegion Description at [0.586, 0.060, 0.678, 0.138] : purple flowers inside of bush on right.\n\nGlobal Caption:\nA little girl that is standing with an umbrella.\nA little girl walking down a driveway carrying a pink umbrella.\nA LITTLE GIRL DRESSED IN PINK ALSO HAS A PINK UMBRELLA\nA small girl is holding an umbrella over her head\nA young girl carries and open unbrella while walking down an alley."}

View File

@ -0,0 +1,40 @@
{"question_id": 0, "image": "000000130566.jpg", "category": "refer_reason", "text": "What might be the purpose of the object [0.850, 0.523, 0.898, 0.583] on the train?"}
{"question_id": 1, "image": "000000010764.jpg", "category": "refer_reason", "text": "What is the purpose of the object [0.546, 0.625, 0.626, 0.801]?"}
{"question_id": 2, "image": "000000184324.jpg", "category": "refer_reason", "text": "What is the purpose of the object [0.478, 0.464, 0.492, 0.491]?"}
{"question_id": 3, "image": "000000452122.jpg", "category": "refer_reason", "text": "What is the purpose of the object [0.450, 0.592, 0.600, 0.643]?"}
{"question_id": 4, "image": "000000032334.jpg", "category": "refer_reason", "text": "What is the purpose of the object [0.350, 0.325, 0.516, 0.373]?"}
{"question_id": 5, "image": "000000360960.jpg", "category": "refer_reason", "text": "What might the man [0.850, 0.156, 1.000, 0.652] do next based on the current scene?"}
{"question_id": 7, "image": "000000376322.jpg", "category": "refer_reason", "text": "What is the likely occasion for the people [0.000, 0.428, 0.997, 0.998]?"}
{"question_id": 8, "image": "000000271402.jpg", "category": "refer_reason", "text": "What is the purpose of the object [0.462, 0.480, 0.713, 0.840]?"}
{"question_id": 9, "image": "000000356424.jpg", "category": "refer_reason", "text": "What might be the purpose of the object [0.419, 0.134, 0.509, 0.184]?"}
{"question_id": 10, "image": "000000131138.jpg", "category": "refer_reason", "text": "What is the purpose of the object [0.872, 0.556, 0.993, 0.634] in this setting?"}
{"question_id": 11, "image": "000000332318.jpg", "category": "refer_reason", "text": "What is the purpose of the object [0.796, 0.910, 0.894, 0.997] in this rural setting?"}
{"question_id": 12, "image": "000000513567.jpg", "category": "refer_reason", "text": "What could be the potential reason for the girl [0.682, 0.229, 0.742, 0.315] to have her mouth open?"}
{"question_id": 13, "image": "000000134722.jpg", "category": "refer_reason", "text": "What's the purpose of the object [0.348, 0.499, 0.410, 0.584]?"}
{"question_id": 14, "image": "000000341058.jpg", "category": "refer_reason", "text": "What is the purpose of the object [0.548, 0.180, 0.779, 0.344]?"}
{"question_id": 15, "image": "000000277051.jpg", "category": "refer_reason", "text": "What is the purpose of the object [0.080, 0.003, 0.296, 0.721]?"}
{"question_id": 16, "image": "000000376900.jpg", "category": "refer_reason", "text": "What is the object [0.235, 0.578, 0.304, 0.664] used for?"}
{"question_id": 17, "image": "000000412240.jpg", "category": "refer_reason", "text": "What does the region [0.646, 0.869, 0.824, 0.923] likely represent in this image?"}
{"question_id": 18, "image": "000000179765.jpg", "category": "refer_reason", "text": "What is the purpose of the object [0.626, 0.501, 0.698, 0.680] on the bike?"}
{"question_id": 19, "image": "000000329219.jpg", "category": "refer_reason", "text": "What is the purpose of the object [0.056, 0.214, 0.140, 0.277]?"}
{"question_id": 20, "image": "000000184384.jpg", "category": "refer_reason", "text": "What is the purpose of the object [0.454, 0.024, 0.638, 0.288] on the cake?"}
{"question_id": 21, "image": "000000018519.jpg", "category": "refer_reason", "text": "What is the use of the object [0.279, 0.524, 0.341, 0.570]?"}
{"question_id": 22, "image": "000000415748.jpg", "category": "refer_reason", "text": "Can you tell me what is unusual about the object [0.462, 0.670, 0.489, 0.692]?"}
{"question_id": 23, "image": "000000543300.jpg", "category": "refer_reason", "text": "What is the purpose of the object [0.414, 0.691, 0.662, 0.725]?"}
{"question_id": 24, "image": "000000349184.jpg", "category": "refer_reason", "text": "What is the purpose of the object [0.458, 0.488, 0.605, 0.694]?"}
{"question_id": 25, "image": "000000042070.jpg", "category": "refer_reason", "text": "Can you tell what might be the purpose of the object [0.258, 0.055, 0.770, 0.168]?"}
{"question_id": 26, "image": "000000241668.jpg", "category": "refer_reason", "text": "What can be inferred from the object [0.786, 0.780, 0.794, 0.796]?"}
{"question_id": 27, "image": "000000535578.jpg", "category": "refer_reason", "text": "What purpose does the object [0.000, 0.072, 0.760, 0.160] serve in relation to the sheep?"}
{"question_id": 28, "image": "000000484415.jpg", "category": "refer_reason", "text": "What is the function of the object [0.681, 0.208, 0.878, 0.500]?"}
{"question_id": 29, "image": "000000491090.jpg", "category": "refer_reason", "text": "What might be the function of the object [0.663, 0.568, 0.733, 0.634]?"}
{"question_id": 30, "image": "000000276018.jpg", "category": "refer_reason", "text": "What is the purpose of the item [0.757, 0.030, 0.889, 0.078]?"}
{"question_id": 31, "image": "000000361551.jpg", "category": "refer_reason", "text": "What is the purpose of the object [0.784, 0.544, 0.824, 0.568]?"}
{"question_id": 32, "image": "000000562207.jpg", "category": "refer_reason", "text": "What is the object [0.268, 0.744, 0.322, 0.828] used for?"}
{"question_id": 33, "image": "000000553990.jpg", "category": "refer_reason", "text": "What is the function of the object [0.474, 0.246, 0.678, 0.432]?"}
{"question_id": 34, "image": "000000106048.jpg", "category": "refer_reason", "text": "What could be the purpose of the text found in the region [0.568, 0.524, 0.770, 0.599]?"}
{"question_id": 35, "image": "000000421923.jpg", "category": "refer_reason", "text": "What is the purpose of the object [0.838, 0.774, 0.994, 0.974]?"}
{"question_id": 36, "image": "000000273493.jpg", "category": "refer_reason", "text": "What is the function of the object [0.916, 0.405, 0.934, 0.438]?"}
{"question_id": 37, "image": "000000475150.jpg", "category": "refer_reason", "text": "What is the pattern on the object within the region [0.616, 0.565, 0.956, 0.958] and what does it indicate about the object?"}
{"question_id": 38, "image": "000000125472.jpg", "category": "refer_reason", "text": "When was the object [0.012, 0.746, 0.664, 0.886] popularly invented?"}
{"question_id": 39, "image": "000000069138.jpg", "category": "refer_reason", "text": "What is the purpose of the object [0.621, 0.082, 0.772, 0.132]?"}
{"question_id": 40, "image": "000000408120.jpg", "category": "refer_reason", "text": "What is the purpose of the object [0.404, 0.189, 0.528, 0.297] in this scene?"}

View File

@ -0,0 +1,5 @@
{
"refer_desc": {"role": "Assistant", "prompt": "We would like to request your feedback on the performance of two AI assistants in response to the user question displayed above. The user asks the question about specific region of an image. For your reference, the visual content in the image is represented with five descriptive sentences describing the same image. In addition, specific object locations within the image are given, along with detailed coordinates. These coordinates are in the form of bounding boxes, represented as (x1, y1, x2, y2) with floating numbers ranging from 0 to 1. These values correspond to the top left x, top left y, bottom right x, and bottom right y. Also, the relationships between pairs of objects are provided, in the format of object -> relationship -> subject, where the object/subject are indexed by object id from previous object lists as well as the object names. Also, several region description are given, each describing a box region of image, with detailed coordinates. \nPlease rate the spatial correspondence, helpfulness, relevance, accuracy, level of details of their responses. Each assistant receives an overall score on a scale of 1 to 10, where a higher score indicates better overall performance.\nPlease first output a single line containing only two values indicating the scores for Assistant 1 and 2, respectively. The two scores are separated by a space.\nIn the subsequent line, please provide a comprehensive explanation of your evaluation, avoiding any potential bias and ensuring that the order in which the responses were presented does not affect your judgment."},
"refer_reason": {"role": "Assistant", "prompt": "We would like to request your feedback on the performance of two AI assistants in response to the user question displayed above. The user asks the question about specific region of an image. For your reference, the visual content in the image is represented with five descriptive sentences describing the same image. In addition, specific object locations within the image are given, along with detailed coordinates. These coordinates are in the form of bounding boxes, represented as (x1, y1, x2, y2) with floating numbers ranging from 0 to 1. These values correspond to the top left x, top left y, bottom right x, and bottom right y. Also, the relationships between pairs of objects are provided, in the format of object -> relationship -> subject, where the object/subject are indexed by object id from previous object lists as well as the object names. Also, several region description are given, each describing a box region of image, with detailed coordinates. \nPlease rate the spatial correspondence, helpfulness, relevance, accuracy, level of details of their responses. Each assistant receives an overall score on a scale of 1 to 10, where a higher score indicates better overall performance.\nPlease first output a single line containing only two values indicating the scores for Assistant 1 and 2, respectively. The two scores are separated by a space.\nIn the subsequent line, please provide a comprehensive explanation of your evaluation, avoiding any potential bias and ensuring that the order in which the responses were presented does not affect your judgment."},
"ground_conv": {"role": "Assistant", "prompt": "We would like to request your feedback on the performance of two AI assistants in response to the user question displayed above. The user asks the question that requires model to predict the coordinates of relevant object. For your reference, the visual content in the image is represented with five descriptive sentences describing the same image. In addition, specific object locations within the image are given, along with detailed coordinates. These coordinates are in the form of bounding boxes, represented as (x1, y1, x2, y2) with floating numbers ranging from 0 to 1. These values correspond to the top left x, top left y, bottom right x, and bottom right y. Also, the relationships between pairs of objects are provided, in the format of object -> relationship -> subject, where the object/subject are indexed by object id from previous object lists as well as the object names. Also, several region description are given, each describing a box region of image, with detailed coordinates. \nPlease rate the predicted coordinates, helpfulness, relevance, accuracy, level of details of their responses. Specifically, pay your attention to the precision of the coordinates and whether it matches the object. Small deviation (<20% of ground-truth box width or height) of coordinates is allowed and shouldn't be punished. More than that, the degree of deviation should be reflected in scoring too. Each assistant receives an overall score on a scale of 1 to 10, where a higher score indicates better overall performance.\nPlease first output a single line containing only two values indicating the scores for Assistant 1 and 2, respectively. The two scores are separated by a space.\nIn the subsequent line, please provide a comprehensive explanation of your evaluation, avoiding any potential bias and ensuring that the order in which the responses were presented does not affect your judgment."}
}

View File

@ -0,0 +1,52 @@
#!/bin/bash
CHECKPOINT_FILE='ferret_ft/final-checkpoint'
CUDA_VISIBLE_DEVICES=0 python -m ferret.eval.model_gpt4eval_3newclass --add_region_feature \
--model-path checkpoints/${CHECKPOINT_FILE} \
--data_path ferret/eval/ferret_gpt4_data/refer_desc/question.jsonl \
--answers-file gpt4_result/${CHECKPOINT_FILE}/refer_desc &
CUDA_VISIBLE_DEVICES=1 python -m ferret.eval.model_gpt4eval_3newclass --add_region_feature \
--model-path checkpoints/${CHECKPOINT_FILE} \
--data_path ferret/eval/ferret_gpt4_data/ground_conv/question.jsonl \
--answers-file gpt4_result/${CHECKPOINT_FILE}/ground_conv &
CUDA_VISIBLE_DEVICES=2 python -m ferret.eval.model_gpt4eval_3newclass --add_region_feature \
--model-path checkpoints/${CHECKPOINT_FILE} \
--data_path ferret/eval/ferret_gpt4_data/refer_reason/question.jsonl \
--answers-file gpt4_result/${CHECKPOINT_FILE}/refer_reason &
wait
echo "Finish Inference."
OPENAI_API_KEY="xxx" python ferret/eval/eval_gpt_review_3newclass.py \
--question ferret/eval/ferret_gpt4_data/refer_desc/question.jsonl \
--context ferret/eval/ferret_gpt4_data/refer_desc/context.jsonl \
--answer-list \
ferret/eval/ferret_gpt4_data/refer_desc/answer.jsonl \
gpt4_result/${CHECKPOINT_FILE}/refer_desc/ferret_answer.jsonl \
--rule ferret/eval/ferret_gpt4_data/rule.json \
--output gpt4_result/${CHECKPOINT_FILE}/review_refer_desc.jsonl &
OPENAI_API_KEY="xxx" python ferret/eval/eval_gpt_review_3newclass.py \
--question ferret/eval/ferret_gpt4_data/ground_conv/question.jsonl \
--context ferret/eval/ferret_gpt4_data/ground_conv/context.jsonl \
--answer-list \
ferret/eval/ferret_gpt4_data/ground_conv/answer.jsonl \
gpt4_result/${CHECKPOINT_FILE}/ground_conv/ferret_answer.jsonl \
--rule ferret/eval/ferret_gpt4_data/rule.json \
--output gpt4_result/${CHECKPOINT_FILE}/review_ground_conv.jsonl &
OPENAI_API_KEY="xxx" python ferret/eval/eval_gpt_review_3newclass.py \
--question ferret/eval/ferret_gpt4_data/refer_reason/question.jsonl \
--context ferret/eval/ferret_gpt4_data/refer_reason/context.jsonl \
--answer-list \
ferret/eval/ferret_gpt4_data/refer_reason/answer.jsonl \
gpt4_result/${CHECKPOINT_FILE}/refer_reason/ferret_answer.jsonl \
--rule ferret/eval/ferret_gpt4_data/rule.json \
--output gpt4_result/${CHECKPOINT_FILE}/review_refer_reason.jsonl &
wait
echo "Finish Review."
echo "Gather final score."
echo $CHECKPOINT_FILE
python ferret/eval/summarize_gpt_review.py \
--dir=gpt4_result/${CHECKPOINT_FILE}

260
ferret/eval/model_flickr.py Normal file
View File

@ -0,0 +1,260 @@
"""
Usage:
--data_path: path of flickr30k annotation.
--image_path: path of flickr30k test images.
--answers-file: path of output result.
Example:
CUDA_VISIBLE_DEVICES=0 python -m ferret.eval.model_flickr \
--model-path checkpoints/ferret_13b/checkpoint-final \
--image_path data/flickr30k/flickr30k_images_split/test \
--data_path data/annotations/final_flickr_mergedGT_test.json \
--answers-file flickr_result/test_answer \
--add_region_feature \
--chunk-idx 0 \
--num-chunks 1
"""
import argparse
from typing import Any, Tuple
import torch
import os
import json
from tqdm import tqdm
# Added
from ferret.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria
from ferret.model.builder import load_pretrained_model
from ferret.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
from ferret.conversation import conv_templates, SeparatorStyle
from ferret.utils import disable_torch_init
from PIL import Image
import re
import math
import torchvision
import numpy as np
from copy import deepcopy
# Added for visualization
from PIL import Image, ImageDraw, ImageFont
VOCAB_IMAGE_W = 1000
VOCAB_IMAGE_H = 1000
DEFAULT_REGION_FEA_TOKEN = "<region_fea>"
def split_list(lst, n):
"""Split a list into n (roughly) equal-sized chunks"""
chunk_size = math.ceil(len(lst) / n) # integer division
return [lst[i:i+chunk_size] for i in range(0, len(lst), chunk_size)]
def get_chunk(lst, n, k):
chunks = split_list(lst, n)
return chunks[k]
def plot_flickr(img, boxes, entities, mode='pred'):
if mode == "gt":
color = "green"
elif mode == "pred":
color = "blue"
draw = ImageDraw.Draw(img)
fnt = ImageFont.load_default()
boxes = boxes
entities = entities
for box, tk in zip(boxes, entities):
draw.rectangle([box[0], box[1], box[2], box[3]], outline=color)
draw.text((box[0], box[1]-5), f'{tk}', font=fnt, fill=color)
return img
def remove_punctuation(text: str) -> str:
punct = [',',]
for p in punct:
text = text.replace(p, '')
return text.strip()
def resize_bbox(box, image_w=None, image_h=None):
ratio_w = image_w * 1.0 / VOCAB_IMAGE_W
ratio_h = image_h * 1.0 / VOCAB_IMAGE_H
new_box = [int(box[0] * ratio_w), int(box[1] * ratio_h), \
int(box[2] * ratio_w), int(box[3] * ratio_h)]
return new_box
def find_bbox_template(text, img_w, img_h):
entities = []
boxes = []
# Regular expression pattern to match entities and boxes
pattern = r'([a-zA-Z\s]+)\s+(\[[\d,\s]+\])'
# Find all matches in the text
matches = re.findall(pattern, text)
for entity, box_str in matches:
# Append the entity to the entities list
entities.append(entity.strip())
# Convert the box string to a list of integers and append to the boxes list
box = list(map(int, box_str.strip('[]').split(',')))
resized_box = resize_bbox(box, img_w, img_h)
boxes.append(resized_box)
return entities, boxes
class FlickrGrounding(torchvision.datasets.CocoDetection):
def __init__(self, img_folder, ann_file, transforms):
super(FlickrGrounding, self).__init__(img_folder, ann_file)
self._transforms = transforms
self.question_prompt = "What are the locations of <objs>?"
def __getitem__(self, idx):
img, target = super(FlickrGrounding, self).__getitem__(idx)
image_id = self.ids[idx]
coco_img = self.coco.loadImgs(image_id)[0]
file_name = coco_img["file_name"]
caption = coco_img["caption"]
postive_item_pos = coco_img['tokens_positive_eval']
dataset_name = coco_img["dataset_name"] if "dataset_name" in coco_img else None
w, h = img.size
# token_positive = []
bboxes = []
entities = []
for anno in target:
bbox_xywh = anno["bbox"]
bbox_xyxy = np.array([bbox_xywh[0], bbox_xywh[1], bbox_xywh[0] + bbox_xywh[2], bbox_xywh[1] + bbox_xywh[3]])
bbox_xyxy[0::2].clip(min=0, max=w)
bbox_xyxy[1::2].clip(min=0, max=h)
bboxes.append(bbox_xyxy.tolist())
# tokens_positive = anno["tokens_positive"]
# token_positive.append(tokens_positive)
entities = [remove_punctuation(caption[t[0][0]:t[0][1]].lower()) for t in postive_item_pos]
obj_caption = ", ".join(entities)
assert "<objs>" in self.question_prompt
question = self.question_prompt.replace("<objs>", obj_caption)
target = {"image_id": image_id, "file_name": file_name, "annotations": target,
"caption": caption, "img_w": w, "img_h": h, "question": question, "bboxes": bboxes, "entities": entities}
if self._transforms is not None:
img, target = self._transforms(img, target)
target["dataset_name"] = dataset_name
for extra_key in ["sentence_id", "original_img_id", "original_id", "task_id"]:
if extra_key in coco_img:
target[extra_key] = coco_img[extra_key]
return img, target
def eval_model_flickr(args):
# Data
dataset = FlickrGrounding(img_folder=args.image_path,
ann_file=args.data_path,
transforms=None,
)
data_ids = range(len(dataset))
# Model
disable_torch_init()
model_path = os.path.expanduser(args.model_path)
model_name = get_model_name_from_path(model_path)
tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, args.model_base, model_name)
chunk_data_ids = get_chunk(data_ids, args.num_chunks, args.chunk_idx)
answers_file = os.path.expanduser(args.answers_file)
os.makedirs(answers_file, exist_ok=True)
answers_file = os.path.join(answers_file, f'{args.chunk_idx}_of_{args.num_chunks}.jsonl')
ans_file = open(answers_file, "w")
for i, id in enumerate(tqdm(chunk_data_ids)):
img, ann = dataset[id]
qs = ann["question"]
cur_prompt = qs
# Plot GTs
# img = plot_flickr(img, ann["bboxes"], ann["entities"], mode="gt")
if model.config.mm_use_im_start_end:
qs = DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN + '\n' + qs
else:
qs = DEFAULT_IMAGE_TOKEN + '\n' + qs
conv = conv_templates[args.conv_mode].copy()
conv.append_message(conv.roles[0], qs)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()
img_w, img_h = ann["img_w"], ann["img_h"]
image_tensor = image_processor.preprocess(img, return_tensors='pt', do_resize=True,
do_center_crop=False, size=[args.image_h, args.image_w])['pixel_values'][0]
stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
keywords = [stop_str]
stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
with torch.inference_mode():
output_ids = model.generate(
input_ids,
images=image_tensor.unsqueeze(0).half().cuda(),
do_sample=True,
temperature=args.temperature,
top_p=args.top_p,
num_beams=args.num_beams,
# no_repeat_ngram_size=3,
max_new_tokens=1024,
use_cache=True,
stopping_criteria=[stopping_criteria],
)
input_token_len = input_ids.shape[1]
n_diff_input_output = (input_ids != output_ids[:, :input_token_len]).sum().item()
if n_diff_input_output > 0:
print(f'[Warning] Sample {i}: {n_diff_input_output} output_ids are not the same as the input_ids')
outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]
outputs = outputs.strip()
if outputs.endswith(stop_str):
outputs = outputs[:-len(stop_str)]
outputs = outputs.strip()
# Plot Preds
# pred_entities, pred_bboxes = find_bbox_template(outputs, img_w=img_w, img_h=img_h)
# img = plot_flickr(img, pred_bboxes, pred_entities, mode="pred")
# img.save('flickr_result/images/{}.png'.format(i))
ans_file.write(json.dumps({"image_id": ann['original_img_id'],
"sentence_id": ann['sentence_id'],
"file_name": ann["file_name"],
"prompt": cur_prompt,
"text": outputs,
"width": ann['img_w'],
"height": ann['img_h'],
}) + "\n")
ans_file.flush()
ans_file.close()
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--model-path", type=str, default="facebook/opt-350m")
parser.add_argument("--model-base", type=str, default=None)
parser.add_argument("--image_path", type=str, default="data/flickr30k/flickr30k_images_split/test")
parser.add_argument("--data_path", type=str, default="data/annotations/final_flickr_separateGT_test.json")
parser.add_argument("--answers-file", type=str, default="flickr_result/test_answer.jsonl")
parser.add_argument("--conv-mode", type=str, default="ferret_v1")
parser.add_argument("--num-chunks", type=int, default=1)
parser.add_argument("--chunk-idx", type=int, default=0)
parser.add_argument("--image_w", type=int, default=336) # 224
parser.add_argument("--image_h", type=int, default=336) # 224
parser.add_argument("--add_region_feature", action="store_true")
parser.add_argument("--temperature", type=float, default=0.001)
parser.add_argument("--top_p", type=float, default=None)
parser.add_argument("--num_beams", type=int, default=1)
args = parser.parse_args()
eval_model_flickr(args)

View File

@ -0,0 +1,274 @@
"""
Usage:
Example:
CUDA_VISIBLE_DEVICES=0 python -m ferret.eval.model_gpt4eval_3newclass \
--model-path checkpoints/ferret_13b \
--data_path ferret/eval/ferret_gpt4_data/refer_desc/question.jsonl \
--answers-file gpt4_result/refer_desc/ferret_ft_clipL336_vicunaV1-3-13b_3Ep --add_region_feature
"""
import argparse
import torch
import os
import json
from tqdm import tqdm
# Added
from ferret.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria
from ferret.model.builder import load_pretrained_model
from ferret.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
from ferret.conversation import conv_templates, SeparatorStyle
from ferret.utils import disable_torch_init
from PIL import Image
import math
import pdb
import numpy as np
from copy import deepcopy
from functools import partial
import re
VOCAB_IMAGE_W = 1000
VOCAB_IMAGE_H = 1000
DEFAULT_REGION_FEA_TOKEN = "<region_fea>"
def split_list(lst, n):
"""Split a list into n (roughly) equal-sized chunks"""
chunk_size = math.ceil(len(lst) / n) # integer division
return [lst[i:i+chunk_size] for i in range(0, len(lst), chunk_size)]
def get_chunk(lst, n, k):
chunks = split_list(lst, n)
return chunks[k]
def generate_mask_for_feature(coor, raw_w, raw_h, mask=None):
if mask is not None:
assert mask.shape[0] == raw_w and mask.shape[1] == raw_h
coor_mask = np.zeros((raw_w, raw_h))
# Assume it samples a point.
if len(coor) == 2:
# Define window size
span = 5
# Make sure the window does not exceed array bounds
x_min = max(0, coor[0] - span)
x_max = min(raw_w, coor[0] + span + 1)
y_min = max(0, coor[1] - span)
y_max = min(raw_h, coor[1] + span + 1)
coor_mask[int(x_min):int(x_max), int(y_min):int(y_max)] = 1
assert (coor_mask==1).any(), f"coor: {coor}, raw_w: {raw_w}, raw_h: {raw_h}"
elif len(coor) == 4:
# Box input or Sketch input.
coor_mask[coor[0]:coor[2]+1, coor[1]:coor[3]+1] = 1
if mask is not None:
coor_mask = coor_mask * mask
coor_mask = torch.from_numpy(coor_mask)
try:
assert len(coor_mask.nonzero()) != 0
except:
pdb.set_trace()
return coor_mask
class GPTEval_Data():
def __init__(self, data_path, image_path, args) -> None:
datas = [json.loads(q) for q in open(os.path.expanduser(args.data_path), "r")]
for idx, i in enumerate(datas):
i['image_path'] = os.path.join(image_path, i['image'].split('/')[-1])
img_raw_w, img_raw_h = Image.open(i['image_path']).size
pattern = r'\[.*?\]'
matches = re.findall(pattern, i['text'])
question = i['text']
masks = []
for match in matches:
coor_cur = match.replace('[', '')
coor_cur = coor_cur.replace(']', '')
coor_cur = coor_cur.split(',')
coor_cur = [float(i.strip()) for i in coor_cur]
raw_box_coor = [int(coor_cur[0] * img_raw_w), int(coor_cur[1] * img_raw_h), int(coor_cur[2] * img_raw_w), int(coor_cur[3] * img_raw_h)]
converted_box_coor = [int(coor_cur[0] * VOCAB_IMAGE_W), int(coor_cur[1] * VOCAB_IMAGE_H), int(coor_cur[2] * VOCAB_IMAGE_W), int(coor_cur[3] * VOCAB_IMAGE_H)]
if args.add_region_feature:
question = question.replace(match, f'[{converted_box_coor[0]}, {converted_box_coor[1]}, {converted_box_coor[2]}, {converted_box_coor[3]}] {DEFAULT_REGION_FEA_TOKEN}')
generated_mask = generate_mask_for_feature(raw_box_coor, raw_w=img_raw_w, raw_h=img_raw_h, mask=None)
masks.append(generated_mask)
else:
question = question.replace(match, f'[{converted_box_coor[0]}, {converted_box_coor[1]}, {converted_box_coor[2]}, {converted_box_coor[3]}]')
# pdb.set_trace()
if args.add_region_feature:
i['region_masks'] = masks
else:
i['region_masks'] = None
i['question'] = question
# obj_list = [json.loads(line) for line in tqdm(open(data_path))]
# question_prompt = "Is the object in <location> of the image a <obj1> or a <obj2>?"
# for idx, i in enumerate(obj_list):
# i['image_path'] = os.path.join(image_path, i['image_path'].split('/')[-1])
# ratio_w = VOCAB_IMAGE_W * 1.0 / i['width']
# ratio_h = VOCAB_IMAGE_H * 1.0 / i['height']
# point_x_textvocab = int(i['sample_point'][0]*ratio_w)
# point_y_textvocab = int(i['sample_point'][1]*ratio_h)
# box_x1 = int(i['bbox_norm'][0]*i['width'])
# box_y1 = int(i['bbox_norm'][1]*i['height'])
# box_x2 = int(i['bbox_norm'][2]*i['width'])
# box_y2 = int(i['bbox_norm'][3]*i['height'])
# box_x1_textvocab = int(i['bbox_norm'][0]*VOCAB_IMAGE_W)
# box_y1_textvocab = int(i['bbox_norm'][1]*VOCAB_IMAGE_H)
# box_x2_textvocab = int(i['bbox_norm'][2]*VOCAB_IMAGE_W)
# box_y2_textvocab = int(i['bbox_norm'][3]*VOCAB_IMAGE_H)
# if args.region_format == 'point':
# region_coordinate_raw = [i['sample_point'][0], i['sample_point'][1]]
# i['question'] = question_prompt.replace('<location>', '[{}, {}]'.format(point_x_textvocab, point_y_textvocab))
# segment_mask = None
# elif args.region_format == 'box' or args.region_format == 'segment':
# region_coordinate_raw = [box_x1, box_y1, box_x2, box_y2]
# i['question'] = question_prompt.replace('<location>', '[{}, {}, {}, {}]'.format(box_x1_textvocab, box_y1_textvocab, box_x2_textvocab, box_y2_textvocab))
# if args.region_format == 'segment':
# segment_mask = np.array(i['segment_mask'])
# else:
# segment_mask = None
# else:
# raise NotImplementedError(f'{args.region_format} is not supported.')
# if args.add_region_feature:
# i['question'] = i['question'].replace('of the image', f'{DEFAULT_REGION_FEA_TOKEN} of the image')
# generated_mask = generate_mask_for_feature(region_coordinate_raw, raw_w=i['width'], raw_h=i['height'], mask=segment_mask)
# i['region_masks'] = [generated_mask]
# else:
# i['region_masks'] = None
# if idx % 2 == 0:
# i['question'] = i['question'].replace('<obj1>', i['name'])
# i['question'] = i['question'].replace('<obj2>', i['neg_class'])
# else:
# i['question'] = i['question'].replace('<obj2>', i['name'])
# i['question'] = i['question'].replace('<obj1>', i['neg_class'])
self.datas = datas
self._ids = range(len(self.datas))
# pdb.set_trace()
@property
def ids(self):
return deepcopy(self._ids)
def fetch_data(self, id):
ann = self.datas[id]
return ann
def eval_model(args):
# Data
dataset = GPTEval_Data(data_path=args.data_path, image_path=args.image_path, args=args)
data_ids = dataset.ids
# Model
disable_torch_init()
model_path = os.path.expanduser(args.model_path)
model_name = get_model_name_from_path(model_path)
tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, args.model_base, model_name)
# chunk_data_ids = get_chunk(data_ids, args.num_chunks, args.chunk_idx)
answers_file = os.path.expanduser(args.answers_file)
os.makedirs(answers_file, exist_ok=True)
# answers_file = os.path.join(answers_file, f'{args.chunk_idx}_of_{args.num_chunks}.jsonl')
answers_file = os.path.join(answers_file, f'ferret_answer.jsonl')
ans_file = open(answers_file, "w")
for i, id in enumerate(tqdm(data_ids)):
ann = dataset.fetch_data(id)
image_path = ann['image_path']
qs = ann['question']
cur_prompt = qs
# pdb.set_trace()
if model.config.mm_use_im_start_end:
qs = DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN + '\n' + qs
else:
qs = DEFAULT_IMAGE_TOKEN + '\n' + qs
conv = conv_templates[args.conv_mode].copy()
conv.append_message(conv.roles[0], qs)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()
image = Image.open(image_path).convert('RGB')
image_tensor = image_processor.preprocess(image, return_tensors='pt', do_resize=True,
do_center_crop=False, size=[args.image_h, args.image_w])['pixel_values'][0]
stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
keywords = [stop_str]
stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
region_masks = ann['region_masks']
if region_masks is not None:
region_masks = [[region_mask_i.cuda().half() for region_mask_i in region_masks]]
else:
region_masks = None
# pdb.set_trace()
with torch.inference_mode():
model.orig_forward = model.forward
model.forward = partial(
model.orig_forward,
region_masks=region_masks
)
output_ids = model.generate(
input_ids,
images=image_tensor.unsqueeze(0).half().cuda(),
do_sample=True,
temperature=args.temperature,
top_p=args.top_p,
num_beams=args.num_beams,
# no_repeat_ngram_size=3,
max_new_tokens=1024,
use_cache=True,
stopping_criteria=[stopping_criteria])
model.forward = model.orig_forward
input_token_len = input_ids.shape[1]
n_diff_input_output = (input_ids != output_ids[:, :input_token_len]).sum().item()
if n_diff_input_output > 0:
print(f'[Warning] Sample {i}: {n_diff_input_output} output_ids are not the same as the input_ids')
outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]
outputs = outputs.strip()
if outputs.endswith(stop_str):
outputs = outputs[:-len(stop_str)]
outputs = outputs.strip()
# pdb.set_trace()
ans_file.write(json.dumps({"question_id":ann['question_id'], # +1 offset
"image_path":image_path,
"prompt": cur_prompt,
"text": outputs
}) + "\n")
ans_file.flush()
ans_file.close()
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--model-path", type=str, default="facebook/opt-350m")
parser.add_argument("--model-base", type=str, default=None)
parser.add_argument("--image_path", type=str, default="dataset/cocoval2017")
parser.add_argument("--data_path", type=str, default="dataset/lvis/lvis_v1_minival_inserted_image_name.json")
parser.add_argument("--answers-file", type=str, default="lvis_result/answer.jsonl")
parser.add_argument("--conv-mode", type=str, default="ferret_v1")
parser.add_argument("--num-chunks", type=int, default=1)
parser.add_argument("--chunk-idx", type=int, default=0)
parser.add_argument("--image_w", type=int, default=336) # 224
parser.add_argument("--image_h", type=int, default=336) # 224
parser.add_argument("--add_region_feature", action="store_true")
parser.add_argument("--temperature", type=float, default=0.001)
parser.add_argument("--top_p", type=float, default=None)
parser.add_argument("--num_beams", type=int, default=1)
parser.add_argument("--region_format", type=str, default="box", choices=["point", "box", "segment"])
args = parser.parse_args()
eval_model(args)

277
ferret/eval/model_lvis.py Normal file
View File

@ -0,0 +1,277 @@
"""
Usage:
--data_path: path of LVIS eval annotation.
--image_path: path of coco val 2017 images.
--answers-file: path of output result.
Change --region_format to evaluate different types of referring regions. Choices: ["point", "box", "free_shape"]
If eval on free-form shape:
CUDA_VISIBLE_DEVICES=0 python -m ferret.eval.model_lvis \
--model-path checkpoints/ferret_13b/final-checkpoint \
--data_path dataset/lvis/lvis_eval.jsonl \
--image_path dataset/cocoval2017 \
--answers-file lvis_result/ferret_13b_freeshape \
--add_region_feature \
--chunk-idx 0 \
--num-chunks 1 \
--region_format free_shape
"""
import argparse
import torch
import os
import json
from tqdm import tqdm
# Added
from ferret.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria
from ferret.model.builder import load_pretrained_model
from ferret.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
from ferret.conversation import conv_templates, SeparatorStyle
from ferret.utils import disable_torch_init
from PIL import Image
import math
import pdb
import numpy as np
from copy import deepcopy
from functools import partial
VOCAB_IMAGE_W = 1000
VOCAB_IMAGE_H = 1000
DEFAULT_REGION_FEA_TOKEN = "<region_fea>"
def split_list(lst, n):
"""Split a list into n (roughly) equal-sized chunks"""
chunk_size = math.ceil(len(lst) / n) # integer division
return [lst[i:i+chunk_size] for i in range(0, len(lst), chunk_size)]
def get_chunk(lst, n, k):
chunks = split_list(lst, n)
return chunks[k]
def generate_mask_for_feature(coor, raw_w, raw_h, mask=None):
if mask is not None:
assert mask.shape[0] == raw_w and mask.shape[1] == raw_h
coor_mask = np.zeros((raw_w, raw_h))
# Assume it samples a point.
if len(coor) == 2:
# Define window size
span = 5
# Make sure the window does not exceed array bounds
x_min = max(0, coor[0] - span)
x_max = min(raw_w, coor[0] + span + 1)
y_min = max(0, coor[1] - span)
y_max = min(raw_h, coor[1] + span + 1)
coor_mask[int(x_min):int(x_max), int(y_min):int(y_max)] = 1
assert (coor_mask==1).any(), f"coor: {coor}, raw_w: {raw_w}, raw_h: {raw_h}"
elif len(coor) == 4:
# Box input or Sketch input.
coor_mask[coor[0]:coor[2]+1, coor[1]:coor[3]+1] = 1
if mask is not None:
coor_mask = coor_mask * mask
coor_mask = torch.from_numpy(coor_mask)
try:
assert len(coor_mask.nonzero()) != 0
except:
pdb.set_trace()
return coor_mask
class LVISData_V1():
def __init__(self, data_path, image_path, args) -> None:
obj_list = [json.loads(line) for line in tqdm(open(data_path))]
# question_prompt = "Is the object in <location> of the image a <obj1> or a <obj2>?"
question_prompt = "Is the object <location> of the image a <obj1> or a <obj2>?"
for idx, i in enumerate(obj_list):
i['image_path'] = os.path.join(image_path, i['image_path'].split('/')[-1])
ratio_w = VOCAB_IMAGE_W * 1.0 / i['width']
ratio_h = VOCAB_IMAGE_H * 1.0 / i['height']
point_x_textvocab = int(i['sample_point'][0]*ratio_w)
point_y_textvocab = int(i['sample_point'][1]*ratio_h)
box_x1 = int(i['bbox_norm'][0]*i['width'])
box_y1 = int(i['bbox_norm'][1]*i['height'])
box_x2 = int(i['bbox_norm'][2]*i['width'])
box_y2 = int(i['bbox_norm'][3]*i['height'])
box_x1_textvocab = int(i['bbox_norm'][0]*VOCAB_IMAGE_W)
box_y1_textvocab = int(i['bbox_norm'][1]*VOCAB_IMAGE_H)
box_x2_textvocab = int(i['bbox_norm'][2]*VOCAB_IMAGE_W)
box_y2_textvocab = int(i['bbox_norm'][3]*VOCAB_IMAGE_H)
if args.region_format == 'point':
region_coordinate_raw = [i['sample_point'][0], i['sample_point'][1]]
if args.no_coor:
assert args.add_region_feature
i['question'] = question_prompt.replace('<location>', '')
else:
i['question'] = question_prompt.replace('<location>', '[{}, {}]'.format(point_x_textvocab, point_y_textvocab))
segment_mask = None
elif args.region_format == 'box' or args.region_format == 'free_shape':
if args.region_format == 'free_shape':
region_coordinate_raw = i['free_shape_bbox_raw']
box_x1_textvocab = int(i['free_shape_bbox_raw'][0]*ratio_w)
box_y1_textvocab = int(i['free_shape_bbox_raw'][1]*ratio_h)
box_x2_textvocab = int(i['free_shape_bbox_raw'][2]*ratio_w)
box_y2_textvocab = int(i['free_shape_bbox_raw'][3]*ratio_h)
else:
region_coordinate_raw = [box_x1, box_y1, box_x2, box_y2]
if args.no_coor:
assert args.add_region_feature
i['question'] = question_prompt.replace('<location>', '')
else:
i['question'] = question_prompt.replace('<location>', '[{}, {}, {}, {}]'.format(box_x1_textvocab, box_y1_textvocab, box_x2_textvocab, box_y2_textvocab))
if args.region_format == 'free_shape':
segment_mask = np.array(i['free_shape_segment_mask'])
else:
segment_mask = None
else:
raise NotImplementedError(f'{args.region_format} is not supported.')
if args.add_region_feature:
i['question'] = i['question'].replace('of the image', f'{DEFAULT_REGION_FEA_TOKEN} of the image')
generated_mask = generate_mask_for_feature(region_coordinate_raw, raw_w=i['width'], raw_h=i['height'], mask=segment_mask)
i['region_masks'] = [generated_mask]
else:
i['region_masks'] = None
if idx % 2 == 0:
i['question'] = i['question'].replace('<obj1>', i['name'])
i['question'] = i['question'].replace('<obj2>', i['neg_class'])
else:
i['question'] = i['question'].replace('<obj2>', i['name'])
i['question'] = i['question'].replace('<obj1>', i['neg_class'])
self.obj_list = obj_list
self._ids = range(len(self.obj_list))
@property
def ids(self):
return deepcopy(self._ids)
def fetch_data(self, id):
ann = self.obj_list[id]
return ann
def eval_model(args):
# Data
dataset = LVISData_V1(data_path=args.data_path, image_path=args.image_path, args=args)
data_ids = dataset.ids
# Model
disable_torch_init()
model_path = os.path.expanduser(args.model_path)
model_name = get_model_name_from_path(model_path)
tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, args.model_base, model_name)
chunk_data_ids = get_chunk(data_ids, args.num_chunks, args.chunk_idx)
answers_file = os.path.expanduser(args.answers_file)
os.makedirs(answers_file, exist_ok=True)
answers_file = os.path.join(answers_file, f'{args.chunk_idx}_of_{args.num_chunks}.jsonl')
ans_file = open(answers_file, "w")
for i, id in enumerate(tqdm(chunk_data_ids)):
ann = dataset.fetch_data(id)
image_path = ann['image_path']
qs = ann['question']
cur_prompt = qs
# pdb.set_trace()
if model.config.mm_use_im_start_end:
qs = DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN + '\n' + qs
else:
qs = DEFAULT_IMAGE_TOKEN + '\n' + qs
conv = conv_templates[args.conv_mode].copy()
conv.append_message(conv.roles[0], qs)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()
image = Image.open(image_path).convert('RGB')
image_tensor = image_processor.preprocess(image, return_tensors='pt', do_resize=True,
do_center_crop=False, size=[args.image_h, args.image_w])['pixel_values'][0]
stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
keywords = [stop_str]
stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
region_masks = ann['region_masks']
if region_masks is not None:
region_masks = [[region_mask_i.cuda().half() for region_mask_i in region_masks]]
else:
region_masks = None
# pdb.set_trace()
with torch.inference_mode():
model.orig_forward = model.forward
model.forward = partial(
model.orig_forward,
region_masks=region_masks
)
output_ids = model.generate(
input_ids,
images=image_tensor.unsqueeze(0).half().cuda(),
do_sample=True,
temperature=args.temperature,
top_p=args.top_p,
num_beams=args.num_beams,
# no_repeat_ngram_size=3,
max_new_tokens=1024,
use_cache=True,
stopping_criteria=[stopping_criteria])
model.forward = model.orig_forward
input_token_len = input_ids.shape[1]
n_diff_input_output = (input_ids != output_ids[:, :input_token_len]).sum().item()
if n_diff_input_output > 0:
print(f'[Warning] Sample {i}: {n_diff_input_output} output_ids are not the same as the input_ids')
outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]
outputs = outputs.strip()
if outputs.endswith(stop_str):
outputs = outputs[:-len(stop_str)]
outputs = outputs.strip()
# pdb.set_trace()
ans_file.write(json.dumps({"id":ann['id'], # +1 offset
"image_path":image_path,
"prompt": cur_prompt,
"text": outputs,
"name":ann['name'],
"synonyms":ann['synonyms'],
"bbox":ann['bbox'],
"bbox_norm":ann['bbox_norm'],
"width": ann['width'],
"height": ann['height'],
}) + "\n")
ans_file.flush()
ans_file.close()
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--model-path", type=str, default="facebook/opt-350m")
parser.add_argument("--model-base", type=str, default=None)
parser.add_argument("--image_path", type=str, default="dataset/cocoval2017")
parser.add_argument("--data_path", type=str, default="dataset/lvis/lvis_v1_minival_inserted_image_name.json")
parser.add_argument("--answers-file", type=str, default="lvis_result/answer.jsonl")
parser.add_argument("--conv-mode", type=str, default="ferret_v1")
parser.add_argument("--num-chunks", type=int, default=1)
parser.add_argument("--chunk-idx", type=int, default=0)
parser.add_argument("--image_w", type=int, default=336) # 224
parser.add_argument("--image_h", type=int, default=336) # 224
parser.add_argument("--add_region_feature", action="store_true")
parser.add_argument("--no_coor", action="store_true")
parser.add_argument("--temperature", type=float, default=0.001)
parser.add_argument("--top_p", type=float, default=None)
parser.add_argument("--num_beams", type=int, default=1)
parser.add_argument("--region_format", type=str, default="point", choices=["point", "box", "free_shape"])
args = parser.parse_args()
eval_model(args)

View File

@ -0,0 +1,181 @@
"""
Usage:
- If eval on center point:
CUDA_VISIBLE_DEVICES=1 python -m ferret.eval.model_point_cls_single_image \
--model-path checkpoints/ferret_13b/checkpoint-4500 \
--img_path ferret/serve/examples/extreme_ironing.jpg \
--answers-file lvis_result/single_img/ \
--add_region_feature
"""
import argparse
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig
import torch
import os
import json
from tqdm import tqdm
from ferret.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria
from ferret.model.builder import load_pretrained_model
from ferret.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
from ferret.conversation import conv_templates, SeparatorStyle
from ferret.utils import disable_torch_init
from PIL import Image
import random
import math
from copy import deepcopy
import pdb
import numpy as np
from functools import partial
VOCAB_IMAGE_W = 1000
VOCAB_IMAGE_H = 1000
DEFAULT_REGION_FEA_TOKEN = "<region_fea>"
def generate_mask_for_feature(coor,raw_w, raw_h):
coor_mask = np.zeros((raw_w, raw_h))
# Assume it samples a point.
if len(coor) == 2:
# Define window size
span = 5
# Make sure the window does not exceed array bounds
x_min = max(0, coor[0] - span)
x_max = min(raw_w, coor[0] + span + 1)
y_min = max(0, coor[1] - span)
y_max = min(raw_h, coor[1] + span + 1)
coor_mask[int(x_min):int(x_max), int(y_min):int(y_max)] = 1
assert (coor_mask==1).any(), f"coor: {coor}, raw_w: {raw_w}, raw_h: {raw_h}"
else:
raise NotImplementedError('Coordinates must be 2d.')
coor_mask = torch.from_numpy(coor_mask)
assert len(coor_mask.nonzero()) != 0
return coor_mask
def eval_model(args):
# Model
disable_torch_init()
model_path = os.path.expanduser(args.model_path)
model_name = get_model_name_from_path(model_path)
tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, args.model_base, model_name)
image_path_list = ['ferret/serve/examples/extreme_ironing.jpg']
# image_path_list = ['ferret/serve/examples/2409138.jpg', 'ferret/serve/examples/extreme_ironing.jpg', 'ferret/serve/examples/2332136.jpg']
image_path = args.img_path
coor_list = []
grid_w = 10
grid_h = 10
for i in range(grid_w):
for j in range(grid_h):
coor_i = VOCAB_IMAGE_W * (i + 1) / (grid_w+1)
coor_j = VOCAB_IMAGE_H * (j + 1) / (grid_h+1)
coor_list.append([int(coor_i), int(coor_j)])
if args.add_region_feature:
question = f'What is the class of object <coor> {DEFAULT_REGION_FEA_TOKEN}?'
else:
question = 'What is the class of object <coor>?'
for image_path in image_path_list:
answers_file = os.path.expanduser(args.answers_file)
os.makedirs(answers_file, exist_ok=True)
image_name = image_path.split('.')[0].split('/')[-1]
answers_file = os.path.join(answers_file, f'{image_name}.jsonl')
ans_file = open(answers_file, "w")
for i, coor_i in enumerate(tqdm(coor_list)):
qs = question.replace('<coor>', '[{}, {}]'.format(int(coor_i[0]), int(coor_i[1])))
cur_prompt = qs
if model.config.mm_use_im_start_end:
qs = DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN + '\n' + qs
else:
qs = DEFAULT_IMAGE_TOKEN + '\n' + qs
conv = conv_templates[args.conv_mode].copy()
conv.append_message(conv.roles[0], qs)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
# inputs = tokenizer([prompt])
image = Image.open(image_path).convert('RGB')
# image.save(os.path.join(save_image_folder, image_file))
image_tensor = image_processor.preprocess(image, return_tensors='pt', do_resize=True,
do_center_crop=False, size=[args.image_h, args.image_w])['pixel_values'][0]
# image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values'][0]
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()
if args.add_region_feature:
generated_mask = generate_mask_for_feature(coor_i, VOCAB_IMAGE_W, VOCAB_IMAGE_H)
region_masks = [generated_mask]
region_masks = [[region_mask_i.cuda().half() for region_mask_i in region_masks]]
else:
region_masks = None
stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
keywords = [stop_str]
stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
with torch.inference_mode():
model.orig_forward = model.forward
model.forward = partial(
model.orig_forward,
region_masks=region_masks
)
output_ids = model.generate(
input_ids,
images=image_tensor.unsqueeze(0).half().cuda(),
do_sample=True,
temperature=args.temperature,
max_new_tokens=1024,
num_beams=1,
use_cache=True,
stopping_criteria=[stopping_criteria])
model.forward = model.orig_forward
input_token_len = input_ids.shape[1]
n_diff_input_output = (input_ids != output_ids[:, :input_token_len]).sum().item()
if n_diff_input_output > 0:
print(f'[Warning] Sample {i}: {n_diff_input_output} output_ids are not the same as the input_ids')
outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]
outputs = outputs.strip()
if outputs.endswith(stop_str):
outputs = outputs[:-len(stop_str)]
outputs = outputs.strip()
# pdb.set_trace()
img_w, img_h = image.size
ans_file.write(json.dumps({"img_w": img_w,
"img_h": img_h,
"VOCAB_IMAGE_W": VOCAB_IMAGE_W,
"VOCAB_IMAGE_H": VOCAB_IMAGE_H,
"coor": coor_i,
"image_path":image_path,
"prompt": cur_prompt,
"text": outputs,
}) + "\n")
ans_file.flush()
ans_file.close()
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--model-path", type=str, default="facebook/opt-350m")
parser.add_argument("--model-base", type=str, default=None)
parser.add_argument("--img_path", type=str, default='llava/serve/examples/extreme_ironing.jpg')
parser.add_argument("--answers-file", type=str, default="lvis_result/answer.jsonl")
parser.add_argument("--conv-mode", type=str, default="ferret_v1")
parser.add_argument("--image_w", type=int, default=336)
parser.add_argument("--image_h", type=int, default=336)
parser.add_argument("--answer_prompter", action="store_true")
parser.add_argument("--add_region_feature", action="store_true")
parser.add_argument("--temperature", type=float, default=0.001)
args = parser.parse_args()
eval_model(args)

212
ferret/eval/model_pope.py Normal file
View File

@ -0,0 +1,212 @@
"""
Usage:
--data_path: path of pope annotation.
--image_path: path of coco2014 val images.
--answers-file: path of output result.
Example:
CUDA_VISIBLE_DEVICES=0 python -m ferret.eval.model_pope \
--model-path checkpoints/ferret_13b/checkpoint-final \
--image_path data/refcoco/val2014 \
--data_path data/pope/coco_pope_adversarial.json \
--answers-file pope/coco_pope_adversarial \
--add_region_feature \
--chunk-idx 0 \
--num-chunks 8
"""
import argparse
from typing import Any, Tuple
import torch
import os
import json
from tqdm import tqdm
# Added
from ferret.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria
from ferret.model.builder import load_pretrained_model
from ferret.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
from ferret.conversation import conv_templates, SeparatorStyle
from ferret.utils import disable_torch_init
from PIL import Image
import re
import math
import torchvision
import numpy as np
from copy import deepcopy
# Added for visualization
from PIL import Image, ImageDraw, ImageFont
VOCAB_IMAGE_W = 1000
VOCAB_IMAGE_H = 1000
DEFAULT_REGION_FEA_TOKEN = "<region_fea>"
def split_list(lst, n):
"""Split a list into n (roughly) equal-sized chunks"""
chunk_size = math.ceil(len(lst) / n) # integer division
return [lst[i:i+chunk_size] for i in range(0, len(lst), chunk_size)]
def get_chunk(lst, n, k):
chunks = split_list(lst, n)
return chunks[k]
def plot_pope(img, boxes, text):
draw = ImageDraw.Draw(img)
fnt = ImageFont.load_default()
draw.rectangle([boxes[0], boxes[1], boxes[2], boxes[3]], outline="blue")
draw.text((boxes[0], boxes[1]-5), f'{text}', font=fnt, fill="green")
return img
def resize_bbox(box, image_w=None, image_h=None):
ratio_w = image_w * 1.0 / VOCAB_IMAGE_W
ratio_h = image_h * 1.0 / VOCAB_IMAGE_H
new_box = [int(box[0] * ratio_w), int(box[1] * ratio_h), \
int(box[2] * ratio_w), int(box[3] * ratio_h)]
return new_box
def find_bbox_template_v3(text, img_w, img_h):
pattern = r'\[(\d+), (\d+), (\d+), (\d+)\]'
matches = re.findall(pattern, text)
new_bboxes = []
old_bboxes = []
for match in matches:
x1, y1, x2, y2 = map(int, match)
new_box = resize_bbox([x1, y1, x2, y2], img_w, img_h)
new_bboxes.append(new_box)
old_bboxes.append([x1, y1, x2, y2])
set_old_bboxes = sorted(set(map(tuple, old_bboxes)), key=list(map(tuple, old_bboxes)).index)
list_old_bboxes = list(map(list, set_old_bboxes))
set_bboxes = sorted(set(map(tuple, new_bboxes)), key=list(map(tuple, new_bboxes)).index)
list_bboxes = list(map(list, set_bboxes))
for i in range(len(list_bboxes)):
x1, y1, x2, y2 = list_old_bboxes[i]
obj_string = '[obj{}]'.format(i)
text = text.replace('[{}, {}, {}, {}]'.format(x1, y1, x2, y2), obj_string)
return text, list_bboxes
class PopeGrounding():
def __init__(self, img_folder, ann_file):
self.img_folder = img_folder
self.ann_file = ann_file
self.label_list = [json.loads(q) for q in open(self.ann_file, 'r')]
self._ids = range(len(self.label_list))
def __getitem__(self, idx):
label = self.label_list[idx]
filename = label["image"]
image = Image.open(os.path.join(self.img_folder, filename)).convert('RGB')
question = label["text"]
return image, question
@property
def ids(self):
return deepcopy(self._ids)
def eval_model_pope(args):
# Data
dataset = PopeGrounding(img_folder=args.image_path, ann_file=args.data_path)
data_ids = dataset.ids
# Model
disable_torch_init()
model_path = os.path.expanduser(args.model_path)
model_name = get_model_name_from_path(model_path)
tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, args.model_base, model_name)
chunk_data_ids = get_chunk(data_ids, args.num_chunks, args.chunk_idx)
answers_file = os.path.expanduser(args.answers_file)
os.makedirs(answers_file, exist_ok=True)
answers_file = os.path.join(answers_file, f'{args.chunk_idx}_of_{args.num_chunks}.jsonl')
ans_file = open(answers_file, "w")
for i, id in enumerate(tqdm(chunk_data_ids)):
img, question = dataset[id]
qs = question
img_w, img_h = img.size
if model.config.mm_use_im_start_end:
qs = DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN + '\n' + qs
else:
qs = DEFAULT_IMAGE_TOKEN + '\n' + qs
conv = conv_templates[args.conv_mode].copy()
conv.append_message(conv.roles[0], qs)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()
image_tensor = image_processor.preprocess(img, return_tensors='pt', do_resize=True,
do_center_crop=False, size=[args.image_h, args.image_w])['pixel_values'][0]
stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
keywords = [stop_str]
stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
with torch.inference_mode():
output_ids = model.generate(
input_ids,
images=image_tensor.unsqueeze(0).half().cuda(),
do_sample=True,
temperature=args.temperature,
top_p=args.top_p,
num_beams=args.num_beams,
# no_repeat_ngram_size=3,
max_new_tokens=1024,
use_cache=True,
stopping_criteria=[stopping_criteria],
)
input_token_len = input_ids.shape[1]
n_diff_input_output = (input_ids != output_ids[:, :input_token_len]).sum().item()
if n_diff_input_output > 0:
print(f'[Warning] Sample {i}: {n_diff_input_output} output_ids are not the same as the input_ids')
outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]
outputs = outputs.strip()
if outputs.endswith(stop_str):
outputs = outputs[:-len(stop_str)]
outputs = outputs.strip()
# Plot Preds
# text, bboxes = find_bbox_template_v3(outputs, img_w=img_w, img_h=img_h)
# # print(text, bboxes)
# img = plot_pope(img, bboxes[0], text)
# img.save('pope/images/{}.png'.format(i))
ans_file.write(json.dumps({"question": question,
"answer": outputs.lower(),
}) + "\n")
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--model-path", type=str, default="facebook/opt-350m")
parser.add_argument("--model-base", type=str, default=None)
parser.add_argument("--image_path", type=str, default="data/refcoco/val2014")
parser.add_argument("--data_path", type=str, default="data/pope/coco_pope_popular.json")
parser.add_argument("--answers-file", type=str, default="pope/coco_pope_popular")
parser.add_argument("--conv-mode", type=str, default="ferret_v1")
parser.add_argument("--num-chunks", type=int, default=1)
parser.add_argument("--chunk-idx", type=int, default=0)
parser.add_argument("--image_w", type=int, default=336) # 224
parser.add_argument("--image_h", type=int, default=336) # 224
parser.add_argument("--add_region_feature", action="store_true")
parser.add_argument("--temperature", type=float, default=0.001)
parser.add_argument("--top_p", type=float, default=None)
parser.add_argument("--num_beams", type=int, default=1)
args = parser.parse_args()
eval_model_pope(args)

View File

@ -0,0 +1,252 @@
"""
Usage:
--data_path: path of refcoco annotation.
--image_path: path of refcoco images.
--answers-file: path of output result.
Example:
CUDA_VISIBLE_DEVICES=0 python -m ferret.eval.model_refcoco \
--model-path checkpoints/ferret_13b/checkpoint-final \
--image_path data/refcoco/train2014 \
--data_path data/annotations/finetune_refcocog_test.json \
--answers-file refexp_result/finetune_refcocog_test \
--add_region_feature \
--chunk-idx 0 \
--num-chunks 1
"""
import argparse
from typing import Any, Tuple
import torch
import os
import json
from tqdm import tqdm
# Added
from ferret.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria
from ferret.model.builder import load_pretrained_model
from ferret.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
from ferret.conversation import conv_templates, SeparatorStyle
from ferret.utils import disable_torch_init
from PIL import Image
import re
import math
import torchvision
import numpy as np
from copy import deepcopy
# Added for visualization
from PIL import Image, ImageDraw, ImageFont
VOCAB_IMAGE_W = 1000
VOCAB_IMAGE_H = 1000
DEFAULT_REGION_FEA_TOKEN = "<region_fea>"
def split_list(lst, n):
"""Split a list into n (roughly) equal-sized chunks"""
chunk_size = math.ceil(len(lst) / n) # integer division
return [lst[i:i+chunk_size] for i in range(0, len(lst), chunk_size)]
def get_chunk(lst, n, k):
chunks = split_list(lst, n)
return chunks[k]
def plot_refexp(img, boxes, entities, mode='pred'):
if mode == "gt":
color = "green"
elif mode == "pred":
color = "blue"
draw = ImageDraw.Draw(img)
fnt = ImageFont.load_default()
draw.rectangle([boxes[0], boxes[1], boxes[2], boxes[3]], outline=color)
draw.text((boxes[0], boxes[1]-5), f'{entities}', font=fnt, fill=color)
return img
def remove_punctuation(text: str) -> str:
punct = [',',]
for p in punct:
text = text.replace(p, '')
return text.strip()
def resize_bbox(box, image_w=None, image_h=None):
ratio_w = image_w * 1.0 / VOCAB_IMAGE_W
ratio_h = image_h * 1.0 / VOCAB_IMAGE_H
new_box = [int(box[0] * ratio_w), int(box[1] * ratio_h), \
int(box[2] * ratio_w), int(box[3] * ratio_h)]
return new_box
def find_bbox_template(text, img_w, img_h):
pattern = r'\[(\d+), (\d+), (\d+), (\d+)\]'
matches = re.findall(pattern, text)
new_bboxes = []
old_bboxes = []
for match in matches:
x1, y1, x2, y2 = map(int, match)
new_box = resize_bbox([x1, y1, x2, y2], img_w, img_h)
new_bboxes.append(new_box)
old_bboxes.append([x1, y1, x2, y2])
set_old_bboxes = sorted(set(map(tuple, old_bboxes)), key=list(map(tuple, old_bboxes)).index)
list_old_bboxes = list(map(list, set_old_bboxes))
set_bboxes = sorted(set(map(tuple, new_bboxes)), key=list(map(tuple, new_bboxes)).index)
list_bboxes = list(map(list, set_bboxes))
for i in range(len(list_bboxes)):
x1, y1, x2, y2 = list_old_bboxes[i]
text = text.replace('[{}, {}, {}, {}]'.format(x1, y1, x2, y2), '')
if text.endswith(" ."):
text = text[:-2]
split_text = text.split(" . ")
entities = [item.strip() for item in split_text if item.strip() != '']
return entities, list_bboxes
class RefExpGrounding(torchvision.datasets.CocoDetection):
def __init__(self, img_folder, ann_file, transforms):
super(RefExpGrounding, self).__init__(img_folder, ann_file)
self._transforms = transforms
self.question_prompt = "What is the location of <obj> in the image?"
def __getitem__(self, idx):
img, target = super(RefExpGrounding, self).__getitem__(idx)
image_id = self.ids[idx]
coco_img = self.coco.loadImgs(image_id)[0]
file_name = coco_img["file_name"]
caption = coco_img["caption"]
dataset_name = coco_img["dataset_name"] if "dataset_name" in coco_img else None
assert len(target) == 1
bbox_xywh = target[0]["bbox"]
bbox_xyxy = np.array([bbox_xywh[0], bbox_xywh[1], bbox_xywh[0] + bbox_xywh[2], bbox_xywh[1] + bbox_xywh[3]])
w, h = img.size
bbox_xyxy[0::2].clip(min=0, max=w)
bbox_xyxy[1::2].clip(min=0, max=h)
assert "<obj>" in self.question_prompt
question = self.question_prompt.replace("<obj>", remove_punctuation(caption))
target = {"image_id": image_id, "file_name": file_name, "annotations": target, "caption": caption,
"img_w": w, "img_h": h, "question": question, "bboxes": bbox_xyxy.tolist(), "entities": [caption]}
if self._transforms is not None:
img, target = self._transforms(img, target)
target["dataset_name"] = dataset_name
for extra_key in ["sentence_id", "original_img_id", "original_id", "task_id"]:
if extra_key in coco_img:
target[extra_key] = coco_img[extra_key]
return img, target
def eval_model_refexp(args):
# Data
dataset = RefExpGrounding(img_folder=args.image_path,
ann_file=args.data_path,
transforms=None,
)
data_ids = range(len(dataset))
# Model
disable_torch_init()
model_path = os.path.expanduser(args.model_path)
model_name = get_model_name_from_path(model_path)
tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, args.model_base, model_name)
chunk_data_ids = get_chunk(data_ids, args.num_chunks, args.chunk_idx)
answers_file = os.path.expanduser(args.answers_file)
os.makedirs(answers_file, exist_ok=True)
answers_file = os.path.join(answers_file, f'{args.chunk_idx}_of_{args.num_chunks}.jsonl')
ans_file = open(answers_file, "w")
for i, id in enumerate(tqdm(chunk_data_ids)):
img, ann = dataset[id]
qs = ann["question"]
cur_prompt = qs
# Plot GTs
# img = plot_refexp(img, ann["bboxes"], ann["entities"], mode="gt")
if model.config.mm_use_im_start_end:
qs = DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN + '\n' + qs
else:
qs = DEFAULT_IMAGE_TOKEN + '\n' + qs
conv = conv_templates[args.conv_mode].copy()
conv.append_message(conv.roles[0], qs)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()
img_w, img_h = ann["img_w"], ann["img_h"]
image_tensor = image_processor.preprocess(img, return_tensors='pt', do_resize=True,
do_center_crop=False, size=[args.image_h, args.image_w])['pixel_values'][0]
stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
keywords = [stop_str]
stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
with torch.inference_mode():
output_ids = model.generate(
input_ids,
images=image_tensor.unsqueeze(0).half().cuda(),
do_sample=True,
temperature=args.temperature,
top_p=args.top_p,
num_beams=args.num_beams,
# no_repeat_ngram_size=3,
max_new_tokens=1024,
use_cache=True,
stopping_criteria=[stopping_criteria],
)
input_token_len = input_ids.shape[1]
n_diff_input_output = (input_ids != output_ids[:, :input_token_len]).sum().item()
if n_diff_input_output > 0:
print(f'[Warning] Sample {i}: {n_diff_input_output} output_ids are not the same as the input_ids')
outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]
outputs = outputs.strip()
if outputs.endswith(stop_str):
outputs = outputs[:-len(stop_str)]
outputs = outputs.strip()
# Plot Preds
# pred_entities, pred_bboxes = find_bbox_template(outputs, img_w=img_w, img_h=img_h)
# img = plot_refexp(img, pred_bboxes[0], pred_entities, mode="pred")
# img.save('refexp_result/images/{}.png'.format(i))
ans_file.write(json.dumps({"image_id": ann['image_id'],
"file_name": ann["file_name"],
"prompt": cur_prompt,
"text": outputs,
"width": ann['img_w'],
"height": ann['img_h'],
}) + "\n")
ans_file.flush()
ans_file.close()
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--model-path", type=str, default="facebook/opt-350m")
parser.add_argument("--model-base", type=str, default=None)
parser.add_argument("--image_path", type=str, default="data/refcoco/train2014")
parser.add_argument("--data_path", type=str, default="data/annotations/finetune_refcoco_testA.json")
parser.add_argument("--answers-file", type=str, default="refexp_result/refcoco_testA")
parser.add_argument("--conv-mode", type=str, default="ferret_v1")
parser.add_argument("--num-chunks", type=int, default=1)
parser.add_argument("--chunk-idx", type=int, default=0)
parser.add_argument("--image_w", type=int, default=336) # 224
parser.add_argument("--image_h", type=int, default=336) # 224
parser.add_argument("--add_region_feature", action="store_true")
parser.add_argument("--temperature", type=float, default=0.001)
parser.add_argument("--top_p", type=float, default=None)
parser.add_argument("--num_beams", type=int, default=1)
args = parser.parse_args()
eval_model_refexp(args)

View File

@ -0,0 +1,63 @@
import json
import os
from collections import defaultdict
import numpy as np
import argparse
def parse_args():
parser = argparse.ArgumentParser(description='ChatGPT-based QA evaluation.')
parser.add_argument('-d', '--dir', default=None)
parser.add_argument('-f', '--files', nargs='*', default=None)
parser.add_argument('-i', '--ignore', nargs='*', default=None)
parser.add_argument('-s', '--save', action='store_true')
return parser.parse_args()
if __name__ == '__main__':
args = parse_args()
if args.ignore is not None:
args.ignore = [int(x) for x in args.ignore]
if args.files is not None and len(args.files) > 0:
review_files = args.files
else:
review_files = [x for x in os.listdir(args.dir) if x.endswith('.jsonl') and (x.startswith('gpt4_text') or x.startswith('reviews_') or x.startswith('review_'))]
metrics = []
for review_file in sorted(review_files):
config = os.path.basename(review_file).replace('gpt4_text_', '').replace('.jsonl', '')
scores = defaultdict(list)
print(config)
with open(os.path.join(args.dir, review_file) if args.dir is not None else review_file) as f:
for review_str in f:
review = json.loads(review_str)
if args.ignore is not None and review['question_id'] in args.ignore:
continue
if 'category' in review:
scores[review['category']].append(review['tuple'])
scores['all'].append(review['tuple'])
else:
if 'tuple' in review:
scores['all'].append(review['tuple'])
else:
scores['all'].append(review['score'])
summ_dict = defaultdict(list)
for k, v in sorted(scores.items()):
stats = np.asarray(v).mean(0).tolist()
stats = [round(x, 3) for x in stats]
# print(k, stats, round(stats[1]/stats[0]*100, 1))
print(k, round(stats[1]/stats[0]*100, 1))
summ_dict[k] = round(stats[1]/stats[0]*100, 1)
print('=================================')
metrics.append(summ_dict)
if args.save:
with open(os.path.join(args.dir, 'metric.json'), 'w') as f:
json.dump(metrics, f, indent=2)

74
ferret/mm_utils.py Normal file
View File

@ -0,0 +1,74 @@
from PIL import Image
from io import BytesIO
import base64
import torch
from transformers import StoppingCriteria
from ferret.constants import IMAGE_TOKEN_INDEX
def load_image_from_base64(image):
return Image.open(BytesIO(base64.b64decode(image)))
def process_images(images, image_processor, model_cfg):
return image_processor(images, return_tensors='pt')['pixel_values']
def tokenizer_image_token(prompt, tokenizer, image_token_index=IMAGE_TOKEN_INDEX, return_tensors=None):
prompt_chunks = [tokenizer(chunk).input_ids for chunk in prompt.split('<image>')]
def insert_separator(X, sep):
return [ele for sublist in zip(X, [sep]*len(X)) for ele in sublist][:-1]
input_ids = []
offset = 0
if len(prompt_chunks) > 0 and len(prompt_chunks[0]) > 0 and prompt_chunks[0][0] == tokenizer.bos_token_id:
offset = 1
input_ids.append(prompt_chunks[0][0])
for x in insert_separator(prompt_chunks, [image_token_index] * (offset + 1)):
input_ids.extend(x[offset:])
if return_tensors is not None:
if return_tensors == 'pt':
return torch.tensor(input_ids, dtype=torch.long)
raise ValueError(f'Unsupported tensor type: {return_tensors}')
return input_ids
def get_model_name_from_path(model_path):
model_path = model_path.strip("/")
model_paths = model_path.split("/")
if model_paths[-1].startswith('checkpoint-') or model_paths[-1].endswith('checkpoint'):
return model_paths[-2] + "_" + model_paths[-1]
else:
return model_paths[-1]
class KeywordsStoppingCriteria(StoppingCriteria):
def __init__(self, keywords, tokenizer, input_ids):
self.keywords = keywords
self.keyword_ids = []
for keyword in keywords:
cur_keyword_ids = tokenizer(keyword).input_ids
if len(cur_keyword_ids) > 1 and cur_keyword_ids[0] == tokenizer.bos_token_id:
cur_keyword_ids = cur_keyword_ids[1:]
self.keyword_ids.append(torch.tensor(cur_keyword_ids))
self.tokenizer = tokenizer
self.start_len = input_ids.shape[1]
def __call__(self, output_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
assert output_ids.shape[0] == 1, "Only support batch size 1 (yet)" # TODO
offset = min(output_ids.shape[1] - self.start_len, 3)
self.keyword_ids = [keyword_id.to(output_ids.device) for keyword_id in self.keyword_ids]
for keyword_id in self.keyword_ids:
if output_ids[0, -keyword_id.shape[0]:] == keyword_id:
return True
outputs = self.tokenizer.batch_decode(output_ids[:, -offset:], skip_special_tokens=True)[0]
for keyword in self.keywords:
if keyword in outputs:
return True
return False

1
ferret/model/__init__.py Normal file
View File

@ -0,0 +1 @@
from .language_model.ferret_llama import FERRETLlamaForCausalLM, FERRETConfig

View File

@ -0,0 +1,48 @@
"""
Usage:
python3 -m fastchat.model.apply_delta --base ~/model_weights/llama-7b --target ~/model_weights/vicuna-7b --delta lmsys/vicuna-7b-delta
"""
import argparse
import torch
from tqdm import tqdm
from transformers import AutoTokenizer, AutoModelForCausalLM
from ferret import FERRETLlamaForCausalLM
def apply_delta(base_model_path, target_model_path, delta_path):
print("Loading base model")
base = AutoModelForCausalLM.from_pretrained(
base_model_path, torch_dtype=torch.float16, low_cpu_mem_usage=True)
print("Loading delta")
delta = FERRETLlamaForCausalLM.from_pretrained(delta_path, torch_dtype=torch.float16, low_cpu_mem_usage=True)
delta_tokenizer = AutoTokenizer.from_pretrained(delta_path)
print("Applying delta")
for name, param in tqdm(delta.state_dict().items(), desc="Applying delta"):
if name not in base.state_dict():
assert name in ['model.mm_projector.weight', 'model.mm_projector.bias'], f'{name} not in base model'
continue
if param.data.shape == base.state_dict()[name].shape:
param.data += base.state_dict()[name]
else:
assert name in ['model.embed_tokens.weight', 'lm_head.weight'], \
f'{name} dimension mismatch: {param.data.shape} vs {base.state_dict()[name].shape}'
bparam = base.state_dict()[name]
param.data[:bparam.shape[0], :bparam.shape[1]] += bparam
print("Saving target model")
delta.save_pretrained(target_model_path)
delta_tokenizer.save_pretrained(target_model_path)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--base-model-path", type=str, required=True)
parser.add_argument("--target-model-path", type=str, required=True)
parser.add_argument("--delta-path", type=str, required=True)
args = parser.parse_args()
apply_delta(args.base_model_path, args.target_model_path, args.delta_path)

139
ferret/model/builder.py Normal file
View File

@ -0,0 +1,139 @@
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import shutil
import pdb
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig, BitsAndBytesConfig
import torch
from ferret.model import *
from ferret.constants import DEFAULT_IMAGE_PATCH_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
DEFAULT_REGION_FEA_TOKEN = "<region_fea>"
def load_pretrained_model(model_path, model_base, model_name, load_8bit=False, load_4bit=False, device_map="auto"):
kwargs = {"device_map": device_map}
if load_8bit:
kwargs['load_in_8bit'] = True
elif load_4bit:
kwargs['load_in_4bit'] = True
kwargs['quantization_config'] = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4'
)
else:
kwargs['torch_dtype'] = torch.float16
if 'llava' in model_name.lower() or 'ferret' in model_name.lower():
# Load LLaVA/FERRET model
if 'lora' in model_name.lower() and model_base is not None:
lora_cfg_pretrained = AutoConfig.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False)
print('Loading LLaVA/FERRET from base model...')
model = FERRETLlamaForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained, **kwargs)
token_num, tokem_dim = model.lm_head.out_features, model.lm_head.in_features
if model.lm_head.weight.shape[0] != token_num:
model.lm_head.weight = torch.nn.Parameter(torch.empty(token_num, tokem_dim, device=model.device, dtype=model.dtype))
model.model.embed_tokens.weight = torch.nn.Parameter(torch.empty(token_num, tokem_dim, device=model.device, dtype=model.dtype))
print('Loading additional LLaVA/FERRET weights...')
if os.path.exists(os.path.join(model_path, 'non_lora_trainables.bin')):
non_lora_trainables = torch.load(os.path.join(model_path, 'non_lora_trainables.bin'), map_location='cpu')
else:
# this is probably from HF Hub
from huggingface_hub import hf_hub_download
def load_from_hf(repo_id, filename, subfolder=None):
cache_file = hf_hub_download(
repo_id=repo_id,
filename=filename,
subfolder=subfolder)
return torch.load(cache_file, map_location='cpu')
non_lora_trainables = load_from_hf(model_path, 'non_lora_trainables.bin')
non_lora_trainables = {(k[11:] if k.startswith('base_model.') else k): v for k, v in non_lora_trainables.items()}
if any(k.startswith('model.model.') for k in non_lora_trainables):
non_lora_trainables = {(k[6:] if k.startswith('model.') else k): v for k, v in non_lora_trainables.items()}
model.load_state_dict(non_lora_trainables, strict=False)
from peft import PeftModel
print('Loading LoRA weights...')
model = PeftModel.from_pretrained(model, model_path)
print('Merging LoRA weights...')
model = model.merge_and_unload()
print('Model is loaded...')
elif model_base is not None:
# this may be mm projector only
print('Loading LLaVA/FERRET from base model...')
tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False)
cfg_pretrained = AutoConfig.from_pretrained(model_path)
model = FERRETLlamaForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=cfg_pretrained, **kwargs)
mm_projector_weights = torch.load(os.path.join(model_path, 'mm_projector.bin'), map_location='cpu')
mm_projector_weights = {k: v.to(torch.float16) for k, v in mm_projector_weights.items()}
model.load_state_dict(mm_projector_weights, strict=False)
else:
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
model = FERRETLlamaForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)
else:
# Load language model
if model_base is not None:
# PEFT model
from peft import PeftModel
tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False)
model = AutoModelForCausalLM.from_pretrained(model_base, torch_dtype=torch.float16, low_cpu_mem_usage=True, device_map="auto")
print(f"Loading LoRA weights from {model_path}")
model = PeftModel.from_pretrained(model, model_path)
print(f"Merging weights")
model = model.merge_and_unload()
print('Convert to FP16...')
model.to(torch.float16)
else:
use_fast = False
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
model = AutoModelForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)
image_processor = None
if 'llava' in model_name.lower() or 'ferret' in model_name.lower():
mm_use_im_start_end = getattr(model.config, "mm_use_im_start_end", False)
mm_use_im_patch_token = getattr(model.config, "mm_use_im_patch_token", True)
mm_im_region_fea_token = getattr(model.config, "im_region_fea_token", None)
if mm_use_im_patch_token:
tokenizer.add_tokens([DEFAULT_IMAGE_PATCH_TOKEN], special_tokens=True)
if mm_im_region_fea_token is not None:
tokenizer.add_tokens([DEFAULT_REGION_FEA_TOKEN], special_tokens=True)
if mm_use_im_start_end:
tokenizer.add_tokens([DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN], special_tokens=True)
model.resize_token_embeddings(len(tokenizer))
vision_tower = model.get_vision_tower()
vision_tower_path = os.path.join(model_path, 'vision_tower')
if not vision_tower.is_loaded or os.path.exists(vision_tower_path):
if os.path.exists(vision_tower_path):
print(f'Start Loading vision tower from {vision_tower_path}')
vision_tower.load_model(vision_tower_path=vision_tower_path)
print(f'Finish Loading vision tower from {vision_tower_path}')
else:
vision_tower.load_model()
vision_tower.to(device='cuda', dtype=torch.float16)
image_processor = vision_tower.image_processor
if hasattr(model.config, "max_sequence_length"):
context_len = model.config.max_sequence_length
else:
context_len = 2048
return tokenizer, model, image_processor, context_len

View File

@ -0,0 +1,29 @@
"""
Usage:
python3 -m llava.model.consolidate --src ~/model_weights/llava-7b --dst ~/model_weights/llava-7b_consolidate
"""
import argparse
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from ferret.model import *
from ferret.model.utils import auto_upgrade
def consolidate_ckpt(src_path, dst_path):
print("Loading model")
auto_upgrade(src_path)
src_model = AutoModelForCausalLM.from_pretrained(src_path, torch_dtype=torch.float16, low_cpu_mem_usage=True)
src_tokenizer = AutoTokenizer.from_pretrained(src_path, use_fast=False)
src_model.save_pretrained(dst_path)
src_tokenizer.save_pretrained(dst_path)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--src", type=str, required=True)
parser.add_argument("--dst", type=str, required=True)
args = parser.parse_args()
consolidate_ckpt(args.src, args.dst)

678
ferret/model/ferret_arch.py Normal file
View File

@ -0,0 +1,678 @@
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from abc import ABC, abstractmethod
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
from .multimodal_encoder.builder import build_vision_tower
import pdb
from ferret.constants import IGNORE_INDEX, IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_PATCH_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
DEFAULT_REGION_FEA_TOKEN = "<region_fea>"
def rand_sample(x, max_len):
if x.shape[0] <= max_len:
return x
else:
rand_idx = torch.randperm(x.shape[0])[:max_len]
return x[rand_idx, :]
def rand_sample_repeat(x, max_len):
if x.shape[0] < max_len:
indices = torch.randint(0, x.shape[0], (max_len-x.shape[0],))
# pdb.set_trace()
return torch.cat((x, x[indices]), dim=0)
elif x.shape[0] == max_len:
return x
else:
rand_idx = torch.randperm(x.shape[0])[:max_len]
return x[rand_idx, :]
def point_sample(input, point_coords, return_dtype, **kwargs):
"""
A wrapper around :function:`torch.nn.functional.grid_sample` to support 3D point_coords tensors.
Unlike :function:`torch.nn.functional.grid_sample` it assumes `point_coords` to lie inside
[0, 1] x [0, 1] square.
Args:
input (Tensor): A tensor of shape (N, C, H, W) that contains features map on a H x W grid.
point_coords (Tensor): A tensor of shape (N, P, 2) or (N, Hgrid, Wgrid, 2) that contains
[0, 1] x [0, 1] normalized point coordinates.
Returns:
output (Tensor): A tensor of shape (N, C, P) or (N, C, Hgrid, Wgrid) that contains
features for points in `point_coords`. The features are obtained via bilinear
interplation from `input` the same way as :function:`torch.nn.functional.grid_sample`.
"""
add_dim = False
if point_coords.dim() == 3:
add_dim = True
point_coords = point_coords.unsqueeze(2)
# output = F.grid_sample(input, 2.0 * point_coords - 1.0, **kwargs)
output = F.grid_sample(input.float(), (2.0 * point_coords - 1.0).float(), **kwargs)
output = output.to(return_dtype)
if add_dim:
output = output.squeeze(3)
return output
def farthest_point_sample(xyz, npoint):
"""
Input:
xyz: pointcloud data, [B, N, 2]
npoint: number of samples
Return:
centroids: sampled pointcloud index, [B, npoint]
"""
device = xyz.device
B, N, C = xyz.shape
centroids = torch.zeros(B, npoint, dtype=torch.long).to(device)
distance = torch.ones(B, N).to(device) * 1e10
farthest = torch.randint(0, N, (B,), dtype=torch.long).to(device)
batch_indices = torch.arange(B, dtype=torch.long).to(device)
for i in range(npoint):
centroids[:, i] = farthest
centroid = xyz[batch_indices, farthest, :].view(B, 1, 2)
dist = torch.sum((xyz - centroid) ** 2, -1)
distance = torch.min(distance, dist)
farthest = torch.max(distance, -1)[1]
return centroids
def index_points(points, idx):
"""
Input:
points: input points data, [B, N, C]
idx: sample index data, [B, S]
Return:
new_points:, indexed points data, [B, S, C]
"""
device = points.device
B = points.shape[0]
view_shape = list(idx.shape)
view_shape[1:] = [1] * (len(view_shape) - 1)
repeat_shape = list(idx.shape)
repeat_shape[0] = 1
batch_indices = torch.arange(B, dtype=torch.long).to(device).view(view_shape).repeat(repeat_shape)
new_points = points[batch_indices, idx, :]
return new_points
def square_distance(src, dst):
"""
Calculate Euclid distance between each two points.
src^T * dst = xn * xm + yn * ym + zn * zm
sum(src^2, dim=-1) = xn*xn + yn*yn + zn*zn;
sum(dst^2, dim=-1) = xm*xm + ym*ym + zm*zm;
dist = (xn - xm)^2 + (yn - ym)^2 + (zn - zm)^2
= sum(src**2,dim=-1)+sum(dst**2,dim=-1)-2*src^T*dst
Input:
src: source points, [B, N, C]
dst: target points, [B, M, C]
Output:
dist: per-point square distance, [B, N, M]
"""
B, N, _ = src.shape
_, M, _ = dst.shape
dist = -2 * torch.matmul(src, dst.permute(0, 2, 1))
dist += torch.sum(src ** 2, -1).view(B, N, 1)
dist += torch.sum(dst ** 2, -1).view(B, 1, M)
return dist
def knn_point(nsample, xyz, new_xyz):
"""
Input:
nsample: max sample number in local region
xyz: all points, [B, N, C]
new_xyz: query points, [B, S, C]
Return:
group_idx: grouped points index, [B, S, nsample]
"""
sqrdists = square_distance(new_xyz, xyz)
_, group_idx = torch.topk(sqrdists, nsample, dim=-1, largest=False, sorted=False)
return group_idx
class ConvReLULN1D(nn.Module):
def __init__(self, in_channels, out_channels, kernel_size=1, bias=True):
super(ConvReLULN1D, self).__init__()
self.act = nn.ReLU(inplace=True)
self.net = nn.Sequential(
nn.Conv1d(in_channels=in_channels, out_channels=out_channels, kernel_size=kernel_size, bias=bias),
self.act
)
self.norm = nn.LayerNorm(out_channels)
def forward(self, x):
# (B, C, N) -> (B, C_1, N)
x = self.net(x)
x = x.permute(0, 2, 1)
x = self.norm(x)
x = x.permute(0, 2, 1)
return x
def normal_init(module, mean=0, std=1, bias=0):
if hasattr(module, 'weight') and module.weight is not None:
nn.init.normal_(module.weight, mean, std)
if hasattr(module, 'bias') and module.bias is not None:
nn.init.constant_(module.bias, bias)
class GeoRegionSampler(nn.Module):
def __init__(self,
input_dim,
output_dim,
num_init_point,
num_sub_point,
num_neighbor,
pooler_mode='mean'):
super(GeoRegionSampler, self).__init__()
self.input_dim = input_dim
self.output_dim = output_dim
self.num_init_point = num_init_point
self.num_sub_point = num_sub_point
self.num_neighbor = num_neighbor
self.diff_projector_list = nn.ModuleList()
self.agg_projector_list = nn.ModuleList()
self.pooler_list = nn.ModuleList()
for ii in range(len(num_sub_point)):
self.diff_projector_list.append(nn.Linear(self.input_dim + 2, self.input_dim + 2))
self.agg_projector_list.append(ConvReLULN1D(in_channels=2*(self.input_dim + 2),
out_channels=self.input_dim,
))
if pooler_mode == 'mean':
self.pooler_list.append(nn.AvgPool1d(kernel_size=num_neighbor[ii]))
elif pooler_mode =='max':
self.pooler_list.append(nn.AdaptiveMaxPool1d(output_size=1))
else:
raise NotImplementedError(f'{self.pooler_mode} is not supported.')
self.flatten_projector = nn.Linear(self.input_dim * num_sub_point[-1], self.input_dim)
self.dim_projector = nn.Linear(self.input_dim, self.output_dim)
self.norm_init_weights()
# self.dtype = torch.float32
def norm_init_weights(self):
for m in self.modules():
if isinstance(m, nn.Conv2d):
normal_init(m, 0, 0.01)
def forward(self,
feature_map,
region_masks,
original_dtype,
return_dtype):
assert len(feature_map) == len(region_masks)
all_points = []
all_points_fea = []
all_points_img_ids = []
# Sample points and their features
for img_idx, (region_feature_map_i, region_masks_list_i) in enumerate(zip(feature_map, region_masks)):
if len(region_masks_list_i) != 0:
# (w, h)
ori_image_wh = torch.tensor([region_masks_list_i[0].shape[0], region_masks_list_i[0].shape[1]], device=region_masks_list_i[0].device)[None,]
# list of elements of shape [num_sample_point, 2]
# pdb.set_trace()
cur_non_zero_pos = [rand_sample_repeat((m.nonzero()/ori_image_wh), self.num_init_point) for m in region_masks_list_i]
# list -> [num_mask, num_sample_point, 2]
cur_non_zero_pos = torch.stack(cur_non_zero_pos)
# [HxW, C] -> [H, W, C] -> [C, H, W] -> [N, C, H, W]
h = w = int(math.sqrt(region_feature_map_i.shape[0]))
c = region_feature_map_i.shape[-1]
dup_region_feature_map_i = region_feature_map_i.reshape(h, w, c).permute(2, 0, 1)
dup_region_feature_map_i = dup_region_feature_map_i.unsqueeze(0).repeat(cur_non_zero_pos.shape[0], 1, 1, 1)
# [num_mask, C, H, W] x [num_mask, num_sample_point, 2] -> [num_mask, C, num_sample_point] -> [num_mask, num_sample_point, C]
# F.grid_sample doesn't support BF16. Need to tranform into float32 then transform back.
dup_region_feature_map_i_ori_type = dup_region_feature_map_i.to(original_dtype)
region_feature_i = point_sample(dup_region_feature_map_i_ori_type,
cur_non_zero_pos.flip(dims=(2,)).type(original_dtype),
return_dtype,
align_corners=True,
)
# region_feature_i = region_feature_i.to(dup_region_feature_map_i.dtype)
region_feature_i = region_feature_i.transpose(-2, -1)
cur_img_ids = [img_idx] * len(cur_non_zero_pos)
# save to global list
all_points.append(cur_non_zero_pos)
all_points_fea.append(region_feature_i)
all_points_img_ids.extend(cur_img_ids)
# pdb.set_trace()
# No region found, return list of None.
if len(all_points) == 0:
return [None] * len(region_masks)
all_points = torch.cat(all_points, dim=0).to(return_dtype) # [B*num_mask, num_sample_point, 2]
all_points_fea = torch.cat(all_points_fea, dim=0) # [B*num_mask, num_sample_point, C]
all_points_img_ids = torch.tensor(all_points_img_ids, device=all_points_fea.device)
# pdb.set_trace()
assert all_points_fea.shape[:-1] == all_points_fea.shape[:-1]
# Processing.
for stage_i in range(len(self.num_sub_point)):
cur_num_sub_point = self.num_sub_point[stage_i]
cur_num_neighbor = self.num_neighbor[stage_i]
all_points = all_points.contiguous() # xy [btach, points, xy]
fps_idx = farthest_point_sample(all_points, cur_num_sub_point).long()
new_points = index_points(all_points, fps_idx) # [B, npoint, 2]
new_points_fea = index_points(all_points_fea, fps_idx) # [B, npoint, d]
idx = knn_point(cur_num_neighbor, all_points, new_points)
grouped_points = index_points(all_points, idx) # [B, npoint, k, 2]
grouped_points_fea = index_points(all_points_fea, idx) # [B, npoint, k, d]
# pdb.set_trace()
local_points_fea = torch.cat([grouped_points_fea, grouped_points],dim=-1) # [B, npoint, k, d+2]
anchor_points_fea = torch.cat([new_points_fea, new_points],dim=-1).unsqueeze(-2)
diff_points_fea = local_points_fea-anchor_points_fea
diff_points_fea = self.diff_projector_list[stage_i](diff_points_fea)
gather_points_fea = torch.cat([diff_points_fea, anchor_points_fea.repeat(1, 1, cur_num_neighbor, 1)], dim=-1) # [B, npoint, k, 2(d+2)]
# pdb.set_trace()
b, n, s, d = gather_points_fea.size()
gather_points_fea = gather_points_fea.permute(0, 1, 3, 2) # [B, npoint, 2(d+2), k]
gather_points_fea = gather_points_fea.reshape(-1, d, s) # [B*npoint, 2(d+2), k]
gather_points_fea = self.agg_projector_list[stage_i](gather_points_fea) # [B*npoint, d, k]
# pdb.set_trace()
batch_size, new_dim, _ = gather_points_fea.size()
gather_points_fea = self.pooler_list[stage_i](gather_points_fea).view(batch_size, new_dim) # [B*npoint, d]
# gather_points_fea = F.adaptive_max_pool1d(gather_points_fea, 1).view(batch_size, -1) # [B*npoint, d]
# pdb.set_trace()
gather_points_fea = gather_points_fea.reshape(b, n, -1) # [B, npoint, d]
# pdb.set_trace()
all_points = new_points
all_points_fea = gather_points_fea
# pdb.set_trace()
x = all_points_fea.flatten(1, -1) # [B, npoint x d]
x = self.flatten_projector(x)
all_region_fea = self.dim_projector(x) # [B, d]
output_region_fea = []
for img_idx in range(len(region_masks)):
cur_mask = all_points_img_ids == img_idx
# pdb.set_trace()
if not cur_mask.any():
output_region_fea.append(None)
else:
output_region_fea.append(all_region_fea[cur_mask])
# pdb.set_trace()
return output_region_fea
class FERRETMetaModel:
def __init__(self, config):
super(FERRETMetaModel, self).__init__(config)
self.max_sample_point = 512
if hasattr(config, "mm_vision_tower"):
self.vision_tower = build_vision_tower(config, delay_load=True)
self.mm_projector = nn.Linear(config.mm_hidden_size, config.hidden_size)
if hasattr(config, "region_fea_adapter"):
self.region_fea_adapter = nn.Linear(config.mm_hidden_size, config.hidden_size)
if hasattr(config, "region_geo_sampler"):
# pdb.set_trace()
self.region_geo_sampler = GeoRegionSampler(input_dim=config.mm_hidden_size,
output_dim=config.hidden_size,
num_init_point=self.max_sample_point,
num_sub_point=[128, 32],
num_neighbor=[24, 24],
pooler_mode=config.sampler_pooler_mode
)
def get_vision_tower(self):
vision_tower = getattr(self, 'vision_tower', None)
if type(vision_tower) is list:
vision_tower = vision_tower[0]
return vision_tower
def initialize_vision_modules(self, model_args, fsdp=None, add_region_feature=False, region_geo_sampler=False, sampler_pooler_mode='mean'):
vision_tower = model_args.vision_tower
mm_vision_select_layer = model_args.mm_vision_select_layer
mm_vision_select_feature = model_args.mm_vision_select_feature
pretrain_mm_mlp_adapter = model_args.pretrain_mm_mlp_adapter
self.config.mm_vision_tower = vision_tower
vision_tower = build_vision_tower(model_args)
if fsdp is not None and len(fsdp) > 0:
self.vision_tower = [vision_tower]
else:
self.vision_tower = vision_tower
self.config.use_mm_proj = True
self.config.mm_hidden_size = vision_tower.hidden_size
self.config.mm_vision_select_layer = mm_vision_select_layer
self.config.mm_vision_select_feature = mm_vision_select_feature
if not hasattr(self, 'mm_projector'):
self.mm_projector = nn.Linear(self.config.mm_hidden_size, self.config.hidden_size)
if add_region_feature:
if region_geo_sampler:
self.config.region_geo_sampler = True
self.config.sampler_pooler_mode = sampler_pooler_mode
# pdb.set_trace()
if not hasattr(self, 'region_geo_sampler'):
self.region_geo_sampler = GeoRegionSampler(input_dim=self.config.mm_hidden_size,
output_dim=self.config.hidden_size,
num_init_point=self.max_sample_point,
num_sub_point=[128, 32],
num_neighbor=[24, 24],
pooler_mode=sampler_pooler_mode
)
else:
self.config.region_fea_adapter = True
if not hasattr(self, 'region_fea_adapter'):
self.region_fea_adapter = nn.Linear(self.config.mm_hidden_size, self.config.hidden_size)
if pretrain_mm_mlp_adapter is not None:
mm_projector_weights = torch.load(pretrain_mm_mlp_adapter, map_location='cpu')
def get_w(weights, keyword):
return {k.split(keyword + '.')[1]: v for k, v in weights.items() if keyword in k}
self.mm_projector.load_state_dict(get_w(mm_projector_weights, 'mm_projector'))
class FERRETMetaForCausalLM(ABC):
@abstractmethod
def get_model(self):
pass
def get_vision_tower(self):
return self.get_model().get_vision_tower()
def encode_images(self, images, region_flag=False, region_geo_sampler=False):
image_features = self.get_model().get_vision_tower()(images)
projected_image_features = self.get_model().mm_projector(image_features)
if region_flag:
if region_geo_sampler:
new_region_feature_map = image_features
else:
new_region_feature_map = self.get_model().region_fea_adapter(image_features)
else:
new_region_feature_map = None
return image_features, projected_image_features, new_region_feature_map
def extract_region_feature(self, region_feature_map, region_masks, original_dtype, return_dtype):
all_region_features = []
assert len(region_feature_map) == len(region_masks)
for region_feature_map_i, region_masks_list_i in zip(region_feature_map, region_masks):
if len(region_masks_list_i) == 0:
all_region_features.append(None)
else:
# (w, h)
ori_image_wh = torch.tensor([region_masks_list_i[0].shape[0], region_masks_list_i[0].shape[1]], device=region_masks_list_i[0].device)[None,]
# list of elements of shape [num_sample_point, 2]
non_zero_pos = [rand_sample((m.nonzero()/ori_image_wh), self.get_model().max_sample_point) for m in region_masks_list_i]
# [num_mask, num_sample_point(padded), 2]
non_zero_pos = nn.utils.rnn.pad_sequence(non_zero_pos, padding_value=-1, batch_first=True)
non_zero_pos_mask = ~(non_zero_pos.sum(dim=-1) < 0)
# [HxW, C] -> [H, W, C] -> [C, H, W] -> [N, C, H, W]
h = w = int(math.sqrt(region_feature_map_i.shape[0]))
c = region_feature_map_i.shape[-1]
dup_region_feature_map_i = region_feature_map_i.reshape(h, w, c).permute(2, 0, 1)
dup_region_feature_map_i = dup_region_feature_map_i.unsqueeze(0).repeat(non_zero_pos.shape[0], 1, 1, 1)
# [num_mask, C, H, W] x [num_mask, num_sample_point(padded), 2] -> [num_mask, C, num_sample_point(padded)]
# F.grid_sample doesn't support BF16. Need to tranform into float32 then transform back.
dup_region_feature_map_i_ori_type = dup_region_feature_map_i.to(original_dtype)
# pdb.set_trace()
region_feature_i = point_sample(dup_region_feature_map_i_ori_type,
non_zero_pos.flip(dims=(2,)).type(original_dtype),
return_dtype,
align_corners=True
)
region_feature_i = region_feature_i.to(dup_region_feature_map_i.dtype)
# [num_mask, C]
region_feature_i = torch.stack([x[m].mean(dim=0) for x, m in zip(region_feature_i.transpose(1,2), non_zero_pos_mask)]).nan_to_num()
all_region_features.append(region_feature_i)
return all_region_features
def prepare_inputs_labels_for_multimodal(
self, input_ids, attention_mask, past_key_values, labels, images, region_masks
):
if region_masks is not None:
region_flag = True
else:
region_flag = False
region_geo_sampler = region_flag and getattr(self.config, 'region_geo_sampler', False)
vision_tower = self.get_vision_tower()
if vision_tower is None or images is None or input_ids.shape[1] == 1:
if past_key_values is not None and vision_tower is not None and images is not None and input_ids.shape[1] == 1:
attention_mask = torch.ones((attention_mask.shape[0], past_key_values[-1][-1].shape[-2] + 1), dtype=attention_mask.dtype, device=attention_mask.device)
return input_ids, attention_mask, past_key_values, None, labels
if type(images) is list or images.ndim == 5:
assert region_flag == False
concat_images = torch.cat([image for image in images], dim=0)
raw_image_features, image_features, region_feature_map = self.encode_images(concat_images, region_flag, region_geo_sampler)
# image_features = self.encode_images(concat_images)
split_sizes = [image.shape[0] for image in images]
image_features = torch.split(image_features, split_sizes, dim=0)
image_features = [x.flatten(0, 1) for x in image_features]
else:
raw_image_features, image_features, region_feature_map = self.encode_images(images, region_flag, region_geo_sampler)
if region_flag:
if region_geo_sampler:
# pdb.set_trace()
region_features = self.get_model().region_geo_sampler(region_feature_map, region_masks,
original_dtype=raw_image_features.dtype,
return_dtype=image_features.dtype)
else:
region_features = self.extract_region_feature(region_feature_map, region_masks,
original_dtype=raw_image_features.dtype,
return_dtype=image_features.dtype)
assert len(region_features) == len(input_ids)
new_input_embeds = []
new_labels = [] if labels is not None else None
cur_image_idx = 0
for batch_idx, cur_input_ids in enumerate(input_ids):
if (cur_input_ids == IMAGE_TOKEN_INDEX).sum() == 0:
# multimodal LLM, but the current sample is not multimodal
cur_input_embeds = self.get_model().embed_tokens(cur_input_ids)
cur_input_embeds = cur_input_embeds + (0. * self.get_model().mm_projector(vision_tower.dummy_feature)).sum()
new_input_embeds.append(cur_input_embeds)
if labels is not None:
new_labels.append(labels[batch_idx])
cur_image_idx += 1
continue
image_token_indices = torch.where(cur_input_ids == IMAGE_TOKEN_INDEX)[0]
cur_new_input_embeds = []
if labels is not None:
cur_labels = labels[batch_idx]
cur_new_labels = []
assert cur_labels.shape == cur_input_ids.shape
while image_token_indices.numel() > 0:
cur_image_features = image_features[cur_image_idx]
image_token_start = image_token_indices[0]
if region_flag:
assert (cur_input_ids[:image_token_start] == self.config.im_region_fea_token).sum() == 0
# If not use start-end token, pt ckpt saved only has mm projector.
if getattr(self.config, 'tune_mm_mlp_adapter', False) and getattr(self.config, 'mm_use_im_start_end', False):
cur_new_input_embeds.append(self.get_model().embed_tokens(cur_input_ids[:image_token_start-1]).detach())
cur_new_input_embeds.append(self.get_model().embed_tokens(cur_input_ids[image_token_start-1:image_token_start]))
cur_new_input_embeds.append(cur_image_features)
cur_new_input_embeds.append(self.get_model().embed_tokens(cur_input_ids[image_token_start+1:image_token_start+2]))
if labels is not None:
cur_new_labels.append(cur_labels[:image_token_start])
cur_new_labels.append(torch.full((cur_image_features.shape[0],), IGNORE_INDEX, device=labels.device, dtype=labels.dtype))
cur_new_labels.append(cur_labels[image_token_start:image_token_start+1])
cur_labels = cur_labels[image_token_start+2:]
else:
cur_new_input_embeds.append(self.get_model().embed_tokens(cur_input_ids[:image_token_start]))
cur_new_input_embeds.append(cur_image_features)
if labels is not None:
cur_new_labels.append(cur_labels[:image_token_start])
cur_new_labels.append(torch.full((cur_image_features.shape[0],), IGNORE_INDEX, device=labels.device, dtype=labels.dtype))
cur_labels = cur_labels[image_token_start+1:]
cur_image_idx += 1
if getattr(self.config, 'tune_mm_mlp_adapter', False) and getattr(self.config, 'mm_use_im_start_end', False):
cur_input_ids = cur_input_ids[image_token_start+2:]
else:
cur_input_ids = cur_input_ids[image_token_start+1:]
image_token_indices = torch.where(cur_input_ids == IMAGE_TOKEN_INDEX)[0]
if cur_input_ids.numel() > 0:
if getattr(self.config, 'tune_mm_mlp_adapter', False) and getattr(self.config, 'mm_use_im_start_end', False):
text_input_embeds = self.get_model().embed_tokens(cur_input_ids).detach()
else:
text_input_embeds = self.get_model().embed_tokens(cur_input_ids)
if labels is not None:
cur_new_labels.append(cur_labels)
# Add region feature into text feature embeddings.
assert batch_idx+1 == cur_image_idx
if region_flag and region_features[batch_idx] is not None:
region_embs = torch.zeros_like(text_input_embeds)
region_replace_mask = (cur_input_ids == self.config.im_region_fea_token)
# pdb.set_trace()
region_embs[region_replace_mask] = region_features[batch_idx].to(text_input_embeds.dtype)
text_input_embeds = text_input_embeds * (~region_replace_mask).to(text_input_embeds.dtype)[:, None] + region_embs
# print('region_embs[..., 0].nonzero()', region_embs[..., 0].nonzero())
# raise NotImplementedError()
# pdb.set_trace()
else:
if hasattr(self.config, 'im_region_fea_token'):
assert (cur_input_ids == self.config.im_region_fea_token).sum() == 0
cur_new_input_embeds.append(text_input_embeds)
cur_new_input_embeds = [x.to(device=self.device) for x in cur_new_input_embeds]
cur_new_input_embeds = torch.cat(cur_new_input_embeds, dim=0)
new_input_embeds.append(cur_new_input_embeds)
if labels is not None:
cur_new_labels = torch.cat(cur_new_labels, dim=0)
new_labels.append(cur_new_labels)
if any(x.shape != new_input_embeds[0].shape for x in new_input_embeds):
max_len = max(x.shape[0] for x in new_input_embeds)
new_input_embeds_align = []
for cur_new_embed in new_input_embeds:
cur_new_embed = torch.cat((cur_new_embed, torch.zeros((max_len - cur_new_embed.shape[0], cur_new_embed.shape[1]), dtype=cur_new_embed.dtype, device=cur_new_embed.device)), dim=0)
new_input_embeds_align.append(cur_new_embed)
new_input_embeds = torch.stack(new_input_embeds_align, dim=0)
if labels is not None:
new_labels_align = []
_new_labels = new_labels
for cur_new_label in new_labels:
cur_new_label = torch.cat((cur_new_label, torch.full((max_len - cur_new_label.shape[0],), IGNORE_INDEX, dtype=cur_new_label.dtype, device=cur_new_label.device)), dim=0)
new_labels_align.append(cur_new_label)
new_labels = torch.stack(new_labels_align, dim=0)
if attention_mask is not None:
new_attention_mask = []
for cur_attention_mask, cur_new_labels, cur_new_labels_align in zip(attention_mask, _new_labels, new_labels):
new_attn_mask_pad_left = torch.full((cur_new_labels.shape[0] - labels.shape[1],), True, dtype=attention_mask.dtype, device=attention_mask.device)
new_attn_mask_pad_right = torch.full((cur_new_labels_align.shape[0] - cur_new_labels.shape[0],), False, dtype=attention_mask.dtype, device=attention_mask.device)
cur_new_attention_mask = torch.cat((new_attn_mask_pad_left, cur_attention_mask, new_attn_mask_pad_right), dim=0)
new_attention_mask.append(cur_new_attention_mask)
attention_mask = torch.stack(new_attention_mask, dim=0)
assert attention_mask.shape == new_labels.shape
else:
new_input_embeds = torch.stack(new_input_embeds, dim=0)
if labels is not None:
new_labels = torch.stack(new_labels, dim=0)
if attention_mask is not None:
new_attn_mask_pad_left = torch.full((attention_mask.shape[0], new_input_embeds.shape[1] - input_ids.shape[1]), True, dtype=attention_mask.dtype, device=attention_mask.device)
attention_mask = torch.cat((new_attn_mask_pad_left, attention_mask), dim=1)
assert attention_mask.shape == new_input_embeds.shape[:2]
return None, attention_mask, past_key_values, new_input_embeds, new_labels
def initialize_vision_tokenizer(self, model_args, tokenizer, add_region_feature=False):
if model_args.mm_use_im_patch_token:
tokenizer.add_tokens([DEFAULT_IMAGE_PATCH_TOKEN], special_tokens=True)
self.resize_token_embeddings(len(tokenizer))
if add_region_feature:
num_region_fea_tokens = tokenizer.add_tokens([DEFAULT_REGION_FEA_TOKEN], special_tokens=True)
self.config.im_region_fea_token = tokenizer.convert_tokens_to_ids([DEFAULT_REGION_FEA_TOKEN])[0]
self.resize_token_embeddings(len(tokenizer))
if model_args.mm_use_im_start_end:
num_new_tokens = tokenizer.add_tokens([DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN], special_tokens=True)
self.resize_token_embeddings(len(tokenizer))
if add_region_feature:
num_new_tokens = num_new_tokens + num_region_fea_tokens
if num_new_tokens > 0:
input_embeddings = self.get_input_embeddings().weight.data
output_embeddings = self.get_output_embeddings().weight.data
input_embeddings_avg = input_embeddings[:-num_new_tokens].mean(
dim=0, keepdim=True)
output_embeddings_avg = output_embeddings[:-num_new_tokens].mean(
dim=0, keepdim=True)
input_embeddings[-num_new_tokens:] = input_embeddings_avg
output_embeddings[-num_new_tokens:] = output_embeddings_avg
if model_args.tune_mm_mlp_adapter:
for p in self.get_input_embeddings().parameters():
p.requires_grad = True
for p in self.get_output_embeddings().parameters():
p.requires_grad = False
if model_args.pretrain_mm_mlp_adapter:
mm_projector_weights = torch.load(model_args.pretrain_mm_mlp_adapter, map_location='cpu')
embed_tokens_weight = mm_projector_weights['model.embed_tokens.weight']
if add_region_feature:
num_new_tokens = num_new_tokens - num_region_fea_tokens
assert num_new_tokens == 2
if input_embeddings.shape == embed_tokens_weight.shape:
input_embeddings[-num_new_tokens:] = embed_tokens_weight[-num_new_tokens:]
elif embed_tokens_weight.shape[0] == num_new_tokens:
input_embeddings[-num_new_tokens:] = embed_tokens_weight
else:
raise ValueError(f"Unexpected embed_tokens_weight shape. Pretrained: {embed_tokens_weight.shape}. Current: {input_embeddings.shape}. Numer of new tokens: {num_new_tokens}.")
elif model_args.mm_use_im_patch_token:
if model_args.tune_mm_mlp_adapter:
for p in self.get_input_embeddings().parameters():
p.requires_grad = False
for p in self.get_output_embeddings().parameters():
p.requires_grad = False

View File

@ -0,0 +1,139 @@
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import List, Optional, Tuple, Union
import torch
import torch.nn as nn
from torch.nn import CrossEntropyLoss
from transformers import AutoConfig, AutoModelForCausalLM, \
LlamaConfig, LlamaModel, LlamaForCausalLM
from transformers.modeling_outputs import CausalLMOutputWithPast
from ..ferret_arch import FERRETMetaModel, FERRETMetaForCausalLM
class FERRETConfig(LlamaConfig):
model_type = "ferret"
class FERRETLlamaModel(FERRETMetaModel, LlamaModel):
config_class = FERRETConfig
def __init__(self, config: LlamaConfig):
super(FERRETLlamaModel, self).__init__(config)
class FERRETLlamaForCausalLM(LlamaForCausalLM, FERRETMetaForCausalLM):
config_class = FERRETConfig
def __init__(self, config):
super(LlamaForCausalLM, self).__init__(config)
self.model = FERRETLlamaModel(config)
self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
# Initialize weights and apply final processing
self.post_init()
def get_model(self):
return self.model
def forward(
self,
input_ids: torch.LongTensor = None,
attention_mask: Optional[torch.Tensor] = None,
region_masks: Optional[List[torch.Tensor]] = None,
past_key_values: Optional[List[torch.FloatTensor]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
labels: Optional[torch.LongTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
images: Optional[torch.FloatTensor] = None,
return_dict: Optional[bool] = None,
) -> Union[Tuple, CausalLMOutputWithPast]:
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = (
output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
)
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
input_ids, attention_mask, past_key_values, inputs_embeds, labels = self.prepare_inputs_labels_for_multimodal(input_ids, attention_mask, past_key_values, labels, images, region_masks=region_masks)
# decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
outputs = self.model(
input_ids=input_ids,
attention_mask=attention_mask,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict
)
hidden_states = outputs[0]
logits = self.lm_head(hidden_states)
loss = None
if labels is not None:
# Shift so that tokens < n predict n
shift_logits = logits[..., :-1, :].contiguous()
shift_labels = labels[..., 1:].contiguous()
# Flatten the tokens
loss_fct = CrossEntropyLoss()
shift_logits = shift_logits.view(-1, self.config.vocab_size)
shift_labels = shift_labels.view(-1)
# Enable model/pipeline parallelism
shift_labels = shift_labels.to(shift_logits.device)
loss = loss_fct(shift_logits, shift_labels)
if not return_dict:
output = (logits,) + outputs[1:]
return (loss,) + output if loss is not None else output
return CausalLMOutputWithPast(
loss=loss,
logits=logits,
past_key_values=outputs.past_key_values,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
)
def prepare_inputs_for_generation(
self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs
):
if past_key_values:
input_ids = input_ids[:, -1:]
# if `inputs_embeds` are passed, we only want to use them in the 1st generation step
if inputs_embeds is not None and past_key_values is None:
model_inputs = {"inputs_embeds": inputs_embeds}
else:
model_inputs = {"input_ids": input_ids}
model_inputs.update(
{
"past_key_values": past_key_values,
"use_cache": kwargs.get("use_cache"),
"attention_mask": attention_mask,
"images": kwargs.get("images", None),
}
)
return model_inputs
AutoConfig.register("ferret", FERRETConfig)
AutoModelForCausalLM.register(FERRETConfig, FERRETLlamaForCausalLM)

View File

@ -0,0 +1,11 @@
import os
from .clip_encoder import CLIPVisionTower
def build_vision_tower(vision_tower_cfg, **kwargs):
vision_tower = getattr(vision_tower_cfg, 'mm_vision_tower', getattr(vision_tower_cfg, 'vision_tower', None))
is_absolute_path_exists = os.path.exists(vision_tower)
if is_absolute_path_exists or vision_tower.startswith("openai") or vision_tower.startswith("laion"):
return CLIPVisionTower(vision_tower, args=vision_tower_cfg, **kwargs)
raise ValueError(f'Unknown vision tower: {vision_tower}')

View File

@ -0,0 +1,123 @@
import torch
import torch.nn as nn
from transformers import CLIPVisionModel, CLIPImageProcessor, CLIPVisionConfig
# Added for customized Processor.
import math
import numpy as np
from typing import Dict
from transformers.image_utils import PILImageResampling, ChannelDimension
from transformers.image_processing_utils import get_size_dict
from transformers.image_transforms import (
get_resize_output_image_size,
resize,
)
from typing import List, Optional, Tuple, Union
class CLIPImageProcessor_GIT(CLIPImageProcessor):
def resize(
self,
image: np.ndarray,
size: Dict[str, int],
resample: PILImageResampling = PILImageResampling.BICUBIC,
data_format: Optional[Union[str, ChannelDimension]] = None,
**kwargs,
) -> np.ndarray:
"""
Resize an image. The shortest edge of the image is resized to size["shortest_edge"], with the longest edge
resized to keep the input aspect ratio.
Args:
image (`np.ndarray`):
Image to resize.
size (`Dict[str, int]`):
Size of the output image.
resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`):
Resampling filter to use when resiizing the image.
data_format (`str` or `ChannelDimension`, *optional*):
The channel dimension format of the image. If not provided, it will be the same as the input image.
"""
size = get_size_dict(size, default_to_square=True, height_width_order=True)
# Hack(haoxuan): Bypass the shortest_edge detection. We hope to get a {"height": size[0], "width": size[1]}, where w=h.
# if "shortest_edge" not in size:
# raise ValueError(f"The `size` parameter must contain the key `shortest_edge`. Got {size.keys()}")
# output_size = get_resize_output_image_size(image, size=size["shortest_edge"], default_to_square=True)
output_size = get_resize_output_image_size(image, size=(size["height"], size["width"]), default_to_square=True)
return resize(image, size=output_size, resample=resample, data_format=data_format, **kwargs)
class CLIPVisionTower(nn.Module):
def __init__(self, vision_tower, args, delay_load=False):
super().__init__()
self.is_loaded = False
self.vision_tower_name = vision_tower
self.select_layer = args.mm_vision_select_layer
self.select_feature = getattr(args, 'mm_vision_select_feature', 'patch')
if not delay_load:
self.load_model()
else:
self.cfg_only = CLIPVisionConfig.from_pretrained(self.vision_tower_name)
def load_model(self, vision_tower_path=None):
self.image_processor = CLIPImageProcessor_GIT.from_pretrained(self.vision_tower_name)
if vision_tower_path is not None:
self.vision_tower, loading_info = CLIPVisionModel.from_pretrained(vision_tower_path, output_loading_info=True)
print('loading_info:', loading_info)
else:
self.vision_tower = CLIPVisionModel.from_pretrained(self.vision_tower_name)
self.vision_tower.requires_grad_(False)
self.is_loaded = True
def feature_select(self, image_forward_outs):
image_features = image_forward_outs.hidden_states[self.select_layer]
if self.select_feature == 'patch':
image_features = image_features[:, 1:]
elif self.select_feature == 'cls_patch':
image_features = image_features
else:
raise ValueError(f'Unexpected select feature: {self.select_feature}')
return image_features
@torch.no_grad()
def forward(self, images):
if type(images) is list:
image_features = []
for image in images:
image_forward_out = self.vision_tower(image.to(device=self.device, dtype=self.dtype).unsqueeze(0), output_hidden_states=True)
image_feature = self.feature_select(image_forward_out).to(image.dtype)
image_features.append(image_feature)
else:
image_forward_outs = self.vision_tower(images.to(device=self.device, dtype=self.dtype), output_hidden_states=True)
image_features = self.feature_select(image_forward_outs).to(images.dtype)
return image_features
@property
def dummy_feature(self):
return torch.zeros(1, self.hidden_size, device=self.device, dtype=self.dtype)
@property
def dtype(self):
return self.vision_tower.dtype
@property
def device(self):
return self.vision_tower.device
@property
def config(self):
if self.is_loaded:
return self.vision_tower.config
else:
return self.cfg_only
@property
def hidden_size(self):
return self.config.hidden_size
@property
def num_patches(self):
return (self.config.image_size // self.config.patch_size) ** 2

20
ferret/model/utils.py Normal file
View File

@ -0,0 +1,20 @@
from transformers import AutoConfig
def auto_upgrade(config):
cfg = AutoConfig.from_pretrained(config)
if 'llava' in config and 'llava' not in cfg.model_type:
assert cfg.model_type == 'llama'
print("You are using newer LLaVA code base, while the checkpoint of v0 is from older code base.")
print("You must upgrade the checkpoint to the new code base (this can be done automatically).")
confirm = input("Please confirm that you want to upgrade the checkpoint. [Y/N]")
if confirm.lower() in ["y", "yes"]:
print("Upgrading checkpoint...")
assert len(cfg.architectures) == 1
setattr(cfg.__class__, "model_type", "llava")
cfg.architectures[0] = 'FERRETLlamaForCausalLM'
cfg.save_pretrained(config)
print("Checkpoint upgraded.")
else:
print("Checkpoint upgrade aborted.")
exit(1)

0
ferret/serve/__init__.py Normal file
View File

298
ferret/serve/controller.py Normal file
View File

@ -0,0 +1,298 @@
"""
A controller manages distributed workers.
It sends worker addresses to clients.
"""
import argparse
import asyncio
import dataclasses
from enum import Enum, auto
import json
import logging
import time
from typing import List, Union
import threading
from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse
import numpy as np
import requests
import uvicorn
from ferret.constants import CONTROLLER_HEART_BEAT_EXPIRATION
from ferret.utils import build_logger, server_error_msg
logger = build_logger("controller", "controller.log")
class DispatchMethod(Enum):
LOTTERY = auto()
SHORTEST_QUEUE = auto()
@classmethod
def from_str(cls, name):
if name == "lottery":
return cls.LOTTERY
elif name == "shortest_queue":
return cls.SHORTEST_QUEUE
else:
raise ValueError(f"Invalid dispatch method")
@dataclasses.dataclass
class WorkerInfo:
model_names: List[str]
speed: int
queue_length: int
check_heart_beat: bool
last_heart_beat: str
def heart_beat_controller(controller):
while True:
time.sleep(CONTROLLER_HEART_BEAT_EXPIRATION)
controller.remove_stable_workers_by_expiration()
class Controller:
def __init__(self, dispatch_method: str):
# Dict[str -> WorkerInfo]
self.worker_info = {}
self.dispatch_method = DispatchMethod.from_str(dispatch_method)
self.heart_beat_thread = threading.Thread(
target=heart_beat_controller, args=(self,))
self.heart_beat_thread.start()
logger.info("Init controller")
def register_worker(self, worker_name: str, check_heart_beat: bool,
worker_status: dict):
if worker_name not in self.worker_info:
logger.info(f"Register a new worker: {worker_name}")
else:
logger.info(f"Register an existing worker: {worker_name}")
if not worker_status:
worker_status = self.get_worker_status(worker_name)
if not worker_status:
return False
self.worker_info[worker_name] = WorkerInfo(
worker_status["model_names"], worker_status["speed"], worker_status["queue_length"],
check_heart_beat, time.time())
logger.info(f"Register done: {worker_name}, {worker_status}")
return True
def get_worker_status(self, worker_name: str):
try:
r = requests.post(worker_name + "/worker_get_status", timeout=5)
except requests.exceptions.RequestException as e:
logger.error(f"Get status fails: {worker_name}, {e}")
return None
if r.status_code != 200:
logger.error(f"Get status fails: {worker_name}, {r}")
return None
return r.json()
def remove_worker(self, worker_name: str):
del self.worker_info[worker_name]
def refresh_all_workers(self):
old_info = dict(self.worker_info)
self.worker_info = {}
for w_name, w_info in old_info.items():
if not self.register_worker(w_name, w_info.check_heart_beat, None):
logger.info(f"Remove stale worker: {w_name}")
def list_models(self):
model_names = set()
for w_name, w_info in self.worker_info.items():
model_names.update(w_info.model_names)
return list(model_names)
def get_worker_address(self, model_name: str):
if self.dispatch_method == DispatchMethod.LOTTERY:
worker_names = []
worker_speeds = []
for w_name, w_info in self.worker_info.items():
if model_name in w_info.model_names:
worker_names.append(w_name)
worker_speeds.append(w_info.speed)
worker_speeds = np.array(worker_speeds, dtype=np.float32)
norm = np.sum(worker_speeds)
if norm < 1e-4:
return ""
worker_speeds = worker_speeds / norm
if True: # Directly return address
pt = np.random.choice(np.arange(len(worker_names)),
p=worker_speeds)
worker_name = worker_names[pt]
return worker_name
# Check status before returning
while True:
pt = np.random.choice(np.arange(len(worker_names)),
p=worker_speeds)
worker_name = worker_names[pt]
if self.get_worker_status(worker_name):
break
else:
self.remove_worker(worker_name)
worker_speeds[pt] = 0
norm = np.sum(worker_speeds)
if norm < 1e-4:
return ""
worker_speeds = worker_speeds / norm
continue
return worker_name
elif self.dispatch_method == DispatchMethod.SHORTEST_QUEUE:
worker_names = []
worker_qlen = []
for w_name, w_info in self.worker_info.items():
if model_name in w_info.model_names:
worker_names.append(w_name)
worker_qlen.append(w_info.queue_length / w_info.speed)
if len(worker_names) == 0:
return ""
min_index = np.argmin(worker_qlen)
w_name = worker_names[min_index]
self.worker_info[w_name].queue_length += 1
logger.info(f"names: {worker_names}, queue_lens: {worker_qlen}, ret: {w_name}")
return w_name
else:
raise ValueError(f"Invalid dispatch method: {self.dispatch_method}")
def receive_heart_beat(self, worker_name: str, queue_length: int):
if worker_name not in self.worker_info:
logger.info(f"Receive unknown heart beat. {worker_name}")
return False
self.worker_info[worker_name].queue_length = queue_length
self.worker_info[worker_name].last_heart_beat = time.time()
logger.info(f"Receive heart beat. {worker_name}")
return True
def remove_stable_workers_by_expiration(self):
expire = time.time() - CONTROLLER_HEART_BEAT_EXPIRATION
to_delete = []
for worker_name, w_info in self.worker_info.items():
if w_info.check_heart_beat and w_info.last_heart_beat < expire:
to_delete.append(worker_name)
for worker_name in to_delete:
self.remove_worker(worker_name)
def worker_api_generate_stream(self, params):
worker_addr = self.get_worker_address(params["model"])
if not worker_addr:
logger.info(f"no worker: {params['model']}")
ret = {
"text": server_error_msg,
"error_code": 2,
}
yield json.dumps(ret).encode() + b"\0"
try:
response = requests.post(worker_addr + "/worker_generate_stream",
json=params, stream=True, timeout=5)
for chunk in response.iter_lines(decode_unicode=False, delimiter=b"\0"):
if chunk:
yield chunk + b"\0"
except requests.exceptions.RequestException as e:
logger.info(f"worker timeout: {worker_addr}")
ret = {
"text": server_error_msg,
"error_code": 3,
}
yield json.dumps(ret).encode() + b"\0"
# Let the controller act as a worker to achieve hierarchical
# management. This can be used to connect isolated sub networks.
def worker_api_get_status(self):
model_names = set()
speed = 0
queue_length = 0
for w_name in self.worker_info:
worker_status = self.get_worker_status(w_name)
if worker_status is not None:
model_names.update(worker_status["model_names"])
speed += worker_status["speed"]
queue_length += worker_status["queue_length"]
return {
"model_names": list(model_names),
"speed": speed,
"queue_length": queue_length,
}
app = FastAPI()
@app.post("/register_worker")
async def register_worker(request: Request):
data = await request.json()
controller.register_worker(
data["worker_name"], data["check_heart_beat"],
data.get("worker_status", None))
@app.post("/refresh_all_workers")
async def refresh_all_workers():
models = controller.refresh_all_workers()
@app.post("/list_models")
async def list_models():
models = controller.list_models()
return {"models": models}
@app.post("/get_worker_address")
async def get_worker_address(request: Request):
data = await request.json()
addr = controller.get_worker_address(data["model"])
return {"address": addr}
@app.post("/receive_heart_beat")
async def receive_heart_beat(request: Request):
data = await request.json()
exist = controller.receive_heart_beat(
data["worker_name"], data["queue_length"])
return {"exist": exist}
@app.post("/worker_generate_stream")
async def worker_api_generate_stream(request: Request):
params = await request.json()
generator = controller.worker_api_generate_stream(params)
return StreamingResponse(generator)
@app.post("/worker_get_status")
async def worker_api_get_status(request: Request):
return controller.worker_api_get_status()
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--host", type=str, default="localhost")
parser.add_argument("--port", type=int, default=21001)
parser.add_argument("--dispatch-method", type=str, choices=[
"lottery", "shortest_queue"], default="shortest_queue")
args = parser.parse_args()
logger.info(f"args: {args}")
controller = Controller(args.dispatch_method)
uvicorn.run(app, host=args.host, port=args.port, log_level="info")

View File

@ -0,0 +1 @@
06a92cd7-c698-4e59-b980-58e4bc162946

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

After

Width:  |  Height:  |  Size: 33 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 42 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 92 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 70 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 122 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 111 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 93 KiB

View File

@ -0,0 +1,73 @@
code_highlight_css = (
"""
#chatbot .hll { background-color: #ffffcc }
#chatbot .c { color: #408080; font-style: italic }
#chatbot .err { border: 1px solid #FF0000 }
#chatbot .k { color: #008000; font-weight: bold }
#chatbot .o { color: #666666 }
#chatbot .ch { color: #408080; font-style: italic }
#chatbot .cm { color: #408080; font-style: italic }
#chatbot .cp { color: #BC7A00 }
#chatbot .cpf { color: #408080; font-style: italic }
#chatbot .c1 { color: #408080; font-style: italic }
#chatbot .cs { color: #408080; font-style: italic }
#chatbot .gd { color: #A00000 }
#chatbot .ge { font-style: italic }
#chatbot .gr { color: #FF0000 }
#chatbot .gh { color: #000080; font-weight: bold }
#chatbot .gi { color: #00A000 }
#chatbot .go { color: #888888 }
#chatbot .gp { color: #000080; font-weight: bold }
#chatbot .gs { font-weight: bold }
#chatbot .gu { color: #800080; font-weight: bold }
#chatbot .gt { color: #0044DD }
#chatbot .kc { color: #008000; font-weight: bold }
#chatbot .kd { color: #008000; font-weight: bold }
#chatbot .kn { color: #008000; font-weight: bold }
#chatbot .kp { color: #008000 }
#chatbot .kr { color: #008000; font-weight: bold }
#chatbot .kt { color: #B00040 }
#chatbot .m { color: #666666 }
#chatbot .s { color: #BA2121 }
#chatbot .na { color: #7D9029 }
#chatbot .nb { color: #008000 }
#chatbot .nc { color: #0000FF; font-weight: bold }
#chatbot .no { color: #880000 }
#chatbot .nd { color: #AA22FF }
#chatbot .ni { color: #999999; font-weight: bold }
#chatbot .ne { color: #D2413A; font-weight: bold }
#chatbot .nf { color: #0000FF }
#chatbot .nl { color: #A0A000 }
#chatbot .nn { color: #0000FF; font-weight: bold }
#chatbot .nt { color: #008000; font-weight: bold }
#chatbot .nv { color: #19177C }
#chatbot .ow { color: #AA22FF; font-weight: bold }
#chatbot .w { color: #bbbbbb }
#chatbot .mb { color: #666666 }
#chatbot .mf { color: #666666 }
#chatbot .mh { color: #666666 }
#chatbot .mi { color: #666666 }
#chatbot .mo { color: #666666 }
#chatbot .sa { color: #BA2121 }
#chatbot .sb { color: #BA2121 }
#chatbot .sc { color: #BA2121 }
#chatbot .dl { color: #BA2121 }
#chatbot .sd { color: #BA2121; font-style: italic }
#chatbot .s2 { color: #BA2121 }
#chatbot .se { color: #BB6622; font-weight: bold }
#chatbot .sh { color: #BA2121 }
#chatbot .si { color: #BB6688; font-weight: bold }
#chatbot .sx { color: #008000 }
#chatbot .sr { color: #BB6688 }
#chatbot .s1 { color: #BA2121 }
#chatbot .ss { color: #19177C }
#chatbot .bp { color: #008000 }
#chatbot .fm { color: #0000FF }
#chatbot .vc { color: #19177C }
#chatbot .vg { color: #19177C }
#chatbot .vi { color: #19177C }
#chatbot .vm { color: #19177C }
#chatbot .il { color: #666666 }
""")
#.highlight { background: #f8f8f8; }

View File

@ -0,0 +1,714 @@
'''
Usage:
python -m ferret.serve.gradio_web_server --controller http://localhost:10000 --add_region_feature
'''
import argparse
import datetime
import json
import os
import time
import gradio as gr
import requests
from ferret.conversation import (default_conversation, conv_templates,
SeparatorStyle)
from ferret.constants import LOGDIR
from ferret.utils import (build_logger, server_error_msg,
violates_moderation, moderation_msg)
import hashlib
# Added
import re
from copy import deepcopy
from PIL import ImageDraw, ImageFont
from gradio import processing_utils
import numpy as np
import torch
import torch.nn.functional as F
from scipy.ndimage import binary_dilation, binary_erosion
import pdb
from ferret.serve.gradio_css import code_highlight_css
DEFAULT_REGION_REFER_TOKEN = "[region]"
DEFAULT_REGION_FEA_TOKEN = "<region_fea>"
logger = build_logger("gradio_web_server", "gradio_web_server.log")
headers = {"User-Agent": "FERRET Client"}
no_change_btn = gr.Button.update()
enable_btn = gr.Button.update(interactive=True)
disable_btn = gr.Button.update(interactive=False)
priority = {
"vicuna-13b": "aaaaaaa",
"koala-13b": "aaaaaab",
}
VOCAB_IMAGE_W = 1000 # 224
VOCAB_IMAGE_H = 1000 # 224
def generate_mask_for_feature(coor, raw_w, raw_h, mask=None):
if mask is not None:
assert mask.shape[0] == raw_w and mask.shape[1] == raw_h
coor_mask = torch.zeros((raw_w, raw_h))
# Assume it samples a point.
if len(coor) == 2:
# Define window size
span = 5
# Make sure the window does not exceed array bounds
x_min = max(0, coor[0] - span)
x_max = min(raw_w, coor[0] + span + 1)
y_min = max(0, coor[1] - span)
y_max = min(raw_h, coor[1] + span + 1)
coor_mask[int(x_min):int(x_max), int(y_min):int(y_max)] = 1
assert (coor_mask==1).any(), f"coor: {coor}, raw_w: {raw_w}, raw_h: {raw_h}"
elif len(coor) == 4:
# Box input or Sketch input.
coor_mask = torch.zeros((raw_w, raw_h))
coor_mask[coor[0]:coor[2]+1, coor[1]:coor[3]+1] = 1
if mask is not None:
coor_mask = coor_mask * mask
# coor_mask = torch.from_numpy(coor_mask)
# pdb.set_trace()
assert len(coor_mask.nonzero()) != 0
return coor_mask.tolist()
def draw_box(coor, region_mask, region_ph, img, input_mode):
colors = ["red"]
draw = ImageDraw.Draw(img)
font = ImageFont.truetype("./ferret/serve/dejavu/DejaVuSans.ttf", size=18)
if input_mode == 'Box':
draw.rectangle([coor[0], coor[1], coor[2], coor[3]], outline=colors[0], width=4)
draw.rectangle([coor[0], coor[3] - int(font.size * 1.2), coor[0] + int((len(region_ph) + 0.8) * font.size * 0.6), coor[3]], outline=colors[0], fill=colors[0], width=4)
draw.text([coor[0] + int(font.size * 0.2), coor[3] - int(font.size*1.2)], region_ph, font=font, fill=(255,255,255))
elif input_mode == 'Point':
r = 8
leftUpPoint = (coor[0]-r, coor[1]-r)
rightDownPoint = (coor[0]+r, coor[1]+r)
twoPointList = [leftUpPoint, rightDownPoint]
draw.ellipse(twoPointList, outline=colors[0], width=4)
draw.rectangle([coor[0], coor[1], coor[0] + int((len(region_ph) + 0.8) * font.size * 0.6), coor[1] + int(font.size * 1.2)], outline=colors[0], fill=colors[0], width=4)
draw.text([coor[0] + int(font.size * 0.2), coor[1]], region_ph, font=font, fill=(255,255,255))
elif input_mode == 'Sketch':
draw.rectangle([coor[0], coor[3] - int(font.size * 1.2), coor[0] + int((len(region_ph) + 0.8) * font.size * 0.6), coor[3]], outline=colors[0], fill=colors[0], width=4)
draw.text([coor[0] + int(font.size * 0.2), coor[3] - int(font.size*1.2)], region_ph, font=font, fill=(255,255,255))
# Use morphological operations to find the boundary
mask = np.array(region_mask)
dilated = binary_dilation(mask, structure=np.ones((3,3)))
eroded = binary_erosion(mask, structure=np.ones((3,3)))
boundary = dilated ^ eroded # XOR operation to find the difference between dilated and eroded mask
# Loop over the boundary and paint the corresponding pixels
for i in range(boundary.shape[0]):
for j in range(boundary.shape[1]):
if boundary[i, j]:
# This is a pixel on the boundary, paint it red
draw.point((i, j), fill=colors[0])
else:
NotImplementedError(f'Input mode of {input_mode} is not Implemented.')
return img
def get_conv_log_filename():
t = datetime.datetime.now()
name = os.path.join(LOGDIR, f"{t.year}-{t.month:02d}-{t.day:02d}-conv.json")
return name
def get_model_list():
ret = requests.post(args.controller_url + "/refresh_all_workers")
assert ret.status_code == 200
ret = requests.post(args.controller_url + "/list_models")
models = ret.json()["models"]
models.sort(key=lambda x: priority.get(x, x))
logger.info(f"Models: {models}")
return models
get_window_url_params = """
function() {
const params = new URLSearchParams(window.location.search);
url_params = Object.fromEntries(params);
console.log(url_params);
return url_params;
}
"""
def load_demo(url_params, request: gr.Request):
logger.info(f"load_demo. ip: {request.client.host}. params: {url_params}")
dropdown_update = gr.Dropdown.update(visible=True)
if "model" in url_params:
model = url_params["model"]
if model in models:
dropdown_update = gr.Dropdown.update(
value=model, visible=True)
state = default_conversation.copy()
return (state,
dropdown_update,
gr.Chatbot.update(visible=True),
gr.Textbox.update(visible=True),
gr.Button.update(visible=True),
gr.Row.update(visible=True),
gr.Accordion.update(visible=True))
def load_demo_refresh_model_list(request: gr.Request):
logger.info(f"load_demo. ip: {request.client.host}")
models = get_model_list()
state = default_conversation.copy()
return (state, gr.Dropdown.update(
choices=models,
value=models[0] if len(models) > 0 else ""),
gr.Chatbot.update(visible=True),
gr.Textbox.update(visible=True),
gr.Button.update(visible=True),
gr.Row.update(visible=True),
gr.Accordion.update(visible=True))
def vote_last_response(state, vote_type, model_selector, request: gr.Request):
with open(get_conv_log_filename(), "a") as fout:
data = {
"tstamp": round(time.time(), 4),
"type": vote_type,
"model": model_selector,
"state": state.dict(),
"ip": request.client.host,
}
fout.write(json.dumps(data) + "\n")
def upvote_last_response(state, model_selector, request: gr.Request):
logger.info(f"upvote. ip: {request.client.host}")
vote_last_response(state, "upvote", model_selector, request)
return ("",) + (disable_btn,) * 3
def downvote_last_response(state, model_selector, request: gr.Request):
logger.info(f"downvote. ip: {request.client.host}")
vote_last_response(state, "downvote", model_selector, request)
return ("",) + (disable_btn,) * 3
def flag_last_response(state, model_selector, request: gr.Request):
logger.info(f"flag. ip: {request.client.host}")
vote_last_response(state, "flag", model_selector, request)
return ("",) + (disable_btn,) * 3
def regenerate(state, image_process_mode, request: gr.Request):
logger.info(f"regenerate. ip: {request.client.host}")
state.messages[-1][-1] = None
prev_human_msg = state.messages[-2]
if type(prev_human_msg[1]) in (tuple, list):
prev_human_msg[1] = (*prev_human_msg[1][:2], image_process_mode)
state.skip_next = False
return (state, state.to_gradio_chatbot(), "") + (disable_btn,) * 5
def clear_history(request: gr.Request):
logger.info(f"clear_history. ip: {request.client.host}")
state = default_conversation.copy()
return (state, state.to_gradio_chatbot(), "", None, None) + (disable_btn,) * 5 + \
(None, {'region_placeholder_tokens':[],'region_coordinates':[],'region_masks':[],'masks':[]}, [], None)
def resize_bbox(box, image_w=None, image_h=None, default_wh=VOCAB_IMAGE_W):
ratio_w = image_w * 1.0 / default_wh
ratio_h = image_h * 1.0 / default_wh
new_box = [int(box[0] * ratio_w), int(box[1] * ratio_h), \
int(box[2] * ratio_w), int(box[3] * ratio_h)]
return new_box
def show_location(sketch_pad, chatbot):
image = sketch_pad['image']
img_w, img_h = image.size
new_bboxes = []
old_bboxes = []
# chatbot[0] is image.
text = chatbot[1:]
for round_i in text:
human_input = round_i[0]
model_output = round_i[1]
# TODO: Difference: vocab representation.
# pattern = r'\[x\d*=(\d+(?:\.\d+)?), y\d*=(\d+(?:\.\d+)?), x\d*=(\d+(?:\.\d+)?), y\d*=(\d+(?:\.\d+)?)\]'
pattern = r'\[(\d+(?:\.\d+)?), (\d+(?:\.\d+)?), (\d+(?:\.\d+)?), (\d+(?:\.\d+)?)\]'
matches = re.findall(pattern, model_output)
for match in matches:
x1, y1, x2, y2 = map(int, match)
new_box = resize_bbox([x1, y1, x2, y2], img_w, img_h)
new_bboxes.append(new_box)
old_bboxes.append([x1, y1, x2, y2])
set_old_bboxes = sorted(set(map(tuple, old_bboxes)), key=list(map(tuple, old_bboxes)).index)
list_old_bboxes = list(map(list, set_old_bboxes))
set_bboxes = sorted(set(map(tuple, new_bboxes)), key=list(map(tuple, new_bboxes)).index)
list_bboxes = list(map(list, set_bboxes))
output_image = deepcopy(image)
draw = ImageDraw.Draw(output_image)
font = ImageFont.truetype("./ferret/serve/dejavu/DejaVuSans.ttf", 28)
for i in range(len(list_bboxes)):
x1, y1, x2, y2 = list_old_bboxes[i]
x1_new, y1_new, x2_new, y2_new = list_bboxes[i]
obj_string = '[obj{}]'.format(i)
for round_i in text:
model_output = round_i[1]
model_output = model_output.replace('[{}, {}, {}, {}]'.format(x1, y1, x2, y2), obj_string)
round_i[1] = model_output
draw.rectangle([(x1_new, y1_new), (x2_new, y2_new)], outline="red", width=3)
draw.text((x1_new+2, y1_new+5), obj_string[1:-1], fill="red", font=font)
return (output_image, [chatbot[0]] + text, disable_btn)
def add_text(state, text, image_process_mode, original_image, sketch_pad, request: gr.Request):
image = sketch_pad['image']
logger.info(f"add_text. ip: {request.client.host}. len: {len(text)}")
if len(text) <= 0 and image is None:
state.skip_next = True
return (state, state.to_gradio_chatbot(), "", None) + (no_change_btn,) * 5
if args.moderate:
flagged = violates_moderation(text)
if flagged:
state.skip_next = True
return (state, state.to_gradio_chatbot(), moderation_msg, None) + (
no_change_btn,) * 5
text = text[:1536] # Hard cut-off
if original_image is None:
assert image is not None
original_image = image.copy()
print('No location, copy original image in add_text')
if image is not None:
if state.first_round:
text = text[:1200] # Hard cut-off for images
if '<image>' not in text:
# text = '<Image><image></Image>' + text
text = text + '\n<image>'
text = (text, original_image, image_process_mode)
if len(state.get_images(return_pil=True)) > 0:
new_state = default_conversation.copy()
new_state.first_round = False
state=new_state
print('First round add image finsihed.')
state.append_message(state.roles[0], text)
state.append_message(state.roles[1], None)
state.skip_next = False
return (state, state.to_gradio_chatbot(), "", original_image) + (disable_btn,) * 5
def post_process_code(code):
sep = "\n```"
if sep in code:
blocks = code.split(sep)
if len(blocks) % 2 == 1:
for i in range(1, len(blocks), 2):
blocks[i] = blocks[i].replace("\\_", "_")
code = sep.join(blocks)
return code
def format_region_prompt(prompt, refer_input_state):
for region_ph_index, region_ph_i in enumerate(refer_input_state['region_placeholder_tokens']):
prompt = prompt.replace(region_ph_i, '{} {}'.format(refer_input_state['region_coordinates'][region_ph_index], DEFAULT_REGION_FEA_TOKEN))
return prompt
def http_bot(state, model_selector, temperature, top_p, max_new_tokens, refer_input_state, request: gr.Request):
# def http_bot(state, model_selector, temperature, top_p, max_new_tokens, request: gr.Request):
logger.info(f"http_bot. ip: {request.client.host}")
start_tstamp = time.time()
model_name = model_selector
if state.skip_next:
# This generate call is skipped due to invalid inputs
yield (state, state.to_gradio_chatbot()) + (no_change_btn,) * 5
return
if len(state.messages) == state.offset + 2:
# First round of conversation
template_name = 'ferret_v1'
new_state = conv_templates[template_name].copy()
new_state.append_message(new_state.roles[0], state.messages[-2][1])
new_state.append_message(new_state.roles[1], None)
state = new_state
state.first_round = False
# Query worker address
controller_url = args.controller_url
ret = requests.post(controller_url + "/get_worker_address",
json={"model": model_name})
worker_addr = ret.json()["address"]
logger.info(f"model_name: {model_name}, worker_addr: {worker_addr}")
# No available worker
if worker_addr == "":
state.messages[-1][-1] = server_error_msg
yield (state, state.to_gradio_chatbot(), disable_btn, disable_btn, disable_btn, enable_btn, enable_btn)
return
# Construct prompt
prompt = state.get_prompt()
if args.add_region_feature:
prompt = format_region_prompt(prompt, refer_input_state)
all_images = state.get_images(return_pil=True)
all_image_hash = [hashlib.md5(image.tobytes()).hexdigest() for image in all_images]
for image, hash in zip(all_images, all_image_hash):
t = datetime.datetime.now()
filename = os.path.join(LOGDIR, "serve_images", f"{t.year}-{t.month:02d}-{t.day:02d}", f"{hash}.jpg")
if not os.path.isfile(filename):
os.makedirs(os.path.dirname(filename), exist_ok=True)
image.save(filename)
# Make requests
pload = {
"model": model_name,
"prompt": prompt,
"temperature": float(temperature),
"top_p": float(top_p),
"max_new_tokens": min(int(max_new_tokens), 1536),
"stop": state.sep if state.sep_style in [SeparatorStyle.SINGLE, SeparatorStyle.MPT] else state.sep2,
"images": f'List of {len(state.get_images())} images: {all_image_hash}',
}
logger.info(f"==== request ====\n{pload}")
if args.add_region_feature:
pload['region_masks'] = refer_input_state['region_masks']
logger.info(f"==== add region_masks to request ====\n")
pload['images'] = state.get_images()
print(f'Input Prompt: {prompt}')
state.messages[-1][-1] = ""
yield (state, state.to_gradio_chatbot()) + (disable_btn,) * 5
try:
# Stream output
response = requests.post(worker_addr + "/worker_generate_stream",
headers=headers, json=pload, stream=True, timeout=10)
for chunk in response.iter_lines(decode_unicode=False, delimiter=b"\0"):
if chunk:
data = json.loads(chunk.decode())
if data["error_code"] == 0:
output = data["text"][len(prompt):].strip()
output = post_process_code(output)
state.messages[-1][-1] = output + ""
yield (state, state.to_gradio_chatbot()) + (disable_btn,) * 5
else:
output = data["text"] + f" (error_code: {data['error_code']})"
state.messages[-1][-1] = output
yield (state, state.to_gradio_chatbot()) + (disable_btn, disable_btn, disable_btn, enable_btn, enable_btn)
return
time.sleep(0.03)
except requests.exceptions.RequestException as e:
state.messages[-1][-1] = server_error_msg
yield (state, state.to_gradio_chatbot()) + (disable_btn, disable_btn, disable_btn, enable_btn, enable_btn)
return
state.messages[-1][-1] = state.messages[-1][-1][:-1]
yield (state, state.to_gradio_chatbot()) + (enable_btn,) * 5
finish_tstamp = time.time()
logger.info(f"{output}")
with open(get_conv_log_filename(), "a") as fout:
data = {
"tstamp": round(finish_tstamp, 4),
"type": "chat",
"model": model_name,
"start": round(start_tstamp, 4),
"finish": round(start_tstamp, 4),
"state": state.dict(),
"images": all_image_hash,
"ip": request.client.host,
}
fout.write(json.dumps(data) + "\n")
title_markdown = ("""
# 🦦 Ferret: Refer and Ground Anything Anywhere at Any Granularity
[[Code](https://github.com/apple/ml-ferret)] [[Paper](https://arxiv.org/abs/2310.07704)]
""")
tos_markdown = ("""
### Terms of use
By using this service, users are required to agree to the following terms: The service is a research preview intended for non-commercial use only. It only provides limited safety measures and may generate offensive content. It must not be used for any illegal, harmful, violent, racist, or sexual purposes. The service may collect user dialogue data for future research.
""")
learn_more_markdown = ("""
### License
The service is a research preview intended for non-commercial use only
""")
css = code_highlight_css + """
pre {
white-space: pre-wrap; /* Since CSS 2.1 */
white-space: -moz-pre-wrap; /* Mozilla, since 1999 */
white-space: -pre-wrap; /* Opera 4-6 */
white-space: -o-pre-wrap; /* Opera 7 */
word-wrap: break-word; /* Internet Explorer 5.5+ */
}
"""
Instructions = '''
Instructions:
1. Select a 'Referring Input Type'
2. Draw on the image to refer to a region/point.
3. Copy the region id from 'Referring Input Type' to refer to a region in your chat.
'''
class ImageMask(gr.components.Image):
"""
Sets: source="canvas", tool="sketch"
"""
is_template = True
def __init__(self, **kwargs):
super().__init__(source="upload", tool="sketch", interactive=True, **kwargs)
def preprocess(self, x):
return super().preprocess(x)
def draw(input_mode, input, refer_input_state, refer_text_show, imagebox_refer):
if type(input) == dict:
image = deepcopy(input['image'])
mask = deepcopy(input['mask'])
else:
mask = deepcopy(input)
# W, H -> H, W, 3
image_new = np.asarray(image)
img_height = image_new.shape[0]
img_width = image_new.shape[1]
# W, H, 4 -> H, W
mask_new = np.asarray(mask)[:,:,0].copy()
mask_new = torch.from_numpy(mask_new)
mask_new = (F.interpolate(mask_new.unsqueeze(0).unsqueeze(0), (img_height, img_width), mode='bilinear') > 0)
mask_new = mask_new[0, 0].transpose(1, 0).long()
if len(refer_input_state['masks']) == 0:
last_mask = torch.zeros_like(mask_new)
else:
last_mask = refer_input_state['masks'][-1]
diff_mask = mask_new - last_mask
if torch.all(diff_mask == 0):
print('Init Uploading Images.')
return (refer_input_state, refer_text_show, image)
else:
refer_input_state['masks'].append(mask_new)
if input_mode == 'Point':
nonzero_points = diff_mask.nonzero()
nonzero_points_avg_x = torch.median(nonzero_points[:, 0])
nonzero_points_avg_y = torch.median(nonzero_points[:, 1])
sampled_coor = [nonzero_points_avg_x, nonzero_points_avg_y]
# pdb.set_trace()
cur_region_masks = generate_mask_for_feature(sampled_coor, raw_w=img_width, raw_h=img_height)
elif input_mode == 'Box' or input_mode == 'Sketch':
# pdb.set_trace()
x1x2 = diff_mask.max(1)[0].nonzero()[:, 0]
y1y2 = diff_mask.max(0)[0].nonzero()[:, 0]
y1, y2 = y1y2.min(), y1y2.max()
x1, x2 = x1x2.min(), x1x2.max()
# pdb.set_trace()
sampled_coor = [x1, y1, x2, y2]
if input_mode == 'Box':
cur_region_masks = generate_mask_for_feature(sampled_coor, raw_w=img_width, raw_h=img_height)
else:
cur_region_masks = generate_mask_for_feature(sampled_coor, raw_w=img_width, raw_h=img_height, mask=diff_mask)
else:
raise NotImplementedError(f'Input mode of {input_mode} is not Implemented.')
# TODO(haoxuan): Hack img_size to be 224 here, need to make it a argument.
if len(sampled_coor) == 2:
point_x = int(VOCAB_IMAGE_W * sampled_coor[0] / img_width)
point_y = int(VOCAB_IMAGE_H * sampled_coor[1] / img_height)
cur_region_coordinates = f'[{int(point_x)}, {int(point_y)}]'
elif len(sampled_coor) == 4:
point_x1 = int(VOCAB_IMAGE_W * sampled_coor[0] / img_width)
point_y1 = int(VOCAB_IMAGE_H * sampled_coor[1] / img_height)
point_x2 = int(VOCAB_IMAGE_W * sampled_coor[2] / img_width)
point_y2 = int(VOCAB_IMAGE_H * sampled_coor[3] / img_height)
cur_region_coordinates = f'[{int(point_x1)}, {int(point_y1)}, {int(point_x2)}, {int(point_y2)}]'
cur_region_id = len(refer_input_state['region_placeholder_tokens'])
cur_region_token = DEFAULT_REGION_REFER_TOKEN.split(']')[0] + str(cur_region_id) + ']'
refer_input_state['region_placeholder_tokens'].append(cur_region_token)
refer_input_state['region_coordinates'].append(cur_region_coordinates)
refer_input_state['region_masks'].append(cur_region_masks)
refer_text_show.append((cur_region_token, ''))
# Show Parsed Referring.
imagebox_refer = draw_box(sampled_coor, cur_region_masks, \
cur_region_token, imagebox_refer, input_mode)
return (refer_input_state, refer_text_show, imagebox_refer)
def build_demo(embed_mode):
textbox = gr.Textbox(show_label=False, placeholder="Enter text and press ENTER", visible=False, container=False)
with gr.Blocks(title="FERRET", theme=gr.themes.Base(), css=css) as demo:
state = gr.State()
if not embed_mode:
gr.Markdown(title_markdown)
gr.Markdown(Instructions)
with gr.Row():
with gr.Column(scale=4):
with gr.Row(elem_id="model_selector_row"):
model_selector = gr.Dropdown(
choices=models,
value=models[0] if len(models) > 0 else "",
interactive=True,
show_label=False,
container=False)
original_image = gr.Image(type="pil", visible=False)
image_process_mode = gr.Radio(
["Raw+Processor", "Crop", "Resize", "Pad"],
value="Raw+Processor",
label="Preprocess for non-square image",
visible=False)
# Added for any-format input.
sketch_pad = ImageMask(label="Image & Sketch", type="pil", elem_id="img2text")
refer_input_mode = gr.Radio(
["Point", "Box", "Sketch"],
value="Point",
label="Referring Input Type")
refer_input_state = gr.State({'region_placeholder_tokens':[],
'region_coordinates':[],
'region_masks':[],
'masks':[],
})
refer_text_show = gr.HighlightedText(value=[], label="Referring Input Cache")
imagebox_refer = gr.Image(type="pil", label="Parsed Referring Input")
imagebox_output = gr.Image(type="pil", label='Output Vis')
cur_dir = os.path.dirname(os.path.abspath(__file__))
gr.Examples(examples=[
# [f"{cur_dir}/examples/harry-potter-hogwarts.jpg", "What is in [region0]? And what do people use it for?"],
# [f"{cur_dir}/examples/ingredients.jpg", "What objects are in [region0] and [region1]?"],
# [f"{cur_dir}/examples/extreme_ironing.jpg", "What is unusual about this image? And tell me the coordinates of mentioned objects."],
[f"{cur_dir}/examples/ferret.jpg", "What's the relationship between object [region0] and object [region1]?"],
[f"{cur_dir}/examples/waterview.jpg", "What are the things I should be cautious about when I visit here? Tell me the coordinates in response."],
[f"{cur_dir}/examples/flickr_9472793441.jpg", "Describe the image in details."],
# [f"{cur_dir}/examples/coco_000000281759.jpg", "What are the locations of the woman wearing a blue dress, the woman in flowery top, the girl in purple dress, the girl wearing green shirt?"],
[f"{cur_dir}/examples/room_planning.jpg", "How to improve the design of the given room?"],
[f"{cur_dir}/examples/make_sandwitch.jpg", "How can I make a sandwich with available ingredients?"],
[f"{cur_dir}/examples/bathroom.jpg", "What is unusual about this image?"],
[f"{cur_dir}/examples/kitchen.png", "Is the object a man or a chicken? Explain the reason."],
], inputs=[sketch_pad, textbox])
with gr.Accordion("Parameters", open=False, visible=False) as parameter_row:
temperature = gr.Slider(minimum=0.0, maximum=1.0, value=0.2, step=0.1, interactive=True, label="Temperature",)
top_p = gr.Slider(minimum=0.0, maximum=1.0, value=0.7, step=0.1, interactive=True, label="Top P",)
max_output_tokens = gr.Slider(minimum=0, maximum=1024, value=512, step=64, interactive=True, label="Max output tokens",)
with gr.Column(scale=5):
chatbot = gr.Chatbot(elem_id="chatbot", label="FERRET", visible=False).style(height=750)
with gr.Row():
with gr.Column(scale=8):
textbox.render()
with gr.Column(scale=1, min_width=60):
submit_btn = gr.Button(value="Submit", visible=False)
with gr.Row(visible=False) as button_row:
upvote_btn = gr.Button(value="👍 Upvote", interactive=False)
downvote_btn = gr.Button(value="👎 Downvote", interactive=False)
# flag_btn = gr.Button(value="⚠️ Flag", interactive=False)
#stop_btn = gr.Button(value="⏹️ Stop Generation", interactive=False)
regenerate_btn = gr.Button(value="🔄 Regenerate", interactive=False)
clear_btn = gr.Button(value="🗑️ Clear history", interactive=False)
location_btn = gr.Button(value="🪄 Show location", interactive=False)
if not embed_mode:
gr.Markdown(tos_markdown)
gr.Markdown(learn_more_markdown)
url_params = gr.JSON(visible=False)
# Register listeners
btn_list = [upvote_btn, downvote_btn, location_btn, regenerate_btn, clear_btn]
upvote_btn.click(upvote_last_response,
[state, model_selector], [textbox, upvote_btn, downvote_btn, location_btn])
downvote_btn.click(downvote_last_response,
[state, model_selector], [textbox, upvote_btn, downvote_btn, location_btn])
# flag_btn.click(flag_last_response,
# [state, model_selector], [textbox, upvote_btn, downvote_btn, flag_btn])
regenerate_btn.click(regenerate, [state, image_process_mode],
[state, chatbot, textbox] + btn_list).then(
http_bot, [state, model_selector, temperature, top_p, max_output_tokens, refer_input_state],
[state, chatbot] + btn_list)
clear_btn.click(clear_history, None, [state, chatbot, textbox, imagebox_output, original_image] + btn_list + \
[sketch_pad, refer_input_state, refer_text_show, imagebox_refer])
location_btn.click(show_location,
[sketch_pad, chatbot], [imagebox_output, chatbot, location_btn])
textbox.submit(add_text, [state, textbox, image_process_mode, original_image, sketch_pad], [state, chatbot, textbox, original_image] + btn_list
).then(http_bot, [state, model_selector, temperature, top_p, max_output_tokens, refer_input_state],
[state, chatbot] + btn_list)
submit_btn.click(add_text, [state, textbox, image_process_mode, original_image, sketch_pad], [state, chatbot, textbox, original_image] + btn_list
).then(http_bot, [state, model_selector, temperature, top_p, max_output_tokens, refer_input_state],
[state, chatbot] + btn_list)
sketch_pad.edit(
draw,
inputs=[refer_input_mode, sketch_pad, refer_input_state, refer_text_show, imagebox_refer],
outputs=[refer_input_state, refer_text_show, imagebox_refer],
queue=True,
)
if args.model_list_mode == "once":
demo.load(load_demo, [url_params], [state, model_selector,
chatbot, textbox, submit_btn, button_row, parameter_row],
_js=get_window_url_params)
elif args.model_list_mode == "reload":
demo.load(load_demo_refresh_model_list, None, [state, model_selector,
chatbot, textbox, submit_btn, button_row, parameter_row])
else:
raise ValueError(f"Unknown model list mode: {args.model_list_mode}")
return demo
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--host", type=str, default="0.0.0.0")
parser.add_argument("--port", type=int)
parser.add_argument("--controller-url", type=str, default="http://localhost:21001")
parser.add_argument("--concurrency-count", type=int, default=8)
parser.add_argument("--model-list-mode", type=str, default="once",
choices=["once", "reload"])
parser.add_argument("--share", action="store_true")
parser.add_argument("--moderate", action="store_true")
parser.add_argument("--embed", action="store_true")
parser.add_argument("--add_region_feature", action="store_true")
args = parser.parse_args()
logger.info(f"args: {args}")
models = get_model_list()
logger.info(args)
demo = build_demo(args.embed)
demo.queue(concurrency_count=args.concurrency_count, status_update_rate=10,
api_open=False).launch(
server_name=args.host, server_port=args.port, share=args.share)

View File

@ -0,0 +1,367 @@
"""
A model worker executes the model.
Usage:
CUDA_VISIBLE_DEVICES=0 python -m ferret.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 \
--worker http://localhost:40000 --model-path checkpoints/xxx \
--multi-modal --add_region_feature
"""
import argparse
import asyncio
import json
import time
import threading
import uuid
from fastapi import FastAPI, Request, BackgroundTasks
from fastapi.responses import StreamingResponse
import requests
import torch
import uvicorn
from functools import partial
from ferret.constants import WORKER_HEART_BEAT_INTERVAL
from ferret.utils import (build_logger, server_error_msg,
pretty_print_semaphore)
from ferret.model.builder import load_pretrained_model
from ferret.mm_utils import process_images, load_image_from_base64, tokenizer_image_token, KeywordsStoppingCriteria
from ferret.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
# from transformers import TextIteratorStreamer
from threading import Thread
GB = 1 << 30
worker_id = str(uuid.uuid4())[:6]
logger = build_logger("model_worker", f"model_worker_{worker_id}.log")
global_counter = 0
model_semaphore = None
DEFAULT_REGION_FEA_TOKEN = "<region_fea>"
def heart_beat_worker(controller):
while True:
time.sleep(WORKER_HEART_BEAT_INTERVAL)
controller.send_heart_beat()
class ModelWorker:
def __init__(self, controller_addr, worker_addr,
worker_id, no_register,
model_path, model_base, model_name,
load_8bit, load_4bit,
keep_aspect_ratio,
num_gpus,
add_region_feature,
image_w,
image_h):
self.image_w = image_w
self.image_h = image_h
self.controller_addr = controller_addr
self.worker_addr = worker_addr
self.worker_id = worker_id
if model_path.endswith("/"):
model_path = model_path[:-1]
if model_name is None:
model_paths = model_path.split("/")
if model_paths[-1].startswith('checkpoint-'):
self.model_name = model_paths[-2] + "_" + model_paths[-1]
else:
self.model_name = model_paths[-1]
else:
self.model_name = model_name
logger.info(f"Loading the model {self.model_name} on worker {worker_id} ...")
self.keep_aspect_ratio = keep_aspect_ratio
self.add_region_feature = add_region_feature
self.tokenizer, self.model, self.image_processor, self.context_len = load_pretrained_model(
model_path, model_base, self.model_name, load_8bit, load_4bit)
self.is_multimodal = 'llava' in self.model_name.lower() or 'ferret' in self.model_name.lower()
if not no_register:
self.register_to_controller()
self.heart_beat_thread = threading.Thread(
target=heart_beat_worker, args=(self,))
self.heart_beat_thread.start()
def register_to_controller(self):
logger.info("Register to controller")
url = self.controller_addr + "/register_worker"
data = {
"worker_name": self.worker_addr,
"check_heart_beat": True,
"worker_status": self.get_status()
}
r = requests.post(url, json=data)
assert r.status_code == 200
def send_heart_beat(self):
# logger.info(f"Send heart beat. Models: {[self.model_name]}. "
# f"Semaphore: {pretty_print_semaphore(model_semaphore)}. "
# f"global_counter: {global_counter}")
url = self.controller_addr + "/receive_heart_beat"
while True:
try:
ret = requests.post(url, json={
"worker_name": self.worker_addr,
"queue_length": self.get_queue_length()}, timeout=5)
exist = ret.json()["exist"]
break
except requests.exceptions.RequestException as e:
logger.error(f"heart beat error: {e}")
time.sleep(5)
if not exist:
self.register_to_controller()
def get_queue_length(self):
if model_semaphore is None:
return 0
else:
return args.limit_model_concurrency - model_semaphore._value + (len(
model_semaphore._waiters) if model_semaphore._waiters is not None else 0)
def get_status(self):
return {
"model_names": [self.model_name],
"speed": 1,
"queue_length": self.get_queue_length(),
}
@torch.inference_mode()
def generate_stream(self, params):
tokenizer, model, image_processor = self.tokenizer, self.model, self.image_processor
image_w = self.image_w
image_h = self.image_h
prompt = params["prompt"]
ori_prompt = prompt
region_masks = params.get('region_masks', None)
images = params.get("images", None)
num_image_tokens = 0
if images is not None and len(images) > 0 and self.is_multimodal:
if len(images) > 0:
if len(images) != prompt.count(DEFAULT_IMAGE_TOKEN):
raise ValueError("Number of images does not match number of <image> tokens in prompt")
images = [load_image_from_base64(image) for image in images]
if self.keep_aspect_ratio:
images = process_images(images, image_processor, model.config)
else:
images = image_processor(images, return_tensors='pt', do_resize=True, do_center_crop=False, size=[image_h, image_w])['pixel_values']
if type(images) is list:
images = [image.to(self.model.device, dtype=torch.float16) for image in images]
else:
images = images.to(self.model.device, dtype=torch.float16)
replace_token = DEFAULT_IMAGE_TOKEN
if getattr(self.model.config, 'mm_use_im_start_end', False):
replace_token = DEFAULT_IM_START_TOKEN + replace_token + DEFAULT_IM_END_TOKEN
prompt = prompt.replace(DEFAULT_IMAGE_TOKEN, replace_token)
num_image_tokens = prompt.count(replace_token) * model.get_vision_tower().num_patches
else:
images = None
image_args = {"images": images}
else:
images = None
image_args = {}
if region_masks is not None:
assert self.add_region_feature
region_masks = [[torch.Tensor(region_mask_i).cuda().half() for region_mask_i in region_masks]]
image_args["region_masks"] = region_masks
logger.info("Add region_masks to image_args.")
else:
logger.info("No region_masks for this sample.")
region_masks = None
l_prompt = len(prompt)
temperature = float(params.get("temperature", 1.0))
top_p = float(params.get("top_p", 1.0))
max_context_length = getattr(model.config, 'max_position_embeddings', 2048)
max_new_tokens = min(int(params.get("max_new_tokens", 256)), 1024)
stop_str = params.get("stop", None)
stop_idx = None
if stop_str is not None:
stop_idx = tokenizer(stop_str).input_ids
if len(stop_idx) == 1:
stop_idx = stop_idx[0]
else:
stop_idx = None
# input_ids = tokenizer(prompt).input_ids
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors=None)
output_ids = list(input_ids)
pred_ids = []
max_src_len = self.context_len - max_new_tokens - 8
input_ids = input_ids[-max_src_len:]
past_key_values = None
for i in range(max_new_tokens):
if i == 0:
out = model(
torch.as_tensor([input_ids]).cuda(),
use_cache=True,
**image_args)
logits = out.logits
past_key_values = out.past_key_values
else:
attention_mask = torch.ones(
1, past_key_values[0][0].shape[-2] + 1, device="cuda")
out = model(input_ids=torch.as_tensor([[token]], device="cuda"),
use_cache=True,
attention_mask=attention_mask,
past_key_values=past_key_values,
region_masks=region_masks)
logits = out.logits
past_key_values = out.past_key_values
last_token_logits = logits[0][-1]
if temperature < 1e-4:
token = int(torch.argmax(last_token_logits))
else:
probs = torch.softmax(last_token_logits / temperature, dim=-1)
token = int(torch.multinomial(probs, num_samples=1))
output_ids.append(token)
pred_ids.append(token)
if stop_idx is not None and token == stop_idx:
stopped = True
elif token == tokenizer.eos_token_id:
stopped = True
else:
stopped = False
if i % args.stream_interval == 0 or i == max_new_tokens - 1 or stopped:
cur_out = tokenizer.decode(pred_ids, skip_special_tokens=True)
pos = cur_out.rfind(stop_str)
if pos != -1:
cur_out = cur_out[:pos]
stopped = True
output = ori_prompt + cur_out
ret = {
"text": output,
"error_code": 0,
}
yield json.dumps(ret).encode() + b"\0"
if stopped:
break
if past_key_values is not None:
del past_key_values
def generate_stream_gate(self, params):
try:
for x in self.generate_stream(params):
yield x
except ValueError as e:
print("Caught ValueError:", e)
ret = {
"text": server_error_msg,
"error_code": 1,
}
yield json.dumps(ret).encode() + b"\0"
except torch.cuda.CudaError as e:
print("Caught torch.cuda.CudaError:", e)
ret = {
"text": server_error_msg,
"error_code": 1,
}
yield json.dumps(ret).encode() + b"\0"
except Exception as e:
print("Caught Unknown Error", e)
ret = {
"text": server_error_msg,
"error_code": 1,
}
yield json.dumps(ret).encode() + b"\0"
app = FastAPI()
def release_model_semaphore(fn=None):
model_semaphore.release()
if fn is not None:
fn()
@app.post("/worker_generate_stream")
async def generate_stream(request: Request):
global model_semaphore, global_counter
global_counter += 1
params = await request.json()
if model_semaphore is None:
model_semaphore = asyncio.Semaphore(args.limit_model_concurrency)
await model_semaphore.acquire()
worker.send_heart_beat()
generator = worker.generate_stream_gate(params)
background_tasks = BackgroundTasks()
background_tasks.add_task(partial(release_model_semaphore, fn=worker.send_heart_beat))
return StreamingResponse(generator, background=background_tasks)
@app.post("/worker_get_status")
async def get_status(request: Request):
return worker.get_status()
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--host", type=str, default="localhost")
parser.add_argument("--port", type=int, default=21002)
parser.add_argument("--worker-address", type=str,
default="http://localhost:21002")
parser.add_argument("--controller-address", type=str,
default="http://localhost:21001")
parser.add_argument("--model-path", type=str, default="facebook/opt-350m")
parser.add_argument("--model-base", type=str, default=None)
parser.add_argument("--model-name", type=str)
parser.add_argument("--multi-modal", action="store_true", help="Multimodal mode is automatically detected with model name, please make sure `ferret` is included in the model path.")
parser.add_argument("--keep-aspect-ratio", action="store_true")
parser.add_argument("--num-gpus", type=int, default=1)
parser.add_argument("--limit-model-concurrency", type=int, default=5)
parser.add_argument("--stream-interval", type=int, default=1)
parser.add_argument("--no-register", action="store_true")
parser.add_argument("--load-8bit", action="store_true")
parser.add_argument("--load-4bit", action="store_true")
parser.add_argument("--add_region_feature", action="store_true")
parser.add_argument("--image_w", type=int, default=336) # 224
parser.add_argument("--image_h", type=int, default=336) # 224
args = parser.parse_args()
logger.info(f"args: {args}")
if args.multi_modal:
logger.warning("Multimodal mode is automatically detected with model name, please make sure `ferret` is included in the model path.")
worker = ModelWorker(args.controller_address,
args.worker_address,
worker_id,
args.no_register,
args.model_path,
args.model_base,
args.model_name,
args.load_8bit,
args.load_4bit,
args.keep_aspect_ratio,
args.num_gpus,
args.add_region_feature,
args.image_w,
args.image_h)
uvicorn.run(app, host=args.host, port=args.port, log_level="info")

View File

@ -0,0 +1,26 @@
"""
Manually register workers.
Usage:
python3 -m fastchat.serve.register_worker --controller http://localhost:21001 --worker-name http://localhost:21002
"""
import argparse
import requests
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--controller-address", type=str)
parser.add_argument("--worker-name", type=str)
parser.add_argument("--check-heart-beat", action="store_true")
args = parser.parse_args()
url = args.controller_address + "/register_worker"
data = {
"worker_name": args.worker_name,
"check_heart_beat": args.check_heart_beat,
"worker_status": None,
}
r = requests.post(url, json=data)
assert r.status_code == 200

View File

@ -0,0 +1,66 @@
import os
import torch
from transformers import Trainer
from typing import Optional
def maybe_zero_3(param, ignore_status=False, name=None):
from deepspeed import zero
from deepspeed.runtime.zero.partition_parameters import ZeroParamStatus
if hasattr(param, "ds_id"):
if param.ds_status == ZeroParamStatus.NOT_AVAILABLE:
if not ignore_status:
print(name, 'no ignore status')
with zero.GatheredParameters([param]):
param = param.data.detach().cpu().clone()
else:
param = param.detach().cpu().clone()
return param
def get_mm_adapter_state_maybe_zero_3(named_params, keys_to_match):
to_return = {k: t for k, t in named_params if any(key_match in k for key_match in keys_to_match)}
to_return = {k: maybe_zero_3(v, ignore_status=True, name=k).cpu() for k, v in to_return.items()}
return to_return
class FERRETTrainer(Trainer):
def _save_checkpoint(self, model, trial, metrics=None):
if getattr(self.args, 'tune_mm_mlp_adapter', False):
from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR
checkpoint_folder = f"{PREFIX_CHECKPOINT_DIR}-{self.state.global_step}"
run_dir = self._get_output_dir(trial=trial)
output_dir = os.path.join(run_dir, checkpoint_folder)
# Only save Adapter
keys_to_match = ['mm_projector']
if getattr(self.args, "use_im_start_end", False):
keys_to_match.extend(['embed_tokens', 'embed_in'])
weight_to_save = get_mm_adapter_state_maybe_zero_3(self.model.named_parameters(), keys_to_match)
if self.args.local_rank == 0 or self.args.local_rank == -1:
self.model.config.save_pretrained(output_dir)
torch.save(weight_to_save, os.path.join(output_dir, f'mm_projector.bin'))
else:
super(FERRETTrainer, self)._save_checkpoint(model, trial, metrics)
if getattr(self.args, 'save_vision_tower', False):
from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR
checkpoint_folder = f"{PREFIX_CHECKPOINT_DIR}-{self.state.global_step}"
run_dir = self._get_output_dir(trial=trial)
output_dir = os.path.join(run_dir, checkpoint_folder, 'vision_tower')
if self.args.local_rank == 0 or self.args.local_rank == -1:
vision_tower = self.model.model.get_vision_tower().vision_tower
vision_tower.save_pretrained(output_dir)
print(f'Save vision tower ckpt to {output_dir}/vision_tower')
def _save(self, output_dir: Optional[str] = None, state_dict=None):
if getattr(self.args, 'tune_mm_mlp_adapter', False):
pass
else:
super(FERRETTrainer, self)._save(output_dir, state_dict)

View File

@ -0,0 +1,102 @@
# Adopted from https://github.com/lm-sys/FastChat. Below is the original copyright:
from typing import List, Optional, Tuple
import torch
from torch import nn
import transformers
from transformers.models.llama.modeling_llama import apply_rotary_pos_emb
from einops import rearrange
from flash_attn.flash_attn_interface import flash_attn_unpadded_qkvpacked_func
from flash_attn.bert_padding import unpad_input, pad_input
def forward(
self,
hidden_states: torch.Tensor,
past_key_value: Optional[Tuple[torch.Tensor]] = None,
attention_mask: Optional[torch.Tensor] = None,
output_attentions: bool = False,
use_cache: bool = False,
) -> Tuple[torch.Tensor, Optional[torch.Tensor],
Optional[Tuple[torch.Tensor]]]:
"""Input shape: Batch x Time x Channel
attention_mask: [bsz, q_len]
"""
bsz, q_len, _ = hidden_states.size()
query_states = self.q_proj(hidden_states).view(
bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
key_states = self.k_proj(hidden_states).view(
bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
value_states = self.v_proj(hidden_states).view(
bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
# [bsz, q_len, nh, hd]
# [bsz, nh, q_len, hd]
kv_seq_len = key_states.shape[-2]
offset = 0
if past_key_value is not None:
offset = past_key_value[0].shape[-2]
kv_seq_len += offset
cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
query_states, key_states = apply_rotary_pos_emb(query_states,
key_states,
cos,
sin,
offset=offset)
# [bsz, nh, t, hd]
assert not output_attentions, "output_attentions is not supported"
assert not use_cache, "use_cache is not supported"
assert past_key_value is None, "past_key_value is not supported"
# Flash attention codes from
# https://github.com/HazyResearch/flash-attention/blob/main/flash_attn/flash_attention.py
# transform the data into the format required by flash attention
qkv = torch.stack([query_states, key_states, value_states], dim=2) # [bsz, nh, 3, q_len, hd]
qkv = qkv.transpose(1, 3) # [bsz, q_len, 3, nh, hd]
# We have disabled _prepare_decoder_attention_mask in LlamaModel
# the attention_mask should be the same as the key_padding_mask
key_padding_mask = attention_mask
if key_padding_mask is None:
qkv = rearrange(qkv, 'b s ... -> (b s) ...')
max_s = q_len
cu_q_lens = torch.arange(0, (bsz + 1) * q_len, step=q_len, dtype=torch.int32,
device=qkv.device)
output = flash_attn_unpadded_qkvpacked_func(
qkv, cu_q_lens, max_s, 0.0,
softmax_scale=None, causal=True
)
output = rearrange(output, '(b s) ... -> b s ...', b=bsz)
else:
nheads = qkv.shape[-2]
x = rearrange(qkv, 'b s three h d -> b s (three h d)')
x_unpad, indices, cu_q_lens, max_s = unpad_input(x, key_padding_mask)
x_unpad = rearrange(x_unpad, 'nnz (three h d) -> nnz three h d', three=3, h=nheads)
output_unpad = flash_attn_unpadded_qkvpacked_func(
x_unpad, cu_q_lens, max_s, 0.0,
softmax_scale=None, causal=True
)
output = rearrange(pad_input(rearrange(output_unpad, 'nnz h d -> nnz (h d)'),
indices, bsz, q_len),
'b s (h d) -> b s h d', h=nheads)
return self.o_proj(rearrange(output,
'b s h d -> b s (h d)')), None, None
# Disable the transformation of the attention mask in LlamaModel as the flash attention
# requires the attention mask to be the same as the key_padding_mask
def _prepare_decoder_attention_mask(self, attention_mask, input_shape,
inputs_embeds, past_key_values_length):
# [bsz, seq_len]
return attention_mask
def replace_llama_attn_with_flash_attn():
transformers.models.llama.modeling_llama.LlamaModel._prepare_decoder_attention_mask = _prepare_decoder_attention_mask
transformers.models.llama.modeling_llama.LlamaAttention.forward = forward

1387
ferret/train/train.py Normal file

File diff suppressed because it is too large Load Diff

13
ferret/train/train_mem.py Normal file
View File

@ -0,0 +1,13 @@
# Adopted from https://github.com/lm-sys/FastChat. Below is the original copyright:
# Adopted from tatsu-lab@stanford_alpaca. Below is the original copyright:
# Make it more memory efficient by monkey patching the LLaMA model with FlashAttn.
# Need to call this before importing transformers.
from ferret.train.llama_flash_attn_monkey_patch import replace_llama_attn_with_flash_attn
replace_llama_attn_with_flash_attn()
from ferret.train.train import train
if __name__ == "__main__":
train()

126
ferret/utils.py Normal file
View File

@ -0,0 +1,126 @@
import datetime
import logging
import logging.handlers
import os
import sys
import requests
from ferret.constants import LOGDIR
server_error_msg = "**NETWORK ERROR DUE TO HIGH TRAFFIC. PLEASE REGENERATE OR REFRESH THIS PAGE.**"
moderation_msg = "YOUR INPUT VIOLATES OUR CONTENT MODERATION GUIDELINES. PLEASE TRY AGAIN."
handler = None
def build_logger(logger_name, logger_filename):
global handler
formatter = logging.Formatter(
fmt="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
)
# Set the format of root handlers
if not logging.getLogger().handlers:
logging.basicConfig(level=logging.INFO)
logging.getLogger().handlers[0].setFormatter(formatter)
# Redirect stdout and stderr to loggers
stdout_logger = logging.getLogger("stdout")
stdout_logger.setLevel(logging.INFO)
sl = StreamToLogger(stdout_logger, logging.INFO)
sys.stdout = sl
stderr_logger = logging.getLogger("stderr")
stderr_logger.setLevel(logging.ERROR)
sl = StreamToLogger(stderr_logger, logging.ERROR)
sys.stderr = sl
# Get logger
logger = logging.getLogger(logger_name)
logger.setLevel(logging.INFO)
# Add a file handler for all loggers
if handler is None:
os.makedirs(LOGDIR, exist_ok=True)
filename = os.path.join(LOGDIR, logger_filename)
handler = logging.handlers.TimedRotatingFileHandler(
filename, when='D', utc=True)
handler.setFormatter(formatter)
for name, item in logging.root.manager.loggerDict.items():
if isinstance(item, logging.Logger):
item.addHandler(handler)
return logger
class StreamToLogger(object):
"""
Fake file-like stream object that redirects writes to a logger instance.
"""
def __init__(self, logger, log_level=logging.INFO):
self.terminal = sys.stdout
self.logger = logger
self.log_level = log_level
self.linebuf = ''
def __getattr__(self, attr):
return getattr(self.terminal, attr)
def write(self, buf):
temp_linebuf = self.linebuf + buf
self.linebuf = ''
for line in temp_linebuf.splitlines(True):
# From the io.TextIOWrapper docs:
# On output, if newline is None, any '\n' characters written
# are translated to the system default line separator.
# By default sys.stdout.write() expects '\n' newlines and then
# translates them so this is still cross platform.
if line[-1] == '\n':
self.logger.log(self.log_level, line.rstrip())
else:
self.linebuf += line
def flush(self):
if self.linebuf != '':
self.logger.log(self.log_level, self.linebuf.rstrip())
self.linebuf = ''
def disable_torch_init():
"""
Disable the redundant torch default initialization to accelerate model creation.
"""
import torch
setattr(torch.nn.Linear, "reset_parameters", lambda self: None)
setattr(torch.nn.LayerNorm, "reset_parameters", lambda self: None)
def violates_moderation(text):
"""
Check whether the text violates OpenAI moderation API.
"""
url = "https://api.openai.com/v1/moderations"
headers = {"Content-Type": "application/json",
"Authorization": "Bearer " + os.environ["OPENAI_API_KEY"]}
text = text.replace("\n", "")
data = "{" + '"input": ' + f'"{text}"' + "}"
data = data.encode("utf-8")
try:
ret = requests.post(url, headers=headers, data=data, timeout=5)
flagged = ret.json()["results"][0]["flagged"]
except requests.exceptions.RequestException as e:
flagged = False
except KeyError as e:
flagged = False
return flagged
def pretty_print_semaphore(semaphore):
if semaphore is None:
return "None"
return f"Semaphore(value={semaphore._value}, locked={semaphore.locked()})"

BIN
figs/ferret_demo.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 632 KiB

38
pyproject.toml Normal file
View File

@ -0,0 +1,38 @@
[build-system]
requires = ["setuptools>=61.0"]
build-backend = "setuptools.build_meta"
[project]
name = "ferret"
version = "1.0.1"
description = "Towards GPT-4 like large language and visual assistant."
readme = "README.md"
requires-python = ">=3.8"
classifiers = [
"Programming Language :: Python :: 3",
"License :: OSI Approved :: Apache Software License",
]
dependencies = [
"einops", "fastapi", "gradio==3.26", "markdown2[all]", "numpy",
"requests", "sentencepiece", "tokenizers>=0.12.1",
"torch", "torchvision", "uvicorn", "wandb",
"shortuuid", "httpx==0.24.0",
"deepspeed==0.9.5",
"peft==0.4.0",
"transformers @ git+https://github.com/huggingface/transformers.git@cae78c46",
"accelerate==0.21.0",
"bitsandbytes==0.41.0",
"scikit-learn==1.2.2",
"sentencepiece==0.1.99",
"einops==0.6.1", "einops-exts==0.0.4", "timm==0.6.13", "openai",
"gradio_client==0.1.2"
]
[project.urls]
"Homepage" = "https://github.com/apple/ml-ferret"
[tool.setuptools.packages.find]
exclude = ["assets*", "benchmark*", "docs", "dist*", "playground*", "scripts*", "tests*"]
[tool.wheel]
exclude = ["assets*", "benchmark*", "docs", "dist*", "playground*", "scripts*", "tests*"]