Web platform for sharing free data for ML and research

By using this site, you agree to have cookies stored on your device, strictly for functional purposes, such as storing your session and preferences.

Dismiss

 formats.md

View raw Download
text/plain • 6.81 kiB
ASCII text

Data formats

This document describes the various data formats that are used in the system.

Raw annotation data

The client sends raw data for image annotations in a JSON format which is a list of shapes. Each shape is a dictionary with the following keys:

  • type: The type of the shape which can be:

    • bbox (bounding box, rectangle)

    • polygon

    • polyline

    • point

  • shape: The shape data. Its format depends on the shape type:

    • For bbox it is a dictionary with keys x, y, w, h: ~json {"x": x, "y": y, "w": w, "h": h} ~

    • For polygon and polyline it is a list of points; each point is a dictionary with keys x and y: ~json [{"x": x1, "y": y1}, {"x": x2, "y": y2}, ...] ~ The only difference between polygon and polyline is that the former is supposed to be closed so the last point is connected to the first one.

    • For point it is a dictionary with keys x and y: ~json {"x": x, "y": y} ~

    • All coordinates are floating-point numbers in the range [0, 1] and relative to the image size, with the origin in the top-left corner.

  • object: The ID of the type of object (label) depicted in the shape. This ID is a human-readable string that must be registered in the system before being used on shapes.

The server sends the same data back to the client, to use to show the existing annotations for an image.

Example

[
   {
       "type": "bbox",
       "shape": {"x": 0.1, "y": 0.1, "w": 0.5, "h": 0.5},
       "object": "Cat (Felis catus)"
   },
   {
       "type": "polygon",
       "shape": [{"x": 0, "y": 0}, {"x": 1, "y": 0}, {"x": 0, "y": 1}],
       "object": "Slice of pizza margherita"
   },
   {
       "type": "point",
       "shape": {"x": 0.5, "y": 0.5},
       "object": "Cat (Felis catus) - left eye"
   }
]

Query format

The query format is based on YAML and used to query for pictures in the system.

Structure

The root can have 3 keys:

  • want: A list of rules that the images must satisfy. If not provided, no filtering is done.

  • exclude: A list of rules that the images must not satisfy. If not provided, no filtering is done.

  • include_obsolete: If true the query may return images with a designated replacement. If false (default) this won't be possible.

want and exclude are lists of rules. Each rule is a dictionary with a single key (this is to allow multiple rules of the same kind). Accepted rules are:

  • has_object: [object1, object2, ...]: The image must contain any of the objects in the list.

  • has: [object1, object2, ...]: The image must contain any of the objects in the list, or a descendant of any of them.

  • nature: [nature1, nature2, ...]: The image must have one of the natures in the list. Natures are strings like "photo" that indicate the source of the image.

  • licence: [licence1, licence2, ...]: The image must have one of the licences in the list. If possible, licence IDs are SPDX identifiers, non-standard ones are prefixed with X-.

  • author: [author1, author2, ...]: The image's author's username must be in the list.

  • title: query: Search for titles (ilike).

  • description: query: Search for descriptions (ilike).

  • origin_url: query: Search for origin URLs. The query matches the beginning of the URL excluding the protocol. (like commons.wikimedia.org)

  • above_width: width: The image must have a width greater than or equal to the given value, in pixels.

  • above_height: height: The image must have a height greater than or equal to the given value, in pixels.

  • below_width: width: The image must have a width less than or equal to the given value, in pixels.

  • below_height: height: The image must have a height less than or equal to the given value, in pixels.

  • before_date: timestamp: The image must have been uploaded before the given Unix timestamp.

  • after_date: timestamp: The image must have been uploaded after the given Unix timestamp.

  • in_gallery: [gallery1, gallery2, ...]: The image must be in any of the galleries (by ID) in the list.

  • above_rating: rating: The image must have a rating greater than or equal to the given value (1-5 stars). Images with no rating are included; use above_rating_count: 1 to exclude them.

  • below_rating: rating: The image must have a rating less than or equal to the given value (1-5 stars).

  • above_rating_count: count: The image must have at least the given rating count.

  • below_rating_count: count: The image must have at most the given rating count.

  • above_region_count: count: The image must have at least the given number of regions.

  • below_region_count: count: The image must have at most the given number of regions.

  • copied_from: [image1, image2, ...]: The image must be a copy of an image in the images in the list (by ID).

ordering, offset and limit can be specified as query parameters in the URL. ordering can be one of date-desc, date-asc, title-asc, title-desc, number-regions-desc, number-regions-asc, random. offset and limit are integers that specify the number of images to skip and the maximum number of images to return, respectively.

Example

# Restrictions for queried images
want:
   # This means that the image must contain both rules, so both a cat and a dog
   - has_object: ["Cat (Felis catus)"]
   - has_object: ["Dog (Canis lupus familiaris)"]
   # Or we can put them in a list to mean that the image can contain any of the
   # objects in the list
   - has_object: ["Grass", "Flower"]
   # So the image must contain a cat and a dog, as well as either grass or
   # a flower
   # The following rule restricts the images to those with a certain source,
   # like a camera or a drawing; omitting this rule means that the images can
   # be of any source
   - nature: ["photo", "computer-3d-art"]
   # The following rule restricts the images to those with a certain licence
   - licence: ["CC-BY-1.0", "CC-BY-2.0", "CC-BY-3.0", "CC-BY-4.0", "CC0-1.0",
               "Unlicense", "WTFPL", "MIT", "BSD-2-Clause", "BSD-3-Clause",
               "Apache-2.0", "X-informal-attribution", "X-informal-do-anything",
               "X-public-domain-old", "X-public-domain"]
# Prohibitions for queried images
exclude:
   # This means that the image must not contain any of the objects in the list
   - has_object: ["Human"]
   # This excludes images uploaded before the given date
   - before_date: 1546300800
   # This requires images to have a minimum resolution
   - below_width: 800
   - below_height: 600
# In summary, we want images that contain both a cat and a dog, either a grass
# or a flower, but not a human, taken after 2019-01-01, must be a photo or a
# 3D render, must carry one of certain permissive licences and have a resolution
# of at least 800x600 pixels.
                
                    
1
Data formats
2
============
3
4
This document describes the various data formats that are used in the system.
5
6
Raw annotation data
7
-------------------
8
9
The client sends raw data for image annotations in a JSON format which is a list
10
of shapes. Each shape is a dictionary with the following keys:
11
12
* `type`: The type of the shape which can be:
13
* `bbox` (bounding box, rectangle)
14
* `polygon`
15
* `polyline`
16
* `point`
17
* `shape`: The shape data. Its format depends on the shape `type`:
18
* For `bbox` it is a dictionary with keys x, y, w, h:
19
~~~json
20
{"x": x, "y": y, "w": w, "h": h}
21
~~~
22
* For `polygon` and `polyline` it is a list of points; each point is a
23
dictionary with keys x and y:
24
~~~json
25
[{"x": x1, "y": y1}, {"x": x2, "y": y2}, ...]
26
~~~
27
The only difference between `polygon` and `polyline` is that the former is
28
supposed to be closed so the last point is connected to the first one.
29
* For `point` it is a dictionary with keys x and y:
30
~~~json
31
{"x": x, "y": y}
32
~~~
33
* All coordinates are floating-point numbers in the range [0, 1] and relative
34
to the image size, with the origin in the top-left corner.
35
* `object`: The ID of the type of object (label) depicted in the shape. This ID
36
is a human-readable string that must be registered in the system before
37
being used on shapes.
38
39
The server sends the same data back to the client, to use to show the existing
40
annotations for an image.
41
42
### Example
43
44
~~~json
45
[
46
{
47
"type": "bbox",
48
"shape": {"x": 0.1, "y": 0.1, "w": 0.5, "h": 0.5},
49
"object": "Cat (Felis catus)"
50
},
51
{
52
"type": "polygon",
53
"shape": [{"x": 0, "y": 0}, {"x": 1, "y": 0}, {"x": 0, "y": 1}],
54
"object": "Slice of pizza margherita"
55
},
56
{
57
"type": "point",
58
"shape": {"x": 0.5, "y": 0.5},
59
"object": "Cat (Felis catus) - left eye"
60
}
61
]
62
~~~
63
64
Query format
65
------------
66
67
The query format is based on YAML and used to query for pictures in the system.
68
69
### Structure
70
The root can have 3 keys:
71
* `want`: A list of rules that the images must satisfy. If not provided, no
72
filtering is done.
73
* `exclude`: A list of rules that the images must not satisfy. If not provided,
74
no filtering is done.
75
* `include_obsolete`: If true the query may return images with a designated
76
replacement. If false (default) this won't be possible.
77
78
`want` and `exclude` are lists of rules. Each rule is a dictionary with a single
79
key (this is to allow multiple rules of the same kind). Accepted rules are:
80
* `has_object: [object1, object2, ...]`: The image must contain any of the
81
objects in the list.
82
* `has: [object1, object2, ...]`: The image must contain any of the objects in
83
the list, or a descendant of any of them.
84
* `nature: [nature1, nature2, ...]`: The image must have one of the natures in
85
the list. Natures are strings like "photo" that indicate the source of the
86
image.
87
* `licence: [licence1, licence2, ...]`: The image must have one of the licences
88
in the list. If possible, licence IDs are SPDX identifiers, non-standard ones
89
are prefixed with `X-`.
90
* `author: [author1, author2, ...]`: The image's author's username must be in
91
the list.
92
* `title: query`: Search for titles (`ilike`).
93
* `description: query`: Search for descriptions (`ilike`).
94
* `origin_url: query`: Search for origin URLs. The query matches the beginning
95
of the URL excluding the protocol. (like `commons.wikimedia.org`)
96
* `above_width: width`: The image must have a width greater than or equal to
97
the given value, in pixels.
98
* `above_height: height`: The image must have a height greater than or equal to
99
the given value, in pixels.
100
* `below_width: width`: The image must have a width less than or equal to the
101
given value, in pixels.
102
* `below_height: height`: The image must have a height less than or equal to the
103
given value, in pixels.
104
* `before_date: timestamp`: The image must have been uploaded before the given
105
Unix timestamp.
106
* `after_date: timestamp`: The image must have been uploaded after the given
107
Unix timestamp.
108
* `in_gallery: [gallery1, gallery2, ...]`: The image must be in any of the
109
galleries (by ID) in the list.
110
* `above_rating: rating`: The image must have a rating greater than or equal to
111
the given value (1-5 stars). Images with no rating are included; use
112
`above_rating_count: 1` to exclude them.
113
* `below_rating: rating`: The image must have a rating less than or equal to the
114
given value (1-5 stars).
115
* `above_rating_count: count`: The image must have at least the given rating
116
count.
117
* `below_rating_count: count`: The image must have at most the given rating
118
count.
119
* `above_region_count: count`: The image must have at least the given number of
120
regions.
121
* `below_region_count: count`: The image must have at most the given number of
122
regions.
123
* `copied_from: [image1, image2, ...]`: The image must be a copy of an image in the
124
images in the list (by ID).
125
126
`ordering`, `offset` and `limit` can be specified as query parameters in the
127
URL. `ordering` can be one of `date-desc`, `date-asc`, `title-asc`, `title-desc`,
128
`number-regions-desc`, `number-regions-asc`, `random`. `offset` and `limit` are
129
integers that specify the number of images to skip and the maximum number of
130
images to return, respectively.
131
132
### Example
133
~~~yaml
134
# Restrictions for queried images
135
want:
136
# This means that the image must contain both rules, so both a cat and a dog
137
- has_object: ["Cat (Felis catus)"]
138
- has_object: ["Dog (Canis lupus familiaris)"]
139
# Or we can put them in a list to mean that the image can contain any of the
140
# objects in the list
141
- has_object: ["Grass", "Flower"]
142
# So the image must contain a cat and a dog, as well as either grass or
143
# a flower
144
# The following rule restricts the images to those with a certain source,
145
# like a camera or a drawing; omitting this rule means that the images can
146
# be of any source
147
- nature: ["photo", "computer-3d-art"]
148
# The following rule restricts the images to those with a certain licence
149
- licence: ["CC-BY-1.0", "CC-BY-2.0", "CC-BY-3.0", "CC-BY-4.0", "CC0-1.0",
150
"Unlicense", "WTFPL", "MIT", "BSD-2-Clause", "BSD-3-Clause",
151
"Apache-2.0", "X-informal-attribution", "X-informal-do-anything",
152
"X-public-domain-old", "X-public-domain"]
153
# Prohibitions for queried images
154
exclude:
155
# This means that the image must not contain any of the objects in the list
156
- has_object: ["Human"]
157
# This excludes images uploaded before the given date
158
- before_date: 1546300800
159
# This requires images to have a minimum resolution
160
- below_width: 800
161
- below_height: 600
162
# In summary, we want images that contain both a cat and a dog, either a grass
163
# or a flower, but not a human, taken after 2019-01-01, must be a photo or a
164
# 3D render, must carry one of certain permissive licences and have a resolution
165
# of at least 800x600 pixels.
166
~~~
167