Dataset
The data used in this project originate from the NUS Global Streetscapes dataset, which was published by the National University of Singapore. The dataset contains large-scale street-level imagery with a variety of labels. For this project, these labels were combined into a single large table that includes the complete set of images and metadata.
Notebook
For the code and steps used to build the table, see Initial Table Generation.
Steps for preparing the subsets
Step 1
The following table shows how the total count of the chosen cities in the study change when using the mly_quality_score. This is quite important to notice, hence, the amount changes quite drastically when using a different thresholds.
Note
This approach is useful when computation time needs to be reduced and fewer pairs are expected.
| City | Total | 50 % | 60 % | 70 % | 80 % | 90 % |
|---|---|---|---|---|---|---|
| Berlin | 198184 | 61606 | 59728 | 56517 | 51531 | 41767 |
| Washington | 197080 | 76859 | 70041 | 60128 | 44313 | 24971 |
| Sydney | 69227 | 63944 | 61771 | 57759 | 52210 | 41051 |
| Cape Town | 12639 | 11135 | 10136 | 8764 | 6708 | 4068 |
| Taipei | 198538 | 171232 | 161789 | 146595 | 122037 | 84761 |
| Sao Paulo | 197964 | 129330 | 108546 | 78852 | 46080 | 19913 |
Since testing showed, that the heading variable is not a 100 % reliable, mapillary's computed heading was used as well to determine if there is an offset greater than 10 degrees. The following reduction of the table looks the following:
| City | Total | 50 % | 60 % | 70 % | 80 % | 90 % |
|---|---|---|---|---|---|---|
| Berlin | 45650 | 13035 | 12476 | 11629 | 10287 | 8018 |
| Washington | 132952 | 51160 | 47204 | 41233 | 30535 | 16644 |
| Sydney | 15304 | 12849 | 12164 | 11228 | 9924 | 7573 |
| Cape Town | 9511 | 8276 | 7416 | 6303 | 4618 | 2594 |
| Taipei | 26417 | 19805 | 18686 | 16351 | 12673 | 8040 |
| Sao Paulo | 124394 | 79460 | 66792 | 48881 | 27401 | 11107 |
As a result, the difference between the headings and the score were being used to reduce the size of the images.
Step 2
Since the initial amount of images was still too big, we decided to use the tool by Danish et al. (2024) which can be found on GitHub. For our usage, the tools were slightly modified. How to use it, is explained in the See the Advanced guide.
Final dataset
The result of these steps is a cleaned and filtered subset of the Global Streetscapes dataset. In the Berlin example, this produced a smaller but higher quality collection of images that can be used for tasks such as pair identification and perception studies.