Home page               Nearest Neighbor Project

Datasets for Nearest Neighbor Retrieval and Classification

So far we have used four datasets in our experiments. Descriptions of the datasets is provided in the experiments chapter of my thesis (chapter 7, starting at page 96 of the thesis, page 112 according to Acrobat Reader). The distance matrices and class labels for those datasets can be downloaded from the following links:

UNIPEN with Dynamic Time Warping
(5323 queries, 10630 database objects)
http://vlm1.uta.edu/~athitsos/nearest_neighbors/bm_datasets/unipen/

MNIST database with shape context matching
(10000 queries, 60000 database objects)
http://vlm1.uta.edu/~athitsos/nearest_neighbors/bm_datasets/60sc/

MNIST database with shape context matching
(10000 queries, 20000 database objects)
http://vlm1.uta.edu/~athitsos/nearest_neighbors/bm_datasets/sc/

Time series with constrained Dynamic Time Warping, original dataset
(50 queries, 32768 database objects)
http://vlm1.uta.edu/~athitsos/nearest_neighbors/bm_datasets/ats/

Time series with constrained Dynamic Time Warping, scrambled dataset
(1000 queries, 31818 database objects)
http://vlm1.uta.edu/~athitsos/nearest_neighbors/bm_datasets/2ts/

ASL handshape data set with the chamfer distance
(710 queries, 80640 database objects)
http://vlm1.uta.edu/~athitsos/nearest_neighbors/bm_datasets/hands

IMPORTANT: You do not need to download every file in the above directories. In each directory, the files that you need to download are:
- testtrain_distances.bin (distances from each test object (i.e., query object) to each database object)
- traintrain_distances.bin (distances from each database object to each database object)
- test_labels.bin (class labels for the test objects)
- training_labels.bin (class labels for the training objects)

Each distance file has this format:
- four 32-bit integers (ignore the first and fourth, the second is # of rows in distance matrix, the third is # of columns in distance matrix).
- rows * cols 32-bit floating point numbers, one for each distance. They are saved row-by-row.

Each class label file has this format:
- four 32-bit integers (ignore the first and fourth, the second is # of rows in the matrix, which is 1, the third is # of columns in distance matrix).
- rows * cols 32-bit floating point numbers. The i-th number is the class label of the i-th object.

If a file (testtrain_distances.bin or traintrain_distances.bin) was over 2GB, I had to split it into chunks under 2GB, that's why you see filenames like "split_testtrainaa", etc. In that case, you have to manually merge those files to recreate the original file. I split the files using the Linux command "split".

Please do not hesitate to send me e-mail ([my last name] AT cs.bu.edu) for any problems or questions, to let me know about results you have obtained on these datasets, or to let me know about additional datasets that may be of interest.


Home page               Nearest Neighbor Project