Authors: Ralph Gasser†, Luca Rossetto‡, Silvan Heller†, Heiko Schuldt†;
Affiliations: †University of Basel, ‡University of Zurich;
Editors: Mathias Lux and Marco Bertini
Introduction
Analysis and retrieval of media collections get more and more challenging the larger the collections become. Keeping everything in the main memory becomes less feasible, and more and more time and effort have to be spent to deal with the data management. However, traditional relational databases do not support primitives often used in multimedia workloads, such as the nearest-neighbour search on vectors. In this column, we introduce Cottontail DB, an open-source database management system for multimedia features. Cottontail DB supports traditional relational database operations and text retrieval based on Lucene and, most importantly, efficient vector-space retrieval operations for large datasets. Cottontail DB is the new data storage system powering the vitrivr multimedia retrieval stack, which was also previously featured in the SIGMM Records [7]. Just like the other components of vitrivr, Cottontail DB is released under the permissive MIT license. It is written in Kotlin, runs on all major operating systems, and comes with a flexible and easy-to-use gRPC API, which makes it usable in many applications, independent of the programming languages used. Cottontail DB’s clean and modular architecture enables the easy extension of its functionalities and also makes it useful in an educational context. In the following, we will give a brief introduction on how Cottontail DB works, what we are using it for, and, most importantly, how it can help you manage your data. To learn more about Cottontail DB, including performance evaluations, we kindly refer readers to our Open Source Software Track Contribution at ACM MM 2020 [1], where Cottontail DB was honored with that year’s Best Open Source Award.
What Cottontail DB can do for you
Generally, Cottontail DB can be used for any workload in multimedia retrieval or multimedia analysis. Its strengths really play-out, when a combination of nearest-neighbor search (NSS) and Boolean retrieval is required. To illustrate Cottontail DB’s versatility, we briefly describe two of our own use cases:
Multimedia Retrieval with vitrivr. vitrivr offers a variety of different query and retrieval modalities, ranging from vector-based features (based on, e.g., color, edges, or content), textual features (e.g., OCR, ASR) to Boolean features for metadata. Cottontail DB offers support for the storage and efficient retrieval of all of these feature representations. For video data, we are using it to perform competitive retrieval on 1000+ hours of video on commodity hardware [5]. For images and their accompanying metadata, we are using Cottontail DB in the Lifelog Search Challenge [3], combining Boolean, textual, and NNS functionality on the same dataset [4]. In [2], we have also shown that vitrivr and Cottontail DB can be used for mixed multimedia collections. Using index structures for multimedia retrieval such as PQ [6] or text indices offered by Lucene is essential for efficient retrieval on large datasets, and Cottontail DB offers such functionality.
Magnetic Resonance Fingerprint (MRF). MRF is a technique to generate quantitative MRI scans of tissue. It is based on the assumption that a specific type of tissue has a unique fingerprint, which can be seen as a vector of complex numbers for each pixel, where components are generated by scanning with random parameters. For reconstructing images from the signals, MRF matches these vectors against a dictionary of simulated fingerprints using maximum inner product search (MIPS). These dictionaries potentially contain hundreds of thousands to billions of vectors. We have successfully employed Cottontail DB in a collaborative project to speed-up that MRF matching step. Cottontail DB’s modular architecture made it possible to add tailored index structures for MIPS in the complex number domain.
How Cottontail DB works
Cottontail DB’s data model is very similar to the one found in a relational database management system. Data can be organized into different schemata and entities where each entity consists of one to many different columns. Columns are strongly typed and in addition to the well-known scalar types such as integers, floats, or strings. Vector types and complex value types are both first-class citizens in the Cottontail DB type system.
Cottontail DB’s query model allows for a seamless combination of classical Boolean queries and NNS, of which the latter is often used in multimedia retrieval and similarity search. Not only can both types of queries be executed, but they can also be combined. For example, it is entirely possible to restrict NNS to a subset of the data by first filtering by some predicate or to express a sub-SELECT query based on the outcome of an NNS. Cottontail DB’s query planning and execution engine uses a cost-based model to find the most efficient execution plan for such a query and puts available index structures and parallelism to use.
Technically, Cottontail DB is a column store, i.e., instead of storing all the attributes that belong to an entry in a single tuple, the data is stored column-wise, and tuples are constructed during query execution. This allows for optimizations when executing analytical queries such as NNS by, for example, deferring fetching of attributes that are not used for distance calculation until after the NNS has concluded. This, in combination with data partitioning and parallelism, makes Cottontail DB very fast on this type of workload even for linear scan queries on large data collections. Should the linear approach not be fast enough, however, Cottontail DB also supports various types of indexes both for Boolean retrieval (e.g., hash-based indexing) and nearest neighbor search (e.g., Vector Approximation Files or Product Quantization).
Getting Started
Cottontail DB can be built from source [8], but we also offer pre-built releases and it is available as a Docker container. Releases are available for download from GitHub [9]. All you need to run Cottontail DB is a Java Virtual Machine (JVM) with Java 8 or higher. Once you have downloaded the latest release package, unpack the TAR file and start Cottontail DB using the following command:
cottontaildb-bin/bin/cottontaildb /path/to/config.json
You must provide the path to a config.json file as a program argument. In this file, you can adjust various settings such as the path to Cottontail DB’s data (root) folder or the gRPC server port. The only required setting is the data folder location. An example is given in the following figure.
Once you have started Cottontail DB, you should see the terminal output shown in next figure and the CLI should come up. You can type help to get a list of all available CLI commands.
When using the Docker image from DockerHub [10], it is important to expose the default port 1865 to the host machine, so the minimal run command is:
docker run -it -p 1865:1865 vitrivr/cottontaildb
Furthermore, we recommend mapping the data directory from the host into the container. For more information on setup considerations, we refer to the official Wiki [11]
Once set up, Cottontail DB can be used as a database with the exception that communication takes place via gRPC [12] and not SQL. The definitions for the gRPC stubs can be found online [13] and compiled to all platforms supported by gRPC. In addition, there is a simplified client library for Java and Kotlin, which can be included as a dependency and is available from Maven Central. All the necessary functionality such as data definition (i.e., creating and changing database objects), data management (i.e, inserting, updating, and deleting data), and querying are exposed via dedicated gRPC endpoints. There is a dedicated repository with Java and Kotlin examples [14].
The following figure gives a very simple example for an NNS query on a collection called features_averagecolor in a schema called cineast using the Kotlin client library. The id and distance of the top ten most similar matches are selected and printed to the terminal.
Conclusion and Outlook
While Cottontail DB has already proven to be very useful in several areas, work on the system is far from done. Currently, we are working on several projects involving clustering and online queries for data analysis over an extended period of time. We also plan to make use of the new Vector API (JEP-338) introduced in Java 16 to speed-up distance calculation for NNS using SIMD instructions. We are also working on a faster, more robust storage engine called HARE.
If you have any suggestions on how Cottontail DB could help you in your work, we would love to hear from you and we are always open for feedback and suggestions. If you are already using Cottontail DB, also contact us – we love hearing success stories. And of course, contributions to this project are always welcome!
References
[1] Gasser, R., Rossetto, L., Heller, S., & Schuldt, H. Cottontail DB: An Open Source Database System for Multimedia Retrieval and Analysis. MM ’20: The 28th ACM International Conference on Multimedia, Virtual Event / Seattle, WA, USA, October 12-16, 2020. https://doi.org/10.1145/3394171.3414538
[2] Gasser, R., Rossetto, L., & Schuldt, H. (2019). Multimodal multimedia retrieval with Vitrivr. Proceedings of the 2019 on International Conference on Multimedia Retrieval, 391–394. https://dl.acm.org/doi/abs/10.1145/3323873.3326921
[3] Gurrin, C., Le, T.-K., Ninh, V.-T., Dang-Nguyen, D.-T., Jónsson, B. ó., Lokoč, J., Hurst, W., Tran, M.-T., & Schoeffmann, K. (2020). An Introduction to the Third Annual Lifelog Search Challenge, LSC’20. International Conference on Multimedia Retrieval. ACM. https://dl.acm.org/doi/abs/10.1145/3372278.3388043
[4] Heller, S., Amiri Parian, M., Gasser, R., Sauter, L., & Schuldt, H. (2020). Interactive lifelog retrieval with vitrivr. Proceedings of the Third Annual Workshop on Lifelog Search Challenge. https://dl.acm.org/doi/abs/10.1145/3379172.3391715
[5] Heller, S., Gasser, R., Illi, C., Pasquinelli, M., Sauter, L., Spiess, F., & Schuldt, H. (2021). Towards explainable interactive multi-modal video retrieval with vitrivr. International Conference on Multimedia Modeling. https://rdcu.be/chwVs
[6] Jegou, H., Douze, M., & Schmid, C. (2010). Product quantization for nearest neighbor search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1), 117–128. https://doi.org/10.1109/TPAMI.2010.57
[7] Rossetto, L., Giangreco, I., Gasser, R., & Schuldt, H. (2018). Open-source column: content-based multimedia retrieval using vitrivr. ACM SIGMultimedia Records, 9(3). https://dl.acm.org/doi/abs/10.1145/3178422.3178430
[8] https://github.com/vitrivr/cottontaildb, last accessed 2021-03-25
[9] https://github.com/vitrivr/cottontaildb/releases/, last accessed 2021-03-25
[10] https://hub.docker.com/r/vitrivr/cottontaildb, last accessed 2021-03-25
[11] https://github.com/vitrivr/cottontaildb/wiki, last accessed 2021-03-25
[12] https://grpc.io/, last accessed 2021-03-25
[13] https://github.com/vitrivr/cottontaildb-proto, last accessed 2021-03-25
[14] https://github.com/vitrivr/cottontaildb-examples, last accessed 2021-03-25