VQEG Column: VQEG Meeting November 2024

Introduction

The last plenary meeting of the Video Quality Experts Group (VQEG) was held online by the Institute for Telecommunication Sciences (ITS) of the National Telecommunications and Information Adminsitration (NTIA) from November 18th to 22nd, 2024. The meeting was attended by 70 participants from industry and academic institutions from 17 different countries worldwide.

The meeting was dedicated to present updates and discuss about topics related to the ongoing projects within VQEG. All the related information, minutes, and files from the meeting are available online in the VQEG meeting website, and video recordings of the meeting are available in Youtube.

All the topics mentioned bellow can be of interest for the SIGMM community working on quality assessment, but special attention can be devoted to the creation of a new group focused on Subjective and objective assessment of GenAI content (SOGAI) and to the recent contribution of the Immersive Media Group (IMG) group to the International Telecommunication Union (ITU) towards the Rec. ITU-T P.IXC for the evaluation of Quality of Experience (QoE) of immersive interactive communication systems. Finally, it is worth noting that Ioannis Katsavounidis (Meta, US) joins Kjell Brunnström (RISE, Sweden) as co-chairs of VQEG, substituting Margaret Pinson (NTIA(ITS).

Readers of these columns interested in the ongoing projects of VQEG are encouraged to subscribe to their corresponding reflectors to follow the activities going on and to get involved in them.

Group picture of the online meeting

Overview of VQEG Projects

Audiovisual HD (AVHD)

The AVHD group works on developing and validating subjective and objective methods to analyze commonly available video systems. In this meeting, Lucjan Janowski (AGH University of Krakow, Poland) and Margaret Pinson (NTIA/ITS) presented their proposal to fix wording related to an experiment realism and validity, based on the experience in the psychology domain that addresses the important concept of describing how much results from lab experiment can be used outside a laboratory.

In addition, given that there are no current joint activities of the group, the AVHD project will become dormant, with the possibility to be activated when new activities are planned.

Statistical Analysis Methods (SAM)

The group SAM investigates on analysis methods both for the results of subjective experiments and for objective quality models and metrics. In addition to a discussion on the future activities of the group lead by its chairs Ioannis Katsavounidis (Meta, US), Zhi Li (Netflix, US), and Lucjan Janowski (AGH University of Krakow, Poland), the following presentations were delivered during the meeting:  

No Reference Metrics (NORM)

The group NORM addresses a collaborative effort to develop no-reference metrics for monitoring visual service quality. In this sense, Ioannis Katsavounidis (Meta, US) and Margaret Pinson (NTIA/ITS) summarized recent discussions within the group on developing best practices for subjective test methods when analyzing Artificial Intelligence (AI) generated images and videos. This discussion resulted in the creation of a new VQEG project called Subjective and objective assessment of GenAI content (SOGAI) to investigate subjective and objective methods to evaluate the content produced by generative AI approaches.

Emerging Technologies Group (ETG)

The ETG group focuses on various aspects of multimedia that, although they are not necessarily directly related to “video quality”, can indirectly impact the work carried out within VQEG and are not addressed by any of the existing VQEG groups. In particular, this group aims to provide a common platform for people to gather together and discuss new emerging topics, possible collaborations in the form of joint survey papers, funding proposals, etc. During this meeting, Abhijay Ghildyal (Portland State University, US), Saman Zadtootaghaj (Sony Interactive Entertainment, Germany), and Nabajeet Barman (Sony Interactive Entertainment, UK) presented their work on quality assessment of AI generated content and AI enhanced content. In addition, Matthias Wien (RWTH Aachen University, Germany) presented the approach, design and methodology for the evaluation of AI-based Point Cloud Compression in the corresponding Call for Proposals in MPEG. Finally, Abhijay Ghildyal (Portland State University, US) presented his work on how foundation models boost low-level perceptual similarity metrics, investigating the potential of using intermediate features or activations from these models for low-level image quality assessment, and showing that such metrics can outperform existing ones without requiring additional training.

Joint Effort Group (JEG) – Hybrid

The group JEG addresses several areas of Video Quality Assessment (VQA), such as the creation of a large dataset for training such models using full-reference metrics instead of subjective metrics. In addition, the group includes the VQEG project Implementer’s Guide for Video Quality Metrics (IGVQM). The chair of this group, Enrico Masala (Politecnico di Torino, Italy) presented the updates on the latest activities going on, including the plans for experiments within the IGVMQ project to get feedback from other VQEG members.

In addition to this, Lohic Fotio Tiotsop (Politecnico di Torino, Italy) delivered two presentations. The first one focused on the prediction of the opinion score distribution via AI-based observers in media quality assessment, while the second one analyzed unexpected scoring behaviors in image quality assessment comparing controlled and crowdsourced subjective tests.

Immersive Media Group (IMG)

The IMG group researches on the quality assessment of immersive media technologies. Currently, the main joint activity of the group is the development of a test plan to evaluate the QoE of immersive interactive communication systems, which is carried out in collaboration with ITU-T through the work item P.IXC. In this meeting, Pablo Pérez (Nokia XR Lab, Spain), Marta Orduna (Nokia XR Lab, Spain), and Jesús Gutiérrez (Universidad Politécnica de Madrid, Spain) presented the status of the Rec. ITU-T P.IXC that the group was writing based on the joint test plan developed in the last months and that was submitted to ITU and discussed in its meeting in January 2025.

Also, in relation with this test plan, Lucjan Janowski (AGH University of Krakow, Poland) and Margaret Pinson (NTIA/ITS) presented an overview of ITU recommendations for interactive experiments that can be used in the IMG context.

In relation with other topics addressed by IMG, Emin Zerman (Mid Sweden University, Sweden) delivered two presentations. The first one presented the BASICS dataset, which contains a representative range of nearly 1500 point clouds assessed by thousands of participants to enable robust quality assessments for 3D scenes. The approach involved a careful selection of diverse source scenes and the application of specific “distortions” to simulate real-world compression impacts, including traditional and learning-based methods. The second presentation described a spherical light field database (SLFDB) for immersive telecommunication and telepresence applications, which comprises 60-view omnidirectional captures across 20 scenes, providing a comprehensive basis for telepresence research.

Quality Assessment for Computer Vision Applications (QACoViA)

The group QACoViA addresses the study the visual quality requirements for computer vision methods, where the final user is an algorithm. In this meeting, Mehr un Nisa (AGH University of Krakow, Poland) presented a comparative performance analysis of deep learning architectures in underwater image classification. In particular, the study assessed the performance of the VGG-16, EfficientNetB0, and SimCLR models in classifying 5,000 underwater images. The results reveal each model’s strengths and weaknesses, providing insights for future improvements in underwater image analysis

5G Key Performance Indicators (5GKPI)

The 5GKPI group studies relationship between key performance indicators of new 5G networks and QoE of video services on top of them. In this meeting, Pablo Perez (Nokia XR Lab, Spain) and Francois Blouin (Meta, US) and others presented the progress on the 5G-KPI White Paper, sharing some of the ideas on QoS-to-QoE modeling that the group has been working on to get feedback from other VQEG members.

Multimedia Experience and Human Factors (MEHF)

The MEHF group focuses on the human factors influencing audiovisual and multimedia experiences, facilitating a comprehensive understanding of how human factors impact the perceived quality of multimedia content. In this meeting, Dominika Wanat (AGH University of Krakow, Poland) presented MANIANA (Mobile Appliance for Network Interrupting, Analysis & Notorious Annoyance), an IoT device for testing QoS and QoE applications in home network conditions that is made based on Raspberry Pi 4 minicomputer and open source solutions and allows safe, robust, and universal testing applications.

Other updates

Apart from this, it is worth noting that, although no progresses were presented in this meeting, the Quality Assessment for Health Applications (QAH) group is still active and focused on the quality assessment of health applications. It addresses subjective evaluation, generation of datasets, development of objective metrics, and task-based approaches.

In addition, the Computer Generated Imagery (CGI) project became dormant, since it recent activities can be covered by other existing groups such as ETG and SOGAI.

Also, in this meeting Margaret Pinson (NTIA/ITS) stepped down as co-chair of VQEG and Ioannis Katsavounidis (Meta, US) is the new co-chair together with Kjell Brunnström (RISE, Sweden).

Finally, as already announced in the VQEG website, the next VQEG plenary meeting be hosted by Meta at Meta’s Menlo Park campus, California, in the United States from May 5th to 9th, 2025. For more information see: https://vqeg.org/meetings-home/vqeg-meeting-information/

MPEG Column: 150th MPEG Meeting (Virtual/Online)

The 150th MPEG meeting was held online from 31 March to 04 April 2025. The official press release can be found here. This column provides the following highlights:

  • Requirements: MPEG-AI strategy and white paper on MPEG technologies for metaverse
  • JVET: Draft Joint Call for Evidence on video compression with capability beyond Versatile Video Coding (VVC)
  • Video: Gaussian splat coding and video coding for machines
  • Audio: Audio coding for machines
  • 3DGH: 3D Gaussian splat coding

MPEG-AI Strategy

The MPEG-AI strategy envisions a future where AI and neural networks are deeply integrated into multimedia coding and processing, enabling transformative improvements in how digital content is created, compressed, analyzed, and delivered. By positioning AI at the core of multimedia systems, MPEG-AI seeks to enhance both content representation and intelligent analysis. This approach supports applications ranging from adaptive streaming and immersive media to machine-centric use cases like autonomous vehicles and smart cities. AI is employed to optimize coding efficiency, generate intelligent descriptors, and facilitate seamless interaction between content and AI systems. The strategy builds on foundational standards such as ISO/IEC 15938-13 (CDVS), 15938-15 (CDVA), and 15938-17 (Neural Network Coding), which collectively laid the groundwork for integrating AI into multimedia frameworks.

Currently, MPEG is developing a family of standards under the ISO/IEC 23888 series that includes a vision document, machine-oriented video coding, and encoder optimization for AI analysis. Future work focuses on feature coding for machines and AI-based point cloud compression to support high-efficiency 3D and visual data handling. These efforts reflect a paradigm shift from human-centric media consumption to systems that also serve intelligent machine agents. MPEG-AI maintains compatibility with traditional media processing while enabling scalable, secure, and privacy-conscious AI deployments. Through this initiative, MPEG aims to define the future of multimedia as an intelligent, adaptable ecosystem capable of supporting complex, real-time, and immersive digital experiences.

MPEG White Paper on Metaverse Technologies

The MPEG white paper on metaverse technologies (cf. MPEG white papers) outlines the pivotal role of MPEG standards in enabling immersive, interoperable, and high-quality virtual experiences that define the emerging metaverse. It identifies core metaverse parameters – real-time operation, 3D experience, interactivity, persistence, and social engagement – and maps them to MPEG’s longstanding and evolving technical contributions. From early efforts like MPEG-4’s Binary Format for Scenes (BIFS) and Animation Framework eXtension (AFX) to MPEG-V’s sensory integration, and the advanced MPEG-I suite, these standards underpin critical features such as scene representation, dynamic 3D asset compression, immersive audio, avatar animation, and real-time streaming. Key technologies like point cloud compression (V-PCC, G-PCC), immersive video (MIV), and dynamic mesh coding (V-DMC) demonstrate MPEG’s capacity to support realistic, responsive, and adaptive virtual environments. Recent efforts include neural network compression for learned scene representations (e.g., NeRFs), haptic coding formats, and scene description enhancements, all geared toward richer user engagement and broader device interoperability.

The document highlights five major metaverse use cases – virtual environments, immersive entertainment, virtual commerce, remote collaboration, and digital twins – all supported by MPEG innovations. It emphasizes the foundational role of MPEG-I standards (e.g., Parts 12, 14, 29, 39) for synchronizing immersive content, representing avatars, and orchestrating complex 3D scenes across platforms. Future challenges identified include ensuring interoperability across systems, advancing compression methods for AI-assisted scenarios, and embedding security and privacy protections. With decades of multimedia expertise and a future-focused standards roadmap, MPEG positions itself as a key enabler of the metaverse – ensuring that emerging virtual ecosystems are scalable, immersive, and universally accessible​.

The MPEG white paper on metaverse technologies highlights several research opportunities, including efficient compression of dynamic 3D content (e.g., point clouds, meshes, neural representations), synchronization of immersive audio and haptics, real-time adaptive streaming, and scene orchestration. It also points to challenges in standardizing interoperable avatar formats, AI-enhanced media representation, and ensuring seamless user experiences across devices. Additional research directions include neural network compression, cross-platform media rendering, and developing perceptual metrics for immersive Quality of Experience (QoE).

Draft Joint Call for Evidence (CfE) on Video Compression beyond Versatile Video Coding (VVC)

The latest JVET AHG report on ECM software development (AHG6), documented as JVET-AL0006, shows promising results. Specifically, in the “Overall” row and “Y” column, there is a 27.06% improvement in coding efficiency compared to VVC, as shown in the figure below.

The Draft Joint Call for Evidence (CfE) on video compression beyond VVC (Versatile Video Coding), identified as document JVET-AL2026 | N 355, is being developed to explore new advancements in video compression. The CfE seeks evidence in three main areas: (a) improved compression efficiency and associated trade-offs, (b) encoding under runtime constraints, and (c) enhanced performance in additional functionalities. This initiative aims to evaluate whether new techniques can significantly outperform the current state-of-the-art VVC standard in both compression and practical deployment aspects.

The visual testing will be carried out across seven categories, including various combinations of resolution, dynamic range, and use cases: SDR Random Access UHD/4K, SDR Random Access HD, SDR Low Bitrate HD, HDR Random Access 4K, HDR Random Access Cropped 8K, Gaming Low Bitrate HD, and UGC (User-Generated Content) Random Access HD. Sequences and rate points for testing have already been defined and agreed upon. For a fair comparison, rate-matched anchors using VTM (VVC Test Model) and ECM (Enhanced Compression Model) will be generated, with new configurations to enable reduced run-time evaluations. A dry-run of the visual tests is planned during the upcoming Daejeon meeting, with ECM and VTM as reference anchors, and the CfE welcomes additional submissions. Following this dry-run, the final Call for Evidence is expected to be issued in July, with responses due in October.

The Draft Joint Call for Evidence (CfE) on video compression beyond VVC invites research into next-generation video coding techniques that offer improved compression efficiency, reduced encoding complexity under runtime constraints, and enhanced functionalities such as scalability or perceptual quality. Key research aspects include optimizing the trade-off between bitrate and visual fidelity, developing fast encoding methods suitable for constrained devices, and advancing performance in emerging use cases like HDR, 8K, gaming, and user-generated content.

3D Gaussian Splat Coding

Gaussian splatting is a real-time radiance field rendering method that represents a scene using 3D Gaussians. Each Gaussian has parameters like position, scale, color, opacity, and orientation, and together they approximate how light interacts with surfaces in a scene. Instead of ray marching (as in NeRF), it renders images by splatting the Gaussians onto a 2D image plane and blending them using a rasterization pipeline, which is GPU-friendly and much faster. Developed by Kerbl et al. (2023) it is capable of real-time rendering (60+ fps) and outperforms previous NeRF-based methods in speed and visual quality. Gaussian splat coding refers to the compression and streaming of 3D Gaussian representations for efficient storage and transmission. It’s an active research area and under standardization consideration in MPEG.

MPEG technical requirements working group together with MPEG video working group started an exploration on Gaussian splat coding and the MPEG coding of 3D graphics and haptics (3DGH) working group addresses 3D Gaussian splat coding, respectively. Draft Gaussian splat coding use cases and requirements are available and various joint exploration experiments (JEEs) are conducted between meetings.

(3D) Gaussian splat coding is actively researched in academia, also in the context of streaming, e.g., like in “LapisGS: Layered Progressive 3D Gaussian Splatting for Adaptive Streaming” or “LTS: A DASH Streaming System for Dynamic Multi-Layer 3D Gaussian Splatting Scenes”. The research aspects of 3D Gaussian splat coding and streaming span a wide range of areas across computer graphics, compression, machine learning, and systems for real-time immersive media. In particular, on efficiently representing and transmitting Gaussian-based neural scene representations for real-time rendering. Key areas include compression of Gaussian parameters (position, scale, color, opacity), perceptual and geometry-aware optimizations, and neural compression techniques such as learned latent coding. Streaming challenges involve adaptive, view-dependent delivery, level-of-detail management, and low-latency rendering on edge or mobile devices. Additional research directions include standardizing file formats, integrating with scene graphs, and ensuring interoperability with existing 3D and immersive media frameworks.

MPEG Audio and Video Coding for Machines

The Call for Proposals on Audio Coding for Machines (ACoM), issued by the MPEG audio coding working group, aims to develop a standard for efficiently compressing audio, multi-dimensional signals (e.g., medical data), or extracted features for use in machine-driven applications. The standard targets use cases such as connected vehicles, audio surveillance, diagnostics, health monitoring, and smart cities, where vast data streams must be transmitted, stored, and processed with low latency and high fidelity. The ACoM system is designed in two phases: the first focusing on near-lossless compression of audio and metadata to facilitate training of machine learning models, and the second expanding to lossy compression of features optimized for specific applications. The goal is to support hybrid consumption – by machines and, where needed, humans – while ensuring interoperability, low delay, and efficient use of storage and bandwidth.

The CfP outlines technical requirements, submission guidelines, and evaluation metrics. Participants must provide decoders compatible with Linux/x86 systems, demonstrate performance through objective metrics like compression ratio, encoder/decoder runtime, and memory usage, and undergo a mandatory cross-checking process. Selected proposals will contribute to a reference model and working draft of the standard. Proponents must register by August 1, 2025, with submissions due in September, and evaluation taking place in October. The selection process emphasizes lossless reproduction, metadata fidelity, and significant improvements over a baseline codec, with a path to merge top-performing technologies into a unified solution for standardization.

Research aspects of Audio Coding for Machines (ACoM) include developing efficient compression techniques for audio and multi-dimensional data that preserve key features for machine learning tasks, optimizing encoding for low-latency and resource-constrained environments, and designing hybrid formats suitable for both machine and human consumption. Additional research areas involve creating interoperable feature representations, enhancing metadata handling for context-aware processing, evaluating trade-offs between lossless and lossy compression, and integrating machine-optimized codecs into real-world applications like surveillance, diagnostics, and smart systems.

The MPEG video coding working group approved the committee draft (CD) for ISO/IEC 23888-2 video coding for machines (VCM). VCM aims to encode visual content in a way that maximizes machine task performance, such as computer vision, scene understanding, autonomous driving, smart surveillance, robotics and IoT. Instead of preserving photorealistic quality, VCM seeks to retain features and structures important for machines, possibly at much lower bitrates than traditional video codecs. The CD introduces several new tools and enhancements aimed at improving machine-centric video processing efficiency. These include updates to spatial resampling, such as the signaling of the inner decoded picture size to better support scalable inference. For temporal resampling, the CD enables adaptive resampling ratios and introduces pre- and post-filters within the temporal resampler to maintain task-relevant temporal features. In the filtering domain, it adopts bit depth truncation techniques – integrating bit depth shifting, luma enhancement, and chroma reconstruction – to optimize both signaling efficiency and cross-platform interoperability. Luma enhancement is further refined through an integer-based implementation for luma distribution parameters, while chroma reconstruction is stabilized across different hardware platforms. Additionally, the CD proposes removing the neural network-based in-loop filter (NNLF) to simplify the pipeline. Finally, in terms of bitstream structure, it adopts a flattened structure with new signaling methods to support efficient random access and better coordination with system layers, aligning with the low-latency, high-accuracy needs of machine-driven applications.

Research in VCM focuses on optimizing video representation for downstream machine tasks, exploring task-driven compression techniques that prioritize inference accuracy over perceptual quality. Key areas include joint video and feature coding, adaptive resampling methods tailored to machine perception, learning-based filter design, and bitstream structuring for efficient decoding and random access. Other important directions involve balancing bitrate and task accuracy, enhancing robustness across platforms, and integrating machine-in-the-loop optimization to co-design codecs with AI inference pipelines.

Concluding Remarks

The 150th MPEG meeting marks significant progress across AI-enhanced media, immersive technologies, and machine-oriented coding. With ongoing work on MPEG-AI, metaverse standards, next-gen video compression, Gaussian splat representation, and machine-friendly audio and video coding, MPEG continues to shape the future of interoperable, intelligent, and adaptive multimedia systems. The research opportunities and standardization efforts outlined in this meeting provide a strong foundation for innovations that support real-time, efficient, and cross-platform media experiences for both human and machine consumption.

The 151st MPEG meeting will be held in Daejeon, Korea, from 30 June to 04 July 2025. Click here for more information about MPEG meetings and their developments.

CASTLE 2024: A Collaborative Effort to Create a Large Multimodal Multi-perspective Daily Activity Dataset

This report describes the CASTLE 2024 event, a collaborative effort to create a PoV 4K video dataset recorded by a dozen people in parallel over several days. The participating content creators wore a GoPro and a Fitbit for approximately 12 hours each day while engaging in typical daily activities. The event took place in Ballyconneely, Ireland, and lasted for four days. The resulting data is publicly available and can be used for papers, studies, and challenges in the multimedia domain in the coming years. A preprint of the paper presenting the resulting dataset is available on arXiv (https://arxiv.org/abs/2503.17116). 

Introduction

Motivated by a requirement for a real-world PoV video dataset, a group of co-organizers of the annual VBS and LSC challenges came together to hold an invitation workshop and generate a novel PoV video dataset. In the first week of December 2024, twelve researchers from the multimedia community gathered in a remote house in Ballyconneely, Ireland, with the goal to create a large multi-view and multimodal lifelogging video dataset. Equipped with a Fitbit on their wrists, a GoPro Hero 13 on their heads for about 12 hours a day, with five fixed cameras capturing the environment, they began a journey of 4K lifelogging. They lived together for four full days and performed some typical living tasks, such as cooking, eating, washing dishes, talking, discussing, reading, watching TV, as well as playing games (ranging from paper plane folding and darts to quizzes). While this sounds very enjoyable, the whole event required a lot of effort, discipline, and meticulous planning – in terms of food and, more importantly, the data acquisition, data storage, data synchronization, avoiding the usage of any copyrighted material (book, movie, songs, etc.), limiting the usage of smartphones and laptops for privacy concerns, and making the content as diverse as possible. Figure 1 gives an impression of the event and shows different activities by the participants.

Figure 1: Participants at CASTLE 2024, having a light dinner and playing cards.

Organisational Procedure

Already months before the event, we were planning for the recording equipment, the participants, the activities, as well as the food.    

The first challenge was figuring out a way to make wearing a GoPro camera all day as simple and enjoyable as possible. This was realized by using the camera with the elastic strap for a strong hold, a specifically adapted rubber pad at the back side of the camera, and a USB-C cable to a large 20,000 mAh power bank that every participant was wearing in their pocket. In the end of the day, the Fitbits, the battery packs, and the SD cards of every participant were collected, approximately 4TB of data was copied to an on-site NAS system, the SD cards cleared, and the batteries fully charged, so that next day in the morning they were usable again.

We ended up with six people from Dublin City University, and six international researchers, but only 10 people were wearing recording equipment. Every participant was asked to prepare at least one breakfast, lunch, or dinner, and all the food and drinks were purchased a few days before the event. 

After arrival at the house, every participant had to sign an agreement that all collected data can be publicly released and used for scientific purposes in the future.    

CASTLE 2024 Multimodal Dataset

The dataset (https://castle-dataset.github.io/) that emerged from this collaborative effort contains heart rate and steps logs of 10 people, 4K@50fps video streams from five fixed mounted cameras, as well as 4K video streams from 10 head-mounted cameras. The recording time per day is 7-12 hours per device, resulting in over 600 hours of video that totals to about 8.5 TB of data, after processing and more efficient re-encoding. The videos were processed into one hour-long parts that are aligned to all start at the hour. This was achieved in a multi-stage process, using a machine-readable QR code-based clock for initial rough- and subsequent audio signal correlation analysis for fine-alignment. 

The language spoken in the videos is mainly English with a few parts of (Swiss-)German and Vietnamese. The activities by the participants include:

  • preparing food and drinks
  • eating
  • washing dishes
  • cleaning up
  • discussing
  • hiding items
  • presenting and listening
  • drawing and painting
  • playing games (e.g., chess, darts, guitar, various card games, etc.)
  • reading (out loud)
  • watching tv (open source videos)
  • having a walk
  • having a car-ride

Use Scenarios of the Dataset

The dataset can be used for content retrieval contests, such as the Lifelog Search Challenge (LSC) and the Video Browser Showdown (VBS), but also for automatic content recognition and annotation challenges, such as the CASTLE Challenge that will happen at ACM Multimedia 2025 (https://castle-dataset.github.io/).  

Further application scenarios include complex scene understanding, 3d reconstruction and localization, audio event prediction, source separation, human-human/machine interaction, and many more.

Challenges of Organizing the Event

As this was the first collaborative event to collect such a multi-view multimodal dataset, there were also some challenges that are worth mentioning and may help other people that want to organize a similar event in the future. 

First of all, the event turned out to be much more costly than originally planned for. Reasons for this are the increased living/rental costs, the travel costs for international participants, but also expenses for technical equipment such as batteries, which we originally did not intend to use. Originally we wanted to organize the event in a real castle, but it turned out to be way too expensive, without a significant gain.

For the participants it was also hard to maintain privacy for all days, since not even quickly responding to emails was possible. When having a walk or a car ride, we needed to make sure that other people or car plates were not recorded.

In terms of the data, it should be mentioned that the different recording devices needed to be synchronized. This was achieved via regular capturing of dynamic QR codes showing the master time (or wall clock time), and using these positions in all videos as temporal anchors during post-processing. 

The data volume together with the available transfer speed were also an issue and it required many hours during the nights to copy all the data from all sd-cards. 

Summary

The CASTLE 2024 event brought together twelve multimedia researchers in a remote house in Ireland for an intensive four-day data collection retreat, resulting in a rich multimodal 4K video dataset designed for lifelogging research. Equipped with head-mounted GoPro cameras and Fitbits, ten participants captured synchronized, real-world point-of-view footage while engaging in everyday activities like cooking, playing games, and discussing, with additional environmental video captured from fixed cameras. The team faced significant logistical challenges, including power management, synchronization, privacy concerns, and data storage, but ultimately produced over 600 hours of aligned video content. The dataset – freely available for scientific use – is intended to support future research and competitions focused on content-based video analysis, lifelogging, and human activity understanding.