Jesús Gutiérrez – ACM SIGMM Records

VQEG Column: VQEG Meeting December 2025

By Jesús Gutiérrez | June 9, 2026 - 10:06 |June 9, 2026 0226, Event Report, Feature, Standards

Introduction

The Video Quality Experts Group (VQEG) fall plenary meeting took place online from December 15^th to 19^th 2025. More than 130 participants registered to the meeting, coming from industry and academic institutions worldwide.

The meeting was dedicated to present updates and discuss about topics related to the ongoing projects within VQEG. All the related information, minutes, and files from the meeting are available online in the VQEG meeting website, and video recordings of the meeting are available in Youtube.

Several of the addressed topics can be of interest for the SIGMM community working on quality assessment, such as the shift toward Artificial Intelligence (AI) and neural-based media processing, requiring updated evaluation methodologies; the increasing emphasis on immersive media and XR systems, including the recently published recommendation ITU-T P.1321; the growing interest in user-centric Quality of Experience (QoE) modeling, including individualized quality prediction; and the continued development of datasets, tools, and statistical frameworks for reproducible research.

Readers of these columns interested in the ongoing projects of VQEG are encouraged to subscribe to their corresponding reflectors to follow the activities going on and to get involved in them

Overview of VQEG Projects

Immersive Media Group (IMG)

The IMG group continues its focus on quality assessment for immersive technologies. In this meeting, Marta Orduna (Nokia XR Lab, Spain) and Jesús Gutiérrez (Universidad Politécnica de Madrid, Spain) reported updates on the multi-laboratory test plan and next steps for immersive communication assessment. In this sense, a major achievement of the group was the publication of the recommendation ITU-T P.1321 “Interactive test methods for subjective assessment of extended reality communications”, approved in October 2025. In addition, the following presentations related to IMG topics were delivered:

Felix Immohr (RWTH Aachen University, Germany) introduced the ICS-MR dataset, providing standardized conversational scenarios for evaluating mixed reality communication systems.
Jakob Hartbrich and William Menz (RWTH Aachen University and TU Ilmenau, Germany) analyzed how encoding parameters, such as texture resolution and polygon count influence the perceived quality of volumetric avatars, showing that texture resolution and quality level are dominant factors.
Daniel Zielasko (Technical University of Denmark, Denmark) presented a cross-laboratory initiative to establish standardized evaluation protocols for cybersickness, aiming to improve reproducibility and enable large-scale datasets.
Another contribution focused on datasets and evaluation tools. Abhinav Bhattacharya (RWTH Aachen University, Germany) introduced AMIS, a multimodal dataset for eXtended Reality (XR) research including multiple representation formats, such as talking-head videos, full-body videos, volumetric avatars, and personalized animated avatars.
An additional work explored teleoperation and user experience, with Shirin Rafiei (RISE, Sweden) presenting a study showing that latency and field of view significantly impact user performance, while video quality mainly affects subjective confidence.

Statistical Analysis Methods (SAM)

The group SAM presented several contributions focused on their main topics, which are data analysis, subjective evaluation, and statistical modeling:

Ryan Lei (Meta, USA) provided an update on AV2 common test conditions and coding gains, including both subjective and objective evaluation methodologies.
Lucjan Janowski (AGH University of Krakow, Poland) introduced a simulation framework for evaluating subject screening methods, demonstrating that correlation-based approaches outperform traditional kurtosis-based techniques.
Dietmar Saupe (University of Konstanz, Germany) proposed the use of G-test statistical methods for validating models in subjective quality datasets, improving reliability in quality scale reconstruction.
A discussion led by Robert Grosso (NTIA/ITS, USA) addressed open challenges in crowdsourcing-based subjective quality evaluation, including whether such approaches can replace standardized lab-based methodologies.

Joint Effort Group (JEG) – Hybrid

The group JEG addresses several areas of Video Quality Assessment (VQA), and in this meeting it provided contributions focused on advanced modeling of perceptual quality and subjective evaluation methods. In particular, Lohic Fotio Tiotsop (Politecnico di Torino, Italy) introduced a novel attention-based model for predicting individual perceived image coding quality, moving beyond Mean Opinion Score (MOS) toward personalized QoE modeling. In a second contribution, he also presented ongoing work on the evaluation of high-resolution image quality, highlighting variability in no-reference metrics and the need for subjective validation.

Emerging Technologies Group (ETG)

The ETG group continued to highlight forward-looking topics in multimedia. In this meeting:

Abhijay Ghildyal, Saman Zadtootaghaj, and Nabajeet Barman (Sony Interactive Entertainment, Germany and USA) proposed a non-aligned reference image quality assessment framework for novel view synthesis, addressing the challenge of quality evaluation when pixel-perfect references are unavailable.
Mathias Wien (RWTH Aachen University, Germany) presented updates on JVET activities, including results from a Call for Evidence (CfE) on next-generation video compression beyond Versatile Video Coding (VVC) and the outlook for future standardization.

Subjective and objective assessment of GenAI content (SOGAI)

The SOGAI group researches on subjective testing methodologies and objective metrics for assessing the quality of GenAI-generated content. In this meeting, the following topcis were presented and discussed:

Ryan Lei (Meta, USA) presented an evaluation of AI-based image codecs, comparing neural compression approaches with traditional codecs and discussing challenges in measuring compression efficiency.
Benjamin Herb (TU Ilmenau, Germany) presented a large-scale subjective testing of neural codecs, highlighting that metric reliability is comparable between neural and traditional compression methods.

5G Key Performance Indicators (5GKPI)

The 5GKPI group focuses on networked multimedia systems and QoE modeling. In this meeting, the following contributions were presented:

Henrique Rossi and Karan Mitra (Lulea University of Technology, Sweden) proposed a Bayesian network framework for analyzing interactivity in cloud gaming, showing that latency is the dominant factor influencing QoE.
Pablo Pérez (Nokia XR Lab, Spain) presented proposed updates to the QoE definition in ITU-T SG12 recommendation P.10/G.100, raising awareness within VQEG.
Martín Varela (Metosin, Finland / University of Malaga, Spain) discussed broader concepts of QoE at the system level, including fairness and group-level QoE metrics.
Finally, François Blouin (Meta, USA) and Pablo Pérez (Nokia XR Lab, Spain) reported on the status and future directions of the VQEG White Paper on QoE management.

Multimedia Experience and Human Factors (MEHF)

The MEHF group contributions address human perception and subjective evaluation challenges. In this meeting the following presentations were delivered:

Kamran Javidi and Maria Martini (Kingston University London) presented a subjective study on a commercial light field display of the KULF-TT53 Dataset, showing that AV1 encoding outperforms HEVC in perceived quality for light field content.
Jingwen Zhu investigated resolution cross-over in adaptive streaming of live sports, demonstrating that pairwise comparison (PC) methodologies outperform traditional Absolute Category Rating (ACR) methods for identifying optimal encoding decisions.

Quality Assessment for Computer Vision Applications (QACoViA)

In this meeting, the QACoViA group explored deep learning approaches for image enhancement:

Mehrunnisa (AGH University of Krakow, Poland) presented a MOS-guided deep learning framework for underwater image enhancement, combining perceptual loss functions with subjective score optimization.
Doğukan Öztürk (AGH University of Krakow, Poland) further examined deep learning-based enhancement models, comparing multiple approaches for visual quality improvement.

Other updates

Additional presentations covered diverse topics related to quality assessment of audiovisual technologies:

Kjell Brunnström (RISE, Sweden) presented a study on digital rear-view mirrors highlighting the importance of augmented visual cues for depth perception in automotive systems.
Werner Robitza (AVEQ, Germany) introduced Videoparser-ng, a fast open-source bitstream parser for model development.
Margaret Pinson (NTIA/ITS, USA) presented a new camera dataset (JNR-ITScam) containing information from videographers on this video dataset being filmed on modern cameras.
Syed Uddin (AGH University of Krakow, Poland) presented a segment-Level QoE assessment dataset for adaptive bitrate video streaming.
Henrique Rossi and Karan Mitra (Lulea University of Technology, Sweden) made a demonstration of ALTRUIST, a multi-platform tool to conduct subjective tests efficiently, for conducting QoE subjective tests in immersive systems.
In terms of standardization efforts, MPEG representatives provided an overview of AG5 activities, including updates on datasets and evaluation methodologies, and there was an ITU-T Q19 interim meeting to discuss progresses on the recommendations of no-reference metrics (J.Noref).

Finally, as announced in the VQEG website, the next face-to-face VQEG plenary meeting was planned for May 2026.

VQEG Column: Finalization of Recommendation Series P.1204, a Multi-Model Video Quality Evaluation Standard – The New Standards P.1204.1 and P.1204.2

By Jesús Gutiérrez | December 17, 2025 - 13:08 |December 17, 2025 0425, 0425, Event Report, Feature, Standards

Leave a comment

Abstract

This column introduces the now completed ITU-T P.1204 video quality model standards for assessing sequences up to UHD/4K resolution. Initially developed over two years by ITU-T Study Group 12 (Question Q14/12) and VQEG, the work used a large dataset of 26 subjective tests (13 for training, 13 for validation), each involving at least 24 participants rating sequences on the 5-point ACR scale. The tests covered diverse encoding settings, bitrates, resolutions, and framerates for H.264/AVC, H.265/HEVC, and VP9 codecs. The resulting 5,000-sequence dataset forms the largest lab-based source for model development to date. Initially standardized were P.1204.3, a no-reference bitstream-based model with full bitstream access, P.1204.4, a pixel-based, reduced-/full-reference model, and P.1204.5, a no-reference hybrid model. The current record focuses on the latest additions to the series, namely P.1204.1, a parametric, metadata-based model using only information about which codec was used, plus bitrate, framerate and resolution, and P.1204.2, which in addition uses frame-size and frame-type information to include video-content aspects into the predictions.

Introduction

Video quality under specific encoding settings is central to applications such as VoD, live streaming, and audiovisual communication. In HTTP-based adaptive streaming (HAS) services, bitrate ladders define video representations across resolutions and bitrates, balancing screen resolution and network capacity. Video quality, a key contributor to users’ Quality of Experience (QoE), can vary with bandwidth fluctuations, buffer delays, or playback stalls.

While such quality fluctuations and broader QoE aspects are discussed elsewhere, this record focuses on short-term video quality as modeled by ITU-T P.1204 for HAS-type content. These models assess segments of around 10s under reliable transport (e.g., TCP, QUIC), covering resolution, framerate, and encoding effects, but excluding pixel-level impairments from packet loss under unreliable transport.

Because video quality is perceptual, subjective tests, laboratory or crowdsourced, remain essential, especially at high resolutions such as 4K UHD under controlled viewing conditions (1.5H or 1.6H viewing distance). Yet, studies show limited perceptual gain between HD and 4K, depending on source content, underlining the need for representative test materials. Given the high cost of such tests, objective (instrumental) models are required for scalable, automated assessment supporting applications like bitrate ladder design and service monitoring.

Four main model classes exist: metadata-based, bitstream-based, pixel-based, and hybrid. Metadata-based models use codec parameters (e.g., resolution, bitrate) and are lightweight; bitstream-based models analyze encoded streams without decoding, as in ITU-T P.1203 and P.1204.3 [1][2][3][7]. Pixel-based models compare decoded frames and include Full Reference and Reduced Reference models (e.g., P.1204.4, and also PSNR [9], SSIM [10], VMAF [11][12]), as well as No Reference variants. Finally, hybrid models combine pixel and bitstream or metadata inputs, exemplified by the ITU-T P.1204.5 standard. These three standards, P.1204.3 P.1204.4 and P.1204.5, formed the initial P.1204 Recommendation series finalized in 2020.

ITU-T P.1204 series completed with P.1204.1 and P.1204.2

The respective standardization project under the Work Item name P.NATS Phase 2 (read: Peanuts) was a unique video quality model development competition conducted in collaboration between ITU-T Study Group 12 (SG12) and the Video Quality Experts Group (VQEG). The target use cases were for up to UHD/4K resolution, with presentation on UHD/4K resolution PC/TV or Mobile/Tablet (MO/TA). For the first time, bitstream-, pixel-based, and hybrid models were jointly developed, trained, and validated, using a large common subjective dataset comprising 26 tests, each with at least 24 participants (see, e.g., [1] for details). The P.NATS Phase 2 work built on the earlier “P.NATS Phase 1” project, which resulted in the ITU-T Rec. P.1203 standards series (P.1203, P.1203.1, P.1203.2, P.1203.3). In the P.NATS Phase 2 project, video quality models in five different categories were evaluated, and different candidates were found to be eligible to be recommended as standards. The initially standardized three models out of the five categories were the aforementioned P.1204.3, P.1204.4 and P.1204.5. However, due to the lack of consensus between the winning proponents, no models were recommended as standards for the category “bitstream Mode 0” with access to high-level metadata only, such as the video codec, resolution, framerate and bitrate used, and “bitstream Mode 1”, with further access to frame-size information that can be used for content-complexity estimation.

For the latest model additions of P.1204.1 and P.1204.2, subsets of the databases initially used in the P.NATS Phase 2 project were employed for model training. Two different datasets belonging to the two contexts PC/TV and MO/TA were used for training the models. AVT-PNATS-UHD-1 is the dataset for the PC/TV use case and ERCS-PNATS-UHD-1 the dataset used for the MO/TA use case.

AVT-PNATS-UHD-1 [7] consists of four different subjective tests conducted by TU Ilmenau as part of the P.NATS Phase 2 competition. The target resolution of these datasets was 3840 x 2160 pixels. ERCS-PNATS-UHD-1 [1] is a dataset targeting the MO/TA use case. It consists of one subjective test conducted by Ericsson as part of the P.NATS Phase 2 competition. The target resolution of these datasets was 2560 x 1440 pixels.

For model performance evaluation, beyond AVT-PNATS-UHD-1, further externally available video-quality test databases were used, as outlined in the following.

AVT-VQDB-UHD-1: This is a publicly available dataset and consists of four different subjective tests. All the four tests had a full-factorial design. In total, 17 different SRCs with a duration of 7-10 s were used across all the four tests. All the sources had a resolution of 3840×2160 pixels and a framerate of 60 fps. For HRC design, bitrate was selected in fixed (i.e. non-adaptive) values per PVS between 200kbps and 40000kbps, resolution between 360p and 2160p and framerate between 15fps and 60fps. In all the tests, a 2-pass encoding approach was used to encode the videos, with medium preset for H.264 and H.265, and the speed parameter for VP9 set to the default value “0”. A total of 104 participants in the four tests.

GVS: This dataset consists of 24 SRCs that have been extracted from 12 different games. The SRCs are of 1920×1080 pixel resolution, 30fps framerate and have a duration of 30s . The HRC design included three different resolutions, namely, 480p, 720p and 1080p . 90 PVSs resulting from 15 bitrate-resolution pairs were used for subjective evaluation. A total of 25 participants rated all the 90 PVSs.

KUGVD: Six SRCs out of the 24 SRCs from the GVSwere used to develop KUGVD. The same bitrate-resolution pairs from GVS were included to define the HRCs. In total, 90 PVSs were used in the subjective evaluation and 17 participants took part in the test.

CGVDS: This dataset consists of SRCs captured at 60fps from 15 different games. For designing the HRCs, three resolutions, namely, 480p, 720p and 1080p at three different framerates of 20, 30, and 60fps were considered. To ensure that the SRCs from all the games could be assessed by test subjects, the overall test was split into 5 different subjective tests, with a minimum of 72 PVSs being rated in each of the tests. A total of over 100 participants took part over the five different tests, with a minimum of 20 participants per test.

Twitch: The Twitch Dataset consists of 36 different games, with 6 games each representing one out of 6 pre-defined genres. The dataset consists of streams directly downloaded from Twitch. A total of 351 video sequences of approximately 50s duration across all representations were downloaded. 90 video sequences out of these 351 video sequences were selected for subjective evaluation. Only the first 30s of the chosen 90 PVSs were considered for subjective testing. Six different resolutions between 160p and 1080p at framerates of 30 and 60fps were used. 29 participants rated all the 90 PVSs.

BBQCG: This is the training dataset developed as part of the P.BBQCG work item. This dataset consists of nine subjective test databases. Three out of these nine test databases consisted of processed video sequences (PVSs) up to 1080p/120fps and the remaining had PVSs up to 4K/60fps. Three codecs, namely, H.264, H.265, and AV1 were used to encode the videos. Overall 900 different PVSs were created from 12 sources (SRCs) by encoding the SRCs with different encoding settings.

AVT-VQDB-UHD-1-VD: This dataset consists of 16 source contents encoded using a CRF-based encoding approach. Overall 192 PVSs were generated by encoding all 16 sources in four resolutions, namely, 360p, 720p, 1080p, 2160p with three CRF values (22, 30, 38) each. A total of 40 subjects participate in the study.

ITU-T P.1204.1 and P.1204.2 model prediction performance

The performance figures of the two new models P.1204.1 and P.1204.2 models on the different datasets are indicated in Table 1 (P.1204.1) and Table 2 (P.1204.2) below.

Table 1: Performance of P.1204.1 (Mode 0) on the evaluation datasets, in terms of Root Mean Square Error (RMSE, measure used as winning criterion in the ITU-T/VQEG modelling competition). Pearson Correlation Coefficeint (PCC), Spearman Rank Correlation Coefficient (SRCC) and Kendall’s tau.

Dataset	RMSE	PCC	SRCC	Kendall
AVT-VQDB-UHD-1	0.499	0.890	0.877	0.684
KUGVD	0.840	0.590	0.570	0.410
GVS	0.690	0.670	0.650	0.490
CGVDS	0.470	0.780	0.750	0.560
Twitch	0.430	0.920	0.890	0.710
BBQCG	0.598 (on a 7-point scale)	0.841	0.843	0.647
AVT-VQDB-UHD-1-VD	0.650	0.814	0.813	0.617

Table 2: Performance of P.1204.1 (Mode 1) on the evaluation datasets, in terms of Root Mean Square Error (RMSE, measure used as winning criterion in the ITU-T/VQEG modelling competition). Pearson Correlation Coefficeint (PCC), Spearman Rank Correlation Coefficient (SRCC) and Kendall’s tau.

Dataset	RMSE	PCC	SRCC	Kendall
AVT-VQDB-UHD-1	0.476	0.901	0.900	0.730
KUGVD	0.500	0.870	0.860	0.690
GVS	0.420	0.890	0.870	0.710
CGVDS	0.360	0.900	0.880	0.690
Twitch	0.370	0.940	0.930	0.770
BBQCG	0.737 (on a 7-point scale)	0.745	0.746	0.547
AVT-VQDB-UHD-1-VD	0.598	0.845	0.845	0.654

For all databases except BBQCG and KUGVD, the Mode 0 model P.1204.1 performs in a solid way, as shown in Table 1. With the information about frame types and sizes available to the Mode 1 model P.1204.2, performance improves considerably, as shown in Table 2. For performance results of all three previously standardized models, P.1204.3, P.1204.4 and P.1204.5, the reader is referred to [1] and the individual standards, [4][5][6]. For the P.1204.3 model, complementary performance information is presented in, e.g., [2][7]. For P.1204.4, additional model performance information is available in [8], including results for AV1, AVS2, and VVC.

The following plots provide an illustration of how the new P.1204.1 Mode 0 model may be used. Here, bitrate-ladder-type graphs are presented, with the predicted Mean Opinion Score on a 5-point scale plotted over log bitrate.

Codec: H.264

Codec: H.265

Codec: VP9

Conclusions and Outlook

The P.1204 standard series now comprises the complete initially planned set of models, namely:

ITU-T P.1204.1: Bitstream Mode 0, i.e., metadata-based model with access to information about video codec, resolution, framerate and bitrate used.
ITU-T P.1204.2: Bitstream Mode 1, i.e., metadata-based model with access to information about video codec, resolution, framerate and bitrate used, plus information about video frame types and sizes.
ITU-T P.1204.3: Bitstream Mode 3 [1][2][3][7].
ITU-T P.1204.4: Pixel-based reduced- and full-reference [1][5][8].
ITU-T P.1204.5: Hybrid no-reference Mode 0 [1][6].

Extensions of some of these models beyond the initial scope of codecs (H.264/AVC, H.265/HEVC, VP9) have been included over the last few years. Here, P.1204.4 and P.1204.5 have been extended (P.1204.5) or evaluated (P.1204.4) to also cover the AV1 video codec. Work in ITU-T SG12 (Q14/12) is ongoing so as to also extend P.1204.1, P.1204.2 and P.1204.3 to newer codecs such as AV1, and all five models are planned to be extended so as to also cover VVC. It is noted that for P.1204.3, P.1204.4 and P.1204.5, also long-term quality integration modules that generate per-session scores for up to 5min long streaming sessions have been described in Appendices of the respective recommendations. For P.1204.1 and P.1204.2, this extension still has to be completed. Initial evaluations for similar Mode 0 and Mode 1 models that use the P.1204.3-type long-term integration can be found in [7].

References

[1] Raake, A., Borer, S., Satti, S.M., Gustafsson, J., Rao, R.R.R., Medagli, S., List, P., Göring, S., Lindero, D., Robitza, W. and Heikkilä, G., 2020. Multi-model standard for bitstream-, pixel-based and hybrid video quality assessment of UHD/4K: ITU-T P. 1204. IEEE Access, 8, pp.193020-193049.
[2] Rao, R.R.R., Göring, S., List, P., Robitza, W., Feiten, B., Wüstenhagen, U. and Raake, A., 2020, May. Bitstream-based model standard for 4K/UHD: ITU-T P. 1204.3—Model details, evaluation, analysis and open source implementation. In 2020 Twelfth International Conference on Quality of Multimedia Experience (QoMEX) (pp. 1-6).
[3] ITU-T Rec. P.1204, 2025. Video quality assessment of streaming services over reliable transport for resolutions up to 4K. International Telecommunication Union (ITU-T), Geneva, Switzerland.
[4] ITU-T Rec. P.1204.3, 2020. Video quality assessment of streaming services over reliable transport for resolutions up to 4K with access to full bitstream information. International Telecommunication Union (ITU-T), Geneva, Switzerland.
[5] ITU-T Rec. P.1204.4, 2022. Video quality assessment of streaming services over reliable transport for resolutions up to 4K with access to full and reduced reference pixel information. International Telecommunication Union (ITU-T), Geneva, Switzerland.
[6] ITU-T Rec. P.1204.5, 2023. Video quality assessment of streaming services over reliable transport for resolutions up to 4K with access to transport and received pixel information. International Telecommunication Union (ITU-T), Geneva, Switzerland.
[7] Rao, R.R.R., Göring, S. and Raake, A., 2022. AVQBits – Adaptive video quality model based on bitstream information for various video applications. IEEE Access, 10, pp.80321-80351.
[8] Borer, S., 2022, September. Performance of ITU-T P. 1204.4 on Video Encoded with AV1, AVS2, VVC. In 2022 14th International Conference on Quality of Multimedia Experience (QoMEX) (pp. 1-4).
[9] Winkler, S. and Mohandas, P., 2008. The evolution of video quality measurement: From PSNR to hybrid metrics. IEEE transactions on Broadcasting, 54(3), pp.660-668.
[10] Wang, Z., Lu, L. and Bovik, A.C., 2004. Video quality assessment based on structural distortion measurement. Signal Processing: Image Communication, 19(2), pp.121-132.
[11] Li, Z., Aaron, A., Katsavounidis, I., Moorthy, A., and Manohara, M., 2016. Toward A Practical Perceptual Video Quality Metric, Netflix TechBlog.
[12] Li, Z., Swanson, K., Bampis, C., Krasula, L., and Aaron, A., 2020. Toward a Better Quality Metric for the Video Community, Netflix TechBlog.

VQEG Column: VQEG Meeting May 2025

By Jesús Gutiérrez | September 17, 2025 - 16:43 |September 17, 2025 0325, 0325, Event Report, Feature, Standards

Leave a comment

Introduction

From May 5^th to 9^th, 2025 Meta hosted the plenary meeting of the Video Quality Experts Group (VQEG) in their headquarters in Menlo Park (CA, United Sates). Around 150 participants registered to the meeting, coming from industry and academic institutions from 26 different countries worldwide.

All the topics mentioned bellow can be of interest for the SIGMM community working on quality assessment, but special attention can be devoted to the first activities of the group on Subjective and objective assessment of GenAI content (SOGAI) and to the advances on the contribution of the Immersive Media Group (IMG) group to the International Telecommunication Union (ITU) towards the Rec. ITU-T P.IXC for the evaluation of Quality of Experience (QoE) of immersive interactive communication systems.

Readers of these columns who are interested in VQEG’s ongoing projects are encouraged to subscribe to the corresponding mailing lists to stay informed and get involved.