An Evaluation of Video Quality Assessment Metrics for Passive Gaming Video Streaming Nabajeet Barman Kingston University London, United Kingdom n.barman@kingston.ac.uk Steven Schmidt Quality and Usability Lab, TU Berlin Berlin, Germany steven.schmidt@tu-berlin.de Maria G. Martini Saman Zadtootaghaj Deutsche Telekom AG Berlin, Germany saman.zadtootaghaj@telekom.de Sebastian Möller Kingston University London, United Kingdom m.martini@kingston.ac.uk Quality and Usability Lab, TU Berlin Berlin, Germany sebastian.moeller@tu-berlin.de ABSTRACT 1 Video quality assessment is imperative to estimate and hence manage the Quality of Experience (QoE) in video streaming applications to the end-user. Recent years have seen a tremendous advancement in the field of objective video quality assessment (VQA) metrics, with the development of models that can predict the quality of the videos streamed over the Internet. However, no work so far has attempted to study the performance of such quality assessment metrics on gaming videos, which are artificial and synthetic and have different streaming requirements than traditionally streamed videos. Towards this end, we present in this paper a study of the performance of objective quality assessment metrics for gaming videos considering passive streaming applications. Objective quality assessment considering eight widely used VQA metrics is performed on a dataset of 24 reference videos and 576 compressed sequences obtained by encoding them at 24 different resolution-bitrate pairs. We present an evaluation of the performance behavior of the VQA metrics. Our results indicate that VMAF predicts subjective video quality ratings the best, while NIQE turns out to be a promising alternative as a no-reference metric in some scenarios. Gaming video streaming applications are becoming increasingly popular. They can be divided into two different, but related, applications: interactive and passive services. Interactive gaming video streaming applications are commonly known as cloud gaming, where the actual gameplay is performed on a cloud server. The user receives the rendered gameplay video back on a client device and then inputs corresponding game commands. Such applications have received lots of attention, resulting in the rapid development and acceptance of such services [1]. On the other hand, passive gaming video streaming refers to applications such as Twitch.tv1 , where viewers can watch the gameplay of other gamers. Such applications have received much less attention from both the gaming and video community despite the fact that Twitch.tv, with its nine million subscribers and about 800 thousand active viewers at the same time, is alone responsible for the 4th highest peak Internet traffic in the USA [2]. With the increasing popularity of such services, along with demand for other over-the-top services such as Netflix and YouTube, the demand on network resources has also increased. Therefore, to provide the end-user with a service at a reasonable Quality of Experience (QoE) and satisfy the user expectation of anytime, anyplace and any-content video service availability, it is necessary to optimize the video delivery process. For the assessment of video quality, typically subjective tests are carried out. However, these tests are time-consuming and expensive. Thus, numerous efforts are being made to predict the video quality through video quality assessment (VQA) metrics. Depending on the availability and the amount of reference information, objective video quality assessment (VQA) algorithms can be categorized into full-reference (FR), reduced-reference (RR), and no-reference (NR) metrics. So far, these metrics have been developed and tested for non-gaming videos, usually considering video on demand (VoD) streaming applications. Also, some of the metrics such as NIQE and BRISQUE are based on Natural Scene Statistics (for details see Section 2). Gaming videos, on the other hand, are artificial and synthetic in nature, have different streaming requirements (1pass, Constant Bitrate (CBR)) and hence the performance of these VQA metrics remains an open question. Our earlier study in [3] found some differences in the performance of such metrics when comparing gaming videos to non-gaming videos. Towards this end, CCS CONCEPTS •Information systems → Multimedia streaming; KEYWORDS Gaming Video Streaming, Quality Assessment, QoE ACM Reference format: Nabajeet Barman, Steven Schmidt, Saman Zadtootaghaj, Maria G. Martini, and Sebastian Möller. 2018. An Evaluation of Video Quality Assessment Metrics for Passive Gaming Video Streaming. In Proceedings of 23rd Packet Video Workshop, Amsterdam, Netherlands, June 12–15, 2018 (Packet Video’18), 6 pages. DOI: 10.1145/3210424.3210434 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. Packet Video’18, Amsterdam, Netherlands © 2018 Copyright held by the owner/author(s). Publication rights licensed to ACM. 978-1-4503-5773-9/18/06. . . $15.00 DOI: 10.1145/3210424.3210434 INTRODUCTION 1 https://www.twitch.tv/ 7 Packet Video’18, June 12–15, 2018, Amsterdam, Netherlands N. Barman et al. we present in this paper the evaluation and analysis of some of the most widely used VQA metrics. Since for applications such as live video streaming, where due to the absence of reference information, FR and RR metrics cannot be used, we provide a more detailed discussion on the performance of the NR metrics. We believe that the insight gained from this study will help to improve or design better performing VQA metrics. The remainder of the paper is organized as follows: Section 2 presents a discussion about the eight VQA metrics used in this work. Section 3 describes the dataset and the evaluation methodology. The results and main observations are presented in Section 4 and Section 5 finally concludes the paper. 2 considers only the spatial domain for its computation [9]. For both these metrics, we used the default settings and implementation as provided by the authors. 2.3 OVERVIEW OF VQA METRICS We start with a brief introduction of the eight VQA metrics considered in this work. The primary focus of this work is to evaluate the performance of the existing VQA metrics on gaming video content which has not been investigated. 2.1 NR Metrics NR metrics try to predict the quality without using any source information. Since for gaming applications, a high-quality reference video is typically not available, the development of good performing no-reference metrics is of very high importance. Blind/referenceless image spatial quality evaluator (BRISQUE) [10] is an NR metric which tries to quantify the possible loss of naturalness in an image by using the locally normalized luminance coefficients. Blind image quality index (BIQI) is a modular NR metric based on distortion image statistics which is based on natural scene statistics (NSS) [11]. Natural Image Quality Evaluator (NIQE) is a learning-based NR quality estimation metric which uses statistical features based on the space domain NSS model [12]. For FR metrics, we use the results made available in the dataset. For ST-RREDOpt, SpEED-QA and BIQI we used the implementation made available by the authors using the default settings. NIQE2 and BRISQUE3 calculations were done using the inbuilt MATLAB function (version: R2017b). FR metrics FR metrics refer to the VQA metrics which requires the availability of full reference information. We selected Peak Signal to Noise Ratio (PSNR), Structural Similarity Index Metric (SSIM) [4] and Video Multi-Method Assessment Fusion (VMAF) [5] as the choice of our three FR metrics. Due to its simplicity and ease of computation, PSNR is one of the most widely used metrics for both image and video quality assessment. SSIM, which computes the structural similarity between the two images, was shown to correlate better with subjective judgement and hence is also widely used for both image and video quality assessment [4]. For video quality assessment, frame-level PSNR and SSIM are temporally pooled (usually averaged) over the video duration to obtain a single score. VMAF is a fusion based metric which combines scores from three different metrics to obtain a single score between 0 to 100, with higher score denoting a higher quality. The choice of VMAF along with PSNR and SSIM is influenced by our previous work which has shown to have a very high correlation with subjective scores [3]. 3 EVALUATION DATASET AND METHODOLOGY 3.1 Evaluation Dataset For this work, we use the GamingVideoSET, a public open source dataset made available by the authors in [13]. We briefly describe the dataset and the data used in this work and refer the reader to the dataset and associated publication for further information. GamingVideoSET consists of a total of 24 gaming video sequences of 30 seconds duration obtained from two recorded video sequences from each of the 12 games considered. The dataset also provides subjective test results for 90 gaming video sequences obtained by encoding six gaming videos in 15 different resolution-bitrate pairs (three resolutions, 1080p, 720p and 480p) using the H.264/AVC compression standard. In addition, a total of 576 encoded videos obtained by encoding the 24 reference videos in 24 different resolution-bitrate pairs (inclusive of the ones used for subjective assessment) are provided in MP4 format. The encoding mode used is 1-pass, Constant Bitrate (CBR). In the rest of this paper, we refer to the part of the dataset reporting the subjective results as subjective dataset and the whole dataset as full dataset. 2.2 RR Metrics Reduced-reference metrics are used when only partial information about the reference video is available. As such they are less accurate than FR metrics but are useful in applications where there is limited source information available such as limited bandwidth transmissions. We used Spatio-temporal-reduced reference entropic differences (ST-RRED), an RR metric proposed by the authors in [6], since it is one of the most widely used RR metrics with very good performance on various VQA databases [7]. It measures the amount of spatial and temporal information differences in terms of wavelet coefficients of the frames and frame differences between the distorted and received videos. In this work, we use the recently developed optimized version of ST-RRED known as ST-RREDOpt which calculates only the desired sub-band, resulting in almost the same performance as ST-RRED but almost ten times computationally faster [8]. In addition, we also use the recently proposed spatial efficient entropic differencing for quality assessment (SpEED-QA) model, which is almost 70 times faster than the original implementation of ST-RRED and seven times faster than ST-RREDOpt as it 3.2 Evaluation Methodology The standard practice to evaluate how well a VQA metric performs is to measure the correlation between the objective metric score with subjective scores. In this work, we measure the performance of the objective metrics in two phases. In the first phase, we compare the performance of the VQA metrics with subjective scores considering the subjective dataset. In the second phase, for a comprehensive evaluation of the VQA metrics on the full dataset, we compare the VQA metric performance with a benchmark VQA metric. Since the 2 https://de.mathworks.com/help/images/ref/niqe.html 3 https://de.mathworks.com/help/images/ref/brisque.html 8 An Evaluation of VQA Metrics for Passive Gaming Video Streaming Packet Video’18, June 12–15, 2018, Amsterdam, Netherlands 1 40 SSIM PSNR (dB) 50 30 20 500 1000 1500 2000 2500 3000 3500 0.9 0.8 500 4000 1000 1500 Bitrate (kbps) ST-RREDOpt VMAF 50 1000 1500 2000 2500 3000 3500 0 500 4000 1000 1500 2000 2500 3000 3500 4000 3000 3500 4000 3000 3500 4000 60 BRISQUE SpEEDQA 4000 Bitrate (kbps) 2000 1000 1000 1500 2000 2500 3000 3500 40 20 500 4000 1000 1500 Bitrate (kbps) 2000 2500 Bitrate (kbps) 6 NIQE 60 BIQI 3500 500 3000 40 20 500 3000 1000 Bitrate (kbps) 0 500 2500 Bitrate (kbps) 100 0 500 2000 1000 1500 2000 2500 3000 3500 4 2 500 4000 Bitrate (kbps) 1000 1500 2000 2500 Bitrate (kbps) Figure 1: Quality vs. Bitrate plots for eight different quality metrics for 1080p resolution. metrics4 . The results are reported separately for each resolution and also considering all three resolution-bitrate combined (all data). It can be observed that VMAF results in the highest performance in terms of both PLCC and SROCC values across all three resolutions and all data. The two RR metrics have a similar performance in terms of correlation values across all resolution-bitrate pairs and over all data. Hence for applications where an increased speed of computation is of high importance, SpEEDQA can be selected as RR metric as it is almost seven times faster than ST-RREDOpt. Among the NR metrics, BIQI performs the worst. BRISQUE and NIQE result in almost the same performance for 1080p and 720p resolutions, but for 480p resolution and all data, NIQE performs better than BRISQUE. encoded videos available are MP4, for FR and RR metric calculations, we instead use the decoded, raw YUV videos obtained from the encoded, MP4 videos (The videos at 480p and 720p resolution were rescaled to 1080p YUV format using bilinear scaling filter, as was done by the authors in GamingVideoSET for subjective quality evaluation). For NR metric calculations we instead use the encoded videos at their original resolution (without scaling 480p and 720p videos to 1080p) due to the reasons discussed later in Section 4.6 4 RESULTS 4.1 VQA Metrics Variation With Bitrates Figure 1 shows the rate-distortion results for the eight VQA metrics for all twenty-four videos considering different bitrates for the 1080p resolution. Similar results for 720p and 480p resolution videos are also obtained but are not presented here due to lack of space. It can be observed that the FR and RR metrics, at higher bitrates, the quality gap between various content (due to content complexity) decreases. Both RR metric results in identical behavior with both reaching saturation at higher bitrates. For NR metrics, almost a reverse trend is observed, with increased quality gap at higher bitrates compared to at lower bitrates. 4.3 Impact of resolution on VQA metrics It can be observed that in general, the performance of the VQA metrics varies across different resolutions. For the FR and NR metrics, the performance decreases as one moves across from higher resolution to lower resolution videos. In contrast, both RR metrics resulted in higher correlation in terms of PLCC with MOS scores for 720p resolution videos, followed by 1080p and 480p resolution videos. Fisher’s Z-test5 to assess the significance of the difference between two correlation coefficients indicates that the difference between 720p and 1080p is not statistically significant, while the difference between 720p and 480p is significant, Z = 2.954, p < 0.01. For all eight VQA metrics, the performance for the 480p resolution 4.2 Comparison of VQA metrics with MOS The performance of a VQA metric with respect to subjective rating is evaluated in terms of Pearson Linear Correlation Coefficient (PLCC) and Spearman’s Rank Correlation Coefficient (SROCC) values. Negative PLCC and SROCC correlation values indicate that higher values for the respective metric indicate lower quality and vice versa. Table 1 shows the correlation values of the eight VQA 4 While the authors in [13] makes available both raw MOS and MOS scores after outlier detection, we in this work consider only the raw MOS scores and not the ones obtained without any subjective scores processing 5 http://psych.unl.edu/psycrs/statpage/biv corr comp eg.pdf 9 Packet Video’18, June 12–15, 2018, Amsterdam, Netherlands N. Barman et al. Table 1: Comparison of the performance of the VQA metric scores with MOS ratings in terms of PLCC and SROCC values. All Data refers to the combined data of all three resolution-bitrate pairs. The best performing metric is shown in bold. Table 2: Comparison of the performance of the VQA metric scores with VMAF scores in terms of PLCC and SROCC values. All Data refers to the combined data of all three resolution-bitrate pairs. The best performing metric is shown in bold. Metrics FR Metrics RR Metrics NR Metrics PSNR SSIM ST-RREDOpt SpEEDQA BRISQUE BIQI NIQE 480p 720p All Data SROCC PLCC SROCC PLCC SROCC PLCC SROCC 0.62 0.56 -0.66 -0.68 -0.68 -0.57 -0.75 0.60 0.56 -0.85 -0.88 -0.68 -0.54 -0.77 0.79 0.68 -0.74 -0.76 -0.79 -0.70 -0.81 0.77 0.70 -0.89 -0.92 -0.79 -0.71 -0.81 0.91 0.80 -0.77 -0.77 -0.77 -0.67 -0.78 0.92 0.83 -0.91 -0.93 -0.78 -0.68 -0.76 0.87 0.70 -0.53 -0.55 -0.14 -0.05 -0.42 0.87 0.74 -0.61 -0.63 -0.14 -0.05 -0.42 (cf. Table 1) is considerably lower compared to the same VQA metric performance for the 720p and 1080p resolutions. Also, the decrease in performance for some metrics is higher than others. We explain this observation using an example metric, PSNR, as shown in Figure 2b. Based on the figure, it can be observed that PSNR for different bitrates at 480p resolution is not able to capture the variation in MOS (cf. Figure 2a) as its value for the 480p resolution almost remain constant even at higher bitrates. VMAF, on the other hand, as evident from Figure 2c, captures this variation quite well and hence results in increased performance overall and also across each individual resolutions. 4.4 1080p PLCC with VMAF scores. It can be observed that PSNR results in the highest correlation followed by SSIM. Similar to the correlation values with MOS as reported in Table 1, both RR metrics result in similar performance. Also, it is observed that similar to results reported in Table 1, for some metrics the correlation values vary significantly over different resolutions. At 1080p, PSNR results in the highest PLCC scores and SpEEDQA results in higher SROCC values. At 720p and 480p, NIQE results in the highest PLCC scores and SpEEDQA results in the highest SROCC values. These results indicate towards the high potential for the use of RR and NR metrics for quality evaluations for applications limited to a single resolution and where full reference information is not available. Comparison of VQA metrics with VMAF 4.5 In the previous section, we presented and evaluated the performance of the eight VQA metrics based on the subjective ratings using six reference gaming video sequences and 15 resolutionbitrate pairs. It was found that across all conditions, VMAF resulted in the highest performance in terms of both PLCC and SROCC values. In the absence of subjective ratings for the full dataset, and taking into account the fact that our previous results showed superior performance of VMAF among all eight VQA metrics, we consider VMAF values as the reference score. We then evaluate the rest of the seven VQA metrics on the full dataset (24 reference video sequences and a total of 24 resolution-bitrate pairs, resulting in a total of 576 encoded video sequences). Table 2 shows the PLCC and SROCC correlation values for the seven VQA metrics Comparative performance analysis of NR metrics While the VQA metrics, in general, perform quite well, when considering multiple resolutions their performance decreases. Compared to FR and RR metrics, the performance degradation of NR metrics for all data was considerably high. We investigate the reason behind such performance degradation across multiple resolution-bitrate pairs using Figure 3 which shows the scatter plot of BRSIQUE, BIQI and NIQE with VMAF scores considering all three resolutions. It can be observed from Figure 3 that, when considering individual resolutions, the variation of the NR metric values with respect to VMAF values are somewhat well correlated and increases linearly and hence results in reasonable PLCC scores. When considering all 10 An Evaluation of VQA Metrics for Passive Gaming Video Streaming decrease for 480p (wider spread of the scores) and all data. BIQI performs the worst among all three. The difference in values per resolution can be attributed to the fact that, while for FR and RR metric calculations we used the rescaled YUV videos, for 720p and 480p resolutions, for NR metric calculations we used the downscaled, compressed MP4 videos. This, along with lack of proper training with videos consisting of different resolutions, as well as the absence of parameters in the models which can capture the differences due to change in resolution results in lower correlation scores when considering all resolution-bitrate pairs. We discuss next the results obtained for NR metric performance evaluation when considering the upscaled YUV videos as was done for FR and RR metric evaluation. (a) MOS vs. Bitrate (kbps) Mean Opinion Score (MOS) 5 Resolution 480 720 1080 4 3 2 1 300 500 600 750 1200 2000 4000 Bitrate (kbps) (b) PSNR (dB) vs. Bitrate (kbps) 4.6 40 720 1080 PSNR 30 20 10 0 300 500 600 750 1200 2000 4000 2000 4000 Bitrate (kbps) (c) VMAF vs. Bitrate (kbps) 100 Resolution VMAF 80 480 720 1080 60 40 20 0 300 500 600 750 1200 NR metric evaluation with rescaling As mentioned before, the three NR metrics were evaluated on videos without rescaling. We briefly present and discuss the results obtained with rescaled YUV videos and limitations of the same. Figure 4 shows the variation of the NIQE scores for one of the sample gaming video (FIFA) over 24 different resolution-bitrate pairs. While for 1080p resolution videos, the NIQE values indicate higher quality with increase in encoding bitrate (as one would expect), for 720p resolution videos, the estimated quality remains approx. the same even when considering higher bitrates. For 480p, the trend actually reverses, with NIQE estimating a poorer quality at higher bitrates. A similar behavior is observed for BRISQUE and BIQI. A possible reason behind such behavior could be that these NR metrics, which are based on natural scene statistics, are not able to capture the combined effect of quality loss due to compression and quality loss due to rescaling, a common method used in resolution switching in adaptive streaming applications such as Dynamic Adaptive Streaming over HTTP (DASH) and HTTP Live Streaming (HLS). Hence, while the results for NR metrics when considering the compressed, low resolution version without upscaling (480p and 720p) are as expected, the same are not capable to estimate MOS values when rescaled versions of the sequences are considered. This indicates unsuitability of their usage for applications such as DASH and HLS where there is quality adaptation using multiple resolution-bitrate pairs, and the videos are usually rescaled to the native resolution (1080p in our case). Further investigation into the design of these metrics can help to overcome this shortcoming and also perhaps increasing their performance. Training and evaluation of these metrics considering rescaled, multiple resolution-bitrate pairs can possibly lead to improved prediction accuracy. Resolution 480 Packet Video’18, June 12–15, 2018, Amsterdam, Netherlands Bitrate (kbps) Figure 2: MOS (with 95% confidence interval), PSNR and VMAF values for the CSGO video sequence at different resolution-bitrate pairs. A similar behavior is observed for other video sequences (relevant results not reported here due to lack of space). 5 CONCLUSION AND FUTURE WORK In this paper, we presented an objective evaluation and analysis of the performance of eight different VQA metrics on gaming video considering a passive, live streaming scenario. At first, on a subset of GamingVideoSET consisting of 90 video sequences, we evaluated the performance of the VQA metrics against MOS scores. We found that VMAF results in the highest correlation with subjective scores followed by SSIM and NIQE. It was observed that many metrics failed to capture the MOS variation at lower resolutions, hence resulting in lower correlation values. Then we evaluated the performance of the rest of the VQA metrics against VMAF on resolution-bitrate pairs, however, the spread of values is no longer linear, hence the lower correlation scores. Among the three NR metrics, NIQE results in a much lower spread for each individual resolution and when considering all data as compared to BIQI and BRISQUE. Hence, NIQE results in a higher overall prediction quality when using both MOS scores and VMAF scores as the benchmark. BRISQUE on the other hand results in almost similar performance as NIQE for 1080p and 720p resolutions but the correlation values 11 Packet Video’18, June 12–15, 2018, Amsterdam, Netherlands (a) VMAF vs. BRISQUE (b) VMAF vs. BIQI 480 720 1080 Resolution 480 720 1080 100 80 VMAF VMAF 60 60 60 40 40 40 20 20 20 0 0 0 10 20 30 40 50 60 480 720 1080 100 80 80 VMAF (c) VMAF vs. NIQE Resolution Resolution 100 N. Barman et al. 0 10 20 30 40 50 60 1 BIQI BRISQUE 2 3 4 5 6 NIQE Figure 3: Scatter plot showing the variation of the NR metrics wrt. VMAF scores considering all three resolutions over the whole dataset. 6.5 ACKNOWLEDGMENT FIFA17 This work is part of a project that has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Sk lodowska-Curie grant agreement No 643072 and was supported by the German Research Foundation (DFG) within project MO 1038/21-1. 6.0 5.5 5.0 NIQE 4.5 4.0 3.5 REFERENCES 3.0 [1] S. Shirmohammadi, M. Abdallah, D. T. Ahmed, Y. Lu, and A. Snyatkov. Introduction to the special section on visual computing in the cloud: Cloud gaming and virtualization. IEEE Transactions on Circuits and Systems for Video Technology, 25(12):1955–1959, 2015. [2] D. Fitzgerald and D. Wakabayashi. Apple Quietly Builds New Networks. https: //www.wsj.com/articles/apple-quietly-builds-new-networks-1391474149, February 2014. [Online: accessed 27-February-2017]. [3] N. Barman, S. Zadtootaghaj, M. G. Martini, S. Möller, and S. Lee. A Comparative Quality Assessment Study for Gaming and Non-Gaming Videos. In Tenth International Conference on Quality of Multimedia Experience (QoMEX), May 2018. [4] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004. [5] Netflix. VMAF - Video Multi-Method Assessment Fusion. https://github.com/ Netflix/vmaf. [Online: accessed 12-Dec-2018]. [6] R. Soundararajan and A. C. Bovik. Video quality assessment by reduced reference spatio-temporal entropic differencing. IEEE Transactions on Circuits and Systems for Video Technology, 23(4):684–694, April 2013. [7] A. C. Bovik, R. Soundararajan, and Christos Bampis. On the Robust Performance of the ST-RRED Video Quality Predictor. http://live.ece.utexas.edu/research/ Quality/ST-RRED/. [8] C. G. Bampis, P. Gupta, R. Soudararajan, and A.C. Bovik. Source code for optimized Spatio-Temporal Reduced Reference Entropy Differencing Video Quality Prediction Model. http://live.ece.utexas.edu/research/Quality/STRRED opt demo.zip, 2017. [9] C. G. Bampis, P. Gupta, R. Soundararajan, and A. C. Bovik. SpEED-QA: Spatial Efficient Entropic Differencing for Image and Video Quality. IEEE Signal Processing Letters, 24(9):1333–1337, Sept 2017. [10] A. Mittal, A. K. Moorthy, and A. C. Bovik. No-reference image quality assessment in the spatial domain. IEEE Transactions on Image Processing, 21(12):4695–4708, Dec 2012. [11] A. K. Moorthy and A. C. Bovik. A two-step framework for constructing blind image quality indices. IEEE Signal Processing Letters, 17(5):513–516, May 2010. [12] A. Mittal, R. Soundararajan, and A. C. Bovik. Making a ”Completely Blind” Image Quality Analyzer. IEEE Signal Processing Letters, 20(3):209–212, March 2013. [13] N. Barman, S. Zadtootaghaj, S. Schmidt, M. G. Martini, and S. Möller. GamingVideoSET: A Dataset for Gaming Video Streaming Applications. In Network and Systems Support for Games (NetGames), 16th Annual Workshop on, Amsterdam, Netherlands, June 2018. 2.5 2.0 1.5 Resolution 480 720 1080 480 720 1080 1.0 300 400 500 600 750 900 1000 1200 1500 1600 2000 2500 3000 4000 Bitrate (kbps) Figure 4: NIQE score variation for one of the sample gaming video sequence (FIFA) considering the rescaled YUV videos for 720p and 480p resolution. Similar patterns are observed for other videos but not presented here due to lack of space. the full test dataset. The performance of the NR metrics decreased when considering different resolution-bitrate pairs together. Also, when considering rescaled videos, the NR metrics results in erroneous predictions. Possible reasons could be attributed to the lack of proper training, gaming video content, etc., which we plan to investigate in our future works. We believe that the observations and discussions presented in this work will be helpful to improve the prediction efficiency of these existing metrics as well as develop better performing NR VQA metrics with a focus on live gaming video streaming applications. In addition to passive gaming service as discussed in this work, a well-performing NR metric can also be used for predicting video quality for interactive cloud gaming services. It should be noted that our current subjective evaluation was limited in terms of the number of videos considered. Also, the gaming videos used in this work were limited to 30 fps frame rate. As a future work, we plan to extend our subjective analysis using more videos and also include higher frame rate videos. 12