View
245
Download
0
Embed Size (px)
Citation preview
Visual attention and perception models for assessing quality in 2D
and 3D stereoscopic video
Juan Pedro López Velasco - [email protected]: José Manuel Menéndez García - [email protected]
Universidad Politécnica de MadridMadrid, 8th February 2016
2
Index• Introduction• Objectives and Work Development• Visual discomfort prediction in 3D
stereoscopic video• Visual Attention Model for Video Quality
Assessment• Conclusions and Future work• Merits
3
Introduction• Quality of Experience (QoE) is defined as the degree
of delight or annoyance of the user of an application or service, in this case, multimedia services.
• Necessary: Estimation of QoE in different stages of video broadcasting dataflow and for a variety of sources: 2D and 3D.
4
Scenarios
CONTENT CREATION PHASE:
Visual comfort assessment
(3D)
COMPRESSION PHASE:
Visual attention and
saliency models (2D)
5
…the final user.
The most important thing in video quality assessment is…
6
Ob
ject
ives
an
d W
ork
D
eevl
opm
ent
7
Objectives (I)
For visual comfort assessment (3D):• Detecting empirically the main sources of visual discomfort in 3D
stereoscopic video after developing subjective assessment.• Quantifying the situations of sequences where the probability of visual
discomfort to occur is higher.• Analyzing the factors of motion, distribution of parallax and disparity
change in pairs of sequences for developing tools that correspond to human perception.
• Demonstrate with sequences that the results obtained in subjective assessment may be predicted with objective parameters and characteristics measurement.
8
Preliminary subjective assessmnet
Determination of visual discomfort sources
Characterization of video sequences
Statistics analysis, new subjective assessment, metrics development and drawing conclusions
Work Development (I)
9
Objectives (II)
For visual attention and saliency models (2D):• Improving objective quality metrics by applying visual attention models,
which weight regions of interest to obtain results closer to human eye’s response.
• Determining accurate visual attention models, particular for each sequence, which predict the most probable areas observed by the user.
• Weighting the saliency factors analyzed by the use of subjective assessment. These saliency factors are the following: motion, level of detail, face detection and position of pixel.
• Demonstrating the improvement of the objective metrics for measuring quality and artifacts in the sequence when applying the developed visual attention model (Advanced Blur metric)
10
Determining factors: motion, face detection, level of detail and position
Subjective assessment with artificially impaired sequences
Weighting these factors in order of importance.
Visual attention model generation
Application of model in objective metrics (Advanced Blur metric)
Work Development (II)
11
Vis
ual
dis
com
fort
pre
dic
tion
in
3D
ste
reos
cop
ic v
ideo
12
Introduction to Stereoscopy• Stereoscopic 3D video perception is based on the fact that two
different video signals (different but highly correlated) are captured in order to feed each of the viewer’s eyes.
• One signal is received by the left eye and another one by the right eye. The brain fuses left and right view.
• 3D video imitates the binocular human vision (natural view).• The cyclopean eye is an imaginary eye situated midway
between the two eyes.
13
Disparity and Parallax
• Disparities are the differences between the angles subtended between pairs of features.
• Parallax is created by disparities: Positive, negative or zero, depending on the position of the object respect to the screen.
14
Example of 3D disparity
15
Accommodation-Vergence conflict
• Viewing an object in stereoscopic displays: – Eyes accommodate to the screen – But when rotating to fix the apparent object (vergence)– an inconsistency between them occurs (derived from stereopsis).
• This effect is the accommodation-vergence conflict.
16
Problem description
• Disparity may offer an incredible experience, BUT differences in 3D disparity eye may have difficulties to focus objects causing visual discomfort, annoyance, headache.
• The eye focus the objects: Accommodation of the eyes needs enough time to adapt to changes for correct vision of 3D videos (importance of motion).
• Common sources of visual discomfort:– Excessive binocular parallax (especially negative)– Accommodation and vergence mismatches (AVM)
17
Accomodation-Vergence Mismatches (AVM)
• AVM is one of the most frequent sources of visual discomfort in 3DTV.
• When position of the objects change (parallax), the accommodation is constant but the vergence changes.
• The crystalline must adapt to change fastly.
Near distance object Far distance object
18
Zone of Comfort
• Zone of Comfort (ZoC) is a term introduced by Percival (1892) to define the relationship between distance of vergence and distance to the screen (accommodation distance).
• Studies focused on static images (Shibata, 2011)
19
Work methodologyCharacterization of individual video sequences
Sequence
Motion
Depth map
Distribution of parallax
1
Sequence 1
Sequence 2
Combination of video pair sequences2
Wide casuistic of transitions
Subjective assessment with pairs of sequences for transition analysis
3 Analysis of when visual discomfort happens4
20
Characterization of video sequences
• Tools for characterization:– Depth maps: using SAD (Sum of Absolute Differences) techniques.– Histograms of parallax information (based on depth map information)– Diagrams of TI (Temporal Information) and SI (Spatial Information) variation.
SAD
21
Case of study: Sequence Palco HD• Separation of virtual cameras over the average interpupillary
distance. Human eye adapts to change produce by negative parallax, but… abrupt transition generates discomfort.
Progressive Temporal Parallax
variation
22
Subjective Assessment
• Analysis of changes / transitions between pairs of video sequences to determine a preliminary ZoC.
• Analysis of transitions between scenes:– Selection of sequences with different values of SI (Spatial Information) and TI
(Temporal information), bidimensional information.– Selection of sequences with diferent values of spatial and temporal parallax
variance (negative, parallax), tridimensional information
• Test conditions (following Recommendations BT.500 and P.910)– 74 observers– 65 inches television– Observation distance: 2,5 m– HD sequences– Annoyance 5-notes Scale
MOS Scale
Annoyance derived from transition Quality of Experience
5 Very comfortable Excellent Experience4 Comfortable Good Experience3 Mildly uncomfortable No visual discomfort2 Uncomfortable Visual discomfort1 Extremely uncomfortable High visual discomfort
23
Results of subjective assessment
24
Transition: Angel to Ladder (I)
40% of the people gave a score that manifests visual discomfort
25
Transition: Angel to Ladder (II)
Parallax variation in pixel
26
Transition: “Spaceship” to “Astronaut”
Negative parallax in right side of first video to negative/positive combination
27
Transition: “Station” to “Itaca3d”
This is the worst scored transition in the tests
↑↑Motion↑↑Motion
Hiperstereoscopy!
28
Transition: “Boxers” to “Dance”
Negative parallax located in different areas, less annoyance for observers.
29
Transition: “Hall” to “Laboratory”
Both videos with negative parallax in both videos and window violation → low scores.
Window violation!
30
Conclusions• After subjective assessment, results indicate the necessity of
evaluating both static disparity and dynamic variation of the stereoscopic image, in terms of motion.
• ZoC is affected by motion in the scene. The state-of-the-art must be actualized to offer results with tests of dynamic sequences.
• Avoiding visual discomfort is possible locating objects in positive parallax, BUT that implies a consequent decrease of QoE.
• Negative parallax must be controlled to generate soft variations:– Fast variation of negative parallax is usually the main source of visual discomfort,
especially when the transition is produced to a content with a completely different disparity diagram.
– Only hyperstereoscopy (i.e. pixels with negative parallax with disparities higher than 5) in the sequence is not enough for detecting visual discomfort, it is the transition what provokes the discomfort.
• Positive parallax is recommended for its tolerance to visual discomfort and the consequent.
31
Future work
After the conclusions obtained after detecting the main sources of visual discomfort:
• Developing recommendations and guidelines for 3D contents creators.
• Generating tools for automatic detection of discomfort in 3D videos.
32
Visu
al A
ttent
ion
Mod
el fo
r Vid
eo
Qua
lity
Ass
essm
ent
33
Contents
• Introduction: Problem description• Calibration of the visual attention model
– Artificially impaired video sequences generation: Analysis of video characteristics by regions Creation of masks based on ROI’s
• Results and examples with test sequences• Advanced blur metric
– Application to real video sequences (encoded in H.264 at different bitrates)• Conclusions
34
Problem description (I)
• Assessing video quality is still a complex task.• Video Quality Assessment needs to correspond to human
perception.• Visual attention is focused on concrete regions (ROI’s) of an image
as demonstrated with fixation maps and eye-tracking.
Original image Fixation map Image with visual attention weights
35
• Most pixel-based metrics do not present enough correlation between objective and subjective results
• Algorithms need to correspond to human perception when analyzing quality in a video sequence.
• For example, these four frames have the same MSE.
• Video quality metrics should correlate with visual attention and psychovisual models adapted to concrete artifacts and their visualization.
Problem description (II)
High blocking High blurring (defocus) Salt and pepper noise JPEG encoding
36
Visual Attention Features
• According to context-aware saliency detection model proposed by Goferman et al [GOFERMAN-1, 2012], image regions of interest are detected based on four principles of human attention supported by psychological evidence– Low-level characteristics affecting to each individual pixel, such as color
and contrast– Global considerations, which suppress frequently occurring features,
while maintaining features that deviate from the norm.– Visual organization rules which state that visual forms may possess one
or several centers of gravity about which the form is organized– High-level factors, such as human faces or concrete objects recognition.
This factor could be content dependent, but human faces generate specific patterns in human retina that increase the probability of be perceived related to psychological and cognitive features.
37
Example of artificially impaired sequences
• Impaired area (with blocking artifact) located in human faces ROI.
• This effect is excessive in this example but in real life is a common effect.
38
Work methodology• Objectives:
– Calibration of the influence of features (ROI) for determining the visual attention model.
– Creation of Advanced Blur Metrics• Methodology for Visual Attention Model:
– Selection of ROI’s: motion, faces, spatial detail and position.– Creation of masks for artificially impaired sequences (adapted to
concrete artifact: blurring).– Subjective Assessment: Opinions of users (MOS scaled).– Search for inconsistencies between subjective assessment (MOS
obtained) with pixel-based objective metrics (PSNR), to weight the influence of each feature.
• Advanced Blur metric: loss of energy (blur) adapted to visual attention.• Tests: Once the visual attention model is generated, it will be tested with
real sequences (distorted by the effect of H.264 encoding).
39
Scheme of artificially impaired video sequences generation
Impaired video
sequenceOriginal
video sequence
Artificiallyimpaired sequence
InverseFeature
Mask
FeatureMask
Distortion
(2 sequences for each distortion:
One and the opposite case
As seen in next example)
40
Impairment and artifacts insertion process
Original video
sequenceArtifact Distortion
Impaired video
sequence
Blocking
Blurring
Ringing
Blocking simulated with 8x8 mosaic filter
Blurring simulated with gaussian lowpass filter
Ringing simulated with JPEG codification filter
41
Creation of masks based on ROI’s (I)
• Types of regions of interest for masks
Original video
sequence Feature Detection
Feature Mask
Inverse Feature Mask
Motion
Spatial Detail
Faces
Position
Color
42
Motion mask
• For motion detection, temporal information in consecutive frames is scrutinized
• Temporal information is analyzed:
0),(),(,.),( 1 yxFyxFifMaskyxPix frameiii
Original frame Motion mask based on TI
43
Spatial Detail Mask• Textures, edges and objects in motion are the source of hiding or
highlighting a determined impairments, in cases such as blocking or blurring artifacts.
• Canny algorithm is used to create binary masks for separating homogenous from high-frequencies areas.
Original frame Spatial detail mask based on Canny algorithm
44
Pixel Position Masks• The image is divided in 9 sections (Nojiri, 2009)• Objective: Analyzing influence of pixel position by areas.
• Three types of masks are created depending on the regions:
Nojiri’s sections distribution
Corner mask Lateral mask Central mask
45
Facial Mask
• Haar algorithm included in OpenCV libraries based on a boosted cascade of simple features is used for face detection
Face detection Face mask
46
Subjective assessment for calibration• Results based on subjective tests are analyzed to demonstrate
the validity of test sequences. Spatial detail is analyzed in these 3 sequences.
• MOS scale is used: 5 (excellent) to 1 (Poor)
“News Report”: Faces “Barrier”: Motion “Crowd”: Pixel Position
Sequence FR Metric
H.264 Impairment located in Faces ROI.
75Mbps 500Kbps D. Inv.
News Report
PSNR 47.93 37.58 46.82 34.52
Blur 0.44 3.63 0.38 5.17
MSE 0.67 1.93 0.10 2.30
MOS 4.81 1.54 1.33 3.78
Sequence FR Metric
H.264 Impairment located in Motion ROI.
75Mbps 500Kbps D. Inv.
Barrier
PSNR 49.82 33.19 39.85 34.24
Blur 0.27 8.36 1.97 6.24
MSE 0.51 3.34 0.359 2.98
MOS 4.77 1.33 3.11 3.89
Seq. FR Metric
H.264 Impairment located in Position ROI’s
75 Mbps
500 Kbps
Center Lateral Corner
D. Inv. D. Inv. D. Inv.
Crowd
PSNR 34.33 25.34 30.74 26.82 33.87 26.00 35.95 25.88
Blur 3.44 22.55 6.27 15.33 2.60 19.44 0.95 22.47
MSE 3.55 8.76 2.30 6.21 1.21 7.30 0.64 7.87
MOS 4.68 1.22 1.44 2.44 3.78 1.33 4.11 1.22
47
Calibration of Faces
• Distortion is located in the human faces ROI• Subjective MOS values are lower (1.33) than when located in
the rest of the picture and faces appear sharp (3.78)• Inconsistence with objective metrics: PSNR (46.82 vs. 34.52) or
MSE’s behavior (0.10 vs. 2.30)
Sequence FR Metric
H.264 Impairment located in Faces ROI.
75Mbps 500Kbps D. Inv.
News Report
PSNR 47.93 37.58 46.82 34.52
Blur 0.44 3.63 0.38 5.17
MSE 0.67 1.93 0.10 2.30
MOS 4.81 1.54 1.33 3.78
48
Calibration of Motion and Faces• A similar situation occurs when analyzing motion in “Barrier”
sequence. Inconsistence with objective metrics.
• Inconsistencies in corner regions between MOS and objective metrics, such as PSNR, for sequence “Crowd”.
• Inconsistencies in spatial detail areas, less
Sequence FR Metric
H.264 Impairment located in Motion ROI.
75Mbps 500Kbps D. Inv.
Barrier
PSNR 49.82 33.19 39.85 34.24
Blur 0.27 8.36 1.97 6.24
MSE 0.51 3.34 0.359 2.98
MOS 4.77 1.33 3.11 3.89
Seq. FR Metric
H.264 Impairment located in Position ROI’s
75 Mbps
500 Kbps
Center Lateral Corner
D. Inv. D. Inv. D. Inv.
Crowd
PSNR 34.33 25.34 30.74 26.82 33.87 26.00 35.95 25.88
Blur 3.44 22.55 6.27 15.33 2.60 19.44 0.95 22.47
MSE 3.55 8.76 2.30 6.21 1.21 7.30 0.64 7.87
MOS 4.68 1.22 1.44 2.44 3.78 1.33 4.11 1.22
49
Relative influence of factors
• After subjective assessment we concluded that the following chain of influence has been considered
Faces > Central > Motion > Detail > Lateral > Corner
50
Example of psychovisual model defined (I)
Frame from sequence “News Report”
51
Example of psychovisual model defined (II)
Motion Mask Spatial Details Mask
Pixel Position Mask Faces Mask
52
Advanced Blur metric
• Blur metrics calculates the loss of energy when compressing a video sequence with transforms, such as DCT. Blur is the comparison of gradient between reference and distorted image
• Advanced Blur includes the effect of visual attention model.
1
0
1
0
)),(()),((),(W
j
H
icodref jifGEjifGEjipsyBlur
1
0
1
0
)),(()),((1 W
j
H
icodref jifGEjifGE
HWBlur
Advanced Blur:
3
0
)(
),(),(),(),(),(
cMAX
FACESPOSDETMOT
ccoefHW
jicoefjicoefjicoefjicoefjipsy
53
Test with real sequences
• Real sequences encoded at different bitrates:– H.264: 6Mbps – 500Kbps (HD Sequences)
Umbrella Boxers
Tree BranchesPhone Call
54
Results (I)
• Results of sequences compared to MOS (subjective opinión), PCC (Pearson Correlation Coefficient), and the improvement from conventional Blur metric to Advanced Blur metric.
Sequence Value 6Mbps 4Mbps 1Mbps 500Kbps PCC Δ(Adv.Blur-Blur)
Boxers Blur 0,650 0,920 3,040 6,880 -0,953 2,97% Adv Blur 1,340 1,480 2,000 2,660 -0,983 MOS 4,778 4,111 2,444 1,333
Hall Blur 0,790 3,280 14,180 27,230 -0,982 1,40% Adv Blur 2,440 3,490 6,880 9,670 -0,996 MOS 4,889 4,111 2,667 1,556
Phone Call Blur 1,950 2,260 3,460 4,490 -0,990 0,94% Adv Blur 1,640 1,780 1,990 2,170 -0,999 MOS 4,889 4,000 2,444 1,333
Tree Branches Blur 11,920 17,360 22,380 20,120 -0,863 13,24% Adv Blur 6,150 8,030 9,790 12,090 -0,996 MOS 4,889 3,778 2,556 1,550
55
Results (II)
56
Conclusions
• Algorithms are not adapted to subjective human eye response.• Subjective tests revealed the importance of some concrete
regions.• Visual attention models adapted to visual attention obtain better
correlations when weighting regions of interest (ROI) and adapted to concrete artifacts.
• The use of visual attention models obtains improvement in objective metrics (Advanced Blur metric) up to 13% compared to conventional methods.
57
Con
clu
sion
s an
d F
utu
re W
ork
58
Conclusions• ZoC is affected by motion in the scene. The state-of-the-art
must be actualized to offer results with tests of dynamic sequences. Motion is a key factor in visual discomfort.
• Avoiding visual discomfort is possible locating objects in positive parallax, BUT that implies a decrease of QoE: – Negative parallax must be controlled to generate soft variations.– Positive parallax is recommended for its tolerance to visual discomfort and
the consequent.• Subjective tests revealed the importance of concrete ROI’s.• Visual attention models adapted to visual attention obtain better
correlations when weighting regions of interest (ROI) and adapted to concrete artifacts.
• The use of visual attention models obtains improvement in objective metrics (Advanced Blur metric) up to 13% compared to conventional methods.
59
Future work
• Development and patent of a system for automatization of quality of Experience for content generation (measuring visual discomfort).
• Developing recommendations and guidelines for 3D contents creators.
• Improvement of Visual attention model with more low, medium and high level features, such as color.
• Advanced metrics adapted to other artifacts, such as blocking.• Development of No-Reference metrics including visual attention
models.
60
Mer
its
61
Publications (I)Peer-reviewed international journal articles (1)
• López, J. P., Rodrigo, J. A., Jiménez, D., & Menéndez, J. M. (2013). Stereoscopic 3D video quality assessment based on depth maps and video motion. EURASIP Journal on Image and Video Processing, 2013(1), 1-14. December 2013. Impact Factor: 0.74. JCR Indexed.
Peer-reviewed international conference papers (9)• López, J. P., Rodrigo, J. A., Jimenez, D., & Menendez, J. M. Subjective quality assessment in
stereoscopic video based on analyzing parallax and disparity. Consumer Electronics (ICCE), 2015 IEEE International Conference on. Las Vegas (U.S.A.), January 2015.
• López, J. P., Rodrigo, J. A., Jimenez, D., & Menendez, J. M. Proposal for characterization of 3DTV video sequences describing parallax information. In Consumer Electronics (ICCE), 2015 IEEE International Conference on. Las Vegas (U.S.A.), January 2015.
• López, J. P., Slanina, M., Arnaiz, L., & Menéndez, J. M. Subjective quality assessment in scalable video for measuring impact over device adaptation. In EUROCON, 2013 IEEE (pp. 162-169). Zagreb (Croatia), July 2013.
• López, J. P., Rodrigo, J. A., Jimenez, D., & Menendez, J. M. Insertion of Impairments in Test Video Sequences for Quality Assessment Based on Psychovisual Characteristics. Artificial Intelligence, Modelling and Simulation, International Conference on. Madrid, November 2014.
• López, J. P., Rodrigo, J. A., Jimenez, D., & Menendez, J. M. Definition of masks related to psychovisual features for Video Quality Assessment. In Consumer Electronics (ISCE), 2015 IEEE International Symposium on (pp. 1-2). Madrid, June 2015.
62
Publications (II)
• López, J. P., Jimenez, D., Cerezo, A., & Menéndez, J. M. No-reference algorithms for video quality assessment based on artifact evaluation in MPEG-2 and H. 264 encoding standards. IFIP/IEEE International Symposium on. IEEE. Ganthe (Belgium), May 2013.
• Rodrigo, J. A., López, J. P., Jiménez Bermejo, D., & Menendez Garcia, J. M. (2013). Automatic 3DTV Quality Assessment Based On Depth Perception Analysis. Nem Summit 2013 Proceedings, 69-74. Nantes (France), October 2013.
• López, J.P., Jiménez, D., Díaz, M., & Menéndez, J.M. Metrics for the objective quality assessment in high definition digital video. IASTED International Conference on Signal Processing, Pattern Recognition and Applications (SPPRA). 2008.
• López, J.P., Díaz, M., Jiménez, D., & Menéndez, J. M. Tiling effect in quality assessment in high definition digital television. 12th IEEE International Symposium on Consumer Electronics- ISCE2008, ISBN: 978-1-4244-2422-1, Vilamoura, April 2008.
Book chapters (1)• López, J.P. Video Quality Assessment. Video Compression, Ed. InTech, ISBN: 978-953-51-
0422-3, March 2012.
Other peer-reviewed international conference papers (5)Peer-reviewed national journal articles (1)
63
Research projects• ACTIVA. Ministerio de Industria, Turismo y Comercio (FIT-330300-2007-42).• BUSCAMEDIA: hacia una adaptación semántica de medios digitales multirred-multiterminal. [2009-2012].• CIUDAD2020: Hacia un nuevo modelo de ciudad inteligente sostenible. [2011-2014].• COST Action IC1105: 3D-ConTourNet 3D Content Creation, Coding and Transmission over Future Media Networks.• EPSIS. Entretenimiento y publicidad segmentada en entornos inmersivos. Ministerio Economía y Competitividad [2011-
2013].• FURIA 2009. Futura red integrada audiovisual. Ministerio de Industria, Turismo y Comercio (TSI-020301-2009-33) [2009-
10]• HBB4ALL Hybrid Broadcast Broadband TV For All. [2013-2016]• HORFI-Radar MIMO de banda ultra ancha. TEC2012-38402-C04-01 HORFI. • ICT 2020. Ministerio de Industria, Turismo y Comercio (TSI-020302-2011-23). [2011-2013]• IMMERSIVE TV: Una aproximación a los medios inmersivos. Ministerio de Industria, Turismo y Comercio [2010-2012].• ITACA 3D. Plataforma de creación, producción y distribución de video estereoscópico de entretenimiento para la
visualización de televisión en 3D a través de briadcast. Ministerio de Industria, Turismo y Comercio (TSI-020110-2009-396).• MELISMAS - Generación automática de mensajes en lengua de signos para aplicaciones sanitarias. Ministerio de
Economía y Competitividad (RTC-2014-2762-1). [2014-16]• Palco HD. Convergencia de plataformas digitales hacia la HD y medidas de calidad asociadas. Ministerio de Industria,
Turismo y Comercio. [2007-2009]• PALCO HD2. Ministerio de Industria, Turismo y Comercio. [2009-2011].• PLEASE Plataforma de alta eficiencia avanzada para distribución de contenidos [2014-15].• PRO-TVD-CM PRO-TVD-CM: Proyecto Integral de Investigación en Televisión Digital (S0505/TIC-0398). [2005-2009]• S3D: Equipo servidor-editor de vídeo 3D realizado en colaboración con las empresas Overon y Aicox.• SIRENA: SIstemas y tecnologías 3D Media sobre Internet del Futuro y REdes de difusión de NuevA generación. Ministerio
de Economía y Competitividad (IPT-2011-1269-430000). [2011-2013]