95
UNIVERSIDAD DE CHILE FACULTAD DE CIENCIAS F ´ ISICAS Y MATEM ´ ATICAS DEPARTAMENTO DE INGENIER ´ IA INDUSTRIAL MEJORAMIENTO DE UNA METODOLOG ´ IA PARA LA IDENTIFICACI ´ ON DE WEBSITE KEYOBJECT MEDIANTE LA APLICACI ´ ON DE TECNOLOG ´ IAS EYE TRACKING, AN ´ ALISIS DE DILATACI ´ ON PUPILAR Y ALGORITMOS DE WEB MINING MEMORIA PARA OPTAR AL T ´ ITULO DE INGENIERO CIVIL INDUSTRIAL GUSTAVO ADOLFO MART ´ INEZ AZ ´ OCAR PROFESOR GU ´ IA: SR. JUAN DOMINGO VEL ´ ASQUEZ SILVA MIEMBROS DE LA COMISI ´ ON: SR. PEDRO MALDONADO ARBOGAST SR. ALBERTO CABEZAS BULLEMORE SANTIAGO DE CHILE DICIEMBRE 2013 Financiado por el proyecto FONDEF-CONICYT CA12I10061 - AKORI

Mejoramiento de una metodología para la identificación de Website

  • Upload
    ngoque

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Mejoramiento de una metodología para la identificación de Website

UNIVERSIDAD DE CHILEFACULTAD DE CIENCIAS FISICAS Y MATEMATICASDEPARTAMENTO DE INGENIERIA INDUSTRIAL

MEJORAMIENTO DE UNA METODOLOGIA PARA LAIDENTIFICACION DE WEBSITE KEYOBJECT MEDIANTE LA

APLICACION DE TECNOLOGIAS EYE TRACKING, ANALISIS DEDILATACION PUPILAR Y ALGORITMOS DE WEB MINING

MEMORIA PARA OPTAR AL TITULO DE INGENIERO CIVILINDUSTRIAL

GUSTAVO ADOLFO MARTINEZ AZOCAR

PROFESOR GUIA:SR. JUAN DOMINGO VELASQUEZ SILVA

MIEMBROS DE LA COMISION:SR. PEDRO MALDONADO ARBOGASTSR. ALBERTO CABEZAS BULLEMORE

SANTIAGO DE CHILEDICIEMBRE 2013

Financiado por el proyecto FONDEF-CONICYT CA12I10061 - AKORI

Page 2: Mejoramiento de una metodología para la identificación de Website

Resumen Ejecutivo

El crecimiento acelerado de internet ha creado un aumento sostenido de los sitios webpara todo tipo de empresas, organizaciones y particulares, provocando un nivel de ofertainmensamente alto. Estos sitios comienzan cada vez mas a ser un importante canal tantode comunicacion directa con el cliente como de ventas, por lo que se hace necesario tratarde generar estrategias que permitan atraer a mas usuarios al sitio y ademas hacer que losactuales usuarios continuen utilizandolo. Esto lleva a preguntarse que tipo de informacionresulta de utilidad para el usuario final y como poder identificar esa informacion.

Anteriormente se ha tratado de abordar este problema mediante tecnica de web mining alas areas de contenido, estructuras y usabilidad de un sitio web, de modo de poder encontrarpatrones que permitan generar informacion y conocimiento sobre estos datos. Estos a su vezpermitirıan tomar mejores decisiones respecto de la estructura y contenido de los sitios web.

Sin embargo este tipo de tecnicas incluıa la conjuncion de datos objetivos (web logs)con datos subjetivos (encuestas y focus group principalmente), los cuales poseen una altavariabilidad tanto personal como interpersonal. Esto provoca que el analisis posterior de losdatos pueda contener errores, lo que redunda en peores decisiones.

Para resolver en cierta manera eso, este proyecto de memoria desarrollo algoritmos deweb mining que incluyen analisis de exploracion visual y neurodatos. Al ser ambas fuentesde datos objetivas, se elimina en cierta parte la variabilidad de los resultados posteriores, conla consecuente mejora en las decisiones a tomar.

El resultado principal de este proyecto son algoritmos de web mining y modelos de com-portamiento del usuario que incluyen informacion de analisis de exploracion visual y datosobtenidos a traves de tenicas de neurociencia. Se incluyen tambien una lsita de websitekeyobjects encontrados en la pagina de prueba para este proyecto.

Se incluyen ademas una revision general acerca de los principales temas sobre los que elproyecto se basa: la web e internet, el proceso KDD, Web Mining, sistemas de eye trackingy website keyobjects. Por otra parte se especificaron los alcances del proyecto de memoria,tanto tecnicos como de investigacion.

Se concluye que el resultado del trabajo fue exitoso, incluso siendo el resultado de losalgoritmos similares a la metodologıa previa. Sin embargo se abre un nuevo camino encuanto al analisis de sitio dadas las relaciones encontradas entre el comportamiento pupilar yel analisis del sitio. Son incluidas ciertas consideraciones y recomendaciones para continuary mejorar este trabajo.

i

RESUMEN DE LA MEMORIAPARA OPTAR AL TITULO DEINGENIERO CIVIL INDUSTRIAL

POR : GUSTAVO MARTINEZ AZOCARFECHA: 11/12/2013

PROF. GUIA: SR. JUAN VELASQUEZ

Page 3: Mejoramiento de una metodología para la identificación de Website

Abstract

The rapid growth of the Internet has created a sustained increase in web sites for alltypes of businesses, organizations and individuals, causing an immensely high level of supply.These sites begin to be, increasingly, an important communication channel for customers andalso a sales channel, so it is necessary to generate strategies for attracting more users to thesite and also make existing users continue using it. This raises the question of what kind ofinformation is useful to the end user and how to identify this information.

Previously this problem has been addressed by web mining techniques in the areas ofcontent, structure and usability of a website. This techniques allow to find patterns that willgenerate information and knowledge on these data. In the same way this data allows to takebetter decisions about the structure and content of the websites.

However, this type of techniques includes the combination of objective data (web logs)with subjective data (surveys and focus groups mainly), which have a high variability bothpersonal and interpersonal. This causes the subsequent analysis of the data may containerrors, resulting in worse decisions.

For solving this in some way, this project developed web mining algorithms that includedvisual exploration and pupil dilation analysis. Being both objective data sources be elimina-ted to a certain part of the variability of the subsequent results, with consequent improvementin decision making.

The results obtained were web mining algorithms and user behavior models that incor-porate visual exploration and analysis of data obtained through techniques of neuroscience,applied and validated. Also a list of website keyobjects found through this techniques ispresented.

It also includes an overview on the main topics on which the project is based: the weband the internet, the KDD process, web mining, eye tracking systems and website keyobject.In the same way were specified the thesis technical and research scope.

Is concluded that the results obtained in this project thesis were successful, even being theresults similar to the previous methodology. However a new way of exploring website usingthis techniques is opened due to the relations founded between the pupil dilation and theanalysis of the site. Some considerations and recommendations are included for continuingthis work or improving this work.

ii

ABSTRACT OF THE THESISTO OBTAIN THE TITLE OFCIVIL INDUSTRIAL ENGINEER

BY: GUSTAVO MARTINEZ AZOCARDATE: 11/12/2013

GUIDE PROF.: MR. JUAN VELASQUEZ

Page 4: Mejoramiento de una metodología para la identificación de Website

En cualquier lugar que estuvieran, recordaran siempre que el pasado era mentira, que lamemoria no tenıa caminos de regreso, que toda primavera antigua era irrecuperable, y que

el amor mas desatinado y tenaz era de todos modos una verdad efımera.- Gabriel Garcıa Marquez, Cien anos de soledad

Page 5: Mejoramiento de una metodología para la identificación de Website

Acknowledgements

Despues de un largo y costoso proceso finalmente este proyecto de memoria ha terminado.Sin embargo su realizacion no hubiese sido posible sin la ayuda de muchas personas que dealguna forma u otra colaboraron en su desarrollo, a quienes me gustarıa agradecerles.

A todos los integrantes de “la salita”: Yerko Covacevich, Claudio Aracena, Manuel Castro,Cristian Rodriguez, Edison Marrese y Alfonso Abadıa. Gracias a todos por su ayuda en eldesarrollo de las soluciones, los analisis, en los problemas con el informe y su preocupacion ycompania.

A todo el equipo del laboratorio de Neurosistemas que siempre tuvieron la mejor di-sposicion para ayudar y ensenar a manejar los equipos o explicar el funcionamiento de losprogramas. Gracias en especial a Christ Devia, Enzo Brunetti y Caro Astudillo que fueroncon los que trabaje y que estuvieron preocupados constantemente.

A mi amigo personal Felipe Bravo quien me colaboro fuertemente con su amplio cono-cimiento de mining, algoritmos, analisis, computacion, etc. sin lo cual no habrıa podidoavanzar o llegar a buen puerto.

A Kristopher Munoz por dedicar gran parte de su tiempo y trabajo en realizar algoritmosy analisis para el desarrollo de mi trabajo, sin los cuales este proyecto hubiese tomado muchomas tiempo.

A todos los que se dieron el tiempo de venir a realizar el experimento a esta facultad.Javier Ojeda, Camila Paniagua, Pedro Abarca, Belen Castro y todos aquellos que no conocıpersonalmente pero que participaron voluntariamente en el.

A mi profesor guıa Juan Velasquez que me dio la libertad y confianza para desarrollar esteproyecto y apoyarme en todo momento con el. De igual manera al profesor Pedro Maldonadopor confiar en mi y en el trabajo que realice.

Y finalmente a todos los que se preocuparon constantemente de mi trabajo, preguntandomesobre el, dandome animo y apoyandome constantemente, incluso desde los mas remotoslugares.

A todos ustedes, muchas gracias.

Gustavo Adolfo Martınez Azocar

iv

Page 6: Mejoramiento de una metodología para la identificación de Website

IndexResumen Ejecutivo i

Abstract ii

Acknowledgements iv

Index of Tables vii

Index of Figures viii

1 Introduction 11.1 Work context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Project description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.2 Project description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2.1 General objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2.2 Specific objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Expected results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.5 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.6 Content Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Conceptual framework 82.1 Web and Internet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.1 Information produced by the web . . . . . . . . . . . . . . . . . . . . 92.2 KDD process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3 Web Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3.1 Web Content Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3.2 Web Structure Mining . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3.3 Web Usage Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4 Human eye . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.4.1 Eye movement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.4.2 Visual attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.4.3 Eye tracking techniques . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.5 Pupillary Response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.5.1 Pupil Dilation Measurement . . . . . . . . . . . . . . . . . . . . . . . 222.5.2 Pupil Dilation and behavioral correlation . . . . . . . . . . . . . . . . 24

2.6 Website Keyobject . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.6.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.6.2 Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.6.3 Objects comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.6.4 Transforming categories in strings and comparison . . . . . . . . . . . 312.6.5 Methodology for finding web site keyobjects . . . . . . . . . . . . . . 322.6.6 Clustering Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3 Experiment Design 40

v

Page 7: Mejoramiento de una metodología para la identificación de Website

3.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.1.1 Experimental group . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.1.2 The web site . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.1.3 Eye-tracking system and devices . . . . . . . . . . . . . . . . . . . . . 42

3.2 Capturing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.2.1 Experimental group . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.2.2 The web site . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.2.3 Web logs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.3 Data transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.3.1 Web pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.3.2 Web logs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.3.3 User interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.3.4 Experimental group . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.4 Changes in the methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.5 Results comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4 Implementation 494.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.1.1 Experimental group chosen . . . . . . . . . . . . . . . . . . . . . . . . 494.1.2 Web site chosen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.1.3 Machines and devices . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.1.4 Software and code . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.2 Capturing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.2.1 Data about the experimental group . . . . . . . . . . . . . . . . . . . 544.2.2 Data from the web sites . . . . . . . . . . . . . . . . . . . . . . . . . 554.2.3 Web log data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.2.4 Objects on the site . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.2.5 Objects concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.3 KDD Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.3.1 Data selection, preprocessing and transformation . . . . . . . . . . . 57

5 Results 655.1 Data Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655.2 Association Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685.3 K Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695.4 Website Keyobjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 715.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

6 Conclusions 736.1 Further Development and Recommendations . . . . . . . . . . . . . . . . . . 74

Appendixs 78A Pre Selection Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78B Informed consent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

References 82

vi

Page 8: Mejoramiento de una metodología para la identificación de Website

Index of Tables

2.1 Association between categories and chars . . . . . . . . . . . . . . . . . . . . 31

4.1 Experimental group characteristics . . . . . . . . . . . . . . . . . . . . . . . 504.2 Site’s statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.3 SR Research Eyelink 1000 specifications . . . . . . . . . . . . . . . . . . . . 52

5.1 Correlation between object variables . . . . . . . . . . . . . . . . . . . . . . 665.2 Rules created by association rules . . . . . . . . . . . . . . . . . . . . . . . . 695.3 High time Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 705.4 High delta Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 715.5 List of Website Keyobjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

vii

Page 9: Mejoramiento de una metodología para la identificación de Website

Index of Figures

2.1 HTML code of site http://www.mbauchile.cl . . . . . . . . . . . . . . . . . . 92.2 Steps of KDD process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3 Parts of the human eye . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.4 Subject using an electro-oculography technology . . . . . . . . . . . . . . . . 182.5 Scleral Contact lens with a search coil. . . . . . . . . . . . . . . . . . . . . . 192.6 Example of table-mounted video-based eye tracker. . . . . . . . . . . . . . . 202.7 Example of head-mounted video-based eye tracker. . . . . . . . . . . . . . . 212.8 Diagram of Whittaker Pupillometer . . . . . . . . . . . . . . . . . . . . . . . 232.9 Example of web site objects on the site www.emol.com . . . . . . . . . . . . 282.10 Example of a toroidal network . . . . . . . . . . . . . . . . . . . . . . . . . . 352.11 Steps of k-means algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372.12 Example of Association Rules . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.1 EER Model of the site and it’s objects . . . . . . . . . . . . . . . . . . . . . 453.2 EER Model web log tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.3 EER Model User interest table . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.1 Eyelink 1000 layout with a subject . . . . . . . . . . . . . . . . . . . . . . . 544.2 EER model of pages table . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.3 Navigation menu opened . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.4 Navigation menu closed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614.5 Table averaged interest objects . . . . . . . . . . . . . . . . . . . . . . . . . 614.6 Pupil behavior for different kind of emotional valences . . . . . . . . . . . . . 634.7 Tables object permanence and conceptual similarity . . . . . . . . . . . . . . 64

5.1 Area vs time spent on object . . . . . . . . . . . . . . . . . . . . . . . . . . . 665.2 area vs delta indicator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675.3 Delta vs time spent on object . . . . . . . . . . . . . . . . . . . . . . . . . . 675.4 Histogram: number of objects per time spent on them (normalized) . . . . . 685.5 K Means clusters for K=3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

viii

Page 10: Mejoramiento de una metodología para la identificación de Website

Chapter 1

Introduction

1.1 Work context

Internet use has increasingly massified over time [1], with a resulting increase in the

number of existing web sites. This increased supply of information available on the Web

creates the challenge of establishing differentiation from competition and for this reason it

is necessary to generate useful and interesting information for users or customers, in order

to keep them visiting, stay longer and also attract other users to the site. This leads to the

need to develop the question of what type or kind of information is useful to the end user of

the site and how to define it.

Previously, to address this, web mining techniques have been used in the areas of content,

structure and on web site users to retrieve such information and knowledge. These techni-

ques are based primarily on data mining analysis of the data from these primary sources of

information.

With these techniques it has been possible to find the website keywords, which were

defined as “a word or possibly a set of words that are used by users in their search process

and characterize the content of a page or website because” [2], whose function is to determine

the most important words for users of a site based on the content and navigation through it.

However getting website keywords neglected one of the most important resources on a site:

the objects. An object or website object was defined by Dujovne as “any structured group of

words or a media resource that is present on a web page that has meta-data describing the

contents” [3]. And together with this, a website keyobject was defined as “a website object

1

Page 11: Mejoramiento de una metodología para la identificación de Website

that captures the attention of users and characterizing the content of a website.” [3]

Dujovne and Velasquez developed a methodology that allowed the identification of website

keyobjects on a site by the amount of time a user spent on a website object. To do this they

used algorithms that allowed the reconstruction of the sequence of pages a user looked at

during a visit to the site and the implementation of a survey on the same control group that

visited the site.

Later Gonzalez in [4] improved this methodology, replacing the survey for an eye-tracking

system, which allowed him to more reliably determine the object permanence on each website.

This is why the natural step in the line of investigation is to quantify the user interest in

each website object using the information obtained from pupillary dilation and the correlation

between the level of dilation and interest in what is observed.

1.1.1 Project description

The project to develop as a thesis project is mainly an improvement to the existing

methodology for identifying website keyobjects (based on eye-tracking systems) through the

introduction of information about pupil dilation produced by the individual and neural data

(or extract data through techniques of neuroscience) as new sources.

First, a state-of-the-art analysis of the three main components of the project will be

developed: eye-tracking systems, information models obtained through the measurement of

pupil dilation, neural data and models of web user behavior.

Then the data generated by the pupil dilation examination along with the information

obtained through neuroscience techniques (neural data) will be properly incorporated into

the development of the web mining algorithms.

After that the algorithms and analysis techniques to detect patterns in neural data will

be developed, and this in conjunction with the data provided by the eye-tracking system will

allow the type of and interest level on a website object to be established, plus the time spent

on it. In the same way a change in the methodology designed by Gonzales will be made to

incorporate these new techniques.

Finally both methodologies (that developed by Gonzalez in [4] and the one developed

in this project) will be applied to a test site, and the results of both compared in order to

quantify the possible existence of improvement between the new techniques.

2

Page 12: Mejoramiento de una metodología para la identificación de Website

1.1.2 Project description

The methodology developed by Velasquez and Dujovne had limitations in the data ana-

lysis, mixing objective data sources (web logs) with subjective data sources (surveys), thus

skewing the information obtained. This is mainly due to the subjective data showing high

intrapersonal variability, mainly a product of seasonality and individual emotions, together

with interpersonal variability, for example, which was added in a focus group.

Gonzalez’ work continued along the line of detecting the level of interest in web objects,

the main difference being the replacement of the surveys in the method proposed by Velasquez

and Dujovne by an eye-tracking system, which would avoid the use of subjective data and

replace it with objective data. While this was useful, improving on the previous method by

a certain percentage, the system lacked information about the type of interest (positive or

negative) that a website object generated.

This project aims to improve the previous methodology, by adding the information obtai-

ned through neural data and eye-tracking systems (which have a high correlation with user

preferences) for quantifying the interest level shown by the user, along with the introduction

of more objective data.

Finally, the hypothesis presented in this thesis project is: “It is possible to improve

the effectiveness of web mining algorithms using web intelligence techniques, adding visual

exploration analysis, pupil dilation information and analysis of patterns obtained from neural

data.”

1.2 Objectives

This work pursues the following objectives:

1.2.1 General objective

The main objective of the thesis project is:

Develop, apply and validate web mining algorithms and user behavior models by adding

visual exploration data and pupil dilation analysis.

1.2.2 Specific objectives

The specific objectives for this thesis project are:

3

Page 13: Mejoramiento de una metodología para la identificación de Website

1. Analyze the state of the art of web user behavior

2. Analyze the state of the art of eye-tracking systems

3. Analyze the state of the art of pupil dilation analysis

4. Characterize the data produced by the eye-tracking system and neuroscientists

5. Develop techniques that allow detection of patterns in the data produced by the eye-

tracking system

6. Develop a functional prototype for applying the techniques on a test web site

7. Analyze and conclude based on the results, and propose future developments

1.3 Methodology

A first step in the methodology to be followed is a literature review of the three main

components of the project: a model of web user behavior, eye-tracking systems and pupil

dilation.

After this step the rest of the project will proceed using the CRISP-DM methodology,

adapting it so that it can be applied to developing web mining techniques.

The CRISP-DM methodology consists of six interrelated and cyclical steps and a constant

feedback, enabling the development of standard data mining analysis [5]. These steps are:

1. Understanding the business: the phase in which it is necessary to understand the

business, by understanding their objectives, goals and current situation. It is also

where the project’s development objectives are defined.

2. Understanding the data: this stage is where the data is analyzed in order to arrive

at a full understanding. It is also where the necessary information requirements are

determined to develop the project.

3. Data Preparation: Once the available data have been identified and understood, the

next step is to proceed to perform the tasks of selecting, processing and cleaning.

4. Modeling: Here is where modeling is done, using the proper algorithms or software for

doing it. As a process that needs feedback, new algorithms can be added according to

the level of knowledge about the data.

4

Page 14: Mejoramiento de una metodología para la identificación de Website

5. Evaluation: This is the stage where the results of the models obtained above are

evaluated, within the business context.

6. Implementation: Finally at this stage is where the hypotheses are tested and where

new knowledge is also discovered.

In the case of this thesis project, the resulting adjustment would be:

First seek to understand the objectives of the project and what the expected results are.

Also define the environment where the thesis will be developed, i.e. technologies to be used,

the test site for developing the algorithms and the necessary control groups.

Then the data capture needed for developing the project will proceed. Among this data

was the chosen site, along with its respective objects, web server log files and data on the

interests of users, gathered through eye-tracking systems and neuroscience techniques.

The following step will be transformation of the captured data needed for the project.

This includes the process of sessionalization and characterization of user interest.

The next step is to develop algorithms for detecting patterns of information in the data

collected. Different strategies will be used according to the level of development achieved.

Finally, results will be evaluated, comparing them with previous methods in order to

verify the possible existence of improvements.

1.4 Expected results

The expected results of the thesis project are:

1. A report documenting the results obtained through the research of the state of the art of

the eye-tracking systems.

2. A report documenting the results obtained through the research of the state of the art of

information about pupil dilation.

3. A data model including the definitions and descriptions of the data vector obtained and

its utility for the project.

4. A methodology proposal, along with the respective algorithms, for identifying website

keyobjects.

5

Page 15: Mejoramiento de una metodología para la identificación de Website

5. A report with the results obtained after implementing the methodology.

6. A report detailing the analysis made and the results obtained.

1.5 Scope

The scope of the project is the same context where it will be applied, i.e. the web area

and its content and structure, and not therefore to develop models and generic algorithms.

However, it would be considered possible to use in other areas or industries adapting the

required components.

The research and development scope will focus on the web intelligence area, taking the

issues related to the eye-tracking technology and neuroscience as input or data sources.

However a state-of-the-art analysis of both fields will be included.

Finally, within the technical scope, it is noteworthy that the data and information related

to eye-tracking technology and neuroscience analysis will be obtained using the equipment

provided by the Universidad de Chile. External equipment needs will depend on the resources

available for the project. For this reason, the memory project aims to develop models and

algorithms using the latest technologies, but also those which are available for student use.

1.6 Content Structure

The rest of this document is structured as follows. In the first place, in Chapter 2

the conceptual framework of this project will be explained. It will give all the necessary

definitions and explanations used during the development of the thesis and which will be

used later. Among these concepts some basic information is included about the Web and

Internet, Web mining and its derivatives, the human eye and its components, the pupillary

response and its relationship with human behavior and the state of the art about eye-tracking

systems.

Later in Chapter 3 the design and requirements of the experiment will be explained,

and all its different components defined. Concepts like the experimental group, the devices

needed, the web sites for testing, how the data will be captured, etc. will be fully detailed

here. Also some of the necessary data transformation requirements are explained thoroughly.

In Chapter 4 the whole experiment’s implementation is explained, detailing which devices

will be used, the characteristics of the experimental group selected, the web sites chosen, etc.

6

Page 16: Mejoramiento de una metodología para la identificación de Website

It will also detail the practical work and how it was developed.

Chapter 5 shows the results obtained by the experiment, detailing every component and

including a comparative analysis with the previous methodologies.

Finally in Chapter 6 the conclusions of the project, based on the results of the previous

chapter and the research done, are presented. Some guidelines and basic baselines are also

given for further development.

7

Page 17: Mejoramiento de una metodología para la identificación de Website

Chapter 2

Conceptual framework

This conceptual framework seeks to give to the reader a general and clear explanation of

the essential and basic concepts from which will be developed the thesis project. This will

involve a brief review of the main concepts, definitions and performance.

2.1 Web and Internet

These two terms are often used indistinctly, but each has its own meaning. Internet,

in technical terms, is a decentralized collection of interconnected networks, which enables

the connection between geographically separated teams through a particular protocol, which

makes it work as a single global network. The Web, on the other hand, is the name given

by Tim Berners-Lee to the system of pages and related objects linked together through

hyperlinks.

For the proper working of the Web the existence of a document transfer protocol is

necessary. In this case the protocol is called HTTP (Hypertext Transfer Protocol) [6], which

is maintained by the World Wide Web Consortium, an international organization that is

responsible for maintaining standards for the World Wide Web (abbreviated WWW).

Web architecture is formed by three main components: Hypertext Transfer Protocol

(HTTP) [6], Hypertext Markup Language (HTML) [7] and Uniform Resource Locator (URL)

[8] which are detailed below:

• Hypertext Transfer Protocol (HTTP) is a protocol for transferring documents

between two devices in a network. The specifications of this protocol, as mentioned

8

Page 18: Mejoramiento de una metodología para la identificación de Website

before, are maintained by the World Wide Web Consortium.

• Hypertext Markup Language (HTML): Hypertext markup language is used most-

ly for creating web pages. It describes the structure and content of the document as

a text, and also allows the generation of links or connections between documents and

the insertion of objects on pages such as images, videos, etc. An image showing some

code from the site http://www.mbauchile.cl can be seen in figure 2.1.

Figure 2.1: HTML code of site http://www.mbauchile.cl

• Uniform Resource Locator (URL): The string that associates each web element

with an address, which allows it to be located on the Web. It generally consists of the

transfer protocol, the domain being accessed and the particular resource. For example

in http://www.docode.cl/index.php, http corresponds to the protocol, www.docode.cl

corresponds to the domain and index.php corresponds to the particular file.

2.1.1 Information produced by the web

Web operation produces three basic data types: structure, content and navigation data

generated by users. These are detailed below:

• Structure: the structure of a web page is defined by the HTML language, which also

defines the content of the page. It also allows the generation of connections to other web

9

Page 19: Mejoramiento de una metodología para la identificación de Website

sites through links (or hyperlinks) that create common information communities [9]. In

general the main structure contains the beginning tags, the header, metadata, styles,

functions and the body of the document. Similarly, the structure of a web site is

established by the existence of hyperlinks linking the pages to one another and which

allow users to navigate between documents.

• Content: The content refers to the different components (or objects) present within a

web page such as text, images, videos, etc. The text on a page can be easily described,

but the media (sounds, videos or pictures) is not so easy, so the existence of metadata

that refers to content in this type of objects is necessary.

• Navigation data: Each time a user navigates through a web site, the web server that

runs the site leaves a record of each request made by the user in a file called a web

log [10]. Web log files store the following fields:

– Host IP: The IP address from where the request was made.

– User and User ID: If the site has user authentication mechanisms, it stores the

username and his numeric identifier.

– Timestamp: Date and time when the request was made.

– Application Method: Form in which the request was made.

– URI: Name and location of the requested file.

– Protocol: HTTP protocol version of the software that makes the request

– Status: Status of the petition outcome. This result is defined by the status code

that is delivered to the browser.

– Bytes: Document size in bytes sent.

– Reference: Refers to the page from which the user logged in to the document

requested. It is empty if he/she has entered directly through the browser.

– Agent: Identification code of the agent that made the request.

The analysis of these records allows reconstruction of the session [11] generated by each

user on a web site, which would enable understanding the process of navigation on a

site.

2.2 KDD process

The large amount of data generated and stored by different industries, such as retail,

mining, finance, etc. makes it almost impossible to do a manual analysis of it. Given this

10

Page 20: Mejoramiento de una metodología para la identificación de Website

scenario, it is necessary to generate an automated or semi-automated technique that allows

the efficient extraction of information from these large volumes of data.

This process of extracting information from large data sources is known as Knowledge

Discovery in Databases (or the KDD process), and was defined by Fayyad et al. [12] as “the

non-trivial process of identifying valid patterns, original, useful and understandable about

data.” Data is defined as a set of facts stored in a data source, and pattern is defined as

an expression in a language that describes a subset of the data or a model relevant to that

subset.

Figure 2.2 shows the components of the KDD process, which consists of five main steps.

It is important to notice that this process is iterative and interactive, and for this reason it

is possible to return to a previous step. Weighting the results with an expert in the relevant

business is also recommended. These steps are detailed here:

1. Data Selection: This is the stage where the data to study is chosen, being generally

a subset of the universe of data available. To make the choice it is necessary to analyze

the requirements and objectives of the project from the viewpoint of the business expert,

since a bad choice can lead to erroneous test results.

2. Preprocessing data: A processing is done by cleaning programs on the chosen data to

effect noise reduction, handle missing data, etc. After this processing a clean data set is

obtained.

3. Transformation: From the data set obtained in the previous step certain fields will be

chosen for analysis, considering in the process only useful data variables. This helps to

further reduce the dimensionality of the data set. The result of this process is a refined

database.

4. Data Mining: This is the core of the KDD process and involves the discovery of patterns

in data. After defining the objectives of the process, different algorithms can be applied to

find information and patterns related to the objectives. The selection of these algorithms

depends directly on the objectives, some being more appropriate than others. These algo-

rithms permit different functions to be applied to the data such as data grouping (cluster

generation), classification, regression, predictions, etc. Tests are generally performed with

more than one algorithm to compare and verify the results.

5. Interpretation and Analysis: This is where the results of the previous process are

obtained, mainly patterns that allow the generation of knowledge. In this stage the

11

Page 21: Mejoramiento de una metodología para la identificación de Website

business expert plays a key role because he is the most suitable to analyze these patterns,

and to verify the usefulness and validity of the knowledge generated.

Figure 2.2: Steps of KDD process

2.3 Web Mining

Given the large amount of data existing on the Web, and their characteristics such as he-

terogeneity, time-variance, high dimensionality, etc., they are good candidates for treatment

with data mining techniques. The use of these techniques on such data has been called web

mining [13] and research in this area is growing given the increasing use of the Web [14].

This field is the product of the union of several research areas such as databases, informa-

tion retrieval, machine learning, language processing [15], etc. The main purpose of these

techniques is to find patterns, trends and user behavior on the sites, on the basis of which it

is possible to make improvements in the structure, content and usability of a web site.

Web mining techniques are separated into three main areas according to the source from

which data is extracted. These areas are:

2.3.1 Web Content Mining

Web content mining (WCM) focuses on finding information from web documents, among

which include content, data and files. This type of mining involves the analysis of text, images

and videos, where the analysis of the last two is a sub area called multimedia data mining.

In order to analyze the text of a document a preprocessing is first required to be done. This

transformation of web pages is called vector space modeling and consists of transforming the

document into a characteristics vector.

12

Page 22: Mejoramiento de una metodología para la identificación de Website

2.3.2 Web Structure Mining

This area of web mining is responsible for analyzing the hyperlinks between pages and

the documents linked with them. In this way a web site can be represented as a directed

graph, where nodes represent pages and links represent the hyperlinks between pages. Each

link goes from the node where the hyperlink is included and ends in the node representing

the page where the link goes. This definition allows the assignment of values according to

the importance of each node, and with this information the pages or sites which are more

important within a network can be seen. These techniques are used by search engines like

Google [16] to order search results.

2.3.3 Web Usage Mining

These techniques are designed to find information patterns in user behavior when they

navigate through the web pages on a site [17]. To perform this analysis it is necessary to

apply mining techniques to the user’s browsing session on the web site. The first step is

rebuilding the entire session on the site.

Those techniques are neither independent nor exclusive. The data originated on the Web

allows analysis to be done through different kinds of web mining and even combining them

to get better results.

In this project the techniques used are mainly Web Usage mining for the analysis of

navigation patterns and user behavior on the site through the logs stored in the web log files.

Web content mining is also used for the analysis of contents (mainly objects) inside the web

pages. With these analyses it will possible to get information about the behavior of the users

on the site and the most important objects on every page.

2.4 Human eye

The human eye is an organ which reacts to and detects light, and as a sense organ,

allows vision in mammals. Different components of the eye allow light perception, color

differentiation and depth perception, and permit us to distinguish about 10 million colors [18].

Specialized cells in the retina receive the light signals which affect the adjustment of the size

of the pupil and the regulation of different hormones. [19]

Some of the components of the eye are:

13

Page 23: Mejoramiento de una metodología para la identificación de Website

• Cornea: Is located in the front part of the eye and covers the iris, pupil and anterior

chamber. This structure is transparent and refracts the light passing through it. This

characteristic is used by some eye-tracker systems.

• Sclera: Is a membrane located in the external part of the eyeball. It is white, thick and

strong and its function is to give shape and protect the internal components. Usually

the visible part of the sclera is called the white part of the eye [20].

• Choroid: Is a membrane located between the retina and sclera and is the vascular

layer of the eye, containing multiple connective tissues. Its functions are to provide

nutrients to the components of the eyeball and keep the temperature constant.

• Pupil: Is a hole located in the center of the iris, which allows light entrance into

the eyeball [21]. It is an expandable and contractile structure, and its function is to

regulate the amount of light received by the retina. It appears black because light rays

are mostly absorbed by the tissue inside the eyeball.

• Retina: Is a light-sensitive tissue located on the rear inner surface of the eye. The

light impinging on the retina creates a sequence of chemical and electric phenomena

which produce nerve impulses. These impulses are sent to the brain through the optic

nerve. This is part of the first stage of visual perception [22].

• Iris: Is the colored and rounded membrane which separates the anterior and posterior

eyeball’s chambers, and its function is to control the size and diameter of the pupil [23].

The iris color is often called eye color.

• Macula: Is a place near the center of the retina shaped as an oval with a yellow

pigmentation. It contains a greater concentration of cone cells, allowing high-resolution

vision. This structure works when something is observed or focused on with interest

[22].

The location of these components are described in figure 2.3.

2.4.1 Eye movement

The eye can perform different kinds of movements for fixating, acquiring and tracking

visual stimuli, and in this way the person can develop a cerebral image of the scenario [24].

These movements can be classified into three main categories:

14

Page 24: Mejoramiento de una metodología para la identificación de Website

Figure 2.3: Parts of the human eye

• Pursuit Movements: Are the movements performed by the eye to keep track of a

moving object.

• Vergence Movements: Are the movements of both eyes to the same object with the

purpose of maintaining binocular vision.

• Saccades: These movements occur when the subject is scanning a visual scene, trying

to find the most interesting parts of a scene in order to focus on them. The speed of

the saccade cannot be consciously controlled so the eyes move as fast as they can [25].

There is another eye mechanism not considered a movement– fixation. Visual fixation

occurs when the subject maintains focus on an object, in order to get a high-resolution image

of the important details of the scene. Usually when a subject gets visual stimulation he/she

starts tracking the scene, fixating on details and performing saccades between the different

fixation points [24].

2.4.2 Visual attention

One of the most important questions is: why it is necessary to track the eye movements of

the subject? Answered in a simple way, humans move the eyes to bring a particular portion

of the visible field of view into high resolution to see it in fine detail. This is often done

to drive the focus of concentration to that particular point (or region) of interest, even for

15

Page 25: Mejoramiento de una metodología para la identificación de Website

short lapses of time. Therefore the main conjecture which can be made is, if it is possible to

track the eye movements of a subject, it is possible to register the path of attention followed

by the participant. This can thus provide a good source of information about what looks

interesting, what draws attention, how the world is perceived, which components of a scene

are more noticeable, etc.

Notwithstanding the previous reasons for tracking eye movement, it is necessary to esta-

blish a definition for what attention is, and whether eye movement gives some information

about the process known as visual attention.

Regardless of having been a topic of study during the last century, there is not a definitive

complete definition of visual attention. One definition was given by the psychologist William

James [26]:

Everyone knows what attention is. It is the taking possession by the mind, in clear and

vivid form, of one out of what seem several simultaneously possible objects or trains of

thought. Focalization, concentration, of consciousness are of its essence. It implies

withdrawal from some things in order to deal effectively with others...

When the things are apprehended by the senses, the number of them that can be attended

to at once is small, ‘Pluribus intentus, minor est ad singula sensus.’

The last part of the phrase, in Latin, can be translated into something like ‘Many filtered

into few for perception’. It is supposed that attention is the filter in this case.

The human mind cannot attend to all the stimuli which it receives at the same time, due

to its finite capacity. Generally attention is the way the mind focuses its capacities onto a

selection of the whole range of stimuli, in order to process it in an adequate way. Due to

the limited capacity of the brain, it is necessary to concentrate on specific components of

the whole “sensory realm” to examine in a detailed way the most interesting smells, sounds,

sights, etc. rather than the peripheral stimuli. This is particularly noticeable in vision,

where a scene is inspected using parts of the whole scene, which are highly detailed due to

the attention process, and are later mixed with other detailed parts, to construct a coherent

representation of the whole scene.

In simple terms, the human vision system has two parts working together; a central zone

with a very high resolution called foveal vision and the rest of the visual range with a low

resolution called peripheral vision. Thus, when the gaze is directed at a particular place, the

16

Page 26: Mejoramiento de una metodología para la identificación de Website

foveal vision gives a high resolution image of that particular range of the scene, and the rest

of the scene is watched in low resolution through peripheral vision.

Visual attention and eye movements

If visual attention is considered in terms of “what” and “where”, i.e. “where” the noti-

ceable things are on a scene (through parafoveal vision) and “what” these noticeable things

are (using foveal vision), it is normal to consider that eye movements work in a way that

support this dual-attentive hypothesis. This means that vision behaves in a cyclical process

composed of these steps:

1. Given a scene as stimulus the whole frame is seen through peripheral vision and thus

mostly at low resolution. In this step the interesting components of the scene may “pop

out” in the field of view and engage the attention to their location for further detailed

inspection.

2. Foveal location disengages attention and the eyes are moved toward the region that

attracted attention.

3. After the eyes are pointed to the area of interest, the fovea is directed at the region of

interest and attention is engaged to that point to perceive it in high resolution.

This type of visual attention represents a bottom-up approach. This means that the

interesting areas of the visual stimulus, or in this case scene, attract enough attention to

prevent looking at the rest of the scene. On the other hand, a top-down approach is driven

by other cognitive factors such as knowledge, expectations, goals, etc. Under this model,

the subjects are more likely to look around using peripheral vision. For example a persons

who drives regularly will notice gas stations more than another person who does not drive

often [27].

2.4.3 Eye tracking techniques

Eye-tracking systems are devices that allow, using different techniques, measurement

of the eye movements of a person. In this way it is possible to know where the subject

is directing their gaze during its use, and with these data it is possible to determine the

sequence of movement of the eyes along with the time they stopped to observe an object.

17

Page 27: Mejoramiento de una metodología para la identificación de Website

In this section the main techniques for tracking eye movement will be presented, belonging

to two main categories, the ones which measure eye position relative to head position and the

ones which measure the orientation of the eyes in space, which is called point of regard [28].

Electro-Oculography (EOG)

This technique was widely used during the 1950s (and is still used in some places today).

It consists of electrodes positioned around the eyes to measure the differences in the skin’s

electrical potential. This technique measures the eye movement relative to head position, so

it is not very useful for measuring point gaze unless it is used with a head tracker to measure

the movements of the head. An example of this is the electrodes on a subject as shown in

figure 2.4

Figure 2.4: Subject using an electro-oculography technology

Scleral Contact Lens / Search Coil

A scleral contact lens is one of the most precise methods for measuring eye movements.

This device consists of a mechanical or optical reference object mounted on a contact lens,

which is worn directly on the eye. The contact lens for this device is necessarily large,

extending over the cornea and sclera, to avoid any sliding movement over the eye.

Different kinds of mechanical and optical devices have been used on the stalk attached

to the contact lens but the most popular is the wire coil. This kind of device can be easily

18

Page 28: Mejoramiento de una metodología para la identificación de Website

measured while moving through an electromagnetic field. A figure showing the size and the

position on the eye is shown in figure 2.5

Despite being one of the most precise techniques, is also the most intrusive method,

causing discomfort to the user during its use. Also it measures the movements relative to

head position, and it is not generally suitable for point-of-regard measurement.

Figure 2.5: Scleral Contact lens with a search coil.

A subject wearing a scleral search coil in conventional manner (A), the wire exiting directly (white arrow).When wearing a bandage lens on top of the coil, the wire is applied to the sclera and exits the eye near the

medial canthus (B, gray arrow). The bandage lens (C, center) is 5 mm. larger in diameter than a scleralsearch coil (C, right). [29]

PhotoOculography (POG) or Video-Oculography (VOG)

This category of eye-tracking techniques groups a wide variety of techniques for measuring

eye movements which involve the measurement of different features of the eye during rota-

tion/translation movements. Some of these features are the shape of the pupil, the position

of the limbus or the reflection of the cornea under light exposure (usually infra-red light).

These techniques work differently, but are grouped together because they often do not

provide point-of-regard measurement. Further, results may be gotten automatically, but in

some cases may involve having to visually inspect the recorded eye movements. This kind of

analysis can be extremely tedious and prone to error, and it is limited to the sample rate of

the device used.

19

Page 29: Mejoramiento de una metodología para la identificación de Website

Video Based Combined Pupil / Corneal Reflection

The main innovation of this technique is the provision of point-of-regard measurement.

Point-of-regard measurement requires having the subject’s head fixed so that the eye’s posi-

tion relative to the head and point of regard coincide. Otherwise it is necessary to measure

other features of the eyes to disambiguate head movements from eye rotation.

Two useful features for measurement are the corneal reflection and the pupil center.

Video-based trackers incorporating relatively inexpensive cameras and image processing har-

dware are used to compute the point of regard in real time. The apparatus and set up can

be table-mounted (as shown in figure 2.6) or worn on the head (as shown in figure 2.7). The

techniques are the same in both systems, and the only difference is the size of the device.

Often the head mount is more precise, because it moves with the head, but it is also more

invasive.

Figure 2.6: Example of table-mounted video-based eye tracker.

20

Page 30: Mejoramiento de una metodología para la identificación de Website

Figure 2.7: Example of head-mounted video-based eye tracker.

The corneal reflection of the light source (usually an infra-red light) is measured relative

to the location of the pupil center, where a clearly identifiable bright disc (the pupil), and

a small one (the corneal reflection) can be detected. Using a calibration system of 9 points

and some basic trigonometric calculations it is possible to estimate the position of the cornea

and the pupil, therefore the point of regard of the subject.

Data generated by eye-tracker devices

Modern eye-tracking systems provide a wide variety of data. Among the most important

are:

• time in milliseconds when the data was collected

• horizontal and vertical position, on the monitor, that observes each eye separately

• horizontal and vertical location of the pupil for each eye separately

• distance in millimeters from the eye tracker to each eye separately

• diameter of the pupil of each eye in millimeters

21

Page 31: Mejoramiento de una metodología para la identificación de Website

2.5 Pupillary Response

Pupillary response is a physiological response that varies the size of the pupil, producing

a dilation or constriction. This is due primarily because of the light reflex and the accom-

modation reflex. The first one is produced by the exposure of the eye to a light source,

generating a dilation when it is exposed to dim light and a constriction in intense light. The

second reflex refers to changes in pupil diameter produced by trying to focus on an object.

Other kinds of smaller changes in pupil diameter are reflections of cognitive processes or

behavioral activities. Among these, researchers have found correlations between pupil dilation

and positive stimuli, pupil constriction and unpleasant stimuli and pupil diameter and the

learning process, information processing, short-term memory and nonverbal communication.

It can also indicate sexual stimulation [30].

The diameter of the pupil is determined by the activity of two iris muscles, the sphincter

pupillae and the dilator pupillae, and the major function is to change the pupillary diameter

according to the light environment in order to adjust the amount of light allowed to enter

the eye. The pupil in humans can dilate to about 8 to 9 mm., constrict to 1.5 mm. and react

to stimuli in 0.2 s with peaks in 0.5 to 1.0 s. [31], [32]

Pupil movements (constriction and dilation) are produced by the iris’s muscles controlled

through the ANS (autonomic nervous system). In particular, pupillary constriction is pro-

duced by the innervation of the circular fiber of the iris, a process caused by the neurons of

the PNS (peripheral nervous system). Pupil dilation is produced by the excitation of radial

fibers of the iris, produced by the SNS (sympathetic nervous system) neurons.

2.5.1 Pupil Dilation Measurement

Pupil measurement has been a topic in eye research throughout the last century but only

during the last 40 years have accurate measurements been possible. This has been facilitated

by the development of more precise and reliable kinds of instrumentation, even under diverse

experimental conditions and under various types of psychological and physiological states.

One of the basic problems of pupil measurement is the constant fluctuation of pupil size

in both eyes during waking hours, which can be about 1 mm. in amplitude. This is due to

the brain mechanism that regulates pupil size according to light stimulation. Under normal

light conditions it is important to consider this behavior when taking measurements.

In the beginning, the measurement of pupil size was done using infrared photography

22

Page 32: Mejoramiento de una metodología para la identificación de Website

which allowed photographing pupil diameter even under bad light conditions. Some pho-

toelectric methods were also used, devices which measured the light reflected on the iris.

Another device, described by Hess [33], used a 16 mm. camera to film the pupil. This device,

consisting of a movie camera, a projector, a screen and reflecting mirrors, allowed filming of

the subject’s pupil regardless of their eye color through an infrared sensor. To measure the

dilation 20 individual frames were averaged and used as pupil size. This kind of technique

was the most affordable but the frame-by-frame measurement involved a lot of errors, both

human and mechanical, resulting in questionable reliability. Hess highlighted the importance

of such variables as stimulus brightness and brightness contrast, and how they have to be

maintained constant during the whole trial.

In modern laboratories and research centers a video-based pupillometer is often used.

This device consists of a closed-circuit TV for recording the eye and a signal processor

for measuring and displaying the pupil diameter. In these kinds of devices infrared light

illuminates the eye and a silicon matrix tube camera is used to record the size of the pupil.

Pupil size and variations are presented as numerical readouts or on a chart, and enable

measurement over a 0 to 10 mm. range. A diagram explaining the system is presented in

figure 2.8.

Figure 2.8: Diagram of Whittaker Pupillometer

Source: Diagram 1 in [34]

23

Page 33: Mejoramiento de una metodología para la identificación de Website

Recent research uses momentary estimates of pupil diameter which enables analysis using

computational techniques. This approach is useful for task-evoked pupillary response (TE-

PR), where it is necessary to measure the pupil size at specific times during a task [35]. With

the availability of this kind of data it is possible to specify the events during the trials in

the experiment, and the background variations can be canceled out. According to Stern and

Dunham in [36], the TEPR is an appropriate index of pupillary response because the TE-

PR is measured in fractions of millimeters and usually the background pupillary response is

around one millimeter or more. Averaging the results enables the response to stand out from

the “background noise”, but this procedure requires establishing a baseline for measuring

changes in pupil size evoked by tasks.

2.5.2 Pupil Dilation and behavioral correlation

Modern pupillometry techniques allow researching in diverse fields, human behavior being

one of the most complex and interesting. The most influential researcher in this field was

Eckhard Hess who suggested the correlation between mental activities, perception, interests,

information processing, attitudes and pupillary response. He was not the first to suggest

this, but his work led to an increase in research in this field. In 1975 he launched a book

called “The Tell-Tale Eye: How Your Eyes Reveal Hidden Thoughts and Emotions.”

Over the years a lot of research has been done in this particular field. Some of the most

important conclusions and results obtained are presented in this section.

Pupil size and mental activity

The correlation between mental activity and pupillary response was first shown by Hess

and Polt [37] through a mental multiplication test. They asked the participants to mentally

calculate the result of a multiplication of two factors, increasing the difficulty from two

factors with just one digit to two factors with two digits each, with the results showing the

relation between difficulty level and pupil size. The range of size increase was between 4%

to 30% during the period between pre-question to pre-answer, decreasing immediately after

the answer was given. Thus, this result shows that pupil activity reflects the information-

processing load produced by cognitive tasks. Later, Polt in [38] used electric shocks as a

threat for incorrect answers on the same test. This resulted in increased amounts of effort to

solve the problems, and as a consequence, in bigger pupil dilation.

Later Beatty, in [39], concluded that the amplitude of the task-evoked pupillary response

(TEPR) is an index of a common factor related to the cognitive demands of memory, language

processing, reasoning and perception. In fact, pupillary responses have proven to be very

24

Page 34: Mejoramiento de una metodología para la identificación de Website

important indicators of mental efforts involved in task resolution.

Attitudes

According to research done by Hess in [33], the pupil could be a better index of attitude

towards persons or things than other traditional systems such as questionnaires or interviews.

In research done by Barlow in [40], he showed images of three political leaders of different

political factions (Lyndon Johnson, George Wallace and Martin Luther King Jr.) and an

unknown person to subjects classified as either “liberal” or “conservative”. While the subjects

classified as liberal showed pupillary dilation when Lyndon Johnson and Martin Luther King

Jr., were shown and pupillary constriction when Wallace was shown, the ones classified as

conservative showed an opposite pattern. This implies that the pupil dilates when they see

someone with whom they agreed and constricts in the ones who disagreed. However, this

result was questioned by Ertas in [41], due to the results obtained in a similar experiment.

They showed pictures of two presidential candidates (Richard Nixon and George McGovern)

and a picture of an unknown person to supporters of either of those candidates. In another

stage of the experiment, the candidates’ last names and the name Smith were also spoken.

The three pictures were associated with pupillary constriction, and the names were associated

with pupillary dilation. These results showed a conflict with the previous results. Thus, the

relation between pupillary response and attitudes requires additional research and study to

be accepted.

Perception

Perception has been related to pupil size through various different approaches. In [42]

Kahneman and Beatty tried to relate the difficulty encountered by subjects in a pitch di-

scrimination test with pupillary dilation. The participants had to judge whether the tone

presented was higher or lower than the standard tone. In the cases where the difficulty

in distinguishing between the two became greater, participants showed higher dilation. In

another experiment, Hakerem and Sutton in [43] measured pupillary response to a barely

perceptible threshold visual stimuli. When the light flashes were not detected or when the

subject was not asked to detect, pupillary dilation did not occur. When they were required

to detect whether a flash was present or absent, or when they correctly discriminated a flash,

pupillary dilation occurred. Similar results were obtained by Beatty in [44] when participants

were asked to detect a weak tone only present in half of the trials. In this case pupil dilation

occurred only when the subjects detected the presented signal. Beatty’s conclusion was that

pupillary dilations reflect changes in nervous system activation due to perceptual processing

and the difficulty of the task.

25

Page 35: Mejoramiento de una metodología para la identificación de Website

Affective value of stimuli and pupil size

One of the results obtained in the original study made by Hess in [30] was the relationship

between pupil size and affect or “feeling tone” generated by different pictures. Hess and

Polt reported in this research that when female subjects viewed pictures of male nudes or

babies they experienced greater pupil dilations than men. On the other hands male subjects

reacted in the same way with nude females. They concluded that gender differences indicated

greater interest in nudes of the opposite sex. Later in [45] Hess, Seltzer and Shlien found

that homosexual men had greater pupil dilation when they were seeing pictures of male nudes

compared to female nudes, whereas results obtained with heterosexual men showed opposite

pupillary response.

Nonverbal communication and pupil size

In 1975 Hess, in [46], concluded that pupil size is an important variable during nonverbal

communication. This conclusion was obtained after an experiment where Hess showed pic-

tures of a woman to male participants, where the pupil size in the images was manipulated

previously. Subjects described the woman with large pupils as soft, feminine or pretty. On

the other hand they described the woman with small pupils as hard, selfish or cold. These

results led Hess to conclude that pupil size seems to act as an unlearned mechanism in the

facilitation of certain social behavior such as sexual interest.

In other studies conducted by Hess, large pupils were associated with happy faces and

small pupils with angry faces. He concluded that larger pupils are associated with attracti-

veness, sexual interest and happiness whereas small ones are related to the opposite charac-

teristics. This conclusion was not completely accepted by other investigators such as Janisse

in [47], where he pointed out that the validity of this data was questionable because the

data was collected with visual stimuli where the light reflex could not be ruled out as a

possible cause of pupillary change. Hess developed another approach in which participants

were allowed to draw pupils on line drawings of faces. The results showed that participants

consistently drew large pupils on happy faces and small pupils on sad faces. These same

results were obtained with both adults and children as subjects.

Hicks in [48] tested Hess’ results with 223 college students. They found that subjects

drew large pupils on happy faces and smaller ones on sad faces, the same conclusion as the

previous research.

26

Page 36: Mejoramiento de una metodología para la identificación de Website

Information Processing, learning and pupil size

Poock in [49] concluded that pupil diameter was related to information-processing speed.

First he determined a maximum processing capacity by making the subjects press buttons

corresponding to displayed numerals as fast as possible. Then he made the participants

process numerals at different percentages (between 50% to 125%) of maximum capacity.

When the subjects were required to process information at 75% and 100% significant increases

in pupil diameter appeared. However when the requirement was raised to over 100% to 125%

of capacity, pupillary constriction occurred.

In another study Peavler in [50] required 14 college students to reproduce a string of digits

after hearing them, varying the size of the string between 5, 9 and 13 characters. A trend

toward increased pupil size with each successive digit in the five and nine digit condition was

noticed. The experiment also showed the dilation leveled off immediately after presentation of

the 10th digit in the 13-digit string. These results suggested to Peavler that the information

processing effort was momentarily suspended at that instant. In this experiment, nine digits

was approximately the short-term memory capacity of the subject, and correspond to the

point of no further dilation. Even when Peavler and Poock got similar results for the first

parts of their experiments (more processing load corresponded to a bigger pupil size) they

disagreed about what happens afterward, when there is an overload.

2.6 Website Keyobject

The content of a web site is composed mainly of text formatted using HTML standard,

but also includes other types of contents, called multimedia content, which include images,

sound, animations, videos, etc. Nowadays these kinds of contents are some of the most

popular on the World Wide Web, and sites like Youtube, Facebook, Instagram, etc. are

among the most popular sites in the world.

One of the problems presented by this kind of content is the difficulty of analyzing it.

Because the format of these data does not allow the direct use of web mining techniques due

to ignorance about their content, the option of applying automatic techniques is disabled.

Web mining techniques implemented directly on a web site will not consider these kinds of

data, because they give inaccurate results.

This section will try to show a complete presentation of website keyobjects. First the

definition of website objects and website keyobjects is given, then an explanation of how to

implement them on a web site. A comparison measure for the objects will then be established

27

Page 37: Mejoramiento de una metodología para la identificación de Website

and finally a methodology for finding them will be detailed.

2.6.1 Definition

Website Keyobjects are based on the definition of an object on the Web, but this definition

can cover a broad range of meanings, because the Web is defined as a set of pages or resources

with links between them. For this project, object will be considered as any group of words

having some kind of structure, and in the same way, multimedia files which are shown on

the web site pages, including all kinds of pictures, images, sound, animations, etc. Objects

based on word structure need to be inside delimitators such as paragraphs, tables or other

kinds of tag separation.

An example of the previous definition can be seen in figure 2.9, which corresponds to a

well-known news web site in Chile. It is possible to watch some of the objects on the page

(not all of them).

Figure 2.9: Example of web site objects on the site www.emol.com

In the figure three objects are defined: a banner, an image and the title of a story, and

each one has different formats: an animation, an image and a text with an image. But

the separation between them is subjective, because a human can understand the difference

between them, but an automated system cannot do that analysis. It is necessary to define

metadata for the objects in order to define their content.

Therefore a web site object was defined by Dujovne and Velasquez as “a structured

28

Page 38: Mejoramiento de una metodología para la identificación de Website

group of words or media, which are present on a web page, having metadata describing

their content” [3]. This definition clearly states the need for metadata, which are necessary

to build the content representation vector of the page, and will allow the comparison of

different objects between them (a video with a picture for example).

The previous definition allows any object on a web site to be described, and by doing

so it is possible to process them using different algorithms and analyzing technologies. This

definition is also the baseline for the definition of a Website Keyobject.

A web site keyobject was defined as “one or a group of website object that attract the

user attention and characterize the content of a page or website” [3]. This definition states

which web objects get more attention and are more interesting to the user and therefore,

identify which object types would help to improve the presentation, usability and content of

the web site.

2.6.2 Representation

HTML standard does not have any kind of structure for defining components of the web

objects which incorporate metadata. For this reason it is necessary to create another way to

do it.

One solution presented in [3] was inspired by the integration of metadata using MPEG-7,

where every multimedia file is associated with an XLM1 file which stores the metadata of

the multimedia file. Using this method it is easy to give a basic definition for every object

on the page without adding other elements to the source code, and in the same way it will

be easier to get the defining concepts of each object.

In an XML file the necessary data for describing the objects is added, and it will keep a

link between the objects and the page where they belong. The data in this file will be:

• Page Id

• Object

– Object Id

– Object format

– Concepts associated to the object

1XML is an acronym for Extensible Markup Language, a markup language that defines rules for encodingdocuments in a format that is readable by humans and also by machines.

29

Page 39: Mejoramiento de una metodología para la identificación de Website

With this data it is possible to get the relationship between a page and its objects. The

format of the object can also be obtained along with the concepts which define it.

A basic definition of object concept could be a set of words which describe the concept

in a partial or total way. Using XML syntax it would be something like:

<concept>Analysis tool</concept>

In the example, the concept Analysis tool defines that the object is a tool for doing some

kind of analysis, but does not say anything about the relations between the concepts.

To handle the previous problem each concept will be classified into various categories

which will allow create relationships between them. Every concept can belong to one or

more categories, depending on the context of the object. An example of this could be:

<concept category=“marketing”>Analysis tool</concept>

<concept category=“computation”>Analysis tool</concept>

In the previous example, the object analysis tool will be a tool related to marketing on

certain pages, but on others could be a tool designed for computation analysis. This definition

will also allow easily getting all the concepts related to certain categories.

For the concepts definition three substantives will be used. According to [51], this will

not give a complete definition of the object, but a sufficient one. However this could be used

for comparing two concepts.

2.6.3 Objects comparison

Web mining algorithms usually need a way to make comparisons between pages. For

example in [2] the similarity measure was made using a relation between the permanence

time on the pages.

Page comparison is not sufficient to compare web site objects, because a page can contain

more than one object, hence the use of the previous methodology is not direct.

A similarity measure unit, similar to the one used to find web site keywords in [2], considers

two objects O1 and O2 where:

30

Page 40: Mejoramiento de una metodología para la identificación de Website

|O1| = N and |O2| = M ∃N,M > 0 ∧N ≤M

The i-esim concept of the object O is also defined as:

Ci(O) i = {1...M}

The main idea is to pair up similar concepts for every object, and based on this result

compare the objects. The pseudo-code for the algorithm would be:

• ∀ Ci(O1) i = {1...N} :

– ∀ Cj(O2) j = {1...M} :

∗ Compare the 3 words which define Ci(O1) with the 3 words which define

Cj(O2) using a thesaurus2

∗ If the words are the same 1 is added. In case of being synonyms 0.5 will be

added.

∗ The pair of concepts with their sum closer to 3 will be paired.

2.6.4 Transforming categories in strings and comparison

It was previously defined that every concept could belong to one or more categories, and

in this way it is possible to define an object using the categories of their concepts. This would

be expressed in this way:

O = Category1, Category2, ..., CategoryN

Category Associated charCategory 1 ACategory 2 BCategory 3 CCategory 4 DCategory 5 E

Table 2.1: Association between categories and chars

And supposing that each category is associated with a character, according to the table

2.1, every object could be represented through the characters of each category. For example:

2A thesaurus is a reference work that lists words grouped together according to similarity of meaning,used for representing concepts.

31

Page 41: Mejoramiento de una metodología para la identificación de Website

O1 = ABB

O1 = BBC

If every category is codified and the objects are represented through those characters, it

is possible to define a string for each object. This string will represent the categories which

define every object and, after being paired, it will be possible to compare those strings to

discover if they are similar.

To compare two strings the Levenshtein distance is used, also known as Edit Distance,

which is a string metric for measuring the difference between two string sequences. This

metric is defined as the minimum number of edits (insertion, deletion or substitution) required

for transforming one string into the other.

The application of this technique to the string results in a conceptual comparison, because

a difference between both represents a difference between the concepts. If two objects have a

Levenshtein distance equal to 0 or near to 0 it will mean that they are conceptually similar.

On the other hand, if the distance is far from 0 it will mean that the objects are different.

Finally the similarity between objects will be calculated through the relation between the

distance equation defined previously and the maximal length between the two strings. The

equation can be seen in the equation 2.1.

do(O1, O2) = 1− L(O1, O2)

max{|O1|, |O2|}(2.1)

The value of the equation will be 1 when both objects are equal and 0 when they are

totally different.

2.6.5 Methodology for finding web site keyobjects

The methodology proposed by Dujovne and Velasquez in [3], and improved by Gonzalez in

[4], for identifying web site keyobjects is defined by two main processes: Data transformation

and clustering.

32

Page 42: Mejoramiento de una metodología para la identificación de Website

Data transformation

The data to transform in this stage comes from different sources so it is necessary to

perform separate transformation processes to each data set.

• Sessionalization: The process of rebuilding the sessions of different web site users. In

this case, a session is the sequence of web pages that were visited while browsing the

site. This information can also give the user’s time spent on each page.

• Adding metadata: In general the objects on a page suffer a lack of metadata, for

which reason it is necessary to incorporate it in order to do further analysis. First it is

necessary to identify the different objects from a web site page, and then, in conjunction

with the web master of the site, define the concepts that characterize each object.

• Time spent on the objects: Dujovne and Velasquez suggested conducting a survey

on a control group to determine the time spent on each object. This was replaced by Go-

nzalez introducing an eye-tracking system which allowed him to incorporate objective

data of the time spent on each object to the sessions data.

• Vector of user behavior: For each session identified above objects were chosen that

caught the attention of the user. With this it was possible to define the vector called

Important Object Vector (IOV) defined by the equation:

v = [(o1, t1), ..., (on, tn)] (2.2)

Where oi represents the object and ti represents the time spent on each page.

Clustering algorithms

In order to apply clustering algorithms to the identified sessions it is necessary to define

a distance measure between vectors. A distance measure between Important Object Vectors

was defined for this.

• Similarity measures for sessions: In [3] the similarity between two IOV was defined

through the equation:

st(α, β) =1

i∗ (

i∑k=1

min(ταkτβk,τβkταk

) ∗ do(oαk , oβk)) (2.3)

33

Page 43: Mejoramiento de una metodología para la identificación de Website

Where α and β correspond to the user sessions identifier to be compared. ταk corre-

sponds to the time spent by the user α on the object ok and do(oαk , oβk) is the similarity

between objects using the Levenshtein distance.

The last equation was defined between 0 and 1, where 0 represents no similarity at all

and 1 represents total similarity.

Later this measure was used as an input parameter for these clustering algorithms:

Association Rules, K-Means and Self Organizing Feature Maps.

2.6.6 Clustering Algorithms

In the previous methodologies for finding web site keyobjects, created by Dujovne and

Velasquez in [3], the main objective was grouping the user behavior vectors using three

different techniques: SOFM (Self Organizing Feature Maps), K-means and Association Rules.

The results of these techniques gave different sets of vectors, where their inner components

were similar among themselves but different comparing them with the other sets. The criteria

for selecting the Website Keyobjects was to select the objects which appeared more times in

the set of vectors after applying the techniques. It is important to notice that the similitude

measure between two IOVs using the equation 2.3 will be used to compare to sessions.

Algorithms

First the algorithms used for this process will be presented, along with how they work

and how they were applied to that project.

The first algorithm presented will be SOFM, because the number of clusters obtained

after its application will be used as an entry for the K-means algorithm. Then the K-means

clustering algorithm will be shown and finally the algorithms to get the Association Rules.

SOFM The Self Organizing Feature Maps (also known as Kohonen Maps) are a type of

artificial neural network (ANN) that uses an unsupervised learning algorithm which allows

mapping high-dimensional elements into a grid of a few (usually two) variables, called a map.

These kinds of artificial neural network are different than the classic ones in the sense that

they use neighborhood function to preserve the topological properties of input space.

Every neuron in the network is associated with a weight vector which has the same

dimensionality as the training vector, a relative position inside the grid and also a set of

neurons called neighborhood. This set of neurons will define the type of topology of the

34

Page 44: Mejoramiento de una metodología para la identificación de Website

network, the rectangular and hexagonal ones being the most usual. There is a special case of

networks with a toroidal topology where the first neuron on the map will have as neighbors

the one to the right, the one below it, the last of its column and the last of its row (see figure

2.10). The neighbors of the rest of the neurons are defined in this same way.

Figure 2.10: Example of a toroidal network

As was explained before, the neighborhoods in the algorithm define a topology and its

influence radius declines through the iterations of the algorithm. This means that in the

beginning of the algorithm process the pattern generates a lot of changes in the winner

neuron and its neighborhoods, but in the following iterations these modifications will apply

only to the nearest neighbors and at the end of the algorithm will affect only the winner

neurons.

The training algorithm has a competitive nature because all the neurons “battle” between

them to define which one is the most similar to the example. When the winner is found,

arrangements to the network are made in such a way that the winner neurons become similar

to the examples.

It is not possible to use the SOFM algorithm in a direct way for the clustering session

process due to the nature of the data to be grouped. To do this every neuron will be defined as

an IOV, and to update them during the training the similarity measure between the vectors

will be used. For every IOV it will be necessary to find the neuron most similar to it and

then update the net weights according to the distances calculated. This process is iterative

and will end when the net weights are lower than ε.

35

Page 45: Mejoramiento de una metodología para la identificación de Website

K-Means The K-means algorithm allows the grouping of data sets into K clusters where

the objects in a cluster are more similar to other objects in the same group (homogeneous)

than when compared with the objects in the other clusters (heterogeneous).

The learning method in this algorithm is supervised because it requires the number of the

K clusters to be created. This algorithm works using a Top-Down approach because it starts

with a specified number of groups and later assigns the patterns to each one of them. In this

algorithm there are no overlaps between groups because each object is assigned only to one

cluster. It is also a Two-Phase algorithm because each iteration starts with an assignation

of objects to the clusters and then the centroids are calculated.

The main steps of the algorithm are:

• Clusters Initialization: Using all the objects as sources, K are randomly selected,

assigned to a cluster and set as centroids. Then the rest of the objects are compared

with these centroids and assigned to the cluster which has the centroid with the minimal

distance to it.

• Calculating the cluster’s centroid: For each cluster the distance between the cen-

troid and the objects in the cluster is compared. The one with the minimal distance is

designed as the new centroid for that cluster.

• Reassigning elements to the clusters: For each object in each cluster a comparison

is made with the centroids of the cluster and the object is moved to the cluster where

the distance between the object and the centroid is minimal.

These steps can be seen in figure 2.11

This process is repeated until there are no changes between clusters or until the error

between iterations stays stable (i.e. does not change).

Association Rules This is a “clustering” method which uses unsupervised learning to

group objects, while attempting to understand the links or associations between the different

variables or attributes of the group. The main rules of this method follow the pattern:

IF variable1=a AND variable2=b THEN variable3=c

36

Page 46: Mejoramiento de una metodología para la identificación de Website

Figure 2.11: Steps of k-means algorithm

A good example of this method can be seen in [52]. Here a set contains different objects

such as are shown in figure 2.12.

Figure 2.12: Example of Association Rules

Source: Figure 6.26. in [52]

In the figure 2.12 the objects have different kinds of shapes, colors and borders. One

possible rule for this set could be:

IF Color = Gray AND Shape = Circle THEN Border = thick

The set has six objects that meet the rule (i.e. there are six objects which are gray, are

circles and their border is thick). But this observation can be made by means of other rules

using the same variables with different values. For example:

IF Border = thick AND Color = gray THEN Shape = circle

37

Page 47: Mejoramiento de una metodología para la identificación de Website

IF Border = thick AND Shape = Circle THEN Color = gray

To measure how good a rule is and to compare it with others there are three indicators:

• Support: Support is a way of measuring the total number of observations that the

rules map. It is usually defined in terms of proportion or percentage. In the previous

example, using the first rule, there are six gray circles with a thick border out of a total

of 26 objects. Therefore the support for the rule is 6/26 or 23%

• Confidence: The rules are divided into two main parts, the If-part or antecedent and

the THEN-part or consequence. The first part refers to the statements linked with the

and in the first part of the rule, while the second part refers to the statement after

the THEN. The confidence indicator is a measure for how predictable a rule is, and its

formula is:

Confidence = groupsupport/IF − partsupport (2.4)

In the example the support value for the group was 0.23 and the support value for the

IF-part of the rule (number of gray circles in the data set) is 7/26 or 0.27. Dividing

these two values yields:

Confidence = 0.23/0.27 = 0.85

This means that rule number 1 has a confidence of 85%.

• Lift: Lift is defined as the importance of the rule because it describes the association

between the IF-part of the rule and the THEN-part of the rule. The calculation of

the indicator is made by dividing the confidence value by the support value across all

observations of the THEN-part

Lift = Confidence/THEN − partsupport (2.5)

For the example, using rule number 1, the confidence is 0.85. The THEN-part support

of the rule is 13 out of 26 (there are 13 objects with thick borders out of a total of 26

objects), i.e. 0.5. Dividing these two values yields:

Lift = 0.85/0.5 = 1.7

38

Page 48: Mejoramiento de una metodología para la identificación de Website

When a lift value is bigger than 1 it indicates a positive association. On the other hand

when it is less than 1 it indicates a negative association.

To find rules inside a data set there is an algorithm called Apriori defined in [53]. This

algorithm tries to find subsets of items which meet a certain level of confidence C defined by

the user. To do this, the algorithm uses a Bottom-Up approach where one element is added

to the subsets for each iteration, to test if they comply with the restriction. To store the

data the algorithm uses a tree structure, where it keeps the candidates for rules. This kind

of data structure allows the application of pruning techniques to increase the efficiency of

the algorithm.

39

Page 49: Mejoramiento de una metodología para la identificación de Website

Chapter 3

Experiment Design

This section will explain how the hypothesis will be tested. To do this it will be necessary

to develop an experiment to check the improvements of the new methodology versus the

previous one. With this information it will be possible to compare the results of both methods

and verify if there is an improvement using the pupillary response as a new data source. This

experiment, briefly explained before, will be thoroughly detailed in this section.

3.1 Requirements

To complete the experiment some requirements needed to be accomplished. These requi-

rements are mainly related directly to the data capturing process because this is one of the

most important steps in the experiment. The main idea is to get a clean environment and

data, without any kind of bias or other kinds of distorting variables. First the characteristics

required for the experimental group will be explained, then the type of web site required and

its characteristics and finally the devices needed to get the eye movement data and for the

measurement of pupillary response.

3.1.1 Experimental group

The experimental group is one of the most important components of an experiment,

because the data obtained with this group will be used for extrapolating the results to the

rest of the universe. For this reason the number of participants and their profiles will be the

two main variables for choosing the group. A bad decision could lead to invalid results or

inconsistency with reality.

40

Page 50: Mejoramiento de una metodología para la identificación de Website

If possible, it is highly recommended to have a sample size of 33 participants as a minimum

because according to the central limit theorem, with this number the mean and the variance

of the sample will be similar to the rest of the population. In this case, the main variable of

the study is the permanence time of the users on the web objects.

If the last point is impossible to reach, it will be necessary to have a representative

sample of the web users and potential customers of the site. To get this information it will

be necessary to have the assistance of the site’s web master or the site’s business expert. Any

other kinds of users will also be considered, because it is completely possible for anyone on

the internet to access the site.

The main profile of the participants will be:

• Age range between 16 to 50 years, according to the range of world Wide Web users.

• Having a high school degree at least.

• Having used or knowing how to use a web browser

• Having passed the selection survey (see appendix A )

• Information about expertise level will be collected, but will not be an exclusion variable.

The sample selected represents the main characteristics of the internet users and, in

particular, the users of the web site chosen. Does not include participants who could access

to site, like children or elder people, or people who does not know how to use a web browser

or computers, because the results gotten through those participants could add noise to the

data for analysis, resulting in less accurate results. This attributes makes this sample a valid

sample of the main internet user.

It is important to notice that each participant will have to sign an informed consent docu-

ment allowing the usage of the data collected through experiments with them (see appendix

B ).

3.1.2 The web site

There are three main characteristics needed for the web site in this experiment. The first

being the number of pages on the site, the second is the number of objects inside every page

on the site, and the third is having a large number of visits.

41

Page 51: Mejoramiento de una metodología para la identificación de Website

The quantity of pages on a site is important because it will determine the complexity

of finding the web site keyobjects. With an excessive number of pages the problem will be

too complex for testing the hypothesis, adding more problems to the project. On the other

hand, a site with too few pages will not benefit from the experiment approach. Nowadays

on the Internet it is possible to find an enormous number of different web sites, containing

different quantities of pages, from just one to hundreds and even thousands of pages. For

this experiment it will be necessary to have a site with a number of pages between 50 and

100.

Secondly the number of objects per web page will determine the complexity of finding

which ones are the important ones and this number is also closely related to the usability

level of the site. In the case of a page having just one object it will not be necessary to check

it, because the amount of time spent on the page will be the amount of time focused on the

object. In the case of having too many objects per page it will be recommendable to redesign

the experiment instead of testing it with that page. For this experiment it will be necessary

to have an average number of 20 objects per page.

The third point is vital for doing the mining analysis. This kind of processing requires

a large amount of data for finding patterns and then discovering knowledge, so it will be

completely necessary to have at least a sufficient number of visits as well as access to the

corresponding web log files.

Finally another requirement could be useful for this experiment, and that is the possibility

of changing the pages on the site. This could help in further development, because if it is

necessary, the improvements could be applied directly on the page to test the results and

check if there are any kinds of changes in user behavior. In any case, the experiment is

completely feasible without this feature.

3.1.3 Eye-tracking system and devices

Nowadays most of the eye-tracking systems available on the market have enough precision

for this kind of experiment, but the main problem is their cost. For this project an eye-

tracking system will be necessary which can allow the identification of the objects watched

by the participant, while retrieving all the necessary data for the experiment. It is also

necessary for the system to allow the researcher to measure the pupillary response of the

participant, retrieving the data related to it.

42

Page 52: Mejoramiento de una metodología para la identificación de Website

3.2 Capturing data

This section describes the process of capturing data from the different sources including

the experimental group, the web site, the users, etc. All these sources have different kinds

of data, with different formats and specifications, and for this reason it is necessary to treat

with them in specific ways for each one.

3.2.1 Experimental group

To capture the user interest in the objects on a page one approach will be used, an

eye-tracking system which allows the measurement of pupillary dilation. A survey, as was

conducted in [3], will not be done because the previous methodology showed that with the

eye-tracking system it is possible to get better results compared to the use of a survey.

Eye-tracking system

To get a quantitative measurement of the time spent on a web object, and the pupillary

response produced by it, an eye-tracking system will be used. This device will obtain the

ocular movements of the user, and with this information it will be possible to determine the

time spent on the objects. In the same way, the data about pupil dilation will be obtained,

and combining this information with the ocular movements will allow the response to the

web objects to be determined.

To get this information two approaches will be used, assigning a task to the user on the

site and presenting the pages as a stimulus.

• Task assignments: It is usual in web usability experiments to assign a task to the

users, and during this assignment to record the ocular movement of the participants.

These assignments or tasks have to be directly related to the normal usage of the site.

For example asking the participant to buy a ticket on an airline site or buy a product

on an online shopping site. To choose a task assignment it is necessary to have a good

level of knowledge of the site process, because the ocular movements are directly related

to the kind of task requested [54]. If a task does not correspond to the normal usage of

the site the experiment will not generate proper results. For this reason it is necessary

to validate the tasks or assignments with the site’s administrator because he is the one

who has the best knowledge about the main process of the site.

• Pages as stimulus: This approach tries to retrieve data in a different way, mainly

because this experiment does not want to get information about usability but about

43

Page 53: Mejoramiento de una metodología para la identificación de Website

web site keyobjects and the time spent on them. Instead of asking for a specific task

or assignment, the pages of the site will be shown to the participants as visual stimuli,

in a sequential order. The participant may change the page that is being visualized

when he considers there is nothing else interesting on it. In any case each page may be

watched a maximum amount of time before it changes automatically to the next one.

This time will be obtained from the web log files.

3.2.2 The web site

All the components of the chosen site will be separated for analysis. To do this a web

crawler will be used which will retrieve all the pages on the site and the links between them.

The pages of the site captured by the web crawler will be transformed to the png format.

With this transformation it will be easier to map the objects on the page and, in the same

way, to match data between the eye-tracker data source and the pages on the site.

Objects

The process of identifying and delimiting objects will be done, if possible, with the web

master or site administrator, as the person who best understands the site components, or if

this is not possible, at least validate the delimitation done. The steps developed by Dujovne

and Velasquez in [3] will be followed to describe the content of the objects on the site. A list

of word descriptors and their formats will be saved after this process.

3.2.3 Web logs

The web log file will be retrieved by the site’s administrator, directly from the server

machine. This file could be located in different places depending on the site’s web server and

the setup.

3.3 Data transformation

To analyze the data retrieved from the sources it is necessary to transform them. This

will permit them to be processed with data mining algorithms, and then the information to

be retrieved, this being step one of the critical steps of the KDD process. The transformation

process is different for every data source, and this section will explain the detailed steps.

44

Page 54: Mejoramiento de una metodología para la identificación de Website

3.3.1 Web pages

The site’s web pages will be transformed to a png file (image format), including data such

as URL and size.

Objects

The objects on the pages will be detailed, and the information such as format, description,

size, etc. will be completed and this information will then be stored in a relational database.

A list of concepts describing the objects and their categories will also be stored, along with

the relation between objects and concepts.

To know which objects are on which pages, a relational table will be created, to easily

map the objects onto the pages. This relation will also store data about the left superior

coordinate, and using the size data, it will be possible to map any pixel on the page onto an

object. This thus enables the researcher to know which objects are being looked at by the

participant during the experiment.

After the mapping step and storing process, the conceptual distance between objects

will be calculated, according to the equation 2.1. This information will also be stored in a

relational table, to be used by the data mining algorithms. This will avoid the calculation

process during the KDD process, and improve the efficiency of the whole process.

The tables, keys, and fields of the database are shown in figure 3.1.

Figure 3.1: EER Model of the site and it’s objects

45

Page 55: Mejoramiento de una metodología para la identificación de Website

3.3.2 Web logs

After retrieving the web log files from the server, they will be processed to extract the data

as detailed in section 2.1.1). These data will be stored in a relational table on the database,

allowing them to be worked with directly. This will be necessary for the sessionalization

process, to obtain the information about the time spent on every page (see section 2.6.5).

The results of this process will also be stored in another table in the database, allowing the

researcher to know the page visiting sequence and the time spent on each one of them. The

figure 3.2 shows the tables, keys and fields corresponding to the data that will be stored.

Figure 3.2: EER Model web log tables

3.3.3 User interest

Next the process of data transformation of the main data source of user interest in this

experiment is explained, the eye-tracking system.

Eye-tracking system

The transformation process of the eye-tracking data is mainly for knowing the time spent

on every object and the pupillary response produced by them. The object mapping of every

page is needed (see section 2.6.1) for correlating that data with the eye movement data

on the page, mixing the coordinates watched and time spent with the objects. With this

information it will be possible to calculate the time spent on every object, as well as the

46

Page 56: Mejoramiento de una metodología para la identificación de Website

pupillary response produced by them. After that this data will be stored in a table in the

database, according to figure 3.3.

Figure 3.3: EER Model User interest table

The time spent on the objects will be transformed to a percentage of the total permanence

on the page, and after that an average will be calculated with the participant’s data. This

will be done in an analogous way with the pupillary response of every object.

Then, using the data stored in the database after the sessionalization process and after the

processes of transforming the eye-tracking data, the time spent on every page will be weighted

with the permanence percentages on the objects obtained in the last two transformations.

3.3.4 Experimental group

Some data will be asked of the participants in the experiment, information such as age,

gender, profession, level of internet knowledge, knowledge about the research and knowledge

about the site. This will be stored in a database table along with the other tables.

The transformations here will be done to try to find some kind of correlation between the

type of user and the interest in the objects, the knowledge of the research and the relation

between the integrity of the answers (the relation between the answers on the survey and the

results of the eye tracker), etc.

3.4 Changes in the methodology

The main change between this methodology and the methodology proposed by Gonzalez

in [4] is the addition of the pupillary response as a data source using the eye-tracking system.

In the same way the main change between the methodology proposed by Dujovne and the

one proposed by Gonzalez is the way of getting the interest data. The latter includes the time

spent on every object using the eye-tracker machine as source, but for this methodology the

47

Page 57: Mejoramiento de una metodología para la identificación de Website

data about pupillary dilation is also going to be used as a significant variable. The experiment

will use this variable on the web mining algorithms for identifying if the pupillary response

can be related to the identification of web site keyobjects.

3.5 Results comparisons

The results of both methodologies (the one presented in this research and the one develo-

ped by Gonzalez in [4]) are a set of web site keyobjects. To compare the results the assistance

of the site business expert will be necessary, who will decide if an object categorized as a

keyobject was matched correctly. After this step both methodologies will be compared using

the precision comparator which is useful for evaluating a classification algorithm precision.

48

Page 58: Mejoramiento de una metodología para la identificación de Website

Chapter 4

Implementation

This chapter explains the entire implementation of the experiment designed in Chapter

3, in order to achieve the proposed result, detailing its most important components.

First the details of the selected components for realizing the test are given. These include

the web site, the participants and the devices.

Next the process of testing and capturing data are detailed, explaining the whole process

of recruiting participants, the trials and the process of getting the data from the devices.

Finally the KDD process developed for the data is explained. Details about selection,

preprocessing and transformation of data are given.

4.1 Requirements

This section describes the main requirements chosen for developing the experiment and

the tests. Specifically it shows the chosen experimental group and their data, the web site

selected for testing, the machines and devices for capturing data and the software and code

used for processing the captured data and for getting the results.

4.1.1 Experimental group chosen

For a better process of choosing the participants in the experiment, first the main focus

of the web site was established and the type of customer they are looking for. In this way

it was possible to choose a representative sample of the main clients of the web site. This

49

Page 59: Mejoramiento de una metodología para la identificación de Website

information was retrieved through the web administrator, who is the business expert.

With this data a number of 15 were chosen for testing the site, of whom seven were male

and eight were female. The average age of the experimental group was 26.1 years, with a

variance of 2.2 years.

The group involved was mainly composed of university students and professionals, of

different knowledge areas. Among them were such occupations as engineers, biologist, kine-

siologists, students, etc.

According to their own experience (no test was developed for measuring this), one con-

sidered him/herself an expert on web navigation, 12 considered their knowledge as average

and the other two think they are basic or amateur users.

In table 4.1 the data and attributes about every participant of the experiment can be

viewed.

Number Sex Age Professionor Activity

User level Knowledgeaboutproject?

1 Male 29 Engineer Medium Yes2 Female 25 Student Low Yes3 Male 25 Student Low Yes4 Female 24 Student Medium Yes5 Female 27 Kinesiologist Medium Yes6 Male 24 Student High Yes7 Female 28 Psychologist Medium No8 Female 28 Student Medium No9 Male 29 Psychologist Medium No10 Male 24 Student Medium No11 Male 26 Student Medium No12 Female 26 Student Medium No13 Female 24 Student Medium No14 Male 23 Student Medium No15 Female 30 Student Medium No

Table 4.1: Experimental group characteristics

It is important to note that, in contrast with the previous work developed by Gonzalez

in [4], in this project it was impossible to differentiate between the people who had knowledge

about the project and the ones who did not. This was mainly because the devices used for

capturing data are different, and work in different ways for the same purposes. In the previous

study it was possible to track the eye movements without the participant knowing about it,

but with the new device this is not possible due to its physical characteristics. In any case,

according to the previous study, this variable did not have any important consequence on

50

Page 60: Mejoramiento de una metodología para la identificación de Website

the results.

4.1.2 Web site chosen

The experiment was tested on the site MBA Ingenierıa Industrial - Universidad de Chile

developed by the Master in Business and Administration program of the Industrial Enginee-

ring Department of Universidad de Chile. This site has been running since January 2011

and provides information about the program offered by the University. Among the pages on

the site it is possible to find information about the courses, the methodology of the plan,

teachers, student profiles, etc., as well as some pages for applying to the program.

At the beginning of the experiment the site was composed of 32 pages and 359 objects,

and they appear 1014 times in total. This means that an object can appear on more than one

page, which behavior is quite usual with such common objects as banners, menus, footers,

headers, etc. The average number of objects per page is 31.9 and their average size is 418.6

pixels wide and 100.1 pixels high.

The site’s statistics, for the first seven months of this year (January to August), show

that the average monthly number of visits is 5320.5, the average number of distinct sessions

are 2960.5 and an average of 26512.5 pages are seen. In the table 4.2 it is possible to view

the statistics for the last seven months.

Month Unique visitors Number of visits Pages Hits Bandwidth (in MB)January 4889 9439 36592 219120 8.23February 2181 5274 26850 107951 4.13

March 2597 6141 31772 145755 5.30April 2779 6257 26119 135582 4.23May 2083 5330 21711 101204 3.87June 2542 6483 19937 93108 3.50July 2505 6930 30611 105958 3.24

August 4108 7744 42673 138970 4.14Total 23684 42564 236265 1047648 36.64

Average 2960.5 5320.5 29533.1 130956 4.58

Table 4.2: Site’s statistics

4.1.3 Machines and devices

To capture eye movements and to measure pupillary dilation a fourth category eye-tracker

system was used. This category, according to section 2.4.3, corresponds to an eye-tracking

system which uses video-based combined pupil and corneal reflection. These kinds of devices

are the most advanced system for measuring this kind of movements according to the degree

of accuracy obtained with them.

51

Page 61: Mejoramiento de una metodología para la identificación de Website

In particular the device corresponds to eye-tracker model Eyelink 1000, developed by SR

Research company. This device is composed of a main screen with a high speed camera and

an infrared emitter, connected to a host computer which does the data processing. There is

one more computer which is also connected to the host. This one allows the researcher to

develop the experiments and get the data in an easy way. It also shows in real time the same

stimulus that the participants see. The whole setup for a trial takes around 2 to 5 minutes.

The eye-tracker specifications are shown in the table 4.3

Sampling Rate 2000 Hz Monocular / 1000 Hz BinocularAccuracy 0.25o - 0.5o average accuracyResolution 0.01o RMS, micro-saccade resolution of 0.05o

Real-time Data Access 1.4 msec (SD < 0.4 msec) @ 2000 Hz

Table 4.3: SR Research Eyelink 1000 specifications

This system belongs to the Medical Faculty of the Universidad de Chile and they provided

all the necessary information and components for working with it.

4.1.4 Software and code

To implement the algorithms and process the data it was necessary to choose an adequate

set of tools. These were chosen for their relative simplicity, for their easy access and also for

the general information about them. It is always necessary to choose tools which allow the

researcher to achieve the goals of the project and fulfill the requirements.

The main tools used in this project were software for processing data and programming

languages for capturing data. They are detailed below.

• Ubuntu Linux Operating System: Ubuntu is a computer operating system based

on a Linux distribution called Debian. It is distributed as free and open source software,

and is developed by Canonical Company. According to online surveys, in 2012 it was

the most popular Linux distribution on desktop/laptop personal computer, which is

also the main market of the company. This OS was chosen because it is a well-known

system, stable and reliable for programming ambiances, as well as for the availability

of fully documented applications and its user friendly interface called Unity.

• Python Programming Language: Python is a high-level programming langua-

ge, used for general purposes. It runs on such different operating systems as Win-

dows, Linux, Mac OS, etc., being also included in such Linux distributions as Ubuntu.

Python supports such multiple programming paradigms as functional programming,

52

Page 62: Mejoramiento de una metodología para la identificación de Website

object-oriented and imperative. This programming language was chosen for these

characteristics:

– Easy learning language allowing the development of complex algorithms in a short

time

– Free for use even in commercial products

– Clear and easy to understand syntax

– Fully documented and extensive standard library

– High number of third-party libraries

– Intuitive object orientation and full modularity

– High level dynamic data types such as lists, arrays and other data structures.

• XML (eXtended Markup Language): XML is a markup language for encoding

documents in a format that is both machine-readable and human-readable, according

to a set of rules which defines it. Their specification is mainly defined by W3C. This

language is based on a textual data format emphasizing simplicity, generality and usa-

bility on the Internet. It is widely used for representing such arbitrary data structures

as, for example, web services.

• MySQL: It is an open-source relational database management system (RDMS) which

runs as server providing access to multiple data bases for multiple users. It is available

for many platforms including Windows, Linux and Mac OS and it is widely used by

different web applications and also for large-scale websites as Wikipedia, Facebook,

Google and Youtube. The source code is available under GNU General Public Licen-

se. This system was chosen mainly for being free for non-commercial systems, the

availability of documentation and the support of different database engines.

• Matlab: It is a language programming language and a numerical computing environ-

ment developed by MathWorks Company. This system allows working with numerical

data in an easy way, enabling plotting of functions and data, implementations of algo-

rithms, matrix manipulations, creation of user interfaces. It also interacts with such

other languages as C, Java and Python.

4.2 Capturing data

The data used in this kind of research comes from a lot of different sources, and the variety

of data types is likewise large. This section will try to describe the process of capturing this

53

Page 63: Mejoramiento de una metodología para la identificación de Website

data and how the storing process was managed. The section is ordered according to the

source of the data.

4.2.1 Data about the experimental group

The data about the experimental group is mainly the data obtained from the eye-tracking

device. To collect this it was necessary to explain the experiment to the participant in every

trial. They were located in front of the display and the eye-tracking device, according to

the figure 4.1, and for every trial the first 20 participants were required to do a task related

to finding information on a web page for making a decision about it. The exact instruction

was: “You are interested in taking an MBA program, but you are still unsure about the

decision. During the information searching process you found the site www.mbauchile.cl.

This site belongs to the MBA program of the Industrial Engineering Department, and has

the important information about that program. Starting at the home page, browse freely on

the site until you have the necessary information for making a decision.” This instruction

tried to emulate the normal behavior of the user during their stay on the web site.

Figure 4.1: Eyelink 1000 layout with a subject

The remaining subjects were required to watch the site without browsing on it, but just

watching random pages of the site. The subjects could change the page they were seeing to

the next one at any moment, but the system only allowed a maximum stay of one minute

on every page. If the user exceeded the maximum stay time the next page was presented

automatically. The number of pages presented to the subjects was 32.

For every trial the eye-tracker device retrieves a log file with the data obtained with it.

This file has multiple fields indicating the values of the variables during the whole trial. The

typical sample line looks like this:

54

Page 64: Mejoramiento de una metodología para la identificación de Website

<time> <xpl> <ypl> <psl> <xpr> <ypr> <psr> <xvl> <yvl> <xvr> <yvr> <xr> <yr>

These variables are:

• <time>: timestamp in milliseconds.

• <xpl>: left-eye x position data.

• <ypl>: left-eye y position data.

• <psl>: left pupil size (area or diameter).

• <xpr>: right-eye x position.

• <ypr>: right-eye y position.

• <psr>: right pupil size (area or diameter).

• <xvl>: left-eye instantaneous velocity (degrees/sec).

• <yvl>: left-eye instantaneous velocity (degrees/sec).

• <xvr>: right-eye instantaneous velocity (degrees/sec).

• <yvr>: right-eye instantaneous velocity (degrees/sec).

• <xr>: x resolution (position units/degree).

• <yr>: y resolution (position units/degree).

4.2.2 Data from the web sites

To retrieve the pages on the site http://www.mbauchile.cl the wget program was used.

Wget is a computer program that retrieves files from web servers, supporting downloads via

HTTP, HTTPS and FTP protocols. This program is included with the Ubuntu distribution

explained before.

Every page on the site was converted to a PNG1 file for mapping the objects later. The

data about the image was stored in the database shown in figure 4.2.

1PNG is the acronym for Portable Network Graphics, a popular image format on the World Wide Web

55

Page 65: Mejoramiento de una metodología para la identificación de Website

Figure 4.2: EER model of pages table

4.2.3 Web log data

The web log files were obtained directly from the web server where the site is running,

with the help of the web site’s administrator. It was not possible to get all the previous data,

because a repository for these kinds of files does not exist. However, it was possible to recover

the files from the month of August, 2013. During this time 4108 different persons visited the

site in 7744 sessions. The number of pages seen was 42,673, the number of requests to the

server was 138,970 and the traffic was 4.14 GB.

4.2.4 Objects on the site

After the site was completely retrieved from the server the objects definition process

began. To identify the objects on every page the main variables considered were the concepts

associated with each object and the spatial distance between them, in the same way as the

previous work in [4]. After the process the result gave 359 objects and most of them appeared

on several pages on the site.

To get the coordinates of each object the image of the file of each page was used. Using the

data about width and height of each object, and knowing the top-left corner, the calculation

of the four corners was easy.

4.2.5 Objects concepts

After the objects were defined, a list of concepts for defining them was made. This was

done in the same way as the two previous methodologies in [3] and [4], through a manual

56

Page 66: Mejoramiento de una metodología para la identificación de Website

analysis of every object. This method was chosen instead of some automatic method, such

as Latent Dirichlet Allocation or Pachinko allocation, because the number of objects was low

and these kinds of techniques would add more variables for the analysis, adding more bias to

the final result and giving more complexity the algorithms for finding patterns in the data.

The data about the concepts was stored in the same database shown in figure 3.1.

4.3 KDD Process

In this section the KDD process performed on the data obtained in the previous section

is described. It is separated into three main topics: data selection, preprocessing and trans-

formation and data mining. This whole process is the core of the experiment and yields the

knowledge. More information about the KDD process is presented in section 2.2.

4.3.1 Data selection, preprocessing and transformation

This section describes the process done after getting all the data in the previous section.

Each component of this process was separated according to the source of the data due to the

differences among them.

Web sites

From the total of pages obtained in the previous step, two were excluded. The first

one was the web syndication page (in http://www.mbauchile.cl/feed/rss/index.php). It was

excluded because this page corresponds to an RSS2 file for making the contents of the page

available to other sites. This page is not made for people but for crawlers or other kinds of

application such as feed readers.

The other page excluded was one that was completely the same as the other page, both

as objects and the concepts representing them. They had different URLs, but their session

data was included as a request to just one of them.

Objects and concepts

359 objects were found on the site and these results validated by the business expert.

However some of the objects were joined with other ones, some were separated and some

were deleted.

2RSS is an acronym for Rich Site Summary, a popular web feed format on the Internet

57

Page 67: Mejoramiento de una metodología para la identificación de Website

The concepts for the objects were established by the researcher, using the same techniques

as the previous research. This process was then validated by the business expert of the site,

who accepted the definition given to each one. The only modification made in this step

was the reassignment of some concepts to the objects that were separated or joined in the

previous step.

Then after defining each object’s concepts, the algorithm used for calculating the concep-

tual similarity between objects, defined in [3], was implemented. The detailed description of

this algorithm can be found in section 2.6. The results of this process was stored in a table

called concept similitude in the previous database (shown in figure 3.1).

Experimental group

The data about the experimental group was collected basically using the eye-tracker

device, mixing it with other sources of data such as web logs and videos produced by the

eye-tracker software. The survey used in the previous methodology for getting the data about

interest was not used in this project.

For every trial of the experiment, the eye tracker generates a EDF file, which is a binary

file containing the whole register made during the trial, highly compressed. To work with

this data it is necessary to translate it to a more readable file. Using the software EDF2ASC,

provided by the eye-tracker company (SR Research), it is possible to convert it to an ASCII

file 3. This file is easily readable for humans and, in the same way, for parsers that can

extract the data contained in them.

The basic data registered by the eye tracker was presented in 2.6, but it also collected

other data, possibly including the following:

• Messages: Message lines contain the data about different kind of events happening

during the trial. These messages can be sent by some applications running on the

display computer, and contain data for analysis. The most important are the ones

related with mouse actions like moving, clicking, scrolling, etc. Every line includes the

timestamp of the event.

• Fixations: The file includes data about fixations made by each eye. This line of data

includes information about which eye data is being presented, the starting and end

3ASCII is an acronym for American Standard Code for Information Interchange, a common character-encoding scheme for writing files.

58

Page 68: Mejoramiento de una metodología para la identificación de Website

time of the fixation, the duration, the average position on the X and the Y axis and

the average pupil size during the fixation. Every line is timestamped.

• Saccades: A timestamped line containing data about the movements made by each eye

(a saccade) during the trial. This line includes data about which eye is being presented,

the starting and end time of the saccade, the duration, the position of the eye at the

beginning and at the end of the saccade on both axes, the visual angle covered by the

saccade and the peak velocity during the saccade.

• Blinks: Blinks are the period of data when pupil information is missing. These events

are registered on a timestamped line showing data about which eye blinked, the starting

and end time of the blink and the duration of the event.

Of all the data collected by the eye tracker only some will be used. The most important

are those related to fixations and events. For the fixations the data used will be the average

position of each eye, the duration of the fixation and the average pupil size. The messages

about mouse clicking, in combination with the log server information, will be used to know

which links were clicked. The data about scrolling events will also be used for recalculating

the relative position of the objects on the screen.

According to [22], the main range of fixation duration is between 150 and 600 milliseconds,

so to understand something it is necessary to look at it for at least 150 milliseconds. In this

case the eye tracker is configured at 500 hz resolution, which means the data is captured

every 2 milliseconds, so there won’t be a problem with distortions of the fixation data. In

any case, only registries over 150 ms. will be considered.

To map this data into the web pages of the site it was necessary to develop an algorithm

for processing the data made by the eye tracker mixed with the web log data. With this

information it was possible to discover which pages the subject was looking at during the

trial and which objects on those particular pages were looked at.

One of the biggest problems during this stage was the scrolling problem. Usually the

pages on a website are larger than the screen resolution, and for reading the entire page it

is necessary to scroll the page. When this happens, the relative positions of the objects on a

page change, some of them appearing and others disappearing. To know the new positions

of the objects during a scroll event, it was necessary to improve the previous algorithm by

adding a checking feature to the video frame of the trial for every scrolling message. In this

particular case it was necessary to add a small object to every page on the website called an

intensity bar which is basically a long bar with different degrees of color, starting in the color

59

Page 69: Mejoramiento de una metodología para la identificación de Website

black and ending in the color white. This object did not affect the usability of the site nor

the user behavior on it because the size was minimal, just enough to be noticed by an image

analyzer, and was located on the left side of the pages, far from the main parts. Then, when

the algorithm detected a scrolling message it checked the video frame corresponding to that

timestamp, and with the help of an image analyzer algorithm, which checked the relative

position of the intensity bar, it was possible to recalculate the new positions of the objects

on the page.

Another problem presented during this stage was the floating menus. A floating menu

is basically a navigation bar which reacts when the mouse hovers over it. When this event

happens, the menu appears on the screen, increasing the size of the menu object, and covering

the objects under it (figure 4.3). On the other hand, when the mouse is not over them, the

page behaves in the normal way (figure 4.4). This is a problem because when this happens

the eyes are looking at the menu, and not at the objects under it, so it was necessary to add

this to the algorithm. To improve the previous algorithm an instruction was added which

checked the mouse movement messages in the eye-tracker file, to check the position of the

cursor. When the cursor position was over one of the options on the menu bar, the algorithm

changed the position and size of the affected objects, making it possible to calculate the

correct amount of interest on the objects. In the same way, when the mouse was not there,

the algorithm corrected the position and size of the objects.

Figure 4.3: Navigation menu opened

After that process, the registries made by the algorithm were grouped by the objects,

60

Page 70: Mejoramiento de una metodología para la identificación de Website

Figure 4.4: Navigation menu closed

and with this set the percentage of time that a user spent on each object and the average

pupil size for them were calculated. When an object was not looked at, the percentage of

time spent on it and the average pupil size were included as 0, for the correct calculations of

average interest.

Finally the time spent on every object and the average pupil size were averaged, and the

result was stored in a table called averaged interest objects (see figure 4.5) . The data stored

here were the ID of the page, the ID of the object, the average time spent on in and the

averaged pupil size.

Figure 4.5: Table averaged interest objects

61

Page 71: Mejoramiento de una metodología para la identificación de Website

Web logs

The sessionalization process was realized following the same process as the previous me-

thodologies [3] [4]. Those methodologies are based on the previous work by Fernandez in [55],

which is actually validated. The detailed steps of this process are not going to be explained

here, being outside the main scope of this thesis project.

This process was performed directly over the web log files provided by the web master

of the www.mbauchile.cl site. This web log file was separated by the web server into five

different files containing the information specified in section 2.1.1.

The maximum session time was set at 30 minutes, and with this restriction a total of

6622 registries was obtained out of a total of 2219 sessions. This gave an average of 4.1 pages

per session.

Then, in the same way as in the previous work, on each page the previously obtained

results were transformed into the objects which composed it and then the permanence time

was weighted with the percentage of permanence on the object. It is important to notice

that some of the objects appear on several pages, and in those cases the subject could see

the same object twice or more during the same session. In those cases the permanence times

on both pages were added.

For the pupil information the contraction produced by the object during the fixation was

used because this is related to the kind of object viewed according to the research. Figure

4.6 shows how the pupil reacts to different kinds of images (neutral, positive and negative).

62

Page 72: Mejoramiento de una metodología para la identificación de Website

Figure 4.6: Pupil behavior for different kind of emotional valences

According to [56], the images with greater emotional valence are seen more frequently

than the neutral ones. This can be seen in the previous figure, where the same results were

obtained. On the other hand, in [57] it is shown that images with important content or

meaning produce a bigger dilation than the ones without content.

The behaviors demonstrated in those studies are analogous in the way that both of them

showed the same behavior for different kinds of stimulus (emotional valence and content

information). For this reasons, a new indicator using the difference between the maximum

dilation value and the lowest contraction point, for the time range between the start of

the fixation until 500 ms. after seeing the object, could be useful for including the data

about object importance (in content) and object emotional valence. This information will

be calculated and stored in the same way as for contraction. This indicator will be called

“delta” because it is the difference between two indicators.

Finally the n most important objects per session were selected, as was described before

in section 2.6. There were three different criteria for selecting them. The first one was the

same as the previous methodology, where the objects were selected based on the permanence

time on them, to get a baseline for comparing results. The second criterion was similar to

63

Page 73: Mejoramiento de una metodología para la identificación de Website

the first one, but here the objects were ordered using two variables, first the permanence

time and second the pupil dilation percentage. The third way was ordering the objects based

on the permanence time, but later reorganizing them using the pupil dilation percentage.

To calculate the n value the procedure was realized by first calculating the average (µ) and

the standard deviation (σ) of the number of objects per session. Later, the n number was

selected in the range [µ− 3σ, µ+ 3σ]. All the data obtained after these processes was stored

in the tables permanence object and conceptual similarity (see figure 4.7).

Figure 4.7: Tables object permanence and conceptual similarity

In the next steps these data will be used as sources for the data mining techniques. In

the table conceptual similarity the data about similitude between concepts are stored and

in the table permanence object the data about permanence time, dilation percentage, the

object and the corresponding session are stored.

64

Page 74: Mejoramiento de una metodología para la identificación de Website

Chapter 5

Results

This chapter shows the algorithms applied to the data and their results, after the proces-

sing explained in the previous chapters and sections. These results are explained in a simple

and ordered manner, including the transformations done to the data to apply the analysis

algorithms.

5.1 Data Exploration

The amount of data retrieved during the experimentation process is quite large and in-

cludes many variables to consider. Among the objects only the ones which were seen during

the tests were considered, which corresponds to 25% of the objects mapped on the site. This

result is expected because many of the objects appear on parts of the pages that the users

usually do not visit or which are not interesting.

The objects which were seen were defined on a matrix using five variables: height (in

pixels), width (in pixels), area (the result of height per weight), the average time spent on

it (in seconds), the delta indicator, the average contraction of the pupil during visualization

(using an arbitrary measuring unit which allows comparison of results) and the average

number of views.

Variable Correlation

First, Pearson tests were calculated to find out if there was any kind of correlation between

the variables of the objects. The results of these tests are shown in table 5.1.

The previous table shows that there are no important correlations between the variables,

65

Page 75: Mejoramiento de una metodología para la identificación de Website

Width Height Area Time spent Delta Contraction N of viewsWidth 1Height 0,2754 1Area 0,5754 0,8993 1Time spent 0,1419 0,2186 0,2078 1Delta -0.0304 0,0902 0,0963 0.1028 1Contraction -0,0710 0,1075 0,0812 -0,0566 0.4557 1N of views 0,04326 0,5129 0,5201 0,2299 0.0183 -0.1221 1

Table 5.1: Correlation between object variables

except for the expected one between height, width and area. The bigger correlation between

variables is the one between area and number of views, which makes sense in the way that

a bigger object, or an object which takes up more space on the page, will probably be seen

more than the others.

Other relations between the variables were plotted to more easily notice whether there

were important relations between them. Figure 5.1 shows the plot of the variables area and

time, figure 5.2 shows the plot of the variables area and contraction and figure 5.3 shows

variables contraction and time spent.

Figure 5.1: Area vs time spent on object

66

Page 76: Mejoramiento de una metodología para la identificación de Website

Figure 5.2: area vs delta indicator

Figure 5.3: Delta vs time spent on object

These plots do not show any kind of relation in a simple way. This was an expected result

based on the correlation table shown in table 5.1.

To know if there was some kind of strange behavior in the numbers of objects observed,

a histogram of the number of objects per time spent on them was created. Figure 5.4 shows

the result.

67

Page 77: Mejoramiento de una metodología para la identificación de Website

Figure 5.4: Histogram: number of objects per time spent on them (normalized)

The previous figure shows an expected result, where there are a large number of unim-

portant objects and a few with great importance.

These results just by themselves do not allow doing any kind of interpretation, other than

analyzing if the tests were done correctly.

5.2 Association Rules

To apply association rules to the data in this project it was necessary to transform them.

This step consisted of defining which objects produced any kind of pupil dilation for each

session obtained after the sessionalization process. The algorithm A Priori was used to find

the rules, using a minimum confidence of 0.9 and a number of 30 rules. The results of this

process can be seen in table 5.2.

From the table 5.2 it is possible to determine that there is a cluster selecting the objects

1, 2, 3, 4, 5, 6, 7, 8, 9 (in that order) as some of the ones that produced dilation for the

sessions. The objects found here will be used in combination with the other techniques.

68

Page 78: Mejoramiento de una metodología para la identificación de Website

Rule number If part Then part Support ConfidenceRule 1 Object 1 –> Object 2 99,6999% 100%Rule 2 Object 2 –> Object 1 99,6999% 100%Rule 3 Object 1 –> Object 3 99,6999% 100%Rule 4 Object 3 –> Object 1 99,6999% 100%Rule 5 Object 1 –> Object 4 99,6999% 100%Rule 6 Object 4 –> Object 1 99,6999% 100%Rule 7 Object 1 –> Object 5 99,6999% 100%Rule 8 Object 5 –> Object 1 99,6999% 100%Rule 9 Object 1 –> Object 6 99,6999% 100%Rule 10 Object 6 –> Object 1 99,6999% 100%Rule 11 Object 1 –> Object 7 99,6999% 100%Rule 12 Object 7 –> Object 1 99,6999% 100%Rule 13 Object 1 –> Object 8 99,6999% 100%Rule 14 Object 8 –> Object 1 99,6999% 100%Rule 15 Object 1 –> Object 9 99,6999% 100%Rule 16 Object 9 –> Object 1 99,6999% 100%Rule 17 Object 10 –> Object 1 86,1345% 100%Rule 18 Object 11 –> Object 1 86,1345% 100%Rule 19 Object 12 –> Object 1 86,1345% 100%Rule 20 Object 13 –> Object 1 86,1345% 100%Rule 21 Object 14 –> Object 1 86,1345% 100%Rule 22 Object 15 –> Object 1 86,1345% 100%Rule 23 Object 16 –> Object 1 86,1345% 100%Rule 24 Object 17 –> Object 1 86,1345% 100%Rule 25 Object 18 –> Object 1 86,1345% 100%Rule 26 Object 19 –> Object 1 86,1345% 100%Rule 27 Object 20 –> Object 1 86,1345% 100%Rule 28 Object 59 –> Object 1 85,8343% 100%Rule 29 Object 21 –> Object 1 62,9652% 100%Rule 30 Object 22 –> Object 1 62,9652% 100%

Table 5.2: Rules created by association rules

5.3 K Means

For K Means clustering the delta indicator variables and the time spent on objects were

used. What could be expected from this clustering is a methodology to parameterize the

idea of relevance of an object. There were multiple variables such as area, height, width,

etc. which could be used for this clustering process, but a qualitative analysis of this kind

of objects (great height, great width or both) shows that there is no relation between this

variable and the importance. For example the footer object is an object with a large area

but is located on the lower part of every page on the site, and does not include any kind

of important information, so it is not important for defining a web page. In this way, the

variables selected much better represent the relationship with importance and relevance.

For the K Means process three clusters were chosen and the plot can be seen in figure

5.5.

69

Page 79: Mejoramiento de una metodología para la identificación de Website

Figure 5.5: K Means clusters for K=3

The results of the clustering process gave three clusters. These were called “High time,”

“High delta” and “Low importance.” The first one includes the objects in the middle-left

part of the graph which are the ones with lower delta value, but with a high time spent on

them. On the other hand, the “High delta” cluster includes the objects of the bottom-right

part of the graph which are the ones with high delta value and low time spent on them.

Finally the “Low importance” cluster are the ones with low delta value and low time spent

on object value. These clusters are the ones with a higher number of objects. In table 5.3

and table 5.4 the objects belonging to the clusters “High time” and “High delta” respectively

can be seen.

High time ClusterObject 5Object 195Object 215Object 243Object 244Object 245Object 265Object 267Object 306Object 356

Table 5.3: High time Cluster

70

Page 80: Mejoramiento de una metodología para la identificación de Website

High time ClusterObject 130Object 277

Table 5.4: High delta Cluster

5.4 Website Keyobjects

To combine the objects obtained through both techniques the number of times they

appeared on them was counted. After that they were ordered using the indicator delta*time

spent on each object, starting with the one with the highest value. The objects with that

value being equal to 0 were not considered. The results of this procedure can be seen in table

5.5

Object Number Website object Name Is keyobject?1 Object 5 header picture Yes2 Object 356 picture detalle profe 1 Yes3 Object 6 navigation menu Yes4 Object 195 paragraph detalle noticia 1 Yes5 Object 4 logo mba No6 Object 267 paragraph quienes 1 Yes7 Object 306 paragraph metodologia 1 Yes8 Object 244 image plataforma 1 Yes9 Object 1 ingenieria industrial logo No10 Object 265 paragraph corporacion 1 Yes11 Object 307 paragraph metodologia 2 Yes12 Object 215 paragraph propuesta 1 Yes13 Object 243 paragraph plataforma 1 Yes14 Object 277 title contenido 1 Yes15 Object 245 paragraph perfil egresados 1 Yes16 Object 130 title elegirnos 10 No17 Object 3 search form No18 Object 59 titulo cuerpo Yes19 Object 23 main post text 1 Yes20 Object 22 main post picture Yes

Table 5.5: List of Website Keyobjects

5.5 Discussion

The results in table 5.5 show that there are 16 of 20 objects which were classified correctly.

This corresponds to 80% precision which is the same amount of precision obtained with the

previous methodology. However the results and the analysis performed on the data shows

that, compared to the previous methodology, this technique includes more objects with text,

as paragraphs or titles. This may be due to the fact that reading a long paragraph and

understanding it requires more cognitive processing, which is directly related to pupil dilation.

71

Page 81: Mejoramiento de una metodología para la identificación de Website

In the same way, the indicator created by the multiplication of the time spent on an

object with the delta, is a good measure of the importance of the objects on the page. This

result could not be obtained before because the previous methodology did not include a way

for measuring pupil dilation.

Finally the results obtained do not allow the objects to be related with any kind of

emotional valence. This was an expected result because the object mapping process did not

show any kind of special objects, the majority being completely neutral. These include the

paragraph or text, because their content cannot be easily classified as positive, negative or

neutral.

72

Page 82: Mejoramiento de una metodología para la identificación de Website

Chapter 6

Conclusions

This work proposes the utilization of pupil dilation as an improvement on the previous

methodology for finding web site keyobjects. This methodology is the continuation of a

series of studies related to finding these objects, previously begun by Dujovne and later by

Gonzalez. The improvement includes the measurement of the pupil and eye movements with

newer machines and devices, and includes new techniques and algorithms for processing eye

movements and pupil dilation. These techniques are based on the assumption that there is

a relation between object importance and pupil movements, in turn based on the previous

work related to the cognitive process and its reflection on the pupil.

Thus, the pupil information adds a new dimension for measuring web user behavior, but

this kind of information can be related to many more variables such as cognitive process,

reading, emotional valences of the images or objects, etc. so the process of disambiguation

of this variable is non-trivial.

The use of this variable in this project and the results obtained with it are considered a

success because the results obtained were the expected, even in this case, where the results

were similar to the previous methodology. In particular the use of this variable and the

transformation done with it give a new tool for classifying objects and for finding which ones

caused more cognitive processing by the subjects. This opens a new way of web site analysis

and exploration, where new kinds of variables can be explored; for example the amount of

mental processing needed by the subject, the emotions produced by it, stress level, etc., and

not just the analysis of objects.

For the previous reason the main hypothesis of this project is validated, because the

73

Page 83: Mejoramiento de una metodología para la identificación de Website

relation between website keyobjects and attention on them and importance as web pages

definers was showed through empiric demonstration. It is possible to conclude that pupil

dilation it is an important factor to consider in any kind of user behavior on web pages

research, and its analysis does not make more difficult these kind of research because the

most of the new eye-tracking devices include pupil dilation analysis system.

In particular the results obtained here can also be considered for new improvements of

design and layout of the site, when considering the kind of users and the kind of information

that are being looked for. These modifications will directly affect the number of people who

visit the site and the length of their sessions, which would result in more possible customer

attention to the program.

Finally it is important to notice that this methodology and results can be an important

improvement to the web analysis done by different companies in the world, which are currently

based mostly on such subjective data as surveys and simple observation. To improve this

it is necessary to automatize the different parts of the process like the object mapping, the

retrieval of the web page, the log analysis, etc.

6.1 Further Development and Recommendations

Based on the research done on this project, there are certain considerations which can

be considered for further development of a methodology for finding web site keyobjects, for

improving the results and decreasing the bias. This can provide a baseline for developing

techniques which include new types of information and variables in order to get better results.

Kind of web site

It is important to notice which kind of web site is used for testing. In this project an

informational site was used that includes all the important things about the studies program

offered. This kind of site is used primarily for getting a first impression about the program

or to get additional information, for which reason users visit it just once on average.

It is necessary to consider this because the objects on the site have to be created with

this kind of behavior in mind. On other kinds of sites, like blogs or news, the objects are

probably created for repeated visits.

User tasks on the web site

The behavior of the user on the site is closely related to the previous point. The explo-

74

Page 84: Mejoramiento de una metodología para la identificación de Website

ration of the site differs based on the type of visits made to it. On a single-visit site all the

objects will probably be explored, but on a daily visit site a lot of objects will probably be

overridden because they were seen in the past. Accordingly, applying the sessionalization

process to this site and supposing that every session is the same can introduce bias in the

result.

In the same way, during the experimentation step, it is necessary to give different tasks

to the participants for getting all the information about which objects were seen according

to the task given. The behavior on the site could be completely different given one task or

another.

Web log time

The web log time data has to be considered because it can introduce bias into the results.

For example, on the site used for this project, the number of visits prior to the application

process into the MBA program is greater than any other month, because people are looking

for information for applying, and for the same reason will visit certain pages more than

others. Thus, the web log can indicate different kinds of behaviors depending on the time of

the year.

It is recommended to have a large amount of web log data, with a large number of

requests, and of different time ranges to include all these variables in the analysis. It is also

recommended to do an analysis of the user behavior of these sessions to detect whether there

are differences according to the season of the year.

Object type

It is important to determine that the objects are of different types. This refers to the fact

that some objects are there for “decoration” (as logos, headers and pictures), others are ads

(for example banners, pictures or videos), others are the content of the site (paragraphs, text,

titles) and others for navigation purposes (such as menus, links, buttons). The behavior on

these objects is completely different based on for their functions. For example the menus are

seen more than any other object because it is necessary to look at them to know where to go

or how the site is arranged. The logos and pictures around the site are usually seen just once

because they do not give much more information and are mostly useful for knowing about

the owner of the site. The banners and ads are avoided by most experienced users who are

“trained” to not look at them. Paragraph and text on the other hand are read to discover

the content, and usually every page has different content, so much of time on a page is spent

on those objects.

75

Page 85: Mejoramiento de una metodología para la identificación de Website

For these reasons it is important to classify them (prior to or after the experiment) and

consider this as valuable information for analysis.

Web site page layout

The layout of the page can directly affect the exploration of the site. For example the logos

and menus take up an important percentage of the page because they are usually located in

the upper part of every page, and on the other hand the contact information is located on

the lower part of the page, and it is necessary to scroll to look at them. It is logical to think

that those objects that are shown first, on the upper part of page, will be looked at more

than others, even if they do not add useful information about the site process.

Creating a classifier or giving a score to the objects, based on such data as length of the

page, length and width of the page, area of the object, etc. can be a way of including a

weight factor for them.

Cognitive process during the exploration

This project did not include information about the cognitive process during the explo-

ration of the web site. For example it could be important to analyze if some of the objects

require more cognitive processing to understand them (for example text too complicated to

understand) or if the web site produces any kind of emotion in the subjects (frustration,

anger, relaxation, etc.) to know which objects to include or which to exchange for others.

This kind of analysis can be performed using technologies such as EEG (electroencepha-

logram) for measuring the brain activity of the subject during the experiment, and then

analyzing the combination of data obtained through the eye tracker and the EEG.

Web user behavior on the page

Usually a web site is scanned from the upper part to the bottom, and from left to right,

but this behavior could be guided according to the objects on it. This is because there is a

kind of relation between the objects (for example, after looking at the title of something the

next step will be to look at the paragraph related to that title.)

It can be useful to consider these kinds of relations between the objects on a page, and

how they are related to the behavior on a page. In other words, how decisions about what to

look at and what to choose after each object are made. This kind of analysis could consider

the time spent on the objects, the movement of the eyes on the page and the order in which

76

Page 86: Mejoramiento de una metodología para la identificación de Website

the objects were seen.

The results of this analysis could help to improve the layout of a site by guiding the

behavior of the user on it.

Object mapping

One of the most important steps in this methodology was the object mapping on the site.

If the site is big, with a large number of pages and objects on it, the process will be tedious

and slow, taking a long time, and being prone to errors.

Therefore, the development of an automated or semi-automated object mapper could

significantly decrease the time needed for doing this task, and could retrieve better results,

with fewer errors. The development of this kind of system is not trivial because the page

code and the image of the page need to be analyzed, and then that information combined to

get the objects and their positions.

77

Page 87: Mejoramiento de una metodología para la identificación de Website

Appendixs

A Pre Selection Survey

This survey includes 8 simple questions for knowing if the subject is suitable for theexperiment. The comments at the side of each question are for the person who is taking thesurvey.

At the beginning of the survey it is necessary to explain the instructions for the survey:I’m going to ask some question about your eyes, because we will use devices which registersyour eyes movement and the dilation of your pupil. Also we will register your behavior onthe web site.

1. Do you use glasses or contact lenses for reading on your computer or tablet?

• [ ] Yes Continue with the next question.

• [ ] No Go to question number 3

2. Are your glasses made for

• [ ] Reading? Continue with the next question.

• [ ] Seeing far objects? Continue with the next question.

• [ ] Both

3. Can you read on a screen computer or a web page without using glasses or contact lenses?

• [ ] Yes Continue with the next question.

• [ ] No Finish survey.

4. Do you have cataracts?

• [ ] Yes Continue with the next question.

• [ ] No Continue with the next question.

5. Do you have eyes implants?

78

Page 88: Mejoramiento de una metodología para la identificación de Website

• [ ] Yes Finish survey.

• [ ] No Continue with the next question.

6. Do you have glaucoma?

• [ ] Yes Finish survey.

• [ ] No Continue with the next question.

7. Do you use a screen reader, screen magnifier or any other kind of technology for usingcomputer or visiting web pages?

• [ ] Yes Finish survey.

• [ ] No Continue with the next question.

8. Are your pupils constantly dilated?

• [ ] Yes Finish survey.

• [ ] No Continue with the next question.

79

Page 89: Mejoramiento de una metodología para la identificación de Website

B Informed consent

Informed Consent

“IMPROVEMENT OF A METHODOLOGY FOR IDENTIFICATION OF WEBSITEKEYOBJECTS THROUGH THE APPLICATION OF EYE TRACKING

TECHNOLOGIES, PUPIL DILATION ANALYSIS AND WEB MINING ALGORITHMS”

Main researcher name: Gustavo Martınez AzocarInstitution: Programa de Fisiologıa y Biofısica, ICBM, Facultad de Medicina, Universidadde Chile.Telephone Number: 2978 6035

Invitation to take part: We are inviting to take part on the research project called “Im-provement of a methodology for identification of website keyobjects through the applicationof eye tracking technologies, pupil dilation analysis and web mining algorithms”, because areneeded participants of general population.

Objectives: This research has as objective make research about pupil response as significantvariable during the process of free navigation on a web site.

Procedure: If you accept the invitation for participant in the research, you will be testedduring a single session for around 20 minutes. During this session you will be asked towatch a series of image sequences on computer’s screen meanwhile a camera register theeyes movements and the changes on your pupils diameter. This techniques are harmless andnoninvasive.

Risks: The registration of eyes movement and pupil diameter does not have other kind ofeffects on you. In any case, if you consider that there are some effects due to the procedures,you have to communicate with Gustavo Martınez on the telephone number 2978 6035.

Costs: The procedures specified here doesnt have any kind of cost for you. All necessarymeans for doing them will be provided by the Neurosystems Lab of Medicine Faculty ofUniversidad de Chile.

Benefits: This research implies a direct benefit to the knowledge progress, being a significantcontribution to understanding of pupil response during cognitive processes.

Compensation: This research does not have any kind of economic compensation for yourparticipation.

Confidentiality: All the information derived of your participation on this research will bestored with full confidentiality, including the access of researchers or research supervisoryagencies. Any kind of publication or scientific communication of the rsults of this researchwill be completely anonymous.

80

Page 90: Mejoramiento de una metodología para la identificación de Website

Additional information: You will be informed if during the development of this researchappears new knowledge of complications that may affect your willingness of continuing as aparticipant of this research.

Voluntarism: Your participation on this research is totally voluntary and you can reject itduring any moment telling it to the researcher. In the same way the researcher could decideyour retire if he considers that decision goes to your benefit.

Participant rights: If you need any other kind of information about your participation inthis research you can call to: Researcher: Gustavo Martınez Azocar, phone number: 29786035 Institution director: Pedro Maldonado Arbogast, phone number: 2978 6035

Conclusion: After you have received and understood the information on this document,and being able to solve any kind of doubts, I give my consent to participate in this project.

Subject name:Occupation:Education level:Age:Sign:Date:

Informant name:Sign:Date:

Researcher name:Sign:Date:

81

Page 91: Mejoramiento de una metodología para la identificación de Website

References

[1] Miniwatts Marketing Group, “Estadısticas de uso de internet y sitios web..” http:

//www.internetworldstats.com/stats.htm, 2011. 30/06/2012.

[2] V. J.D, R. S, B. A, Y. H., and A. T, “Towards the identification of keywords in the web

site text context: A methodological approach,” Journal of web information systems,

pp. 11–15, 2005.

[3] J. D. Velasquez and L. E. Dujovne, “Identifying web site key objects: A methodological

approach,”

[4] L. Gonzalez and J. D. Velasquez, “Mejoramiento de una metodologıa para la

identificacion de website keyobjects mediante la aplicacion de tecnologıas eye tracking,”

[5] R. Wirth and J. Hipp, “Crisp-dm: Towards a standard process model for data mining,”

[6] World Wide Web Consortium, “Http - hypertext transfer protocol.” http://www.w3.

org/Protocols/, 2012. 30/06/2012.

[7] World Wide Web Consortium, “Html 4.01 specification.” http://www.w3.org/TR/

1999/REC-html401-19991224/, 1999. 30/06/2012.

[8] World Wide Web Consortium, “Uniform resource locators.” http://www.w3.org/

Addressing/URL/url-spec.html, 2012. 30/06/2012.

[9] D. Gibson, J. Kleinberg, and P. Raghavan, “Proceedings of the ninth acm conference on

hypertext and hypermedia: links, objects, time and space,” pp. 225–234, ACM, 1998.

[10] World Wide Web Consortium, “Log files - apache http server.” http://httpd.apache.

org/docs/current/logs.html, 2012. 30/06/2012.

[11] M. Spiliopoulou, B. Mobasher, B. Berendt, and M. Nakagawa, “A framework for the

evaluation of session reconstruction heuristics in web-usage analysis,” INFORMS journal

on computing, pp. 171–190, 2003.

82

Page 92: Mejoramiento de una metodología para la identificación de Website

[12] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, “From data mining to knowledge

discovery in databases,” AI magazine, 1996.

[13] G. Chang, M. Healey, J. McHugh, and J. Wang, Mining the World Wide Web. Kluwer,

2001.

[14] R. K. Blockeel and H., “Web mining research: A survey,” Information Processing Society

of Japan SIGNL Note, pp. 1–15, 2000.

[15] O. Etzioni, “The world-wide web: quagmire or gold mine?,” Commun, ACM, pp. 65–68,

1996.

[16] P. L. and S. Brin, “The anatomy of a large-scale hypertextual web search,” Computer

networks and ISDN systems, pp. 107–117, 1998.

[17] J. Srivastava, R. Cooley, M. Deshpande, and P.-N. Tan, “Web usage mining: discovery

and applications of usage patterns from web data,” SIGKDD Explor. Newsletter, pp. 12–

23, 2000.

[18] D. B. Judd and G. Wyszecki, Color in business, science, and industry. Wiley, 1975.

[19] C. Zimmer, “The brain: Our strange, important, subconscious light detectors,” Discover

Magazine, 2012.

[20] G. J. Tortora and B. Derrickson, Principles of Anatomy and Physiology. Editorial Medica

Panamericana, 2006.

[21] B. Cassin and S. Solomon, Dictionary of Eye Terminology. Triad Publishing Company,

1990.

[22] A. T. Duchowski, Eye tracking methodology: Theroy and practice. Springer Verlag, 2003.

[23] Harvey Richard Schiffman, Sensation and Perception: An Integrated Approach. Wiley,

2000.

[24] J. Nielsen and K. Pernica, Eyetracking web usability. New Riders Pub, 2009.

[25] Neil R. Carlson and Donald Heth C, Psychology: the science of behaviour. Toronto:

Pearson, 2010.

[26] W. James, F. Burkhardt, F. Bowers, and I. Skrupskelis, The Principles of Psychology.

No. v. 1 in American science series–advanced course, Harvard University Press, 1981.

83

Page 93: Mejoramiento de una metodología para la identificación de Website

[27] S. Frintrop, E. Rome, and H. I. Christensen, “Computational visual attention systems

and their cognitive foundations: A survey,” ACM Trans. Appl. Percept., vol. 7, pp. 6:1–

6:39, Jan. 2010.

[28] L. R. Young and D. Sheena, “Survey of eye movement recording methods,” Behavior

Research Methods & Instrumentation, vol. 7, no. 5, pp. 397–429, 1975.

[29] A. Sprenger, B. Neppert, S. Koster, S. Gais, D. Kompf, C. Helmchen, and H. Kimmig,

“Long-term eye movement recordings with a scleral search coil-eyelid protection device

allows new applications,” Journal of Neuroscience Methods, vol. 170, no. 2, pp. 305 –

309, 2008.

[30] E. H. Hess and J. M. Polt, “Pupil size as related to interest value of visual stimuli,”

Science, vol. 132, no. 3423, pp. 349–350, 1960.

[31] A. C. Guyton, Basic human physiology: normal function and mechanisms of disease.

Saunders Philadelphia London Toronto, 1977.

[32] H. Davson, The Eye. Vol. 3. Muscular Mechanisms. Academic, 1970.

[33] E. H. Hess, “Pupillometrics: A method of studying mental, emotional and sensory

processes,” Handbook of psychophysiology, pp. 491–531, 1972.

[34] L. Stark and A. Troelstra, “Dynamic pupillometers using television camera system,”

October 1970.

[35] M. G. H. Coles, E. Donchin, and S. W. Porges, “The pupillary system,”

Psychophysiology: Systems, Processes, and Applications, pp. 43–50, 1986.

[36] J. T. Cacioppo and L. G. Tassinary, “The ocular system,” Principles of psychophysiology:

Physical, social, and inferential elements., pp. 193–215, 1990.

[37] E. H. Hess and J. M. Polt, “Pupil size in relation to mental activity during simple

problem-solving,” Science, vol. 143, no. 3611, pp. 1190–1192, 1964.

[38] J. M. Polt, “Effect of threat of shock on pupillary response in a problem-solving

situation,” Perceptual and Motor Skills, vol. 31, no. 2, pp. 587–593, 1970.

[39] J. Beatty, “Task-evoked pupillary responses, processing load, and the structure of

processing resources.,” Psychological bulletin, vol. 91, no. 2, p. 276, 1982.

[40] J. D. Barlow, “Pupillary size as an index of preference in political candidates,” Perceptual

and Motor Skills, vol. 28, no. 2, pp. 587–590, 1969.

84

Page 94: Mejoramiento de una metodología para la identificación de Website

[41] W. Clark and M. Ertas, “A comparison of pupillary reactions to visual and auditory

stimuli in a test of preferences for presidential candidates.,” JSAS Catalog of Selected

Documents in Psychology, vol. 1, pp. 20–25, 1975.

[42] D. Hahnemann and J. Beatty, “Pupillary responses in a pitch-discrimination task,”

Perception & Psychophysics, vol. 2, no. 3, pp. 101–105, 1967.

[43] G. Hakerem and S. Sutton, “Pupillary response at visual threshold.,” Nature, 1966.

[44] J. Beatty, “Prediction of detection of weak acoustic signals from patterns of pupillary

activity preceding behavioral response,” Tech. Rep. 140, Los Angeles: University of

California, Dept. of Psychology, 1975.

[45] E. H. Hess, A. L. Seltzer, and J. M. Shlien, “Pupil response of hetero-and homosexual

males to pictures of men and women: A pilot study.,” Journal of Abnormal Psychology,

vol. 70, no. 3, p. 165, 1965.

[46] E. Hess, “The tell-tale eye,” 1975.

[47] M. P. Janisse, Pupillometry: The psychology of the pupillary response. Hemisphere

Publishing Corporation Washington, 1977.

[48] R. A. Hicks, S. L. Williams, and F. Ferrante, “Pupillary attributions of college students

to happy and angry faces,” Perceptual and Motor Skills, vol. 48, no. 2, pp. 401–402,

1979.

[49] G. K. Poock, “Information processing vs pupil diameter,” Perceptual and Motor Skills,

vol. 37, no. 3, pp. 1000–1002, 1973.

[50] W. S. Peavler, “Pupil size, information overload, and performance differences,”

Psychophysiology, vol. 11, no. 5, pp. 559–566, 1974.

[51] M. SECo, “Problemas formales de la definicion lexicografica,” in Estudios ofrecidos a

Emilio Alarcos Llorach, pp. 217–240, Universidad de Oviedo, 1978.

[52] G. Myatt, Making Sense of Data: A Practical Guide to Exploratory Data Analysis and

Data Mining. Wiley, 2007.

[53] R. Agrawal, R. Srikant, et al., “Fast algorithms for mining association rules,” in Proc.

20th Int. Conf. Very Large Data Bases, VLDB, vol. 1215, pp. 487–499, 1994.

[54] A. L. Yarbus, B. Haigh, and L. A. Rigss, Eye movements and vision, vol. 2. Plenum

press New York, 1967.

85

Page 95: Mejoramiento de una metodología para la identificación de Website

[55] J. I. Fernandez, Mejorando el Contenido Textual de un Sitio Web a Traves de la Identi-

ficacion de sus Web Site Keywords. Departamento de Ingenieria Industrial, Universidad

de Chile, 2010.

[56] A. Rosler, C. Ulrich, J. Billino, P. Sterzer, S. Weidauer, T. Bernhardt, H. Steinmetz,

L. Frolich, and A. Kleinschmidt, “Effects of arousing emotional scenes on the distribution

of visuospatial attention: Changes with aging and early subcortical vascular dementia,”

Journal of the Neurological Sciences, vol. 229, pp. 109–116, 2005.

[57] C. M. Privitera, L. W. Renninger, T. Carney, S. Klein, and M. Aguilar, “Pupil dilation

during visual target detection,” Journal of Vision, vol. 10, no. 10, 2010.

86