Choi ECCV12 presentation

A Unified Framework

A Unified Framework for Multi-Target Tracking and Collective Activity RecognitionWongun Choi and Silvio Savarese

University of Michigan, Ann Arbor

1

VisionLabGood afternoon. Im Wongun Choi from the University of Michigan. This is a joint work with my advisor, Silvio Savarese.1

2

VisionLabConsider a video sequence with multiple people,2

Our Goal

3

VisionLabour goal is to understand behavior of all individuals in the scene.

3

Our Goal

4Multiple target tracking

VisionLabFirstly, we want to estimate the trajectories of all individuals4

Our Goal

5Multiple target trackingRecognize activities at different level of granularityAtomic activityWalking

Walking

Walking

& poseWalkingFacing-frontWalkingFacing-frontWalkingFacing-back

VisionLabAlso, we want to recognize semantic activities of the people in different level of granularities.

Such activities include:[show1] single person activities in isolation which we call atomic activities, [show2] such as walking or standing, [show3] and the pose of individuals, [show4] such as facing-front or back.5

Our Goal

Walking-Side-by-Side

Moving-to-Opposite6Multiple target trackingRecognize activities at different level of granularityAtomic activity & posePairwise interaction

VisionLabWe also want to recognize Interplay between pairs of people, which we call interactions, for instance, [show1] walking side by side or [show2] moving to opposite direction.6

Our Goal

Multiple target trackingRecognize activities at different level of granularityAtomic activity & posePairwise interactionCollective Activity

Crossing7

VisionLabAnd finally, we want to identify the overall behavior of all people in the scene, the collective activity, [show1] for example crossing.7

Our Goal

Multiple target trackingRecognize activities at different level of granularityAtomic activity & posePairwise interactionCollective Activity

Solve all problems jointly!!

Crossing8

VisionLabMost importantly, we address all the problems in a unified framework.8

BackgroundAtomicActivityPairwiseInteractionCollectiveActivityTargetTracking Wu et al, 2007 Avidan, 2007 Zhang et al, 2008 Breistein et al, 2009 Ess at al, 2009 Wojek et al, 2009 Geiger et al, 2011 Brendel et al, 2011 Pirsiavash et al, 2011

9 Bobick & Davis, 2001 Efros et al, 2003 Schuldt et al, 2004 Dollar et al, 2005 Niebles et al, 2006 Laptev et al, 2008 Rodriguez et al, 2008 Wang & Mori, 2009 Gupta et al, 2009 Liu et al, 2009 Marszalek et al, 2009 Liu et al, 2011 Zhou et al, 2008 Ryoo & Aggarwal, 2009 Yao et al, 2010 Choi et al, 2010 Patron-perez et al, 2010

Choi et al, 2009 Li et al, 2009 Lan et al, 2010 Ryoo & Aggarwal, 2010 Choi et al, 2011 Khamis et al, 2011 Lan et al, 2012 Khamis et al, 2012 Amer et al, 2012

Investigated in isolationHierarchy of activities

Lan et al, 2010Amer et al 2012

Khamis et al, 2012

VisionLabSo far, a large literature have proposed methods for[show1] tracking multiple targets and recognizing[show2] atomic activity[show3] pairwise interaction[show4] and collective activity[show5] , but most of the times these problems are addressed in isolation

[show6] Some exceptions are shown here. But they only focus on solving atomic and collective activity recognition. 9

BackgroundAtomicActivityPairwiseInteractionCollectiveActivityTargetTracking

VisionLabAs opposed to previous works, we propose to solve all of the four problems in a joint fashion.

Let me explain this concept in few more details.10

11ContributionsBottom-up activity understanding.

WalkingWalkingWalking

Approaching

ApproachingGathering

Bottom-upAtomicactivityInteractionCollectiveactivityTrajectories

VisionLabOur model is able to transfer the information in a bottom up fashion, [show1]

From the estimation of trajectories of individuals and their atomic activities, we can obtain robust characterizations of interactions and collective activities. 11

12Contributions

Bottom-up activity understanding.

Meeting and Leaving

Crossing

VisionLab

For instance, [show1] if we observe these trajectories, [show2] we can easily infer that the activity is meeting and leaving[show3] but if we see these trajectories, [show4] we will say it is an activity crossing.12

13ContributionsBottom-up activity understanding.Contextual information propagates top-down.

WalkingWalkingWalking

Approaching

ApproachingGathering

AtomicactivityInteractionCollectiveactivityTrajectories

Top-down

VisionLabAt the same time, the information flows from [show1] top to down so as to provide critical contextual information to the lower levels of hierarchy: collective activities help understand interactions; interactions help understands atomic activity and track associations.13


Meeting and Leaving

VisionLabLet me give you an example of how activity understanding help trajectory estimation.

[show1] For example, if we are given these set of broken trajectories and[show2] if we know that the underlying activity is meeting and leaving, we can interpret these trajectories as follows [show3] 14


Crossing

Simple Social Force Model Pellegrini et al 2009 Choi et al 2010 Leal-Taixe et al 2012 etc

Repulsion &Attraction

VisionLabNow given the same broken trajectories but if we know that the activity is crossing, [show1] we can interpret the trajectories as follows.

[show2] Similar concept is also introduced as a social force model in previous works, however these works only considered few hand designed types of interactions, such as [show3] repulsion and attraction.

We generalize this concept and have high level activity understanding to guide the process of associating tracks.15

16OutlineJoint ModelInference/Training methodExperimental evaluationConclusion

VisionLab3 Minutes up to this!!!

This is an outline of todays talk16


VisionLabLets begin with our Joint model.17

Hierarchical Activity ModelInput: video with tracklets

18

VisionLabGiven a video with a set of short fragment of trajectories[show1] tracklets, the activities of all is encoded as a hierarchical graphical model using factor graph. Each component of our factor graph model are shown on the left

19Hierarchical Activity Model

A1A2A3I12I23I13

C

OC

O1O2O3

I(t)I(t)I

A(t)A(t)ACOc

OA(t)OA(t)OA

I(t)I(t)I

A(t)

A(t)ACOc

OA(t)OA(t)OA

I(t)I(t)I

A(t)A(t)A

COc

OA(t)OA(t)OA

t-1tt+1

VisionLab[show1] From each of the tracklets, we obtain observation cue for individuals.[show2] grounded on each observation, we model the atomic activity with variable A.[show3] we encode pair-wise interaction between individuals with variable I.[show4] finally, we assign one collective activity variable, C, to chracterize overall behavior of individuals. [show5] we also provide top-down observation cue for variable C

[show6] By considering the temporal relathionship among variables as well, our full model can be compactly represented as the graph shown here.

20

I(t)I(t)I

A(t)A(t)ACOc

OA(t)OA(t)OA

I(t)I(t)I

A(t)

A(t)ACOc

OA(t)OA(t)OA

I(t)I(t)I

A(t)A(t)A

COc

OA(t)OA(t)OA

t-1tt+1Hierarchical Activity Model

VisionLabThe model can be written as an energy function as shown in the right side.[show1] The energy can be factorized into multiple local potential each of which encodes relationship among variables.

Lets see the details of what each factors represents.20

Atomic-Observation PotentialAtomic Activity Models Action: BoW with STIPPose: HoG

Dalal and Triggs, 05

Dollar et al, 06; Niebles et al, 0721

I(t)I(t)I

A(t)A(t)ACOc

OA(t)OA(t)OA

I(t)I(t)I

A(t)

A(t)ACOc

OA(t)OA(t)OA

I(t)I(t)I

A(t)A(t)A

COc

OA(t)OA(t)OA

t-1tt+1

VisionLabThe highlighted potential encodes the compatibility between atomic activity models and corresponding observations.

[show1] Atomic activity are modeled by the BoW representation equipped with Spatio-Temporal-Interest-Point features [Show2] and pose are described using HOG 21

Interaction-Atomic PotentialI: Standing-in-a-line22

ModelA: StandingFacing-leftA: StandingFacing-left

A: StandingFacing-leftA: StandingFacing-left

I(t)I(t)I

A(t)A(t)ACOc

OA(t)OA(t)OA

I(t)I(t)I

A(t)

A(t)ACOc

OA(t)OA(t)OA

I(t)I(t)I

A(t)A(t)A

COc

OA(t)OA(t)OA

t-1tt+1

VisionLabThe second potential psi (I,A,f) captures the compatibility between interactions models and observations of atomic activities. For instance

[show1] here we illustrate a visualization of a possible learnt model for the standing-in-a-lineinteraction. This model captures the property that two people that stand in a line tend to be located nearby and face the same direction.[show2] Thus, if we are given these observations of atomic activities standing and facing left which are compatible with the learntstanding-in-a-lineinteraction,[show3] the potential is high.22

ModelA: StandingFacing-leftA: StandingFacing-leftInteraction-Atomic Potential23

I(t)I(t)I

A(t)A(t)ACOc

OA(t)OA(t)OA

I(t)I(t)I

A(t)

A(t)ACOc

OA(t)OA(t)OA

I(t)I(t)I

A(t)A(t)A

COc

OA(t)OA(t)OA

t-1tt+1I: Standing-in-a-line

A: StandingFacing-rightA: StandingFacing-left

VisionLabOn the other hand, if we are given these observations of atomic activities standing, facing left, facing right, the compatibility with the model is weak, and thus[show1] the potential is low[show2] such relationship can be compactly represented as an equation below. 23

Collective-Interaction Potential24

C: Queuing

I: standing-side-by-side

I: one-after-the-other



One-after-the-otherStanding-side-by-sideFacing-each-otherModel

I(t)I(t)I

A(t)A(t)ACOc

OA(t)OA(t)OA

I(t)I(t)I

A(t)

A(t)ACOc

OA(t)OA(t)OA

I(t)I(t)I

A(t)A(t)A

COc

OA(t)OA(t)OA

t-1tt+1

VisionLabSimilarly, the potential psi(C,I) encodes the compatibility between collective activity models and a set of observations of pair-wise interactions.

[show1] For example, here we illustrate a visualization of a possible learnt model for the collective activity queuing. This model captures the probability of occurrences of interactions labels such as one-after-the-other, facing-each-other. For the collective activity queueing, the interactionone-after-the-otheris highly probable to occur. The interactionfacing-each-other is much less so.[show2] Thus, if we observe these set of interactions, [show3] standing-side-by-side, [show4] one-after-the-other, [show5] and so on, with Queing, [show3] the potential psi(C,I) is high24

Collective-Interaction Potential

I: facing-each-other

I: facing-opposite-side

I: facing-each-other25C: Queuing

One-after-the-otherStanding-side-by-sideFacing-each-otherModel

I(t)I(t)I

A(t)A(t)ACOc

OA(t)OA(t)OA

I(t)I(t)I

A(t)

A(t)ACOc

OA(t)OA(t)OA

I(t)I(t)I

A(t)A(t)A

COc

OA(t)OA(t)OA

t-1tt+1

VisionLabOn the other hand, [show1] if we observe these interactions, facing-each-other and facing-opposite-direction,[show2] the potential is low since these are not compatible with the learned model for queuing. [sohw3] such relationship is encoded by this equation. 25

Collective-Observation PotentialCollective ActivitySTL of all targetsChoi et al, 0926

I(t)I(t)I

A(t)A(t)ACOc

OA(t)OA(t)OA

I(t)I(t)I

A(t)

A(t)ACOc

OA(t)OA(t)OA

I(t)I(t)I

A(t)A(t)A

COc

OA(t)OA(t)OA

t-1tt+1

VisionLabSimilarly to bottom-up cues for atomic activities, The highlighted potential encodes the compatibility between collective activity models and corresponding observations.

[show1] These are obtained using the crowd context descriptor introduced by Choi et al in 2009. 26

Activity Transition Potential27Smooth activity transition

I(t)I(t)I

A(t)A(t)ACOc

OA(t)OA(t)OA

I(t)I(t)I

A(t)

A(t)ACOc

OA(t)OA(t)OA

I(t)I(t)I

A(t)A(t)A

COc

OA(t)OA(t)OA

t-1tt+1

VisionLabAlso, we encode temporal smoothness between[show1] collective activities, [show2] interactions, [show3] and atomic activitiesin adjacent time frames. 27

28Trajectory Estimation

I(t)I(t)I

A(t)A(t)ACOc

OA(t)OA(t)OA

I(t)I(t)I

A(t)

A(t)ACOc

OA(t)OA(t)OA

I(t)I(t)I

A(t)A(t)A

COc

OA(t)OA(t)OA

t-1tt+1

VisionLabLast term captures potential related to trajectory estimation. Before we talk about it, let me briefly discuss about how we define the tracking problem. 28

29Tracklet Association Problem

VisionLabSuppose we have the activity people crossing; [show1] it is likely that our low level observations wont be just a pair of clean tracks such as the red and back ones.29

30Tracklet Association Problem

Input: Fragmented Trajectories (tracklets)Detector failuresOcclusion between targetsScene clutteretc..

Output: set of trajectories With correct IDs (color)

VisionLabBut we rather observe a fragmented set of trajectories which we call tracklets. This is because of [show1] detection failures, occlusion, scene clutters, and so on. Given such set of initial inputs, our goal is [show2] to obtain trajectories with consistent IDs by associating the tracklets30

31Tracklet Association Model

Location affinityAppearance/Color...

Simple match costs, c

??

I(t)I(t)I

A(t)A(t)ACOc

OA(t)OA(t)OA

I(t)I(t)I

A(t)

A(t)ACOc

OA(t)OA(t)OA

I(t)I(t)I

A(t)A(t)A

COc

OA(t)OA(t)OA

t-1tt+1

VisionLabAs in traditional track association problems, we introduce [Show1] the variable f for capturing association hypotheses; A simple solution is to have the association cost vector c to encode properties such as[show2] location affinity, color similarity, and so on.

As this example shows, these properties, dont always work:[show3] location affinity becomes ambiguous at the point of crossing.[show4] Color or appearance similarity is not always reliable when people wear similar clothings31


Crossing

I(t)I(t)I

A(t)A(t)ACOc

OA(t)OA(t)OA

I(t)I(t)I

A(t)

A(t)ACOc

OA(t)OA(t)OA

I(t)I(t)I

A(t)A(t)A

COc

OA(t)OA(t)OA

t-1tt+1

VisionLabIn addition to traditional cues, we advocate that the interaction labels provides critical contextual information to guide the process of associating tracklets. [show1] Such information is encoded by the interaction potential that we discussed ealier. [show1] for instance, if we know the interaction is crossing, this type of association[next]

32


Crossing

I(t)I(t)I

A(t)A(t)ACOc

OA(t)OA(t)OA

I(t)I(t)I

A(t)

A(t)ACOc

OA(t)OA(t)OA

I(t)I(t)I

A(t)A(t)A

COc

OA(t)OA(t)OA

t-1tt+1

VisionLab[cont.] will give rise to high energy [show1] since this association is compatible with the leanrt model for crossing33


Crossing

I(t)I(t)I

A(t)A(t)ACOc

OA(t)OA(t)OA

I(t)I(t)I

A(t)

A(t)ACOc

OA(t)OA(t)OA

I(t)I(t)I

A(t)A(t)A

COc

OA(t)OA(t)OA

t-1tt+1

VisionLabwhereas this type of association [show1] will give rise to low energy.

In the interest of time, we skip the details of mathematics here. 34


VisionLabLets see how we solve the inference problem and train the model. 35

InferenceNon-convex Problem!!36

VisionLabThe inference problem can be represented as finding the configuration of all variables that maximize the joint energy function. [show1] The optimization, however, computationally very demanding36

InferenceActivity Recognition Given f

Iterative Belief Propagation

Novel Branch and Bound37Tracklet Association Given C, I, A

Crossing

I(t)I(t)I

A(t)A(t)ACOc

OA

I(t)I(t)I

A(t)

A(t)ACOc

OA

I(t)I(t)I

A(t)A(t)A

COc

OA

t-1tt+1

VisionLabThus, we introduce a new iterative method to solve the problem.

[show1] Given initial tracklet association, we obtain activity labels using[show2] iterative belief propagation.[show3] and given activity labels, [show4] we obtain tracklet association using our novel branch-and-bound method.

For the interests of time, we skip the details of each inference method.37

Model weights are learned in a Max-Margin framework using Structural SVM.38TrainingTsochantaridis et al, 2004

VisionLabFinally, we obtain the model parameters from set of training data using a structural svm framework .38


VisionLabLets discuss our experimental evaluation39

ExperimentsCollective Activity Dataset44 videos with multiple peopleCrossing, Waiting, Queuing, Walking, Talking

40CrossingWaitingQueuingWalkingTalkingChoi et al, 2009

Target identities

InteractionApproachingLeavingPassing-byFacing-each-otheretc..

Atomic Activity facing-rightfacing-leftWalkingstanding

VisionLabFor the evaluation, we use the collective activity dataset, proposed by us in 2009. In addition to the collective activity lables, we provided annotations for [show1] target identities[show2] interactions between pairs of people[show3] and atomic properties of individuals.40

ExperimentsNew Dataset32 videos with multiple peopleGathering, Talking, Dismissal, Walking together, Chasing, Queuing

41

GatheringTalkingDismissal

Walking-togetherChasingQueuing

Target identities

InteractionApproachingWalking-in-oppos..Facing-each-otherStanding-in-a-rowetc..

Atomic Activity facing-rightfacing-leftWalkingStandingrunning

VisionLabAlso, we collect an additional dataset to test our framework that is composed of 32 videos with 6 collective activities. We also provide labels for target identities, interactions and atomic activities similarly. 41

42Classification ResultsChoi et al, 2009Choi et al, 2009+6.6%+4.9%

Collective ActivityDataset, 2009New Dataset

VisionLabFirstly, we compare the collective activity classification accuracy usingBaseline method using a crowd context descriptor, we introduced in 2009 and 11, and our full hierarchical representation.

We obtain good improvement, about 6%, in overall classification by utilizing the hierarchical structure in our model tested on the collective activity dataset.

Again, we observe similar improvement, about 5 %, in the new dataset. 42

Target AssociationTracklet# of error1556Improvement over tracklet0%

Result of Dataset VSWS0943

VisionLabNow, we analyze the tracklet association results.The first row shows number of error in tracklet matchingAnd the second row shows % of improvement over input tracklets.

Target AssociationTrackletNo Interaction# of error15561109Improvement over tracklet0%28.73%


VisionLabBy solving the tracklet association without interaction cues, we could obtain about 30% improvement over tracklets, which have abour 400 less matching error.

Target AssociationTrackletNo InteractionWith Interaction# of error15561109894Improvement over tracklet0%28.73%42.54%


VisionLabBy incorporating our interaction model with the estimated activity labels, We obtain 14% more improvement, which have abour 200 less matching errors than the baseline without interaction.

Target AssociationTrackletNo InteractionWith InteractionWith GT Activities# of error15561109894736Improvement over tracklet0%28.73%42.54%52.76%


VisionLabNotice that if ground truth activity lables are given, we obtain an upper bound target association error of 736, that is corresponding to 53% improvement over the input. More quantitative evaluation can be found in the paper.

Example Classification Result47Interaction labelsAP: approachingFE: facing-each-otherSR: standing-in-a-row...

VisionLabHere we show examplar results we obtained from newly proposed dataset. The estimated collective activity lable is overlaid on top And interaction labels are displayed between pairs of people. 47

Example Classification Result48Atomic Activities Action: W - walking S standing

Pose (8 directions) L - left LF left/front F front RF- right/front etc.

VisionLabNow let me show results from a more complex sequence.

Here, we show the estimated atomic activity label for each individual, overlaid below each bounding boxes, where .

48

Example Classification Result49Pair-InteractionsAP: approaching.....FE: facing-each-otherSS: standing-side-by-sideSQ: standing-in-a-queue

VisionLaband interactions between pairs of people. SQ represent interaction standing in a queue.49

Example Classification Result50

VisionLabFinally, we show the estimated collective activity variable on top.50

Example Classification Result51Tracklet Association Color/nNumber: ID Solid boxes: tracklets Dashed boxes: match hypothesis

VisionLabHere, the video shows tracklet association result, we obtained using our unified framework.

Color and number on top of bounding box shows identity of target, Solid boxes represent trackletsAnd dashed boxes shows smoothed paths that associate two tracklets.

As you see, our model keep the targets identities same even after the occlusion by associating tracklets correctly. 51

52Association Example

With InteractionNo InteractionWrong IDs!Correct IDs!Time

VisionLabFinally, we examplar comparison between the tracklet association results obtained without interaction model and with interaction model.

[show1] Due to the severe occlusion incurred by the car motion, [show2] traditional method without interaction context, lost the targets identities and assign new ID for all after the occlusion. [show3] on the other hand, when we have the interaction in the association model, [show4] we could keep the identity of the targets in such a challenging scenario. 52

Propose novel model for joint activity recognition and target association.

Conclusion53

A1A2A3I12I23I13

C

OC

O1O2O3

VisionLabIn this paper, we proposed a novel model that seamlessly relate target tracking, and atomic activity, interaction and collective activity understanding.

Also , we show that by solving all together, we can achieve better understanding about high level activities.

Most interestingly, we also show that high level activity can help trajectory estimation better. 53


High level contextual information help improve target association accuracy significantly.Conclusion54

Crossing





High level contextual information help improve target association accuracy significantly.

Best classification results on collective activity up to date.

Conclusion55




56Thanks to

Yingze Bao, Byungsoo Kim, Min Sun, Yu Xiang,ONR and anonymous reviewers

VisionLab57

VisionLab

Not-PSD => non-convex58Branch-and-Bound AssociationGeneral search algorithmGuarantee exact solution RequireBranch operationBound operation

VisionLab59Branch-and-Bound Illustration

Q

Q0Q1

Q

L(Q)L(Q)U(Q)U(Q)U(Q1) < L(Q0) !

BranchBound

VisionLabBoundLower-bound60Branch-and-Bound Association

??for each interaction variable

I12

I34

VisionLabBound (cont.)61Branch-and-Bound Association

Per interaction variableOnly one non-zero activationper line in Hi

I12

VisionLabBound (cont.)Lower-bound

Binary Integer Programming

Upper-bound

62Branch-and-Bound Association

VisionLabBranchDivide problem into two disjoint sub-problem.

Find the most ambiguous variable.63Branch-and-Bound Associatione.g. Q0 => f = [1, x, x, x, x, x, .] Q1 => f = [0, x, x, x, x, x, .]

VisionLab

Technology

Choi ECCV12 presentation