Français | English
Conferences       Bibliography       Links       About Us


Cross-Modal Effects in Motion Picture Perception: Toward an Interactive Theory of Film
Mark Rollins


 Moderators: Anouk Barberousse, John Zeimbekis, Gloria Origgi, Nicolas Bullot
 

Introduction

It has been argued by a number of philosophers and scientists that there is a type of interpretation that is characteristic of our response to pictures. It is a type that is dependent on perception in ways that the interpretation of other symbol-types is not (Schier 1986; Cutting 1986; Lopes 1996). On this view, any correct interpretation of pictures must derive from the perception of objects and events, which are recognized in the pictures using ordinary perceptual abilities alone. Moreover, it is said, any interpretation that is so derived will, to that extent, be correct. Pictures are distinguished from words in that respect. The referent of a word is not recognizable in it; thus the possibility of basing a correct interpretation of words on ordinary object and event recognition processes does not arise.

These claims depend on showing that ordinary perceptual processes are not themselves entirely continuous with cognition; that is, that they are not highly dependent on background knowledge and beliefs or theory-laden, to use the classic phrase. The danger here is twofold. First, a perception-based theory of pictorial interpretation purports to be a naturalistic account, one that focuses on perceptual psychology rather than on a theory of conventional symbols or signs. But the deep cognitive penetrability of perception can underwrite substantial cultural differences in perception, and those might be viewed as driven by local representational conventions, rather than simply being embedded in them. In that case, the conventions rather than psychological processes may become the central focus in explanations of pictorial interpretation, leaving naturalism in the dust. Second, the door is opened by pervasive cognitive penetrability to the invidious effects of individual differences. Those could take the form of idiosyncratic beliefs and desires, which cannot be accommodated by explanations of a general sort.

An obvious response to this problem would be to appeal to the strong modularity of vision, which blocks cognitive penetrability by imposing limits on the types of knowledge upon which low-level perceptual processes may depend. Unfortunately, the evidence suggests that the strong modularity thesis is false. In any case, even if the thesis were true, the modularity of low-level visual processes would not suffice to ground pictorial interpretation in the way that the perception-based approach requires. The reason is simply that the outputs of such modules will always underdetermine conscious, central interpretive processes, which can take highly variable forms. One might get around this problem by positing centralized modules as well. These would take the form of potentially conscious reasoning processes that draw upon a knowledge base sufficiently broad to constitute a theory, but of a domain-specific sort; that is, a theory that does not depend on other theories and involve access to knowledge and belief of any and every kind (cf. Melzoff and Gopnik 1997). However, as we will see, while higher-order structures and processes may sometimes be involved in picture recognition, understanding what a picture represents based on them alone does not constitute thinking and reasoning about what a picture means. Moreover, it is not necessary to treat such knowledge representation structures as drawing on encapsulated, domain-specific theories in order to give a perception-based account of pictorial interpretation. Knowledge effects can be constrained in another way. In any case, I will cite evidence that indicates that interpretation does not always depend on such higher-order structures of the relevant type.

The evidence that I will consider concerns cross-modal effects in perception, specifically interactions of sights and sounds. Such effects can take a variety of forms. My concern here will be with cases of cross-talk in adult perceptual systems, in which one sense modality, operating normally, influences a different modality. For example, the ‘ventriloquist effect’ is experienced when the location of a sound source is influenced by visual stimulation; in other cases, sounds affect either the location or the quality of the visual stimulus. I suggest that the knowledge that cross-modal interactions require is limited in scope and content. What is important about them in that regard is that they comprise heuristic processes, in several respects: They make possible a limited reliance on internal representations, thus economizing on available mental resources. And they provide a means of categorizing events that is reliable, but also fallible and not fully rational. However, as will emerge, heuristic processes play central roles in a variety of theories, in which perception is said to be either more or less knowledge-dependent than on the view that I espouse. Thus it is crucial to be clear about the differences among these accounts.

If a distinctive mode of pictorial interpretation can be grounded on perception in this way, the question then arises of whether it is possible to make even more fine-grained distinctions, on the basis of which movies (construed as multimedia representations) might be the object of a special type of interpretation, not applicable to paintings or photographs. [1] Certainly there are important differences between the perceptual experience of paintings and film. Except for ambient noise, sounds do not usually figure in the perception of pictures in the museum or gallery. Moreover, unlike paintings, motion pictures represent events dynamically, i.e. as unfolding in real time, and they are typically viewed transiently under time constraints. Thus there is no opportunity for viewers to scrutinize scenes. Nonetheless, it has been said that “cross-modal interactions are the rule and not the exception in perception …,” and if that is true, then they should figure in a larger account that includes still photographs and paintings as well (Shimojo and Shams 2001: 505).

Cross-modal Effects: Theory and Evidence

There are three models in terms of which cross-modal effects have been explained: (1) a strictly feedforward account in which the visual and auditory systems operate independently and in parallel to provide information that is used to construct a higher order representation in multi-modal areas of the brain; (2) a model that locates the effects of one sensory system on another within the primary sensory areas themselves, but in a way that is mediated by feedback from multi-modal cells; and (3) an explanation in terms of direct links from one sensory system to another.

The first of these models treats sensory systems as modules and is undercut by evidence in that regard. It is purportedly bottom-up, in the sense that early sensory processes are unaffected by higher-order perception and cognition, and because the latter are supposed to constrain the former by providing particular kinds of input to them. However, the implication is also that there is a final, conscious stage in which all sensory inputs are combined into a coherent whole (whether processing is actually completed in multimodal areas themselves or not), and that is a view that is now widely rejected.

While there is evidence that there are, in fact, direct connections between sensory systems, lending support to the third approach, there is also substantial empirical weight on the side of the idea that interactions across systems are mediated by higher order, polysensory neurons. For instance, animal and human studies show that activity in such neurons is enhanced with cross-modal interactions; the amount is greater than the sum of activities in the unimodal areas that are involved (Bushara et al, 2002, p. 191). In light of that, I will concentrate on the second model, in which the potential for top-down effects and learned variations across individuals and cultures is greatest. There are a number of candidate regions for multimodal mediation, some cortical and some not (e.g. superior colliculus and prefrontal areas). The question then is how much integration is required of the auditory and visual features and what effect, if any, does the integration have on the encoding of the elements themselves? To the extent that integration is required, in a way that modifies the use of basic movement features, by what principles might the modifications be constrained?

Consider, then, two recent studies of cross-modal effects. The first involves the ventriloquist effect, which obviously plays an important role in the experience of film. In a series of experiments, Vrooman and de Gelder (2004) used a basic paradigm in which a centrally located sound from a computer screen was heard as alternating between left and right when synchronized with lights flashing in the relevant locations. It might be thought that this phenomenon depends on a higher-order judgment about the source of the sounds, based on assumptions about how visual stimuli and sounds are usually paired. Or perhaps the result was due to a post-perceptual Stroop-like response competition: Visual object recognition is a largely automatic response, we might say, that competes with a process of locating objects by attending to sounds and makes it difficult to separate them from the visual stimulus. Finally, an argument could be made that the subjects are responding strategically in the face of what they believe to be task demands. Noting the odd way the stimuli are presented and speculating about the experimenter’s aims and goals, they respond appropriately.

However, Vrooman and de Gelder found that subjects could not be trained to ignore cross-modal effects, and the effects occurred even when psychophysical measures of responses were used about which subjects could not strategize. The response occurred more-or-less automatically, thus ruling out explanations in terms of higher order judgments and strategies. Further, it appears that neither attention nor consciousness were required to produce the interaction. Specifically, focusing attention covertly on the shifting visual stimuli (as opposed to monitoring the center of the screen) did not produce a stronger ventriloquist effect. Moreover, introducing exogenous attention-attractors on one side of the display did not shift the location of the sound in that direction. When a ‘singleton’ feature (one small square in the context of other large squares) appeared on one side of the screen (with only large squares on the other side), such a feature would be expected to pop-out and draw the subjects’ attention. Nonetheless, the location of the sound actually shifted away from the presumed focus of attention and in the direction of the larger stimuli. Finally, in this study, cross-modal effects were found in patients with unilateral visual neglect. This syndrome is sometimes said to be an attention deficit; in any case, it results in the patient being unaware of the presence of an object on the relevant side, although he can still process some information about it. In this case, subjects failed to detect the visual stimulus in their left visual field; nonetheless, their pointing to a sound was shifted in the direction of the stimulus. Thus neither attention nor awareness was required for the effect. This conclusion is supported by other evidence showing that one sense modality can influence the response to another, even when they are not presented simultaneously; i.e. in situations in which attentional binding is not involved. For instance, Braille reading by early blind subjects has been shown to produce activity in visual cortex (Sadato, et al 1996 pp. 526-528; cf. Shimojo and Shams 2001, p. 505). Further, brain imaging studies by Bushara, et al (2002, p. 190) show that, while activity in multimodal areas increases with sound-induced changes in visual motion perception, the activity is actually lower in unimodal sensory areas. This is not what would be expected if attention were in play.

The second study is one in which sounds affect the perception of a visual stimulus, using a ‘collide-or-pass’ paradigm. In this type of experiment, subjects see two spots move toward each other, intersect on the screen, and then move off in a different direction. However, the movements can be seen in two distinct ways. On one interpretation, both spots move in straight line trajectories, merely passing by each other, producing an X pattern. On the other interpretation, they collide, and both spots change direction, being reflected by their contact. With no sound, collisions are sometimes seen, but the ‘collision’ and ‘pass’ interpretations are equally likely. However, if a sharp click is heard at the moment when the two spots intersect, ‘collision’ becomes the favored response. Note that in this case, not only the perceived direction of movements, but also the description of the event is affected by the cross-modal interaction. This determines the mathematical and physical principles that would be appropriately applied: collision dynamics for one response, but not for the other.

It is important that the effect of sound on the perception of movement does not appear to be due to a prior categorization of the event as a collision. There is no significant difference in brain activity when subjects describe the objects as colliding, as against subjects classifying them as passing, when there is no added sound (Bushara et al 2002, p. 194). If the brain activity corresponding to the perception of a collision were due to an inference based on general knowledge, then we would expect to find it whether sound provided input to the inference or not. Further, the effect can be produced regardless of which sense modality is synchronized with the intersection of the visual stimuli. Tactile stimulation of a finger will work as well (Shimojo and Shams 2001, p.508). The implication is that the result is not due to a learned association of typical sights and sounds that are produced when objects collide.

These results suggest that whatever integration and modulation of features are required, it does not depend on extensive post-perceptual processing. Nonetheless, Vrooman and de Gelder argue that this evidence is consistent with feedback from multi-modal levels in the brain (2004; cf. Shimojo and Shams 2001, and Bushara et al 2002 for similar arguments). The implication is that the mediating role performed in multimodal levels does not derive from judgments about the type of event in which the features co-occur, or from beliefs, desires, or goals of the perceiver, upon which the direction of attention might depend.

From what, then, does it derive? One possibility is that there is a reciprocal interaction between the unimodal and multimodal areas through which the effects are produced; an interaction that constitutes a “perceptual interpretation” of the signals (Bushara et al 2002, p. 190). In that case, a model is required of the form that interactions take. In order to sketch such a model, I turn now to a comparison of two possible accounts of cross-modal interactions. These emerge from contrasting ways that events in film might be categorized, especially in cases where human actions are portrayed. On one approach, very little internal representation, if any, is required. ‘Interactions’ among features will consist simply in using body, head, and eye movements to identify and employ cues in the environment according to selection principles in which the cues can come from different sense modalities. Any changes in the relevant features will come largely from changes in the perceiver’s point of view or task. On the face of it, the evidence for multimodal neurons does not seem to sit well with the approach. However, it can be argued that the discovery of subpersonal brain systems responsible for mediating sensori-motor and cross-modal interactions need not be taken to imply internal representations, in terms of which the interactions are explained (cf. Pessoa, Thompson, and Noe, in press.) Yet on the other approach, it is precisely an elaboration of an internal representation that is required, although in a limited and nonlinear sense. Arguments for construing cross-modal mediation in this more integrative form derive from the inadequacy of the first approach to account for perceptual behavior and from an analysis of the mechanisms on which mediation depends.

The Categorization of Events in Film

A central issue in regard to event perception has to do with the extent to which the process by which events are segmented or parsed is bottom-up or top-down. A bottom-up account will explain the parsing of events in terms of elemental patterns of movement, taken singly or in small, relatively isolated sets, where the movements are individuated by points of maximal change in space and time. A top-down approach makes the identification of such movements depend on their place in a larger representational structure, which can be more-or-less elaborate, conscious, and centralized: a story, perhaps, which is informed by substantial background knowledge and theory; or, alternatively, a more skeletal narrative schema or script. However, some attempts have recently been made to show that the intentions in terms of which the characters’ actions are described, can themselves be recognized on the basis of a few cues, using models that are explicitly heuristic in form. As we have seen, interactions of sights and sounds can affect the properties that basic movements are perceived to have. Thus, in so far as events are segmented with reference to such movements, we can ask, “is scene analysis itself a cross-modal phenomenon?” (Vrooman and de Gelder, 2004, p. 147). If so, the question of whether event perception in film has a bottom-up grounding or is top-down becomes a question of how cross-modal interactions should be understood.

The bottom up approach comes in two varieties, which differ in how they describe the status of cues. On the one hand, cues are said to specify the identity of a movement, construed in a way that is derived from James Gibson; i.e. they can be used selectively to get a uniquely correct result. In that sense, the cues are taken to be informationally sufficient for categorization tasks. However, it is possible that there may be more than one such cue, and in that case the door is open to the perceiver to satisfice, i.e. rely on the first source of information that will do the job (Cutting 1986). That is where heuristics come into play.

On the other hand, cues may be taken to be only diagnostic of action types. They can be elements in motion patterns that are characteristic of more than one action; thus they can never be correlated with properties in a law-like way as to give a determinate result. However, the cues can be ranked in terms of the frequencies with which they are correlated with various actions types (either through experience or natural selection). If the cues are considered individually in an order based on their ranking, and the process stops at some point before all the relevant features are considered, then the cues can be said to be psychologically and behaviorally sufficient for the task, even if they are informationally inexact. Cues might be used selectively in this way because of time pressure or because the viewer is simply not predisposed or capable of taking all of the relevant information into account. Although it can lead to mistakes in particular cases, this kind of selection heuristic might be typically employed because perceivers get the identification of the event right often enough to make further cue elaboration and processing a waste of time.

A central example of an application of this approach to the recognition of intentions in film can be found in Blythe, Todd, and Miller (1999), who appeal to a heuristic called Categorization-by-Elimination (henceforth, CBE). Blythe et al argue that CBE can provide for the recognition of intentions on the basis of a few fundamental movement types. On this model, the order of cue consideration depends on the fact that some cues are more useful than others, in the sense that they serve to eliminate more action categories, thus narrowing the field of choice. For instance, frequent contact between two moving shapes might rule out courtship, pursuit, evasion, and play, whereas high speed may only rule out play. Although Blythe et al do not specifically consider the effects of sound on perceived motion, it is clear that the account could be extended to simple combinations of sounds and visual features, which could be ranked by their value in event recognition and used by perceivers in accordance with CBE.

What is crucial about this model is that cues are registered in isolation from one another and consulted in a fixed order. Further, a cue cannot be reconsidered once it has been employed and the system has moved on to other cues. Thus no global integration into an action story is possible on this account. From the perspective of representational parsimony, this is a large part of its appeal; however, in that respect, it can be shown to be inadequate on empirical grounds. The evidence suggests that cues are not employed in the rigid and inflexible manner that CBE implies. Rather, the order in which cues are selected can be modified, not only by the perceiver’s beliefs, but also by modifications in movement stimuli themselves (Zacks 2004). In the first instance, it appears that the identifications of movements as actions often depends on relating the movements to goals and subgoals that are hierarchically ordered. When movements are seen in that way, different areas of the brain are involved than when identical movements are not seen as intentional actions in this sense, even if the movements are perceived to be biological (Saxe et al 2004). This suggests that cues are treated as more interrelated and in more variable ways that CBE allows. Because the selection of cues for event segmentation can also be modulated by changing the properties of the stimulus, the implication is that, while we may parse events on the basis of a few movement cues, they, too, must be seen as interrelated in some way; a way that can be affected by variations in the representational properties of the video or film.[2] This suggests the need for a more top-down approach, although as I have argued, one in which knowledge effects are somehow constrained.

In seeking a model of event segmentation that involves the appropriate level of integration, a natural place to turn is to theories of apparent motion. A number of experiments on apparent motion have been designed to eliminate the effects of high-level cognition, by requiring a response to a stimulus that allows no time for such effects to occur. Nonetheless, the central feature of the results is that the properties of basic movements of objects, such as their directions and identities, appear to depend on global effects that might be construed as higher order or top down. Although these experiments involve short-range apparent motion, which is different from the more long-range movements that are represented in video and film, there are reasons to think that the two phenomena may involve similar mechanisms (Liu, et al 2003, p. 1772.). [3] Specifically, the perception of apparent motion is often said to involve a process of perceptual completion that depends on the capacity of some neurons to expand their receptive fields (and a corollary inhibition of cells with overlapping fields; cf. Liu, et al, 2003). In various types of perceptual completion, this is described as an interactive process involving the detection of features of different kinds, through which global effects are produced (Ramachandran 1987; Livingstone 2002). Although there is evidence that, in so far as apparent motion depends on filling in, it occurs only in the human motion processing complex (MT+), the evidence also suggests that the perceived continuity of objects through apparent movement derives from an interaction between cells in MT+ and lateral occipital complex (LOC) neurons (Liu et al, 2003 p. 1772). This occurs by virtue of their areas of overlap. Moreover, MT+ has been identified as a possible area for multisensory integration; and various multisensory areas of the brain have been said to integrate inputs from different sense modalities by exploiting overlapping receptive fields (Stein et al 2004 p. 253; Schroeder and Foxe, 2004, p. 297 and 301). Thus, it may be that perceptual completion in MT+ depends on an interactive mechanism involving changes in receptive field properties of neurons and that cross modal effects on apparent motion derive from that sort of mechanism as well. Although in that case, the recognition of basic movement types (and that actions to which they correspond) will not simply derive from early responses to isolated stimuli, it would also not seem to require an account in terms of substantial cognitive processing. It depends instead on mutual constraints imposed by the related features and on the anatomical and physiological properties through which the constraints are applied. [4]

Two points are important in that regard. First, the mechanism makes sensory integration depend on internal representations, but only of a limited sort. This is due to the coarse coding involved in the ‘capture’ of one type of sensory feature by another. When cells respond to regions in their extended receptive fields that contain features that they would not typically detect, the responses need not delineate all of the features located there. (And other cells whose response to different features in the region is inhibited will continue to be active, but only weakly so). Thus multimodal integration is heuristic in a general sense. Second, the neuronal enhancement that derives from this integration can be viewed as serving to reinforce the diagnostic status of certain cues. Those cues can be used selectively, not simply by way of a predetermined diagnostic ranking, but because of their dynamic relation to other cues; a relation that need not itself be explicitly represented, being vested an interactive process of the sort I have described. [5]

Representational Genres

These points bear on the nature of pictorial interpretation and the perceptual abilities on which it may depend. In some of the experiments discussed above, the recognition of intentions and the segmenting of events in film have been shown to depend on the selective enhancement of features (Zacks 2004; Vrooman and de Gelder 2004). Cross-modal interactions can be understood in those terms, especially in film. However, the analysis of those interactions points to an account of picture perception that does not emphasize the role of attentional binding of external features, and this distinguishes it from other theories in which the need for internal representation is minimized (cf. e.g. Ballard 1999).

At the same time, the analysis suggests that the mode of interpretation of movies is not fundamentally different from that employed with paintings and photographs (cf. e.g. Livingstone 2002; Grossman et al 2000, 2001). The continuity of cross-modal interactions with interactions among subsystems within a sense modality may have phenomenological consequences; that is, it may be at work in the fact that cross-modal interactions always produce effects in one sense modality or another: Ventriloquism concerns the location of sound, tones synchronized with flashing lights produces visual apparent motion. The principles that govern the experience of cross-modal effects in that regard remain unclear. [6] However, it can be argued that, as long as the effects of cross-modal interactions include responses within primary sensory areas and do not simply terminate with activity in multimodal neurons per se, the response to multimedia presentations should not be construed as a distinctive mode of interpretation. Rather, the response should be classified with reference to the pattern of activity in the unimodal areas that are involved. So construed, movies are motion pictures that include sounds.

References

Ballard, Dana (1991) “Animate Vision,” Artificial Intelligence, 49: 57-96

Blythe, Todd, P, and Miller (1999) “How Motion Reveals Intention: Categorizing Social Interactions,” Simple Heuristics That Make Us Smart, eds., Gigerenzer, G., Todd, P., and the ABC Research Group, Oxford University Press.

Bushara, K., Hanakawa, T., Immisch, I., Toma, K., Kansaku, K., and Hallett, M. (2003) “Neural correlates of cross-modal binding,” Nature Neuroscience 6: 190-195.

Cutting, J. (1986) Perception with an eye for motion. Cambridge MA: MIT Press

Gilden, D. (1991) “On the Origins of Dynamical Awareness,” Psychological Review 98: 554-558.

Grossman, E, Donnelly, M., Price, R., Pickens, D., Morgan, G., Neighbor, G., and Blake, R. (2000) “Brain Areas Involved in Perception of Biological Motion,” Journal of Cognitive Neuroscience 12:5, pp. 711-720.

Grossman, E., and Blake, R. (2001) “Brain activity evoked by inverted and imagined biological motion,” Vision Research 41:1475-1482

Hecht, Heiko (1996). "Heuristics and invariants in dynamic event perception: Immunized concepts or nonstatements?" Psychonomic Bulletin & Review 3, pp.61-70.

Livingstone, M. (2002) Art and Vision: The Biology of Seeing New York: Harry Abrams.

Lopes, Dominic (1996) Understanding Pictures Oxford University Press

Liu, T., Slotnick, S., and Yantis, S. (2004) “Human MT+ mediates perceptual filling-in during apparent motion,” NeuroImage 21: 1771-1780.

Meltzoff, A., and Gopnik, A., (1997) Words, Thoughts, and Things. Cambridge, MA: MIT Press.

Pessoa, L., Thompson, E., and Noe, A. (in press) “Finding Out About Filling In: A Guide to Perceptual Completion for Visual Science and the Philosophy of Perception,” Behavioral and Brain Sciences.

Ramachandran, V.S. (1987) “Interactions between Motion, Depth, Color, and Form: The Utilitarian Theory of Perception,” Coding and Efficiency, ed. Colin Blakemore.

Ramachandran V.S. and Hirstein W. (1991), “The Science of Art: A Neurological Theory of Aesthetic Experience,” Journal of Consciousness Studies 6:15-57.

Rollins, M. (2003) “The Mind in Pictures: Perceptual Strategies and the Interpretation of Visual Art,” The Monist 86: 608-630.

Saxe, R., Xiao, D., Kovacs, G., Perrett, D., and Kanwisher, N. (2004) “A region of right posterior temporal sulcus responds to observed intentional actions” Neuropsychologia 42: 1435-1446.

Schier, F. (1986) Deeper Into Pictures. Cambridge UK: Cambridge University Press

Shimojo, S. and Shams, L. (2001). “Sensory modalities are not separate modalities: plasticity and ineractions,” Current Opinions in Neurobiology 11:505-509.

Speer, N., Swallow, K., and Zacks, J. (2000) “On the role of human areas MT+ and FEF in event perception,” Cognitive, Affective, and Behavioral Neuroscience 3: 335-345.

Vroomen, J. and de Gelder, B. (2004) “Perceptual Effects of Cross-modal Stimulation: Ventriloquism and the Freezing Phenomenon,” in Handbook of Multisensory Processes, eds. Calvert, G, Spence, C., and Stein, B. Cambridge MA: MIT Press.

Zacks, J., Braver T., Shridan, M., Donaldson, D., Snyder, A., Ollinger, J., Buckner, R., and Raichle, M. (2001) “Human brain activity time-locked to perceptual event boundaries,” Nature Neuroscience 4:651-655.

Zacks, J. (2004) “Using Movement and Intentions to Understand Simple Events.” Cognitive Science.

Notes

[1] For reasons I have expressed elsewhere (Rollins 2002), I do not think that pictorial representation can actually be defined in terms of a type of interpretation that applies only to pictures. Being the object of a certain kind of perception-based interpretation is only a necessary condition for being a picture at most. In any case, it will be sufficient for my purposes to show that there is a distinctive perception-based mode of interpretation that is characteristic of pictures (i.e. typically employed in understanding them). The question then is whether there is also a characteristic mode of interpretation of movies, which is sufficiently different from our interpretive response to static pictures as to require a distinct type of account. I will argue that the answer is no.

[2] The area that Zacks et al and Saxe et al find to be implicated in the detection of biological motion, the post superior temporal sulcus (PSTS), has been described by Saxe et al, as well as Grossman et al (199x) as a special purpose mechanism for recognizing biological motion, which may even contain a subregion dedicated to intentional biological motion in particular. Grossman et al argue that PSTS does not provide feedback to earlier levels, in which case it would seem to be a higher level structure that simply receives input from earlier visual processes. However, two points are important in that regard. First, even if PSTS is a dedicated intention detector, Saxe et al argue that the activity in that area specifically does not involve reasoning about the mental states of the agents (because other brain regions known to be active in such reasoning are not involved in detection intentional action). Second, the Saxe et al and Grossman et al studies involve only visual stimuli. It is not clear what other areas might be involved when the perception of biological or intentional motion which depends on cross-modal effects.

[3] Of course the representation of long-range apparent motion in movies depends on the perception of l changes in the positions of objects, the perceived continuity across depends on an apparent motion effect. It should be noted that cross-modal versions of short-range apparent motion can be produced: A single flashing light on the left side of a screen will appear to move behind a piece of tape on the right, if it is coordinated with tones alternating in the relevant locations. The effect lends itself naturally to description in intentional terms: “In ‘hiding’ the missing dot behind the piece of tape stuck to the screen, the visual system is doing more than merely filling in the missing dots between the two points. It ‘assumes’ that the dot ‘hid’ from view[;] …” it attributes “’hiding’ to the behavior of a dot…” (Anderson and Anderson, 200x, pp. 358-369).

[4] Although it appears that processes in MT+ may be modulated by connections to areas involved in the direction of attention, I take it that the extent to which the function of MT+ should be explained in more bottom up or top down terms is empirically an open question. Cf. Speer, Swallow, and Zacks 2000.

[5] More must be said about the nature of heuristics than space allows here. In earlier debates, a reliance on heuristic principles was taken to be opposed to the use of invariant properties. Cutting mitigates that opposition, to some extent, in allowing for perceptual strategies in which invariants can be selected and combined. The distinction between the use of heuristics and invariants can be blurred in other ways as well. For instance, Hecht (1996) argues that accounts that appeal to invariant properties sometimes explain failures in perceptual judgment in terms of the perceiver’s use of incomplete invariants. That makes the response sound heuristic in form. Nonetheless, I suggest that progress can be made by focusing on particular phenomena and the mechanisms that underwrite them. Studies with such a focus support the conclusion that film viewers commonly rely on more feature integration than some models allow, but not so much as to make event perception heavily dependent on inference and background knowledge.

[6] Vrooman and de Gelder 2004 suggest a principle of cross-modal control that consistent with this point.

Close Crossmodal interactions and understanding movies  
Nicolas Bullot
Nov 18, 2005 23:47 UT

In the beginning of the article, you present a set of problems for any naturalist approach to picture understanding -- capable of explaining the relation of perceptual processing with conceptual thinking. According to you, the appeal to visual modules does not work since “(…) the outputs of such modules will always underdetermine conscious, central interpretive processes, which can take highly variable forms.” In contrast with the purely an approach based on a visual module, you examine the function of crossmodal interactions for movie perception (crossmodal interactions between vision and audition seem to play an important role in movie perception or/and understanding). You give reasons for maintaining that film understanding (e.g. the recognition of intentions and the segmenting of events in film) may depend sometimes on the crossmodal selective enhancement of perceived features. If true, this suggests that (the epistemic/heuristic strategies of) motion-picture understanding may be routinely based on crossmodal selective enhancement. In this context, I am not sure to that I understand the meaning your conclusion:

“it can be argued that, as long as the effects of cross-modal interactions include responses within primary sensory areas and do not simply terminate with activity in multimodal neurons per se, the response to multimedia presentations should not be construed as a distinctive mode of interpretation. Rather, the response should be classified with reference to the pattern of activity in the unimodal areas that are involved. So construed, movies are motion pictures that include sounds.”

Are you excluding the possibility the crossmodal interaction would be associated with a distinctive mode of interpretation (which may make sense: e.g., think about the distinction between silent and speaking movies)? Is this because you conceive of the interpretative process as being amodal or non-perceptual? Is this related to a claim according to which pictures can only be visual? Can one consider movies as evolving sound environments that include pictures or multimodal (i.e. audiovisual) pictures?

  1 reply to Crossmodal interactions and understanding movies:
    Open Pictorial Interpretation and Vision
Mark Rollins, Nov 26, 2005 16:49 UT
 
Note: yellow triangles (   ) indicate new messages that have been posted since your last visit to the site.
 
© 2008 interdisciplines.