Classifying Film Shot Types with SGNet

Imagine a scene where the camera zooms in so closely that every flicker of emotion on an actor’s face becomes visible. Now, picture that same character appearing as a tiny figure within a vast, empty landscape. These contrasting choices in shot size are deliberate tools filmmakers use to evoke specific emotions, shaping how viewers connect with each moment. Each frame serves not only as a window into the story but also as a storyteller, guiding the audience’s emotional journey.

The Art of Shot Selection in Filmmaking

Typically, shot choices are planned well in advance during pre-production and locked in through editing. But what if, in certain instances, a filmmaker needs to explore which shot better captures the desired emotion? This is where advanced frameworks come into play, helping to analyze and classify shot types more precisely.

“A Unified Framework for Shot Type Classification Based on Subject-Centric Lens,” a research paper by Anyi Rao et al., introduces a novel framework called SGNet. This framework automatically classifies shot types by distinguishing between each shot's subject and background. SGNet creates separate guidance maps for scale and movement classification, enabling a nuanced understanding of shot composition. To support this, the authors developed MovieShots, a comprehensive dataset featuring over 46,000 annotated shots from 7,000 movie trailers, categorizing them by scale and movement.

Before exploring the technical aspects of SGNet, let’s explore the various shot types that filmmakers use to build a visual narrative. For easier understanding, we’ll use examples from Baahubali 2.

Exploring Different Shot Types with Examples from Baahubali 2

Establishing Shot:

Sets the location and environment of the scene, helping viewers understand the broader context. It's often a wide or aerial shot, but there’s no strict requirement on how wide it must be.

Example: The Mahishmati Kingdom Shot

A wide aerial shot of Mahishmati establishes the kingdom's grandeur and scale, setting the story's epic tone. This shot gives viewers a clear view of the city’s architecture and sets the location’s mood and importance.

Extreme Wide Shot:

Shows the subject as a small part of a large landscape, creating a sense of vastness. It can make the character seem small or isolated within their surroundings.

Example: Baahubali lifting the Shiva Linga.

Wide Shot:

This balances the subject with the environment, capturing the full subject within the frame and emphasizing the space around it.

Example: Baahubali during the Kalakeya battle.

Wide shots often create a sense of narrative distance by positioning characters within a vast environment, making them appear small and insignificant compared to the surrounding landscape.

Full Shot:

Frames the subject head-to-toe, letting the audience focus on the character's stance and body language while still keeping parts of the environment in view.

Example: Baahubali in waterfall

This full shot captures his stance and determination, making him seem both awe-inspiring and grounded, while the waterfall hints at the challenges he will overcome.

Medium Wide Shot:

Frames the subject from roughly the knees up, giving viewers a sense of the subject's posture and the immediate surroundings without overwhelming detail.

Example: Baahubali and Devasena in the palace courtyard

Framed from the knees up, this shot shows their regal attire and posture while they converse, allowing viewers to sense the chemistry and tension between them. The palace in the background gives additional context to their royal status and environment.

Cowshot Shot:

Traditionally used in Westerns to show gunslingers, it frames the subject from the mid-thigh up. It’s ideal for scenes with action or tension as it reveals weapons or tools while maintaining an emotional connection.

Example: Bhallaladeva with his weapon

Framed from the mid-thigh-up, this shot captures Bhallaladeva’s intimidating posture and weapon, emphasizing his readiness for combat. This framing draws attention to his armor and weapon, underscoring his role as a fierce warrior.

Medium Shot:

Frames the subject from the waist up, emphasizing the character's emotions and allowing the audience to see some background. It’s often used for conversations.

Example: Sivagami with Bahubali

A medium shot, from the waist up, captures their facial expressions and body language as they discuss important matters. This shot keeps the focus on their emotions, while also giving context to their surroundings.

Medium Close-Up:

Frames the subject from the chest up, focusing on facial expressions while keeping some distance, ideal for emotional moments without full intensity.

Example: Katappa and Young Bahubali

Close-Up:

A close-up fills the frame with the subject’s face or detail, making it excellent for emotional impact or conveying a character’s inner thoughts through micro-expressions.

Example: Baahubali’s Happiness

This close-up captures the character's happiness, offering an intimate view of their emotions.

Extreme Close-Up:

Focuses on minute details, often parts of the face, to create an intimate or tense moment.

Example: Devasena’s Eyes

An extreme close-up is used to highlight Devasena's renowned beauty, focusing on the intricate details of her features.

How SGNet Work for Film Shot Classification?

SGNet is designed to address the challenges of shot type classification by separating the subject and background of a shot into two distinct streams. This dual approach allows for more accurate scale and movement type identification. SGNet utilizes:

> Subject Stream: This stream focuses on identifying the subject in the image, such as the main character or object within the frame. It captures details about the subject's position, orientation, and scale.

> Background Stream: The background stream analyzes the environment or setting surrounding the subject. It pays attention to the shot composition, capturing elements like the landscape, architecture, or general background details.

Flowchat of SGNet Framework for Film Shot Type Classification

Subject-Centric Approach: SGNet’s design emphasizes the subject-centric nature of film shots, where the position, size, and focus of the subject in the frame can drastically change the classification. For example, a close-up emphasizes the subject’s facial details, while an extreme wide shot captures the subject within a vast background.SGNet uses bounding boxes and segmentation techniques to locate and isolate the subject, allowing the model to focus on key features necessary for accurate classification.

Feature Extraction and Fusion: After processing each stream (subject and background) separately, SGNet extracts features relevant to shot classification. These features are then fused, combining insights from both streams. This fusion helps SGNet understand relationships like:

The size of the subject relative to the background
The distance between the camera and the subject
Background elements that can indicate shot types (e.g., landscape indicating an establishing shot)

Classifier Head for Shot Type Prediction: The fused feature representation is passed to a classifier head, which makes the final decision on the shot type. The classifier is trained on a variety of shot types, such as:

Establishing shots: Large background, with the subject often small or distant.
Extreme wide shots: Vast background with a tiny subject presence.
Medium shots: Balanced view of subject and background.
Close-ups: High focus on the subject, minimal background.

Training with Annotated Data: SGNet is typically trained on annotated data that includes labels for different shot types. For effective training, the dataset should have diverse examples of each shot type, capturing various subjects and backgrounds.

By leveraging this dual-stream approach, SGNet can perform nuanced classifications, distinguishing subtle variations in shot types based on how subjects are framed against backgrounds. This capability makes it highly effective for applications in automated film analysis and cinematographic studies.

The framework employs a ResNet50 backbone, where each shot is divided into multiple clips processed through separate subject and background streams. These feature vectors are then pooled together to produce a final classification, making it possible to distinguish between different shot types like extreme close-ups, long shots, and dynamic camera movements like push and pull shots.

To further enhance SGNet, future improvements could focus on integrating additional temporal features to better understand transitions between shots. Moreover, expanding the dataset to include more diverse video sources could improve the model's generalization capabilities. Utilizing transformer-based architectures might also offer better performance by capturing long-range dependencies in shot sequences, making SGNet even more effective for real-world video analysis.

SGNet is a groundbreaking step forward in automated film shot classification, providing filmmakers, critics, and AI researchers with a sophisticated tool to decode the nuanced visual language of cinema. With its dual-stream focus on subjects and backgrounds, SGNet captures the emotional weight and storytelling intent behind each shot type, from intense close-ups to sweeping establishing shots. This framework bridges technology and creative insight, offering a powerful means to explore cinematography.

As SGNet evolves, its potential to serve filmmakers, educators, and enthusiasts grows, deepening our understanding of cinematic techniques. Enhanced with richer data, temporal analysis, and advanced architectures, SGNet is pioneering a future where AI aids in the craft of visual storytelling.