Scene Detection Guide

Learn how VideoIntel detects scene changes in videos using advanced computer vision techniques, frame difference analysis, and intelligent filtering algorithms.

Overview

Scene detection is the process of identifying where one scene ends and another begins in a video. This is useful for:

Video Segmentation: Automatically split videos into meaningful chapters
Content Analysis: Understand video structure and pacing
Smart Navigation: Create chapter markers for easier video browsing
Thumbnail Selection: Generate one thumbnail per scene
Video Editing: Identify natural cut points for trimming

How It Works

VideoIntel's scene detection uses a multi-stage pipeline that balances accuracy with performance:

Detection Algorithm

The scene detection process follows these 7 steps:

Step 1: Frame Extraction

Sample frames at regular intervals (default: every 0.5 seconds) throughout the video. This provides sufficient temporal resolution while keeping processing time reasonable.

Step 2: Frame Difference Calculation

Compare each frame with the previous frame using pixel-level difference calculation. Frames are downscaled to 25% size and converted to grayscale for 48x faster processing.

Step 3: Boundary Identification

Mark timestamps where frame difference exceeds the threshold (default: 30%) as potential scene boundaries. Higher differences indicate more significant visual changes.

Step 4: False Positive Filtering

Apply smoothing and local maxima detection to remove false positives caused by camera motion, fast object movement, or flashes. Reduces false positives by 50-70%.

Step 5: Minimum Scene Length

Remove boundaries that create scenes shorter than the minimum length (default: 3 seconds). This prevents micro-scenes from quick cuts or transitions.

Step 6: Scene Grouping

Group timestamps into coherent Scene objects with start time, end time, duration, and confidence scores.

Step 7: Thumbnail Generation

Extract a representative frame from each scene's midpoint (if enabled). Midpoint frames are most stable and representative of the scene.

Frame Sampling Strategy

VideoIntel samples frames at 0.5-second intervals, providing ~2 frames per second of analysis. This is optimal because:

Fast enough to catch quick cuts and transitions
Slow enough for efficient processing (analyzing every frame would be 60x slower)
Memory efficient - only keeps frames needed for comparison

// Example: 60-second video with 0.5s sampling
// Extracts: 120 frames (60 ÷ 0.5)
// Memory: ~30MB peak (frames processed progressively)
// Time: ~3-5 seconds on modern hardware

const scenes = await videoIntel.detectScenes(video, {
  minSceneLength: 3,    // Filter scenes shorter than 3s
  threshold: 0.3,       // 30% difference required
});

Frame Difference Calculation

VideoIntel calculates frame differences using pixel-level comparison with optimizations:

// Pseudo-code showing the difference calculation
function calculateFrameDifference(frame1, frame2) {
  // 1. Downscale frames to 25% size (4x4 = 16x fewer pixels)
  const small1 = downscale(frame1, 0.25);
  const small2 = downscale(frame2, 0.25);
  
  // 2. Convert to grayscale (3x faster than RGB)
  const gray1 = toGrayscale(small1);
  const gray2 = toGrayscale(small2);
  
  // 3. Calculate pixel-by-pixel difference
  let totalDifference = 0;
  for (let i = 0; i < pixels.length; i++) {
    const diff = Math.abs(gray1[i] - gray2[i]);
    totalDifference += diff;
  }
  
  // 4. Normalize to 0-1 range
  const avgDifference = totalDifference / pixels.length;
  return avgDifference / 255;
}

// Result: 48x faster than full-res RGB comparison
// Accuracy: >95% scene detection rate

False Positive Filtering

Raw difference detection produces many false positives. VideoIntel applies two filters:

1. Local Maxima Detection

Only keeps boundaries that are peaks in their neighborhood. This removes spurious detections during gradual transitions or panning shots.

// A boundary is kept only if it's higher than neighbors
// Window size: ±3 frames

Difference:  [0.2, 0.4, 0.8, 0.5, 0.3, 0.9, 0.4]
                    ↓    ↑              ↑
              kept (local max)    kept (local max)
              
// The 0.4 spike is rejected because 0.8 is nearby

2. Prominence Filtering

Boundaries must be significantly higher (20% threshold) than their neighbors to be considered valid scene changes.

// Prominence = (boundary - avg_neighbors) / avg_neighbors
// Must be ≥ 20% to be kept

Boundary: 0.8, Neighbors: [0.7, 0.65]
Avg neighbors: 0.675
Prominence: (0.8 - 0.675) / 0.675 = 18.5%
Result: REJECTED (below 20% threshold)

Boundary: 0.8, Neighbors: [0.5, 0.45]
Avg neighbors: 0.475
Prominence: (0.8 - 0.475) / 0.475 = 68%
Result: KEPT (above threshold)

Configuration Options

Fine-tune scene detection for your specific use case:

const scenes = await videoIntel.detectScenes(video, {
  // Minimum scene length in seconds
  // Shorter scenes are merged with adjacent scenes
  minSceneLength: 3,    // Default: 3 seconds
  
  // Detection sensitivity (0-1)
  // Lower = more scenes, Higher = fewer scenes
  threshold: 0.3,       // Default: 0.3 (30%)
  
  // Generate thumbnails for each scene
  includeThumbnails: true,  // Default: true
});

Threshold Tuning Guide

Threshold	Sensitivity	Use Case
0.15 - 0.25	Very High	Catch subtle transitions, slow pans, lighting changes
0.25 - 0.35	Balanced ⭐	Most videos - good balance of accuracy and precision
0.35 - 0.50	Conservative	Only obvious cuts, action films with fast motion
0.50+	Very Low	Only dramatic scene changes

Best Practices

1. Choose the Right Threshold

Different video types need different thresholds:

// Talking head videos (static scenes, few cuts)
const talkingHead = await videoIntel.detectScenes(video, {
  threshold: 0.25,      // Lower threshold to catch subtle changes
  minSceneLength: 5,    // Longer minimum (scenes tend to be long)
});

// Action movies (fast cuts, lots of motion)
const actionMovie = await videoIntel.detectScenes(video, {
  threshold: 0.40,      // Higher threshold to avoid motion blur
  minSceneLength: 2,    // Shorter minimum (scenes are quick)
});

// Documentaries (mix of interviews and B-roll)
const documentary = await videoIntel.detectScenes(video, {
  threshold: 0.30,      // Balanced detection
  minSceneLength: 3,    // Standard minimum
});

// Music videos (very fast cuts, artistic transitions)
const musicVideo = await videoIntel.detectScenes(video, {
  threshold: 0.35,      // Higher to avoid detecting every beat
  minSceneLength: 1,    // Allow very short scenes
});

2. Validate Results

Use the statistics API to understand detection quality:

const detector = new SceneDetector(
  new FrameExtractor(),
  new FrameDifferenceCalculator()
);

const scenes = await detector.detect(video, options);

// Get detection statistics
const stats = detector.getLastStats();

console.log(`Detected ${stats.scenesDetected} scenes`);
console.log(`Average scene length: ${stats.averageSceneLength}s`);
console.log(`False positives filtered: ${stats.boundariesRejected}`);
console.log(`Processing time: ${stats.processingTime}ms`);

// If too many scenes detected:
// → Increase threshold
// If too few scenes detected:
// → Decrease threshold

3. Handle Edge Cases

// Very short videos (< 5 seconds)
if (video.duration < 5) {
  // Might not find any scenes - that's okay
  const scenes = await videoIntel.detectScenes(video, {
    minSceneLength: 0.5,  // Lower minimum
    threshold: 0.2,       // More sensitive
  });
}

// Very long videos (> 30 minutes)
if (video.duration > 1800) {
  // Consider higher threshold for efficiency
  const scenes = await videoIntel.detectScenes(video, {
    minSceneLength: 5,    // Longer scenes likely
    threshold: 0.35,      // Less sensitive = faster
  });
}

// Videos with fades/transitions
const artisticVideo = await videoIntel.detectScenes(video, {
  threshold: 0.25,      // Lower to catch gradual transitions
  minSceneLength: 2,
});

Performance

Benchmarks

Video Length	Frames Analyzed	Processing Time	Memory Peak
30 seconds	~60 frames	1-2 seconds	~50MB
2 minutes	~240 frames	3-5 seconds	~100MB
10 minutes	~1,200 frames	15-20 seconds	~200MB
30 minutes	~3,600 frames	45-60 seconds	~300MB

⚡ Performance Note

Scene detection is CPU-intensive. For very long videos (`>`1 hour), consider processing in chunks or using a lower sampling rate. The algorithm is already optimized with downscaling and grayscale conversion for maximum speed.

Common Examples

Create Video Chapters

async function createChapters(video: HTMLVideoElement) {
  const scenes = await videoIntel.detectScenes(video, {
    minSceneLength: 5,    // Chapters should be substantial
    threshold: 0.3,
    includeThumbnails: true,
  });
  
  return scenes.map((scene, i) => ({
    title: `Chapter ${i + 1}`,
    start: scene.start,
    end: scene.end,
    duration: scene.duration,
    thumbnail: scene.thumbnail,  // Use scene thumbnail
  }));
}

// Usage in video player
const chapters = await createChapters(videoElement);
chapters.forEach(chapter => {
  addChapterMarker(chapter);
});

Smart Video Trimming

async function suggestTrimPoints(video: HTMLVideoElement) {
  const scenes = await videoIntel.detectScenes(video, {
    threshold: 0.35,  // Conservative - only clear cuts
  });
  
  // Suggest natural cut points at scene boundaries
  return {
    suggestedTrims: scenes.map(scene => scene.start),
    scenes: scenes.map(scene => ({
      start: scene.start,
      end: scene.end,
      canTrim: scene.duration > 3,  // Only suggest if scene is long enough
    })),
  };
}

Automatic Highlights

async function findHighlightScenes(video: HTMLVideoElement) {
  const scenes = await videoIntel.detectScenes(video, {
    threshold: 0.3,
  });
  
  // Get thumbnails for scene analysis
  const thumbnails = await videoIntel.getThumbnails(video, {
    count: scenes.length,
  });
  
  // Match thumbnails to scenes and score them
  const scoredScenes = scenes.map((scene, i) => ({
    ...scene,
    score: thumbnails[i]?.score || 0,  // Use thumbnail quality as proxy
  }));
  
  // Return top 3 highest-scoring scenes as highlights
  return scoredScenes
    .sort((a, b) => b.score - a.score)
    .slice(0, 3);
}

🚀 Next Steps