Multimodal Video Summarization Using Machine Learning: A Comprehensive Benchmark of Feature Selection and Classifier Performance
The exponential growth of user-generated video content necessitates efficient summarization systems for improved accessibility, retrieval, and analysis. This study presents and benchmarks a multimodal video summarization framework that classifies segments as informative or non-informative using audio, visual, and fused features. Sixty hours of annotated video across ten diverse categories were analyzed. Audio features were extracted with pyAudioAnalysis, while visual features (colour histograms, optical flow, object detection, facial recognition) were derived using OpenCV. Six supervised classifiers—Naive Bayes, K-Nearest Neighbors, Logistic Regression, Decision Tree, Random Forest, and XGBoost—were evaluated, with hyperparameters optimized via grid search. Temporal coherence was enhanced using median filtering. Random Forest achieved the best performance, with 74% AUC on fused features and a 3% F1-score gain after post-processing. Spectral flux, grayscale histograms, and optical flow emerged as key discriminative features. The best model was deployed as a practical web service using TensorFlow and Flask, integrating informative segment detection with subtitle generation via beam search to ensure coherence and coverage. System-level evaluation demonstrated low latency and efficient resource utilization under load. Overall, the results confirm the strength of multimodal fusion and ensemble learning for video summarization and highlight their potential for real-world applications in surveillance, digital archiving, and online education.