The document presents a system for detecting complex events in unconstrained videos using pre-trained deep CNN models. Frame-level features extracted from various CNNs are fused to form video-level descriptors, which are then classified using SVMs. Evaluation on a large video corpus found that fusing different CNNs outperformed individual CNNs, and no single CNN worked best for all events as some are more object-driven while others are more scene-based. The best performance was achieved by learning event-dependent weights for different CNNs.