AI Video & Voice Analysis Webapp

Project Description

I need a full-stack, browser-based AI application that ingests video either live through the user’s webcam or as an uploaded file. The core of the job is to combine computer-vision and speech-analysis pipelines so the platform can produce an interactive report on a speaker’s performance.

Video requirements
• Accept MP4, AVI or MOV and stream from the webcam for real-time mode.
• Detect and count eye movement, flagging every time the gaze drifts away from the camera.
• Recognise whether the person appears to be reading (e.g. head-eye patterns that follow text), then highlight the corresponding timestamps.
• Classify gestures and hand movements, tying each event back to the frame sequence.
• Infer an overall confidence score from facial cues.

Voice requirements
• Auto-extract the audio track, run speech-to-text, and gauge English proficiency (grammar, vocabulary range, fluency).
• Calculate confidence indicators from tone, volume stability and pace.
• Measure pauses and “thinking time” between sentences and insert them into the transcript with millisecond accuracy.

Both real-time feedback (small overlay suggestions) and post-video analytics (downloadable PDF/CSV plus on-screen dashboard) are needed. I’m happy for you to build with tools such as OpenCV, MediaPipe, TensorFlow, PyTorch, spaCy or similar—use what you are fastest with as long as the models run efficiently in a web environment (GPU acceleration via CUDA or WebGL is a plus).

Deliverables
1. Source-controlled codebase ready to deploy on a standard cloud stack (Docker image or Heroku-style procfile).
2. Front-end UI (React, Vue or vanilla JS) that lets users toggle between real-time and upload modes.
3. Modular inference services for vision and audio that can be retrained or swapped if I add new metrics later.
4. Clear README and API documentation.
5. Short demo video plus test dataset proving accuracy on the listed metrics.

Please outline your proposed tech stack, any pretrained models you plan to fine-tune, and the estimated timeline for an MVP. Future enhancements—emotion detection, filler-word counting, multilingual support—will be commissioned once the foundation is stable, so design with extensibility in mind. Show More

Attachments

Freelancers Bidding (0)

  • This project has no proposals yet.
    Be the first to place a bid on this project!