I run a regional event directory that follows more than 400 venue URLs. At the moment I rely on Manus AI, which only captures about 60-70 % of what is published. I want a purpose-built scraper that raises that coverage to virtually 100 % while preventing duplicates from slipping into the database.
The sources you will have to handle are varied—Google Calendar feeds, PDFs posted on the venues’ sites, Facebook event widgets, and a mix of other custom formats. Some pages are JavaScript-heavy, others serve flat files, and a few hide the information in weekly PDF flyers, so the solution will likely combine techniques such as headless browsing, HTML parsing, PDF text extraction/OCR, and selective API use.
The tool must plug directly into my existing site workflow (a Laravel backend with a MySQL database). I already have endpoints for create/update actions; the scraper just needs to push normalized JSON to them and include a simple hash or fingerprint system so the same event is never imported twice.
Deliverables
• A modular scraper (Python preferred, but I’m open) with separate handlers for Google Calendar, PDFs (OCR where needed), Facebook events, and a catch-all HTML parser for custom formats.
• A lightweight deduplication module that compares new events against the existing table by title, date, venue, and hash.
• Deployment script or Dockerfile so I can spin the service up on an Ubuntu VPS.
• Setup notes and commented code so I can extend it to new venues later on.
Acceptance criteria
• Test run across the current 400 URLs shows ≥95 % capture of unique events.
• No duplicate entries created during the same run or across consecutive runs.
• All fields (title, date/time, venue name, source URL) populate correctly in my database.
• Runtime per full scan stays under two hours on a 2-vCPU VPS.
If you have deep experience with Python scraping frameworks (Scrapy, Selenium, Playwright), PDF parsing libraries, and API integration, I’d love to see an outline of how you’d tackle the mix of formats and keep the data clean.
Show More