I’m building a fully automated publishing pipeline that turns raw manuscript files into polished, publication-ready books. The system must ingest HTML, Markdown, and plain TXT, detect any structural metadata already present, then export perfectly styled EPUB by default, with optional DOCX and press-quality PDF versions generated in the same run.
Advanced styling is essential: the converter should apply layout templates that control typography, front-matter placement, page geometry, and embedded media rules. I need the flexibility to swap or extend these templates later without rewriting the core pipeline, so clean separation between content conversion and styling logic is critical.
I’m open to the tooling you prefer—Pandoc, Calibre, PrinceXML, custom Python or Node transformers, containerised micro-services, or a blend—as long as the finished workflow scales easily on a build server and can be triggered via CLI or REST.
Deliverables
• Source code and build scripts for the complete conversion pipeline
• At least two example templates demonstrating advanced styling features
• Documentation covering installation, configuration, and how to add new formats or templates
• A short test suite proving that all three input types successfully produce valid EPUB, DOCX, and PDF outputs
If this sounds like your kind of challenge, let’s talk timelines and the best technological path forward.
PROJECT TITLE
AI-Based eBook Creation & Conversion System (OCR + EPUB + AI Processing)
---
1. PROJECT OVERVIEW
We are developing a scalable automated publishing system that converts multiple input formats into publication-ready EPUB books and optionally print-ready formats (DOCX/PDF).
The system will:
Process files in batch
Maintain formatting (including tables, figures, equations)
Use AI for content generation and rewriting
Automatically generate book structure (Title Page, Preface, etc.)
---
2. PROJECT OBJECTIVE
To build a modular, scalable, and configurable system that:
1. Converts:
Scanned files (OCR)
PDF
HTML
Word (DOCX)
→ into EPUB
2. Converts:
EPUB → Word/PDF (print-ready)
3. Automatically generates:
Title Page
Copyright Page
Preface
Acknowledgement
Table of Contents (for print output)
---
3. INPUT TYPES**
A. Scanned Files
OCR required
Output must be editable and structured
Formatting must be preserved as much as possible
---
B. PDF Files
Detect:
Scanned vs digital
Maintain:
Headings
Tables
Layout
---
C. HTML Files
Direct conversion to EPUB
Preserve formatting
---
D. Word Files (DOCX)
Convert to EPUB with formatting intact
---
E. EPUB Files
Convert to Word (print-ready)
Generate TOC and optional Index
---
4. CORE FEATURES (MVP SCOPE)**
4.1 Batch Processing
Upload multiple files
Process via queue system
---
4.2 Excel-Based Metadata Input
System must read Excel file
Must support:
Dynamic column mapping (NO hardcoding)
Missing field handling
---
4.3 AI-Generated Content
System must generate:
Book Title (based on article titles)
Preface
Acknowledgement
---
4.4 AI Rewriting Feature
Expand or reduce content:
±10%, 25%, 40%, 60%, 80%, 100%
Must:
Preserve structure
Avoid plagiarism
Not modify equations/tables layout
---
4.5 Table Formatting (MANDATORY)
All tables must:
Have grid borders
Use hairline thickness (~0.25 pt)
Must work in:
EPUB
Word
PDF
---
4.6 Book Structure Generation
Final EPUB must include:
1. Title Page (AI-generated title)
2. Copyright Page (template-based)
3. Preface (AI-generated)
4. Acknowledgement (AI-generated)
5. Table of Contents
6. Chapters (articles)
---
5. IMPORTANT CONTENT RULES
Author Names
Only names allowed
NO:
Designations
Institutions
Affiliations
---
Copyright Page
Template will be provided
System must replace variables:
ISBN
eISBN
Year
Publisher Name
Address
Email
---
Title Page
Title → AI-generated
Editor/Author Name → provided via Excel
---
6. TECHNICAL REQUIREMENTS**
Preferred Stack
Backend: Python (FastAPI preferred)
OCR: Tesseract
Conversion:
Pandoc
Calibre
---
Architecture (MANDATORY)
The system MUST be:
1. Modular
Separate components:
OCR
Conversion
AI processing
Output generation
---
2. Config-Driven
No hardcoding of:
Excel columns
Templates
Prompts
---
3. Scalable
Must support:
Batch processing
Future API integration
Multi-user expansion
---
4. Replaceable Components
OCR engine should be replaceable
AI provider should be replaceable
---
7. UI REQUIREMENTS (BASIC)
Simple interface:
Upload files
Upload Excel
Select options:
Rewrite %
Generate content (yes/no)
Download output
---
8. OUTPUT REQUIREMENTS
EPUB
Clean structure
Compatible with major readers
---
Word (DOCX)
Print-ready
Includes:
TOC
Proper formatting
---
9. ERROR HANDLING**
System must:
Skip problematic files (log errors)
Continue batch processing
Provide error report
---
10. PERFORMANCE REQUIREMENT**
Must handle:
Minimum 50–100 files per batch
Should not crash on large files
---
11. DELIVERABLES
Developer must provide:
1. Working application
2. Source code (fully commented)
3. Documentation:
Setup instructions
Config guide
4. Sample outputs
---
12. MANDATORY DEVELOPMENT CONDITIONS (VERY IMPORTANT)
The developer MUST:
NOT hardcode:
Excel structure
Templates
Prompts
Build system so that:
Fields can be changed without code edits
Prompts can be modified easily
Templates can be replaced
---
Code Requirements:
Clean and readable
Modular
Future scalable
---
13. PROJECT PHASES
Phase 1 (MVP)
OCR + Conversion + Basic AI
EPUB output
---
Phase 2 (Later)
Advanced indexing
UI improvements
Multi-language support
---
14. TIMELINE
MVP: 4–6 weeks
---
15. BUDGET
* Open to proposals (cost-effective preferred)
* Milestone-based payment
---
16. APPLICATION REQUIREMENTS
Please include:
1. Relevant experience (OCR / EPUB / document processing)
2. Tools you will use
3. Timeline
4. Cost breakdown
5. Sample work (MANDATORY)
---
17. SELECTION PROCESS
Shortlisting
Paid test task
Final selection
---
18. IMPORTANT NOTE
We are looking for a long-term developer.
This project will expand significantly.
---
Show More