Automated Multi-Format Book Converter

Project Description

I’m building a fully automated publishing pipeline that turns raw manuscript files into polished, publication-ready books. The system must ingest HTML, Markdown, and plain TXT, detect any structural metadata already present, then export perfectly styled EPUB by default, with optional DOCX and press-quality PDF versions generated in the same run.

Advanced styling is essential: the converter should apply layout templates that control typography, front-matter placement, page geometry, and embedded media rules. I need the flexibility to swap or extend these templates later without rewriting the core pipeline, so clean separation between content conversion and styling logic is critical.

I’m open to the tooling you prefer—Pandoc, Calibre, PrinceXML, custom Python or Node transformers, containerised micro-services, or a blend—as long as the finished workflow scales easily on a build server and can be triggered via CLI or REST.

Deliverables
• Source code and build scripts for the complete conversion pipeline
• At least two example templates demonstrating advanced styling features
• Documentation covering installation, configuration, and how to add new formats or templates
• A short test suite proving that all three input types successfully produce valid EPUB, DOCX, and PDF outputs

If this sounds like your kind of challenge, let’s talk timelines and the best technological path forward.

PROJECT TITLE

AI-Based eBook Creation & Conversion System (OCR + EPUB + AI Processing)

---

1. PROJECT OVERVIEW

We are developing a scalable automated publishing system that converts multiple input formats into publication-ready EPUB books and optionally print-ready formats (DOCX/PDF).

The system will:

Process files in batch
Maintain formatting (including tables, figures, equations)
Use AI for content generation and rewriting
Automatically generate book structure (Title Page, Preface, etc.)

---

2. PROJECT OBJECTIVE

To build a modular, scalable, and configurable system that:

1. Converts:

Scanned files (OCR)
PDF
HTML
Word (DOCX)
→ into EPUB

2. Converts:

EPUB → Word/PDF (print-ready)

3. Automatically generates:

Title Page
Copyright Page
Preface
Acknowledgement
Table of Contents (for print output)

---

3. INPUT TYPES**

A. Scanned Files

OCR required
Output must be editable and structured
Formatting must be preserved as much as possible

---

B. PDF Files

Detect:

Scanned vs digital
Maintain:

Headings
Tables
Layout

---

C. HTML Files

Direct conversion to EPUB
Preserve formatting

---

D. Word Files (DOCX)

Convert to EPUB with formatting intact

---

E. EPUB Files

Convert to Word (print-ready)
Generate TOC and optional Index

---

4. CORE FEATURES (MVP SCOPE)**

4.1 Batch Processing

Upload multiple files
Process via queue system

---

4.2 Excel-Based Metadata Input

System must read Excel file
Must support:

Dynamic column mapping (NO hardcoding)
Missing field handling

---

4.3 AI-Generated Content

System must generate:

Book Title (based on article titles)
Preface
Acknowledgement

---

4.4 AI Rewriting Feature

Expand or reduce content:

±10%, 25%, 40%, 60%, 80%, 100%
Must:

Preserve structure
Avoid plagiarism
Not modify equations/tables layout

---

4.5 Table Formatting (MANDATORY)

All tables must:

Have grid borders
Use hairline thickness (~0.25 pt)
Must work in:

EPUB
Word
PDF

---

4.6 Book Structure Generation

Final EPUB must include:

1. Title Page (AI-generated title)
2. Copyright Page (template-based)
3. Preface (AI-generated)
4. Acknowledgement (AI-generated)
5. Table of Contents
6. Chapters (articles)

---

5. IMPORTANT CONTENT RULES

Author Names

Only names allowed
NO:

Designations
Institutions
Affiliations

---

Copyright Page

Template will be provided
System must replace variables:

ISBN
eISBN
Year
Publisher Name
Address
Email

---

Title Page

Title → AI-generated
Editor/Author Name → provided via Excel

---

6. TECHNICAL REQUIREMENTS**

Preferred Stack

Backend: Python (FastAPI preferred)
OCR: Tesseract
Conversion:

Pandoc
Calibre

---

Architecture (MANDATORY)

The system MUST be:

1. Modular

Separate components:

OCR
Conversion
AI processing
Output generation

---

2. Config-Driven

No hardcoding of:

Excel columns
Templates
Prompts

---

3. Scalable

Must support:

Batch processing
Future API integration
Multi-user expansion

---

4. Replaceable Components

OCR engine should be replaceable
AI provider should be replaceable

---

7. UI REQUIREMENTS (BASIC)

Simple interface:

Upload files
Upload Excel
Select options:

Rewrite %
Generate content (yes/no)
Download output

---

8. OUTPUT REQUIREMENTS

EPUB

Clean structure
Compatible with major readers

---

Word (DOCX)

Print-ready
Includes:

TOC
Proper formatting

---

9. ERROR HANDLING**

System must:

Skip problematic files (log errors)
Continue batch processing
Provide error report

---

10. PERFORMANCE REQUIREMENT**

Must handle:

Minimum 50–100 files per batch
Should not crash on large files

---

11. DELIVERABLES

Developer must provide:

1. Working application
2. Source code (fully commented)
3. Documentation:

Setup instructions
Config guide
4. Sample outputs

---

12. MANDATORY DEVELOPMENT CONDITIONS (VERY IMPORTANT)

The developer MUST:

NOT hardcode:

Excel structure
Templates
Prompts

Build system so that:

Fields can be changed without code edits
Prompts can be modified easily
Templates can be replaced

---

Code Requirements:

Clean and readable
Modular
Future scalable

---

13. PROJECT PHASES

Phase 1 (MVP)

OCR + Conversion + Basic AI
EPUB output

---

Phase 2 (Later)

Advanced indexing
UI improvements
Multi-language support

---

14. TIMELINE

MVP: 4–6 weeks

---

15. BUDGET

* Open to proposals (cost-effective preferred)
* Milestone-based payment

---

16. APPLICATION REQUIREMENTS

Please include:

1. Relevant experience (OCR / EPUB / document processing)
2. Tools you will use
3. Timeline
4. Cost breakdown
5. Sample work (MANDATORY)

---

17. SELECTION PROCESS

Shortlisting
Paid test task
Final selection

---

18. IMPORTANT NOTE

We are looking for a long-term developer.
This project will expand significantly.

--- Show More

Skills

AI DevelopmentContinuous IntegrationDocumentationHTMLJavaScriptLaTeXNode.jsOCRPDFPython

Freelancers Bidding (0)

This project has no proposals yet.
Be the first to place a bid on this project!

Budget

$75,000 - $112,500

Client Info

Demo USER

Project Summary

Status

Open
Category
Software Development
Project Type
Fixed Price
Budget
$75,000 - $112,500
Date Posted
3 months ago
Project ID
3400
Project Views
427

Automated Multi-Format Book Converter

Project Description

Attachments

Skills

Freelancers Bidding (0)

Open

Software Development

Fixed Price

$75,000 - $112,500

3 months ago

3400

427

Bookmark or Share

Helpful Links

Privacy

Information

Automated Multi-Format Book Converter

Project Description

Attachments

Skills

Freelancers Bidding (0)

Open

Software Development

Fixed Price

$75,000 - $112,500

3 months ago

3400

427

Bookmark or Share

Bid Amount *

Delivery with in days *

Describe your proposal *

Accept Offer From

Helpful Links

Privacy

Information

Welcome Back!