papercast / plan.md
batuhanozkose
feat: Implement initial PaperCast application with core modules, documentation, a periodic curl script, and a Gradio certificate.
472739a

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

PaperCast Implementation Plan

This plan outlines the steps to build PaperCast, an AI agent that converts research papers into podcast-style conversations using MCP, Gradio, and LLMs.

1. Infrastructure & Dependencies

  • Update requirements.txt
    • Add transformers, accelerate, bitsandbytes (for 4-bit LLM loading).
    • Add scipy (for audio processing).
    • Add beautifulsoup4 (for web parsing).
    • Add python-multipart (for API handling).
    • Ensure mcp and gradio versions are pinned.
  • Project Structure Setup
    • Create app.py (entry point).
    • Ensure __init__.py in all subdirs.
    • Create config.py in utils/ for global settings (LLM model names, paths).

2. Core Processing Modules

2.1. PDF Processing (processing/)

  • Implement pdf_reader.py
    • Function extract_text_from_pdf(pdf_path) -> str.
    • Use PyMuPDF (fitz) for fast extraction.
    • Implement basic cleaning (remove headers/footers/references if possible).
  • Implement url_fetcher.py
    • Function fetch_paper_from_url(url) -> str.
    • Handle arXiv URLs (convert /abs/ to /pdf/ or scrape abstract).
    • Download PDF to temporary storage.

2.2. Generation Logic (generation/)

  • Implement script_generator.py
    • Model: unsloth/Phi-4-mini-instruct-unsloth-bnb-4bit.
    • Define System Prompts for "Host" and "Guest" personas.
    • Function generate_podcast_script(paper_text) -> List[Dict].
    • Output format: [{"speaker": "Host", "text": "...", "emotion": "excited"}, {"speaker": "Guest", "text": "...", "emotion": "neutral"}].
    • Key Logic: Prompt the model to include emotion tags (e.g. [laugh], [sigh]) supported by Maya1.

2.3. Audio Synthesis (synthesis/)

  • Implement tts_engine.py
    • Model: maya-research/maya1.
    • Function synthesize_dialogue(script_json) -> audio_path.
    • Parse the script for emotion tags and pass them to Maya1.
    • Combine audio segments into a single file using pydub or scipy.

3. MCP Server Integration (mcp_servers/)

To satisfy the "MCP in Action" requirement, we will expose our core tools as MCP resources/tools.

  • Create paper_tools_server.py
    • Implement an MCP server that provides:
      • Tool: read_pdf(path)
      • Tool: fetch_arxiv(url)
      • Tool: synthesize_podcast(script)
    • This allows the "Agent" to call these tools via the MCP protocol.

4. Agent Orchestration (agents/)

  • Implement podcast_agent.py
    • Create a PodcastAgent class.
    • Planning Loop:
      1. Receive User Input.
      2. Plan: Decide to fetch/read paper.
      3. Analyze: Extract key topics.
      4. Draft: Generate script using Phi-4-mini.
      5. Synthesize: Create audio using Maya1.
    • Use sequential_thinking pattern (simulated) to show "Agentic" behavior in the logs/UI.
    • Crucial: The Agent should use the MCP Client to call the tools defined in Step 3, demonstrating "Autonomous reasoning using MCP tools".

5. User Interface (app.py)

  • Build Gradio UI
    • Input: Textbox (URL) or File Upload (PDF).
    • Output: Audio Player, Transcript Textbox, Status/Logs Markdown.
    • Agent Visualization: Show the "Thoughts" of the agent as it plans and executes (e.g., "Fetching paper...", "Analyzing structure...", "Generating script...").
  • Deployment Config
    • Create Dockerfile (if needed for custom deps) or rely on HF Spaces default.

6. Verification & Polish

  • Test Run
    • Run with a real arXiv paper.
    • Verify audio quality and script coherence.
  • Documentation
    • Update README.md with usage instructions and "MCP in Action" details.
    • Record Demo Video.

7. Bonus Features (Time Permitting)

  • RAG Integration: Use a vector store to answer questions about the paper after the podcast.
  • Background Music: Mix in intro/outro music.