Spaces:
Running
Running
batuhanozkose
feat: Implement initial PaperCast application with core modules, documentation, a periodic curl script, and a Gradio certificate.
472739a
A newer version of the Gradio SDK is available:
6.1.0
PaperCast Implementation Plan
This plan outlines the steps to build PaperCast, an AI agent that converts research papers into podcast-style conversations using MCP, Gradio, and LLMs.
1. Infrastructure & Dependencies
- Update
requirements.txt- Add
transformers,accelerate,bitsandbytes(for 4-bit LLM loading). - Add
scipy(for audio processing). - Add
beautifulsoup4(for web parsing). - Add
python-multipart(for API handling). - Ensure
mcpandgradioversions are pinned.
- Add
- Project Structure Setup
- Create
app.py(entry point). - Ensure
__init__.pyin all subdirs. - Create
config.pyinutils/for global settings (LLM model names, paths).
- Create
2. Core Processing Modules
2.1. PDF Processing (processing/)
- Implement
pdf_reader.py- Function
extract_text_from_pdf(pdf_path) -> str. - Use
PyMuPDF(fitz) for fast extraction. - Implement basic cleaning (remove headers/footers/references if possible).
- Function
- Implement
url_fetcher.py- Function
fetch_paper_from_url(url) -> str. - Handle arXiv URLs (convert
/abs/to/pdf/or scrape abstract). - Download PDF to temporary storage.
- Function
2.2. Generation Logic (generation/)
- Implement
script_generator.py- Model:
unsloth/Phi-4-mini-instruct-unsloth-bnb-4bit. - Define System Prompts for "Host" and "Guest" personas.
- Function
generate_podcast_script(paper_text) -> List[Dict]. - Output format:
[{"speaker": "Host", "text": "...", "emotion": "excited"}, {"speaker": "Guest", "text": "...", "emotion": "neutral"}]. - Key Logic: Prompt the model to include emotion tags (e.g.
[laugh],[sigh]) supported by Maya1.
- Model:
2.3. Audio Synthesis (synthesis/)
- Implement
tts_engine.py- Model:
maya-research/maya1. - Function
synthesize_dialogue(script_json) -> audio_path. - Parse the script for emotion tags and pass them to Maya1.
- Combine audio segments into a single file using
pyduborscipy.
- Model:
3. MCP Server Integration (mcp_servers/)
To satisfy the "MCP in Action" requirement, we will expose our core tools as MCP resources/tools.
- Create
paper_tools_server.py- Implement an MCP server that provides:
- Tool:
read_pdf(path) - Tool:
fetch_arxiv(url) - Tool:
synthesize_podcast(script)
- Tool:
- This allows the "Agent" to call these tools via the MCP protocol.
- Implement an MCP server that provides:
4. Agent Orchestration (agents/)
- Implement
podcast_agent.py- Create a
PodcastAgentclass. - Planning Loop:
- Receive User Input.
- Plan: Decide to fetch/read paper.
- Analyze: Extract key topics.
- Draft: Generate script using Phi-4-mini.
- Synthesize: Create audio using Maya1.
- Use
sequential_thinkingpattern (simulated) to show "Agentic" behavior in the logs/UI. - Crucial: The Agent should use the MCP Client to call the tools defined in Step 3, demonstrating "Autonomous reasoning using MCP tools".
- Create a
5. User Interface (app.py)
- Build Gradio UI
- Input: Textbox (URL) or File Upload (PDF).
- Output: Audio Player, Transcript Textbox, Status/Logs Markdown.
- Agent Visualization: Show the "Thoughts" of the agent as it plans and executes (e.g., "Fetching paper...", "Analyzing structure...", "Generating script...").
- Deployment Config
- Create
Dockerfile(if needed for custom deps) or rely on HF Spaces default.
- Create
6. Verification & Polish
- Test Run
- Run with a real arXiv paper.
- Verify audio quality and script coherence.
- Documentation
- Update
README.mdwith usage instructions and "MCP in Action" details. - Record Demo Video.
- Update
7. Bonus Features (Time Permitting)
- RAG Integration: Use a vector store to answer questions about the paper after the podcast.
- Background Music: Mix in intro/outro music.