Spaces:

Dyno1307
/

Translate-V2

Sleeping

App Files Files Community

Dyno1307 commited on Oct 14

Commit

279ed8e

verified ·

1 Parent(s): 5bdd8f4

Upload 14 files

Browse files

Files changed (14) hide show

.gitattributes +6 -36
.gitignore +2 -0
Dockerfile +20 -0
README.md +250 -5
api_log.txt +20 -0
app.py +213 -0
baseline_analysis.py +55 -0
baseline_translate.py +51 -0
debug_load.py +26 -0
fast_api.py +214 -0
interactive_translate.py +74 -0
requirements.txt +92 -0
test_analysis.py +84 -0
test_translation.py +71 -0

.gitattributes CHANGED Viewed

@@ -1,38 +1,8 @@
-*.7z filter=lfs diff=lfs merge=lfs -text
-*.arrow filter=lfs diff=lfs merge=lfs -text
-*.bin filter=lfs diff=lfs merge=lfs -text
-*.bz2 filter=lfs diff=lfs merge=lfs -text
-*.ckpt filter=lfs diff=lfs merge=lfs -text
-*.ftz filter=lfs diff=lfs merge=lfs -text
-*.gz filter=lfs diff=lfs merge=lfs -text
-*.h5 filter=lfs diff=lfs merge=lfs -text
-*.joblib filter=lfs diff=lfs merge=lfs -text
-*.lfs.* filter=lfs diff=lfs merge=lfs -text
-*.mlmodel filter=lfs diff=lfs merge=lfs -text
-*.model filter=lfs diff=lfs merge=lfs -text
-*.msgpack filter=lfs diff=lfs merge=lfs -text
-*.npy filter=lfs diff=lfs merge=lfs -text
-*.npz filter=lfs diff=lfs merge=lfs -text
-*.onnx filter=lfs diff=lfs merge=lfs -text
-*.ot filter=lfs diff=lfs merge=lfs -text
-*.parquet filter=lfs diff=lfs merge=lfs -text
-*.pb filter=lfs diff=lfs merge=lfs -text
-*.pickle filter=lfs diff=lfs merge=lfs -text
-*.pkl filter=lfs diff=lfs merge=lfs -text
-*.pt filter=lfs diff=lfs merge=lfs -text
-*.pth filter=lfs diff=lfs merge=lfs -text
-*.rar filter=lfs diff=lfs merge=lfs -text
-*.safetensors filter=lfs diff=lfs merge=lfs -text
-saved_model/**/* filter=lfs diff=lfs merge=lfs -text
-*.tar.* filter=lfs diff=lfs merge=lfs -text
-*.tar filter=lfs diff=lfs merge=lfs -text
-*.tflite filter=lfs diff=lfs merge=lfs -text
-*.tgz filter=lfs diff=lfs merge=lfs -text
-*.wasm filter=lfs diff=lfs merge=lfs -text
-*.xz filter=lfs diff=lfs merge=lfs -text
-*.zip filter=lfs diff=lfs merge=lfs -text
-*.zst filter=lfs diff=lfs merge=lfs -text
-*tfevents* filter=lfs diff=lfs merge=lfs -text
 data/processed/nepali.en filter=lfs diff=lfs merge=lfs -text
 data/processed/nepali.ne filter=lfs diff=lfs merge=lfs -text
-frontend/public/android-chrome-512x512.png filter=lfs diff=lfs merge=lfs -text

 data/processed/nepali.en filter=lfs diff=lfs merge=lfs -text
 data/processed/nepali.ne filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.png filter=lfs diff=lfs merge=lfs -text
+*.jpeg filter=lfs diff=lfs merge=lfs -text
+*.ico filter=lfs diff=lfs merge=lfs -text
+*.en filter=lfs diff=lfs merge=lfs -text
+*.ne filter=lfs diff=lfs merge=lfs -text

.gitignore ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ .venv/
2	+ models/

Dockerfile ADDED Viewed

	@@ -0,0 +1,20 @@

+# Use an official Python runtime as a parent image
+FROM python:3.10-slim
+# Set the working directory in the container
+WORKDIR /code
+# Copy the requirements file into the container at /code
+COPY ./requirements.txt /code/requirements.txt
+# Install any needed packages specified in requirements.txt
+RUN pip install --no-cache-dir --upgrade -r /code/requirements.txt
+# Copy the rest of the application's code
+COPY . /code/
+# Expose the port the app runs on
+EXPOSE 7860
+# Command to run the application
+CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]

README.md CHANGED Viewed

@@ -1,10 +1,255 @@
 ---
-title: Translate V2
-emoji: 🐢
-colorFrom: yellow
-colorTo: purple
 sdk: docker
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: Translate
+emoji: 🌐
+colorFrom: blue
+colorTo: indigo
 sdk: docker
+app_file: app.py
 pinned: false
 ---
+# Saksi Translation: Nepali-English Machine Translation
+This project provides a machine translation solution to translate text from Nepali and Sinhala to English. It leverages the power of the NLLB (No Language Left Behind) model from Meta AI, which is fine-tuned on a custom dataset for improved performance. The project includes a complete workflow from data acquisition to model deployment, featuring a REST API for easy integration.
+## Table of Contents
+- [Features](#features)
+- [Workflow](#workflow)
+- [Tech Stack](#tech-stack)
+- [Model Details](#model-details)
+- [API Endpoints](#api-endpoints)
+- [Getting Started](#getting-started)
+- [Usage](#usage)
+- [Project Structure](#project-structure)
+- [Future Improvements](#future-improvements)
+## Features
+-   **High-Quality Translation:** Utilizes a fine-tuned NLLB model for accurate translations.
+-   **Support for Multiple Languages:** Currently supports Nepali and Sinhala to English translation.
+-   **REST API:** Exposes the translation model through a high-performance FastAPI application.
+-   **Interactive Frontend:** A simple and intuitive web interface for easy translation.
+-   **Batch Translation:** Supports translating multiple texts in a single request.
+-   **PDF Translation:** Supports translating text directly from PDF files.
+-   **Scalable and Reproducible:** Built with a modular structure and uses MLflow for experiment tracking.
+## Workflow
+The project follows a standard machine learning workflow for building and deploying a translation model:
+1.  **Data Acquisition:** The process begins with collecting parallel text data (Nepali/Sinhala and English). The `scripts/fetch_parallel_data.py` script is used to download data from various online sources. The quality and quantity of this data are crucial for the model's performance.
+2.  **Data Cleaning and Preprocessing:** Raw data from the web is often noisy and requires cleaning. The `scripts/clean_text_data.py` script performs several preprocessing steps:
+    *   **HTML Tag Removal:** Strips out HTML tags and other web artifacts.
+    *   **Unicode Normalization:** Normalizes Unicode characters to ensure consistency.
+    *   **Sentence Filtering:** Removes sentences that are too long or too short, which can negatively impact training.
+    *   **Corpus Alignment:** Ensures a one-to-one correspondence between source and target sentences.
+3.  **Model Finetuning:** The core of the project is fine-tuning a pre-trained NLLB model on our custom parallel dataset. The `src/train.py` script, which leverages the Hugging Face `Trainer` API, handles this process. This script manages the entire training loop, including:
+    *   Loading the pre-trained NLLB model and tokenizer.
+    *   Creating a PyTorch Dataset from the preprocessed data.
+    *   Configuring training arguments like learning rate, batch size, and number of epochs.
+    *   Executing the training loop and saving the fine-tuned model checkpoints.
+4.  **Model Evaluation:** After training, the model's performance is evaluated using the `src/evaluation.py` script. This script calculates the **BLEU (Bilingual Evaluation Understudy)** score, a widely accepted metric for machine translation quality. It works by comparing the model's translations of a test set with a set of high-quality reference translations.
+5.  **Inference and Deployment:** Once the model is trained and evaluated, it's ready for use.
+    *   `interactive_translate.py`: A command-line script for quick, interactive translation tests.
+    *   `fast_api.py`: A production-ready REST API built with FastAPI that serves the translation model. This allows other applications to easily consume the translation service.
+## Tech Stack
+The technologies used in this project were chosen to create a robust, efficient, and maintainable machine translation pipeline:
+-   **Python:** The primary language for the project, offering a rich ecosystem of libraries and frameworks for machine learning.
+-   **PyTorch:** A flexible and powerful deep learning framework that provides fine-grained control over the model training process.
+-   **Hugging Face Transformers:** The backbone of the project, providing easy access to pre-trained models like NLLB and a standardized interface for training and inference.
+-   **Hugging Face Datasets:** Simplifies the process of loading and preprocessing large datasets, with efficient data loading and manipulation capabilities.
+-   **FastAPI:** A modern, high-performance web framework for building APIs with Python. It's used to serve the translation model as a REST API.
+-   **Uvicorn:** A lightning-fast ASGI server, used to run the FastAPI application.
+-   **MLflow:** Used for experiment tracking to ensure reproducibility. It logs training parameters, metrics, and model artifacts, which is crucial for managing machine learning projects.
+## Model Details
+-   **Base Model:** The project uses the `facebook/nllb-200-distilled-600M` model, a distilled version of the NLLB-200 model. This model is designed to be efficient while still providing high-quality translations for a large number of languages.
+-   **Fine-tuning:** The base model is fine-tuned on a custom dataset of Nepali-English and Sinhala-English parallel text to improve its performance on these specific language pairs.
+-   **Tokenizer:** The `NllbTokenizer` is used for tokenizing the text. It's a sentence-piece based tokenizer that is specifically designed for the NLLB model.
+## API Endpoints
+The FastAPI application provides the following endpoints:
+-   **`GET /`**: Returns the frontend HTML page.
+-   **`GET /languages`**: Returns a list of supported languages.
+-   **`POST /translate`**: Translates a single text.
+    -   **Request Body:**
+        ```json
+        {
+          "text": "string",
+          "source_language": "string"
+        }
+        ```
+    -   **Response Body:**
+        ```json
+        {
+          "original_text": "string",
+          "translated_text": "string",
+          "source_language": "string"
+        }
+        ```
+-   **`POST /batch-translate`**: Translates a batch of texts.
+    -   **Request Body:**
+        ```json
+        {
+          "texts": [
+            "string"
+          ],
+          "source_language": "string"
+        }
+        ```
+    -   **Response Body:**
+        ```json
+        {
+          "original_texts": [
+            "string"
+          ],
+          "translated_texts": [
+            "string"
+          ],
+          "source_language": "string"
+        }
+        ```
+-   **`POST /translate-pdf`**: Translates a PDF file.
+    -   **Request:** `source_language: str`, `file: UploadFile`
+    -   **Response Body:**
+        ```json
+        {
+          "filename": "string",
+          "translated_text": "string",
+          "source_language": "string"
+        }
+        ```
+## Getting Started
+### Prerequisites
+-   **Python 3.10 or higher:** Ensure you have a recent version of Python installed.
+-   **Git and Git LFS:** Git is required to clone the repository, and Git LFS is required to handle large model files.
+-   **(Optional) NVIDIA GPU with CUDA:** A GPU is highly recommended for training the model.
+### Installation
+1.  **Clone the repository:**
+    ```bash
+    git clone <repository-url>
+    cd saksi_translation
+    ```
+2.  **Create and activate a virtual environment:**
+    ```bash
+    python -m venv .venv
+    # On Windows
+    .venv\Scripts\activate
+    # On macOS/Linux
+    source .venv/bin/activate
+    ```
+3.  **Install dependencies:**
+    ```bash
+    pip install -r requirements.txt
+    ```
+## Usage
+### Data Preparation
+-   **Fetch Parallel Data:**
+    ```bash
+    python scripts/fetch_parallel_data.py --output_dir data/raw
+    ```
+-   **Clean Text Data:**
+    ```bash
+    python scripts/clean_text_data.py --input_dir data/raw --output_dir data/processed
+    ```
+### Training
+-   **Start Training:**
+    ```bash
+    python src/train.py \
+        --model_name "facebook/nllb-200-distilled-600M" \
+        --dataset_path "data/processed" \
+        --output_dir "models/nllb-finetuned-nepali-en" \
+        --learning_rate 2e-5 \
+        --per_device_train_batch_size 8 \
+        --num_train_epochs 3
+    ```
+### Evaluation
+-   **Evaluate the Model:**
+    ```bash
+    python src/evaluate.py \
+        --model_path "models/nllb-finetuned-nepali-en" \
+        --test_data_path "data/test_sets/test.en" \
+        --reference_data_path "data/test_sets/test.ne"
+    ```
+### Interactive Translation
+-   **Run the interactive script:**
+    ```bash
+    python interactive_translate.py
+    ```
+### API
+-   **Run the API:**
+    ```bash
+    uvicorn fast_api:app --reload
+    ```
+    Open your browser and navigate to `http://127.0.0.1:8000` to use the web interface.
+## Project Structure
+```
+saksi_translation/
+├── .gitignore
+├── fast_api.py             # FastAPI application
+├── interactive_translate.py  # Interactive translation script
+├── README.md               # Project documentation
+├── requirements.txt        # Python dependencies
+├── test_translation.py     # Script for testing the translation model
+├── frontend/
+│   ├── index.html          # Frontend HTML
+│   ├── script.js           # Frontend JavaScript
+│   └── styles.css          # Frontend CSS
+├── data/
+│   ├── processed/          # Processed data for training
+│   ├── raw/                # Raw data downloaded from the web
+│   └── test_sets/          # Test sets for evaluation
+├── mlruns/                 # MLflow experiment tracking data
+├── models/
+│   └── nllb-finetuned-nepali-en/ # Fine-tuned model
+├── notebooks/              # Jupyter notebooks for experimentation
+├── scripts/
+│   ├── clean_text_data.py
+│   ├── create_test_set.py
+│   ├── download_model.py
+│   ├── fetch_parallel_data.py
+│   └── scrape_bbc_nepali.py
+└── src/
+    ├── __init__.py
+    ├── evaluation.py       # Script for evaluating the model
+    ├── train.py            # Script for training the model
+    └── translate.py        # Script for translating text
+```
+## Future Improvements
+-   **Support for more languages:** The project can be extended to support more languages by adding more parallel data and fine-tuning the model on it.
+-   **Improved Model:** The model can be improved by using a larger version of the NLLB model or by fine-tuning it on a larger and cleaner dataset.
+-   **Advanced Frontend:** The frontend can be improved by adding features like translation history, user accounts, and more advanced styling.
+-   **Containerization:** The application can be containerized using Docker for easier deployment and scaling.

api_log.txt ADDED Viewed

	@@ -0,0 +1,20 @@

+Loading models on CPU...
+Traceback (most recent call last):
+  File "D:\SIH\saksi_translation\api.py", line 14, in <module>
+    "nepali": AutoModelForSeq2SeqLM.from_pretrained("models/nllb-finetuned-nepali-en").to(DEVICE),
+              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "C:\Users\dynos\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.13_qbz5n2kfra8p0\LocalCache\local-packages\Python313\site-packages\transformers\models\auto\auto_factory.py", line 549, in from_pretrained
+    config, kwargs = AutoConfig.from_pretrained(
+                     ~~~~~~~~~~~~~~~~~~~~~~~~~~^
+        pretrained_model_name_or_path,
+        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+    ...<4 lines>...
+        **kwargs,
+        ^^^^^^^^^
+    )
+    ^
+  File "C:\Users\dynos\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.13_qbz5n2kfra8p0\LocalCache\local-packages\Python313\site-packages\transformers\models\auto\configuration_auto.py", line 1329, in from_pretrained
+    raise ValueError(
+    ...<3 lines>...
+    )
+ValueError: Unrecognized model in models/nllb-finetuned-nepali-en. Should have a `model_type` key in its config.json, or contain one of the following strings in its name: aimv2, aimv2_vision_model, albert, align, altclip, apertus, arcee, aria, aria_text, audio-spectrogram-transformer, autoformer, aya_vision, bamba, bark, bart, beit, bert, bert-generation, big_bird, bigbird_pegasus, biogpt, bit, bitnet, blenderbot, blenderbot-small, blip, blip-2, blip_2_qformer, bloom, bridgetower, bros, camembert, canine, chameleon, chinese_clip, chinese_clip_vision_model, clap, clip, clip_text_model, clip_vision_model, clipseg, clvp, code_llama, codegen, cohere, cohere2, cohere2_vision, colpali, colqwen2, conditional_detr, convbert, convnext, convnextv2, cpmant, csm, ctrl, cvt, d_fine, dab-detr, dac, data2vec-audio, data2vec-text, data2vec-vision, dbrx, deberta, deberta-v2, decision_transformer, deepseek_v2, deepseek_v3, deepseek_vl, deepseek_vl_hybrid, deformable_detr, deit, depth_anything, depth_pro, deta, detr, dia, diffllama, dinat, dinov2, dinov2_with_registers, dinov3_convnext, dinov3_vit, distilbert, doge, donut-swin, dots1, dpr, dpt, efficientformer, efficientloftr, efficientnet, electra, emu3, encodec, encoder-decoder, eomt, ernie, ernie4_5, ernie4_5_moe, ernie_m, esm, evolla, exaone4, falcon, falcon_h1, falcon_mamba, fastspeech2_conformer, fastspeech2_conformer_with_hifigan, flaubert, flava, florence2, fnet, focalnet, fsmt, funnel, fuyu, gemma, gemma2, gemma3, gemma3_text, gemma3n, gemma3n_audio, gemma3n_text, gemma3n_vision, git, glm, glm4, glm4_moe, glm4v, glm4v_moe, glm4v_moe_text, glm4v_text, glpn, got_ocr2, gpt-sw3, gpt2, gpt_bigcode, gpt_neo, gpt_neox, gpt_neox_japanese, gpt_oss, gptj, gptsan-japanese, granite, granite_speech, granitemoe, granitemoehybrid, granitemoeshared, granitevision, graphormer, grounding-dino, groupvit, helium, hgnet_v2, hiera, hubert, hunyuan_v1_dense, hunyuan_v1_moe, ibert, idefics, idefics2, idefics3, idefics3_vision, ijepa, imagegpt, informer, instructblip, instructblipvideo, internvl, internvl_vision, jamba, janus, jetmoe, jukebox, kosmos-2, kosmos-2.5, kyutai_speech_to_text, layoutlm, layoutlmv2, layoutlmv3, led, levit, lfm2, lightglue, lilt, llama, llama4, llama4_text, llava, llava_next, llava_next_video, llava_onevision, longformer, longt5, luke, lxmert, m2m_100, mamba, mamba2, marian, markuplm, mask2former, maskformer, maskformer-swin, mbart, mctct, mega, megatron-bert, metaclip_2, mgp-str, mimi, minimax, mistral, mistral3, mixtral, mlcd, mllama, mm-grounding-dino, mobilebert, mobilenet_v1, mobilenet_v2, mobilevit, mobilevitv2, modernbert, modernbert-decoder, moonshine, moshi, mpnet, mpt, mra, mt5, musicgen, musicgen_melody, mvp, nat, nemotron, nezha, nllb-moe, nougat, nystromformer, olmo, olmo2, olmoe, omdet-turbo, oneformer, open-llama, openai-gpt, opt, ovis2, owlv2, owlvit, paligemma, patchtsmixer, patchtst, pegasus, pegasus_x, perceiver, perception_encoder, perception_lm, persimmon, phi, phi3, phi4_multimodal, phimoe, pix2struct, pixtral, plbart, poolformer, pop2piano, prompt_depth_anything, prophetnet, pvt, pvt_v2, qdqbert, qwen2, qwen2_5_omni, qwen2_5_vl, qwen2_5_vl_text, qwen2_audio, qwen2_audio_encoder, qwen2_moe, qwen2_vl, qwen2_vl_text, qwen3, qwen3_moe, rag, realm, recurrent_gemma, reformer, regnet, rembert, resnet, retribert, roberta, roberta-prelayernorm, roc_bert, roformer, rt_detr, rt_detr_resnet, rt_detr_v2, rwkv, sam, sam2, sam2_hiera_det_model, sam2_video, sam2_vision_model, sam_hq, sam_hq_vision_model, sam_vision_model, seamless_m4t, seamless_m4t_v2, seed_oss, segformer, seggpt, sew, sew-d, shieldgemma2, siglip, siglip2, siglip_vision_model, smollm3, smolvlm, smolvlm_vision, speech-encoder-decoder, speech_to_text, speech_to_text_2, speecht5, splinter, squeezebert, stablelm, starcoder2, superglue, superpoint, swiftformer, swin, swin2sr, swinv2, switch_transformers, t5, t5gemma, table-transformer, tapas, textnet, time_series_transformer, timesfm, timesformer, timm_backbone, timm_wrapper, trajectory_transformer, transfo-xl, trocr, tvlt, tvp, udop, umt5, unispeech, unispeech-sat, univnet, upernet, van, video_llava, videomae, vilt, vipllava, vision-encoder-decoder, vision-text-dual-encoder, visual_bert, vit, vit_hybrid, vit_mae, vit_msn, vitdet, vitmatte, vitpose, vitpose_backbone, vits, vivit, vjepa2, voxtral, voxtral_encoder, wav2vec2, wav2vec2-bert, wav2vec2-conformer, wavlm, whisper, xclip, xcodec, xglm, xlm, xlm-prophetnet, xlm-roberta, xlm-roberta-xl, xlnet, xlstm, xmod, yolos, yoso, zamba, zamba2, zoedepth

app.py ADDED Viewed

	@@ -0,0 +1,213 @@

+"""
+A FastAPI application for serving the translation model, inspired by interactive_translate.py.
+"""
+import torch
+from transformers import M2M100ForConditionalGeneration, NllbTokenizer
+from fastapi import FastAPI, HTTPException, UploadFile, File
+from fastapi.staticfiles import StaticFiles
+from fastapi.responses import FileResponse
+from pydantic import BaseModel
+import logging
+from typing import List
+import fitz  # PyMuPDF
+import shutil
+import os
+# --- 1. App Configuration ---
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+app = FastAPI(
+    title="Saksi Translation API",
+    description="A simple API for translating text and PDFs to English.",
+    version="2.0",
+)
+app.mount("/frontend", StaticFiles(directory="frontend"), name="frontend")
+# --- 2. Global Variables ---
+DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+SUPPORTED_LANGUAGES = {
+    "nepali": "nep_Npan",
+    "sinhala": "sin_Sinh",
+}
+MODEL_PATH = "facebook/nllb-200-distilled-600M"
+model = None
+tokenizer = None
+# --- 3. Pydantic Models ---
+class TranslationRequest(BaseModel):
+    text: str
+    source_language: str
+class TranslationResponse(BaseModel):
+    original_text: str
+    translated_text: str
+    source_language: str
+class BatchTranslationRequest(BaseModel):
+    texts: List[str]
+    source_language: str
+class BatchTranslationResponse(BaseModel):
+    original_texts: List[str]
+    translated_texts: List[str]
+    source_language: str
+class PdfTranslationResponse(BaseModel):
+    filename: str
+    translated_text: str
+    source_language: str
+# --- 4. Helper Functions ---
+def load_model_and_tokenizer(model_path):
+    """Loads the model and tokenizer from the given path."""
+    global model, tokenizer
+    logger.info(f"Loading model on {DEVICE.upper()}...")
+    try:
+        model = M2M100ForConditionalGeneration.from_pretrained(model_path).to(DEVICE)
+        tokenizer = NllbTokenizer.from_pretrained(model_path)
+        logger.info("Model and tokenizer loaded successfully!")
+    except Exception as e:
+        logger.error(f"Error loading model: {e}")
+        # In a real app, you might want to exit or handle this more gracefully
+        raise
+def translate_text(text: str, src_lang: str) -> str:
+    """
+    Translates a single string of text to English.
+    """
+    if src_lang not in SUPPORTED_LANGUAGES:
+        raise ValueError(f"Language '{src_lang}' not supported.")
+    tokenizer.src_lang = SUPPORTED_LANGUAGES[src_lang]
+    inputs = tokenizer(text, return_tensors="pt").to(DEVICE)
+    generated_tokens = model.generate(
+        **inputs,
+        forced_bos_token_id=tokenizer.convert_tokens_to_ids("eng_Latn"),
+        max_length=128,
+    )
+    return tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
+def batch_translate_text(texts: List[str], src_lang: str) -> List[str]:
+    """
+    Translates a batch of texts to English.
+    """
+    if src_lang not in SUPPORTED_LANGUAGES:
+        raise ValueError(f"Language '{src_lang}' not supported.")
+    tokenizer.src_lang = SUPPORTED_LANGUAGES[src_lang]
+    # We use padding=True to handle batches of different lengths
+    inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=512).to(DEVICE)
+    generated_tokens = model.generate(
+        **inputs,
+        forced_bos_token_id=tokenizer.convert_tokens_to_ids("eng_Latn"),
+        max_length=512, # Allow for longer generated sequences in batches
+    )
+    return tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
+# --- 5. API Events ---
+@app.on_event("startup")
+async def startup_event():
+    """Load the model at startup."""
+    load_model_and_tokenizer(MODEL_PATH)
+# --- 6. API Endpoints ---
+@app.get("/")
+async def root():
+    """Returns the frontend."""
+    return FileResponse('frontend/index.html')
+@app.get("/languages")
+def get_supported_languages():
+    """Returns a list of supported languages."""
+    return {"supported_languages": list(SUPPORTED_LANGUAGES.keys())}
+@app.post("/translate", response_model=TranslationResponse)
+async def translate(request: TranslationRequest):
+    """Translates a single text from a source language to English."""
+    try:
+        translated_text = translate_text(request.text, request.source_language)
+        return TranslationResponse(
+            original_text=request.text,
+            translated_text=translated_text,
+            source_language=request.source_language,
+        )
+    except ValueError as e:
+        raise HTTPException(status_code=400, detail=str(e))
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"An unexpected error occurred: {e}")
+@app.post("/batch-translate", response_model=BatchTranslationResponse)
+async def batch_translate(request: BatchTranslationRequest):
+    """Translates a batch of texts from a source language to English."""
+    try:
+        translated_texts = batch_translate_text(request.texts, request.source_language)
+        return BatchTranslationResponse(
+            original_texts=request.texts,
+            translated_texts=translated_texts,
+            source_language=request.source_language,
+        )
+    except ValueError as e:
+        raise HTTPException(status_code=400, detail=str(e))
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"An unexpected error occurred: {e}")
+@app.post("/translate-pdf", response_model=PdfTranslationResponse)
+async def translate_pdf(source_language: str, file: UploadFile = File(...)):
+    """Translates a PDF file from a source language to English."""
+    if file.content_type != "application/pdf":
+        raise HTTPException(status_code=400, detail="Invalid file type. Please upload a PDF.")
+    # Save the uploaded file temporarily
+    temp_pdf_path = f"temp_{file.filename}"
+    with open(temp_pdf_path, "wb") as buffer:
+        shutil.copyfileobj(file.file, buffer)
+    try:
+        # Extract text from the PDF
+        doc = fitz.open(temp_pdf_path)
+        extracted_text = ""
+        for page in doc:
+            extracted_text += page.get_text()
+        doc.close()
+        if not extracted_text.strip():
+            raise HTTPException(status_code=400, detail="Could not extract any text from the PDF.")
+        # Split text into chunks (e.g., by paragraph) to handle large texts
+        text_chunks = [p.strip() for p in extracted_text.split('\n') if p.strip()]
+        # Translate the chunks in batches
+        translated_chunks = batch_translate_text(text_chunks, source_language)
+        # Join the translated chunks back together
+        final_translation = "\n".join(translated_chunks)
+        return PdfTranslationResponse(
+            filename=file.filename,
+            translated_text=final_translation,
+            source_language=source_language,
+        )
+    except Exception as e:
+        logger.error(f"Error processing PDF: {e}")
+        raise HTTPException(status_code=500, detail=f"An error occurred while processing the PDF: {e}")
+    finally:
+        # Clean up the temporary file
+        if os.path.exists(temp_pdf_path):
+            os.remove(temp_pdf_path)
+# --- 7. Example Usage (for running with uvicorn) ---
+# To run this API, use the following command in your terminal:
+# uvicorn fast_api:app --reload
+if __name__ == "__main__":
+    import uvicorn
+    uvicorn.run(app, host="0.0.0.0", port=8000)

baseline_analysis.py ADDED Viewed

	@@ -0,0 +1,55 @@

+# baseline_analysis.py
+from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
+import torch
+# Define the model we want to use. We'll use a distilled (smaller, faster)
+# version of NLLB-200 for this quick test.
+model_name = "facebook/nllb-200-distilled-600M"
+# Load the pre-trained tokenizer and model from Hugging Face.
+# This might take a minute to download the first time.
+print(f"Loading model: {model_name}")
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
+print("Model loaded successfully!")
+# Sentences we want to translate.
+sinhala_sentences = [
+    "ඩෝසන් මිස් දුරකථනයෙන් ඩෝසන් මිස් කවුද සර්",
+    "කවුද ඩෝසන් නැතුව ඉන්නේ ඔව් සර්",
+    "ඔබ එය උත්සාහ කරන්න සර්",
+    "කොහොමද වැඩේ හරිද ඔව් සර්ට ස්තුතියි",
+    "ඔව්, හරි, ස්තුතියි රත්තරං"
+]
+print("\n--- Starting Translation ---")
+# Loop through each sentence and translate it.
+for sentence in sinhala_sentences:
+    # 1. Prepare the input for the model
+    # We need to tell the tokenizer what the source language is.
+    tokenizer.src_lang = "sin_Sinh"
+    # Convert the text into a format the model understands (input IDs).
+    inputs = tokenizer(sentence, return_tensors="pt")
+    # 2. Generate the translation
+    # We force the model to output English by setting the target language ID.
+    target_lang = "eng_Latn"
+    translated_tokens = model.generate(
+        **inputs,
+        forced_bos_token_id=tokenizer.vocab[target_lang],
+        max_length=50 # Set a max length for the output
+    )
+    # 3. Decode the output
+    # Convert the model's output tokens back into readable text.
+    translation = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
+    # 4. Display the results
+    print(f"\nOriginal (si): {sentence}")
+    print(f"Translation (en): {translation}")
+print("\n--- Translation Complete ---")

baseline_translate.py ADDED Viewed

	@@ -0,0 +1,51 @@

+# baseline_translate.py
+from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
+import torch
+# Define the model we want to use. We'll use a distilled (smaller, faster)
+# version of NLLB-200 for this quick test.
+model_name = "facebook/nllb-200-distilled-600M"
+# Load the pre-trained tokenizer and model from Hugging Face.
+# This might take a minute to download the first time.
+print(f"Loading model: {model_name}")
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
+print("Model loaded successfully!")
+# Sentences we want to translate.
+sentences_to_translate = {
+    "nep_Npan": "नेपालको राजधानी काठमाडौं हो।",  # Nepali: "The capital of Nepal is Kathmandu."
+    "sin_Sinh": "ශ්‍රී ලංකාවේ අගනුවර කොළඹ වේ."   # Sinhala: "The capital of Sri Lanka is Colombo."
+}
+print("\n--- Starting Translation ---")
+# Loop through each sentence and translate it.
+for lang_code, text in sentences_to_translate.items():
+    # 1. Prepare the input for the model
+    # We need to tell the tokenizer what the source language is.
+    tokenizer.src_lang = lang_code
+    # Convert the text into a format the model understands (input IDs).
+    inputs = tokenizer(text, return_tensors="pt")
+    # 2. Generate the translation
+    # We force the model to output English by setting the target language ID.
+    translated_tokens = model.generate(
+        **inputs,
+        forced_bos_token_id=tokenizer.lang_code_to_id["eng_Latn"],
+        max_length=50 # Set a max length for the output
+    )
+    # 3. Decode the output
+    # Convert the model's output tokens back into readable text.
+    translation = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
+    # 4. Display the results
+    print(f"\nOriginal ({lang_code}): {text}")
+    print(f"Translation (eng_Latn): {translation}")
+print("\n--- Translation Complete ---")

debug_load.py ADDED Viewed

	@@ -0,0 +1,26 @@

+# debug_load.py
+import torch
+from transformers import AutoTokenizer, M2M100ForConditionalGeneration
+# --- Configuration ---
+DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+nepali_model_path = r"D:\SIH\saksi_translation\models\nllb-finetuned-nepali-en"
+# --- Tokenizer Loading ---
+print("Loading Nepali tokenizer...")
+try:
+    nepali_tokenizer = AutoTokenizer.from_pretrained(nepali_model_path)
+    print("Nepali tokenizer loaded successfully.")
+    print(nepali_tokenizer)
+except Exception as e:
+    print(f"Error loading Nepali tokenizer: {e}")
+# --- Model Loading ---
+print("\nLoading Nepali model...")
+try:
+    nepali_model = M2M100ForConditionalGeneration.from_pretrained(nepali_model_path).to(DEVICE)
+    print("Nepali model loaded successfully.")
+    print(nepali_model)
+except Exception as e:
+    print(f"Error loading Nepali model: {e}")

fast_api.py ADDED Viewed

	@@ -0,0 +1,214 @@

+"""
+A FastAPI application for serving the translation model, inspired by interactive_translate.py.
+"""
+import torch
+from transformers import M2M100ForConditionalGeneration, NllbTokenizer
+from fastapi import FastAPI, HTTPException, UploadFile, File
+from fastapi.staticfiles import StaticFiles
+from fastapi.responses import FileResponse
+from pydantic import BaseModel
+import logging
+from typing import List
+import fitz  # PyMuPDF
+import shutil
+import os
+# --- 1. App Configuration ---
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+app = FastAPI(
+    title="Saksi Translation API",
+    description="A simple API for translating text and PDFs to English.",
+    version="2.0",
+)
+app.mount("/frontend", StaticFiles(directory="frontend"), name="frontend")
+# --- 2. Global Variables ---
+DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+SUPPORTED_LANGUAGES = {
+    "nepali": "nep_Npan",
+    "sinhala": "sin_Sinh",
+}
+MODEL_PATH = "models/nllb-finetuned-nepali-en"
+model = None
+tokenizer = None
+# --- 3. Pydantic Models ---
+class TranslationRequest(BaseModel):
+    text: str
+    source_language: str
+class TranslationResponse(BaseModel):
+    original_text: str
+    translated_text: str
+    source_language: str
+class BatchTranslationRequest(BaseModel):
+    texts: List[str]
+    source_language: str
+class BatchTranslationResponse(BaseModel):
+    original_texts: List[str]
+    translated_texts: List[str]
+    source_language: str
+class PdfTranslationResponse(BaseModel):
+    filename: str
+    translated_text: str
+    source_language: str
+# --- 4. Helper Functions ---
+def load_model_and_tokenizer(model_path):
+    """Loads the model and tokenizer from the given path."""
+    global model, tokenizer
+    logger.info(f"Loading model on {DEVICE.upper()}...")
+    try:
+        model = M2M100ForConditionalGeneration.from_pretrained(model_path).to(DEVICE)
+        tokenizer = NllbTokenizer.from_pretrained(model_path)
+        logger.info("Model and tokenizer loaded successfully!")
+    except Exception as e:
+        logger.error(f"Error loading model: {e}")
+        # In a real app, you might want to exit or handle this more gracefully
+        raise
+def translate_text(text: str, src_lang: str) -> str:
+    """
+    Translates a single string of text to English.
+    """
+    if src_lang not in SUPPORTED_LANGUAGES:
+        raise ValueError(f"Language '{src_lang}' not supported.")
+    tokenizer.src_lang = SUPPORTED_LANGUAGES[src_lang]
+    inputs = tokenizer(text, return_tensors="pt").to(DEVICE)
+    generated_tokens = model.generate(
+        **inputs,
+        forced_bos_token_id=tokenizer.convert_tokens_to_ids("eng_Latn"),
+        max_length=128,
+    )
+    return tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
+def batch_translate_text(texts: List[str], src_lang: str) -> List[str]:
+    """
+    Translates a batch of texts to English.
+    """
+    if src_lang not in SUPPORTED_LANGUAGES:
+        raise ValueError(f"Language '{src_lang}' not supported.")
+    tokenizer.src_lang = SUPPORTED_LANGUAGES[src_lang]
+    # We use padding=True to handle batches of different lengths
+    inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=512).to(DEVICE)
+    generated_tokens = model.generate(
+        **inputs,
+        forced_bos_token_id=tokenizer.convert_tokens_to_ids("eng_Latn"),
+        max_length=512, # Allow for longer generated sequences in batches
+    )
+    return tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
+# --- 5. API Events ---
+@app.on_event("startup")
+async def startup_event():
+    """Load the model at startup."""
+    load_model_and_tokenizer(MODEL_PATH)
+# --- 6. API Endpoints ---
+@app.get("/")
+async def root():
+    """Returns the frontend."""
+    return FileResponse('frontend/index.html')
+@app.get("/languages")
+def get_supported_languages():
+    """Returns a list of supported languages."""
+    return {"supported_languages": list(SUPPORTED_LANGUAGES.keys())}
+@app.post("/translate", response_model=TranslationResponse)
+async def translate(request: TranslationRequest):
+    """Translates a single text from a source language to English."""
+    try:
+        translated_text = translate_text(request.text, request.source_language)
+        return TranslationResponse(
+            original_text=request.text,
+            translated_text=translated_text,
+            source_language=request.source_language,
+        )
+    except ValueError as e:
+        raise HTTPException(status_code=400, detail=str(e))
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"An unexpected error occurred: {e}")
+@app.post("/batch-translate", response_model=BatchTranslationResponse)
+async def batch_translate(request: BatchTranslationRequest):
+    """Translates a batch of texts from a source language to English."""
+    try:
+        translated_texts = batch_translate_text(request.texts, request.source_language)
+        return BatchTranslationResponse(
+            original_texts=request.texts,
+            translated_texts=translated_texts,
+            source_language=request.source_language,
+        )
+    except ValueError as e:
+        raise HTTPException(status_code=400, detail=str(e))
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"An unexpected error occurred: {e}")
+@app.post("/translate-pdf", response_model=PdfTranslationResponse)
+async def translate_pdf(source_language: str, file: UploadFile = File(...)):
+    """Translates a PDF file from a source language to English."""
+    if file.content_type != "application/pdf":
+        raise HTTPException(status_code=400, detail="Invalid file type. Please upload a PDF.")
+    # Save the uploaded file temporarily
+    temp_pdf_path = f"temp_{file.filename}"
+    with open(temp_pdf_path, "wb") as buffer:
+        shutil.copyfileobj(file.file, buffer)
+    try:
+        # Extract text from the PDF
+        doc = fitz.open(temp_pdf_path)
+        extracted_text = ""
+        for page in doc:
+            extracted_text += page.get_text()
+        doc.close()
+        if not extracted_text.strip():
+            raise HTTPException(status_code=400, detail="Could not extract any text from the PDF.")
+        # Split text into chunks (e.g., by paragraph) to handle large texts
+        text_chunks = [p.strip() for p in extracted_text.split('\n') if p.strip()]
+        # Translate the chunks in batches
+        translated_chunks = batch_translate_text(text_chunks, source_language)
+        # Join the translated chunks back together
+        final_translation = "\n".join(translated_chunks)
+        return PdfTranslationResponse(
+            filename=file.filename,
+            translated_text=final_translation,
+            source_language=source_language,
+        )
+    except Exception as e:
+        logger.error(f"Error processing PDF: {e}")
+        raise HTTPException(status_code=500, detail=f"An error occurred while processing the PDF: {e}")
+    finally:
+        # Clean up the temporary file
+        if os.path.exists(temp_pdf_path):
+            os.remove(temp_pdf_path)
+# --- 7. Example Usage (for running with uvicorn) ---
+# To run this API, use the following command in your terminal:
+# uvicorn fast_api:app --reload
+if __name__ == "__main__":
+    import uvicorn
+    uvicorn.run(app, host="0.0.0.0", port=8000)

interactive_translate.py ADDED Viewed

	@@ -0,0 +1,74 @@

+"""
+An interactive script to translate text to English using a fine-tuned NLLB model.
+"""
+import torch
+from transformers import M2M100ForConditionalGeneration, NllbTokenizer
+# --- 1. Configuration ---
+DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+SUPPORTED_LANGUAGES = {
+    "nepali": "nep_Npan",
+    "sinhala": "sin_Sinh",
+}
+# --- 2. Load Model and Tokenizer ---
+def load_model_and_tokenizer(model_path):
+    """Loads the model and tokenizer from the given path."""
+    print(f"Loading model on {DEVICE.upper()}...")
+    try:
+        model = M2M100ForConditionalGeneration.from_pretrained(model_path).to(DEVICE)
+        tokenizer = NllbTokenizer.from_pretrained(model_path)
+        print("Model and tokenizer loaded successfully!")
+        return model, tokenizer
+    except Exception as e:
+        print(f"Error loading model: {e}")
+        return None, None
+# --- 3. Translation Function ---
+def translate_text(model, tokenizer, text: str, src_lang: str) -> str:
+    """
+    Translates a single string of text to English.
+    """
+    if src_lang not in SUPPORTED_LANGUAGES:
+        return f"Language '{src_lang}' not supported."
+    tokenizer.src_lang = SUPPORTED_LANGUAGES[src_lang]
+    inputs = tokenizer(text, return_tensors="pt").to(DEVICE)
+    generated_tokens = model.generate(
+        **inputs,
+        forced_bos_token_id=tokenizer.convert_tokens_to_ids("eng_Latn"),
+        max_length=128,
+    )
+    return tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
+# --- 4. Interactive Translation Loop ---
+if __name__ == "__main__":
+    # Select model path based on language
+    lang_choice = input(f"Choose a language ({list(SUPPORTED_LANGUAGES.keys())}): ").lower()
+    if lang_choice not in SUPPORTED_LANGUAGES:
+        print("Invalid language choice.")
+        exit()
+    # For now, we assume a single model path. This can be extended.
+    model_path = "models/nllb-finetuned-nepali-en"
+    model, tokenizer = load_model_and_tokenizer(model_path)
+    if model and tokenizer:
+        print(f"\n--- Interactive Translation ({lang_choice.capitalize()}) ---")
+        print(f"Enter a {lang_choice} sentence to translate to English.")
+        print("Type 'exit' to quit.\n")
+        while True:
+            text_to_translate = input(f"{lang_choice.capitalize()}: ")
+            if text_to_translate.lower() == "exit":
+                break
+            if not text_to_translate.strip():
+                print("Please enter some text to translate.")
+                continue
+            english_translation = translate_text(model, tokenizer, text_to_translate, lang_choice)
+            print(f"English: {english_translation}\n")

requirements.txt ADDED Viewed

	@@ -0,0 +1,92 @@

+accelerate==1.10.1
+aiohappyeyeballs==2.6.1
+aiohttp==3.12.15
+aiosignal==1.4.0
+annotated-types==0.7.0
+anyio==4.11.0
+attrs==25.3.0
+beautifulsoup4==4.14.2
+certifi==2025.10.5
+charset-normalizer==3.4.3
+click==8.3.0
+colorama==0.4.6
+datasets==4.1.1
+dill==0.4.0
+dnspython==2.8.0
+email-validator==2.3.0
+evaluate==0.4.6
+fastapi==0.118.0
+fastapi-cli==0.0.13
+fastapi-cloud-cli==0.3.0
+filelock==3.19.1
+frozenlist==1.7.0
+fsspec==2025.9.0
+h11==0.16.0
+httpcore==1.0.9
+httptools==0.6.4
+httpx==0.28.1
+huggingface-hub==0.35.3
+idna==3.10
+itsdangerous==2.2.0
+Jinja2==3.1.6
+langdetect==1.0.9
+lxml==6.0.2
+markdown-it-py==4.0.0
+MarkupSafe==3.0.3
+mdurl==0.1.2
+mpmath==1.3.0
+multidict==6.6.4
+multiprocess==0.70.16
+networkx==3.5
+numpy==2.3.3
+orjson==3.11.3
+packaging==25.0
+pandas==2.3.3
+portalocker==3.2.0
+propcache==0.4.0
+protobuf==6.32.1
+psutil==7.1.0
+pyarrow==21.0.0
+pydantic==2.11.10
+pydantic-extra-types==2.10.5
+pydantic-settings==2.11.0
+pydantic_core==2.33.2
+Pygments==2.19.2
+PyMuPDF==1.26.4
+python-dateutil==2.9.0.post0
+python-dotenv==1.1.1
+python-multipart==0.0.20
+pytz==2025.2
+PyYAML==6.0.3
+regex==2025.9.18
+requests==2.32.5
+rich==14.1.0
+rich-toolkit==0.15.1
+rignore==0.7.0
+sacrebleu==2.5.1
+safetensors==0.6.2
+sentencepiece==0.2.1
+sentry-sdk==2.39.0
+setuptools==80.9.0
+shellingham==1.5.4
+six==1.17.0
+sniffio==1.3.1
+soupsieve==2.8
+starlette==0.48.0
+sympy==1.14.0
+tabulate==0.9.0
+tokenizers==0.22.1
+torch==2.8.0
+tqdm==4.67.1
+transformers==4.57.0
+typer==0.19.2
+typing-inspection==0.4.2
+typing_extensions==4.15.0
+tzdata==2025.2
+ujson==5.11.0
+urllib3==2.5.0
+uvicorn==0.37.0
+watchfiles==1.1.0
+websockets==15.0.1
+xxhash==3.6.0
+yarl==1.20.1

test_analysis.py ADDED Viewed

	@@ -0,0 +1,84 @@

+import os
+import sys
+import codecs
+import torch
+from transformers import M2M100ForConditionalGeneration, NllbTokenizerFast
+def translate_text(text, model, tokenizer, src_lang, target_lang="eng_Latn"):
+    """
+    Translates a single text string.
+    """
+    try:
+        tokenizer.src_lang = src_lang
+        inputs = tokenizer(text, return_tensors="pt")
+        generated_tokens = model.generate(
+            **inputs,
+            forced_bos_token_id=tokenizer.vocab[target_lang],
+            max_length=512
+        )
+        translated_text = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
+        return translated_text
+    except Exception as e:
+        return f"An error occurred during translation: {e}"
+def main():
+    """
+    Main function to load the model and run a test translation.
+    """
+    # Reconfigure stdout to handle UTF-8 encoding
+    sys.stdout = codecs.getwriter('utf-8')(sys.stdout.buffer)
+    # --- Configuration ---
+    script_dir = os.path.dirname(os.path.abspath(__file__))
+    nepali_model_path = os.path.join(script_dir, "models", "nllb-finetuned-nepali-en")
+    # --- Model Loading ---
+    print("Loading Nepali model and tokenizer...")
+    try:
+        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+        nepali_model = M2M100ForConditionalGeneration.from_pretrained(nepali_model_path).to(device)
+        nepali_tokenizer = NllbTokenizerFast.from_pretrained(nepali_model_path)
+        print("Nepali model and tokenizer loaded successfully.")
+    except Exception as e:
+        print(f"Error loading Nepali model or tokenizer: {e}")
+        return
+    # --- Nepali Translation ---
+    nepali_sentences = [
+        "जडान बिन्दु थप्नुहोस्",
+        "स्टिकी नोट आयात पूरा भयो",
+        "मोनोस्पेस १२",
+        "पानी जेट पम्पमा दुईवटा भित्रिने र एउटा बाहिरिने पाइप हुन्छन् र एक भित्र अर्को सिद्धान्त अनुरूप दुईवटा पाइप हुन्छन् । पानीको प्रविष्टिमा एउटा पानी जेटले केही ठूलो पाइपमा पूरा चापले टुटीबाट बाहिर फाल्दछ । यस्तो तरिकाले पानी जेटले वायू वा तरललाई दोस्रो प्रविष्टिबाट टाढा पुर्याउदछ । ड्रिफ्टिङ तरलमा ऋणात्मक चापको कारणले यस्तो हुन्छ । त्यसैले यो हाइड्रोडायनमिक विरोधाभाषको एउटा अनुप्रयोग हो । यसले ड्रिफ्टिङ तरल नजिकका वस्तु टाढा फाल्नुको साटोमा सोस्ने कुरा बताउदछ ।",
+        "वस्तुको परिवर्तन बचत गर्नुहोस् ।"
+        "तिमीलाई कस्तो छ" ,
+        "तिमी को हौ",
+        "कति बज्यो"
+    ]
+    print("\n--- Nepali to English Translation Analysis ---")
+    for sentence in nepali_sentences:
+        print(f"\nOriginal (ne): {sentence}")
+        translated_text = translate_text(sentence, nepali_model, nepali_tokenizer, src_lang="nep_Npan")
+        print(f"Translated (en): {translated_text}")
+    # --- Sinhala Translation ---
+    # NOTE: No fine-tuned model for sinhala was found. Using the baseline model for now.
+    print("\n\n--- Sinhala to English Translation Analysis ---")
+    sinhala_sentences = [
+        "ඩෝසන්මිස් දුරකථනයෙන් ඩෝසන්මිස් කවුද සර්",
+        "කවුද ඩෝසන් නැතුව ඉන්නේ ඔව් සර්",
+        "ඔබ එය උත්සාහ කරන්න සර්",
+        "කොහොමද වැඩේ හරිද ඔව් සර්ට ස්තුතියි",
+        "ඔව්, හරි, ස්තුතියි රත්තරං",
+    ]
+    for sentence in sinhala_sentences:
+        print(f"\nOriginal (si): {sentence}")
+        translated_text = translate_text(sentence, nepali_model, nepali_tokenizer, src_lang="sin_Sinh")
+        print(f"Translated (en): {translated_text}")
+if __name__ == "__main__":
+    main()

test_translation.py ADDED Viewed

	@@ -0,0 +1,71 @@

+import os
+import sys
+import codecs
+import torch
+from transformers import M2M100ForConditionalGeneration, NllbTokenizerFast
+def translate_text(text, model, tokenizer, src_lang="nep_Npi", target_lang="eng_Latn"):
+    """
+    Translates a single text string.
+    """
+    try:
+        tokenizer.src_lang = src_lang
+        inputs = tokenizer(text, return_tensors="pt")
+        generated_tokens = model.generate(
+            **inputs,
+            forced_bos_token_id=tokenizer.vocab[target_lang],
+            max_length=512
+        )
+        translated_text = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
+        return translated_text
+    except Exception as e:
+        return f"An error occurred during translation: {e}"
+def main():
+    """
+    Main function to load the model and run a test translation.
+    """
+    # Reconfigure stdout to handle UTF-8 encoding
+    sys.stdout = codecs.getwriter('utf-8')(sys.stdout.buffer)
+    # --- Configuration ---
+    # Construct the absolute path to the model directory to ensure it's found correctly
+    script_dir = os.path.dirname(os.path.abspath(__file__))
+    model_path = os.path.join(script_dir, "models", "nllb-finetuned-nepali-en")
+    # --- Model Loading ---
+    print("Loading model and tokenizer...")
+    try:
+        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+        model = M2M100ForConditionalGeneration.from_pretrained(model_path).to(device)
+        tokenizer = NllbTokenizerFast.from_pretrained(model_path)
+        print("Model and tokenizer loaded successfully.")
+    except Exception as e:
+        print(f"Error loading model or tokenizer: {e}")
+        return
+    # --- Translation ---
+    sentences_to_translate = [
+        "मेरो नाम जेमिनी हो।",
+        "आज मौसम कस्तो छ?",
+        "मलाई नेपाली खाना मन पर्छ।",
+        "तपाईंलाई कस्तो छ?",
+        "वस्तुको परिवर्तन बचत गर्नुहोस् ।",
+        "तिमीलाई कस्तो छ" ,
+        "तिमी को हौ",
+        "कति बज्यो",
+        "बाटो कहाँ छ",
+        "फिल्मले सामान्यतया सकारात्मक समीक्षा प्राप्त गर्यो, हिन्दी डब संस्करणमा अत्यन्तै राम्रो प्रदर्शन गर्यो",
+        "इङ्गल्याण्डमा भएको गन्तव्य विवाहको पृष्ठभूमिमा सेट गरिएको, कथाले विवाह योजनाकार जगजिन्दर जोगिन्दर र धर्मपुत्र उत्तराधिकारी आलिया अरोरा बीचको विचित्र प्रेमकथालाई पछ्याउँछ, किनकि उनीहरू विचित्र परिवारहरू, व्यक्तिगत आघातहरू र व्यवस्थित विवाहको बेतुकापनहरू पार गर्छन्।",
+        "साई रा नरसिंह रेड्डीको वास्तविक कथा रायलसीमा क्षेत्रका एक भारतीय स्वतन्त्रता सेनानी उय्यालवाडा नरसिंह रेड्डीमा केन्द्रित छ जसले १८४६ मा ब्रिटिश इस्ट इन्डिया कम्पनी विरुद्ध पहिलो सामूहिक विद्रोहको नेतृत्व गरेका थिए, सिपाही विद्रोहको एक दशक अघि। एक पोलिगर (एक सामन्ती सरदार), रेड्डी र उनका अनुयायीहरूले कृषि प्रणालीमा शोषणकारी परिवर्तनहरू विरुद्ध विद्रोह गरे, जसमा उनीहरूको पुर्खाको जग्गा कब्जा र कम्पनीद्वारा अनुचित कर लगाउने समावेश थियो। प्रारम्भिक विजय पछि, उनलाई पछि १८४७ मा पक्राउ गरियो र फाँसी दिएर मृत्युदण्ड दिइयो, उनको शरीर डर जग्गाउन प्रदर्शन गरियो।"
+    ]
+    for sentence in sentences_to_translate:
+        print(f"\nOriginal text (Nepali): '{sentence}'")
+        translated_text = translate_text(sentence, model, tokenizer)
+        print(f"Translated text (English): '{translated_text}'")
+if __name__ == "__main__":
+    main()