AI & ML interests

Tools for creating and exploring datasets

prithivMLmods 
posted an update about 21 hours ago
view post
Post
982
One speech model with seven voices, streamlined with multimodal capabilities for vision tasks. Performs vision(image-text) to audio inference with Qwen2.5-VL + VibeVoice-Realtime-0.5B. Vision to VibeVoice (EN) - The demo is live. 🗣️🔥

🤗 Vision-to-VibeVoice-en [Demo]: prithivMLmods/Vision-to-VibeVoice-en
✨ Collection: https://huggingface.co/collections/prithivMLmods/multimodal-implementations
✨ Speech [VibeVoice-Realtime-0.5B]: microsoft/VibeVoice-Realtime-0.5B
✨ Vision [Qwen2.5-VL]: Qwen/Qwen2.5-VL-7B-Instruct

To know more about it, visit the app page or the respective model page!
  • 1 reply
·
prithivMLmods 
posted an update 5 days ago
view post
Post
3618
Hello everyone,

The strangerzonehf [HF] Community / Organization Page, which is maintained by me, has reached the Top 10 Developer Pages ranking at 6th place, contributing 3.4% in the calendar cycle from August 2024 to August 2025. It is also the only South Asia / Indian page in the list. I could not be more proud to be doing things for the community. ❤️🤗

Source: https://www.dataprovenance.org/economies-of-open-intelligence.pdf

It is a pleasure to be a part of it.
Thank you!
@prithivMLmods
prithivMLmods 
posted an update 9 days ago
view post
Post
10574
Introducing the Super-OCRs Demo, a comparison of state-of-the-art multimodal OCR VLMs, including HunyuanOCR, DeepSeekOCR, Dots, and Nanonets in one space for performing OCR, rendering LaTeX and Markdown, and visual grounding (layout). Find the related Spaces and models below.🤗🔥

✨Super-OCRs[Demo]: prithivMLmods/Super-OCRs-Demo
✨Collection: https://huggingface.co/collections/prithivMLmods/multimodal-implementations
✨GitHub: https://github.com/PRITHIVSAKTHIUR/Super-OCRs-Demo

⭐ Models Used:
✦ HunyuanOCR: tencent/HunyuanOCR
✦ DeepSeek-OCR: (-) deepseek-ai/DeepSeek-OCR (+) prithivMLmods/DeepSeek-OCR-Latest-BF16.I64
✦ Dots.OCR: (-) rednote-hilab/dots.ocr (+) prithivMLmods/Dots.OCR-Latest-BF16
✦ Nanonets-OCR2-3B: nanonets/Nanonets-OCR2-3B

⭐ Some Other Relevant Apps:
✦ Qwen3-VL-HF-Demo: prithivMLmods/Qwen3-VL-HF-Demo
✦ Qwen3-VL-Outpost: prithivMLmods/Qwen3-VL-Outpost
✦ Multimodal-OCR: prithivMLmods/Multimodal-OCR
✦ Multimodal-OCR2: prithivMLmods/Multimodal-OCR2
✦ Multimodal-OCR3: prithivMLmods/Multimodal-OCR3
✦ DeepSeek-OCR-experimental: prithivMLmods/DeepSeek-OCR-experimental

To know more about it, visit the app page or the respective model page!
prithivMLmods 
posted an update 13 days ago
view post
Post
3185
Introducing the advanced sketch-board editor "Nano-Banana-Pro-Sketch-Board" powered by the Gemini 2.5 Flash Image and Gemini 3 Pro Preview Image models through the Gemini API. This version includes more features than the Nano-Banana-AIO app for drawing and prompt-based concept transformation of freestyle sketches. 🔥🍌

✨Nano-Banana-Pro-Sketch-Board: prithivMLmods/Nano-Banana-Pro-Sketch-Board
✨Collection: https://huggingface.co/collections/prithivMLmods/image-generation-apps-collection
✨Github: https://github.com/PRITHIVSAKTHIUR/Nano-Banana-Pro-Sketch-Board
✨Model-Garden: https://tinyurl.com/4xxs9dvy

Some Other Relevant Apps [OSS]

⭐Qwen-Image-Edit-2509-LoRAs-Fast-Fusion: prithivMLmods/Qwen-Image-Edit-2509-LoRAs-Fast-Fusion
⭐Qwen-Image-Edit-2509-LoRAs-Fast: prithivMLmods/Qwen-Image-Edit-2509-LoRAs-Fast
⭐Photo-Mate-i2i: prithivMLmods/Photo-Mate-i2i
⭐Kontext-Photo-Mate-v2: https://huggingface.co/spaces/prithivMLmods/Kontext-Photo-Mate-v2

Note: The Nano-Banana-Pro-Sketch-Board demo requires a Gemini API key for the editing process. Your API key will be removed when the app is reloaded or closed. Your key remains safe and will not be exposed to any medium. Also, the Gemini 3 Pro Preview Image model may require a paid API key from a Google Cloud project with billing enabled.

To know more about it, visit the app info section or the respective Model Garden page!
prithivMLmods 
posted an update 14 days ago
view post
Post
1291
Try the demo of NVIDIA Nemotron Parse v1.1, NVIDIA's latest VLM for understanding document semantics and extracting text and table elements with spatial grounding. It is capable of comprehensive text understanding and document structure analysis in a given document, and can provide bounding boxes with coordinates.

⭐Space[Demo]: prithivMLmods/NVIDIA-Nemotron-Parse-OCR
⭐Model: nvidia/NVIDIA-Nemotron-Parse-v1.1
⭐Multimodal-Spaces: https://huggingface.co/collections/prithivMLmods/multimodal-implementations

Some relevant Spaces

⭐DeepSeek-OCR-experimental [latest transformers]: prithivMLmods/DeepSeek-OCR-experimental
⭐Qwen3-VL-Outpost: prithivMLmods/Qwen3-VL-Outpost
⭐Multimodal-OCR3: prithivMLmods/Multimodal-OCR3

Check out the other spaces in the multimodal implementation collection.

To know more about it, visit the app page or the respective model page!
prithivMLmods 
posted an update 17 days ago
view post
Post
1474
Try the all-new trending Qwen-Image-Edit-2509 (Multi-Image-Edits) specialized adapter demos, including Cloth-Design-Fuse, Texture Edit, Guided-Objects-Patching, and more — all in a single Hugging Face Space. The demo link is provided below. 🤗🔥

⮞ Space[Demo]: prithivMLmods/Qwen-Image-Edit-2509-LoRAs-Fast-Fusion
⮞ Collection: https://huggingface.co/collections/prithivMLmods/image-generation-apps-collection
⮞ Base Model: Qwen/Qwen-Image-Edit-2509

Similar applications↗️

⮞ Kontext-Photo-Mate-v2: https://huggingface.co/spaces/prithivMLmods/Kontext-Photo-Mate-v2
⮞ Photo-Mate-i2i: prithivMLmods/Photo-Mate-i2i
⮞ Qwen-Image-Edit-2509-LoRAs-Fast: prithivMLmods/Qwen-Image-Edit-2509-LoRAs-Fast

To know more about it, visit the app page or the respective model page!
prithivMLmods 
posted an update 18 days ago
view post
Post
3495
Made a demo for multimodal understanding of Qwen3-VL space for tasks including point annotation, detection, captioning, guided text inferences, and more. Find the demo link below. 🤗↗️

⮞ Space[Demo]: prithivMLmods/Qwen3-VL-HF-Demo
⮞ Model Used: Qwen/Qwen3-VL-4B-Instruct
⮞ Collection: https://huggingface.co/collections/prithivMLmods/multimodal-implementations
⮞ GitHub: https://github.com/PRITHIVSAKTHIUR/Qwen-3VL-Multimodal-Understanding

To know more about it, visit the app page or the respective model page!
prithivMLmods 
posted an update 21 days ago
view post
Post
3731
Made a small write up and experimental finetuning guide for MetaCLIP2 for Image Classification on Downstream Tasks. The blog titled Fine Tuning MetaCLIP 2 for Image Classification on Downstream Tasks demonstrates the step by step finetuning using CIFAR10 and is also flexible for adapting to other datasets. For more details, check out the linked blog below. 🤗↗️

⮞ Blog Article: https://huggingface.co/blog/prithivMLmods/metaclip2-downstream-finetune
⮞ Demo Space[Zero-Shot Classification]: prithivMLmods/metaclip-2-demo

Some other models
╰› MetaCLIP-2-Cifar10: prithivMLmods/MetaCLIP-2-Cifar10
╰› MetaCLIP-2-Age-Range-Estimator: prithivMLmods/MetaCLIP-2-Age-Range-Estimator
╰› MetaCLIP-2-Gender-Identifier: prithivMLmods/MetaCLIP-2-Gender-Identifier
╰› MetaCLIP-2-Open-Scene: prithivMLmods/MetaCLIP-2-Open-Scene

⮞ Collection: https://huggingface.co/collections/prithivMLmods/metaclip2-image-classification-experiments

To know more about it, visit the app page or the respective model page!
prithivMLmods 
posted an update 24 days ago
view post
Post
3263
Try the all-new trending Qwen-Image-Edit specialized adapter demos, including Photo-to-Anime, Light Restoration, Multi-Angle Edits, Relighting, and more — all in a single Hugging Face Space. Below is the demo link. 🤗🌠

⮞ Demo-Space: prithivMLmods/Qwen-Image-Edit-2509-LoRAs-Fast
⮞ How-to-Use: prithivMLmods/Qwen-Image-Edit-2509-LoRAs-Fast#2
⮞ Collection: https://huggingface.co/collections/prithivMLmods/image-generation-apps-collection

To know more about it, visit the app page or the respective model page!
·
prithivMLmods 
posted an update 28 days ago
view post
Post
2863
Introducing Photo-Mate-v2, based on FLUX.1-Kontext-dev, for advanced image manipulation tasks. It supports transforming scenes into top-down/bottom-up perspectives, CAM-right/left-view and its reverse, as well as general kontext-specified object removal. Below is the list of demos and adapters.🔥🤗

➤ Spaces [Demo] : https://huggingface.co/spaces/prithivMLmods/Kontext-Photo-Mate-v2

Kontext-Adapters :
✦ Kontext-Bottom-Up-View: prithivMLmods/Kontext-Bottom-Up-View
✦ Kontext-CAM-Right-View: prithivMLmods/Kontext-CAM-Right-View
✦ Kontext-Top-Down-View: prithivMLmods/Kontext-Top-Down-View
✦ Kontext-CAM-Left-View: prithivMLmods/Kontext-CAM-Left-View
✦ Kontext-CAM-Right-View: prithivMLmods/Kontext-CAM-Right-View
✦ Kontext-Unblur-Upscale: prithivMLmods/Kontext-Unblur-Upscale
✦ Kontext-0811-exp: prithivMLmods/Kontext-0811-exp

Photo-Mate Collection:
✦ Kontext CAM Angles: https://huggingface.co/collections/prithivMLmods/kontext-cam-angles
✦ i2i - Kontext (exp): https://huggingface.co/collections/prithivMLmods/i2i-kontext-exp
✦ LZO-1 (Lossless Zoom Operator): https://huggingface.co/collections/prithivMLmods/lzo-1-lossless-zoom-operator

Related-Apps:
✦ Photo-Mate [Version 1.0]: prithivMLmods/Photo-Mate-i2i
✦ Image Generation Apps [Collection]: https://huggingface.co/collections/prithivMLmods/image-generation-apps-collection

To know more about it, visit the app page or the respective model page!
@prithivMLmods
prithivMLmods 
posted an update about 1 month ago
view post
Post
1291
A week ago, I shared a post about the latest transformers test implementation of DeepSeek-OCR Compatibility (https://tinyurl.com/ykc4mm66). Now, I’m dropping the most compatible version of it to support the model with the latest transformers. 🤗🔥

➠ DeepSeek-OCR-Latest-BF16.I64: prithivMLmods/DeepSeek-OCR-Latest-BF16.I64
➠ DeepSeek OCR [exp] : prithivMLmods/DeepSeek-OCR-experimental

✅Supports the latest transformers v4.57.1
✅torch: 2.6.0+cu124 (or) the latest version (i.e., torch 2.9.0)
✅cuda version: 12.4
✅users can also opt out of specific attention implementations if desired.

✨Previous version: strangervisionhf/deepseek-ocr-latest-transformers
↗️Related Blog: https://huggingface.co/blog/prithivMLmods/multimodal-ocr-vlms
✨Community Page: strangervisionhf
✨Original Model Page: deepseek-ai/DeepSeek-OCR

To know more about it, visit the app page or the respective model page!
prithivMLmods 
posted an update about 1 month ago
view post
Post
2582
A small blog post titled - Hall of Multimodal OCR VLMs and Demonstrations has been published on ↗️ https://huggingface.co/blog/prithivMLmods/multimodal-ocr-vlms on behalf of strangervisionhf

It discusses the latest trends in OCR models, the multilingual support offered by modern OCR systems, their unique capabilities, OCR benchmark model comparisons, transformer-based implementations, and strategies for streamlining transformers compatibility.
prithivMLmods 
posted an update about 1 month ago
view post
Post
3838
Implemented DeepSeek-OCR to support the latest transformers on the strangervisionhf page. The page includes the model weights and corrected configuration, which fix the issues and allow transformers inference to run smoothly.🤗🔥

> Model: strangervisionhf/deepseek-ocr-latest-transformers
> Demo Space: prithivMLmods/DeepSeek-OCR-experimental

✅Supports the latest transformers
✅You can also opt out of the attention implementation if needed.
✅Supports torch version 2.6.0 or higher
✅torch version cuda: 12.4

If you are interested in experimenting with new things and streamlining compatibility, the strangervisionhf organization is open for you, and you can join the community.

> Multimodal Collection: prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0, https://huggingface.co/collections/strangervisionhf/october-2025-models

> Thank you, @merve , for assigning the blazing-fast Zero GPU support!

> Notebook : https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/DeepSeek-OCR-Demo/deepseek_ocr_demo.ipynb

To know more about it, visit the app page or the respective model page!
prithivMLmods 
posted an update about 1 month ago
view post
Post
1519
Introducing Gliese-OCR-7B-Post2.0-final, a document content-structure retrieval VLM designed for content extraction (OCR), summarization, and document visual question answering. This is the fourth and final model in the Camel Doc OCR VLM series, following Gliese-OCR-7B-Post1.0. The model delivers superior accuracy across a wide range of document types, including scanned PDFs, handwritten pages, structured forms, and analytical reports.🚀🤗

> Gliese-OCR-7B-Post2.0-final : prithivMLmods/Gliese-OCR-7B-Post2.0-final
> Gliese-OCR-7B-Post1.0 (previous) : prithivMLmods/Gliese-OCR-7B-Post1.0
> Gliese OCR Post-x.0 (collection) : https://huggingface.co/collections/prithivMLmods/gliese-ocr-post-x0
> Multimodal Implementations (collection) : https://huggingface.co/collections/prithivMLmods/multimodal-implementations
> Qwen VL Captions (other-collection) : https://huggingface.co/collections/prithivMLmods/qwen-vl-captions
> Run Demo Here : prithivMLmods/Gliese-OCR-7B-Post2.0-final
> GitHub (4bit) : https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/Gliese-OCR-7B-Post2.0-final(4bit)/Gliese_OCR_7B_Post2_0_final.ipynb

.
.
.
> To know more about it, visit the app page or the respective model page!!
prithivMLmods 
posted an update about 1 month ago
view post
Post
1852
Here is the official Florence-2 Transformers-converted demo for the following vision models: florence-community/Florence-2-large, florence-community/Florence-2-large-ft, florence-community/Florence-2-base, and florence-community/Florence-2-base-ft. These models support tasks such as object detection, captioning, detailed captioning, more detailed captioning, dense region captioning, region proposal, OCR, and OCR with region. Try the official demo at the link below:

> Space: prithivMLmods/florence2-vision-models
> Collection: prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0

> To know more about it, visit the app page or the respective model page!!
prithivMLmods 
posted an update about 2 months ago
prithivMLmods 
posted an update about 2 months ago
view post
Post
1913
Now you can try all the latest state-of-the-art multimodal vision-language models from the Qwen3-VL series demo on Hugging Face Spaces — including 4B, 8B, and 30B (Instruct, 4B-Thinking) variants. I’ve also uploaded the weights for the Abliterated variants of these models, up to 30B parameters. Check out the Spaces and model links below! 🤗🔥

✨ Qwen3-VL[4B,8B]: prithivMLmods/Qwen3-VL-Outpost
✨ Qwen3-VL-30B-A3B-Demo: prithivMLmods/Qwen3-VL-HF-Demo
✨ Collection: prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0

Qwen3-VL Abliterated Model Collection [ Version 1.0 ]

⚡Collection: https://huggingface.co/collections/prithivMLmods/qwen3-vl-abliteration-oct-1625-68f0e3e567ef076594605fac

Note: This is version 1.0 of the Abliteration of the Qwen3-VL series of models. It may perform sub-optimally in some cases. If you encounter any issues, please open a discussion.
prithivMLmods 
posted an update about 2 months ago
view post
Post
3073
Introducing Image-Guard-2.0, an experimental, lightweight vision-language encoder model with a size of 0.1B (<100M parameters), trained on SigLIP2 (siglip2-base-patch16-224). Designed for multi-label image classification tasks, this model functions as an image safety system, serving as an image guard or moderator across a wide range of categories, from anime to realistic imagery.

⚡blog-article: https://huggingface.co/blog/prithivMLmods/image-guard-models

It also performs strict moderation and filtering of artificially synthesized content, demonstrating strong detection and handling of explicit images. Image-Guard-2.0 delivers robust performance in streamlined scenarios, ensuring reliable and effective classification across diverse visual inputs.
prithivMLmods 
posted an update about 2 months ago
view post
Post
3381
The demo of Qwen3-VL-30B-A3B-Instruct, the next-generation and powerful vision-language model in the Qwen series, delivers comprehensive upgrades across the board — including superior text understanding and generation, deeper visual perception and reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. 🤗🔥

⚡ Space / App: prithivMLmods/Qwen3-VL-HF-Demo

The model’s demo supports a wide range of tasks, including;
Image Inference, Video Inference, PDF Inference, Image Captioning (VLA), GIF Inference.

⚡ Collection: prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0

Thanks for granting the blazing-fast Zero GPU access, @merve 🙏

⚡ Other Pages

> Github: https://github.com/prithivsakthiur/qwen3-vl-hf-demo
> Multimodal VLMs July'25 : prithivMLmods/multimodal-vlms-until-july25-688312e6b840e1e156f13027
> VL caption — < Sep 15 ’25 : prithivMLmods/vl-caption-sep-15-25-68c7f6d737985c63c13e2391
> Multimodal VLMs - Aug'25 : prithivMLmods/multimodal-vlms-aug25-68a56aac39fe8084f3c168bd

To know more about it, visit the app page or the respective model page!!
prithivMLmods 
posted an update about 2 months ago
view post
Post
471
Introducing the next-gen version of DeepCaption-VLA (v2.0) — an advanced, multimodal model based on Qwen2.5-VL, specialized for Image Captioning and Vision Language Attribution (VLA). This enhanced release focuses on generating precise, attribute-rich captions that capture visual properties, object attributes, and scene details across diverse image types and aspect ratios. Version 2.0 introduces significant improvements in multilingual inference, delivering higher captioning quality and attribution accuracy in languages including Chinese (Zh), Thai (Th), and more.

🤗 DeepCaption-VLA (v2.0) : prithivMLmods/DeepCaption-VLA-V2.0-7B
🫱 Collection : prithivMLmods/vlm-20-oct-0825-68e606aa6e3993be8a3b1d51
⭐ GitHub (notebook) : https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/DeepCaption_VLA_V2_0_7B/DeepCaption_VLA_V2_0_7Bipynb.ipynb

Other Pages⚡

➥ Multimodal VLMs July'25 : prithivMLmods/multimodal-vlms-until-july25-688312e6b840e1e156f13027
➥ VL caption — < Sep 15 ’25 : prithivMLmods/vl-caption-sep-15-25-68c7f6d737985c63c13e2391
➥ Multimodal VLMs - Aug'25 : prithivMLmods/multimodal-vlms-aug25-68a56aac39fe8084f3c168bd

To know more about it, visit the app page or the respective model page!!