Home › AI Vision Models Tools

Best AI Vision Models Tools

Explore 114 AI vision models tools to find the perfect solution.

Vision Models

114 tools

Recraft V3

A text-to-image generative AI model capable of generating images with long text, available via Replicate’s API.

Real-ESRGAN

An AI-powered image upscaling tool that enlarges images while enhancing details and reducing artifacts, often used for improving image resolution.

CodeFormer

A robust face restoration algorithm designed to repair old photos or improve AI-generated faces, delivering improved image quality.

DeepBrain AI Studios

An AI tool for generating realistic AI avatars and creating text-to-video content tailored for creative projects.

Submagic

An AI-powered video tool that automatically identifies the best moments in your videos and converts them into viral clips.

NSFWGenerator

An AI tool that generates and browses NSFW images through advanced algorithms.

Janus-1.3B

A unified multimodal AI model that decouples visual encoding to support both understanding and generation tasks.

GFPGAN

A practical AI tool for face restoration, capable of enhancing and restoring old and AI-generated faces, available for self-hosting via Docker.

FLUX.1 [dev]

A 12-billion parameter text-to-image model focused on generating high-fidelity images from text with state-of-the-art quality.

OmniGen is a unified image generation model that can generate a wide range of images from multi-modal prompts, simplifying the image generation process without the need for additional network modules or preprocessing steps. It supports various tasks such as text-to-image generation, identity-preserving generation, image editing, and more.

YOLOv10

YOLOv10 is a real-time end-to-end object detection tool that improves upon previous YOLO versions through NMS-free training and a comprehensive architectural design to enhance efficiency and accuracy. It offers state-of-the-art performance across various model sizes and is implemented in PyTorch.

BLIP-2

BLIP-2 is an advanced visual-language model that allows zero-shot image-to-text generation, enabling tasks such as image captioning and visual question answering using a combination of pretrained vision and language models.

DeepSeek-VL2

A series of advanced vision-language models designed for multimodal understanding, available in multiple sizes to suit varying complexity and performance requirements.

YOLOv5

YOLOv5 is a popular open-source AI tool aimed at object detection, image segmentation, and image classification, leveraging PyTorch for model building and deployment. It supports various deployment formats including ONNX, CoreML, and TFLite, and is well-documented for ease of use in research and practical applications.

AI Image Generator – Text to Image Models

A platform that hosts various AI models for generating images from text prompts using advanced techniques such as Stable Diffusion and FLUX.1, showcasing models with capabilities including realistic text generation, SVG creation, and high-quality image outputs.

MagicQuill

MagicQuill is an intelligent interactive image editing system that enables precise image modification through AI-powered suggestions and a user-friendly interface, featuring functionalities like local editing and drag-and-drop support.

Clarity AI Upscaler

Clarity AI Upscaler is an advanced image upscaling tool that utilizes Stable Diffusion processes to enhance and recreate details in images, providing users with the option to balance fidelity and creativity through parameters such as diffusion strength. The tool supports tiled diffusion techniques for handling large images and incorporates ControlNet for maintaining structural integrity while enhancing details.

Adobe Firefly

Adobe Firefly is an AI art generator developed by Adobe, enabling users to create images, audio, vectors, and videos from text prompts. It integrates with Adobe Creative Cloud, enhancing workflows with generative AI capabilities such as Text-to-Image, Generative Fill, and more.

JanusFlow-1.3B

JanusFlow-1.3B is a unified multimodal model by DeepSeek that integrates autoregressive language models with rectified flow, enabling both multimodal understanding and image generation.

Stable Diffusion 3.5 Medium

A Multimodal Diffusion Transformer text-to-image generative model by Stability AI that offers improved image quality, typography, complex prompt understanding, and resource efficiency. It supports local or programmatic use via diffusers, ComfyUI, and API endpoints.

Stable Diffusion 3 Medium

A multimodal diffusion transformer model that generates images from textual descriptions with improvements in image quality, typography, and resource-efficiency for creative applications.

Ultralytics YOLOv8

A state‐of‐the‐art object detection model by Ultralytics that provides robust capabilities for object detection, instance segmentation, and pose estimation. It offers both CLI and Python integrations with extensive documentation and performance metrics.

Ultralytics YOLO11

A suite of computer vision models for object detection, segmentation, pose estimation, and classification, integrated with Ultralytics HUB for visualization and training.

DeepSeek-VL2-small

DeepSeek-VL2-small is a variant of the DeepSeek-VL2 series, advanced mixture-of-experts vision-language models designed for multimodal tasks such as visual question answering, optical character recognition, document/table/chart understanding, and visual grounding.

Ultimate SD Upscale with ControlNet Tile

An advanced image upscaling model leveraging Stable Diffusion 1.5 and ControlNet Tile to enhance image quality. Accessible via an API on Replicate and optimized to run with Nvidia A100 GPUs.

Janus-Series

An open-source repository from deepseek-ai that offers a suite of unified multimodal models (including Janus, Janus-Pro, and JanusFlow) designed for both understanding and generation tasks. The models decouple visual encoding to improve flexibility and incorporate advanced techniques like rectified flow for enhanced text-to-image generation.

Anything V4.0

An AI image generation model known for incorporating components from AbyssOrangeMix2 to deliver versatile image synthesis across styles.

Stable Diffusion

A high-resolution image synthesis model that enables users to generate images from textual descriptions, supporting creative and design applications.

ACE++

ACE++ is an instruction-based image creation and editing toolkit that uses context-aware content filling for tasks such as portrait generation, subject-driven image editing, and local editing. The tool supports diffusion-based models, provides installation instructions, demos, and guides for fine-tuning using LoRA, and is hosted on Hugging Face.

Ideogram-V2

Ideogram-V2 is an advanced image generation model that excels in inpainting, prompt comprehension, and text rendering. It is designed to transform ideas into captivating designs, realistic images, innovative logos, and posters. The model is accessible via an API on Replicate and offers unique features for creative image editing.

Mochi 1

Mochi 1 is an open state-of-the-art video generation model by Genmo, featuring a 10 billion parameter diffusion model built on the novel Asymmetric Diffusion Transformer (AsymmDiT) architecture. It generates high-quality videos with high-fidelity motion and strong prompt adherence and is available via an API on Replicate.

Stable Diffusion 2-1

The latest iteration of StabilityAI’s text-to-image model, delivering high-quality image generation from text prompts.

AI Comic Factory

An AI tool that generates illustrated comic panels from text descriptions, enabling creative storytelling.

Shuttle 3 Diffusion

Shuttle 3 Diffusion is a text-to-image diffusion model that generates detailed and diverse images from textual prompts in just 4 steps. It offers enhanced image quality, improved typography, and resource efficiency, and can be integrated via API, Diffusers, or ComfyUI.

Recraft V3 SVG

A text-to-image model focused on generating high-quality SVG images, including logotypes and icons, with controlled text placement.

Allegro

Allegro is an advanced open-source text-to-video generation model by RhymesAI. It converts simple text prompts into high-quality, 6-second video clips at 15 FPS and 720p resolution using a combination of VideoVAE for video compression and a scalable Diffusion Transformer architecture.

minimax/video-01-director

An advanced AI video generation model that creates high-definition 720p videos (up to 6 seconds) with cinematic camera movements. It allows users to control camera movements through both bracketed commands and natural language descriptions.

LuminaBrush

A creative ML app hosted on Hugging Face Spaces that lets users explore and generate artistic images using community-built AI models.

Stable Diffusion 3.5 Large

Stable Diffusion 3.5 Large is a Multimodal Diffusion Transformer text-to-image generative model developed by Stability AI. It generates images from text prompts with enhanced image quality, typography, and resource-efficiency. The model supports integration with Diffusers, ComfyUI, and other programmatic interfaces, and is available under the Stability Community License.

EasyDeepNude

EasyDeepNude is an AI tool that implements a reimagined version of the controversial DeepNude project. It provides both a command-line interface (CLI) and a graphical user interface (GUI) to process and transform photos using deep learning models. The CLI version can be integrated into automated workflows, while the GUI version offers a user-friendly cropping system for easy use. Note: This is an early alpha release and may have compatibility issues.

Ideogram-v2-turbo

A fast text-to-image generation model ideal for quick ideation and providing rough compositional sketches.

xinsir/controlnet-union-sdxl-1.0

A ControlNet++ model for text-to-image generation and advanced image editing. Built on Stable Diffusion XL, it supports over 10 control conditions and advanced features such as tile deblurring, tile variation, super resolution, inpainting, and outpainting. The model is designed for high-resolution, multi-condition image generation and editing.

Mochi 1 Preview

Mochi 1 Preview is an open, state-of-the-art text-to-video generation model by Genmo that leverages a 10 billion parameter diffusion model with a novel Asymmetric Diffusion Transformer architecture. It generates high-fidelity videos from text prompts and is available under an Apache 2.0 license.

olmOCR-7B-0225-preview

A preview release of AllenAI's olmOCR model, fine-tuned from Qwen2-VL-7B-Instruct using the olmOCR-mix-0225 dataset. It is designed for document OCR and recognition, processing PDF images by extracting text and metadata. The model is intended to be used in conjunction with the olmOCR toolkit for efficient, large-scale document processing.

Playground v2.5 – 1024px Aesthetic Model

A diffusion-based text-to-image generative model that produces highly aesthetic images at a resolution of 1024x1024 across various aspect ratios. It outperforms several state-of-the-art models in aesthetic quality and is accessible via an API on Replicate, with integration support for Hugging Face Diffusers.

YOLOv8

A state-of-the-art computer vision model for object detection, segmentation, pose estimation, and classification tasks, designed for speed, accuracy, and ease of use.

Janus-Pro-1B

Janus-Pro-1B is a unified multimodal model by DeepSeek that decouples visual encoding for multimodal understanding and generation. It supports both image input (via SigLIP-L) for understanding and image generation using a unified transformer architecture.

Stable Virtual Camera

A 1.3B diffusion model for novel view synthesis that generates 3D consistent novel views and videos from multiple input images and freely specified target camera trajectories. It is designed for research and creative non-commercial use.

Easel AI

An AI tool that offers advanced face swap and avatar generation, preserving user likeness and enabling creative image manipulations.

Hunyuan3D 2.0

A diffusion-based model for generating high-resolution textured 3D assets, featuring a two-stage pipeline with a shape generation component (Hunyuan3D-DiT) and a texture synthesis component (Hunyuan3D-Paint). It supports both image-to-3D and text-to-3D workflows, and includes a user-friendly production platform (Hunyuan3D-Studio) for mesh manipulation and animation.

Wan2.1-T2V-14B

Wan2.1-T2V-14B is an advanced text-to-video generation model that offers state-of-the-art performance, supporting both 480P and 720P resolutions. It is part of the Wan2.1 suite and excels in multiple tasks including text-to-video, image-to-video, video editing, and even generating multilingual text (Chinese and English) within videos. The repository provides detailed instructions for single and multi-GPU inference, prompt extension methods, and integration with tools like Diffusers and ComfyUI.

FLUX.1 Redux

An adapter for FLUX.1 base models that generates slight variations of a given image, enabling creative refinements and flexible high-resolution outputs.

Flux.1

The official inference repository for FLUX.1 models, offering AI-powered text-to-image and inpainting services, maintained in collaboration with its authors.

Florence-2-large

An advanced vision foundation model by Microsoft designed for a wide range of vision and vision-language tasks such as captioning, object detection, OCR, and segmentation. It uses a prompt-based, sequence-to-sequence transformer architecture pretrained on the FLD-5B dataset and supports both zero-shot and finetuned settings.

GUI-R1

GUI-R1 is a generalist R1-style vision-language action model designed for GUI agents that leverages reinforcement learning and policy optimization to automatically control and interact with graphical user interfaces across multiple platforms (Windows, Linux, macOS, Android, Web).

YOLOv8

A state-of-the-art object detection, segmentation, and classification model known for its speed, accuracy, and ease of use in computer vision tasks.

ghibli-easycontrol

An open-source model hosted on Replicate that transforms input images with a Ghibli-style aesthetic, offering high-quality, fast, and cost-effective image translation via an API.

LHM

LHM: Large Animatable Human Reconstruction Model from a Single Image in Seconds is an open‐source implementation for reconstructing and animating 3D human models from a single image. It offers GPU-optimized pipelines, Docker support, and integration with animation frameworks like ComfyUI.

topazlabs/image-upscale

An AI-powered, professional-grade image upscaling tool by Topaz Labs. It offers multiple enhancement models (Standard, Low Resolution, CGI, High Fidelity, Text Refine) to upscale images up to 6x with options for facial enhancement, making it ideal for improving various image types including digital art and text-heavy photos.

Kernel/sd-nsfw

A Stable Diffusion v1-5 NSFW REALISM model variant hosted on Hugging Face. It is a diffusion-based text-to-image generation model fine-tuned for generating photo-realistic images, including NSFW content, and is intended for research purposes. It can be used with the Diffusers library and offers options for both direct inference and fine-tuning.

Anything V5

A text-to-image diffusion model from the Anything series designed for anime-style image generation. The model is available in multiple variants (e.g., V5-Prt) and is optimized for precise prompt-based outputs. It leverages Stable Diffusion pipelines and is hosted on Hugging Face with detailed versioning and usage instructions.

fofr/color-matcher

A model hosted on Replicate that performs color matching and white balance correction for images via an API. It allows users to automatically adjust image colors to achieve better balance.

UniRig

UniRig is an AI-based unified framework for automatic 3D model rigging. It leverages a GPT-like transformer to predict skeleton hierarchies and per-vertex skinning weights, automating the traditionally time-consuming rigging process for diverse 3D assets including humans, animals, and objects.

Stable Diffusion v1.5

A latent diffusion-based text-to-image generation model that produces photorealistic images from text prompts. It builds upon the Stable Diffusion v1.2 weights and is fine-tuned for improved classifier-free guidance. It can be used via the Diffusers library, ComfyUI, and other interfaces.

VLM-R1

VLM-R1 is a stable and generalizable R1-style large Vision-Language Model designed for visual understanding tasks such as Referring Expression Comprehension (REC) and Out-of-Domain evaluation. The repository provides training scripts, multi-node and multi-image input support, and demonstrates state-of-the-art performance with RL-based fine-tuning approaches.

Hunyuan3D-2.0

An AI application that generates high-resolution 3D models from images or text descriptions, enabling creative 3D content creation through AI.

Kling Lip Sync

Kling Lip Sync is an API that changes the lip movements of a person in a video to match supplied audio or text. It allows users to add lip-sync to any video, integrating video content with new audio inputs. The model sends data from Replicate to Kuaishou and offers pricing based on the seconds of video generated.

Stable Diffusion XL Base 1.0

A diffusion-based text-to-image generative model developed by Stability AI. This model uses a latent diffusion approach with dual fixed text encoders, and can be used standalone or combined with a refinement model for enhanced high-resolution outputs. It supports both direct image generation and img2img workflows leveraging SDEdit.

HiDream-I1

An open-source image generative model with 17B parameters, delivering state-of-the-art image generation quality, accompanied by a dedicated Hugging Face Space for experimentation.

Shakker-Labs/AWPortraitCN2

A text-to-image model focused on generating portraits with Eastern aesthetics. The updated version expands character depiction across various age groups and themes including cuisine, architecture, traditional ethnic costumes, and diverse environments. It is based on the stable-diffusion/flux framework and released under a non-commercial license.

HeyGem

HeyGem is an open-source AI avatar project that enables offline video synthesis on Windows. It precisely clones your appearance and voice to generate ultra-realistic digital avatars, allowing users to create personalized videos without an internet connection.

FLUX.1 Kontext

An AI tool that merges two images into a single cohesive output using creative image blending with text prompts.

FLUX.1 Kontext

FLUX.1 Kontext is a new image editing model from Black Forest Labs that leverages text prompts for precise image modifications, including color swaps, background edits, text replacements, style transfers, and aspect ratio changes. It features multiple variants (Pro, Max, and an upcoming Dev) along with a conversational interface (Kontext Chat) to simplify the editing process.

Flux1.1 Pro – Ultra

Flux1.1 Pro – Ultra is an advanced text-to-image diffusion model by Black Forest Labs available on Replicate. It offers ultra mode for generating high-resolution images (up to 4 megapixels) at impressive speeds (around 10 seconds per sample) and a raw mode that produces images with a more natural, candid aesthetic.

Flux-uncensored

Flux-uncensored is a text-to-image diffusion model hosted on Hugging Face by enhanceaiteam. It leverages the stable-diffusion pipeline, LoRA, and the fluxpipeline to generate images from text prompts. The model is marked as 'Not-For-All-Audiences', indicating that it might produce sensitive content.

FLUX.1 Kontext – Text Removal

A dedicated application built on the FLUX.1 Kontext image editing model from Black Forest Labs that removes all text from an image. The tool is available on Replicate with API access and a playground for experimentation, showcasing its specialized text removal functionality.

FLUX Kontext max - Multi-Image List

An AI tool that combines multiple images using FLUX Kontext Max, a premium image editing model from Black Forest Labs. It accepts a list of images to creatively merge them and produce enhanced, text-guided composite outputs. The tool is available on Replicate and is designed for versatile image editing tasks, including creative compositing and improved typography generation.

nanoVLM

A lightweight, fast repository for training and fine-tuning small vision-language models using pure PyTorch.

FLUX.1 Fill [dev]

FLUX.1 Fill [dev] is a 12-billion parameter rectified flow transformer developed by Black Forest Labs designed for text-guided inpainting. It fills specific areas in an existing image based on a textual description, enabling creative image editing workflows. It comes with a non-commercial license and integrates seamlessly with diffusers.

FLUX.1

FLUX.1 is an open‐source state‐of‐the‐art text‐to‐image generation model developed by Black Forest Labs. It excels in prompt adherence, visual detail, and diverse output quality. Available via Replicate's API, FLUX.1 comes in three variants (pro, dev, schnell) with different pricing models.

Ideogram 3.0

Ideogram 3.0 is a text-to-image generation model available on Replicate that offers three variants—Turbo, Balanced, and Quality—to cater for fast iterations, balanced outputs, and high-fidelity results. It delivers improved realism, enhanced text rendering, precise layout generation, and advanced style transfer capabilities, making it ideal for graphic design, marketing, and creative visual content creation.

ComfyUI-RMBG

A custom node for ComfyUI that provides advanced image background removal and segmentation (including object, face, clothes, and fashion segmentation) by integrating multiple models like RMBG-2.0, INSPYRENET, BEN, BEN2, BiRefNet, SAM, and GroundingDINO.

inswapper

inswapper is an open-source, one-click face swapper and restoration tool powered by insightface. It utilizes ONNX runtime for inference, along with integration of face restoration techniques (e.g., CodeFormer) to enhance image quality and produce realistic face swaps.

Veo 3

Veo 3 is an AI-powered video generation model from Google DeepMind that produces both visuals and native audio, including sound effects, ambient noise, dialogue, and accurate lip-sync. It delivers hyperrealistic motion, prompt adherence, and even can generate video game worlds, making it a versatile media generation tool.

Google Veo 3

A text-to-video generation tool from Google DeepMind, featuring native audio generation and improved prompt adherence for hyperreal outputs.

Wan2.1-I2V-14B-720P

An advanced Image-to-Video generation model from the Wan2.1 suite by Wan-AI that produces high-definition 720P videos from input images. It features state-of-the-art performance, supports multiple tasks including text-to-video, video editing, and visual text generation in both Chinese and English, and is optimized for consumer-grade GPUs.

Depth Anything V2

An interactive Hugging Face Space that leverages deep learning to generate depth maps from images. This tool extracts depth information from 2D images, which can be used for creative 3D effects, image editing, or further computer vision tasks.

Recraft V3

A text-to-image generation model specialized in creating images with long text and diverse styles, ensuring precise control over content layout.

test-yash-model-4-new-2

A custom diffusion-based model designed for generating unique fashion designs from text prompts. The API reference page provides detailed parameters for controlling aspects like prompt strength, aspect ratio, model selection, and output format.

IP-Adapter

IP-Adapter is a lightweight image prompt adapter developed by Tencent AI Lab that enables pre-trained text-to-image diffusion models to incorporate image prompts along with text prompts for multimodal image generation. With only 22M parameters, it offers comparable or improved performance compared to fine-tuned models and supports integration with various controllable generation tools.

Realistic Vision V6.0 B1 noVAE

Realistic Vision V6.0 "New Vision" is a beta diffusion-based text-to-image model focused on realism and photorealism. It is released on Hugging Face and provides detailed guidelines on resolutions, generation parameters, and recommended workflows (including using a VAE for quality improvements).

Kimi-VL-A3B-Thinking

Kimi-VL-A3B-Thinking is an efficient open-source Mixture-of-Experts vision-language model specialized in long-context processing and extended chain-of-thought reasoning. With a 128K context window and only 2.8B activated LLM parameters, it excels in multimodal tasks including image and video comprehension, OCR, mathematical reasoning, and multi-turn agent interactions.

Shap-E

Shap-E is an official GitHub repository by OpenAI for generating 3D implicit functions conditioned on text or images. It provides sample notebooks and usage instructions for converting text prompts or images into 3D models, making it a practical tool for generating 3D objects.

Juggernaut-XL v8

Juggernaut-XL v8 is a fine-tuned text-to-image diffusion model built on Stable Diffusion XL, designed for photo-realistic art generation. It is part of the RunDiffusion suite and is intended for creative visual content generation, though it cannot be used behind API services. Business inquiries and commercial licensing are available via email.

FLUX.1 Kontext [dev]

FLUX.1 Kontext [dev] is a state-of-the-art, open-weight text-based image editing model developed by Black Forest Labs. It enables detailed image edits using text prompts, such as style transfer, object modifications, text replacement, background swapping, and preserving character consistency. The model offers clear instructions on best prompting practices and is available under a non-commercial license with commercial use options via Replicate.

StoodioAI Fashion Model

A custom-trained model for generating unique fashion designs, available via API on the Replicate platform.

Photoshop Fusion Beta

An AI-powered beta extension for Photoshop aimed at enhancing digital creativity through generative image editing features.

UndressAI

UndressAI is an AI-powered undressing tool that processes images to generate realistic undressed versions. The platform emphasizes speed, high-quality outputs, and robust enterprise-grade security, aiming to outperform competitors by addressing issues like outdated technology and poor privacy found in similar tools.

AI-WebTV

AI-WebTV is a live, automated video generation demonstration hosted on Hugging Face Spaces. It streams generated video content using a fine-tuned Modelscope-based model (producing outputs similar to the Zeroscope model), and features an automated prompt database with different themes. The project serves as a public demonstration with research-only guidelines to avoid violent or excessively gory content.

stoodioai/test-yash-model-4-new-2

A custom trained generative image model that produces unique fashion designs. It supports text-to-image and image-to-image (inpainting) modes via an API, with configurable parameters such as prompt, aspect ratio, model type, and output quality.

AI Undresser

An AI-powered tool available via Replicate, designed for specialized image processing and transformation tasks.

Recraft V3 SVG

A text-to-image generative model that produces high-quality SVG (vector) images including logos, icons, and branded designs. It offers precise control over text and image placement, supports a variety of styles, and allows brand style customization by uploading reference images.

SmolVLM

SmolVLM is a 2B parameter vision-language model that is small, fast, and memory-efficient. It builds on the Idefics3 architecture with modifications such as an improved visual compression strategy and optimized patch processing, making it suitable for local deployment, including on laptops. All model checkpoints, training recipes, and tools are released open-source under the Apache 2.0 license.

Flux Schnell

A fast text-to-image generation model optimized for local development and personal use, developed by Black Forest Labs. It provides an API for rapid text-to-image synthesis, making it ideal for personal projects and local experimentation.

Ideogram v2 Inpainting Model

Ideogram v2 is a high-quality inpainting model available via Replicate’s API. It comes in two variants – the best quality version and a faster 'turbo' variant – and is adept at not only inpainting images but also generating new images (including effective text generation) for various creative applications.

FLUX Family of Models (Black Forest Labs)

A suite of API-accessible image generation and editing models that enable users to generate high-resolution images from text prompts, perform advanced inpainting, outpainting, edge-guided editing, and rapid image variation. The collection includes variants optimized for realism (FLUX1.1 Pro Ultra), speed (FLUX.1 Schnell), and prototyping (FLUX.1 Dev), among others.

FLUX.1 Kontext

FLUX.1 Kontext is an advanced image editing model from Black Forest Labs that enables users to modify images through text prompts. It supports various editing tasks such as style transfer, text editing, and character consistency adjustments. It is available in multiple variants (Pro, Max, and an upcoming Dev version) to balance quality and speed.

FLUX.1

FLUX.1 is an innovative text-to-image generative model that uses a novel flow matching technique instead of traditional diffusion. It produces images with a distinctive, fluid aesthetic, achieves faster generation speed, and offers refined control over light, texture, and composition. An optimized variant (FLUX.1 [schnell]) is available for local execution on Replicate.

FLUX.1 Redux [dev]

An open-weight image variation model by Black Forest Labs that generates new image versions while preserving key elements of the original.

Google Gemini 2.5 Flash Image

A state-of-the-art text-to-image generation and editing model from Google, designed for fast, conversational, multi-turn creative workflows. It offers native image creation, multi-image fusion, consistent character and style maintenance, conversational natural language editing, visual reasoning, and embeds SynthID watermarks. The tool is accessible via the Gemini API, Google AI Studio, and Vertex AI.

HiDream-I1-Full

HiDream-I1-Full is an open-source text-to-image generative foundation model with 17B parameters. Built using a sparse diffusion transformer, it delivers state-of-the-art image quality across multiple styles (photorealistic, cartoon, artistic, etc.) and boasts best-in-class prompt following as demonstrated by benchmark evaluations such as HPSv2.1, GenEval, and DPG-Bench. The model is commercially friendly and includes a Gradio demo and detailed inference scripts for easy deployment.

sparklearningstudiollc/nikujkakdiya-new-model

An AI image generation API model hosted on Replicate that supports text-to-image, image-to-image, and inpainting modes. It offers extensive configuration options including prompt strength, custom dimensions, aspect ratio, LoRA weight integration, and various output settings for generating images according to user prompts.

ByteDance Seedream 4

Seedream 4.0 is ByteDance’s unified text-to-image generation and image editing model. It supports high-resolution (up to 4K) and fast inference along with natural language prompt editing, multi-reference input, batch workflows, and versatile style transfer.

Fooocus

Fooocus is an open-source, offline image generation tool built on the Stable Diffusion XL architecture and Gradio. It streamlines the image generation process by reducing manual tweaks to prompt-based generation, requiring minimal GPU memory (4GB) and fewer user interactions to produce images.