Florence-2-large - AI Vision Models Tool
Overview
Florence-2-large is a Microsoft vision foundation model for vision and vision-language tasks. It uses a prompt-based sequence-to-sequence transformer pretrained on the FLD-5B dataset and supports zero-shot and fine-tuned settings for tasks such as captioning, object detection, OCR, and segmentation.
Key Features
- Prompt-based sequence-to-sequence transformer architecture
- Pretrained on the FLD-5B dataset
- Supports zero-shot inference
- Supports fine-tuning for downstream tasks
- Handles image captioning
- Performs object detection
- Performs OCR extraction
- Supports image segmentation
- Designed as a vision foundation model
Ideal Use Cases
- Generate descriptive captions for images
- Detect and localize objects in images
- Extract text from scanned documents
- Produce segmentation masks for images
- Fine-tune for custom vision tasks
Getting Started
- Open the model page on Hugging Face: https://huggingface.co/microsoft/Florence-2-large
- Read the model card and available documentation
- Load the model into your preferred ML framework
- Run zero-shot prompts on sample images
- Fine-tune with a labeled dataset for specific tasks
Pricing
Pricing is not disclosed on the model page. Check Hugging Face or Microsoft for licensing and hosting costs.
Key Information
- Category: Vision Models
- Type: AI Vision Models Tool