Skip to main content

Overview

The Vision Agent is MedMitra’s specialized component for analyzing medical images, particularly radiology scans such as X-rays, CT scans, and MRIs. It uses multimodal vision-language models to extract clinical findings directly from images.

Architecture

The Vision Agent is implemented in backend/agents/vision_agent.py and uses Groq’s Llama Vision model for image analysis.

Key Functions

backend/agents/vision_agent.py
async def image_extraction(image_url: str):
    """Vision agent for a single image."""
    
async def vision_agent(case_id: str):
    """Vision agent for all radiology images in a case."""

Image Analysis Process

Single Image Extraction

The image_extraction function processes individual radiology images:
backend/agents/vision_agent.py:19-54
async def image_extraction(image_url: str):
    logger.info(f"Starting vision agent for image ------ {image_url}")

    completion = client.chat.completions.create(
        model="meta-llama/llama-4-scout-17b-16e-instruct",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": RADIOLOGY_ANALYSIS_PROMPT
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": image_url
                        }
                    }
                ]
            }
        ],
        temperature=1,
        max_completion_tokens=1024,
        top_p=1,
        stream=False,
        stop=None,
    )

    res = extract_json_from_string(completion.choices[0].message.content)
    return res

Message Format

The function uses OpenAI’s multimodal message format with two content blocks:
  1. Text prompt: Clinical instructions for analysis
  2. Image URL: Direct link to the radiology image
{
  "role": "user",
  "content": [
    {"type": "text", "text": "[RADIOLOGY_ANALYSIS_PROMPT]"},
    {"type": "image_url", "image_url": {"url": "https://..."}}
  ]
}

Case-Level Processing

The vision_agent function orchestrates analysis for all radiology files in a case:
backend/agents/vision_agent.py:57-89
async def vision_agent(case_id: str):
    logger.info(f"Starting vision agent for case ------ {case_id}")

    results = await supabase.get_case_files(case_id=case_id)
    mapping = {}
    
    for result in results:
        file_id = result.get("file_id")
        file_url = result.get("file_url")
        file_category = result.get("file_category")

        if file_category == "radiology":
            # Analyze the image
            ai_summary = await image_extraction(file_url)
            logger.info(f"AI Summary for file_id {file_id}: {ai_summary}")
            mapping[file_id] = ai_summary
            
            try:
                # Save results to database
                await supabase.update_case_file_metadata(
                    file_id=file_id, 
                    metadata={"ai_summary": ai_summary}
                )
                logger.info(f"Updated ai_summary for file_id: {file_id}")
            except Exception as e:
                logger.error(f"Failed to update ai_summary for file_id {file_id}: {str(e)}")
    
    return True

Processing Flow

Model Configuration

Vision Model

model = "meta-llama/llama-4-scout-17b-16e-instruct"
Why Llama Vision Scout?
  • Multimodal capabilities: Can process both images and text
  • Medical imaging: Fine-tuned for visual understanding
  • Fast inference: Optimized 17B parameter model
  • Structured output: Generates JSON-formatted findings

Generation Parameters

temperature = 1          # Higher temperature for detailed descriptions
max_completion_tokens = 1024  # Sufficient for comprehensive findings
top_p = 1                # Full probability distribution
stream = False           # Wait for complete response

Radiology Analysis Prompt

The RADIOLOGY_ANALYSIS_PROMPT guides the model to extract relevant clinical information:
backend/utils/medical_prompts.py
RADIOLOGY_ANALYSIS_PROMPT = """
You are an expert radiologist. Analyze the provided medical image and extract:

1. **Modality**: Type of imaging (X-ray, CT, MRI, etc.)
2. **Body Part**: Anatomical region being imaged
3. **Findings**: Observable abnormalities or key features
4. **Impressions**: Clinical interpretation and significance
5. **Summary**: Concise clinical summary

Provide your response in JSON format.
"""

Output Format

The Vision Agent returns structured JSON:
{
  "modality": "Chest X-ray",
  "body_part": "Chest/Thorax",
  "findings": [
    "Increased opacity in right lower lobe",
    "No pleural effusion",
    "Cardiomediastinal silhouette normal"
  ],
  "impressions": "Findings consistent with right lower lobe pneumonia",
  "summary": "Chest X-ray demonstrates right lower lobe consolidation suggestive of pneumonia. No complications noted."
}

Integration with Medical Insights Agent

The Vision Agent’s output is consumed by the Medical Insights Agent:
backend/agents/medical_ai_agent.py:94-119
async def _process_radiology_documents(self, state: MedicalAnalysisState):
    for radiology_file in state["case_input"].radiology_files:
        if radiology_file.ai_summary:
            try:
                ai_summary_data = json.loads(radiology_file.ai_summary)
                summary_text = ai_summary_data.get("summary", radiology_file.ai_summary)
            except (json.JSONDecodeError, TypeError):
                summary_text = radiology_file.ai_summary
            
            radiology_doc = RadiologyDocument(
                file_id=radiology_file.file_id,
                file_name=radiology_file.file_name,
                summary=summary_text,
            )
            processed_docs.append(radiology_doc)

Database Storage

Vision analysis results are stored in the file metadata:
await supabase.update_case_file_metadata(
    file_id=file_id, 
    metadata={"ai_summary": ai_summary}
)

Metadata Schema

{
  "file_id": "uuid",
  "file_name": "chest_xray.jpg",
  "file_category": "radiology",
  "file_url": "https://storage.supabase.co/...",
  "ai_summary": "{\"modality\": \"X-ray\", ...}"
}

Usage in Orchestration

The Vision Agent is called during the file processing stage:
backend/agentic.py:69-73
if radiology_files:
    logger.info("Processing radiology files...")
    result = await vision_agent(case_id)
    if result:
        logger.info(f"Successfully processed radiology files for case {case_id}")

Processing Timeline

Error Handling

The Vision Agent includes robust error handling:
try:
    await supabase.update_case_file_metadata(
        file_id=file_id, 
        metadata={"ai_summary": ai_summary}
    )
    logger.info(f"Updated ai_summary for file_id: {file_id}")
except Exception as e:
    logger.error(f"Failed to update ai_summary for file_id {file_id}: {str(e)}")

Common Error Scenarios

  • Invalid image URL: Returns error if file URL is inaccessible
  • Model timeout: Groq API timeout after extended processing
  • JSON parsing errors: Handled by extract_json_from_string utility
  • Database update failures: Logged but don’t block other file processing

Supported Image Formats

The Vision Agent can process:
  • X-rays: Chest, bone, dental
  • CT scans: All body regions
  • MRI scans: Brain, spine, joints
  • Ultrasound: When uploaded as images
  • DICOM: After conversion to web-compatible formats

Performance Considerations

Processing Time

  • Single image: ~2-5 seconds
  • Case with 5 images: ~10-25 seconds
  • Processing is sequential per file

Optimization Opportunities

# Current: Sequential processing
for result in results:
    if file_category == "radiology":
        ai_summary = await image_extraction(file_url)

# Future: Parallel processing with asyncio.gather
tasks = [image_extraction(url) for url in radiology_urls]
ai_summaries = await asyncio.gather(*tasks)

JSON Extraction Utility

The extract_json_from_string function handles response parsing:
backend/utils/extractjson.py
def extract_json_from_string(text: str) -> dict:
    """Extract JSON from LLM response, handling markdown code blocks."""
    # Remove markdown code blocks
    text = re.sub(r'```json\s*', '', text)
    text = re.sub(r'```\s*$', '', text)
    
    # Parse JSON
    return json.loads(text)
This handles cases where the model wraps JSON in markdown:
```json
{"modality": "X-ray", ...}
```

Future Enhancements

Planned Features

  • DICOM support: Direct processing of medical imaging format
  • Multi-view analysis: Comparing multiple angles of the same region
  • Temporal comparison: Detecting changes between studies
  • Annotation extraction: Reading radiologist markups and measurements
  • 3D reconstruction: For CT/MRI volumetric data

Model Upgrades

Potential future models:
  • Specialized radiology vision models
  • Larger context windows for whole-study analysis
  • Fine-tuned models for specific imaging modalities

Next Steps

Medical Insights Agent

Learn how vision outputs are used in diagnosis

Complete Workflow

See the full processing pipeline