# Multimodal Cloud Vision Example - Internal Implementation Details

This document provides technical details about the implementation of the Multimodal Cloud Vision example, specifically focusing on the data preparation and cloud upload scripts intended for internal use only. These scripts are prerequisites for the customer-facing components.

## Data Generation Overview

The example uses synthetic X-ray-like images categorized as:
- COVID-19 positive cases
- Normal cases
- Viral pneumonia cases

These images are combined with synthetic tabular features that are vertically partitioned:
- Organization 2 has features 0-4 (first half) plus the target label
- Organization 3 has features 5-9 (second half) plus image references

## Internal Scripts

### 0_prepare_data.py

This script handles the following tasks:
- Creates synthetic X-ray-like images with class-specific visual characteristics
- Generates tabular features with clear separation between classes
- Splits data between organizations in a vertically partitioned manner
- Creates training and testing datasets with consistent distributions
- Ensures proper alignment of records across organizations through image_id
- Saves mappings between local and cloud paths for subsequent upload

Implementation details:
- Uses PIL for image generation with specific patterns for each disease class
- Creates ground glass opacity patterns for COVID images
- Adds normal lung field patterns for normal X-rays
- Generates interstitial patterns for pneumonia cases
- Ensures feature consistency between train and test sets
- Saves expected outcomes for later evaluation

### 1_upload_to_cloud.py

This script handles:
- Uploading the synthetic X-ray images to cloud storage (Azure or AWS S3)
- Saving cloud storage configuration for subsequent scripts
- Providing retry logic for failed uploads

Implementation details:
- Supports both Azure Blob Storage and AWS S3
- Creates containers/buckets if they don't exist
- Sets appropriate content types for images
- Saves cloud storage information in a JSON file for later scripts

## Data Structure Details

### Synthetic Feature Generation

The synthetic features are generated with the following characteristics:
- 10 total features (5 for each organization)
- Explicit class-specific patterns to ensure separability
- Consistent distributions across training and test sets
- Class 0 (COVID): Strong positive values in first 3 features, negative in others
- Class 1 (Normal): Near zero values in first 3 features, positive in middle features
- Class 2 (Pneumonia): Negative in first features, high values in last features

### Image Generation

The synthetic images are created with:
- 512x512 pixel grayscale images
- Class-specific visual patterns embedded in the images
- Basic lung structure and ribcage outlines
- COVID images: Ground glass opacities and consolidation patterns
- Normal images: Clear lung fields with normal vasculature
- Pneumonia images: Interstitial patterns and patchy consolidation

## Important Implementation Notes

- The script first attempts to download real X-ray images, but will fallback to fully synthetic generation if download fails
- Synthetic image generation includes class-specific visual patterns that are learnable by CNNs
- Feature engineering ensures clear separation between classes
- Record alignment is critical - both organizations' data must have the same image_ids in the same order
- SQL queries exclude `image_id` to avoid string conversion issues, while ensuring proper record ordering with `ORDER BY image_id`
- The TabularPreprocessor will automatically include all columns from the query in the input tensor, including non-target label columns
- Only including the label column in Organization 2's preprocessor prevents "multi-target not supported" errors

## Extending or Modifying

When making changes to these scripts:
1. Ensure data consistency between organizations is maintained
2. Verify that image mappings are correctly generated
3. Test both Azure and AWS S3 upload paths
4. Confirm that synthetic data maintains clear class separation
5. Keep the distribution of features consistent between training and test sets