# Multimodal Cloud Vision Example with Vertically Partitioned PostgreSQL

This example demonstrates vertical partition training with data distributed across multiple organizations using PostgreSQL databases and cloud-based image storage (Azure Blob Storage or AWS S3). The data is vertically partitioned, with Organization 2 having the first set of features and Organization 3 having the second set of features plus image references. Organization 1 serves as the model owner that coordinates the training process.

> **Note**: This example uses a pre-configured setup where database tables are created and populated by our staff. Clients only need to run the asset positioning, training, and inference scripts.

## Overview

The example trains a model to classify chest X-ray images as COVID-19, viral pneumonia, or normal, using a combination of image data and tabular features that are vertically partitioned across different organizations. This showcases a real-world scenario where:

1. Organization 2 owns the first set of structured data (features 0-4) in their PostgreSQL database
2. Organization 3 owns the second set of structured data (features 5-9) and image references in their PostgreSQL database
3. Images are stored in cloud storage and referenced by Organization 3's database
4. Organization 1 coordinates the training without directly accessing the raw data

This vertically partitioned approach demonstrates privacy-preserving machine learning where multiple parties contribute different feature sets to build a better model than any single party could create alone.

## Components

This example consists of the following scripts:

**Staff-executed setup scripts** (already run for you):
- Scripts to prepare data and upload images to Azure Blob Storage
- Script to create and populate PostgreSQL database tables

**Client-facing scripts** (for you to run):
1. **2_position_database_asset.py**: Creates two PostgreSQL database assets - one for each organization - using the existing database tables with Organization 3's database having linked_storage_columns pointing to the cloud-stored images
2. **3_model_train.py**: Trains a model using vertical partition training with data from both organizations and linked storage
3. **4_inference.py**: Performs inference using the trained model and data from both organizations

## Prerequisites

- Access to a PostgreSQL database server (local or remote) with pre-populated tables
- The following environment variables should be set:
  - PostgreSQL: `PG_HOST`, `PG_PORT`, `PG_USER`, `PG_PASSWORD`, `PG_DATABASE_ORG2`, `PG_DATABASE_ORG3`

For convenience, you can use the provided `env_setup.sh` script as a template to set these variables. The database tables and cloud storage have already been set up for you by our staff.

## PostgreSQL Configuration

You need to provide PostgreSQL connection details to use this example:

1. **PG_HOST**: The PostgreSQL server hostname (e.g., `localhost` or `your-postgres-server.com`)
2. **PG_PORT**: The PostgreSQL server port (default: `5432`)
3. **PG_USER**: Your PostgreSQL username
4. **PG_PASSWORD**: Your PostgreSQL password
5. **PG_DATABASE_ORG2**: The database name for Organization 2 (default: `covid_multimodal_org2`)
6. **PG_DATABASE_ORG3**: The database name for Organization 3 (default: `covid_multimodal_org3`)

The databases and tables have already been created for you with the necessary data.

## Technical Details

This example demonstrates two key concepts:

1. **Vertical Partitioning**: Data is split by columns across organizations, allowing each organization to contribute different features to the model without revealing their raw data. The training happens using secure multi-party computation techniques.

2. **Linked Storage Columns**: The `linked_storage_columns` feature allows referencing files stored in Azure Blob Storage directly from PostgreSQL database columns. When the model training algorithm accesses these columns, the system automatically fetches the referenced files from cloud storage, applying appropriate preprocessing.

This combined approach enables privacy-preserving collaboration on large multimodal datasets, where organizations can maintain control of their data while still contributing to a shared model.

The example uses synthetically generated data with distinct patterns for each class to demonstrate the concepts without requiring actual medical images.
## PostgreSQL Tables

This example uses pre-created tables in two PostgreSQL databases:

**Organization 2 Database (features 0-4):**
- `covid_train_data`: Contains training data with the first set of features
- `covid_test_data`: Contains testing data with the first set of features
- Each table includes the label (target) column and image_id for record alignment

**Organization 3 Database (features 5-9 + images):**
- `covid_train_data`: Contains training data with the second set of features and references to cloud-stored images
- `covid_test_data`: Contains testing data with the second set of features and references to cloud-stored images
- Each table includes the label column (not used in preprocessing) and image_id for record alignment

These tables have already been created and populated for you as part of the setup process.

## Troubleshooting

### PostgreSQL Connection Issues

1. **Missing PostgreSQL Python modules**:
   ```
   pip install psycopg2-binary sqlalchemy
   ```

2. **Connection refused**:
   - Make sure PostgreSQL is running 
   - Check that the hostname and port are correct
   - Verify that PostgreSQL is configured to accept connections from your IP address

3. **Authentication failed**:
   - Verify the username and password are correct
   - Check PostgreSQL's `pg_hba.conf` file to ensure the proper authentication method is configured

### Cloud Storage Issues

1. **Azure authentication errors**:
   - Verify your storage account name and key are correct
   - Make sure the container exists and you have permission to access it

2. **AWS S3 errors**:
   - Verify AWS credentials are correct
   - Ensure you have access to the bucket
   - Check your IAM permissions include S3 read access

### Data Processing Issues

1. **UUID serialization errors**:
   - If you see errors about UUID not being JSON serializable, make sure you're converting UUIDs to strings before serialization

2. **Type conversion errors**:
   - If you see errors like "could not convert string to float: 'covid_xray/COVID_3.png'", follow these steps:
   
     **Handling mixed data types with vertical partitioning**
     1. Ensure your SQL query doesn't include string columns like `image_id` that aren't needed
     2. Exclude string columns like `image_path` from the TabularPreprocessor's feature list
     3. Make sure the ImagePreprocessor handles the image_path column for linked storage
     4. Include the label column ONLY in Organization 2's preprocessor with `target=True`
     5. DO NOT include the label column in Organization 3's preprocessor at all
     6. Adjust your neural network architecture to match the exact number of features each organization provides