Introduction

What is file sharing? What are some of the permutations of it that matter in the context of a virtual organization (VO)? How do the requirements of file sharing within a single institution differ from interinstitutional file sharing? What are some of the possible solutions?

This document is intended to be just the start of a conversation about file sharing and the possibilities in this space. We will look some of the more interesting use cases, ones that span single institutions, consortia, and virtual organizations. Further conversation will develop based on reader feedback and areas of interest.

A group of researchers have a growing set of data files that need to be securely shared between members of their VO. The files are anywhere from 100MB to 100GB and need to be in a central repository that allows for proper authorization of access. Members of the group receive automated notices when new files are made available to them, in order to stay on top of current research. Any researcher may upload a file or set of files at any time, from any operating system, via any web browser. – A basic interinstitutional use case

File sharing sounds like a rather simple thing – someone has or creates a file that needs to be seen/used/edited by someone or something else, and so needs to make it available. But there are different types of file sharing (ad hoc, batched) and different types of files (small data files, large data files, documents, compound data sets), as well as different needs around restricting access, handling storage issues (quota, quantities), and more. Different types of organizations, from individual institutions to multinational VO, contribute to the richness of use cases in this area.

With the growth of cloud storage solutions such as Amazon’s S3 or DropBox as well as work in the open source space (FileSender, MediaMosa), this is a very active space that has the potential to solve a variety of problems for researchers, academic groups, and broader consortium.

Thinking about organizational requirements

Whether an organization is brick-and-mortar or virtual, establishing the set of requirements around file sharing is necessary whether the preference is to build or buy a solution. An organization can start by considering the following:

  • What type of legal issues will surround the potential storage of data with a third party? Will FERPA, HIPAA, or granting agency rules restrict options to a purely local solution?
  • How will users access this new storage space? Even if not required today, all organizations should consider how individuals from outside an organizations identity management sphere will read and write to the storage space. Planning for federated authentication and authorization as well as group management should be planned in from the outset to save difficulty later. The use cases below look at this in more depth.
  • What can your organization afford? Costs around file sharing go beyond just the obvious disk storage. An increase in network bandwidth may impact an organization's network infrastructure, raising both equipment and staffing costs. If federated single-sign on is not already in place, a temporary surge in costs may be required to get this service established.
  • Who will be responsible for the data and the storage space? Can researchers reach out directly to the service provider to request additional space or new services like replication or restores? If you are a virtual organization, do you have enough infrastructure to encourage researchers to go through a single point of contact to get additional storage?

Thinking about individual requirements

When considering a file sharing solution, the needs of the individuals sharing the files also must be explicitly considered. This gets down in to the details of what a file sharing service will need to be able to do.

  • Do the files need to be shared as they are created (“ad hoc”) or in regular batch jobs?
  • What is the size range of files being shared?
  • Will the researchers be sharing a limited number of large files, a vast number of large files, a large number of small files?
  • Do the files have a structure to them that must be preserved?
  • Is automatic notification regarding new files becoming available required?
  • Do the files need to be restricted? If so, do they need to be restricted to a small group, a large group? Are all the individuals that need to access these files easily identified, or is the information regarding access control limited to knowing users are associated with a particular group or institution?
  • Is there a common, central IT infrastructure available for the people sharing files?
  • If an upload fails, is it acceptable for the upload to be manually restarted, or is there a need for checkpointing (restarting a file transfer midway through?)
  • Are there any restrictions or preferences around doing file transfers via the web, or is a separate client preferred?
  • Do you have firewall or network constraints that need to be addressed? Are there protocol requirements or limitations involved?

Asking vendors or open source projects the interesting questions

  • What is your service model to support federated access?
  • For encrypted space, who stores the encryption keys?
  • In your contracts, who owns the data?
  • Do you offer integrity checks or other data validation services for the data place in to your servic?

A list of some of the open source file sharing tools, particularly those that can (or will eventually) handle federated authentication and authorization via groups is being developed on the Domestication Wiki, available online at:
https://wiki.surfnetlabs.nl/display/domestication/Overview

Different use cases based on different types, styles of file sharing

Below are some different broad use cases in the file sharing arena, ones that can be applied to both a VO and an institutional context. The key points listed below each use case are concepts that need to be considered when determining the requirements for storage. Is encryption important? Federated access? Reporting requirements?

Class groups sharing (read/write) structured data

(based on WebFiles at Duke University)

A botany class at University Y is expected to share, edit, and add to a collection of image files. The members of the class change from semester to semester, and within each semester there are 3 different class sections. Members of this class and associated sections are identified within a central system and expressed as groups, and change automatically immediately before, during, and after the semester. Students will retain access to the material if they complete the course successfully and for the length of time they are associated with the institution. Some members of this class may actually be visiting students from University Z and will be using their credentials from that institution. Instructors have the ability to override the registration system of record and allow both local and federated individuals access to the course material. The students will need to access this both via command line and through a web browser. The professor and the TA’s will need to be able to add or remove members of the group and have the associated access to the data adjusted in close-to-real-time. There will be a surge of file uploading at the beginning and end of each semester, and ad hoc uploading and downloading during the semester.

Key points: read/write access; federated access; ad hoc and bulk file sharing; CLI and browser; strong tie-in to group management; granular access control

-------

Researchers sharing (read-only) unstructured data

(based on discussions with LIGO)

A large science project has instruments collecting data at a fast and furious rate. Tools exist that filter out “noise”, but over a terabyte of data is still created daily and stored for access by members of the VO. VO membership is dependent on federation, as members come from nearly 100 different research institutions around the world. Data may be access real-time in a "streaming" fashion with data becoming available at a rate of 1 MB/s, or in large data set chunks of 100GB to 1TB downloaded by the researcher at a time. Older data is not accessed as frequently as newer data. Membership is restricted to members of the collaboration. There may be dozens of researchers concurrently accessing the data at any given time. Data is replicated to 10 compute sites and accessed from there. Data is accessed via a variety of applications such as MatLab or, more likely, with home-grown analysis tools. Network configurations (firewalls, speed) will vary widely among the compute sites.

Key points: read-only access; federated access; bulk file sharing; CLI and application; data replication and hierarchical storage management

-------

Researchers sharing and doing computation on highly structured data

(based on a discussion with the iPlant Collaborative)

_A large science project is working with millions of image files, and each researcher in the project may be uploading and working with data sets in the multi-gigabyte to terabyte range. These files must be access-controlled and run through ongoing normalization processes to refine and categorize the images.
This requires significant integration of the storage space with the compute services, resulting in the colocation of services. The researchers require faster computation services and data transfer as the number of images increase. Data replication is desired, but the logic and coordination of a million image files results in a complex structure that must not be lost as the data is replicated._

-------

Interinstitutional file sharing

A consortium of academic institutions have decided to create a cloud storage service that will meet both their legal and financial needs. With issues such as grant restrictions around physical location and control of data, HIPAA requirements around the medical research data sets, and financial concerns around the combination of storage and bandwidth costs to a third-party service provider, creating a service within the consortium minimizes some of the legal and financial challenges with a commercial cloud service. Each institution has their own identity management infrastructure (authentication, authorization, and group management). While creating a separate account for users within this new service is possible, many of the institutions have single-sign on requirements written in to their policies and guidelines. Managing a separate identity management system for one service is a burden none of the participating institutions want to bear, and the potential FERPA-related issues around having to share enough student information to create these accounts is a serious concern. In creating this service, each institution hopes to take advantage of underutilized storage systems throughout the consortium and to create a better bargaining position with the larger storage companies when additional storage must be acquired. The expectation is that with storage being physically located at known locations, each individual institution can meet any offsite storage requirements while still maintaining a strong understanding over the location and management of the storage. The storage service will need to be able to handle the full array of possible storage types: extensive reads, writes, large and small files, high speed computation, encryption and access controls, multiple file systems and protocols.

Key points: federated SSO; legal and regulatory requirements; financial considerations

-------

A file sharing service requiring encryption

The School of Medicine of a prestigious research institution has started participating in a large, federally funded grant project that span over a dozen schools. Given the nature of the data in the research, all files must be encrypted in transit and at rest. Access to the data must allow for granular access control and point-in-time auditing. The research will involve groups from different parts of the research institution and associated hospitals, which will in turn involve several identity management (authentication, authorization, and group management) systems. The authentication and account management requirements at each entity involved are very strict and not always in exact alignment; creating a new account to access just this data is extremely difficult. With each entity having a separate network and firewall, agreeing upon a common access protocol (html5? ftp?) early will be required. The access to the storage will be primarily read-only access, as high performance compute clusters access the data sets for analysis. Write actions will happen in bursts as new data is discovered and added to the overall data available.

Key points: encryption; replication; legal and regulatory requirements; access control

-------

A file sharing service explicitly for high performance computing

The biology department of a particular research institution has an ever-growing data set around primate genetics. Computational activity against this data set is ongoing, and access to it has been requested by researchers around the world. Current dataset size is 500TB, and is growing by 1-2 TB a month. Because of the grant requirements that help fund the research that is at the heart of creating this data, reports must be run regularly to provide information on how many individuals from how many different institutions have accessed the data in a given time period. Individual research groups will run massive computational analyses against the data, some pulling copies of the data for local manipulation and others attempting to read directly from the original dataset (or an assigned mirror). Some of the data will be restricted access until a formal approval process allows for its addition to the public dataset.

Key points: large data sets; significant I/O requirements; interinstitutional access; reporting requirements

-------

University as a cloud-broker for individuals

Institution A offers their constituents a fee-based, cloud-based storage service that has the following advantages: due to a collective bargaining effort, costs per gigabyte are much lower than what an individual could find on their for a service with equivalent features; the storage, based on federated identity, can actually transfer with an individual from institution to institution, allowing them to keep their document and research in a single location regardless of institutional affiliation.

Key points: federated access; lifetime acess

-------

Some examples of different storage and file sharing tools and services

Software / Service

federated
access?

native
encryption?

automatic
replication?

fine-grain
access controls?

large
data sets?

high
bandwidth?

CLI
access?

reporting/auditing
capabilities?

Amazon's S3

(error)

(error)

(tick)

 

(tick)

 

(error)

(error)

Box.net

(tick)

(tick)

(tick)

(tick)

(tick)

 

 

(tick)

DropBox

(error)

(tick)

(tick)

(error)

 

 

(error)

(error)

FileSender

(tick)

(error)

(error)

 

(tick)

 

(error)

(error)

iRODS

(warning)

(error)

(error)

 

(tick)

 

(tick)

(tick)

Lobber (https:/portal.nordu.net/display/LOBBER/)

(tick)

(error)

(error)

(tick)

(question)

 

(tick)

(error)

MediaMosa

(warning)

(error)

(error)

(tick)

 

(tick)

(error)

(error)

Nirvanix

(error)

(tick)

(tick)

 

 

 

(tick)

(tick)

Key

(warning) in progress
(tick) supported within the app or service
(error) limited or not supported within the app or service

Certain features, such as encryption, may be layered in to the service, whether or not they are supported natively by the application or the service
Information for above table was derived from review of available documentation on these products. Further analysis is required for more detail.

Possible tiers of service

Given the varied needs of individuals and institutions, no one service is likely to offer everything an organization hopes for. All the bells and whistles may be there, but the associated price tag might be too high. Or, the price is right but the functionality is too limited. It is a fairly common problem. One way to deal with this is to break a file sharing service apart in to tiers, keeping costs and features tightly constrained to make certain your organization is getting only what it needs.

Basic service ($)

JBOD-style data storage ("Just a Bunch of Disks"), federated access, no encryption or replication

Full service ($$)

Federated access, native encryption, replication, fine grain access control (user and group), large and small data sets, granular audit and reporting capabilities. Geographic replication not necessary. Specific physical data center security requirements not required.

High security ($$$)

Federated access, full end-to-end encryption with data also encrypted at rest, replication, fine grain access control (user and group), large and small data sets, granular audit and reporting capabilities. Must meet heavier legal requirements around physical as well as data security.

HPC ($$)

Heavier on the burst-type network usage. Large data sets in which pieces will be uploaded and downloaded for analasys.

What's Next?

This document is designed to just start the conversation around the possibilities for academic and research cloud-based storage. With so many possibilities available to the consumer, what do your institutions need to offer to stay relevant in this space? Is staying relevant in this space even a concern? Do any of these scenarios and use cases describe the solutions that your institution or consortium of institutions need to offer?

What would the value proposition be to make this of interest to users? Perhaps a promise that this storage would be interoperable with any of the other services offered by the institution (email, web, HPC). The promise of third-party security evaluations, rather than just trusting the commercial entity with an individual's data. Better help desk support and integration with campus authentication systems (SSO).

Reported trends (Nordunet conference, 2011)

  • more demands on national networks
  • huge amount of data (bioimaging, genomic research)
  • federated logins to storage
  • silent error corrections (scrubbing data)
  • global access to sensitive data (note this is the use case for a more global understanding around LoA policies)
  • No labels