Workflows for distributed computing
Mathieu Leclaire
Jonathan Passerat-Palmbach
Romain Reuillon
Context
- Complex-system community
- Various scientific fields
- No standard practices, language, plateform..
- Thematician PhDs (Geographers, Biologist, Scociologist...) with no technical support
They use naturally parallel methods daily on their laptop:
- Data reconstruction
- Parameter estimation
- Sensitivity analysis
- Optimisation
- Replication
- ...
Execution on the same program with different parameters and/or datasets.
How to bring Distributed Computing to theses researchers
Prototype Small, Scale for Free
OpenMOLE articulate 3 orthogonal concepts

... and an expressive workflow formalism for distributed computing.
1 - Model?
Stuff that you can launch, taking inputs and producing outputs
Zero deployment approach
- User code is automatically deployed at runtime
- No prior knowledge of remote environment needed
- No installation required on any machine
Works with almost any language / plateform running on Linux
Packaging (non JVM) application with Care
Applications have dependencies:
- Shared libraries
- Packages (Python, R, ...)
- Low level system calls
- Environment variables
- ...
Capture these dependencies and transfer along with the application from Linux to Linux
Distributed execution of (almost) any program to (pretty much) any computing environment in 3 simple steps
- Package it with CARE ⇒ execute it on linux
- Write your OpenMOLE workflow
- Click the run button
2 - Method?

Map/reduce

Map/reduce
Grid Search
Random sampling
Latin Hypercube
Parallel data processing
...
Master/slave
Example: genetic algorithm
Genetic Algorithm
3 - Execution environment?

Today
Multi-thread
Delegation through SSH
PBS (on ssh)
SLURM (on ssh)
Condor (on ssh)
SGE (on ssh)
OAR (on ssh)
EGI Grid (trough DIRAC)
Adhoc Desktop Grid
Grid Computing
Tomorow
Commercial cloud providers
Academic cloud
Volonteer computing?
Combination of environments?
Next docker-based computing plateform?
Transparent access
Access as the user would do it
Use user credential
No preliminary step
Automatic data transfers (+ Replica management)
Files and folders transfers are handled transparently by OpenMOLE
And now: examples!
The terminology
A workflow
val i = Val[Double]
val res = Val[Double]
val exploration = ExplorationTask ( i in (0.0 to 10.0 by 1.0) )
val model =
ScalaTask ("val res = i * 2") set (
inputs += i,
outputs += (i, res)
)
val env = LocalEnvironment(5)
val ex = exploration -< (model on env) start
Same workflow on the Grid !
val i = Val[Double]
val res = Val[Double]
val exploration = ExplorationTask ( i in (0.0 to 10.0 by 1.0) )
val model =
ScalaTask ("val res = i * 2") set (
inputs += i,
outputs += (i, res)
)
val env = EGIEnvironment("biomed")
val ex = exploration -< (model on env) start
Native code
care -o hello.tgz.bin python hello.py 42 test.txt
val arg = Val[Int]
val output = Val[File]
val pythonTask =
CARETask(
workDirectory / "hello.tgz.bin",
"python hello.py ${arg} output.txt") set (
inputs += arg,
outputFiles += ("output.txt", output),
outputs += arg
)
val exploration = ExplorationTask(arg in (0 to 10))
val copy = CopyFileHook(output, workDirectory / "hello${arg}.txt")
exploration -< (pythonTask hook copy)
OpenMOLE for all communities, for all languages
OpenMOLE is neither dedicated to a scientific field nor to a language
- Chromosome structuring: Neuro Sciences, C++
- The SimTRAP project: Social Sciences, Netlogo
- The SimPOP project: Geography, Scala
- The BioEmergence project: Biology, C