.. BioEM documentation master file, created by sphinx-quickstart on Tue Dec 5 14:19:56 2017. You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. .. .. role:: math(raw) :format: html latex .. role:: raw-latex(raw) :format: latex ################################# Welcome to BioEM's documentation! ################################# *A software for Bayesian analysis of EM images* :Authors: P. Cossio, D. Rohr, F. Baruffa, L. Stanisic, M. Rampp, V. Lindestruth, G. Hummer :Organization: *Max Planck Institute of Biophysics* *Frankfurt Institute for Advanced Studies* *Max Planck Computing and Data Facility* :Date: December 2015 :Preface & Disclaimer: This manual is a preliminary guide for installation and use of the BioEM software. It is not intended to be complete. As the BioEM code is improved and developed, the manual will be updated. For any comments or questions please contact: *pilar.cossio@biophys.mpg.de*. :Copyright: :math:`<`\ Bayesian inference of Electron Microscopy images\ :math:`>` **BioEM 1.0** software for Copyright (C) 2016 Pilar Cossio, David Rohr, Fabio Baruffa, Markus Rampp, Volker Linderstruth and Gerhard Hummer. **BioEM 2.0** Copyright (C) 2018 Pilar Cossio, Luka Stanisic, Markus Rampp and Gerhard Hummer. **BioEM 2.1** Copyright (C) 2019 Pilar Cossio, Luka Stanisic, Markus Rampp and Gerhard Hummer. The BioEM program is a free software, under the terms of the GNU General Public License as published by the Free Software Foundation, version 3 of the License. This program is distributed in the hope that it will be useful, but **without any warranty**. See the GNU General Public License for more details. :Citation: Please cite refs. :cite:`CossioHummerJSB_2013,BioEM_server`. The BioEM software ================== Introduction ------------ Most biological systems are dynamic, they change conformation with time and inter-convert between several functional metastable states. These flexible biomolecules can be characterized using electron microscopy (EM), a technique that produces frozen images of the sample in a near-native environment. Each individual image contains information of the instantaneous configuration of the biomolecule, and, in principle, each particle can be in a different conformational state. However, analyzing the images individually is challenging because the signal-to-noise level is very low. This has so far limited EM to study a subset of static biomolecules because the reconstruction of high-resolution density maps requires most particles to be in the same conformation. Here, we present a computing tool to harness the single-molecule character of EM for studying dynamic biomolecules. With our method, we can categorize and classify models of flexible biomolecules from individual EM images. Bayesian inference of electron microscopy images, BioEM :cite:`CossioHummerJSB_2013,BioEM_server`, allows us to compute the posterior probability of a model given experimental data. The BioEM posterior is calculated by solving a multidimensional integral over many nuisance parameters that account for the experimental factors in the image formation, such as molecular orientation and interference effects. The BioEM software computes this integral via numerical grid sampling over a portable CPU/GPU computing platform. By comparing the BioEM posterior probabilities it is possible to discriminate and rank structural models, allowing to characterize the dynamics of the biological system. In this chapter, we briefly describe the mathematical background of the BioEM method. Then, we present the necessary tools and procedures to install the BioEM software. We describe the prerequisite programs that should be preinstalled on the compute node. Then, we explain the BioEM download files and directories. Lastly, we describe the steps to install BioEM using the CMake program. The commandline executions are using the bash scripting language. .. _theory: Theoretical background ---------------------- The BioEM method calculates the posterior probability of a model, :math:`m`, given a set of experimental images, :math:`\omega \in \Omega`. Its key idea is to create a calculated image, from the original model, as similar as possible to the experimental image. The calculated image is generated using nuisance parameters, :math:`\boldsymbol \theta`, that describe the molecule orientation, interference effects with the Point Spread Function (PSF), uncertainties in the particle center, intensity normalization, offset and noise. :numref:`Fig. %s` exemplifies how a calculated image from a model, with a given set of nuisance parameters, is created. Technically, the model is first rotated to a given orientation, then projected along the :math:`z`-axis, then it is convoluted with a PSF to cope with imaging artifacts, next it is shifted by a certain number of pixels to account for the uncertainties in the particle center. Normalization, and offset in the intensity, as well as noise, are taken implicitly into account. The calculated image is compared to an experimental particle-image, :math:`\omega`, through a likelihood function, :math:`L(\omega|m,\boldsymbol\theta)`. Eq. 7 of ref. :cite:`CossioHummerJSB_2013` shows its analytical formulation. .. _likeliCons: .. figure:: ./img/Fig1_10Dec.* *Steps in building a realistic image starting from a 3D model:* rotation, projection, point spread function convolution, center displacement, and integrated-out parameters of normalization, offset and noise. The likelihood function establishes the similarity between the calculated image and the observed experimental image. The posterior probability of a model, given an experimental image, is a weighted integral over the product of prior probabilities and likelihood, over all nuisance parameters, .. math:: :label: pmom P_{m\omega} \propto \int L(\omega|m,\boldsymbol\theta)p_M(m)p(\boldsymbol\theta) d\boldsymbol\theta~, where :math:`p_M(m)`, :math:`p(\boldsymbol\theta)` are the prior probabilities of model and parameters, respectively. The BioEM software is used to perform the integrals in Eq. :eq:`pmom` over orientation, PSF parameters, and center displacement using numerical grid sampling. The remaining integrals over the intensity normalization, offset, and noise are performed analytically following ref. :cite:`CossioHummerJSB_2013`. The posterior probability of a single model given a set of images, :math:`\omega \in \Omega`, becomes .. math:: :label: pb2 P(m|\Omega) \propto \prod_{\omega=1}^{\Omega}P_{m\omega}~. The main result of the BioEM software is the computation of Eq. :eq:`pb2`. This can be used for model comparison and discrimination (*e.g.*, to rank the best model) or to calculate the posterior probability of a full set of models, :math:`m \in M`, following Eq. 2 of ref. :cite:`CossioHummerJSB_2013`. In this manual, it is assumed that the user has sufficient comprehension of the BioEM theory. Therefore, it is encouraged to read refs. :cite:`CossioHummerJSB_2013,BioEM_server` thoroughly. Installation ------------ Prerequisite programs and libraries ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Before installation, there are several programs and libraries that should be preinstalled on the compute node. First check that the compiler is a modern C++ compiler which is OpenMP compliant. In the following, we give a brief explanation of the mandatory, and optional prerequisite programs. Mandatory preinstalled libraries ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - *FFTW library (minimal version 3.3.3):* is a subroutine library for computing the discrete Fourier transform. It is specifically used in BioEM, to calculate the convolution of the ideal image with the PSF, and the cross-correlation of the calculated image to the experimental image. FFTW can be downloaded from the webpage https://fftw.org/. Optional preinstalled programs ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The optional but *encouraged* to use programs for an easy compilation, and optimal performance, are described below: - *CMake (minimal version 2.6):* is a cross-platform software for managing the build process of software using a compiler-independent method (*i.e.*, creating a Makefile). CMake can be downloaded from https://cmake.org/. - *CUDA (minimal version 5.5):* is a parallel computing platform implemented by the graphics processing units (GPUs) that NVIDIA produce. Thus, NVIDIA graphics cards are necessary for running BioEM with the CUDA implementation. For more information see https://nvidia.com/. - *MPI:* Message Passing Interface is a standardized and portable message-passing system designed to function on a wide variety of parallel computers, with and without shared-memory. Any MPI platform (either openMPI or MPICH) can be used with BioEM. The minimal version of *impi* is 5.0. - *Git:* is a system that is used for project development (see https://git-scm.com/). Git can be used to clone the BioEM software into a local directory. After these programs are successfully installed on your compute node, it will be possible to install BioEM. .. note:: It is recommended that the same compiler that is used to compile the libraries is also used to compile BioEM. .. _download: Download ~~~~~~~~ The BioEM software can be cloned using *git* from https://github.com/bio-phys/BioEM with .. code-block:: bash git clone https://github.com/bio-phys/BioEM A compressed directory of the BioEM software can be also directly downloaded from https://github.com/bio-phys/BioEM. After downloading the *zip* file, uncompress it by executing .. code-block:: bash unzip BioEM.zip In the **BioEM** directory there are: - the source code *.cpp* and *.cu* files. - the **include** directory with corresponding header files. - the copyright license, and *README.md* file. - the *CMakeLists.txt* file that is necessary for installation with CMake (see below). - the **Quaternions** directory that includes files with lists of quaternions that sample uniformly the rotational group *SO3* (section :ref:`intor`). .. _download_tutorials: Download BioEM-tutorials ^^^^^^^^^^^^^^^^^^^^^^^^ The BioEM tutorials can be cloned using *git* from https://github.com/bio-phys/BioEM-tutorials with .. code-block:: bash git clone https://github.com/bio-phys/BioEM-tutorials A compressed directory of the BioEM tutorials can be also directly downloaded from https://github.com/bio-phys/BioEM-tutorials. After downloading the *zip* file, uncompress it by executing .. code-block:: bash unzip BioEM-tutorials.zip In the **BioEM-tutorials** directory there are: - the **Tutorial\_BioEM** directory that includes the example files used in the tutorial (chapter :ref:`tutorial`). Inside this directory, there is also a directory called **MODEL\_COMPARISON**. - the **Cross-Validation\_Tutorial** directory that includes the example files used in the cross-validation tests. Installing BioEM with CMake ~~~~~~~~~~~~~~~~~~~~~~~~~~~ The easiest installation of BioEM is done with the CMake program. CMake contains all the instructions to generate automatically a *Makefile* according to the specific architecture of the computing node, and the desired features of parallelization. CMake uses the *CMakeLists.txt* file. This file is provided in the uncompressed **BioEM** directory. The *CMakeLists.txt* has several modifiable options, that should be enabled/disabled (**ON**/**OFF**, respectively) according to the desired functionalities. The keywords for the modifiable options are shown in :numref:`Table %s `. These options can be enabled or disabled by executing cmake with .. code-block:: bash -D=ON/OFF For example, to turn on the compilation with CUDA run .. code-block:: bash cmake -DUSE_CUDA=ON CMakeLists.txt It is also possible to modify these options directly in the CMakeLists.txt file. At the beginning of this file, the keywords and ON/OFF options are presented. .. _tableCMake: .. table:: CMake keyword options. +-----------------------------+---------------------------------------------------------+ | **** | **Option** | +=============================+=========================================================+ | ``USE_OPENMP`` | Enable/Disable OpenMP | +-----------------------------+---------------------------------------------------------+ | ``USE_MPI`` | Enable/Disable MPI | +-----------------------------+---------------------------------------------------------+ | ``USE_CUDA`` | Enable/Disable CUDA | +-----------------------------+---------------------------------------------------------+ | ``PRINT_CMAKE_VARIABLES`` | Printout CMake variables | +-----------------------------+---------------------------------------------------------+ | ``CUDA_FORCE_GCC`` | | Force of GCC as host compiler for CUDA part | | | | (If standard host compiler is incompatible with CUDA) | +-----------------------------+---------------------------------------------------------+ .. note:: For certain architectures, an *FindFFTW.cmake* may be required to find the FFTW libraries. This file is included in the **BioEM** directory. Steps for basic installation ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - Create a build directory in the main **BioEM** directory, and access it by .. code-block:: bash mkdir build && cd build - Run CMake with the desired options and the *CMakeLists.txt* file .. code-block:: bash cmake -D=ON -D=OFF ../CMakeLists.txt - If this process is successful, a *Makefile* and **CMakeFiles** directory should be generated. If this is not the case, enable the variable ``PRINT_CMAKE_VARIABLES``, and re-run CMake with verbosity to debug. - After generating the *Makefile*, execute it .. code-block:: bash make - If this process is successful a ``bioEM`` executable should be generated. For a simple test, run the BioEM executable .. code-block:: bash ./bioEM If the code runs successfully, the output on the terminal screen should be as shown in :numref:`Listing %s`. .. .. _tabletest: .. code-block:: none :caption: BioEM commandline input options :name: cmdline Command line inputs: --Modelfile arg (Mandatory) Name of model file --Particlesfile arg (Mandatory) Name of particle-image file --Inputfile arg (Mandatory) Name of input parameter file --ReadOrientation arg (Optional) Read file name containing orientations --ReadPDB (Optional) If reading model file in PDB format --ReadModelMRC (Optional) If reading model file in MRC format --ReadMRC (Optional) If reading particle file in MRC format --ReadMultipleMRC (Optional) If reading multiple MRCs --DumpMaps (Optional) Dump maps after they were read from particle-image file --LoadMapDump (Optional) Read Maps from dump option --DumpModel (Optional) Dump model after it was read from model file --LoadModelDump (Optional) Read model from dump option --PrintCOORDREAD (Optional) Print model coordinates --OutputFile arg (Optional) For changing the outputfile name --help (Optional) Produce help message BioEM Input =========== In this chapter, we describe the BioEM input commands and keywords. BioEM has two main sources of input: from the commandline and from the input-parameter file. In the first section, we describe each commandline item from :numref:`Listing %s`. In the second section, we describe the keywords that should be specified in the input-parameter file. Lastly, we describe the specific formats of the model, particle-image, and input-parameter files that are used in the BioEM software. Commandline input ----------------- The BioEM software requires a model, a set of experimental images and a input-parameter file. The names of these files are passed to the ``bioEM`` executable via the commandline, as well as their format specifications. We now give a detailed description of the commandline input items shown in :numref:`Listing %s`. .. _modfile: Model file ~~~~~~~~~~ .. option:: --Modelfile The structural model is represented as spheres in 3-dimensional space. The position of the center of the sphere should be specified in the model file, as well as its corresponding radius and number of electrons. These spheres can represent atoms, coarse-grained residues or multi-scale blobs. The radius size approximately determines the resolution of the model. Spheres with radius less than the pixel size are projected on to a single pixel. The name of the file containing the model has to be provided in the commandline when ``bioEM`` is executed: .. code-block:: bash ./bioEM --Modelfile arg where ``arg`` is the model filename. The possible formats for the model (*pdb*, text or starting from **BioEM 2.1** *mrc*) are described in section :ref:`modformat`. Additional features to read the model in BioEM2.1 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ If one has to read the same model multiple times, the following options might be useful. The first time the model file is read, include in the commandline the keyword .. option:: --DumpModel This will writeout a file *model.dump* containing the model in binary format, which will be useful for a faster re-reading. To read the dumped model in binary format, use .. option:: --LoadModelDump Note that the *model.dump* file should be in the same directory where the code is executed. Using this last option, it is not necessary to include :option:`--Modelfile` in the commandline. See chapter :ref:`tutorial` for examples. .. _partimag: Particle-image file ~~~~~~~~~~~~~~~~~~~ The name of the experimental particle-image file is passed to the BioEM executable using the commandline: .. option:: --Particlesfile where ``arg`` is the particle-image file name. The possible formats for the particle-images (*mrc* or text) are described in section :ref:`imaformat`. Additional features to read the particle-images ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ If one has to read the same particle-image set multiple times, the following options might be useful. The first time the particle-image file is read, include in the commandline the keyword .. option:: --DumpMaps This will writeout a file *maps.dump* containing the particle-images in binary format, which will be useful for a faster re-reading. To read the dumped maps in binary format, use .. option:: --LoadMapDump Note that the *maps.dump* file should be in the same directory where the code is executed. Using this last option, it is not necessary to include :option:`--Particlesfile` in the commandline. See chapter :ref:`tutorial` for examples. .. _infile: Input-parameter file ~~~~~~~~~~~~~~~~~~~~ BioEM has two sets of variables. One set describes the physical problem, like the number of pixels, and the parameter integration ranges. Another set describes the runtime configuration, which involves how to parallelize, whether to use a GPU, and some other algorithmic settings. The latter set does not change the output, but has a large influence on the compute performance. The two sets are treated differently, because the first set is related to the actual problem, while the second set belongs to the compute node where the problem is processed. For a detailed description of the performance variables see chapter :ref:`perfparm`. The physical parameters are passed via an input-parameter file that contains specific keywords for the physical constraints, and integration limits of the algorithm. The name of the input-parameter file is passed via the commandline: .. option:: --Inputfile where ``arg`` is the filename. In section :ref:`inparam`, we describe in detail the keywords used in the input-parameter file. .. _ortfile: Orientations from a file ~~~~~~~~~~~~~~~~~~~~~~~~ In BioEM there is an option to read the orientations of a model directly from a file, instead of calculating them in the code (see also section :ref:`intor`). This option provides more flexibility to perform the integral over the orientations. For this feature use the following commandline keyword .. option:: --ReadOrientation where ``arg`` is the name of the file containing the list of orientations. The format for the orientations (Euler angles or quaternions) is described in section :ref:`orform`. .. _biout: BioEM output ~~~~~~~~~~~~ By default, the main BioEM output file is called .. outpar:: Output_Probabilities .. object:: Output_Probabilities To change the name of the output file use the following commandline keyword .. option:: --OutputFile where ``arg`` is the desired name of the output file. This file contains the logarithm of the posterior probability of the model to each individual experimental image and the parameter set that gives a maximum of the posterior (see section :ref:`anaout` for its format). To check that model coordinates, radius and density are read correctly, inspect the COORDREAD file that can be generated using the following commandline keyword .. option:: --PrintCOORDREAD Note that printing of COORDREAD file has become optional starting from **BioEM2.1**, while in the older BioEM versions, the output file was always generated. .. _inparam: Input of physical parameters ---------------------------- Up to now, we have seen several commandline inputs that can be used in BioEM. We now focus on the input of the physical parameters that are necessary for the BioEM computation and are read from *inside* the input-parameter file. These parameters describe the physical constraints of the algorithm, such as the integration ranges and grid points, and are passed using specific keywords in the this file (see also section :ref:`infile`). Micrograph parameters ~~~~~~~~~~~~~~~~~~~~~ Mandatory inputs for the description of the experimental particle-image are .. inpar:: PIXEL_SIZE .. object:: PIXEL_SIZE (float) Pixel size in :math:`\AA` of the experimental micrograph. .. inpar:: NUMBER_PIXELS .. object:: NUMBER_PIXELS (int) We assume a square particle-image. Here, ``(int)`` is the number of pixels in each dimension, *e.g.*, for a particle-image of 220 x 220 pixels, then ``(int)= 220``. In the BioEM calculation, the integration over the model orientations, PSF parameters, and center displacement are performed numerically. To do so, one needs to define the integration ranges, and grid spacing for each parameter. These quantities depend on the experimental conditions, such as defocus range, and thus should be specified by the user. .. _intor: Integration of orientations ~~~~~~~~~~~~~~~~~~~~~~~~~~~ There are two ways to describe the orientation of the model in 3D space: with the Euler angles or with quaternions. - *Euler Angles*. The Euler angles are :math:`\alpha,\beta,\gamma`, and represent a sequence of three elemental rotations about the axes of a coordinate system. We use the reference rotations :math:`Z_1 X_2 Z_3`, such that the first rotation is around the :math:`z`-axis by an angle :math:`\alpha`, the second rotation is around the :math:`x`-axis by an angle :math:`\beta`, and a last rotation is again around the :math:`z`-axis by an angle :math:`\gamma`. - *Quaternions*. The orientation of a rigid body can also be described with quaternions. A set of quaternions is a four-dimensional vector over the real numbers (:math:`q_1`, :math:`q_2`, :math:`q_3`, :math:`q_4`) each within :math:`[-1,1]` such that :math:`1=q_1^2+q_2^2+q_3^2+q_4^2`. There are several ways to sample the space of Euler angles or quaternions. We *importantly remark* that not all possibilities sample uniformly the group of rotations in 3D space (*SO3*), which is crucial to perform a fast and accurate integration of uniformly distributed model orientations. Uniform sampling of SO3 ^^^^^^^^^^^^^^^^^^^^^^^ To uniformly sample *SO3*, we recommend using a list of quaternions generated with the successive orthonormal images method from ref. :cite:`Yershova2010`. In the directory **Quaternions**, we provide lists of quaternions that have been generated using this method. Here, it is necessary to follow section :ref:`ortfile` because a list of quaternions is read from a separate file. To use quaternions the keyword :inpar:`USE_QUATERNIONS` in the input-parameter file is also required. Non-uniform sampling ^^^^^^^^^^^^^^^^^^^^ It is also possible to have trivial grid-sampling of the Euler angles or quaternions: - *Grid-sampling of the Euler Angles (*\ :math:`\alpha,\beta,\gamma`\ *):* Sampling of the full Euler angle space within an uniform cubic-grid: :math:`\alpha \in [-\pi,\pi]`, :math:`\cos(\beta) \in [-1,1]` and :math:`\gamma \in [-\pi,\pi]`. Here one needs to provide the number of grid points in :math:`\alpha`, and :math:`\cos(\beta)`. By default, the grid spacing of Euler angle :math:`\gamma` will be the same as that of :math:`\alpha`. The keywords in the parameter file are .. inpar:: GRIDPOINTS_ALPHA .. object:: GRIDPOINTS_ALPHA (int) .. inpar:: GRIDPOINTS_BETA .. object:: GRIDPOINTS_BETA (int) where ``(int)`` is the number of grid points. .. note:: For an optimal grid spacing, it is recommended that ``GRIDPOINTS_ALPHA~ 2*GRIDPOINTS_BETA``. - *Grid-sampling of quaternions:* With BioEM it is also possible to generate a grid in quaternion space. One should provide the keywords .. inpar:: USE_QUATERNIONS .. object:: USE_QUATERNIONS .. inpar:: GRIDPOINTS_QUATERNION .. object:: GRIDPOINTS_QUATERNION (int) where ``(int)`` is the grid spacing in each dimension :math:`[-1,1]`. - *Non-uniform sampling of orientations from a file:* We note that with the option of reading the orientations from a file (section :ref:`ortfile`) the user has great freedom to sample also non-uniformly the orientation space (for example around a given orientation, see :ref:`modcom`). Integration of the PSF parameters ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To take into account the interference effects in the experiment, we convolute the ideal image from the model with the PSF. In practice, we use its Fourier-space equivalent, which is the multiplication the contrast transfer function (CTF) and envelope function. An approximate expression for the CTF is .. math:: \mathrm{CTF}(s)=-A\cos(as^2/2)-\sqrt{1-A^2}\sin(as^2/2), where :math:`s` is the radial spatial frequency, and :math:`a=2\pi \lambda \Delta f` with :math:`\lambda` is the electron wavelength, and :math:`\Delta f` is the defocus. Parameter :math:`A \in [0,1]` establishes the contributions of the cosine and sine components. The envelope function is .. math:: \mathrm{Env}(s)=e^{-bs^2/2}, where parameter :math:`b` controls the Gaussian width and modulates the CTF. To calculate the BioEM posterior probability, we integrate numerically the three parameters :math:`\Delta f`, :math:`b` and :math:`A`. To do so, one should include in the input-parameter file the keyword for each parameter, its integration limits, and number of grid points: *Parameter – (start) – (end) – (gridpoints)* .. inpar:: CTF_DEFOCUS .. object:: CTF_DEFOCUS (float) (float) (int) .. inpar:: CTF_B_ENV .. object:: CTF_B_ENV (float) (float) (int) .. inpar:: CTF_AMPLITUDE .. object:: CTF_AMPLITUDE (float) (float) (int) The defocus, :math:`\Delta f`, should be in units of :math:`\mu`\ m, and :math:`b` in Å\ :math:`^2`. The amplitude parameter :math:`A` is adimensional within :math:`[0,1]`. The default value of the electron wavelength is 0.019688\ :math:`\AA`, which corresponds to a :math:`300 kV` microscope. To change this value use the keyword .. inpar:: ELECTRON_WAVELENGTH .. object:: ELECTRON_WAVELENGTH (float) where ``(float)`` should be in :math:`\AA`. Integration of center displacement ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The integration of the particle center is done over a square and uniform grid. The particle, along both directions, is translated from its center up to a maximum distance (*max displ.*). Users should provide this maximum displacement and the grid spacing in units of pixels. The keyword in parameter file is: *Parameter - (max displ.) - (grid-space)* .. inpar:: DISPLACE_CENTER .. object:: DISPLACE_CENTER (int) (int) If ``[DISPLACE_CENTER 10 2]``, the integration will be done along :math:`x` within :math:`[x_c-10,x_c+10]` (where :math:`x_c` is the center), and :math:`[y_c-10,y_c+10]` along :math:`y`, with sampling every 2 pixels. The integration over the *normalization*, *offset* and *noise* are carried out analytically. See Supplementary Information of ref. :cite:`CossioHummerJSB_2013`. .. _priorsec: Priors ~~~~~~ - *Uniform model prior probability:* To include a uniform model prior use the following keyword in the input-parameter file .. inpar:: PRIOR_MODEL .. object:: PRIOR_MODEL (float) where ``(float)`` is the value of the model’s prior. - *Prior for orientations:* It is possible to assign prior probabilities for each orientation. The keyword .. inpar:: PRIOR_ANGLES .. object:: PRIOR_ANGLES allows to read the prior of each orientation from the input file of orientations (see section :ref:`ortfile`). An extra column of format “%12.6f” should be added in the orientations-file, which indicates the value of the prior probability for each orientation. - *Prior for* :math:`b` *envelope parameter:* To avoid full loss of the high-frequency components in Fourier space, the code utilizes a Gaussian prior on the :math:`b` envelope parameter .. math:: p(b)=\frac{1}{2\sqrt{2\pi}\sigma_b}e^{-b^2/2\sigma_b^2}, where :math:`\sigma_b` is the Gaussian width. By default the Gaussian prior is centered at zero, and :math:`\sigma_b=100\AA`, to modify the width include in the input-parameter file the keyword .. inpar:: SIGMA_PRIOR_B_CTF .. object:: SIGMA_PRIOR_B_CTF (float) where ``(float)`` is the desired :math:`\sigma_b`. See also the supporting information of ref. :cite:`BioEM_server`. - *Prior for* :math:`\Delta f` *defocus parameter:* BioEM implements a Gaussian prior on the :math:`\Delta f` defocus parameter .. math:: p(\Delta f)=\frac{1}{\sqrt{2\pi}\sigma_{\Delta f}}e^{-(\Delta f - \Delta f_c)^2/2\sigma_{\Delta f}^2}, where :math:`\sigma_{\Delta f}` is the Gaussian width and :math:`\Delta f_c` is the Gaussian center. By default :math:`\sigma_{\Delta f}=1.0\mu`\ m, and :math:`\Delta f_c=3.0\mu`\ m. To modify these values include in the input-parameter file the keyword .. inpar:: SIGMA_PRIOR_DEFOCUS .. object:: SIGMA_PRIOR_DEFOCUS (float) where ``(float)`` is the desired :math:`\sigma_{\Delta f}`, and .. inpar:: PRIOR_DEFOCUS_CENTER .. object:: PRIOR_DEFOCUS_CENTER (float) to change the Gaussian center :math:`\Delta f_c`. See also the supporting information of ref. :cite:`BioEM_server`. - *Prior for* :math:`A` *amplitude parameter:* BioEM implements a Gaussian prior on the :math:`A` amplitude parameter .. math:: p(A)=\frac{1}{\sqrt{2\pi}\sigma_{A}}e^{-(A - A_c)^2/2\sigma_{A}^2}, where :math:`\sigma_{A}` is the Gaussian width and :math:`A_c` is the Gaussian center. By default :math:`\sigma_{A}=0.3`, and :math:`A_c=0`. To modify these values include in the input-parameter file the keyword .. inpar:: SIGMA_PRIOR_AMP_CTF .. object:: SIGMA_PRIOR_AMP_CTF (float) where ``(float)`` is the desired :math:`\sigma_{A}`, and .. inpar:: PRIOR_AMP_CTF_CENTER .. object:: PRIOR_AMP_CTF_CENTER (float) to change the Gaussian center :math:`A_c`. .. _angprob: Posterior probability as a function of orientations ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ One can write out the log-posterior as a function of each orientation. In this case, the integration is performed over the CTF parameters, particle-center, normalization, offset and noise, but not over the orientations. The keyword in parameter file is .. inpar:: WRITE_PROB_ANGLES .. object:: WRITE_PROB_ANGLES (int) With this feature there is an additional output file :outpar:`ANG_PROB` where ``(int)`` orientations with highest posterior are written. The orientations in this file are sorted in decreasing log-posterior order. Overview of keywords for the input-parameter file ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In the following, we provide a list of the possible keywords read from the input-parameter. BioEM posterior probability computation: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - :inpar:`PIXEL_SIZE` ``(float)``: Micrograph pixel size in Å. - :inpar:`NUMBER_PIXELS` ``(int)``: Assuming a square particle-image, it is the number of pixels along an axis. This should coincide with the number of pixels read from the micrograph. - :inpar:`CTF_DEFOCUS` ``(float) (float) (int)``: (CTF integration) Grid sampling of CTF defocus, :math:`\Delta f`. Units of micro-meters. ``(float) (float)`` are the starting and ending limits, respectively, and ``(int)`` is the number of grid points. - :inpar:`CTF_B_ENV` ``(float) (float) (int)``: (CTF integration) Grid sampling of envelope parameter :math:`b`. Units of Å\ :math:`^2`. ``(float) (float)`` are the starting and ending limits, respectively, and ``(int)`` is the number of grid points. - :inpar:`CTF_AMPLITUDE` ``(float) (float) (int)``: (CTF integration) Grid sampling of the CTF amplitude, :math:`A` (adimensional :math:`\in [0,1]`). ``(float) (float)`` are the starting and ending limits, respectively, and ``(int)`` is the number of grid points. - :inpar:`DISPLACE_CENTER` ``(int) (int)``: (Integration of particle center displacement) Sampling within a square grid. Units of pixels. ``(int) (int)`` are the maximum displacement from the center in both directions, and the grid spacing, respectively. Optional keywords: ^^^^^^^^^^^^^^^^^^ - :inpar:`GRIDPOINTS_ALPHA` ``(int)``: (Integration of orientations, mandatory if quaterionions or `--ReadOrientation` are not used) Number of grid points used in the integration over Euler angle :math:`\alpha \in [-\pi,\pi]`. Here a cubic grid in Euler angle space is performed. The integral over Euler angle :math:`\gamma` is identical to that of :math:`\alpha`. - :inpar:`GRIDPOINTS_BETA` ``(int)``: (Integration of orientations, mandatory if quaterionions or `--ReadOrientation` are not used) Number of grid points used in the integration over :math:`\cos(\beta) \in [-1,1]`. - :inpar:`USE_QUATERNIONS`: (Integration of Orientations) If using quaternions to the describe the orientations. *Recommended* for uniformly sampling of :math:`SO3` with the quaternions lists available in the **Quaternions** directory. - :inpar:`GRIDPOINTS_QUATERNION` ``(int)``: (Integration of Orientations) For a hypercubic grid quaternion sampling. Each quaternion is within :math:`[-1,1]`. ``(int)`` is the number of grid points per dimension. - :inpar:`ELECTRON_WAVELENGTH` ``(float)``: To change the default value of the electron wavelength ``(float)`` used to calculate the CTF phase with the defocus. Default 0.019688 :math:`\AA`. - :inpar:`PRIOR_MODEL` ``(float)``: Prior probability of model. **Default** 1. - :inpar:`PRIOR_ANGLES`: To read the prior of each orientation in the input file of orientations. - :inpar:`SIGMA_PRIOR_B_CTF` ``(float)``: To change the Gaussian width of the prior probability of the CTF envelope parameter :math:`b` (section :ref:`priorsec`). **Default** 100 Å. - :inpar:`SIGMA_PRIOR_DEFOCUS` ``(float)``: To change the Gaussian width of the prior of the defocus :math:`\sigma_{\Delta f}` (section :ref:`priorsec`). **Default** 1 :math:`\mu` m. - :inpar:`PRIOR_DEFOCUS_CENTER` ``(float)``: To change the Gaussian center of the prior of the defocus :math:`\Delta f_c` (section :ref:`priorsec`). **Default** 3 :math:`\mu` m. - :inpar:`SIGMA_PRIOR_AMP_CTF` ``(float)``: To change the Gaussian width of the prior of the amplitude :math:`\sigma_{A}` (section :ref:`priorsec`). **Default** 0.3. - :inpar:`PRIOR_AMP_CTF_CENTER` ``(float)``: To change the Gaussian center of the prior of the amplitude :math:`A_c` (section :ref:`priorsec`). **Default** 0. - :inpar:`NO_MAP_NORM`: Condition to not normalize to zero mean and unit variance the input maps. - :inpar:`WRITE_PROB_ANGLES` ``(int)``: To write out the posterior as a function of the best ``(int)`` orientation. File formats ------------ .. _modformat: Formats for the model file ~~~~~~~~~~~~~~~~~~~~~~~~~~ There are two types of model file formats that are read by BioEM: - *Text file:* A simple text file with format “%f %f %f %f %f”. The first three columns are the coordinates of the sphere centers in :math:`\AA`, the fourth column is the radius in :math:`\AA`, and the last column is the corresponding number of electrons (which can be non-integer). (Format: ``x — y — z — radius — number electrons``). This format is useful for all atom, mixed or coarse-grained representations of the density maps. - *pdb file:* BioEM reads the C\ :math:`_\alpha` atom positions with their corresponding residue type from standard *pdb* files. A residue is modeled as a sphere, centered at the C\ :math:`_\alpha`, with van-der-Waals radii and number of electrons corresponding to the specific amino acid type (as in ref. :cite:`CossioHummerJSB_2013`). To read pdb files the following commandline keyword is needed (related to section :ref:`modfile`): .. option:: --ReadPDB - *mrc file:* BioEM reads the intensities from standard *mrc* files. The pixel size is read from the Input-parameter file (:ref:`infile`) To read mrc files the following commandline keyword is needed (related to section :ref:`modfile`): .. option:: --ReadModelMRC .. _imaformat: Formats for the particle-images ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Two format options are allowed for the the particle-image file: .. inpar:: PARTICLE - *Text file:* Data are formatted as “%8d%8d%16.8f” where the first two columns are the pixel indexes, and the third column is the intensity at that pixel. Multiple particles are read in the same file with the separator :inpar:`PARTICLE`. Pixel indexes should start at 0, and all pixels should be included. - *.mrc file:* BioEM also reads standard *.mrc* particle-image files. To do so, the additional commandline keyword is needed: .. option:: --ReadMRC - If reading multiple *mrc* files, the name of the file containing the *list* of all the *mrc* files should be provided. The additional command is required: .. option:: --ReadMultipleMRC *Example:* .. code-block:: bash --Particlesfile LIST --ReadMRC --ReadMultipleMRC ``LIST`` is the name of the file containing the list of names of the multiple *mrc* files. .. inpar:: NO_MAP_NORM .. note:: When *mrc* particles are read, by default the intensities are normalized to zero average and unit standard deviation. Use the keyword ``NO_MAP_NORM`` in the input-parameter file to unset this default. .. _orform: Formats for the orientations file ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Related to sections :ref:`ortfile` and :ref:`intor`. The format for the orientations file is described in the following: - The first row of the file should have ``(int)`` equal to the total number of orientations. - The orientations can be described with Euler angles, or with quaternions: - *Euler angles*. These are Euler angles :math:`\alpha,\beta,\gamma` in radians, which representing three rotations about axis :math:`Z_1X_2Z_3`. The format for the file containing the Euler angles is “%12.6f%12.6f%12.6f”, ordered as :math:`\alpha,\beta,\gamma`, respectively. - *Quaternions*. A set of quaternions is a four-dimensional vector over the real numbers (:math:`q_1`, :math:`q_2`, :math:`q_3`, :math:`q_4`) each within :math:`[-1,1]`. The format for this file containing the quaternions should be “%12.6f%12.6f%12.6f%12.6f”, ordered as :math:`q_1`, :math:`q_2`, :math:`q_3`, and :math:`q_4`, respectively. To use quaternions the keyword :inpar:`USE_QUATERNIONS` should be placed in the input-parameter file. - **Prior for orientations.** Its possible to assign prior probabilities to each orientation. To do so, one should add at the end of each line an extra column (of format “%12.6f”) that indicates the value of the prior probability for each orientation. .. _anaout: Output format ~~~~~~~~~~~~~ The main BioEM output file is called :outpar:`Output_Probabilities` by default. Its name can be changed using the commandline :option:`--OutputFile` as described in section :ref:`biout`. This file contains the logarithm of the posterior probability of the model to each individual experimental image. .. code-block:: bash RefMap [number Particle Map] LogProb [ln(P)] It also reports the parameter grid values that give a maximum value of the posterior probability. .. code-block:: bash RefMap [number Particle Map] Maximizing Param: [Orientation] [PSF parameters] [center displacement] [norm] [offset] *Important remark:* The posterior probability is not normalized. Thus, it is always recommended to compare :math:`\ln (P)` of different models or relative to noise as in ref. :cite:`CossioHummerJSB_2013` (see also section :ref:`modcom`). Before executing a production run, it is recommended to check that the values of the log-posterior are finite, and the parameters that give a maximum of the posterior are in a reasonable range (*e.g.*, not at the borders of the integration limits). Optional outputs ^^^^^^^^^^^^^^^^ The optional output files for BioEM are: .. outpar:: COORDREAD .. object:: COORDREAD This file used to be generated always, but in the newer BioEM versions it is generated only when the option :option:`--PrintCOORDREAD` is used. This output file allows to check that the model coordinates, radius and density are read correctly. It is reported in the following format .. code-block:: bash Text --- Number ---- x ---- y ---- z ---- radius ---- number of electron .. outpar:: ANG_PROB .. object:: ANG_PROB Related to section :ref:`angprob`. This file has the posterior probabilities for each orientation, which was specified with the keyword :inpar:`WRITE_PROB_ANGLES` in the parameter inputfile. For the Euler angles, the format of the output file is .. code-block:: bash [Map number -- alpha -- beta -- gamma -- log Probability] For the quaternions, its format is .. code-block:: bash [Map number -- q1 -- q2 -- q3 -- q4 -- log Probability] .. _perfparm: Performance =========== The BioEM performance variables enhance or modify the code’s computational performance without modifying the numerical results. They should be tuned for the specific computing node characteristics where BioEM is executed. They are passed via environment variables using the bash scripting language. In the following chapter, we explain the types of parallelization used within the BioEM software, list all relevant environment variables, and provide some suggestions for runtime configurations in different situations. .. _wayparallel: Ways of parallelization ----------------------- BioEM compares various projections of a model to a set of reference particle-images. As explained in section :ref:`theory` the model is first projected along a given angular orientation, then it is convoluted with the PSF, next it is shifted by a certain number of pixels to account for the center displacement, and finally this modified projection is compared to a reference particle-image. From a computational complexity perspective, the performance depends mostly on the number of angular orientations relative to the number experimental images. If there are many experimental images and many orientations then the comparison of the calculated projection to all the experimental images is by far the most time consuming part. However, if there are few experimental images and many orientations, the comparison part is not the time-limiting step. **BioEM 2.0** has been optimized for both CPU and GPU performance according to two different scenarios: - **Many orientations versus** *many* **experimental images** - **Many orientations versus** *few* **experimental images** Because the optimal parallelization scheme changes depending on the previous conditions, we address each item separately. .. _multiorvsmultiim: Many orientations vs. many experimental images ---------------------------------------------- BioEM facilitates the comparison of many orientations to many experimental images using an all model projections to an all particle-image comparison through a nested loop. For this case, the following external variable modulates the BioEM optimization algorithm: .. code-block:: bash export BIOEM_ALGO=1 As shown in Fig. 2 of ref. :cite:`BioEM_server`, in the :envvar:`BIOEM_ALGO`\ ``=1`` the outermost loop is over the orientations and the inner most loop iterates over all particle-images and center displacements. Parallelization ~~~~~~~~~~~~~~~ There are multiple dimensions for parallelization: - *MPI:* BioEM uses MPI to parallelize over the orientations in the outermost loop. In this case the probabilities for all particle-images / PSF kernels / center displacements are calculated for a certain subset of orientations by each MPI process. Afterward, the probabilities computed by every MPI process are reduced to the final probabilities. If started via ``mpirun``, BioEM will automatically distribute the orientations evenly among all MPI processes. - *OpenMP:* BioEM can use OpenMP to parallelize over the particle images in the innermost loop. As processing of these particle-images is totally independent, there is no synchronization required at all. BioEM will automatically multithread over the particle-images. The number of employed threads can be controlled with the standard .. code-block:: bash export OMP_NUM_THREADS=[x] environment variable for OpenMP, where ``[x]`` is the number of OpenMP threads. - *Graphics Processing Units (GPUs):* BioEM can use GPUs to speed up the processing. In this case, the innermost loop over all particle-images, and with all center displacements, is processed by the GPU. The projections and the PSF convolutions are still processed by the CPU. This process is pipelined such that the CPU prepares the next projections, and PSF convolutions while the GPU calculates the probabilities to all particle-images for the previous calculated projections. Hence, this is a horizontal parallelization layer among the particle images with an additional vertical layer through the pipeline. Usage of GPUs must be enabled with the .. code-block:: bash export GPU=1 environment variable. One BioEM process will always only use one GPU, by default the fastest one. A GPU device can be explicitly configured with the environment variable: .. code-block:: bash export GPUDEVICE=[x] Multiple GPUs can be used through MPI. In this case, every GPU will process all particle-images but calculate the probabilities only for a subset of the orientations (see description of MPI above). Selection of GPU devices for each process must be carried out by .. code-block:: bash export GPUDEVICE=-1 In this case the MPI process with rank N on a system with G GPUs will take the GPU with ID (N % G). This option is mandatory when using MPI. - *GPU / CPU combined processing:* Besides the pipeline approach described in the previous point, which employs the CPU for creating the calculated image, and the GPU for calculating the likelihood to all particle-images, there is also the possibility to split the set of particle-images among the CPU and the GPU. This is facilitated by the environment variable .. code-block:: bash export GPUWORKLOAD=-1 that automatically sets the percentage of particle-images processed by the GPU. It is also possible to not use this autotuning option but to set a static value provided by the user .. code-block:: bash export GPUWORKLOAD=[x] where :math:`0\le x \le100` provides the x% of particles processed by the GPU. However, the autotuning option is set by default. In an optimal situation the CPU will: - Issue a GPU kernel call such that the GPU calculates the probabilities for x% of the particle-images for the current orientation and convolution. - Process its own fraction of (100-x)% of the particle-images in parallel to the GPU. - Afterward, finish the preparation of the next orientation and PSF convolution before the GPU has finished calculating the probabilities for the current orientation and PSF convolution. - *Multiple Projections/Convolutions at once via OpenMP:* BioEM can prepare the projections of multiple orientations and convolutions at once using OpenMP. The benefit compared to the pure OpenMP parallelization over the particle images, however, is tiny, while the memory requirements are drastically increased. This is relevant if MPI is not used, OpenMP is used, GPU is not used, and if the number of reference particle-image is small. The number of projections at once is determined by the environment variable .. code-block:: bash export BIOEM_PROJ_CONV_AT_ONCE=[x] where ``[x]`` is the number of projections that will be calculated simultaneously. - *Fourier-algorithm to process all center displacements in parallel:* BioEM uses as default the Fourier-algorithm to calculate the cross-correlation. The Fourier-algorithm automatically takes all displacements into account without having to loop over them. Hence, its runtime is almost independent from the number of center displacements (see ref. :cite:`BioEM_server`). Parallelization on only CPUs ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ For parallelization over the CPU cores: - One can use MPI with as many MPI processes as there are CPU cores :math:`\times` nodes, and with :envvar:`OMP_NUM_THREADS`\ ``=1``. In this case, the parallelization is done only over the orientations. - On a single node, one can use OpenMP to parallelize over the particle images, and optionally using the environmental variable :envvar:`BIOEM_PROJ_CONV_AT_ONCE`\ ``=[x]`` to increase number of projections/convolutions processed in parallel. - One can combine both MPI and OpenMP, as shown in ref. :cite:`BioEM_server`. For instance, on a single node, :envvar:`OMP_NUM_THREADS`\ ``=[x]`` can be set to ``x = 1/4 N``, where ``N`` is the number of CPU cores on the system, and BioEM can be called with ``mpirun``, and 4 MPI processes. In this case, four orientations are processed in parallel using MPI, and ``x`` particle-images are processed in parallel using OpenMP. - If multiple nodes are used MPI is mandatory, and should be combined with OpenMP. Optimal work distribution will depend on the number of orientations (parallelization with MPI) compared to the number of particle-images (parallelization with OpenMP). .. note:: To find the optimal performance setup for only CPUs, it is recommended to try both BioEM algorithms :envvar:`BIOEM_ALGO`\ ``=1`` and :envvar:`BIOEM_ALGO`\ ``=2`` with different combinations of the options described. Parallelization on CPUs and GPUs ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Naturally, different methods of parallelization can be combined with the GPU: - One can combine MPI with the GPU algorithm to use multiple GPUs at once. The number of MPI processes has to be equal to the number of available GPUs. - One can use GPUs and CPU cores jointly to calculate the probabilities for all particle-images with OpenMP and the :envvar:`GPUWORKLOAD`\ ``=-1`` autotunning variable. For more than one GPU, MPI must be employed. In this case, the number of MPI processes must match the number of GPUs. So it is important to combine MPI, and OpenMP inside one node in order to use all CPU cores. Examples of possible ways of parallelization are shown in Fig. 5 and 6 of ref. :cite:`BioEM_server` for the FRH protein complex system. Many orientations vs. few experimental images --------------------------------------------- **BioEM2.0** has been optimized to treat many orientations and few experimental images using GPUs and CPUs. For this case, the following external variable modulates the BioEM algorithm: .. code-block:: bash export BIOEM_ALGO=2 In this algorithm, the parallelization for GPU is now done on a lower level: the GPU (or OpenMP for the only CPU case) processes the center displacements, whilst the CPU with MPI processes the orientations and with OpenMP the projections and convolutions. Hence, there is more parallelism and better performance for the GPU for this case. Parallelization ~~~~~~~~~~~~~~~ We present the different parallelization options when using the :envvar:`BIOEM_ALGO`\ ``=2``: - *MPI:* Similarly as with :envvar:`BIOEM_ALGO`\ ``=1`` (section :ref:`multiorvsmultiim`) MPI is used to parallelize over the orientations in the outermost loop. - *OpenMP:* With :envvar:`BIOEM_ALGO`\ ``=2`` the :envvar:`BIOEM_PROJ_CONV_AT_ONCE` is by default equal to :envvar:`OMP_NUM_THREADS`. However, :envvar:`BIOEM_PROJ_CONV_AT_ONCE` can also be modified as described above. Importantly, for :envvar:`BIOEM_ALGO`\ ``=2`` the contribution of :envvar:`BIOEM_PROJ_CONV_AT_ONCE` is significant. These OMP threads are used to work in parallel on the projections, the convolutions, and if GPU is disabled on the center displacements and comparisons. - *Graphics Processing Units (GPUs):* For :envvar:`BIOEM_ALGO`\ ``=2`` the loop over center displacements can be processed by the GPU. The projections and convolutions are still processed by the CPU. The GPU environment variables are :envvar:`GPU`\ ``=1`` to use the GPU and :envvar:`GPUDEVICE`\ ``=[x]`` to select the GPU device. With :envvar:`GPUDEVICE`\ ``=-1`` the GPU is automatically selected. Note that :envvar:`GPUWORKLOAD` is always ``100``, meaning that all center displacements are always processed by GPU. - *Fourier-algorithm to process all center displacements in parallel:* For :envvar:`BIOEM_ALGO`\ ``=2``, the Fourier-algorithm is also default and always used. Parallelization on only CPUs ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ For :envvar:`BIOEM_ALGO`\ ``=2`` and only CPUs: - One can use MPI with as many MPI processes as there are CPU cores :math:`\times` nodes and with :envvar:`OMP_NUM_THREADS`\ ``=1``. In this case, the parallelization is done only over the orientations . - On a single node, one can use OpenMP with :envvar:`OMP_NUM_THREADS`\ ``=[x]`` to parallelize over the projections, convolutions and center displacements (by default using also :envvar:`BIOEM_PROJ_CONV_AT_ONCE`). - One can combine both MPI and OpenMP where MPI runs over the orientations and OpenMP over the projections, convolutions and center displacements. For instance, on a single node, :envvar:`OMP_NUM_THREADS`\ ``=[x]`` can be set to ``x = 1/4 N``, where ``N`` is the number of CPU cores on the system, and BioEM can be called with ``mpirun``, and 4 MPI processes. - If multiple nodes are used MPI is mandatory, and should be combined with OpenMP. Optimal work distribution will depend on the specifications of the nodes, and the number of orientations compared to the number of particle-images. .. note:: To find the optimal performance setup for only CPUs, it is recommended to try both BioEM algorithms :envvar:`BIOEM_ALGO`\ ``=1`` and :envvar:`BIOEM_ALGO`\ ``=2`` with different combinations of the options described. Parallelization on CPUs and GPUs ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ For :envvar:`BIOEM_ALGO`\ ``=2``, different methods of GPU and CPU parallelization can be combined: - One can combine MPI with the GPU algorithm to use multiple GPUs at once. The number of MPI processes has to be equal to the number of available GPUs. - One can use GPUs and CPU cores jointly. MPI will parallelize over the orientations, OpenMP can parallelize over the projections and the GPU over the convolutions and center displacements. The number of MPI processes must match the number of GPUs. So it is important to combine MPI and OpenMP inside one node in order to use all CPU cores. Note on the numerical results ----------------------------- **BioEM2.0** combines float and double-precision variables. Float precision is used for most variables within the code, which significantly speeds-up the calculations (see :cite:`BioEM_server`). By contrary, the posterior probability is handled in double precision to maintain a high numerical accuracy. Nonetheless, we note that there could be a minimal numerical difference in the computed probabilities, depending whether CPUs, GPUs or a combination of both is used. This is coming from the different results and rounding errors on different hardware and different underlying libraries, thus it is hard to avoid it. However, in all practical cases this minimal discrepancies can be considered negligible; much smaller than the uncertainties of the numerical integrations. List of environment variables ----------------------------- .. envvar:: BIOEM_ALGO (Default: 1) Set to 1 to enable the BioEM algorithm optimized for many orientations versus *many* experiment images computations. Set to 2 to enable the BioEM algorithm optimized for many orientations versus *few* experiment images computations. .. envvar:: GPU (Default: 0) Set to 1 to enable GPU usage, set to 0 to use only the CPU. .. envvar:: GPUDEVICE (Default: fastest) Only relevant if :envvar:`GPU`\ ``=1``. - If this is not set, BioEM will autodetect the fastest GPU. Only possible if MPI is not used. - If ``x >= 0``, BioEM will use GPU number ``x``. Only possible if MPI is not used. - If ``x = -1``, BioEM runs with ``N`` MPI threads, and the system has ``G`` GPUs, then BioEM will use GPU with number (``N % G``). The idea is that one can place multiple MPI processes on one node, and each will use a different GPU. For a multi-node configuration, one must make sure that consecutive MPI ranks are placed on the same node, *i.e.*, four processes on two nodes (node0 and node1) must be placed as (node0, node0, node1, node1) and not as (node0, node1, node0, node1), because in the latter case only 1 GPU per node will be used (by two MPI processes each). .. envvar:: GPUWORKLOAD (Default: -1 for :envvar:`BIOEM_ALGO`\ ``=1`` and fixed to 100 for :envvar:`BIOEM_ALGO`\ ``=2``) Only relevant if :envvar:`GPU`\ ``=1``. This defines the fraction of the workload in percent. To be precise: the fraction of the number of particle-images processed by the GPU. The remaining particle-images will be processed by the CPU. For :envvar:`BIOEM_ALGO`\ ``=1``, if set to -1 the autotuning option will automatically select the ideal % of particles processed by the GPU. For :envvar:`BIOEM_ALGO`\ ``=2`` it is fixed to :envvar:`GPUWORKLOAD`\ ``=100``. .. envvar:: GPUASYNC (Default: 1) Only relevant if :envvar:`GPU`\ ``=1``. This uses a pipeline to overlap the processing on the GPU, the preparation of projections and convolutions on the CPU, and the DMA transfer. There is no reason to disable this except for debugging purposes. .. envvar:: GPUDUALSTREAM (Default: 1) Only relevant if :envvar:`GPU`\ ``=1``. If this is set to 1, the GPU will use two streams in parallel. This can help to improve the GPU utilization. Benchmarks have shown that there is a very little positive effect by this setting, as utilization of GPU is already high. .. envvar:: BIOEM_CUDA_THREAD_COUNT (Default: 256 for :envvar:`BIOEM_ALGO`\ ``=1`` and 512 for :envvar:`BIOEM_ALGO`\ ``=2``) Only relevant if :envvar:`GPU`\ ``=1``. This variable can explicitly select the number of CUDA threads. Different inputs and algorithms might need different number of threads for an optimized performance, but also to respect hardware (memory) limits of a GPU device. .. envvar:: OMP_NUM_THREADS (Default: Number of CPU cores) This is the standard OpenMP environment variable to define the number of OpenMP threads. It can be used for profiling purposes to analyze the scaling. It can be set to ``x=1`` to use MPI exclusively or to other values for a mixed MPI / OpenMP configuration. .. envvar:: BIOEM_PROJ_CONV_AT_ONCE (Default: 1 for :envvar:`BIOEM_ALGO`\ ``=1`` and ``=``\ :envvar:`OMP_NUM_THREADS` for :envvar:`BIOEM_ALGO`\ ``=2``) This defines the number of projections and convolutions prepared at once. OpenMP threads (whose number is defined by :envvar:`OMP_NUM_THREADS` environment variable) are used to prepare these projections and convolutions in parallel. For :envvar:`BIOEM_ALGO`\ ``=1`` :envvar:`BIOEM_PROJ_CONV_AT_ONCE`\ ``=[x]`` is mostly relevant, if OpenMP is used, no GPU is used, and/or the number of reference particle-image is very small. For :envvar:`BIOEM_ALGO`\ ``=2`` its contribution is important. .. envvar:: BIOEM_DEBUG_BREAK (Default: deactivated) This is a debugging option. It will reduce the number of projection and PSF convolutions to a maximum of ``x`` both. It can be used for profiling to analyze scaling, and for fast sanity tests. .. envvar:: BIOEM_DEBUG_NMAPS (Default: deactivated) As :envvar:`BIOEM_DEBUG_BREAK`, with the difference that this limits the number of reference particle-images to a maximum of ``x``. .. envvar:: BIOEM_DEBUG_OUTPUT (Default: 0) Change the verbosity of the output. Higher means more output, lower means less output. - ``x=0``: Stands for no debug output. - ``x=1``: Limited timing output. - ``x=2``: Standard timing output showing durations of projection, convolution, and cross-correlation comparison. This adds successively more extensive output. Default environment variables ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ With **BioEM2.0** the Fourier-algorithm :cite:`BioEM_server` is always used. This implies that the GPU algorithm is by default ``GPUALGO=2`` (defined in BioEM1.0). It has been shown that for realistic cases, where the particle center is an unknown parameter, the Fourier-algorithm outperforms all other algorithms. Because of this, we have selected it to be permanently default. .. _performsugg: Suggestions for runtime configurations -------------------------------------- Default Settings ~~~~~~~~~~~~~~~~ It is recommended that the following settings should be left at theirs defaults: :envvar:`GPUASYNC` (Default 1), :envvar:`GPUDUALSTREAM` (Default 1). Profiling ~~~~~~~~~ For profiling one can limit the number of orientations, projections and particle-images for example using :envvar:`BIOEM_DEBUG_BREAK` and :envvar:`BIOEM_DEBUG_NMAPS`. However, for accurate estimations, it is recommended to keep the proportion of orientations to particle-images the same as in the actual application. Also a good choice is :envvar:`BIOEM_DEBUG_OUTPUT`\ ``=2`` to get the timing of each projection, convolution and comparison. For a larger number of particle-images it might make sense to switch to :envvar:`BIOEM_DEBUG_OUTPUT`\ ``=1``. Production run: *Many orientations vs. many experimental images* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ On only CPUs ^^^^^^^^^^^^ - :envvar:`BIOEM_ALGO`\ ``=1`` to select the BioEM algorithm 1 that optimizes the computation of many orientations to many particle images. - :envvar:`BIOEM_DEBUG_OUTPUT`\ ``=0`` can reduce the size of the text output. - :envvar:`BIOEM_PROJ_CONV_AT_ONCE`\ ``=[x]`` may have a positive effect. The memory footprint increases with ``x``, so it should not be too large. For best performance, choose a multiple of the number of OpenMP threads. - On a single node, one should use OpenMP parallelization for many particle-images and few orientations; and MPI parallelization for few particle-images and many orientations. Assume a system with ``N`` CPU cores, the command for the first would be ``BIOEM_PROJ_CONV_AT_ONCE=[4*N] OMP_NUM_THREADS=[N]`` and for the second ``OMP_NUM_THREADS=1 ; mpirun -n [N]`` - For a medium number of particle-images and orientations, a combined MPI / OpenMP configuration can be better. *Example:* Assume 20 CPU cores, possible options would be (among others): - 20 MPI processes with 1 OMP thread each: ``OMP_NUM_THREADS=1 mpirun -n 20`` - 10 MPI processes with 2 OMP threads each: ``OMP_NUM_THREADS=2 mpirun -n 10`` - 4 MPI processes with 5 OMP threads each: ``OMP_NUM_THREADS=5 mpirun -n 4`` - 2 MPI processes with 10 OMP threads each: ``OMP_NUM_THREADS=10 mpirun -n 2`` The best configuration has to be checked by the user. But in any case, one should make sure that the number of MPI processes times the number of OMP threads per process equals the number of (virtual) CPU cores. *Importantly*, one should also compare the timings from :envvar:`BIOEM_ALGO`\ ``=1`` or :envvar:`BIOEM_ALGO`\ ``=2`` with the different configurations. On combined CPUs and GPUs ^^^^^^^^^^^^^^^^^^^^^^^^^ - :envvar:`BIOEM_ALGO`\ ``=1`` to select the BioEM algorithm 1 that optimizes the computation of many orientations to many particle images. - :envvar:`BIOEM_DEBUG_OUTPUT`\ ``=0`` can reduce the size of the text output. - :envvar:`BIOEM_PROJ_CONV_AT_ONCE`\ ``=[x]`` may have a positive effect. However, the memory footprint increases with ``x``, it this can be a limiting factor for GPUs. Therefore, it is usually enough to keep the default :envvar:`BIOEM_PROJ_CONV_AT_ONCE`\ ``=1``, unless the number of particle images is small (in which case one should consider the :envvar:`BIOEM_ALGO`\ ``=2`` algorithm anyway). - :envvar:`GPU`\ ``=1`` should be used if a GPU is available. Performance wise, one Titan GPU corresponds roughly to 20 cores at 3 GHz. - :envvar:`GPUWORKLOAD`\ ``=-1`` for autotuning of the optimal workload balance. - If a system offers multiple GPUs, all GPUs should be used. This must be accomplished via MPI. In this case, the number of MPI processes per node must match the number of GPUs per node. There are different ways to make sure every MPI process uses a different GPU (as discussed in the GPU paragraph of section :ref:`wayparallel`). Assuming the MPI processes are placed such, that consecutive MPI ranks are placed on one node, one can use the :envvar:`GPUDEVICE`\ ``=-1`` setting. This is assumed here. Let us assume an example of ``N`` nodes with ``C`` CPU cores each and ``G`` GPUs each. The following command will use all GPUs, and ignore the CPUs: ``OMP_NUM_THREADS=1 GPU=1 GPUDEVICE=-1 mpirun -n [N*G]`` - One can use all the CPU cores as well as the GPUs. A combined MPI / OpenMP setting as discussed previously must be used, under the constraint that the number of MPI processes matches the number of GPUs: ``OMP_NUM_THREADS=[C/G] GPU=1 GPUDEVICE=-1 mpirun -n [N*G]`` Production run: *Many orientations vs. few experimental images* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ On only CPUs ^^^^^^^^^^^^ - :envvar:`BIOEM_ALGO`\ ``=2`` to select the BioEM algorithm 2 that optimizes the computation of many orientations to few particle images. - :envvar:`BIOEM_DEBUG_OUTPUT`\ ``=0`` can reduce the size of the text output. - One should use a combination of OpenMP and MPI. Assume 20 CPU cores, possible options would be (among others): - 20 MPI processes with 1 OMP thread each: ``OMP_NUM_THREADS=1 mpirun -n 20`` - 10 MPI processes with 2 OMP threads each: ``OMP_NUM_THREADS=2 mpirun -n 10`` - 4 MPI processes with 5 OMP threads each: ``OMP_NUM_THREADS=5 mpirun -n 4`` - 2 MPI processes with 10 OMP threads each: ``OMP_NUM_THREADS=10 mpirun -n 2`` The best configuration has to be checked by the user. But in any case, one should make sure that the number of MPI processes times the number of OMP threads per process equals the number of (virtual) CPU cores. *Importantly*, one should also compare the timings from :envvar:`BIOEM_ALGO`\ ``=1`` or :envvar:`BIOEM_ALGO`\ ``=2`` with the different configurations. On combined CPUs and GPUs ^^^^^^^^^^^^^^^^^^^^^^^^^ - :envvar:`BIOEM_ALGO`\ ``=2`` to select the BioEM algorithm 2 that optimizes the computation of many orientations to few particle images. - :envvar:`GPU`\ ``=1`` should be used if a GPU is available. - For multiple GPUs, MPI has to be used, with number of MPI processes equal to the number of GPUs. Additionally, if there are ``x`` CPU cores per MPI process use :envvar:`OMP_NUM_THREADS`\ ``=[x]``. - Consider increasing the value of :envvar:`BIOEM_PROJ_CONV_AT_ONCE` to increase the parallelism, or decreasing the value of :envvar:`BIOEM_PROJ_CONV_AT_ONCE` to decrease GPU memory requirements. - Keep the other environment variables as default. .. _tutorial: Tutorial ======== In this chapter, we provide a short tutorial to perform BioEM calculations. First, we explain the commandline executions, and inputfile options, to calculate the posterior probability of a model given a particle-image set. Then, we show examples of the additional calculations that can be performed with the BioEM code. Lastly, we give full example of how to do model comparison using BioEM. All files mentioned in this chapter are provided in the **Tutorial\_BioEM** directory that comes in BioEM tutorials package (see section :ref:`download_tutorials`). Posterior probability using BioEM --------------------------------- We now show examples of the different commandline options and inputfile formats used to calculate the BioEM posterior probability. Here, we only describe the input setups related to the physical problem. For computing node performance setups see section :ref:`performsugg`. Commandline input and execution ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - *Text Model - Text Image:* To calculate the BioEM posterior probability of a model in text format given particle images also in text format. **Files:** - Model file: *Model\_Text* - Parameter input file: *Param\_Input* - Particle-image file: *Text\_Image\_Form* **Commandline execution:** ``bioEM`` :option:`--Inputfile` Param_Input :option:`--Modelfile` Model_Text :option:`--Particlesfile` Text_Image_Form **Outputfile:** *Output\_Probabilities* .. note:: 1. The *txt* particle-image file can contain multiple particles that are distinguished by the separator :inpar:`PARTICLE` (see section :ref:`partimag`). 2. The *Param\_Input* file is an example for a debug run. It has very few grid points to perform the integrations numerically. See section :ref:`Prorun`, for suggestions on input-parameter configurations for a production run. - *PDB Model - Text Image:* To perform the BioEM calculation with a model in *pdb* format. **New Command:** :option:`--ReadPDB` **Files:** - Model file: *Model.pdb* - Parameter file: *Param\_Input* - Particle-image file: *Text\_Image\_Form* **Commandline execution:** ``bioEM`` :option:`--Inputfile` Param_Input :option:`--Modelfile` Model.pdb :option:`--ReadPDB` :option:`--Particlesfile` Text_Image_Form **Outputfile:** *Output\_Probabilities* - *PDB Model - One MRC Image:* To perform the BioEM calculation for a single *.mrc* particle-image file. **New Command:** :option:`--ReadMRC` **Files:** - Model file: *Model.pdb* - Parameter file: *Param\_Input* - Particle-image file: *OneImage.mrc* **Commandline execution:** ``bioEM`` :option:`--Inputfile` Param_Input :option:`--Modelfile` Model.pdb :option:`--ReadPDB` :option:`--Particlesfile` OneImage.mrc :option:`--ReadMRC` **Outputfile:** *Output\_Probabilities* - *PDB Model - Multiple MRCs:* To perform the BioEM calculation, when multiple *mrc* files are read. In this case, the file name containing the list of all *mrc* filenames should be provided. **New Command:** :option:`--ReadMultipleMRC` **Files:** - Model file: *Model.pdb* - Parameter file: *Param\_Input* - File with names of MRC files : *ListMRC* .. note:: The file *ListMRC* contains the names of files *OneImage.mrc* and *TwoImages.mrc* that are provided in the **Tutorial\_BioEM** directory. **Commandline execution:** ``bioEM`` :option:`--Inputfile` Param_Input :option:`--Modelfile` Model.pdb :option:`--ReadPDB` :option:`--Particlesfile` ListMRC :option:`--ReadMRC` :option:`--ReadMultipleMRC` **Example outputfile:** *Output\_Probabilities.* .. note:: Both commands :option:`--ReadMRC` :option:`--ReadMultipleMRC` are required. - *Read Euler angles from file:* Related to section :ref:`intor`. With this feature the Euler angles are read from an input orientations file. **New Command:** :option:`--ReadOrientation` **Files:** - Model file: *Model.pdb* - Parameter file: *Param\_Input* - Particle image file: *Text\_Image\_Form* - EulerAngle File: *Euler\_Angle\_List* **Commandline execution:** ``bioEM`` :option:`--Inputfile` Param_Input :option:`--Modelfile` Model.pdb :option:`--ReadPDB` :option:`--Particlesfile` Text_Image_Form :option:`--ReadOrientation` Euler_Angle_List **Outputfile:** *Output\_Probabilities* .. note:: If the command :option:`--ReadOrientation` is used then the code will disregard the Euler angle grid-sampling stated in the *Param\_Input* file. This means that reading the orientations from a file prevails over the option of calculating cubic-grids directly inside the code. - *Read quaternions from file:* Related to section :ref:`intor`. With this feature the quaternions are read from an input orientations file. **New Command:** :option:`--ReadOrientation` **Important!:** in the input-parameter file one has to add the keyword: :inpar:`USE_QUATERNIONS` **Files:** - Model file: *Model.pdb* - Parameter file: *Param\_Input\_Quat* - Particle image file: *Text\_Image\_Form* - Quaternion File: *Quat\_list\_Small* **Commandline execution:** ``bioEM`` :option:`--Inputfile` Param_Input_Quat :option:`--Modelfile` Model.pdb :option:`--ReadPDB` :option:`--Particlesfile` Text_Image_Form :option:`--ReadOrientation` Quat_list_Small **Outputfile:** *Output\_Probabilities* .. note:: In the directory **Quaternions**, there are several quaternion lists that sample uniformly the rotational group in 3D space, *SO3*. These files are strongly *recommended* to use. - *MRC Model - One MRC Image:* To perform the BioEM calculation for a *.mrc* model and a single *.mrcs* particle-image. **New Command:** :option:`--ReadModelMRC` **Files:** - Model file: *Model_MRC.mrc* - Parameter file: *Param\_ModelMRC* - Particle-image file: *particles3.mrcs* - Quaternion File: *Quat\_list\_Small* **Commandline execution:** ``bioEM`` :option:`--Inputfile` Param_ModelMRC :option:`--Modelfile` Model_MRC.mrc :option:`--ReadPDB` :option:`--Particlesfile` particle3.mrcs :option:`--ReadMRC` :option:`--ReadOrientation` Quat_list_Small **Outputfile:** *Output\_Probabilities* .. _Prorun: Input-parameter suggestions for a production run ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We strongly recommend to use all the prior information of the system that is available, *e.g.*, if the orientations, defocus, etc. are known, one should use this information to reduce the sampling time in the BioEM algorithm. If few prior information is available, we provide the file *Param\_ProRun* as a tentative setup for a production run that is shown in :numref:`Table %s `. **Commandline execution:** ``bioEM`` :option:`--Inputfile` Param_ProRun :option:`--Modelfile` Model.pdb :option:`--ReadPDB` :option:`--Particlesfile` Text_Image_Form :option:`--ReadOrientation` List_Quat_ProRun **Outputfile:** *Output\_Probabilities* .. _tableParamPro: .. table:: Input-parameter suggestions for a production run, if negligible prior information is available. +---------------------------------------------+ | ``USE_QUATERNIONS`` | +---------------------------------------------+ | ``CTF_B_ENV`` 2.0 300.0 4 | +---------------------------------------------+ | ``CTF_DEFOCUS`` 0.5 4.5 8 | +---------------------------------------------+ | ``CTF_AMPLITUDE`` 0.1 0.1 1 | +---------------------------------------------+ | ``SIGMA_PRIOR_B_CTF`` 50. | +---------------------------------------------+ | ``SIGMA_PRIOR_DEFOCUS`` 0.4 | +---------------------------------------------+ | ``PRIOR_DEFOCUS_CENTER`` 2.8 | +---------------------------------------------+ | ``DISPLACE_CENTER`` 40 1 | +---------------------------------------------+ To note are: - The Gaussian prior on the :math:`b` envelope parameter, has a width of 50\ :math:`\AA`. - The Gaussian prior on the CTF defocus :math:`\Delta f` parameter, has a width of 0.4\ :math:`\mu`\ m, and it is centered at 2.8\ :math:`\mu`\ m. - Quaternions are used to describe the orientations. The quaternions should be read from a file that samples uniformly :math:`SO3`. See for example *List\_Quat\_ProRun*, with :math:`> 4000` orientations. - The grid spacing of the particle-center displacement can be very fine if the FFT algorithm is used (see section :ref:`wayparallel`). Additional commandline options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Several additional features using the commandline are available with BioEM: - *Dump model:* This feature writes out the model in binary format. This allows a faster readout in a further BioEM execution. **New Command:** :option:`--DumpModel` **Files:** - Model file: *Model.pdb* - Parameter file: *Param\_Input* - File with names of MRC files : *ListMRC* **Commandline execution:** ``bioEM`` :option:`--Inputfile` Param_Input :option:`--Modelfile` Model.pdb :option:`--ReadPDB` :option:`--Particlesfile` ListMRC :option:`--ReadMRC` :option:`--ReadMultipleMRC` :option:`--DumpModel` **Outputfiles:** \ *Output\_Probabilities* and *model.dump* - *Load model:* This feature reads in the model in binary format from file *model.dump* (see above). In this case, no model file is necessary, but the *model.dump* file should be in the current directory. **New Command:** :option:`--LoadModelDump` **Files:** - Parameter file: *Param\_Input* - File with names of MRC files : *ListMRC* - Dumped Modelfile: *model.dump* **Commandline execution:** ``bioEM`` :option:`--Inputfile` Param_Input :option:`--Particlesfile` ListMRC :option:`--ReadMRC` :option:`--ReadMultipleMRC` :option:`--LoadModelDump` **Outputfile:** *Output\_Probabilities* - *Dump particle-images:* This feature writes out the particle-images in binary format. This allows a faster readout in a further BioEM execution. **New Command:** :option:`--DumpMaps` **Files:** - Model file: *Model.pdb* - Parameter file: *Param\_Input* - File with names of MRC files : *ListMRC* **Commandline execution:** ``bioEM`` :option:`--Inputfile` Param_Input :option:`--Modelfile` Model.pdb :option:`--ReadPDB` :option:`--Particlesfile` ListMRC :option:`--ReadMRC` :option:`--ReadMultipleMRC` :option:`--DumpMaps` **Outputfiles:** \ *Output\_Probabilities* and *maps.dump* - *Load particle-images:* This feature reads in the particle images in binary format from file *maps.dump* (see above). In this case, no particle-image file is necessary, but the *maps.dump* file should be in the current directory. **New Command:** :option:`--LoadMapDump` **Files:** - Model file: *Model.pdb* - Parameter file: *Param\_Input* - Dumped Mapfile: *maps.dump* **Commandline execution:** ``bioEM`` :option:`--Inputfile` Param_Input :option:`--Modelfile` Model.pdb :option:`--ReadPDB` :option:`--LoadMapDump` **Outputfile:** *Output\_Probabilities* - *Check model:* This feature writes to the *COORDREAD* output file model coordinates, radius and density. It is sometimes very useful to verify that the model is correct. **New Command:** :option:`--PrintCOORDREAD` **Files:** - Model file: *Model\_Text* - Parameter input file: *Param\_Input* - Particle-image file: *Text\_Image\_Form* **Commandline execution:** ``bioEM`` :option:`--Inputfile` Param_Input :option:`--Modelfile` Model_Text :option:`--Particlesfile` Text_Image_Form :option:`--PrintCOORDREAD` **Outputfiles:** \ *Output\_Probabilities* and *COORDREAD* - *Including prior probabilities*: To include the prior probabilities both for the model and orientations see the *Param\_Input\_Priors* file. The prior probabilities for the orientations should be included in an additional file (*e.g.*, see *Euler\_Angle\_List\_Prior*). An example is: **Files:** - Model file: *Model\_Text* - Parameter file: *Param\_Input\_Priors* - Particle image file: *Text\_Image\_Form* - EulerAngle File: *Euler\_Angle\_List\_Prior* **Commandline execution:** ``bioEM`` :option:`--Modelfile` Model_Text :option:`--Particlesfile` Text_Image_Form :option:`--Inputfile` Param_Input_Priors :option:`--ReadOrientation` Euler_Angle_List_Prior **Outputfile:** *Output\_Probabilities* - *Posterior as a function of orientations:* This option prints out the posterior probabilities of the model as a function of the orientations. In this case, all integrals in Eq. Eq. :eq:`pmom` are performed apart from that over the orientations. The keyword in the parameter file is :inpar:`WRITE_PROB_ANGLES` ``x`` an additional outputfile :outpar:`ANG_PROB` is generated with the best ``x`` orientations. An example of the parameter input is provide in the *Param\_Input\_WritePAng* file. .. _modcom: Example: model comparison using BioEM ------------------------------------- BioEM should be used for model comparison and ranking. Here, we provide a complete example of how to analyze the output files of BioEM to discriminate between structural models with two subsequent rounds of assessment. In the first round, the orientation sampling is done uniformly over :math:`SO3` using the BioEM algorithm 1 (*e.g.*, an all-orientations to all-particles comparison). In the second round, the posterior for each particle is calculated independently for a subset of orientations that are close to the best orientation from the previous round. The relevant files are found in the **MODEL\_COMPARISON** directory that is inside the **Tutorial\_BioEM** directory. There you will find: - *MODEL\_1*: First model in text format. - *MODEL\_2*: Second model in text format. - *Param\_Input\_ModelComparision*: example of parameter input. - *Quaternion\_List*: List of quaternions to sample uniformly *SO3*. - *20\_ParticleImages*: Stack of the particle images in text format. - *Particles*: Folder with the individual 20 particles files in text format. - *create\_gridOr.sh*: Bash script to create a refined grid over the best orientation from the previous round. This script uses python3.3 with the file *multiply\_quat.py* and the quaternion grid file *smallGrid\_125*. - *multiply\_quat.py*: Python3.3 script that multiplies the best quaternion from the previous round with the quaternions from *smallGrid\_125* to generate a new list of quaternions that samples homogeneously near the best quaternion. - *smallGrid\_125*: Grid of quaternions around the north pole. - *subtract\_LogP.sh*: Bash script to calculate the difference in log posterior from the outputfiles. - *bioem\_array\_sge.sh*: Example launch script for BioEM round 2 (see below) for a high-performance computing (HPC) platform with the sge job scheduling system. - *bioem\_array\_slurm.sh*: Example launch script for BioEM round 2 (see below) for a HPC platform with the slurm job scheduling system. Before we begin it is recommended to link the ``bioEM`` executable into the working model-comparison directory. Round 1: Model comparison with uniform sampling for all particle images ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ First, we will calculate the BioEM probability over a uniform grid of orientations on :math:`SO3` for all particle images using :envvar:`BIOEM_ALGO`\ ``=1``. To compare the models, one needs to run the BioEM program for each: - Model 1 .. code:: bash BIOEM_DEBUG_OUTPUT=0 BIOEM_ALGO=1 ./bioEM --Modelfile MODEL_1 --Particlesfile 20_ParticleImages --Inputfile Param_Input_ModelComparision --ReadOrientation Quaternion_List --OutputFile Output_MODEL_1 - Model 2: .. code:: bash BIOEM_DEBUG_OUTPUT=0 BIOEM_ALGO=1 ./bioEM --Modelfile MODEL_2 --Particlesfile 20_ParticleImages --Inputfile Param_Input_ModelComparision --ReadOrientation Quaternion_List --OutputFile Output_MODEL_2 Here, two output files containing the posterior probabilities of each model (``Output_MODEL_1`` and ``Output_MODEL_2``) are generated. Since the input-parameter and particle-image files are the same, then the output files should only differ in the specific numerical results. To calculate the difference in log-posterior of *Model 1* with respect to *Model 2*, one can simply run in terminal the bash script *subtract\_LogP.sh*: .. code:: bash ./subtract_LogP.sh Output_MODEL_1 Output_MODEL_2 > Results-Round1 This script prints out the particle number, log-posterior of *Model 1*, log-posterior of *Model 2*, difference in log-posteriors (*Model 1- Model 2*), and cumulative difference. Round 2: Model comparison with different orientations for each particle image ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Now, we can refine the BioEM probability by sampling near to the best orientation from **Round 1** using :envvar:`BIOEM_ALGO`\ ``=2``. We can extract the best orientations for each particle from the *Output\_MODEL\_1* and *Output\_MODEL\_2* files, and we can generate a new orientation grid for each particle around the best orientation. To do so, one can use the script *create\_gridOr.sh* that creates a new list of quaternions for each particle image. For Model 1, the commandline instructions are .. code:: bash ./create_gridOr.sh Output_MODEL_1 M1 This script takes as first column the OutputFile from round 1, and as second column a variable that assigns a name to the new lists (for simplicity, we have chosen ``M1``). This scripts generates twenty individual orientation files *Quaternion\_List\_M1\_P$x* with *$x* from 1 to 20 (*i.e.*, a file per particle). All files are stored in */tmp/Quaternion\_Lists\_M1* folder. Note that when working with large number of input/output files, it is very important to keep them in a well structured, hierarchical manner and at the most appropriate location. The optimal configuration depends on the number and size of files, as well as the file system of the machine. As an example, we have stored the generated folders and files in the */tmp/* directory. Similarly, for model 2 .. code:: bash ./create_gridOr.sh Output_MODEL_2 M2 we obtain 20 individual files *Quaternion\_List\_M2\_P$x*, stored inside */tmp/Quaternion\_Lists\_M2* folder. Because each image now has a different list of orientations, one has to launch each BioEM analysis individually within a nested loop. For example: - Model 1: .. code:: bash mkdir -p /tmp/Outputs_M1_Round2 numim=20 for((x=1;x<${numim}+1; x++)) do BIOEM_DEBUG_OUTPUT=0 BIOEM_ALGO=2 ./bioEM --Modelfile MODEL_1 --Particlesfile Particles/Particle_$x --Inputfile Param_Input_ModelComparision --ReadOrientation /tmp/Quaternion_Lists_M1/Quaternion_List_M1_P$x --OutputFile /tmp/Outputs_M1_Round2/Output_MODEL_1_P$x tail -2 /tmp/Outputs_M1_Round2/Output_MODEL_1_P$x >> Output_MODEL_1_Round2 done where ``numim`` is the total number of particle images and ``x`` indexes the particle number. - Model 2: .. code:: bash mkdir -p /tmp/Outputs_M2_Round2 numim=20 for((x=1;x<${numim}+1; x++)) do BIOEM_DEBUG_OUTPUT=0 BIOEM_ALGO=2 ./bioEM --Modelfile MODEL_2 --Particlesfile Particles/Particle_$x --Inputfile Param_Input_ModelComparision --ReadOrientation /tmp/Quaternion_Lists_M2/Quaternion_List_M2_P$x --OutputFile /tmp/Outputs_M2_Round2/Output_MODEL_2_P$x tail -2 /tmp/Outputs_M2_Round2/Output_MODEL_2_P$x >> Output_MODEL_2_Round2 done This loop can be treated in an easier way using job arrays of sge (with ``#$ -t 1:X`` option) or slurm (with ``#SBATCH –array=1:X`` option) on HPC platforms. In the **MODEL\_COMPARISON** directory, the files *bioem\_array\_sge.sh* and *bioem\_array\_slurm.sh* show example launch scripts for the procedure previously described for sge and slurm, respectively. To compare the resulting probabilities from round 2, one can use the same script (*subtract\_LogP.sh*) with the new output files: .. code:: bash ./subtract_LogP.sh Output_MODEL_1_Round2 Output_MODEL_2_Round2 > Results-Round2 In :numref:`Fig. %s`, we compare the results of the BioEM model comparison from round 1 (red) and round 2 (blue). In :numref:`Fig. %s` (top) we plot the the BioEM log-posterior of *Model 1* versus *Model 2* (output columns 4 and 6 from the script execution) for both rounds. These results show that the blue dots (those from round 2) are more shifted to the left and to the top, indicating that by refining the sampling around the best orientation there is an increase of the posterior probability. In :numref:`Fig. %s` (bottom), we show the cumulative difference of (*Model 1 - Model 2*) (column 10) as a function of the image number for both rounds. From these results, one can conclude that *Model 1* is more probable. Importantly, the discriminating power also increases by refining around the best orientation (as shown also in ref. :cite:`BioEM_cring`). Lastly, we note that the analysis of the posterior probabilities can be done in many different manners, as in ref. :cite:`CossioHummerJSB_2013,BioEM_cring`. The final interpretation of the results is left to the individual user. .. _ModelComp: .. figure:: ./img/ModComp.* *Example of model comparison using BioEM.* (**top**) Natural logarithm of the BioEM Posterior probability of *Model 1* versus *Model 2* for 20 particle-images for round 1 (red) of BioEM refinement with uniformly distributed equal orientations for all particles and round 2 (blue) of BioEM refinement around the best orientation from round 1. (**bottom**) Cumulative difference of *Model 1 - Model 2* as a function of the image number for round 1 (red) and 2 (blue). Example files of models, particle-images and input-parameters are in the **MODEL\_COMPARISON** directory. .. Bibliography .. ============ .. only:: html .. rubric:: References .. bibliography:: bib_manual.bib :style: unsrt .. Indices and tables .. ================== .. only:: html .. rubric:: Indices and tables * :ref:`genindex` * :ref:`modindex` * :ref:`search`