Appendix C — Code/Data Best Practices

Sourced (borrowed unabashedly & modified) from the excellent material from the Greene & Krishnan Labs + the rest of the references below, combined with lessons I have learnt from my own experience through the years!

C.0.1 Pride

We expect lab members to sign their code. To quote from The Pragmatic Programmer, “Craftsmen of an earlier age were proud to sign their work. You should be, too… People should see your name on a piece of code and expect it to be solid, well written, tested, and documented.” While some code will be proof-of-concept code, it should be of a form that inspires confidence.

And this, I cannot emphasize enough: Pride also means clearly showcasing what is inspired by, a modified/improved version of, and/or directly borrowed from another software code base. This means that when we write code by building on top of what others have done (via inspiration, modification/improvement, or borrowing), we clearly give credit by:

1. Explicitly saying this in the comments section next to the relevant piece/block of your code, including a link to the source and a short write-up on how the source is adapted here.

2. Including these detail in the documentation of the code base that lives as a supporting document outside the code.

3. Including a citation to the source wherever appropriate in our publication. Of all these, [1] is the easiest place to be extremely generous and complete in attribution to everything non-obvious (fast implementation, numerical recipes, methodological/algorithmic ideas) from StackOverflow answers, blog posts, lecture notes, existing open software, etc. We will start impeding readability and hit space constraints with [2] & [3]. Please talk to me or other senior members of the lab about these to decide on proper attributions.

C.0.2 Language

We write code for our analyses in Python and R, which allows everyone in the lab to know two languages and understand analytical code. Code for visualization can be Python, R, or javascript. Webserver interface code uses R Shiny or javascript. Check out this doc for all the material you need to get started with most of these languages and tools.

C.0.3 Licensing

We expect code that we produce to be licensed under a 3-clause BSD license. Unless a funding agency requires something different, we’ll use this. If you have questions or concerns about licensing, feel free to raise them in Slack.

C.0.4 Version control

We have a JRaviLab account on GitHub. We expect that lab members will maintain their code in repositories under this team account. When you commit code, add informative messages. Remember: verbose is better. Even if you’re just fixing formatting or correcting typos, instead of phrases like “small changes” and “some minor stuff”, just say “fixing formatting” or “correcting some typos”. We will only publish code that is held in a public JRaviLab repository that has gone through the code review process.

C.0.5 Data management

For publicly available data, scripts used to download and process these data should be preserved, as should the versions of helper files used in processing (e.g. probe-to-gene mapping, gene-symbol mapping). [If the dataset is large (>100 GB), discuss with me about where to download it.] These items – processing scripts, helper files, etc. – should be version controlled. Where possible, intermediate files of reasonable size can be stored to facilitate re-use, but the process to regenerate these files from publicly available data should be preserved. Keep the following things in mind:

Do not tamper with original files that you get from me, a collaborator, or from external resources. In the folder where you’re downloading these original files, create a readme.txt file that contains detailed information on when you downloaded this data and from where. For example:
1. If you got this from a link to a Google-Drive/DropBox folder from me/collaborator via Slack/Email, along with the date you downloaded, note the link to the GDrive/DropBox folder and the link to the Slack message or Email.
  1. Slack message link: click the ..., then “Show message actions” next to the Slack message and click “Copy link”.
  2. Email message link or ID can be obtained by right-clicking the message or looking under “Details”.
2. If you get a link to the data on our MSU servers, create a symlink of the data in your data directory (do not copy over the data; it’s a waste of space).
Do not make changes to any file by hand. Write a shell/Python/R script that reads in the file and generates the desired new file with all the required modifications. If you get a file in formats that cannot be worked with easily using a code (like Excel sheets), export the sheet as a text document and then work with that text file.
Automate everything so that you can exactly reproduce everything at a later time. Create a runlog.sh file in each folder that has the list of all the commands (shell commands: the full commands to run your Python/R script) you executed in that directory.
Many times, we will have to use the scratch space to download large data dumps and process them. Scratch has better I/O but has no backups. So, remember to keep the processing scripts in your project directories, symlink the scripts onto the directories on scratch and use them there. Thus, even if scratch fails, we can rebuild everything seamlessly. Talk to me before moving over huge processed datasets from scratch over to the backed-up research directories.
Give all files meaningful, interpretable, and computable names.

C.0.6 Organizing your project

Adhere to the following organization for your project folder/repository. Each project is slightly different but should not depart too much from this proposed organization.

project_data

data

src

bin

doc

results

Check out the detailed doc here.

This whole project directory, except the big datasets in your data directory, should be synced with your GitHub repo of your project.

C.0.7 Reproducibility

We expect all lab members to maintain code that performs reproducible analyses. This can be in the form of R/Python/Shell scripts to do everything without manual editing and runlog.sh scripts that contain the command-line calls of all the scripts with inputs/arguments that allow analyses to be automatically performed. We expect that these scripts, including those to generate figures in papers generated as a consequence of such analyses, will be included in source control repositories and made publicly available before or concurrent with manuscript publication.

C.0.8 Good enough practices for writing code

Pride: We expect lab members to sign their code.
Using other code: Code taken from elsewhere is properly acknowledged and should be checked for compatibility with its license.
Style guide: Python code follows PEP 8. R code follows Google’s R Style Guide.
Variable and function names: Variable names are descriptive and interpretable to someone looking at this code for the first time (e.g. not a, b, x, etc.) They should be full words (nodeDegree or pvalues) or clearly recognizable acronyms. Function names should begin with a verb (e.g. parse_expression_dataset, shuffle_list_of_genes, or get_score_distribution). In both cases, be as verbose and expressive as you need.
File commenting: Each file has a comment (a small paragraph) at the top to broadly describe its purpose and how it is expected to be used (e.g. imported, run from command line, both) including details on inputs/outputs and example usage. This is also the place where you make dependencies and requirements explicit.
Function/code commenting: Each function has a docstring that reports the computation that is intends to implement, its arguments, and its return value(s). E.g., #' Takes in the gene_length dictionary & a geneset_gene dictionary and returns a random geneset for each real geneset with the random genes having similar lengths as the real member genes. Docstring documentation is here: Python, R.
In-line commenting: Each block/section of code contains detailed comments on what it is meant to do.
Imports: All trivial imports are at the top of the file right after the top comment paragraph.
Code with constants: Any constants are specified at the beginning of the file following the imports.
Code that uses a random seed [special case of constants]: Code that uses a random seed is reproducible. This means that the seed can be set and a default value is specified. This needs to be done at the top as well.
Column length: Lines are 80 characters or fewer. This applies to all text under revision control with the exception of data files that must adhere to a particular file format that does not allow for line “folding” where necessary. This rule is already covered well in PEP 8 but called out here to clarify that we apply it to more than Python code. One reason for this is to aid in readability of diff output when performing code reviews.
Whitespace: There is no unnecessary whitespace.

D References

The KrishnanLab and GreeneLab Onboarding Docs
A Guide to Organizing Computational Biology Projects
Good Enough Practices in Scientific Computing
Best Practices for Scientific Computing
Ten Simple Rules for Making Research Software More Robust