Welcome to the lab!

This page is created specifically for new comers to the lab.

Project resources

For a quick overview of what we are up to, here is a curated list of project repositories of the lab. They are contributed by different past and current members in various styles, but are all meant to follow good computational research practice that makes our work easy to share and reproduce.

Get onboard

You should have received emails of onboarding instructions from the HR department prior to your start day. Please follow those to get onboard. You will have to run by a couple of offices at Columbia Medical campus on your start day but all should be done by lunch time.

Once you are back to your desk please send Gao your account information (email or username) for these services

  1. slack
  2. github
  3. dynalist
  4. mendeley

Also please send Gao your Columbia UNI.

You will be added to our home page. If you want to embed an external link for your name (eg pointing to your personal website, github or LinkedIn account) please send Gao the external link. Alternatively you can send your bio to Gao to create a page on our website (click here for an example).

As will be explained next, most of research communications will happen on slack or github. If you would like to talk to Gao in person please book a meeting here (I encourage that you book an in person meeting every week).

Orientation tasks

Task 1: Add yourself to our slack workspace

You should receive an email notification asking you to join the statgenworkspace.slack.com workspace on slack. Please contact Gao if you don’t have it.

Here you can communicate with others in the lab via instant messaging and work on projects together in separate project channels. Some channels of interest are:

  • orientation
  • papers
  • tips-n-rants
  • computing-matters
  • campus-events

We ask the lab member who most recently completed the orientation material to become your “orientation tour guide” — that is, the last row on this table. Please slack her/him to introduce yourself and explain that you might need some help from time to time while completing your orientation tasks below. You are expected to try completing the tasks on your own but please do not feel shy to bother your tour guide if you hit any blockers. The best way a tour guide answers your question is to help editing the instructions on this wiki to make clarifications that everyone down the road can benefit from.

Slack tips

  1. It would be great if you could upload a photo of you (or your cat if you really resent the idea of using your photo) to your slack profile.
  2. Notifications: Under Preference -> Notifications you can configure notifications behavior for incoming new messages. I suggest choose notification option “Direct messages, mentions and keywords”. Also enable “Send email notification” under “When inactive on desktop” option.
  3. Speak to a person: when you are chatting in a slack channel please pin the person you want to talk to via @, so s/he can get a notification and respond to you faster. Otherwise the message might go unnoticed.
  4. Slack desktop app is available and is recommended because that will keep slack running on the background, for multiple slack groups you join. There is also a phone app you can choose to install.
  5. Slack uses mostly markdown language to edit text, as you will learn in the next orientation task.

Task 2: git, github.com and markdown

We assume you are comfortable with command-line interface (on Linux or Mac). This orientation task involves obtaining the source code for this wiki, make and contribute your changes. The source code of this wiki is on github so this will be a git exercise — it means you will need to install git and clone this repo, make changes and push back to github which automatically publishes your changes here.

You should receive an email notification asking you to join the github repository for lab wiki. Please contact Gao if you don’t have it.

Before you make any changes to the wiki, you should learn about using git if you haven’t used it before. Under the orientation folder of this repo (that you should have been granted permission to at this point) there is a Markdown file called 5m-git.md for a 5-minutes tutorial on git. If you are not familiar with git please walk through that document to learn basic git. If you are already familiar with git, please take a look and help improve a more advance tutorial git-tips.md completing some of the FIXME tags I made on the document, or adding to it whatever tips you’ve learned in the past that you find useful to mention here. Please make sure you use the best of your knowledge editing Markdown format files, that is, format things nicely and logically.

Now you should be ready to contribute to the wiki. To do so please edit Lab Members page through lab-members.md file add in a row about yourself.

Text editor

Here is a personal suggestion: I use gvim for many years before I switched to VS Code text editor — yes it is from Microsoft but it is cross-platform and is good! I now use VS Code with Vim binding (an Extension you can find in VS Code Extension Marketplace) so I can still use keyboard conventions that I’m familiar with.

To open a particular folder (eg a github local clone) on your computer from command terminal:

cd <path to the folder>
code ./

Additional reading

Task 3: organize your research

This task is about good computational biology research practice. Regardless of your focus (on methods development or applied data analysis) it is required that all computational procedures in your daily research should be documented, well organized and version controlled (using git) for review at any point.

You can optionally choose one of the task below for organizing your research using Notebooks or Rmarkdown files, although Notebooks (Option 1) is highly recommended unless you have strong preference in Rstudio over JupyterLab.

Option 1: Learn and use IPython notebook and JupyterLab

With IPython notebook + JupyterLab you should develop the practice of clearly documenting what you do in research, and communicate your results as well as the code that generated them in a self-contained document. In particular, in a notebook you can put down notes in Markdown cells in between code cells to explain what you do. This may be less important to computer programmers but is very important to data scientists.

An important reason I prefer Jupyter over Rstudio is because I recommend using SoS suite, a workflow system (pipeline tool) for batch data analysis and a multi-language notebook for interactive analysis, for your daily computing in research. You will find out more about it later in this orientation task.

Here are some tasks you should walk through:

  1. Install Jupyter Lab with SoS Suite, make sure you know (eg by learning from Google) how to launch Bash, R and Python notebooks and correspondingly write codes in them.
  2. Learn from these examples interactive data analysis using SoS Notebook that allows for multiple languages inside one notebook (you can find and run them at: http://sosworkflows.com):
  3. Convert notebooks to a research website using jnbinder script. Please follow instructions on jnbinder repo to create a research website using some IPython notebooks you have.
  4. Learn from this example the suggested format to write and report computational analysis. This is a demo of a research website jnbinder created. The suggested format is as follows:
    1. Title, and in the same notebook cell a brief one sentence summary of what the notebook is about.
    2. Motivation or Aims: describe the problem under investigation.
    3. Methods overview: a high-level description of methods used to solve the problem.
    4. Main conclusions (not applicable to a pure workflow notebook): take home message from your investigations.
    5. Data input and output (if applicable): describe data used and generated from the notebook.
    6. The rest of the notebook: multiple sections of detailed steps, with interactive codes / workflows and narratives, as well as diagnostic summary statistics, plots and tables at each step.

In your future daily research you will be expected to use SoS Notebook to analyze data, document your workflows with suggested analysis report format, and make them available as websites to share with your colleagues. We host a private webserver and provide instructions to configure your github repository to automatically publish websites to the server as soon as you push to the repository.

Additionally, if your research focus involves methods development and large scale genomic analysis, you can optionally complete the following tasks on Bioinformatics Workflow System:

  1. Learn from these examples the very basic usage of SoS Workflow (you can find and run the first 2 at: http://sosworkflows.com):
  2. Please try to reproduce this example on your computer (source code here). In particular, note how multiple samples are processed in parallel (group_by in SoS) and how intermediate results can be visualized within the workflow notebook. Also note how docker containers are used to execute the workflow to help avoid installing all software dependencies and ensuring reproducible results.

Although I recommend SoS Notebook and SoS Workflow be the primary tool for daily computational research, I acknowledge there are limitations to using IPython notebooks in general for interactive analysis, cf, this presentation. However most of such issues can be avoided if you recognize them and develop good habits in using notebooks and not commit those pitfalls. Additionally, these limitations do not apply to when you use notebooks to develop SoS Workflows; and I always prefer writing small workflows over interactive notebooks — as you hopefully have learned from the above tasks and agree that it is almost trivial to turn an interactive SoS notebook into an SoS workflow.

Option 2: Learn and use Rstudio and workflowr

(to be updated)

Additional reading

Task 4: Explore lab wiki

You are encouraged to explore the lab wiki checking out material on other pages. In particular,