VOOZH about

URL: https://towardsdatascience.com/why-do-bioinformaticians-avoid-using-windows-c5acb034f63c/

⇱ Why do Bioinformaticians Avoid Using Windows? | Towards Data Science


Skip to content

Why do Bioinformaticians Avoid Using Windows?

List of reasons why almost every bioinformatician use Linux instead of Windows

6 min read

Bioinformatics

👁 Image by Open Clip Art from Pixabay
Image by Open Clip Art from Pixabay

Introduction

Before we touch on the main topic, let me introduce you first to Bioinformatics. Bioinformatics is a discipline that bridges computational studies (computer science, statistics, data engineering) and biology. bioinformaticians help biologists in storing very large biological data, perform computational analysis, and transform biological queries into understandable results.

If you are a bioinformatician or have worked with one before, you probably realize one thing. For most of their work, bioinformaticians do not use Windows.

The reason is quite simple really. It is because most Bioinformatics work can’t be done in Windows. And even if it is possible, there are a plethora of other problems. The list below covers why bioinformaticians prefer Linux over Windows.

👁 Image by Open Clip Art from Pixabay
Image by Open Clip Art from Pixabay

Why use Linux for Bioinformatics?

1. Most Bioinformatics tools are only available in Linux

Unfortunately, Bioinformatics tool options are very limited in Windows. Most standard Bioinformatics software, such as software for genome assembly, structural annotation, variant calling, and phylogenetic tree construction are all written exclusively for Linux.

You can argue that there is still a way to access these Bioinformatics tools without using Linux. We can try using Galaxy which is an online platform that contains commonly used Bioinformatics tools such as FastQC, BWA-MEM, SAMtools, FreeBayes, BCFtools, and many more. However, as good of a platform Galaxy is, there are some limitations.

  1. Time Because you are running the procedures on the Galaxy server. The process can take a while to complete as you are sharing the server with many other users.

  2. Memory Galaxy is quite charitable when it comes to memory allocation. You are given 250 Gb of free quota for your Galaxy account. But if you were to exceed this limit, then you need to download and delete your files to free up some space.

  3. Limited Tool Options This is perhaps the main reason why bioinformaticians still prefer doing Bioinformatics in Linux. Even though there are so many options to pick in Galaxy, the Bioinformatics fields are so vast that Galaxy can’t integrate every single Bioinformatics tool. For example, it is possible that I have to use a very niche function such as computeGCBias for GC bias correction. Niche function like this is not available in Galaxy even though this function has been available since 2012.

If you are interested in checking out Galaxy Software for computational biology, check it out here.

2. Windows is very slow at processing biological data

Have you ever tried opening a large FASTA file or a gene expression file with Notepad in Windows? I bet it doesn’t work because Notepad has a size limitation of 58 Mb. Large biological files like these are not meant to be open directly with a text editor or spreadsheet app.

👁 Image by Hassan from Pixabay
Image by Hassan from Pixabay

What we often do is peek **** at the file mostly to check the header (or column names if it is tabular) and get some understanding. Peeking at the data can be easily done in Linux with head or tail.

Performing a simple data wrangling can also be performed easily in Linux. Are we only interested in chromosome 21 to check for down syndrome? Use grep and redirect the output to a different file. Do we want to combine multiple FASTA files for phylogenetic tree construction? Simple, use cat and redirect the output to a new FASTA file.

3. Linux has built-in programming languages

Another good thing about Linux is that it has several built-in programming languages that come with the installation. After file transformation, the next step in Bioinformatics is data analysis. In Ubuntu (the most commonly used Linux distro), Perl and Python are already installed and can be used straight away.

What if we want to use other languages for data analysis? For example, I am an R programmer but R is not preinstalled in Ubuntu. Fortunately, it is very easy to install R and R-studios in Ubuntu because most (if not all) programming languages are available in every Linux distro.

4. Creating a biological analysis pipeline can be done easily in Linux

A good practice bioinformaticians do is automating their pipeline. For example, it would be a good idea to automate a genome assembly pipeline from start to finish. Without automation, we need to run each line and wait for it to finish (this can take a while because biological data is usually quite large) and then manually run the next line.

A common way for bioinformaticians to encapsulate all these processes is by listing them in a ** Bash script. Automating pipelines with Bash is not the only way we can do it in Linux. My personal favorite method for pipeline automation is to use Snakemake** which is a tool to create reproducible and scalable data analyses. Read more about Snakemake here.

5. Linux is free

Linux is a free, open-source operating system, that anyone can install, study, modify and even redistribute. Because all Linux distributions are open source, the development and advancement of Linux distributions come from the community and for the community. And if you have a designated platform (laptop or desktop PC) exclusively used for Bioinformatics work, why not just take the free option?


Where you can use Windows for Bioinformatics

Now that I am done bashing Windows and praising Linux, there is actually an instance where it is fine for bioinformaticians to use Windows.

👁 Image by 200degrees from Pixabay
Image by 200degrees from Pixabay

Data Analysis and Result Interpretation

In Bioinformatics, the entire computational pipeline can be divided into two parts, **** data transformation and data analysis steps. For data transformation, Bioinformatics options are very limited. For example, biological file transformation from FASTA to a CSV file has fewer limitations on Linux than Windows. However, it is an entirely different story for data analysis.

The three most common programming languages for biological data analysis or results interpretation are Python, R, and Perl. These 3 languages are all fully supported in Windows. From creating a machine learning model for breast cancer prediction to variant density visualization with a Manhattan plot, all these analyses can be done with the three languages in Windows. Keep in mind that you can also do these analyses in Linux, but feel free to pick your OS once you reached this part of the Bioinformatics pipeline.


Conclusion

Bioinformatician often uses Linux because our options are very limited in Windows. Common Bioinformatics tools such as FastQC, BWA-MEM, and SAMtools are not available in Windows. There are indeed online platforms that allow us to utilize these tools, but they are still not free of limitations. However, if you are working on the data analysis step, the choice of Linux or Windows is up to the user.


Thank you for reading this post! Do you agree with my thoughts, or have found a way to perform Bioinformatics work even with Windows? Feel free to put your thoughts in the comment below.


Written By

Zainul Arifin

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

Related Articles