Jump to content

Next Generation Sequencing (NGS)/Bioinformatics from the outside

From Wikibooks, open books for an open world
Next Generation Sequencing (NGS)
Big Data Bioinformatics from the outside Pre-processing

Bioinformatics from the outside

[edit | edit source]
For an in-depth introduction to UNIX, see the Guide to Unix or A Quick Introduction to Unix.

Unix command line: History

[edit | edit source]

The first version of Unix was developed by Bell Labs (part of AT&T) in 1969, making it more than forty years old. Its roots go back to when computers were large and rare, time on them very expensive and shared between many users. Unix was developed so as to allow multiple users to work simultaneously. Unix actually grew out of a desire to play a game called Space Travel and the features that made it an operating system were incidental. Initially it only supported one user and the name Unix, originally UNICS, is a pun on MULTICS, a multi-user system available at the time.

While this might seem strange and unnecessary in a world where everyone has their own laptop, computing is again moving back to remote central services with many users. The compute power required for mapping next-generation sequencing data or de novo assembly is beyond what is available or desirable to have sitting on your lap. In many ways, the “cloud” (or whatever has replaced it by the time you read this) requires ways of working that have more in common with traditional Unix machines than the personal computing emphasised by Windows and Apple Macintosh.

USA federal monopoly law prevented AT&T from commercialising Unix but interest in using it increased outside of Bell Labs and eventually they decided to give it away freely, including the source code, which allowed other institutions to modify it. Perhaps the most important of these institutions was the University of Berkeley. (A significant proportion of Mac OS X has its roots in the Berkeley Standard Distribution (BSD) that distributed a set of tools to make Unix more useful and made changes that significantly increased performance.) The involvement of several universities in its development meant Unix was ideally placed when the internet was created and many of the fundamental technologies were developed and tested using Unix machines. Again, these improvements were given away freely. Some of the code was repurposed to provide networking for early versions of Windows and even today several utilities in Windows Vista incorporate Berkeley code.

As well as being a key part in the development of the early internet, a Unix machine was also the first web server, a NeXT cube. NeXT was an early attempt to make a Unix machine for desktop use. Extremely advanced for its time but also very expensive, it never really caught on outside of the finance industry. Apple eventually bought NeXT, its operating system becoming OS X, and this heritage can still be seen in its programming interfaces. Apple is now the largest manufacturer of Unix machines; every Apple computer, the iPhone, and most recent iPods have a Unix base underneath their facade.

By the early 90s Unix became increasingly commercially important. This inevitably lead to legal trouble: with so many people giving away improvements freely and having them integrated into the system, who actually owned it? The legal trouble cast uncertainty over the freely available Unix versions, creating an opening for another free operating system.

The vacuum was filled by Linux, a freely available computer operating system similar to Unix started by Linus Torvalds in 1991 as a hobby. More correctly, Linux is just the kernel, the central program from which all others are run. Many more tools in addition to this are required to make an operating system. These tools are provided by the GNU project.[1]

Importantly, Linux was written from scratch and did not contain any of the original Unix code and so was free of legal doubt. Coinciding with the penetration of the internet onto university campuses, and the availability of cheap but sufficiently powerful personal computers, Linux rapidly matured with over one hundred developers collaborating over the internet within two years. The real advances driving Linux were social rather than technological, disparate volunteers donating time on the understanding that, in return for giving their work away freely, anything based on their work is also given away freely and so they in turn benefit from improvements.

The idea that underpins this sharing and ensures that nobody can profit from anyone else's work without sharing is “copyleft”, described in a simple legal document called the GNU General Public Licence,[2] which turns the notion of copyright on its head. (It should be noted that the GNU project, and the philosophy behind it, predate Linux by almost a decade.) Today, Linux has become the dominant free Unix-like operating system with millions of users and support from many large companies.

Getting and installing Ubuntu

[edit | edit source]

Here we describe the Ubuntu distribution (packaging) of Linux, which is one of the most widely used, but all the examples are fairly generic and should work with most Linux, Unix, and Mac OS X computers. There are many different guides on the web about how to install Ubuntu but we recommend installing it as a virtual machine on your current computer.

The Ubuntu Linux distribution is generally easy to use and it is updated (for free) every six months. The examples and versions used here are for version of Ubuntu is 11.10, named after its release date in October 2011, and also known as “Oneiric Ocelot”; the next (most current) version, 12.04 or “Precise Pangolin” was released in April 2012 and is designated a Long Term Support (LTS) edition, meaning that it will be receive fixes and maintenance upgrades for five years before being retired, and is the best option if you don't want to be regularly upgrading your system.

Acclimatisation

[edit | edit source]

A significant effort has been undertaken to make Ubuntu easy to use, so even novice computer users should have little trouble using it. There are quite a few tutorials available for users new to Ubuntu. The official material is available[3] but a quick search on the web will locate much more. In addition, there is a lot of documentation installed on the machine itself. You can access this by moving the mouse towards Ubuntu Desktop at the top left of the screen and clicking on the help menu that appears. In general, the name of the program you are currently using is displayed at the top-left of the screen and moving the mouse to top of the screen will reveal the programs menus in a similar fashion to how they are displayed on the Mac (although, confusingly, some programs display their menus within their own window rather like a Windows computer).

An alternative way to get help is to click on the circular symbol (a stylised picture of three people holding hands) at the top left of the screen and type help in the search box that appears. For want of a better name, we will refer to the people-holding-hands button as the Ubuntu button although the help text that appears describes it as “Dash home”.

Ubuntu comes free with many tools, including web browsers, file managers, word processors, etc. There are free equivalents available for most of the everyday software people use, and you can browse what is available by clicking on the Ubuntu Software Centre, whose icon at the left of the screen looks like a paper shopping bag full of goodies. The Ubuntu Software Centre is just a starting point and there are many other sources available, both of prepackaged software specifically for Ubuntu, and source code that will require compiling. Search the web for “Ubuntu software repositories” for more information on obtaining additional software.

While there are explicit key combinations for copy and pasting text, just like on Windows or Mac, control-c and control-v in Ubuntu, this convention is not respected by all programs. Unix has traditionally been more mouse centred with the left mouse button used to highlight text and the middle button used to copy it. You may find yourself accidentally doing this occasionally if you are not used to using the middle mouse button. Starting applications from icons, opening folders, etc... only requires a single click, rather than the double click required on Windows, making the action of pressing buttons and selecting things from menus more consistent with each other. Accidentally double clicking will generally result in an action being done twice, not normally a bad thing but it does mean that impatient users can quickly find their desktop covered in windows.

Perhaps the most important difference you are likely to encounter on a daily basis is that the names of files and directories are case sensitive: README.txt, readme.txt and readme.TXT all refer to different files. This is different from both Windows and Mac OS X, where upper and lower-case characters are preserved in the name but the file can be referred to using any case. (Despite the Unix heritage of OS X, Apple chose this behaviour to maintain compatibility with earlier versions of the Mac operating system)

Fetching the examples

[edit | edit source]

There are many examples in this tutorial to be tried, enclosed in boxes like the one below, which explains the format of the examples. The example below shows how to automatically download and unpack the file ready for use.

Figure 1: Example of how to automatically download and unpack a file

Basics

[edit | edit source]
The command line
[edit | edit source]

While Ubuntu has all the graphical tools you might expect in a modern operating system, so new users rarely need to deal with its Unix foundations, we will be working with the command-line. An obvious question is, why is the command-line still the main way of interacting with Unix or, more relevantly, why we are making you use it? Part of the answer to the first question is that the origins of Unix predate the development of graphical interfaces and this is what all the tools and programs have evolved from. The reason the command-line remains popular is that it is an extremely efficient way to interact with the computer: once you want to do something complex enough that there isn't a handy button for it, graphical interfaces force you to go through many menus and manually perform a task that could have been automated. Alternatively, you must resort to some form of programming (Mac OS X Automator, Microsoft Office macros, etc) which is the functional equivalent of using the command line.

Unix is built around many little tools designed to work together. Each program does one task and returns its output in a form easily understood by other programs. These properties allow simple programs to be combined together to produce complex results, rather like building something out of Lego bricks. The forward to the 1978 report in the Bell System Technical Journal[4] describes the Unix philosophy as:

"(i) Make each program do one thing well. To do a new job, build afresh rather than complicate old programs by adding new features.

(ii) Expect the output of every program to become the input to another, as yet unknown, program. Don't clutter output with extraneous information. Avoid stringently columnar or binary input formats. Don't insist on interactive input.

(iii) Design and build software, even operating systems, to be tried early, ideally within weeks. Don't hesitate to throw away the clumsy parts and rebuild them.

(iv) Use tools in preference to unskilled help to lighten a programming task, even if you have to detour to build the tools and expect to throw some of them out after you've finished using them."

The rest of this tutorial will be based using the command-line through a “terminal”. This terminology dates back to the early days of Unix when there would be many “terminals”, basically a simple screen and keyboard, connected to a central computer. The terminal program can be found by clicking on the Ubuntu button and typing terminal in the search box, as shown in Illustration 1. You can also easily access the terminal using just the keyboard, by pressing control-alt-T. Once open, the text size can be changed using the View/Zoom menu options or the font changed entirely using the Edit/Profile Preferences menu option.

While we are using Linux during the workshop, you may not have access to a machine later or may not wish to use Linux exclusively on your computer. While you could install Linux as 'dual-boot' on your computer, or run it in a virtual machine (A Virtual Machine (VM) is a program on your computer that acts like another computer and can run other operating systems. Several VM's are available, VirtualBox http://www.virtualbox.org/ is free and regularly updated), the knowledge of the command-line is fairly transferable between platforms. Mac OS X also has a command-line hidden away (/Applications/Utilities/Terminal) and, with a small number of eccentricities, everything that works on the Linux command-line should work for OS X. Windows has its own incompatible version of a command-line but Cygwin http://www.cygwin.com/ can be installed and provides an entire Unix-like environment within Windows.

Figure 2: Opening a terminal in Ubuntu. A partially obscured terminal is shown at the bottom right of the desktop

At the beginning of the command-line is the command prompt, showing that the computer is ready to accept commands. The prompt is text of the form user@computer:directory$. Figure 2 has a user called tim in the directory ~ on a computer called coffee-grinder. Having all this information is handy when you are working with multiple remote computers at the same time. The prompt is configurable and may vary between computers; you may notice later that other prompts are slightly different. Some basic commands are shown in Table 1; try typing them at the command-line and press return after the command to tell the computer to run the command.

Figure 3: Some basic commands to answer the important questions of life: "whom am I?", "where am I?" and "what operating system am I running?"
Files and directories
[edit | edit source]

All files in Unix are arranged in a tree-like structure: directories are represented as branches leading from a single trunk (the “root”) and may, in turn, have other branches leading from them (directories inside directories) and individual files are the leaves of the tree. The tree structure is similar to that of every other common operating system and most file browsers can display the filesystem in a tree-like fashion, for example: part of the filesystem for an Ubuntu Linux computer is displayed in Figure 4.

Figure 4: Tree-like structure of the Ubuntu filesystem. Home/Tim directories have been open to show its contents (illustrative purpose)

Where Unix differs from other operating systems is that the filesystem is used much more for organising different types of files. The essential system programs are all in /bin and their shared code (libraries) are in /lib; similarly user programs are in/usr/bin, with libraries in /usr/lib and manual pages in /usr/share/man.

There are two different ways of specifying the location of a file or directory in the tree: the absolute path and the relative path from where we currently are in the filesystem (the current working directory). An absolute path is one that starts at the root and does not depend on the location of the current working directory. Starting with a / to signify the root, the absolute path describes all the directories (branches) we must follow to get to the file in question. each directory name is separated by a /.

For example, home/user/Music/TheKinks/SunnyAfternoon.mp3 refers to the file SunnyAfternoon.mp3 inside the directory TheKinks, which is inside the directory Music, which is inside the user's directory, which is inside on the directory home, which is connected to the root. If you are familiar with Microsoft Windows, you might notice that the path separator is different. Unix-based systems use a forward-slash (/) rather than the backward-slash (\) used on Windows. You may have noticed that the paths of web pages are also separated by forward-slashes, revealing their Unix origins as a path to a file on a remote machine.

For convenience, a few directories have special symbols that are synonyms for them and the most common of these are listed in Figure 5. Most of these have a special meaning when at the beginning of a path otherwise they are just a symbol. For example dir/~/ is the directory ~ inside the directory dir in the current directory, whereas ~/dir/ is the directory dir inside the home directory (usually /home/user on Linux, /Users/user on Mac OS X). In both cases the '/' symbols are separators rather than the root directory.

Figure 5: Special directory names

The current location, the working directory, can be displayed at the command-line using the pwd command. Rather than referring to a file by its absolute path, we can refer it by using a path relative to where we are: a file in the current directory can be referred to by its name, a file in a directory inside our working directory can be referred to by directory/filename (and so on for files inside of directories inside of directories inside of our working directory, etc...). Note that these paths are very similar to how we describe absolute paths except that they do not start with /; absolute paths are relative paths relative to the root (alternatively we could read the initial / as “goto root” and consider them to be relative paths). As shown in Figure 5, the directory above the current directory can be referred to as .. so, if the working directory is /home/user, then the root directory can be referred to as ../.. (go up one directory, then go up another directory). The symbol .. can be freely mixed into paths: the directory examples below the current directory could have path examples/../examples/../examples (needless to say, simply using just examples is recommended).

Commands
[edit | edit source]

Commands are just programs elsewhere on the computer and entering their name on the command-line runs them. Commands have a predicable format:

command -flags target

The command is the name of the program to run, the (optional) flags modify its behaviour and the target is what the command is to operate on, often the name of a file. Many commands require neither flags nor target but Unix tools are generally extremely configurable and even simple commands like date (some utilities also have parodies, see ddate or sl for example) have many optional flags to change the format of their output.

As mentioned in Files and directories, there are special directories to contain executable programs and programs within them can be run by typing their name at the command-line. The reason you can run the programs in these directories simply by typing their names is that the operating system knows to look in those directories for programs. In general you will not have permission to place files in these directories and experienced Unix users create their own, normally ~/bin/ ,to place programs they use frequently. Creating this directory does not make it special; you still have to tell the operating system to go look for programs there as well. The operating system has a variable, $PATH, which is a list of directories in which the computer looks for programs. To add a directory to that list, use the command "export PATH=~/bin:$PATH" where "~/bin" is the directory you want to add. This command is often added to the file ~/.bashrc, which is a list of commands to be run automatically every time a new terminal is opened. If a program is not in a special directory, you cannot run it just by typing its name because the computer doesn't know where to find it. This is true even if the program is in the current directory. Programs which are not in special directories can still be run, but you have to include the path to where it can be found. If the program is in your current working directory, this can be as simple as typing ./program (program is in current directory). If the program is elsewhere just type the absolute or relative path to were it is. You can always use the command-line's autocompletion features (see “tab-completion” below) to reduce the amount of typing needed. In order to alleviate the need to type paths to commonly-used programs, it is a good idea to add their paths to the PATH variable in ~/.bashrc.

One thing you'll quickly discover is that the mouse does not move the cursor in the terminal. The terminal interface predates the popularity of mice by decades and alternative methods of efficiently moving around and editing have been developed. There are keyboard short-cuts defined for most common operations, and a few of these are listed in Figure 6. Probably the most useful shortcut is the tab key. It can be used to complete command names and paths in the filesystem (called 'tab-completion'). Pressing tab once will complete a path up to the first ambiguity encountered and pressing again gives a list of possible completions (you can type the next letter or so of the one you want and press tab again to attempt further auto-completion).

Figure 6: Common key bindings for moving around command-line
Figure 7: commands for manipulationg files

A record is kept of the commands you have entered, and the history command can be used to list them so you can refer back to what you did earlier. The history can also be searched: Control-r starts a search and the computer will match against your history as you type; typing enter accepts the current line, typing Control-r again goes to the next match and Control-g cancels the search. History can also be referred to by entry number, listed using the history command: entering !n on the command-line will repeat history entry n, entering !! will repeat the last command.

There are many commands, often quite terse, for manipulating files and a few of the more useful of these are shown in Table 4. Many of the commands for Unix have short names, often only two or three letters, so errors typing can easily have unintended and severe consequences! Be careful what you enter, because Unix rarely gives you a second chance to correct mistakes. Some Unix machines have the sl command to encourage accurate typing.

Figure 7 shows few commands for manipulating files and brief explanations.

On the Unix command line, some symbols can have special meanings. A slash, '/', indicates the end of a directory name, an asterisk, '*' is a wildcard, etc. However, there are many circumstances when it is preferable for symbols not to have a special meaning, the most common example being when the file name contains a space (a space is a special character in the sense that it is interpreted as a break between command-line options). The character in question can be “escaped” by prefixing it with a '\' to remove its special meaning so, for example: / is the root directory but \/ is a file called '/'.

Files beginning with a . character are hidden by default and will not appear in the output of ls or equivalent (or in the file browser when you're using a graphical user interface). Generally, hidden files are those used directly by the computer or programs, containing configuration information not intended for the average user to understand or use.

Reading and writing permission

[edit | edit source]

All files and directories have a set of permissions associated with them, describing who is allowed to read or write that file. There are three basic permissions: read r, write w, and execute x. The meanings of read and write are fairly obvious, but execute has two meanings depending on context. For normal files, execute permission is used on files with executable code (i.e. programs) to give users permission to run that program. For a directory, x permission allows a user to open that directory and see the files it contains. There are three categories of user: owner u (generally the user who created the file), group g (the group of users that the owner belongs to), and other o (everyone else). The permissions for each file are described as a string of nine characters, three for each user category. The three positions assigned to each user category correspond to the three types of permissions ('r,w, and x, in that order). If that user category has a given permission, the appropriate letter will appear. If not, the letter will be replaced with a dash '-'. For example, if a user category has permission to read and execute a file, but not write it, their triplet will look like r-x. The permission string rwxr-x--- means that the owner has permission to read, write or execute, users in the same group have read and execute permission and other users have no permissions.

The owner of a file can change its permissions. Some programs will do this automatically if they are being run by the file's owner, giving the impression that the permissions have been ignored. Running rm -f is the most common time a user will run into this behaviour: by default rm will prompt to remove write-protected files (i.e. files you don't have permission to write) but the -f (force) flag turns tells it not to bother asking and just remove the file.

Dealing with multiple files

[edit | edit source]

Often, especially when running scripts or organising files, you will need to be dealing with multiple files at once. Rather than typing each file name out explicitly, we can give the computer a pattern instead of a filename. All filenames are checked against the pattern and the computer automatically generates a list of all the matching files to use when running the command. Patterns are created using symbols that have a special meaning. For example: * means match anything (or nothing), so a*b is a pattern that matches any filename beginning with a and ending with b including the file ab. Figure 8 contains a list of special symbols useful for constructing patterns.

Figure 8: Special symbols for filenames. As with the \* example in the table, any of these symbols can be prevented from having a special meaning by “escaping” them with a '\'.

As mentioned above, pattern matching occurs before a command is run and the pattern is replaced by the list of matches. The command never sees the pattern, just the results of the match.

Running multiple programs

[edit | edit source]

From early on in its development, Unix was designed to run multiple programs simultaneously on remote machines and support for this is integrated into the command-line. Jobs (scripts, programs, or other fairly self-contained things running from the command line) can be divided into two types, foreground jobs and background jobs, based on how they affect the terminal. A foreground job temporarily replaces the command-line and you cannot enter new commands until it has finished, whereas a background job runs independently and allows you to continue with other tasks. Only foreground jobs receive input from the keyboard, so interactive programs like PAUP* should be run as foreground (although you could set up a compute intensive analysis, background it and continue with other tasks while it is running. Later, when the calculations have finished, the program can be made foreground again so interaction can continue). Although background jobs leave your command-line mostly free to do other things, they do send their output to the terminal you launched them from, so you might see it popping up in the middle of another task, which can be confusing. If you are running multiple background jobs, their output will be interleaved based on when it was produced, with no indication of which program produced the output.


Figure 9: A few commands and key combinations for job control

As hinted in Figure 9, there is a difference between a job and a process. A process is a single program running on the machine, and each process is uniquely numbered with a pid (process ID). You can list all the processes you are running, including the command-line itself (generally called bash, but in some unix distributions it may be zsh or tcsh) using ps (or ps -a if you want to see what all the other users of the machine are doing). The command-line itself is just a process (program) running on the computer, albeit one specially designed for starting, stopping and manipulating other processes. Processes are the fundamental method of keeping track of what is running on the computer. Jobs, on the other hand, are things entered on the command-line and many include several programs logically connected together by pipes (see In, out and pipes

for details) to achieve a task.

Figure 10: The command-line splits the jobs into several processes

The command-line

splits the jobs into several processes and runs them, possibly simultaneously. See illustrative example in Figure 10.

In, out and pipes

[edit | edit source]

Where possible, Unix commands behave sort of like filters, or the mathematical concept of a function: they read from input, manipulate that input, and write the output. This might sound trivial, tautologous even, but it enables simple commands to be combined to produce complex results. Every command reads from stdin (short for standard in) and writes to stdout (short for standard out). By default stdin is whatever gets typed in the terminal from and keyboard, and stdout is connected to the current command-line, so results are displayed on the screen. stdout can easily be redirected to a file instead by using the greater-than operator, >. > filename redirects stdout to the file specified for later perusal. By chaining many simple commands together, complex transformations of the input can be achieved. The following is an advanced example, showing how a complex output can be achieved using a series of smaller steps. You may not yet have sufficient understanding of the shell to follow everything in this example but try to work through it and see what each step is doing. The main pages for each command (see Getting help) might be useful.

Compression

[edit | edit source]

The aim of compression is to make files smaller, which is useful for both saving disk space and making it quicker to send files over the internet. Some types of programs that send data over the internet have the ability to transparently compress files before sending and uncompress at the other end. Some web servers implement this but the most important example for us are scp and sftp (two command-line programs used to transfer files over networks) which can each be given the -C option to request compression.

Simply put, compression programs look for frequently repeated patterns in the file and remove this redundancy in a manner that can be undone later. Text files tend to compress very well, with 100MB worth of Wikipedia being compressable into less than 16MB (See The Hutter prize http://prize.hutter1.net/), and, in particular, biological sequences tend to be very compressible since the size of the alphabet of nucleotides or amino acids is small.

The two most common tools for compressing files are gzip and bzip2, with their respective tools for uncompressing being gunzip and bunzip2. gzip is the de-facto standard; bzip2 tends to produce smaller files but takes longer to compress them. On the Windows platform, the Zip (often known as WinZip (http://www.winzip.com/)) compression method is favoured and many Unix platforms provide zip and unzip tools to deal with these files. Non-Linux Unix platforms, like Mac OS X for example, have older tools called compress and uncompress that are rarely used any more. Support for compress 'd files on Linux can be patchy and unreliable. For example, a machine one author has access to has a compress manual page but no actual tool installed.

A final method to be aware of, that is becoming more popular, is 7-zip (7za). 7-zip can produce smaller files than all the above methods, again at the expense of taking longer to compress. A list of file suffixes that can be used to identify what files are compressed using what method is provided in Figure 11.

Figure 11: List of suffice useful to ientify what files are compressed

Compression works better if files are combined and then compressed together, rather than compressing them individually, since this allows the compression program to spot repeated patterns between the files. On Unix, the process of packing/unpacking several files into / from a single file has been historically separate from the process of the compression, in keeping with the philosophy of having little tools that do one thing well. The Unix tool for packing and unpacking files is tar “Tape Archiver”, the odd name because its heritage goes back to 1979 when writing files to magnetic tape was a common method of storage.

Below is an example of using tar to compress and then extract files in an archive:

ls
chimp.fasta human.fasta macaque.fasta orangutan.fasta
# Pack into single file. The suffix is your responsibility.  'c'  means create, and
#'f' means that the next argument is the filename to write to.
tar -cf sequences.tar *.fasta
Note that the original files are untouched

ls
chimp.fasta human.fasta macaque.fasta orangutan.fasta sequences.tar
# Delete all sequences
rm *.fasta
# 'x' means extract
tar -xf sequences.tar
ls
chimp.fasta human.fasta macaque.fasta orangutan.fasta sequences.tar

Over time, the features of tar have increased to make it more convenient and modern versions are now capable of packing and compressing files, as in the example above.

ls

chimp.fasta human.fasta macaque.fasta orangutan.fasta sequences.tar
# Pack and gzip sequences simultaneously ( 'z'  tells tar to use gzip)
tar -zcf sequences.tgz *.fasta
#List the contents without extracting
tar -ztf sequences.tgz
chimp.fasta
human.fasta
macaque.fasta
orangutan.fasta
# More recent versions of tar can also bzip2 files
tar -jcf sequences.tbz2 *.fasta
tar -jtf sequences.tbz2
chimp.fasta
human.fasta
macaque.fasta
orangutan.fasta


Figure 12: File suffixes for common compression programs. When combined with tar to compress multiple files, often the full suffix .tar.suffix is shortened to that given above. zip and 7za “7-zip” have a Windows heritage and have built methods to combine multiple files together, so are rarely used in conjunction with tar. The file tool can also be used to determine file type, e.g: file file.unknown.suffix . See man file for details.

Compression and decompression are actually done by the same program. Decompression program names like 'gunzip' are actually just convenient aliases that tell the computer to call the gzip program with the unzipping flags.

Working on remote computers

[edit | edit source]

Why use a remote computer? There are many reasons: First, central computing resources tend to be much larger, more reliable and more powerful than your laptop or PC – if you need to do a lot of work or use a lot of data then you may have no option but to use a bigger computer.

There is also a world of difference between server-quality hardware and stuff on your desk. Uninterruptible power supplies, (i.e. backup batteries for when the power goes out) are one example. Servers also tend to have redundant components and memory that can detect and correct errors. At the top end, servers can detect and isolate faulty parts, report the problem, and continue running. Often the first time users of a central server know that a fault occurred is when an engineer turns up with a replacement part.

If you have a job that will take a long time to run, for instance Bayesian phylogenetic methods, you may not want to commit to leaving your personal computer untouched for long enough to complete the analysis (and you really trust your colleagues not to turn it off?) whereas central facilities are permanently on and have batteries to prevent small glitches in the power supply from affecting the computers. Lastly, and most importantly, central computers tend to have much more rigorous and tested policies for backing up data – Do you do regular backups? Are they kept in a separate physical location from the original? When was the last time you checked that the backup actually worked?

SSH (short for Secure SHell) is a method of connecting to other computers and giving access to a command-line on them; once we have a command-line we can interact with the remote computer just like we interact with the local one using the command-line. SSH replaces an older method of connecting to remote computers called telnet, which sends everything – including your password – as normal undisguised text so anyone can read it. It is not a good idea to use telnet unless you know what you are doing and you have no other option. Similarly, avoid FTP `File Transfer Protocol' for transferring files if you have sftp or scp available.

As well as keeping communications between your computer and a remote computer secure, SSH also allows you to verify that the remote computer is the computer it claims to be – no point keeping traffic secure if you send it to the wrong place – and prevents someone sitting in the middle of the connection listening to each message then passing it on, pretending to each side to be the other. (This is known as a Man-in-the-Middle attack http://en.wikipedia.org/wiki/Man-in-the-middle_attack. Both sides think they are communicating with the other but are actually communicating with an intermediary who copies all messages then forwards them on.) The method use to verify identity, without possibility of forgery, and even if someone else can copy and manipulate all messages is very interesting and has many other uses. See http://en.wikipedia.org/wiki/Public-key_cryptography and http://en.wikipedia.org/wiki/Digital_signature for details. If verification fails, you will be warned with a message like in Figure 13 and the computer will refuse to connect.

Figure 13: warning message

By far, the majority of these warnings are caused by inept computer administration rather than malice (for instance, if someone has upgraded the other machine incorrectly so it appears to be a different computer, you will get this kind of error). If you are sure it is safe, the warning can be dealt with by deleting the appropriate line for the computer from the ~/.ssh/known_hosts file. Graphical programs can also be run on remote machines, but expect pauses unless you have a very, very fast internet connection. The system that enables this is called the X Windows system (or just X, or X11) (X is the successor to the W Windows System, if you are wondering where the X came from). You can use the -X flag when you run ssh to allow the remote computer to programs in new windows on your local display, provided you have software on your local computer that understand the instructions being sent. Linux computers use such software by default for display and Mac OS X comes with software that can be used (and is started automatically by ssh in the following example). On Windows, the Cygwin software provides the required functionality. Below is an example of using ssh with the -X flag.

Transferring files

[edit | edit source]

It is possible to transfer files between computers using SSH alone but this is not recommended since more friendly interfaces exist. Of course, there are many graphical file transfer programs available. Without recommending particular programs, Cyber-duck http://cyberduck.ch/ for the Mac OS X and WinSCP http://winscp.net/ for Windows appear to be usefule options, but there are many more. Alternatively, under Mac and Unix, it is possible to mount directories on remote computers so there appear to be local; search for sshfs for details. When transferring files, silent errors are extremely rare but can happen and so we'd like to be able to verify that the file received is identical to the one sent. Short files could be checked by eye, but this can't be automated without transferring the file again (which might also get an error). A common technique to verify correct transfer is to calculate the md5 (Message Digest algorithm 5) of both files and compare these values. The md5 is short string of characters that identifies a file and two different files are extremely unlikely to share the same string – if a file changes, its md5 will (very probably) change and so we know that that a change occurred. It is extremely difficult to deliberately create two files that have the same sum. The chances of two non-identical random files having the same md5 is about 3.4e38. When checking large numbers of files, the chance that there are two files in the set with the same md5 increases rapidly but will still be small enough for realistic uses. More rarely, you may come across SHA sums, shasum on both Unix and Mac computers, which are very similar to md5's but have an even smaller chance that two files share the same string.

Getting help

[edit | edit source]

General help with Ubuntu has already been covered in “Acclimatisation“, alternatively, just find someone to ask. As with everything else, the web is a rich source of good, bad, and down-right weird tutorials.

If you are have little or no programming experience, Python (http://python.org/) is a good choice for learning how to do useful bioinformatics scripting, especially in conjunction with the Biopython module (http://biopython.org/). Unix is generally very well documented, although the documentation is often aimed at more experienced users. The manual pages all tend to follow the same format, and it's a good idea to become familiar with it. The page will start with a description of what the command does and a summary of all its flags. Optional flags will be enclosed in square brackets. Next comes a full description of the command and detailed descriptions of what each flag does. Sometimes there is also a section containing examples of usage. Mac OS X is generally very consistent about man pages but Linux derivatives can be a mixed bag.

Figure 14: Look at manual page for man example

Variables and programming

[edit | edit source]

So far, we have only used the command-line to run other programs and to chain them together to achieve more complex results. The command-line tools can be used like a programming language in its own right, and we can write little programs to automate common tasks; often this referred to as scripting rather than programming although the distinction is not really relevant.

Obviously learning to program is not something that can be taught in an hour or two, and even experienced programmers take several days to become productive in a new language, so this section can give little more than a taste of what is possible; however, it should be possible to show how you could save a lot of time with a little investment up front. If you are doing similar things to large number of files, many sequences for example, typing the same command over and over on the command line is time-consuming, tedious, and prone to error, especially as you get bored. Scripting can save you a lot of time and allow you to get on with something else while the computer takes on that task for you. Think about the last time you needed to rename 100 files, or change the format of thousands of gene alignments so they are compatible with your phylogeny program. Learning a little bit of scripting can speed up these tasks tremendously. As with everything, there are many tutorials available on the web and a search for bash scripting tutorial or bash scripting introduction will yield many examples of varying completeness and comprehensibility.

In order to provide you with a little bit of programming background, we've prepared a small general tutorial below:

The first thing to introduce are variables. A variable is just a name for another piece of data, a useful analogy is that of a labelled box: every time we see the label, we replace it conceptually with the contents of the box. The ability to manipulate variables, changing the state of the computer, is fundamental to programming. Here we'll introduce two useful cases: shortening common directory paths and performing the same operations on many files. In bash scripting, variables called with a dollar sign, followed by the name of the variable: $NAME. There are some restrictions on the characters that can be part of a variable name, and variable names cannot start with a number. As a rule of thumb it's a good idea to only use upper- or lower-case letters in your variable names

A variable can refer to the name of a file and we can write things at the command-line using the variable instead of the name explicitly – change the variable and we run exactly the same commands on a different file. One way to take advantage of this this would be to set the variable to one of several files and use the history to repeat a set of commands. Of course, if the commands write their output to a file then that would have to be renamed each time otherwise the output for each file would be written over that for the previous. Shell scripting provides an alternative: the computer can be told to set the variable to each of many file names in turn and the value of the variable can be edited automatically to provide the name of a unique output file.

A common Unix practice is to place frequently used sets of functions into a file, called a script, for reuse thereby preventing errors retyping them. Writing a script also means that complex operations with many steps can be tested before you commit to running them over many files, something that could potentially take days if we are dealing with large numbers of genes. Scripts can be written and modified in any common text editor but must be saved in text format; nano is a good basic editor that is fairly intuitive to use but there are many others more specifically designed with programmers in mind. Alternatively you could use gedit, a program more like Notepad on Windows (to access gedit, click the Ubuntu button and search for gedit; entering gedit & at the command-line will also work).

Line endings – compatibility problems

[edit | edit source]

Even after the standard alphabet for computers was established (ASCII – American Standard Code for Information Interchange) there was no agreement about how to how to indicated the end of a line. ASCII provides two possibilities: line-feed '\n' and carriage-return '\r' , based on how old type-writers and tele-type terminals used to work: a carriage-return moves the carriage, the position to print the next character at, back to be beginning of the line and line-feed moves the paper one line down but doesn't change where the carriage is. On Unix a '\n' character is taken to mean “line-feed and carriage return” and this is used to separate lines of text. On Windows, lines are separated by the pair of characters '\r\n' (in that order) and old versions of Apple operating systems (prior to OS X) use '\r' to separate lines. The situation on Mac OS X is more complex since it must deal with both its Mac and Unix heritage; officially '\n' ' now separates lines in files but programs have to be able to deal with both conventions.

To further complicate things, some methods of transferring files between machines try to automatically convert the line endings for you. This is generally a mistake. Specifically an old file transfer method called FTP “File Transfer Protocol” has two modes: text and binary, text mode will attempt to translate line endings. Unix platforms default to binary and are generally safe. The only case where you need to be careful is transferring files from Windows using the command-line FTP application. If you transfer a binary file over FTP in text mode, the received file will be corrupted irretrievably. If in doubt, see Transferring files for how to verify that your file has transferred correctly.

If you've managed to read through to here, you're probably thinking: a) that's complicated, and b)why haven't I noticed this? The answer is that it used to cause problems in the past but programmers are aware of the issues nowadays and programs tend to do the right thing. Some programming languages like Perl even deal with these problems transparently so even programmers don't need to be aware of them any more.

Galaxy

[edit | edit source]

Galaxy is an open source GUI toolset for a variety of NGS applications. It can be accessed through the Penn State University server[5] or an institution may set up their own server. Users can save and publish data histories for future use or for others to use. Its tutorials and graphic interface make it simple to learn and easy to use.

Users may also want to run their own galaxy instance, for testing, tool development or production use. A preconfigured Docker container makes this straightforward.

The following demonstrates an example setup.

The user:

docker pull bgruening/galaxy-stable
  • runs the docker container
docker run -itp 8080:80 -v /local/dir/:/export bgruening/galaxy-stable

GeneTalk

[edit | edit source]

5 steps to analyze VCF files in GeneTalk at www.gene-talk.de

  1. Upload VCF file
  2. Edit pedigree and phenotype information for segregation filtering
  3. Filter your VCF file by editing the following filtering options
    • Functional filter (filter out variants that have effects on protein level)
    • Linkage filter (filter out variants that are on specified chromosomes)
    • Gene panel filter (filter variants by genes or gene panles, subscribe to publically available gene panels or create own ones)
    • Frequency filter (show only variants with a genotype frequency lower than specified)
    • Inheritance filter (filter out variants by presumed mode of inheritance)
    • Annotation filter (show only variants that are listed in databases)
  1. View results and annotations
  2. Write your own annotations

BaseSpace

[edit | edit source]

References

[edit | edit source]
  1. "GNU Operating System". Free Software Foundation, Inc. 11 April 2016. Retrieved 28 April 2016.
  2. "What is Copyleft?". Free Software Foundation, Inc. 3 October 2015. Retrieved 28 April 2016.
  3. Ubuntu Documentation Team. "Official Ubuntu Documentation". Canonical Ltd. Retrieved 28 April 2016.
  4. McIlroy, M.D.; Pinson, E.N.; Tague, B.A. (1978). "Unix Time-Sharing System: Forward". The Bell System Technical Journal. 57 (6): 1899–1904.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  5. "Galaxy". usegalaxy.org. Galaxy Project. Retrieved 28 April 2016.