CONSED 29.0 DOCUMENTATION CONTENTS: 1 WHAT IS NEW IN CONSED 29.0 2 UPGRADING FROM CONSED 27.0 to 29.0 3 UPGRADING FROM CONSED 28.0 to 29.0 4 REQUIRED VERSIONS OF OTHER PROGRAMS 5 INSTALLING CONSED 6 NOTE TO LINUX USERS (64 BIT) (INSTALLATION) 7 NOTE TO LINUX USERS (32 bit) (INSTALLATION) 8 NOTE TO MACOSX USERS (INSTALLATION) 9 NOTE TO SOLARIS USERS (INSTALLATION) 10 QUICK TOUR 11 QUICK TOUR OF BAMSCAPE 12 QUICK TOUR OF CONSED (THE GRAPHICAL EDITOR) 13 BAM2ACE: MAKING A CONSED-READY DATASET OUT OF A BAM FILE 14 SANGER READS 15 454 READS 16 USING AUTOPCRAMPLIFY 17 USING AUTOPRIMERS 18 USING AUTOREPORT 19 FEATURES FOR SNP ANALYSIS 20 BIONANO DIGEST GENOME MAPS 21 LESS USED CONSED FEATURES 22 CONSED CUSTOMIZATION 23 CREATING CUSTOM TAG TYPES 24 EXPANDING CONSED'S CAPABILITIES WITH A LITTLE PROGRAMMING 25 MONITORS AND MICE FOR CONSED 26 ACE FILE FORMAT 27 SAMPLE PHD BALL FORMAT 28 TIMESTAMP MISMATCH 29 CONSED REFERENCES 30 RUNNING PHRED and PHRAP 31 WHAT IS AUTOFINISH? 32 USING AUTOFINISH 33 CONTRIBUTED SOFTWARE 34 CONSED CUSTOMIZABLE CONSEDRC RESOURCES 35 ACKNOWLEDGEMENTS BIG TABLE OF CONTENTS: 1. WHAT IS NEW IN CONSED 29.0 2. UPGRADING FROM CONSED 27.0 to 29.0 3. UPGRADING FROM CONSED 28.0 to 29.0 4. REQUIRED VERSIONS OF OTHER PROGRAMS 5. INSTALLING CONSED 5.12) PRELIMINARY TESTING OF CONSED BEFORE COMPLETING THE REST OF THE INSTALLATION 5.21) ENOUGH MEMORY FOR CONSED 5.23) TESTING THE INSTALLATION 5.24) TESTING ADDING ILLUMINA READS 5.25) TESTING ADDING 454 READS 5.26) TESTING 454 READS (NEWBLER ASSEMBLY) 5.27) TESTING ADD NEW READS 5.29) TESTING RUNNING CROSS_MATCH FROM ASSEMBLY VIEW 5.30) TEST RUNNING PHREDPHRAP 5.31) TESTING MINIASSEMBLIES 5.34) FAKE READS 5.35) APPENDING EXPID TO THE PHD FILES 6. NOTE TO LINUX USERS (64 BIT) (INSTALLATION) 7. NOTE TO LINUX USERS (32 bit) (INSTALLATION) 8. NOTE TO MACOSX USERS (INSTALLATION) 8.8) MICE FOR MAC 8.9) C COMPILER FOR MAC 9. NOTE TO SOLARIS USERS (INSTALLATION) 10. QUICK TOUR 11. QUICK TOUR OF BAMSCAPE 11.1) USING BAMSCAPE 11.36) MODIFYING THE REFERENCE SEQUENCE 11.42) FINDING PROBLEMS IN BATCH 12. QUICK TOUR OF CONSED (THE GRAPHICAL EDITOR) 12.1) GETTING YOUR OWN COPY OF A SAMPLE DATASET 12.8) SCROLLING 12.12) VERTICAL SCROLLING 12.13) GOTO POSITION 12.14) COLORS 12.16) HIGHLIGHTING READ NAMES 12.20) MOVING ALONG A READ 12.21) DIMMING AND UNDIMMING ENDS OF READS 12.22) FIXING A READ AT TOP OF ALIGNED READS WINDOW 12.28) EDITING THE CONSENSUS 12.29) SAVING THE ASSEMBLY 12.30) EXPORTING THE CONSENSUS 12.33) COMPLEMENTING THE CONTIG 12.34) FIND MAIN WINDOW 12.35) MULTIPLE UNDO EDIT 12.36) EXITING CONSED 12.37) CONSED -ACE 12.38) SORTING OF READS 12.40) SORTING BY BASE 12.41) SORTING BY MISMATCHES ON TOP 12.42) SORTING BY QUALITY 12.43) SORTING BY UNALIGNED + MISMATCHES ON TOP 12.44) ALPHABETICAL SORTING OF READS 12.47) SEARCH FOR STRING 12.48) COPY AND PASTE 12.49) FINDING VARIANTS/MISASSEMBLED READS/HIGHLY DISCREPANT LOCATIONS 12.56) EXTENDING THE CONSENSUS 12.57) HIGH AND LOW DEPTH OF COVERAGE REGIONS 12.63) ASSEMBLY VIEW 12.64) READ DEPTH 12.65) FORWARD/REVERSE PAIR DEPTH 12.67) INCONSISTENT FORWARD/REVERSE PAIRS 12.75) SEQUENCE MATCHES 12.81) RUNNING CROSS_MATCH FOR SEQUENCE MATCHES 12.82) PULLING OUT READS AND RE-ASSEMBLYING THEM (MINIASSEMBLIES) 12.86) MINIASSEMBLIES 12.90) HIGHLIGHTING READS TO REMOVE THEM FROM A CONTIG 12.101) CONTIG ARRANGEMENT--REORDER CONTIGS 12.104) CONTIG ORIENTATION 12.105) NAVIGATING 12.109) CUSTOM NAVIGATION 12.111) TEAR CONTIG 12.113) JOIN CONTIGS 12.114) COMPARE CONTIGS WINDOW AND INVERTED REPEATS 12.116) REMOVING READS 12.118) TAGS 12.122) CREATING LONG TAGS 12.125) CONSENSUS TAGS 12.127) WHAT THE COLORS MEAN 12.128) SEARCH FOR READ NAME 12.132) ONLINE DOCUMENTATION 12.133) THE .WRK LOG FILE 12.134) FINDING, DISPLAYING, AND MAKING POTENTIAL JOINS 12.145) USING CONSED -MAKEJOINS TO MAKE JOINS IN BATCH 12.147) PROTEIN TRANSLATION 12.148) OPEN READING FRAMES 12.154) DISPLAYING TRACKS WITH SCORES (BED FILES) 12.155) FIXING CONTIG-ENDS 12.164) FIXING THE CONSENSUS IN BATCH 12.170) HANDLING DUPLICATE READ NAMES 13. BAM2ACE: MAKING A CONSED-READY DATASET OUT OF A BAM FILE 13.8) MAKING AN ACE FILE OUT OF AN ENTIRE BAM FILE 14. SANGER READS 14.2) TRACES AND EDITING READS 14.7) SCROLLING TRACES AND ALIGNED READS TOGETHER 14.8) SHOW ALL TRACES 14.9) PRIMER-PICKING 14.14) CHECKING WHETHER A PARTICULAR OLIGO WOULD MAKE AN ACCEPTABLE PRIMER 14.15) PICKING PCR PRIMER PAIRS 14.16) ORDERING OF PRIMERS 14.19) ADD NEW READS (SANGER--NOT ILLUMINA OR 454) 14.20) ADDING NEW READS IN BATCH (SANGER) 14.25) ADDING NEW SANGER READS IN BATCH TO TARGETED REGIONS 14.32) CONSED-POLYPHRED INTERACTION TO REVIEW POLYMORPHIC SITES 15. 454 READS 15.1) USING 454 READS (NEWBLER ASSEMBLY) 15.12) USING 454'S NEWBLER ON YOUR OWN DATA 16. USING AUTOPCRAMPLIFY 17. USING AUTOPRIMERS 18. USING AUTOREPORT 18.1) VARIANTS REPORT 19. FEATURES FOR SNP ANALYSIS 19.3) FINDING SNPS IN BATCH 19.4) TAGGING A REFERENCE SEQUENCE 20. BIONANO DIGEST GENOME MAPS 21. LESS USED CONSED FEATURES 21.1) CHANGING THE CONSENSUS IN BATCH ACCORDING TO A SCRIPT 21.2) EXPORTING SCAFFOLDS [ EXPORT SCAFFOLDS ] 21.3) ADDING PAIRED ILLUMINA READS USING CROSS_MATCH 21.12) ADDING UNPAIRED ILLUMINA READS USING CROSS_MATCH 21.24) ALIGNING ILLUMINA READS USING CROSS_MATCH AGAINST A LARGE 21.30) ALIGNING YOUR OWN ILLUMINA DATA USING CROSS_MATCH TO A 21.37) USING 454 READS (ALIGNING WITH CROSS_MATCH TO REFERENCE SEQUENCE ) 21.42) ADDING ADDITIONAL 454 OR ILLUMINA READS USING CROSS_MATCH 21.43) MULTIPLE HIGH QUALITY DISCREPANCIES VS SEARCH FOR HIGHLY 21.44) BACKING OUT EDITS AFTER YOU HAVE SAVED THE ASSEMBLY 21.45) SELECTIVELY BACKING OUT EDITS AND REMOVING READS 21.46) REMOVING READS FROM A PHRAP ASSEMBLY 21.47) ADDING READS WITHOUT CHROMATOGRAM FILES 21.48) ALIGNING READS TO A BACKBONE 21.49) COMPARING READS TO A REFERENCE SEQUENCE 21.50) TAGGING ALL READS AT ONCE 21.51) EDITING ALL READS AT ONCE 21.52) FASTER CONSED STARTUP FOR SANGER READS 21.53) VIEWING THE CHROMATOGRAM OF SINGLETS OR NON-ASSEMBLED READS 21.54) HIDING SOME TYPES OF TAGS 21.55) CUSTOM CONTIG NAMES 21.56) ERROR RATE 21.57) RESTRICTION DIGEST 21.58) RESTRICTION DIGEST AND ASSEMBLY VIEW 21.59) MULTIPLE TRACE POPUP 21.60) MAXIMUM NUMBER OF TRACES DISPLAYED 21.61) SCALING THE TRACES 21.62) HOTKEYS FOR EDITING 21.63) SCROLLING TRACES INDEPENDENTLY 21.64) MEASURING ERROR RATE AND SINGLE SUBCLONE BASES FOR A REGION 21.65) PREVENTING 2 USERS FROM MAKING CONFLICTING EDITS 21.66) PRINTING CONSED WINDOWS 21.67) COLOR MEANS EDITED AND TAGS 21.68) COLOR MEANS MATCH 21.69) AUTOEDIT 22. CONSED CUSTOMIZATION 22.1) CUSTOMIZING NAVIGATE BY SINGLE STRANDED REGIONS AND NAVIGATE BY SINGLE 22.3) COLOR BLINDNESS 23. CREATING CUSTOM TAG TYPES 24. EXPANDING CONSED'S CAPABILITIES WITH A LITTLE PROGRAMMING 24.1) BRINGING UP CONSED FROM A SCRIPT 24.2) CONTROL OF CONSED FROM SOME OTHER PROGRAM 24.5) REMOVING READS IN BATCH 24.6) COMPLEMENTING CONTIGS IN BATCH 24.7) HOW TO WRITE A CUSTOM NAVIGATION FILE 24.12) COMPRESSING CHROMATOGRAMS 24.13) READING CHROMATOGRAMS OUT OF AN EXTERNAL DATABASE 24.14) COMPRESSING ACE FILES AND PHD BALLS 24.18) NO PHD FILES 24.19) ADDING TAGS FROM OTHER PROGRAMS 24.20) CHROMOSOME POSITIONS/USER-DEFINED CONSENSUS POSITIONS 24.21) DEFINING KEYS (HOTKEYS) TO CALL EXTERNAL PROGRAMS AND/OR APPLY TAGS AND/OR 24.22) READ PREFIXES 24.23) USING FILES CREATED ON WINDOWS OR WINDOWS NT. 24.24) CREATING YOUR OWN ACE FILES (INSTEAD OF ACE FILES CREATED BY 24.25) CONSED OPTIONS 25. MONITORS AND MICE FOR CONSED 26. ACE FILE FORMAT 27. SAMPLE PHD BALL FORMAT 28. TIMESTAMP MISMATCH 29. CONSED REFERENCES 30. RUNNING PHRED and PHRAP 30.7) COMMON PROBLEMS RUNNING PHREDPHRAP 30.9) WHY ARE ALL THE READS NOT IN THE ASSEMBLY? 30.10) ARE THERE READS THAT ARE TOTALLY UNALIGNED? 30.11) CORRECTING FALSE JOINS MADE BY PHRAP 30.12) USING PHRAP ON NEXT-GEN READS 31. WHAT IS AUTOFINISH? 32. USING AUTOFINISH 32.3) AUTOFINISH: MINIMUM NUMBER OF ERRORS FIXED PER READ 32.4) EDIT PARAMETERS: HOW TO CHANGE CONSED/AUTOFINISH PARAMETERS 32.6) DIVERSION: UNIX LESSON 32.7) AUTOFINISH: CHANGING MELTING TEMPERATURES 32.8) AUTOFINISH: JUST CLOSING GAPS 32.9) AUTOFINISH: JUST CLOSING GAPS JUST USING WALKS 32.10) AUTOFINISH: NOT REPEATING FAILED EXPERIMENTS 32.12) AUTOFINISH: NOT USING PARTICULAR SUBCLONE TEMPLATES 32.13) AUTOFINISH: NOT USING ENTIRE LIBRARIES FOR FINISHING 32.14) MULTIPLE LIBRARIES WITH DIFFERENT INSERT SIZES 32.15) AUTOFINISH CLOSING GAPS WITH MINILIBRARIES 32.17) CLOSING GAPS USING PCR 32.19) AUTOFINISH: TOO MANY UNIVERSAL PRIMER READS 32.20) AUTOFINISH FOR CDNA ASSEMBLIES 32.21) AUTOFINISH FOR LISTING GAP-SPANNING TEMPLATES 32.22) FINISHING A SPECIFIC CONTIG 32.23) MARKING THE END OF THE CLONE 33. CONTRIBUTED SOFTWARE 34. CONSED CUSTOMIZABLE CONSEDRC RESOURCES 35. ACKNOWLEDGEMENTS END BIG TABLE OF CONTENTS ---------------------------------------------------------------------------- 1. WHAT IS NEW IN CONSED 29.0 This is a maintenance release. It fixes several bugs including: * Bamscape: if you zoom in too far * Bam2Ace: if you had zillions of regions * Consed's batch addNewReads gave a segmentation fault ---------------------------------------------------------------------------- 2. UPGRADING FROM CONSED 27.0 to 29.0 Since there is an installation script (see INSTALLING CONSED below), it is easiest to delete (or hide) your current distribution and follow the consed installation instructions (below). If you have customized some of consed's perl scripts, you might want to save some of them. Be aware, however, that the following perl scripts have changed: addReads2Consed.perl bam2Ace.perl convertBedToBamScape.perl fasta2Ace.perl picard2Regions.perl autoPrimers.perl In addition, the executables and some of the sample datasets have changed. You will get the new ones by using the installation script. ---------------------------------------------------------------------------- 3. UPGRADING FROM CONSED 28.0 to 29.0 The following script is new: autoPrimers.perl The executables, of course, have changed. ---------------------------------------------------------------------------- 4. REQUIRED VERSIONS OF OTHER PROGRAMS Several functions of consed will not work without phrap and cross_match (which is part of the phrap package). These functions include: miniassemblies, finding sequence similarity in Assembly View, and Adding New Reads. These are pretty important features, but most consed features will work without phrap and crossmatch. To use all of consed's features, you MUST have the following versions of programs in order to use this version of Consed. If you are using previous versions of these programs, please upgrade to the following versions. (Note that the versions below are dates. For example, 1.080714 means year 2008, month 07 (July), and 14th day of the month--ignore the leading "1.". Thus 1.080714 is later than 1.080630. Similarly, 0.990622.e means June 22, 1999 and ignore the leading "0".) REQUIRED VERSION OF PHRAP TO WORK WITH CONSED 1.080721 or later for phrap and cross_match (This is a more recent version than the one you normally get. See below for how to get them. cross_match comes with phrap.) You will need to specially request the most recent version of phrap--not the one that you get with a normal request. To request the most recent version of phrap and cross_match (cross_match comes with phrap), send an email to phg (at) u.washington.edu, with a Subject line that says "phrap new version request", and an email body that consists of the following two lines (it should be in exactly this format, to be computer readable): Request: phrap ver 1.080721 or later Registered phrap email address: [[insert address here]] The address should be the one you supplied previously when obtaining phrap; the new version will be sent to it. If you have not previously registered for phrap, or your registered address is no longer valid, you will need to include a license agreement (with all questions answered) in the email. This system is not completely automated so you may need to wait several days. If you are using Sanger reads, you will need phred. 0.000925.c or later for phred. Type phred -V to check. Contact bge (at) u.washington.edu if your phred is earlier than this. If you are using polyphred, you must have polyphred 3.5 or more recent. USING AN OLDER VERSION OF POLYPHRED WILL CAUSE SEVERE PROBLEMS WITH CONSED WHICH WILL APPEAR AS PROBLEMS WITH FILETYPES. ---------------------------------------------------------------------------- 5. INSTALLING CONSED To install Consed, you must have some basic Unix system administration skills. For example, you must be able to run X applications such as xterm, you must know what PATH is for and how to add something to it, you must be able to edit a file using a Unix editor (such as emacs, vi, or pico), you must be able to move around in the filesystem from the command line, and you must know how to build/compile a program. If you do not know how to do these (such as if you are a mac user that has minimal experience on the command line), find someone who does to help you and make sure they finish the job, including completing the tests below. 5.1) Using Firefox, Safari, Chrome, Internet Explorer, or some other browser on the computer of which you used for step 4, open url: http://bozeman.mbt.washington.edu/consed/consed.html#howToGet Click on the appropriate type of computer. Your browser (e.g., firefox) will ask you what you want to name the file. Just use the default. If you are denied access, *carefully* follow the instructions on the "Don't have a cow, man--You are not authorized to get this document" page, including the try-to-get-consed part. Please do not email David Gordon until after you have followed these instructions. 5.2) If you have downloaded from a Windows computer, transfer the downloaded file to a Linux/Macosx/Unix computer before doing anything further. 5.3) Unpack the downloaded in its own subdirectory: mkdir consed cd consed gunzip -c (whatever) | tar -xvf - The "whatever" depends on, in your browser when you saved the file, what name you gave the file. It also depends on where you are and where the file is. But it will usually be one of the following: gunzip -c ../consed_linux.tar.gz | tar -xvf - gunzip -c ../consed_solaris.tar.gz | tar -xvf - gunzip -c ../consed_mac.tar.gz | tar -xvf - 5.4) Figure out the correct Consed executable file to use. IF YOU ARE USING LINUX: Type the following executables in order (below). Use the first one that does not give an error but simply says "Version 29.0". ./consed_rhel6linux64bit -v ./consed_rhel4linux64bit -v ./consed_rhel4linux64bit_static -v ./consed_linux32bit_dyn -v ./consed_linux32bit -v If it says something like "Exec format error. Wrong Architecture.", try another executable. If it says something like "error while loading shared libraries: libXp.so.6: cannot open shared object file: No such file or directory", try another executable. IF YOU ARE USING MACOSX: Try to use: ./consed_mac_intel -v If you get any error or you have any problem with the subsequent instructions (below), read NOTE TO MACOSX users (below). In any event read MICE FOR MAC (below). 5.5) Decide where to put the Consed distribution. I suggest you put Consed, phred, cross_match (which comes with phrap), phrap, the perl scripts, and other executables into /usr/local/genome/bin. So create /usr/local/genome (without the bin). If you can't actually use /usr/local/genome, then you could make /usr/local/genome be a link to the real location--that will work just as well. As a third choice, if you want to have another location xxx, then put: export CONSED_HOME=xxx into .bash_profile or .bashrc if you are using bash (which is the case for macosx users) or if you are using csh or tcsh, put into .login setenv CONSED_HOME xxx In this case phred, cross_match, phrap, the perl scripts and other executables must go into $CONSED_HOME/bin 5.6) At this point you should have decided: -which consed executable to use -where to put all of the consed programs and related files From the directory where you ran "gunzip -c ... | tar -xvf -", type: pwd just so you are sure where you are, and then type: ./installConsed.perl (consed executable) (where consed programs) For example: ./installConsed.perl consed_mac_intel /usr/local/genome If this script gives an error and you can't figure it out, look at the script itself and see what is giving the error. Then follow the scripts instructions about what to delete and then run the script again. 5.7) Make sure that the location of the consed executable is in every Consed users' PATH. This location might be /usr/local/genome/bin (or $CONSED_HOME/bin). For bash users (which includes macosx), you do this by putting something like this in .bash_profile export PATH=/usr/local/genome/bin:$PATH For csh or tcsh users, you do this by putting something like this in .cshrc set path=(/usr/local/genome/bin $path) and typing "rehash" 5.8) Check this by logging on as a user and typing: rehash (don't worry if the rehash command says "not found") consed -V You should see 'Version 29.0'. If you see an error message like this: consed: Command not found. you have some debugging to do. 5.9) Put phrap and cross_match into the same bin directory where you put consed by cd'ing to where you built phrap and typing: cp phrap /usr/local/genome/bin cp cross_match /usr/local/genome/bin (or if you didn't use /usr/local/genome, replace it with what you did use). Check that the correct version of cross_match is installed by typing: cross_match You should see: > cross_match cross_match cross_match cross_match version 1.080721 cross_match version 1.080721 Reading parameters ... 1.008 Mbytes allocated -- total 1.008 Mbytes Run date:time 081205:135315 Run date:time 081205:135315 FATAL ERROR: Sequence files must be specified on command line. See documentation. FATAL ERROR: Sequence files must be specified on command line. See documentation. where 1.080721 is a date in the form YYMMDD. It must be this date or more recent. Otherwise, follow the instructions above for getting cross_match (which is part of the phrap package). 5.10) Put phred into the same bin directory where you put consed by cd'ing to where you built phred and typing: cp phred /usr/local/genome/bin 5.11) Put phredpar.dat (which comes with phred) into /usr/local/genome/lib (or $CONSED_HOME/lib): cp phredpar.dat /usr/local/genome/lib (or if you didn't use /usr/local/genome, replace it with what you did use). 5.12) PRELIMINARY TESTING OF CONSED BEFORE COMPLETING THE REST OF THE INSTALLATION From the location where you put the example directories, type the following: cd standard/edit_dir 5.13) start Consed by typing consed If you get some error such as: Error: Can't open display: then the problem probably has nothing to do with Consed, but rather with X. To test this, run some other X application (such as xclock, xterm, xeyes, or xcalc) and see if you get the same error. (On some versions of MACOSX, you must start X11 and then consed in an xterm--see NOTE TO MACOSX USERS below.) The problem may be due to your X emulator. See 'MONITORS AND MICE FOR CONSED' below. Don't worry about a message like: Warning: Cannot convert string "helvetica" to type FontStruct Two windows will appear. One of these will have the list of .ace files and say 'select assembly file to open' and 'standard.fasta.screen.ace.1'. Double click on "standard.fasta.screen.ace.1". The first window will go away. You will now see a list of one contig and a list of reads. This is the 'Consed Main Window'. Double click on 'Contig1'. The 'Aligned Reads Window' will appear. Then follow the "COPY AND PASTE" instructions (elsewhere in this document) and check that that works. (This will not work on some versions of macosx. It should work on linux--if it doesn't, read the NOTE TO LINUX USERS below.) If this all works, consider this preliminary test successful. 5.14) Here is a summary of the files in usr/local/genome/lib/screenLibs (or $CONSED_HOME/lib/screenLibs): filter454Reads.fa is the puc19 vector used to produce 454 reads. 454 reads containing puc19 vector are eliminated. primerCloneScreen.seq is used to screen candidate primers when you use Consed's function "Pick Primer from Clone Template" (on the Aligned Reads Window). primerSubcloneScreen.seq is used to screen candidate primers when you use Consed's function "Pick Primer from Subclone Template" (on the Aligned Reads Window). repeats.fasta is used to tag repeats (to put a blue line under the bases) vector.seq is used to mask the parts of reads that are from vector rather than insert sffLinkers.fa contains the linkers for 454 reads that separate the 2 reads of a read pair. 5.15) Take a look at files primerCloneScreen.seq, primerSubcloneScreen.seq, repeats.fasta, and vector.seq: They are dummy files indicating the fasta format of the sequences that should be put in them. You can modify them to suit your needs. 5.16) You should put into primerCloneScreen.seq the vector sequence of the cloning vectors you are using (BAC or fosmid) and into primerSubcloneScreen.seq the sequencing vectors you are using (plasmid). Don't be too generous in putting lots of vectors into the files! The larger they are, the slower primer picking will be. Our files are only this big: -rw-r--r-- 1 root root 29938 Nov 7 1997 primerCloneScreen.seq -rw-r--r-- 1 root root 7381 Aug 13 1997 primerSubcloneScreen.seq and primer picking is quite fast enough. TESTING PRIMER PICKING 5.17) Follow the steps above under PRELIMINARY TESTING OF CONSED BEFORE COMPLETING THE REST OF THE INSTALLATION to bring up the Aligned Reads Window on Contig1. Go to some location near the right end of the contig, say base 2470. Click with the right mouse button on the consensus and click on either one of the top strand primer choices (either from subclone template or from clone template). Consed will pause a moment, and then there will appear a selection of primers that pass all of Consed's requirements. (If you get an error message, Consed might not have been correctly installed. See INSTALLING CONSED above.) Templates are also chosen for each primer. You may have to scroll the primer list to the right to see the templates. Consed lists these templates in order of quality--all of them will cover the read you want to make. 5.18) You should put into the file /usr/local/genome/lib/screenLibs/vector.seq (or $CONSED_HOME/lib/screenLibs/vector.seq if you are not using /usr/local/genome for the root of the Phred/Phrap/Consed files.) the vector sequences (in FASTA format) that you want to mask out before running phrap. In general, it is the combination of primerCloneScreen.seq and primerSubcloneScreen.seq. I've given you a dummy file, but you should replace it with your real vector. 5.19) You should put into the file /usr/local/genome/lib/screenLibs/repeats.fasta (or $CONSED_HOME/lib/screenLibs/repeats.fasta if you are not using /usr/local/genome for the root of the Phred/Phrap/Consed files.) any sequences (in FASTA format) that you want to have automatically tagged (visibly marked by a blue line in Consed). These typically are ALU sequences. If you don't want to tag anything, then comment out (put '#' as the first character of the line) the following lines in phredPhrap: To not tag anything, change: !system( "$tagRepeats $szAceFileToBeProduced" ) || die "some problem running $tagRepeats"; to: #!system( "$tagRepeats $szAceFileToBeProduced" ) # || die "some problem running $tagRepeats"; 5.20) If you are going to do any restriction digests, you should create a file /usr/local/genome/lib/screenLibs/singleVectorForRestrictionDigest.fasta containing the cloning vector sequence. This is used for doing in-silico restriction digests. Thus this cloning vector must start at precisely the site where you cut the (circular) vector to ligate the insert. It is not sufficient to just download the vector sequence from Genbank because they may start the sequence at a different site. To get you started for doing the demonstration, I've provided such a file that will work for the example datasets, but will not work for your own data. 5.21) ENOUGH MEMORY FOR CONSED Enough memory is vital with large datasets. Even if you have enough physical memory, the operating system may not allow a single process to use it all. In csh or tcsh type: limit You should see something like this: cputime unlimited filesize unlimited datasize 2097148 kbytes stacksize 8192 kbytes coredumpsize 0 kbytes vmemoryuse unlimited descriptors 64 Type: limit datasize unlimited Then type: limit just to see that the number has changed. On bash, type: ulimit -d unlimited 5.22) Make sure you have enough swap space to support the amount of RAM on the computer. 5.23) TESTING THE INSTALLATION After installing Consed, you should run all the following tests to make sure you have installed everything correctly: If one of the tests (below) fails with a message like: "couldn't execute ..." then you can troubleshoot the problem by going to the directory where this error occurred and type the command that failed. If the command includes any output redirection (e.g, 2>/dev/null or >>temp or >temp), remove everything that occurs on the line after the 2> or > so that all output comes to your screen. 5.24) TESTING ADDING ILLUMINA READS Follow the 8 steps under "ADDING PAIRED ILLUMINA READS" (below) Troubleshooting: If you get an error like this: couldn't execute time /home/genome/BioSw/consed18/bin/cross_match reads081205_130653.fa.0 bacref.fa -discrep_lists -tags -masklevel 0 -minscore 25 -gap1_only -repeat_screen 2 >>alignmentFile.081205_130653.cross.0 2>/dev/null then run it on the command line without "time" and without the ">>" and "2>" so you can see any errors: /home/genome/BioSw/consed18/bin/cross_match reads081205_130653.fa.0 bacref.fa -discrep_lists -tags -masklevel 0 -minscore 25 -gap1_only -repeat_screen 2 If this says: FATAL ERROR: Command line option -gap1_only not recognized that indicates that you are not running the correct version of cross_match (see above under REQUIRED VERSIONS OF OTHER PROGRAMS). Another possible cause of problems is that cross_match is not in the right place (see above under INSTALLING CONSED) or that you have not set CONSED_HOME (if you need to do this--see above under INSTALLING CONSED). 5.25) TESTING ADDING 454 READS Follow the 4 steps under "USING 454 READS (ALIGNING TO REFERENCE SEQUENCE )" (below) 5.26) TESTING 454 READS (NEWBLER ASSEMBLY) Follow the first 6 steps under "USING 454 READS (NEWBLER ASSEMBLY)" and especially be sure that the traces pop up. 5.27) TESTING ADD NEW READS 5.28) Next you should test the ADD NEW READS step in the Quick Tour (below). This step requires that everything be set up correctly and in the correct location. Hopefully the error messages are clear enough to help you if you have set up anything incorrectly. 5.29) TESTING RUNNING CROSS_MATCH FROM ASSEMBLY VIEW See RUNNING CROSS_MATCH FOR SEQUENCE MATCHES (below) and make sure that step works. 5.30) TEST RUNNING PHREDPHRAP See the section RUNNING PHRED and PHRAP (below) 5.31) TESTING MINIASSEMBLIES See PULLING OUT READS AND RE-ASSEMBLYING THEM (MINIASSEMBLIES) and MINIASSEMBLIES (below) and make sure those steps work. The newer version of phredPhrap is required for this. If you have invested a lot of work customizing some ancient version of phredPhrap (e.g., 10 years old), and don't want to upgrade, you do have the option of keeping your customized version of phredPhrap for regular assemblies, and using the new version of phredPhrap for miniassemblies. To do this, you must specify the alternate name/location of phredPhrap by the consedrc parameter: consed.fullPathnameOfMiniassemblyScript: /usr/local/genome/bin/phredPhrap (See CONSED CUSTOMIZATION below.) If you can't even use the new version just for miniassemblies, then there is a consedrc parameter: consed.okToUseObsoleteMiniassemblyScript: true This parameter will allow you to use obsolete versions of phredPhrap which have bugs such as duplicating consensus tags. If you like bugs, this is how you can keep them. ------ NOTE: You might be done installing consed -------- The following 4 installation steps are only necessary if you are using autofinish or consed's primer picker *and* if you are using Sanger reads. Otherwise, you can skip: MODIFYING determineReadTypes.perl TROUBLESHOOTING YOUR CHANGES TO determineReadTypes.perl FAKE READS APPENDING EXPID TO THE PHD FILES 5.32) MODIFYING determineReadTypes.perl Read the comments in determineReadTypes.perl Phrap, Consed's primer picking, and Consed/Autofinish all need the following information for each read: is it a univeral primer forward, a universal primer reverse, or a walking read? what is its template name? If you are using different libraries that have different insert sizes, then Consed/Autofinish also need the library name for each read. Generally this information can be determined from the read name, using *your* naming convention. Modify the perl script determineReadTypes.perl to put this information at the end of the phd file using WR info items. If you don't want to do much perl programming and all your libraries have the same insert size, you have the option of using the St Louis naming convention. In this case, you needn't do anything with determineReadTypes.perl You must also uncomment (remove the "#"s in column 1) the lines in the phredPhrap script that say roughly: #print "\n\n--------------------------------------------------------\n"; #print "Now running determineReadTypes.perl...\n"; #print "--------------------------------------------------------\n\n\n"; #!system( "$determineReadTypes" ) || die "some problem running determineReadTypes.perl $!\n"; But what is the St Louis naming convention? Most of it (but not all) is explaned in the file phrap.doc that comes with phrap. In addition, you must never use an underscore in the name if the read is a universal primer forward or universal primer reverse read. If the read is a walk, then you must have an underscore (_) follow the template name and then have a number (the oligo number). Examples of reads in the St Louis naming convention: read eeq03a01.g1 is univ rev template: eeq03a01 library: eeq03 read eeq03a02.b1 is univ fwd template: eeq03a02 library: eeq03 read eeq03a02.g1 is univ rev template: eeq03a02 library: eeq03 read eeq03a03.b1 is univ fwd template: eeq03a03 library: eeq03 read eej45h07_2.i1 is walk template: eej45h07 library: eej45 read eej46c12_1.i1 is walk template: eej46c12 library: eej46 Once you have correctly customized determineReadTypes.perl, then uncomment the line in phredPhrap which calls determineReadTypes.perl It is fine to assume the St Louis naming convention for the purpose of the sample dataset directories that come with Consed ("standard", "assembly_view", "autofinish", and "polyphred"). 5.33) TROUBLESHOOTING YOUR CHANGES TO determineReadTypes.perl Consed allows you to check that you have correctly modified determineReadTypes.perl: On the Consed Main Window, point to 'Info', hold down the left mouse button, and release on 'Show Info for Each Read'. Study all the information and check that the information presented is correct. If, for example, Consed thinks that there are templates that have 9 or more reads, it is likely that you have not correctly customized determineReadTypes.perl You will see a section that looks like this: template djs736a2_fp04q286 with 2 reads djs736a2_fp04q286.x2 term universal forward (from phd file) djs736a2_fp04q286.y2 term universal reverse (from phd file) You want to see the "from phd file" part. If, instead of "from phd file", it says "inferred from name", that means that determineReadTypes.perl couldn't figure out what kind of read it was. If you think you have made a mistake in customizing determineReadTypes.perl, it is best to delete the PHD files (and phd.ball if you are using that) and run phredPhrap again since the otherwise incorrect WR items will be left in the PHD files. There is more specific documentation within the script determineReadTypes.perl for more information about how to customize it. CUSTOMIZING determineReadTypes.perl: SPECIAL CASES 5.34) FAKE READS By "fake reads" I mean reads such as those created from a Genbank reference sequence or a consensus from some other assembly... or others for which there is no chromatogram (and there never was any chromatogram). If you don't use any such reads, you can skip this step. In the past, any read that ended with a .a2 or .c3 (where 2 and 3 could be any numbers), was considered a fake read. Now you can make Autofinish not assume this using the consedrc parameter (see CONSED CUSTOMIZATION): consed.fakeReadsSpecifiedByFilenameExtension: false Instead, you must have determineReadTypes.perl put "fake" into the "type:" field of a "template" WR item. See determineReadTypes.perl for more information. 5.35) APPENDING EXPID TO THE PHD FILES If you are not using Autofinish, you can skip this step. If you are using Autofinish, and would like Autofinish to tell you how well your reads are succeeding, then the phd files must be appended with the experiment id's. In the 3 Autofinish summary files (*.univReverse, *.univForwards, and *.customPrimers), you will see information like this: univ rev,,,->,-329,-249,71,Contig1,3,djs228_1034 or this: tgaagaaatggctgactcc,56,1,->,3258,3338,3658,Contig1,4,djs228_2813,5,djs228_168,6,djs228_1248 The '3' just before the djs228_1034 on the line starting with "univ rev" is an experiment id. There is also an expid '4' just before djs228_2813, an expid '5' before djs228_168, and an expid '6' just before djs228_1248. Autofinish doesn't know what you will end up calling these reads it is telling you to make. Autofinish only knows those reads by the numbers 3, 4, 5, and 6. So when you make the reads, Autofinish needs to be informed that this is 'experiment 3' or whatever. You do this by appending in the phd file the following structure: WR{ expid addExpid 990811:140818 5 } where WR stands for 'whole read item', expid for 'expid' addExpid is the name of the program that you will write that will append this information 990811:140818 is the date and time in format YYMMDD:HHMISS 5 is the expid This program must be run *after* phred runs to create the phd files. Thus your program must have some method of determining what the expid of each read is. What the University of Washington Genome Center does is to have the finishers put the expid as part of the filename. This makes it easy for a program to look at the phd file and figure out what the expid is and then write the WR item into that phd file. Alternatively, you could keep a database and, after the phd file is created, look into the database to see what the expid is. When you have successfully added expid's to the phd files, the next time you run Autofinish on this project, see the 'EVALUATE' section of the Autofinish output file--you will see lots of interesting information about how well the reads succeeded. ---------------------------------------------------------------------------- 6. NOTE TO LINUX USERS (64 BIT) (INSTALLATION) Do you know for a fact that your computer is a 64 bit computer? If it is, use one of the 64 bit binaries because the 64 bit version will allow you to use consed on larger assemblies. You can determine what kind of computer you have by typing: uname -a If it says something like this: Linux lake.interim.stanford.edu 2.6.9-78.0.1.ELsmp #1 SMP Tue Jul 22 18:11:48 EDT 2008 i686 i686 i386 GNU/Linux where there is an "i686" or "i386", then you have 32 bit linux. In this case skip down to "NOTE TO LINUX USERS (32 bit)" (below). If it says something like this: Linux lake.interim.stanford.edu 2.6.9-67.ELsmp #1 SMP Wed Nov 7 13:56:44 EST 2007 x86_64 x86_64 x86_64 GNU/Linux where there is an "x86_64" present, then you have 64 bit linux and you are reading in the right place. If it says something like this: Linux lake.interim.stanford.edu 2.4.21-sgi240rp04041413_10065 #1 SMP Wed Apr 14 13:09:51 PDT 2004 ia64 unknown where there is an "ia64" present, then you have itanium linux, which we do not support. I've supplied three executables: consed_rhel6linux64bit consed_rhel4linux64bit consed_rhel4linux64bit_static The ones with rhel4 were built on a RedHat release 4 system and the one with rhel6 was built on a RedHat release 6 system. The one with "static" in the name doesn't use shared libraries; the two others are dynamically linked. Try the first one first. If Consed doesn't come up at all, then try the second one. The kind of problems you might have would cause consed to immediately terminate, so if consed comes up at all (you can see the Consed Main Window), that particular executable is fine for you. (See QUICK TOUR OF CONSED for how to start Consed--you must be in the correct directory.) (August 2013) One user on Fedora 17 had the following font problem: the base letters in the Aligned Reads Window were different widths so didn't line up vertically. He solved this problem by downloading the following font kit: xorg-x11-fonts-misc which includes the font that consed wants to use which is called: -misc-fixed-medium-r-normal--15-140-75-75-c-90-iso8859-1 which is a common fixed-point font (every letter is the same width). 6.1) (April 2012) One user of Ubuntu 11.04 had the following problem: ./consed_linux64bit: error while loading shared libraries: libstdc++.so.5: cannot open shared object file: No such file or directory The problem was fixed by issuing the following command: sudo apt-get install libstdc++5 -or- as root: apt-get install libstdc++5 6.2) If you can't copy and paste (see COPY AND PASTE elsewhere in this document: if you highlight a segment of the consensus sequence, you should be able to paste it into the search window), try the dynamically linked executable consed_linux64bit. 6.3) (Reported July 2012) If you get warnings like this: Warning: String to TranslationTable conversion encountered errors Warning: translation table syntax error: Unknown keysym name: osfActivate Warning: ... found while parsing ':osfActivate: PrimitiveParentActivate()' it can be fixed by: export XKEYSYMDB=/usr/share/X11/XKeysymDB or export XKEYSYMDB=/usr/lib/X11/XKeysymDB (try each to see which works) If you are using csh or tcsh instead of bash, use setenv XKEYSYMDB /usr/share/X11/XKeysymDB instead of export XKEYSYMDB=/usr/share/X11/XKeysymDB 6.4) Another user reported the following problem (Aug 2008): "now we are unable to copy/paste into Consed from text editors such as emacs or vim. However, copying/pasting within Consed works just fine." He then found the following fixed the problem: "Initially, as I was following the installation instructions and couldn't verify the version number with a 'consed -v' command with the 'consed_linux64bit' executable (it complained about a missing library, libstdc++.so.5) I switched to the 'consed_linux64bit_static' executable and it returned the version number properly. After finishing the installation and attempting to work with our assembly data we hit some strange errors. On a hunch and following Joel Martin's advice not to use the _static executable we installed the compat-libstdc++-296 and compat-libstdc++-33 libraries on our fedora 8 64-bit system and reverted to the non-static executable. (These were the only two in the Legacy Software directory of our Fedora 8 repository.)" 6.5) For users running on Ubuntu Linux (Aug 2010): When Ubuntu upgraded from release 9 to release 10, they introduced a serious bug into the X-Window server software. This affects all motif-based graphical applications, including Consed. When you right mouse click in consed's Aligned Reads Window, the mouse cursor is captured within part of the Aligned Reads Window. You cannot move the pointer to the "quit" button to terminate consed, and you cannot give input focus to any other window or application on the computer. To fix your Ubuntu system so that consed will run, do the following: Open a terminal window by pressing Alt+F2, write gnome-terminal and press run Type the following in the terminal window: sudo add-apt-repository ppa:crcarlin/ppa sudo apt-get update sudo apt-get update (same as the previous line) sudo apt-get upgrade When it asks you if you want to upgrade your xserver-xorg-core package and (possibly) your xserver-xorg-dev package and other packages that start with a x, allow it to do so. Then reboot your computer. Go again to the terminal (Alt + F2 and write gnome-terminal and press run) and type the following in a terminal window: sudo rm /etc/apt/sources.list.d/crcarlin-ppa-lucid.list sudo apt-get update sudo apt-get update (same as the previous line) sudo apt-get upgrade According to one Consed user who tried this, this will fix the problem. 6.6) One user (July 2010) reported: On Ubuntu 10.04, (Lucid) Static Consed Version 19.0 (090206) gave the following message: consed: relocation error: /lib/libnss_files.so.2: symbol __rawmemchr, version GLIBC_2.2.5 not defined in file libc.so.6 with link time reference Shared/Dynamic linking Consed gave: /pkg/consed/consed_linux64bit: error while loading shared libraries: libstdc++.so.5: cannot open shared object file: No such file or directory I solved the problem by downloading and directly installing an old library using dpkg : http://packages.ubuntu.com/jaunty/amd64/libstdc++5/download 6.7) Another user reported that consed_linux64bit could not find libXp.so.6 He solved this problem by downloading xorg-x11-deprecated-libs-6.8.2-1.EL.52.x86_64.rpm from http://rpm.pbone.net/index.php3/stat/4/idpl/8965447/com/xorg-x11-deprecated-libs-6.8.2-1.EL.52.x86_64.rpm.html or from http://rpm.pbone.net/index.php3/stat/3/srodzaj/1/search/libXp.so.6()(64bit) and installed the rpm package, then added "/usr/X11R6/lib64/" to "/etc/ld.so.conf" then ran the command "ldconfig" 6.8) Several users reported that consed_linux64bit gave: error while loading shared libraries: libXp.so.6: cannot open shared object file: No such file or directory and he solved the problem with the command: yum install libXp -------------------------------------------------------------------------- 7. NOTE TO LINUX USERS (32 bit) (INSTALLATION) Do you know for a fact that your computer is not a 64 bit computer? If it is, use one of the 64 bit binaries because the 64 bit version will allow you to use consed on larger assemblies. You can determine what kind of computer you have by typing: uname -a If it says something like this: Linux lake.interim.stanford.edu 2.6.9-78.0.1.ELsmp #1 SMP Tue Jul 22 18:11:48 EDT 2008 i686 i686 i386 GNU/Linux where there is an "i686" or "i386", then you have 32 bit linux and you are reading in the right place. If it says something like this: Linux lake.interim.stanford.edu 2.6.9-67.ELsmp #1 SMP Wed Nov 7 13:56:44 EST 2007 x86_64 x86_64 x86_64 GNU/Linux where there is an "x86_64" present, then you have 64 bit linux and you should skip up to "NOTE TO LINUX USERS (64 BIT)" (above). If it says something like this: Linux lake.interim.stanford.edu 2.4.21-sgi240rp04041413_10065 #1 SMP Wed Apr 14 13:09:51 PDT 2004 ia64 unknown where there is an "ia64" present, then you have itanium linux, which we do not support. We have found that there is a large variation among different linux systems (even those with the same kernel) so I have provided 2 different executables (consed_linux32bit and consed_linux32bit_dyn) in the hope that one will work for you. With one of them, consed may not come up at all but rather terminate with an error such as the following: > ./consed ./consed: error while loading shared libraries: libstdc++-libc6.2-2.so.3: cannot open shared object file: No such file or directory or > ./consed: symbol regexec, version GLIBC_2.3.4 not defined in file libc.so.6 with link time reference (See below for suggestions from Consed/linux users with similar experiences.) 7.1) If Consed does come up, do the following test: Bring up Consed with the standard dataset as shown in QUICK TOUR OF CONSED (above) and open standard.fasta.screen.ace.1 as shown in the QUICK TOUR. After Consed is up, on the Consed Main Window there is a menu "Help" on the top right. Push the left mouse button down on Help menu. There will be a list of choices that will appear. While still holding down the left mouse button, drag the cursor to "Test Exception Handling" and release the left mouse button. If a popup window appears with a "Dismiss" button, you are fine (but you should still read the rest of this note). If Consed terminates, then this Consed executable does not work with the exception handling shared libraries you have installed. Try a different consed executable or find different shared libraries, as discussed below. 7.2) For users running on Ubuntu Linux (Aug 2010): When Ubuntu upgraded from release 9 to release 10, they introduced a serious bug into the X-Window server software. This affects all motif-based graphical applications, including Consed. When you right mouse click in consed's Aligned Reads Window, the mouse cursor is captured within part of the Aligned Reads Window. You cannot move the pointer to the "quit" button to terminate consed, and you cannot give input focus to any other window or application on the computer. To fix your Ubuntu system so that consed will run, do the following: Open a terminal window by pressing Alt+F2, write gnome-terminal and press run Type the following in the terminal window: sudo add-apt-repository ppa:crcarlin/ppa sudo apt-get update sudo apt-get update (same as the previous line) sudo apt-get upgrade When it asks you if you want to upgrade your xserver-xorg-core package and (possibly) your xserver-xorg-dev package and other packages that start with a x, allow it to do so. Then reboot your computer. Go again to the terminal (Alt + F2 and write gnome-terminal and press run) and type the following in a terminal window: sudo rm /etc/apt/sources.list.d/crcarlin-ppa-lucid.list sudo apt-get update sudo apt-get update (same as the previous line) sudo apt-get upgrade According to one Consed user who tried this, this will fix the problem. 7.3) If you try to run consed and get an error message like this: > ./consed ./consed: error while loading shared libraries: libstdc++-libc6.2-2.so.3: cannot open shared object file: No such file or directory This is because there must be a file in /usr/lib libstdc++.so.6 I have provided this file in case you don't have it. Just put it in /usr/lib and see if that fixes the problem. One consed user reports: did a little poking around and found that i needed: compat-libstdc++-7.3-2.96.118 RPM for i386 since i'm running fedora core 1 at the moment. ... Anyway, if anyone gets this error tell them they're missing the Standard C++ libraries for Red Hat 7.3 backwards compatibility compiler and it can be downloaded here: http://www2.linuxforum.net/RPM/fedora/core/1/Fedora/RPMS/compat-libstdc++-7.3-2.96.118.i386.html Another system administrator said: "I got the error: ../../consed_linux: error while loading shared libraries: libstdc++-libc6.2-2.so.3: cannot open shared object file: No such file or directory In order to resolve the issue on linux boxes you need to install the compatibility libraries. "To be on the safe side I installed the following compat-libstdc++-33.i386 3.2.3-61 gcc-c++.i386 4.1.2-14.el5 gcc.i386 4.1.2-14.el5 cpp.i386 4.1.2-14.el5 libstdc++-devel.i386 4.1.2-14.el5 libgomp.i386 4.1.2-14.el5 libstdc++.i386 4.1.2-14.el5 libgcc.i386 4.1.2-14.el5 "I believe you may get away with just the first compat-libstdc package." 7.4) If you get warnings like this: Warning: String to TranslationTable conversion encountered errors Warning: translation table syntax error: Unknown keysym name: osfActivate Warning: ... found while parsing ':osfActivate: PrimitiveParentActivate()' it can be fixed by: export XKEYSYMDB=/usr/share/X11/XKeysymDB or export XKEYSYMDB=/usr/lib/X11/XKeysymDB (try each to see which works) If you are using csh or tcsh instead of bash, use setenv XKEYSYMDB /usr/share/X11/XKeysymDB instead of export XKEYSYMDB=/usr/share/X11/XKeysymDB 7.5) If you can't cut and paste (e.g., if you highlight a segment of the consensus sequence, you should be able to paste it into the search window. It gets highlighted, but nothing gets pasted), fix it by: using the dynamic executable: consed_linux32bit ---------------------------------------------------------------------------- 8. NOTE TO MACOSX USERS (INSTALLATION) Be aware that only macosx 10.6 and better are fully supported. Older versions will work, but will not have access to bamScape and bam2Ace. 8.1) Downloading Consed. One user (in 2013) reported a problem using Safari to download consed: "Be aware that Safari automatically decompresses the original file causing some problems when transferring the file to another computer. You can disable the automatic decompressing option by unchecking "Open Safe Files After Downloading" in Safari Preferences/General." 8.2) To create /usr/local/genome, create a terminal window and type: cd /usr/local sudo mkdir genome sudo chmod 777 genome (The last command says that anyone can read and write to the genome directory. If you don't want to allow this much access, read about the chmod command and adjust it according to your wishes.) (You can also use Finder to do it but it is tricky since Finder normally will refuse to even show you /usr which is a hidden file. To get it to show you hidden files, google "showing hidden folders macosx". Then create a folder "genome" within /usr/local and return to a terminal window for the rest of the installation.) 8.3) Determine which of the 2 consed executables to use: consed_mac_intel consed_mac_ppc If consed_mac_intel works, use it. If you are running macosx 10.10, you probably will get the following error: consed dyld: Library not loaded: /usr/X11/lib/libX11.6.dylib You can solve this problem by opening a terminal window and typing: cd /usr sudo ln -s /opt/X11 X11 (Note that the X's above are capitalized.) That should do it--consed should then start without error. If it gives an error such as: consed_mac_intel: Bad CPU type in executable. or dyld: Library not loaded: /usr/X11/lib/libX11.6.dylib then you can try consed_mac_ppc. However consed_mac_ppc doesn't have bamScape or bam2Ace. (If there is overwhelming demand, I will change this.) You probably need to have macosx 10.6 or better to run consed_mac_intel (10.5.8 won't work). 8.4) Now follow the normal installation instructions ./installConsed.perl (consed executable) (where consed programs) (see above under INSTALLING CONSED). 8.5) You must put /usr/local/genome/bin (or wherever you put consed and the scripts) into your path. On MacOSX, this is by a file in your home directory .bash_profile. You would add this line: export PATH=/usr/local/genome/bin:$PATH When you log off and log back on, your new path will include consed. 8.6) X-WINDOWS on MacOSX can have problems. To test this, type (in a terminal window): xterm A xterm terminal window should appear. If not, here are some suggestions from various people, some of which may work and some may not: If you are using MacOSX 10.6 or better, things should just work. If you are using MacOSX 10.5, there seems to be at least 2 problems with X11: one is that cut/paste does not function within X11. I've also heard that the $DISPLAY variable is not set automatically. Here is what one user suggests: The best workaround is to remove X11 altogether, and replace it with the previous version that was part of Mac OS 10.4. Here is how: Remove the X11 installation of Mac Os 10.5 sudo rm -rf /usr/X11 /usr/X11R6 sudo rm /System/Library/LaunchAgents/org.x.X11.plist (or rm /System/Library/LaunchAgents/org.x.startx.plist) sudo rm /Library/Receipts/X11User.pkg sudo pkgutil --forget com.apple.pkg.X11DocumentationLeo sudo pkgutil --forget com.apple.pkg.X11User sudo pkgutil --forget com.apple.pkg.X11SDKLeo sudo pkgutil --forget org.x.X11.pkg Install the X11 installation of Mac Os 10.4 (found on 10.4 installation CD: System/Installation/Packages/X11User.pkg) Install the newest xquartz (X11-2.3.2.1.dmg) found at http://xquartz.macosforge.org/trac/wiki You can start X11 from the dock and run consed just as usual. For other versions of MacOSX One person suggests: http://sage.ucsc.edu/~wgscott/xtal/wiki/index.php/X11 http://sage.ucsc.edu/~wgscott/xtal/wiki/index.php/X11_more_details Another says that for 10.5, an X-environment comes installed by default (XQuartz). Information about XQuartz (and the newest versions) can be found at: http://xquartz.macosforge.org Another says that for older versions of macosx (10.4 and earlier): You must have an X environment on your MAC and you might need to turn it on. If you don't know how to do this, find someone locally who can help you. If you don't have an X environment already on your MAC, download from Apple at www.apple.com/software I suggest you use XDarwin in full screen mode. Use option-apple-A to move back and forth between the MAC desktop and the X environment. Another counters that XDarwin is not so friendly and instead suggests running the X11 version found at: http://www.apple.com/downloads/macosx/apple/x11formacosx.html or else OroborOSX (http://oroborosx.sourceforge.net/), and a new (non-beta) version is available (http://oroborosx.sourceforge.net/download.html). Some people say that XDarwin is no longer supported. 8.7) Please edit the phredPhrap script to reflect the correct location of nice (there is a note in the phredPhrap script about this). 8.8) MICE FOR MAC If you have a 1-button mouse, I've found that: apple-click = right button click option-click = middle button click (With X11 up, you may need to go into X11 Preferences, Input and enable 3-button mouse emulation.) 8.9) C COMPILER FOR MAC You will need to have a c compiler to compile some programs. If you can't find one on your computer, one is part of the Xcode package which is both part of a CD that came with your mac and it is also available for download from Apple. You will need to get a free membership to Apple's ADC program to download it. -------------------------------------------------------------------------- 9. NOTE TO SOLARIS USERS (INSTALLATION) 9.1) Do not use /usr/ucb/cc !!! How can you tell if you are using it? Type: which cc If it says /usr/ucb/cc, you must get gcc or else buy the commercial cc from Sun (which is /opt/SUNWspro/bin/cc). If you use /usr/ucb/cc, strange things will happen, including phd2fasta not working correctly by cutting off the first 2 characters of read names. ---------------------------------------------------------------------------- 10. QUICK TOUR Consed is a program for viewing and editing assemblies. Below are: QUICK TOUR OF BAMSCAPE and QUICK TOUR OF CONSED If you are already an advanced Consed user, you should read through this and do any of the exercises on features that you are unfamiliar with. I frequently run across people who are doing something in Consed a hard way month after month, and request a new feature to make things easier, when that new feature is already in Consed. If you have never used Consed before, to follow this Quick Tour will take you less than 6 hours. I've heard of many people who do not have 6 hours to spare so they skip the Quick Tour and then they struggle for 2 days instead. When you do the quick tour, I encourage you to be free about changing the data set. If you really mess things up (such as changing all a read's bases to N's), no problem--just delete the data set and start again with a fresh copy. ---------------------------------------------------------------------------- 11. QUICK TOUR OF BAMSCAPE Note: Currently this feature is available for linux (32 and 64 bit) and macosx-intel (but not ppc). It is not available for solaris. 11.1) USING BAMSCAPE Consed -bamscape can view (similar to IGV) BAM files. It can be brought up like this (don't do this yet): consed -bamscape -bamFile myBamFile.bam -referenceFOF bamScapeReference.fof where myBamFile.bam is a BAM file that must be sorted and must have an associated .bai index file. bamScapeReference.fof looks like this: /codon/gordon/genomes/human_genome_hg18/chr1.fa /codon/gordon/genomes/human_genome_hg18/chr2.fa where these are the pathnames of the fasta files of the reference sequences. (There can be multiple reference sequences in the same file.) -bamFile myBamFile.bam 11.2) Type the following: cd bamScape (You should get no error from this. If you do, type "pwd" to find out where you are and cd to the correct directory accordingly.) 11.3) Make sure there is no file called rewriteReference.fof by typing: ls -l rewriteReference.fof If there is such a file, delete it: rm rewriteReference.fof 11.4) start bamscape like this: consed -bamscape -bamFile reads.sorted.bam -referenceFOF bamScapeReference.fof You should see a window popup labeled "Bam View" with the single reference sequence "23". 11.5) Double click on the "23". Up should pop a window entitled "Reads vs Reference". This will show the range of 1-200,001 bases of reference sequence 23. You will notice there are 2 panels. The top panel shows read depth--total read depth in blue. You will notice that there are only reads in the region from about 50kb to 175kb. Look at the bottom panel. It shows read discrepancies. If, at a position, all of the reads disagree with the reference base, the graph will show 100%. (Since there are generally more than 1 reference position at the same pixel location on the screen, the graph will be the maximum % discrepancy rate of all bases represented by the pixel.) There will usually be some fraction of reads that are inconsistently paired, but if there is a high fraction, then something is wrong (perhaps the reads are mismapped). You will notice that the brown looks like it might be as high as the blue near position 100,000. [SEARCH FOR PROBLEMS/VARIANTS] (needed by Help button) 11.6) Finding Problem Regions Point to "Navigate", hold down the left mouse button and release on "Search for next problem". The "Search for Problems/Variants" window will pop up. You will notice that "problems" can be defined in 3 ways: too high (or too low) depth of coverage too many reads with inconsistent mates (an "improper pair" in SAM terminology) too many reads discrepant with the reference You will also notice that "too many" is defined both in % and in absolute number. There is the ability to exclude junk reads by the filter "ignore reads whose average base quality is less than _______". 11.7) There is a box to the left of "Read depth < 5 reads". Click this box so it is checked. 11.8) Click the button labeled "Find first problem in reference sequence". A window labeled "Problems" will pop up. There will be one line saying: 23 1-45,741 read depth 0 too low 11.9) Move this window aside so you can see the Reads vs Reference window again. You will see a similar message at the bottom of the Reads vs Reference window. A turquoise cursor will be blinking at position 45,741 which is the right end of the region with a flat red line--no reads. 11.10) Now click the button "Find first problem after cursor". You will see another region with too low read depth. 11.11) Click "Find first problem after cursor" several more times. You can see that most of the problems are too low depth of coverage. 11.12) So that you can see other problems, uncheck the box to the left of "Read depth < 5 reads". 11.13) Click "Find first problem in reference sequence". At the bottom of the Reads vs Reference Window and added to the Problems Window will be a line: 7 reads with inconsistent mates out of 10 or 70 % (at left end of region: 53,484-53,489) 11.14) Click on "Find first problem after cursor". The problem found will be: 100 % of reads are discrepant (5 discrepant out of 5 all above or at quality 30) There are 2 discrepant positions in a window of size 25 (at left end of region: 95,785-95,785) What does this "window size" ... mean? In real datasets, many of the discrepant locations will be SNPs, rather than misaligned reads. To distinguish between polymorphic sites and mismapped reads, we have added an additional filter: isolated discrepant locations are ignored. Rather, the discrepant locations must cluster. In the "Search for Problems/Variants" window is a line saying ">= [2] discrepant sites in a window of size [25] bp" If you wanted to report every discrepant position, including SNPs, you would change the "2" to "1" and then every discrepant location would be reported. 11.15) Continue to click on "Find first problem after cursor" and watch the messages at the bottom of Reads vs Reference Window. When you have clicked about 24 times, you will finally get a message saying "no further problems". You can play around with the options on the Search for Problems Window. Let's look more closely at a particular problem area. 11.16) Point at the 100,000 and hold down the right mouse button, and then release on "zoom in". 11.17) Zoom in a few more times until you see about 89,000 on the left of the window and 109,000 on the right of the window. If you zoom in too far, then zoom out again. Let's look at this region in detail with consed: 11.18) Point at about 95,000 and hold down the *middle* mouse button. While continuing to hold down the middle mouse button, move the pointer to around 105,000 and release the mouse button. While you were moving, the region should have been highlighted in grey. A window should pop up labeled "Swiped Region". In this window is a list of clusters of mates of reads in the region you just swiped. These mate pairs are inconsistent as indicated in the bam file, so it is really up to whatever created the bam file (bwa?) to decide what "inconsistent" means, but it typically means too far away and/or in the wrong orientation. You will notice that there are about 22 mate reads on reference sequence GL000224.1, 11 on GL000214.1, and about 10 on reference sequence 4, and smaller numbers on many other reference sequences. 11.19) Scroll down this list until you find a line showing reference sequence 23 from 93,125-103,161 (the number 103,161 might be slightly different). Double click on that line. Another Reads vs Reference Window will pop up showing just the region starting at 93,125 and ending somewhere around 103,000----the region where the inconsistent mates are clustered. 11.20) Dismiss this latest Reads vs Reference Window. 11.21) Go back to the "Swiped Region" Window and click on "Add New Region to List". Under "List of region for consed:", you should now see "23 95,039 to 104,928" (approximately). 11.22) Dismiss this "Swiped Region" window. 11.23) In the first Reads vs Reference Window (the one that shows about from 83,893-113,892), push the ">>>" button several times until you see 150,000-170,000 (approximately). 11.24) Swipe the region from 155,000 to 165,000 again by pointing to 155,000, holding down the middle mouse button, moving the pointer to 165,000 while holding the middle mouse button down, and then releasing the middle mouse button at 165,000. Another box labeled "Swiped Region" should pop up and you should still see, under "List of regions for consed:", the line "23 95,039 to 104,928" (approximately). 11.25) Click the button "Add New Region to List and Start Consed". In about 2 seconds a window will appear labelled "Consed Main Window". In its "Contig List" are 2 contigs. These 2 contigs are the 2 regions you swiped. 11.26) Double click on the first contig. 11.27) In the goto"Pos:" box in the upper right-hand corner, type 100,000 and type "Enter". There should now be a blinking red cursor on the consensus base at position 100,000. These bases should be: ttctctccagcc 11.28) Point to the T at position 100,000 and click with the left mouse button. 11.29) Type A. 11.30) A box labelled "Are you sure?" will popup saying "There is no read that has base a. Are you sure? (y/n)". Check the box that says "do not ask this question again" and click "yes". The base at position 100,000 should now be changed to an A. Now left-click on the base at position 100,001 and similarly change it to an A. Do this all positions from 100,000 to 100,010. 11.31) Exit consed by pointing to the "File" menu and releasing on "Quit consed". 11.32) When a menu comes up titled "consed", click on "Save before Quitting." 11.33) A box labeled "Save assembly to file" will appear. Click "OK". 11.34) Another box will popup saying: "Do you want to append this ace file to the list of ace files that will be applied with consed -rewriteReference?" Click "Yes". All of the consed windows will disappear leaving bamScape still running. 11.35) Exit bamScape by clicking "Quit" in the Reads vs Reference Window. 11.36) MODIFYING THE REFERENCE SEQUENCE Now there should be a file rewriteReference.fof See this by typing: ls -l rewriteReference.fof Take a look at it: > more rewriteReference.fof EDITED /wd1/gordon/sunny/bamScape/consed1/edit_dir/130222.155105.ace.1 where it will be something different--the absolute pathname of the ace file you just saved. 11.37) Run the following command: consed -rewriteReference -referenceFOF bamScapeReference.fof -newReferenceFOF newReference.fof -aceFileFOF rewriteReference.fof You will see output similar to this: -rewriteReference will be run. no consedrc file so no project-specific resources--that's ok couldn't open readOrder.txt--that's ok 23 changed closing new fasta, 23.fa.new done 11.38) You will see a new file in your directory, 23.fa.new Type "ls" to see it. 11.39) Compare 23.fa (the original reference sequence) with 23.fa.new (the modified reference sequence) by typing: 11.40) cross_match 23.fa.new 23.fa -minmatch 100 -alignments >cross.out (This assumes that you have installed cross_match which is part of the phrap package available from phg@u.washington.edu using the same license you used to get consed.) and then examine cross.out: 11.41) emacs cross.out Scroll down to line 100,000 and you will see the sequences aligned at this position with a difference between an A and a T with a "v" between them (v for "transversion"). 23 100000 ATCTCTCCAGCCTTCCCCGGATTTCTGCCACAGTCAGCCCCAGGCACCCA 100049 v 23 100000 TTCTCTCCAGCCTTCCCCGGATTTCTGCCACAGTCAGCCCCAGGCACCCA 100049 11.42) FINDING PROBLEMS IN BATCH 11.43) Type: consed -findBamProblems -bamFile reads.sorted.bam -referenceFOF bamScapeReference.fof -nav myProblems.nav This will find problems and put them into the myProblems.nav file in Picard IntervalList format. (See http://picard-tools.sourcearchive.com/documentation/1.25-1/IntervalList_8java-source.html ) Picard IntervalList format: * A SAM style header must be present in the file which lists the sequence records * against which the intervals are described. After the header the file then contains * records one per line in text format with the following values tab-separated: * - Sequence name * - Start position (1-based) * - End position (1-based, end inclusive) * - Strand (either + or -) * - Interval name (an, ideally unique, name for the interval) * - Anything else Here is a little example: @HD VN:1.0 SO:coordinate @SQ SN:chr1 LN:247249719 UR:/seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta M5:9ebc6df9496613f373e73396d5b3b6b6 SP:Homo sapiens @SQ SN:chr2 LN:242951149 UR:/seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta M5:b12c7373e3882120332983be99aeb18d SP:Homo sapiens . . . @SQ SN:chrX_random LN:1719168 UR:/seq/references/Homo_sapiens_assembly18/v0/Homo_sapiens_assembly18.fasta M5:f4d71e0758986c15e5455bf3e14e5d6f SP:Homo sapiens chr1 1104841 1104940 + target_1 . . . chr1 1110198 1110401 + target_9 Examine this file by typing: less myProblems.nav Then bring up bamScape with all of these locations already loaded into an interactive list: consed -bamScape -bamFile reads.sorted.bam -referenceFOF bamScapeReference.fof -nav myProblems.nav Double-click on any item in the list to go to that location in the BVAligned Reads Window. Alternatively, you can click "next" repeatedly to examine these locations. We found these problems using the default parameters. The following consedrc parameters allow you to set different parameters, just as you did in the Search for Problems Window (above): consed.BVFindProblemsTooHighDepthOfCoverage: true consed.BVFindProblemsTooLowDepthOfCoverage: false consed.BVFindProblemsDepthOfCoverageAboveThisNumber: 100 consed.BVFindProblemsDepthOfCoverageBelowThisNumber: 5 consed.BVFindProblemsInconsistentReads: true consed.BVFindProblemsPerCentInconsistentReadsAboveThisNumber: 20.0 consed.BVFindProblemsNumberOfInconsistentReadsAboveThisNumber: 6 consed.BVFindProblemsDiscrepancyRate: true consed.BVFindProblemsDiscrepancyRateIsAboveThisNumber: 30 consed.BVFindProblemsDiscrepancyNumberOfReadsIsAboveThisNumber: 4 consed.BVFindProblemsDiscrepancyIgnoreSoftTrimmed: false consed.BVFindProblemsNumberOfDiscrepantSitesInAWindow: 2 consed.BVFindProblemsWindowSizeForDiscrepancies: 25 (see CONSED CUSTOMIZATION below). Note: at present -findBamProblems requires a single BAM file. It cannot handle multiple BAM files. 11.44) Users can also write their own custom navigation files and load them into bamScape. On the main BamScape window, point to the "navigation" menu, hold down the left mouse button, and release on "Custom Navigation." You must supply a file in Picard IntervalList format, as described above. If you prefer to write a file in BED format, you can convert to Picard IntervalList format with: convertBedToBamScape.perl CHM1_1.1_assemblyerrors.bed CHM1_1.1_assemblyerrors2.nav all_sequences.fa conversion.txt where CHM1_1.1_assemblyerrors.bed is the input file in BED format, CHM1_1.1_assemblyerrors2.nav is the output file in Picard IntervalList format, all_sequences.fa is the fasta file of the reference sequences and conversion.txt looks like this: > more conversion.txt 1 gi|512322365|gb|CM001609.2| 2 gi|512322360|gb|CM001610.2| 3 gi|512322358|gb|CM001611.2| 4 gi|512322356|gb|CM001612.2| 5 gi|512322354|gb|CM001613.2| 6 gi|512322352|gb|CM001614.2| 7 gi|512322347|gb|CM001615.2| 8 gi|512322345|gb|CM001616.2| 9 gi|512322343|gb|CM001617.2| 10 gi|512322340|gb|CM001618.2| 11 gi|512322338|gb|CM001619.2| 12 gi|512322331|gb|CM001620.2| 13 gi|512322329|gb|CM001621.2| 14 gi|512322327|gb|CM001622.2| 15 gi|512322324|gb|CM001623.2| 16 gi|512322321|gb|CM001624.2| 17 gi|512322319|gb|CM001625.2| 18 gi|512322314|gb|CM001626.2| 19 gi|512322311|gb|CM001627.2| 20 gi|512322309|gb|CM001628.2| 21 gi|512322307|gb|CM001629.2| 22 gi|512322299|gb|CM001630.2| X gi|512322297|gb|CM001631.2| M gi|512322367|gb|CM001971.1| where the first column is the name in the Bed file for the reference sequence and the 2nd column is the name of the reference sequence according to the bam file. 11.45) If you want to examine these locations in consed, you can swipe to bring up consed, as described above. If you have many locations and want to examine them all quickly, you can make a single ace file with all of these locations by running: picard2Regions.perl This will create a regions file. You then must run bam2Ace.perl (see below) ---------------------------------------------------------------------------- 12. QUICK TOUR OF CONSED (THE GRAPHICAL EDITOR) 12.1) GETTING YOUR OWN COPY OF A SAMPLE DATASET 12.2) First get a copy of a sample dataset into your home directory. Thus you can make as many edits as you like since you will always be able delete it and get a fresh copy. Make a copy as follows: 12.3) cd ${CONSED_HOME}/examples 12.4) ls You should see something like this: 454_newbler autoPCRAmplify_answer polyphred align454reads bamScape selectRegions align454reads_answer gene_track selectRegionsAnswer assembly_view gene_track_answer solexa_example autofinish illumina_paired solexa_example_answer autoPCRAmplify illumina_paired_answer standard We want "illumina_paired_answer" right now (later we will want the others). 12.5) Copy it to your home directory: cp -r illumina_paired_answer ~ and go home: cd ~ 12.6) Type the following: cd illumina_paired_answer/edit_dir (You should get no error from this. If you do, type "pwd" to find out where you are and cd to the correct directory accordingly.) 12.7) start Consed by typing: consed (If it says "consed: Command not found.", go back to INSTALLING CONSED and start over.) (Don't worry about a message like: Warning: Cannot convert string "helvetica" to type FontStruct ) Two windows will appear. One of these will have a list of .ace files and say 'Click on an ace file and then click Open'. Double click on "ref.ace.1". The first window will go away. (If it asks the question "There is an edit history file...Do you want to apply those edits?" click "No" ) You will now see a list of one contig and a list of reads. This is the 'Consed Main Window'. Double click on 'c_elegans_piece' in the Contig List. The 'Aligned Reads Window' will appear. 12.8) SCROLLING Try scrolling back and forth. You might prefer to enlarge the Aligned Reads Window so you can see more reads at once. Do this pointing to the lower right corner, pushing down mouse button 1, and dragging it to enlarge the window, then releasing mouse button 1. Try scrolling by dragging the thumb of the scrollbar. Also try scrolling by clicking on the 4 buttons at the bottom of the window: << < > >> for scrolling by small amounts. For scrolling by tiny amounts, click on the arrows at either end of the scrollbar. For scrolling by huge amounts, use the middle mouse button and just click on some location on the black area of the scrollbar. For scrolling to the beginning or end of the contig, use the <<< or >>> buttons. Try clicking ">>>". In typical de novo assemblies, there are reads that protrude far beyond the beginning of the contig and reads that protrude beyond the end of the contig. Moving the scrollbar to the extreme right will scroll the contig to the end of the rightmost read--typically far to the right of the end of the contig/consensus. The <<< and >>> buttons will not go this far--they will just go to the leftmost and rightmost positions of the consensus. 12.9) Familiarize yourself with the features on the Aligned Reads Window. You should be able to figure out which are the read names, which are the read bases, and what the yellow arrows are, etc. (Answer: the yellow arrows are strand.) 12.10) There is a column of C's (and infrequently I's and U's) just to the right of the yellow arrows. Point the mouse at one of the C's and you will see the word "consistent..." in the big status box on the bottom of the Aligned Reads Window (which is between the "cursor" button and the "dismiss" button). "Consistent" means that the read is part of a mate pair that has pretty normal spacing compared to other mate pairs in the same library. Consistent also means that the reads, if they are in the same contig, are in one of these orientations: -> <- or <- -> but not -> -> or <- <- The entire message in the status might not be visible. You can enlarge the window to make it visible. A message like this: consistent, opp strands, ins size:276,lib default max 331 means that the reads of a pair are on opposite strands (as above), they are 276 bp apart, they come from a library called "default" (the one that all reads are put into that don't explictedly say which library they are from), and that the library has a maximum insert size of 331 bp. 12.11) Point with the mouse cursor to one of the reads with a "C" (for consistent), hold down the right mouse button, and release on "jump to mate." Notice that it jumps from the /2 to /1 and vice versa. 12.12) VERTICAL SCROLLING Scroll to about base 400 (see the yellow scale at the top of the Aligned Reads black window). Scroll the vertical scrollbar (on the right) up and down. (If you have made the window so large that you can't scroll, make the window smaller and then try scrolling.) The scrollbar turns a green tint to remind you that you are scrolled down and there are reads above the top. You can scroll faster by clicking on the black space above or below the thumb of the vertical scrollbar. You can scroll by tiny amounts by clicking on the little grey arrows at the top and bottom of the vertical scrollbar. Try each of these. 12.13) GOTO POSITION In the Aligned Reads Window, click in the 'Pos:' box in the upper right-hand corner. Type in a number, such as 750, and push the 'Return' or 'Enter' key. The Aligned Reads Window will scroll to position 750. We find this feature is particularly useful when one person wants another person to look at something in the sequence. (Little used feature: if you type in a number preceded by a "*" such as "*750", the cursor will be moved to *padded* position 750 (counting pads) which is unpadded position 749. These numbers are different because there is a pad after position 371--see for yourself.) 12.14) COLORS Notice the colors. Scroll to position 377 and notice the read 'A' about 8 reads down from the top read (all of the others have a C in that column, as does the reference). The red bases are the ones that disagree with the consensus. Now scroll to position 580. Notice the different shades of grey background (around the bases). They refer to the quality (error probability) of the base. Quality values mean the following: A quality value of 10 means 1 error in ten to the 1.0 power A quality value of 20 means 1 error in ten to the 2.0 power A quality value of 30 means 1 error in ten to the 3.0 power A quality value of 40 means 1 error in ten to the 4.0 power and for quality values in between: A quality value of 25 means 1 error in ten to the 2.5 power Get the idea? (These have actually been empirically verified for Sanger reads--if you are interested in the gory details, read the phred papers: Ewing B, Hillier L, Wendl M, Green P: Basecalling of automated sequencer traces using phred. I. Accuracy assessment. Genome Research 8, 175-185 (1998). Ewing B, Green P: Basecalling of automated sequencer traces using phred. II. Error probabilities. Genome Research 8, 186-194 (1998). In that same copy of the journal is a paper about Consed, as well.) Also notice the upper and lowercase. This is just a cruder indication of the quality of the bases: uppercase is higher quality and lowercase is lower quality. 12.15) To see the quality value of a particular base, point at it and click with the left mouse button. You will see the quality displayed in the Info Box at the bottom of the Aligned Reads Window. What is the highest quality base you can find? What is the lowest quality base you can find? (Answers are below.) These quality values are shown in grey scales: Quality 0 through 4 is given by dark grey Quality 5 through 9 is given by a shade lighter Quality 10 through 14 is given by a shade still lighter . . . Quality of 40 through 97 is given by white (the brightest shade) A quality value of 99 is reserved for bases that have been edited and the user is absolutely sure of the base ('high quality edited'). A quality value of 98 is reserved for bases that have been edited and the user is not sure of the base ('low quality edit'). Go to position 380. Notice that the ends of some of the reads have a completely black background and the letters themselves are grey (rather than black or red). These are the unaligned ends of reads, as determined by the assembler/aligner (phrap, bwa, newbler, etc.) See, for example, the dim GCGCGCGCGCGCGCGCGCC on the right end of C02D1ACXX:7:2202:15482:17608#GAACTATA/1 (about the 3rd read from the top). 12.16) HIGHLIGHTING READ NAMES In the Aligned Reads Window, click on a read name with the left mouse button. The name will turn magenta. Click again and it will turn yellow again. Try turning it magenta and then scrolling. This feature is helpful in keeping track of a particular read as you scroll. If you have an emacs window open (or any editor window), you can paste the read name in by first clicking on it in consed to turn it magenta and then clicking with the middle mouse button in the editor window. To highlight a bunch of reads at once, use the same shift-click method as in Windows: point and left click to highlight a read. Then point to an unhighlighted read several more reads down, hold down the shift key, and left mouse click. All of the intervening reads should be highlighted. 12.17) You can also make a file of the all of the reads you have highlighted by doing the following: Push the left mouse button down on the the 'Highlight' menu and release on 'Save Highlighted Read Names to File'. Try this. (If you can't find the "Highlight" menu: all menus are near the top of the window. Just under the title "Aligned Reads" are the menus labeled "File", "Navigate", "Info", "Color", "Dim", "Highlight", "Tracks", "Misc", and "Sort".) 12.18) Look at the file you created ('highlighted_reads.txt' unless you changed it)--in UNIX you do this by creating an xterm, cd'ing to the same directory (illumina_paired_answer/edit_dir), and typing: less highlighted_reads.txt (type 'q' to get out of less). 12.19) Turn off highlighting of all reads by pointing to the 'Highlight' menu, holding the left mouse down, and release it on 'Unhighlight All Reads in All Contigs'. 12.20) MOVING ALONG A READ First highlight a read name. Then left-mouse-click on a base on that read. You can move left and right within the read by using the left and right arrow keys on your keyboard. You can also scroll by 10 bp at a time by using the "<" (less than) and ">" (greater than) on the keyboard (not using the mouse). (If this doesn't work, first click on a base and then do it.) Try this: hold down the control key and type 'a'. You will move to the left end of the read. Hold down the control key and type 'e'. You will move to the right end of the read. (Emacs users will recognize these commands.) When you are done playing with this, unhighlight the read name. 12.21) DIMMING AND UNDIMMING ENDS OF READS Scroll so that location 380 is about in the middle of the aligned reads window. Notice that is a dimmed stretch of GCGCGCGCGCGCGCGCGCC that you looked at above. Push the left mouse button down on the 'Dim' menu. There will be a list of choices that will appear. Drag the cursor down to 'Dim Nothing' and release. Now look what happened to the color of the bases. Many now appear red with a grey background. You are seeing the clipped-off bases with all the same information as any other base. In some assemblies (especially those with some contamination, chimeras, vector, etc.) there is a huge amount of red (discrepant) bases, the screen becomes distracting and busy. Thus by default the clipped-off/unaligned bases are made with a black background and a grey foreground so they don't distract you. Look at the choices under the "Dim" menu. Notice there is a distinction here between 'low quality ends of reads' and 'unaligned ends of reads'. Unaligned ends of reads can be low quality as well, or they can be high quality, as in the case of chimeric reads. Change back to "Dim" menu, item "Dim Unaligned" and find the read that has the stretch of unaligned bases mentioned above. Point with the mouse to that read's name (the read names are yellow on the left side of the window and this one is called "C02D1ACXX:7:2202:15482:17608#GAACTATA/1" ) and hold down the right mouse button. You will notice there is a line that says "high quality from 293 to 392; aligned region from 293 to 373; chem: solexa". This is giving the same information in number form. Highlight the read name (see HIGHLIGHTING READ NAMES above) so you don't lose the read as you scroll. Then scroll check that the numbers agree with the dimming. You can play with the dimming options a bit. Then return it to 'Dim Unaligned' for the rest of this tour. 12.22) FIXING A READ AT TOP OF ALIGNED READS WINDOW By now, you will have observed that it is difficult to follow a particular read as you scroll. This is because some reads end and others begin so a particular read needs to move up or down to accommodate these changes. A particular read jumps up as you scroll right and jumps down as you scroll left. But sometimes you want to focus on a particular read as you scroll--for example, you might want to compare other reads to it along its length. Highlighting the read helps, but there is a better way: 12.23) In the Aligned Reads Window point to a single read, hold down the right mouse button and release on 'Fix read ... at the top of this window'. Suddenly a copy of the read will be shown below the 'Search for String' button and above the numbers and scale. Try scrolling left and right. 12.24) Now point to the read fixed at the top, hold down the right mouse button and release on 'unfix read ... at the top of this window'. The read should be removed from the top. 12.25) Try fixing several reads at the top of the window. Then unfix them all. 12.26) Highlight 2 or 3 reads. Point with the mouse pointer to the "Highlight" menu at the top of the window, hold down the left mouse button, and release on "Fix Highlighted Reads to Top of Window." 12.27) You can then point to the "Highlight" menu, hold the left mouse button down, and release on "Unfix Highlighted Reads At Top of Window". 12.28) EDITING THE CONSENSUS You can edit the consensus in the Aligned Reads Window. Click on the 't' in the consensus at position 382. Looking down the column you will see there is a 'C' that is red. Type 'C' on the consensus. Now look down the column. You will see that all of the T's have turned red (since they disagree with the consensus) and the red 'C' is now black (since it agrees with the consensus). You can also edit individual reads, if you like. Look below under TRACES AND EDITING READS. 12.29) SAVING THE ASSEMBLY To save the assembly, pull down the 'File' menu on the Aligned Reads Window, and release on 'Save assembly'. A box will pop up with a suggested name. I suggest you always use the one it suggests. The idea is that the ace files: (project).fasta.screen.ace.1 (project).fasta.screen.ace.2 (project).fasta.screen.ace.3 (project).fasta.screen.ace.4 (project).fasta.screen.ace.5 are in order of how old they are. If you feel you are taking up too much disk space, then start deleting the ace files starting at the oldest. I do not recommend that you overwrite existing ace files. The version numbers just keep growing, and that is not a problem. 12.30) EXPORTING THE CONSENSUS Bring the Aligned Reads Window into view again. Hold down the left mouse button on the 'File' menu and release the button on 'Export consensus sequence'. Notice that the consensus will be stored (in this case) in a file called 'Contig1.fasta'. Click 'OK'. There is now a file in your edit_dir directory called 'Contig1.fasta' that has the consensus sequence in it. If you want to see the file, bring up another Xterm (if you are UNIX literate), and type: cd illumina_paired_answer/edit_dir (You should get no error from this. If you do, type "pwd" to find out where you are and cd to the correct directory accordingly.) less c_elegans_piece.fasta (You get out of less by typing "q".) 12.31) Fancier exporting the consensus. Bring the Aligned Reads Window into view again. Hold down the left mouse button on the 'File' menu but this time release on 'Export consensus sequence (with options)...'. Just export a little snip of the consensus, from 370 to 380. (You will notice this contains a pad * character.) Under "Write Both Bases File and Qual File or Just Bases File?" click "Both Files" Click 'OK'. Consed will want to call this file 'c_elegans_piece.fasta' again. You can overwrite the existing file by answering "Yes" to "Save anyway?" Look in your other Xterm at these files: more c_elegans_piece.fasta more c_elegans_piece.fasta.qual The one file contains the bases (but no * pads) and the other contains the corresponding qualities of those bases (in this case the qualities are all 20.) 12.32) Exporting the consensus of all contigs at once: Go to the Main Consed Window (not the Aligned Reads Window). Point to 'File', hold down the left mouse button, and release on 'Write all contigs to fasta file'. You then can choose a filename for all contigs to be written to. (In this project there is only 1 contig, so there is no difference between this option and just exporting a contig at a time.) (Note that there is a way of exporting the contigs as they are oriented in the scaffold. See EXPORTING SCAFFOLDS below.) 12.33) COMPLEMENTING THE CONTIG Push the button 'Compl Cont' in the Aligned Reads Window to complement the contig. This displays the opposite strand of the contig including the consensus and all reads. Push this button again to uncomplement it. 12.34) FIND MAIN WINDOW On the Aligned Reads window, click on 'Find Main Win'. This will cause the Consed Main Window to pop up in the event you have buried it under other windows or iconified it. (This may not work with some settings of your X emulator. In that case you will have to find and click on the Main Window to bring it up.) 12.35) MULTIPLE UNDO EDIT Now that the Consed Main Window is visible, click the 'Undo Edit...' button. There will be a popup indicating the most recent edit. (If it says "no edits so far", click dismiss and then make some edits to the consensus.) Then click on 'Undo Edit...' again.) Click 'undo'. Then you will see the edit that was done before that. Click 'undo'. You can continue undoing if you like. You now know how to undo more than one edit. You cannot choose which edits to undo and which to not undo--edits can only be undone in precisely reverse order from the order you made them. Once you save the assembly, you cannot undo prior edits. 12.36) EXITING CONSED On the Aligned Reads Window, point to 'File' menu, hold down the left button and release on 'Quit Consed'. If it asks you some questions, answer 'Quit Without Saving and Discard .wrk File'. 12.37) CONSED -ACE Try bringing up Consed like this: consed -ace ref.ace.1 (where "consed" is replaced by whatever command brings up consed on your system). This is an alternative to just typing "consed" and then selecting the ace file from within consed. Many users prefer this method instead. 12.38) SORTING OF READS 12.39) Scroll to position 382 and click on the t in the consensus at that location. A green vertical line will appear. Scroll up and down using the scrollbar on the right side of the window. 12.40) SORTING BY BASE Point to the "Sort" menu at the top of the window, hold down the left mouse button, and release on "Sort Options and Help." Click on "by base". Move this window to the side out of the way (or dismiss it, but you will need it again soon). Notice now the read with the red C at position 382 is on top. The reads with T's are sorted by quality, the darker ones below. Look at position 397--all of the reads are A's except for a red G further down. Click on the consensus at position 397. The red G will jump further down as the reads resort by base, all the A's before the G. The A's are sorted by quality, the darkest ones near the bottom. 12.41) SORTING BY MISMATCHES ON TOP In the "How to Sort Reads" window, click on "mismatches on top". The read with a G will jump to the top, above all of the A's. 12.42) SORTING BY QUALITY In the "How to Sort Reads" window, click on "by quality". The read with the G will jump to somewhere in the midst of the reads with A's since now the reads are sorted by quality. (Actually they are sorted by quality in a 9-base window--you can click to see the actual numerical quality values.) 12.43) SORTING BY UNALIGNED + MISMATCHES ON TOP Click on the consensus base at position 382. In the "How to Sort Reads" window, click on the button labeled "unaligned + mis on top". The unaligned read with these bases GCGCGCGCGCGCGCGCGCC should rise to the top, followed by the read with a red "C" at this position, followed by all of the reads with black T's. 12.44) ALPHABETICAL SORTING OF READS If you have multiple reads from each of multiple patients, you might want to sort the reads by patient, so that all reads from the same patient are together. 12.45) Click on "alpha" and also "by method specified above". The reads should be sorted in alphabetical order. Scroll down to the bottom and you will see that the reads are in alphanumeric order. 12.46) When you are done experimenting, turn off sorting by quality: In the How To Sort Reads Window, click on "by method specified above". The reads will now be sorted by top strand first and then bottom strand (look at the arrows in the Aligned Reads Window). It is also possible to sort the reads by a user-provided file, but to do this you must learn CONSED CUSTOMIZATION (below) with resources: consed.showReadsInAlignedReadsWindowOrderedByFile: true consed.showReadsInAlignedReadsWindowOrderedByThisFile: readOrder.txt 12.47) SEARCH FOR STRING Try the 'Search for String' button (left side of the Aligned Reads Window). Type in the string "aaaaa" and click 'ok'. There should be a list of 'hits'. Double click on one of the hits (or single click on it and click on 'go'.) Notice that the Aligned Reads Window scrolls to that position and has the cursor on the found string. (Some of the hits are complemented, so they are ttttt.) Try also clicking on the "Next" and "Prev" buttons at the bottom of the Searching Contigs Window. Try clicking on the "Next" and "Prev" buttons at the bottom of the Aligned Reads Window. Notice they have the same effect as those button the Searching Contig Window but are more convenient. Click "Dismiss" on the "Searching Contigs" Window. Click the "Search for String" button in the Aligned Reads Window again. This time in the Search For String Window select 'Search Just Reads'. Erase the singlets file. Then click 'OK'. You will notice there are many more hits. This is because this shows hits in each read, even if they are at the same consensus position. You can also try the approximate match search for string by clicking on 'Approximate' instead of 'Exact'. The 'Per Cent Mismatch' only applies to the Approximate match search. 12.48) COPY AND PASTE In the Aligned Reads Window, swipe some bases by holding down the left mouse button. You should see the bases turn yellow, at least temporarily. Then click the 'Search for String' button. Click the "Clear" button to clear the "Query string" box. Use the middle mouse button to paste the bases you have just swiped into the 'Query string:' box. Notice that you can swipe bases either from the consensus or from a read. The search for string is case-insensitive so don't worry about the pasting being upper or lowercase. 12.49) FINDING VARIANTS/MISASSEMBLED READS/HIGHLY DISCREPANT LOCATIONS For this exercise, use the dataset called "solexa_example_answer". Make a copy (so you can modify it all you want) following the instructions GETTING YOUR OWN COPY OF A SAMPLE DATASET (above). cd solexa_example_answer/edit_dir (You should get no error from this. If you do, type "pwd" to find out where you are and cd to the correct directory accordingly. You might need to type cd ../.. first to get out of the dataset you are currently in.) 12.50) Type: ls You should see a file ref.ace.1 Start consed by typing: 12.51) consed -ace ref.ace.1 The Consed main window should appear. 12.52) Point to the 'Navigate' menu, hold down the left mouse button, and release on 'Search for highly discrepant positions'. 12.53) Do not change any of the defaults and just click the 'Search' button. Up will pop a window labelled 'Highly Discrepant Positions' with an empty window. Well, that's no fun--apparently there aren't any real variants in this dataset. Dismiss this window. So that you can see what they look like, repeat the steps above to bring up the Navigate by Highly Discrepant Regions Window again, but this time change "Ignore Bases Below This Quality" from 20 to 12. Click 'Search'. Up will pop the Highly Discrepant Positions Window with a list of the 9 locations below: min # of discrepant reads: 2 min quality: 12, "r": base of reference seq max depth of coverage: 100000 and ignoring reference seq A C G T * pos contig 2 8.0% 23 92.0%r 0 0.0% 0 0.0% 0 0.0% 56 ref 3 9.1% 30 90.9%r 0 0.0% 0 0.0% 0 0.0% 252 ref 2 6.9% 27 93.1%r 0 0.0% 0 0.0% 0 0.0% 256 ref 0 0.0% 0 0.0% 20 90.9%r 2 9.1% 0 0.0% 682 ref 0 0.0% 0 0.0% 31 93.9%r 2 6.1% 0 0.0% 715 ref 2 4.8% 40 95.2%r 0 0.0% 0 0.0% 0 0.0% 742 ref 2 8.7% 21 91.3%r 0 0.0% 0 0.0% 0 0.0% 936 ref 0 0.0% 1 2.4% 1 2.4% 39 95.1%r 0 0.0% 982 ref This means, for example, that at position 56 of contig "ref", there are 2 A's, 23 C's, 0 G's, 0 T's, and 0 *'s (deletions). There are 8.0% A's, 92.0% C's, 0% G's, 0% T's, and the reference sequence contains a C at this position If you see a line like this: 0 0.0% 0 0.0% 0 0.0% 9 75.0%r 3 25.0% 542-543* ref the "542-543*" means that two indel variants are right next to each other at positions 542 and 543 and are probably a single event. 12.54) Click the 'Next' button on this window and watch the Aligned Reads Window. To see the discrepant reads, change the sort. As shown above (see "SORTING OF READS"), point to the "Sort" menu, hold down the left mouse button and release on "Sort Options and Help". In the How to Sort Reads Window, click on "mismatches on top." You can then click "dismiss" in the How to Sort Reads Window. This sort will allow you to see the discrepant bases and then the agreeing bases with the highest quality agreeing bases first. You can continue clicking the 'Next' button either in the Highly Discrepant Positions Window or else at the bottom of the Aligned Reads Window. Do this until you have reached the end of the list. This provides a rapid method of reviewing variants. 12.55) Go back to the Consed Main Window, point to the 'Navigate' menu, hold down the left mouse button, and release on 'Search for highly discrepant positions'. When the window pops up entitled 'Navigate by Highly Discrepant Regions', look at the different options. This time try 'Just list indels'. Are there any indel variants in this data set? Try it and see. Well, actually there is one, but you will need to change the "minimum # of discrepant reads" to 1 to find it. Play around with these parameters a little. There is also the ability to ignore locations in which the consensus is an x or an n (or any bases you wish to ignore). You turn on this option by clicking "True" on the line "Ignore location if consensus base is one of:". Here are 2 more obscure options: 'maximum depth of coverage' Typically you won't use this (it is set to a ridiculously high number). It is there in case you want to avoid regions that you believe are collapsed repeats and thus what appear to be variants are really just differences between different copies of repeats. 'Count only first of multiple reads starting at same location' Some people believe that Illumina reads that start at exactly the same location are really the same read (the same cluster) and the image software made a mistake by making multiple reads out of it. If such a group of reads has a discrepancy in it, they want to count the group as one read with the variant rather than multiple reads with the variant. This feature is also available as a report which can be generated automatically without using consed's graphical interface. You will learn how to use consed's report feature later. 12.56) EXTENDING THE CONSENSUS You can edit or tag a Illumina read in the same way you do with Sanger reads (below). Scroll to the right end of the contig (around position 1000). Push the left mouse button down on the menu item 'Dim' and release on "Dim Nothing". You will see that there are a number of reads that protrude beyond the right end of the consensus and are red, indicating discrepant with the consensus (you may need to scroll down to see them). Suppose you want to extend the consensus based on those reads. Find read HWI-EAS94_4_1_59_547_158 and click on the name to highlight it so you don't have to find it again. (Later you will learn how to quickly find a read by name, but for now just use your eyes.) Middle mouse click on the c base at the right end of this read. You will see a Trace Window pop up. The dashed lines for the traces are to remind you that this is a Illumina read and the traces are completely fictional--this window just gives you the ability to edit and tag the read. Point at the first base of the gccatgtcataac sequence which is all red, and hold down the middle mouse button and swipe to the last base which is a c (all should turn yellow) and then release. A "What to Do with Selection" window should pop up. In this window, click on the "Change Consensus" button. In the Aligned Reads Window, you will notice that the consensus has now been extended to include the additional bases from the read HWI-EAS94_4_1_59_547_158. But there is a problem: click on consensus base "t" at position 1009. Scroll up and down and notice that the highest quality bases are all G (and red--discrepant) while the consensus is a t since the read you used to extend the consensus had a t. Overstrike the t in the consensus with a G. Now scroll up and down and notice the consensus agrees with the highest quality reads. 12.57) HIGH AND LOW DEPTH OF COVERAGE REGIONS Go back to the Consed Main Window, point to the 'Navigate' menu, hold down the left mouse button and release on 'Search for High (or Low) Depth of Coverage Regions'. A Window entitled 'Navigate by High (or Low) Depth of Coverage' should pop up. 12.58) Leave "show high depth (not low depth)" checked. Change the 'min (for high depth regions) or max (for low depth regions) depth of coverage' box from 10 to 50 and click the 'Search' button. A navigate window entitled 'High Depth of Coverage Regions' will pop up with a number of regions with depth of coverage 50 and over. Navigate to a few of them. To find low depth regions, just uncheck the "show high depth (not low depth)" box. If you wanted to find regions with absolutely no read coverage, you could do that by changing 'min (for high depth regions) or max (for low depth regions) depth of coverage' box to 0 and also the 'ignore read bases below this quality' box to 0. Then click 'search.' How many such regions did you find? Now change the "max (for low depth regions) depth of coverage" box to 1. Click "search." How many such regions did you find now? 12.59) You can find the depth of coverage at a specific position as follows: in the Aligned Reads Window, set the cursor on the consensus base at the position you are interested in. Point to the Misc Menu, hold down the left mouse button and release on "depth of coverage at cursor". The depth will be displayed at the bottom of the Aligned Reads Window. 12.60) You can also see an overview of the depth of coverage. On the Consed Main Window, click on Assembly View. The Assembly View Window will pop up. An error box will come up saying "Sequence matches will not be shown in Assembly View...". Dismiss it for now. You will learn much more about the Assembly View Window later, but for now just notice a few features: 12.61) Put the pointer on the grey bar with the numbers inside it. Move the pointer left and right and notice the information displayed near the bottom of the Assembly View Window as you do this. In particular, you will see the depth of coverage and the base position change as you move the pointer. The green graph also indicates depth of coverage. Notice that the green graph has horizontal black lines labeled "0, 50, 100, ... 250". Let's change that... 12.62) Find the button labeled "What to Show" near the bottom of the Assembly View Window. Click on it. A popup menu will appear. Click on "Read Depth/Multiple Discrepancies." The bottom box is labeled "max read depth" and currently has the value 300 in it. Change that to 50 and click "Apply". What happened to the green depth of coverage graph? 12.63) ASSEMBLY VIEW Consed can show you a bird's eye view of the Assembly using forward/reverse pair information, sequence match information, read depth, etc. We have a example database which shows its features. Get your own copy of the dataset "assembly_view" (see above under GETTING YOUR OWN COPY OF A SAMPLE DATASET). cd assembly_view/edit_dir (You should get no error from this. If you do, type "pwd" to find out where you are and cd to the correct directory accordingly.) ls Restart consed Double click on "assembly_view.fasta.screen.ace.1" In the Consed Main Window, click on the button "Assembly View" which is near the upper left corner of the window. You should see 3 grey bars with pink labels "2", "3", and "1". The bars are the contigs: Pink "1" means Contig1, pink "2" means Contig2, etc. Notice the scale on the contigs. This gives the contig position. 12.64) READ DEPTH We covered this briefly under Illumina Reads. You should see a dark-green graph above the contig bars. This dark green graph indicates read depth--the depth of the quality 20 (by default) region of reads. Click on the button labelled "What to Show". A menu will popup at that location. Click on the "Read Depth/Multiple Discrepancies" menu item. A window will appear labelled "Show Read Depth/Multiple Discrepancies". There is a field labeled "max read depth" and the current value should be 300. Change it to 30 and click the "Apply" button. The read depth graph should now be much bigger and easier to see. But depending on the dataset, a max of 30 might be much too small. Thus you will need to adjust it. You might not like the horizontal lines in the read graph. Turn them off by clicking on the square with the check mark (a "toggle button") labeled "make read depth horizontal lines across window" so that the check mark disappears. Then click "apply" and the horizontal lines should disappear. Note: the read depth is *not* the # of reads that have quality 20 bases or above, although this number is a good approximation. For example, suppose there is a stretch of 300 Q50 bases, and in the middle of that stretch are 5 Q10 bases. Those Q10 bases will be counted toward the Q20 read depth. (In computer science terms, these bases are part of the maximal Q20 read segment.) 12.65) FORWARD/REVERSE PAIR DEPTH A "forward/reverse pair" is a pair of reads from the same subclone template, each of which is primed within the subclone vector, but one is primed on one side of the insert and the other is primed on the other end of the insert. A forward/reverse pair may both be assembled into the same contig, in which case they should point towards each other and be approximately the insert size apart. A forward reverse pair also might be in different contigs on different sides of a gap. 12.66) To see the graph of forward/reverse pair depth: Click on the button labelled "What to Show". A menu will popup at that location. Click on "Fwd/Rev Pairs". A box will appear labelled "Which Fwd/Rev Pairs to Show in Assembly View". There is a little square (a toggle button) next to "show consistent fwd/rev pair depth". Click on this toggle button to change it from appearing sticking out to appearing pushed in. Then click on "Apply". A bright green graph should appear--this is fwd/rev pair depth. It is highest around 7000 to 10000 of Contig2 and around 14000 of Contig3. The bright green graph indicates, for each base, the depth of subclone templates that have a consistent forward/reverse pair. A forward/reverse pair is "consistent" if the forward and reverse are pointing towards each other (this may be a problem for some Illumina datasets--if this is a problem for anyone, let me know) and are not too far away from each other. ("Too far" is defined as 3 or more standard deviations from the mean of the insert size of templates from a particular library.) In other words, the green graph tells for each base, how many consistent forward/reverse pairs have that base between the forward read and the reverse read. This forward/reverse pair depth is not the same as read depth, which is typically much less. Forward/reverse pair depth is important in that it gives a measure of the confidence of the assembly at a base. If the forward/reverse pair depth is close to zero, as it is in Contig1 position about 9300, there is a likelihood that the assembly program has made an incorrect join. When the forward/reverse pair depth is zero, the green line turns red, as it does on the right end of Contig3. 12.67) INCONSISTENT FORWARD/REVERSE PAIRS The red lines connect the right end of Contig3 with the middle of Contig1. These are filtered inconsisent forward/reverse pairs--they are "inconsistent" because they are not consistent (see above) and they are "filtered" in that they have another inconsistent read close by (at both ends) that is inconsistent for the same reason. This is a good example of a misassembly. There are many many reads at the right end of Contig3 that are paired with reads in the middle of Contig1. Notice that the forward/reverse pair depth of Contig1 is close to zero around base 9300. (You can use the "Zoom In" button to see this in more detail, but when you are done experimenting with the Zoom buttons and the scroll bar, click on "Zoom Orig" for the rest of this exercise.) This is where the assembly program made a bad join. If you tear the contig apart there, complement the left part of Contig1, and then join it to the right end of Contig3, the forward/reverse pairs will change from inconsistent to consistent. You will learn later how to do that. 12.68) Point to one of the red lines. You will notice that it turns yellow. The text near the bottom of the Assembly View Window tells you a little more about what you have "highlighted" (turned yellow). If you want more information, click with the left mouse button. A window "Clicked Forward/Reverse Pairs" will appear giving information about each highlighted read. Try this. In the "Clicked Forward/Reverse Pairs" Window double click on one of the reads. The Aligned Reads Window should appear with the cursor on that read. This shows how to go from the Assembly View Window to the Aligned Reads Window. 12.69) You can also go from the Aligned Reads Window to the Assembly View Window. First you must make sure the Assembly View Window is already open (or else open it by clicking on Assembly View in the Consed Main Window). In the Aligned Reads Window, point to a read name, hold down the right mouse button, and release on "Find Read in Assembly View" (one of the last items in the menu the appears when you push down with the right mouse button). If the read is from a subclone that has a forward/reverse pair in the assembly, then the same "Clicked Forward/Reverse Pairs" Window will appear. It will contain not only the read that you pointed to, but all of the other reads from the same subclone as the one you pointed to. In the Assembly View Window, all of these reads will blink yellow. You can use this procedure to go within the Aligned Reads Window from forward read to reverse read or visa versa. (This may take some patience to find one with this dataset because there are many reads that are not paired.) 12.70) Notice the aqua lines that connect the right end of Contig2 to the left end of Contig3. These are consistent gap-spanning forward/reverse pairs. These are the reads that tell you (and Consed and many other programs) that the right end of Contig2 is connected to the left end of Contig3. As above, point to one to highlight it and click on it to see more information. 12.71) You can see much more information by clicking on the "What to Show" button, and then when the menu pops up, click on the "Fwd/Rev Pairs" menu item. Up will pop the "Which Fwd/Rev Pairs to Show in Assembly View" Window. Click on "All" next to "Show Inconsistent Forward/Reverse Pairs". Then click "Apply" at the bottom of this window. In this particular example, you just see a few more stray red lines. In a real example, you would probably see so many red lines that it would be a mess. In most cases those inconsistent such as chimerism and not to any misassembly. Thus I suggest that you only generally leave "Show Inconsistent Forward/Reverse Pairs" to "Filtered". (I suggest not changing the "# of pairs to confirm or the "expected # of confirmed inconsistent pairs. We have used statistics so that the predicted mean # of clusters of inconsistent fwd-rev pairs is 1.0.) 12.72) Still in the "Which Fwd/Rev Pairs to Show in Assembly View" Window, click on "Show each consistent fwd/rev pair within contigs" (so the button looks as though it is pushed in) and click "Apply". This will show a blue square for each consistent forward/reverse pair within a contig. The horizontal position of the square is the center of the subclone (midway between the forward and reverse read) and the vertical position of the square indicates the size of the subclone (higher means a larger subclone). If you really want to see the position of the forward and reverse reads, you can do that too: Click on "Show legs on squares for consistent fwd/rev pairs" ("Show each consistent fwd/rev pair within contigs" must be still on) and click "Apply". What a mess! I believe most of this information is much more easily understood by just showing the "consistent fwd/rev pair depth" (the bright green graph described above). But it is your choice. When you want to highlight a consistent fwd/rev pair, you must point to the square--not the legs. Try it so you understand. 12.73) Suppose you have an assembly and there are some forward/reverse pairs that you specifically do not want to see in the Assembly View Window. For example, perhaps they are from a plate that was misnamed or from a library that is somehow less reliable. By hiding these forward/reverse pairs, the more reliable/important ones can more easily be seen. This is how you can do that: In the "Which Fwd/Rev Pairs to Show in Assembly View" Window, notice the line that says: Do not show templates in file doNotShowInAssemblyView.fof Underneath this are 3 buttons and probably the one that is selected is "show all templates". Try clicking "do not show specified templates" and click 'Apply'. See if you notice that anything changed in which forward/reverse pairs are displayed. If not, switch back and forth between "show all templates" and "do not show specified templates", each time clicking 'Apply'. When you see a line that appears and disappears, click on it to find what template it is. For example, djs736a2_fp04q146 is one such template. Then from an xterm in the assembly_view/edit_dir directory, type: more doNotShowInAssemblyView.fof You will see the names of the templates that are displayed/hidden. In order to hide particular forward/reverse pairs, put them into this file. This file can also contain the character '*' which means "match any characters". For example, djs736a1_fp* would match the template djs736a1_fp04q206 but not djs736a2_fp01q127 12.74) Try turning on/off each of the Fwd/Rev Pair options so you understand them. (In this example, there are no "consistent fwd/rev pairs between different scaffolds.") 12.75) SEQUENCE MATCHES Notice the curvy orange lines connecting Contig1 with Contig2 and Contig3. These show sequence matches. Point at the one connecting Contig1 and Contig2 and click on it. A "Sequence Matches" box will popup saying that this match has 120 bases and has a similarity of 90.8%. Click on that line so its background turns black. Then click on the button "Show Alignment". Up will pop the Compare Contigs Window with the alignment shown in the lower half of this box. You will learn more about this later (see "JOIN CONTIGS"). For now, dismiss this window. 12.76) In the Assembly View Window, click on "What to Show" and then when the menu pops up, click on "Sequence Matches". In the "Which Sequence Matches to Show in Assembly View" Window, try clicking off "ok to show sequence matches between contigs". Then click the "Apply" button. You should see the orange lines disappear. (Any highlighted lines will not disappear.) Click "ok to show sequence matches between contigs" back on, and click "Apply" and the lines should be back. 12.77) Also in the "Which Sequence Matches to Show in Assembly View" Window, change the minimum similarity from 90 to 85. Click "Apply". You should see a lot more orange curvy lines, and now you should also see black curvy lines. If you look carefully, you will see that 2 lines within each pair of orange curvy lines do not cross each other but the 2 lines within each pair of black curvy lines do. This is because orange is used to show direct repeats and black is used to show inverted repeats (relative to the orientation of the contigs in the Assembly View Window). 12.78) Also in the "Which Sequence Matches to Show in Assembly View" Window, click on "filter seq matches by size" and set the min size to 400 and the max size to some huge number (e.g., 1000000), leave minimum similarity at 85, and click "Apply". You will see just one direct repeat (orange curvy lines) of size 746. 12.79) Try some of the other ways of filtering the sequence matches on "Which Sequence Matches to Show in Assembly View". 12.80) You must learn this step if you are going to ever see sequence matches with your own data, so don't skip this step. If you have problems, it is likely that the phrap or consed packages have not been installed correctly and you will need help from your system administrator. Exit Consed and look at the files in assembly_view/edit_dir. Notice there is a file: assembly_view.fasta.screen.ace.1.aview This is what Consed uses to show sequence matches in the Assembly View Window. When you use your own data, you will not have this file so you will need to learn how to create it. Hide it from Consed by (in practice you will never do this step--this is just to simulate the .aview file not being there): mv assembly_view.fasta.screen.ace.1.aview assembly_view.fasta.screen.ace.1.aview_hide Now restart consed and select ace file assembly_view.fasta.screen.ace.1 If you are asked if you want to apply edits, click the "No" button. Click on "Assembly View" in the Consed Main Window. You will get the error message: "Sequence matches will not be shown in Assembly View because there is no file assembly_view.fasta.screen.ace.1.aview If you want sequence matches to be shown, click on "What to show: Sequence Matches" and then "run cross_match" 12.81) RUNNING CROSS_MATCH FOR SEQUENCE MATCHES Just as the instructions (above) say, click on "What to show" and then when the popup menu appears, click on "Sequence Matches" and then when the "Which Sequence Matches to Show In Assembly View" Window comes up, click on the "Run Cross_Match" button. Watch the action in the xterm. There should be several pages worth of output from cross_match that scrolls by in the xterm. If you get an error, it is likely that the phrap or consed packages are not correctly installed. You (or your system administrator) should track down the problems and correct them. If you are successful, then 3 orange pairs of curvy lines will appear in the Assembly View Window--the same as you saw in the steps above. Note to advanced users: It is also possible to have cross_match run automatically using the consedrc resource: consed.assemblyViewAutomaticallyRunCrossmatchIfNecessary: true See CONSED CUSTOMIZATION below. 12.82) PULLING OUT READS AND RE-ASSEMBLYING THEM (MINIASSEMBLIES) When the Assembly View Window indicates (using forward-reverse pair information) that there is a misassembly, Consed provides the tools to correct that misassembly: you can first pull out the the misassembled reads from their current contigs into individual contigs, with a single read per contig. Then you can reassemble those new contigs that each contain a single read. Let's do this: 12.83) In the Assembly View Window move your cursor so that the red forward/reverse pair lines turn yellow. You will be unable to get them all yellow, but get as many as you can. Then click with the left mouse button. A window labelled "Clicked Fwd/Rev Pairs" should appear with a very long list of reads in it (around 53 reads). 12.84) In the "Clicked Fwd/Rev Pairs" Window, click on the button labelled "Pull out reads". A window labelled "Put Reads into Their Own Contigs" should appear. 12.85) In the "Put Reads into Their Own Contigs" Window, select all of the reads. You can do that by clicking with the left mouse button on the first read and then scrolling down to the bottom of the list of reads, holding down the shift key and clicking with the left mouse button on the last read. (When a read is selected, its background should be black.) Click on the button "Remove Highlighted Reads". The Assembly View Window will close and reopen after a few seconds and will complain about not being able to show sequence matches. Save the assembly (see "SAVING THE ASSEMBLY" above) and follow the instructions in "RUNNING CROSS_MATCH FOR SEQUENCE MATCHES" (above). The assembly will now probably contain 4 contigs: 2-3-1c in one scaffold and 4 in the other. That is because when the misassembled reads were pulled out of Contig1, it fell into two new contigs: the new contig 1 and contig 4. All of the reads you pulled out have created Contig5, Contig6, ... and approximately Contig58, each of which contain only a single read. 12.86) MINIASSEMBLIES On the Consed Main Window, click the button "Miniassembly". A box will popup labelled "Reassemble Some Contigs". On the left part of the box will be all contigs, from Contig1 to about Contig58. Notice that starting with Contig5 will be contigs that contain only a single read. On the right will be Contig5 through approximately Contig58. You add or delete from the list on the right. For example, to delete Contig5 from the list on the right, click on it, and then click "Clear Highlighted". The right list should now only contain Contig6 through the last contig. Add Contig5 back to the right list by clicking on Contig5 in the left list and then clicking on the button labelled "Move Highlighted to Right". Contig5 will now appear at the bottom of the list on the right. 12.87) Leave all of these boxes blank: "-minscore", "-minmatch", "-forcelevel", and "other phrap options:". Keep "Put into separate contigs" selected rather than "Discard from assembly". Click the "Reassemble" button. If you haven't saved the assembly, a box will popup saying "Error You must first save the assembly before making a miniassembly". Follow the instructions you learned above ("SAVING THE ASSEMBLY") to save the assembly. Then click the "Reassemble" button again and watch the action in the xterm. Lots of output from determineReadTypes.perl, phrap, cross_match will scroll by in the xterm as those programs run. (If they don't, you haven't correctly installed all of the phred, phrap, or Consed package.) 12.88) When the miniassembly is complete, a box will popup asking "Are you finished miniassemblying these contigs?" Click the "Yes" button. 12.89) On the Consed Main Window, click the "Assembly View" button. Consed will complain about not being able to show Sequence Matches so save the assembly and follow the instructions in "RUNNING CROSS_MATCH FOR SEQUENCE MATCHES" (above). In the Assembly View Window in addition to Contig1, Contig2, Contig3, and Contig4, you should see a few more contigs. These are the result of the miniassembly of all those individual reads. Note to advanced users: you can have consed automatically save the assembly before miniassembly by using the consedrc resource: consed.autoSaveBeforeMiniassembly: true See CONSED CUSTOMIZATION below. 12.90) HIGHLIGHTING READS TO REMOVE THEM FROM A CONTIG Instead of using Assembly View to pull out reads, you can do so using highlighting. Restart consed as follows: Exit Consed and then restart Consed. Double click on "assembly_view.fasta.screen.ace.1" (If a window pops up saying "There is an edit history file ( a .wrk file )...", click the "No" button.) 12.91) Double click on 'Contig2'. 12.92) Scroll to position 1670. 12.93) Click on the consensus T at this location. 12.94) Point to the 'Highlight' menu, hold down the left mouse button and release on 'Highlight reads with string at cursor'. 12.95) A box will pop up labeled 'Highlight reads with string at cursor'. There will be an input field labeled 'enter string (*'s will be ignored)'. Type a 'GG' into this field and click OK. It will say 'string gg matched 3 reads at position 1670'. 12.96) Let's see if there are any mates of these reads and highlight them also. Point to the 'Highlight' menu, hold down the left mouse button and release on 'Highlight mates of highlighted reads'. This will highlight 3 more reads. 12.97) To see that now 6 reads are highlighted, save the highlighted reads in a file and examine the file. (See HIGHLIGHTING READ NAMES above.) 12.98) Pull these reads out of Contig2 as follows: Point to the 'Highlight' menu, hold down the left mouse button and release on 'Remove highlighted reads'. Up will pop a box labeled 'Remove Reads'. You will also notice that there are 6 reads in the 'Reads to be removed' column--the 3 that you highlighted and their mates. There are many options for what to do with the reads and what to do with the remaining contig. See below under "REMOVING READS". For now, do not change any of the options. 12.99) Click 'do it' to remove these reads. 12.100) On the Contig Main Window, click the 'Miniassemble' button. You will notice that the reads you just removed are in the 'Contigs to Reassemble' column--all ready to miniassemble. You've now seen two different ways that users select reads for miniassembly: the first was in Assembly View by selecting mate pairs that are inconsistent. The second was in the Aligned Reads Window by selecting reads that are discrepant with the other reads (and the consensus) at a location. 12.101) CONTIG ARRANGEMENT--REORDER CONTIGS Contigs are arranged by Consed into "scaffolds" using forward/reverse pair information. However, you might have some external information (such as digest information) that tells you a different arrangement. You can use Consed to rearrange the contigs. This new arrangement will be preserved even if you reassemble. 12.102) Exit Consed and then restart Consed. Double click on "assembly_view.fasta.screen.ace.1" (If a window pops up saying "There is an edit history file ( a .wrk file )...", click the "No" button.) Click on the "Assembly View" button. You will see two scaffolds: one on the top row with Contig2 and Contig3, and one on the bottom row with just Contig1. Now suppose that you believe that Contig2 and Contig1 are connected together instead of Contig2 and Contig3. To do this: 12.103) Within the Assembly View Window, click on the "Contig Arrangement" button. Up will pop a menu. Click on "Reorder Contigs". A "Reorder Contigs" Window will pop up. Enter the following information: Contig: 2 [Right End] connected to Contig: 1 [Left End] That is, you must enter "2" and "1" in the contig boxes, and you must click on the first "right end" button. Then click on the "Add and Restart Assembly View" button. A warning box will pop up telling you that you are crazy, because there are 13 forward/reverse pairs as evidence that the scaffold as displayed in the Assembly View Window is already correct. Click on "yes"--that you are sure. Well, that isn't quite what you wanted. Contig 2 and Contig3 are still together. So connected the other end of Contig1: Contig: 1 [Right End] connected to Contig: 3 [Left End] Then click on the "Add and Restart Assembly View" button. A warning box will pop up again. Click on "yes"--that you are sure. The Assembly View Window will disappear for a second and reappear, with Consed2 and Contig1 connected together, just as you wanted. 12.104) CONTIG ORIENTATION Some users want a scaffold oriented a particular way. For example, one user might be working on a particular gene so wants to always view the top strand of that gene. Another user might be finishing a BAC and wants a particular end of the BAC on the left of the scaffold. The assembly program, however, may not respect their wishes and might have contigs complemented from the way the users want to view them. Consed provides a way for the user to indicate his/her desired orientation, and thereafter if phrap complements a contig from that desired orientation, Consed will complement the contig back when Consed starts up. To demonstrate this, exit Consed and then restart Consed. Double click on "assembly_view.fasta.screen.ace.1" In the Consed Main Window, double click on Contig1. You will see read djs736a2_fp02q494.y1 pointing left. But let's suppose that you would rather the Contig be in the other orientation, with read djs736a2_fp02q494.y1 pointing right. In the Consed Main Window, click on Assembly View. Then click on the button labelled "contig arrangement". When a popup menu comes up, click on "Reorient Contigs". The "Reorient Contigs Window" should come up. Highlight the scaffold labelled "1" under "Select a scaffold". Click on "flip scaffold". Then push the button labelled "Apply and Restart Assembly View". There will be an error box complaining about not being able to show sequence matches. To fix that, save the assembly and follow the instructions in "RUNNING CROSS_MATCH FOR SEQUENCE MATCHES" (above). In the Consed Main Window, double click on Contig1 so the Aligned Reads Window comes up. Scroll to the right end. You will notice that djs736a2_fp02q494.y1 is now on the right end pointing right. What is the difference between doing this and just complementing the contig, which just requires the click of a button? There isn't any, unless you are going to reassemble the project with phrap. In that case the difference is that complementing the contig will be undone the next time phrap runs (you reassemble), but using this procedure will be permanent, even if phrap complements the contig. 12.105) NAVIGATING Earlier you learned "Search for String", "Highly Discrepant Locations", and "High (or Low) Depth of Coverage Regions". Consed uses this same navigation system to go to many other types of locations. If consed is running, exit it. In this case we need a private copy of the dataset called "standard" (see GETTING YOUR OWN COPY OF A SAMPLE DATASET above). Type: cd standard/edit_dir (You should get no error from this. If you do, type "pwd" to find out where you are and cd to the correct directory accordingly.) Restart Consed. Double click on standard.fasta.screen.ace.1 Double click on Contig1 In the Aligned Reads window, pull down the Navigate menu and release on 'Low consensus quality'. You will see a list of locations. Move the 'Low consensus quality' window down so you can see the Aligned Reads window. Repeatedly click on 'Next' until you reach the end of the list. (Low consensus quality means an area in which the consensus bases each have too high probability of being wrong.) This saves you from having to look through large amounts of high quality data trying to find problem areas. There are 2 'Next' buttons--one on the Aligned Reads Window and one on the Low Consensus Quality Window. You can click on either, but it is probably more convenient to use the 'Next' button on the Aligned Reads Window. Thus you can keep the Aligned Reads Window in front with input focus and keep the Low consensus quality window pushed out of the way. You may want to click on the 'Save' button in the Low consensus quality Window to save to a file a copy of this list of problem areas as you work through them. In our experience, this will be the most important navigate list you will use if you are finishing to high accuracy. In fact, finishing partly consists mainly of adding reads and rephrapping until this list is reduced to nothing. 12.106) Dismiss the Low consensus quality window. Pull down the 'Navigate' menu again and release on 'High quality discrepancies as above, but omitting tagged compressions and G_dropouts'. You will probably notice there are no entries (unless you created some yourself by editing). That is because there are no high quality discrepancies with this dataset. So let's force there to be some by lowering the quality threshold. First, dismiss the High quality discrepancies window. Click on 'Find Main Win'. In the Consed Main Window, pulldown the 'Options' menu and release on 'General Preferences'. Notice that the default for 'Threshold for High Quality Discrepancy' is 40. Change it to 15 and click 'Apply & Dismiss'. Then follow the steps above to bring up the High quality discrepancies menu. Now you will see several entries. Click 'next' repeatedly to go successively to the next high quality discrepancy in the Aligned Reads Window. You can also double click on a particular line in the High quality discrepancies window to go to that location. Alternatively, you can single click on a line and then click the 'Go' button. Dismiss the High quality discrepancies window. 12.107) There is also a way of getting such a list in *ALL* contigs: Click the "Find Main Win" button just above the black area containing the reads. On the "Consed Main Window", point to the "Navigate" menu (at the top) and release on "High Quality Discrepancies in All Reads." This will give the same list as before. In some assemblies there are hundreds, sometimes thousands, of contigs. It is much more convenient to search through all contigs at once than to search them one at a time. 12.108) Try navigate by tags by selecting 'tags' under navigate: when the Select Tag Type Window appears, double click on 'compression'. (Note that you can't do anything else until you deal with this window.) This gives a list of a particular tag type in all contigs. You can also search for many tag types at once. Do this by following the instructions above but this time click on several tag types and then click 'ok'. (You will only find compression tags because there are only compression tags in this assembly. When you learn how to create tags below, you will be able to find multiple tag types.) To speed-up selecting multiple tag types, do this: Point to the Navigate menu and release on "Tags in All Contigs". Type 'heter' in the box and click 'select'. You will notice that all of the heterozygote tags are selected. Then click 'ok' to find them all. (You will find none because there aren't any in this dataset.) 12.109) CUSTOM NAVIGATION In the Consed Main Window, Point to the Navigate menu and release on Custom Navigation. A box will pop up saying 'Select custom navigation file:' There will be a file: custom_navigation.nav Double click on it. You will see the now-familiar custom navigation box. Click 'Next' repeatedly until you get to the end of the list. This list of locations is chosen by some program other than consed. Many labs write such programs themselves. This allows a human to quickly review the sites the program has chosen. If your lab is interested in writing such a program, see below under HOW TO WRITE A CUSTOM NAVIGATION FILE. 12.110) Other navigation lists. Feel free to experiment. Some of these are only applicable to Sanger sequencing to high accuracy, or are obsolete, or are only used for consed development. Some terms: "Unaligned high quality regions" are regions in which the read is high quality so there is no question of the bases, but the region differs so much from other reads that the assembly or alignment program has given up trying to align the region with the consensus. (This could be due to a chimeric read, or perhaps the read belongs somewhere else.) "Unaligned reads" are reads that are totally unaligned to the consensus and don't belong there at all. "Edits" refers to human-made edits. "Search for Questionable Consensus Bases" is the favorite list of one finisher for finding misassemblies, but I don't recommend it. 12.111) TEAR CONTIG Just so you get the same results as I do, exit Consed and bring it up again using the original ace file standard.fasta.screen.ace.1 If it asks if you want to apply edits, just say 'no'. 12.112) When the assembly program really screws up, you may want to just tear the contig apart in several places and then join the pieces back together in a different way. Let's try it: Double click on "Contig1" so that the Aligned Reads Window comes up. Go to location 1500. Point the mouse at the consensus base at 1500 and push the right mouse button down. Release the button on 'Tear Contig at This Consensus Position'. You will notice that in the Aligned Reads Window, 4 read names are now colored purple: djs74-996.s2, djs74-2689.s1, djs74-564.s1, and djs74-2931.s1. The purple reads are consed's suggestions of which visible reads will go into the new left contig. The read's that are not colored purple will go into the new right contig. If you click on a read name in the Aligned Reads Window, it will switch back and forth between purple and not purple. Leave everything as it is and just click 'Do Tear'in the "Tear Contig" Window. (If you want to play around with which reads go into which contig, do that another time.) Now you should have 2 Aligned Reads Windows on top of each other. One should contain 'Contig2' and the other 'Contig3'. Dismiss the little window that says 'Tear Complete'. Don't do the following now--this is just for your information: You can also tear contigs in batch, possibly tearing at multiple sites. This is done by typing: consed -ace (ace file) -tearContigs (file of locations) where the "file of locations" has the following format: (contig name) (unpadded position to tear) (contig name) (unpadded position to tear) (contig name) (unpadded position to tear) (contig name) (unpadded position to tear) . . . 12.113) JOIN CONTIGS Now let's join these 2 contigs back together: Click on 'Search for String' and type in the following bases: agctgccatc Click 'OK'. Search for string should find 2 locations, one in Contig2 and one in Contig3: Contig2 (consensus) 1447-1456 (uncomplemented) Contig3 (consensus) 829-838 (uncomplemented) Double click on the first one. The Aligned Reads Window for Contig2 will scroll to location 1447 and the window will raise up. In that Aligned Reads Window, click on 'Compare Cont'. Now double click on the 'Contig3' line in the above Search for String results. The Aligned Reads Window for Contig3 will scroll to location 829 and lift up. In that Aligned Reads Window, click on 'Compare Cont'. Now the Compare Contigs Window should be visible. In the Compare Contigs Window, try scrolling back and forth. You can change the cursors (blinking red), but if you do, please return them to the locations 1447 and 829 for the next step. The cursors 'pin' these bases together when doing an alignment. (The algorithm is a pinned and banded Smith-Waterman alignment.) Click on Align. Try scrolling the alignment by dragging the thumb in the lower half of the Compare Contigs. An 'X' means there is a discrepancy between the 2 contigs. There is also a 'P' (see if you can find it!) The P indicates the bases that you pinned together. You will also notice that some bases are lighter and some are darker. This indicates quality just as in the Aligned Reads Window. You will notice that wherever there an is a discrepancy (an 'X') one of the bases is low quality. This is your cue that the discrepancy is just a base calling error rather than indicating that the two contigs really are different but similar locations. Click a few times on "Next Discrepancy." Then click on "Prev Discrepancy." Notice that the red cursors in the Compare Contigs Window moves to the next/previous X (discrepancy). Then look what happens in the 2 Aligned Reads Windows: as you move from X to X in this manner, the Aligned Reads Windows scroll as well. In the Compare Contigs Window click with the left mouse button on either contig in the bottom alignment. You will notice that both contigs will have the red blinking cursor in the same position. Click on 'Scroll Both Aligned Reads Windows' and look at the Aligned Reads Windows to see that they scroll to the corresponding positions. The number of discrepancies and discrepancy rate is also displayed--find this. Finally click the 'Join Contigs' button. The 2 previous Aligned Reads Windows will disappear and there will be a new one which has a new contig 'Contig4'. You have made a join! Scroll left and right. You will notice that many of the reads are highlighted. These are the reads that came from the previous "right" contig. To unhighlight all of these reads at once, point to the "Highlight" menu, hold down the left mouse button and release on "Unhighlight All Reads in All Contigs". It is possible to have more than one Compare Contigs Windows up at a time. This allows you to investigate a repeat that has more than 2 copies. There are several other ways of making joins that you will learn later: one uses the Assembly View window, one uses autoreport to make a list of potential joins and then allow the user to review each of them before making them, and one is completely automated with no user review. 12.114) COMPARE CONTIGS WINDOW AND INVERTED REPEATS In the above example, we used the Compare Contigs Window to examine a sequence match between two different contigs. It is also possible to use the Compare Contigs Window to examine a sequence match between two copies of a repeat within the same contig, either direct or inverted. 12.115) To see this, restart Consed: consed Double click on standard.fasta.screen.ace.1 When it says "There is an edit history file (a .wrk file)...Do you want to apply those edits?", click on "no". Double click on Contig1 to bring up the Aligned Reads Window. Go to position 69 (use the "Pos:" box described above). Click the "Compare Cont" button on the Aligned Reads Window. The Compare Contigs Window will popup, but move it aside. Go to position 2035 in the Aligned Reads Window. Click the "Compare Contig" button again on the Aligned Reads Window. In the Compare Contigs Window there are two copies of Contig1--one on top and one on the bottom. Each has a "complement just in this window" button. Click on the bottom one (the one that has position 2035 blinking red). After clicking on it, you should notice that the numbers on the bottom contig are reversed--they decrease to the right--a copy of Contig1 has been reversed and complemented. Now click the "Align" button. Suddenly, you should see the alignment appear in the bottom half of the Compare Contigs Window. You should see bases between 69-78 aligned against the reversed complement of bases from 2026-2035. This has shown how you explore an inverted repeat. If you wanted to examine a direct repeat, you would use the same method except you wouldn't click on the "complement just in this window" button. Compare Contigs is one method of exploring joins of contigs that were not made by your assembly program. Another method is to use the Assembly View Window (above). They are designed to work together: the Assembly View Window gives a high level view of all sequence matches and takes you to the Compare Contigs Window which shows the alignment of a single sequence match and, if the user so desires, makes a join. Dismiss the Compare Contigs Window. 12.116) REMOVING READS Above you saw how reads can be removed by highlighting them "HIGHLIGHTING READS TO REMOVE THEM FROM A CONTIG" or by using Assembly View ("PULLING OUT READS AND RE-ASSEMBLYING THEM (MINIASSEMBLIES)"). Here you will learn other ways to remove reads: You can remove individual reads and put them into their own contigs. For example, in the Aligned Reads Window, go to location 2000. Point to the read name of read djs74_2664.s1 and hold down the right mouse button. Release on 'Remove read djs74_2664.s1 from this contig.' A window will pop up labeled "Remove Reads" with various options: Ignore the top part of the window for now--it is for a different method of specifying the reads to be deleted. Pay attention to these options, which specify what you want done with the reads removed from this contig: For this exercise, click on "Just Put Each Read Into Its Own Contig", "If no reads in a contig location, break contig? (No)" and "Recalculate bases/qualities in old contig where reads were removed?" (Yes) Then click "Do It" on the bottom. Presto-chango! The read is put into its own contig and the old contig is redrawn without the read in it. At this point you should save the assembly--you should always save the assembly after removing reads. 12.117) You can also remove many reads at once. Look at the Consed Main Window. Click on the "Remove Reads" near the top. Type into the "File of read names:" box "reads_to_remove.fof" and either push the "Enter" key or click on "Read File". You should see a list of 2 reads: djs74-2231.s1 djs74-3174.s1 You can specify any of the options discussed above. Delete Reads from Assembly means that the read will no longer appear in Consed. When you are using your own data and you really want to remove reads from the assembly, you must also use the UNIX "rm" command to remove the corresponding phd files from phd_dir and the chromatograms from chromat_dir, if applicable. Otherwise, the next time you run assemble (possibly by running phredPhrap), the reads, like Phoenix, will rise again to become part of the next assembly. Notice that you can also remove all reads in a particular contig. There is also a method of removing reads from a script in batch without using Consed's graphical interface. See "consed -removeReads" below. 12.118) TAGS Restart Consed so the dataset is in its original condition: consed Double click on standard.fasta.screen.ace.1 When it says "There is an edit history file (a .wrk file)...Do you want to apply those edits?", click on "no". Double click on "Contig1" to bring up the Aligned Reads Window. 12.119) Middle mouse click on a read base (as you did above). You will see a Trace Window pop up. (These are actual traces of a Sanger read.) 12.120) Point at a base on the "edt" line, hold down the middle mouse button, move the pointer so several bases turn yellow, and then release the middle mouse button. 12.121) A list of choices will pop up. Select 'Add Tag'. Type in a comment in the box at the bottom, and select 'comment' from the list of tag types. You will now see a blue box both in the Aligned Reads Window and in the Traces Window on that read. To see the comment, you can just point to it in the Aligned Reads Window (without any clicking) and you will see the comment in the lower right hand corner of the Aligned Reads Window. Alternatively, you can click on that blue tag in the Aligned Reads Window with the right mouse button and release on 'Tag: comment Show more info?'. Alternatively, you can click on the blue tag in the Traces Window with the right mouse button. Try creating some other kinds of tags: again swipe some bases in the Trace Window by selecting a different tag type. You will notice that different tags are in different colors. You can always use the methods above to see the kind of tag even if you forget what a particular color means. Create a tag and enter for the comment 'lazy fox'. Then in the Main Consed Window, push down the left mouse button on 'Navigate' and release on 'Search for tags/find string in comment'. In the box, enter 'fox' and click 'Search'. The tag should appear in a navigation window. In this manner, you can find (and go to) all tags with 'fox' in the comment. You can also define your own tag types. See below CREATING CUSTOM TAG TYPES for how to do that. 12.122) CREATING LONG TAGS You can create really, really long tags as follows: Just create a short version of the tag as above for where you want the tag to start. Then figure out the consensus position of where you want the tag to end. In the Aligned Reads Window, click on the short tag with the right mouse button and release on 'tag: show more info?' (as above). A Tag Window will appear for that tag. In the Tag Window, simply change the End Unpadded Consensus Position to the place you want it to end. Then click 'OK'. You will now notice that the tag will be as long as you wanted. Users were unsatisfied with this method of making long tags (perhaps because it isn't intuitive) so I implemented the following method: 12.123) In the Aligned Reads Window middle mouse click on a read base (as you did above). You will see a Trace Window pop up. 12.124) Point at a base on the "edt" line, hold down the middle mouse button, move the pointer so several bases turn yellow, but DO NOT RELEASE THE MIDDLE MOUSE BUTTON. Instead move the pointer right until it is outside the window entirely. The read will start to scroll. When it is at your desired location, release the middle mouse button and create a tag as before. 12.125) CONSENSUS TAGS You can create tags on the consensus in the same way. In the Aligned Reads Window, use the middle mouse button to swipe some bases on the consensus in the Aligned Reads Window. Up will pop a list of tag types. Click on one of them. Try it again somewhere else. Try it with the tag type being 'comment'. In this case, you must enter a comment. Notice the pretty colors! If you forget which tag type a particular color represents, just point at the colored tag with the mouse and the tag type will be displayed at the bottom of the Aligned Reads Window. 12.126) Try creating some tags that overlap each other. You will notice that the overlapping region will be purple. If you want to know which tags overlap, you can use any of the methods already discussed. 12.127) WHAT THE COLORS MEAN At this point, you should know which each of the following colors means (the answer is further below--no peeking!): Dark grey background of a base vs very light background of a base Grey base with black background Red base Black base Color area covering lower half of a base Purple area covering lower half of a base 12.128) SEARCH FOR READ NAME Restart Consed using the original ace file standard.fasta.screen.ace.1 If it asks if you want to apply edits, just say 'no'. Instead of clicking on a read or contig name, you can type a read name (or just part of a read name). For example, if you want to look at the location containing read djs74-2689.s1, do the following: 12.129) Type "2689" into the "Find reads containing (*'s allowed):" box and then push the "Enter" key. Consed will immediately bring up the Aligned Reads Window with the cursor on read djs74-2689.s1. Suppose that there were more than one read that matched? Try it. 12.130) Type: "26" and then push the "Enter" key. This matches 3 reads: djs74-2689.s1 djs74-2679.s1 djs74-2664.s1 What happened? Try entering "26*9" and see what happens. What does the "*" mean? Try using the box below labeled "Find 1st read starting with:". 12.131) Type djs74-2 into this box. You will notice that as you type each letter, the first item in the list that matches the letters typed will be highlighted. Experiment with deleting a few letters and typing others. This is a powerful method of quickly getting to the read name you are interested in. When you get to the name in the list, you do not have to type the rest of the name--just type carriage return or else click on 'OK'. 12.132) ONLINE DOCUMENTATION On the Aligned Reads Window or on the Consed Main Window, click on the 'Help' menu and release on 'Show Complete Documentation'. You will see this document. You can search for keywords in it. It is also on the web. Go to http://bozeman.mbt.washington.edu/consed/consed.html, and find "complete documentation" near the bottom of the page. 12.133) THE .WRK LOG FILE Consed keeps a log of all changes you make to an assembly: adding new reads, putting reads into their own contigs, making joins and tears, adding and removing tags, and changing bases. This log is kept in a file ending with ".wrk". You can use this file to help you remember exactly what you did to an assembly. 12.134) FINDING, DISPLAYING, AND MAKING POTENTIAL JOINS I don't have a dataset for you that has some joins that need to be made. So I'm going to have you make one: 12.135) cd to the assembly_view/edit_dir directory (we used this for Assembly View above). (You should get no error from this. If you do, type "pwd" to find out where you are and cd to the correct directory accordingly.) consed -ace assembly_view.fasta.screen.ace.1 12.136) Double click on Contig2. 12.137) In the Aligned Reads Window go to location 8300. (If you followed the instructions on the Quick Tour, you know a fast way to get there rather than scrolling.) 12.138) Point the mouse at the consensus base at 8300 and push the right mouse button down. Release the button on 'Tear Contig at This Consensus Position'. You will notice that in the Aligned Reads Window, some read names are now colored purple. Leave everything as it is and just click 'Do Tear'. 12.139) Save the assembly: pull down the 'File' menu on the Aligned Reads Window, and release on 'Save assembly'. Remember what you name the ace file (for this exercise, I'll call it assembly_view.fasta.screen.ace.7). 12.140) Exit consed. Congratulations: you now have an ace file with 2 contigs that need to be joined! Now I can show you how to make joins either semiautomated or fully automated. Put the following into your consedrc file (for information on how to change the consedrc file, see EDIT PARAMETERS: HOW TO CHANGE CONSED/AUTOFINISH PARAMETERS elsewhere in this document.): consed.autoReportPrintPotentialJoins: true 12.141) Then type the following: consed -ace assembly_view.fasta.screen.ace.7 -autoreport (where "consed" is replaced by whatever command brings up consed on your system). There will be a flurry of output ending with: Total # pairs: 10000, size: 0.720 Mbytes; edges12: 0; # score_hists: 0, size: 0.000 Mbytes; # query_domains: 10000, size: 0.880 Mbytes; # query_datas: 10000, size: 0.240 Mbytes Total # segment blocks: 0, size: 0.000 Mbytes Total # diffs: 41, in 1 lists, size: 0.000 Mbytes see assembly_view.101229.104216.out where the 101229.104216 will be replaced by your current date and time. (Note to programmers: this filename will in the file auto.fof ) 12.142) Look at this file by typing: less assembly_view.101229.104216.out (where assembly_view.101229.104216.out is replaced by whatever your .out file is called) and type "G" to look at the bottom of the file. The end of the file assembly_view.101229.104216.out will look like this: printPotentialJoins { ALIGNMENT do 697 97 Contig4 right 7933 8704 Contig5 left 1 775 U ALIGNMENT matchNotToGap_discrepancy 51 77 Contig3 left 499 670 Contig5 right 3776 3951 U } printPotentialJoins Type "q" to get out of "less." In a larger assembly, there will be many ALIGNMENT lines. 12.143) Bring up consed again like this: consed -ace assembly_view.fasta.screen.ace.7 -displayMatches assembly_view.101229.104216.out (where "consed" is replaced by whatever command brings up consed on your system and assembly_view.101229.104216.out is replaced by the file displayed in the previous step). The Consed Main Window will popup and then a window titled "Sequence Matches" will also pop up. In this case there will be the same two alignments you just saw in the file. Only one says "do". "Do" means that it is recommended. Otherwise it will tell you why it isn't recommended. In this case it is for 2 reasons: "matchNotToGap"--the match doesn't extend all the way to the gap and "discrepancy" meaning there are high quality discrepancies between the contigs in the overlapping region. 12.144) Double click on the "do" alignment line and the Compare Contigs Window will pop up with the alignment displayed. You can click "Join Contigs" to make the join. 12.145) USING CONSED -MAKEJOINS TO MAKE JOINS IN BATCH Alternatively, you can make all of the recommended joins in batch: 12.146) consed -ace assembly_view.fasta.screen.ace.7 -makeJoins assembly_view.101229.104216.out (where assembly_view.fasta.screen.ace.7 is the ace file you created after the tear above and assembly_view.101229.104216.out is the output of autoreport above). This will create a new ace file. Bring it up in consed and you will see that it will look just like the 3 contigs in the original ace file assembly_view.fasta.screen.ace.1 12.147) PROTEIN TRANSLATION If you would like, you can see the amino acid translation of the consensus in all reading frames. In the Aligned Reads Window, push down the left mouse button on the 'Misc' menu and release on 'Show Top Strand Protein Translation'. Try again but this time release on 'Show Bottom Strand Protein Translation'. Notice that there are 2 characters that are in magenta color. What are those characters? Why are they made in a different color? To not show the protein translation, push down the left mouse button on the 'Misc' menu and release on 'Don't show protein translation'. 12.148) OPEN READING FRAMES You can search for open reading frames (a methionine followed by some amino acids and then a stop codon all within the same reading frame) within a contig. In the Aligned Reads Window, push the left mouse button on 'Navigate' and release on 'Search for Open Reading Frames'. Notice that the open reading frames are shown for all 6 reading frames and are sorted by length. 12.149) DISPLAYING TRACKS (WIG and BED files) This assumes you are still displaying standard.fasta.screen.ace.1 12.150) In the Aligned Reads Window, point to the "Track" menu, hold down the left mouse button and release on "Add Track". A box will popup asking if consensus position 1 corresponds to position 1 of the track file. Click the "Yes" button, and the "Add Track Window" will appear. 12.151) Click on the "Browse" button (upper left-hand corner of the Add Track Window). Another window will pop up. Scroll the list of files down to "wig_fixed.txt". Double click on it. This window will disappear. 12.152) In the the "Add Track" window, click "OK". Now in the Aligned Reads Window, you will see a yellow graph labeled "Coverage". Try scrolling left and right to see it. 12.153) DISPLAYING GENE TRACKS (BED files) You can also use Consed to display BED files for purposes such as displaying genes. For this exercise, you will need a different sample dataset than the "standard" one we have been using so quit out of consed. Follow the instructions (above) GETTING YOUR OWN COPY OF A SAMPLE DATASET in order to make a private copy of the dataset gene_track_answer. Type: cd gene_track_answer/chrF_500/edit_dir (You should get no error from this. If you do, type "pwd" to find out where you are and cd to the correct directory accordingly.) Restart Consed. Double click on chrF_500.ace.1 Double click on chrF_500_45000 Point to the "Tracks" menu, hold down the left mouse button, and release on "Add Track". In the box that says "File containing track information:" , type: ../../hgTablesFake.txt and push the "enter" key. In the box that says "Examples of sequences in this file:", you should see "chrF" appear. Do not modify any of the other fields. Click "OK" at the bottom of this window. You should now see a large black area above a yellow horizontal line above the consensus numbers. Point to that area, hold down the right mouse button, and you will see a menu. The 2nd item from the bottom should say "Go to next feature of track "tb_knownGene". While still holding down the right mouse button, move the pointer to point to this item, and release the mouse button. Now you will see a gene labeled "uc010ufi.2", a right arrow (indicating this is a top strand gene), and a green horizontal line (which is the 5' untranslated region). There also is a message more bed lines...to show, right mouse click and "change size of track" Follow those instructions: point to this track area, hold down the right mouse button, wait for the menu to popup, and release on the menu item "change size of track". A window will popup labeled "Change Track Height". Right below the instruction "Move vertical slider to change track height:" is a vertical slider. Grab the slider and move it up and down and notice how the #s in the box below change. Move it so the number in the box is around "120". Then click "Apply" and then click "Dismiss". In the Aligned Reads Window, you should now see 5 genes: uc010ufi.2, uc010ufj.2, uc010ufk.2, ... Point to this track area, hold down the right mouse button, and release on "Go to next feature of track "tb_knownGene". The window should scroll to position 976 and two of the genes (the top and bottom ones) should show amino acid translations within a thicker green line. This indicates the start of translation. The thinner green lines are untranslated regions. The yellow arrows indicate these are top strand genes. Again point to this track area, hold down the right mouse button, and release on "Go to next feature of track "tb_knownGene". The window should scroll to position 1125. Here you will see very thin lines to the right, thiner than the untranslated regions to the left. These very thin lines are introns. 12.154) DISPLAYING TRACKS WITH SCORES (BED FILES) (This assumes you have consed still up and displaying the BED file hgTablesFake.txt from the preceding exercise.) Point to the "Tracks" menu, hold down the left mouse button and release on "Add Track". In the box that says "File containing track information:" , type: ../../with_scores.bed and push the "enter" key. In the box that says "Examples of sequences in this file:", you should see "chrF" appear. Do not modify any of the other fields. Click "OK" at the bottom of this window. Now you should see a second track appear. Point to that second track, hold down the right mouse button, and release on "Go to next feature of track conservation". You should see a grey bar. As in the UCSC Genome Browser, the grey scale corresponds to the score, with darker meaning a higher score. 12.155) FIXING CONTIG-ENDS When you've added reads, consed does not automatically extend the consensus using the new data so you end up with good quality reads sticking out of the contigs. In addition, the existing consensus might be wrong and other reads near the ends of contigs may be misaligned. In the past, users have fixed this by pulling out reads and rejoining them, and/or bring up traces and "change consensus". This is a tedious process to fix hundreds of contigs ends. We now have a feature that fixes contig ends in batch. For this exercise, use the dataset called "illumina_paired_answer". Make a copy (so you can modify it all you want) following the instructions GETTING YOUR OWN COPY OF A SAMPLE DATASET (above). cd illumina_paired_answer/edit_dir (You should get no error from this. If you do, type "pwd" to find out where you are and cd to the correct directory accordingly.) 12.156) First examine the dataset: consed -ace ref.ace.1 12.157) Scroll between positions -70 and 0 and notice that the left end of read "c_elegans_piece" is the left end of the consensus and there are some reads sticking out the left end of the consensus. 12.158) Scroll to position 1110 and notice that the right end of read "c_elegans_piece" is the right end of the consensus and there are some reads sticking out the right end of the consensus. 12.159) Scroll to position 501 and notice that all of the reads disagree with the reference sequence. This error in the consensus will be fixed in the next exercise FIXING THE CONSENSUS IN BATCH. 12.160) Exit consed and then type: consed -ace ref.ace.1 -fixContigEnds There will be lots of output ending with: moving consensus tags... deleting old contigs... writing ref.ace.2 Wrote new ace file: ref.ace.2 See output in ref.ace.1.140203.160835.out (you will have different numbers here) 12.161) Examine this new ace file: consed -ace ref.ace.2 12.162) Scroll to about position 71 and you will see that the consensus extends far to the left of the left end of the read "c_elegans_piece". 12.163) Click on a base of the read "c_elegans_piece", hold down the control key, and type "e". You should move to the right end of the read c_elegans_piece, position 1170. (If you don't understand these instructions, just scroll to position 1170.) You will see that the consensus extends considerably to the right of the end of the read c_elegans_piece. This is what has happened: phrap has reassembled the reads at each end of each contigs, extending the consensus based on the consensus of each little assembly. When you are using your own data, if you don't want all ends of all contigs reassembled, you can restrict it in 2 ways: consed -ace (ace file) -fixContigEnds -contigEndsFOF desired_contig_ends.fof where desired_contig_ends.fof is a file that looks like this: Contig466 left Contig466 right You can also restrict fixing to contigs that have more contigs by putting into your consedrc file the following (for information on how to change the consedrc file, see EDIT PARAMETERS: HOW TO CHANGE CONSED/AUTOFINISH PARAMETERS elsewhere in this document.): consed.fixContigEndsMinNumberOfReadsInContig: 5 If you have a -contigEndsFOF, a contig end will only be done if it also meets the minimum number of reads filter (above). Note to old-timers: do not use the following any longer: consed.addNewReadsExtendConsensusUsingProtrudingNewReads: true "consed -fixContigEnds" supercedes the above parameter. 12.164) FIXING THE CONSENSUS IN BATCH This is useful if either your assembler makes lots of errors in the consensus or if the consensus is really a reference sequence from a different genome and you want to use the reads to modify the reference. This exercise assumes that you have completed the exercise above "FIXING CONTIG-ENDS". Continue where you left off. 12.165) Continue examining this ace file: consed -ace ref.ace.2 12.166) Scroll to position 571 in contig "c_elegans_piece" and you will see the same location you looked at before in which the consensus has a c (incorrect) and all of the reads have a G. 12.167) Exit consed and, on the command line, type: consed -ace ref.ace.2 -fixConsensus and a new ace file will be created. 12.168) Examine the new ace file: consed -ace ref.ace.3 12.169) Scroll to position 571 in contig "c_elegans_piece" and you will see that the consensus is now a G with a blue tag on it. Point at the blue tag and it will say "automatedEdit" on the bottom line of the window. AutomatedEdit tags allow the user to bring up consed after running consed -fixConsensus and rapidly review each changed location by navigating to each automatedEdit tag. (See NAVIGATING (above).) Warning: if you run both -fixContigEnds and -fixConsensus, you must run -fixConsensus *after* running -fixContigEnds. 12.170) HANDLING DUPLICATE READ NAMES If you have an assembly that gives the following error message: there is at least 1 (maybe lots) of reads with the same name. For example, m130722_222134_00116_c100533042550000001823085711101382_s1_p0/103207 then run: consed -ace (ace file) -renameDuplicates This will cause reads to be suffixed with _d1, _d2, etc. so the names are unique. Probably more than you want to know: Base segments are eliminated. RT tags are only retained if the read can be determined unambiguously. 12.171) Answer to What the Colors Mean (above) Greyscale of background indicates quality Grey base with black background--clipped off part of read (either due to low quality or due to alignment) Red base--discrepant with consensus Black base--agrees with consensus Colored area covering half of a base--tag (see Quick Tour) Purple tag--more than 1 tag covering a base ---------------------------------------------------------------------------- 13. BAM2ACE: MAKING A CONSED-READY DATASET OUT OF A BAM FILE Note: Currently this feature is available for linux (32 and 64 bit) and macosx-intel (but not ppc). It is not available for solaris. You've already seen how bamscape can view a bam file and then start consed on a particular region. If you know exactly the region of the bam file that you would like to view with consed, you can also convert a BAM file (or part of a BAM file) into an ace file that can be edited with consed. To do this, use: for multiple bam files: bam2Ace.perl -bamFiles (bam file fof) -regionsFile (regions file) for a single bam file: bam2Ace.perl -bamFile (bam file) -regionsFile (regions file) where: (bam file fof) looks like this: 1869.merged.sorted.nodups.realigned.bam 1871.merged.sorted.nodups.realigned.bam and (regions file) looks like this: BEGIN_SEQ_FASTA chr1 /net/gs/vol2/shared/greenlab/genomes/hg19/chr1.fa END_SEQ_FASTA chr1 1653034 1653150 chr1 1654146 1654257 chr1 1634345 1634438 chr1 1634518 1634708 where the lines between BEGIN_SEQ_FASTA and END_SEQ_FASTA list the sequences and which files they can be found in. (E.g., chr1 is found in /net/gs/vol2/shared/greenlab/genomes/hg19/chr1.fa) The lines after END_SEQ_FASTA give the sequence name, start position, and end position of the regions to put into consed. The start and end positions are 1-based respect to the sequences and are typically chromosome positions. If the sequence names have spaces in them, such as: >gi|57116681|ref|NC_000962.2| Mycobacterium tuberculosis just use the first word (gi|57116681|ref|NC_000962.2| in this case) since spaces are not allowed. bam2Ace will not (by default) take all of your reads. Typically the depth of Illumina reads is in the thousands, which is unwieldy for examination and finishing. So bam2Ace runs shallowerDepth that takes just the highest quality reads of each allele. If you want all reads, use this consedrc resource: consed.bam2AceShallowerDepth: false (See CONSED CUSTOMIZATION for more info about consedrc files.) It is also possible to add consensus tags at defined positions in the reference sequence. You can do that like this: 23 10,000 20,000 tag: polymorphism 15,000 15,000 where you have added a polymorphism consensus tag at position 15,000 of reference sequence 23. For this exercise, use the dataset called "bamScape". 13.1) Make a copy (so you can modify it all you want) following the instructions GETTING YOUR OWN COPY OF A SAMPLE DATASET (above). 13.2) Type: cd bamScape (You should get no error from this. If you do, type "pwd" to find out where you are and cd to the correct directory accordingly.) 13.3) Type: bam2Ace.perl -bamFile reads.sorted.bam -regionsFile bam2AceRegions.txt There will be a lot of xterm output ending with this: read phd files in ../phdball_dir/phd.ball.1 found: 4 totals: used: 4 need: 759 read phd files in ../phdball_dir/phd.ball.1 found: 5 totals: used: 5 need: 759 read phd files in ../phdball_dir/phd.ball.1 found: 6 totals: used: 6 need: 759 read phd files in ../phdball_dir/phd.ball.1 found: 7 totals: used: 7 need: 759 read phd files in ../phdball_dir/phd.ball.1 found: 8 totals: used: 8 need: 759 read phd files in ../phdball_dir/phd.ball.1 found: 9 totals: used: 9 need: 759 Number of phd blocks used from ../phdball_dir/phd.ball.1: 759 Number of individual phd files read: 0 Total reads in assembly: 759 Finished setting quality values in 0 seconds writing new ace file bam2Ace.ace.1 writing bam2Ace.ace.1 cd /wd1/gordon/sunny/bamScape/consed1/edit_dir/ (the consed1 above might be consed2 or consed3 ....) 13.4) In your output, look at the last line--the one starting with "cd " and ending with "/edit_dir/" Type that line. You should now be in an edit_dir subdirectory. 13.5) In your output, look at the 2nd to last line--the one starting with "writing" consed -ace (ace file) where (ace file) is replaced by whatever ace file is on the 2nd to last line of your output. The "Consed Main Window" should popup and the "Contig List" should have 2 contigs: 23_10000_20000 (0 reads, 10,001 bps) 23_105000_115000 (759 reads, 10,001 bps) What? A contig with 0 reads? 13.6) Double-click on that contig. The "Aligned Reads" Window will popup and you will see that the consensus line consists just of N's. Scroll from one end to the other and you will just see N's. So now you know why no reads are aligned to this region. This can happen with real data, too. 13.7) Dismiss this "Aligned Reads" Window. Double click on the 2nd chromosome (the one with 759 reads). Another "Aligned Reads" window will pop up, but this time there are plenty of reads. 13.8) MAKING AN ACE FILE OUT OF AN ENTIRE BAM FILE Making an ace file out of the entire reference sequence (rather than just a targeted region), can be a little dangerous since the ace file/phd ball might be enormous--too big for consed. But if you know this won't happen, there is a script to do it for you: makeRegionsFile.perl myReference.fa which takes a file (such as myReference.fa) with one or more reference sequences in it, and produces a regions file suitable for bam2Ace.perl. The regions file will make the regions the entire lengths of the reference sequences. The sample bam file in the bamScape dataset (above) provides a good example for practicing doing this. 13.9) Type this: cd .. pwd Do this several times until you are in the bamScape directory (the path ends with "/bamScape") rather than the directory you were in for running consed. 13.10) Type: makeRegionsFile.perl 23.fa 13.11) Type: ls -l and notice there is a file 23Regions.txt 13.12) Type: bam2Ace.perl -bamFile reads.sorted.bam -regionsFile 23Regions.txt There will be a flurry of output as before ending with roughly: writing new ace file bam2Ace.ace.1 writing bam2Ace.ace.1 cd /wd1/gordon/sunny/bamScape/consed3/edit_dir/ 13.13) cd to the directory indicated on the last line of the output. 13.14) Bring up consed with the ace file as indicated by the 2nd to last line of the output: consed -ace bam2Ace.ace.1 In this case there are 10,064 reads. Try scrolling across the contig. ---------------------------------------------------------------------- 14. SANGER READS For these exercises with Sanger reads, you must have some basic knowledge of using consed, as learned in the first dozen or so steps of QUICK TOUR OF CONSED (above). After you've gotten that, continue here. 14.1) In this case we need a private copy of the dataset called "standard" (see GETTING YOUR OWN COPY OF A SAMPLE DATASET above). Type: cd standard/edit_dir (You should get no error from this. If you do, type "pwd" to find out where you are and cd to the correct directory accordingly.) Restart Consed. Double click on standard.fasta.screen.ace.1 Double click on Contig1 14.2) TRACES AND EDITING READS Many of the Traces and Editing features apply to both Sanger reads and Next Gen reads. However, in the case of Next Gen reads, a fake trace is created and displayed so all of the same editing features are available. Point with the mouse at a base of one of the reads and click with the middle mouse button. (If you do not have a 3 button mouse, see MONITORS AND MICE FOR CONSED below.) The Trace Window showing the traces for that stretch of read should popup. There are 2 rows of numbers: 'con' are the consensus positions 'rd' are the read positions There are 3 rows of bases in the trace window: 'con' is the consensus 'edt' is where you can edit the base calls of the read 'phd' is the original phred base calls Notice that a red rectangle blinks (the 'cursor') in the corresponding positions of the Aligned Reads Window and the Trace Window. 14.3) Try editing in the Trace Window. You can click the left mouse button on a base in the 'edt' line to set the cursor (a blinking red rectangle). You can directly overstrike a base by typing a letter. Try this. Try undoing it by clicking on 'undo' which is near the lower right corner of the Trace Window. If you want to undo more than one edit, you will have to go to the main Consed window and click on the button labeled 'Undo Edit...'--see MULTIPLE UNDO EDIT elsewhere in this document. You can overstrike with the following characters: acgt (bases), * (a pad, in effect deleting the base), and mrwsykvhdb (IUB ambiguity codes). You can move left and right with the arrow keys. (If this doesn't work, your window may not have "input focus"--an X Windows issue. To give a window input focus, you can click within it.) We believe that the user should change a Sanger base call only while viewing the traces. That is why editing is done here--not in the Aligned Reads Window. 14.4) You can insert a column of pads by pushing the space bar. Try this. (You may need to click on a base on the 'edt' line first to give the window input focus.) (For those of you new to editing assemblies, a 'pad', which in Consed and phrap is represented by the '*' character, is used to align two or more sequences such as these: gttgacagtaatcta gttgacataatcta in which one sequence has an inserted or deleted base with respect to the other. By inserting the pad character, it is possible to get a good alignment: gttgacagtaatcta gttgaca*taatcta This is the purpose of pad character--it is just a placeholder.) You can then overstrike a pad with a base. In this way you can insert a base, and still preserve the alignment. 14.5) Try highlighting part of a read on the edt line by holding down the middle mouse button and dragging the cursor over some bases. They will turn yellow as you drag. Then release the mouse button. A window will pop up giving you some choices of what to do with those (yellow) bases.: Change Consensus--make the highlighted bases edited high quality and change the consensus to agree with that stretch of the read. This also is a directive to phrap (upon reassembly) to use that stretch of that read to be the consensus. Change to n's--Change the highlighted bases to n's which means they are unknown bases. This also tells phrap (when it reassembles) to not make any join based on these bases. It is useful when you believe the bases may be in the chimeric portion of a read. Change to n's to left--same as above but to left end. Change to n's to right--same as above but to right end. Change to x's to left--Change the highlighted bases to x's which means they are vector. This also tells phrap to ignore these bases for the purpose of determining overlap. Change to x's to right--same as above but to right end. Add Tag--allows user to add any tag to a stretch of read bases. Dismiss--you decided you don't really want to do anything with this stretch of bases. The following options are only relevant if you are using phrap to reassemble (including miniassembly): Make High Quality--makes the highlighted bases edited high quality (99). This tells phrap (when it reassembles) that you are sure of the sequence here. Make low quality--makes the highlighted bases edited low quality. This tells phrap (when it reassembles) that you are not sure of the bases here and phrap can go ahead and make a join even if the bases in this region don't match perfectly. Make Low Quality to Left End--same as above, but all the way to the left end of the read. Make Low Quality to Right End--same as above, but all the way to the right end of the read. This popup is made so that nothing else works until you choose something. Try each of these choices, except for tags, which you'll try below. When you are done, dismiss this window. If you don't, you'll be sorry! (Consed will freeze.) 'Change Consensus' has an additional function--if a read extends out on the right beyond the end of the consensus, you can extend the consensus by using this function. You might want to do this, for example, if your assembly program did not correctly find the cloning site and thus clipped too much. You can add these bases to the consensus by using 'Change Consensus'. Typically, the quality of these bases in the read and in the consensus is 99. That is so that next time phrap runs, it will correctly extend the consensus. However, if you aren't going to reassemble, you might want to just leave the quality values the way the base-caller originally called them. You can do this by using a Consed parameter (consed.extendConsensusWithHighQuality), which you will learn more about later (see CONSED CUSTOMIZATION). 14.6) To delete a base, overstrike it with a '*' character. Even if the consensus has many *'s in it, this is OK since when you export the consensus (try the exercise on EXPORTING THE CONSENSUS), the *'s are not exported. While you are editing in Consed, we believe there should be a visual indication that a base was deleted. 14.7) SCROLLING TRACES AND ALIGNED READS TOGETHER In the Aligned Reads Window, notice the yellow arrows between the column of read names and the bases. Some of these arrows are magenta, which indicates its trace is up. In the Aligned Reads window, scroll along the contig to a different point but keep the read in view whose trace is already up (its arrow is magenta). In the Aligned Reads Window click the left mouse button on a base of that read while watching the Trace Window. Notice that the corresponding trace window instantly scrolls to the corresponding location. Now go to the Trace Window and scroll the traces to a new location. In the Trace Window click on the edt line with the left mouse button while watching the Aligned Reads Window. You will notice that the Aligned Reads window will instantly scroll to the corresponding location. Thus you can keep the Aligned Reads window and the traces scrolled to the same location. 14.8) SHOW ALL TRACES Go to base 2000. Point to the consensus base, push down the right mouse button, and release on 'Display traces for all reads'. You will see all traces displayed in a scrolling window. You can drag the scrollbar on the right down and up to see all the traces. This feature is particularly useful for polymorphism/mutation detection work. This feature was added to work in cooperation with polyphred. (See CONSED-POLYPHRED intereaction below.) In this Traces Window, point at one of the bases of one of the reads and click with the left mouse button. The base should start blinking in red. Now push the down arrow key on your keyboard. The cursor should move to the next read. Repeatedly type the down arrow key. Eventually the display should scroll so you can continue to see the read the cursor is on. Try the up arrow key as well. If there are more than 100 traces at a position, you will see those traces in batches of 100 traces. You can use the bottons at the bottom of the Traces Window labelled "prev 100 traces" and "next 100 traces" to move to the previous and next batches of 100 traces. There is also a button at the top left of the Traces Window that changes between "Show All Traces" and "Show Just Good Traces". A "good trace" means a trace that is all of the following: * it has a base at the cursor location * there is no dataNeeded tag on the read (this is customizable using the resource consed.showAllTracesDoNotShowTraceIfTheseTagsPresent: ) 14.9) PRIMER-PICKING Go to position 2470 which is near the right end of the contig. Click with the right mouse button on the consensus and click on "top strand primer from subclone template". Consed will pause a moment, and then there will appear a selection of primers that pass all of Consed's requirements. (If you get an error message, Consed might not have been correctly installed. See INSTALLING CONSED above.) Templates are also chosen for each primer. You may have to scroll the primer list to the right to see the templates. Consed lists these templates in order of quality--all of them will cover the read you want to make. 14.10) Double click on one of the primers in the Primers Window. That will cause the Aligned Reads Window to scroll to show that oligo in context. Click on 'Accept Primer'. A comment box will pop up. Enter some comment and click 'OK'. Notice that a yellow oligo tag, with a little red end, is created on the consensus for that primer. The red end points in the direction of the oligo. 14.11) Point to the yellow and press down the right mouse button and then release on 'Tag: oligo ... show more info?' A box will popup with much information about the oligo--all you need to order that oligo and do the reaction. Notice the field: 'Oligo name'. The name should be something like 'standard.1'. 14.12) If you can't find the oligo, you can find it again using its name. Scroll the Aligned Reads Window so you can't see the oligo anymore. In the Consed Main Window, point to the "Navigation" menu, push down the left mouse button and release on "Search for oligo tags by name". A box will pop up saying "Search for Oligo Tags". Enter "standard.1" (or whatever your oligo name is). Click "search". The Aligned Reads Window will scroll to the location of that tag. 14.13) To check whether the primer matches some other location in the assembly, do the following: As before, in the Aligned Reads Window, point to the yellow tag and press down the right mouse button and then release on 'Tag: oligo ... show more info?' A box will popup. Click on the button labelled 'search for oligo bases'. Note that Consed's primer picking will generally (there are some exceptions) not pick primers that match to more than one location. However, if you have added more information and/or reassembled since that primer was picked, there now could be another location that the primer matches to. I would suggest you just accept the first primer in the list. However, if you want to understand the differences, here is the explanation (if you want more information, see Gordon 1998 listed in the consed references). -----matches----- min self false vector qua 4 22 13 50 "4" is a measure of the primer's match to itself or another copy of itself forming a loop or primer-dimer making it less available for priming. Bigger is worse. "22" is a measure of a match to some other location (not the location you want) on the template. Bigger is worse. "13" is a measure of the match to the vector sequence(s) that are in the vector files. Bigger is worse. Typically the vector files are: /usr/local/genome/lib/screenLibs/primerSubcloneScreen.seq or /usr/local/genome/lib/screenLibs/primerCloneScreen.seq but there are consedrc parameters that allow these files to be some place else. "50" is the minimum consensus quality of the primer. Bigger is better because it gives you greater confidence that the primer sequence is correct at the location you want to prime. When picking primers (above), what is the difference between 'Pick Primer from Subclone Template' and 'Pick Primer from Clone Template'? There are 3 differences: A. which vector file the primers are screened against. In the former case, the primer is screened against the file primerSubcloneScreen.seq and in the latter case against the file primerCloneScreen.seq B. In checking for false matches elsewhere in the assembly, if the template is the whole clone, then Consed must check for false matches in the *entire* assembly, including all other contigs. But if the template is just going to be a subclone, Consed only needs to check elsewhere in that subclone. Actually, to be conservative, Consed checks for false matches +/- the maximum insert size of a subclone. C. If you are picking primers for subclone template, then the primer picker can also pick the subclone templates. If it doesn't find any suitable subclone template, it will reject the primer. (By default, picking of subclone templates is turned on. If you prefer to pick your own templates, and want Consed's primer picker to be much faster, you can turn it off temporarily or permanently. To turn it off temporarily, go to the Consed Main Window, point to the Options menu, hold down the left mouse button and release on 'Primer Picking Preferences'. Scroll down to 'Pick Subclone Templates for Primers' and click 'False'. Click on 'Apply and Dismiss'. To change this permanently, see CONSED CUSTOMIZATION below. Beware: you must correctly customize determineReadTypes.perl for template picking to work. See INSTALLING CONSED above.) If you are interested in the details of primer-picking, type: consed -printDefaultResources which will tell you the primer-picking parameters and what they do. 14.14) CHECKING WHETHER A PARTICULAR OLIGO WOULD MAKE AN ACCEPTABLE PRIMER You can check this as follows: In the Aligned Reads Window, point to the 'Misc' menu, hold down the left mouse button and release on 'Check Primer'. Enter the left and right consensus positions of the primer, check which strand, and whether the primer is to use subclone templates or the whole clone as a template. For example, type 20 for left and 40 for right, select "<-" (bottom strand) and subclone. Then click "Check Primer". A box "What is Wrong With This Primer" will pop up telling you what is and is not acceptable about this primer. 14.15) PICKING PCR PRIMER PAIRS In the Aligned Reads Window, go to the location where you want to pick the first PCR primer, base 500. Point to the consensus, hold down the right mouse button and release on 'Top Strand PCR Primer'. Then scroll to the location where you want to pick the second PCR primer, base 2200. Point to the consensus, hold down the right mouse button and release on "Bottom Strand PCR Primer". There will be a pause and then there will be a list of PCR primer pairs. Click on the first (top) pair and click "Accept Pair". You will now see pcr primers (in yellow with a red tip) at 404 and 2250. You can modify the parameters for choosing PCR primer pairs by going to the Consed Main Window, pointing to "Options", holding down the left mouse button, and releasing on "Primer Picking Preferences." For example, by default Consed does not display all PCR primer pairs--this would take too long and give you too many. However, you can ask it to show you all such pairs. In the Primer Picking Preferences, scroll down to "Check All PCR Pairs (huge) or Just Sample?" and click on "All". Then click on "Apply and Dismiss". Then pick PCR primers again, as above. Don't be surprised if you get 10,000 or more pairs of primers! (PCR Primers are screened for: melting temperature and length, the melting temperature of the 2 primers must be sufficiently close to each other, each primers must not stick to itself or to the other primer, no mononucleotide repeats, only ACGT's (no n's or ambiguity codes), and primer pair must not amplify any other location. There are many more details...) 14.16) ORDERING OF PRIMERS I heard of a finisher who manually ordered 72 primers. She had to cut/paste the bases of each primer. That is not only painful, but also error prone. I've supplied you a script that you can use to save to a file all primers that you have selected. 14.17) The primers and are saved in the ace file when you exit consed, so exit consed by clicking "quit" (not by clicking the X in the corner of the window). When it pops up a "warning--there are unsaved edits", click on "Save Before Quitting" and click "OK" to whatever name it offers, but remember that name. 14.18) The script is ace2Oligos.perl. Run it like this: ace2Oligos.perl standard.fasta.screen.ace.3 oligo.txt where standard.fasta.screen.ace.3 is replaced by whatever name it offered (above) when you exited consed and oligo.txt is the name of the file you want it to put the oligo in. It looks like: name=standard.1 sequence=ttattggcaattgggtga template=clone date=140204:155602 temp=56 name=standard.2 sequence=cactttggctttgattctgta template=clone date=140204:155602 temp=56 ace2Oligos.perl finds all oligo tags in the ace file and makes sure that all of them are in this primer file. ace2Oligos.perl does not record the comments that the finisher entered when creating the oligo. If you want to record that as well, you could use the script ace2OligosWithComments.perl which was written by a Consed user and thus is found in the 'contributions' directory. 14.19) ADD NEW READS (SANGER--NOT ILLUMINA OR 454) For this to work, your system administrator must have set up everything correctly. (See below in INSTALLING CONSED.) Assuming you have set everything up correctly, you can now experiment with adding reads. If you have consed up, terminate it. You should be in the standard/edit_dir directory (see the beginning of SANGER READS above). Copy the new chromatograms into the chromat_dir directory by typing (on the command line): cp ../chromats_to_add/* ../chromat_dir Bring it up consed again using the original ace file standard.fasta.screen.ace.1 If it asks if you want to apply edits, just say 'no'. On the Main Window, click on the Add New Reads button. There will appear a list of files ending with .fof. These are files that contain lists of chromatograms. Double click on 'reads_to_add.fof' (Accept the defaults for the other options in this window.) If you get an error message, look carefully at the full error message in the xterm to diagnose the problem. Probably there is some mistake in how you installed Consed. See INSTALLING CONSED (above). There should be lots of progress output in the xterm from which you started Consed. When it completes, there will be a Reads Added Window popup with a report of which reads were added. In this case, it should say that 9 reads were successfully added and list them. When you are using your own data, you will need to create the file reads_to_add.fof. This file should contain a list of just the file names of the reads/chromatograms you are going to add (not including directory names). The chromatagram files to be added should be placed in the ../chromat_dir and reads_to_add.fof should be put in edit_dir. If your assembly doesn't already contain any chromatograms, and you are adding them for the first time, a simple way to create the file reads_to_add.fof is to first copy the chromatagrams to be added into the (empty) ../chromat_dir and then type the following command (assuming you are inside of edit_dir): ls ../chromat_dir > reads_to_add.fof The reads_to_add.fof can now be used following the instructions above. 14.20) ADDING NEW READS IN BATCH (SANGER) You've seen (above) how to add Sanger reads using consed's graphical interface. You can also add Sanger reads in batch. For example, if you are sequencing the same region over and over and you have a reference sequence, phrap may not be a good choice for creating an assembly: phrap will take a long time to run (since many reads match each other), phrap may make several contigs when you know there should be only one, and phrap may not put all the reads into the assembly. Consed provides an alternative to phrap. See how to add Sanger reads to a reference sequence with the "standard" dataset. 14.21) cd to standard/edit_dir directory (You should get no error from this. If you do, type "pwd" to find out where you are and cd to the correct directory accordingly. See above under "SANGER READS".) 14.22) Follow the instructions under "ADD NEW READS" above including cp ../chromats_to_add/* ../chromat_dir 14.23) In edit_dir, look at the file "reads_to_add.fof" which contains a list of the reads to be added. 14.24) Run: consed -ace standard.fasta.screen.ace.1 -addNewReads reads_to_add.fof -newAceFilename standard.fasta.screen.ace.20 When this completes, there will be a new ace file standard.fasta.screen.ace.20 with all the reads added. There will also be a custom navigation file that is named something like: standard.070913.141632.nav where 070913.141632 is the date and time so will be different for you. (See CUSTOM NAVIGATION below.) This will allow you to visually find each added read in the assembly, if you so choose. What Consed does is take each reads and try to align it against the reference sequence. It will thus attempt to make one contig with all of the reads in it. Some reads may not align very well against the reference sequence. In that case, you can tell consed what you want to do by the following parameter in the consedrc file: consed.addNewReadsPutReadIntoItsOwnContig: ifUnaligned means that if a read does not match the reference sequence very well, it will be put into its own contig. (For information on how to change the consedrc file, see EDIT PARAMETERS: HOW TO CHANGE CONSED/AUTOFINISH PARAMETERS elsewhere in this document.) consed.addNewReadsPutReadIntoItsOwnContig: never means that if a read does not match the reference sequence very well, it will not be put into the assembly at all. consed.addNewReadsPutReadIntoItsOwnContig: always means that each read is not even compared to the reference sequence, but just put into its own contig. Consensus quality values are not recalculated unless you put the following into your consedrc file: consed.addNewReadsRecalculateConsensusQuality: true USING YOUR OWN DATA: Create an edit_dir, phdball_dir, and chromat_dir as usual. Put the reference sequence, in fasta format, into edit_dir. Type: fasta2Ace.perl reference.fa -noread This will create an ace file for an assembly that just contains the single reference sequence as the consensus and has no reads. Run consed to view it and make sure you have followed each of these steps successfully so far. (If you have multiple reference sequences, put them all in reference.fa and run fasta2Ace.perl just as shown above. Each reference sequence will be a separate contig.) Then run "consed -addNewReads" as shown above with the standard dataset. (There is a little used feature to add a single polymorphism tag at a defined position. Examine fasta2Ace.perl for more information.) 14.25) ADDING NEW SANGER READS IN BATCH TO TARGETED REGIONS Suppose that you are trying to close a gap by sequencing on a PCR product. The read matches several other locations besides the edges of the gap, but you know it goes at the edge of the gap. You can direct consed to put the read at the edge of the gap. To do so, you construct a list of reads to be added (which are assumed to be in ../chromat_dir). On the line with each read is a region that you want the read to go into, in this format: (read name) (contig name) (start of region) (end of region) I suggest that you make the (start of region) and (end of region) bigger than necessary so there is little chance that the read will protrude out of the region (which would be bad since the protruding part of the read would be unaligned, even if it should be aligned). Let's try doing this. 14.26) If you are not already in standard/edit_dir (check with 'pwd'), cd to standard/edit_dir directory. (You should get no error from this. If you do, type "pwd" to find out where you are and cd to the correct directory accordingly.) 14.27) Follow the instructions under "ADD NEW READS" above including cp ../chromats_to_add/* ../chromat_dir 14.28) In edit_dir, look at the file "reads_to_add_to_regions.fof" which contains a list of reads and regions: djs74-1455.s1 Contig1 1100 2500 djs74-1465.s1 Contig1 1000 2500 djs74-2282.s1 Contig1 500 1700 djs74-2712.s1 Contig1 1 2000 djs74-2861.s1 Contig1 1 2000 djs74-536.s2 Contig1 500 2000 djs74-568.s1 Contig1 500 2000 djs74-649.x1 djs74-867.s1 Notice that the last 2 do not have contig positions--this indicates that consed should look for the best alignment of these reads anywhere in the assembly. 14.29) Type: addSangerReads.perl standard.fasta.screen.ace.1 reads_to_add_to_regions.fof There will be a huge amount of output, which is mainly cross_match trying to align the reads. The output will end with something like this: writing standard.fasta.screen.ace.6 See log file: standard.120718.170851.out (where the numbers will be different) 14.30) Look at the log file by typing: less standard.120718.170851.out (where the numbers will be different). It should tell you that all the reads went into each region. 14.31) Bring up consed on the new assembly: consed -ace standard.fasta.screen.ace.6 (where the number might be different). You will be able to see the newly added reads. By default, reads that do not align are put into their own contigs. If you prefer these reads not go into the assembly, set: consed.addNewReadsPutReadIntoItsOwnContig: never 14.32) CONSED-POLYPHRED INTERACTION TO REVIEW POLYMORPHIC SITES This example applies not just to polyphred, but to any program that would find for you particular positions on the consensus. Polyphred is a program for finding polymorphic sites; it was developed by Debbie Nickerson's group (contact them at http://droog.mbt.washington.edu). We have a example database, 'polyphred', which has had polyphred run on it already. Polyphred has put a polymorphism tag on each polymorphic site. If Consed is running, exit it. For this exercise, use the dataset called "polyphred". Make a copy (so you can modify it all you want) following the instructions GETTING YOUR OWN COPY OF A SAMPLE DATASET (above). Type: cd polyphred/edit_dir (You should get no error from this. If you do, type "pwd" to find out where you are and cd to the correct directory accordingly.) ls Restart Consed. Double click on example2.fasta.screen.ace.1 When Consed comes up, you should see 2 contigs. Double click on Contig2 In the Aligned Reads Window, push the left mouse button while pointing to the 'Navigate' menu and release on: 'Toggle feature: when navigating to consensus location, pop up all traces (currently off)' That will turn this feature on. Now point to the 'Navigate' menu, hold down the left mouse button, and release on 'Tags'. Up should pop a list of tag types. Double click on 'polymorphism'. Polyphred has already been run so the consensus is tagged with polymorphism tags at each polymorphic site. Up will pop a window labelled 'Polymorphism Tags' with a list of sites. Click on 'Next'. If you correctly followed the instructions above, all the traces should pop up at the first polymorphic site. You may want to reposition the traces window to see it better. Now ignore the original 'Polymorphism Tags' window and instead click on 'Next' in the *traces* window. This will take you to the next polymorphic site. Pretty nice, huh? Many labs write programs that apply tags to the consensus, and then their staff uses consed to review those sites using the procedure above. ---------------------------------------------------------------------------- 15. 454 READS 15.1) USING 454 READS (NEWBLER ASSEMBLY) The Newbler Assembler and Consed work hand-in-glove together. For this exercise, use the dataset called "454_newbler". Make a copy (so you can modify it all you want) following the instructions GETTING YOUR OWN COPY OF A SAMPLE DATASET (above). 15.2) Type: cd 454_newbler/edit_dir (You should get no error from this. If you do, type "pwd" to find out where you are and cd to the correct directory accordingly.) 15.3) Restart Consed 15.4) Double click on "454Contigs.ace.1". You will see 2 contigs in the list: contig00001 contig00002 15.5) Double click on contig00001 to bring up the Aligned Reads Window 15.6) Using the thumb at the bottom, scroll from the far left of the contig all the way to the far right to get an idea of the assembly. (It is a very small one.) 15.7) In the Aligned Reads Window, scroll to position 246 and middle mouse click on the t in read ERQJC7K01CLG7G (which is probably the second to bottom read on the screen). A "trace" (curves made of dotted lines) should pop up. This trace is fake (but it allows you to edit the bases). Now terminate consed. In the current directory (edit_dir), create a file called "consedrc" with one line in it: consed.storeTracePeakPositions: always (If you don't know how to create files in unix, learn. See CONSED CUSTOMIZATION below.) Restart consed and follow the steps above up to and including clicking on the t in read ERQJC7K01CLG7G at position 246. Now a different looking trace should pop up. Rather than having dotted curves, it should have vertical rectangles of different colors. If this kind of trace does not pop up, there is an installation problem. See INSTALLING CONSED (above). Look closely at the error message that pops up and the error message in the xterm where Consed was started. They will indicate where Consed is expecting to find sff2scf and what the problem is. Unlike chromatograms from fluorescent sequencers, the spacing and width of the peaks is meaningless, but the height of the peaks is the actual intensity of the light emitted during each of the 454 cycles. When the light intensity indicates that there is more than one base in a row, instead of having a very tall peak, we break the peak up into n peaks where n is the number of repeated bases. 15.8) Look at the 3 "T" peaks at positions 244 through 246 ("con" line) or 195 through 197 ("rd" line) in the traces window. Notice that the rightmost peak is higher than the others. The reason for this is that the intensity of the light emitted was not exactly three times that of a single normal base, so we made the trace show the left peaks as high as a standard peak and the height of the rightmost peak is whatever amount of intensity is left over. Look back in the Aligned Reads Window and you will see that all other reads have 4 t's instead of 3. So the reason that the 3rd peak is higher than the others is that there are probably 4 t's here instead of 3. 15.9) In the Aligned Reads Window look at the t at unlabelled position between 228 and 229 of read ERQJC7K01A7AUR (probably the 2nd read from the bottom). Notice that none of the other reads have a t at this position, so this read may have a base-calling error. Middle mouse click on this t. You will see in the Trace Window that the t peak is shorter than a normal peak, confirming this suspicion. 15.10) Let's take a look behind the scenes: Terminate consed and examine the contents of ../chromat_dir (it should be empty), ../phd_dir (it should be empty), ../phdball_dir (it should contain phd.ball.1), and ../sff_dir (it should contain reads.sff). Since there are no files in chromat_dir, there are no traces initially. When you click to see a trace, Consed runs the program sff2scf which creates the trace for the read you are interested in by reading reads.sff (which has all the intensity information of each base in each read). Jim Knight of 454 corporation did a great job developing it. Each time a 454 trace pops up, tip your hat to Jim! Consed runs sff2scf for example like this: sff2scf sff:-f:pairedreads.sff:ERQJC7K01C3R2X where pairedreads.sff is the name of the sff file and EBE03TV02D2D4F is the name of the read (without the _left or _right extension). It will write the scf file into /tmp where Consed will read it. 15.11) On the Consed Main Window is a button on the left labelled "Assembly View". Click it. You will see 2 grey bars labelled "contig00002c" (the "c" is for "complemented") and contig00001. This indicates that the left end of contig00002 is connected to the left end of contig00001 (the right end of contig00002c is the left end of contig00002). Newbler has given Consed forward-reverse pair information which Consed has used to determine this orientation of the contigs--another great Jim Knight job. You will learn more about the other graphics here in the Assembly View section (below). 15.12) USING 454'S NEWBLER ON YOUR OWN DATA First you should run through the tutorial above so that you know that everything works with my example dataset. 15.13) Run Newbler according to the 454 documentation using the -consed option. 15.14) Delete the consedrc file that Newbler creates in edit_dir--it is intended for obsolete versions of consed and may cause problems with the current version. 15.15) Delete the phd.ball link in edit_dir--it is also intended for obsolete versions of consed and may cause problems with the current version. 15.16) Check that the current version of sff2scf is the one to be used. Type "sff2scf -v" It should say "080721" (or later). If instead it says "Error: Unable to open SCF file: ../chromat_dir/-v", your version is old and should be discarded. Use the new version that comes with consed. ---------------------------------------------------------------------------- 16. USING AUTOPCRAMPLIFY If you have a fasta sequence, and you want to amplify part of that sequence using pcr, and you want to select a pair of PCR primers, you can do that using Consed's autoPCRAmplify function. It can handle very high throughput: on a slow computer it takes about 5 minutes to find PCR primers for a hundred different regions. For this exercise, use the dataset called "autoPCRAmplify". 16.1) Make a copy (so you can modify it all you want) following the instructions GETTING YOUR OWN COPY OF A SAMPLE DATASET (above). 16.2) Then type: cd autoPCRAmplify (You should get no error from this. If you do, type "pwd" to find out where you are and cd to the correct directory accordingly.) 16.3) Type: ls You will see there is one file, brian.fa 16.4) Look at what is in this file: more brian.fa You will see it looks like this: >AP000527.C22.6.mRNA.primerRegion 1 70 81 150 smallest ACAGGGCCCCTCGCGGGCCCTGACGCAGGATGGAGTTGAGGTGGGGGCAG CGCTGGACCCCAGGGCCCCTNNNNNNNNNNTGCCGCAGTCTTGGATGATG GGTTCCTAGAAGCTCTCAACATCTCTTCTTAATTGGAGAAAGTGTTAAGC >AC004019.C22.4.mRNA.primerRegion 1 70 81 150 smallest AGCTGTGAGCTGTGCAATCATGTAACTAACTTTGTTTAAGTATTGTTTAG TCTTTCTGGTCTCCAGATGANNNNNNNNNNTCAGACATTCCACAGCTACC TAGAGGACATCATCAACTACCGCTGGGAGCTCGAAGAAGGGAAGCCCAAC The numbers 1 70 81 150 means that the left primer should be selected from the region from 1 to 70 of the sequence (starting at 1), so the primer should be chosen from the sequence: ACAGGGCCCCTCGCGGGCCCTGACGCAGGATGGAGTTGAGGTGGGGGCAG CGCTGGACCCCAGGGCCCCT The 81 150 means that the right primer should be selected from the region from 81 to 150 of the sequence, i.e. from within: TGCCGCAGTCTTGGATGATG GGTTCCTAGAAGCTCTCAACATCTCTTCTTAATTGGAGAAAGTGTTAAGC "smallest": -------------------- ------------------------- ---> <--- "biggest": -------------------- ------------------------- ---> <--- The word "smallest" means that the primers should be chosen so that the product is as small as possible. That means that the left primer should be chosen as far as possible to the right within the 1-70 region and the right primer should be chosen as far as possible to the left within the 81-150 region. If we had instead put "biggest", the primers would instead have been chosen to make the PCR product as large as possible. Notice that in the diagram above, I didn't make it look like this: "smallest": -------------------- ------------------------- ---> <--- (the primers are at the very edge of the regions). The reason is that in general, due to other checks on the primers, the primers that would make the absolute smallest product are not acceptable, and the primers must be backed up. Similarly for "biggest". 16.5) Then run the following: amplifyTranscripts.perl brian.fa (The name comes from the fact that this perl program was originally developed to amplify cDNA transcripts.) You should see a page or two of output flash by the screen, ending with: --------------------------------------------------------- working on AP000527.C22.6.mRNA.primerRegion --------------------------------------------------------- working on transcript AC004019.C22.4.mRNA.primerRegion (2 out of 2... --------------------------------------------------------- working on AC004019.C22.4.mRNA.primerRegion --------------------------------------------------------- see files primers_unsorted.txt for primers and failures.txt for failures (if any) 16.6) Look at the files just created in your directory. You should see "failures.txt" which should be empty. And you should see "primers_unsorted.txt" which should contain: PRIMER_PAIR { Region: AP000527.C22.6.mRNA.primerRegion Product size: 67 AP000527.C22.6.mRNA.primerRegionf: AGTTGAGGTGGGGGCAGC temp: 64 AP000527.C22.6.mRNA.primerRegionr: CATCATCCAAGACTGCGGC temp: 63 } PRIMER_PAIR { Region: AC004019.C22.4.mRNA.primerRegion Product size: 59 AC004019.C22.4.mRNA.primerRegionf: TTTAGTCTTTCTGGTCTCCAGATGA temp: 61 AC004019.C22.4.mRNA.primerRegionr: TCTAGGTAGCTGTGGAATGTCTGA temp: 60 } This gives the primers. The top strand primer is the one ending in 'f' and the bottom strand primer is the one ending in 'r'. Both are in 5' to 3' orientation, so the 'r' primer is reverse complemented from the sequence in the original fasta file. 16.7) To put these primers into 96 well format for ordering, type orderPrimerPairs.perl no You will see output like this: > orderPrimerPairs.perl no finished sorting attachments: /me1/gordon/sunny/autoPCRAmplify_answer/brian.fa "brian.fa", /me1/gordon/sunny/autoPCRAmplify_answer/primers_sorted.txt "primers_sorted.txt", /me1/gordon/sunny/autoPCRAmplify_answer/to_order.txt "to_order.txt" Type ls and see that there will be the following 4 files created: to_order.txt, which is the primers in 96 well format, tab-separated so this can be easily imported into Excel primers_unsorted_shorter.txt, which in this case is identical to primers_unsorted.txt (Don't ask.) primers_sorted.txt, which has the same primers once again, but this time they are sorted by product size. If you have thousands of primers, you may want to run all the big ones together on the thermocycler, then all the next longest ones, etc. primers081205.fasta (in which 081205 is the current date in YYMMDD) the same primers, YET AGAIN, but this time in fasta format, in case you want to use them in some other program that wants fasta format ---------------------------------------------------------------------------- 17. USING AUTOPRIMERS This feature is used when you have a known template and you want to choose custom sequencing primers to cover the template (both strands) with roughly equally spaced reads. It is assumed that the template is an insert in a vector and that there are 2 universal primer sites in the vector: 1 top strand and 1 bottom strand: ----------------iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii------------------ ---> <--- a1: top strand univeral primer b1: bottom stand universal primer <------><-------------------------------><---------> A I B -->UP where A is the distance from the end of universal primer a1 to the insert and B is the distance from the end of the universal primer b1 to the insert and I is the insert length and UP is another (optional) top strand universl primer that is always used.. Top strand reads cover both A and I. Bottom strand reads cover both I and B. To use this feature, put your insert sequences into a fasta file and run: autoPrimers.perl (fasta file name) You will need to edit autoPrimers.perl for your values of A, B, and your target read length. The program will attempt to make reads all roughly the same size and as small as possible and not less than: consed.autoPrimersMinReadLength: 500 A (above) is: consed.autoPrimersLeftUniversalPrimerDistanceToInsert: 50 B (above) is: consed.autoPrimersRightUniversalPrimerDistanceToInsert: 50 If you have an UP primer, then the distance from it to the end of the insert is: consed.autoPrimersDoNotTryToSequenceTopStrandThisManyBasesOnRightEndOfInsert: 198 I is calculated by consed and is the unpadded length of the consensus sequence of the contig. All consed.primers... resources are relevant, but the most commonly modified ones are: consed.primersMinMeltingTemp: 55 consed.primersMaxMeltingTemp: 60 consed.primersMinimumLengthOfAPrimer: 22 consed.primersMaximumLengthOfAPrimer: 40 The complete vector sequence must be put into a fasta file cloning_vector.fa (for example) and this resource must specify its path: consed.primersSubcloneFullPathnameOfFileOfSequencesForScreening: /net/grc/vol3/home/dgordon/consed_demo/moderna/cloning_vector.fa If the vector is circular, this file must be the vector sequence starting at the right end of the insert and continuing to the left end of the insert. People tend to want to write the vector sequence by just removing the insert sequence and concatenating the sequence before and after the insert together. This will not work. There are 4 resources that must always be left alone in autoPrimers.perl: print filConsedrc "consed.autoFinishMinNumberOfErrorsFixedByAnExp: 0.0\n"; print filConsedrc "consed.primersMinQuality: 0\n"; print filConsedrc "consed.autoFinishAllowCustomPrimerSubcloneReads: false\n"; print filConsedrc "consed.autoFinishAllowWholeCloneReads: true\n"; ---------------------------------------------------------------------------- 18. USING AUTOREPORT Autoreport is a command-line (non-graphical) method of running consed to report information about the assembly. 18.1) VARIANTS REPORT Let's try the consed.autoReportPrintHighlyDiscrepantRegions feature. For this exercise, use the dataset called "solexa_example_answer". 18.2) Make a copy (so you can modify it all you want) following the instructions GETTING YOUR OWN COPY OF A SAMPLE DATASET (above). 18.3) Then type: cd solexa_example_answer/edit_dir (You should get no error from this. If you do, type "pwd" to find out where you are and cd to the correct directory accordingly.) 18.4) Type: ls You should see a file ref.ace.1 We will need the consedrc file in this directory with the following in it: consed.autoReportPrintHighlyDiscrepantRegions: true consed.navigateByHighlyDiscrepantPositionsIgnoreBasesBelowThisQuality: 12 These options were explained above under "Search for highly discrepant positions". To create this file, do the following: 18.5) EDIT PARAMETERS: HOW TO CHANGE consedrc PARAMETERS This section applies not only to autoreport, but also to autofinish, autoedit, and customizing consed. You can edit consedrc using an editor, such as pico, or you can do it with Consed, which is far easier. To do it with Consed, bring up consed as follows: 18.6) type: consed -editConsedrc A window should come up with many consedrc parameters. 18.7) Find consed.autoReportPrintHighlyDiscrepantRegions. You can easily do this by typing in the "Find Parameter" box at the bottom "printhighly" and click on "Find First". 18.8) Click "True". This item should turn red, indicating that it is now different than the default value. 18.9) Find "consed.navigateByHighlyDiscrepantPositionsIgnoreBasesBelowThisQuality" As in the step above, in the "Find Parameter" box at the bottom, type "ignorebases" which will be enough to find it. 18.10) Change the default of 20 to 12. (The default is actually better for finding real variants, but there aren't any real variants with this dataset so if you leave it at 20 you won't get any output.) 18.11) At the bottom of the Edit consedrc Window, click "Just project". Then click "save". A box titled "Name of parameter file to write" should pop up. Click "OK". That box will disappear and a box saying "Note that these new parameters will take effect only after restarting Consed/Autofinish" will popup. Click "Dismiss" on that box and click "Dismiss" on the "Edit consedrc Window". All windows should disappear. 18.12) Back on the command line, type: ls -al and you should see consedrc 18.13) Type: more consedrc and see that it should contain just this: consed.autoReportPrintHighlyDiscrepantRegions: true consed.navigateByHighlyDiscrepantPositionsIgnoreBasesBelowThisQuality: 12 (Get in the habit of checking consedrc after using Consed's Edit consedrc Window.) Why doesn't consedrc contain these others (below) as well? See if you can figure that out. consed.navigateByHighlyDiscrepantPositionsMinDiscrepantReads: 2 consed.navigateByHighlyDiscrepantPositionsMaxDepthOfCoverage: 100000 consed.navigateByHighlyDiscrepantPositionsJustListIndels: false consed.navigateByHighlyDiscrepantPositionsIgnoreOtherReadsStartingAtSameLocation: false 18.14) Type: consed -ace ref.ace.1 -autoreport There will be a lot of output ending with something like: see ref.ace.1.081211.160556.out where 081211.160556 will be replaced by your current date and time. 18.15) Type: more ref.ace.1.081211.160556.out (replace this by the name of your file) This file will contain a huge amount of output (listing the parameters used in the run)--the important part is at the end: printHighlyDiscrepantRegions { Highly Discrepant Positions min # of discrepant reads: 2 min quality: 12 "r": base of reference seq max depth of coverage: 100000 and ignoring reference seq A C G T * pos contig 2 8.0% 23 92.0%r 0 0.0% 0 0.0% 0 0.0% 56 ref 3 9.1% 30 90.9%r 0 0.0% 0 0.0% 0 0.0% 252 ref 2 6.9% 27 93.1%r 0 0.0% 0 0.0% 0 0.0% 256 ref 0 0.0% 0 0.0% 20 90.9%r 2 9.1% 0 0.0% 682 ref 0 0.0% 0 0.0% 31 93.9%r 2 6.1% 0 0.0% 715 ref 2 4.8% 40 95.2%r 0 0.0% 0 0.0% 0 0.0% 742 ref 2 8.7% 21 91.3%r 0 0.0% 0 0.0% 0 0.0% 936 ref 0 0.0% 1 2.4% 1 2.4% 39 95.1%r 0 0.0% 982 ref } printHighlyDiscrepantRegions This output is explained above under "Search for highly discrepant positions". Programmers: if you want to run this report automatically and have the results parsed, there is also a file auto.fof which will contain the name of this output file. 18.16) In this case we need a private copy of the dataset called "standard" (see GETTING YOUR OWN COPY OF A SAMPLE DATASET above). 18.17) Then Type: cd standard/edit_dir (You should get no error from this. If you do, type "pwd" to find out where you are and cd to the correct directory accordingly.) 18.18) Type: consed -editconsedrc Follow the example above and this time make consedrc have only the following two lines: consed.autoReportPrintLowConsensusQualityRegions: true consed.autoReportPrintSingleSubcloneRegions: true 18.19) Run autoreport as follows: consed -ace standard.fasta.screen.ace.1 -autoreport (where "consed" must be replaced by whatever command your system administer says to use). You will see something like this: > consed -ace standard.fasta.screen.ace.1 -autoreport couldn't open readOrder.txt--that's ok opened file standard.070918.162756.out for output Now setting quality values Number of individual phd files read: 24 Total reads in assembly: 24 Finished setting quality values in 0 seconds see standard.070918.162756.out 18.20) Look at standard.070918.162756.out (where the 070918.162756 will be replaced by the current date and time). Scroll down to the bottom (where the important information is) and you will see: lowConsensusQualityRegions { Contig1 (consensus) 1-83 base quality below threshold Contig1 (consensus) 85-110 base quality below threshold Contig1 (consensus) 113-117 base quality below threshold Contig1 (consensus) 120-156 base quality below threshold Contig1 (consensus) 159-166 base quality below threshold Contig1 (consensus) 168-171 base quality below threshold Contig1 (consensus) 185-187 base quality below threshold Contig1 (consensus) 189-190 base quality below threshold Contig1 (consensus) 192 base quality below threshold Contig1 (consensus) 194-199 base quality below threshold Contig1 (consensus) 269 base quality below threshold Contig1 (consensus) 271-275 base quality below threshold Contig1 (consensus) 2584-2591 base quality below threshold } lowConsensusQualityRegions singleSubcloneRegions { Contig1 (consensus) 1-199 199 bp single subclone Contig1 (consensus) 2588-2591 4 bp single subclone } singleSubcloneRegions This gives the low consensus quality regions and the single subclone regions. 18.21) If you want to specify an output file name (something other than standard.070918.162756.out, you can do so by running autoreport like this: consed -ace standard.fasta.screen.ace.1 -autoreport -outputFile myVeryOwnFileName.out where myVeryOwnFileName.out can be anything you want. ---------------------------------------------------------------------------- 19. FEATURES FOR SNP ANALYSIS 19.1) FINDING SNPS IN CONSED USING THE METHOD OF Li, Ruan, and Durbin (2008), 19.2) On the Consed Main Window, point to the 'Navigate' menu, hold down the left mouse button, and release on 'Search for SNPs'. Up will pop the SNPs window. Here is a typical line: sub het 37 27 30 A G 1_53001_270001 83,976 30976 sub = "substitution polymorphism". Other choices are del and ins for deletion or insertion. 37 is the genotype quality Ignore the 27. 30 is the read depth A is the reference base G is the alternate allele 1_53001_270001 is the contig name 83,976 is the reference position 30976 is the contig position 19.3) FINDING SNPS IN BATCH Put the following into consedrc: consed.autoReportCalculateGenotypes: true (see CONSED CUSTOMIZATION below). Then run: consed -ace (ace file) -autoreport The SNP calls will be produced in VCF format. 19.4) TAGGING A REFERENCE SEQUENCE If you want to download known snp sites with, say 200 base pairs on each side of the snp site, then fasta2Ace.perl can tag the reference sequence(s) if you know the location of the snps. You run it as follows: fasta2Ace.perl (fasta file) -polymorphism 201 where 201 is replaced by whichever base position the snp is in each sequence in (fasta file), which is a file of just the reference sequences. Then you run addNewReads (see README.txt that comes with consed). After doing the alignments with addNewReads, you can find each snp site by on the Consed Main Window/ Navigate / Tags in All Contigs / polymorphism. A box will popup with a list of all of the polymorphism tags. You can click "next" repeatedly to view each snp site. ---------------------------------------------------------------------------- 20. BIONANO DIGEST GENOME MAPS BamScape can be used with Bionano Genomics digest genome maps to help find misassemblies by showing regions that are confirmed by the Bionano digest, regions that disagree with the Bionano digest, and regions that are unsupported. 20.1) Review running bamScape by doing the complete exercise (above) labelled "USING BAMSCAPE". 20.2) Terminate bamscape. 20.3) Start bamScape again by typing: consed -bamScape -bamFile reads.sorted.bam -referenceFOF bamScapeReference.fof -bionanoXmapFile test.xmap -bionanoKeyFile test.key As in the exercise USING BAMSCAPE (above) bring up the Reads vs Reference Window for reference sequence 23. You will now see an additional panel labeled "Bionano Digest". This graph has a vertical scale on the left labeled with numbers 0, 1, 2, and 3. You will also see thick vertical blue bars, and thin horizontal bars of various colors. Some background: The user sequences and assembles the DNA, giving a consensus sequence. Bionano does an insilico restriction digest of this consensus, giving a map with restriction sites and distances between the restriction sites. Independently of this, the Bionano technology digests the physical DNA and finds the location and distance between the restriction sites. So this gives 2 restriction site maps, including distances between the restriction sites. Bionano aligns these 2 maps. Ideally, there would be a perfect match with neither map having regions that cannot be found in the other. In the real world, not only are there regions not found in the other, but there are also regions found in one that are found multiple times in the other. bamScape attempts to alert the user to these types of problems. Back to the bamScape display: green (height 1) means that there is a single location in the digest map matching this location in the sequence. Any other # is bad: yellow means there is NO location in the digest map matching this location in the sequence. 3 means there are 3 (or more) locations in the digest map matching this location in the sequence map. Typically this indicates that there really are 3 (or more) locations in the DNA that have similar sequences...so similar that the assembly program thought all of the reads were from a single location and put them all together, commonly refered to as a "collapsed repeat." The blue bars themselves generally indicate a problem--they indicate that matching location between the 2 maps has terminated. Why? 20.4) Point to the leftmost blue vertical bar. The status lines at bottom will say: no match segments in digest space to left of this one This indicates that in digest space there is no matches further to the left of this match. Point to the next blue vertical bar and the status lines will say: NEXT sequence region is 23 61,234-71,234 1000 bp away to right in digest space Be aware that this sequence region includes the 3rd, 4th, 5th, and 6th blue vertical lines (counting from 1 at the left). The region between the 4th and 5th blue lines is orange, indicating that this location matches 2 different regions in digest space. ---------------------------------------------------------------------------- 21. LESS USED CONSED FEATURES 21.1) CHANGING THE CONSENSUS IN BATCH ACCORDING TO A SCRIPT [CHANGE CONSENSUS] consed -ace (ace file) -changeConsensus (change file) where "change file" is a file with lines like this: Contig21 28-30 x where Contig21 is the contig, 28-30 are the unpadded positions and x is the new base. You can also specify the positions in padded positions like this: Contig21 *35-*40 c where 35 and 40 in *padded* positions. You might prefer using padded to unpadded positions if, for example, you are have some pads in the consensus that you want to change to other bases. unpadded positions would not be useful because they only refer to non-pad bases. 21.2) EXPORTING SCAFFOLDS [ EXPORT SCAFFOLDS ] Go to assembly_view/edit_dir and type: consed -ace assembly_view.fasta.screen.ace.1 -exportScaffolds scaffolds.fa (where "consed" is replaced by whatever command brings up consed on your system). Look at scaffolds.fa Contigs are separated by 50 n's. You can change the 50 to some other number by modifying the consedrc parameter: consed.exportScaffoldsNsBetweenContigs: 50 If you prefer to export in fastq format, add the following to your consedrc file: consed.exportScaffoldsFastaOrFastq: fastq This will give Sanger (+33) encoding of the fastq file. If you prefer Illumina (+64) encoding, you can add the following to your consedrc file: consed.solexa64FastqOrSanger33FastqForOutput: 64 However, this will cause a problem since assemblies often have bases that are very high quality which will give non-printing characters (e.g., 90 which will give 154 when Illumina-encoded). By default, the contigs are not trimmed at the low-quality ends. If you want the low quality bases at the ends of contigs to be trimmed, you can do that: consed.exportScaffoldsTrimEnds: true Note that the trimming looks for the maximal high quality segment of bases quality 13 and above. That means that if the left end of the contig has bases that are quality: 17 21 9 9 9 9 9 9 13 11 11 11 11 20 20 20 ... all of these bases to the left of the three 20's will be clipped off. The error probably of the 9's and the 11's is great enough to overcome the 3 higher quality bases with 17, 21, and 13. The quality threshold of bases to be kept is set at 13. I suggest leaving it there. If you are determined to change it, modify the consedrc parameter: consed.exportScaffoldsTrimEndsQuality: 13 21.3) ADDING PAIRED ILLUMINA READS USING CROSS_MATCH If you have a reference sequence or an existing assembly, you can use this procedure to align, using cross_match, additional paired Illumina reads to that reference sequence or consensus of the assembly and make the reads part of a new consed-ready dataset. 21.4) Exit Consed and type: cd illumina_paired/edit_dir (This is not the same directory you were in for the examples above, which was solexa_example_answer/edit_dir You may need to first type cd ../.. depending on which directory you are currently in.) 21.5) Type: ls You should see ref.fa and fastq.fa 21.6) Look at them: more ref.fa This is just the reference sequence in fasta format more fastq.fof This contains a pair of Illumina fastq files, the "1" reads in paired_illumina1.fq and the "2" reads in paired_illumina2.fq 21.7) Type: ls ../solexa_dir You will see the 2 Illumina fastq files: paired_illumina1.fq paired_illumina2.fq 21.8) First make sure you are still in illumina_paired/edit_dir by typing: pwd which should say something that ends with: illumina_paired/edit_dir Convert the reference sequence into an assembly by typing: fasta2Ace.perl ref.fa -noread There should now be a file ref.ace in this directory. To check that everything is fine, bring up Consed and double click on "ref.ace" You should see an assembly with 1 contig--the reference sequence called 'c_elegans_piece'. If you look at that contig in the Aligned Reads Window (by double-clicking on it), you will notice no reads. Terminate Consed. 21.9) Type: addSolexaReads.perl -ace ref.ace -fastqfof fastq.fof -fasta ref.fa There will be a flurry of output from various programs ending with something like this: Inserting pads in contigs to accommodate insertions in new reads...Done ending insertPadsInContigs 0 Inserting pads in reads and setting read bases... writing new ace file ref.ace.1 writing ref.ace.1 See new ace file ref.ace.1 0.0 minutes cross_match and -phdBall2Fasta time 0.0 minutes consed time 0.0 minutes total time If you instead get error messages, the Consed/cross_match package is not installed correctly. See INSTALLING CONSED (above). 21.10) Type: ls You should now see the file ref.ace.1 21.11) Start Consed and double click on ref.ace.1 Double click on the contig 'c_elegans_piece' in the Contig List. The Aligned Reads Window will popup. Scroll around a little. If you have trouble with any of these steps, you can compare your results to the correct result in the directory illumina_paired_answer/edit_dir (and illumina_paired_answer/phdball_dir). There is an option to fasta2Ace.perl that allows you to tag each snp if you know its position. See TAGGING A REFERENCE SEQUENCE (below). There is also an option to addSolexaReads.perl to not all of the reads but rather a subset of the reads. 21.12) ADDING UNPAIRED ILLUMINA READS USING CROSS_MATCH If you have a reference sequence or an existing assembly, you can use this procedure to align, using cross_match, additional unpaired Illumina reads to that reference sequence or consensus of the assembly and make the reads part of a new consed-ready dataset. 21.13) Exit Consed. For this exercise, use the dataset called "solexa_example". 21.14) Make a copy (so you can modify it all you want) following the instructions GETTING YOUR OWN COPY OF A SAMPLE DATASET (above). 21.15) Then type: cd solexa_example/edit_dir (You should get no error from this. If you do, type "pwd" to find out where you are and cd to the correct directory accordingly.) 21.16) Type: ls You should see ref.fa and solexa_files.fof 21.17) Type: more ref.fa This just contains the reference sequence in fasta format 21.18) Type more solexa_files.fof It contains just one line for each fastq file. We have just one such file so there is just one line: solexa_reads.fastq where solexa_reads.fastq is a solexa fastq file (note that "solexa fastq" is different than normal fastq--don't mix them up). 21.19) Type: ls ../solexa_dir You will see this file: solexa_reads.fastq 21.20) First make sure you are still in solexa_example/edit_dir by typing: pwd which should say something that ends with: solexa_example/edit_dir Convert the reference sequence into an assembly by typing: fasta2Ace.perl ref.fa (This is the old method that doesn't use -noread The newer method using -noread creates an ace file without any reads. In both cases the consensus matches ref.fa) There should now be a file ref.ace in this directory and a file phd.ball.1 in ../phdball_dir To check that everything is fine, bring up Consed and double click on "ref.ace" You should see an assembly with exactly 1 read--the reference sequence called 'ref'. Terminate Consed. 21.21) Type: addSolexaReads.perl -ace ref.ace -fastqFOF solexa_files.fof -fasta ref.fa There will be a flurry of output from various programs ending with something like this: Inserting pads in contigs to accommodate insertions in new reads...Done ending insertPadsInContigs 0 Inserting pads in reads and setting read bases... now saving assembly... 0 writing ./ref.ace.1 See new ace file ref.ace.1 done 0 See log file: ref.080627.111305.out 0.0 minutes to make fasta files 0.0 minutes cross_match time 0.0 minutes consed time 0.0 minutes total time If you instead get error messages, the Consed/cross_match package is not installed correctly. See INSTALLING_CONSED (above). 21.22) Type: ls You should now see the file ref.ace.1 21.23) Start Consed and double click on ref.ace.1 Double click on the contig 'ref' in the Contig List. The Aligned Reads Window will popup. Scroll around a little to convince yourself that you have created exactly the same assembly as you used in the exercises above under "USING ILLUMINA READS". There is an option to fasta2Ace.perl that allows you to tag each snp if you know its position. See TAGGING A REFERENCE SEQUENCE (below). 21.24) ALIGNING ILLUMINA READS USING CROSS_MATCH AGAINST A LARGE GENOME AND SELECTING A SMALL REGION FOR VIEWING WITH CONSED In many applications, you will want to align your Illumina reads against a large genome (such as the human genome) even though you are only interested in some part of that genome. For example, you might only be interested in reads that do map *best* to the region of interest, and thus you must map them against the entire genome to be sure they don't match better to some other location. Consed handles this by allowing you to run cross_match against a large genome and then allows you to specify certain regions of interest to view with consed. This exercise shows you how to do this. 21.25) Type: cd selectRegions/edit_dir (You may need to first type cd ../.. depending on which directory you are currently in.) ls You will see: solexa_files.fof which is a list of the Illumina Gerald fastq files. refs.fof which is a file of filenames of the reference sequences. Typically refs.fof will be a list of the fasta files of the genome such as the human genome, but in this case it contains just one filename, ref.fa which is a small fasta file. (I made it small so it doesn't take you too long to download consed.) regions.txt is a file specifying the regions that you are interested in. 21.26) Run: alignSolexaReads2Refs.perl solexa_files.fof refs.fof my_alignments.fof (my_alignments.fof will be created by this program--it could be any name) The last line should say "see my_alignments.fof" At this point you have aligned all of the Illumina reads against the reference sequence ref.fa (If you want to see those alignments, look in my_alignments.fof and then look in the file in my_alignments.fof, and then look through the pages of output for the ALIGNMENT lines.) Now suppose that we are interested in the following two regions: Bases from 1 to 100, and from 901 to 1000. 21.27) type: more regions.txt This indicates to consed that you are interested in these 2 regions. It also shows the path of the fasta file (ref.fa) which contains the sequence ref. In this case there are only 2 regions specified, but there is no reason, with your own data, that you couldn't specify thousands of regions. 21.28) type: selectRegions.perl regions.txt my_alignments.fof my_new_ace.ace The last line of the output should say something like: writing my_new_ace.ace.2 21.29) type: consed -ace my_new_ace.ace.2 (where "consed" is replaced by whatever command brings up consed on your system). You should see 2 contigs: ref_1 and ref_901 which are the 2 regions specified. There should be a total of 213 reads--211 Illumina reads and 2 fake fasta file reads. Bring up the Aligned Reads Window and scroll around a bit. In the contig ref_901 you will notice that the left end of the contig is numbered 901--the consensus numbers refer to the positions in the original reference sequence. Thus if your reference sequence were, for example, a chromosome, the numbers would be chromosome positions. If you would like, you can see the consensus numbers starting at position 1. To do this, point to a "Misc" menu, hold down the left mouse button, and release on "Turn On/Off User-Defined Consensus Scale Numbers". For those of you interested in what is happening behind the scenes, you might want to look at the files in phdball_dir: phd.ball.1 contains all 867 of the Illumina reads, phd.ball.2 contains the 2 fake reads representing the 2 regions, and phd.ball.3 contains just the 211 Illumina reads that align to the 2 regions. my_new_ace.ace.2 tells consed that it only needs to read phd.ball.2 and phd.ball.3 Also note that this uses cross_match as the alignment engine. 21.30) ALIGNING YOUR OWN ILLUMINA DATA USING CROSS_MATCH TO A REFERENCE SEQUENCE OR CONSENSUS OF AN ASSEMBLY You first must complete the exercises above using the example Illumina data so you are confident you are doing the process correctly. After you have done the exercises above, create a project directory with subdirectories: phdball_dir edit_dir solexa_dir phd_dir 21.31) Put the Gerald fastq or Bustard files (pairs of *_seq.txt and *_prb.txt) into solexa_dir (or you could use links or you could even make solexa_dir itself be a link). 21.32) In edit_dir, make a file myFiles.fof just like solexa_files.fof in the solexa_example dataset described above. If you have paired reads, there should be 2 filenames on each line like this: file1 file2 . . . . one file for the 1st reads of each pair and one file for the 2nd reads of each pair. There must be perfect correspondence--the nth line of file1 must correspond to the nth line of file2. 21.33) Make a fasta file myFasta.fa in edit_dir containing the reference sequences. (Note that addSolexaReads.perl is written such that lowercase bases in the reference sequences are assumed to be repeats and matches of Illumina reads to such regions are pretty much ignored. If your reference sequence were totally lowercase, you would get no matches--bad. If you instead want to not ignore matches to lowercase regions, you must modify addSolexaReads.perl by removing the words "-repeat_screen 2". See phrap.doc which came with the phrap/cross_match. It is generally good to use repeat_screen with a repeat-screened reference sequence since it greatly speeds up cross_match and a match to within a repeat doesn't mean much.) 21.34) Convert the fasta file to an assembly by typing fasta2Ace.perl myFasta This should create myFasta.ace 21.35) Run addSolexaReads.perl like this: addSolexaReads.perl -ace myFasta.ace -fastqFOF myFiles.fof -fasta myFasta.fa This should create myFasta.ace.1 which should contain all of the aligning Illumina reads. Consensus quality values are not recalculated unless you put the following into your consedrc file: consed.addNewReadsRecalculateConsensusQuality: true (For information on how to change the consedrc file, see EDIT PARAMETERS: HOW TO CHANGE CONSED/AUTOFINISH PARAMETERS elsewhere in this document.) 21.36) If you only want to add some of the reads in the fastq files, put the names of those reads into a file readsList.txt Then run: addSolexaReads.perl -ace myFasta.ace -fastqFOF myFiles.fof -fasta myFasta.fa -readsList readsList.txt readsList.txt can also include the approximate contig locations of each read. Only that region of the contig will be searched to find the alignment of the read. 21.37) USING 454 READS (ALIGNING WITH CROSS_MATCH TO REFERENCE SEQUENCE ) Consed/cross_match can quickly align 454 reads to an existing reference sequence. You start with a fasta file of the reference sequence and the 454 sff files. You end up with an assembly with all of the 454 reads aligned against the reference sequence. To do this: Exit Consed and type: 21.38) cd align454reads/edit_dir (You might need to type "cd ../.." first depending on which directory you are currently in.) ls You will see 2 files: reference.fa which contains the reference sequence sff.fof which is an fof file referring to the 454 sff files (Note that add454Reads.perl is written such that lowercase bases in the reference sequence are assumed to be repeats and matches of 454 reads to such regions are pretty much ignored. If you instead want to not ignore matches to lowercase regions, you must modify add454Reads.perl by removing the words "-repeat_screen 2". See phrap.doc which came with the phrap/cross_match.) You might notice that there also is a align454reads_answer directory parallel to the align454reads directory. This contains the files that you should get if you correctly follow this exercise and have Consed correctly installed. You can refer to it for troubleshooting. Also see INSTALLING CONSED (above). 21.39) Convert the reference.fa file into an assembly by typing: fasta2Ace.perl reference.fa There should now be a file reference.ace in this directory and a file phd.ball.1 in ../phdball_dir To check that everything is fine, bring up Consed and double click on "reference.ace" You should see that this is a assembly with exactly 1 read--the reference sequence. Terminate Consed. 21.40) Type: add454Reads.perl reference.ace sff.fof reference.fa There should be lots of output to the screen and no error messages and it should complete in a second or two. Type: ls Now you should see reference.ace.1 21.41) Bring up Consed and double click on "reference.ace.1" Then double click on contig "myreference" to bring up the Aligned Reads Window. Scroll around a little and middle mouse click on a read or two to see the trace. 21.42) ADDING ADDITIONAL 454 OR ILLUMINA READS USING CROSS_MATCH (YOUR OWN DATA) You can add additional 454 or Illumina reads to an existing assembly. It doesn't matter whether the existing assembly is 454, Illumina, or sanger. To add 454 reads, use: add454Reads.perl (existing ace file) (fof of sff files) (fasta) (This will add all of the reads in the sff files.) To add Illumina reads, use: addSolexaReads.perl (existing ace file) (fof with prefixes) (fasta) In both cases the fasta file must precisely match the consensus of the existing ace file. If you want to add just a few 454 reads (not all of them in the sff file), and you know the names of the 454 reads you want to add, you can use a different method: In edit_dir, create a file (e.g., reads.fof) that contains the names of all the 454 reads you want to add. To find all reads in an sff file, type: sffinfo -a ../sff_dir/my454.sff >reads.fof where my454.sff is the sff file. Create a ../phd_dir directory and a ../chromat_dir directory. Run: sff2scfAndPhd ../sff_dir/my454.sff reads.fof This will create both scf files in ../chromat_dir and phd files in ../phd_dir. Then run either "add new reads" from within consed (see ADD NEW READS below) or automated add new reads (see ADDING NEW READS IN BATCH (SANGER) (below)). ILLUMINA AND 454 DATA--WHAT IS HAPPENING BEHIND THE SCENES 454 data comes in sff files (in sff_dir) 1. sff files --> phdballs (in phdball_dir) via the program consed -sff2PhdBall consed -sff2PhdBall calls a perl script filter454Reads.perl filter454Reads.perl runs cross_match to find, within each read, sffLinkers.fa which is the 454 linker sequence (if any) that separates the _left and _right 454 read. It also looks for puc19 contamination (filter454Reads.fa). 2. phdballs --> *.fa fasta files (in edit_dir) via the program consed -phdball2fasta 3. fasta files --> *.cross alignments (in edit_dir) via the program cross_match 4. alignments and phdballs --> ace file (in edit_dir) via the program consed -addReads All of the above steps are run by add454Reads.perl For Illumina data, all of the steps are the same, except for the 1st step: Illumina data comes in *.fastq files or *_seq.txt and *_prb.txt "Bustard" files (in solexa_dir) 1. Illumina files --> phdball (in phdball_dir) via the program consed -solexa2PhdBall (fastq fof) -phdBallFOF (phd ball fof) 21.43) MULTIPLE HIGH QUALITY DISCREPANCIES VS SEARCH FOR HIGHLY DISCREPANT REGIONS You have already used (above) "Search for highly discrepant positions". "Multiple high quality discrepanices" (MHQD) is similar but much less flexible. It requires that there be one read at a position that differs with the consensus and is at least consed.qualityThresholdForFindingHighQualityDiscrepancies (40 by default) at that base and within a 9-base window about the base (4 bases on each side, not including pads). It then requires there be a second read of any quality of a different subclone that has the same base. 21.44) BACKING OUT EDITS AFTER YOU HAVE SAVED THE ASSEMBLY If you decide that all your edits are terrible and you want to start over (perhaps you have been training a new finisher), the cleanest solution is to delete everything in phd_dir and edit_dir , but leave everything in chromat_dir and just reassemble (run phredPhrap) or realign the reads again. 21.45) SELECTIVELY BACKING OUT EDITS AND REMOVING READS If you want to back out all edits in just particular reads, I have provided a perl script to do this: revertToUneditedRead (read name) What it does it copy the .phd.1 to 1 greater than the highest version. Then you must reassemble using the phredPhrap script to create an ace file that has no edits for that particular read. It will have all edits for all other reads. Why doesn't it just delete all phd files except for the .phd.1? In that case, Consed could not read any previous ace file since all previous versions of ace files would refer to phd files that have been deleted. 21.46) REMOVING READS FROM A PHRAP ASSEMBLY (This is obsolete and has been replaced by consed -removeReads. See elsewhere in this document.) Create a file containing the filename of all the reads you want to remove, one filename per line. Then use the perl script removeReads Then reassemble using the phredPhrap script. 21.47) ADDING READS WITHOUT CHROMATOGRAM FILES This may happen if you, for example, download sequence from Genbank and want to assemble it along with your reads. There are 2 ways to do this, depending on whether you want to edit the read or not. a) If you want to edit the read, run mktrace to produce a fake trace. It will have all perfect peaks. Run: mktrace (name of file with fasta sequence) It will create both a chromatogram and a phd file. Move the chromatogram into ../chromat_dir. Move the phd file into ../phd_dir. Then run the phredPhrap script normally. You will be able to bring up the traces in Consed and edit the read. b) If it is not important to edit the reads, there is a method that is a little faster. Create just a fake phd file using: fasta2Phd.perl (name of file with fasta sequence) It will create a file whose name is taken from the fasta file name: for example, if the fasta filename is Contig1.c.fasta, then the phd file will be called Contig1.c.phd.1 The fasta name in the file is ignored. You can then put this in the phd_dir, and reassemble using the phredPhrap script. If the reads are really fake (you don't want the templates to be chosen by Consed/Autofinish as a template for a primer), then the read should end with an extension .c or .a or .c1 or .c2 ... or .a1 or .a2 or ... This indicates to Consed/Autofinish that the read is a fake read. Note: when you are creating phd files such as this, you must start with (read name).phd.1 Do not start with (read name).phd.2 or any higher version number. This is because Consed looks for the .1 version in order to find the original phred calls so it expects there to be a .1 version. There is also a publicly contributed script "lib2Phd.perl" that takes a fasta file that contains more than one sequence and makes phd files for each of them. 21.48) ALIGNING READS TO A BACKBONE If you sequence the same region (in different people or in different species), then you may want them all aligned together, even if phrap doesn't want to put them all together. To align them all together, first use a reference sequence and make an assembly out of it by using mktrace or fasta2Phd.perl (see above) followed by phd2Ace.perl (see above). Then add all of the other reads using Consed's Add New Reads feature (either automated or manual--see above). 21.49) COMPARING READS TO A REFERENCE SEQUENCE The reference sequence, as in the step above, will just be another read in the assembly. Let's call it "ref". To compare the other reads to it, in the Aligned Reads Window, point at the Navigate Menu, hold down the left mouse button and release on "Compare Reads To Reference Sequence". A Window labelled "Enter Name of Reference Read" will pop up. Enter the name of the read and click "OK". A list of high quality read positions that disagree with the reference read will be displayed. 21.50) TAGGING ALL READS AT ONCE Follow the instructions for tagging the consensus, but when the list of tag types pops up, click the "tag all reads" box at the top of this list. Then continue as with tagging the consensus. 21.51) EDITING ALL READS AT ONCE Please don't do this. Not unless you REALLY know what you are doing and have a good reason for doing so. You should really only change a base call if you are looking at the chromatogram and thus have a basis in that read for making the change. If you are determined to do this in spite of my pleas and protests, do the following. Suppose that at a particular consensus position some reads have "a" and some "c" and you want them all to be "a". In the Aligned Reads Window, point to an "a" at that position, hold down the left mouse button and release on "make all reads a". The reason you shouldn't do this is that perhaps the reads that were "c" were actually correct and were a different copy of a repeat. Hence the reads with "a" and the reads with "c" did not really overlap. But you just destroyed the evidence. 21.52) FASTER CONSED STARTUP FOR SANGER READS Warning: This only applies to assemblies with large numbers of Sanger reads. This will have no effect on assemblies with 454 or Illumina reads. You can greatly speed up Consed startup if you are willing to use more disk space. The disk space used will be about equal to the total space used by the PHD files. Try this will a large dataset (you won't notice any difference with the example datasets that come with Consed.) To use this method of startup: 1) cd to directory where ace file is kept 2) type: makePhdBall.perl (This will create a file called phd.ball which is big.) 3) start consed normally In many situations, this will greatly speed up Consed startup. The amount of speedup depends on which operating system is used: on Linux, the time to read phd files dropped from 75 seconds to 8 seconds, and thus the total time to start up consed dropped from 86 seconds to 17 seconds. I saw similar speedups on Solaris where the phd files are on an nfs mounted disk. However, there was another situation in which the startup time was the same. Warning: If you create phd.ball as above, Consed will be reading most phd files from phd.ball instead of from ../phd_dir. If you delete phd files in phd_dir, you must also delete phd.ball. Otherwise Consed will give lots of error messages "TIME STAMP MISMATCH" and many things will not work correctly. 21.53) VIEWING THE CHROMATOGRAM OF SINGLETS OR NON-ASSEMBLED READS If you have a chromatogram, you can use Consed to view it, even if it hasn't been assembled into the ace file. This is common with cDNA assemblies in which the reads don't overlap and thus phrap doesn't put them together into a contig. To do this, make the same edit_dir, phd_dir, and chromat_dir as above, put the chromatogram into chromat_dir, run phred on it to generate the phd file which goes into phd_dir. Then go to edit_dir and run: phd2Ace.perl (name of phd file) For example, if your phd file is myRead.phd.1 from edit_dir, type: phd2Ace.perl myRead.phd.1 This will produce myRead.ace Then just start Consed normally: consed -ace myRead.ace and you can view the chromatogram. 21.54) HIDING SOME TYPES OF TAGS If you have many tags that overlap and thus are purple, you can hide some less relevant tag types so there is less purple and there is less distraction. Make sure you have a few tags visible. Then click on 'Find Main Win'. In the Main Window, open the Options menu, and release on 'Hide Some Tag Types'. A list of tag types will pop up. Select the type that you have visible (above). Then click 'OK'. Go back to the Aligned Reads Window. That tag should still be visible. Click on the button 'Some Tags' in the upper right part of the Aligned Reads Window. Your tag should disappear. The 'Some Tags' button should have changed to 'Sh All Tags'. Click on it again. Your tags should have reappeared. 21.55) CUSTOM CONTIG NAMES Normally, when you re-assemble, phrap will name the contigs differently--what was Contig31 before may become Contig32. To help you know which contig is which, Consed allows you to give a name (e.g., "A") to a contig which will persist after re-assembling. To do this, swipe some consensus bases with the middle mouse button (as above). When the "Select Tag Type" box pops up, click on "contigName" and also type a name into the "Contig Name:" field and then click "OK". The next time you re-assemble, the name "A" will appear in the list of contigs on the Consed Main Window. 21.56) ERROR RATE In the Aligned Reads Window is a box (upper right) labelled 'Err/10kb'. This is the estimated error rate for this contig, and it is a good indicator of when you are done (or not done) finishing. In addition, you can find the error rate for a particular region of contig as follows: Point at 'Misc' menu, hold down the left mouse button, pull down and release on 'Show Error Info For Region'. Fill in the boxes for left and right consensus position, click on 'Calculate' and you will be given the error and single subclone data for that region. 21.57) RESTRICTION DIGEST Restart Consed. Double click on "standard.fasta.screen.ace.1" In the Consed Main Window, click the "Digest" button. For the purpose of this exercise, the full pathname of file of vector sequence can refer to any file of sequence in fasta format. However, when you are using it with your own data it should refer to a file that contains the sequence of your cloning vector. For example, if you are sequencing a BAC, it should contain BAC vector. The sequence must start at the vector/insert junction that you used when you ligated the insert. Click "OK". You will see a comparison of in-silico fragments (those calculated from the sequence) and real fragments (those in fragSizes.txt which supposedly came from a real gel). * If a band is red, that means that it doesn't match. * If a band has a "v" on it, that means it is a vector fragment. * If a band has a "g" on it, that means it is a gap-spanning fragment. Move the pointer over the fragments, and you will see the fragment sizes appear. Move the pointer to the in-silico fragment with size 2299. Click on it. You will see the fragment on the left size of the window become highlighted. Click on the button labeled "right end" (2nd row from the bottom of the window) and the Aligned Reads Window will pop up, with the cursor on the right end of the fragment. Click on "show problems" and navigate through the list of problems by clicking on "next". You will notice that the Gel Window is zoomed in. To return to the original zoom, click on "Zoom Original". Where it says "Select Enzyme:", point to "EcoRV", hold down the left mouse button and release on "HindIII". This is how you change enzymes. Click on the button labeled "Text Output". This can be saved to a file and printed out. Dismiss the restriction digest window. On the Consed Main Window, click the "Digest" button again. Notice the file "fragSizes.txt". This is a file of actual gel fragment sizes. If you don't have an actual gel, but rather you want to just make predictions of fragment sizes from the sequence, you can leave this box blank (erase the "fragSizes.txt"). Try that. fragSizes.txt has the following format: >EcoRV 448 710 1102 1197 -1 >HindIII 448 508 586 735 801 -1 where EcoRV and HindIII are enzymes and the numbers below them are the actual fragment sizes. Each enzyme list is terminated by -1. Consed does its best to try to figure out which end of the clone insert is connected to which end of the vector. However, it sometimes is wrong. If you believe it is wrong, you can click "compl vector" to try connecting the insert to the vector in the opposite orientation and see if that produces better agreement with the actual digest. 21.58) RESTRICTION DIGEST AND ASSEMBLY VIEW Get your own copy of the dataset "assembly_view" (see above under GETTING YOUR OWN COPY OF A SAMPLE DATASET). cd assembly_view/edit_dir (You should get no error from this. If you do, type "pwd" to find out where you are and cd to the correct directory accordingly.) Restart consed Double click on "assembly_view.fasta.screen.ace.1" In the Consed Main Window, click on the button "Assembly View" which is near the upper left corner of the window. Also on the Consed Main Window, click on Digest. The "Select Enzyme and Contigs" Window should appear with EcoRV and HindIII selected. Click OK. The "Display Digest" Window should appear. Now look at the Assembly View Window. You will notice blue, green, and red rectangles under the grey contig bars. These rectangles are the in-silico restriction fragments. Point to one of them-- it will turn yellow and information will be displayed in the information box below. Point to one of the EcoRV fragments, hold down the right mouse button, and release on "Goto fragment in digest window". Notice that in the Display Digest Window, the selected fragment is highlighted both on the left side (the text) and in the Gel (right) side. 21.59) MULTIPLE TRACE POPUP Bring up dataset standard. In the Aligned Reads window, scroll to a region that has many reads and that has some discrepancies--try position 1162. Hold down the shift key, and click with the middle mouse button on the consensus. At this location 3 traces will pop up--these are the 2 highest quality traces that agree with the consensus (on each strand) and the highest quality trace that disagrees with the consensus. This feature is useful in areas of high coverage when you want to rapidly examine just the most significant traces rather than looking at all of them. 21.60) MAXIMUM NUMBER OF TRACES DISPLAYED Bring up dataset standard. Scroll to position 1162. Bring up 4 reads and then try bringing up additional reads.You will notice that new reads are put at the top of the stack of traces and, once there are 4 traces displayed, traces are automatically removed from the bottom of the stack. If you want to change this maximum number of traces to something besides 4, you can do that: In the Consed Main Window (click on 'Find Main Win' on the Aligned Reads window), pull down the 'Options' menu, and release on 'General Preferences'. Try changing the 'Max Number of Traces Shown' to 3. Then click 'Apply and Dismiss'. Now dismiss the Trace Window and again start adding additional traces to the Trace Window. You will notice that now the number of traces shown will not exceed 3. If you want to view a large number of traces at once, you should use the SHOW ALL TRACES (described above). 21.61) SCALING THE TRACES In the Trace Window, grab the thumb of the line that is labelled "V" (for Vertical magnification) and move it back and forth, noticing the effect on the traces. This is useful if the traces are too small or too large. There are several other methods of scaling the traces you will learn later. 21.62) HOTKEYS FOR EDITING If you do a lot of editing, you will want to have a faster method of doing these edits than having the popup and selecting an option. Thus the following hot keys exist: < and > (less than and greater than) to make n's to the left and the right (respectively) of the cursor control-l and control-r to make low quality to the left and the right (respectively) of the cursor overstriking with a capital letter (e.g., C instead of c) causes the base to become high quality rather than low quality overstriking with a lower case letter causes the base to become low quality Give these a try. 21.63) SCROLLING TRACES INDEPENDENTLY Dismiss all of your Trace Windows. Then pop up traces for 2 different reads in approximately the same location. Scroll one of them. You may want to scroll by clicking the arrows or clicking to the left or right of the thumb. You will notice that both will scroll. Consed will do its best to have corresponding peak lined up. (Consed can't line all of them up because the peak spacing is not uniform and differs from read to read.) Try removing a trace by clicking on one of the 'Remove' buttons in the Trace Window. Try adding other traces. Then click on 'No' for scrolling the traces together and try scrolling. You will now observe that they scroll separately. 21.64) MEASURING ERROR RATE AND SINGLE SUBCLONE BASES FOR A REGION Some contigs have long tails of low quality bases and you would like to find out the error rate for the contig without that long tail. On the Align Reads Window, pull down the Misc menu, and release on 'Show Errors for a Region'. This will tell you both the error rate for the region and the number of single subclone bases for that region. 21.65) PREVENTING 2 USERS FROM MAKING CONFLICTING EDITS If there are 2 users that are both editing in the same directory, there is the possibility they will both make edits to the same read. Whoever saves their assembly last will wipe out the edits of the other person, even if they were using different ace files. To help prevent this, consed can warn you if someone else is making edits in the same directory. Set the consed parameter: consed.onlyAllowOneReadWriteConsedAtATime: true The default is "false" so you have to turn this to true to make it work (see CONSED CUSTOMIZATION). This will usually work even if the 2 users are on different computers (and the directory is nfs-mounted between them) and even if the different computers have different operating systems. I've tested the following combinations: user 1 on Solaris; user 2 on Solaris user 1 on Linux; user 2 on Linux user 1 on Linux; user 2 on Solaris <--- does not work Only the last combination doesn't work. 21.66) PRINTING CONSED WINDOWS Consed windows are really designed for being viewed on a computer screen, rather than on paper. If you want to print out a consed window, the default colors are not so good. Since Consed windows are mostly black (Aligned Reads Window and Traces Window), a lot of toner is used up and the window is difficult to read. I've solved this: Go to the Consed Main Window, pulldown the 'Options' menu and release on 'General Preferences'. Scroll down to "Make light background in Aligned Reads Window..." and click on "Do it now". You will notice the light background. A few other things (traces colors and thickness) are also customized for making color prints. You can also make consed start with like this by putting the following into your consedrc consed.makeLightBackgroundInAlignedReadsWindowAndTracesWindow: true (see CONSED CUSTOMIZATION) If you are running on a Linux box, there is a free (or nearly free) program called "xv". One web site is http://www.trilon.com/xv It is written by one of those dying breed of UNIX programmers who just *loved* UNIX and programming and sharing it. His web site is enjoyable because some of his passion comes through. With xv, you can make a postscript file from a Consed window. Then you can print the postscript file on a color printer. If you are running on a Windows computer (with an X emulator) or on macosx, you can make a screen snapshot. Or you can use the Microsoft Snipping tool: http://windows.microsoft.com/en-us/windows7/use-snipping-tool-to-capture-screen-shots 21.67) COLOR MEANS EDITED AND TAGS (For this step, first click on the 'Dim' menu and release on 'Dim Nothing'.) Point to the 'Color' menu, hold down the left mouse button and release on 'Color Means Edited and Tags'. Notice that the bases that you have edited (make sure you have edited some bases) will stand out in either white or grey (depending on whether the base was made high quality or low quality). Observe this both in the Trace Window and the Aligned Reads window. This colormode is useful if you are interested in easily spotting which bases are edited. 21.68) COLOR MEANS MATCH In the Aligned Reads Window, go to the menu labelled 'color', and pulldown and release on 'color means match'. Now you notice different colors: The colors have the following meaning: Blue: agrees with consensus Orange: disagrees with consensus (Yellow: this stretch of this read was used by phrap to form the consensus--no longer supported) Grey: Low quality or unaligned ends of reads Return to the 'Color Means Quality and Tags' colormode by the following: point to the 'Color' menu, hold down the left mouse button and release on 'Color Means Quality and Tags'. This is the colormode most commonly used. 21.69) AUTOEDIT Autoedit is a program that will read an ace file, make edits according to which options you specify, and then write out a new ace file, all without any interaction from the user. Thus Autoedit can be run automatically at night, the same way you can run phredPhrap. Autoedit has various options that are controlled from the consedrc file the same as the consedrc file controls Autofinish. Run AutoEdit as follows: consed -ace (name of exising ace file) -autoEdit This will create another ace file with a version number one higher than the one you just ran. If you want to specify a particular new ace file name, you can do it this way: consed -ace (old ace file) -autoEdit -newAceFileName (new ace file) Autoedit has the following options (if you do not specify any of these, autoedit will do nothing): consed.autoEditConvertCloneEndBasesToXs: true bool ! If true, will convert to X's bases of all reads that protrude beyond a ! cloneEnd tag. ! (YES) consed.autoEditTellPhrapNotToOverlapMultiplyDiscrepantReads: true bool ! This will find all locations where there are multiple identical ! discrepancies with the consensus (and some other conditions) and try ! to make most of the reads quality 99 at that location so that phrap, ! next time it is run, will not overlap those reads. This will fix ! many misassemblies. ! (YES) consed.autoEditTagEditableLowConsensusQualityRegions: true bool ! This will find regions that are low quality, but that a human ! finisher could easily determine the correct base and thus ! money could be saved by not having Autofinish suggest additional ! reads overlapping the region ! (YES) consed.autoEditRecalculateHighQualitySegmentsOfReads: false bool ! If true, will recalculate the high quality segments of the reads ! (YES) consed.autoEditFixRunsInConsensus: false bool ! fixes this: ! ccc (cons) ! cc* (read1) ! *cc (read2) ! (YES) 21.70) (not currently recommended way of tagging SNPs and not currently supported) You must have a set of fake reads (see fasta2Phd.perl) with polymorphism tags on the SNPs. There must also be a read in the assembly that has a genomeRegion WR item. Then run consed -ace (ace file) -tagSNPs (file of fake reads) ---------------------------------------------------------------------------- 22. CONSED CUSTOMIZATION If you want to customize Consed, it would help to be able to edit in UNIX. There is no Microsoft Word in UNIX, but there is emacs, vi, pico, nano and other editors. I suggest pico or nano for their simplicity. (You can get more information by googling, for example, "pico unix editor".) Point at the 'Info' menu on the Consed Main Window, hold down the left mouse button and release on menu item 'Show Current Consed Parameters'. This shows you what is available to be changed by putting in your ~/consedrc file. You can also see what is available by typing on the command line: consed -printDefaultResources (Warning: if you have changed any resource, it will not show the new value--just the default value.) Type: consed -editConsedrc This includes most of the parameters found under 'Info/Show Current Consed Parameters' (above). It provides an easy graphical way for you to edit these parameters, if you are not familiar with editing under UNIX. You just change the parameter you want and click "Save". (See HOW TO CHANGE CONSED/AUTOFINISH PARAMETERS (far above)). For the new parameter to take effect, you must restart Consed/Autofinish. Changes in ~/consedrc only affect one user. If you want to make a change to affect all Consed users on the system, put a file in some central location (e.g., /usr/local/genome/lib/consedrc ) and then have every user set the environment variable CONSED_PARAMETERS to that the full pathname of the file. For example, if using csh or tcsh, type: setenv CONSED_PARAMETERS /usr/local/genome/lib/consedrc If using bash, type: CONSED_PARAMETERS=/usr/local/genome/lib/consedrc export CONSED_PARAMETERS Anything the user puts in ~/consedrc will override whatever is in the CONSED_PARAMETERS file. You can also have different parameters for different projects. Put a consedrc file in the edit_dir of a particular project. When you are working on that project, whatever is in that consedrc will override whatever is in your ~/consedrc file or the CONSED_PARAMETERS file. 22.1) CUSTOMIZING NAVIGATE BY SINGLE STRANDED REGIONS AND NAVIGATE BY SINGLE SUBCLONE REGIONS You can set the parameters: consed.searchFunctionsUseUnalignedEndsOfReads: false consed.searchFunctionsUseLowQualityEndsOfReads: true If you set consed.searchFunctionsUseUnalignedEndsOfReads to be false, then the unaligned ends of a read are not considered to cover the consensus. If you set consed.searchFunctionsUseLowQualityEndsOfReads to false, then the low quality ends of a read are not considered to cover the consensus. For example, if the settings are: consed.searchFunctionsUseUnalignedEndsOfReads: false consed.searchFunctionsUseLowQualityEndsOfReads: false then a base in a read is only considered to cover the consensus if it is both in the aligned portion of the read and the high quality portion of the read. 22.2) consedrc vs .Xdefaults There are a few consed resources that are changed in ~/.Xdefaults rather than consedrc. They are mainly colors, fonts, and scale resources. Point at the 'Info' menu on the Consed Main Window, hold down the left mouse button and release on menu item 'Show Default X Resources'. This shows you what is available to be changed by putting in your ~/.Xdefaults file. Although most Consed parameters now go into consedrc, there are still a very few that need to stay in .Xdefaults. Here is the rule: if the parameter starts with consed. such as consed.gunzipFullPath: /bin/uncompress then it goes into consedrc If the resource (here it is called a "resource" rather than a "parameter") starts with consed* such as consed*contigwin.background: Black then it goes in .Xdefaults For example, if you want to change the background color of Assembly View, put the following in your ~/.Xdefaults file: consed*Assembly_View*background: red Type xrdb -remove and restart consed 22.3) COLOR BLINDNESS One person with Red/Green colorblindness (Deutan), found the following colors helpful: consed.colorTracesG: Yellow consed.colorTracesA: forest green consed.colorTracesC: medium blue consed.colorTracesT: light coral Put these in a consedrc in your home directory. ---------------------------------------------------------------------------- 23. CREATING CUSTOM TAG TYPES You can add your own tag types by creating a file of your custom tag types. The file looks like this: mytag1 red consensus yes mytag2 purple both yes mytag3 green read no field 1 ("mytag1") is the tag name field 2 ("red") is the color field 3 is "consensus", "read", or "both" depending on which kind of tag it is field 4 is "yes" or "no" depending on whether the user can add this tag in Consed (by swiping) or whether it is a tag that can only be viewed in Consed (presumably it would be added by some software of yours before the user sees it in Consed). If the file is called "/usr/local/genome/lib/tagTypes.txt", then in consedrc put the following line: consed.fileOfTagTypes: /usr/local/genome/lib/tagTypes.txt so that Consed knows where the file is. Once you have done this, the user of Consed can add tags of these types in the method described in TAGS of the Quick Tour (above). The list of available colors is found in the file rgb.txt found in /usr/X11R6/shar/X11/rgb.txt on macosx, /usr/lib/X11/rgb.txt or /usr/share/X11/rgb.txt on Linux or /usr/openwin/lib/rgb.txt on Solaris. For more information, consult any X-Windows reference, since this has nothing to do specifically with Consed. For your convenience, here are a few of the color names. One way to find out what they look like is to try them: mint cream DeepSkyBlue1 DeepPink4 azure DeepSkyBlue2 HotPink1 alice blue DeepSkyBlue3 HotPink2 lavender DeepSkyBlue4 HotPink3 lavender blush SkyBlue1 HotPink4 misty rose SkyBlue2 pink1 white SkyBlue3 pink2 black SkyBlue4 pink3 dark slate gray LightSkyBlue1 pink4 dim gray LightSkyBlue2 LightPink1 slate gray LightSkyBlue3 LightPink2 light slate gray LightSkyBlue4 LightPink3 gray SlateGray1 LightPink4 You can also associate data with tags. For example, you can have a tag type SNPprobability which gives, at a particular consensus position, the probability that a base is a SNP. Thus there needs to be a floating point number with the tag. This can be defined in the same file /usr/local/genome/lib/tagTypes.txt (as above), but instead of having one line for the tag type (as shown above), it has a more complicated structure to allow for tag fields: TAG_TYPE NAME: tag_type CONS_OR_READ: both USER_CAN_ADD: yes COLOR: color1 FIELD: name type POINTER_FIELD: name (type of tag pointed to) (optional ?*+) END_TAG_TYPE CONS_OR_READ: can be followed by "consensus", "read", or "both". In FIELD: name type, "type" is one of integer, floating, or string In POINTER_FIELD: name (type of tag pointed to) (optional ?*+) ? means 0 or 1 occurrences * means 0 or more occurrences + means 1 or more occurrences absence of any of these means exactly 1 occurrence Note that FIELD and POINTER_FIELD can both be present 0 or more times. Here is an example for a SNP with a probability: tagTypes.txt contains: TAG_TYPE NAME: SNP CONS_OR_READ: consensus USER_CAN_ADD: yes COLOR: yellow FIELD: probability floating END_TAG_TYPE Note that tagTypes.txt is not read by default. You must set the consedrc parameter: consed.fileOfTagTypes: tagTypes.txt The ace file will then contain SNP tags that look like this: CT{ Contig1 SNP consed 1863 1870 091030:091242 probability 75.2 } Here is an example of a user-defined tag type that points to another copy: tagTypes.txt: TAG_TYPE NAME: tear2 CONS_OR_READ: consensus USER_CAN_ADD: yes COLOR: yellow POINTER_FIELD: other_tear2_tag tear2 END_TAG_TYPE and these tags look like this in the ace file: CT{ Contig2 tear2 consed 6470 6470 090714:103935 ID: 5 other_tear2_tag 6 } CT{ Contig2 tear2 consed 6487 6487 090714:103941 ID: 6 other_tear2_tag 5 } This means that the tear2 tag with ID 5 refers to the other tear2 with ID 6, and visa versa. ---------------------------------------------------------------------------- 24. EXPANDING CONSED'S CAPABILITIES WITH A LITTLE PROGRAMMING Lab managers: Please do not get put off by the title of this section. You should read through this section so you are aware of what consed is capable of. If you think one of these features would be very helpful to your lab, then get a programmer to spend a day or two and write you some scripts that could really help you out. But first you need to be aware of what is possible. So read through this. 24.1) BRINGING UP CONSED FROM A SCRIPT Suppose that you want to write a script that brings up consed on one ace file to a particular position, and then brings up consed on another ace file at a particular position, and then brings up consed on another ace file at a particular position, ... you can do this by: consed -ace (name of ace file) -mainContigPos (unpadded pos) This will bring up consed with the main contig (the contig with the most number of reads) with the Aligned Reads Window already up and scrolled to position (unpadded pos). Thus you could write a script like this: cd directory1 consed -ace file1.ace -mainContigPos 1050 cd directory2 consed -ace file2.ace -mainContigPos 2057 cd directory3 consed -ace file3.ace -mainContigPos 1487 . . . 24.2) CONTROL OF CONSED FROM SOME OTHER PROGRAM Consed can be controlled by some other program. For example, you might have a program that displays mapping data and you would like the user to be able to click on a location and have Consed come up showing the bases in that region. This feature allows a programmer to do this. Here is an example of how to do this: 24.3) In this case we need a private copy of the dataset called "standard" (see GETTING YOUR OWN COPY OF A SAMPLE DATASET above). 24.4) Then Type: cd standard/edit_dir (You should get no error from this. If you do, type "pwd" to find out where you are and cd to the correct directory accordingly.) Start consed as follows: consed -socket 5432 -ace standard.fasta.screen.ace.1 If a window pops up asking if you want to apply edits, answer "no". Open another xterm and cd to standard/edit_dir From this directory, run the script testSocket.perl (thanks to Bill Gilliland) which is supplied with consed in the scripts directory of the consed distribution. This script will say: issuing command Scroll Contig1 100 waiting for you to type a command (such as Scroll Contig1 150)... You should immediately see consed's Aligned Reads Window open and scroll automatically to position 100. If you were to type (in the window in which testSocket.perl is running): Scroll Contig1 150 then you would see Consed immediately scroll to position 150. Here are the details of how you could use it: The external program can start up Consed as follows: consed -socket (local port number) -ace (ace filename) For example, consed -socket 5432 -ace standard.fasta.screen.ace.1 After Consed completes coming up (including you clicking whether you want to apply edits), you will see the message in the xterm: success bind to local port number: 5432 And then you will see a file created by Consed in the default directory (which is usually the directory the ace file is in) called consedSocketLocalPortNumber This gives the port number of the Berkeley socket that Consed has opened and is listening on. Thus your program can read this file and create a connection to the Berkeley socket created by Consed. Once the connection is established, your program can send commands to Consed at that socket indicating to Consed which contig to display and what consensus position to scroll to. Currently, the only acceptable commands are: Scroll (contigname) (consensus position) PopupTraces (read name) (unpadded read position in the direction of sequencing) 'Unpadded read position in the direction of sequencing' is the position from the right end, if the read is a bottom strand read. Just send such a command to the Berkeley socket, and Consed will respond appropriately. (Currently, Consed doesn't like it if another process establishes a connection and then terminates without first terminating the connection.) 24.5) REMOVING READS IN BATCH Consed can remove reads from the command line (without the graphical interface) as follows: consed -ace (ace file) -removeReads (file with reads to remove) consed -ace (ace file) -removeContigs (file with contigs to remove) consed -ace (ace file) -selectContigs (file with contigs to keep) (Optionally, -newacefile (new ace file) can be used to specify the name of the new ace file.) consed -removeReads, consed -selectContigs, and consed -removeContigs do not bring up the graphical interface. What is done with the reads is governed by the consedrc parameter consed.removeReadsWhatToDoWithReads: which can have values removeTogether, delete, or eachIntoOwnContig There are a number of other consedrc parameters: consed.removeReadsMakeCustomNavigationFileWhereConsensusRecalculated: false consed.removeReadsWhatToDoIfZeroDepthRegions: which can have values break or nobreak ! When removing reads, what should happen if removing a read causes a ! contig to have a location that is zero depth of coverage. Options ! are: a) break (to break the contig into several new contigs that ! have nonzero depth of coverage), b) nobreak (to leave the contig in ! one piece with a new 0 depth of coverage region) consed.removeReadsRecalculateConsensus: true ! when using consed -removeReads, use this to determine whether ! to recalculate the consensus bases. When using gui, ask user. ! If you will be allowing contigs to break, ! the the consensus will be recalculated regardless of the setting of ! consed.removeReadsRecalculateConsensus consed.removeReadsWhatToDoWithUnalignedReads: allIntoOneContig ! options are allIntoOneContig or eachIntoOwnContig ! allIntoOneContig only applies when specifying that contigs ! will be broken apart where there are no reads, i.e., ! consed.removeReadsWhatToDoIfZeroDepthRegions: break 24.6) COMPLEMENTING CONTIGS IN BATCH Consed can complement contigs from the command line (without the graphical interface) as follows: consed -ace (ace file) -complementContigs (file with list of contigs to complement) (Optionally, -newacefile (new ace file) can be used to specify the name of the new ace file.) 24.7) HOW TO WRITE A CUSTOM NAVIGATION FILE In the Main Window, there is also a Navigate menu. Pull it down and release on the Custom Navigation menu item. A box will pop up saying 'Select custom navigation file:' There will be a file: custom_navigation.nav Double click on it. You will see the now-familiar custom navigation box. Click 'Next' repeatedly until you get to the end of the list. Consed doesn't write such a file--it just reads it. This feature allows you the ability to write your own programs that select locations that you want your finishers to examine. Your program writes a file, the user reads that file into Consed in this manner, and you can go to each of the locations. The format of the file is as follows: There is a title (optional) line that looks like this: TITLE: low quality base in discrepant region and then there are blocks that look like this: BEGIN_REGION TYPE: READ READ: B11_hs1-60153193_GGor_050426.f UNPADDED_READ_POS: 34 34 COMMENT: a comment END_REGION The block above refers to read position 34 of read B11_hs1-60153193_GGor_050426.f Even if this read is complemented in the assembly (it is right to left), this position refers to the base position in the direction of sequencing--same as the position within the PHD file. There is another kind of block: BEGIN_REGION TYPE: CONSENSUS CONTIG: hs21-15002178_HSap-Contig UNPADDED_CONS_POS: 1774 1784 COMMENT: another comment END_REGION which refers to a position on the consensus. Notice that it is missing the "READ:" line, the TYPE: line is different, and instead of "UNPADDED_READ_POS" it has "UNPADDED_CONS_POS". When someone is navigating, the blinking cursor will be put onto the consensus (with the second kind of block) rather than the blinking cursor on the read (with the first kind of block). You might want to specify the consensus positions in terms of some user-defined positions (the first position of the consensus is not 1 but rather is some other number). For example, you might want to use chromosome positions, rather than the position within the contig. You can let Consed know that the UNPADDED_CONS_POS numbers are user-defined positions by putting the words "user-defined positions" somewhere in the TITLE line like this: TITLE: low quality base in discrepant region (user-defined positions) So that Consed knows what number to start numbering the consensus at, you must have a startNumberingConsensus tag on the consensus or a read indicating the user-defined position of the left-end of the contig. See USER-DEFINED CONSENSUS POSITIONS in this document. There is a 3rd type of block that you probably won't use much. It is used when you know the consensus position within a read, but not the read position. Then you can use: BEGIN_REGION TYPE: READ CONTIG: hs2-105068850_HSap-Contig READ: E02_hs2-105068850_PTro_040520.f UNPADDED_CONS_POS: 295 299 COMMENT: left 2 END_REGION The block above refers to a position on read E02_hs2-105068850_PTro_040520.f in contig hs2-105068850_HSap-Contig at consensus positions 295-299. There is a 4th type of block that you will probably not use much, either. It is when you know the *padded* consensus position (the position that includes *'s). This is especially useful when you want the user to navigate to a particular pad in the consensus and there are several pads in a row. The padded position is the only way to unambiguously say which pad you are interested in. BEGIN_REGION TYPE: CONSENSUS CONTIG: Contig138464 PADDED_CONS_POS: 512330 512330 COMMENT: padded 512330 512330 END_REGION (Note the comment can be anything.) 24.8) Consed can startup with your custom navigation file already loaded and displayed. To illustrate this, do this exercise. 24.9) In this case we need a private copy of the dataset called "standard" (see GETTING YOUR OWN COPY OF A SAMPLE DATASET above). 24.10) Then Type: cd standard/edit_dir (You should get no error from this. If you do, type "pwd" to find out where you are and cd to the correct directory accordingly.) 24.11) Start consed: consed -ace standard.fasta.screen.ace.1 -nav custom_navigation.nav You will see consed come up with the custom navigation window visible and loaded. Suppose that you want to review, say, 100 different ace files, each with a computer-generated list of locations. Using this -nav feature, this is easy: you write a script that has: cd (next ace file location) consed -ace (ace file) -nav (custom navigation file) cd (next ace file location) consed -ace (ace file) -nav (custom navigation file) cd (next ace file location) . . . The only thing the user need do is click "next", "next", "next", ... and finally "quit" and automatically the next ace file is brought up with the next custom navigation file already loaded visible. If you want the user to see the traces at each position (this assumes you are using Sanger reads), then you can set the following in your consedrc file (see CONSED CUSTOMIZATION): consed.navigateAutomaticTracePopup: true Warning: if the user inserts and deletes bases from the consensus sequence, all downstream positions will be changed. 24.12) COMPRESSING CHROMATOGRAMS If you are interested in compressing your chromatogram files, go into chromat_dir and gzip one of the chromatogram files. Make sure that gunzip is in /usr/local/bin (You can change this location via the Consed parameter consed.gunzipFullPath: /usr/local/bin/gunzip --see CONSED CUSTOMIZATION (above), but it will be easiest for you and your users if you just put gunzip (or a link to it) in /usr/local/bin and not have to bother with Consed parameters.) Restart Consed and bring up the corresponding trace. You will notice no appreciable delay. 24.13) READING CHROMATOGRAMS OUT OF AN EXTERNAL DATABASE Normally, chromatograms are kept in ../chromat_dir. If you want to keep them somewhere else (such as in an external database), you can do that. When the chromatogram is needed (when the user asks to view a trace), Consed will call an external program, passing it the name of the read required, and then look for the chromatogram in /tmp (by default). It will read the chromatogram and then delete it. Use the parameters: consed.alwaysRunProgramToGetChromats: true consed.programToRunToGetChromats: /usr/local/bin/programToGetChromat In this case, "programToGetChromat" is the name of the program that gets the chromatogram and puts it into /tmp. If you keep *some* chromats in an external database but *some* chromats are in ../chromat_dir, then set consed.alwaysRunProgramToGetChromats: last which means it will first look in ../chromat_dir and, if it doesn't find it, it will then run the program to get the chromats. 24.14) COMPRESSING ACE FILES AND PHD BALLS Consed can read and write compressed ace files and phd balls (by default, using gzip and gunzip). To see this, just compress an ace file: 24.15) Type: cd standard/edit_dir (You should get no error from this. If you do, type "pwd" to find out where you are and cd to the correct directory accordingly.) 24.16) Type: gzip -c standard.fasta.screen.ace.1 >somethingelse.ace.gz 24.17) And then you can bring up consed like this: consed -ace somethingelse.ace.gz Consed can also *write* compressed ace files. When you are saving the ace file and consed asks you for the filename, just append a ".gz" and consed will know to compress the file. (See "SAVING THE ASSEMBLY" above.) In tests that I did, this did not appear to have any performance improvement. I believe the reason is that consed is not I/O bound--when starting up, reading the ace file is a very small portion of the time. In fact, you will notice that when consed is starting up, it uses 100% of the cpu. You can also compress phd balls--just: cd ../phdball_dir and gzip the phd balls. Even though the ace file refers to the uncompressed file name, consed is smart enough to try the .gz version if it can't find the original. (Try it and see.) Phd balls are usually even larger than ace files so compressing them will give you even more improvement in disk space usage. 24.18) NO PHD FILES Try bring up Consed like this: consed -nophd This mode allows you to view an assembly when you don't have phd files or chromatograms but you only have the ace file. I do not recommend nor support this option! There are so many things that do not work with this option that I haven't bothered to keep track of them, but here are a few items: can't make joins, can't recalculate consensus quality, can't view traces, can't edit, autofinish will not give good results, can't view quality of the read bases, ... 24.19) ADDING TAGS FROM OTHER PROGRAMS You can also write external programs that add tags to the ace file and/or the phd files. Both RT (read) and CT (consensus) tags can be appended to the end of the ace file. BEGIN_TAG tags can be appended to the end of the phd files. Do not rewrite the ace file or the phd file--there is no need to do so and it will cause problems. See SAMPLE PHD BALL FORMAT in this document for the format of BEGIN_TAG read tags. 24.20) CHROMOSOME POSITIONS/USER-DEFINED CONSENSUS POSITIONS Suppose instead of labeling the consensus 1, 2, 3, 4, ..., you want, for example, to number it: 100,000,001, 100,000,002, 100,000,003, 100,000,004, etc. (e.g., in chromosome positions). You can do this. Note that all bases in the consensus (except pads) will be numbered--you cannot, for example, only number exon bases and not number intron bases (pity). These user-defined consensus positions will apply not only to the consensus scale in the Aligned Reads Window, but also to all of the Navigate lists and Search for String. To start numbering the consensus at a number different from 1, add a "startNumberingConsensus" tag to either the consensus or a read in that contig. The tag will look like this (this is a consensus tag in the ace file): CT{ hs18-25105605_HSap-Contig startNumberingConsensus consed 1 1 041123:152840 25105605 } This says that the consensus will be numbered starting at 25,105,605 You cannot add such a tag by using Consed--you must have a program add it to the ace file (or a phd file of one of the reads in the contig). You can switch between labeling positions 1, 2, 3, ... and labeling positions by chromosome or user-defined positions as follows: Point to a "Misc" menu, hold down the left mouse button, and release on "Turn On/Off User-Defined Consensus Scale Numbers". 24.21) DEFINING KEYS (HOTKEYS) TO CALL EXTERNAL PROGRAMS AND/OR APPLY TAGS AND/OR INTEGRATE CONSED WITH EXTERNAL DATABASES [CUSTOM KEYS, USER-DEFINED KEYS] You can define keys (such as Control-N) to apply a particular tag to a single base, saving you the several steps in applying tags: swiping and selecting a tag type (as shown under "TAGS" above). However, it is even more powerful than that. You can also define an external program to run when you type this key. That external program can be your own, and it could be, for example, a program that puts information into an external database. The first thing you need to set up a custom hotkey is a consedrc file which goes in edit_dir of the project you're working on (see above CONSED CUSTOMIZATION for other possible locations). Put the following in that file: consed.userDefinedKeys: 14 15 ! make a space-separated list of the decimal ASCII values of the keys ! 14 means control-N, 15 means control-O consed.programsForUserDefinedKeys: /bin/echo /bin/echo ! a space-separated list of the full pathnames of the commands to run consed.argumentsToPassToUserDefinedPrograms: argument_for_first_key argument_for_se cond_key ! a space-separated list of the arguments to pass to each user-defined programs consed.tagsToApplyWithUserDefinedKeys: none polymorphismConfirmed ! a space-separate list of the tag types to apply when the user ! presses a user-defined key. If a key is to have no associated tag, ! then enter "none" for that key. This makes control-N and control-O ("oh"--not zero) call "/bin/echo" by default. In either the aligned reads window or the trace window, click the cursor on a base and try these keys (e.g., holding down the control key and typing 'o'). Watch in the xterm where you started Consed for output like this: argument_for_first_key djs74-561.s1 97 Contig1 2534 2581 a 51 /kw3/gordon/consed_demo/standard/edit_dir/standard.fasta.screen.ace.1 tr.window argument_for_second_key djs74-2679.s1 78 Contig1 2527 2574 c 39 /kw3/gordon/consed_demo/standard/edit_dir/standard.fasta.screen.ace.1 a.r.window djs74_561.s1 the read the user was viewing (or "consensus" if the cursor is on the consensus) 97 the base position in the direction of sequencing (or -1 if the cursor is on the consensus) Contig1 the contig 2534 the unpadded consensus position 2581 is the padded (counts *'s) consensus position 'a' is the base 51 is the quality of the base /kw3/gordon/consed_demo/standard/edit_dir/standard.fasta.screen.ace.1 is the ace file tr.window means it was called by the user pushing the key in the trace window--not the aligned reads window. It's the same as if you had run the program from the shell, with command-line arguments, like this: bash%: /bin/echo argument_for_first_key djs74-561.s1 97 Contig1 2534 2581 a 51 /kw3/gordon/consed_demo/standard/edit_dir/standard.fasta.screen.ace.1 tr.window You will also see that control-O will automatically add a polymorphismConfirmed tag, but control-N will not add any tag. That is because of consed.tagsToApplyWithUserDefinedKeys (see above). Several groups that are doing polymorphism detection have expressed interest in this feature because it enables them to have Consed directly write into an external database (e.g., Oracle or Sybase) by calling a program that then writes to the database. You can use these hotkeys from within the trace window or the aligned reads window. You don't have to use only ctrl-N/ctrl-O... for instance 1 is control-A, 2 is control-B, 3 is control-C, 4 is control-D, etc. If you want to pass this information to a database, you will need to know how to talk to your database, and either choose your hotkey to do it directly for you, or call another program that takes the parameters above and massages them into the format your database needs. control-A, control-E, and control-T already mean something in the aligned reads window, so those keys cannot be defined to be anything else. Typically control-C, control-S, and control-Q already mean something to the operating system so you can't use those, either. 24.22) READ PREFIXES You can create a file called readPrefixes.txt in edit_dir. This file contains a list of reads and prefixes for those reads. In the Aligned Reads Window, the Consed user will see those read prefixes in a column before the read names. This can be a very helpful feature for finishers. For example, these read prefixes can indicate to the finishers which templates are available to use for making finishing reads. The format of the file is: (readname) (read prefix) (color for read prefix) The read prefix and color for read prefix are optional. If you leave them out, you get '*' for the read prefix in blue. The consed parameters involving this feature are: consed.defaultReadPrefix: * consed.readPrefixesFile: readPrefixes.txt consed.maxCharsDisplayedForReadPrefix: 1 but you probably won't need to change them. 24.23) USING FILES CREATED ON WINDOWS OR WINDOWS NT. Don't. (E.g., phd files generated by a Beckman CEQ-2000.) These files initially had at end of line instead of . CONSED chokes every time it tries to read something from these phd files. If you must use these files, you must first convert them to UNIX format, which means stripping out the CR's and just having \n (decimal 10) separate lines. 24.24) CREATING YOUR OWN ACE FILES (INSTEAD OF ACE FILES CREATED BY PHRAP) Some people have tried creating their own ace files, try Consed on it, and when Consed starts up ok, they don't understand when later some feature in Consed doesn't work. This is because Consed does not check everything about an ace file when it starts up. If you are going to write software to create ace files, here is a partial list of Consed features you should check before you think your ace files are fine for Consed: assembly view restriction digest read all traces complement contig and then read all traces add new reads If all of these work properly, then your ace files are probably ok. 24.25) CONSED OPTIONS You've seen quite a few consed options, such as -removeReads, -socket, -ace, -nophd, -removeContigs, etc. To see them all, type consed -help -------------------------------------------------------------------------- 25. MONITORS AND MICE FOR CONSED If your monitor is part of a Unix computer (a Linux box, a Mac or a Sun) or is an Xterminal, then you will probably have no problem. If your monitor is a PC running Windows (any flavor), then you must have an X emulator installed and running. X emulators include: Exceed, XWin32, Reflection X, and OpenNT. Any of these will work if configured correctly (and the 'correctly' is the key). I encourage you to use single window mode (where there is one huge unix window with xterms inside it) and then use a Unix window manager such as CDE, fvwm, or mwm. If your monitor is a MAC with macosx running, see NOTE TO MACOSX USERS (above). Whatever you monitor, you must have 3 button mouse or 3 button emulation. 3 Button emulation is tricky since Consed uses all 3 buttons of the mouse and it also uses Control-Middle-Mouse-button, Shift-Middle-Mouse-Button and Control-Right-Mouse-Button. So if you are going to try to just use a 2 button mouse (or, God-forbid, a 1 button mouse), you should make sure that you can emulate each of those. Often, if you push the left and right mouse buttons at the same time, your X server will interpret that to be the middle mouse button. But you must consult your X emulator or X server to know what it will do--that is out of Consed's control. -------------------------------------------------------------------------- 26. ACE FILE FORMAT Note that consed really requires both an ace file and a phd ball to fully function. If you are trying to write files that consed can read, I strongly urge you to write both files. Read the next section about phd balls. Refer to the accompanying sample_ace_file.txt (below) AS CO <# of bases> <# of reads in contig> <# of base segments in contig> This defines the contig. The U or C indicates whether the contig has been complemented from the way phrap originally created it. Thus this is always U for an ace file created by phrap. The contig sequence follows. It includes pads--"*" characters which are inserted by phrap in order to make room for some read that has an extra base at that position. (Note: any position which counts the *'s is referred to as a "padded position". A position that does not count *'s is referred to as "unpadded position".) The contig sequence must be followed by a blank line. BQ This starts the list of base qualities for the unpadded consensus bases. (NB: annoyingly, no qualities are given for *'s in the consensus.) The contig is the one from the previous CO, hence no name is needed here. The list of base qualities must be followed by a blank line. AF This defines the location of the read within the contig. C or U means complemented or uncomplemented. means the position of the beginning of the read, in terms of consensus bases which start at 1 and do count *'s. BS The BS line (base segment) indicates which read phrap has chosen to be the consensus at a particular position. BS lines are now optional since they don't make much sense for assemblers other than phrap. (It is also possible to have contigs with no reads in them.) If you don't have any BS lines, there must be a blank line between the most recent AF line and the next QA line. If you choose to to write BS lines, I suggest you choose any read which matches the consensus perfectly over the stretch of bases. There must not be any two BS lines that intersect. Each unpadded base must be included in some BS line. RD <# of padded bases> <# of whole read info items> <# of read tags> Below RD is the sequence of bases for the read. The sequence includes *'s and is in the orientation that phrap needed to align it against the consensus (thus it might be complemented from the direction it was sequenced). QA This line indicates which part of the read is the high quality segment (if there is any) and which part of the read is aligned against the consensus. These positions are offsets (and count *'s) from the left end of the read (left, as shown in Consed). Hence for bottom strand reads, the offsets are from the end of the read. The offsets are 1-based. That is, if the left-most base is in the aligned, high-quality region, = 1 and = 1 (not zero). If the entire read is low quality, then and will both be -1. phrap will sometimes make a QA line in which both align clipping positions are -1. This means that the read is completely unaligned and shouldn't be there at all. (Sorry! I know the read shouldn't even be in the ace file.) DS CHROMAT_FILE: PHD_FILE: TIME: CHEM: DYE: TEMPLATE: