cgatools version 1.1.0 build 8 usage: cgatools COMMAND [ options ] [ positionalArgs ] For help on a particular command CMD, try "cgatools help CMD". Available commands: help Prints help information. man Prints the cgatools reference manual. fasta2crr Converts fasta reference files to the crr format. crr2fasta Converts a crr reference file to the fasta format. listcrr Lists chromosomes, contigs, or ambiguous sequences of a crr file. decodecrr Prints the reference sequence for a given reference range. snpdiff Compares snp calls to a Complete Genomics variant file. calldiff Compares two Complete Genomics variant files. listvariants Lists the variants present in a variant file. testvariants Tests variant files for presence of variants. map2sam Converts CGI initial reference mappings into SAM format. evidence2sam Converts CGI variant evidence data into SAM format. join Joins two tab-delimited files based on equal fields or overlapping regions. ------------------------------------------------------------------------------- COMMAND NAME help - Prints help information. OPTIONS -h [ --help ] Print this help message. --command arg The command to describe. --format arg (=text) The format of the output stream (text or html). --output arg (=STDOUT) The output file (may be omitted for stdout). ------------------------------------------------------------------------------- COMMAND NAME man - Prints the cgatools reference manual. OPTIONS -h [ --help ] Print this help message. --output arg (=STDOUT) The output file (may be omitted for stdout). --format arg (=text) The format of the output stream (text or html). ------------------------------------------------------------------------------- COMMAND NAME fasta2crr - Converts fasta reference files to the crr format. OPTIONS -h [ --help ] Print this help message. --input arg The input fasta files (may be positional args, or omitted for stdin). Take care to specify the fasta files in chromosome order; ordering is important. To work with human Complete Genomics data, the chromosome order should be chr1...chr22, chrX, chrY, chrM. --output arg The output crr file. --circular arg A comma-separated list of circular chromosome names. If ommitted, defaults to chrM. ------------------------------------------------------------------------------- COMMAND NAME crr2fasta - Converts a crr reference file to the fasta format. OPTIONS -h [ --help ] Print this help message. --input arg The input crr file (may be positional arg). --output arg (=STDOUT) The output fasta file (may be omitted for stdout). --line-width arg (=50) The maximum width of a line of sequence. ------------------------------------------------------------------------------- COMMAND NAME listcrr - Lists chromosomes, contigs, or ambiguous sequences of a crr file. DESCRIPTION For mode=chromosome, prints a space-separated table describing each chromosome within the reference. The columns are defined as follows: ChromosomeId A numeric identifier for the chromosome. Chromosome The name of the chromosome. Length The length in bases of the chromosome. Circular Boolean indicating if the chromosome is circular. Md5 Md5 of the string containing the upper case IUPAC code for each base in the chromosome (spaces and dashes are omitted). For mode=contig, prints a space-separated table describing each gap and each contig within the reference. Here, a gap between contigs is defined as any stretch of min-contig-gap-length or more no-called reference bases (N character). The columns are defined as follows: ChromosomeId A numeric identifier for the chromosome. Chromosome The name of the chromosome. Type Either CONTIG or GAP. Offset The 0-based offset of the start of the contig or gap within the chromosome. Length The length in bases of the contig or gap. For mode=ambiguity, prints a space-separated table describing each run of ambiguity codes within the reference. The columns are defined as follows: ChromosomeId A numeric identifier for the chromosome. Chromosome The name of the chromosome. Code The IUPAC code for the region. Offset The 0-based offset of the run of ambiguity codes in the chromosome. Length The length in bases of the run of ambiguity codes. OPTIONS -h [ --help ] Print this help message. --reference arg The reference crr file (may be positional arg). --output arg (=STDOUT) The output file (may be omitted for stdout). --mode arg (=chromosome) One of chromosome, contig, or ambiguity. --min-contig-gap-length arg (=50) Minimum length of gap between reference contigs, for mode=contig. ------------------------------------------------------------------------------- COMMAND NAME decodecrr - Prints the reference sequence for a given reference range. OPTIONS -h [ --help ] Print this help message. --reference arg The reference crr file (may be positional arg). --output arg (=STDOUT) The output file (may be omitted for stdout). --range arg The range of bases to print (chr,begin,end or chr:begin-end). ------------------------------------------------------------------------------- COMMAND NAME snpdiff - Compares snp calls to a Complete Genomics variant file. DESCRIPTION Compares the snp calls in the "genotypes" file to the calls in a Complete Genomics variant file. The genotypes file is a tab-delimited file with at least the following columns (additional columns may be given): Chromosome (Required) The name of the chromosome. Offset0Based (Required) The 0-based offset in the chromosome. GenotypesStrand (Optional) The strand of the calls in the Genotypes column (+ or -, defaults to +). Genotypes (Optional) The calls, one per allele. The following calls are recognized: A,C,G,T A called base. N A no-call. - A deleted base. . A non-snp variation. The output is a tab-delimited file consisting of the columns of the original genotypes file, plus the following additional columns: Reference The reference base at the given position. VariantFile The calls made by the variant file, one per allele. The character codes are the same as is described for the Genotypes column. DiscordantAlleles (Only if Genotypes is present) The number of Genotypes alleles that are discordant with calls in the VariantFile. If the VariantFile is described as haploid at the given position but the Genotypes is diploid, then each genotype allele is compared against the haploid call of the VariantFile. NoCallAlleles (Only if Genotypes is present) The number of Genotypes alleles that were no-called by the VariantFile. If the VariantFile is described as haploid at the given position but the Genotypes is diploid, then a VariantFile no-call is counted twice. The verbose output is a tab-delimited file consisting of the columns of the original genotypes file, plus the following additional columns: Reference The reference base at the given position. VariantFile The call made by the variant file for one allele (there is a line in this file for each allele). The character codes are the same as is described for the Genotypes column. [CALLS] The rest of the columns are pasted in from the VariantFile, describing the variant file line used to make the call. The stats output is a comma-separated file with several tables describing the results of the snp comparison, for each diploid genotype. The tables all describe the comparison result (column headers) versus the genotype classification (row labels) in different ways. The "Locus classification" tables have the most detailed match classifications, while the "Locus concordance" tables roll these match classifications up into "discordance" and "no-call". A locus is considered discordant if it is discordant for either allele. A locus is considered no-call if it is concordant for both alleles but has a no-call on either allele. The "Allele concordance" describes the comparison result on a per-allele basis. OPTIONS -h [ --help ] Print this help message. --reference arg The input crr file. --variants arg The input variant file. --genotypes arg The input genotypes file. --output-prefix arg The path prefix for all output reports. --reports arg (=Output,Verbose,Stats) Comma-separated list of reports to generate. A report is one of: Output The output genotypes file. Verbose The verbose output file. Stats The stats output file. SUPPORTED FORMAT_VERSION 0.3 or later ------------------------------------------------------------------------------- COMMAND NAME calldiff - Compares two Complete Genomics variant files. DESCRIPTION Compares two Complete Genomics variant files. Divides the genome up into superloci of nearby variants, then compares the superloci. Also refines the comparison to determine per-call or per-locus comparison results. Comparison results are usually described by a semi-colon separated string, one per allele. Each allele's comparison result is one of the following classifications: ref-identical The alleles of the two variant files are identical, and they are consistent with the reference. alt-identical The alleles of the two variant files are identical, and they are inconsistent with the reference. ref-consistent The alleles of the two variant files are consistent, and they are consistent with the reference. alt-consistent The alleles of the two variant files are consistent, and they are inconsistent with the reference. onlyA The alleles of the two variant files are inconsistent, and only file A is inconsistent with the reference. onlyB The alleles of the two variant files are inconsistent, and only file B is inconsistent with the reference. mismatch The alleles of the two variant files are inconsistent, and they are both inconsistent with the reference. phase-mismatch The two variant files would be consistent if the hapLink field had been empty, but they are inconsistent. ploidy-mismatch The superlocus did not have uniform ploidy. In some contexts, this classification is rolled up into a simplified classification, which is one of "identical", "consistent", "onlyA", "onlyB", or "mismatch". A good place to start looking at the results is the superlocus-output file. It has columns defined as follows: SuperlocusId An identifier given to the superlocus. Chromosome The name of the chromosome. Begin The 0-based offset of the start of the superlocus. End The 0-based offset of the base one past the end of the superlocus. Classification The match classification of the superlocus. Reference The reference sequence. AllelesA A semicolon-separated list of the alleles (one per haplotype) for variant file A, for the phasing with the best comparison result. AllelesB A semicolon-separated list of the alleles (one per haplotype) for variant file B, for the phasing with the best comparison result. The locus-output file contains, for each locus in file A and file B that is not consistent with the reference, an annotated set of calls for the locus. The calls are annotated with the following columns: SuperlocusId The id of the superlocus containing the locus. File The variant file (A or B). LocusClassification The locus classification is determined by the varType column of the call that is inconsistent with the reference, concatenated with a modifier that describes whether the locus is heterozygous, homozygous, or contains no-calls. If there is no one variant in the locus (i.e., it is heterozygous alt-alt), the locus classification begins with "other". LocusDiffClassification The match classification for the locus. This is defined to be the best of the comparison of the locus to the same region in the other file, or the comparison of the superlocus. The somatic output file contains a list of putative somatic variations of genome A. The output includes only those loci that can be classified as heterozygous or homozygous snp, del, ins or sub in file A, and are called reference in the file B. Every locus is annotated with the following columns: VarScoreA The alternative score of the variation in genome A; this is the reference score for SNPs, and the evidence score for other variations. RefScoreB Minimum of the reference scores of the locus in genome B. SomaticScore The score that indicates the likelihood that the listed varition is indeed somatic, as opposed to a false negative in genome B or a false positive in the genome A. Superlocus comparison statistics can be found in the superlocus-stats file. Locus comparison statistics can be found in the locus-stats file. Beware any output files whose parameter name begins with "debug". OPTIONS -h [ --help ] Print this help message. --reference arg The input crr file. --variantsA arg The "A" input variant file. --variantsB arg The "B" input variant file. --output-prefix arg The path prefix for all output reports. --reports arg (=SuperlocusOutput,SuperlocusStats,LocusOutput,LocusStats) Comma-separated list of reports to generate. A report is one of: SuperlocusOutput Report for superlocus classification. SuperlocusStats Report for superlocus classification stats. LocusOutput Report for locus classification. LocusStats Report for locus stats. SomaticOutput Report for the list of simple variations that are present only in file "A", annotated with the score that indicates the probability of the variation being truly somatic. Requires beta, export-rootA, and export-rootB options to be provided as well. Note: generating this report slows calldiff by 10x-20x. DebugCallOutput Report for call classification. DebugSuperlocusOutput Report for debug superlocus information. DebugSomaticOutput Report for distribution estimates used for somatic rescoring. Only produced if SomaticOutput is also turned on. --locus-stats-column-count arg (=15) The number of columns for locus compare classification in the locus stats file. --max-hypothesis-count arg (=32) The maximum number of possible phasings to consider for a superlocus. --no-reference-cover-validation Turns off validation that all bases of a chromosome are covered by calls of the variant file. --export-rootA arg The "A" export package root, for example /data/GS00118-DNA_A01; this directory is expected to contain ASM/REF and ASM/EVIDENCE subdirectories. --export-rootB arg The "B" export package root. --beta This flag enables the SomaticOutput report, which is beta functionality. SUPPORTED FORMAT_VERSION 0.3 or later ------------------------------------------------------------------------------- COMMAND NAME listvariants - Lists the variants present in a variant file. DESCRIPTION Lists all called variants present in the specified variant files, in a format suitable for processing by the testvariants command. The output is a tab-delimited file consisting of the following columns: variantId Sequential id assigned to each variant. chromosome The chromosome of the variant. begin 0-based reference offset of the beginning of the variant. end 0-based reference offset of the end of the variant. varType The varType as extracted from the variant file. reference The reference sequence. alleleSeq The variant allele sequence as extracted from the variant file. xRef The xRef as extrated from the variant file. OPTIONS -h [ --help ] Print this help message. --beta This is a beta command. To run this command, you must pass the --beta flag. --reference arg The reference crr file. --output arg (=STDOUT) The output file (may be omitted for stdout). --variants arg The input variant files (may be positional args). SUPPORTED FORMAT_VERSION 0.3 or later ------------------------------------------------------------------------------- COMMAND NAME testvariants - Tests variant files for presence of variants. DESCRIPTION Tests variant files for presence of variants. The output is a tab-delimited file consisting of the columns of the input variants file, plus a column for each assembly results file that contains a character code for each allele. The character codes have meaning as follows: 0 This allele of this genome is consistent with the reference at this locus but inconsistent with the variant. 1 This allele of this genome has the input variant at this locus. N This allele of this genome has no-calls but is consistent with the input variant. OPTIONS -h [ --help ] Print this help message. --beta This is a beta command. To run this command, you must pass the --beta flag. --reference arg The reference crr file. --input arg (=STDIN) The input variants to test for. --output arg (=STDOUT) The output file (may be omitted for stdout). --variants arg The input variant files (may be positional args). SUPPORTED FORMAT_VERSION 0.3 or later ------------------------------------------------------------------------------- COMMAND NAME map2sam - Converts CGI initial reference mappings into SAM format. DESCRIPTION The Map2Sam converter takes as input Reads and Mappings files, a library structure file and a crr reference file and generates one SAM file as an output. The output is sent into stdout by default. All the mapping records from the input are converted into corresponding SAM records one to one. In addition, the unmapped DNB records are reported as SAM records having appropriate indication. Map2Sam converter tries to identify primary mappings and highlight them using the appropriate flag. The negative gaps in CGI mappings are represented using GS/GQ/GC tags. OPTIONS -h [ --help ] Print this help message. -r [ --reads ] arg Input reads file. -m [ --mappings ] arg Input mappings file. -l [ --library ] arg Input library file. -s [ --reference ] arg Reference file. -o [ --output ] arg (=STDOUT) The output SAM file (may be omitted for stdout). -f [ --from ] arg (=0) Defines start read record of the export range. -t [ --to ] arg (=18446744073709551615) Defines end read record of the export range (the end record is not exported). -e [ --export-region ] arg defines an export region as a half-open interval 'chr,from,to' --skip-not-mapped Skip not mapped records --add-mate-sequence Generate mate sequence and score tags. --mate-sv-candidates Inconsistent mappings are normally converted as single arm mappings with no mate information provided. If the option is used map2sam will mate unique single arm mappings in SAM including those on different stands and chromosomes. To distinguish these "artificially" mated records a tag "XS:i:1" is used. The MAPQ provided for these records is a single arm mapping weight. SUPPORTED FORMAT_VERSION 0.3 or later ------------------------------------------------------------------------------- COMMAND NAME evidence2sam - Converts CGI variant evidence data into SAM format. DESCRIPTION The evidence2sam converter takes as input evidence mapping files (evidenceDnbs-*) and generates one SAM file as an output. The output is sent into stdout by default. All the evidence mapping records from the input are converted into a pair of corresponding SAM records - one record for each HalfDNB. Evidence2Sam converter reports all mappings as not primary. The negative gaps in CGI mappings are represented using GS/GQ/GC tags. OPTIONS -h [ --help ] Print this help message. --beta This is a beta command. To run this command, you must pass the --beta flag. -e [ --evidence-dnbs ] arg Input evidence dnbs file. -s [ --reference ] arg Reference file. -o [ --output ] arg (=STDOUT) The output SAM file (may be omitted for stdout). -r [ --export-region ] arg defines an export region as a half-open interval 'chr,from,to'. --keep-duplicates Keep local duplicates of DNB mappings.All the output SAM records will be marked as not primary if this option is used. --add-mate-sequence Generate mate sequence and score tags. --add-allele-id Generate interval id and allele id tags. -v [ --debug-output ] Generate verbose debug output. Please don't rely on this option in production. SUPPORTED FORMAT_VERSION 0.3 or later ------------------------------------------------------------------------------- COMMAND NAME join - Joins two tab-delimited files based on equal fields or overlapping regions. DESCRIPTION Joins two tab-delimited files based on equal fields or overlapping regions. By default, an output record is produced for each match found between file A and file B, but output format can be controlled by the --output-mode parameter. OPTIONS -h [ --help ] Print this help message. --beta This is a beta command. To run this command, you must pass the --beta flag. --input arg File name to use as input (may be positional args, or omitted for stdin). There must be exactly two input files to join. If only one file is specified by name, file A is taken to be stdin and file B is the named file. File B is read fully into memory, and file A is streamed. File A's columns appear first in the output. --output arg (=STDOUT) The output file name (may be omitted for stdout). --match arg A match specification, which is a column from A and a column from B separated by a colon. --overlap arg Overlap specification. An overlap specification consists of a range definition for files A and B, separated by a colon. A range definition may be two columns, in which case they are interpreted as the beginning and end of the range. Or it may be one column, in which case the range is defined as the 1-base range starting at the given value. The records from the two files must overlap in order to be considered for output. Two ranges are considered to overlap if the overlap is at least one base long, or if one of the ranges is length 0 and the ranges overlap or abut. For example, "begin,end:offset" will match wherever end-begin > 0, begin<offset+1, and end>offset, or wherever end-begin = 0, begin<=offset+1, and end>=offset. -m [ --output-mode ] arg (=full) Output mode, one of the following: full Print an output record for each match found between file A and file B. compact Print at most one record for each record of file A, joining the file B values by a semicolon and suppressing repeated B values and empty B values. compact-pct Same as compact, but for each distinct B value, annotate with the percentage of the A record that is overlapped by B records with that B value. Percentage is rounded up to nearest integer. --overlap-mode arg (=strict) Overlap mode, one of the following: strict Range A and B overlap if A.begin < B.end and B.begin < A.end. allow-abutting-points Range A and B overlap they meet the strict requirements, or if A.begin <= B.end and B.begin <= A.end and either A or B has zero length. --select arg (=A.*,B.*) Set of fields to select for output. -a [ --always-dump ] Dump every record of A, even if there are no matches with file B. --overlap-fraction-A arg (=0) Minimum fraction of A region overlap for filtering output. --boundary-uncertainty-A arg (=0) Boundary uncertainty for overlap filtering. Specifically, records failing the following predicate are filtered away: overlap >= overlap-fraction-A * ( A-range-length - boundary-uncertainty-A ) --overlap-fraction-B arg (=0) Minimum fraction of B region overlap for filtering output. --boundary-uncertainty-B arg (=0) Boundary uncertainty for overlap filtering. Specifically, records failing the following predicate are filtered away: overlap >= overlap-fraction-B * ( B-range-length - boundary-uncertainty-B ) SUPPORTED FORMAT_VERSION Any