找到你要的答案

Q:extract FASTA sequence using sequence ID [closed]

Q:提取使用序列号[关闭] FASTA序列

I have two files: File 1:

84C2_Locus_14_Transcript_1/3_Confidence_0.571_Length_1244
AAACTAGTCAATAGAGAAAATCCAAAGTGGATGAAATTGAAGTGATTGTATGGCACAAGT...so on

>84C2_Locus_14_Transcript_2/3_Confidence_0.857_Length_1961
AAACTAGTCAATAGAGAAAATCCAAAGTGGATGAAATTGAAGTGATTGTATGGCACAAGT...so on

>84C2|Locus_15_Transcript_1/9_Confidence_0.190_Length_757
ATTTGCTCGGAAAAACACTTCTCGTGGAACTTGTTAGCGCTGAGCTTGATCCCAAGACGA.....so on

>84C2_Locus_15_Transcript_5/9_Confidence_0.333_Length_1841
ATTTGCTCGGAAAAACACTTCTCGTGGAACTTGTTAGCGCTGAGCTTGATCCCAAGACGA....so on

File2: only the sequence IDS

84C2_Locus_14_Transcript_1/3_Confidence_0.571_Length_1244
84C2_Locus_14_Transcript_2/3_Confidence_0.857_Length_1961
84C2_Locus_14_Transcript_3/3_Confidence_0.571_Length_1248
84C2_Locus_15_Transcript_1/9_Confidence_0.190_Length_757

...........so many.

my output file should be contain the sequence associated with header. i.e. matching the sequence id file header portion with original fasta sequence file and those sequences header match the fasta sequence header store in another output file containing header with sequence.Just like this:

Original output file should like this:

>84C2_Locus_15_Transcript_5/9_Confidence_0.333_Length_1841
ATTTGCTCGGAAAAACACTTCTCGTGGAACTTGTTAGCGCTGAGCTTGATCCCAAGACGA......so on

Please suggest me the way in perl which is applicable for my problem.

I have two files: File 1:

84C2_Locus_14_Transcript_1/3_Confidence_0.571_Length_1244
AAACTAGTCAATAGAGAAAATCCAAAGTGGATGAAATTGAAGTGATTGTATGGCACAAGT...so on

>84C2_Locus_14_Transcript_2/3_Confidence_0.857_Length_1961
AAACTAGTCAATAGAGAAAATCCAAAGTGGATGAAATTGAAGTGATTGTATGGCACAAGT...so on

>84C2|Locus_15_Transcript_1/9_Confidence_0.190_Length_757
ATTTGCTCGGAAAAACACTTCTCGTGGAACTTGTTAGCGCTGAGCTTGATCCCAAGACGA.....so on

>84C2_Locus_15_Transcript_5/9_Confidence_0.333_Length_1841
ATTTGCTCGGAAAAACACTTCTCGTGGAACTTGTTAGCGCTGAGCTTGATCCCAAGACGA....so on

源文件:只有序列ID

84C2_Locus_14_Transcript_1/3_Confidence_0.571_Length_1244
84C2_Locus_14_Transcript_2/3_Confidence_0.857_Length_1961
84C2_Locus_14_Transcript_3/3_Confidence_0.571_Length_1248
84C2_Locus_15_Transcript_1/9_Confidence_0.190_Length_757

...........这么多。

我的输出文件应该包含与头相关联的序列。即匹配序列ID文件头部分与原序列的FASTA序列文件和标题匹配的FASTA序列头存储在另一个输出文件包含报头序列。就这样:

原始输出文件应该这样:

>84C2_Locus_15_Transcript_5/9_Confidence_0.333_Length_1841
ATTTGCTCGGAAAAACACTTCTCGTGGAACTTGTTAGCGCTGAGCTTGATCCCAAGACGA......so on

请建议我在Perl适用于我的问题的方法。

answer1: 回答1:

Edit: New code, I didn't understand your question the first time. This script reads both files, stores all IDs in a hash and then goes through all sequences in the first file. Only those sequences whose IDs are in the second file get written into the output file. Note that the output is written with | as separator like in the first file, not _ like the IDs in the second file. The missing > in the beginning of the first file is also reproduced correctly.

#!/usr/bin/perl -w

use strict;

# Check command line arguments
unless ($#ARGV == 1 && -e $ARGV[0] && -e $ARGV[1]) {
        print STDERR "Usage: split-fasta.pl DATA IDS\n";
        exit 1;
}

my (%ids, $id);

# Read sequence IDs
open(IDS, "<$ARGV[1]") or die "Can't open IDs file: $1";

while (<IDS>) {
        $ids{$_} = 1;
}

close(IDS);


# Read sequence data and write results to output.fasta
open(DATA,    "<$ARGV[0]") or die "Can't open sequences file: $1";
open(OUTFILE, ">>output.fasta") or die "Can't open output file: $1";

while (<DATA>) {
        my $line = $_;


        if ($line =~ /^>?(\w+\|.+\n)/) {
                $id = $1;
                $id =~ tr/|/_/;
        }

        print OUTFILE $line if defined $ids{$id};
}

close(DATA);
close(OUTFILE);

So, you basically want to split file 1 into multiple files that each only contain one sequence? Maybe this ugly (maybe close each filehandle and not only the last one ...) piece of Perl will help you out.

#!/usr/bin/perl -w

use strict;

while (<>) {
        my $line = $_;

        if ($line =~ />?([0-9]+|\w+)\//) {
                my $file_name = $1;
                open(OUTFILE, ">>$file_name");
        }

        print OUTFILE $line;
}

close(OUTFILE);

Edit: In case you want to feed in IDs from the second file, just add another if that matches the ID with the headers from the first file. This should work as long as both files are in the same order since the input is immediately discarded upon writing and can't be searched anymore.

编辑:新代码,我第一次不明白你的问题。此脚本读取两个文件,将所有的ID存储在一个散列中,然后在第一个文件中进行所有序列。只有那些ID在第二个文件中的序列才会写入输出文件。请注意,输出是用|在第一个文件分隔符一样,不_像第二文件系统。失踪的>;在第一个文件的开始也是正确复制。

#!/usr/bin/perl -w

use strict;

# Check command line arguments
unless ($#ARGV == 1 && -e $ARGV[0] && -e $ARGV[1]) {
        print STDERR "Usage: split-fasta.pl DATA IDS\n";
        exit 1;
}

my (%ids, $id);

# Read sequence IDs
open(IDS, "<$ARGV[1]") or die "Can't open IDs file: $1";

while (<IDS>) {
        $ids{$_} = 1;
}

close(IDS);


# Read sequence data and write results to output.fasta
open(DATA,    "<$ARGV[0]") or die "Can't open sequences file: $1";
open(OUTFILE, ">>output.fasta") or die "Can't open output file: $1";

while (<DATA>) {
        my $line = $_;


        if ($line =~ /^>?(\w+\|.+\n)/) {
                $id = $1;
                $id =~ tr/|/_/;
        }

        print OUTFILE $line if defined $ids{$id};
}

close(DATA);
close(OUTFILE);

所以,你基本上想把文件1分成多个文件,每个文件只包含一个序列?也许这丑(也许关闭每个文件句柄而不是只有最后一个…)块Perl会帮你。

#!/usr/bin/perl -w

use strict;

while (<>) {
        my $line = $_;

        if ($line =~ />?([0-9]+|\w+)\//) {
                my $file_name = $1;
                open(OUTFILE, ">>$file_name");
        }

        print OUTFILE $line;
}

close(OUTFILE);

编辑:如果你想从第二个文件中输入IDS,只需添加另一个,如果匹配的ID与头文件从第一个文件。这应该工作,只要两个文件是在相同的顺序,因为输入立即丢弃时,写,不能再搜索。

perl  design-patterns  matching