如何提取介于某个区间的几行文字,区间的开始和结束可以用正则表达式描述
juzi1114
如何提取介于某个区间的几行文字,区间的开始和结束可以用正则表达式描述
如这样的文件:
CC -!- FUNCTION: Rapidly .
CC -!- CATALYTIC ACTIVITY: Acetylcholine.
CC -!- SUBUNIT: Homotetramer; composed .
CC Interacts with PRIMA1.
CC anchor it to the basal
CC (By similarity).
CC -!- [color=Red]SUBCELLULAR LOCATION[/color]: Cell junction, synapse. Secreted (By
CC similarity). Cell membrane; Peripheral membrane protein (By
CC similarity).
CC -!- [color=Red]SUBCELLULAR LOCATION[/color]: Isoform 2: Cell membrane;
CC anchor; Extracellular side (By similarity).
CC -!- ALTERNATIVE PRODUCTS:
CC Event=Alternative splicing; Named isoforms=2;
我要提取其中以SUBCELLULAR LOCATION开头的那一小段文件,如下:
SUBCELLULAR LOCATION: Cell junction, synapse. Secreted (By
similarity). Cell membrane; Peripheral membrane protein (By
similarity).
SUBCELLULAR LOCATION: Isoform 2: Cell membrane; Lipid-anchor, GPI-
anchor; Extracellular side (By similarity).
我现在写的
[color=RoyalBlue]f($line=~ /SUBCELLULAR LOCATION:/)
{
print $line,"/n";
if ($array[$line_num]=~ //-/!/-/)
{next;}
}
if( $array[$line_num] =~ /^/////)
{
$a++;
last;[/color]
只有其中的第一句话出来 不知道怎么办才好啊 谢谢各位了!!
[[i] 本帖最后由 flw 于 2008-6-20 13:16 编辑 [/i]]
flw
下面给出这一类问题的通用解决办法。
这是面向行处理的一种轻量级解决方法。
比那些对整个文件进行模式匹配的方法不知优雅了要多少倍。
$start 表示开始标记的模式,$end 表示结束标记的模式,
if ( (/$start/ .. /$end/) and !/$end/ ){
表示需要开始和结束之间的,但不需要结束的那一行。
[font=fixedsys][code]#! /usr/bin/env perl
my $start = qr/^CC/s+-!- SUBCELLULAR LOCATION/;
my $end = qr/^CC/s+-!- (?!SUBCELLULAR LOCATION)/;
while(<DATA>){
if ( (/$start/ .. /$end/) and !/$end/ ){
print "*** $_";
}
else{
print "--- $_";
}
}
__END__
CC -!- FUNCTION: Rapidly .
CC -!- CATALYTIC ACTIVITY: Acetylcholine.
CC -!- SUBUNIT: Homotetramer; composed .
CC Interacts with PRIMA1.
CC anchor it to the basal
CC (By similarity).
CC -!- SUBCELLULAR LOCATION: Cell junction, synapse. Secreted (By
CC similarity). Cell membrane; Peripheral membrane protein (By
CC similarity).
CC -!- SUBCELLULAR LOCATION: Isoform 2: Cell membrane;
CC anchor; Extracellular side (By similarity).
CC -!- ALTERNATIVE PRODUCTS:
CC Event=Alternative splicing; Named isoforms=2;[/code][/font]
运行结果:
[font=fixedsys][quote]flw@debian:~$ ./ttt.pl
--- CC -!- FUNCTION: Rapidly .
--- CC -!- CATALYTIC ACTIVITY: Acetylcholine.
--- CC -!- SUBUNIT: Homotetramer; composed .
--- CC Interacts with PRIMA1.
--- CC anchor it to the basal
--- CC (By similarity).
[color=blue]*** CC -!- SUBCELLULAR LOCATION: Cell junction, synapse. Secreted (By
*** CC similarity). Cell membrane; Peripheral membrane protein (By
*** CC similarity).
*** CC -!- SUBCELLULAR LOCATION: Isoform 2: Cell membrane;
*** CC anchor; Extracellular side (By similarity).[/color]
--- CC -!- ALTERNATIVE PRODUCTS:
--- CC Event=Alternative splicing; Named isoforms=2;
flw@debian:~$[/quote][/font]
cobrawgl
#!user/bin/perl
use strict;
use warnings;
my @data = <DATA>;
$_ = join '', @data;
my @t = /(SUBCELLULAR.*?)CC/s+-!-/msg;
print map {s/CC/s+//g; $_} @t;
__DATA__
CC -!- FUNCTION: Rapidly .
CC -!- CATALYTIC ACTIVITY: Acetylcholine.
CC -!- SUBUNIT: Homotetramer; composed .
CC Interacts with PRIMA1.
CC anchor it to the basal
CC (By similarity).
CC -!- SUBCELLULAR LOCATION: Cell junction, synapse. Secreted (By
CC similarity). Cell membrane; Peripheral membrane protein (By
CC similarity).
CC -!- SUBCELLULAR LOCATION: Isoform 2: Cell membrane;
CC anchor; Extracellular side (By similarity).
CC -!- ALTERNATIVE PRODUCTS:
CC Event=Alternative splicing; Named isoforms=2;
[[i] 本帖最后由 cobrawgl 于 2008-6-20 12:36 编辑 [/i]]
juzi1114
回复 #2 cobrawgl 的帖子
谢谢啦 呵呵.... 看来我的正则表达式还要多学习了 呵呵...
cobrawgl
版主有时间的话,给大家解说一下吧 :mrgreen:
cobrawgl
那个 ?! 是什么意思啊,版主 :em14:
flw
[quote]原帖由 [i]cobrawgl[/i] 于 2008-6-20 12:49 发表 [url=http://bbs.chinaunix.net/redirect.php?goto=findpost&pid=8626782&ptid=1165378][img]http://bbs.chinaunix.net/images/common/back.gif[/img][/url]
那个 ?! 是什么意思啊,版主 :em14: [/quote]
向前断言,不匹配。
cobrawgl
谢谢版主指点,赶紧看书去了 :mrgreen:
ly5066113
苯方法:
[code]#! /bin/perl
use warnings;
use strict;
my $key;
while(<DATA>){
if (/-!-/) {
$key = 0;
}
if (/SUBCELLULAR LOCATION/) {
print;
$key = 1;
next;
}
if ($key) {
print;
}
}
__END__
CC -!- FUNCTION: Rapidly .
CC -!- CATALYTIC ACTIVITY: Acetylcholine.
CC -!- SUBUNIT: Homotetramer; composed .
CC Interacts with PRIMA1.
CC anchor it to the basal
CC (By similarity).
CC -!- SUBCELLULAR LOCATION: Cell junction, synapse. Secreted (By
CC similarity). Cell membrane; Peripheral membrane protein (By
CC similarity).
CC -!- SUBCELLULAR LOCATION: Isoform 2: Cell membrane;
CC anchor; Extracellular side (By similarity).
CC -!- ALTERNATIVE PRODUCTS:
CC Event=Alternative splicing; Named isoforms=2;[/code]
flw
[quote]原帖由 [i]ly5066113[/i] 于 2008-6-20 13:08 发表 [url=http://bbs.chinaunix.net/redirect.php?goto=findpost&pid=8626974&ptid=1165378][img]http://bbs.chinaunix.net/images/common/back.gif[/img][/url]
苯方法:[/quote]
range operator 有一个内置的 state,相当于你的 $key,不过由它自己来维护,因此可读性更好。
juzi1114
回复 #2 flw 的帖子
谢谢啦 真的要加油学习了.. 呵呵..
ly5066113
[quote]原帖由 [i]flw[/i] 于 2008-6-20 13:13 发表 [url=http://bbs.chinaunix.net/redirect.php?goto=findpost&pid=8627016&ptid=1165378][img]http://bbs.chinaunix.net/images/common/back.gif[/img][/url]
range operator 有一个内置的 state,相当于你的 $key,不过由它自己来维护,因此可读性更好。 [/quote]
恩,老大的方法确实非常优雅。但 ?! 这种东西,不是只读过一遍小骆驼的我所能理解的,呵呵。
ulmer
回复 #1 juzi1114 的帖子
Another way is to use $/ ($INPUT_RECORD_SEPARATOR).
Because your data structure is typically with beginning of "CC -!-" terminated,
using $/ = "CC -!-" to seperate the records and to store then in each element of an array.
Then processing array's element to get what you want.
Sample code:
[quote]
$/ = "CC -!-";
my @data = ();
while (<DATA>) {
chomp;
push @data, $_ if $_ ne '';
}
__DATA__
CC -!- FUNCTION: Rapidly .
CC -!- CATALYTIC ACTIVITY: Acetylcholine.
CC -!- SUBUNIT: Homotetramer; composed .
CC Interacts with PRIMA1.
CC anchor it to the basal
CC (By similarity).
CC -!- SUBCELLULAR LOCATION: Cell junction, synapse. Secreted (By
CC similarity). Cell membrane; Peripheral membrane protein (By
CC similarity).
CC -!- SUBCELLULAR LOCATION: Isoform 2: Cell membrane;
CC anchor; Extracellular side (By similarity).
CC -!- ALTERNATIVE PRODUCTS:
CC Event=Alternative splicing; Named isoforms=2;
[/quote]
juzi1114
回复 #13 ulmer 的帖子
thanks very much!:mrgreen:
flw
[quote]原帖由 [i]ly5066113[/i] 于 2008-6-20 13:27 发表 [url=http://bbs.chinaunix.net/redirect.php?goto=findpost&pid=8627155&ptid=1165378][img]http://bbs.chinaunix.net/images/common/back.gif[/img][/url]
恩,老大的方法确实非常优雅。但 ?! 这种东西,不是只读过一遍小骆驼的我所能理解的,呵呵。 [/quote]
?! 是这个问题特有的,和 range operator 倒没什么关系。
因此并不是总需要用到 ?!,而 range operator 用来解决这一类问题确实很方便。
不死草
[quote]原帖由 [i]flw[/i] 于 2008-6-20 12:13 发表 [url=http://bbs.chinaunix.net/redirect.php?goto=findpost&pid=8626732&ptid=1165378][img]http://bbs.chinaunix.net/images/common/back.gif[/img][/url]
$start 表示开始标记的模式,$end 表示结束标记的模式,
if ( (/$start/ .. /$end/) and !/$end/ ){
表示需要开始和结束之间的,但不需要结束的那一行。
[/quote]
斑竹的方法确实不错,学习ing!~...