sed / grep을 사용하여 두 단어 사이의 텍스트를 추출하는 방법은 무엇입니까?
문자열의 두 단어 사이에있는 모든 것을 포함하는 문자열을 출력하려고합니다.
입력:
"Here is a String"
산출:
"is a"
사용 :
sed -n '/Here/,/String/p'
끝점을 포함하지만 포함하고 싶지 않습니다.
sed -e 's/Here\(.*\)String/\1/'
GNU grep은 긍정적 및 부정적 미리보기 및 되돌아보기를 지원할 수도 있습니다. 귀하의 경우 명령은 다음과 같습니다.
echo "Here is a string" | grep -o -P '(?<=Here).*(?=string)'
여러 번이있는 경우 Here
그리고 string
, 당신은 당신이 처음부터 일치 할 것인지 여부를 선택할 수 있습니다 Here
마지막으로 string
또는 개별적으로 일치합니다. 정규식에서는 욕심쟁이 일치 (첫 번째 경우) 또는 비 탐욕적인 일치 (두 번째 경우)라고합니다.
$ echo 'Here is a string, and Here is another string.' | grep -oP '(?<=Here).*(?=string)' # Greedy match
is a string, and Here is another
$ echo 'Here is a string, and Here is another string.' | grep -oP '(?<=Here).*?(?=string)' # Non-greedy match (Notice the '?' after '*' in .*)
is a
is another
수락 된 답변은 이전 Here
또는 이후에 있을 수있는 텍스트를 제거하지 않습니다 String
. 이것은 :
sed -e 's/.*Here\(.*\)String.*/\1/'
주요 차이점은 .*
바로 Here
앞뒤에를 추가하는 것입니다 String
.
Bash 에서만 문자열을 제거 할 수 있습니다 .
$ foo="Here is a String"
$ foo=${foo##*Here }
$ echo "$foo"
is a String
$ foo=${foo%% String*}
$ echo "$foo"
is a
$
PCRE 가 포함 된 GNU grep이 있는 경우 너비가 0 인 어설 션을 사용할 수 있습니다.
$ echo "Here is a String" | grep -Po '(?<=(Here )).*(?= String)'
is a
GNU awk를 통해
$ echo "Here is a string" | awk -v FS="(Here|string)" '{print $2}'
is a
grep with -P
( perl-regexp ) 매개 변수는 \K
이전에 일치 된 문자를 삭제하는 데 도움이되는를 지원합니다 . 우리의 경우 이전에 일치 된 문자열이 Here
최종 출력에서 버려졌습니다.
$ echo "Here is a string" | grep -oP 'Here\K.*(?=string)'
is a
$ echo "Here is a string" | grep -oP 'Here\K(?:(?!string).)*'
is a
출력을 원하면 is a
아래를 시도해 볼 수 있습니다.
$ echo "Here is a string" | grep -oP 'Here\s*\K.*(?=\s+string)'
is a
$ echo "Here is a string" | grep -oP 'Here\s*\K(?:(?!\s+string).)*'
is a
여러 줄이 많은 긴 파일이있는 경우 먼저 숫자 줄을 인쇄하는 것이 유용합니다.
cat -n file | sed -n '/Here/,/String/p'
이것은 당신을 위해 일할 수 있습니다 (GNU sed) :
sed '/Here/!d;s//&\n/;s/.*\n//;:a;/String/bb;$!{n;ba};:b;s//\n&/;P;D' file
This presents each representation of text between two markers (in this instance Here
and String
) on a newline and preserves newlines within the text.
All the above solutions have deficiencies where the last search string is repeated elsewhere in the string. I found it best to write a bash function.
function str_str {
local str
str="${1#*${2}}"
str="${str%%$3*}"
echo -n "$str"
}
# test it ...
mystr="this is a string"
str_str "$mystr" "this " " string"
You can use \1
(refer to http://www.grymoire.com/Unix/Sed.html#uh-4):
echo "Hello is a String" | sed 's/Hello\(.*\)String/\1/g'
The contents that is inside the brackets will be stored as \1
.
To understand sed
command, we have to build it step by step.
Here is your original text
user@linux:~$ echo "Here is a String"
Here is a String
user@linux:~$
Let's try to remove Here
with s
ubstition option in sed
user@linux:~$ echo "Here is a String" | sed 's/Here //'
is a String
user@linux:~$
At this point, I believe you would be able to remove String
as well
user@linux:~$ echo "Here is a String" | sed 's/String//'
Here is a
user@linux:~$
But this is not your desired output.
To combine two sed commands, use -e
option
user@linux:~$ echo "Here is a String" | sed -e 's/Here //' -e 's/String//'
is a
user@linux:~$
Hope this helps
Problem. My stored Claws Mail messages are wrapped as follows, and I am trying to extract the Subject lines:
Subject: [SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular
link in major cell growth pathway: Findings point to new potential
therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is
Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as
a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway
identified [Lysosomal amino acid transporter SLC38A9 signals arginine
sufficiency to mTORC1]]
Message-ID: <20171019190902.18741771@VictoriasJourney.com>
Per A2 in this thread, How to use sed/grep to extract text between two words? the first expression, below, "works" as long as the matched text does not contain a newline:
grep -o -P '(?<=Subject: ).*(?=molecular)' corpus/01
[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key
However, despite trying numerous variants (.+?; /s; ...
), I could not get these to work:
grep -o -P '(?<=Subject: ).*(?=link)' corpus/01
grep -o -P '(?<=Subject: ).*(?=therapeutic)' corpus/01
etc.
Solution 1.
Per Extract text between two strings on different lines
sed -n '/Subject: /{:a;N;/Message-ID:/!ba; s/\n/ /g; s/\s\s*/ /g; s/.*Subject: \|Message-ID:.*//g;p}' corpus/01
which gives
[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular link in major cell growth pathway: Findings point to new potential therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway identified [Lysosomal amino acid transporter SLC38A9 signals arginine sufficiency to mTORC1]]
Solution 2.*
Per How can I replace a newline (\n) using sed?
sed ':a;N;$!ba;s/\n/ /g' corpus/01
will replace newlines with a space.
Chaining that with A2 in How to use sed/grep to extract text between two words?, we get:
sed ':a;N;$!ba;s/\n/ /g' corpus/01 | grep -o -P '(?<=Subject: ).*(?=Message-ID:)'
which gives
[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular link in major cell growth pathway: Findings point to new potential therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway identified [Lysosomal amino acid transporter SLC38A9 signals arginine sufficiency to mTORC1]]
This variant removes double spaces:
sed ':a;N;$!ba;s/\n/ /g; s/\s\s*/ /g' corpus/01 | grep -o -P '(?<=Subject: ).*(?=Message-ID:)'
giving
[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular link in major cell growth pathway: Findings point to new potential therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway identified [Lysosomal amino acid transporter SLC38A9 signals arginine sufficiency to mTORC1]]
참고URL : https://stackoverflow.com/questions/13242469/how-to-use-sed-grep-to-extract-text-between-two-words
'Programing' 카테고리의 다른 글
Android SearchView에서 텍스트 색상을 변경할 수 있습니까? (0) | 2020.08.12 |
---|---|
구조체 생성자 : "컨트롤이 호출자에게 반환되기 전에 필드가 완전히 할당되어야합니다." (0) | 2020.08.12 |
Java에서 창을 중앙에 배치하는 방법은 무엇입니까? (0) | 2020.08.12 |
신속하게 배경에 흐림 효과 추가 (0) | 2020.08.12 |
Java 프로세스 목록 (0) | 2020.08.12 |