파이썬에서 문자열을 공백으로 나눕니다 (따옴표 붙은 부분 문자열 유지)

Programing

파이썬에서 문자열을 공백으로 나눕니다 (따옴표 붙은 부분 문자열 유지)

lottogame 2020. 4. 10. 08:06

파이썬에서 문자열을 공백으로 나눕니다 (따옴표 붙은 부분 문자열 유지)

다음과 같은 문자열이 있습니다.

this is "a test"

따옴표 내 공백을 무시하면서 공백으로 나누기 위해 파이썬으로 무언가를 작성하려고합니다. 내가 찾고있는 결과는 다음과 같습니다

['this','is','a test']

추신. "응용 프로그램에 따옴표 안에 따옴표가 있으면 어떻게됩니까?"

shlex 모듈 에서 분리 하려고 합니다.

>>> import shlex
>>> shlex.split('this is "a test"')
['this', 'is', 'a test']

이것은 당신이 원하는 것을 정확하게해야합니다.

shlex특히 모듈을 살펴보십시오 shlex.split.

>>> import shlex
>>> shlex.split('This is "a test"')
['This', 'is', 'a test']

복잡하고 잘못 보이는 정규식 접근 방식이 여기에 있습니다. 정규식 구문은 "공백 또는 따옴표로 묶은 것"을 쉽게 설명 할 수 있고 대부분의 정규식 엔진 (파이썬 포함)이 정규식으로 분할 될 수 있기 때문에 놀랍습니다. 따라서 정규 표현식을 사용하려면 정확히 무엇을 의미하는지 말하지 않겠습니까? :

test = 'this is "a test"'  # or "this is 'a test'"
# pieces = [p for p in re.split("( |[\\\"'].*[\\\"'])", test) if p.strip()]
# From comments, use this:
pieces = [p for p in re.split("( |\\\".*?\\\"|'.*?')", test) if p.strip()]

설명:

[\\\"'] = double-quote or single-quote
.* = anything
( |X) = space or X
.strip() = remove space and empty-string separators

shlex는 아마도 더 많은 기능을 제공 할 것입니다.

사용 사례에 따라 csv 모듈을 확인하고 싶을 수도 있습니다.

import csv
lines = ['this is "a string"', 'and more "stuff"']
for row in csv.reader(lines, delimiter=" "):
    print row

산출:

['this', 'is', 'a string']
['and', 'more', 'stuff']

나는 shlex.split을 사용하여 70,000,000 라인의 오징어 통나무를 처리하는데 너무 느립니다. 그래서 나는 다시 전환했다.

shlex에 성능 문제가있는 경우이 기능을 사용해보십시오.

import re

def line_split(line):
    return re.findall(r'[^"\s]\S*|".+?"', line)

이 질문은 정규식으로 태그 지정되었으므로 정규식 접근 방식을 시도하기로 결정했습니다. 따옴표 부분의 모든 공백을 \ x00으로 바꾼 다음 공백으로 나눈 다음 \ x00을 각 부분의 공백으로 바꿉니다.

두 버전 모두 동일한 작업을 수행하지만 splitter는 splitter2보다 약간 읽기 쉽습니다.

import re

s = 'this is "a test" some text "another test"'

def splitter(s):
    def replacer(m):
        return m.group(0).replace(" ", "\x00")
    parts = re.sub('".+?"', replacer, s).split()
    parts = [p.replace("\x00", " ") for p in parts]
    return parts

def splitter2(s):
    return [p.replace("\x00", " ") for p in re.sub('".+?"', lambda m: m.group(0).replace(" ", "\x00"), s).split()]

print splitter2(s)

성능상의 이유로 re더 빠릅니다. 외부 인용 부호를 유지하는 가장 욕심 많은 연산자를 사용하는 솔루션은 다음과 같습니다.

re.findall("(?:\".*?\"|\S)+", s)

결과:

['this', 'is', '"a test"']

aaa"bla blub"bbb이러한 토큰은 공백으로 분리되지 않으므로 구성을 같이 남겨 둡니다 . 문자열에 이스케이프 문자가 포함되어 있으면 다음과 같이 일치시킬 수 있습니다.

>>> a = "She said \"He said, \\\"My name is Mark.\\\"\""
>>> a
'She said "He said, \\"My name is Mark.\\""'
>>> for i in re.findall("(?:\".*?[^\\\\]\"|\S)+", a): print(i)
...
She
said
"He said, \"My name is Mark.\""

패턴 ""의 \S일부를 사용하여 빈 문자열과도 일치합니다 .

따옴표를 유지하려면이 함수를 사용하십시오.

def getArgs(s):
    args = []
    cur = ''
    inQuotes = 0
    for char in s.strip():
        if char == ' ' and not inQuotes:
            args.append(cur)
            cur = ''
        elif char == '"' and not inQuotes:
            inQuotes = 1
            cur += char
        elif char == '"' and inQuotes:
            inQuotes = 0
            cur += char
        else:
            cur += char
    args.append(cur)
    return args

허용되는 shlex접근 방식 의 주된 문제점 은 따옴표로 묶인 부분 문자열 외부의 이스케이프 문자를 무시하지 않으며 일부 경우에 약간의 예기치 않은 결과가 발생한다는 것입니다.

작은 따옴표 또는 큰 따옴표로 묶인 하위 문자열이 유지되도록 입력 문자열을 분할하는 분할 기능이 필요한 다음 유스 케이스가 있습니다. 따옴표없는 문자열 내 인용 부호는 다른 문자와 다르게 취급해서는 안됩니다. 예상되는 출력을 가진 몇 가지 예제 테스트 사례 :

입력 문자열 | 예상 출력
=================================================
 'abc def'| [ 'abc', 'def']
 "abc \\ s def"| [ 'abc', '\\ s', 'def']
 ' "abc def"ghi'| [ 'abc def', 'ghi']
 " 'abc def'ghi"| [ 'abc def', 'ghi']
 ' "abc \\"def "ghi'| [ 'abc"def', 'ghi']
 " 'abc \\'def 'ghi"| [ "abc 'def",'ghi ']
 " 'abc \\ s def'ghi"| [ 'abc \\ s def', 'ghi']
 ' "abc \\ s def"ghi'| [ 'abc \\ s def', 'ghi']
 ' ""테스트 "| ['', '테스트']
 " ''테스트"| ['', '테스트']
 "abc'def"| [ "abc'def"]
 "abc'def '"| [ "abc'def '"]
 "abc'def 'ghi"| [ "abc'def '",'ghi ']
 "abc'def'ghi"| [ "abc'def'ghi"]
 'abc "def'| [ 'abc"def']
 'abc "def"'| [ 'abc "def"']
 'abc "def"ghi'| [ 'abc "def"', 'ghi']
 'abc "def"ghi'| [ 'abc "def"ghi']
 "r'AA 'r'. * _ xyz $ '"| [ "r'AA '", "r'. * _ xyz $ '"]

나는 모든 입력 문자열에 대해 예상되는 결과가 나오도록 문자열을 분할하기 위해 다음 함수로 끝났습니다.

import re

def quoted_split(s):
    def strip_quotes(s):
        if s and (s[0] == '"' or s[0] == "'") and s[0] == s[-1]:
            return s[1:-1]
        return s
    return [strip_quotes(p).replace('\\"', '"').replace("\\'", "'") \
            for p in re.findall(r'"(?:\\.|[^"])*"|\'(?:\\.|[^\'])*\'|[^\s]+', s)]

테스트 응용 프로그램을 검사에게 다른 접근의 결과 (다음 shlex과 csv지금) 및 사용자 정의 분할 구현을 :

#!/bin/python2.7

import csv
import re
import shlex

from timeit import timeit

def test_case(fn, s, expected):
    try:
        if fn(s) == expected:
            print '[ OK ] %s -> %s' % (s, fn(s))
        else:
            print '[FAIL] %s -> %s' % (s, fn(s))
    except Exception as e:
        print '[FAIL] %s -> exception: %s' % (s, e)

def test_case_no_output(fn, s, expected):
    try:
        fn(s)
    except:
        pass

def test_split(fn, test_case_fn=test_case):
    test_case_fn(fn, 'abc def', ['abc', 'def'])
    test_case_fn(fn, "abc \\s def", ['abc', '\\s', 'def'])
    test_case_fn(fn, '"abc def" ghi', ['abc def', 'ghi'])
    test_case_fn(fn, "'abc def' ghi", ['abc def', 'ghi'])
    test_case_fn(fn, '"abc \\" def" ghi', ['abc " def', 'ghi'])
    test_case_fn(fn, "'abc \\' def' ghi", ["abc ' def", 'ghi'])
    test_case_fn(fn, "'abc \\s def' ghi", ['abc \\s def', 'ghi'])
    test_case_fn(fn, '"abc \\s def" ghi', ['abc \\s def', 'ghi'])
    test_case_fn(fn, '"" test', ['', 'test'])
    test_case_fn(fn, "'' test", ['', 'test'])
    test_case_fn(fn, "abc'def", ["abc'def"])
    test_case_fn(fn, "abc'def'", ["abc'def'"])
    test_case_fn(fn, "abc'def' ghi", ["abc'def'", 'ghi'])
    test_case_fn(fn, "abc'def'ghi", ["abc'def'ghi"])
    test_case_fn(fn, 'abc"def', ['abc"def'])
    test_case_fn(fn, 'abc"def"', ['abc"def"'])
    test_case_fn(fn, 'abc"def" ghi', ['abc"def"', 'ghi'])
    test_case_fn(fn, 'abc"def"ghi', ['abc"def"ghi'])
    test_case_fn(fn, "r'AA' r'.*_xyz$'", ["r'AA'", "r'.*_xyz$'"])

def csv_split(s):
    return list(csv.reader([s], delimiter=' '))[0]

def re_split(s):
    def strip_quotes(s):
        if s and (s[0] == '"' or s[0] == "'") and s[0] == s[-1]:
            return s[1:-1]
        return s
    return [strip_quotes(p).replace('\\"', '"').replace("\\'", "'") for p in re.findall(r'"(?:\\.|[^"])*"|\'(?:\\.|[^\'])*\'|[^\s]+', s)]

if __name__ == '__main__':
    print 'shlex\n'
    test_split(shlex.split)
    print

    print 'csv\n'
    test_split(csv_split)
    print

    print 're\n'
    test_split(re_split)
    print

    iterations = 100
    setup = 'from __main__ import test_split, test_case_no_output, csv_split, re_split\nimport shlex, re'
    def benchmark(method, code):
        print '%s: %.3fms per iteration' % (method, (1000 * timeit(code, setup=setup, number=iterations) / iterations))
    benchmark('shlex', 'test_split(shlex.split, test_case_no_output)')
    benchmark('csv', 'test_split(csv_split, test_case_no_output)')
    benchmark('re', 'test_split(re_split, test_case_no_output)')

산출:

shlex

[OK] abc def-> [ 'abc', 'def']
[FAIL] abc \ s def-> [ 'abc', 's', 'def']
[OK] "abc def"ghi-> [ 'abc def', 'ghi']
[OK] 'abc def'ghi-> [ 'abc def', 'ghi']
[OK] "abc \"def "ghi-> [ 'abc"def', 'ghi']
[FAIL] 'abc \'def 'ghi-> 예외 : 마감 시세 없음
[OK] 'abc \ s def'ghi-> [ 'abc \\ s def', 'ghi']
[OK] "abc \ s def"ghi-> [ 'abc \\ s def', 'ghi']
[OK] ""test-> [ '', 'test']
[OK] ''테스트-> [ '', '테스트']
[FAIL] abc'def-> 예외 : 인용문 없음
[실패] abc'def '-> ['abcdef ']
[FAIL] abc'def 'ghi-> ['abcdef ','ghi ']
[FAIL] abc'def'ghi-> [ 'abcdefghi']
[FAIL] abc "def-> 예외 : 인용문 없음
[FAIL] abc "def"-> [ 'abcdef']
[FAIL] abc "def"ghi-> [ 'abcdef', 'ghi']
[FAIL] abc "def"ghi-> [ 'abcdefghi']
[실패] r'AA 'r'. * _ xyz $ '-> ['rAA ','r. * _ xyz $ ']

csv

[OK] abc def-> [ 'abc', 'def']
[OK] abc \ s def-> [ 'abc', '\\ s', 'def']
[OK] "abc def"ghi-> [ 'abc def', 'ghi']
[FAIL] 'abc def'ghi-> [ " 'abc", "def'", 'ghi']
[FAIL] "abc \"def "ghi-> [ 'abc \\', 'def"', 'ghi']
[FAIL] 'abc \'def 'ghi-> [ "'abc", "\\ '", "def'", 'ghi']
[FAIL] 'abc \ s def'ghi-> [ " 'abc",'\\ s ', "def'", 'ghi']
[OK] "abc \ s def"ghi-> [ 'abc \\ s def', 'ghi']
[OK] ""test-> [ '', 'test']
[실패] ''테스트-> [ " ''", '테스트']
[OK] abc'def-> [ "abc'def"]
[OK] abc'def '-> [ "abc'def'"]
[OK] abc'def 'ghi-> [ "abc'def'", 'ghi']
[OK] abc'def'ghi-> [ "abc'def'ghi"]
[OK] abc "def-> [ 'abc"def']
[OK] abc "def"-> [ 'abc "def"']
[OK] abc "def"ghi-> [ 'abc "def"', 'ghi']
[OK] abc "def"ghi-> [ 'abc "def"ghi']
[OK] r'AA 'r'. * _ xyz $ '-> [ "r'AA'", "r '. * _ xyz $'"]

레

[OK] abc def-> [ 'abc', 'def']
[OK] abc \ s def-> [ 'abc', '\\ s', 'def']
[OK] "abc def"ghi-> [ 'abc def', 'ghi']
[OK] 'abc def'ghi-> [ 'abc def', 'ghi']
[OK] "abc \"def "ghi-> [ 'abc"def', 'ghi']
[OK] 'abc \'def 'ghi-> [ "abc'def", 'ghi']
[OK] 'abc \ s def'ghi-> [ 'abc \\ s def', 'ghi']
[OK] "abc \ s def"ghi-> [ 'abc \\ s def', 'ghi']
[OK] ""test-> [ '', 'test']
[OK] ''테스트-> [ '', '테스트']
[OK] abc'def-> [ "abc'def"]
[OK] abc'def '-> [ "abc'def'"]
[OK] abc'def 'ghi-> [ "abc'def'", 'ghi']
[OK] abc'def'ghi-> [ "abc'def'ghi"]
[OK] abc "def-> [ 'abc"def']
[OK] abc "def"-> [ 'abc "def"']
[OK] abc "def"ghi-> [ 'abc "def"', 'ghi']
[OK] abc "def"ghi-> [ 'abc "def"ghi']
[OK] r'AA 'r'. * _ xyz $ '-> [ "r'AA'", "r '. * _ xyz $'"]

shlex : 반복 당 0.281ms
csv : 반복 당 0.030ms
re : 반복 당 0.049ms

따라서 성능은보다 훨씬 우수 shlex하며 정규식을 사전 컴파일하여 더 향상시킬 수 있습니다.이 경우 csv접근 방식 보다 성능이 우수합니다 .

일부 Python 2 버전에서 유니 코드 문제를 해결하려면 다음을 제안합니다.

from shlex import split as _split
split = lambda a: [b.decode('utf-8') for b in _split(a.encode('utf-8'))]

다른 답변의 속도 테스트 :

import re
import shlex
import csv

line = 'this is "a test"'

%timeit [p for p in re.split("( |\\\".*?\\\"|'.*?')", line) if p.strip()]
100000 loops, best of 3: 5.17 µs per loop

%timeit re.findall(r'[^"\s]\S*|".+?"', line)
100000 loops, best of 3: 2.88 µs per loop

%timeit list(csv.reader([line], delimiter=" "))
The slowest run took 9.62 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.4 µs per loop

%timeit shlex.split(line)
10000 loops, best of 3: 50.2 µs per loop

흠, "답장"버튼을 찾을 수 없습니다 ... 어쨌든,이 답변은 Kate의 접근 방식을 기반으로하지만 이스케이프 된 따옴표가 포함 된 하위 문자열로 문자열을 올바르게 분할하고 하위 문자열의 시작 및 끝 따옴표를 제거합니다.

  [i.strip('"').strip("'") for i in re.split(r'(\s+|(?<!\\)".*?(?<!\\)"|(?<!\\)\'.*?(?<!\\)\')', string) if i.strip()]

이것은 문자열에서 작동합니다 'This is " a \\\"test\\\"\\\'s substring"'(불행히도 파이썬이 이스케이프를 제거하지 못하게하려면 미친 마크 업이 필요합니다).

리턴 된 목록의 문자열에서 결과 이스케이프를 원하지 않는 경우, 약간 변경된이 함수 버전을 사용할 수 있습니다.

[i.strip('"').strip("'").decode('string_escape') for i in re.split(r'(\s+|(?<!\\)".*?(?<!\\)"|(?<!\\)\'.*?(?<!\\)\')', string) if i.strip()]

위에서 논의한 shlex의 유니 코드 문제 (상위 답변)는 http://bugs.python.org/issue6988#msg146200에 따라 2.7.2 이상에서 간접적으로 해결 된 것으로 보입니다.

(댓글을 달 수 없기 때문에 별도의 답변)

나는 제안한다 :

테스트 문자열 :

s = 'abc "ad" \'fg\' "kk\'rdt\'" zzz"34"zzz "" \'\''

""및 ''도 캡처하려면 :

import re
re.findall(r'"[^"]*"|\'[^\']*\'|[^"\'\s]+',s)

결과:

['abc', '"ad"', "'fg'", '"kk\'rdt\'"', 'zzz', '"34"', 'zzz', '""', "''"]

빈 ""및 ''를 무시하려면 :

import re
re.findall(r'"[^"]+"|\'[^\']+\'|[^"\'\s]+',s)

결과:

['abc', '"ad"', "'fg'", '"kk\'rdt\'"', 'zzz', '"34"', 'zzz']

간단한 것보다 하위 문자열을 신경 쓰지 않으면

>>> 'a short sized string with spaces '.split()

공연:

>>> s = " ('a short sized string with spaces '*100).split() "
>>> t = timeit.Timer(stmt=s)
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
171.39 usec/pass

또는 문자열 모듈

>>> from string import split as stringsplit; 
>>> stringsplit('a short sized string with spaces '*100)

성능 : 문자열 모듈이 문자열 방법보다 성능이 우수한 것 같습니다

>>> s = "stringsplit('a short sized string with spaces '*100)"
>>> t = timeit.Timer(s, "from string import split as stringsplit")
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
154.88 usec/pass

또는 RE 엔진을 사용할 수 있습니다

>>> from re import split as resplit
>>> regex = '\s+'
>>> medstring = 'a short sized string with spaces '*100
>>> resplit(regex, medstring)

공연

>>> s = "resplit(regex, medstring)"
>>> t = timeit.Timer(s, "from re import split as resplit; regex='\s+'; medstring='a short sized string with spaces '*100")
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
540.21 usec/pass

매우 긴 문자열의 경우 전체 문자열을 메모리에로드하지 말고 대신 행을 분할하거나 반복 루프를 사용하십시오

이 시도:

  def adamsplit(s):
    result = []
    inquotes = False
    for substring in s.split('"'):
      if not inquotes:
        result.extend(substring.split())
      else:
        result.append(substring)
      inquotes = not inquotes
    return result

일부 테스트 문자열 :

'This is "a test"' -> ['This', 'is', 'a test']
'"This is \'a test\'"' -> ["This is 'a test'"]

참고 URL : https://stackoverflow.com/questions/79968/split-a-string-by-spaces-preserving-quoted-substrings-in-python

'Programing' 카테고리의 다른 글

한 요소에 두 개의 CSS 클래스 사용하기 (0)	2020.04.10
수업 시간에 "this"를 언제 사용해야합니까? (0)	2020.04.10
큰 SQL 파일을 가져올 때 MySQL 서버가 사라졌습니다 (0)	2020.04.10
Swift에서 객체의 유형을 어떻게 알 수 있습니까? (0)	2020.04.10
도메인 기반 디자인 : 도메인 서비스, 응용 프로그램 서비스 (0)	2020.04.10

현재글파이썬에서 문자열을 공백으로 나눕니다 (따옴표 붙은 부분 문자열 유지)

복권의 역사, 로또 정보와 IT 기술 등을 다루는 블로그입니다.

가족나들이, 무비순위, java, 놀거리, 극장순위, Spring3, 자바, spring, 행사, 공연, 관광, 연극, Javascript, 뮤지컬, JQuery, c#, 볼거리, c++, 축제, 여행,

Today :
Yesterday :

일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

lottogame

파이썬에서 문자열을 공백으로 나눕니다 (따옴표 붙은 부분 문자열 유지)

파이썬에서 문자열을 공백으로 나눕니다 (따옴표 붙은 부분 문자열 유지)

'Programing' 카테고리의 다른 글

'Programing'의 다른글

티스토리툴바

파이썬에서 문자열을 공백으로 나눕니다 (따옴표 붙은 부분 문자열 유지)

파이썬에서 문자열을 공백으로 나눕니다 (따옴표 붙은 부분 문자열 유지)

'Programing' 카테고리의 다른 글

'Programing'의 다른글

관련글

티스토리툴바