문자열에서 구두점을 제거하는 가장 좋은 방법

Programing

문자열에서 구두점을 제거하는 가장 좋은 방법

lottogame 2020. 10. 4. 10:16

문자열에서 구두점을 제거하는 가장 좋은 방법

다음보다 더 간단한 방법이있을 것 같습니다.

import string
s = "string. With. Punctuation?" # Sample string 
out = s.translate(string.maketrans("",""), string.punctuation)

거기 있어요?

효율성 관점에서, 당신은 이길 수 없습니다

s.translate(None, string.punctuation)

상위 버전의 Python의 경우 다음 코드를 사용합니다.

s.translate(str.maketrans('', '', string.punctuation))

룩업 테이블을 사용하여 C에서 원시 문자열 연산을 수행하고 있습니다. 이보다 나은 것은 많지 않지만 자신의 C 코드를 작성합니다.

속도가 걱정되지 않는다면 또 다른 옵션은 다음과 같습니다.

exclude = set(string.punctuation)
s = ''.join(ch for ch in s if ch not in exclude)

이것은 각 문자로 s.replace하는 것보다 빠르지 만 아래 타이밍에서 볼 수 있듯이 regexes 또는 string.translate와 같은 순수하지 않은 Python 접근 방식만큼 잘 수행하지 않습니다. 이러한 유형의 문제에 대해 가능한 한 낮은 수준에서 수행하면 효과가 있습니다.

타이밍 코드 :

import re, string, timeit

s = "string. With. Punctuation"
exclude = set(string.punctuation)
table = string.maketrans("","")
regex = re.compile('[%s]' % re.escape(string.punctuation))

def test_set(s):
    return ''.join(ch for ch in s if ch not in exclude)

def test_re(s):  # From Vinko's solution, with fix.
    return regex.sub('', s)

def test_trans(s):
    return s.translate(table, string.punctuation)

def test_repl(s):  # From S.Lott's solution
    for c in string.punctuation:
        s=s.replace(c,"")
    return s

print "sets      :",timeit.Timer('f(s)', 'from __main__ import s,test_set as f').timeit(1000000)
print "regex     :",timeit.Timer('f(s)', 'from __main__ import s,test_re as f').timeit(1000000)
print "translate :",timeit.Timer('f(s)', 'from __main__ import s,test_trans as f').timeit(1000000)
print "replace   :",timeit.Timer('f(s)', 'from __main__ import s,test_repl as f').timeit(1000000)

결과는 다음과 같습니다.

sets      : 19.8566138744
regex     : 6.86155414581
translate : 2.12455511093
replace   : 28.4436721802

정규 표현식은 알고 있다면 충분히 간단합니다.

import re
s = "string. With. Punctuation?"
s = re.sub(r'[^\w\s]','',s)

사용의 편의를 위해 Python 2와 Python 3의 문자열에서 구두점을 스트라이핑하는 메모를 요약했습니다. 자세한 설명은 다른 답변을 참조하십시오.

파이썬 2

import string

s = "string. With. Punctuation?"
table = string.maketrans("","")
new_s = s.translate(table, string.punctuation)      # Output: string without punctuation

파이썬 3

import string

s = "string. With. Punctuation?"
table = str.maketrans({key: None for key in string.punctuation})
new_s = s.translate(table)                          # Output: string without punctuation

myString.translate(None, string.punctuation)

나는 보통 다음과 같은 것을 사용합니다.

>>> s = "string. With. Punctuation?" # Sample string
>>> import string
>>> for c in string.punctuation:
...     s= s.replace(c,"")
...
>>> s
'string With Punctuation'

string.punctuationASCII 전용입니다 ! 더 정확하지만 훨씬 느린 방법은 unicodedata 모듈을 사용하는 것입니다.

# -*- coding: utf-8 -*-
from unicodedata import category
s = u'String — with -  «punctation »...'
s = ''.join(ch for ch in s if category(ch)[0] != 'P')
print 'stripped', s

re family에 더 익숙하다면 반드시 더 간단하지는 않지만 다른 방법입니다.

import re, string
s = "string. With. Punctuation?" # Sample string 
out = re.sub('[%s]' % re.escape(string.punctuation), '', s)

Python 3 str또는 Python 2 unicode값의 str.translate()경우 사전 만 사용합니다. 해당 매핑에서 코드 포인트 (정수)가 조회되고 매핑 된 모든 항목 None이 제거됩니다.

(일부?) 구두점을 제거하려면 다음을 사용하십시오.

import string

remove_punct_map = dict.fromkeys(map(ord, string.punctuation))
s.translate(remove_punct_map)

dict.fromkeys()클래스 메소드는 사소한 모든 설정 값을 매핑을 만들 수 있습니다 None키의 순서에 따라.

ASCII 구두점뿐 아니라 모든 구두점 을 제거하려면 테이블이 조금 더 커야합니다. JF Sebastian의 답변 (Python 3 버전)을 참조하십시오 .

import unicodedata
import sys

remove_punct_map = dict.fromkeys(i for i in range(sys.maxunicode)
                                 if unicodedata.category(chr(i)).startswith('P'))

string.punctuation현실 세계에서 일반적으로 사용되는 많은 구두점을 놓칩니다. 비 ASCII 구두점에 대해 작동하는 솔루션은 어떻습니까?

import regex
s = u"string. With. Some・Really Weird、Non？ASCII。 「（Punctuation）」?"
remove = regex.compile(ur'[\p{C}|\p{M}|\p{P}|\p{S}|\p{Z}]+', regex.UNICODE)
remove.sub(u" ", s).strip()

개인적으로 다음과 같은 이유로 Python의 문자열에서 구두점을 제거하는 가장 좋은 방법이라고 생각합니다.

모든 유니 코드 구두점을 제거합니다.
쉽게 수정할 수 있습니다. 예를 들어 \{S}구두점을 제거하려면를 제거하고 $.
유지하려는 항목과 제거 할 항목을 구체적으로 지정할 수 있습니다 \{Pd}. 예를 들어 대시 만 제거합니다.
이 정규식은 또한 공백을 정규화합니다. 탭, 캐리지 리턴 및 기타 이상한 점을 멋진 단일 공간에 매핑합니다.

여기에는 유니 코드 문자 속성이 사용 되며 Wikipedia에서 자세히 알아볼 수 있습니다 .

이 답변을 아직 보지 못했습니다. 정규식을 사용하십시오. 단어 문자 ( \w)와 숫자 문자 ( \d)를 제외한 모든 문자 와 공백 문자 ( \s)를 제거합니다.

import re
s = "string. With. Punctuation?" # Sample string 
out = re.sub(ur'[^\w\d\s]+', '', s)

다음은 Python 3.5 용 한 줄짜리입니다.

import string
"l*ots! o(f. p@u)n[c}t]u[a'ti\"on#$^?/".translate(str.maketrans({a:None for a in string.punctuation}))

이것이 최선의 해결책이 아닐 수도 있지만 이것이 내가 한 방법입니다.

import string
f = lambda x: ''.join([i for i in x if i not in string.punctuation])

여기 내가 작성한 함수가 있습니다. 매우 효율적이지는 않지만 간단하며 원하는 구두점을 추가하거나 제거 할 수 있습니다.

def stripPunc(wordList):
    """Strips punctuation from list of words"""
    puncList = [".",";",":","!","?","/","\\",",","#","@","$","&",")","(","\""]
    for punc in puncList:
        for word in wordList:
            wordList=[word.replace(punc,'') for word in wordList]
    return wordList

다음은 정규식이없는 솔루션입니다.

import string

input_text = "!where??and!!or$$then:)"
punctuation_replacer = string.maketrans(string.punctuation, ' '*len(string.punctuation))    
print ' '.join(input_text.translate(punctuation_replacer).split()).strip()

Output>> where and or then

구두점을 공백으로 바꿉니다.
단어 사이의 여러 공백을 단일 공백으로 교체
strip ()을 사용하여 후행 공백을 제거하십시오.

업데이트와 마찬가지로 Python 3에서 @Brian 예제를 다시 작성하고 함수 내에서 정규식 컴파일 단계를 이동하도록 변경했습니다. 여기서 제 생각은 기능을 작동시키는 데 필요한 모든 단계를 정하는 것이 었습니다. 아마도 분산 컴퓨팅을 사용하고 있고 작업자간에 정규식 개체를 공유 할 수 없으며 re.compile각 작업자에서 단계를 수행 해야 할 수 있습니다 . 또한 Python 3 용 maketrans의 두 가지 다른 구현 시간이 궁금했습니다.

table = str.maketrans({key: None for key in string.punctuation})

table = str.maketrans('', '', string.punctuation)

또한 집합을 사용하는 또 다른 방법을 추가했습니다. 여기서 교차 함수를 사용하여 반복 횟수를 줄였습니다.

다음은 완전한 코드입니다.

import re, string, timeit

s = "string. With. Punctuation"


def test_set(s):
    exclude = set(string.punctuation)
    return ''.join(ch for ch in s if ch not in exclude)


def test_set2(s):
    _punctuation = set(string.punctuation)
    for punct in set(s).intersection(_punctuation):
        s = s.replace(punct, ' ')
    return ' '.join(s.split())


def test_re(s):  # From Vinko's solution, with fix.
    regex = re.compile('[%s]' % re.escape(string.punctuation))
    return regex.sub('', s)


def test_trans(s):
    table = str.maketrans({key: None for key in string.punctuation})
    return s.translate(table)


def test_trans2(s):
    table = str.maketrans('', '', string.punctuation)
    return(s.translate(table))


def test_repl(s):  # From S.Lott's solution
    for c in string.punctuation:
        s=s.replace(c,"")
    return s


print("sets      :",timeit.Timer('f(s)', 'from __main__ import s,test_set as f').timeit(1000000))
print("sets2      :",timeit.Timer('f(s)', 'from __main__ import s,test_set2 as f').timeit(1000000))
print("regex     :",timeit.Timer('f(s)', 'from __main__ import s,test_re as f').timeit(1000000))
print("translate :",timeit.Timer('f(s)', 'from __main__ import s,test_trans as f').timeit(1000000))
print("translate2 :",timeit.Timer('f(s)', 'from __main__ import s,test_trans2 as f').timeit(1000000))
print("replace   :",timeit.Timer('f(s)', 'from __main__ import s,test_repl as f').timeit(1000000))

이것은 내 결과입니다.

sets      : 3.1830138750374317
sets2      : 2.189873124472797
regex     : 7.142953420989215
translate : 4.243278483860195
translate2 : 2.427158243022859
replace   : 4.579746678471565

>>> s = "string. With. Punctuation?"
>>> s = re.sub(r'[^\w\s]','',s)
>>> re.split(r'\s*', s)


['string', 'With', 'Punctuation']

import re
s = "string. With. Punctuation?" # Sample string 
out = re.sub(r'[^a-zA-Z0-9\s]', '', s)

엄격하지 않은 경우 한 줄짜리가 도움이 될 수 있습니다.

''.join([c for c in s if c.isalnum() or c.isspace()])

#FIRST METHOD
#Storing all punctuations in a variable    
punctuation='!?,.:;"\')(_-'
newstring='' #Creating empty string
word=raw_input("Enter string: ")
for i in word:
     if(i not in punctuation):
                  newstring+=i
print "The string without punctuation is",newstring

#SECOND METHOD
word=raw_input("Enter string: ")
punctuation='!?,.:;"\')(_-'
newstring=word.translate(None,punctuation)
print "The string without punctuation is",newstring


#Output for both methods
Enter string: hello! welcome -to_python(programming.language)??,
The string without punctuation is: hello welcome topythonprogramminglanguage

with open('one.txt','r')as myFile:

    str1=myFile.read()

    print(str1)


    punctuation = ['(', ')', '?', ':', ';', ',', '.', '!', '/', '"', "'"] 

for i in punctuation:

        str1 = str1.replace(i," ") 
        myList=[]
        myList.extend(str1.split(" "))
print (str1) 
for i in myList:

    print(i,end='\n')
    print ("____________")

왜 이것을 사용하지 않습니까?

 ''.join(filter(str.isalnum, s))

너무 느린?

Python을 사용하여 텍스트 파일에서 불용어 제거

print('====THIS IS HOW TO REMOVE STOP WORS====')

with open('one.txt','r')as myFile:

    str1=myFile.read()

    stop_words ="not", "is", "it", "By","between","This","By","A","when","And","up","Then","was","by","It","If","can","an","he","This","or","And","a","i","it","am","at","on","in","of","to","is","so","too","my","the","and","but","are","very","here","even","from","them","then","than","this","that","though","be","But","these"

    myList=[]

    myList.extend(str1.split(" "))

    for i in myList:

        if i not in stop_words:

            print ("____________")

            print(i,end='\n')

This is how to change our documents to uppercase or lower case.

print('@@@@This is lower case@@@@')

with open('students.txt','r')as myFile:

    str1=myFile.read()
    str1.lower()
print(str1.lower())

print('*****This is upper case****')

with open('students.txt','r')as myFile:

    str1=myFile.read()

    str1.upper()

print(str1.upper())

I like to use a function like this:

def scrub(abc):
    while abc[-1] is in list(string.punctuation):
        abc=abc[:-1]
    while abc[0] is in list(string.punctuation):
        abc=abc[1:]
    return abc

참고URL : https://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string

'Programing' 카테고리의 다른 글

Windows 용 Git에서 파일 이름이 너무 깁니다. (0)	2020.10.04
.NET (특히 C #)에서 개체의 전체 복사를 수행하는 방법은 무엇입니까? (0)	2020.10.04
라디오 버튼이 JQuery로 선택되어 있는지 확인하십시오. (0)	2020.10.04
PHP에서 @ 기호를 사용하는 것은 무엇입니까? (0)	2020.10.04
Base64 이미지 포함 (0)	2020.10.04

현재글문자열에서 구두점을 제거하는 가장 좋은 방법

복권의 역사, 로또 정보와 IT 기술 등을 다루는 블로그입니다.

가족나들이, 극장순위, Javascript, 뮤지컬, 공연, c++, JQuery, c#, 축제, 무비순위, spring, 관광, Spring3, 행사, 놀거리, 여행, 연극, java, 볼거리, 자바,

Today :
Yesterday :

lottogame

문자열에서 구두점을 제거하는 가장 좋은 방법

문자열에서 구두점을 제거하는 가장 좋은 방법

'Programing' 카테고리의 다른 글

'Programing'의 다른글

티스토리툴바

« 2025/01 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

문자열에서 구두점을 제거하는 가장 좋은 방법

문자열에서 구두점을 제거하는 가장 좋은 방법

'Programing' 카테고리의 다른 글

'Programing'의 다른글

관련글

티스토리툴바