Programing

공백을 포함하여 줄 길이별로 텍스트 파일 정렬

lottogame 2020. 7. 11. 10:01

공백을 포함하여 줄 길이별로 텍스트 파일 정렬

다음과 같은 CSV 파일이 있습니다

AS2345, ASDF1232, Mr. Plain Example, 110 Binary ave., Atlantis, RI, 12345, (999) 123-5555,1.56
AS2345, ASDF1232, Mrs. Plain Example, 1121110 Ternary st. 110 Binary ave .., Atlantis, RI, 12345, (999) 123-5555,1.56
AS2345, ASDF1232, Mr. Plain Example, 110 Binary ave., Liberty City, RI, 12345, (999) 123-5555,1.56
AS2345, ASDF1232, Mr. Plain Example, RI, 일부 도시, 110 Ternary ave., 12345, (999) 123-5555,1.56

공백을 포함하여 줄 길이별로 정렬해야합니다. 다음 명령에는 공백이 포함되어 있지 않습니다. 수정할 수있는 방법이 있습니까?

cat $@ | awk '{ print length, $0 }' | sort -n | awk '{$1=""; print $0}'

대답

cat testfile | awk '{ print length, $0 }' | sort -n -s | cut -d" " -f2-

또는 동일한 길이의 줄을 원래 (의도하지 않은) 하위 분류로 수행하려면 다음을 수행하십시오.

cat testfile | awk '{ print length, $0 }' | sort -n | cut -d" " -f2-

두 경우 모두, 우리는 최종 컷을 위해 awk에서 멀어지면서 명시된 문제를 해결했습니다.

길이가 일치하는 라인-넥타이의 경우 수행 할 작업 :

이 질문은 길이가 일치하는 줄에 대해 추가 정렬이 필요한지 여부를 지정하지 않았습니다. 나는 이것이 원치 않는 것으로 가정하고 그러한 줄이 서로 정렬되는 것을 방지하고 입력에서 발생하는 상대적 순서로 유지하기 위해 -s( --stable) 사용을 제안했습니다 .

(이 관계를 더 잘 제어하려는 사람들은 sort의 --key옵션을 볼 수 있습니다 .)

질문의 시도 된 솔루션이 실패하는 이유 (awk 라인 재 구축) :

다음의 차이점을 주목하는 것이 흥미 롭습니다.

echo "hello   awk   world" | awk '{print}'
echo "hello   awk   world" | awk '{$1="hello"; print}'

그들은 각각 항복

hello   awk   world
hello awk world

의 관련 섹션 (둔한의) 매뉴얼은 단지 당신이 하나 개의 필드를 변경하면 AWK는 (등, 분리 기준) $ 0 전체를 다시 진행하고 있음을 옆으로 언급하고있다. 나는 그것이 미친 행동이 아니라고 생각합니다. 그것은 이것을 가지고 있습니다 :

"마지막으로 필드의 현재 값과 OFS를 사용하여 awk가 전체 레코드를 다시 작성하는 것이 편리한 경우가 있습니다. 이렇게하려면 겉보기에 무해한 할당을 사용하십시오."

 $1 = $1   # force record to be reconstituted
 print $0  # or whatever else with $0

"이로 인해 awk는 레코드를 다시 작성해야합니다."

길이가 같은 일부 라인을 포함한 테스트 입력 :

aa A line   with     MORE    spaces
bb The very longest line in the file
ccb
9   dd equal len.  Orig pos = 1
500 dd equal len.  Orig pos = 2
ccz
cca
ee A line with  some       spaces
1   dd equal len.  Orig pos = 3
ff
5   dd equal len.  Orig pos = 4
g

neillb 의 AWK 솔루션은 실제로 사용 awk하고 싶을 때 유용 하지만, 그 이유가 무엇인지 설명하지만 원하는 작업을 신속하게 완료하고 수행하는 작업을 신경 쓰지 않는 경우 하나의 솔루션을 사용하는 것입니다 sort()입력 라인을 반복하는 커스텀 caparison 루틴을 가진 Perl의 기능. 하나의 라이너는 다음과 같습니다.

perl -e 'print sort { length($a) <=> length($b) } <>'

STDIN을 수신 cat하거나 ( 쉘 리다이렉션을 통해) 파이프 라인에 필요 하거나 파일 이름을 다른 인수로 perl하여 파일을 열 수 있습니다.

내가 스왑 그래서 내 경우에는 내가 먼저 긴 줄을 필요로 $a하고 $b비교에.

대신이 명령을 시도하십시오.

awk '{print length, $0}' your-file | sort -n | cut -d " " -f2-

벤치 마크 결과

다음은이 질문에 대한 답변에서 솔루션 전반에 걸친 벤치 마크 결과입니다.

테스트 방식

10 sequential runs on a fast machine, averaged
Perl 5.24
awk 3.1.5 (gawk 4.1.0 times were ~2% faster)
The input file is a 550MB, 6 million line monstrosity (British National Corpus txt)

Results

Caleb's perl solution took 11.2 seconds
my perl solution took 11.6 seconds
neillb's awk solution #1 took 20 seconds
neillb's awk solution #2 took 23 seconds
anubhava's awk solution took 24 seconds
Jonathan's awk solution took 25 seconds
Fretz's bash solution takes 400x longer than the awk solutions (using a truncated test case of 100000 lines). It works fine, just takes forever.

Extra `perl` option

Also, I've added another Perl solution:

perl -ne 'push @a, $_; END{ print sort { length $a <=> length $b } @a }' file

Pure Bash:

declare -a sorted

while read line; do
  if [ -z "${sorted[${#line}]}" ] ; then          # does line length already exist?
    sorted[${#line}]="$line"                      # element for new length
  else
    sorted[${#line}]="${sorted[${#line}]}\n$line" # append to lines with equal length
  fi
done < data.csv

for key in ${!sorted[*]}; do                      # iterate over existing indices
  echo -e "${sorted[$key]}"                       # echo lines with equal length
done

The length() function does include spaces. I would make just minor adjustments to your pipeline (including avoiding UUOC).

awk '{ printf "%d:%s\n", length($0), $0;}' "$@" | sort -n | sed 's/^[0-9]*://'

The sed command directly removes the digits and colon added by the awk command. Alternatively, keeping your formatting from awk:

awk '{ print length($0), $0;}' "$@" | sort -n | sed 's/^[0-9]* //'

I found these solutions will not work if your file contains lines that start with a number, since they will be sorted numerically along with all the counted lines. The solution is to give sort the -g (general-numeric-sort) flag instead of -n (numeric-sort):

awk '{ print length, $0 }' lines.txt | sort -g | cut -d" " -f2-

With POSIX Awk:

{
  c = length
  m[c] = m[c] ? m[c] RS $0 : $0
} END {
  for (c in m) print m[c]
}

Example

1) pure awk solution. Let's suppose that line length cannot be more > 1024 then

cat filename | awk 'BEGIN {min = 1024; s = "";} {l = length($0); if (l < min) {min = l; s = $0;}} END {print s}'

2) one liner bash solution assuming all lines have just 1 word, but can reworked for any case where all lines have same number of words:

LINES=$(cat filename); for k in $LINES; do printf "$k "; echo $k | wc -L; done | sort -k2 | head -n 1 | cut -d " " -f1

Here is a multibyte-compatible method of sorting lines by length. It requires:

wc -m is available to you (macOS has it).
Your current locale supports multi-byte characters, e.g., by setting LC_ALL=UTF-8. You can set this either in your .bash_profile, or simply by prepending it before the following command.
testfile has a character encoding matching your locale (e.g., UTF-8).

Here's the full command:

cat testfile | awk '{l=$0; gsub(/\047/, "\047\"\047\"\047", l); cmd=sprintf("echo \047%s\047 | wc -m", l); cmd | getline c; close(cmd); sub(/ */, "", c); { print c, $0 }}' | sort -ns | cut -d" " -f2-

Explaining part-by-part:

l=$0; gsub(/\047/, "\047\"\047\"\047", l); ← makes of a copy of each line in awk variable l and double-escapes every ' so the line can safely be echoed as a shell command (\047 is a single-quote in octal notation).
cmd=sprintf("echo \047%s\047 | wc -m", l); ← this is the command we'll execute, which echoes the escaped line to wc -m.
cmd | getline c; ← executes the command and copies the character count value that is returned into awk variable c.
close(cmd); ← close the pipe to the shell command to avoid hitting a system limit on the number of open files in one process.
sub(/ */, "", c); ← trims white space from the character count value returned by wc.
{ print c, $0 } ← prints the line's character count value, a space, and the original line.
| sort -ns ← sorts the lines (by prepended character count values) numerically (-n), and maintaining stable sort order (-s).
| cut -d" " -f2- ← removes the prepended character count values.

It's slow (only 160 lines per second on a fast Macbook Pro) because it must execute a sub-command for each line.

Alternatively, just do this solely with gawk (as of version 3.1.5, gawk is multibyte aware), which would be significantly faster. It's a lot of trouble doing all the escaping and double-quoting to safely pass the lines through a shell command from awk, but this is the only method I could find that doesn't require installing additional software (gawk is not available by default on macOS).

참고URL : https://stackoverflow.com/questions/5917576/sort-a-text-file-by-line-length-including-spaces

'Programing' 카테고리의 다른 글

TypeScript와 함께 jQuery를 사용하는 방법 (0)	2020.07.11
날짜 범위 간의 자바 스크립트 루프 (0)	2020.07.11
Google지도와 같은 위치 활성화 대화 상자를 표시하는 방법 (0)	2020.07.11
Xcode-그러나… 아카이브는 어디에 있습니까? (0)	2020.07.11
Laravel Eloquent 쿼리에서 테이블의 별칭을 지정하는 방법 (또는 쿼리 작성기를 사용하는 방법)? (0)	2020.07.11

현재글공백을 포함하여 줄 길이별로 텍스트 파일 정렬

복권의 역사, 로또 정보와 IT 기술 등을 다루는 블로그입니다.

c#, 연극, Spring3, 뮤지컬, 무비순위, 관광, 축제, 놀거리, 볼거리, java, 여행, 극장순위, spring, 공연, JQuery, Javascript, 행사, c++, 가족나들이, 자바,

Today :
Yesterday :

일	월	화	수	목	금	토
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

lottogame

공백을 포함하여 줄 길이별로 텍스트 파일 정렬

공백을 포함하여 줄 길이별로 텍스트 파일 정렬

대답

길이가 일치하는 라인-넥타이의 경우 수행 할 작업 :

질문의 시도 된 솔루션이 실패하는 이유 (awk 라인 재 구축) :

길이가 같은 일부 라인을 포함한 테스트 입력 :

벤치 마크 결과

테스트 방식

Results

Extra `perl` option

'Programing' 카테고리의 다른 글

'Programing'의 다른글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

공백을 포함하여 줄 길이별로 텍스트 파일 정렬

공백을 포함하여 줄 길이별로 텍스트 파일 정렬

대답

길이가 일치하는 라인-넥타이의 경우 수행 할 작업 :

질문의 시도 된 솔루션이 실패하는 이유 (awk 라인 재 구축) :

길이가 같은 일부 라인을 포함한 테스트 입력 :

벤치 마크 결과

테스트 방식

Results

Extra perl option

'Programing' 카테고리의 다른 글

'Programing'의 다른글

관련글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

Extra `perl` option