소문자를 대문자로 또는 그 반대로 변환하는 ^ = 32의 아이디어는 무엇입니까?
codeforces에서 몇 가지 문제를 해결하고있었습니다. 일반적으로 먼저 문자가 영문 또는 대문자인지 확인한 다음 빼거나 추가 32
하여 해당 문자로 변환하십시오. 그러나 나는 누군가가 ^= 32
똑같은 일을한다는 것을 알았 습니다. 여기있어:
char foo = 'a';
foo ^= 32;
char bar = 'A';
bar ^= 32;
cout << foo << ' ' << bar << '\n'; // foo is A, and bar is a
이에 대한 설명을 검색했지만 찾지 못했습니다. 왜 이것이 효과가 있습니까?
ASCII 코드 테이블을 바이너리로 살펴 보자.
A 1000001 a 1100001
B 1000010 b 1100010
C 1000011 c 1100011
...
Z 1011010 z 1111010
32는 0100000
소문자와 대문자의 유일한 차이점입니다. 따라서 비트를 토글하면 문자의 대소 문자가 토글됩니다.
이것은 실제로 똑똑한 사람들이 선택한 ASCII 값보다 사실을 사용합니다.
foo ^= 32;
이 제 6 최하위 비트 뒤집 하나 의 foo
하부 케이스에 ASCII 상부 케이스 변형 (ASCII의 정렬의 대문자 플래그)을 그 반대 .
+---+------------+------------+
| | Upper case | Lower case | 32 is 00100000
+---+------------+------------+
| A | 01000001 | 01100001 |
| B | 01000010 | 01100010 |
| ... |
| Z | 01011010 | 01111010 |
+---+------------+------------+
예
'A' ^ 32
01000001 'A'
XOR 00100000 32
------------
01100001 'a'
XOR의 속성에 따라 'a' ^ 32 == 'A'
.
주의
C ++은 ASCII를 사용하여 문자를 나타내지 않아도됩니다. 또 다른 변형은 EBCDIC 입니다. 이 트릭은 ASCII 플랫폼에서만 작동합니다. 더 휴대용 솔루션을 사용하는 것 std::tolower
과 std::toupper
(의견을 참조하지만 그것은 자동적으로 모든 문제가 해결되지 않음) 로케일 인식하기 위해 제공되는 보너스 :
bool case_incensitive_equal(char lhs, char rhs)
{
return std::tolower(lhs, std::locale{}) == std::tolower(rhs, std::locale{}); // std::locale{} optional, enable locale-awarness
}
assert(case_incensitive_equal('A', 'a'));
1) 32는 1 << 5
(2에서 5로) 6 번째 비트를 뒤집습니다 (1부터 계산).
이것이 똑똑해 보이지만 실제로는 정말 어리석은 핵이라고 말할 수 있습니다. 누군가 2019 년에 당신에게 이것을 추천한다면, 그를 때리십시오. 최대한 열심히 쳐
물론 어쨌든 영어 이외의 다른 언어를 사용하지 않을 것이라는 것을 알고 있다면 아무도 자신이 사용하지 않는 소프트웨어를 자신의 소프트웨어로 사용할 수 있습니다. 그렇지 않으면 갈 수 없습니다.
해킹은 컴퓨터가 정말하지 않았다 ASCII에 많이 있지만 영어를 수행 할 때 일부 30~35년 전에 "OK"논쟁의 여지가, 그리고 어쩌면 하나 또는 두 개의 주요 유럽 언어. 하지만 ... 더 이상은 그렇지 않습니다.
해킹은 US-Latin 대문자와 소문자가 서로 정확히 0x20
떨어져 있고 동일한 순서로 나타나기 때문에 한 비트 차이이므로 작동합니다. 실제로이 비트 핵은 토글됩니다.
Now, the people creating code pages for Western Europe, and later the Unicode consortium, were smart enough to keep this scheme for e.g. German Umlauts and French-accented Vowels. Not so for ß which (until someone convinced the Unicode consortium in 2017, and a large Fake News print magazine wrote about it, actually convincing the Duden -- no comment on that) don't even exist as a versal (transforms to SS). Now it does exist as versal, but the two are 0x1DBF
positions apart, not 0x20
.
The implementors were, however, not considerate enough to keep this going. For example, if you apply your hack in some East European languages or the like (I wouldn't know about Cyrillic), you will get a nasty surprise. All those "hatchet" characters are examples of that, lowercase and uppercase are one apart. The hack thus does not work properly there.
There's much more to consider, for example, some characters do not simply transform from lower- to uppercase at all (they're replaced with different sequences), or they may change form (requiring different code points).
Do not even think about what this hack will do to stuff like Thai or Chinese (it'll just give you complete nonsense).
Saving a couple of hundred CPU cycles may have been very worthwhile 30 years ago, but nowadays, there is really no excuse for converting a string properly. There are library functions for performing this non-trivial task.
The time taken to convert several dozens kilobytes of text properly is negligible nowadays.
It works because, as it happens, the difference between 'a' and A' in ASCII and derived encodings is 32, and 32 is also the value of the sixth bit. Flipping the 6th bit with an exclusive OR thus converts between upper and lower.
Most likely your implementation of the character set will be ASCII. If we look at the table:
We see that there's a difference of exactly 32
between the value of a lowercase and uppercase number. Therefore, if we do ^= 32
(which equates to toggling the 6th least significant bit), it changes between a lowercase and uppercase character.
Note that it works with all the symbols, not just the letters. It toggles a character with the respective character where the 6th bit is different, resulting in a pair of characters that is toggled back and forth between. For the letters, the respective upper/lowercase characters form such a pair. A NUL
will change into Space
and the other way around, and the @
toggles with the backtick. Basically any character in the first column on this chart toggles with the character one column over, and the same applies to the third and fourth columns.
I wouldn't use this hack though, as there's not guarantee that it's going to work on any system. Just use toupper and tolower instead, and queries such as isupper.
Plenty of good answers here that describe how this works, but why it works this way is to improve performance. Bitwise operations are faster than most other operations within a processor. You can quickly do a case insensitive comparison by simply not looking at the bit that determines case or change case to upper/lower simply by flipping the bit (those guys that designed the ASCII table were pretty smart).
Obviously, this isn't nearly as big of a deal today as it was back in 1960 (when work first began on ASCII) due to faster processors and Unicode, but there are still some low-cost processors that this could make a significant difference as long as you can guarantee only ASCII characters.
https://en.wikipedia.org/wiki/Bitwise_operation
On simple low-cost processors, typically, bitwise operations are substantially faster than division, several times faster than multiplication, and sometimes significantly faster than addition.
NOTE: I would recommend using standard libraries for working with strings for a number of reasons (readability, correctness, portability, etc). Only use bit flipping if you have measured performance and this is your bottleneck.
It's how ASCII works, that's all.
But in exploiting this, you are giving up portability as C++ doesn't insist on ASCII as the encoding.
This is why the functions std::toupper
and std::tolower
are implemented in the C++ standard library - you should use those instead.
See the second table at http://www.catb.org/esr/faqs/things-every-hacker-once-knew/#_ascii, and following notes, reproduced below:
The Control modifier on your keyboard basically clears the top three bits of whatever character you type, leaving the bottom five and mapping it to the 0..31 range. So, for example, Ctrl-SPACE, Ctrl-@, and Ctrl-` all mean the same thing: NUL.
Very old keyboards used to do Shift just by toggling the 32 or 16 bit, depending on the key; this is why the relationship between small and capital letters in ASCII is so regular, and the relationship between numbers and symbols, and some pairs of symbols, is sort of regular if you squint at it. The ASR-33, which was an all-uppercase terminal, even let you generate some punctuation characters it didn’t have keys for by shifting the 16 bit; thus, for example, Shift-K (0x4B) became a [ (0x5B)
ASCII was designed such that the shift and ctrl keyboard keys could be implemented without much (or perhaps any for ctrl) logic - shift probably required only a few gates. It probably made at least as much sense to store the wire protocol as any other character encoding (no software conversion required).
The linked article also explains many strange hacker conventions such as And control H does a single character and is an old^H^H^H^H^H classic joke.
(found here).
Xoring with 32 (00100000 in binary) sets or resets the sixth bit (from the right). This is strictly equivalent to adding or subtracting 32.
The lower-case and upper-case alphabetic ranges don't cross a %32
"alignment" boundary in the ASCII coding system.
This is why bit 0x20
is the only difference between the upper/lower case versions of the same letter.
If this wasn't the case, you'd need to add or subtract 0x20
, not just toggle, and for some letters there would be carry-out to flip other higher bits. (And there wouldn't be a single operation that could toggle, and checking for alphabetic characters in the first place would be harder because you couldn't |= 0x20 to force lcase.)
Related ASCII-only tricks: you can check for an alphabetic ASCII character by forcing lowercase with c |= 0x20
and then checking if (unsigned) c - 'a' <= ('z'-'a')
. So just 3 operations: OR + SUB + CMP against a constant 25. Of course, compilers know how to optimize (c>='a' && c<='z')
into asm like this for you, so at most you should do the c|=0x20
part yourself. It's rather inconvenient to do all the necessary casting yourself, especially to work around default integer promotions to signed int
.
unsigned char lcase = y|0x20;
if (lcase - 'a' <= (unsigned)('z'-'a')) { // lcase-'a' will wrap for characters below 'a'
// c is alphabetic ASCII
}
// else it's not
See also Convert a String In C++ To Upper Case (SIMD string toupper
for ASCII only, masking the operand for XOR using that check.)
And also How to access a char array and change lower case letters to upper case, and vice versa (C with SIMD intrinsics, and scalar x86 asm case-flip for alphabetic ASCII characters, leaving others unmodified.)
These tricks are mostly only useful if hand-optimizing some text-processing with SIMD (e.g. SSE2 or NEON), after checking that none of the char
s in a vector have their high bit set. (And thus none of the bytes are part of a multi-byte UTF-8 encoding for a single character, which might have different upper/lower-case inverses). If you find any, you can fall back to scalar for this chunk of 16 bytes, or for the rest of the string.
There are even some locales where toupper()
or tolower()
on some characters in the ASCII range produce characters outside that range, notably Turkish where I ↔ ı and İ ↔ i. In those locales, you'd need a more sophisticated check, or probably not trying to use this optimization at all.
But in some cases, you're allowed to assume ASCII instead of UTF-8, e.g. Unix utilities with LANG=C
(the POSIX locale), not en_CA.UTF-8
or whatever.
But if you can verify it's safe, you can toupper
medium-length strings much faster than calling toupper()
in a loop (like 5x), and last I tested with Boost 1.58, much much faster than boost::to_upper_copy<char*, std::string>()
which does a stupid dynamic_cast
for every character.
'Programing' 카테고리의 다른 글
왼쪽 전환에서 CSS 3 슬라이드 인 (0) | 2020.06.14 |
---|---|
네이티브 반응 :“다음”키보드 버튼을 누른 후 다음 TextInput을 선택하는 방법? (0) | 2020.06.14 |
효율적으로 입력 스트림에서 Android 읽기 (0) | 2020.06.14 |
Info.plist에서 iOS 9“fbauth2”가 누락되었습니다 (0) | 2020.06.14 |
bash 스크립트에서 소스를 사용할 때 '소스 : 찾을 수 없음'오류 표시 (0) | 2020.06.14 |