UTF-8 읽기-BOM 마커

Programing

UTF-8 읽기-BOM 마커

lottogame 2020. 11. 27. 07:39

UTF-8 읽기-BOM 마커

FileReader를 통해 파일을 읽고 있습니다. 파일이 UTF-8로 디코딩되었습니다 (BOM 포함) 이제 내 문제는 다음과 같습니다. 파일을 읽고 문자열을 출력하지만 슬프게도 BOM 마커도 출력됩니다. 왜 이런 일이 발생합니까?

fr = new FileReader(file);
br = new BufferedReader(fr);
    String tmp = null;
    while ((tmp = br.readLine()) != null) {
    String text;    
    text = new String(tmp.getBytes(), "UTF-8");
    content += text + System.getProperty("line.separator");
}

첫 번째 줄 이후 출력

?<style>

Java에서는 UTF8 BOM이있는 경우 수동으로 사용해야합니다. 이 동작은 Java 버그 데이터베이스, here 및 here에 문서화되어 있습니다 . JavaDoc 또는 XML 파서와 같은 기존 도구를 손상 시키므로 현재로서는 수정 사항이 없습니다. 아파치 IO 커먼즈는 A가 제공하는 BOMInputStream이 상황을 처리 할 수 있습니다.

이 솔루션 살펴보기 : BOM으로 UTF8 파일 처리

가장 쉬운 해결 방법은 \uFEFF다른 이유로 인해 나타날 가능성이 극히 적기 때문에 문자열 에서 결과를 제거하는 것입니다.

tmp = tmp.replace("\uFEFF", "");

이 Guava 버그 보고서 도 참조하십시오.

Apache Commons 라이브러리를 사용합니다 .

수업: org.apache.commons.io.input.BOMInputStream

사용 예 :

String defaultEncoding = "UTF-8";
InputStream inputStream = new FileInputStream(someFileWithPossibleUtf8Bom);
try {
    BOMInputStream bOMInputStream = new BOMInputStream(inputStream);
    ByteOrderMark bom = bOMInputStream.getBOM();
    String charsetName = bom == null ? defaultEncoding : bom.getCharsetName();
    InputStreamReader reader = new InputStreamReader(new BufferedInputStream(bOMInputStream), charsetName);
    //use reader
} finally {
    inputStream.close();
}

Apache BOMInputStream을 사용하는 방법은 다음과 같습니다. try-with-resources 블록을 사용합니다. "false"인수는 객체가 다음 BOM을 무시하도록 지시합니다 (안전상의 이유로 "BOM-less"텍스트 파일을 사용합니다. haha).

try( BufferedReader br = new BufferedReader( 
    new InputStreamReader( new BOMInputStream( new FileInputStream(
       file), false, ByteOrderMark.UTF_8,
        ByteOrderMark.UTF_16BE, ByteOrderMark.UTF_16LE,
        ByteOrderMark.UTF_32BE, ByteOrderMark.UTF_32LE ) ) ) )
{
    // use br here

} catch( Exception e)

}

이것은 일반적으로 Windows의 파일 문제라고 여기 에 언급되어 있습니다 .

한 가지 가능한 해결책은 먼저 dos2unix와 같은 도구를 통해 파일을 실행하는 것입니다.

Apache Commons IO를 사용하십시오 .

예를 들어 아래의 내 코드 (라틴 문자와 키릴 문자가 모두 포함 된 텍스트 파일을 읽는 데 사용됨)를 살펴 보겠습니다.

String defaultEncoding = "UTF-16";
InputStream inputStream = new FileInputStream(new File("/temp/1.txt"));

BOMInputStream bomInputStream = new BOMInputStream(inputStream);

ByteOrderMark bom = bomInputStream.getBOM();
String charsetName = bom == null ? defaultEncoding : bom.getCharsetName();
InputStreamReader reader = new InputStreamReader(new BufferedInputStream(bomInputStream), charsetName);
int data = reader.read();
while (data != -1) {

 char theChar = (char) data;
 data = reader.read();
 ari.add(Character.toString(theChar));
}
reader.close();

결과적으로 BOM을 제외한 파일 "1.txt"의 모든 문자가있는 "ari"라는 ArrayList가 있습니다.

Consider UnicodeReader from Google which does all this work for you.

Charset utf8 = Charset.forName("UTF-8"); // default if no BOM present
try (Reader r = new UnicodeReader(new FileInputStream(file), utf8)) {
    ....
}

Maven Dependency:

<dependency>
    <groupId>com.google.gdata</groupId>
    <artifactId>core</artifactId>
    <version>1.47.1</version>
</dependency>

The easiest way I found to bypass BOM

BufferedReader br = new BufferedReader(new InputStreamReader(fis));    
while ((currentLine = br.readLine()) != null) {
                    //case of, remove the BOM of UTF-8 BOM
                    currentLine = currentLine.replace("ï»¿","");

If somebody wants to do it with the standard, this would be a way:

public static String cutBOM(String value) {
    // UTF-8 BOM is EF BB BF, see https://en.wikipedia.org/wiki/Byte_order_mark
    String bom = String.format("%x", new BigInteger(1, value.substring(0,3).getBytes()));
    if (bom.equals("efbbbf"))
        // UTF-8
        return value.substring(3, value.length());
    else if (bom.substring(0, 2).equals("feff") || bom.substring(0, 2).equals("ffe"))
        // UTF-16BE or UTF16-LE
        return value.substring(2, value.length());
    else
        return value;
}

참고URL : https://stackoverflow.com/questions/4897876/reading-utf-8-bom-marker

'Programing' 카테고리의 다른 글

단순 대리자 (대리자) 대 멀티 캐스트 대리자 (0)	2020.11.27
jQuery의 'keypress'는 Chrome의 일부 키에서 작동하지 않습니다. (0)	2020.11.27
"EXC_BAD_ACCESS"예외를 유발하는 NSNotificationCenter 게시물 (0)	2020.11.27
행렬의 각 행에 numpy.linalg.norm을 적용하는 방법은 무엇입니까? (0)	2020.11.27
URL에서 원격 이미지의 너비 높이 가져 오기 (0)	2020.11.27

현재글UTF-8 읽기-BOM 마커

복권의 역사, 로또 정보와 IT 기술 등을 다루는 블로그입니다.

c#, c++, 행사, 극장순위, 놀거리, Spring3, 무비순위, JQuery, java, 축제, Javascript, 가족나들이, 관광, 볼거리, 자바, 여행, 연극, spring, 뮤지컬, 공연,

Today :
Yesterday :

일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

lottogame

UTF-8 읽기-BOM 마커

UTF-8 읽기-BOM 마커

'Programing' 카테고리의 다른 글

'Programing'의 다른글

티스토리툴바

UTF-8 읽기-BOM 마커

UTF-8 읽기-BOM 마커

'Programing' 카테고리의 다른 글

'Programing'의 다른글

관련글

티스토리툴바