자바로 프로그래밍 방식으로 웹 페이지를 다운로드하는 방법

Programing

자바로 프로그래밍 방식으로 웹 페이지를 다운로드하는 방법

lottogame 2020. 7. 19. 09:52

자바로 프로그래밍 방식으로 웹 페이지를 다운로드하는 방법

웹 페이지의 HTML을 가져 와서에 저장하여 String처리 할 수 있기를 원합니다 . 또한 다양한 유형의 압축을 어떻게 처리 할 수 있습니까?

Java를 사용하여 어떻게합니까?

다음은 Java의 URL 클래스를 사용하여 테스트 된 코드 입니다. 그래도 예외를 처리하거나 호출 스택에 전달하는 것보다 더 나은 작업을 수행하는 것이 좋습니다.

public static void main(String[] args) {
    URL url;
    InputStream is = null;
    BufferedReader br;
    String line;

    try {
        url = new URL("http://stackoverflow.com/");
        is = url.openStream();  // throws an IOException
        br = new BufferedReader(new InputStreamReader(is));

        while ((line = br.readLine()) != null) {
            System.out.println(line);
        }
    } catch (MalformedURLException mue) {
         mue.printStackTrace();
    } catch (IOException ioe) {
         ioe.printStackTrace();
    } finally {
        try {
            if (is != null) is.close();
        } catch (IOException ioe) {
            // nothing to see here
        }
    }
}

Jsoup 과 같은 괜찮은 HTML 파서를 사용합니다 . 다음과 같이 쉽습니다.

String html = Jsoup.connect("http://stackoverflow.com").get().html();

GZIP 및 청크 응답 및 문자 인코딩을 완전히 투명하게 처리합니다. jQuery와 같은 CSS 선택기의 HTML 탐색 및 조작 과 같이 더 많은 이점을 제공합니다 . 당신은 그것을 Document아닌으로 잡아야합니다 String.

Document document = Jsoup.connect("http://google.com").get();

처리 하기 위해 기본 문자열 메소드 또는 HTML에서 정규식을 실행하고 싶지 않습니다 .

또한보십시오:

Java의 주요 HTML 파서의 장단점은 무엇입니까?

Bill의 답변은 매우 좋지만 압축 또는 사용자 에이전트와 같은 요청으로 일부 작업을 수행 할 수 있습니다. 다음 코드는 요청에 다양한 유형의 압축을 수행하는 방법을 보여줍니다.

URL url = new URL(urlStr);
HttpURLConnection conn = (HttpURLConnection) url.openConnection(); // Cast shouldn't fail
HttpURLConnection.setFollowRedirects(true);
// allow both GZip and Deflate (ZLib) encodings
conn.setRequestProperty("Accept-Encoding", "gzip, deflate");
String encoding = conn.getContentEncoding();
InputStream inStr = null;

// create the appropriate stream wrapper based on
// the encoding type
if (encoding != null && encoding.equalsIgnoreCase("gzip")) {
    inStr = new GZIPInputStream(conn.getInputStream());
} else if (encoding != null && encoding.equalsIgnoreCase("deflate")) {
    inStr = new InflaterInputStream(conn.getInputStream(),
      new Inflater(true));
} else {
    inStr = conn.getInputStream();
}

사용자 에이전트를 설정하려면 다음 코드를 추가하십시오.

conn.setRequestProperty ( "User-agent", "my agent name");

글쎄, 당신은 URL 및 URLConnection 과 같은 내장 라이브러리를 사용할 수는 있지만 그다지 제어 할 수는 없습니다.

~~개인적으로 Apache HTTPClient 라이브러리를 사용합니다.~~
편집 : HTTPClient는 Apache에 의해 수명 종료 로 설정되었습니다 . 대체는 다음과 같습니다. HTTP 구성 요소

위에서 언급 한 모든 접근 방식은 웹 페이지 텍스트를 브라우저에서 볼 때 다운로드하지 않습니다. 요즘에는 HTML 페이지의 스크립트를 통해 많은 데이터가 브라우저에로드됩니다. 위에서 언급 한 기술 중 어느 것도 스크립트를 지원하지 않으며 html 텍스트 만 다운로드합니다. HTMLUNIT는 자바 스크립트를 지원합니다. 따라서 브라우저에서 웹 페이지 텍스트를 다운로드하려면 HTMLUNIT 을 사용해야합니다 .

보안 웹 페이지 (https 프로토콜)에서 코드를 추출해야 할 것입니다. 다음 예제에서 html 파일은 c : \ temp \ filename.html Enjoy!

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.URL;

import javax.net.ssl.HttpsURLConnection;

/**
 * <b>Get the Html source from the secure url </b>
 */
public class HttpsClientUtil {
    public static void main(String[] args) throws Exception {
        String httpsURL = "https://stackoverflow.com";
        String FILENAME = "c:\\temp\\filename.html";
        BufferedWriter bw = new BufferedWriter(new FileWriter(FILENAME));
        URL myurl = new URL(httpsURL);
        HttpsURLConnection con = (HttpsURLConnection) myurl.openConnection();
        con.setRequestProperty ( "User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0" );
        InputStream ins = con.getInputStream();
        InputStreamReader isr = new InputStreamReader(ins, "Windows-1252");
        BufferedReader in = new BufferedReader(isr);
        String inputLine;

        // Write each line into the file
        while ((inputLine = in.readLine()) != null) {
            System.out.println(inputLine);
            bw.write(inputLine);
        }
        in.close(); 
        bw.close();
    }
}

유닉스 / 리눅스 박스에서 'wget'을 실행할 수는 있지만, 크로스 플랫폼 클라이언트를 작성하는 경우에는 이것이 실제로 옵션이 아닙니다. 물론 이것은 다운로드 시점과 디스크에 닿는 시점 사이에 다운로드 한 데이터로 많은 일을하고 싶지 않다고 가정합니다.

Jetty에는 웹 페이지를 다운로드하는 데 사용할 수있는 HTTP 클라이언트가 있습니다.

package com.zetcode;

import org.eclipse.jetty.client.HttpClient;
import org.eclipse.jetty.client.api.ContentResponse;

public class ReadWebPageEx5 {

    public static void main(String[] args) throws Exception {

        HttpClient client = null;

        try {

            client = new HttpClient();
            client.start();

            String url = "http://www.something.com";

            ContentResponse res = client.GET(url);

            System.out.println(res.getContentAsString());

        } finally {

            if (client != null) {

                client.stop();
            }
        }
    }
}

이 예제는 간단한 웹 페이지의 내용을 인쇄합니다.

A의 자바 읽고 웹 페이지 나 URL, JSoup, HtmlCleaner, 아파치 HttpClient를, 부두 HttpClient를, 그리고 HtmlUnit과를 사용하여 자바 programmaticaly 웹 페이지를 dowloading의 여섯 예를 작성한 튜토리얼.

Get help from this class it get code and filter some information.

public class MainActivity extends AppCompatActivity {

    EditText url;
    @Override
    protected void onCreate(Bundle savedInstanceState) {
        super.onCreate( savedInstanceState );
        setContentView( R.layout.activity_main );

        url = ((EditText)findViewById( R.id.editText));
        DownloadCode obj = new DownloadCode();

        try {
            String des=" ";

            String tag1= "<div class=\"description\">";
            String l = obj.execute( "http://www.nu.edu.pk/Campus/Chiniot-Faisalabad/Faculty" ).get();

            url.setText( l );
            url.setText( " " );

            String[] t1 = l.split(tag1);
            String[] t2 = t1[0].split( "</div>" );
            url.setText( t2[0] );

        }
        catch (Exception e)
        {
            Toast.makeText( this,e.toString(),Toast.LENGTH_SHORT ).show();
        }

    }
                                        // input, extrafunctionrunparallel, output
    class DownloadCode extends AsyncTask<String,Void,String>
    {
        @Override
        protected String doInBackground(String... WebAddress) // string of webAddress separate by ','
        {
            String htmlcontent = " ";
            try {
                URL url = new URL( WebAddress[0] );
                HttpURLConnection c = (HttpURLConnection) url.openConnection();
                c.connect();
                InputStream input = c.getInputStream();
                int data;
                InputStreamReader reader = new InputStreamReader( input );

                data = reader.read();

                while (data != -1)
                {
                    char content = (char) data;
                    htmlcontent+=content;
                    data = reader.read();
                }
            }
            catch (Exception e)
            {
                Log.i("Status : ",e.toString());
            }
            return htmlcontent;
        }
    }
}

I used the actual answer to this post (url) and writing the output into a file.

package test;

import java.net.*;
import java.io.*;

public class PDFTest {
    public static void main(String[] args) throws Exception {
    try {
        URL oracle = new URL("http://www.fetagracollege.org");
        BufferedReader in = new BufferedReader(new InputStreamReader(oracle.openStream()));

        String fileName = "D:\\a_01\\output.txt";

        PrintWriter writer = new PrintWriter(fileName, "UTF-8");
        OutputStream outputStream = new FileOutputStream(fileName);
        String inputLine;

        while ((inputLine = in.readLine()) != null) {
            System.out.println(inputLine);
            writer.println(inputLine);
        }
        in.close();
        } catch(Exception e) {

        }

    }
}

참고URL : https://stackoverflow.com/questions/238547/how-do-you-programmatically-download-a-webpage-in-java

'Programing' 카테고리의 다른 글

npm init의 "진입 점"이란 무엇입니까 (0)	2020.07.20
열 이름을 가진 모든 테이블 이름을 찾으십니까? (0)	2020.07.19
의 속성 이름 값 가져 오기 (0)	2020.07.19
Visual Studio에서 마지막으로 닫힌 탭 다시 열기 (0)	2020.07.19
포인터 표현식 : * ptr ++, * ++ ptr 및 ++ * ptr (0)	2020.07.19

현재글자바로 프로그래밍 방식으로 웹 페이지를 다운로드하는 방법

복권의 역사, 로또 정보와 IT 기술 등을 다루는 블로그입니다.

spring, 연극, c++, 공연, java, 관광, 극장순위, 뮤지컬, 볼거리, 가족나들이, JQuery, Javascript, 축제, 놀거리, 무비순위, 행사, 자바, 여행, Spring3, c#,

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

lottogame

자바로 프로그래밍 방식으로 웹 페이지를 다운로드하는 방법

자바로 프로그래밍 방식으로 웹 페이지를 다운로드하는 방법

또한보십시오:

'Programing' 카테고리의 다른 글

'Programing'의 다른글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

자바로 프로그래밍 방식으로 웹 페이지를 다운로드하는 방법

자바로 프로그래밍 방식으로 웹 페이지를 다운로드하는 방법

또한보십시오:

'Programing' 카테고리의 다른 글

'Programing'의 다른글

관련글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역