'웹 서버/크롤러' 카테고리의 글 목록

웹 서버/크롤러

re활용하기 2018.05.16
bs4 함수 2018.05.16
bs4 2018.05.16
urllib 모듈 2018.05.15
requests 모듈 2018.05.15
requests VS urllib 2018.05.15
python 상속 2018.05.15

re활용하기

상감자 2018. 5. 16. 12:49

2018. 5. 16. 12:49

re활용 예제

from bs4 import BeautifulSoup

import re

html = """<html><head><title>title name</title></head><body>test</body></html>"""

soup = BeautifulSoup(html, 'lxml')

html_content = soup.find_all('html')

print(soup.find_all(re.compile('원하는 정규식')))

print(soup.find_all(class_=re.compile('원하는 정규식')))

저작자표시

'웹 서버 > 크롤러' 카테고리의 다른 글

bs4 함수 (0)	2018.05.16
bs4 (0)	2018.05.16
urllib 모듈 (0)	2018.05.15
requests 모듈 (0)	2018.05.15
requests VS urllib (0)	2018.05.15

bs4 함수

상감자 2018. 5. 16. 12:45

2018. 5. 16. 12:45

find_all( ): 원하는 태그들을 리스트 형식으로

from bs4 import BeautifulSoup

html = """<html><head><title>title name</title></head><body>test</body></html>"""

soup = BeautifulSoup(html, 'lxml')

html_content = soup.find_all('html')

find_p = html_content[0].find_all('body')

print(soup.find_all('원하는 태그 이름'))

print(soup.find_all(id='원하는 id이름'))

print(soup.find_all('원하는 태그이름', class_='클래스이름'))

print(soup.find_all('원하는 태그이름', '클래스이름'))

print(soup.find_all('원하는 태그이름', text='클래스이름'))

print(soup.find_all()) #모든 값

print(soup.find_all(['태그값1','태그값2']) #태그여러개 불러오기

select( ) : CSS 셀렉터를 활용, 리스트로 반환

from bs4 import BeautifulSoup

html = """<html><head><title>title name</title></head><body>test</body></html>"""

soup = BeautifulSoup(html, 'lxml')

html_content = soup.find_all('html')

print(soup.select('태그'))

print(soup.select('.클래스이름'))

print(soup.select('#아이디이름'))

print(soup.select('태그 태그.클래스'))

저작자표시

'웹 서버 > 크롤러' 카테고리의 다른 글

re활용하기 (0)	2018.05.16
bs4 (0)	2018.05.16
urllib 모듈 (0)	2018.05.15
requests 모듈 (0)	2018.05.15
requests VS urllib (0)	2018.05.15

bs4

상감자 2018. 5. 16. 12:27

2018. 5. 16. 12:27

태그 데이터 뽑기

from bs4 import BeautifulSoup

html = """<html><head><title>title name</title></head><body>test</body></html>"""

soup = BeautifulSoup(html, 'lxml')

tag_title = soup.title

print(tag_title.text) #title name

print(tag_title.string) #title name

print(tag_title.name) #title

속성 데이터

from bs4 import BeautifulSoup

html = """<html><head><title class = "t" id = "title">title</title></head><body>test</body></html>"""

soup = BeautifulSoup(html, 'lxml')

tag_title = soup.title

print(tag_title.attrs) #{'id' : 'title', 'class' : ['t']}

print(tag_title['class']) # ['t']

print(tag_title['id']) #title

get을 이용하여 속성에 접근하면 에러를 방지할 수 있다.

ex) tag_title.get('class')

태그의 text와 string 속성

tag_title.text는 하위 태그들에 대한 값 출력 가능, tag_title_string은 자신의 태그 안의 내용만 출력

부모, 자식, 형제 태그

from bs4 import BeautifulSoup

html = """<html><head><title class = "t" id = "title">title</title></head><body>test</body></html>"""

soup = BeautifulSoup(html, 'lxml')

tag_content1 = soup.p.contents

tag_content2 = soup.p.children

tag_content3 = soup.p.parent

tag_content4 = soup.p.parents

#tag_content5 = soup.p.previous_sibling 앞의 형제 태그

#tag_content6 = soup.p.next_sibling 뒤의 형제 태그

print(tag_content1) #[test]

print(tag_content2) #iterator 객체로 반환되기 때문에 반복문으로 출력가능

print(tag_content3) #test

print(tag_content4) #iterator 객체로 반환되기 때문에 반복문으로 출력가능

저작자표시

'웹 서버 > 크롤러' 카테고리의 다른 글

re활용하기 (0)	2018.05.16
bs4 함수 (0)	2018.05.16
urllib 모듈 (0)	2018.05.15
requests 모듈 (0)	2018.05.15
requests VS urllib (0)	2018.05.15

urllib 모듈

상감자 2018. 5. 15. 23:46

2018. 5. 15. 23:46

다양한 정보 확인

from urllib.request import urlopen, Request

url = "https://sang-gamja.tistory.com/"

req = request(url)

page = urlopen(req)

print(page)

print(page.code)

print(page.headers)

print(page.url)

print(page.info().get_content_charset( ))

데이터 요청

from urllib.request import urlopen, Request

import urllib

url = "https://sang-gamja.tistory.com/"

data = {'key1' : 'value1', 'key2' : 'value2'}

data = urllib.parse.urlencode(data)

data = data.encode('utf-8')

print(data)

req_post = Request(url, data=data, headers={})

page = urlopen(req_post)

print(page)

urllib는 Request( )함수를 이용하여 요청 객체를 만들 때 두 번째 인자에는 데이터, 세번째 인자에는 헤더가 들어갑니다. 만약 두 번째 인자 값이 존재한다면 POST 요청, 존재하지 않는다면 GET 요청을 보냅니다. 두번째 인자의 존재에 따라서 GET인지 POST인지가 결정됩니다.

저작자표시

'웹 서버 > 크롤러' 카테고리의 다른 글

bs4 함수 (0)	2018.05.16
bs4 (0)	2018.05.16
requests 모듈 (0)	2018.05.15
requests VS urllib (0)	2018.05.15
python 상속 (0)	2018.05.15

requests 모듈

상감자 2018. 5. 15. 23:13

2018. 5. 15. 23:13

쿼리스트링 생성

import request as rq

url = "https://sang-gamja.tistory.com/"

res = rq.get(url, params = { "key1" : "value1", "key2" : "value2"})

print(res.url)

json 모듈과 str 차이점

import json

dict1 = { 'key1' : 'value1', 'key2' : 'value2'}

print(json.dumps(dict1)) #{ "key1" : "value1" , "key2" : "value2" }

print(str(dict1)) #{ 'key1' : 'value1' , 'key2' : 'value2' }

헤더 설정하기

import requests as rq

url = "https://sang-gamja.tistory.com/"

res = rq.get(url, headers = {"User-Agent" : "Mozilla/5.0 {Macintosh; Intel Mac OS X 10_12_5) AppleWebkit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36"})

print(res.url)

requests 오류 처리 방법

import requests as rq

url = "https://sang-gamja.tistory.com"

try:

res = rq.get(url)

print(res.url)

except rq.exceptions.HTTPError:

print("HTTP 에러발생")

저작자표시

'웹 서버 > 크롤러' 카테고리의 다른 글

bs4 함수 (0)	2018.05.16
bs4 (0)	2018.05.16
urllib 모듈 (0)	2018.05.15
requests VS urllib (0)	2018.05.15
python 상속 (0)	2018.05.15

requests VS urllib

상감자 2018. 5. 15. 22:56

2018. 5. 15. 22:56

requests 모듈과 urllib 모듈은 굉장히 비슷한 역할을 합니다.

하지만 많은 사람들이 python에서는 requests 모듈을 사용하고 있습니다.

그렇다면 차이점에 대해서 알아보겠습니다.

1. 데이터를 보낼때 requests는 딕셔너리 형태, urllib는 인코딩하여 바이너리 형태로 전송합니다.

2. requests는 요청 메소드(get, post)를 명시하지만 urllib는 데이터의 여부에 따라 get과 post 요청을 구분합니다.

3. 없는 페이지 요청시 requests는 에러를 띄우지 않지만 urllib는 에러를 띄웁니다.

저작자표시

'웹 서버 > 크롤러' 카테고리의 다른 글

bs4 함수 (0)	2018.05.16
bs4 (0)	2018.05.16
urllib 모듈 (0)	2018.05.15
requests 모듈 (0)	2018.05.15
python 상속 (0)	2018.05.15

python 상속

상감자 2018. 5. 15. 22:49

2018. 5. 15. 22:49

상속은 이미 만들어진 클래스의 기능을 가져다 쓰기를 위함이다.

class people:

people_count = 0

def __init__(self):

print("사람 생성")

people.peolpe_count += 1

def move(self):

print("사람 이동")

class car(people):

def __init__(self):

print(" 자동차 생성 ")

super(car, self).__init__( )

class airplane(people):

def __init__(self):

print(" 비행기 생성 ")

super(airplane, self).__init__( )

car1 = car()

car2 = car()

car1.move()

print(people.people_count)

super(클래스명, self).__init__( )을 자식 생성자에 넣어주면 자식생성자가 호출될 때 마다 부모 생성자를 호출한다.

저작자표시

'웹 서버 > 크롤러' 카테고리의 다른 글

bs4 함수 (0)	2018.05.16
bs4 (0)	2018.05.16
urllib 모듈 (0)	2018.05.15
requests 모듈 (0)	2018.05.15
requests VS urllib (0)	2018.05.15

PREV 이전 1 NEXT 다음

Gamja's Farm