bs4 :: Gamja's Farm

bs4

상감자 2018. 5. 16. 12:27

2018. 5. 16. 12:27

태그 데이터 뽑기

from bs4 import BeautifulSoup

html = """<html><head><title>title name</title></head><body>test</body></html>"""

soup = BeautifulSoup(html, 'lxml')

tag_title = soup.title

print(tag_title.text) #title name

print(tag_title.string) #title name

print(tag_title.name) #title

속성 데이터

from bs4 import BeautifulSoup

html = """<html><head><title class = "t" id = "title">title</title></head><body>test</body></html>"""

soup = BeautifulSoup(html, 'lxml')

tag_title = soup.title

print(tag_title.attrs) #{'id' : 'title', 'class' : ['t']}

print(tag_title['class']) # ['t']

print(tag_title['id']) #title

get을 이용하여 속성에 접근하면 에러를 방지할 수 있다.

ex) tag_title.get('class')

태그의 text와 string 속성

tag_title.text는 하위 태그들에 대한 값 출력 가능, tag_title_string은 자신의 태그 안의 내용만 출력

부모, 자식, 형제 태그

from bs4 import BeautifulSoup

html = """<html><head><title class = "t" id = "title">title</title></head><body>test</body></html>"""

soup = BeautifulSoup(html, 'lxml')

tag_content1 = soup.p.contents

tag_content2 = soup.p.children

tag_content3 = soup.p.parent

tag_content4 = soup.p.parents

#tag_content5 = soup.p.previous_sibling 앞의 형제 태그

#tag_content6 = soup.p.next_sibling 뒤의 형제 태그

print(tag_content1) #[test]

print(tag_content2) #iterator 객체로 반환되기 때문에 반복문으로 출력가능

print(tag_content3) #test

print(tag_content4) #iterator 객체로 반환되기 때문에 반복문으로 출력가능