[Document] beautifulsoup4 - mildsalmon

소개

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.
These instructions illustrate all major features of Beautiful Soup 4, with examples. I show you what the library is good for, how it works, how to use it, how to make it do what you want, and what to do when it violates your expectations.
This document covers Beautiful Soup version 4.8.1. The examples in this documentation should work the same way in Python 2.7 and Python 3.2. You might be looking for the documentation for Beautiful Soup 3. If so, you should know that Beautiful Soup 3 is no longer being developed and that support for it will be dropped on or after December 31, 2020. If you want to learn about the differences between Beautiful Soup 3 and Beautiful Soup 4, see Porting code to BS4.
This documentation has been translated into other languages by Beautiful Soup users:

뷰티플수프는 HTML과 XML 파일로부터 데이터를 뽑아내기 위한 파이썬 라이브러리이다. 여러분이 선호하는 해석기와 함께 사용하여 일반적인 방식으로 해석 트리를 항해, 검색, 변경할 수 있다. 주로 프로그래머의 수고를 덜어준다.
이 지도서에서는 뷰티플수프 4의 중요한 특징들을 예제와 함께 모두 보여준다. 이 라이브러리가 어느 곳에 유용한지, 어떻게 작동하는지, 또 어떻게 사용하는지, 어떻게 원하는대로 바꿀 수 있는지, 예상을 빗나갔을 때 어떻게 해야 하는지를 보여준다.
이 문서의 예제들은 파이썬 2.7과 Python 3.2에서 똑 같이 작동한다.
혹시 뷰티플수프 3에 관한 문서를 찾고 계신다면 뷰티플수프 3는 더 이상 개발되지 않는다는 사실을 꼭 아셔야겠다. 새로 프로젝트를 시작한다면 뷰티플수프 4를 적극 추천한다. 뷰티플수프 3와 뷰티플수프 4의 차이점은 BS4 코드 이식하기를 참조하자.

즉 beautifulsoup는 잘못된 HTML을 수정하여 쉽게 탑색할 수 있는 XML형식의 파이썬 객체로 변환한다.

설치

BeautifulSoup 4 설치

데미안, 우분투 리눅스 (최신버전)

시스템 꾸러미 관리자로 설치하는 방법

$ apt-get install python-bs4

시스템 꾸러미 관리자로 설치할 수 없을 경우 easy_install이나 pip로 설치할 수 있다.

$ easy_install beautifulsoup4
$ pip install beautifulsoup4

easy_install도 pip도 설치되어 있지 않다면, beautifulsoup4 를 내려 받아 setup.py로 설치할 수 있다.

$ python setup.py install

parser

Beautifulsoup(변환할 웹을 포함하는 변수)

~~이렇게 코딩하면 기본적으로 파이썬의 html.parser가 자동으로 파싱된다~~ 20.01.23 기준 위 처럼 코딩하면 에러가 발생해서 아래처럼 parser를 명시해줘야한다.

Beautifulsoup(변환할 웹을 포함하는 변수, "html.parser")

파이썬의 html.parser

BeautifulSoup(markup, "html.parser")

장점 각종 기능 완비 적절한 속도 관대함 (파이썬 2.7.3과 3.2에서.)
단점 별로 관대하지 않음 (파이썬 2.7.3이나 3.2.2 이전 버전에서)

lxml의 HTML 해석기

BeautifulSoup(markup, "lxml")

장점 아주 빠름 관대함
단점 외부 C 라이브러리 의존

lxml의 XML 해석기

BeautifulSoup(markup, ["lxml", "xml"])
BeautifulSoup(markup, "xml")

장점 아주 빠름 유일하게 XML 해석기 지원
단점 외부 C 라이브러리 의존

html5lib

BeautifulSoup(markup, html5lib)

장점 아주 관대함 웹 브라우저의 방식으로 페이지를 해석함 유효한 HTML5를 생성함
단점 아주 느림 외부 파이썬 라이브러리 의존 파이썬 2 전용

lxml의 경우 따로 설치를 해야한다.

$ apt-get install python-lxml

$ easy_install lxml

$ pip install lxml

parser 명시해 사용하기

단지 HTML만 해석하고 싶을 경우, 조판을 BeautifulSoup 구성자에 넣기만 하면, 아마도 잘 처리될 것이다. 뷰티플수프는 해석기를 여러분 대신 선택해 데이터를 해석한다. 그러나 어느 해석기를 사용할지 바꾸기 위해 구성자에 건넬 수 있는 인자가 몇 가지 더 있다.

BeautifulSoup 구성자에 건네는 첫 번째 인자는 문자열이나 열린 파일핸들-즉 해석하기를 원하는 조판이 첫 번째 인자이다. 두 번째 인자는 그 조판이 어떻게 해석되기를 바라는지 지정한다.

아무것도 지정하지 않으면, 설치된 해석기중 최적의 HTML 해석기가 배당된다. 뷰티플수프는 lxml 해석기를 최선으로 취급하고, 다음에 html5lib 해석기, 그 다음이 파이썬의 내장 해석기를 선택한다. 이것은 다음 중 하나로 덮어쓸 수 있다:

해석하고 싶은 조판의 종류. 현재 “html”, “xml”, 그리고 “html5”가 지원된다. 사용하고 싶은 해석기의 이름. 현재 선택은 “lxml”, “html5lib”, 그리고 “html.parser” (파이썬의 내장 HTML 해석기)이다. 해석기 설치하기 섹션에 지원 해석기들을 비교해 놓았다.

적절한 해석기가 설치되어 있지 않다면, 뷰티플수프는 여러분의 요구를 무시하고 다른 해석기를 선택한다. 지금 유일하게 지원되는 XML 해석기는 lxml이다. lxml 해석기가 설치되어 있지 않으면, XML 해석기를 요구할 경우 아무것도 얻을 수 없고, “lxml”을 요구하더라도 얻을 수 없다.

해석기 사이의 차이점

해석기마다 같은 문서에서 다른 해석 트리를 만들어낸다. 가장 큰 차이점은 HTML 해석기와 XML 해석기 사이에 있다. 다음은 HTML로 해석된 짧은 문서이다:

BeautifulSoup("<a><b /></a>")

# <html><head></head><body><a><b></b></a></body></html>

빈  태그는 유효한 HTML이 아니므로, 해석기는 그것을  태그 쌍으로 변환한다.

다음 똑같은 문서를 XML로 해석한 것이다 (이를 실행하려면 lxml이 설치되어 있어야 한다). 빈  태그가 홀로 남았음에 유의하자. 그리고 <html> 태그를 출력하는 대신에 XML 선언이 주어졌음을 주목하자:

BeautifulSoup("<a><b /></a>", "xml")

# <?xml version="1.0" encoding="utf-8"?>
# <a><b /></a>

HTML 해석기 사이에서도 차이가 있다. 뷰티플수프에 완벽하게 모양을 갖춘 HTML 문서를 주면, 이 차이는 문제가 되지 않는다. 비록 해석기마다 속도에 차이가 있기는 하지만, 모두 원래의 HTML 문서와 정확하게 똑같이 보이는 데이터 구조를 돌려준다.

그러나 문서가 불완전하게 모양을 갖추었다면, 해석기마다 결과가 다르다. 다음은 짧은 무효한 문서를 lxml의 HTML 해석기로 해석한 것이다. 나홀로  태그는 그냥 무시된다:

BeautifulSoup("<a></p>", "lxml")

# <html><body><a></a></body></html>

다음은 같은 문서를 html5lib로 해석하였다:

BeautifulSoup("<a></p>", "html5lib")

# <html><head></head><body><a><p></p></a></body></html>

나홀로  태그를 무시하는 대신에, html5lib는 여는  태그로 짝을 맞추어 준다. 이 해석기는 또한 빈 <head> 태그를 문서에 추가한다.

다음은 같은 문서를 파이썬 내장 HTML 해석기로 해석한 것이다:

BeautifulSoup("<a></p>", "html.parser")
# <a></a>

html5lib처럼, 이 해석기는 닫는  태그를 무시한다. html5lib와 다르게, 이 해석기는 <body> 태그를 추가해서 모양을 갖춘 HTML 문서를 생성하려고 아무 시도도 하지 않는다. lxml과 다르게, 심지어 <html> 태그를 추가하는 것에도 신경쓰지 않는다.

문서 <a>는 무효하므로, 이 테크닉중 어느 것도 “올바른” 처리 방법이 아니다. html5lib 해석기는 HTML5 표준에 있는 테크닉을 사용하므로, 아무래도 “가장 올바른” 방법이라고 주장할 수 있지만, 세 가지 테크닉 모두 같은 주장을 할 수 있다.

해석기 사이의 차이점 때문에 스크립트가 영향을 받을 수 있다. 스크립트를 다른 사람들에게 나누어 줄 계획이 있다면, 또는 여러 머신에서 실행할 생각이라면, BeautifulSoup 구성자에 해석기를 지정해 주는 편이 좋다. 그렇게 해야 여러분이 해석한 방식과 다르게 사용자가 문서를 해석할 위험성이 감소한다.

soup 만들기

문서를 해석하려면, 문서를 BeautifulSoup 구성자에 건네주자. 문자열 혹은 열린 파일 핸들을 건네면 된다.

from bs4 import BeautifulSoup

soup = BeautifulSoup(open("index.html"))
soup = BeautifulSoup("<html>data</html>","html.parser")

먼저, 문서는 유니코드로 변환되고 HTML 개체는 유니코드 문자로 변환된다.

BeautifulSoup("Sacr&eacute; bleu!")
<html><head></head><body>Sacré bleu!</body></html>

다음 뷰티플수프는 문서를 가장 적당한 해석기를 사용하여 해석한다. 특별히 XML 해석기를 사용하라고 지정해 주지 않으면 HTML 해석기를 사용한다.

객체

BeautifulSoup 객체

BeautifulSoup 객체 자신은 문서 전체를 대표한다. 대부분의 목적에, 그것을 Tag 객체로 취급해도 좋다. 이것은 곧 트리 항해하기와 트리 검색하기에 기술된 메쏘드들을 지원한다는 뜻이다.

BeautifulSoup 객체는 실제 HTML 태그나 XML 태그에 상응하지 않기 때문에, 이름도 속성도 없다. 그러나 가끔 그의 이름 .name을 살펴보는 것이 유용할 경우가 있다. 그래서 특별히 .name에 “[document]”라는 이름이 주어졌다

print(soup.name)

# [document]

Tag 객체

soup = BeautifulSoup('<div class="section" id="tag"></div>',"html.parser")
tag = soup.div
type(tag)

Tag 객체는 원래 문서의 XML 태그 또는 HTML 태그에 상응한다. 태그는 많은 속성과 메쏘드가 있다.

이름

태그마다 이름이 있다 다음과 같이 .name으로 접근할 수 있다. 태그의 이름을 바꾸면, 그 변화는 뷰티블수프가 생산한 HTML 조판에 반영된다:

soup = BeautifulSoup('<div class="section" id="tag"></div>',"html.parser")
tag = soup.div
print(tag.name)
print(tag)
tag.name = "hi"
print(tag)
print(soup)

# div
# <div class="section" id="tag"></div>
# <hi class="section" id="tag"></hi>
# <hi class="section" id="tag"></hi>

속성

태그는 속성을 여러개 가질 수 있다.  태그는 속성으로 “class”가 있는데 그 값은 “boldest”이다.

print(tag['class'])

# ['section']

딕셔너리에 .attrs와 같이 바로 접근할 수 있다

print(tag.attrs)

# {'class': ['section'], 'id': 'tag'}

태그의 속성을 추가, 제거, 변경할 수 있다. 역시 태그를 딕셔너리처럼 취급해서 처리한다

추가

tag['class'] = 'verybold'
tag['my'] = 1
print(tag)

# <div class="verybold" id="tag" my="1"></div>

변경

tag['class'] = 'change'
print(tag.get('class'))

# change

제거

del tag['class']
del tag['id']
print(tag)

# <div my="1"></div>

값이 여럿인 속성 (잘 작동 안하는거 같다)

HTML 4에서 몇몇 속성은 값을 여러 개 가질 수 있도록 정의된다. HTML 5에서 그 중 2개는 제거되었지만, 몇 가지가 더 정의되었다. 가장 흔한 다중값 속성은 class이다 (다시 말해, 태그가 하나 이상의 CSS 클래스를 가질 수 있다). 다른 것으로는 rel, rev, accept-charset, headers, 그리고 accesskey가 포함된다. 뷰티플수프는 다중-값 속성의 값들을 리스트로 나타낸다:

css_soup = BeautifulSoup('<div class="section tv" id="tag"></div>',"html.parser")
print(css_soup.div['class'])
css_soup = BeautifulSoup('<div class="section" id="tag"></div>',"html.parser")
print(css_soup.div['class'])

# ['section', 'tv']
# ['section']

속성에 하나 이상의 값이 있는 것처럼 보이지만, HTML 표준에 정의된 다중-값 속성이 아니라면, 뷰티플수프는 그 속성을 그대로 둔다.

하지만 위에 section tv에서 tv는 HTML 표준이 아닐텐데 속성 값이 리스트에 들어갔다. 왜 그런지 모르겠다

id_soup = BeautifulSoup('<p id="my id"></p>')
id_soup.p['id']
# 'my id'

태그를 다시 문자열로 바꾸면, 다중-값 속성은 합병된다:

rel_soup = BeautifulSoup('<p>Back to the <a rel="index">homepage</a></p>')
rel_soup.a['rel']
# ['index']
rel_soup.a['rel'] = ['index', 'contents']
print(rel_soup.p)
# <p>Back to the <a rel="index contents">homepage</a></p>

문서를 XML로 해석하면, 다중-값 속성은 없다:

xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml')
xml_soup.p['class']
# u'body strikeout'

NavigableString 객체

문자열은 태그 안에 있는 일군의 텍스트에 상응한다. 뷰티플수프는 NavigableString 클래스 안에다 이런 텍스트를 보관한다:

soup = BeautifulSoup('<div class="my id" id="tag">hi</div>',"html.parser")
tag = soup.div
print(tag.string)
print(type(tag.string))

# hi
# <class 'bs4.element.NavigableString'>

Comment 객체

주석 태그 안에 들어있는 HTML 주석()을 찾는데 사용한다.

soup = BeautifulSoup('<div class="my id" id="tag">hi</div>',"html.parser")
tag = soup.div
comment = soup.div.string
print(type(comment))
markup = "<b><!--Hey, i am comment--></b>"
soup = BeautifulSoup(markup, "html.parser")
comment = soup.b.string
print(type(comment))

# <class 'bs4.element.NavigableString'>
# <class 'bs4.element.Comment'>

Comment 객체는 그냥 특별한 유형의 NavigableString이다

print(comment)

# Hey, i am comment

트리 이동하기

html 문서

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ; and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

내려가기

태그에는 또 다른 태그가 담길 수 있다. 이런 요소들은 그 태그의 자식(children)이라고 부른다. beautifulSoup는 한 태그의 자식을 항해하고 반복하기 위한 속성을 다양하게 제공한다.

beautifulSoup의 문자열은 이런 속성을 제공하지 않음에 유의하자. 왜냐하면 문자열은 자식을 가질 수 없기 때문이다.

태그 이름을 사용하여 이동하기

가장 단순하게 해석 트리를 이동하는 방법은 원하는 태그의 이름을 지정해 주는 것이다.

print(soup.head)
print(soup.title)

# <head><title>The Dormouse's story</title></head>
# <title>The Dormouse's story</title>

이 방법을 반복 사용하면 해석 트리의 특정 부분을 확대해 볼 수 있다.

print(soup.body.b)

# <b>The Dormouse's story</b>

태그 이름을 속성으로 사용하면 오직 그 이름으로 된 첫 번째 태그만 얻는다.

print(soup.a)

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

태그의 자식은 .contents라고 부르는 리스트로 얻을 수 있다.

head_tag = soup.head
print(head_tag)
print(head_tag.contents)

title_tag = head_tag.contents[0]
print(title_tag)

# <head><title>The Dormouse's story</title></head>
# [<title>The Dormouse's story</title>]

print(title_tag.contents)는 AttributeError(속성오류) 발생한다. AttributeError: 'NavigableString' object has no attribute 'contents'

BeautifulSoup 객체 자체에도 자식이 있다. <html> 태그가 BeautifulSoup 객체의 자식이다.

print(len(soup.contents))

# 1

1이 아니라 3이 나온다면 \n 이 개행문자 2개가 요소로 들어가 있어서이다.

.descendants

내용물(.contents)과 자식(.children)속성은 오직 한 태그의 직계(direct)자식만 고려한다. <head>태그는 오직 한 개의 직계 자식으로 <title> 태그가 있다.

print(soup.head.contents)

# [<title>The Dormouse's story</title>]

그러나 <title> 태그 자체에 문자열 The Dormouse's story 이라는 자식이 하나 있다. 그 문자열도 역시 <head>의 자손이다. .descendants 속성은 한 태그의 자손을 모두 재귀적으로, 반복할 수 있도록 해준다.

head_tag = soup.head
for child in head_tag.descendants:
    print(child)

# <title>The Dormouse's story</title>
# The Dormouse's story

<head> 태그는 오직 자식이 하나이지만, 자손은 둘이다. <title> 태그와 <title> 태그의 자손.

print(len(list(soup.children)))
print(len(list(soup.descendants)))

# 3
# 34

정상적인 결과는 1과 25여야 한다. 하지만 결과가 이렇게 나온 이유는 \n 개행문자가 카운트되어서이다.

.string

태그에 오직 자식이 하나라면, 그리고 그 자식이 NavigableString이라면 그 자식은 .string으로 얻을 수 있다.

태그의 유일한 자식이 또다른 태그이고 그 태그가 .string을 가진다면 그 부모 태그는 같은 .string을 그의 자식으로 가진다고 간주한다.

head_tag = soup.head
title_tag = head_tag.contents[0]

print(title_tag.string)
print(head_tag.contents)
print(head_tag.string)

# The Dormouse's story
# [<title>The Dormouse's story</title>]
# The Dormouse's story

그리고 태그에 하나 이상의 태그가 있다면, .string이 무엇을 가리킬지 확실하지 않다. 따라서 .string은 None으로 정의된다.

print(soup.html.string)

# None

.strings AND stripped_strings

한 태그 안에 여러개의 태그가 있더라도 여전히 문자열을 볼 수 있다.

for string in soup.strings:
    print(repr(string))

'\n'
# '\n'
# "The Dormouse's story"
# '\n'
# '\n'
# '\n'
# "\n    The Dormouse's story\n   "
# '\n'
# '\n'
# '\n   Once upon a time there were three little sisters; and their names were\n   '
# '\n    Elsie\n   '
# '\n   ,\n   '
# '\n    Lacie\n   '
# '\n   and\n   '
# '\n    Tillie\n   '
# '\n   ; and they lived at the bottom of a well.\n  '
# '\n'
# '\n   ...\n  '
# '\n'
# '\n'
# '\n'

그리고 이런 문자열들은 개행문자(공백)이 쓸데 없이 많아서, .stripped_strings 를 사용해 제거할 수 있다. 그러면 문자열 앞, 뒤 공백과 공백만으로 구성된 문자열은 제거된다.

for string in soup.stripped_strings:
    print(repr(string))

# "The Dormouse's story"
# "The Dormouse's story"
# 'Once upon a time there were three little sisters; and their names were'
# 'Elsie'
# ','
# 'Lacie'
# 'and'
# 'Tillie'
# '; and they lived at the bottom of a well.'
# '...'

올라가기

태그마다 그리고 문자열마다 부모(.parent)가 있다. 한 요소의 부모는 .parent 속성으로 접근한다. title 문자열 또한 부모가 있다.

title_tag = soup.title
print(title_tag.string.parent)
print(title_tag)
print(title_tag.parent)

# <title>The Dormouse's story</title>
# <title>The Dormouse's story</title>
# <head><title>The Dormouse's story</title></head>

<html> 태그와 같은 최상위 태그의 부모는 BeautifulSoup 객체 자신이다. 하지만 BeautifulSoup의 부모는 None으로 정의된다.

html_tag = soup.html
print(type(html_tag.parent))
print(soup.parent)

# <class 'bs4.BeautifulSoup'>
# None

.parents로 한 요소의 부모를 모두 소환할 수 있다.

link = soup.a

for parent in link.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)

# p
# body
# html
# [document]

예제에는 None까지 나오는데 안나온다

print(link.parent.parent.parent.parent.parent)

# None

이렇게 하면 나오는데, None은 .parent로만 확인할 수 있나보다.

옆으로 가기

sibling_soup = BeautifulSoup("<html><body><a><b>text1</b><c>text2</c></b></a></body></html>",'html.parser')
print(sibling_soup.prettify())

# <html>
#  <body>
#   <a>
#    <b>
#     text1
#    </b>
#    <c>
#     text2
#    </c>
#   </a>
#  </body>
# </html>

 태그와 <c> 태그는 같은 수준에 있다. 둘 다 같은 태그의 직계 자식이다. 즉 형제들(siblings)이다.

.next_sibling AND .previous_sibling

.next_siling과 .previous_siling을 사용하면 해석 트리에서 같은 수준에 있는 페이지 요소들 사이를 이동할 수 있다.

print(sibling_soup.b.next_sibling)
print(sibling_soup.c.previous_sibling)

# <c>text2</c>
# <b>text1</b>

태그는 .next_sibling이 있지만 .previous_sibling은 없다. 태그 앞에 트리에서 같은 수준에 아무것도 없기 때문이다.

문자열 'text1'과 'text2'는 형제 사이가 아니다. 왜냐하면 부모가 다르기 때문.

실제 문서에서, 한태그의 .next_sibling이나 previous_sibling은 보통 공백이 포함된 문자열이다.

link = soup.a
print(link)
print(link.next_sibling)

# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
#    ,

실제로 두번째 <a> 태그는 쉼표의 .next_sibling이다.

print(link.next_sibling.next_sibling)

# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

.next_siblings AND .previous_siblings

태그의 형제들은 .next_siblings이나 .previous_siblings로 반복할 수 있다.

for sibling in soup.a.next_siblings:
    print(repr(sibling))

for sibling in soup.find(id="link3").previous_siblings:
    print(repr(sibling))

# '\n   ,\n   '
# <a class="sister" href="http://example.com/lacie" id="link2">
#     Lacie
#    </a>
# '\n   and\n   '
# <a class="sister" href="http://example.com/tillie" id="link3">
#     Tillie
#    </a>
# '\n   ; and they lived at the bottom of a well.\n  '
#
# '\n   and\n   '
# <a class="sister" href="http://example.com/lacie" id="link2">
#     Lacie
#    </a>
# '\n   ,\n   '
# <a class="sister" href="http://example.com/elsie" id="link1">
#     Elsie
#    </a>
# '\n   Once upon a time there were three little sisters; and their names were\n   '

앞뒤로 가기

.next_element AND .previous_element

문자열이나 태그의 .next_element속성은 바로 다음에 해석된 것을 가리킨다. .next_sibling과 같을 것 같지만, 완전히 다르다

다음은 문서 마지막 <a> 태그이다. 그의 .next_sibling은 문자열이다. <a> 태그가 시작되어 중단되었던 문장의 끝부분이다.

그러나 <a> 태그의 .next_element는, 다시 말해 <a> 태그 바로 다음에 해석된 것은, 나머지 문장이 아니다. 그것은 단어 "Tilie"이다. <a> 태그 -> Tilie -> </a> 태그

last_a_tag = soup.find("a", id="link3")

print(last_a_tag)
print(last_a_tag.next_sibling)
print(last_a_tag.next_element)

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
#    ; and they lived at the bottom of a well.
#     Tillie

.previous_element 속성은 .next_element와 정반대이다.

.next_elements AND .previous_elements

for element in last_a_tag.next_elements:
    print(repr(element))

# '\n    Tillie\n   '
# '\n   ; and they lived at the bottom of a well.\n  '
# '\n'
# <p class="story">
#    ...
#   </p>
# '\n   ...\n  '
# '\n'
# '\n'
# '\n'

트리 검색하기

BeautifulSoup에는 해석 트리 탐색읠 위한 매쏘드들이 많이 정의되어 있지만 거의 다 비슷하다. 그 중 find()와 find_all()이 가장 많이 쓰인다.

필터의 종류

find_all()_all()과 유가 메쏘드들에 관한 설명 전에, 이런 메쏘드들에 건낼 수 있는 다양한 여과기의 예제를 보여준다. 태그의 이름, 그의 속성, 문자열 텍스트 또는 이런 것들을 조합하여 여과할 수 있다.

문자열

가장 단순한 여과기이다. 다음 코드는 문서에서 "div" 태그를 모두 찾는다.

soup = BeautifulSoup('<div class="my id" id="tag">hi</div>',"html.parser")
string_1 =soup.find_all('div')

print(string_1)

# [<div class = "my id" id ="tag">hi</div>]

바이트 문자열을 건내면, beautifulsoup는 그 문자열이 UTF-8로 인코드 되어 있다고 간주한다. 이를 피하려면 유니코드 문자열을 건내면 된다.

정규 표현식

정규 표현식이란 문자열이 주어진 규칙에 일치하는지, 일치하지 않는지 판단하는 것이다.

예를 들어

글자 a를 최소한 한 번 쓰시오.
그 뒤에 b를 정확히 다섯 개 쓰시오
그 뒤에 c를 짝수 번 쓰시오
마지막에 d가 있어도 되고 없어도 됩니다.

이 규칙을 만족하는 정규 표현식은 aa*bbbbb(cc)*(d | ) 이다.

aa* a*는 a가 몇 개든 상관없고 0개여도 된다는 뜻 즉 a는 최소한 한번은 있다는 뜻입니다.
bbbbb
(cc)* c 두 개를 괄호 안에 쓰고 그 뒤에 *을 붙여 c의 쌍이 임의의 숫자만큼 있음을 나타냅니다. (0쌍이여도 규칙에는 맞습니다.)
(d | ) 중간에 파이프문자는 '이거 아니면 저거'라는 뜻입니다. 여기에서는 'd' 다음에 공백을 쓰거나, 아니면 d 없이 공백만 쓴다'는 뜻이 됩니다.

정규 표현식은 http://regexpal.com 같은 웹에서 바로 테스트할 수 있습니다.

정규 표현식 기호

기호	의미	예제	일치하는 문자열 예제
*	바로 앞에 있는 문자, 하위 표현식, 대괄호에 묶인 문자들이 0번 이상 나타납니다.	a``b``	aaaaaaaa, aabbbbbb, bbbbbb
+	바로 앞에 있는 문자, 하위 표현식, 대괄호로 묶인 문자들이 1번 이상 나타납니다.	a+b+	aaaaab, aaabbbb, abbbb
[]	대괄호 안에 있는 문자 중 하나가 나타납니다.	[A-Z]*	APPLE, CAPITALS, QWERTY
()	그룹으로 묶인 하위 표현식입니다. 정규 표현식을 평가할때에는 하위 표현식이 가장 먼저 평가됩니다.	(a``b)``	aaabaab, abaaab, ababaaaaab
{m, n}	바로 앞에 있는 문자, 하위 표현식, 대괄호로 묶인 문자들이 m번 이상, n번 이하 나타납니다.	a{2, 3} b{2, 3}	aabbb, aaabbb, aabb
[^]	대괄호 안에 잇는 문자를 제외한 문자가 나타납니다.	[^A-Z]*	apple, lowercase, qwerty
**	**로 분리된 문자, 문자열 하위 표현식 중 하나가 나타납니다. 는 '파이프'라 부르는 세로 막대이며 대문자 I가 아닙니다.	b(aie)d	bad, bid, bed
.	문자 하나(글자, 숫자, 기호, 공백 등)가 나타납니다.	b.d	bad, bzd, b$d, b d
^	바로 뒤에 있는 문자 혹은 하위 표현식이 문자열의 맨 앞에 나타납니다.	^a	apple, asdf, a
\	특수 문자를 원래 의미대로 쓰게 하는 이스케이프 문자입니다.	\. \** \\	. ** \
$	정규 표현식 마지막에 종종 쓰이며, 바로 앞에 있는 문자 또는 하위 표현식이 문자열의 마지막이라는 뜻입니다. 이 기호를 쓰지 않는 정규 표현식은 사실상 .*가 마지막에 있는 것이나 마찬가지여서 그 뒤에 무엇이 있든 전부 일치합니다. ^ 기호의 반대라고 생각해도 됩니다.	[A-Z][a-z]$	ABCabc, zzzyx, Bob
?!	'포함하지 않는다'는 뜻입니다. 이 기호 쌍 바로 다음에 있는 문자(또는 하위 표현식)는 해당 위치에 나타나지 않습니다. 이 기호는 조금 혼란스러울 수 있읍니다. 배제한 문자가 문자열의 다른 부분에는 나타나도 되니까요. 특정 문자를 완벽히 배제하려면 ^과 $를 앞뒤에 쓰십시오.	^((?![A-Z]).)*$	no-caps-here, $ymb0ls a4e f!ne

해당 정규 표현식은 파이썬을 기준으로 작성되었습니다. 정규 표현식은 언어마다 다를 수 있습니다. 메뉴얼을 참고하세요

정규 표현식 객체를 건내면, beautifulsoup는 match() 메쏘드를 사용하여 그 정규 표현식에 맞게 여과한다.

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link2">
    Tillie
   </a>
   ; and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

다음 코드는 이름이 'b'로 시작하는 태그를 모두 찾는다. 이 경우 <body> 태그와  태그를 찾을 것이다.

from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(html, "html.parser")

for tag in soup.find_all(re.compile("^b")):
    print(tag.name)

# body
# b

리스트

리스트를 건내면, beautifulsoup는 그 리스트에 담긴 항목마다 문자열 부합을 수행한다. 다음 코드는 <a> 태그 그리고 모든  태그를 찾는다.

list_ = soup.find_all(["a","b"])
print(list_)

# [<b>
#   The Dormouse's story
#  </b>, <a class="sister" href="http://example.com/elsie" id="link1">
#   Elsie
#  </a>, <a class="sister" href="http://example.com/lacie" id="link2">
#   Lacie
#  </a>, <a class="sister" href="http://example.com/tillie" id="link2">
#   Tillie
#  </a>]

True

True 값은 참이면 모두 부합시킨다. 다음 코드는 문서에서 태그를 모두 찾지만, 텍스트 문자열은 전혀 찾지 않는다.

for tag in soup.find_all(True):
  print(tag.name)

# html
# head
# title
# body
# p
# b
# p
# a
# a
# a
# p

find_all()

find_all(name, attrs, recursive, text, limit, keyword)

find_all() 메소드는 태그의 후손들을 찾아서 지정한 여과기에 부합하면 모두 추출한다.

print(soup.find_all("title"))
print(soup.find_all("p", "title"))
print(soup.find_all("a"))
print(soup.find_all(id="link2"))
print(soup.find(text=re.compile("sisters")))

# [<title>The Dormouse's story</title>]
# [<p class="title"><b>    The Dormouse's story   </b></p>]
# [<a class="sister" href="http://example.com/elsie" id="link1">
#   Elsie
#  </a>, <a class="sister" href="http://example.com/lacie" id="link2">
#   Lacie
#  </a>, <a class="sister" href="http://example.com/tillie" id="link3">
#   Tillie
#  </a>]
# [<a class="sister" href="http://example.com/lacie" id="link2">
#   Lacie
#  </a>]
#    Once upon a time there were three little sisters; and their names were

인자

name 매개변수

태그 이름인 문자열을 넘기거나, 태그 이름으로 이루어진 파이썬 리스트를 넘길 수 있다. 텍스트 문자열은 무시된다.

print(soup.find_all("title"))

# [<title>The Dormouse's story</title>]

attrs 매개변수

attrs 매개변수는 속성으로 이루어진 파이썬 딕셔너리를 받고, 그 중 하나에 일치하는 태그를 찾습니다.

print(soup.find_all(href=re.compile("elsie"), id='link1'))
print(soup.find_all(attrs={'href':re.compile("elsie"), 'id':'link1'}))

# [<a class="sister" href="http://example.com/elsie" id="link1">
#   Elsie
#  </a>]
# [<a class="sister" href="http://example.com/elsie" id="link1">
#   Elsie
#  </a>]

CSS 클래스로 탐색하기

특정 CSS 클래스를 가진 태그를 탐색하면 아주 유용하지만, CSS 속성의 이름인 "class"는 파이썬에서 예약어이다. 키워드 인자로 class를 사용하면 신텍스 에러(구문에러)가 발생한다. 따라서 class_ 키워드 인자를 사용하면 된다.

print(soup.find_all("a", class_="sister"))

# [<a class="sister" href="http://example.com/elsie" id="link1">
#   Elsie
#  </a>, <a class="sister" href="http://example.com/lacie" id="link2">
#   Lacie
#  </a>, <a class="sister" href="http://example.com/tillie" id="link3">
#   Tillie
#  </a>]

다른 키워드 인자와 마찬가지로, class_에 문자열, 정규 표현식, 함수, True를 건낼 수 있다.

def has_six_characters(css_class):
    return css_class is not None and len(css_class) == 6

print(soup.find_all(class_=re.compile("itl")))

print(soup.find_all(class_=has_six_characters))

# [<p class="title">
# <b>
#   The Dormouse's story
#  </b>
# </p>]
#
# [<a class="sister" href="http://example.com/elsie" id="link1">
#   Elsie
#  </a>, <a class="sister" href="http://example.com/lacie" id="link2">
#   Lacie
#  </a>, <a class="sister" href="http://example.com/tillie" id="link3">
#   Tillie
#  </a>]

하나의 태그에 "class" 속성에 대하여 값이 여러개 있을 수 있다. 특정 CSS 클래스에 부합하는 태그를 탐색할 때, CSS 클래스들 모두에 대하여 대조를 수행하는 것이다. class 속성의 정확한 문자열 값을 탐색할 수도 있다. 그러나 문자열 값을 변현해서 탐색하면 작동하지 않는다.

css_soup = BeautifulSoup('<p class="body strikeout"></p>','html.parser')
print(css_soup.find_all("p", class_="strikeout"))
print(css_soup.find_all("p",class_="body"))
print(css_soup.find_all("p", class_="body strikeout"))
print(css_soup.find_all("p", class_="strikeout body"))

# [<p class="body strikeout"></p>]
# [<p class="body strikeout"></p>]
# [<p class="body strikeout"></p>]
# []

class_를 위한 간편한 방법이 beautifulsoup 모든 버전에 존재한다. find()유형의 메소드에 건내는 두번째 인자는 attrs인데, 문자열을 attrs에 건내면 그 문자열을 CSS 클래스처럼 탐색한다.

print(soup.find_all("a", "sister"))

# [<a class="sister" href="http://example.com/elsie" id="link1">
#   Elsie
#  </a>, <a class="sister" href="http://example.com/lacie" id="link2">
#   Lacie
#  </a>, <a class="sister" href="http://example.com/tillie" id="link3">
#   Tillie
#  </a>]

정규표현식, 함수, 딕셔너리, True 유형으로도 보낼 수 있다.

recursive 매개변수

recursive 매개변수는 불리언이다. 문서에 얼마나 깊이 찾아 들어가고 싶은지 지정할때 사용합니다. True이면 findAll 함수는 매개변수에 일치하는 태그를 찾아 자식, 손자를 검색합니다. False이면 직계 자식의 태그만 찾습니다.

print(soup.html.find_all("title"))
print(soup.html.find_all("title",recursive=False))

# [<title>The Dormouse's story</title>]
# []

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>

<title> 태그는 <html> 태그 아래에 있지만, <html> 태그 바로 아래에 있는 것은 아니다. <head> 태그가 사이에 있다. beautifulsoup는 <html> 태그의 모든 후손을 찾아 보도록 허용해야만 <샤싣>태그를 발견한다. 그러나 recursive=false가 <html>태그의 자식으로 검색을 제한하기 때문에 아무것도 하지 못한다.

beautifulsoup는 트리-탐색 메소드들을 다양하게 제공한다. 대부분 find_all()과 같은 name, attrs, text, limit, keyword인자를 취한다. 그러나 recursive인자는 다르다. find_all()과 find()만 유일하게 지원한다.

text 매개변수

text 매개변수는 태그의 속성이 아니라 텍스트 콘텐츠에 일치한다는 점이 다르다.

print(soup.find_all(text="Elsie"))
print(soup.find_all(text=["Tillie", "Elsie", "Lacie"]))
print(soup.find_all(text=re.compile("Dormouse")))

# []
# []
# ["The Dormouse's story", "\n    The Dormouse's story\n   "]

왜 안될까

# ['Elsie']
# ['Elsie', 'Lacie', 'Tillie']
# ["The Dormouse's story", "The Dormouse's story"]

이렇게 나와야 하는데

'\n Elsie\n ' 문자가 이모양이여서 그런거 같다.

limit 매개변수

find_all()에서만 쓰임 find()에서는 limit를 1로 고정. 페이지의 항목 처음 몇 개만 찾을때 사용한다.

print(soup.find_all("a", limit=2))

# [<a class="sister" href="http://example.com/elsie" id="link1">
#   Elsie
#  </a>, <a class="sister" href="http://example.com/lacie" id="link2">
#   Lacie
#  </a>]

keyword 매개변수

keyword 매개변수는 특정 속성이 포함된 태그를 선택할 때 사용한다.

print(soup.find_all(id='link2'))

# [<a class="sister" href="http://example.com/lacie" id="link2">
#   Lacie
#  </a>]

find()

find(name, attrs, recursive, text, keyword)

find_all() 메소드는 전체 문서를 훓어서 결과를 찾지만, 어떤 경우는 결과를 하나만 원할 수도 있다. find_all()메소드는 한개의 결과를 담은 리스트를 출력하고, find()메소드는 그냥 그 결과를 출력한다. 아무것도 찾지 못할때는 find_all()은 빈리스트를, find()는 None을 출력한다.

print(soup.find_all('title', limit=1))
print(soup.find('title'))

# [<title>The Dormouse's story</title>]
# <title>The Dormouse's story</title>

soup.head.title은 find()을 반복적으로 호출한다

print(soup.head.title)
print(soup.find("head").find("title"))

# <title>The Dormouse's story</title>
# <title>The Dormouse's story</title>

find_parents() AND find_parent()

find_parents(name, attrs, text, limit, **kwargs)
find_parent(name, attrs, text, **kwargs)

find_all()과 find()은 트리를 위에서 아래로 내려오면서 태그를 찾았다면, 위 메소드들은 아래에서 위로 올라가며 태그의 부모를 찾는다.

a_string = soup.find(text=re.compile("Lacie"))

print(a_string)
print(a_string.find_parents("a"))
print(a_string.find_parent("p"))
print(a_string.find_parents("p", class_="title"))

#   Lacie
#
# [<a class="sister" href="http://example.com/lacie" id="link2">
#   Lacie
#  </a>]
#
# <p class="story">
#  Once upon a time there were three little sisters; and their names were
#  <a class="sister" href="http://example.com/elsie" id="link1">
#   Elsie
#  </a>
#  ,
#  <a class="sister" href="http://example.com/lacie" id="link2">
#   Lacie
#  </a>
#  and
#  <a class="sister" href="http://example.com/tillie" id="link3">
#   Tillie
#  </a>
#  ; and they lived at the bottom of a well.
# </p>
#
# []

find_next_siblings() AND find_next_sibling()

find_next_siblings(name, attrs, text, limit, **kwargs)
find_next_sibling(name, attrs, text, **kwargs)

이 메소드들은 .next siblings을 사용하여 트리에서 한 요소의 나머지 형제들을 반복해서 찾는다. find_next_siblings() 메소드는 만족하는 형제를 모두 찾고, find_next_sibling()메소드는 그 중 첫번째만 찾는다.

first_link = soup.a

print(first_link)
print(first_link.find_next_siblings("a"))

first_story_paragraph = soup.find("p", "story")

print(first_story_paragraph.find_next_sibling("p"))

# <a class="sister" href="http://example.com/elsie" id="link1">
#   Elsie
#  </a>
#
# [<a class="sister" href="http://example.com/lacie" id="link2">
#   Lacie
#  </a>, <a class="sister" href="http://example.com/tillie" id="link3">
#   Tillie
#  </a>]
#
# <p class="story">
#  ...
# </p>

find_previous_siblings() AND find_previous_sibling()

find_previous_siblings(name, attrs, text, limit, **kwargs)
find_previous_sibling(name, attrs, text, **kwargs)

이 메소드는 .previous siblings를 사용하여 트리에서 한 원소의 앞에 나오는 형제들을 반복한다. find_previous_siblings()메소드는 만족하는 형제 모두를 찾고, find_previous_sibling()는 첫째만 찾는다.

last_link = soup.find("a", id="link3")

print(last_link)
print(last_link.find_previous_siblings("a"))

first_story_paragraph = soup.find("p", "story")

print(first_story_paragraph.find_previous_sibling("p"))

# <a class="sister" href="http://example.com/tillie" id="link3">
#   Tillie
#  </a>
#
# [<a class="sister" href="http://example.com/lacie" id="link2">
#   Lacie
#  </a>, <a class="sister" href="http://example.com/elsie" id="link1">
#   Elsie
#  </a>]
#
# <p class="title">
# <b>
#   The Dormouse's story
#  </b>
# </p>

find_all_next() 그리고 find_next()

find_all_next(name, attrs, text, limit, **kwargs)
find_next(name, attrs, text, **kwargs)

이 메소드들은 .next elements를 사용하여 문서의 한 태그 뒤에 오는 태그 또는 문자열을 전부 반환한다. find_all_next()메소드는 만족하는 모든 것을, find_next()는 첫 번째만 찾는다.

first_link = soup.a
print(first_link)
print(first_link.find_all_next(text=True))
print(first_link.find_next("p"))

# <a class="sister" href="http://example.com/elsie" id="link1">
#   Elsie
#  </a>
#
# ['\n    Elsie\n   ', '\n   ,\n   ', '\n    Lacie\n   ', '\n   and\n   ', '\n    Tillie\n   ', '\n   ; and they lived at the bottom of a well.\n  ', '\n', '\n   ...\n  ', '\n', '\n', '\n']
#
# <p class="story">
#  ...
# </p>

find_all_previous() AND find_previous()

find_all_previous(name, attrs, text, limit, **kwargs)
find_previous(name, attrs, text, **kwargs)

이 메소드들은 .previous elements를 사용하여 문서에서 앞에 오는 태그나 문자열들을 반복한다. find_all_previous()메소드는 만족하는 모든 것을, find_previous()는 첫 번째만 찾는다.

first_link = soup.a

print(first_link)
print(first_link.find_all_previous("p"))
print(first_link.find_previous("title"))

# <a class="sister" href="http://example.com/elsie" id="link1">
#   Elsie
#  </a>
#
# [<p class="story">
#  Once upon a time there were three little sisters; and their names were
#  <a class="sister" href="http://example.com/elsie" id="link1">
#   Elsie
#  </a>
#  ,
#  <a class="sister" href="http://example.com/lacie" id="link2">
#   Lacie
#  </a>
#  and
#  <a class="sister" href="http://example.com/tillie" id="link3">
#   Tillie
#  </a>
#  ; and they lived at the bottom of a well.
# </p>, <p class="title">
# <b>
#   The Dormouse's story
#  </b>
# </p>]
#
# <title>The Dormouse's story</title>

CSS 선택자

BeautifulSoup는 CSS 선택자 표준의 부분집합을 지원한다. 그냥 문자열로 선택자를 구성하고 그것을 TAG의 .select()메소드 또는 beautifulsoup 객체 자체에 건내면 된다.

다음과 같이 태그를 검색할 수 있다.

print(soup.select("title"))

# [<title>The Dormouse's story</title>]

다른 태그 아래의 태그를 찾을 수 있다.

print(soup.select("body a"))
print(soup.select("html head title"))

# [<a class="sister" href="http://example.com/elsie" id="link1">
#   Elsie
#  </a>, <a class="sister" href="http://example.com/lacie" id="link2">
#   Lacie
#  </a>, <a class="sister" href="http://example.com/tillie" id="link3">
#   Tillie
#  </a>]
#
# [<title>The Dormouse's story</title>]

다른 태그 바로 아래에 있는 태그를 찾을 수 있다.

print(soup.select("head > title"))
print(soup.select("p > a"))
print(soup.select("body > a"))

# [<title>The Dormouse's story</title>]
#
# [<a class="sister" href="http://example.com/elsie" id="link1">
#   Elsie
#  </a>, <a class="sister" href="http://example.com/lacie" id="link2">
#   Lacie
#  </a>, <a class="sister" href="http://example.com/tillie" id="link3">
#   Tillie
#  </a>]
#
# []

CSS 클래스로 태그를 찾는다

print(soup.select(".sister"))
print(soup.select("[class~=sister]"))

# [<a class="sister" href="http://example.com/elsie" id="link1">
#   Elsie
#  </a>, <a class="sister" href="http://example.com/lacie" id="link2">
#   Lacie
#  </a>, <a class="sister" href="http://example.com/tillie" id="link3">
#   Tillie
#  </a>]
#
# [<a class="sister" href="http://example.com/elsie" id="link1">
#   Elsie
#  </a>, <a class="sister" href="http://example.com/lacie" id="link2">
#   Lacie
#  </a>, <a class="sister" href="http://example.com/tillie" id="link3">
#   Tillie
#  </a>]

ID로 태그를 찾는다

print(soup.select("#link1"))
print(soup.select("a#link2"))
print(soup.select("b#link3"))

# [<a class="sister" href="http://example.com/elsie" id="link1">
#   Elsie
#  </a>]
#
# [<a class="sister" href="http://example.com/lacie" id="link2">
#   Lacie
#  </a>]
#
# []

속성이 존재하는지 테스트 한다.

print(soup.select('a[href]'))

# [<a class="sister" href="http://example.com/elsie" id="link1">
#   Elsie
#  </a>, <a class="sister" href="http://example.com/lacie" id="link2">
#   Lacie
#  </a>, <a class="sister" href="http://example.com/tillie" id="link3">
#   Tillie
#  </a>]

속성 값으로 태그를 찾는다.

print(soup.select('a[href="http://example.com/elsie"]'))
print(soup.select('a[href^="http://example.com/"]'))
print(soup.select('a[href$="tillie"]'))
print(soup.select('a[href*=".com/el"]'))

# [<a class="sister" href="http://example.com/elsie" id="link1">
#   Elsie
#  </a>]
#
# [<a class="sister" href="http://example.com/elsie" id="link1">
#   Elsie
#  </a>, <a class="sister" href="http://example.com/lacie" id="link2">
#   Lacie
#  </a>, <a class="sister" href="http://example.com/tillie" id="link3">
#   Tillie
#  </a>]
#
# [<a class="sister" href="http://example.com/tillie" id="link3">
#   Tillie
#  </a>]
#
# [<a class="sister" href="http://example.com/elsie" id="link1">
#   Elsie
#  </a>]

select 안에 href^/$/*는

기호	의미
^	~로 시작
$	~로 끝
*	중간에 ~포함

인듯 CSS SELECT 더 확인해봐야함

언어 코덱을 일치 시킨다.

multilingual_markup = """
 <p lang="en">Hello</p>
 <p lang="en-us">Howdy, y'all</p>
 <p lang="en-gb">Pip-pip, old fruit</p>
 <p lang="fr">Bonjour mes amis</p>
"""

multilingual_soup = BeautifulSoup(multilingual_markup,'html.parser')

print(multilingual_soup.select('p[lang|=en]'))

# [<p lang="en">Hello</p>, <p lang="en-us">Howdy, y'all</p>, <p lang="en-gb">Pip-pip, old fruit</p>]

이미 CSS selecto를 알고 있는 사람은 편리하게 사용할 수 있다. 당신은 모든 것을 beautifulsoup API와 함께 수행 할 수 있다. 만약 CSS selectors가 필요한경우 넌 lxml 해석기를 사용해야한다. 그것이 더 빠르다. 그러나 당신은 CSS selectors와 beautifulSoup API를 결합해서 사용할 수 있다.

트리 수정하기

beautifulSoup는 해석 트리를 검색하는 장점이 있다. 또한 해석 트리를 변형해서 HTML 또는 XML 문서로 저장할 수도 있다.

태그 이름과 속성 바꾸기

태그 이름을 바꾸고 그의 속성값들을 바꾸며, 새로 추가하고, 삭제할 수 있다.

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>','html.parser')
tag = soup.b

tag.name = "blockquote"
tag['class'] = 'verybold'
tag['id'] = 1
print(tag)

del tag['class']
del tag['id']
print(tag)

# <blockquote class="verybold" id="1">Extremely bold</blockquote>
# <blockquote>Extremely bold</blockquote>

.string 변경하기

xormdml .string속성을 설정하면, 태그의 내용이 주어진 문자열로 교체된다. 태그에 또 다른 태그가 들어있다면, 그 태그는 물론 모든 내용이 사라진다.

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup,'html.parser')

tag = soup.a
tag.string = "New link text."
print(tag)

# <a href="http://example.com/">New link text.</a>

append()

Tag.append()로 태그에 내용을 추가할 수 있다. 파이썬 리스트에 .append()를 호출할 것과 똑같이 작동한다.

soup = BeautifulSoup("<a>Foo</a>",'html.parser')
soup.a.append("Bar")

print(soup)
print(soup.a.contents)

# <a>FooBar</a>
# ['Foo', 'Bar']

BeautifulSoup.new_string() AND .new_tag()

문자열을 문서에 추가하고 싶다면, 파이썬 문자열을 append()에 건내기만 하면 된다. 아니면 BeautifulSoup.new_string() 공장 메소드를 호출하면 된다.

soup = BeautifulSoup("<b></b>",'html.parser')
tag = soup.b
tag.append("Hello")
new_string = soup.new_string(" there")
tag.append(new_string)

print(tag)
print(tag.contents)

# <b>Hello there</b>
# ['Hello', ' there']

완전히 새로 태그를 만들어야 한다면 BeautifulSoup.new_tag() 공장 메소드를 호출하면 된다. 오직 첫 번째 인자, 즉 태그 이름만 있으면 된다.

soup = BeautifulSoup("<b></b>",'html.parser')
original_tag = soup.b

new_tag = soup.new_tag("a", href="http://www.example.com")
original_tag.append(new_tag)
print(original_tag)
print(new_tag)

new_tag.string = "Link text."
print(original_tag)
print(new_tag)

# <b><a href="http://www.example.com"></a></b>
# <a href="http://www.example.com"></a>
# <b><a href="http://www.example.com">Link text.</a></b>
# <a href="http://www.example.com">Link text.</a>

.append()는 값을 변수에 복사하는 방식이 아니라 링크 형식으로 연결해준다고 생각한다. C언어의 포인터 개념처럼

insert()

Tag.insert()는 Tag.append()와 거의 같은데, 새 요소가 반드시 그의 부모의 .contents끝에 갈 필요는 없다. 원하는 위치에 삽입될 수 있다. 파이썬 리스트의 .insert()와 같다

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup,'html.parser')
tag = soup.a

tag.insert(1, "but did not endorse ")
print(tag)
print(tag.contents)

# <a href="http://example.com/">I linked to but did not endorse <i>example.com</i></a>
# ['I linked to ', 'but did not endorse ', <i>example.com</i>]

insert_before() AND insert_after()

insert_before()메소드는 태그나 문자열을 해석 트리에서 어떤 것 바로 앞에 삽입한다.

soup = BeautifulSoup("<b>stop</b>",'html.parser')
tag = soup.new_tag("i")
tag.string = "Don't"
soup.b.string.insert_before(tag)

print(soup.b)

# <b><i>Don't</i>stop</b>

insert_after()메소드는 해석 트리에서 다른 어떤 것 바로 뒤에 나오도록 태그나 문자열을 이동시킨다.

soup = BeautifulSoup("<b>stop</b>",'html.parser')
tag = soup.new_tag("i")
tag.string = "Don't"

soup.b.string.insert_before(tag)
soup.b.i.insert_after(soup.new_string(" ever "))

print(soup.b)
print(soup.b.contents)

# <b><i>Don't</i> ever stop</b>
# [<i>Don't</i>, ' ever ', 'stop']

i 바로 뒤에 insert 하도록

clear()

Tag.clear()은 태그의 내용을 제거한다.

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup,'html.parser')
tag = soup.a

tag.clear()
print(tag)

# <a href="http://example.com/"></a>

자신을 제외한 태그를 지우는 듯

extract()

PageElement.extract()는 해석 트리에서 태그나 문자열을 제거한다. 추출하고 남은 태그나 문자열을 돌려준다.

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup,'html.parser')
a_tag = soup.a
i_tag = soup.i.extract()

print(a_tag)
print(i_tag)

print(a_tag.parent)
print(i_tag.parent)

my_string = i_tag.string.extract()
print(my_string)
print(my_string.parent)

print(i_tag)

# <a href="http://example.com/">I linked to </a>
# <i>example.com</i>
#
# <a href="http://example.com/">I linked to </a>
# None
#
# example.com
# None
#
# <i></i>

자기 자신만 남기고 다 제거

decompose()

Tag.decompose()는 태그를 트리에서 제거한 다음, 그와 그의 내용물을 완전히 파괴한다.

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup,'html.parser')
a_tag = soup.a
i_tag = soup.i.decompose()

print(a_tag)
print(i_tag)

print(a_tag.parent)
print(i_tag.parent)

Traceback (most recent call last):
  File "D:/source/test/beautifulsoup_test.py", line 50, in <module>
    print(i_tag.parent)
AttributeError: 'NoneType' object has no attribute 'parent'

# <a href="http://example.com/">I linked to </a>
# None
# <a href="http://example.com/">I linked to </a>

자기 자신을 제거, 그래서 부모를 불렀을때 에러가 나온것

replace_witg()

PageElement.replace_with()는 트리에서 태그나 문자열을 제거하고 그것을 지정한 태그나 문자열로 교체한다. replace_with()는 교체된 후의 태그나 문자열을 돌려준다. 그래서 검사해 보거나 다시 트리의 다른 부분에 추가할 수 있다.

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup,'html.parser')
a_tag = soup.a

new_tag = soup.new_tag("b")
new_tag.string = "example.net"
a_tag.i.replace_with(new_tag)

print(a_tag)

# <a href="http://example.com/">I linked to <b>example.net</b></a>

wrap()

PageElement.wrap()는 지정한 태그에 요소를 둘러싸서 새로운 포장지를 돌려준다.

soup = BeautifulSoup("<p>I wish I was bold.</p>",'html.parser')

print(soup.p.string.wrap(soup.new_tag("b")))
print(soup.b.wrap(soup.new_tag("div")))
print(soup.p.wrap(soup.new_tag("fo")))

# <b>I wish I was bold.</b>
# <div><b>I wish I was bold.</b></div>
# <fo><p><div><b>I wish I was bold.</b></div></p></fo>

unwrap()

Tag.unwrap()은 wrap()의 반대이다. 태그를 그 태그 안에 있는 것들로 교체한다. replace_with()처럼, unwrap()은 교체된 후의 태그를 돌려준다.

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup,'html.parser')
a_tag = soup.a

print(a_tag.i.unwrap())
print(a_tag)

# <i></i>
# <a href="http://example.com/">I linked to example.com</a>

a_tag.a.unwrap() 자기 자신은 제거가 안되나보다, 애초에 교체니까 안되는거겠지

출력

예쁘게 인쇄하기

prettify()메소드는 beautifulSoup 해석 트리를 멋지게 모양을 낸 유니코드 문자열로 변환한다

markup = '<html>\n <head>\n </head>\n <body>\n  <a href="http://example.com/">\n...<a href="http://example.com/">I linked to <i>example.com</i></a></body></html>'
soup = BeautifulSoup(markup,'html.parser')

print(soup.prettify())
print(soup.a.prettify())

# <html>
#  <head>
#  </head>
#  <body>
#   <a href="http://example.com/">
#    ...
#    <a href="http://example.com/">
#     I linked to
#     <i>
#      example.com
#     </i>
#    </a>
#   </a>
#  </body>
# </html>
#
# <a href="http://example.com/">
#  ...
#  <a href="http://example.com/">
#   I linked to
#   <i>
#    example.com
#   </i>
#  </a>
# </a>

있는 그대로 인쇄하기

멋진 모양 말고 그냥 문자열을 원한다면 BeautifulSoup 객체, 그 안의 Tag에 unicode(), str()을 호출하면 된다. encode()를 호출하면 bytestring, decode()는 유니코드를 얻는다.

markup = '<html>\n <head>\n </head>\n <body>\n  <a href="http://example.com/">\n...<a href="http://example.com/">I linked to <i>example.com</i></a></body></html>'
soup = BeautifulSoup(markup,'html.parser')

print(str(soup))

# <html>
# <head>
# </head>
# <body>
# <a href="http://example.com/">
# ...<a href="http://example.com/">I linked to <i>example.com</i></a></a></body></html>

출력 포멧

formatter의 인자값을 prettify(), encode(), decode()에 제공하면 된다.

french = "<p>Il a dit &lt;&lt;Sacr&eacute; bleu!&gt;&gt;</p>"
soup = BeautifulSoup(french, 'html.parser')

minimal

문자열은 beautifulsoup가 유효한 HTML/XML을 생산한다고 확신할 만큼 처리된다.

print(soup.prettify(formatter="minimal"))

# <p>
#  Il a dit &lt;&lt;Sacré bleu!&gt;&gt;
# </p>

html

beautifulsoup는 유니코드 문자를 가능한한 HTML로 변환한다.

print(soup.prettify(formatter="html"))

# <p>
#  Il a dit &lt;&lt;Sacr&eacute; bleu!&gt;&gt;
# </p>

None

beautifulsoup는 출력시 문자열을 건드리지 않는다. 가장 빠른 옵션이지만, BeautifulSoup를 유효하지 않은 생성으로 이끌수있다.

print(soup.prettify(formatter=None))

link_soup = BeautifulSoup('<a href="http://example.com/?foo=val1&bar=val2">A link</a>','html.parser')
print(link_soup.a.encode(formatter=None))

# <p>
#  Il a dit <<Sacré bleu!>>
# </p>
#
# b'<a href="http://example.com/?foo=val1&bar=val2">A link</a>'

함수

Beautifulsoup는 문서에서 문자열과 속성 값에 대하여 하나하나 그 함수를 한 번 호출한다. 이 함수에서 무엇이든 할 수 있다. 다음은 문자열을 대문자로 바꾼다.

def uppercase(str):
    return str.upper()

print(soup.prettify(formatter=uppercase))

# <p>
#  IL A DIT <<SACRÉ BLEU!>>
# </p>

get_text()

문서나 태그에서 텍스트 부분만 추출하고 싶다면, get_text() 메소드를 사용할 수 있다. 이 메소드는 문서나 태그 아래의 텍스트를, 유니코드 문자열 하나로 모두 돌려준다.

markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'
soup = BeautifulSoup(markup,'html.parser')

print(soup.get_text())
print(soup.i.get_text())

# I linked to example.com
# example.com

텍스트를 합칠때 사용될 문자열을 지정해 줄 수 있다. 텍스트의 앞, 뒤에 있는 공백을 지우게 할 수도 있다. _stripped_strings를 사용해 텍스트를 직접 처리할 수도 있다.

print(soup.get_text("|"))
print(soup.get_text("|", strip=True))
print([text for text in soup.stripped_strings])

# I linked to |example.com|
# I linked to|example.com
# ['I linked to', 'example.com']

Encodings

xml parser 꼭 설치해서 해보시길

HTML이든 XML이든 문서는 ASCII나 UTF-8 같은 특정한 인코딩으로 작성된다. 그러나 문서를 뷰티플수프에 적재하면, 문서가 유니코드로 변환되었음을 알게 될 것이다:

markup = "<h1>Sacr\xc3\xa9 bleu!</h1>"
soup = BeautifulSoup(markup)

print(soup.h1)
print(soup.h1.string)

# <h1>SacrÃ© bleu!</h1>
# SacrÃ© bleu!

마법이 아니다(확실히 좋은 것이다.). 뷰티플수프는 Unicode, Dammit라는 하위 라이브러리를 사용하여 문서의 인코딩을 탐지하고 유니코드로 변환한다. 자동 인코딩 탐지는 BeautifulSoup 객체의 .original_encoding 속성으로 얻을 수 있다:

print(soup.original_encoding)

# 'utf-8'

난 왜 None이 뜰까... 왜지...

Unicode, Dammit은 대부분 올바르게 추측하지만, 가끔은 실수가 있다. 가끔 올바르게 추측하지만, 문서를 바이트 하나 하나 오랫동안 탐색한 후에야 그렇다. 혹시 문서의 인코딩을 미리 안다면, 그 인코딩을 BeautifulSoup 구성자에 from_encoding로 건네면 실수를 피하고 시간을 절약할 수 있다.

다음은 ISO-8859-8로 작성된 문서이다. 이 문서는 Unicode, Dammit이 충분히 살펴보기에는 너무 짧아서, ISO-8859-7로 잘못 인식한다:

markup = "<h1>\xed\xe5\xec\xf9</h1>"
soup = BeautifulSoup(markup,'html.parser')

print(soup.h1)
print(soup.original_encoding)

# <h1>íåìù</h1>
# 'ISO-8859-7'

난 왜 None이 뜰까... 왜지... parser 문제인가

이를 해결하려면 올바른 from_encoding을 건네면 된다:

markup = "<h1>\xed\xe5\xec\xf9</h1>"
soup = BeautifulSoup(markup,'html.parser' ,from_encoding="iso-8859-8")

print(soup.h1)
print(soup.original_encoding)

# <h1>íåìù</h1>
# None
#
# <h1>םולש</h1>
#'iso8859-8'

밑에처럼 나와야 하는데, 왜 난 위에처럼 나오는걸까

출력 인코딩

markup = b'''
 <html>
  <head>
   <meta content="text/html; charset=ISO-Latin-1" http-equiv="Content-type" />
  </head>
  <body>
   <p>Sacr\xe9 bleu!</p>
  </body>
 </html>
'''

soup = BeautifulSoup(markup, 'html.parser')
print(soup.prettify())

# <html>
#  <head>
#   <meta content="text/html; charset=utf-8" http-equiv="Content-type"/>
#  </head>
#  <body>
#   <p>
#    Sacré bleu!
#   </p>
#  </body>
# </html>

<meta> 태그가 재작성 되었다 !

인코딩을 prettify()에 건낼 수 있다.

print(soup.prettify("latin-1"))
print(soup.p.encode("utf-8"))
print(soup.p.encode("latin-1"))

# b'<html>\n <head>\n  <meta content="text/html; charset=latin-1" http-equiv="Content-type"/>\n </head>\n <body>\n  <p>\n   Sacr\xe9 bleu!\n  </p>\n </body>\n</html>\n'
# b'<p>Sacr\xc3\xa9 bleu!</p>'
# b'<p>Sacr\xe9 bleu!</p>'

선택한 인코딩에서 표현이 불가능한 문자는 숫자의 XML개체 참조로 변환된다.

markup = u"<b>\N{SNOWMAN}</b>"
snowman_soup = BeautifulSoup(markup, 'html.parser')
tag = snowman_soup.b

print(tag.encode("utf-8"))
print(tag.encode("latin-1"))
print(tag.encode("ascii"))

# b'<b>\xe2\x98\x83</b>'
# b'<b>&#9731;</b>'
# b'<b>&#9731;</b>'
#
# <b>☃</b>
# <b>&#9731;</b>
# <b>&#9731;</b>

밑에처럼 나와야 하는데 왜 위에처럼 나올까..

UnicodeDammit

뷰티플수프를 사용하지 않더라도 유니코드를 사용할 수 있다. 인코딩을 알 수 없는 데이터가 있을 때마다 그냥 유니코드가 되어 주었으면 하고 바라기만 하면 된다:

from bs4 import UnicodeDammit
dammit = UnicodeDammit("Sacr\xc3\xa9 bleu!")

print(dammit.unicode_markup)
print(dammit.original_encoding)

# SacrÃ© bleu!
# None
#
# Sacré bleu!
# 'utf-8'

왜 original_encoding만 제대로 작동을 안할까

유니코드에 더 많은 데이터를 줄 수록, Dammit은 더 정확하게 추측할 것이다. 나름대로 어떤 인코딩일지 짐작이 간다면, 그것들을 리스트로 건넬 수 있다:

dammit = UnicodeDammit("Sacr\xe9 bleu!", ["latin-1", "iso-8859-1"])

print(dammit.unicode_markup)
print(dammit.original_encoding)

# Sacré bleu!
# None
#
# Sacré bleu!
# 'latin-1'

Unicode, Dammit는 뷰티플수프가 사용하지 않는 특별한 특징이 두 가지 있다.

지능형 따옴표

Unicode, Dammit을 사용하여 마이크로소프트 지능형 따옴표를 HTML이나 XML 개체로 변환할 수 있다:

from bs4 import UnicodeDammit

markup = b"<p>I just \x93love\x94 Microsoft Word\x92s smart quotes</p>"

print(UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="html").unicode_markup)
print(UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="xml").unicode_markup)

# <p>I just &ldquo;love&rdquo; Microsoft Word&rsquo;s smart quotes</p>
# <p>I just &#x201C;love&#x201D; Microsoft Word&#x2019;s smart quotes</p>

또 마이크로소프트 지능형 따옴표를 ASCII 따옴표로 변환할 수 있다:

print(UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="ascii").unicode_markup)

# <p>I just "love" Microsoft Word's smart quotes</p>

모쪼록 이 특징이 쓸모가 있기를 바라지만, 뷰티플수프는 사용하지 않는다. 뷰티플수프는 기본 행위를 선호하는데, 기본적으로 마이크로소프트 지능형 따옴표를 다른 모든 것과 함께 유니코드 문자로 변환한다:

print(UnicodeDammit(markup, ["windows-1252"]).unicode_markup) UnicodeDammit(markup, ["windows-1252"]).unicode_markup

# <p>I just “love” Microsoft Word’s smart quotes</p>

비 일관적인 인코딩

어떤 경우 문서 대부분이 UTF-8이지만, 안에 (역시) 마이크로소프트 지능형 따옴표와 같이 Windows-1252 문자가 들어 있는 경우가 있다. 한 웹 사이트에 여러 소스로 부터 데이터가 포함될 경우에 이런 일이 일어날 수 있다. UnicodeDammit.detwingle()을 사용하여 그런 문서를 순수한 UTF-8 문서로 변환할 수 있다. 다음은 간단한 예이다:

snowmen = (u"\N{SNOWMAN}" * 3)
quote = (u"\N{LEFT DOUBLE QUOTATION MARK}I like snowmen!\N{RIGHT DOUBLE QUOTATION MARK}")
doc = snowmen.encode("utf8") + quote.encode("windows_1252")

이 문서는 뒤죽박죽이다. 눈사람은 UTF-8인데 따옴표는 Windows-1252이다. 눈사람 아니면 따옴표를 화면에 나타낼 수 있지만, 둘 다 나타낼 수는 없다:

print(doc)
print(doc.decode("windows-1252"))

# b'\xe2\x98\x83\xe2\x98\x83\xe2\x98\x83\x93I like snowmen!\x94'
#â˜ƒâ˜ƒâ˜ƒ“I like snowmen!”
#
# ☃☃☃�I like snowmen!�
# â˜ƒâ˜ƒâ˜ƒ“I like snowmen!”

하..

문서를 UTF-8로 디코딩하면 UnicodeDecodeError가 일어나고, Windows-1252로 디코딩하면 알 수 없는 글자들이 출력된다. 다행스럽게도, UnicodeDammit.detwingle()는 그 문자열을 순수 UTF-8로 변환해 주므로, 유니코드로 디코드하면 눈사람과 따옴표를 동시에 화면에 보여줄 수 있다:

new_doc = UnicodeDammit.detwingle(doc)
print(new_doc.decode("utf8"))

# ☃☃☃“I like snowmen!”

처음으로 봤네 저 눈사람

UnicodeDammit.detwingle()는 오직 UTF-8에 임베드된 (또는 그 반대일 수도 있지만) Windows-1252을 다루는 법만 아는데, 이것이 가장 일반적인 사례이다.

BeautifulSoup이나 UnicodeDammit 구성자에 건네기 전에 먼저 데이터에 UnicodeDammit.detwingle()을 호출하는 법을 반드시 알아야 한다. 뷰티플수프는 문서에 하나의 인코딩만 있다고 간주한다. 그것이 무엇이든 상관없이 말이다. UTF-8과 Windows-1252를 모두 포함한 문서를 건네면, 전체 문서가 Windows-1252라고 생각할 가능성이 높고, 그 문서는 다음 â˜ƒâ˜ƒâ˜ƒ“I like snowmen!”처럼 보일 것이다.

UnicodeDammit.detwingle()은 뷰티플수프 4.1.0에서 새로 추가되었다.

Troubleshooting

종류	설명
SyntaxError: Invalid syntax (다음 ROOT_TAG_NAME = u'[document]' 줄에서)	코드를 변경하지 않고서, 파이썬 2 버전의 뷰티플수프를 파이썬 3 아래에서 사용하기 때문에 야기된다.
ImportError: No module named HTMLParser	파이썬 2 버전의 뷰티플수프를 파이썬 3 아래에서 사용하기 때문에 야기된다.
ImportError: No module named html.parser	파이썬 3 버전의 뷰티플수프를 파이썬 2에서 실행하기 때문에 야기된다.
ImportError: No module named BeautifulSoup	뷰티플수프 3 코드를 BS3가 설치되어 있지 않은 시스템에서 실행할 때 야기된다. 또는 꾸러미 이름이 bs4로 바뀌었음을 알지 못하고 뷰티플수프 4 코드를 실행하면 야기된다.
ImportError: No module named bs4	뷰티플수프 4 코드를 BS4가 설치되어 있지 않은 시스템에서 실행하면 야기된다.

기타 해석기 문제

스크립트가 한 컴퓨터에서는 잘 되는데 다른 컴퓨터에서는 작동하지 않는다면, 아마도 두 컴퓨터가 다른 해석기를 가지고 있기 때문일 것이다. 예를 들어, lxml이 설치된 컴퓨터에서 스크립트를 개발해 놓고, 그것을 html5lib만 설치된 컴퓨터에서 실행하려고 했을 수 있다. 왜 이것이 문제가 되는지는 해석기들 사이의 차이점을 참고하고, BeautifulSoup 구성자에 특정 라이브러리를 지정해서 문제를 해결하자.
HTMLParser.HTMLParseError: malformed start tag or HTMLParser.HTMLParseError: bad end tag - 파이썬의 내장 HTML 해석기에 처리가 불가능한 문서를 건네면 야기된다. 다른 HTMLParseError도 아마 같은 문제일 것이다. 해결책: lxml이나 html5lib를 설치하자.
알고 있는데 문서에서 그 태그를 발견할 수 없다면 (다시 말해, find_all()이 []를 돌려주거나 find()가 None을 돌려줄 경우), 아마도 파이썬의 내장 HTML 해석기를 사용하고 있을 가능성이 높다. 이 해석기는 가끔 이해하지 못하면 그 태그를 무시하고 지나간다. 해결책: lxml이나 html5lib를 설치하자.
HTML 태그와 속성은 대소문자를 구별하므로, 세가지 HTML 해석기 모두 태그와 속성 이름을 소문자로 변환한다. 다시 말해, 다음 조판 <TAG></TAG>는 <tag></tag>로 변환된다. 태그와 속성에 대소문자 혼합 또는 대문자를 그대로 유지하고 싶다면, 문서를 XML로 해석할 필요가 있다.

세번째 ...

더 자세한 사항은

아래 참고 문헌을 참고해주세요..

참고문헌

Beautiful Soup 4.4.0 documentation, "https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.html"
라이언 미첼, 『파이썬으로 웹 크롤러 만들기』, 한빛미디어(2017)