파이썬으로 슬랙 봇 만들기 (2) - Beautifulsoup4로 크롤링하기

심심해서

파이썬으로 슬랙 봇 만들기 (2) - Beautifulsoup4로 크롤링하기

@~@ 2024. 3. 4. 02:24

1. beautifulsoup4 라이브러리 설치하기

파이썬으로 크롤링을 할 때 beautifulsoup4 라이브러리를 사용한다. 터미널에 아래와 같이 입력하여 라이브러리를 설치한다.

>> pip install beautifulsoup4

BeautifulSoup 공식문서 >> https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Beautiful Soup Documentation — Beautiful Soup 4.12.0 documentation

Beautiful Soup Documentation Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers h

www.crummy.com

2. requests 라이브러리 설치하기

학과 홈페이지의 공지사항을 알려주는 봇을 만들고 있기 때문에 학과 홈페이지 웹사이트를 크롤링 해야 한다. 이 때 HTTP를 호출해야 한다. 파이썬으로 HTTP를 호출하는 프로그램을 만들 때는 requests 라이브러리를 사용한다.

>> pip install requests

Requests 라이브러리 공식문서 >> https://requests.readthedocs.io/en/latest/

Requests: HTTP for Humans™ — Requests 2.31.0 documentation

Requests: HTTP for Humans™ Release v2.31.0. (Installation) Requests is an elegant and simple HTTP library for Python, built for human beings. Behold, the power of Requests: >>> r = requests.get('https://api.github.com/user', auth=('user', 'pass')) >>> r.

requests.readthedocs.io

3. 웹사이트 파싱하고 크롤링하기

먼저 공지 목록에서 글번호, 글제목, 글url을 가져오기 위해 공지 웹페이지를 파싱한다.

F12를 누르면 파싱할 태그를 쉽게 가져올 수 있다.

sw = "https://sw.ssu.ac.kr/bbs/board.php?bo_table=notice"	# 공지 목록 페이지

def ssusw():

    response = requests.get(sw)

    if response.status_code == 200:
        html = response.text
        soup = BeautifulSoup(html, 'html.parser')
        listBoard = soup.select('.notice_list > table > tbody > tr > td > div > a')
        # bo_list > div.notice_list > table > tbody > tr:nth-child(1) > td.td_subject > div > a

        for listA in listBoard:
            article = {}

            article['id'] = listA['href'].split("&wr_id=")[1]	
            # url이 https://sw.ssu.ac.kr/bbs/board.php?bo_table=notice&amp;wr_id=1431 이런 형태이다.
            # wr_id= 뒷부분을 id로 가져오기 위해 split을 사용한다.
            
            article['url'] = listA['href']
            article['title'] = listA.text.strip()	# strip을 붙이면 공백이 제거된다.

            articleHTML = requests.get(listA['href']).text	# 글 본문 페이지를 가져온다.
            articleSoup = BeautifulSoup(articleHTML, 'html.parser')

            article['date'] = articleSoup.select_one('#bo_v_info > div.profile_info > div.profile_info_ct > strong.if_date').text
            article['date'] = "20"+article['date'].split("작성일 ")[1].split(" ")[0].strip()

            print(article)

    else:
        print(requests.status_codes)

그러면

이렇게!!! 가져와진다~

4. 개선할 점

마지막 사진의 목록을 가져오는? 크롤링하는 시간이 오래 걸리는 것 같다. 글 얼마 안 되는 것 같은데 한 번에 출력되지 않고 하나씩 천천히 출력된다. 어떻게 개선하지????? 미래의나야부탁해

'심심해서' 카테고리의 다른 글

스프링부트 슬랙봇 연동 - 메시지 전송 (0)	2024.03.11
파이썬으로 슬랙 봇 만들기 (1) - Hello World! 전송 (0)	2024.03.03

현재글파이썬으로 슬랙 봇 만들기 (2) - Beautifulsoup4로 크롤링하기

희노애락앤롤

FastAPI, 네트워크, 슬랙봇, Flow control, database, db, BeautifulSoup, 크롤링, Python,

Today :
Yesterday :

일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

희노애락앤롤