Wooguy's Study: [공부중] 파이썬 웹 크롤링

2017년 7월 4일 화요일

[공부중] 파이썬 웹 크롤링

파이썬 기본 라이브러리를 이용한 Web Page 가져오기

from urllib.request import urlopen
html = urlopen("http://pythonscraping.com/pages/page1.html")
print(html.read())

BeautifulSoup설치

python setup.py install

BeautifulSoup을 이용한 태그 가져오기

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page1.html")
bsObj = BeautifulSoup(html.read(), "html.parser")
print(bsObj.h1)

BS를 이용하여 특정 조건에 맞게 가져오기

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/warandpeace.html")
htmlString = html.read()
bsObj = BeautifulSoup(htmlString, "html.parser")
nameList = bsObj.findAll("span", {"class":"green"})
for name in nameList:
print(name.get_text())

테서랙트 OCR
https://github.com/tesseract-ocr/tesseract/wiki

Wooguy's Study

2017년 7월 4일 화요일

[공부중] 파이썬 웹 크롤링

댓글 없음:

댓글 쓰기