from urllib.request import urlopen html = urlopen("http://pythonscraping.com/pages/page1.html") print(html.read()) |
BeautifulSoup설치
python setup.py install |
BeautifulSoup을 이용한 태그 가져오기
from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen("http://www.pythonscraping.com/pages/page1.html") bsObj = BeautifulSoup(html.read(), "html.parser") print(bsObj.h1) |
BS를 이용하여 특정 조건에 맞게 가져오기
from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen("http://www.pythonscraping.com/pages/warandpeace.html") htmlString = html.read() bsObj = BeautifulSoup(htmlString, "html.parser") nameList = bsObj.findAll("span", {"class":"green"}) for name in nameList: print(name.get_text()) |
테서랙트 OCR
https://github.com/tesseract-ocr/tesseract/wiki
댓글 없음:
댓글 쓰기