wikipediaからのページからリンク一覧を取得する方法

Python

Last updated at 2016-08-21Posted at 2016-08-21

PythonによるWebスクレイピングを勉強中。その中に、Wikipediaのページから、その記事に含まれているリンクを取得する。本書に載っているサンプルは英語のページようだったので、日本語のWikipedia用に少し改良。

実行環境

OS：OX X EI Capitan(10.11.5)
Python:3.5.1

# codeing:utf-8

import re
from bs4 import BeautifulSoup
from urllib.request import urlopen
from urllib.parse import unquote

url = "https://ja.wikipedia.org/wiki/%E3%83%86%E3%82%A4%E3%83%AB%E3%82%BA_%E3%82%AA%E3%83%96_%E3%82%A4%E3%83%8E%E3%82%BB%E3%83%B3%E3%82%B9"

html = urlopen(url)
bsObj = BeautifulSoup(html,'html.parser')

pattern = re.compile("^(/wiki/)((?!:).)*$")

for link in bsObj.find('div',{'id':'bodyContent'}).findAll('a',href = pattern):
 if 'href' in link.attrs:
 print (unquote(link.attrs['href']))

Go to list of users who liked

Go to list of comments

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

URL: https://qiita.com/tadaken3/items/e09ba2ede988bbacb303

⇱ wikipediaからのページからリンク一覧を取得する方法 #Python - Qiita

wikipediaからのページからリンク一覧を取得する方法

実行環境