Musings of a forgetful functor: Web scraping with Python - the dark side of data

Tuesday, December 27, 2011

Web scraping with Python - the dark side of data

In searching for some information on web-scrapers, I found a great presentation given at Pycon in 2010 by Asheesh Laroia. I thought this might be a valuable resource for R users who are looking for ways to gather data from user-unfriendly websites. The presentation can be found here:

http://python.mirocommunity.org/video/1616/pycon-2010-scrape-the-web-stra

Highlights (at least from my perspective)

Screen scraping is not about regular expressions. It is just too hard to use pattern matching for these tasks, as the tags can change regularly and have significant maintenance issues.
BeautifulSoup is the go-to html parser for poor quality source. I have used this in the past and am pleased to hear that I was not too far off the money!
Configuration of User Agent settings is discussed in detail, as well as other mechanisms that websites exploit to stop you from scraping content
Good description of how to use the Live HTTP Headers add-on for Firefox.
A thought-provoking discussion about APIs, and comments that suggest that their maintenance and support is woefully inadequate. I was interested to hear his views, as they imply that scraping may be the only alternative when you really need data that is highly inaccessible.

Other notes

The mechanise package features heavily in the examples for this presentation. The following link provides some good examples of how to use mechanise to automate forms:
http://wwwsearch.sourceforge.net/mechanize/forms.html

There was also some mention of how Javascript causes problems for web scrapers, although this problem can be overcome via the use of web-drivers such as Selenium (see http://pypi.python.org/pypi/selenium) and Watir. I have used safari-watir before, and from my experience it can perform many complex data gathering tasks with relative ease.

Please feel free to post your comments about your experiences with screen scraping, and other tools that you use to collect web data for R.

16 comments:

AnonymousDecember 30, 2011 at 4:35 AM
Very interesting information, will have to find time to go through it all and watch that video. Thanks for posting this.
ReplyDelete
Replies
IwebMarch 6, 2014 at 2:38 AM
Nice posting,thanks for share the nice blog with us and i read the full blog and this blog is the informative and i have to sure bookmark this blog..

data extraction services
ReplyDelete
Replies
Hir InfotechDecember 27, 2019 at 8:23 PM
This comment has been removed by the author.
ReplyDelete
Replies
AnonymousJuly 17, 2021 at 12:07 PM
trendyol indirim kodu
cami avizesi
cami avizeleri
avize cami
no deposit bonus forex 2021
takipçi satın al
takipçi satın al
takipçi satın al
takipcialdim.com/tiktok-takipci-satin-al/
instagram beğeni satın al
instagram beğeni satın al
btcturk
tiktok izlenme satın al
sms onay
youtube izlenme satın al
no deposit bonus forex 2021
tiktok jeton hilesi
tiktok beğeni satın al
binance
takipçi satın al
uc satın al
sms onay
sms onay
tiktok takipçi satın al
tiktok beğeni satın al
twitter takipçi satın al
trend topic satın al
youtube abone satın al
instagram beğeni satın al
tiktok beğeni satın al
twitter takipçi satın al
trend topic satın al
youtube abone satın al
takipcialdim.com/instagram-begeni-satin-al/
perde modelleri
instagram takipçi satın al
instagram takipçi satın al
takipçi satın al
instagram takipçi satın al
betboo
marsbahis
ReplyDelete
Replies
MELİS YADİGARJuly 26, 2021 at 12:46 PM
kayseriescortu.com - alacam.org - xescortun.com
ReplyDelete
Replies
TechystickOctober 28, 2021 at 8:06 AM
lab furniture manufacturers in india
laboratory furniture manufacturers
modular furniture manufacturers
fume hood manufacturers
ReplyDelete
Replies
kharizaffarFebruary 20, 2022 at 10:49 PM
I have been exploring for a little for any high quality articles or blog posts in this kind of area . Exploring in Yahoo I ultimately stumbled upon this web site. Reading this information So i¡¦m glad to show that I have an incredibly good uncanny feeling I came upon exactly what I needed. I such a lot indubitably will make certain to don¡¦t forget this web site and give it a look a relentless basis. https://python.engineering/time-process_time-function-in-python/
ReplyDelete
Replies
JACKMarch 14, 2022 at 6:18 AM
Instantly this web site will undoubtedly frequently end up being notable regarding all weblog consumers, simply because painstaking reviews as well as checks. dark web links
ReplyDelete
Replies
AnonymousMay 17, 2022 at 9:56 PM
fon perde modelleri
mobil onay
MOBİL ODEME BOZDURMA
nft nasıl alınır
ankara evden eve nakliyat
trafik sigortası
dedektör
Kurma website
ask romanlari
ReplyDelete
Replies
AnonymousMay 30, 2022 at 11:29 AM
Smm Panel
smm panel
iş ilanları
İnstagram Takipçi Satın Al
hirdavatciburada.com
beyazesyateknikservisi.com.tr
Servis
jeton hile
ReplyDelete
Replies
AnonymousJune 27, 2022 at 7:26 PM
yurtdışı kargo
lisans satın al
en son çıkan perde modelleri
en son çıkan perde modelleri
minecraft premium
uc satın al
nft nasıl alınır
özel ambulans
ReplyDelete
Replies
mrbahisDecember 21, 2022 at 10:44 PM
Good content. You write beautiful things.
hacklink
vbet
vbet
sportsbet
korsan taksi
taksi
mrbahis
hacklink
sportsbet
ReplyDelete
Replies
ömerJuly 9, 2023 at 12:30 PM
başakşehir
bayrampaşa
beşiktaş
beykoz
beylikdüzü
J6LBİ
ReplyDelete
Replies
mustafaAugust 2, 2023 at 3:28 PM
bitlis
edirne
hatay
ağrı
urfa

5KFQN
ReplyDelete
Replies
AnonymousNovember 12, 2024 at 5:39 AM
شركة تسليك مجاري بالهفوف YfMZNJqHQZ
ReplyDelete
Replies
AnonymousFebruary 12, 2025 at 3:58 AM
شركة تنظيف افران ببريدة udUulIYzpI
ReplyDelete
Replies

Subscribe to: Post Comments (Atom)