Data

60 mins

A Gentle Introduction to Scraping Data

School of Data
Abstract:
What happens when the database you need to help your advocacy or social development project is a jumbled mess? No download button, no CSV file, no structured dataset. You see the information you want locked in PDF reports, in social networks (like Twitter and Instagram), or even in webpages, but you can't really do anything with it. No more! We will show you how to quickly and easily extract information from these non-structured sources into useful datasets. You will get data you thought were never accessible before, giving your projects a new level of refinement and relevance.
About this course:
This course is designed for human rights activists and journalists who would like to use data to support their advocacy work or to tell stories. You will learn the fundamental concepts of scraping and discover how to use free and easy-to-use tools to scrape data from web pages (using Google Sheets and a web browser extension called Web Scraper), social networks like Twitter and Instagram (using a web service called IFTTT) and PDF files (using both a web service called Abby Fine Reader and Tabula, a free application you can download to your computer made by journalists).
What do I learn:
By the end of the course you will have a basic understanding of what scraping is and be able to perform basic scraping routines in web pages, social networks and PDF files. You will be able to get data from places not traditionally available to people without programming skills and this will broaden the spectrum of your data collection efforts, giving more juice to your advocacy, journalistic or social development projects.
What do I need to know:
This course is suitable for anyone who completed School of Data's Data Analysis & Data Gathering courses. It requires you to have some familiarity with basic data concepts, such as types of data and how a dataset is organised. You will need an internet connection, a computer and you will be asked to create accounts in a few web services, such as Google Spreadsheets, Twitter, Instagram and IFTTT. You don't need any coding, special technical skills or advanced knowledge of how to work on spreadsheets.

Trainers

Marco Túlio Pires

Marco Túlio Pires is Google News Lab’s Lead for Brazil and Latin America. He was previously the School of Data’s Programme Manager. And, has worked at the intersection of computer science, journalism and education. Marco has helped newsrooms and students in multiple countries around the world to become more data literate.

1.1 Introduction to the course
1.2 What is scraping?
1.3 My first scraper: Me!
1.4 Quiz
2.1 Introduction
2.2 Using ABBY FineReader Online to extract data from PDFs
2.3 Using Tabula to extract tabular data from PDFs
2.4 My second scraper: unlocking PDF files!
2.5 Quiz
3.1 Introduction
3.2 Scraping Twitter data using IFTTT
3.3 My third scraper: Twitter & Instagram
4.1 Introduction
4.2 A brief introduction to HTML
4.3 Using the webinspector
4.4 Quiz
5.1 Introduction
5.2 Using Google Sheets formulas to scrape data
5.3 Scraping data from webpages using Google Sheets
5.4 Scraping wikipedia
6.1 Introduction
6.2 Scraping data from webpages using Web Scraper
6.3 Web scraping
7.1 Wrap up video

Related courses

  • 90 mins

    Data

    Cleaning and Analysing Data

    School of Data

    90 mins

    Data

    Cleaning and Analysing Data

    School of Data
  • 60 mins

    Data

    Data Gathering for Beginners

    School of Data

    60 mins

    Data

    Data Gathering for Beginners

    School of Data

Suggested reading

  • Blog

    Defending Online Freedom: Three Organisations in Action

    Internet freedom has been threatened globally due to the rise of censorship, internet shutdowns, and surveillance. Such actions have significantly impacted individuals' ability to access information, express themselves freely, and communicate with others online. In this blog, we will discuss three such organisations. Each of these organisations shared with us insights learned from their experiences as part of our new shutdown academy courses.

  • Blog

    How The Economist uses IODA to report on Internet shutdowns

    This is a case study from our course ‘Detecting Internet Shutdowns with IODA’, in our Internet Shutdown Academy, which features 10 courses in seven languages taught by experts from leading organisations. It is designed to educate activists, journalists, and anyone impacted by internet disruptions and online censorship.

  • Blog

    Relaunching Stronger: Discover Exciting Updates on Advocacy Assembly!

    We're excited to announce the launch of our new website! Read more!

  • Blog

    Internet Shutdown Mentored Training Program

    Advocacy Assembly presents the Shutdown mentored training program, a six-week online initiative that features international experts and provides participants with the knowledge, skills, and resources necessary to prepare better for shutdowns and build an Internet shutdown advocacy campaign.

  • Blog

    The human cost of internet shutdowns

    In the age of technology, the internet has become a crucial aspect of daily life for millions of people around the world. From online shopping to social media and communication, the internet has changed the way we interact with one another and access information. However, internet shutdowns are increasingly becoming a common occurrence in many countries, with potentially serious consequences for citizens and their rights.

  • Blog

    Case study: Experiencing a shutdown in Cuba during protests

    In July 2021, Cuba saw the largest protests in more than 100 years taking place throughout the country. Cubans flooded the streets to demand better access to food, water, medicine, and COVID-19 vaccines, calling for government reforms. The first protest took place in a small town out of Havana called San Antonio de los Baros. The unrest was live-streamed on Facebook and had a domino effect throughout the country. Read more..

Skip to navigation
0
0
  • Privacy
  • Terms