This is the internet!¶

Note

Code for this session in: https://gitlab.fabcloud.org/barcelonaworkshops/code-club/tree/2020/01_python_basics and https://gitlab.fabcloud.org/barcelonaworkshops/code-club/tree/2020/02_python_internet

Basics¶

Warning

If you know this, you know it all!

Some basic python structures:

functions
class
lists
dicts
tuple

Functions¶

A function is a block of code which only runs when it is called. You can pass data, known as parameters, into a function. A function can return data as a result.

# This is how you define a function
def function(arg1, arg2):
    '''
        Some nice documentation about what your function does
    '''
    if arg1 == 1:
        print (arg2)
    return arg1 + arg2

# This is how you define a function
function(arg1 = 1, arg2 = 2)
# Alternatively, you can call it like this
function(1, 2)

Class¶

python is an object oriented programming language (OOP). Almost everything in python is an object, with its properties and methods. A class is like an object constructor, or a blueprint for creating objects.

Easy book example:

class Person:
  def __init__(self, name, age):
    self.name = name
    self.age = age

  def sayname(self):
    print("Hello my name is " + self.name)

person_1 = Person("John", 36)
person_1.sayname()

Another example:

class Furniture:
  def __init__(self, type, legs):
    self.type = type
    self.legs = legs

  def define(self):
    print("Hello I am a " + self.type " with " + self.legs + "legs")

piece_1 = Furniture("table", 4)
piece_1.define() 

piece_2 = Furniture("chair", 3)
piece_2.define() 

piece_3 = Furniture("stool", 4)
piece_3.define()

Organising information¶

Three main structures to organise information: list, dict and tuple:

List¶

List is a collection which is ordered and changeable. Allows duplicate members.

students = ["Andrew", "Antonio", "Manolito"]
print(students)
print(students[0])

Dictionary¶

A dictionary is a collection which is unordered, changeable and indexed. In Python dictionaries are written with curly brackets, and they have keys and values.

car =  {
  "brand": "Ford",
  "model": "Mustang",
  "year": 1964
}
print(car)
print(car["model"])

Tuples¶

A tuple is a collection which is ordered and unchangeable. In Python tuples are written with round brackets.

fruits = ("apple", "banana", "cherry")
print(fruits)
print(fruits[0])

Getting around¶

Some tricks for getting help:

`help()`¶

help(list)

`type()`¶

Returns the type of an object.

a = list()
type(a)

`dir()`¶

If called without an argument, return the names in the current scope. Else, return an alphabetized list of names comprising (some of) the attributes of the given object, and of attributes reachable from it.

a = list()
dir(a)

Interacting with the internet¶

With python we can connect to the internet and do many many things. We can post and download files, make requests… :

What is all this?

Check this out: How web works

HTTP Requests: requests allows sending HTTP/1.1 requests, without the need for manual labor
BeautifulSoup: beautiful Soup is a package for parsing HTML and XML documents. It is perfect for making HTTP requests to a website and analyse the content of the HTML.

What is parsing?

Parsing, syntax analysis, or syntactic analysis is the process of analysing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar.

Before we start

Let’s take a selfie!

And also, install some requirements: pip install requests, bs4, imageio

Learning with examples¶

We have 3 (three!!!) examples for you today. We do not expect to cover all of them, but they are all super-trendy:

BBC News web scrapping
Making a gif based on movie titles
Coronavirus twitter map

BBC News web scrapping¶

One way to get content from the internet is directly scrapping the code from a web via HTTP request. This is called WEB SCRAPING.

Definition

Web scraping is a technique used to retrieve data from websites with a direct HTTP request. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

Read before start -> BeautifulSoup quickstart.

Planning the application¶

Code an application to get BBC News Headlines and show them in an interface:

How do we request the data?
How do we parse it?
Where will you store the data?
How often do we request the information?
How do we show it?

Requesting the data¶

For this, we will use the requests library:

from requests import get
from requests.exceptions import RequestException

def simple_get(url):
    """
    Attempts to get the content at `url` by making an HTTP GET request.
    If the content-type of response is some kind of HTML/XML, return the
    text content, otherwise return None.
    """
    try:
        with closing(get(url, stream=True)) as resp:
            if is_good_response(resp):
                return resp.content
            else:
                return None

    except RequestException as e:
        log_error('Error during requests to {0} : {1}'.format(url, str(e)))
        return None


def is_good_response(resp):
    """
    Returns True if the response seems to be HTML, False otherwise.
    """
    content_type = resp.headers['Content-Type'].lower()
    return (resp.status_code == 200
            and content_type is not None
            and content_type.find('html') > -1)

simple_get('http://www.bbc.com/news')

Parsing the content¶

For this, we use the result from the previous point:

from bs4 import BeautifulSoup
raw_html = simple_get('https://www.bbc.com/news')
html = BeautifulSoup(raw_html, 'html.parser')

Storing it¶

We will use a list. Lists in python are used to store data in a sequential way. They can be defined as:

>>> a = list()
>>> b = []
>>> print (type(a), type(b))
<type 'list'> <type 'list'>
>>> c = [1, 2, 3, 4, 'hello', 'goodbye']
>>> print (c)
[1, 2, 3, 4, 'hello', 'goodbye']

In the example we iterate over the items in the html and put the text field (p.text) in the bbcnews list:

bbcnews = []
raw_html = simple_get('https://www.bbc.com/news')
html = BeautifulSoup(raw_html, 'html.parser')
for p in html.select('h3'):
    if p.text not in bbcnews:
        bbcnews.append(p.text)

bbcnews.remove("BBC World News TV")
bbcnews.remove("News daily newsletter")
bbcnews.remove('Mobile app')
bbcnews.remove('Get in touch')
bbcnews.remove('BBC World Service Radio')

Refreshing the information¶

In the example, the function refresher is called every 2s

import tkinter
def Refresher(frame=None):
    print ('refreshing')
    frame = Draw(frame)
    frame.after(2000, Refresher, frame) # refresh in 2 seconds

Refresher()

And displaying it¶

Using tkinter, we make a window, with yellow background and print each item in the list:

def Draw(oldframe=None):
    frame = tkinter.Frame(window,width=1000,height=600,relief='solid',bd=0)
    lalabel = tkinter.Label(frame,  bg="yellow", fg="black", pady= 350,font=("Times New Roman", 20),text=bbcnews[randint(0, len(bbcnews)-1)]).pack()
    frame.pack()
    if oldframe is not None:
        oldframe.pack_forget()
        #.destroy() # cleanup
    return frame

window.geometry('{}x{}'.format(w, h))
window.configure(bg='yellow')    ###To diff between root & Frame
window.resizable(False, False)

# to rename the title of the window
window.title("BBC Live News")
# pack is used to show the object in the window
Refresher()
window.mainloop()

Putting it all together¶

Get the file here:

from requests import get
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup
import tkinter
from random import randint

def simple_get(url):
    """
    Attempts to get the content at `url` by making an HTTP GET request.
    If the content-type of response is some kind of HTML/XML, return the
    text content, otherwise return None.
    """
    try:
        with closing(get(url, stream=True)) as resp:
            if is_good_response(resp):
                return resp.content
            else:
                return None

    except RequestException as e:
        log_error('Error during requests to {0} : {1}'.format(url, str(e)))
        return None


def is_good_response(resp):
    """
    Returns True if the response seems to be HTML, False otherwise.
    """
    content_type = resp.headers['Content-Type'].lower()
    return (resp.status_code == 200
            and content_type is not None
            and content_type.find('html') > -1)


def log_error(e):
    """
    This function just prints them, but you can
    make it do anything.
    """
    print(e)

def Draw(oldframe=None):
    frame = tkinter.Frame(window,width=1000,height=600,relief='solid',bd=0)
    lalabel = tkinter.Label(frame,  bg="yellow", fg="black", pady= 350,font=("Times New Roman", 20),text=bbcnews[randint(0, len(bbcnews)-1)]).pack()
    frame.pack()
    if oldframe is not None:
        oldframe.pack_forget()
        #.destroy() # cleanup
    return frame

def Refresher(frame=None):
    print ('refreshing')
    frame = Draw(frame)
    frame.after(2000, Refresher, frame) # refresh in 10 seconds

bbcnews = []
raw_html = simple_get('https://www.bbc.com/news')
html = BeautifulSoup(raw_html, 'html.parser')
for p in html.select('h3'):
    if p.text not in bbcnews:
        bbcnews.append(p.text)

bbcnews.remove("BBC World News TV")
bbcnews.remove("News daily newsletter")
bbcnews.remove('Mobile app')
bbcnews.remove('Get in touch')
bbcnews.remove('BBC World Service Radio')
window = tkinter.Tk()
w = '1200'
h = '800'
window.geometry('{}x{}'.format(w, h))
window.configure(bg='yellow')    ###To diff between root & Frame
window.resizable(False, False)

# to rename the title of the window
window.title("BBC Live News")
# pack is used to show the object in the window
Refresher()
window.mainloop()

API requests¶

Sometimes websites are not very happy when they are scrapped. For instance, IMDB says in their terms and conditions:

Robots and Screen Scraping: You may not use data mining, robots, screen scraping, or similar data gathering and extraction tools on this site, except with our express written consent as noted below.

For this, other means for interacting with online content is provided in the form of an API:

A Web API is an application programming interface for either a web server or a web browser. It is a web development concept, usually limited to a web application’s client-side (including any web frameworks being used), and thus usually does not include web server or browser implementation details such as SAPIs or APIs unless publicly accessible by a remote web application.

We can connect to an API directly by it’s endpoints:

Endpoints are important aspects of interacting with server-side web APIs, as they specify where resources lie that can be accessed by third party software. Usually the access is via a URI to which HTTP requests are posted, and from which the response is thus expected.

An example of an open API is the SmartCitizen API:

Data Format

The data is available generally in JSON format. Json is done by packing data in between {}:

{ "glossary": { "title": "example glossary", "GlossDiv": { "title": "S", "GlossList": { "GlossEntry": { "ID": "SGML", "SortAs": "SGML", "GlossTerm": "Standard Generalized Markup Language", "Acronym": "SGML", "Abbrev": "ISO 8879:1986", "GlossDef": { "para": "A meta-markup language, used to create markup languages such as DocBook.", "GlossSeeAlso": ["GML", "XML"] }, "GlossSee": "markup" } } } } }

With Python, we can make requests to APIs via the Requests Library and store the data in the

Planning the app¶

We’ll make an application that gets a word, looks for all the movies in OMDB that contain that word in the title and makes a gif animation with the posters of those movies. For example, with the word Laboratory we want this:

To plan for this:

We need to find the right API (in this case OMDB) and understand how the data is stored
Request the data
Explore the received data and store it
Make use of the data
Download images and make the gif

Exploring the API data¶

First, in some cases, we will need to have an API key to access the data. Normally, we would like to store the key in a secret file (in this case .env file):

with open(join(getcwd(), '.env')) as environment:
    for var in environment:
        key = var.split('=')
        os.environ[key[0]] = re.sub('\n','',key[1])

API_KEY = os.environ['apikey']

In this example, we’ll have a look at the API’s data from OMDB.

Basic API Request Structure

The way we request data to an API comes with the following format:

Base URL: http://www.omdbapi.com/
Query: ? + parameter + queryname. The parameter can be found in the API documentation. Several parameters can be separated by &. An example: http://www.omdbapi.com/?s=jose&plot=full&apikey=2a31115

Requesting data¶

Using the same library as before, we make the get request to the API:

import requests, json

title = 'peter'
baseurl = "http://omdbapi.com/?s=" #only submitting the title parameter

API_KEY = XXXXXXX

def make_request(search):
    response = requests.get(baseurl + search + "&apikey=" + API_KEY)
    movies = {}
    if response.status_code == 200:
        movies = json.loads(response.text)
    else:
        raise ValueError("Bad request")

    return movies

movies = make_request(title)

Exploring the data¶

The data from the API is returned in a dict:

print (movies.keys())

dict_keys(['Search', 'totalResults', 'Response'])

And the values by:

print (movies['Search'])

[{'Title': 'Peter Pan', 'Year': '1953', 'imdbID': 'tt0046183', 'Type': 'movie',...]

Making use of it¶

def get_poster(_title, _link):
    try:
        print ('Downloading poster for', _title, '...')
        _file_name = _title + ".jpg"
        urllib.request.urlretrieve(_link, _file_name)
        return _file_name
    except:
        return ''

file_name = get_poster(movie['Title'], movie['Poster'])

Yields something like:

https://m.media-amazon.com/images/M/MV5BMzIwMzUyYTUtMjQ3My00NDc3LWIyZjQtOGUzNDJmNTFlNWUxXkEyXkFqcGdeQXVyMjA0MDQ0Mjc@._V1_SX300.jpg

Make the gif¶

We will use imageio:

import imageio

for filename in list_movies:
    images.append(imageio.imread(filename))
    imageio.mimsave(join(getcwd(), args.title + '.gif'), images)

Putting it all together! file here:

import os
from os import getcwd, pardir
from os.path import join, abspath
import requests, json
import urllib
import argparse
import re
import imageio

baseurl = "http://omdbapi.com/?s=" #only submitting the title parameter
with open(join(getcwd(), '.env')) as environment:
    for var in environment:
        key = var.split('=')
        os.environ[key[0]] = re.sub('\n','',key[1])

API_KEY = os.environ['apikey']

def make_request(search):
    #OPT!1:
    url_search = baseurl + search + "&apikey=" + API_KEY
    # http://omdbapi.com/?s=peter&apikey=123456
    response = requests.get(url_search)
    movies = dict()
    if response.status_code == 200:
        movies = json.loads(response.text)
    else:
        raise ValueError("Bad request")

    return movies

def get_poster(_title, _link):
    try:
        print ('Downloading poster for', _title, '...')
        _file_name = _title + ".jpg"
        urllib.request.urlretrieve(_link, _file_name)
        return _file_name
    except:
        return ''

if __name__ == '__main__':

    parser = argparse.ArgumentParser()
    parser.add_argument("--title", "-t", help="Movie title query")

    args = parser.parse_args()
    movies = make_request(args.title)

    list_movies = list()
    images = []

    if movies:
        for movie in movies['Search']:
            print (movie['Title'])
            print (movie['Poster'])
            file_name = get_poster(movie['Title'], movie['Poster'])
            if file_name != '':
                list_movies.append(file_name)

        for filename in list_movies:
            images.append(imageio.imread(filename))
            imageio.mimsave(join(getcwd(), args.title + '.gif'), images)

The result!

>>> ./api_request.py -t "Peter"
...

This is the internet!¶

Basics¶

Functions¶

Class¶

Organising information¶

List¶

Dictionary¶

Tuples¶

Getting around¶

help()¶

type()¶

dir()¶

Interacting with the internet¶

Learning with examples¶

BBC News web scrapping¶

Planning the application¶

Requesting the data¶

Parsing the content¶

Storing it¶

Refreshing the information¶

And displaying it¶

Putting it all together¶

API requests¶

Planning the app¶

Exploring the API data¶

Requesting data¶

Exploring the data¶

Making use of it¶

Make the gif¶

`help()`¶

`type()`¶

`dir()`¶