This is the internet!¶
Note
Code for this session in: https://gitlab.fabcloud.org/barcelonaworkshops/code-club/tree/2020/01_python_basics and https://gitlab.fabcloud.org/barcelonaworkshops/code-club/tree/2020/02_python_internet
Basics¶
Warning
If you know this, you know it all!
Some basic python
structures:
- functions
- class
- lists
- dicts
- tuple
Functions¶
A function is a block of code which only runs when it is called. You can pass data, known as parameters, into a function. A function can return data as a result.
# This is how you define a function
def function(arg1, arg2):
'''
Some nice documentation about what your function does
'''
if arg1 == 1:
print (arg2)
return arg1 + arg2
# This is how you define a function
function(arg1 = 1, arg2 = 2)
# Alternatively, you can call it like this
function(1, 2)
Class¶
python
is an object oriented programming language (OOP).
Almost everything in python
is an object, with its properties and methods.
A class
is like an object constructor, or a blueprint for creating objects.
Easy book example:
class Person:
def __init__(self, name, age):
self.name = name
self.age = age
def sayname(self):
print("Hello my name is " + self.name)
person_1 = Person("John", 36)
person_1.sayname()
Another example:
class Furniture:
def __init__(self, type, legs):
self.type = type
self.legs = legs
def define(self):
print("Hello I am a " + self.type " with " + self.legs + "legs")
piece_1 = Furniture("table", 4)
piece_1.define()
piece_2 = Furniture("chair", 3)
piece_2.define()
piece_3 = Furniture("stool", 4)
piece_3.define()
Organising information¶
Three main structures to organise information: list
, dict
and tuple
:
List¶
List is a collection which is ordered and changeable. Allows duplicate members.
students = ["Andrew", "Antonio", "Manolito"]
print(students)
print(students[0])
Dictionary¶
A dictionary is a collection which is unordered, changeable and indexed. In Python dictionaries are written with curly brackets, and they have keys and values.
car = {
"brand": "Ford",
"model": "Mustang",
"year": 1964
}
print(car)
print(car["model"])
Tuples¶
A tuple is a collection which is ordered and unchangeable. In Python tuples are written with round brackets.
fruits = ("apple", "banana", "cherry")
print(fruits)
print(fruits[0])
Getting around¶
Some tricks for getting help:
help()
¶
help(list)
type()
¶
Returns the type of an object.
a = list()
type(a)
dir()
¶
If called without an argument, return the names in the current scope. Else, return an alphabetized list of names comprising (some of) the attributes of the given object, and of attributes reachable from it.
a = list()
dir(a)
Interacting with the internet¶
With python
we can connect to the internet and do many many things. We can post and download files, make requests… :
What is all this?
Check this out: How web works
- HTTP Requests: requests allows sending HTTP/1.1 requests, without the need for manual labor
- BeautifulSoup: beautiful Soup is a package for parsing HTML and XML documents. It is perfect for making HTTP requests to a website and analyse the content of the HTML.
What is parsing?
Parsing, syntax analysis, or syntactic analysis is the process of analysing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar.
Before we start
Let’s take a selfie!
And also, install some requirements: pip install requests, bs4, imageio
Learning with examples¶
We have 3 (three!!!) examples for you today. We do not expect to cover all of them, but they are all super-trendy:
- BBC News web scrapping
- Making a gif based on movie titles
- Coronavirus twitter map
BBC News web scrapping¶
One way to get content from the internet is directly scrapping the code from a web via HTTP request. This is called WEB SCRAPING.
Definition
Web scraping is a technique used to retrieve data from websites with a direct HTTP request. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.
Read before start -> BeautifulSoup quickstart.
Planning the application¶
Code an application to get BBC News Headlines and show them in an interface:
- How do we request the data?
- How do we parse it?
- Where will you store the data?
- How often do we request the information?
- How do we show it?
Requesting the data¶
For this, we will use the requests library:
from requests import get
from requests.exceptions import RequestException
def simple_get(url):
"""
Attempts to get the content at `url` by making an HTTP GET request.
If the content-type of response is some kind of HTML/XML, return the
text content, otherwise return None.
"""
try:
with closing(get(url, stream=True)) as resp:
if is_good_response(resp):
return resp.content
else:
return None
except RequestException as e:
log_error('Error during requests to {0} : {1}'.format(url, str(e)))
return None
def is_good_response(resp):
"""
Returns True if the response seems to be HTML, False otherwise.
"""
content_type = resp.headers['Content-Type'].lower()
return (resp.status_code == 200
and content_type is not None
and content_type.find('html') > -1)
simple_get('http://www.bbc.com/news')
Parsing the content¶
For this, we use the result from the previous point:
from bs4 import BeautifulSoup
raw_html = simple_get('https://www.bbc.com/news')
html = BeautifulSoup(raw_html, 'html.parser')
Storing it¶
We will use a list. Lists in python are used to store data in a sequential way. They can be defined as:
>>> a = list()
>>> b = []
>>> print (type(a), type(b))
<type 'list'> <type 'list'>
>>> c = [1, 2, 3, 4, 'hello', 'goodbye']
>>> print (c)
[1, 2, 3, 4, 'hello', 'goodbye']
In the example we iterate over the items in the html and put the text field (p.text
) in the bbcnews
list:
bbcnews = []
raw_html = simple_get('https://www.bbc.com/news')
html = BeautifulSoup(raw_html, 'html.parser')
for p in html.select('h3'):
if p.text not in bbcnews:
bbcnews.append(p.text)
bbcnews.remove("BBC World News TV")
bbcnews.remove("News daily newsletter")
bbcnews.remove('Mobile app')
bbcnews.remove('Get in touch')
bbcnews.remove('BBC World Service Radio')
Refreshing the information¶
In the example, the function refresher is called every 2s
import tkinter
def Refresher(frame=None):
print ('refreshing')
frame = Draw(frame)
frame.after(2000, Refresher, frame) # refresh in 2 seconds
Refresher()
And displaying it¶
Using tkinter, we make a window, with yellow background and print each item in the list:
def Draw(oldframe=None):
frame = tkinter.Frame(window,width=1000,height=600,relief='solid',bd=0)
lalabel = tkinter.Label(frame, bg="yellow", fg="black", pady= 350,font=("Times New Roman", 20),text=bbcnews[randint(0, len(bbcnews)-1)]).pack()
frame.pack()
if oldframe is not None:
oldframe.pack_forget()
#.destroy() # cleanup
return frame
window.geometry('{}x{}'.format(w, h))
window.configure(bg='yellow') ###To diff between root & Frame
window.resizable(False, False)
# to rename the title of the window
window.title("BBC Live News")
# pack is used to show the object in the window
Refresher()
window.mainloop()
Putting it all together¶
Get the file here:
from requests import get
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup
import tkinter
from random import randint
def simple_get(url):
"""
Attempts to get the content at `url` by making an HTTP GET request.
If the content-type of response is some kind of HTML/XML, return the
text content, otherwise return None.
"""
try:
with closing(get(url, stream=True)) as resp:
if is_good_response(resp):
return resp.content
else:
return None
except RequestException as e:
log_error('Error during requests to {0} : {1}'.format(url, str(e)))
return None
def is_good_response(resp):
"""
Returns True if the response seems to be HTML, False otherwise.
"""
content_type = resp.headers['Content-Type'].lower()
return (resp.status_code == 200
and content_type is not None
and content_type.find('html') > -1)
def log_error(e):
"""
This function just prints them, but you can
make it do anything.
"""
print(e)
def Draw(oldframe=None):
frame = tkinter.Frame(window,width=1000,height=600,relief='solid',bd=0)
lalabel = tkinter.Label(frame, bg="yellow", fg="black", pady= 350,font=("Times New Roman", 20),text=bbcnews[randint(0, len(bbcnews)-1)]).pack()
frame.pack()
if oldframe is not None:
oldframe.pack_forget()
#.destroy() # cleanup
return frame
def Refresher(frame=None):
print ('refreshing')
frame = Draw(frame)
frame.after(2000, Refresher, frame) # refresh in 10 seconds
bbcnews = []
raw_html = simple_get('https://www.bbc.com/news')
html = BeautifulSoup(raw_html, 'html.parser')
for p in html.select('h3'):
if p.text not in bbcnews:
bbcnews.append(p.text)
bbcnews.remove("BBC World News TV")
bbcnews.remove("News daily newsletter")
bbcnews.remove('Mobile app')
bbcnews.remove('Get in touch')
bbcnews.remove('BBC World Service Radio')
window = tkinter.Tk()
w = '1200'
h = '800'
window.geometry('{}x{}'.format(w, h))
window.configure(bg='yellow') ###To diff between root & Frame
window.resizable(False, False)
# to rename the title of the window
window.title("BBC Live News")
# pack is used to show the object in the window
Refresher()
window.mainloop()
API requests¶
Sometimes websites are not very happy when they are scrapped. For instance, IMDB says in their terms and conditions:
- Robots and Screen Scraping: You may not use data mining, robots, screen scraping, or similar data gathering and extraction tools on this site, except with our express written consent as noted below.
For this, other means for interacting with online content is provided in the form of an API:
- A Web API is an application programming interface for either a web server or a web browser. It is a web development concept, usually limited to a web application’s client-side (including any web frameworks being used), and thus usually does not include web server or browser implementation details such as SAPIs or APIs unless publicly accessible by a remote web application.
We can connect to an API directly by it’s endpoints:
- Endpoints are important aspects of interacting with server-side web APIs, as they specify where resources lie that can be accessed by third party software. Usually the access is via a URI to which HTTP requests are posted, and from which the response is thus expected.
An example of an open API is the SmartCitizen API:
Data Format
The data is available generally in JSON format. Json is done by packing data in between {}:
{
"glossary": {
"title": "example glossary",
"GlossDiv": {
"title": "S",
"GlossList": {
"GlossEntry": {
"ID": "SGML",
"SortAs": "SGML",
"GlossTerm": "Standard Generalized Markup Language",
"Acronym": "SGML",
"Abbrev": "ISO 8879:1986",
"GlossDef": {
"para": "A meta-markup language, used to create markup languages such as DocBook.",
"GlossSeeAlso": ["GML", "XML"]
},
"GlossSee": "markup"
}
}
}
}
}
With Python, we can make requests to APIs via the Requests Library and store the data in the
Planning the app¶
We’ll make an application that gets a word, looks for all the movies in OMDB that contain that word in the title and makes a gif animation with the posters of those movies. For example, with the word Laboratory
we want this:
To plan for this:
- We need to find the right API (in this case OMDB) and understand how the data is stored
- Request the data
- Explore the received data and store it
- Make use of the data
- Download images and make the gif
Exploring the API data¶
First, in some cases, we will need to have an API key to access the data. Normally, we would like to store the key in a secret file (in this case .env
file):
with open(join(getcwd(), '.env')) as environment:
for var in environment:
key = var.split('=')
os.environ[key[0]] = re.sub('\n','',key[1])
API_KEY = os.environ['apikey']
In this example, we’ll have a look at the API’s data from OMDB.
Basic API Request Structure
The way we request data to an API comes with the following format:
- Base URL:
http://www.omdbapi.com/
- Query:
?
+parameter
+queryname
. Theparameter
can be found in the API documentation. Several parameters can be separated by&
. An example:http://www.omdbapi.com/?s=jose&plot=full&apikey=2a31115
Requesting data¶
Using the same library as before, we make the get request to the API:
import requests, json
title = 'peter'
baseurl = "http://omdbapi.com/?s=" #only submitting the title parameter
API_KEY = XXXXXXX
def make_request(search):
response = requests.get(baseurl + search + "&apikey=" + API_KEY)
movies = {}
if response.status_code == 200:
movies = json.loads(response.text)
else:
raise ValueError("Bad request")
return movies
movies = make_request(title)
Exploring the data¶
The data from the API is returned in a dict
:
print (movies.keys())
dict_keys(['Search', 'totalResults', 'Response'])
And the values by:
print (movies['Search'])
[{'Title': 'Peter Pan', 'Year': '1953', 'imdbID': 'tt0046183', 'Type': 'movie',...]
Making use of it¶
def get_poster(_title, _link):
try:
print ('Downloading poster for', _title, '...')
_file_name = _title + ".jpg"
urllib.request.urlretrieve(_link, _file_name)
return _file_name
except:
return ''
file_name = get_poster(movie['Title'], movie['Poster'])
Yields something like:
https://m.media-amazon.com/images/M/MV5BMzIwMzUyYTUtMjQ3My00NDc3LWIyZjQtOGUzNDJmNTFlNWUxXkEyXkFqcGdeQXVyMjA0MDQ0Mjc@._V1_SX300.jpg
Make the gif¶
We will use imageio:
import imageio
for filename in list_movies:
images.append(imageio.imread(filename))
imageio.mimsave(join(getcwd(), args.title + '.gif'), images)
Putting it all together! file here:
import os
from os import getcwd, pardir
from os.path import join, abspath
import requests, json
import urllib
import argparse
import re
import imageio
baseurl = "http://omdbapi.com/?s=" #only submitting the title parameter
with open(join(getcwd(), '.env')) as environment:
for var in environment:
key = var.split('=')
os.environ[key[0]] = re.sub('\n','',key[1])
API_KEY = os.environ['apikey']
def make_request(search):
#OPT!1:
url_search = baseurl + search + "&apikey=" + API_KEY
# http://omdbapi.com/?s=peter&apikey=123456
response = requests.get(url_search)
movies = dict()
if response.status_code == 200:
movies = json.loads(response.text)
else:
raise ValueError("Bad request")
return movies
def get_poster(_title, _link):
try:
print ('Downloading poster for', _title, '...')
_file_name = _title + ".jpg"
urllib.request.urlretrieve(_link, _file_name)
return _file_name
except:
return ''
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument("--title", "-t", help="Movie title query")
args = parser.parse_args()
movies = make_request(args.title)
list_movies = list()
images = []
if movies:
for movie in movies['Search']:
print (movie['Title'])
print (movie['Poster'])
file_name = get_poster(movie['Title'], movie['Poster'])
if file_name != '':
list_movies.append(file_name)
for filename in list_movies:
images.append(imageio.imread(filename))
imageio.mimsave(join(getcwd(), args.title + '.gif'), images)
The result!
>>> ./api_request.py -t "Peter"
...