Priyanka Dobhal walks us through Web Scraping
For our first Her Data Learns we have Priyanka Dobhal teaching us how to web scrape. Watch our zoom call with her as she talks us through an example and check out several resources and references she prepared. All content from the call is listed below for reference.
Getting started, Priyanka walked us through HTML basics which is much needed to get started with web scraping.
What is HTML?
HTML is the language in which most websites are written. HTML is used to create pages and make them functional.
There are two important aspects to cover - Tags and Attributes
HTML tags are the hidden keywords within a web page that define how your web browser must format and display the content.
Most tags must have two parts, an opening and a closing part. For example, <html> is the opening tag and </html> is the closing tag. Note that the closing tag has the same text as the opening tag, but has an additional forward-slash ( / ) character.
There are some tags that are an exception to this rule, and where a closing tag is not required. The <img> tag for showing images is one example of this.
Example - <b> content </b>
An attribute is used to define the characteristics of an HTML element and is placed inside the element's opening tag. All attributes are made up of two parts − a name and a value
Example - <p align = "left">This is left aligned</p>
Next she walks through the steps of web scraping. In this process she used Google Colab.
What are the steps in Web Scraping?
Find the URL that you want to scrape
Inspecting the Page
Write the code
Run the code and extract the data
Store the data in the required format
Before writing the code Priyanka explained we need to install required libraries for this process.
Python Libraries -
urllib.request
The urllib.request module defines functions and classes which help in opening URLs (mostly HTTP) in a complex world — basic and digest authentication, redirections, cookies and more.
In particular, the urllib.request module contains a function called urlopen() that can be used to open a URL within a program.
BeautifulSoup
Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.
Pandas
Pandas is the most popular python library that is used for data analysis. Using this library, you can export the data into a file that can be used further.
Now we write the code in Google Colab.
Most Common commands used:
findAll()
Pass the tag to find all the mentions.
Example, soup.findAll("a")
This would retrurn all the <a> tags
To find specific attributes within the tag - soup.findAll("a").get("href")
This would return the links
find()
Pass the tag to find the first mention only.
Example, soup.find("a")
This would return the first <a> tags
To find specific attributes within the tag - soup.find("a").get("href")
This would return the link of the first tag
Google Colab Code:
For Web-scraping, we'll need these libraries -
urllib.request - To open URLs
BeautifulSoup - To extract data from html files
Pandas - To perform any manupulation
xlsxwriter - To save the result in excel
Let us check if the above mentioned libraries are pre-installed
[ ] !pip list To install any library use the syntax below -
[ ] !pip install beautifulsoup4 !pip install pandas !pip install urllib3 Import Libraries [ ] from bs4 import BeautifulSoup as soup from urllib.request import urlopen import pandas as pd from google.colab import files Get the html of the page and parse it
create a Beautiful Soup object from the html. This is done by passing the html to the BeautifulSoup() function. The Beautiful Soup package is used to parse the html, that is, take the raw html text and break it into Python objects. [ ] page_soup=soup(page_html,"html.parser") print(page_soup) Find the outer section (tag) to capture Note: I'm using find since I need to extract only the first table with class = infobox vevent [ ] table_outer = page_soup.find("table", {"class":"infobox vevent"}) #If there are multiple tables with the same name and you need to capture all of them, then use findAll() print(table_outer) To capture the individual information, identify the one common tag among them In this case, it is "tr" tag Extract all the tr tags within the table_outer [ ] tr_tags = table_outer.findAll('tr') print(tr_tags) Iterate over the tr tags Trial Section - 1 Each of the tr tags are saved in tr_tags and can be now access individually For example - Let us try to access the Genre [ ] tr_tags[2].find('td').text How to I get the Header? Just replace the td with th
[ ] tr_tags[2].find('th').text Now that the logic is ready, put this in a loop to iterate on all the headers [ ] print("Headers:") print("---------------------------------------") i = 0 for tr in tr_tags: #at a time on record of tr is passed into the loop i = i + 1 try: header_tags = tr.find('th') #this is same as tr_tags[0].find('th') header_tags_text = tr.find('th').text print(str(i) +". "+ header_tags_text) except: continue What happens if you don't use try and except block? Well an error for sure Try and except block is used for error handling Incase you have tags that may not have the same structure, your code will return an error. So add a try and except block to handle it for you [ ] print("Headers:") print("---------------------------------------") i = 0 for tr in tr_tags: i = i + 1 header_tags = tr.find('th') header_tags_text = tr.find('th').text print(str(i) +". "+ header_tags_text) Trial Section - 2 Seems like we are getting the results as needed. Let us club both the section into one iterator [ ] print("Headers: text") print("---------------------------------------") print("---------------------------------------") i = 0 for tr in tr_tags: i = i + 1 #print('--------') try: header_tags = tr.find('th') header_tags_text = tr.find('th').text row_tags_text = tr.find('td').text print(str(i) +". "+header_tags_text) print(row_tags_text) except: continue print("---------------------------------------\n") Arrange the output Now we need to hold this result so that we can save that in a file. Create empty list to store the values
header
value
order
[ ] header_list = []
value_list = []
order_list = []
print(header_list) #It is empty at present
We see that some of tr tags has multiple values in td which would be hard to store
So we found an alternative to use ul to get all the li text information
How can we get the text from list? the same way again :)
[ ] tr_tags[2].find('li').text
Awesome! So we have our logic ready. Let's club them together - [ ] print("Headers: text") print("---------------------------------------") print("---------------------------------------") i = 0 for tr in tr_tags: i = i +1 #using i to store the order of the tr tags try: row_tags_text = tr.find('td').text header_tags = tr.find('th') header_tags_text = tr.find('th').text # print(header_tags_text) if tr.ul: #This checks if the tr_tag has a <ul> tag row_li_tags = tr.findAll('li') for li in row_li_tags: #Chec row_tags_text = li.text print(str(i) +". "+header_tags_text + " | " + row_tags_text) #Just formatted the print statement to displ ay it in an organised way order_list.append(i) #Append the order value in the list header_list.append(header_tags_text) #Append the header value in the list value_list.append(row_tags_text) #Append the tag value in the list else: #This runs if the tr_tag doesn't have a <ul> tag row_tags_text = tr.find('td').text print(str(i) +". "+header_tags_text + " | " + row_tags_text) header_list.append(header_tags_text) value_list.append(row_tags_text) order_list.append(i) except: #We ignore rows that don't have a td value, for example, Production from the website continue #If it doesn't have a td then ignore th tr_tag record and move to the next one print("\n") Store the values in a dataframe You now have your data stored in three list. But for your final output that needs to be combined. Here is where pandas comes into play The below df combines the list and gives the column header to it [ ] df=pd.DataFrame(zip(order_list,header_list,value_list),columns=['S.no','Header','Value']) print(df[df.columns[1:3]]) # to print specific columns The final step is here!! To download this to a csv file, run the below two code with your path Don't forget to use \\ in your path [ ] df.to_csv('C:\\your path\\Sample_csv.csv') [ ] files.download("C:\\your path\\Sample_csv.csv")
Priyanka made the examples she walked through during call available for everyone including her Google Colab code. She also provided additional resources for getting started with Google Colab.
References:
Google Colab Resources:
Comentários