Web scraping tutorial for beginners

Shenghong Zhong
17 min readApr 26, 2021

Since I’m a foreigner living in the UK, I can tell you that it isn’t enough to speak English with a good accent. A language involves lots of sub-cultures under the wood. I’d say I never truly understand Britain until I came here to study, eat and hanging out with people.

When in Rome, Do as the Romans Do

Ways of learning cultures

I drink English breakfast tea, eat English breakfast, listen to Grime/Drill, crawl pubs before lockdown. Interestingly, I reckoned that small talk is a big part of the British culture.

“Watching movies with you” is the phrase my mate often uses to answer my question about what to do in the evening. Another good data source to get more good entertaining contents is local friends. So I asked my British friends on Instagram.

Asking British friends what British TV shows to watch

I was happily waiting for friends' recommendation and I could just bingle watch over Friday nights. Unfortunately, I didn’t get many answers and attention from my friends. Well, probably my friends were too busy. Yes, it must be.

One of the things I learned from my previous job is to think about what YOU can OFFER, rather than keep ASKING people for what you WANT. Hence, why not scrape a website where there are tons of information about movies and TV programs. Then I can be VERY popular in small talks.

People won’t be silent when I said: “I’m a data scientist.” in upcoming home parties in June (Trust me, many think I’m a nerd). But I hope they can agree with the saying — “ Data scientist is the sexist job in 21 century.” when I convince them what to watch.

Data science is sexy!

A disclaimer before beginning, many websites restrict or outright bar scraping of data from their pages. Users may be subject to legal ramifications depending on where and how they attempt to scrape information. Many sites have a devoted page to noting restrictions on data scraping at **www.[site].com/robots.txt**. Be extremely careful if looking at sites that house user data — places like facebook, linkedin, even craigslist, do not take kindly to data being scraped from their pages. When in doubt, please contact with teams at sites.

Popular TV shows listed on TMDB (themoviedb.org)

data source: TMDB website

Web scraping at themoviedb.org

The Movie Database (TMDB) is a community-driven website about movies and television shows database. Every piece of data has been added by the community since 2008. Users can search for their desired topics and discover what they like after browsing a large amount of data. Users can also contribute to the TMDB community by giving reviews and their scores to certain shows for benefits of the community. In summary, TMDB is an excellent website for someone like me who wanted to practice web scraping skills.

For the purpose of this project, we will retrieve information from the page of ’ Popular TV Shows’ using _web scraping_: a process of extracting information from a website programmatically. Web scraping isn’t magic, and yet some readers may grab information on a daily basis. For example, a recent graduate may copy and paste information about companies they applied for into a spreadsheet for job application management.

The project goal is to build a web scraper that withdraws all desirable information and assemble them into a single CSV. The format of the output CSV file is shown below:

Web Scraping

We’re going to use Requests library, Beautiful Soup library, Pandas library.

Before we dive into the web scraping stuff, I’d love to talk about the beginning of the Internet — World Wide Web.

World Wide Web

Before we explain what requests library is, we have to ask a question about why. This leads to the story of the World Wide Web. Since 1989, Tim Berners-Lee proposed the concept of the World Wide Web as an open platform where users can share information easily and locate it from anywhere in the world. This enables all scientists to continue their research without going back to their home countries at that time.

Thank you the father of web

Three key components

3 key components of WorldWideWeb

HTML

We can break down the web into three key things. The first one is the HyperText Markup Language, short for HTML. It’s the standard markup language for documents designed to be displayed in web browsers. What HTML does is to present content, just like a World document which describes paragraphs of texts, images, tables of data.

URL

The second one is the URL. It stands for Uniform Resource Locator, which is what you enter into your address bar of Chrome every day. What a URL does is to take you to the same page every single time. It’s approximately what your phone number does. If someone phones your telephone number, they’re always going to contact you.

HTTP

Last but at least, HTTP is a part of the web. It’s an invisible layer underneath the surface that is doing the communication with a server and your browser. For example, when you log into Twitter, you’ll type in your username and passwords. Then, you hit the button “Submit” and those details would be sent using an HTTP request to Twitter servers. Next, the servers will send an HTTP response after processing if the username and passwords are correct.

In a nutshell, HTTP is the fundamental way that websites communicate with servers which are just giant computers.

What is requests library?

Requests is a Python HTTP library that allows us to send HTTP requests to servers of websites, instead of using browsers to communicate the web.

We use pip, a package-management system, to install and manage software. Since the platform we selected is Google colab, we would have to type a line of code as shown:

Requests is a Python HTTP library that allows us to send HTTP requests to servers of websites, instead of using browsers to communicate the web.

We use pip, a package-management system, to install and manage softwares. Since the platform we selected is Google colab, we would have to type a line of code !pip install to install requests. You will see lots codes of !pip when installing other packages.

When we attempt to use some prewritten functions from a certain library, we would use the import statement. e.g. When we would have to type import requests after installation, we are able to use any function from requests library.

!pip install requests --upgrade --quiet
import requests

Since we focused on popular TV shows, the URL we’re landing on is https://themoviedv.org/tv. Having analyzed the URL structure, I found out a trick that you can access to the specific page.

URL structure analysis
URL structure analysis

In order to download a web page, we use requests.get() to send the HTTP request to the TMBd server and what the function returns is a response object, which is the HTTP response. Since URL can always lead us to a certain page and how TMDb structures its website, I assigned the variable base_url to a value of https://themoviedb.org and tmbd_url to a value of https://themoviedb.org/tv?page=5. To be explicit, I named the variable response to be assigned to the HTTP response containing page contents and other information.

Later on, I intend to design a function that asked people to input any page they wanted and pass the input value to replace the number 5. Using this design thinking, people can achieve the outcome for either one page of data or X pages of data at TMDb.

Another thing here is that we have to check if we successfully send the HTTP request and get an HTTP response back on purpose, since we’re not using browsers which we can’t get the feedback straightforwardly.

In general, the method to check out if the server sent an HTTP response back is the status code. In requests library, requests.get returns a response object, which contains the page contents and the information about the status code indicating if the HTTP request was successful. Learn more about HTTP status codes here: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status.

If the request was successful, response.status_code is set to a value between 200 and 299.

So we can write a helper function to get the page when we pass the value of page numbers.

Parse the HTML source code using the Beautiful Soup library

What is Beautiful Soup?

You might wonder what BeautifulSoup(response.text) is as you look at each line of codes for my last helper function get_page(). It was a hint for this section.

Beautiful Soup is a Python package for parsing HTML and XML documents. Simply, Beautiful Soup enables us to get data out of sequences of characters. It creates a parse tree for parsed pages that can be used to extract data from HTML. It’s a handy tool when it comes to web scraping. You can read more here

To extract information from the HTML source code of a webpage programmatically, we can use the Beautiful Soup library. Let’s install the library and import the BeautifulSoup class from the bs4 module.

Inspecting the HTML source code of a web page

Inspecting the HTML source code

HTML basics

Before we dive into how to inspect HTML, we should know the basic knowledge about HTML.

In the late 1980s, a British scientist Tim Bereners-Lee invented HTML, which stands for HyperText Markup Language, while working at a CERN laboratory in Switzerland. He didn’t want to make the page contents displayed on the web just regular text files. In order to increase communication efficiency and liberate people’s creativity, he tried to let authors have the ability to define each part of the texts. Hence, the content displayed on web pages is written in HTML.

Tim wanted authors to let whoever was viewing the site know that a particular part of the text was a heading and that another part was a paragraph. The solution he designed was to surround the text with things called ‘tags’. He wanted to start the header at one place and finish it at another.

In Beautiful Soup library, we can specify html.parser to ask Python to read components of the page, instead of reading it as a long string.

We can use <title> tag as an example to demonstrate what `tag' is in HTML.

an example of a tag in HTML

Common tags and attributes in HTML

common HTML tags

There are around 100 types of HTML tags but on a day to day basis, around 15 to 20 of them are the most common use, such as <div> tag, <p> tag, <section> tag, <img> tag, <a> tags.

An HTML tag comprises of three parts:

  1. Name: (html, head, body, div, etc.) Indicates what the tag represents and how a browser should interpret the information inside it.
  2. Attributes: (href, target, class, id, etc.) Properties of tag used by the browser to customize how a tag is displayed and decide what happens on user interactions.
  3. Children: A tag can contain some text or other tags or both between the opening and closing segments, e.g., <div>Some content</div>.

Of many tags, I wanted to highlight <a> tag, which can contain attributes such as href (hyperlink reference), allowing users to click and go to another site. That's why the name of <a> tag is ‘anchor’.

Each tag supports several attributes. Following are some common attributes used to modify the behaviour of tags

  • id
  • style
  • class
  • href (used with <a>)
  • src (used with <img>)

Building the scraper components

In this section, we are starting to build pieces of components for our scraper to extract movie titles, released date and detailed URL. As mentioned earlier, the outcome we want is a CSV file containing as follows:

Inspecting HTML in the Browser

To view the source code of any webpage right within your browser, you can right-click anywhere on a page and select the “Inspect” option. You access the “Developer Tools” mode to see the source code as a tree. You can expand and collapse various nodes and find the source code for a specific portion of the page.

Inspect HTML structure by the developer tool

As shown in the photo above, I’ve cursored over one of the TV programs to display how the entire content was presented. I found out the data on each page is held within a <div> tag with the attribute class="page_wrapper". Its children tags is another <div> tag including the class="content". That's a good sign. we will not need to know every attribute of every tag to extract our information, but it is helpful to analyze the structure of HTML source code.

As of writing this, I'm reminded of one episode from my favourite podcast called Darknet Diaries where it was about a people hacker named Jenny, who is paid to break into buildings and test their security, sometimes paid to get an essential notebook. The first thing she would do is to analyze the building structure and make strategies to implement it. Here I made a similar plan to “break” into the “house” to get data. Nah, she is more challenging.

Since I’ve pulled a single page, and return to a BeautifulSoup object, we can start to use some function from Beautiful Soup library to withdraw the piece of information we want.

By looking at the page, there must be 20 tv programs on one page. Therefore, each function I write to withdraw a piece of information should yield 20 different items. If my output provides fewer than the total number of 20, it indicates something went wrong and refers back to the page itself to debug the code.

1. Movie titles

As noted above, the entire tv program is nested under <div class="content"> tags. Having looked it into details, I could see that movie titles are listed under <a> tags with the attribute title= "[THE MOVIE TITLES]", Here are the texts for <a> tags are the same as values of title attributes.

https://jovian.ml/shenghongzhong/210424-project001-web-scraping-tmbd/v/29&cellId=54

2. Released date

The data about the release date is actually living in the same tag <p> tag.

You might not notice here is a small issue with the date. Since our output is a CSV file, and CSV stands for Comma-separated values. It is a delimited text file that uses a comma to separate values. Each line of the file is a data record.

In this case, we will mess up if we don’t think of ways to clean up the data. Fortunately, we have a module called datetime to help us. Also, I developed if/else statement to avoid some tv programs with NO released date. AKA, no value to be found.

3. User score

User score is located under the <span> tags. Initially, I tried the method of getting the <div> tag with the attribute date-percent="79.0". You can find find the value of a tag's attribute with tag["attribute"]. But I decided to go with <span> tag simply because it makes my job easier.

4. Image link

Images were a bit tricky. Since it is housed in different types of tags under the <div class="paper_wrapper">, we have to modify it. So we create a variable a_tags_for_imgs to capture all <a> tags containing the images.

As said eariler, our intention is to get the URL for the image. Having examined the tag strucure, we could find out that the value of attribute src can help us to get the image. But we have to concatenate it with our base_url which is http://themoviedb.org.

If we got 20 as the total number of image tags, it means we’re successful to capture the information.

On the other hand, we have to consider one situation where the site did not have the image for a tv program. I developed if/else statements to avoid this situation. Anyway, let’s wrap this up into a function.

5. Detailed page

For the detail page, I’ve tried many methods to figure out how to get this URL. It turns out each TV program has its unique identifier. Each identifier actually lives within <a> tags. It's reasonable to see how it functions. Because visitors usually click on images to find out more.

The attribute href stands for ‘hyperlink reference’ and usually comes with <a> tags.

Summary

In this section, we know how to extract information from the HTML document using Beautiful Soup library.

  • We have learned HTML basics
  • We have analyzed the HTML structure and contents
  • We have successfully extracted information about movie titles, released date, image URLs, details URL, user scores.
  • We have written some helper functions such as parse_images(), convert_date()

Let’s wrap it up into a function

Compile extracted information into Python list and dictionaries

Dictionary Concatenation

So far we have successfully captured information from one single tag in a data structure that is ‘dictionary’.

Let’s wrap this up into a helper function. Whenever we pass content_tags, img_tags which is what parse_content() and parse_image() return, we can get a list of dictionary data.

Write information to CSV files

We aim to get the CSV file. We can write a simple function to get this outcome into a CSV file.

Summary

In this section, we have written functions such as

  • get_page() for an HTTP request and returning with Beautiful Soup
  • parse_content() for parsing all contents such as movie title, released date, detailed page, user scores on the web page.
  • convert_date() for cleaning date data into a nice format
  • parse_image() for parsing one image tag into a dictionary
  • all_contents() for concatenating two outputs from parse_content() and parse_image() into a list that comprises of elements in the data structure of dictionary
  • get_one_page() is a nice function we designed to combine with all_content() and get_page() as well. It returns a list of dictionary
  • write_csv() and read_csv() for write a list of dictionary as an output of all_contents()

Now, every piece of the scraper has assembled together. We need to write a function to help us to get one single page of data and output a CSV file.

One page web scraper

So far, we can combine all pieces of scraper components into a single function to get one page of data if we specify a page number.

Extract and combine data from multiple pages

Since we can get one page of data, we can simply write another function on top of the helper function. I named it ultra_scraper().

Now we can get 20 pages of data, which is very exciting!

Future work

It was an interesting project. It reminded me of the time at my previous job as a research associate. I was involved in the innovation project about building a data pipeline from Weibo, a Twitter-like Chinese social media platform. Also, the regular activities I did at previous work are collecting all kinds of data from the Internet. I remembered my boss and CTO were calculating the limit rate and how quickly we can get all data ready. The CTO often mentioned “farming”. I love this word to describe how we collect data from the digital world.

Web scraping is the first step to get real data from the real world. It’s absolutely exciting. For the limit of time, I wasn’t able to do some analysis. However, good questions are better than meaningless action. I remember there is one day I showed my data visualization work to my boss. He messaged me back, “ that’s cool. but what is the insight for our clients?”

As for future work, if any of you is interested in it, you can develop my code to get more data such as cast, crew and comments to answer these questions as follows:

  • Which year do we have the most TV shows?
  • What TV shows do users at TMBd comment on the most? If there wasn’t sufficient comments, we can collect comments from another site like Rotten Tomatoes, https://www.rottentomatoes.com/ or Reddit, or Twitter using API if neccessary
  • Which factors( the number of black actors, female actors, genres) can determine those users to comment?

Some ideas about new projects

Looking for aspiring artists on foundation.app

As for my future project, I’m interested in doing something with Bitcoin as I’m a big fan of cryptocurrency. As the writing of this, the NFTs (Non-fungible tokens)is a hit topic. An artist called Beeple sold his digital artwork at the price of 6.6 million dollars. People started to realize this is going to be a big thing. In summary, NFTs could be the Internet of Intellectual Property. However, some started to question if NFTs actually solve the problem. What if a person screenshotted someone’s digital artwork? Besides, it’s interesting to think of the value of the original work comparing with fake. What’s the real difference between Mona Lisa and fake Mona Lisa?

My opinion is quite simple and just answer key questions like

  1. Who created?

This is because the original Mona Lisa is created by Leonardo da Vinci.

2. How long are they active in the market?

Do you know how long Leonardo da Vinci spent painting Mona Lisa? It took him 16 years to finish.

NFTs provides us with a new way to support artists. Wait, How on earth can I know who has the potentials for those upcoming artists?

So my idea is to get data from those platforms where artists hang out and sell their artwork. You can simply create multiple profiles for those aspiring artists.

Correlation between inflation&corruption rate and volumes on localbitcoin.com

I went to a small gathering for bitcoiners in London 2 weeks ago. It was nice weather in Hydepark. One of the people I met was interested in my skills in data science and we are thinking of getting a project done together. It could be my next project.

It’d be interesting to know some factors drive people to trade bitcoins.

However, the first step would be always to collect data.

Frontline reports for e-commerce business

You might know Kickstarter or Indiegogo. What about we can provide some sort of service for e-commerce business owners? For example, once we found something that is similar to their products, they got notification. It could be helpful for them to develop their next product planning.

Yet, the first step is to get data!

All right, thank you for reading here.

Complete notebook is available here — Jovian.

You can follow me on Twitter @ShenghongZhong, or on Linkedin David Zhong, or on Instagram @Davidzhongg. I also run my weekly vlogging on Youtube channel.

David Zhong business card

References

References

[1] Python offical documentation. https://docs.python.org/3/

[2] Requests library. https://pypi.org/project/requests/

[3] Beautiful Soup documentation. https://www.crummy.com/software/BeautifulSoup/bs4/doc/

[4] Aakash N S, Introduction to Web Scraping, 2021. https://jovian.ai/aakashns/python-web-scraping-and-rest-api

[5] Salmon. M (2017 , Web Scraping Job Postings from Indeed. https://medium.com/@msalmon00/web-scraping-job-postings-from-indeed-96bd588dcb4b

[6] Lazar. D(2020), Scraping Medium with Python& Beautiful Soup. https://medium.com/the-innovation/scraping-medium-with-python-beautiful-soup-3314f898bbf5

[7] Arif Ul Islam(Ron), How to Become a Pro with Scraping Youtube Videoes in 3 minutes. https://medium.com/brainstation23/how-to-become-a-pro-with-scraping-youtube-videos-in-3-minutes-a6ac56021961

[8] Hoekstra.D(2020), How to Scrape Wikipedia Articles with Python, https://www.freecodecamp.org/news/scraping-wikipedia-articles-with-python/

[9] Macaraeg.R(2020), Web Scraping Yahoo Finance. https://towardsdatascience.com/web-scraping-yahoo-finance-477fe3daa852

[10] Mohan.M(2020), Web Scraping Python Tutorial — How to Scrape Data From A Website. https://www.freecodecamp.org/news/web-scraping-python-tutorial-how-to-scrape-data-from-a-website/

[11] Pandas library documentation. https://pandas.pydata.org/docs/

--

--