Crawling 

Crawling is a technique that programs regularly travel around the website to extract information. 
Programs that crawl are called "Crawler" or "Spider."
For example, the crawler you use to implement a search engine can be linked to a website.
Go around the website. And then we're going to collect data from the website, and we're going to put it
Save.

 

Scraping

Scraping refers to the technology of extracting specific information from a website. With scraping, it gets easy to gather information from websites. Most of the information published on the web is in HTML format and requires data processing to be stored in the database. 
You first need to analyze the website's structure to remove unnecessary information, such as advertisements, and to get only the information you need, and at this point, we need scraping. In a nutshell, scraping covers not only the data from the website but also the structure of the website.
Recently, there are also many sites where you need to log in to access useful information.
In this case, you cannot access useful information simply by knowing the URL. So, to properly scrape, you must understand that logging in is necessary to access the required web page and the data.

 

To start crawling, you must import urllib.request to use functions.

urlretrieve(): direct download to the current directory

 

It downloads the website.

urlopen(): to read the file in the memory

Scraping with BeautifulSoup Module

Search on the command prompt to check if you have already installed the BeatifulSoup Module.

pip list

If you don't have the module, install it.

pip install bs4

BeautifulSoup Module functions

find() : to find HTML tags. It finds the first tag in the file.

To bring <ul> tag.

findAll(): to extract all the tags with the list format

Using class attribute: You can also extract specific data with certain classes.

Using id attribute:

 

 

'Python' 카테고리의 다른 글

Python) Crawling and Scraping2  (0) 2022.12.10
Python) WordCloud  (0) 2022.12.07
Python) Graphs  (0) 2022.12.04
Python) Data Analysis - pandas  (0) 2022.12.03
Python) functions for data analysis  (0) 2022.11.30

+ Recent posts