Crawling is a technique that programs regularly travel around the website to extract information. Programs that crawl are called "Crawler" or "Spider." For example, the crawler you use to implement a search engine can be linked to a website. Go around the website. And then we're going to collect data from the website, and we're going to put it Save.
Scraping
Scraping refers to the technology of extracting specific information from a website. With scraping, it gets easy to gather information from websites. Most of the information published on the web is in HTML format and requires data processing to be stored in the database. You first need to analyze the website's structure to remove unnecessary information, such as advertisements, and to get only the information you need, and at this point, we need scraping. In a nutshell, scraping covers not only the data from the website but also the structure of the website. Recently, there are also many sites where you need to log in to access useful information. In this case, you cannot access useful information simply by knowing the URL. So, to properly scrape, you must understand that logging in is necessary to access the required web page and the data.
To start crawling, you must import urllib.request to use functions.
urlretrieve(): direct download to the current directory
It downloads the website.
urlopen(): to read the file in the memory
Scraping with BeautifulSoup Module
Search on the command prompt to check if you have already installed the BeatifulSoup Module.
pip list
If you don't have the module, install it.
pip install bs4
BeautifulSoup Module functions
find() : to find HTML tags. It finds the first tag in the file.
To bring <ul> tag.
findAll(): to extract all the tags with the list format
Using class attribute: You can also extract specific data with certain classes.
import math
# from math import factorial
# pi
print(math.pi)
# 2 x 2 x 2
print('2 x 2 x 2 =', math.pow(2, 3))
# Factorial
print('3!=',math.factorial(3))
print(math.factorial(984))
# ceil()
print(math.ceil(3.1))
# floor()
print(math.floor(3.9))
# sqrt()
print(math.sqrt(5))
import random
# random() : 0.0 ~ 1.0 random number
r1 = random.random()
print('r1=', r1)
# randint(a, b) : a ~ b random interger
r2 = random.randint(1, 10)
print('r2=', r2)
# 1 ~ 45 random number
r3 = random.randint(1, 45)
print('r3=', r3)
# choice() : choose randomly in the list
list = ['red','orange','yellow','green','blue','navy','purple']
r4 = random.choice(list)
print('r4=', r4)
Lottery program Example
import random
lot = [] # list
# lot.append(random.randint(1,45))
# lot.append(random.randint(1,45))
# print(lot)
while True:
r = random.randint(1,45)
if r not in lot:
lot.append(r)
if len(lot) == 6:
break
print(sorted(lot))
time
time() function returns the number of seconds passed since the epoch. For the Unix system,January 1, 1970, 00:00:00atUTCis epoch (the point where time begins).
import time
print(time.time())
#localtime()
print(time.localtime(time.time()))
print(time.asctime(time.localtime(time.time())))
print(time.ctime())
print(time.strftime('%x', time.localtime(time.time())))
print(time.strftime('%c', time.localtime(time.time())))
print(time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time())))
#sleep()
for i in range(10):
print(i)
time.sleep(2)
I will demonstrate reading this file and reversing them.
with open('abc.txt', 'r') as f:
lines = f.readlines()
print(lines) # ['AAA\n', 'BBB\n', 'CCC\n', 'DDD\n', 'EEE']
lines.reverse()
print(lines) # ['EEE', 'DDD\n', 'CCC\n', 'BBB\n', 'AAA\n']
with open('result.txt', 'w') as f:
for line in lines:
line = line.strip()
f.write(line+'\n')
To work with csv files, you must download pandas in settings in PyCharm.
data framing
CSV files
To make the data frame, you need to read the CSV files with pandas module.
import pandas as pd
data = [[1,2,3,4],[5,6,7,8]]
#Create dataframe
df = pd.DataFrame(data)
print(df)
# 0 1 2 3 column number
# 0 1 2 3 4 index number : 0
# 1 5 6 7 8 index number : 1
# dataframe -> csv file (Save)
df.to_csv('../data/df.csv', header=False, index=False)
print('saved successfully')
Excel files
To deal with Excel files, you also need pandas.
This is an excel file that includes statistics.
Install openpyxl in the settings and you will see the content in the console.
import pandas as pd
# open excel file
book = pd.read_excel('../data/stats_104102.xlsx',
sheet_name='stats_104102',
header=1)
print(book)
book = book.sort_values(by=2015, ascending=False) #descending
print(book)
XML files
To read XML files, you need bs4 library.
First, you need to read the URL, and save the file with the extension XML.
from bs4 import BeautifulSoup # module for analyzing html, xml files
import urllib.request as req # download
import os.path
url='http://www.kma.go.kr/weather/forecast/mid-term-rss3.jsp?stnId=108'
savename = 'forecast.xml'
# if not os.path.exists(savename):
req.urlretrieve(url, savename) # forecast.xml file download
# Analyze with BeautifulSoup module
xml = open(savename, 'r', encoding='utf-8').read()
soup = BeautifulSoup(xml, 'html.parser')
# print(soup)
# 전국 날씨정보를 info 딕셔너리에 저장
info = {} # info = { name : weather }
for location in soup.find_all('location'):
name = location.find('city').text
wf = location.find('wf').text
tmx = location.find('tmx').text
tmn = location.find('tmn').text
weather = wf + ':' + tmn + '~' + tmx
if name not in info:
info[name] = []
info[name].append(weather)
print(info)
#To print out
for name in info.keys():
print('+', name)
for weather in info[name]:
print('|', weather)
To open a text file -> get sum and avg -> save into a new text file
with open('sample.txt', 'r') as f:
lines = f.readlines()
print(lines) # ['70\n', '60\n', '55\n', '75\n', '95\n', '90\n', '80\n', '80\n', '85\n', '100']
total = 0
for line in lines:
total += int(line)
avg = total / len(lines)
print('total:', total) # total: 790
print('avg:', avg) # avg: 79.0
with open('result.txt', 'w') as f:
f.write(str(avg))
To read repository from github
import urllib.request as req
import os.path
import json
# To download json file
url = 'https://api.github.com/repositories'
savename = 'repo.json'
if not os.path.exists(savename):
req.urlretrieve(url, savename)
# To read repo.json
items = json.load(open(savename, 'r', encoding='utf-8'))
print(type(items)) # <class 'list'>
print(items)
# To print out
for item in items:
print(item['name']+'-'+item['owner']['login'])
This is the data that the URL has.
wordcount.py
To count the words in the file and organize them in descending order:
def getTextFreq(filename):
with open(filename, 'r', encoding='utf-8') as f:
text = f.read()
tmp = text.split()
fa = {}
for c in tmp:
if c in fa:
fa[c] += 1
else:
fa[c] = 1
return fa
result = getTextFreq('../data/data.txt')
# result = getTextFreq('../data/alice.txt')
# result = getTextFreq('../data/hong.txt')
print(type(result)) # <class 'dict'>
print(result)
# Ascending
print(sorted(result.items()))
print(sorted(result.items(), key=lambda x : x[0]))
# Descending
print(sorted(result.items(), key=lambda x : x[0], reverse=True))
# Descending 10..9..8..
result = sorted(result.items(), key=lambda x : x[1], reverse=True)
print(result)
for c, freq in result:
print('[%s] - [%d]time(s)' %(c, freq))
To count words that the user inserts:
def countWord(filename, word):
with open(filename, 'r') as f:
text = f.read()
text = text.lower() # to lowercase
list = text.split()
count = list.count(word)
return count
word = input('Which word do you want to search?')
word = word.lower()
# result = countWord('../data/data.txt', word)
result = countWord('../data/alice.txt', word)
print('[%s]: %d time(s)'%(word, result))
Regular Expression is a format language used to represent a set of strings with specific rules. It searches and replaces strings in Programming Language or Text Editor, etc. To represent certain conditions in the inserted string, the general conditional statement may be somewhat complicated, however, with regular expressions, it is very simple. The code is simple, but it isn't easy to understand unless you are familiar with the expression because it is not readable. Regular expressions in Python are provided by re-module.
To understand how the codes get shorter and simpler, here are codes without using regular expressions.
This code changes the last seven digits to * from the inserted ID numbers.
data = """
park 800905-1049118
kim 700905-1059119
"""
result=[]
for line in data.split('\n'): # line = "park 800905-1049118"
word_result=[] # word = "park"
for word in line.split(' '): # word = "800905-1049118"
if len(word)==14 and word[:6].isdigit() and word[7:].isdigit():
word = word[:6]+'-'+'*******' # word = "800905-*******"
word_result.append(word) # word_result=["park","800905-*******"]
result.append(" ".join(word_result))
print('\n'.join(result))
With regular expressions, it gets way simpler.
With sub(a, b), you can easily change a to b.
import re
data = """
park 800905-1049118
kim 700905-1059119
"""
# Regular Expression
pat = re.compile('(\d{6})[-](\d{7})')
print(pat.sub("\g<1>-*******", data))
print(pat.sub("******-\g<2>", data))
Meta characters
Meta character
Description
Example
[]
It represents the set of characters.
"[a-z]"
\
It represents the special sequence.
"\r"
.
It signals that any character is present at some specific place.
"Ja.v."
^
It represents the pattern present at the beginning of the string.
"^Python"
$
It represents the pattern present at the end of the string.
"Meadow"
*
It represents zero or more occurrences of a pattern in the string.
"hello*"
+
It represents one or more occurrences of a pattern in the string.
"hello+"
{}
The specified number of occurrences of a pattern in the string.
"python{2}"
|
It represents either this or that character is present.
"hello|world"
()
Capture and group
(agilemeadow)
[]
import re
# [abc] : Match if there are any matching letters.
# RE
p1 = re.compile('[abc]')
print(p1.match('a')) # match='a'
print(p1.match('before')) # match='b'
print(p1.match('dude')) # None
p = re.match('[abc]', 'a') # match='a'
print(p)
result1 = p.findall('life is too short')
print(type(result1)) # 'list'
print(result1) # ['life', 'is', 'too', 'short']
result2 = p.findall('Life is tOo shorT')
print(result2) # ['ife', 'is', 't', 'o', 'shor']
finditer()
result3 = p.finditer('life is too short')
print(type(result3)) # 'callable_iterator'
print(result3) # <callable_iterator object at 0x0000020A8F53DC48>
for r in result3:
print(r)
result4 = p.finditer('Life is tOo shorT')
for r in result4:
print(r)
sub()
sub() is to substitute the string a with the string b.
sub(string a, string b)
import re
p = re.compile('blue|white|red')
# substitute from blue, white, red to 'gold'
print(p.sub('gold', 'blue socks and red shoes'))
print(p.sub('silver', 'blue socks and red shoes', count=1))
Example
To change the last four digits to "####"
import re
s = """
park 010-9999-9988
kim 010-9909-7789
lee 010-8789-7768
"""
pat = re.compile("(\d{3}[-]\d{4})[-]\d{4}")
result = pat.sub("\g<1>-####", s)
print(result)