Last time we discussed about the difference between crawling and scraping and some examples relating them.

2022.12.06 - [Python] - Python) Crawling and Scraping1

 

Python) Crawling and Scraping1

Crawling Crawling is a technique that programs regularly travel around the website to extract information. Programs that crawl are called "Crawler" or "Spider." For example, the crawler you use to implement a search engine can be linked to a website. Go ar

www.agilemeadow.com

Here in this post, we will search by utilizing Naver API and the selenium library.

You first have to register your application on the provided website.

https://developers.naver.com/products/service-api/search/search.md

 

검색 - SERVICE-API

검색 NAVER Developers - 검색 API 소개 웹, 뉴스, 블로그 등 분야별 네이버 검색 결과를 웹 서비스 또는 모바일 앱에서 바로 보여 줄 수 있습니다. 또한 ’OO역맛집’과 같은 지역 검색을 할 수도 있으

developers.naver.com

First, import the urllib. 

 

You need the key that you got issued on the Naver API website. 

Save the content in the result variable. 

Then, save it in the txt file.

 

You can also search for news articles. You can set the number of news articles you need with the keyword. 

First, import the libraries, and 

Crawling

Save the result into xlsx file.

Convert the xlsx file into txt file.

 

Now, let's use this file to make a word cloud.

Please refer to my previous post if you need some references. 

2022.12.10 - [Python] - Python) WordCloud

 

Move the files saved into the data folder and make a word cloud! 

You can also make a graph reflecting the frequency of the words.

Word Cloud

Word cloud with mask

'Python' 카테고리의 다른 글

Python) Crawling and Scraping1  (0) 2022.12.08
Python) WordCloud  (0) 2022.12.07
Python) Graphs  (0) 2022.12.04
Python) Data Analysis - pandas  (0) 2022.12.03
Python) functions for data analysis  (0) 2022.11.30

Crawling 

Crawling is a technique that programs regularly travel around the website to extract information. 
Programs that crawl are called "Crawler" or "Spider."
For example, the crawler you use to implement a search engine can be linked to a website.
Go around the website. And then we're going to collect data from the website, and we're going to put it
Save.

 

Scraping

Scraping refers to the technology of extracting specific information from a website. With scraping, it gets easy to gather information from websites. Most of the information published on the web is in HTML format and requires data processing to be stored in the database. 
You first need to analyze the website's structure to remove unnecessary information, such as advertisements, and to get only the information you need, and at this point, we need scraping. In a nutshell, scraping covers not only the data from the website but also the structure of the website.
Recently, there are also many sites where you need to log in to access useful information.
In this case, you cannot access useful information simply by knowing the URL. So, to properly scrape, you must understand that logging in is necessary to access the required web page and the data.

 

To start crawling, you must import urllib.request to use functions.

urlretrieve(): direct download to the current directory

 

It downloads the website.

urlopen(): to read the file in the memory

Scraping with BeautifulSoup Module

Search on the command prompt to check if you have already installed the BeatifulSoup Module.

pip list

If you don't have the module, install it.

pip install bs4

BeautifulSoup Module functions

find() : to find HTML tags. It finds the first tag in the file.

To bring <ul> tag.

findAll(): to extract all the tags with the list format

Using class attribute: You can also extract specific data with certain classes.

Using id attribute:

 

 

'Python' 카테고리의 다른 글

Python) Crawling and Scraping2  (0) 2022.12.10
Python) WordCloud  (0) 2022.12.07
Python) Graphs  (0) 2022.12.04
Python) Data Analysis - pandas  (0) 2022.12.03
Python) functions for data analysis  (0) 2022.11.30

WordCloud is a Natural Language Processing.

As you can see in my last post, word cloud helps you understand a subject better.

2022.10.07 - [Jobs] - Junior developer job word cloud from indeed & LinkedIn

 

Junior developer job word cloud from indeed & LinkedIn

indeed From the most used, LinkedIn

www.agilemeadow.com

Configuration

1. Download JDK.

http://abit.ly/easypy_101

 

easypy_java 다운로드

easypy_java 다운로드

abit.ly

2.  Download KoNLPy (Kkma, Okt, Komoran, Hannanum, Mecab) dependency package.

pip install jpype1

3. Download KoNLPy  modules.

pip install konlpy

4. Download the word cloud module. Microsoft Visual C++(higher than version 14) has to be installed in advance to download word cloud.

pip install wordcloud

Now, it is all set, and we will use various functions to mine text.

 

open(): to open files

read(): to read files

sub(): to delete the letters that are not needed 

As you can see, the words that are not essential are filtered.

nouns(): to extract nouns only

DataFrame(): to convert to the dataframe

len(): to get the the length of words 

To save the result from len(), I created a count variable and saved them here.

To sort out the words and leave valid result only

groupby(): to group data

head(): to print out top n words by frequencies

barplot(): to create a bar graph

 

To create a word cloud, first set font.

dict(): to convert the data frame to dictionary

Import wordcloud.

Create worldcloud!

To create masks, import PIL and numpy first.

With them, you can customize the shape and color of the cloud.

'Python' 카테고리의 다른 글

Python) Crawling and Scraping2  (0) 2022.12.10
Python) Crawling and Scraping1  (0) 2022.12.08
Python) Graphs  (0) 2022.12.04
Python) Data Analysis - pandas  (0) 2022.12.03
Python) functions for data analysis  (0) 2022.11.30

To draw scatter plot

To change colors of the dots

Graph settings

To initialize settings

Time series plot

To read SPSS Data

After installing, import pandas, numpy and seaborn.

To bring data

To see the shape and the properties of the graph

To change column names

To preprocess data

To print variable type

To describe

To create histogram 

Salary difference by age group

To check age variables

Age variables and frequencies

To create graph

To classify sex groups and age groups

To create graph

To create line graph with lineplot()

 

'Python' 카테고리의 다른 글

Python) Crawling and Scraping1  (0) 2022.12.08
Python) WordCloud  (0) 2022.12.07
Python) Data Analysis - pandas  (0) 2022.12.03
Python) functions for data analysis  (0) 2022.11.30
Python) Module and package  (0) 2022.11.25

If you don't have pandas on your device, install pandas first. 

Otherwise, import pandas with an alias to run in the jupyter notebook or google colab.

# pip install pandas
import pandas as pd

With pandas, we can make a data frame by using dictionaries. 

To extract a certain variable

To get sum and average

To open excel files

l

len() : returns the number of items in an object

To open csv files

To create data frame

'Python' 카테고리의 다른 글

Python) WordCloud  (0) 2022.12.07
Python) Graphs  (0) 2022.12.04
Python) functions for data analysis  (0) 2022.11.30
Python) Module and package  (0) 2022.11.25
Python) Data Analysis - numpy  (0) 2022.11.24

head() : to print out the first five rows

You can specify the rows to print out

To print out the rows from the last

To know about the shape of the table

To see the properties

To describe summary

built-in functions

package functions

methods

query() in data analysis

with 2 conditions

To print out multiple variables

alignment

ascending and descending order

Derived variables

df.assign() with np.where()/

 

'Python' 카테고리의 다른 글

Python) Graphs  (0) 2022.12.04
Python) Data Analysis - pandas  (0) 2022.12.03
Python) Module and package  (0) 2022.11.25
Python) Data Analysis - numpy  (0) 2022.11.24
Python) Files input and output  (0) 2022.11.22

The module is a file that consists of functions, variables, and classes. Modules in Python are considered code libraries in other languages. 

The built-in functions are the parts of the Python standard module. Click the link to see the built-in functions. 

2. Built-in Functions — Python 3.6.15 documentation

 

2. Built-in Functions — Python 3.6.15 documentation

2. Built-in Functions The Python interpreter has a number of functions and types built into it that are always available. They are listed here in alphabetical order. abs(x) Return the absolute value of a number. The argument may be an integer or a floating

docs.python.org

You must import the built-in modules.

Here are some examples of importing the modules. 

math

import math
# from math import factorial

# pi
print(math.pi)


# 2 x 2 x 2
print('2 x 2 x 2 =', math.pow(2, 3))


# Factorial
print('3!=',math.factorial(3))
print(math.factorial(984))


# ceil()
print(math.ceil(3.1))


# floor()
print(math.floor(3.9))


# sqrt() 
print(math.sqrt(5))

calendar

import calendar
# calendar()

cal = calendar.calendar(2019)
print(cal)
# prcal() : print calendar

calendar.prcal(2022)
# prmonth() :print month
calendar.prmonth(2022,11)

# weekday() : week information
# Mon(0),Tue(1),Wed(2),Thur(3),Fri(4),Sat(5),Sun(6)
weekday = calendar.weekday(2022,11,28)
print('weekday:', weekday)

random

import random

# random() : 0.0 ~ 1.0 random number
r1 = random.random()
print('r1=', r1)

# randint(a, b) : a ~ b random interger
r2 = random.randint(1, 10)
print('r2=', r2)

# 1 ~ 45 random number
r3 = random.randint(1, 45)
print('r3=', r3)

# choice() : choose randomly in the list
list = ['red','orange','yellow','green','blue','navy','purple']
r4 = random.choice(list)
print('r4=', r4)

Lottery program Example 

import random

lot = []                           # list

# lot.append(random.randint(1,45))
# lot.append(random.randint(1,45))
# print(lot)

while True:
    r = random.randint(1,45)     
    if r not in lot:              
        lot.append(r)
        if len(lot) == 6:         
            break               

print(sorted(lot))

time

time() function returns the number of seconds passed since the epoch. For the Unix system, January 1, 1970, 00:00:00 at UTC is epoch (the point where time begins).

import time

print(time.time())

#localtime()
print(time.localtime(time.time()))

print(time.asctime(time.localtime(time.time())))

print(time.ctime())

print(time.strftime('%x', time.localtime(time.time())))
print(time.strftime('%c', time.localtime(time.time())))
print(time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time())))

#sleep()
for i in range(10):         
    print(i)
    time.sleep(2)

web browser

import webbrowser

webbrowser.open('http://www.google.com')

webbrowser.open('http://www.naver.com')

webbrowser.open_new('member.html')

custom module Example 1) 

mypi = 3.14

def area(r):
    return mypi * r * r

Importing mymath (custom module)

import mymath

print(mymath.mypi)          # 3.14

print(mymath.area(5))       # 78.5

custom module Example 2) 

def plus(a,b):
    return a+b

def minus(a,b):
    return a-b

def multiply(a,b):
    return a*b

def divide(a,b):
    return a/b

Importing calculator (custom module) 

import calculator

print(calculator.plus(10, 5))
print(calculator.minus(10, 5))
print(calculator.multiply(10, 5))
print(calculator.divide(10, 5))

When you import some parts of the module, you will only be able to use the imported functions or variables.

# from  module import variable/function
from calculator import plus     
from calculator import minus   

print(calculator.plus(10, 5))   # error
print(plus(10, 5))
print(minus(10, 5))

print(multiply(10, 5))          # error
print(divide(10, 5))            # error

To import all the variables and functions

from calculator import *

print(plus(10, 5))
print(minus(10, 5))
print(multiply(10, 5))
print(divide(10, 5))

To use alias 

This is the most used way in Python.

import calculator  as c

print(c.plus(10,5))
print(c.minus(10,5))
print(c.multiply(10,5))
print(c.divide(10,5))

You can also install external modules and import them

c:\> pip install numpy
c:\> pip install pandas
c:\> pip install tensorflow


import numpy as np
import pandas as pd
import tensorflow as tf

To see if the modules are installed, go to settings -Python Interpreter and check.

Packages in Python are directories that contain modules. It is like a folder.

To create a graph with matplotlib.

import matplotlib.pyplot as plt

plt.plot([1,2,3,4,5,6,7,8,9,8,7,6,5,4,3,2,1])
plt.ylabel('some numbers')
plt.show()

 

'Python' 카테고리의 다른 글

Python) Data Analysis - pandas  (0) 2022.12.03
Python) functions for data analysis  (0) 2022.11.30
Python) Data Analysis - numpy  (0) 2022.11.24
Python) Files input and output  (0) 2022.11.22
Python) Regular Expression  (0) 2022.11.21

To analyze data, we will use Jupyter Notebook. Please check my last post to learn about downloading Jupyter Notebook.

2022.10.27 - [Python] - Python) Configuration and Basics

 

You can also use Google Colaboratory.

 

Variables: Int

Variables: string

functions

Now we will use seaborn.

You must install the seaborn module with this: 

pip install seaborn

If you have already

downloaded the module, import the module before you call it.

countplot() is to make a visual graph like the above. 

 

To use alias

Vertical graph

Horizontal graph

If you want to know about a certain function, insert ? with the name of the function.

 

'Python' 카테고리의 다른 글

Python) functions for data analysis  (0) 2022.11.30
Python) Module and package  (0) 2022.11.25
Python) Files input and output  (0) 2022.11.22
Python) Regular Expression  (0) 2022.11.21
Python) Database connection with MySQL  (0) 2022.11.20

Here is abc.txt. 

I will demonstrate reading this file and reversing them.

with open('abc.txt', 'r') as f:
    lines = f.readlines()
    print(lines)                # ['AAA\n', 'BBB\n', 'CCC\n', 'DDD\n', 'EEE']

lines.reverse()
print(lines)                    # ['EEE', 'DDD\n', 'CCC\n', 'BBB\n', 'AAA\n']

with open('result.txt', 'w') as f:
    for line in lines:
        line = line.strip()    
        f.write(line+'\n')

To work with csv files, you must download pandas in settings in PyCharm.

 

data framing 

CSV files

To make the data frame, you need to read the CSV files with pandas module. 

import pandas as pd

data = [[1,2,3,4],[5,6,7,8]]        

#Create dataframe
df = pd.DataFrame(data)
print(df)
#    0  1  2  3            column number
# 0  1  2  3  4            index number : 0
# 1  5  6  7  8            index number : 1

# dataframe -> csv file (Save)
df.to_csv('../data/df.csv', header=False, index=False)
print('saved successfully')

Excel files

To deal with Excel files, you also need pandas. 

This is an excel file that includes statistics.

Install openpyxl in the settings and you will see the content in the console. 

import pandas as pd

# open excel file
book = pd.read_excel('../data/stats_104102.xlsx',
                     sheet_name='stats_104102',
                     header=1)     
print(book)

book = book.sort_values(by=2015, ascending=False) #descending 
print(book)

XML files

To read XML files, you need bs4 library. 

First, you need to read the URL, and save the file with the extension XML.


from bs4 import BeautifulSoup           # module for analyzing html, xml files
import urllib.request as req            # download
import os.path

url='http://www.kma.go.kr/weather/forecast/mid-term-rss3.jsp?stnId=108'

savename = 'forecast.xml'

# if not os.path.exists(savename):      
req.urlretrieve(url, savename)          # forecast.xml file download

# Analyze with BeautifulSoup module
xml = open(savename, 'r', encoding='utf-8').read()
soup = BeautifulSoup(xml, 'html.parser')
# print(soup)

# 전국 날씨정보를 info 딕셔너리에 저장
info = {}               # info = { name : weather }
for location in soup.find_all('location'):
    name = location.find('city').text         
    wf = location.find('wf').text               
    tmx = location.find('tmx').text             
    tmn = location.find('tmn').text             

    weather = wf + ':' + tmn + '~' + tmx

    if name not in info:
        info[name] = []
    info[name].append(weather)

print(info)

#To print out
for name in info.keys():
    print('+', name)              
    for weather in info[name]:
        print('|', weather)

To open a text file -> get sum and avg -> save into a new text file

with open('sample.txt', 'r') as f:
    lines = f.readlines()
    print(lines)                 # ['70\n', '60\n', '55\n', '75\n', '95\n', '90\n', '80\n', '80\n', '85\n', '100']

total = 0
for line in lines:
    total += int(line)
avg = total / len(lines)
print('total:', total)           # total: 790
print('avg:', avg)               # avg: 79.0

with open('result.txt', 'w') as f:
    f.write(str(avg))

To read repository from github


import urllib.request as req
import os.path
import json

# To download json file
url = 'https://api.github.com/repositories'
savename = 'repo.json'

if not os.path.exists(savename):       
    req.urlretrieve(url, savename)      

# To read repo.json
items = json.load(open(savename, 'r', encoding='utf-8'))
print(type(items))                      # <class 'list'>
print(items)

# To print out
for item in items:
    print(item['name']+'-'+item['owner']['login'])

This is the data that the URL has. 

wordcount.py

To count the words in the file and organize them in descending order:

def getTextFreq(filename):
    with open(filename, 'r', encoding='utf-8') as f:
        text = f.read()          
        tmp = text.split()     

        fa = {}                 
        for c in tmp:
            if c in fa:          
                fa[c] += 1       
            else:              
                fa[c] = 1       

    return fa                   

result = getTextFreq('../data/data.txt')
# result = getTextFreq('../data/alice.txt')
# result = getTextFreq('../data/hong.txt')
print(type(result))              # <class 'dict'>
print(result)

# Ascending
print(sorted(result.items()))
print(sorted(result.items(), key=lambda x : x[0]))

# Descending
print(sorted(result.items(), key=lambda x : x[0], reverse=True))

# Descending 10..9..8..
result = sorted(result.items(), key=lambda x : x[1], reverse=True)
print(result)

for c, freq in result:
    print('[%s] - [%d]time(s)' %(c, freq))

To count words that the user inserts:


def countWord(filename, word):
    with open(filename, 'r') as f:
        text = f.read()
        text = text.lower()         # to lowercase

        list = text.split()        
        count = list.count(word)   

    return count

word = input('Which word do you want to search?')
word = word.lower()

# result = countWord('../data/data.txt', word)
result = countWord('../data/alice.txt', word)
print('[%s]: %d time(s)'%(word, result))

'Python' 카테고리의 다른 글

Python) Module and package  (0) 2022.11.25
Python) Data Analysis - numpy  (0) 2022.11.24
Python) Regular Expression  (0) 2022.11.21
Python) Database connection with MySQL  (0) 2022.11.20
Python) Database connection with SQLite  (0) 2022.11.19

Regular Expression is a format language used to represent a set of strings with specific rules. It searches and replaces strings in Programming Language or Text Editor, etc.
To represent certain conditions in the inserted string, the general conditional statement may be somewhat complicated, however, with regular expressions, it is very simple. The code is simple, but it isn't easy to understand unless you are familiar with the expression because it is not readable. Regular expressions in Python are provided by re-module.

 

To understand how the codes get shorter and simpler, here are codes without using regular expressions.

This code changes the last seven digits to * from the inserted ID numbers. 

data = """
        park 800905-1049118
        kim  700905-1059119
       """
result=[]
for line in data.split('\n'):           # line = "park 800905-1049118"
 word_result=[]                         # word = "park"

 for word in line.split(' '):           # word = "800905-1049118"
    if len(word)==14 and word[:6].isdigit() and word[7:].isdigit():
        word = word[:6]+'-'+'*******'   # word = "800905-*******"
    word_result.append(word)            # word_result=["park","800905-*******"]
 result.append(" ".join(word_result))
print('\n'.join(result))

With regular expressions, it gets way simpler. 

With sub(a, b), you can easily change a to b. 

import re

data = """
        park 800905-1049118
        kim  700905-1059119
       """

# Regular Expression
pat = re.compile('(\d{6})[-](\d{7})')       

print(pat.sub("\g<1>-*******", data))

print(pat.sub("******-\g<2>", data))

Meta characters

Meta character Description Example
[] It represents the set of characters. "[a-z]"
\ It represents the special sequence. "\r"
. It signals that any character is present at some specific place. "Ja.v."
^ It represents the pattern present at the beginning of the string. "^Python"
$ It represents the pattern present at the end of the string. "Meadow"
* It represents zero or more occurrences of a pattern in the string. "hello*"
+ It represents one or more occurrences of a pattern in the string. "hello+"
{} The specified number of occurrences of a pattern in the string. "python{2}"
| It represents either this or that character is present. "hello|world"
() Capture and group (agilemeadow)

[] 

import re

# [abc] : Match if there are any matching letters.
# RE
p1 = re.compile('[abc]')

print(p1.match('a'))                # match='a'
print(p1.match('before'))           # match='b'
print(p1.match('dude'))             # None

p = re.match('[abc]', 'a')          # match='a'
print(p)

*

p3 = re.compile('ca*t')

print(p3.match('ct'))           # match='ct'
print(p3.match('cat'))          # match='cat'
print(p3.match('caaat'))        # match='caaat'

Repetition

p4 = re.compile('ca+t')

print(p4.match('ca'))           # None
print(p4.match('cat'))          # match='cat'
print(p4.match('caaat'))        # match='caaat'

{a}

p5 = re.compile('ca{2}t')

print(p5.match('cat'))          # None
print(p5.match('caat'))         # match='caat'

{ a, b } 

Match if the letter repeats a - b times.

p6 = re.compile('ca{2,5}t')

print(p6.match('cat'))          # None
print(p6.match('caat'))         # match='caat'
print(p6.match('caaaaat'))      # match='caaaaat'

repetition

p7 = re.compile('ab?c')

print(p7.match('ac'))           # match='ac'
print(p7.match('abc'))          # match='abc'

Example

The results are different depending if you put + or not. 

\s is for white space(an empty space), whereas \S is opposite.

import re

m1 = re.match('[0-9]', '1234')
print(m1)                           # match='1'
print(m1.group())                   

m2 = re.match('[0-9]', 'abc')
print(m2)                          

m3 = re.match('[0-9]+', '1234')
print(m3)                           # match='1234'
print(m3.group())                   # 1234

m4 = re.match('[0-9]+', ' 1234')
print(m4)                           # None

m5 = re.match('\s[0-9]+', ' 1234')
print(m5)                           # match=' 1234'
print(m5.group())                   # 1234

m6 = re.search('[0-9]+', ' 1234')
print(m6)                           # match='1234'
print(m6.group())                   # 1234

To search strings

match()


import re

from sympy import primenu

p = re.compile('[a-z]+')

m1 = p.match('python')
print(m1)                       # match='python'
m2 = p.match('Python')
print(m2)                       # None
m3 = p.match('pythoN')
print(m3)                       # match='pytho'
m4 = p.match('pyThon')
print(m4)                       # match='py'
m5 = p.match('3 python')
print(m5)                       # None

search()

print('search()함수')
s1 = p.search('python')
s2 = p.search('Python')
s3 = p.search('pythoN')
s4 = p.search('pyThon')
s5 = p.search('3 python')
print(s1)                   # match='python'
print(s2)                   # match='ython'
print(s3)                   # match='pytho'
print(s4)                   # match='py'
print(s5)                   # match='python'

findall()

result1 = p.findall('life is too short')
print(type(result1))        # 'list'
print(result1)              # ['life', 'is', 'too', 'short']

result2 = p.findall('Life is tOo shorT')
print(result2)              # ['ife', 'is', 't', 'o', 'shor']

finditer()

result3 = p.finditer('life is too short')
print(type(result3))        # 'callable_iterator'
print(result3)              # <callable_iterator object at 0x0000020A8F53DC48>

for r in result3:
    print(r)

result4 = p.finditer('Life is tOo shorT')
for r in result4:
    print(r)

sub()

sub() is to substitute the string a with the string b.

sub(string a, string b)

import re

p = re.compile('blue|white|red')

# substitute from  blue, white, red to  'gold'
print(p.sub('gold', 'blue socks and red shoes'))

print(p.sub('silver', 'blue socks and red shoes', count=1))

Example

To change the last four digits to "####"

import re

s = """
    park 010-9999-9988
    kim 010-9909-7789
    lee 010-8789-7768
"""

pat = re.compile("(\d{3}[-]\d{4})[-]\d{4}")
result = pat.sub("\g<1>-####", s)

print(result)

 

'Python' 카테고리의 다른 글

Python) Data Analysis - numpy  (0) 2022.11.24
Python) Files input and output  (0) 2022.11.22
Python) Database connection with MySQL  (0) 2022.11.20
Python) Database connection with SQLite  (0) 2022.11.19
Python) Class and Method  (0) 2022.11.18

+ Recent posts