Bài 8: Cơ bản về Pandas - Python Cho Dữ Liệu Địa Không Gian

Pandas là thư viện chính cho việc thao tác và phân tích dữ liệu trong Python. Nó rất quan trọng để làm việc với dữ liệu không gian địa lý dạng bảng.

8.1. Mục tiêu học tập¶

Tạo và sử dụng DataFrames
Đọc và ghi CSV hoặc excel files
Lọc, sắp xếp và tóm tắt dữ liệu
Các phương thức nâng cao với DataFrames
Áp dụng Pandas cho bộ dữ liệu không gian địa lý

import pandas as pd

8.2. Giới thiệu về Pandas¶

Pandas cung cấp DataFrame cho việc phân tích dữ liệu mạnh mẽ.

8.2.1. Tạo DataFrame từ dictionary¶

# Tạo DataFrame từ một dictionary
data = {
    'City': ['Hà Nội', 'TP.HCM', 'Đà Nẵng'],
    'Population': [8053663, 9420000, 1134000],
    'Latitude': [21.0285, 10.8231, 16.0544],
    'Longitude': [105.8542, 106.6297, 108.2022]
}
df = pd.DataFrame(data)
df.head()

	City	Population	Latitude	Longitude
0	Hà Nội	8053663	21.0285	105.8542
1	TP.HCM	9420000	10.8231	106.6297
2	Đà Nẵng	1134000	16.0544	108.2022

8.2.2. Tạo DataFrame từ danh sách của từ điển (list of dictionary) hoặc danh sách lồng (list of lists)¶

# Từ một list các dictionaries
cities = [
    {'City': 'Hải Phòng', 'Population': 2028220, 'Latitude': 20.8449, 'Longitude': 106.6881},
    {'City': 'Cần Thơ', 'Population': 1235000, 'Latitude': 10.0452, 'Longitude': 105.7469}
]
df = pd.DataFrame(cities)
df.head()

	City	Population	Latitude	Longitude
0	Hải Phòng	2028220	20.8449	106.6881
1	Cần Thơ	1235000	10.0452	105.7469

# Tạo DataFrame từ list các lists với chỉ mục và tên cột tùy chỉnh
data = [
    ['Hà Nội', 8053663, 21.0285, 105.8542],
    ['TP.HCM', 9420000, 10.8231, 106.6297],
    ['Đà Nẵng', 1134000, 16.0544, 108.2022]
]
df = pd.DataFrame(data, columns=['City', 'Population', 'Latitude', 'Longitude']) # Thêm cột tương ứng
df.head()

	City	Population	Latitude	Longitude
0	Hà Nội	8053663	21.0285	105.8542
1	TP.HCM	9420000	10.8231	106.6297
2	Đà Nẵng	1134000	16.0544	108.2022

8.3. Đọc và ghi CSV hoặc excel Files¶

CSV và excel là định dạng phổ biến nhất cho dữ liệu dạng bảng.

8.3.1. Đọc và ghi ra file csv¶

# Ghi DataFrame ra CSV
df.to_csv(r'G:\My Drive\python\geocourse\data\outputs\cities.csv', index=False)
# Đọc DataFrame từ CSV
df_read = pd.read_csv(r'G:\My Drive\python\geocourse\data\outputs\cities.csv')
df_read.head()

	City	Population	Latitude	Longitude
0	Hà Nội	8053663	21.0285	105.8542
1	TP.HCM	9420000	10.8231	106.6297
2	Đà Nẵng	1134000	16.0544	108.2022

8.3.2. Đọc và ghi ra file excel¶

# Đọc DataFrame từ Excel
df_excel = pd.read_excel(r'G:\My Drive\python\geocourse\data\outputs\cities.xlsx')  # Giả sử bạn có file cities.xlsx
# Ghi DataFrame ra Excel
df.to_excel(r'G:\My Drive\python\geocourse\data\outputs\cities_output.xlsx', index=False)

8.3.3. Đọc dữ liệu từ url¶

# Đọc dữ liệu từ URL
url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv'
df = pd.read_csv(url)
# Lưu DataFrame ra file CSV
df.to_csv(r'G:\My Drive\python\geocourse\data\outputs\iris_data.csv', index=False)
# Lưu DataFrame ra file Excel
df.to_excel(r'G:\My Drive\python\geocourse\data\outputs\iris_data.xlsx', index=False)

8.4. Lọc và sắp xếp¶

Bạn có thể lọc các hàng và sắp xếp dữ liệu một cách dễ dàng.

8.4.1. Lọc và nhóm¶

# Dữ liệu Olympiad toán 2018 từ github bên dưới
df = pd.read_csv('https://raw.githubusercontent.com/leomtz/apmowebsite/refs/heads/master/data/data_clean/scoretable-2018-clean.csv')
df.head()

	code	country	rank	last	first	sex	p1	p2	p3	p4	total
0	ARG	Argentina	1	MASLIAH	JULIÁN	M	7	6	1	5	19
1	ARG	Argentina	2	FLESCHLER	IAN	M	7	7	0	4	18
2	ARG	Argentina	3	SOTO	CARLOS MIGUEL	M	7	0	4	5	16
3	ARG	Argentina	4	DI SANZO	BRUNO	M	7	3	0	5	15
4	ARG	Argentina	5	CASSIA	NICOLÁS	M	7	3	1	2	13

# Lọc những thí sinh có rank dưới 3
ranked_df = df[df['rank'] < 3]
# Sắp xếp theo điểm tổng (Total Score) giảm dần
sorted_df = df.sort_values(by='total', ascending=False)
sorted_df.head()

	code	country	rank	last	first	sex	p1	p2	p3	p4	p5	total
342	USA	United States of America	1	Gu	Andrew	M	7	7	7	7	7	35
343	USA	United States of America	2	Wan	Edward	M	7	7	7	7	7	35
146	KOR	Republic of Korea	2	Kim	Ji Min	M	7	7	7	7	7	35
147	KOR	Republic of Korea	3	Kim	Dain	F	7	7	7	7	7	35
145	KOR	Republic of Korea	1	Kwon	Sunghyun	M	7	7	7	7	7	35

# Tóm tắt điểm trung bình theo quốc gia
mean_scores = df.groupby('country')['total'].mean().reset_index().sort_values(by='total', ascending=False)
mean_scores.head()

	country	total
27	Republic of Korea	32.0
38	United States of America	30.6
15	Japan	25.4
30	Singapore	23.1
6	Canada	22.0

8.4.2. Tạo cột mới và gộp các DataFrames¶

# Tạo ra cột mới fullname bằng cách ghép first_name và last_name
df['fullname'] = df['first'] + ' ' + df['last']
df.head()

	code	country	rank	last	first	sex	p1	p2	p3	p4	total	fullname
0	ARG	Argentina	1	MASLIAH	JULIÁN	M	7	6	1	5	19	JULIÁN MASLIAH
1	ARG	Argentina	2	FLESCHLER	IAN	M	7	7	0	4	18	IAN FLESCHLER
2	ARG	Argentina	3	SOTO	CARLOS MIGUEL	M	7	0	4	5	16	CARLOS MIGUEL SOTO
3	ARG	Argentina	4	DI SANZO	BRUNO	M	7	3	0	5	15	BRUNO DI SANZO
4	ARG	Argentina	5	CASSIA	NICOLÁS	M	7	3	1	2	13	NICOLÁS CASSIA

# Gộp 2 dataframes theo cột chung 'id'
df1 = pd.DataFrame({
    'id': [1, 2, 3],
    'value1': ['A', 'B', 'C']
})
df2 = pd.DataFrame({
    'id': [2, 3, 4],
    'value2': ['D', 'E', 'F']
})
merged_df = pd.merge(df1, df2, on='id', how='inner') # Thay 'inner' bằng 'outer', 'left', hoặc 'right' để thay đổi kiểu gộp
merged_df.head()

	id	value1	value2
0	2	B	D
1	3	C	E

8.5. Tóm tắt dữ liệu¶

Pandas giúp dễ dàng lấy thống kê và tóm tắt.

# tóm tắt dữ liệu df 
summary = df.describe()
# Tóm tắt chỉ cột số liệu
numeric_summary = df.describe(include=['number'])
# Tóm tắt chỉ cột đối tượng
categorical_summary = df.describe(include=['object'])

C:\Users\tuyen\AppData\Local\Temp\ipykernel_12788\2741901675.py:6: Pandas4Warning: For backward compatibility, 'str' dtypes are included by select_dtypes when 'object' dtype is specified. This behavior is deprecated and will be removed in a future version. Explicitly pass 'str' to `include` to select them, or to `exclude` to remove them and silence this warning.
See https://pandas.pydata.org/docs/user_guide/migration-3-strings.html#string-migration-select-dtypes for details on how to write code that works with pandas 2 and 3.
  categorical_summary = df.describe(include=['object'])

# Tóm tắt dữ liệu nhiều thông số 
stats = df.groupby('country').agg({
    'total': ['mean', 'max', 'min'],
    'rank': ['mean', 'max', 'min']
})
stats = stats.reset_index()
stats.head()

	total			rank
	mean	max	min	mean	max	min
country
Argentina	12.2	19	7	5.5	10	1
Australia	16.5	21	12	5.5	10	1
Bangladesh	13.5	21	9	5.5	10	1
Bolivia	13.2	21	3	3.0	5	1
Brazil	14.8	20	12	5.5	10	1

Tóm tắt¶

Bạn đã hoàn thành Bài 8 và học được Pandas - thư viện quan trọng nhất cho data analysis trong Python.

Các khái niệm chính đã nắm vững:¶

✅ DataFrames: Cấu trúc dữ liệu 2D mạnh mẽ cho việc xử lý dữ liệu tabular
✅ Tạo DataFrames: Từ dictionaries, lists, và các nguồn dữ liệu khác nhau
✅ File I/O: Đọc và ghi CSV, Excel files với read_csv(), to_csv(), read_excel()
✅ Data filtering: Lọc dữ liệu với boolean indexing và điều kiện logic
✅ Sorting: Sắp xếp dữ liệu với sort_values() theo các columns khác nhau
✅ Statistical analysis: Tính toán thống kê với mean(), count(), describe()
✅ Data exploration: Khám phá và hiểu cấu trúc của datasets
✅ Ứng dụng geospatial: Xử lý dữ liệu thành phố với coordinates và attributes

Kỹ năng bạn có thể áp dụng:¶

Xử lý và phân tích datasets geospatial dạng tabular một cách hiệu quả
Import/export dữ liệu từ nhiều nguồn khác nhau (CSV, Excel, web APIs)
Thực hiện exploratory data analysis (EDA) cho các bộ dữ liệu địa lý
Lọc và truy vấn dữ liệu based on spatial và non-spatial attributes
Chuẩn bị dữ liệu sạch để sử dụng với GeoPandas và các GIS libraries