你的位置:首页 > 数据库

[数据库][数据科学] 从csv, xls文件中提取数据


在python语言中,用丰富的函数库来从文件中提取数据,这篇博客讲解怎么从csv, xls文件中得到想要的数据。

点击下载数据文件http://seanlahman.com/files/database/lahman-csv_2015-01-24.zip 

 

这个一个美国棒球比赛的统计数据
解压文件夹,我们选取AwardsManagers.csv来练习

#-*- coding:utf-8 -*-import csvDIR = 'data/'fname = 'AwardsManagers.csv'fpath = DIR+fname## 用 with open() as filename 的结构非常优美, 而且不需要写代码来关文件## 省去了fileobj.close(), 省去写try-finally的麻烦来出来exceptionwith open(fpath, 'rb') as csvfile:   ## delimiter是csv文件每行中数据间隔开的符号,常用是comma逗号,  ## quotechar之间包括特殊字符  mreader = csv.reader(csvfile, delimiter=',', quotechar='|')   ## 读出每一行都是一个list  first_row = mreader.next()  print first_row  print type(first_row)  ## 目前的行数  print mreader.line_num  for row in mreader:     print ', '.join(row) ## 另外一个读取数据的方法是用DictReadernames = ['playerID','awardID','yearID','lgID','tie','notes']with open(fpath) as csvfile:   ## fieldnames指明了csv文件的列名称    reader = csv.DictReader(csvfile, fieldnames=names,     delimiter=',', quotechar='|')     for row in reader:      ## 每一行都是一个dict对象    print(row[names[0]], row[names[1], row[names[2])

从专业机构中获取的数据也常常是XLS文件,用python提取XLS文件中的函数是xlrd

在xlrd中最重要的函数是:
xlrd.open_workbook
workbook.sheet_by_name
workbook.sheet_by_index
sheet.cell(row_index, col_index)
cell.value
sheet.col_values(col_index, start_row_index, end_row_index)
sheet.row_values(row_index, start_col_index, end_col_index)
sheet.col_slice(col_index, start_row_index, end_row_index)
sheet.row_slice(row_index, start_col_index, end_col_index)
点击下载数据源文件http://www.abs.gov.au/AUSSTATS/subscriber.nsf/log?openagent&33010do001_2009.xls&3301.0&Data%20Cubes&861A1F351DF2D978CA2577CF000DF18E&0&2009&03.11.2010&Latest
文件是关于澳大利亚人口出生情况的统计数据

#-*- coding:utf-8 -*-import xlrdDIR = 'C:/Users/Lucas/Downloads/'fname = '33010do001_2009.xls'# 首先建立workbookmworkbook = xlrd.open_workbook(DIR+fname)# 打印出所有sheetnamessheet_names = mworkbook.sheet_names()print('Sheet Names', sheet_names)# 选取第二个sheetmsheet = mworkbook.sheet_by_name(sheet_names[1])# 或者通过index得到sheetnsheet = mworkbook.sheet_by_index(1)print ('Sheet name: %s' % nsheet.name)# Pull the first row by indexrow = msheet.row(0) # Pull the first row by indexrow = msheet.row(4) # Print 1st row values and typesfor cell in row:  print cell.value# Print all values, iterating through rows and columns#num_cols = msheet.ncols  # Number of columnsnum_rows = msheet.nrows  # Number of rowsfor row_idx in range(0, num_rows):  # Iterate through rows  row_values = []  for col_idx in range(0, num_cols): # Iterate through columns    row_values.append([msheet.cell(row_idx, col_idx).value])  ## 输出每行数据  print row_values## 用col_slice得到某一列的数据col_cells = msheet.col_slice(2, 4, num_rows)for cell in col_cells:  print("-"*6)  print cell.value    ## 用col_valeus得到某一列的数据col_values = msheet.col_values(2, 4, num_rows)print col_values