I have a word file (.docx) with table of data, I am trying to create a pandas data frame using that table, I have used docx and pandas module. But I could not create a data frame.
from docx import Document
document = Document('req.docx')
for table in document.tables:
for row in table.rows:
for cell in row.cells:
print (cell.text)
and also tried to read table as df pd.read_table("path of the file")
I can read the data cell by cell but I want to read the entire table or any particular column. Thanks in advance
docx
always reads data from Word tables as text (strings).
If we want to parse data with correct dtypes we can do one of the following:
dtype
for all columns (not flexible)pd.read_csv()
guess/infer correct dtypes (I've chosen this way) Many thanks to @Anton vBR for improving the function!
import pandas as pd
import io
import csv
from docx import Document
def read_docx_tables(filename, tab_id=None, **kwargs):
"""
parse table(s) from a Word Document (.docx) into Pandas DataFrame(s)
Parameters:
filename: file name of a Word Document
tab_id: parse a single table with the index: [tab_id] (counting from 0).
When [None] - return a list of DataFrames (parse all tables)
kwargs: arguments to pass to `pd.read_csv()` function
Return: a single DataFrame if tab_id != None or a list of DataFrames otherwise
"""
def read_docx_tab(tab, **kwargs):
vf = io.StringIO()
writer = csv.writer(vf)
for row in tab.rows:
writer.writerow(cell.text for cell in row.cells)
vf.seek(0)
return pd.read_csv(vf, **kwargs)
doc = Document(filename)
if tab_id is None:
return [read_docx_tab(tab, **kwargs) for tab in doc.tables]
else:
try:
return read_docx_tab(doc.tables[tab_id], **kwargs)
except IndexError:
print('Error: specified [tab_id]: {} does not exist.'.format(tab_id))
raise
NOTE: you may want to add more checks and exception catching...
Examples:
In [209]: dfs = read_docx_tables(fn)
In [210]: dfs[0]
Out[210]:
A B C,X
0 1 B1 C1
1 2 B2 C2
2 3 B3 val1, val2, val3
In [211]: dfs[0].dtypes
Out[211]:
A int64
B object
C,X object
dtype: object
In [212]: dfs[0].columns
Out[212]: Index(['A', 'B', 'C,X'], dtype='object')
In [213]: dfs[1]
Out[213]:
C1 C2 C3 Text column
0 11 21 NaN Test "quotes"
1 12 23 2017-12-31 NaN
In [214]: dfs[1].dtypes
Out[214]:
C1 int64
C2 int64
C3 object
Text column object
dtype: object
In [215]: dfs[1].columns
Out[215]: Index(['C1', 'C2', 'C3', 'Text column'], dtype='object')
parsing dates:
In [216]: df = read_docx_tables(fn, tab_id=1, parse_dates=['C3'])
In [217]: df
Out[217]:
C1 C2 C3 Text column
0 11 21 NaT Test "quotes"
1 12 23 2017-12-31 NaN
In [218]: df.dtypes
Out[218]:
C1 int64
C2 int64
C3 datetime64[ns]
Text column object
dtype: object