Python 3 Reading Relative Lines in a text document and convert to Pandas DF

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP



Python 3 Reading Relative Lines in a text document and convert to Pandas DF



Working on a Python 3.6 read of a text file to extract relative lines to convert into a pandas dataframe.



What works: Searching for a phrase in a text document and converting the line into a pandas df.


import pandas as pd
df = pd.DataFrame()
list1 =
list2 =

with open('myfile.txt') as f:
for lineno, line in enumerate(f, 1):
if 'Project:' in line:
line = line.strip('n')
list1.append(repr(line))

# Convert list1 into a df column
df = pd.DataFrame('Project_Name':list1)



What doesn't work: Returning a relative line based on the search result. In my case I need to store the "relative" line -6 to -2 (earlier in the text) as Pandas columns.


with open('myfile.txt') as f:
for lineno, line in enumerate(f, 1):
if 'Project:' in line:
list2.append(repr(line)-6) #<--- can't use math here



Returns: TypeError: unsupported operand type(s) for -: 'str' and 'int'



Also tried using a range with partial success:


with open('myfile.txt') as f:
for lineno, line in enumerate(f, 1):
if 'Project' in line:
all_lines = f.readlines()
required_lines = [all_lines[i] for i in range(lineno-6,lineno-2)]
print (required_lines)
list2.append(required_lines) #<-- does not work



Python will print the first 4 target lines but it does not seem to be able to save it as a list or loop through each finding of "Project" in the text doc. Is there a better way to save the results of the relative line above (or below) the search term? Thanks much.



Text data looks like:


0 Exhibit 3
1 Date: February 2018
2 Description
3 Description
4 Description
5 2015
6 2016
7 2017
8 2018
9 $100.50 <---- Add these as different dataframe columns
10 $120.33 <----
11 $135.88 <----
12 $140.22 <----
13 Project A
14
15 Exhibit 4
16 Date: February 2018
17 Description
18 Description
19 2015
20 2016
21 2017
22 2018
23 $899.25 <----
24 $901.00 <----
25 $923.43 <----
26 $1002.02 <----
27 Project B





If you could post what your input data looks like and what you expect the output to look like it would help.
– Alex
Aug 6 at 16:42





Added an example of what the text looks like, thanks Alex
– Arthur D. Howland
Aug 6 at 16:58




2 Answers
2



This might do the trick, it does make the assumption that there are always four values before the 'Project' line.


>>> a =
>>> with open('test.txt') as f:
... prev_lines =
... for line in f:
... prev_lines.append(line.strip('n'))
... if 'Project' in line:
... a.append(prev_lines[-5:])
... del prev_lines[:]
>>> df = pd.DataFrame(a, columns=list('ABCDi'))
>>> df
A B C D i
0 $100.50 $120.33 $135.88 $140.22 Project A
1 $899.25 $901.00 $923.43 $1002.02 Project B



Or without the project included:


>>> a =
>>> with open('test.txt') as f:
... prev_lines =
... for line in f:
... prev_lines.append(line.strip('n'))
... if 'Project' in line:
... a.append(prev_lines[-5:-1])
... del prev_lines[:]
>>> df = pd.DataFrame(a, columns=list('ABCD'))
>>> df
A B C D
0 $100.50 $120.33 $135.88 $140.22
1 $899.25 $901.00 $923.43 $1002.02





Works a lot better than my latest, this is the first time I've seen "prev_lines = [ ]" list construction twice in the same block. Never thought of that.
– Arthur D. Howland
Aug 7 at 13:15





I've updated the code to use a slightly better method of clearing the previous lines list. When I tested this on a file with 2000 records in it ran pretty quickly.
– Alex
Aug 7 at 17:13





Is there a way to go forward 4 columns? I tried a.append(next_lines[:4]) but it skips the first instance.
– Arthur D. Howland
Aug 15 at 18:23





I'm not sure I fully understand, do you mean access the rows from project onwards? This method works because the input is repetitive, but it could be achieved. Let me know what specific lines it should include.
– Alex
Aug 16 at 8:55


project





Alex - yes, we got the previous 4 rows to work, but now I'm trying to get the NEXT 4 rows to go into a pandas df. So in the above example, when searching for 'Project' it would return: (blank), Exhibit 4, Date: February 2018, Description.
– Arthur D. Howland
Aug 20 at 14:43



The reason your second solution is not working is because you are reading the file using a generator like object (f in your case), which one it finishes iterating through the file, will stop.


f



Your iteration for lineno, line in enumerate(f, 1): is meant to iterate line by line inside the file, but in a memory efficient manner by only reading one line at a time. When you find a matching line you do, all_lines = f.readlines() which consumes the generator. When the next iteration in for lineno, line in enumerate(f, 1): is called it raises a StopIterationError which causes the loop to stop.


for lineno, line in enumerate(f, 1):


all_lines = f.readlines()


for lineno, line in enumerate(f, 1):


StopIterationError



You can make your second solution work if you read the entire contents of the file first and then iterate through that list instead.



If you want to be memory efficient, you can maintain a FIFO queue of the required number of lines.





Tried using relative_line = f.readlines(), Line6 = [relative_line[lineno - 6]]. Not working either. I'm not using f.readlines() correctly.
– Arthur D. Howland
Aug 6 at 17:32






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

Firebase Auth - with Email and Password - Check user already registered

Dynamically update html content plain JS

How to determine optimal route across keyboard