Reading text files

Updated on 28 Dec 2022

Nearly all programming languages follow the same construct when it comes to reading and writing to files, also known as file I/O.

  • Open file
  • Read file
  • Close file

Functions

The operation for reading files can be done in many ways, and in this chapter we’ll look at the more sensible ways.

  • read() -> read entire file into string
  • readlines() -> read all lines into list
  • using a compound statement with in (similar to what we do with lists)

Let’s see how we might do with the following file.

data/dummyData1.txt

this is a line
another line
third line is a winner

read

f = open('data/dummyData1.txt')
bigString = f.read()
print(bigString)

f.close()

Reading one line at a time

The purpose of reading one line at a time is usually to perform an operation with that line of data. It might contain mms_id’s that you want to feed into an Alma API. But before we get into that realm, lets look at how to read one line of the file at a time.

with keyword is use in Python for dealing with unmanaged resources.

with open('data/dummyData1.txt') as f:
    for line in f:
        line = line.strip('\n')
        print(line)

When we have for line in f:, Python will read the entire line - including the newline feed at the end. Perhaps we just want the data without the newline, so I have stripped it out.

Also notice that there is no f.close(). The reason is the with statement uses a context manager which has enter and exit features. I.e. with open will open the file and give you the file handle. When the block of code is finished the with exit is invoked which will close the file.

It is possible to create our own context managers, but that is outside the scope of this course.

readlines

Sometimes we will want to read the contents of a file into a list. Instead of having a bunch of I/O, we can have the whole dataset in memory. This means that we have the entire dataset in a variable, and we can now use the list operators to work on the list.

Consider this example which uses the with keyword and readlines to read the datafile into a list. The print is occuring outside the with block which means the file has already been closed before we are printing it or doing anything with the data.

with open('data/dummyData1.txt') as f:
     mylist = f.readlines()

print(mylist)

Notice in this output that each element also includes the newline character.

Files - guided exercise

I have a list of naughty words / phrases. These could just as easily be other types of keywords, but we’ll stick with naughty words for now.

naughty = ['bad word', 'very bad word', 'another line']

Notice that the 3rd element in this list (index 2), also happens to be in the dummyData1.txt file. another line. Write a script that can read the data file and filter out all the bad entries. For the dummyData1.txt file, the final list should look like this…

Solution

Let’s start off by writing some code, and putting in some code stubs that we can fill out later.

naughty = ['bad word', 'very bad word', 'another line']

with open('data/dummyData1.txt') as f:
    lines = f.read()
    mylist = lines.splitlines()

# check the result so far, but this is where I need to do the rest of the code...
print(mylist)

I’ve taken a slightly different approach to what you may be thinking. You might have been thinking of a solution to use readlines because this will automatically put the result into a list. The problem with this approach is that is also adds in the line breaks; so I would have to loop thru each element and remove the line break.

What I did was find a function in string documentation called splitlines that will explode a string based on the new line carriage into a list. I already knew that read() will read the entire contents of the file into a string variable. So I put the two together and checked the result before moving on.

In a previous example we encountered the in keyword.

...
if 'andrea' in students:
    print('The name you are looking for is in the list!')
...

Let’s see if we can use this type of code where we replace ‘andrea’ with mylist and students with naughter list. (I.e. is there anything in our list that is also in the naughty list?). Now, we can’t substitute a string with a list, but we can call upon our knowledge of loops to loop thru each element.

Complete Solution

naughty = ['bad word', 'very bad word', 'another line']

with open('data/dummyData1.txt') as f:
    lines = f.read()
    mylist = lines.splitlines()

for i, v in enumerate(mylist):
    if mylist[i] in naughty:
            del mylist[i]

print(mylist)

Files - extended exercise

The dummyData1.txt file looks like this.

this is a line
another line
third line is a winner

Write a script where the output looks like the below screen shot, and the numbers are calculated in Python for the number of words on each line.