CS8, Fall 2010

lab07
working with files


It may be helpful to read pp. 155-160 and do Homework H10 first

Homework H10 covers pages 155–160 from Section 5.2 in your textbook, which describes reading data files.

If you finish with lab06 during your Friday discussion section and you want to start on lab07 before reading pp. 155–160, then fine—you can probably get some things done even if you haven't already read pp. 155–160 and done Homework H10.

But if you can do those things first, it will probably help.

Goals for this lab

By the time you have completed this lab, you should be able to understand the concepts of opening file, reading from files, and closing files.

In particular, you'll be able to:

This will also provide additional practice with the concepts of "index" vs. "value".

Step by Step Instructions

Step 0: Preliminaries

The preliminaries for this lab are similar to those for lab06:

There's no starting point file this week—we are going to build up the file at little bit at a time.

Step 1: Understanding the role of data files

In working with Python so far, we have mostly been working with either one of two situations:

However, in many real world applications, what we ultimately want to do is examine some data, and answer some questions about the data. Here are some examples:

A data file with students and majors

Here is a data file listing some students names, and their majors.

SHARON, ROBINSON, PHYSICS 
BRIAN, CLARK, MUSIC
MICHELLE, RODRIGUEZ, UNDEC
RONALD, LEWIS, MATH
LAURA, LEE, ENGLISH
ANTHONY, WALKER, UNDEC
SARAH, HALL, CS
KEVIN, ALLEN, CHEM
KIMBERLY, YOUNG, UNDEC
JACOB, HERNANDEZ, STATS

Suppose we want to know:

Each line in this file is the following format:

FIRSTNAME, LASTNAME, MAJOR

As we will see, we can use various Python functions and methods to turn this into three lists: lastNames, firstNames, and majors, and then process the data.

A data file with information on earthquakes

Here is some data on earthquakes that occurred during a three hour period on Monday, September 7, 2010.

Src,Eqid,Version,Datetime,Lat,Lon,Magnitude,Depth,NST,Region
nc,71272610,0,"Monday, September 7, 2010 16:49:28 UTC",38.8148,-122.8097,1.0,2.00,12,"Northern California" nn,00291988,1,"Monday, September 7, 2010 16:48:29 UTC",37.1950,-114.9640,1.6,0.00,24,"Nevada" nn,00291985,1,"Monday, September 7, 2010 16:37:29 UTC",39.3630,-118.1420,1.9,2.00,17,"Nevada" ak,10008422,1,"Monday, September 7, 2010 16:25:46 UTC",62.5900,-148.8269,2.0,5.50,07,"Central Alaska" us,2010lgbc,6,"Monday, September 7, 2010 16:24:27 UTC",42.1945,142.7938,5.0,62.10,34,"Hokkaido, Japan region" pr,p0925006,1,"Monday, September 7, 2010 16:23:45 UTC",19.5597,-68.9285,3.7,68.00,27,"Dominican Republic region" us,2010lgba,6,"Monday, September 7, 2010 16:12:21 UTC",-10.2017,110.6088,6.1,15.90,61,"south of Java, Indonesia" ci,14507596,1,"Monday, September 7, 2010 15:24:51 UTC",33.9095,-118.3215,1.7,12.40,30,"Greater Los Angeles area, California" ak,10008412,1,"Monday, September 7, 2010 14:57:39 UTC",61.1263,-151.9616,1.8,100.00,14,"Southern Alaska" us,2010lga8,Q,"Monday, September 7, 2010 14:42:22 UTC",23.6065,126.3178,4.8,10.00,33,"southeast of the Ryukyu Islands, Japan" ci,14507588,1,"Monday, September 7, 2010 14:13:01 UTC",33.9783,-116.8166,1.5,12.10,24,"Southern California"

Similar data can be found at the link below for earthquakes from the past hour, day, or week.

This data is a bit more complex—it is in a format called "CSV", or "comma-separated values".

With this data we may want to ask questions such as:

Why learning about data files is important

In the real world, this data often comes from a file on a hard drive. So learning to read data from a file into a program is a very important concept, and a very practical and useful tool.

The simplest kind of file is a basic "text files"—the kind of file you can edit in one of several ways:

There are other real world data sources—such as web sites, databases, and spreadsheets, just to name a few. In practice, working with each of these data sources involves variations on basic techniques that we can first learn by working with plain text files. So, that is a great place to start.

Of course, there are many programs that you can use to work with data files—spreadsheet programs are the most common. Most of the questions we'll be exploring in this lab are questions you could answer with a simple spreadsheet like Excel. However, there are three things to keep in mind:

Read more about working with data files (and review strings and lists) in your textbook.

Your textbook discusses working with data files in Chapter 5, section 5.2.

It will help to read that section (or review it) before working on this lab.

You may also find it helpful to review these sections:

Step 2: Creating data files

Open up IDLE

Then open up a window where you can create a new file. This time, however, we are not going to create a Python program (at least not at first.) We are going to create a data file.

In the new window, type in the following (or copy and paste it from this page into the window.)

SHARON, ROBINSON, PHYSICS 
BRIAN, CLARK, MUSIC
MICHELLE, RODRIGUEZ, UNDEC
RONALD, LEWIS, MATH
LAURA, LEE, ENGLISH
ANTHONY, WALKER, UNDEC
SARAH, HALL, CS
KEVIN, ALLEN, CHEM
KIMBERLY, YOUNG, UNDEC
JACOB, HERNANDEZ, STATS

Then, save the file with the name students.txt

Now that you've saved this file, it should be possible to open this file at the Python prompt. Try typing this at the Python prompt (note that you don't type the >>>—that's the prompt!)

>>> infile = open('students.txt','r')
>>>

If you get back another >>> prompt as shown above, you are in good shape. If so, congratulations—move on to Step 3.

If not, see the "What if it doesn't work" section below.

What if it doesn't work?

If, instead, you something like the following, then something is wrong:

>>> infile = open('students.txt','r')
Traceback (most recent call last):
File "<pyshell#1>", line 1, in <module>
infile = open('students.txt','r')
File "/Library/Frameworks/Python.framework/Versions/3.0/lib/python3.0/io.py", line 278, in __new
return open(*args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.0/lib/python3.0/io.py", line 222, in open
closefd)
File "/Library/Frameworks/Python.framework/Versions/3.0/lib/python3.0/io.py", line 619, in __init__
_fileio._FileIO.__init__(self, name, mode, closefd)
IOError: [Errno 2] No such file or directory: 'students.txt'
>>>

The most important part of this is the last line: "No such file or directory: 'students.txt'"

What Python is telling us is that it can't find this file. This may be because you didn't name it properly, or it may be because Python is not looking in the right place for it.

A note about the home directory symbol ~

As you may know ~ is a symbol for your home directory that you can use when you type commands in at the -bash-4.1$ prompt (i.e. the shell.)

However, you usually can't use ~ as an abbreviation for your home directory when accessing a file inside a program (whether that program is in Python, or C, or Java, or whatever).

That's because ~ is a symbol for home directory only in the shell, e.g. at the -bash-4.1$ prompt

Instead, determine what the full path to your home directory is by opening a new terminal window, doing a cd to return to your home directory, and then typing pwd. What comes up is the full path to your home directory, e.g. /fs/home1/student/j/jsmith or /cs/student/jsmith

So, that's what you need to use in your where variable:

Talk it over with your pair partner to make sure you both understand---and if it still isn't clear, ask your TA or instructor for help.

Step 3: Opening an input file, reading some data, closing the file

If you have reached this point, you've already created a file named 'students.txt' with lines such as

SHARON, ROBINSON, PHYSICS
BRIAN, CLARK, MUSIC
etc...
and you've opened it with a line like this (possibly with an extra where variable if needed to specify the location of the file)

>>> infile = open('students.txt','r')
>>>

Now, the question is, what can we do with an open file?

Since we opened this file with the 'r' flag, as you may know from the reading in Section 5.2 of your textbook, this is a file we have opened for reading. That means we can read data file this file into Python variables, as shown below. Try this at the IDLE prompt:

>>> for line in infile:
print(line)

You should get output like this. Notice the extra newline in between each name.

>>> for line in infile:
print(line)

SHARON, ROBINSON, PHYSICS BRIAN, CLARK, MUSIC MICHELLE, RODRIGUEZ, UNDEC RONALD, LEWIS, MATH LAURA, LEE, ENGLISH ANTHONY, WALKER, UNDEC SARAH, HALL, CS KEVIN, ALLEN, CHEM KIMBERLY, YOUNG, UNDEC JACOB, HERNANDEZ, STATS >>>

We see that we can read data from the file into a variable and use a for loop to print each line. But what would be more useful would be to store the data in a list, or in multiple lists—that gives us more options as we work with the data. We'll do that in next step.

First, though we need to close the file—that frees up resources on the computer, and allows us to reopen the file, and start reading it from the beginning. So, type this next:

>>> infile.close()
>>>

Something to notice...

Notice the difference between the syntax of:

In the first case, we are using the open function, and writing an assignment statement that creates a new variable called infile. We can choose any name we want for this variable (e.g. studentFile, theFile, etc.) It is not required that this variable end with "file"—for example, we could call it inputData, or even fred if we want. But it is good practice to name it something that reminds us it is a variable that stands for an open file.

In the second case, we are using a method of the file variable—that is why we write the variable name infile first, and then write a dot (.), and finally the name of the function (close).

To review...

In this step, we did three things:

We'll do these three things again in Step 4, but instead of just printing out the data, we'll read it into a list of strings so we can work with it further, even after the file is closed.

Step 4: Reading data in a lists of strings

In this step, we'll read the data into a list of strings. Here's how: type the following in at the Python prompt:

>>> infile = open('students.txt','r')
>>> inputList=[]
>>> for item in infile:
	     inputList = inputList + [item]

	  
>>> infile.close()
>>>   
Once you've typed in those four lines, you can type the name of the variable inputList at the Python prompt to see what inputList contains. You should see something like this:

>>> inputList
['SHARON, ROBINSON, PHYSICS \n', 'BRIAN, CLARK, MUSIC\n', 'MICHELLE, RODRIGUEZ, UNDEC\n', 'RONALD, LEWIS, MATH\n', 'LAURA, LEE, ENGLISH\n', 'ANTHONY, WALKER, UNDEC\n', 'SARAH, HALL, CS\n', 'KEVIN, ALLEN, CHEM\n', 'KIMBERLY, YOUNG, UNDEC\n', 'JACOB, HERNANDEZ, STATS\n']
>>>

It should be clear from the [ ] characters and the commas, but in case it isn't, what we have is a list of strings:

>>> type(inputList)
<class 'list'>
>>> for item in inputList:
      print(type(item))

<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
>>>

  

So, what can we do with this list? Well, one of the things we can do is instead of building one list, we can build three lists:

But this is going to involve more code than we'll want to type in at once. So, we'll put this in a Python file.

In the file that you started for this lab, i.e. lab07.py, put in the following lines of code:

 

infile = open('students.txt', 'r')

# set up three empty lists fnames = [] lnames = [] majors = []

for item in infile: itemStripped = item.strip() # remove the newlines itemSplit = itemStripped.split(',') # split into a list at the comma

# now, itemSplit[0] is the first name, # itemSplit[1] is the last name # itemSplit[2] is the major

# use the accumulator pattern to add these items to the lists # strip each one, to get rid of extra spaces (the ones after the commas) fnames = fnames + [itemSplit[0].strip()] lnames = lnames + [itemSplit[1].strip()] majors = majors + [itemSplit[2].strip()]

This code contains the strip() function and the split() function. These functions are both string methods—that is, they are applied to a string variable using the dot operator, e.g. item.strip() or itemStripped.split(',')

 

Put this code in your file, and use the Run/Run Module[F5] command to compile the code. Assuming you have no errors, you should now be able to see the values of the fnames, lnames and majors lists at the Python prompt:

>>> ================================ RESTART ================================

>>> 

>>> fnames

['SHARON', 'BRIAN', 'MICHELLE', 'RONALD', 'LAURA', 'ANTHONY', 'SARAH', 'KEVIN',
 'KIMBERLY', 'JACOB']

>>> lnames

['ROBINSON', 'CLARK', 'RODRIGUEZ', 'LEWIS', 'LEE', 'WALKER', 'HALL', 
'ALLEN', 'YOUNG', 'HERNANDEZ']

>>> majors

['PHYSICS', 'MUSIC', 'UNDEC', 'MATH', 'ENGLISH', 'UNDEC', 'CS', 
'CHEM', 'UNDEC', 'STATS']

>>> 

Now that we have this data, we can do lots of different kinds of computations on this data.

For example, here is a function that, given the first name of a student, returns us a string indicating their major. (Note that if more than one student matches, we'll only return the first match—for any given function call, you can only return once!)

Add this function into your file, and then compile again:

def whatMajor(fnames, majors, thisFirstName):
    """
    return the major of the student with thisFirstName

    whatMajor: listofStrings, listOfStrings -> str

    consumes:
       fnames: a list of strings, containing first names
       majors: a list of strings, containing majors
       thisFirstName: the first name of a student you want to find the major for
    produces:
       that student's major, or the empty string if that student was not
          found in the list
       if more than one student has that first name, we return the first match.
    """

    for i in range(len(fnames)):

       # step through every item in the list of fnames
       # when you find a match, return that students's major

       if fnames[i] == thisFirstName:
          return majors[i]

    # if you got all the way through the loop and didn't find
    #  the name, return an empty string

    
    return ""

Once this is in your file, you should be able to inquire about students with various majors, as shown here. Try it yourself:

>>> ================================ RESTART ================================
>>> 
>>> fnames
['SHARON', 'BRIAN', 'MICHELLE', 'RONALD', 'LAURA', 'ANTHONY', 'SARAH', 
'KEVIN', 'KIMBERLY', 'JACOB']
>>> lnames
['ROBINSON', 'CLARK', 'RODRIGUEZ', 'LEWIS', 'LEE', 'WALKER', 'HALL',
 'ALLEN', 'YOUNG', 'HERNANDEZ']
>>> majors
['PHYSICS', 'MUSIC', 'UNDEC', 'MATH', 'ENGLISH', 'UNDEC', 'CS', 'CHEM', 
'UNDEC', 'STATS']
>>> whatMajor(fnames,majors,"SHARON")
'PHYSICS'
>>> whatMajor(fnames,majors,"LAURA")
'ENGLISH'
>>> whatMajor(fnames,majors,"FRED")
''
>>> 

Now, just for fun, add your own name into the list of students, along with your major, and that of a few of your friends. Run the program again, and search for your own name.

Now it is your turn to do some coding!

Step 5: Writing a functions that work with fnames, lnames, majors

Now you need to write four functions on your own.

In each case, be sure to include a "docstring comment" (similar to the one in the whatMajor() function above) that starts with a one line description of the function, then indicates what the function consumes (takes as parameters) and what it produces (i.e. what it returns).

Step 5a: whatLName()

Using the function whatMajor() as a model, write a function that will return the last name of a student, given the student's first name. For now, ignore the possibility that there might be more than one student with a given first name—just return the first match.

The parameters to your function should be the list of first names, the list of last names, and the first name to search for.

Step 5b: countUndec()

Now write a function that counts the number of students in the list that have "UNDEC" as their major. You'll need to pass in only the list of majors, and return the answer as an int.

One of the functions you wrote for lab06 is a good model for this function—but which one? That's up to you to decide.

Step 5c: lNamesOfUndec()

Now write a function that returns the last names of all the students that have "UNDEC" as their major. You'll need to pass in the list of last names, and the list of majors, and return the answer as a list of strings. If there are no UNDEC majors in the list, return an empty list as the result.

Again, one of the functions you wrote for lab06 is a good model for this function—but which one? That's up to you to decide.

Step 5d: majorToLNames()

Now generalize the lNamesOfUndec() function—write a function that works exactly the same way, except that it takes another parameter: thisMajor. Instead of looking for "UNDEC", look for thisMajor. Return a list of strings containing all the last names of the students with the major specified by thisMajor. Return an empty list if there are no such students.

When you've written these four functions, you are ready to submit!

Final submission

Do your final inspection (see lab06 for a guidelines of what to look for).

Be sure that your file contains both pair partners names, and:

Then submit your lab07 directory on CSIL using this command: turnin lab07@cs8 lab07

 


Evaluation and Grading Rubric (150 pts)

Note: a change in the grading rubric from previous labs

In previous labs, you earned points for following certain standard professional software practices, and following instructions. For example:

Starting with this lab, complying with these requirements does not earn any points—these are simply "normal expectations" we have of you. But, there may be deductions from the points you did earn if you fail to follow these.

Grading rubric for lab07

Due Date: Friday November 5, 5pm (same as lab06)