CSC 110 Assignment 7: File I/O (Input/Output)
When you have completed this assignment, you should understand:
That well-tested functions that operate on lists work the same no matter the size of the list
How to read data from a text file using Python
How to hand it in:
Submit your assignment7.py file through the Assignment 7 link on the CSC110 conneX page.
Late submissions will be given a zero grade.
You must use the py file provided to write your solution. Changing the filename or any of the code given in the file will result in a zero grade.
Your function names must match exactly as specified in this document or you will be given a
Function arguments must be exactly as specified in this document. Specifically, do not change the number of and/or order of the arguments or you will be given a zero grade for the function.
We will do spot-check grading in this course. That is, all assignments are graded BUT only a subset of your code might be graded. You will not know which portions of the code will be graded, so all of your code must be complete and adhere to specifications to receive
Your code must run without errors on the ECS 258 Lab machines or a zero grade will be
It is recommended that you use a plain text editor such as Notepad++ as used in the labs or Atom for Mac computers. We also recommend you run your programs through terminal / command prompt, as shown in the
It is the responsibility of the student to submit any and all correct files. Only submitted files will be marked. Submitting an incorrect file is not grounds for a
If the assignment requires the submission of multiple files, then all files must be
Marks will be given for…
your code producing the correct output
the tests for your functions providing sufficient coverage (at least 2 tests and enough to cover all possible paths through the code)
your code following good coding conventions (see lab and lecture code for examples)
Documentation of signature and purpose using the format we have followed in lectures and previous
Names of variables should have meaning relevant to what they are storing
Use of whitespace to improve readability
Proper use of variables to store intermediate computation results
Download py from the conneX Files tab and save it to your working directory.
Write your name and student V# at the top of the file
For each of the 9 function specifications provided below you must:
Uncomment the calls to each functions test
You are free to comment out these test calls in your main function as you progress through the assignment, but you MUST leave all of your tests in place for us to grade
Add any tests to the test function that you feel are necessary. The functions have tests provided for you, but you may want to add tests to adequately test each
Complete the function definition according to the specification
PART 1: Setting the stage (lists and loops):
An important task in bioinformatics is the identification of DNA and RNA sequences. In this assignment we will be looking at nucleic acid sequences. These sequences contain up to four different bases denoted by letters: A for adenine, C for cytosine, G for guanine, and T for thymine.
Sequence strings are compared in order to determine whether nucleic acid sequences match each other, or are related through mutations. Real sequence data as used by biochemists and in bioinformatics research consist of very long strings of A, C, G and T.
The sequences in this assignment will all contain between 2 and 4 of the possible bases (A, C, G, and T). Your task is to search through a collection of sequence data and count how many times a specific sequence occurs. (For example, if the collection contains the following sequences: [ACTG, GATC, ACT, GTC, AC, GATC, GA] and we search for the specific sequence GATC we would report that it was found 2 times (the two in bold and underlined).
One of the difficulties in this assignment will be dealing with mutated sequences. A mutation can occur due to insertions of additional bases within a sequence. For the purpose of this assignment, a mutated sequence contains at least two of the same bases occurring in a row (so in the sequence GAAATC the A has mutated, and in the sequence CCGGAT both the C and G have mutated). Another task in this assignment is to detect how many of the sequences in the collection are mutated.
The final task will be to search through the collection of sequence data for a specific sequence, but you must treat original and mutated sequences the same (For example, if the collection contains [TGC, AC, TTGC, TACG, TGGCC, AGTC] and we search for the specific sequence TGC we would report that it was found 3 times (because TTGC and TGGCC are mutated forms of TGC)
Exercise 1 – Find the longest string in a given list of strings
Complete the function design for the find_longest() function, which takes a list of strings as a parameter, and returns the longest string found in the list. If there a tie (two or more strings are tied for the longest in the list), the string found first is the list is returned.
Exercise 2 – Find the number of times a string occurs in a list
Complete the function design for the get_frequency() function, which takes a list of strings and a string as parameters, and returns a count of the number of elements in the list equal to the given string.
HINT for Remaining exercises:
The following exercises all involve mutations (read the “Setting the Stage” section on the previous page. Exercises involving mutations are a little more difficult.
In this assignment, a mutation occurs when two or more characters in a string are repeated in a row. Think about how you might be able to detect a mutation in a string. It is very similar to how we compared two adjacent list elements in a few exercises over the past few weeks. In fact, we can assign a prev variable and use list slicing on strings the same way we can on lists
Exercise 3 – Determining if a sequence is mutated
Complete the function design for the is_mutation() function that takes a string and determines if the string is mutated. For this assignment, a mutation means there is at least one occurrence where characters in the string occur two or more times in a row. Look at the hint at the top of the page.
Exercise 4 – Removing mutations from a sequence
Complete the function design for the break_mutation() function that removes all mutations. For this assignment, that means returning a string that has all duplicate letters removed from the given string. Remember, duplicate letters will only occur in a row (For example, “AACTTTG” may occur, as the duplicate A’s and T’s are all in a row, but “ATACTGA” would never occur, because the A’s and T’s are not in a row). Look at the hint at the top of the page.
Exercise 5 – Counting the number of mutated sequences
Complete the function design for the count_total_mutations() function that takes a list of strings as a parameter. The function should return a count of the number of strings in the list that are mutated. You should call one of the functions you designed above in your solution!
Exercise 6 – Counting the number of sequences, and mutations of that sequence
Complete the function design for the frequency_incl_mutations() function that takes a list of strings, and a string. The function should return the count of the number of elements in the list that are equal to that string, OR are equal to a mutation from that string! Remember that in this assignment, a mutation occurs when any base is repeated twice in a row. Removing the mutations from TGGGGAA would result in TGA. So, for this function, the result of searching for “TGA” in a list containing the elements: [“ACTG”, “TGA”, “TTGA”, “TGGGGAA”, “TGAC”] would result in 3, as TGA, TTGA, and TGGGGAA are all TGA or mutations of it. You should call one of the functions you designed above in your solution!
PART 2: Running your code with large sequences (File IO)!
On the Connex course page in the Files section for Assignment 7, there are a number of text files containing sequence data. Each file contains sequences (strings) separated by spaces. Your task is to read the data from a file and put it into a list. From there you can run each of your functions you designed in Part 1 to obtain some statistics about the input files. Download the 8 data files.
The input files get progressively larger:
txt – 5 sequences, without mutations
txt – 25 sequences, without mutations
txt – 20 sequences, few mutations
txt – 50 sequences, few mutations
txt – 20 sequences, many mutations
txt – 100 sequences, many mutations
txt – 1000 sequences, many mutations
txt – 10000 sequences, many mutations
Exercise 7 – Reading data from a file
Complete the function design for the get_file() function that takes prompts the user to enter a file name until the user enters the name of a file that can be successfully read from. Try it with data1.txt.
Exercise 8 – Counting the number of mutated sequences
Complete the function design for the make_list() function that takes a file object as a parameter, and creates and returns a list of strings containing all of the strings found in the file.
Exercise 9 – Analysis of the given test files
Download the a7_tester.py and run it. It should call all of the functions you have created so far. The program asks you to enter a file name to read (like data1.txt or data7.txt), and then a sequence to search for. Here is some sample output based on our solution (with the things I entered underlined in red):
On the Connex Discussion Board, post your results when you search for other sequences, and in other text files (like data7.txt). That way you can all compare your results with each other.
DescriptionIn this final assignment, the students will demonstrate their ability to apply two majorconstructs of the C programming language – Fu
Path finding involves finding a path from A to B. Typically we want the path to have certain properties,such as being the shortest or to avoid going t
Develop a program to emulate a purchase transaction at a retail store. Thisprogram will have two classes, a LineItem class and a Transaction class. Th
1 Project 1 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of
1 Project 2 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of