CSIS 485 Assignments


Generally, work will be submitted electronically by:

For textbook and other non-programming assignments, name your submission according to the following pattern:

<username>_<assignment>.<ext>
For example, consider a student named Jane Smith with a GFU email address of jsmith15@georgefox.edu; her username is jsmith15. When submitting some "Assignment 7" assignment, she would name her file jsmith15_Assignment7.pdf.

Unless otherwise specified, programs will be submitted as plain-text source code (typically via version control) and non-programming assignments will be submitted as a single PDF file. For non-programming assignments consisting of multiple files, solutions will be submitted as a ZIP archive. No handwritten work will be accepted unless explicitly requested as part of a specific assignment.

For non-programming assignments, ensure that you include the following in each file you submit in an appropriate header or title page format:

For programming assignments, include your GFU email address in a header comment in each file, using the appropriate tags (e.g., a Javadoc or Doxygen @author tag).

New assignments are added as the semester progresses. Check back often.

Weekly

Engineering Your Soul (EYS)
Read the assigned reading and participate as directed on the course syllabus page.

Due 1/25

Assignment 0
Fill out this brief survey.

Due 1/27

Lab 1: File I/O, part 1 (10 participation points)
Fork the starter repository on GitLab, taking care to make it private in your own CSIS 485 namespace (not in your "Personal" namespace). Note: some IDEs—such as PyCharm, with its Get from Version Control option—let you start a new project by cloning an existing repository; this is a useful option to choose. You will add and commit all your changes in the master branch in your private repo; you can (and should!) push your changes to GitLab (at least) nightly for safe keeping.

Using the provided landings.py file as a starting point, complete the following:

  • Download the meteorite landings data in CSV format from NASA and move the data file to your project directory
  • Write Python code to open the CSV file and read the name, mass, and year for each meteorite in the file, where each line in the file represents one meteorite
  • As you read each line of the file, keep track of the number (count) of meteorites, the minimum, maximum, and total masses
  • After you've read each line, close the file, then calculate and print the following information:
    • The total number (i.e., count) of meteorites in the file
    • The average mass of meteorites
    • The name, mass, and year of the smallest (i.e., lowest mass) meteorite
    • The name, mass, and year of the largest meteorite

The intention of this lab is to use only Python's built-in csv module, and not rely on NumPy or Pandas or similar—we'll use those packages in the future.

Submit your code by pushing your master branch to GitLab. Your most recent commit in this branch before the deadline will be cloned and used for grading. Note that all of this should be in your private fork, and only visible to you and to me.

Due 2/3

Lab 2: File I/O, part 2 (10 participation points)
Create a new branch named numpy from your master branch, and check it out. In the numpy branch, update your code to use NumPy's genfromtxt function instead of Python's csv module to load the meteorite landings data.

Note that the genfromtxt function will load the entire dataset into memory and store it in a multidimensional array. When loading the dataset, use a custom dtype value of None to instruct NumPy to automatically determine and use a structured type to describe each field in a row of meteorite landing data, rather than manually parsing values from strings as you did in the previous version. You might also implement a custom converter to convert the year column from its longer date/time format to a simple four-digit year value during loading.

Furthermore, use NumPy's statistics functions rather than your own manual calculations. You might also find NumPy's argmin and argmax functions useful here, especially on a slice or view of the mass field.

The goal of this lab is to eliminate as much manual file processing and calculation code as possible, using functionality provided by NumPy instead, while still ensuring that the results are correct per the specifications outlined in the previous lab. It is very much expected that you will delete—or otherwise significantly rewrite—many lines of code as you update your program. Beyond that, you are welcome to improve other aspects of your program as you see fit.

Submit your code by pushing your numpy branch to GitLab.

Due 2/10

Lab 3: File I/O, part 3 (10 participation points)
Create a new branch named matplotlib from your numpy branch, and check it out. In the matplotlib branch, update your code to use Matplotlib's pyplot API to produce the following figures:

  • An x–y scatter plot of year (x-axis) versus mass (y-axis, using a log scale)
  • A bar plot of the total recorded meteorite mass by decade
  • A histogram (or bar plot) of the count of each unique class (as recorded in the recclass column) of lunar meteorite across all years

To plot the lunar meteorites, you can identify meteorites where the meteorite class starts with the string 'Lunar', and use the resulting indices to slice out just the desired meteorites of lunar origin.

Note that you will likely need to update your genfromtxt call to include (and possibly convert) additional columns from the data file, especially if you previously only used the name, mass, and year columns.

For each plot, take care to add x- and y-axis labels, and include units of measure where relevant. Each axis should also have sensible tick marks and tick labels.

Submit your code by pushing your matplotlib branch to GitLab.

Due 2/8

Assignment 1: Data Exploration, part 1 (10 points)
Fork the starter repository to your namespace.

For the first part of this two-part assignment, your task is to decide on an topic that is interesting to you, and then find one or more sources of data relevant to that topic. Once you find a good source of data (ideally in some plain-text, human-readable format, e.g., CSV or TSV), write up a brief description of the data that you've found.

The provided README.md file in the starter repository uses Markdown to provide some structure to the document, which the GitLab web interface will render in a visually-pleasing way when viewed in a web browser. The file already includes some placeholder content from our meteorite data to give you an example; delete this placeholder content and replace it with information about your chosen topic instead under each section heading. (Note: you may find it helpful to edit the file in your web browser using the "Web IDE" option; this will let you see a live preview of your rendered Markdown document as you edit.)

For the next part of this two-part assignment, you'll use the techniques you've learned so far in the course to load your data from file, handle any missing data, convert fields as needed, and then do some exploratory data analysis to get a better feel for the data. You will then present some visualizations of your data that you find interesting to the rest of the class.

Once you've updated the README.md file with your topic, links to one or more data sources, and your brief description, add and commit your changes. You do not need to add your actual data file(s) to your repository for grading purposes, but you are welcome to do so if you like.

Submit your work by pushing your master branch to GitLab.

Due 2/17

Assignment 2: Data Exploration, part 2 (40 points)
For the second part of this two-part assignment, complete the following:

  • Create a new Python source code file named explore.py in your PyCharm project, and add the new unversioned file to your repository. You will implement code to achieve all of the following in this one file.
  • Apply the techniques you've learned so far in the course to load your data from file, taking care to handle missing or invalid data appropriately. You are welcome to use Python's csv module, or NumPy's genfromtxt function.
  • Compute some descriptive statistics on relevant columns of your data. For example, using our meteorite data, one might report the total number of meteorites in the file, the min/max/mean meteorite mass, the average number of meteorites per year, etc.
  • Generate three figures to visually describe three different aspects of your data, similar to what you did for an earlier lab activity. Ideally, your three figures will each be a different type of plot (e.g., bar plot, scatter plot, line plot). Take care to clearly label your axes, specify units of measure, use color as appropriate, etc. You should also add a relevant title to each figure.
  • Save each figure to a separate PDF file. Name these files plot1.pdf, plot2.pdf, and plot3.pdf and add them to your repository. You are welcome to include additional plots beyond the required three if you like; use the same naming convention (up to plot9.pdf) if you do so.
  • Using a tool of your choice, create a small set of presentation slides with the following content per slide, in order:
    1. A title slide listing your topic, your name, your major(s), and a brief note about why you chose the topic or type of data that you did
    2. A brief description of the contents your data file, including interesting descriptive statistics, and any anomalies you had to deal with when loading the file
    3. The one plot (of the required three plots) that you like best, are most proud of, or feel tells a compelling story about your data
  • Export your three slides to a single PDF file named slides.pdf and add the file to your repository.

You will talk us through your slides in class, and will receive some small number of points on this assignment based on peer evaluations from others. Each student will have three minutes to present; you will remain seated, and I will project your slides from the front.

Take care to actually add and commit each file listed above to your repository, and submit your work by pushing your master branch to GitLab.

Due 3/3

Lab 4: File I/O, part 4 (10 participation points)
Create a new branch named pandas from your matplotlib branch, and check it out. In the pandas branch, update your code to use Pandas' read_csv function instead of NumPy's genfromtxt function to load the meteorite data as a Pandas DataFrame rather than a NumPy array.

Once you have loaded the data, use Pandas functionality to complete the following tasks:

  • Print the total number of meteorites that have no mass (i.e., blank or zero mass)
  • Print the average mass of the meteorites that have a mass greater than zero
  • Print the name, mass, and year of the largest meteorite

Finally, use the plot.scatter function to produce a scatter plot of year (x-axis) versus mass (y-axis, using a log scale) of all meteorites that fell during or after the year you were born. For your plot, take care to add x- and y-axis labels, and include units of measure where relevant. Each axis should also have sensible tick marks and tick labels.

Submit your code by pushing your pandas branch to GitLab.

Due 3/8

Assignment 3: Hypothesis Testing, part 1 (40 points)
Fork the starter repository to your namespace.

For the first part of this two-part assignment, your task is to develop a testable hypothesis for your dataset. You may continue to use the same dataset you used for the data exploration assignment, or switch to a different dataset for an entirely different domain if you like.

Update the provided README.md file to include the following, replacing the TODO entries with your own values:

  • Your testable (alternate) hypothesis statement
  • The dependent variable for your hypothesis
  • A list of the independent variables for your hypothesis
  • A list of the controlled variables for your hypothesis
  • A list of the uncontrolled (or "epsilon") variables for your hypothesis
  • The specifics of your planned t-test

Take care to include units of measure, where applicable. Once you've updated the README.md file with the above information, add and commit your changes.

Submit your work by pushing your master branch to GitLab.

Due 3/15

Assignment 4: Hypothesis Testing, part 2 (40 points)
For the second part of this two-part assignment, complete the following:

  • Create a new Python source code file named test.py in your PyCharm project, and add the new unversioned file to your repository. You will implement code to achieve all of the following in this one file. You are also welcome to add your data file to your repository if you like; however, you do not need to do so for grading purposes.
  • Apply the techniques you've learned so far in the course to load your data from file, taking care to handle missing or invalid data appropriately. You are welcome to use Python's csv module, NumPy's genfromtxt function, or Pandas' read_csv function—or any other functionality from those packages.
  • Split your data into the two groups you are going to use for your planned t-test. For example, this may be one group consisting of only lunar meteorites, and another group consisting of only martian meteorites. Here, you might choose to only keep the relevant dependent variable for each group, e.g., the meteorite mass.
  • Verify the normality of your two groups of data by doing the following:
    • Plot a histogram of the data for each group separately to visually inspect the distributions. Add a vertical line at the mean value for the samples depicted in the histogram. Save the figures to separate PDF files named hist1.pdf and hist2.pdf. Take care to include a title, axis labels and units of measure, etc.
    • Run both a D'Agostino's test and a Shapiro-Wilk test to determine the probability of the data being drawn from a normal distribution. Here, you should inspect the resulting p-values for each test and determine whether the probability is so low (p ≤ 0.05) that the data is likely not normally distributed.
  • Run your planned test using the appropriate type of test and test parameters, based on your normality test results:
    • If your data is normally distributed, use a t-test for either independent (unpaired) or related (paired) samples; else, use a Mann–Whitney U test.
    • In either case, take care to configure (or interpret the results of) your test correctly based on whether you are looking for simply any difference (two-tailed), or a difference in one specific direction (one-tailed).
  • Update your README.md to add a ## Results section to the end of the file. In this new section, add a brief writeup of the above, taking care to include mention of the following:
    • The results (p-values) of your normality tests, and your interpretation of those results. Are your data normally distributed, according to these tests?
    • The result (p-value) of your planned test, and your interpretation of that result. Can you reject the null hypothesis, and accept your alternate hypothesis?

Take care to add and commit each file listed above to your repository, and submit your work by pushing your master branch to GitLab.

Due 3/11 @ 11:55 PM

Midterm Exam (100 exam points)
Fork the starter repository to your namespace.

The midterm exam is open-notes, open-Internet, open-API reference, open-everything, but individual effort only. For the midterm, you will write data ingest, data exploration, and data visualization code using any of the Python packages introduced so far in the course (e.g., NumPy, Matplotlib, Pandas) based on the following fictitious, but realistic, scenario.

You are a data science intern working at the National Aeronautics and Space Administration (NASA). You have been tasked with doing a bit of research on patent applications filed by various groups within NASA, to help your manager prepare for an upcoming briefing to senior administration.

Your manager has requested several pieces of information for the briefing:

  1. The total number of patent applications filed by all centers at NASA
  2. The total number of patents issued by all centers
  3. The patent number, title, and expiration date of the oldest (i.e., least recently expired) patent
  4. A list of the names of the various centers at NASA that have ever filed a patent application
  5. A list with the number of patent applications filed by each center, sorted in descending order
  6. For each center, the percentage of patent applications actually awarded (i.e., the number issued divided by the sum of the number submitted but not issued plus the number issued)
  7. A bar plot depicting the number of patents expiring during each year from 2021 to 2035
  8. A bar plot depicting the number of active (i.e., not yet expired) patents per center, as of the first day of next month
Implement your code in the provided midterm.py file. Unless otherwise specified, simply printing out the requested information is sufficient; no special formatting or cleanup is required, so long as the requested information is easily identifiable in the printed output. Ensure that plots are adequately labeled.

Submit your work by pushing your master branch to GitLab. You will be evaluated based on how successful you are at providing the requested information according to this grading rubric.

Due 3/31

Lab 5: Machine Learning, part 1 (20 participation points)
Fork the starter repository to your namespace.

Using the provided baseball.py file as a starting point, complete the following clustering tasks, after loading the batting data:

  • Create an x–y scatter plot of any two features (e.g., at bats and on-base percentage, for the batting data), and color the individual scatter points according to the player's hall of fame status (where a hof value greater than zero indicates that the player was inducted into the hall of fame).
  • Use the k-means clustering algorithm to assign each player to one of two groups, given the same two features. Create a second scatter plot, and color the individual scatter points according to the predicted class membership.
  • Calculate the prediction accuracy of the k-means algorithm by determining how well the predicted groups match the actual hall of fame status-based groups (i.e., group 0 is "no, not in the hall of fame" and group 1 is "yes, in the hall of fame").
  • Once you have the above working well, try repeating the steps with different pairs of features from the batting data. Add comments to your code to list interesting pairs of features and the corresponding accuracy.

Next, complete the following regression tasks:

  • Select a floating-point feature to predict (e.g., batting average); this will be your y value. Then, select another feature that you think might correlate well with (or, be predictive of) the other feature (e.g., number of at-bats); this will be your x value.
  • Create a scatter plot of your x and y values, similar to how you did for the clustering task above.
  • Perform an ordinary least squares linear regression to fit a basic linear model that approximates the value of y given the value of x.
  • Use the fitted linear model to predict new values of y given the original x values, and plot the resulting line on top of your scatter plot.

Finally, perform the following classification tasks:

  • Use a support vector machine (SVM)-based classifier to predict the hall of fame status, similar to how you did for the clustering tasks.
  • As you did for the clustering tasks, create a scatter plot and color the points according to the predicted group membership, and calculate the prediction accuracy. As before, try repeating the steps with different pairs of features, and record interesting results as comments in your code.

All data is sourced from Baseball Reference; see Hank Aaron's and Nolan Ryan's pages for examples of various batting and pitching statistics, respectively.

Submit your work by pushing your master branch to GitLab.

Due 4/29 @ 11:55 PM

Final Project (100 exam points)
Fork the starter repository to your namespace.

The final project is a comprehensive project encompassing several of the topics and techniques we've covered in the course. You are welcome to use any of the Python packages introduced in the course (i.e., NumPy, Matplotlib, Pandas, scikit-learn) as you see fit.

First, identify a problem domain you are interested in, and find a dataset from your chosen domain that is suitable for use in a classification experiment. The goal is to find a dataset that has multiple numeric or categorical values (i.e., X) that you can feed to a classification model to predict some other value (i.e., y).

Your dataset should minimally have:

  • Many rows/samples of data (ideally, hundreds or thousands).
  • One or more measures per sample to use as the X to train a model on. These measures may be continuous floating-point numbers, or they may be categorical; if the latter, you will need to encode the features into a format suitable for classification.
  • A categorical label per sample to use as the y for the model to predict. Alternatively, you may discretize a continuous measure and use the resulting bins as categorical labels.

For example, as someone interested in space science and exploration, I might find a dataset containing meteorite data. This dataset might contain 40,000 rows, with one row per meteorite that has been discovered on Earth. Each row might contain measures such as the meteorite mass in grams and the class or type of the meteorite (e.g., lunar, martian), among other measures. I might train a classification model using the meteorite mass, and attempt to predict the meteorite class given the meteorite mass for new, unseen meteorites.

Using the provided final.py as a starting point, complete the following:

  • Add and commit your data file(s) to your repository.
  • Load your data from file, handling any missing data or extreme outliers appropriately.
  • Separate your data into parallel X and y arrays, with various measures encoded or discretized, as relevant.
  • Create at least three figures to visually describe three different aspects of your data. At least one of these figures must plot features from X, and visually indicates class membership according to y. (See the "iris" scatter plot examples from class for an idea what I'm looking for here.) Save each figure to a separate PDF file. Name these files fig1.pdf, fig2.pdf, and fig3.pdf and add them to your repository. You are welcome to include additional plots beyond the required three if you like; use the same naming convention (up to fig9.pdf) if you do so.
  • Use an appropriate cross-validation technique to train and test an SVM-based classification model. Note here that test sets are typically comprised of 10–20% of the dataset; in any given fold, no member of the test set for the fold should exist in the training set for the fold.
  • Compute the raw and balanced accuracy for each cross-validation fold, as well as the overall raw and balanced accuracy.
  • Compute and plot a confusion matrix for all samples to visually depict the confusability between classes. Save this figure in a separate PDF file named cm.pdf and add it to your repository.

It may be helpful to use a basic train/test split to determine how well various features perform before fully implementing the entire cross-validation experiment. You may find that you need to try multiple combinations of features to determine a somewhat optimal X that is predictive of y.

Beyond your Python implementation, prepare a small set of presentation slides with the following content per slide, in order:

  1. A title slide listing your topic, your name, your major(s), and a brief note about why you chose the topic or type of data that you did
  2. A brief description of the contents your data file, including interesting descriptive statistics, and any anomalies you had to deal with when loading the file
  3. The one figure (of the required three figures) that you like best, are most proud of, or feel tells a compelling story about your data
  4. A table listing the raw and balanced accuracy for each fold, along with the overall raw and balanced accuracy. When presenting, take care to describe your cross-validation approach (k-fold, leave-one-out, or some other type; stratified or not stratified; train/test split percentages; etc.), as well as your SVM classifier parameters (primarily, which type of kernel you used).
  5. The confusion matrix, with true and predicted class labels clearly indicated. When presenting, take care to share your observations on which classes were most confusable, and your thoughts on why that was the case.
  6. A conclusions slide outlining what you conclude following your experiment, and any thoughts on what you would try next.

Export your six slides to a single PDF file named slides.pdf and add the file to your repository. You will talk us through your slides during the final exam period, in a similar fashion to what we did for Assignment 2. As before, you will receive some small number of points towards your final project grade based on peer evaluations from others. Each student will have 5 minutes to present; you will remain seated, and I will project your slides from the front.

Take care to add and commit each file listed above to your repository, and submit your work by pushing your master branch to GitLab.


This page was last modified on 2021-09-25 at 14:15:19.

Copyright © 2015–2021 George Fox University. All rights reserved.