Course Announcements

Due this week:

  • CL7 due Fri

  • Take E2

Notes:

  • No more video quizzes after VQ12

    • I had mixed up due dates so VQ12 officially now due Wed (but I’m glad if you did it ahead of today’s class as that was my intent)

  • Will discuss final exam/project in class on Thurs

Scientific Computing#

  • file paths

  • imports

  • Scientific Computing

File Systems#

Computers use a hierarchical file systems that organizes files & folders.

When you click through the folders (directories) on your computer, you’re interacting with this hierarchical system.

Absolute vs. Relative Paths#

Absolute Paths#

Absolute paths specify the full path for a given file system (starting from the root directory).

root specifies the ‘highest’ directory in the file structure (the start).

An absolute file path starts with a slash / specifying the root directory.

Relative Paths#

Relative paths specify the path to a file from your current working directory (where your computer is working right now).

Modules & Packages#

A module is a set of Python code with functions, classes, etc. available in it. A Python package is a directory of modules.

Modules are stored in Python files (.py). We can import these files into our namespace, to gain access to the module within Python.

In A3 you had to import the package: nltk:

  • package is a whole bunch of modules

  • a module stores python (.py) files

  • when imported, we have access to its functionality

import#

import is a keyword to import external code into the local namespace.

Why do it this way (importing modules)?

  1. Minimize startup costs

  2. Functions in different packages could have the same name - break programs

import example: random module#

import random
# random.sample() documentation
random.sample?
Signature: random.sample(population, k, *, counts=None)
Docstring:
Chooses k unique random elements from a population sequence.

Returns a new list containing elements from the population while
leaving the original population unchanged.  The resulting list is
in selection order so that all sub-slices will also be valid random
samples.  This allows raffle winners (the sample) to be partitioned
into grand prize and second place winners (the subslices).

Members of the population need not be hashable or unique.  If the
population contains repeats, then each occurrence is a possible
selection in the sample.

Repeated elements can be specified one at a time or with the optional
counts parameter.  For example:

    sample(['red', 'blue'], counts=[4, 2], k=5)

is equivalent to:

    sample(['red', 'red', 'red', 'red', 'blue', 'blue'], k=5)

To choose a sample from a range of integers, use range() for the
population argument.  This is especially fast and space efficient
for sampling from a large population:

    sample(range(10000000), 60)
File:      /opt/anaconda3/envs/cogs18/lib/python3.11/random.py
Type:      method
# Random example
to_choose_from = ['1', '2', '3', '4', '5']
number_to_choose = 2

chosen = random.sample(to_choose_from, number_to_choose)

print(chosen)
['3', '5']
random.sample(['a', 'b', 'c'], 1)
['a']

Imports: from & as#

from and as allows us to decide exactly what objects to import into our namespace, and what we call them (in our namespace).
# Import a specific object from a module
from random import choice
## do NOT have to type module name 
## using this approach
## to call this
choice(to_choose_from)
'1'
# Import a module with a specific name in our namespace
# used when module names are long
import collections as cols
## collections is not defined
## this code will fail
collections.
# this is how you would do it
cols.
# putting it all together
# Import a specific thing and give it a specific name
from string import punctuation as punc
punctuation
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[15], line 1
----> 1 punctuation

NameError: name 'punctuation' is not defined
punc
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

Activity: imports#

Please complete the Google Form: https://forms.gle/zqt9ofiXgczQBKLK6

# If you want to check the imports
# import collections as col
# from statistics import mean as average
# from os import path
# from random import choice, choices
import ascii_letters from string
  Cell In[19], line 6
    import ascii_letters from string
                         ^
SyntaxError: invalid syntax

Scientific Computing#

Scientific Computing is the application of computer programming to scientific applications: data analysis, simulation & modelling, plotting, etc.

Scientific Python: Scipy Stack#

Scipy = Scientific Python

  • scipy

  • numpy

  • pandas

  • Data Analysis in Python

numpy#

numpy - stands for numerical python

Note:

  • numpy includes a new class of an object: the numpy array

  • this array has associated attributes

  • …and methods

External packages must be imported#

import numpy as np

arrays#

arrays - enable work/operations with matrices

Allow you to efficiently operate on arrays (linear algebra, matrix operations, etc.)

# Create some arrays of data
arr1 = np.array([[1, 2], [3, 4]])
arr1
array([[1, 2],
       [3, 4]])
# lists of lists don't store dimensionality well
[[1, 2], [3, 4]] 
[[1, 2], [3, 4]]

Arrays: attributes, methods, & indexing#

# Check out an array of data
arr1
array([[1, 2],
       [3, 4]])

attributes#

numpy arrays are an object type…so they have associated attributes (below) and methods (we’ll get to these in a second)!

# Check the shape of the array
arr1.shape
(2, 2)
# Index into a numpy array
print(arr1[0, 0])
1
np.int64(1)

methods#

If you’re looping over an array, there’s probably a method for that…

# sum method
# by default sums all values in array
arr1.sum()
10
# sum method
# has an axis parameter
# axis=0 sums across columns
arr1.sum(axis=0)
array([4, 6])
# typecasting to a different variable type
out_list = arr1.sum(axis=0).tolist()
print(out_list)
type(out_list)
[4, 6]
list

Activity: numpy#

Write code to:

  1. Create a 3x3 array (my_array) that stores 9 different numbers

  2. Use a method to identify the largest value in that array (store it in array_max)

Include your code in this Google Form: https://forms.gle/27GU7pK2rJGrbLo46

# YOUR CODE HERE
my_array = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
array_max = my_array.max()
array_max
9

pandas#

Pandas is Python library for managing heterogenous data.

At it’s core, Pandas is built around the DataFrame object, which is:

  • a data structure for labeled rows and columns of data

  • associated methods and utilities for working with data.

  • each column contains a pandas Series

import pandas as pd
# Create a dataframe 
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35]
})
# Check out the dataframe
df
Name Age
0 Alice 25
1 Bob 30
2 Charlie 35
# You can index in pandas
# columns store information in series
df['Age']
0    25
1    30
2    35
Name: Age, dtype: int64
# You can index in pandas
# loc specifies row, column position
df.loc[0,:]
Name    Alice
Age        25
Name: 0, dtype: object
# attribute of df object
# row, columns
df.shape
(3, 2)
# how many rows there are in a series/df
df.shape[0] # len(df) would also work
3
len(df)
3

Working with DataFrames#

There are a lot of functions and methods within pandas. The general syntax is df.method() where the method() operates directly on the dataframe df.

# calculate summary statistics
df.describe()
Age
count 3.0
mean 30.0
std 5.0
min 25.0
25% 27.5
50% 30.0
75% 32.5
max 35.0

Data in pandas#

  • pd.read_*()| * is replaced with file type (i.e. read_csv())

  • input to function is path to file or URL

For example…there is a very famous dataset about mammalian sleep. One copy of it is at the URL ‘https://raw.githubusercontent.com/ShanEllis/datasets/master/msleep.csv

df = pd.read_csv('https://raw.githubusercontent.com/ShanEllis/datasets/master/msleep.csv')
# look at the data
df
name genus vore order conservation sleep_total sleep_rem sleep_cycle awake brainwt bodywt
0 Cheetah Acinonyx carni Carnivora lc 12.1 NaN NaN 11.9 NaN 50.000
1 Owl monkey Aotus omni Primates NaN 17.0 1.8 NaN 7.0 0.01550 0.480
2 Mountain beaver Aplodontia herbi Rodentia nt 14.4 2.4 NaN 9.6 NaN 1.350
3 Greater short-tailed shrew Blarina omni Soricomorpha lc 14.9 2.3 0.133333 9.1 0.00029 0.019
4 Cow Bos herbi Artiodactyla domesticated 4.0 0.7 0.666667 20.0 0.42300 600.000
... ... ... ... ... ... ... ... ... ... ... ...
78 Tree shrew Tupaia omni Scandentia NaN 8.9 2.6 0.233333 15.1 0.00250 0.104
79 Bottle-nosed dolphin Tursiops carni Cetacea NaN 5.2 NaN NaN 18.8 NaN 173.330
80 Genet Genetta carni Carnivora NaN 6.3 1.3 NaN 17.7 0.01750 2.000
81 Arctic fox Vulpes carni Carnivora NaN 12.5 NaN NaN 11.5 0.04450 3.380
82 Red fox Vulpes carni Carnivora NaN 9.8 2.4 0.350000 14.2 0.05040 4.230

83 rows × 11 columns

…we can access the attributes and execute the methods described above on this dataset:

# rows, columns
df.shape
(83, 11)
df['sleep_total'].median()
10.1
df.describe()
sleep_total sleep_rem sleep_cycle awake brainwt bodywt
count 83.000000 61.000000 32.000000 83.000000 56.000000 83.000000
mean 10.433735 1.875410 0.439583 13.567470 0.281581 166.136349
std 4.450357 1.298288 0.358680 4.452085 0.976414 786.839732
min 1.900000 0.100000 0.116667 4.100000 0.000140 0.005000
25% 7.850000 0.900000 0.183333 10.250000 0.002900 0.174000
50% 10.100000 1.500000 0.333333 13.900000 0.012400 1.670000
75% 13.750000 2.400000 0.579167 16.150000 0.125500 41.750000
max 19.900000 6.600000 1.500000 22.100000 5.712000 6654.000000
df['order'].value_counts()
order
Rodentia           22
Carnivora          12
Primates           12
Artiodactyla        6
Soricomorpha        5
Perissodactyla      3
Cetacea             3
Hyracoidea          3
Diprotodontia       2
Erinaceomorpha      2
Proboscidea         2
Chiroptera          2
Didelphimorphia     2
Cingulata           2
Lagomorpha          1
Pilosa              1
Monotremata         1
Afrosoricida        1
Scandentia          1
Name: count, dtype: int64

Activity: pandas#

The penguins dataset is a famous dataset provided within the seaborn package. To use it, run the two lines of code below.

Explore the dataset to answer the following questions:

  1. How many different species of penguin are included in this dataset?

  2. What is the average mass of penguins in this dataset?

  3. (Bonus - time permitting) What is the average mass of female penguins in this dataset?

Include your answer on the Google form here: https://forms.gle/H1nPwkhR1QoGDn4i7

import seaborn as sns
df = sns.load_dataset('penguins')
# How many different species of penguin are included in this dataset?
# What is the average mass of penguins in this dataset?
# What is the average mass of female penguins in this dataset

Everything below this is just FYI…not on assignment/lab/exam

Plotting#

%matplotlib inline

import matplotlib.pyplot as plt
df = pd.read_csv('https://raw.githubusercontent.com/ShanEllis/datasets/master/msleep.csv')
# Plot the data
plt.scatter(df['sleep_rem'], df['sleep_cycle']);
../_images/53de2128c253c3ace94568e79ad0ef6dba80745561cb8032d8be5706364b4aa9.png
  • can change plot type

  • lots of customizations possible

Analysis#

  • scipy - statistical analysis

  • sklearn - machine learning

import scipy as sp
from scipy import stats
# Simulate some data
d1 = stats.norm.rvs(loc=0, size=1000)
d2 = stats.norm.rvs(loc=0.5, size=1000)

Analysis - Plotting the Data#

# Plot the data
plt.hist(d1, 25, alpha=0.6);
plt.hist(d2, 25, alpha=0.6);
../_images/7ca4611f00fc833b18da9b5efdf69a669ec7b14e309ba6e894e36bfa4c061b58.png

Analysis - Statistical Comparisons#

# Statistically compare the two distributions
stats.ttest_ind(d1, d2)
TtestResult(statistic=-10.317692476019195, pvalue=2.366035526414714e-24, df=1998.0)

COGS 108: Data Science in Practice#

If you are interested in data science and scientific computing in Python, consider taking COGS 108 : https://github.com/COGS108/.