Scientific Computing

Course Announcements

Due this week:

CL7 due Fri
Take E2

Notes:

No more video quizzes after VQ12
- I had mixed up due dates so VQ12 officially now due Wed (but I’m glad if you did it ahead of today’s class as that was my intent)
Will discuss final exam/project in class on Thurs

Scientific Computing#

file paths
imports
Scientific Computing

File Systems#

Computers use a hierarchical file systems that organizes files & folders.

When you click through the folders (directories) on your computer, you’re interacting with this hierarchical system.

Absolute vs. Relative Paths#

Absolute Paths#

Absolute paths specify the full path for a given file system (starting from the root directory).

root specifies the ‘highest’ directory in the file structure (the start).

An absolute file path starts with a slash / specifying the root directory.

Relative Paths#

Relative paths specify the path to a file from your current working directory (where your computer is working right now).

Modules & Packages#

A module is a set of Python code with functions, classes, etc. available in it. A Python package is a directory of modules.

Modules are stored in Python files (.py). We can import these files into our namespace, to gain access to the module within Python.

In A3 you had to import the package: nltk:

package is a whole bunch of modules
a module stores python (.py) files
when imported, we have access to its functionality

`import`#

import is a keyword to import external code into the local namespace.

Why do it this way (importing modules)?

Minimize startup costs
Functions in different packages could have the same name - break programs

`import` example: random module#

import random

# random.sample() documentation
random.sample?

Signature: random.sample(population, k, *, counts=None)
Docstring:
Chooses k unique random elements from a population sequence.

Returns a new list containing elements from the population while
leaving the original population unchanged.  The resulting list is
in selection order so that all sub-slices will also be valid random
samples.  This allows raffle winners (the sample) to be partitioned
into grand prize and second place winners (the subslices).

Members of the population need not be hashable or unique.  If the
population contains repeats, then each occurrence is a possible
selection in the sample.

Repeated elements can be specified one at a time or with the optional
counts parameter.  For example:

    sample(['red', 'blue'], counts=[4, 2], k=5)

is equivalent to:

    sample(['red', 'red', 'red', 'red', 'blue', 'blue'], k=5)

To choose a sample from a range of integers, use range() for the
population argument.  This is especially fast and space efficient
for sampling from a large population:

    sample(range(10000000), 60)
File:      /opt/anaconda3/envs/cogs18/lib/python3.11/random.py
Type:      method

# Random example
to_choose_from = ['1', '2', '3', '4', '5']
number_to_choose = 2

chosen = random.sample(to_choose_from, number_to_choose)

print(chosen)

['3', '5']

random.sample(['a', 'b', 'c'], 1)

['a']

Imports: `from` & `as`#

from and as allows us to decide exactly what objects to import into our namespace, and what we call them (in our namespace).

# Import a specific object from a module
from random import choice

## do NOT have to type module name 
## using this approach
## to call this
choice(to_choose_from)

'1'

# Import a module with a specific name in our namespace
# used when module names are long
import collections as cols

## collections is not defined
## this code will fail
collections.

# this is how you would do it
cols.

# putting it all together
# Import a specific thing and give it a specific name
from string import punctuation as punc

punctuation

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[15], line 1
----> 1 punctuation

NameError: name 'punctuation' is not defined

punc

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

Activity: imports#

Please complete the Google Form: https://forms.gle/zqt9ofiXgczQBKLK6

# If you want to check the imports
# import collections as col
# from statistics import mean as average
# from os import path
# from random import choice, choices
import ascii_letters from string

  Cell In[19], line 6
    import ascii_letters from string
                         ^
SyntaxError: invalid syntax

Scientific Computing#

Scientific Computing is the application of computer programming to scientific applications: data analysis, simulation & modelling, plotting, etc.

Scientific Python: Scipy Stack#

Scipy = Scientific Python

scipy
numpy
pandas
Data Analysis in Python

`numpy`#

numpy - stands for numerical python

Note:

numpy includes a new class of an object: the numpy array
this array has associated attributes
…and methods

External packages must be imported#

import numpy as np

arrays#

arrays - enable work/operations with matrices

Allow you to efficiently operate on arrays (linear algebra, matrix operations, etc.)

# Create some arrays of data
arr1 = np.array([[1, 2], [3, 4]])

arr1

array([[1, 2],
       [3, 4]])

# lists of lists don't store dimensionality well
[[1, 2], [3, 4]] 

[[1, 2], [3, 4]]

Arrays: attributes, methods, & indexing#

# Check out an array of data
arr1

array([[1, 2],
       [3, 4]])

attributes#

numpy arrays are an object type…so they have associated attributes (below) and methods (we’ll get to these in a second)!

# Check the shape of the array
arr1.shape

(2, 2)

# Index into a numpy array
print(arr1[0, 0])

np.int64(1)

methods#

If you’re looping over an array, there’s probably a method for that…

# sum method
# by default sums all values in array
arr1.sum()

# sum method
# has an axis parameter
# axis=0 sums across columns
arr1.sum(axis=0)

array([4, 6])

# typecasting to a different variable type
out_list = arr1.sum(axis=0).tolist()
print(out_list)
type(out_list)

[4, 6]

list

Activity: `numpy`#

Write code to:

Create a 3x3 array (my_array) that stores 9 different numbers
Use a method to identify the largest value in that array (store it in array_max)

Include your code in this Google Form: https://forms.gle/27GU7pK2rJGrbLo46

# YOUR CODE HERE
my_array = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
array_max = my_array.max()
array_max

`pandas`#

Pandas is Python library for managing heterogenous data.

At it’s core, Pandas is built around the DataFrame object, which is:

a data structure for labeled rows and columns of data
associated methods and utilities for working with data.
each column contains a pandas Series

import pandas as pd

# Create a dataframe 
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35]
})

# Check out the dataframe
df

	Name	Age
0	Alice	25
1	Bob	30
2	Charlie	35

# You can index in pandas
# columns store information in series
df['Age']

  25
  30
  35
Name: Age, dtype: int64

# You can index in pandas
# loc specifies row, column position
df.loc[0,:]

Name    Alice
Age        25
Name: 0, dtype: object

# attribute of df object
# row, columns
df.shape

(3, 2)

# how many rows there are in a series/df
df.shape[0] # len(df) would also work

len(df)

Working with DataFrames#

There are a lot of functions and methods within pandas. The general syntax is df.method() where the method() operates directly on the dataframe df.

# calculate summary statistics
df.describe()

	Age
count	3.0
mean	30.0
std	5.0
min	25.0
25%	27.5
50%	30.0
75%	32.5
max	35.0

Data in `pandas`#

pd.read_*()| * is replaced with file type (i.e. read_csv())
input to function is path to file or URL

For example…there is a very famous dataset about mammalian sleep. One copy of it is at the URL ‘https://raw.githubusercontent.com/ShanEllis/datasets/master/msleep.csv’

df = pd.read_csv('https://raw.githubusercontent.com/ShanEllis/datasets/master/msleep.csv')

# look at the data
df

	name	genus	vore	order	conservation	sleep_total	sleep_rem	sleep_cycle	awake	brainwt	bodywt
0	Cheetah	Acinonyx	carni	Carnivora	lc	12.1	NaN	NaN	11.9	NaN	50.000
1	Owl monkey	Aotus	omni	Primates	NaN	17.0	1.8	NaN	7.0	0.01550	0.480
2	Mountain beaver	Aplodontia	herbi	Rodentia	nt	14.4	2.4	NaN	9.6	NaN	1.350
3	Greater short-tailed shrew	Blarina	omni	Soricomorpha	lc	14.9	2.3	0.133333	9.1	0.00029	0.019
4	Cow	Bos	herbi	Artiodactyla	domesticated	4.0	0.7	0.666667	20.0	0.42300	600.000
...	...	...	...	...	...	...	...	...	...	...	...
78	Tree shrew	Tupaia	omni	Scandentia	NaN	8.9	2.6	0.233333	15.1	0.00250	0.104
79	Bottle-nosed dolphin	Tursiops	carni	Cetacea	NaN	5.2	NaN	NaN	18.8	NaN	173.330
80	Genet	Genetta	carni	Carnivora	NaN	6.3	1.3	NaN	17.7	0.01750	2.000
81	Arctic fox	Vulpes	carni	Carnivora	NaN	12.5	NaN	NaN	11.5	0.04450	3.380
82	Red fox	Vulpes	carni	Carnivora	NaN	9.8	2.4	0.350000	14.2	0.05040	4.230

83 rows × 11 columns

…we can access the attributes and execute the methods described above on this dataset:

# rows, columns
df.shape

(83, 11)

df['sleep_total'].median()

10.1

df.describe()

	sleep_total	sleep_rem	sleep_cycle	awake	brainwt	bodywt
count	83.000000	61.000000	32.000000	83.000000	56.000000	83.000000
mean	10.433735	1.875410	0.439583	13.567470	0.281581	166.136349
std	4.450357	1.298288	0.358680	4.452085	0.976414	786.839732
min	1.900000	0.100000	0.116667	4.100000	0.000140	0.005000
25%	7.850000	0.900000	0.183333	10.250000	0.002900	0.174000
50%	10.100000	1.500000	0.333333	13.900000	0.012400	1.670000
75%	13.750000	2.400000	0.579167	16.150000	0.125500	41.750000
max	19.900000	6.600000	1.500000	22.100000	5.712000	6654.000000

df['order'].value_counts()

order
Rodentia           22
Carnivora          12
Primates           12
Artiodactyla        6
Soricomorpha        5
Perissodactyla      3
Cetacea             3
Hyracoidea          3
Diprotodontia       2
Erinaceomorpha      2
Proboscidea         2
Chiroptera          2
Didelphimorphia     2
Cingulata           2
Lagomorpha          1
Pilosa              1
Monotremata         1
Afrosoricida        1
Scandentia          1
Name: count, dtype: int64

Activity: `pandas`#

The penguins dataset is a famous dataset provided within the seaborn package. To use it, run the two lines of code below.

Explore the dataset to answer the following questions:

How many different species of penguin are included in this dataset?
What is the average mass of penguins in this dataset?
(Bonus - time permitting) What is the average mass of female penguins in this dataset?

Include your answer on the Google form here: https://forms.gle/H1nPwkhR1QoGDn4i7

import seaborn as sns
df = sns.load_dataset('penguins')

# How many different species of penguin are included in this dataset?

# What is the average mass of penguins in this dataset?

# What is the average mass of female penguins in this dataset

Everything below this is just FYI…not on assignment/lab/exam

Plotting#

%matplotlib inline

import matplotlib.pyplot as plt

df = pd.read_csv('https://raw.githubusercontent.com/ShanEllis/datasets/master/msleep.csv')

# Plot the data
plt.scatter(df['sleep_rem'], df['sleep_cycle']);

../_images/53de2128c253c3ace94568e79ad0ef6dba80745561cb8032d8be5706364b4aa9.png

can change plot type
lots of customizations possible

Analysis#

scipy - statistical analysis
sklearn - machine learning

import scipy as sp
from scipy import stats

# Simulate some data
d1 = stats.norm.rvs(loc=0, size=1000)
d2 = stats.norm.rvs(loc=0.5, size=1000)

Analysis - Plotting the Data#

# Plot the data
plt.hist(d1, 25, alpha=0.6);
plt.hist(d2, 25, alpha=0.6);

../_images/7ca4611f00fc833b18da9b5efdf69a669ec7b14e309ba6e894e36bfa4c061b58.png

Analysis - Statistical Comparisons#

# Statistically compare the two distributions
stats.ttest_ind(d1, d2)

TtestResult(statistic=-10.317692476019195, pvalue=2.366035526414714e-24, df=1998.0)

COGS 108: Data Science in Practice#

If you are interested in data science and scientific computing in Python, consider taking COGS 108 : https://github.com/COGS108/.

Scientific Computing

Contents

Scientific Computing#

File Systems#

Absolute vs. Relative Paths#

Absolute Paths#

Relative Paths#

Modules & Packages#

import#

import example: random module#

Imports: from & as#

Activity: imports#

Scientific Computing#

Scientific Python: Scipy Stack#

numpy#

External packages must be imported#

arrays#

Arrays: attributes, methods, & indexing#

attributes#

methods#

Activity: numpy#

pandas#

Working with DataFrames#

Data in pandas#

Activity: pandas#

Plotting#

Analysis#

Analysis - Plotting the Data#

Analysis - Statistical Comparisons#

COGS 108: Data Science in Practice#

`import`#

`import` example: random module#

Imports: `from` & `as`#

`numpy`#

Activity: `numpy`#

`pandas`#

Data in `pandas`#

Activity: `pandas`#