Scientific Computing

Scientific Computing#

[ad] Food Insecurity Survey

Q&A

Q: Are we getting command lines in the final exam?
A: You will need to know the shell commands and absolute/relative paths for both E2 and the final.

Q: How do you know when to use a script or a module?
A: If you are writing the code, you consider the task. Do you want to store code to import elsewhere? Module. Or do you want to write some code that you want to run from top to bottom? Script!

Q: How heavily will we be tested on our knowledge of a script and module? What type of questions can we expect?
A: For E2, just know the similarities and difference. For the final, we’ll be creating them…but we’ll have mroe practice before then.

Q: I was wondering what the difference is going to be between E2 and the final? Is it just going to be similar to the first two midterms and cover more information or will it be more challenging, us attempting to create code like we do in assignments, etc.
A: E2 focuses on material from loops-clases. Final exam focuses on material after classes, culminating in a mini-project. We’ll discuss details soon in class!

Q: I’m still a little unsure about when to use self and how it works inside methods.
A: We’ll discuss this more today - ask questions if still not clear, if you’re comfortable!

Q: Also, how do I know when to use a class vs. just writing functions?
A: Well, typically, I’ll tell you. But, when you have to decide on your own…if you want to keep attributes and methods organized together and use attributes across different methods, class. If not, just a function.

Q: How many points could we have so far? How many remain?
A: See below

Current possible points earned: 61

pre-course: 2
VQ1-12 : 12
CL1-6: 12
A1-4: 20
Oral exam: 2.5
E1: 12.5

Remaining points to earn: 39

post-course: 2
CL7-8: 4
A5: 5
Oral exam 2: 2.5
E2: 12.5
Final exam: 13

Course Announcements

Due this week:

CL7 due Fri
Take E2: 5/23-5/30

Notes:

Reminder to sign up for Oral exam 2 slot (link also on Canvas homepage)
If you have a few minutes, please complete the Food Insecurity Survey before 5/26 <- there’s an optional quiz on Canvas with the link too
Re-take of E1 or E2
- sign-ups will be available Fri 5/30 (so students have info they need to decide if they want to retake
- replacement grade: 75% of highest + 25% of lowest

Q&A

Q: for the ‘remote.py’ i kept on getting an error
A: Two things to check 1) restart your kernel in your notebook and try again; 2) if error persists, ensure that remote.py is in the same directory/folder as the notebook you’re importing from

Q: Can we go over F strings again?
A: Yup an f string allows you to include a variable within a string using {}. So if you had var_6 = 6…..you could include that value in an f string as follows: f"I like the number {var_6}." Note that you do not have to convert the int to a string…and spacing is indicated within the string directly.

Q: How do you know whether to include a ‘()’ after a class definition (i.e. class Exam2() or class Exam2)?
A: When creating a class, the parentheses are optional. When creating an instance of a class, you always need parentheses.

Q: how ethical do you think chatgpt is
A: Love this question. I think ChatGPT (and other LLMs) have serious ethical (and climate) concerns. Specifically, ChatGPT has been built, in many cases…although it’s not entirely clear the extent to which this is true, on copyright material. So the model has learned from material that wasn’t its to learn from. Beyond that, humans have made decisions as to how ChatGPT is to respond. Same for other LLMs (you’ll notice that Claude is more empathetic, but also more resistant to making decisions for prompters). This is not without its ethical implications. Further, the carbon footprint for training these models is quite large. So, as a society, we have to do a cost-benefit analysis…and also regulate in meaningful ways, neither of which we’ve done well yet. Those are just the thoughts of the top of my head! Happy to discuss this further!

Q: How to read a question and understand what it wants me to execute
A: Look back to the methods notes and the e2 review when I discussed the responses as we discussed this in a bit more detail! But, briefly here, you want to break the question down into smaller steps. To do this, you have to 1) figure out what the question is asking and then 2) figure out how you could use code to do this. Often students struggle to bridge those two steps. In that case, I encourage students to take a step back and figure out how their brains would do the task….did you need to look at each value over a list of numbers and add them up….if yes, then you need a loop! Did you have to use conditional logic? If yes, then you need an if

Q: i know when to USE self, but when it comes to the def method (self) im confused on why we JUST use self
A: If you’re only using self, that indicates you don’t want/need any additional inputs from the user. Within a class, this typically means you’re going to only operate on existing attributes.

Q: If you were not to use the += or -= what would you code instead?
A: If the code were my_variable += 6, this is shorthand for my_variable = my_variable + 6

Q: Will we have to be able to use read_* for any of our exams or assignments?
A: Yes, for A5 and the final

Course Announcements

Due this week:

CL8 due Fri
A5 due Sun
Complete E2

Reminder to complete the Food Insecurity Survey (by Friday)

The Zen of Python#

import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

Scientific Computing#

Scientific Computing is the application of computer programming to scientific applications: data analysis, simulation & modelling, plotting, etc.

Scientific Python: Scipy Stack#

Scipy = Scientific Python

scipy
numpy
pandas
Data Analysis in Python

`numpy`#

numpy - stands for numerical python

Note:

numpy includes a new class of an object: the numpy array
this array has associated attributes
…and methods

External packages must be imported#

import numpy as np

arrays#

arrays - enable work/operations with matrices

Allow you to efficiently operate on arrays (linear algebra, matrix operations, etc.)

# Create some arrays of data
arr1 = np.array([[1, 2], [3, 4]])

arr1

array([[1, 2],
       [3, 4]])

# lists of lists don't store dimensionality well
[[1, 2], [3, 4]] 

[[1, 2], [3, 4]]

Arrays: attributes, methods, & indexing#

# Check out an array of data
arr1

array([[1, 2],
       [3, 4]])

attributes#

numpy arrays are an object type…so they have associated attributes (below) and methods (we’ll get to these in a second)!

# Check the shape of the array
arr1.shape

(2, 2)

# Index into a numpy array
arr1[0, 0]

methods#

If you’re looping over an array, there’s probably a method for that…

# sum method
# by default sums all values in array
arr1.sum()

# sum method
# has an axis parameter
# axis=0 sums across columns
arr1.sum(axis=0)

array([4, 6])

# typecasting to a different variable type
out_list = arr1.sum(axis=0).tolist()
print(out_list)
type(out_list)

[4, 6]

list

Activity: `numpy`#

Write code to:

Create a 3x3 array (my_array) that stores 9 different numbers
Use a method to identify the largest value in that array (store it in array_max)

Include your code in this Google Form: https://forms.gle/r87f4dpoNpu9wB5o7

# YOUR CODE HERE
my_array = np.array([[1,2,3], [4,523983249782398,6], [7,8,9]])
array_max = my_array.max()

`pandas`#

Pandas is Python library for managing heterogenous data.

At it’s core, Pandas is built around the DataFrame object, which is:

a data structure for labeled rows and columns of data
associated methods and utilities for working with data.
each column contains a pandas Series

import pandas as pd

# Create a dataframe 
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35]
})

# Check out the dataframe
df

	Name	Age
0	Alice	25
1	Bob	30
2	Charlie	35

# You can index in pandas
# columns store information in series
df['Age']

  25
  30
  35
Name: Age, dtype: int64

# You can index in pandas
# loc specifies row, column position
df.loc[0,:]

Name    Alice
Age        25
Name: 0, dtype: object

# attribute of df object
# row, columns
df.shape

(3, 2)

# how many rows there are in a series/df
df.shape[0] # len(df) would also work

len(df)

Working with DataFrames#

There are a lot of functions and methods within pandas. The general syntax is df.method() where the method() operates directly on the dataframe df.

# calculate summary statistics
df.describe()

	Age
count	3.0
mean	30.0
std	5.0
min	25.0
25%	27.5
50%	30.0
75%	32.5
max	35.0

Data in `pandas`#

pd.read_*()| * is replaced with file type (i.e. read_csv())
input to function is path to file or URL

For example…there is a very famous dataset about mammalian sleep. One copy of it is at the URL ‘https://raw.githubusercontent.com/ShanEllis/datasets/master/msleep.csv’

df = pd.read_csv('https://raw.githubusercontent.com/ShanEllis/datasets/master/msleep.csv')

# look at the data
df

	name	genus	vore	order	conservation	sleep_total	sleep_rem	sleep_cycle	awake	brainwt	bodywt
0	Cheetah	Acinonyx	carni	Carnivora	lc	12.1	NaN	NaN	11.9	NaN	50.000
1	Owl monkey	Aotus	omni	Primates	NaN	17.0	1.8	NaN	7.0	0.01550	0.480
2	Mountain beaver	Aplodontia	herbi	Rodentia	nt	14.4	2.4	NaN	9.6	NaN	1.350
3	Greater short-tailed shrew	Blarina	omni	Soricomorpha	lc	14.9	2.3	0.133333	9.1	0.00029	0.019
4	Cow	Bos	herbi	Artiodactyla	domesticated	4.0	0.7	0.666667	20.0	0.42300	600.000
...	...	...	...	...	...	...	...	...	...	...	...
78	Tree shrew	Tupaia	omni	Scandentia	NaN	8.9	2.6	0.233333	15.1	0.00250	0.104
79	Bottle-nosed dolphin	Tursiops	carni	Cetacea	NaN	5.2	NaN	NaN	18.8	NaN	173.330
80	Genet	Genetta	carni	Carnivora	NaN	6.3	1.3	NaN	17.7	0.01750	2.000
81	Arctic fox	Vulpes	carni	Carnivora	NaN	12.5	NaN	NaN	11.5	0.04450	3.380
82	Red fox	Vulpes	carni	Carnivora	NaN	9.8	2.4	0.350000	14.2	0.05040	4.230

83 rows × 11 columns

…we can access the attributes and execute the methods described above on this dataset:

# rows, columns
df.shape

(83, 11)

df.describe()

	sleep_total	sleep_rem	sleep_cycle	awake	brainwt	bodywt
count	83.000000	61.000000	32.000000	83.000000	56.000000	83.000000
mean	10.433735	1.875410	0.439583	13.567470	0.281581	166.136349
std	4.450357	1.298288	0.358680	4.452085	0.976414	786.839732
min	1.900000	0.100000	0.116667	4.100000	0.000140	0.005000
25%	7.850000	0.900000	0.183333	10.250000	0.002900	0.174000
50%	10.100000	1.500000	0.333333	13.900000	0.012400	1.670000
75%	13.750000	2.400000	0.579167	16.150000	0.125500	41.750000
max	19.900000	6.600000	1.500000	22.100000	5.712000	6654.000000

df['order'].value_counts()

order
Rodentia           22
Carnivora          12
Primates           12
Artiodactyla        6
Soricomorpha        5
Perissodactyla      3
Cetacea             3
Hyracoidea          3
Diprotodontia       2
Erinaceomorpha      2
Proboscidea         2
Chiroptera          2
Didelphimorphia     2
Cingulata           2
Lagomorpha          1
Pilosa              1
Monotremata         1
Afrosoricida        1
Scandentia          1
Name: count, dtype: int64

Activity: `pandas`#

The penguins dataset is a famous dataset provided within the seaborn package. To use it, run the two lines of code below.

Explore the dataset to answer the following questions:

How many different species of penguin are included in this dataset?
What is the average mass of penguins in this dataset?
(Bonus - time permitting) What is the average mass of female penguins in this dataset?

Include your answer on the Google form here: https://forms.gle/PrdnFP45bwX2JkHj9

import seaborn as sns
df = sns.load_dataset('penguins')

df

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	Male
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	Female
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	Female
3	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	Female
...	...	...	...	...	...	...	...
339	Gentoo	Biscoe	NaN	NaN	NaN	NaN	NaN
340	Gentoo	Biscoe	46.8	14.3	215.0	4850.0	Female
341	Gentoo	Biscoe	50.4	15.7	222.0	5750.0	Male
342	Gentoo	Biscoe	45.2	14.8	212.0	5200.0	Female
343	Gentoo	Biscoe	49.9	16.1	213.0	5400.0	Male

344 rows × 7 columns

# How many different species of penguin are included in this dataset?
df['species'].value_counts()
df['species'].nunique()
df['species'].unique()

array(['Adelie', 'Chinstrap', 'Gentoo'], dtype=object)

# What is the average mass of penguins in this dataset?
df.describe()
df['body_mass_g'].mean()

4201.754385964912

# What is the average mass of female penguins in this dataset

# filter to only include the female penguins -> calculcated the mean
df[df['sex'] == 'Female']['body_mass_g'].mean()

# group by sex -> calculated mean
df.groupby('sex')['body_mass_g'].mean()

sex
Female    3862.272727
Male      4545.684524
Name: body_mass_g, dtype: float64

Everything below this is just FYI…not on assignment/lab/exam

Plotting#

%matplotlib inline

import matplotlib.pyplot as plt

df = pd.read_csv('https://raw.githubusercontent.com/ShanEllis/datasets/master/msleep.csv')

# Plot the data
plt.scatter(df['sleep_rem'], df['sleep_cycle']);

../_images/53de2128c253c3ace94568e79ad0ef6dba80745561cb8032d8be5706364b4aa9.png

can change plot type
lots of customizations possible

Analysis#

scipy - statistical analysis
sklearn - machine learning

import scipy as sp
from scipy import stats

# Simulate some data
d1 = stats.norm.rvs(loc=0, size=1000)
d2 = stats.norm.rvs(loc=0.5, size=1000)

Analysis - Plotting the Data#

# Plot the data
plt.hist(d1, 25, alpha=0.6);
plt.hist(d2, 25, alpha=0.6);

../_images/0a58dd4481f2609a10337058c39a271680580be4cd572aad5daa0a5eeb174265.png

Analysis - Statistical Comparisons#

# Statistically compare the two distributions
stats.ttest_ind(d1, d2)

TtestResult(statistic=-9.249173637189667, pvalue=5.627311200756971e-20, df=1998.0)

COGS 108: Data Science in Practice#

If you are interested in data science and scientific computing in Python, consider taking COGS 108 : https://github.com/COGS108/.

Scientific Computing

Contents

Scientific Computing#

The Zen of Python#

Scientific Computing#

Scientific Python: Scipy Stack#

numpy#

External packages must be imported#

arrays#

Arrays: attributes, methods, & indexing#

attributes#

methods#

Activity: numpy#

pandas#

Working with DataFrames#

Data in pandas#

Activity: pandas#

Plotting#

Analysis#

Analysis - Plotting the Data#

Analysis - Statistical Comparisons#

COGS 108: Data Science in Practice#

`numpy`#

Activity: `numpy`#

`pandas`#

Data in `pandas`#

Activity: `pandas`#