Educational example

In this assigment, we will be analyzing the “educ_data.csv” dataset. In this dataset, each line corresponds to an individual student \(s_1,...,s_{100}\) and each column is an individual question \(q_1,...,q_{40}\) in a test. The content of each cell is either:

  • 0: the student got this question wrong,
  • 1: the student got this question right, or
  • NA: the student was not presented this question

Here are the data:

setwd('U:\\uf\\courses\\bayesian course\\group activities\\6 example educ')
dat=read.csv('educ_data.csv',as.is=T)
head(dat[,1:15])
##    X q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q13 q14
## 1 s1 NA NA  0  0 NA  0  0 NA NA  NA   1   1   0  NA
## 2 s2  1 NA  1  0  0  0 NA NA NA   0  NA  NA  NA   0
## 3 s3 NA  0  0  0 NA  0  1 NA  0   0   0   0   0   1
## 4 s4 NA NA NA NA NA  0 NA NA NA   0   1   1  NA  NA
## 5 s5 NA NA  1 NA NA  1 NA NA NA  NA   0   1  NA  NA
## 6 s6  0  0 NA NA NA  0  0 NA NA  NA   0   0   0  NA

This type of data often arise from assessments in which the questions presented to the student are randomly drawn from a pool of questions.

More broadly, this type of data (and the models used to analyze these data) also arise in several other areas. For example, the performance of basketball teams might be summarized in a similar table where each row and each column is a different team and cell ij is equal to one if team i won team j, 0 if team i lost from team j, and NA if these teams have not faced each other. Similarly, these type of data arise in social networks depicting who is connected with whom.

Modeling these data

We are interested in determining the ability of each student. That would be easier if everybody had been exposed to the same set of questions because we could just calculate the proportion of questions that each student got right. In our setting, this is more complicated because some students might have been given harder questions than others. Therefore, even if two students have gotten 80% of the questions right, that doesn’t mean that they have the same ability/skill. To correctly model these data, we will have to take into account that the difficulty of each question also varies.

Determining the difficulty of each question would be easy to do if we had complete data. In this case, we would simply calculate the proportion of students that got each question right and that would give us a sense of how difficult each question was. Unfortunately, harder questions might have been predominantly given to students with high ability. As a result, two questions that were correctly answered by 80% of the students does not mean that these questions are equally easy. Determining the difficulty of each question will require us to determine the skill level of each student.

To keep things simple, in this exercise we will assume that the data are missing at random (i.e., each student gets a random subset of questions to answer). In this assignment, I want you to:

  1. Come up with a generative model for these data
  2. Develop JAGS code to fit the model you developed in (a)
  3. Simulate data and fit your JAGS model to these data. Does the model work?
  4. Fit your JAGS model to the real data
  5. Say that we believe that students that score higher in the SAT are likely to have greater ability/skill. How can we change our generative model to test this hypothesis?

Obs. 1: for this assignment, I want the Research & Evaluation Methodology (REM) folks to abstain from helping in the development of the generative model.

Obs. 2: for task (e), just focus on writing the equations of your generative model; don’t worry about simulating data or creating the actual JAGS code for this.

Back to main menu

Comments?

Send me an email at