- Case StudyHelp.com
- Sample Questions

**1. (Worth 100 marks)**

**Introduction**

To do something useful with big data, models are devised from the large numbers of observations in order to predict what will occur for some other observation(s). A simple linear model^{1} is of the form:

where *y** _{i}* is the dependent variable,

*i*is the observation number (there are a total of

*M*observa- tions),

*x*

*is the set of independent variables,*

_{ij}*N*is the number of independent variables (for big data,

*M N*), and

*a*

*are the set of model coefficients. Equation (1) lends itself to a matrix formulation:*

_{j}**Y **= **XA**

The model coefficients *a*_{j}* *are determined by measuring *y*_{i}* *and *x** _{ij}*. One of the dangers of developing such a model is “over-fitting” the data. This is where

*a*

*are tuned for the*

_{j}*M*observations so that

*a*

*is an excellent model for*

_{j}*y*

_{i}*,*

*i*

*≤*

*M*, but is a poor model for

*y*

_{i}*,*

*i > M*. Good practice is therefore to split the

*M*observations into a “training dataset” (with

*M*

_{1}observations,

*M*

_{1}

*≥*

*N*and typically

*M*

_{1}

*N*) and a “test dataset” (with

*M*

_{2}observations,

*M*

_{1}+

*M*

_{2}=

*M*, and typically

*M*

_{2}

*< M*

_{1}). The values of

*a*

*are determined from Eq. (1) using the training dataset (with*

_{j}*M*

_{1}observations). The values can then be validated using the test dataset by calculating

*y*

*using Eq. (1) and calculating the error from the actual values*

_{i}*y*ˆ

*.*

_{i}In this question, you are going to apply this methodology to determine if it is possible to estimate the mean pressure for the year based on temperature readings from each month. The ideal gas law is:

*p *= *ρRT*

where *p *is the pressure, *ρ *is the density, *R *the ideal gas constant and *T *the temperature.

Your computational task is to use Eq. (1)

in the form of Eq. (2)

**P **= **TA**

for the 9:00 readings. Here *p*¯* _{i}* is the average pressure across all months for day

*i*,

*T*

*is the temperature on day*

_{ij}*i*and month

*j*and

*a*

*is the average coefficient for month*

_{j}*j*. You will use the entire 12 months’ worth of data (

*N*= 12) to calculate the average pressure for each calendar date (

*M*

*= 28 since there are only 28 days in February), i.e.*

*p*¯

_{1}is the average pressure calculated using the 1st of July, 1st of August, 1st of September, etc. For your assignment, the following value is to be used:

where *M*_{2} is to be rounded to the nearest integer. Because *M > N *(we don’t have *M N *), we will work with *M*_{2} *≤ **N *, which is not ideal, but is pragmatic, since it guarantees that *M*_{1} *> N *to produce statistically-good estimates of *a** _{j}*.

### Requirements

For this assessment item, you must perform hand calculations using Eq. (5):

- Calculate
*a*_{1}using only the 1st of July (i.e.*M*= 1,*N*= 1). - Calculate
*a*_{1}and*a*_{2}using only the 1st and 2nd of both July and August (i.e.*M*= 2,*N*= 2).

You must also produce MATLAB code which uses Eq. (5):

- Repeats Requirements 1 and 2. Reports and verifies the
- *Successfully loads all the relevant
- *Repeats Requirement 2 using the loaded data. Reports and verifies the
- **Reports the value of
*M*_{2}before it is rounded, to confirm the values of*M*_{1}and*M*_{2}you are to use. Calculates all the*a*using the training dataset of_{j}*M*_{1}values and reports*a*._{j} - **Uses the test dataset of
*M*_{2}values to assess the quality of the modeled values of*p*¯._{j} - ***The accuracy of the results is limited because the variability in the temperature and pressure data is in the 3rd or 4th significant figure, and also because we do not have big data. To remedy the problem of significant figures, the data should be The first normalization technique to use in this circumstance is to “centre” the data in the matrix
**T**(subtract a constant value, sometimes the mean, from all the data), which will make the variability in the 1st or 2nd significant figure. Use 15C to centre the temperature data, produce new^{◦}*a*_{j}**T**(non-dimensional sing, normally by dividing by the standard deviation) so that all the quantities are of the same order of magnitude^{2}. - Discusses the
- Have appropriate comments

The projected difficulty of a Requirement is indicated by the number of * at the start. All students are expected to be able to complete Requirements which do not have an *.

## Assessment Criteria

Your code will be assessed using the following scheme. Note that you are marked based on how well you perform for each category, so the correct answer determined in a basic way will receive half marks and the correct answer determined using an excellent method/code will receive full marks.

## 2. (Worth 100 marks)

### Introduction

When data is being measured, it is common for there to be data missing, which could be due to a fault in the measuring equipment, or the variable being unmeasurable at that moment. In the weather data for Dalby, the maximum temperature was not recorded on 29th October 2017, presumably because not all the temperature readings were recorded for that day, so therefore it is impossible to know whether the highest recorded temperature was actually the maximum. Leaving unknown/unreliable readings blank is the best option, since inserting a value (such as zero) could be a valid value, and therefore pollutes the data (this is why my preferred option is to fill an empty slot in an array with NaN, since it is unlikely to have occurred from a calculation). If you need to be able to use a value where there is one missing, then you need to use some method of including an intelligent guess. In this question, you will use a global curve-fit to provide the guess. All of you will use *T*_{3} (the temperature measured at 3:00 pm) as the independent variable to model the maximum daily temperature, *T*_{max}. You will also compare the outcome for this modelling to using another variable, *V *, as the independent variable. For Your assignment, the following value is to be used:

*Q*_{2} = 4*.*1501 *.*

The independent variable (besides *T*_{3}) you are to use is based on your value of *Q*_{2}:

*V **≡ **T*_{min} *, Q*_{2} *≤ *5

*V **≡ **T*_{9} *, Q*_{2} *> *5

where *T *is temperature and the subscript refers to either the daily minimum or the particular time of day.

Your task is to estimate the value of *T*_{max} on 29th October 2017 using both *T*_{3} and *V *as the independent variable.

## Requirements

For this assessment item, you must perform hand calculations using *T*_{max} and *T*_{3}:

- Take the values from 28th and 30th October 2017 and estimate the coefficients of the three standard curve-fitting functions. These data points will provide a qualitative repre- sensation of the overall

You must also produce MATLAB code which:

- Repeats Requirement 1 and verifies the
- *Performs curve-fits of all the data for
*T*_{max}and*T*_{3}. Use the MATLAB function isfinite to filter the dataset so that only those dates with recordings of both*T*_{max}and*T*_{3}are included. - Validates the three standard curve-fitting functions obtained in Requirement 3 by com- paring with the parameters obtained in Requirement 1. Given the limited data used in Requirement 1 and the overall scatter in data, don’t expect the values to be very
- Determines which curve-fit is the
- Demonstrates that the chosen curve-fit is the best both graphically and numerically, show- ing both the data and the relevant curve-fit.

- Displays a message in the Command Window stating which type of curve-fit was chosen, stating the parameters of the curve-fit and the result of the numerical test of the curve-fit.
- Plots the best curve-fit along with the data in a separate figure with normal-scale
- Uses the best curve-fit to estimate
*T*_{max}for 29th October - *Reports the value of
*Q*_{2}, leading to the selection of*V*. Repeats Requirements 3, 8 and 9 using only a linear curve-fit with*V*the independent variable instead of*T*_{3}. Plots the curve-fit along with the Compares and discusses the two estimates for*T*_{max}. - Have appropriate comments

The projected difficulty of a Requirement is indicated by the number of * at the start. All students are expected to be able to complete Requirements which do not have an *.

## Assessment Criteria

Your code will be assessed using the following scheme. Note that you are marked based on how well you perform for each category, so the correct answer determined in a basic way will receive half marks and the correct answer determined using an excellent method/code will receive full marks.

## 3 (Worth 100 marks)

### Introduction

This question provides an alternative methodology to the problem in Question 2. In this question, you will use interpolation to provide the guess. Time is an obvious independent variable to use, but Question 2 suggests that there are other options. For your assignment, the following value is to be used:

*Q*_{3} = 2*.*4937 *.*

The independent variable (besides time), *V *, you are to use is based on your value of *Q*_{3}:

*V **≡ **T*_{min} *, Q*_{3} *≤ *5

*V **≡ **T*_{3} *, Q*_{3} *> *5

Where *T *is temperature and the subscript refers to either the daily minimum or the particular time of day.

Your task is to estimate the value of *T*_{max} on 29th October 2017 using these two independent variables. For this problem you are to only use data from 19th to 31st October inclusive (a total of 12 days not including 29th October). These dates have been chosen to ensure that you do not have repeated values of the independent variable (which is numerically problematic) and also that you have sufficient data either side of 29th October for the interpolation methods to be numerically reliable.

## Requirements

For this assessment item, you must perform hand calculations:

- Estimate the maximum temperature on 29th October 2017 using linear interpolation with time the independent
- **Repeat Requirement 1 using
*V*as the independent variable. You must also produce MATLAB code which: - Repeats Requirements 1, reporting the result to the Command
- Verifies Requirement
- Repeats Requirement 1 using cubic splines. Additionally validates the result with Re- quirement
- ***Reports the value of
*Q*_{3}, leading to the selection of*V*. Repeats the calculations of Requirements 3–5 using*V*as the independent variable. The results using the cubic spline method will not be very good for one reason or another. The function sortrows - Compares and discusses the 2 answers for
*T*_{max}on 29th October 2017 from Question 2 and the 4 answers from this - Has an appropriate comment

### Assessment Criteria

Your code will be assessed using the following scheme. Note that you are marked based on how well you perform for each category, so the correct answer determined in a basic way will receive half marks and the correct answer determined using an excellent method/code will receive full marks.

## 4. (Worth 100 marks)

### Introduction

Predicting what will occur is the essence of a model. Many people rely on weather forecasting for their livelihood, and many more people for their ordinary lives. Your task is to devise and construct a simple model for two of the variables in the Bureau of Meteorology dataset: the temperature at 3:00 pm (*T*_{3}) and the relative humidity at 9:00 am (*φ*_{9}).

To obtain consistent results from your random number generation, you should initialise the seed to a fixed value using rng(seed). For your assignment, the value is:

**seed = 1 .1259**

## Requirements

For this assessment item, you must perform hand calculations on the data for July:

- Calculate the sample mean and standard deviation for
*T*_{3}and*φ*_{9}for the first 10 data values of each - Use the first 20 data values
*T*_{3}and*φ*_{9}to estimate the sample pdf (i.e. the scaled frequency). Plot the sample pdfs. You can produce the plots in MATLAB, but you must perform the calculations of the value of the pdf for each value of*T*_{3}or*φ*_{9}by

You must also produce MATLAB code which:

- Repeats the hand calculations and verifies the MATLAB
- Loads the entire 12 months’ worth of data. The rest of the analysis is to be on the full dataset in this file
^{3}. - *One of
*T*_{3}and*φ*_{9}can be represented by a standard distribution. Determine which variable and the corresponding distribution which best describes the variable, including proof that this distribution is an appropriate model and proof that the other variable cannot be modelled by the same distribution. Part of this proof may be demonstrated by completing Requirements 6–8. - *Calculates the parameters of the population distribution for both variables, including the associated error in the estimation of the parameters. Reports the values to the Command Window.
- *Calculates the sample mean and sample standard deviation for both variables, along with an assessment of the accuracy of these
- *Graphically compares the sample pdf and population pdf for both
- *Reports to the Command Window a discussion of the applicability of the population distribution for both
- **Performs modelling for only the chosen variable from Requirement 5. Produces a prediction of the values for the 12 months by randomly generating samples from the distribution. Plots the time series of values which is thus produced along with the history of the recorded values. Discusses the validity of the predicted values in predicting the actual history (some analysis will assist you in drawing conclusions).
- Has an appropriate comment

### Assessment Criteria

### Submission

Submit your code, with the *.csv files that are provided to you, by the due date to the Study- Desk. Submit your hand calculations as a pdf file. Note that:

- You do not need to rename your files when uploading: the system automatically segregates different students’
- If you can see that the files have uploaded, then you have successfully submitted your assignment. There is no need to click a “send for marking” button, but you will have to click a button confirming that the submission is your own
- You
**MUST**upload all of your code along with input/output files in a *.zip file. The following are the only file types that can be submitted:- *.zip
- *.doc
- *.docx

The system will block any attempt by you to upload a file which doesn’t match any of those file extensions.

- If you forgot to submit a file, do not upload it after the due date: the submission time is based on when the last file was You should email the examiner in this circumstance (with any file attached). If you remember close to midnight on the day you made your submission, you only need to upload the file (don’t bother emailing), since the submission day will effectively be the same.