The datasets provided by the Medical Analytics Group included eighty-five features for facilities 1 thru 3 and fifty-seven features for facility 4. Facility 4 is a walk-in facility and does not provide the same exams as the other 3 facilities thus it will have less features.
Features are also sometimes referred to as “variables” or “attributes and represent a column in the tabular data or csv files. Each feature, or column, represents a measurable piece of data that can be used for analysis: DayOfYear, ArrivalTime, ScheduleTime, and so on.
There is no right or wrong answer when selecting features. Part of Data Challenge was to reduce the number of features to build the model. Selecting features is a trial and error to select the appropriate number to obtain an accurate prediction.
I eliminated features that were derived from averages or sums of other features because I believe they dilute the magnitude of measured real-time variables. The csv files I used to build the models for facilities 1 thru 3 contained forty-four features to predict wait time. Facility 4, which is a walk-in facility with less exam types are performed, I used twenty-five features to predict wait-time.
The models were built in a Swift playground on a Mac using the MLBoostedTreeClassifier from Apple’s CoreML framework. This classifier is Apple’s version on the XGBoost in Python or LSBoost in Mathlab. In addition to gradient boosted decision trees I also tried logistic regression, linear regression and random forest. I researched typical parameters for the MLBoostedTree classifier and tweaked them (mainly max iterations) to improve the models validation accuracy and minimize training log loss and validation log loss.
//Define the model parameters
let boostedTreeModelParameters = MLBoostedTreeClassifier.ModelParameters.init(validation: MLBoostedTreeClassifier.ModelParameters.ValidationData.split(strategy: .automatic), maxDepth: 6, maxIterations: 300, minLossReduction: 0.0, minChildWeight: 0.1, randomSeed: 42, stepSize: 0.3, earlyStoppingRounds: 200, rowSubsample: 1.0, columnSubsample: 1.0)
The models were built using a randomized 80/20 split of the data. Each model was in the 96-97% training accuracy range after ~300 iterations using the boosted trees algorithm.
Create a Swift Playground on your Mac in Xcode. We are going to use the code below to build the machine-learning model for each of the four facilities. Select the following code below, copy and paste it into your Swift Playground.
//
// PatientWaitTime.playground
//
import Cocoa
import CoreML
import CreateML
//Define Paths
let dataPath = "/Users/jburke/Developer/Machine Learning/Patient Wait Time/Data/"
let modelPath = "/Users/jburke/Developer/Machine Learning/Patient Wait Time/Model/"
//Define filenames - change the filename to the facility F1,F2,F4 and F4 for the model you are building
let filename = "F1WaitTime"
let csvFilename = filename + ".csv"
let logisticRegressionModelFilename = "logisticRegression" + filename + ".mlmodel"
let linearRegressionModelFilename = "linearRegression" + filename + ".mlmodel"
let boostedTreeModelFilename = "boostedTree" + filename + ".mlmodel"
let trainingCSV = URL(fileURLWithPath: dataPath + csvFilename)
let waitData = try MLDataTable(contentsOf: trainingCSV)
//Randomly split the training data to use some of the data to test the model
let (trainingData, testData) = waitData.randomSplit(by: 0.8, seed: 0)
//Define the model parameters
let boostedTreeModelParameters = MLBoostedTreeClassifier.ModelParameters.init(validation: MLBoostedTreeClassifier.ModelParameters.ValidationData.split(strategy: .automatic), maxDepth: 6, maxIterations: 300, minLossReduction: 0.0, minChildWeight: 0.1, randomSeed: 42, stepSize: 0.3, earlyStoppingRounds: 200, rowSubsample: 1.0, columnSubsample: 1.0)
//Create the Model - The targetColumn "Wait"
//Boosted Tree Classifier
let modelFilename = boostedTreeModelFilename
let patientWaitTimeModel = try MLBoostedTreeClassifier(trainingData: trainingData,
targetColumn: "Wait",
featureColumns: nil,
parameters: boostedTreeModelParameters)
//Evaluate Model
let evaluationMetrics = patientWaitTimeModel.evaluation(on: testData)
let trainingMetrics = patientWaitTimeModel.trainingMetrics
let validataionMetrics = patientWaitTimeModel.validationMetrics
//print("Training Metrics\n", trainingMetrics)
//print("Validation Metrics\n", validataionMetrics)
//print("Evaluation Metrics\n", evaluationMetrics)
//Save the model in the ML Model directory
var outputURL = URL(fileURLWithPath: modelPath + modelFilename)
var modelMetadata = MLModelMetadata(author: "John Burke",
shortDescription: modelFilename + " From " + dataPath + csvFilename,
license: nil,
version: "2.0",
additional: nil)
try patientWaitTimeModel.write(to: outputURL, metadata: modelMetadata)
You should create a csv file for each of the four facilities and run this code to create a machine-learning model for each one.
Here is the output from building a boosted tree machine-learning model for facility 1.
column_type_hints = {}
Finished parsing file /Users/jburke/Developer/Machine Learning/Patient Wait Time/Data/F1WaitTime.csv
Parsing completed. Parsed 100 lines in 0.080386 secs.
Finished parsing file /Users/jburke/Developer/Machine Learning/Patient Wait Time/Data/F1WaitTime.csv
Parsing completed. Parsed 42766 lines in 0.138116 secs.
Using 44 features to train a model to predict Wait.
Boosted trees classifier:
--------------------------------------------------------
Number of examples : 32611
Number of classes : 194
Number of feature columns : 44
Number of unpacked features : 44
+-----------+--------------+-------------------+---------------------+-------------------+---------------------+
| Iteration | Elapsed Time | Training Accuracy | Validation Accuracy | Training Log Loss | Validation Log Loss |
+-----------+--------------+-------------------+---------------------+-------------------+---------------------+
| 1 | 1.664466 | 0.187084 | 0.161212 | 4.070683 | 4.280230 |
| 2 | 3.306101 | 0.307749 | 0.256970 | 3.384371 | 3.700822 |
| 3 | 4.973208 | 0.434087 | 0.383636 | 2.781713 | 3.117448 |
| 4 | 6.658532 | 0.535893 | 0.481212 | 2.328966 | 2.713683 |
| 5 | 8.369142 | 0.607065 | 0.545455 | 2.015430 | 2.437950 |
| 10 | 17.636758 | 0.728680 | 0.615152 | 1.394644 | 2.018742 |
| 15 | 27.001653 | 0.785011 | 0.643636 | 1.109419 | 1.841601 |
| 20 | 36.443930 | 0.827175 | 0.650303 | 0.915941 | 1.734129 |
| 25 | 45.966680 | 0.855356 | 0.669697 | 0.784234 | 1.657891 |
| 30 | 55.564983 | 0.878569 | 0.670909 | 0.681455 | 1.600850 |
| 35 | 65.404385 | 0.899972 | 0.679394 | 0.592665 | 1.555525 |
| 40 | 75.038794 | 0.915121 | 0.684242 | 0.530229 | 1.527174 |
| 45 | 84.700265 | 0.927601 | 0.689697 | 0.474866 | 1.499194 |
| 50 | 94.221726 | 0.937352 | 0.692727 | 0.427109 | 1.480991 |
| 55 | 103.546124 | 0.947564 | 0.693939 | 0.383834 | 1.464761 |
| 60 | 112.784828 | 0.956364 | 0.695152 | 0.347132 | 1.454755 |
| 65 | 122.029092 | 0.963387 | 0.693939 | 0.315175 | 1.443222 |
| 70 | 131.156561 | 0.969151 | 0.696970 | 0.287578 | 1.434286 |
| 75 | 140.241034 | 0.974456 | 0.696970 | 0.262062 | 1.424328 |
| 80 | 149.292595 | 0.978167 | 0.698788 | 0.240512 | 1.418887 |
| 85 | 158.360742 | 0.982368 | 0.701818 | 0.222228 | 1.414621 |
| 90 | 167.361038 | 0.985465 | 0.701818 | 0.205247 | 1.410939 |
| 95 | 176.327705 | 0.987888 | 0.703030 | 0.190215 | 1.409123 |
| 100 | 185.258737 | 0.989758 | 0.702424 | 0.177541 | 1.406791 |
| 105 | 194.097648 | 0.991322 | 0.703030 | 0.163898 | 1.406120 |
| 110 | 202.883489 | 0.993039 | 0.701818 | 0.152564 | 1.406783 |
| 115 | 211.703826 | 0.994542 | 0.698788 | 0.141125 | 1.408289 |
| 120 | 220.389271 | 0.995922 | 0.701212 | 0.130378 | 1.407490 |
| 125 | 228.949571 | 0.996688 | 0.702424 | 0.121648 | 1.407445 |
| 130 | 237.370641 | 0.997455 | 0.700000 | 0.113246 | 1.405630 |
| 135 | 245.704059 | 0.998252 | 0.703030 | 0.106133 | 1.402067 |
| 140 | 254.229622 | 0.998620 | 0.703636 | 0.099265 | 1.404117 |
| 145 | 262.714381 | 0.999049 | 0.701212 | 0.092882 | 1.403700 |
| 150 | 271.714882 | 0.999203 | 0.701212 | 0.087101 | 1.404575 |
| 155 | 280.004304 | 0.999295 | 0.698788 | 0.081968 | 1.401182 |
| 160 | 288.267176 | 0.999448 | 0.702424 | 0.076835 | 1.400074 |
| 165 | 296.415693 | 0.999693 | 0.703636 | 0.072396 | 1.400473 |
| 170 | 304.686847 | 0.999755 | 0.703030 | 0.068199 | 1.401074 |
| 175 | 313.036721 | 0.999816 | 0.701212 | 0.064022 | 1.399951 |
| 180 | 321.499802 | 0.999847 | 0.700606 | 0.060901 | 1.400793 |
| 185 | 329.809332 | 0.999877 | 0.700606 | 0.057425 | 1.401074 |
| 190 | 338.038631 | 0.999908 | 0.700000 | 0.054170 | 1.401520 |
| 195 | 346.220015 | 0.999939 | 0.700000 | 0.051183 | 1.402575 |
| 200 | 354.542706 | 1.000000 | 0.701212 | 0.047863 | 1.400612 |
| 205 | 362.823121 | 1.000000 | 0.700606 | 0.045256 | 1.402357 |
| 210 | 371.026889 | 1.000000 | 0.701818 | 0.042713 | 1.403160 |
| 215 | 379.373559 | 1.000000 | 0.699394 | 0.040605 | 1.404624 |
| 220 | 387.451039 | 1.000000 | 0.700606 | 0.038528 | 1.405852 |
| 225 | 395.700176 | 1.000000 | 0.701212 | 0.036674 | 1.405921 |
| 230 | 403.760667 | 1.000000 | 0.703030 | 0.034802 | 1.406686 |
| 235 | 412.441899 | 1.000000 | 0.703030 | 0.033215 | 1.407763 |
| 240 | 420.752285 | 1.000000 | 0.701818 | 0.031777 | 1.407523 |
| 245 | 428.916207 | 1.000000 | 0.703636 | 0.030330 | 1.408970 |
| 250 | 441.697279 | 1.000000 | 0.701818 | 0.029042 | 1.410096 |
| 255 | 452.706851 | 1.000000 | 0.701818 | 0.027842 | 1.411463 |
| 260 | 461.962483 | 1.000000 | 0.703030 | 0.026564 | 1.412802 |
| 265 | 471.221901 | 1.000000 | 0.704849 | 0.025318 | 1.414368 |
| 270 | 480.446608 | 1.000000 | 0.702424 | 0.024140 | 1.414573 |
| 275 | 489.562351 | 1.000000 | 0.703636 | 0.023198 | 1.415203 |
| 280 | 498.820086 | 1.000000 | 0.703636 | 0.022110 | 1.415488 |
| 285 | 507.785985 | 1.000000 | 0.705455 | 0.021186 | 1.415716 |
| 290 | 517.113786 | 1.000000 | 0.706667 | 0.020401 | 1.415907 |
| 295 | 526.482134 | 1.000000 | 0.705455 | 0.019690 | 1.416381 |
| 300 | 534.598507 | 1.000000 | 0.706667 | 0.018988 | 1.416875 |
+-----------+--------------+-------------------+---------------------+-------------------+---------------------+
Trained model successfully saved at /Users/jburke/Developer/Machine Learning/Patient Wait Time/Model/boostedTreeF1WaitTime.mlmodel.
By increasing the model parameter maxIteration count to 300 we have improved the training accuracy for the model to 100%. after 200 iterations. Remember we are testing the model with 20% of the data (random split) we used to build the model, so just because we have achieved a 100% training accuracy does not mean that there will not be fluctuation in the confidence level of the models predictions. The real test is to test the model with data it has not seen before.
The machine-learning models are now ready to use in an app.
I built a single view iOS app using iOS Charts created by Daniel Gindi to graph the output of each of the facilities models predictions based on the test data.
I selected data from the same day-of-the-year as the current date and included all the radiology exams performed prior to the current time as test data.
It's best to test models with data it has not seen so I used the data from the other facilities as test data to test each model:
• Facility 3 data was used to test facility 1's model.
• Facility 3 data was used to test facility 2's model.
• Facility 2 data was used to test facility 3's model.
• Facility 1 data was used to test facility 4's model.
Facility 4 is a walk-in radiology center and does not perform neuro, abdominal, vascular or cardiac exams. Therefore only thoracic, pediatric and muscular skeleton exams were extracted from the test data to test facility 4's model.
The blue line is the actual wait time while the gray line is the predicted wait time in minutes. Anything below zero represents no-wait.
The yellow line represents the models confidence as a percentage and varies with each prediction.
Facility 1's model predicted there is no wait at 11:15 AM. The actual wait time is less than zero which indicates the patient showed up early for the appointment and taken early. The model confirm this with 99.7% confidence.
Facility 2's model predicted there is a 4 minute wait at 11:15 AM. The actual wait time is less than zero which indicates the patient showed up early for the appointment and was taken early. The model does not have a high degree of confidence in the prediction at ~24%.
Facility 3's model predicted there is a 6 minute wait at 10:30 AM. The actual wait time is about 20 minutes. The model does not have a high degree of confidence in the prediction at ~14%.
Facility 4's model predicted there is a 2 minute wait at 10:30 AM. The actual wait time is 2 minutes. The model is confident in this prediction at ~78%.