Big Data Papers

In this class, we will focus on the analysis of the following big data: 1) provenance data; and 2) openXC data. We will focus on four categories of algorithms:

  1. Category 1: Given a deadline, minimize the economic cost
  2. Category 2: Given a budget, minimize the makespan
  3. Category 3: Given no constraints, minimize both the makespan and/or economic cost
  4. Category 4: Given both deadline and budget constaints, optimize the success rate or other metrics

    1.     (Category 1) Paper 1: Maria Alejandra Rodriguez, Rajkumar Buyya: A Responsive Knapsack-Based Algorithm for Resource Provisioning and Scheduling of Scientific Workflows in Clouds. ICPP 2015:839-848. Download. (The WRPS algorithm) Youtube video

    2.     (Category 4) Paper 2: Maciej Malawski, Gideon Juve, Ewa Deelman, Jarek Nabrzyski: Cost- and deadline-constrained provisioning for scientific workflow ensembles in IaaS clouds. SC 2012:22. (CloudSim and workflow generator are used for experiments). Download (the DPDS and SPSS algorithms) Youtube video. (level-based deadline distribution, the objective function: maximizing the number of completed workflows from an ensemble under both budget and deadline constraints, limiations: all VMS are the same, homogeneous resource model, so that task placement decisions do not impact the runtime of the tasks. so that task placement decisions do not impact the runtime of the tasks (including data transfer time), data transfer time is fixed. Very interesting but strong assumption: These priorities are absolute in the sense that completing a workflow with a given priority is more valuable than completing all other workflows in the ensemble with lower priorities combined.)

    3.     (Category 2) Paper 3: Ming Mao and Marty Humphrey. 2013. Scaling and scheduling to maximize application performance within budget constraints in cloud workflows. In Proceedings of the International Parallel & Distributed Processing Symposium (IPDPS'13). IEEE, 67-78. Download. Youtube video (The Scheduling-first algorithm)

    4.     (Category 3) Paper 4: Lin, Cui, and Shiyong Lu. "Scheduling scientific workflows elastically for cloud computing." In 2011 IEEE 4th International Conference on Cloud Computing, pp. 746-747. IEEE, 2011. Download. Youtube video (The SHEFT algorithm)

    5.     (Category 3) Paper 5. Haluk Topcuoglu, Salim Hariri, Min-You Wu: Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing. IEEE Trans. Parallel Distrib. Syst. (TPDS) 13(3):260-274 (2002). Download. (The HEFT algorithm and the CPOP algorithm) CPOP youtube

    6.     (Category 1) Paper 6: Seyed Ziae Mousavi Mojab, Mahdi Ebrahimi, Robert G. Reynolds, and Shiyong Lu, "iCATS: Big Data Workflow Scheduling in the Cloud Using Cultural Algorithms", In Proc. of the Fifth IEEE International Conference On Big Data Service And Applications (IEEE BIGDATASERVICE 2019), pp.99-106, April 4 - 9, 2019, San Francisco East Bay, California, USA. Download. Youtube video (The iCATS algorithm)

    7.                        (Category 2) Paper 7: Wu, Chase Qishi, Xiangyu Lin, Dantong Yu, Wei Xu, and Li Li. "End-to-end delay minimization for scientific workflows in clouds under budget constraint." IEEE Transactions on Cloud Computing 3, no. 2 (2014): 169-181. Download.

    8.     (Category 4) Paper 8: Hamid Arabnejad, Jorge G. Barbosa, Radu Prodan: Low-time complexity budget-deadline constrained workflow scheduling on heterogeneous resources. Future Generation Comp. Syst. (FGCS) 55:29-40 (2016). Download. Youtube video(The DBCS algorithm, no optimization, aims to quickly find a feasible solution that satisfies both budget and deadline constraints, for a bounded number of heterogeneous resources, advantages: low complexity planning time O(n^2*p))

    9.     (Category 1) Paper 9: Verma, A. and Kaushal, S., 2014. Deadline constraint heuristic-based genetic algorithm for workflow scheduling in cloud. International Journal of Grid and Utility Computing, 5(2), pp.96-106. Download.

    10.                        (Category 1) Paper 10: Saeid Abrishami, Mahmoud Naghibzadeh, Dick H. J. Epema: Deadline-constrained workflow scheduling algorithms for Infrastructure as a Service Clouds. Future Generation Comp. Syst. (FGCS) 29(1):158-169 (2013). Download. Youtube video (the IC-PCP algorithm).

    11.                        (System) Paper 11: Andrey Kashlev, Shiyong Lu, and Aravind Mohan, "Big Data Workflows: A Reference Architecture and The DATAVIEW System", Services Transactions on Big Data (STBD), 4(1), pp.1-19, 2017. Download.

    12.                        (Category 1) Paper 12: Changxin Bai, Shiyong Lu, Ishtiaq Ahmed, Dunren Che, Aravind Mohan, "LPOD: A Local Path Based Optimized Scheduling Algorithm for Deadline-Constrained Big Data Work¿ows in the Cloud", in Proc. of the IEEE Congress on Big Data (IEEE BigData Congress 2019), Milan, Italy, 2019. Download. Youtube video (The LPOD Algorithm)

    13.                        (Survey) Paper 13: Smanchat, Sucha, and Kanchana Viriyapant. "Taxonomies of workflow scheduling problem and techniques in the cloud." Future Generation Computer Systems 52 (2015): 1-12. Download.

    14.                        (System) Paper 14: Nyström, P., Falck-Ytter, T. and Gredebäck, G., 2016. The TimeStudio Project: An open source scientific workflow system for the behavioral and brain sciences. Behavior research methods, 48(2), pp.542-552. Download.(The TimeStudio system)

    15.                        (System) Paper 15: Berthold, M.R., Cebron, N., Dill, F., Gabriel, T.R., Kötter, T., Meinl, T., Ohl, P., Thiel, K. and Wiswedel, B., 2009. KNIME-the Konstanz information miner: version 2.0 and beyond. AcM SIGKDD explorations Newsletter, 11(1), pp.26-31. Download. (The KNIME system)

    16.                        (System) Paper 16: Goecks, J., Nekrutenko, A., Taylor, J. and Galaxy Team, 2010. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome biology, 11(8), p.R86. Download.

    17.                        (Category 1) Paper 17: Jia Liu, Li, Miao Zhang, Rajkumar Buyya, and Qi Fan. "Deadline-constrained coevolutionary genetic algorithm for scientific workflow scheduling in cloud computing." Concurrency and Computation: Practice and Experience 29, no. 5 (2017). Download.

    18.                        (Category 4) Paper 18: Vahid Arabnejad, Kris Bubendorfer, Bryan Ng: Budget and Deadline Aware e-Science Workflow Scheduling in Clouds. IEEE Trans. Parallel Distrib. Syst. 30(1): 29-44 (2019) (The BDAS algorithm, budget and deadline constrained, optimize admission rate, minimize both makespan and economic cost) Download.

    19.                        (Category 2) Paper 19: Maria Alejandra Rodriguez, Rajkumar Buyya: Budget-Driven Scheduling of Scientific Workflows in IaaS Clouds with Fine-Grained Billing Periods. ACM Transactions on Autonomous and Adaptive Systems (TAAS) 12(2): 5:1-5:22 (2017) Download. Youtube Video. ( The BAGS Algorithm)

    20.                        (Category 1) Paper 20: Mahdi Ebrahimi, Aravind Mohan and Shiyong Lu, "Scheduling Big Data Workflows in the Cloud under Deadline Constraints", In Proc. of the Fourth IEEE International Conference on Big Data Service and Applications, pp. 33-40, March 26¿29, Bamberg, Germany, 2018. Download. (The BORRIS algorithm)

    21.                        (Category 1) Paper 21: Meena, J., Kumar, M. and Vardhan, M., 2016. Cost effective genetic algorithm for workflow scheduling in cloud under deadline constraint. IEEE Access, 4, pp.5065-5082. Download. (The CEGA algorithm)

    22.                        (Survey) Paper 22: Tsai, Chun-Wei, and Joel JPC Rodrigues. "Metaheuristic scheduling for cloud: A survey." Systems Journal, IEEE 8, no. 1 (2014): 279-291. Download.

    23.                        (Category 2) Paper 23: Aravind Mohan, Mahdi Ebrahimi, Shiyong Lu, Alexander Kotov: Scheduling big data workflows in the cloud under budget constraints. BigData 2016: 2775-2784 (The BARENTS algorithm, budget constrained, minimize makespan) Download.

    24.                        (Category 4) Paper 24: Mozhgan Ghasemzadeh, Hamid Arabnejad, Jorge G. Barbosa: Deadline-Budget constrained Scheduling Algorithm for Scientific Workflows in a Cloud Environment. OPODIS 2016: 19:1-19:16 Download (The DBWS algorithm).

    25.                        (System) Paper 25: Warr, W.A., 2012. Scientific workflow systems: Pipeline Pilot and KNIME. Journal of computer-aided molecular design, 26(7), pp.801-804. Download. (Evolutionary approaches)(The Pipeline Pilot system)

    26.                        (Category 3) Paper 26: Goshgar Ismayilov, Haluk Rahmi Topcuoglu: Neural network based multi-objective evolutionary algorithm for dynamic workflow scheduling in cloud computing. Future Generation Comp. Syst. 102: 307-322 (2020) Download.

    27.                        (Category 3) Paper 27: Zhaomeng Zhu, Gongxuan Zhang, Miqing Li, Xiaohui Liu: Evolutionary Multi-Objective Workflow Scheduling in Cloud. IEEE Trans. Parallel Distrib. Syst. 27(5): 1344-1357 (2016) Download. (Evolutionary approaches, the EMS-C algorithm)

    28.                        (Category 3) Paper 28: Durillo, Juan J., and Radu Prodan. "Multi-objective workflow scheduling in Amazon EC2." Cluster computing 17, no. 2 (2014): 169-189. Download. Youtube video. (The MOHEFT Algorithm)

    29. (Category 4) Paper 29: Verma, A. and Kaushal, S., 2017. A hybrid multi-objective particle swarm optimization for scientific workflow scheduling. Parallel Computing, 62, pp.1-19. Download (The HPSO Algorithm)

    30.                        (Category 3) Paper 30: Wu, Q., Zhou, M., Zhu, Q., Xia, Y. and Wen, J., 2019. MOELS: Multiobjective Evolutionary List Scheduling for Cloud Workflows. IEEE Transactions on Automation Science and Engineering, 17(1), pp.166-176. Download. (The MOELS algorithm)

    31.                        (System) Paper 31: Ludäscher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger, E., Jones, M., Lee, E.A., Tao, J. and Zhao, Y., 2006. Scientific workflow management and the Kepler system. Concurrency and Computation: Practice and Experience, 18(10), pp.1039-1065. Download.

    32.                        (System) Paper 32: Deelman, E., Vahi, K., Juve, G., Rynge, M., Callaghan, S., Maechling, P.J., Mayani, R., Chen, W., Da Silva, R.F., Livny, M. and Wenger, K., 2015. Pegasus, a workflow management system for science automation. Future Generation Computer Systems, 46, pp.17-35. Download. Youtube video

    33.                        (System) Paper 33: Oinn, T., Addis, M., Ferris, J., Marvin, D., Senger, M., Greenwood, M., Carver, T., Glover, K., Pocock, M.R., Wipat, A. and Li, P., 2004. Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics, 20(17), pp.3045-3054. Download.

    34.                        (System) Paper 34: Kranjc, Janez, Roman Orač, Vid Podpečan, Nada Lavrač, and Marko Robnik-Šikonja. "ClowdFlows: Online workflows for distributed big data mining." Future Generation Computer Systems (2016). Download. Youtube Video

    35.                        (System) Paper 35: Perovšek, Matic, Janez Kranjc, Tomaž Erjavec, Bojan Cestnik, and Nada Lavrač. "TextFlows: A visual programming platform for text mining and natural language processing." Science of Computer Programming 121 (2016): 128-152. Download. Youtube Video  

    36.                        (System) Paper 36: Lin, C., Lu, S., Fei, X., Chebotko, A., Pai, D., Lai, Z., Fotouhi, F. and Hua, J., 2009. A reference architecture for scientific workflow management systems and the VIEW SOA solution. IEEE Transactions on Services Computing, 2(1), pp.79-92. Download. (The VIEW system)

    37.                        (System) Paper 37: Wilde, M., Hategan, M., Wozniak, J.M., Clifford, B., Katz, D.S. and Foster, I., 2011. Swift: A language for distributed parallel scripting. Parallel Computing, 37(9), pp.633-652. Download. (The Swift system)

    38.                        (System) Paper 38: Kano, Yoshinobu, Makoto Miwa, K. Bretonnel Cohen, Lawrence E. Hunter, Sophia Ananiadou, and Jun’ichi Tsujii. "U-Compare: A modular NLP workflow construction and evaluation system." IBM Journal of Research and Development 55, no. 3 (2011): 11-1. Download. Youtube Video 

    39.                        (System) Paper 39: Freire, J., Koop, D., Chirigati, F. and Silva, C.T., 2014. Reproducibility using vistrails. Implementing Reproducible Research, 33. Download. (The VisTrails system)

    40.                        (System) Paper 40: Saeid Mofrad, Ishtiaq Ahmed, Shiyong Lu, Ping Yang, Heming Cui, and Fengwei Zhang, "SecDATAVIEW: A Secure Big Data Workflow Management System for Heterogeneous Computing Environments", in Proc. of the Annual Computer Security Applications Conference (ACSAC 2019), December 9-13, San Juan, 2019. Download. (The SecDATAVIEW system)

The top 10 data science algorithms

  1. ID3 (8 lectures, 50 mins) | CART
  2. K-means
  3. SVM
  4. Apriori
  5. EM (1) | EM (2)
  6. AdaBoost
  7. kNN
  8. Naive Bayes
  9. CNN
  10. RNN