I want to thank everyone who attended yesterday's webinar on AI. We had just under 800 people attend this session, so thank you! Hopefully, it resonated with you in some aspect and that you were able to take away some nuggets. I did receive some questions, and I will take some time to answer them over the next few days. Here is the first one.
Q. AI generally requires large datasets in order to 'learn' from. Some organizations, particularly owner organizations, may not have such large datasets available. Do you anticipate that project datasets will be commoditized to be made available to a wider range of organizations?
A. First, let's bring this question into context. My answers will reflect AI and machine learning for quantitative datasets and not GenAI and LLMs, etc. This is another separate issue! So, regarding project datasets, this has been an issue since the day we started collecting project information. Organizations do not want to share their data for three reasons: 1.) most organizations are collecting meaningful data, if at all, 2.) it exposes them, warts and all, and 3) if they do collect it, their data is their competitive advantage. There are probably more reasons, but these are the three that come to mind immediately.
In terms of large datasets, we need to emphasize that quality is better than quantity. Specificity also comes into play. Do you really need data that is not representative of your industry, geography, etc.? You don't need the extra noise; that's why we've moved away from the big data concept to a more focused, micro-data approach. So, more (large) data doesn't always mean better data. In my opinion, I cannot see data being commoditized. There can be many issues that support my opinion. Some include legal issues (if used, can the entity providing the data be sued); other legal issues come down to governance, accountability, etc. Another issue is the context and meaning of any commoditized datasets, i.e., is it relevant and fit for purpose? You don't want to use the RS Means plumbing rate-of-placement of a 2" dia. copper pipe to determine the labour for installing a 2" stainless process pipe…
So, how do we increase the volume of data in order to introduce AI and machine learning into an organization?
- Look at the data that is currently available and scrub it, clean it, and make sense of it. Add as much meta-data as possible. Focus on ensuring data integrity. This will be one step in setting the tone for what you want to collect in the future.
- Create a taxonomy, ontologies, and knowledge graphs to assist in understanding the relationships between predictors and targets. Standardize your features/attribute labels. Apply these to your data set. This is a kind of labeling that allows you to do regression and clustering analysis, etc. Identifying features allows you to do feature engineering. This is the power of ML, as it allows you to combine features to see if there is a combination that has a high correlation (predictor(s) to the target). What assists in this is the development of knowledge graphs and ontologies…
- Generate synthetic datasets based on the patterns observed in your real data (basically, fill in the blanks). Also, using AI techniques such as Generative Adversarial Networks (GANs) can simulate realistic scenarios, augmenting your existing datasets. Remember, though, that this only gets you partially there (in terms of model design). You will need to adjust/change your algorithms once you harness your own data. You'll need to retest and retrain your model(s) and adjust them accordingly.
- Even before using Python, ML, and the algorithms, etc., try to complete linear and multivariate regression analysis using Excel; it's a great way to start understanding your data… You may also want to leverage pre-trained models built on large datasets from related domains. These models can be fine-tuned with your smaller datasets, reducing the dependency on large-scale proprietary data.
Once this is done, put your models into operations and then see if you've improved predictability. If not, go back to your features to determine better correlations/relationships to enhance predictability. Once my paper has gone through peer review and headquarters cleansing, I will publish it in this forum. It provides a road map for introducing AI/ML into the organization. I have attached last year's paper, which is a roadmap for introducing data analytics into the organization (you can't do one without the other).
Cheers,
Lance
------------------------------
Lance Stephenson
Director of Operations
AECOM
Edmonton
h.lance.stephenson@gmail.com------------------------------