Transforming [Petroleum] Engineers in Data Science Wizards. Update 2020

12 min readDec 30, 2020

Note. This is an update of the original article I published in 2017 and 2019. So many things have changed, and so many new things I have learned, that this article needs some refreshing to reflect those new experiences. This year, 2020, I set a new frontier: learning about Artificial Intelligence. But from books, not the internet. Building artificial intelligence agents requires understanding of what AI is.

First of all, on the title of the article. I am enclosing “Petroleum” in brackets because that was the original title of the article. But I came to realize that it goes beyond a narrow engineering field; it may be useful to other engineers as well.

These are not absolute truths written in stone. I am sharing experiences, recommendations of something that worked for me and others. Part of them are collection of notes or posts that I have shared with colleagues when asked me:

“Alfonso, how do I start with Data Science?”,
“Is there any value in learning Data Science?”,
“What impact could Data Science bring to my engineering job to improve processes, increase revenues, or reduce costs?”
“Should I attend a Data Science career in an established university?”

It is difficult to find out if you don’t apply Data Science in the the real world. I was an skeptic until I took the decision to seriously learn it. I don’t mean one hour a week or a day; I meant living it, and making it part of my engineering process at work. It is funny, because my team was a bit reluctant at changing the classic way of solving well models to optimize well production. Then after, we wouldn’t dare to analyze an oil field without first applying statistics, creating datasets from input files, simulation, or solver outputs, and look at the resulting data with a new perspective. Later on, our group, Wells and Network Modeling, was made one of the pillars, in the overall company strategy, at bringing extra revenue in challenging times.

In many ways, what I will be describing you may resemble a scientific approach to things. Mind you that the word “science” is not gratuitously in “data science”. Data science comes from the world of Statistics when Computer Science made it a practical tool at making discoveries from data. We owe Data Science, first and foremost, to Statisticians, to scientists. This new industrial revolution, based on data, we are living in, requires a new set of lenses to understand, and discover things that are not immediately evident using the classical methods. Of course it will be difficult, no doubt you will meet resistance, but what human endeavor that is worth, doesn’t.

I don’t particularly buy in this new concept of “physics-driven” vs “data-driven” methods. I’ll tell you why. First, you will never hear a scientist say “I will apply a data-driven methodology to explain this phenomena”. Or, “I will try this time with a huge amount of data to discover things”. You would be laughed out of the department. Second, data is a manifestation of natural phenomena; you cannot separate data from the physics. They are both faces of the same coin. Third, go look at the book classics on physics, fluid mechanics, chemistry, geology, [petroleum] engineering, simulation, biology, etc. You will not find an instance at which the authors say “with more data I would get a better explanation of the phenomena.” More data doesn’t necessarily mean new data, ground breaking data, or unique data. It just means that you have more. It is up to you to filter out the noise and hear what the physical phenomena is trying to tell you.

You may take these as some recipes to start your transformation in a Data Science wizard:

Complete any of the Python or R online courses on Data Science. My favorites are the ones from Johns Hopkins and the University of Michigan in Coursera (Data Science Specializations in R or Introduction to Data Science in Python). Don’t be mistaken: the data science specialization in R is a high quality course, and would make you feel sometimes like going through a PhD program. You will need a firm commitment and set aside some time for lectures, quizzes and project assignments. You could complement it with DataCamp short workshops. For instance, I started, few years ago, with the two-hour “Introduction to R” quick course. And then never stopped. There are other online institutions such as edX, Udacity, Udemy, etc. You will also be able to find online courses from reputable universities such as Stanford, MIT, or Harvard. If you don’t have previous programming experience, start with Python; if you feel confident about your programming skills, and would like breaking the barrier between engineering and science, go full throttle with R. If you are a MSc or PhD, you know I will immediately recommend you to pick R.
Start using Git as much as possible in all your projects. It is useful for sharing and maintaining code, text, papers, working in teams, synchronize projects, work in different computers, bring reproducibility to your data science projects, etc. To access Git in the cloud you may use GitHub, Bitbucket, or GitLab. Don’t be frustrated if you don’t get it or understand it at first; everybody struggles with Git. Even PhDs, who, by the way, have written the best tutorials on Git. So, you are not alone. Think of Git as a time machine that will let you go backward and forward in the history of your document.
In parallel with Git you will need to learn the basics of the Unix terminal. This is useful for many things that Windows doesn’t do -and might never do; it is even useful for Linux and Macs which are Unix based as well. You can do automated scripting using the Unix terminal that can serve you in many data oriented activities, such as operations on huge datasets, deployment, backups, file transfer, manage remote computers, secure transfer, low level settings of your work environment, version control with Git, reproduce tasks, find keywords in text and PDF documents, etc. If you are a Windows guy, there are ways to get familiar with the Unix terminal. There are few hybrid Windows applications such as Git-Bash, MSYS2, or Cygwin, that let you bring a Unix terminal to Windows. There is no question about that you have to know the Unix terminal. It makes your data science much more powerful and reproducible, giving you also avenues for easier deployment. Additionally, it may be your secret weapon when dealing with huge datasets. I am finding more frequently articles where engineers have managed to read and transform terabyte-size datasets, in laptops, using combination of Unix command line tools such as grep, awk, sed, etc., along with data.frame and data.table structures. No need of big-data computer clusters with Hadoop or Spark, which are more more difficult to handle. I know there is also Windows Subsystem for Linux (WSL) but may require a little bit of hackery. I have tested it a couple of times and is kind of hard to benchmarking it to other Linuxes when you run a data science or machine learning project.
One of the best tools for reproducibility is Rmarkdown. As soon as you have installed R, Rtools, and RStudio in your computer, start using Markdown. In R is called Rmarkdown, which is widely used in science for generating documentation, papers, citations, booklets, manuals, tutorials, schematics, diagrams, web pages, blogs, slides, etc. Make a habit in using Markdown. If possible, during engineering work, avoid Word -which generates mostly binary files. Working with Markdown makes easier to do revision control and it is reproducible, both, key to reliable, testable, traceable, repeatable data science. With markdown, you can also embed Latex equations with text, code and calculations. Besides you gain an additional ecosystem to run code and tools from the Latex universe, which is enormous. My favorite application for Markdown is Typora. It works under Windows, Linux and macOS. I use it every day.
Strive to publish your engineering results using Markdown. It will complement your efforts of batch automation, data science and machine learning. Combine calculations with code and text (called literate programming) using the Rmarkdown notebooks in R. Essentially, any document can be written mixing code, text, graphics and calculations with R or Python. Even though I am originally a Python guy (15+ years), I am not strongly recommending the Python notebooks, or Jupyter, because they are not 100% human readable text (it uses JSON), that you may find difficult to apply version control and apply reproducible practices, or using it with Git. I have possible built more than a thousand Jupyter notebooks but when I learned Rmarkdown, it was like stepping in another dimension. In case you haven’t heard, R and RStudio have update packages, such as reticulate, that allows you to write Python code in Rmarkdown. I have published several examples of data science books completely coded in Python using Rmarkdown. This is for Python plotnine, this is introduction to Matplotlib written in Rmarkdown, and this other is a tutorial by Keh-Soo Yong, that I modified; also entirely written in Python using Rmarkdown notebooks.
Start bringing your data into datasets with assistance of R or Python. Build your favorite collections of datasets. Share with colleagues in the office and discuss the challenges in converting raw data into tidy data. Generate tables, plots and statistical reports to come up with discoveries. Use markdown to document the variables or features (columns). If you want to share the data, keeping the confidentiality, learn how to anonymize your data with R or Python cryptographic or scrambling packages. There is another way to anonymize confidential data and that is Generative Adversarial Networks (GANs). I am still working on a package that will use PyTorch to transform a data structure with 2D tensors into a new dataset with similar statistical properties but completely hard to trace to the original source.
Data Science requires continuous deliberate practice. Start solving daily engineering problems with R or Python incorporating them in your daily workflow. If you can, avoid Excel or Excel-VBA, if possible. VBA original purpose was never to deal with version control, or reproducibility, or data science; much less, machine learning. Sticking to Office tools may keep you stuck to outdated practices, or being unable to perform a much richer and productive data science. There is one more thing you may have possible noticed, and that is Excel plots are very simplistic; they go back to 30 years old techniques. You may run the risk of dumbing down your analysis, or prevent you of making discoveries from your data, or showing a compelling story, which is the purpose of data science anyway.
Learn and apply statistics everywhere; every time you can, on all engineering activities you perform. Find what no other person can by using math, physics and statistics. Data Science is about making discoveries and answering questions on the data. Data Science was invented by statisticians; who at that time they called it “data analysis”. An article I never get tired of read and re-read is this “50 years of Data Science by David Donohoe”. Please, read it. It will explain statistics and its tempestuous, albeit tight, relationship with data science.
Read what other disciplines outside yours are doing in data science and machine learning. Look at bioscience, biostatistics, genetics, robotics, medicine, medical imaging, cancer research, pharmaco-kinetics, psychology, biology, ecology, hydrology, high performance computing, spatial data, automotive, finance, etc. You may want to take a brief look by browsing the R packages catalog at CRAN.
Read articles in the net on data science and try to reproduce those that resemble solving a problem you might find at work. It doesn’t matter if it is Python or R. You just have to learn what data science is about; how it could bring value to your everyday engineering workflow. They may give you ideas of applications involving data in your area of expertise. They may not even be data science per-se now, but they most likely could be your next stepping stone. Furthermore, most of the articles are free as well, and so, hundreds of books, booklets, tutorials and papers. We never had the chance to learn so much for so little. Somebody has call this the era of democratization of knowledge and information. What you have to invest is time.
The next stepping stone while learning Data Science is Machine Learning. Start inquiring about what machine learning is about. There are many online resources to learn it, and lots of open source libraries that you could use. In Python you have NumPy, ScikitLearn, TensorFlow, PyTorch, mxNet, NLTK, and others. In R you also have plenty of options: carrot, TensorFlow, rTorch, mxNet, e1071, gbm, glmnet, h2o, kernlab, rpart, and more. Although, there are several machine learning algorithms that are used in data science, machine learning is a discipline by its own. Most of the successes that are attributed to artificial intelligence today are really machine learning applications. What I recommend you here is to make collections of ML algorithms to give you a rapid intuition on when an algorithm should be used. Not all problems in the world will be solved with Neural Networks or Deep Learning. You would be amazed how classic algorithms still outperform them. Keep the hype at arm’s length.
Review C++ and Fortran scientific code. I don’t mean to say that you need to learn another programming language, but knowing what they can do will add power to your toolbox. Sooner or later you will need Fortran, C, or C++ for reasons of efficiency and speed; most likely deployment. Not for nothing the best in class simulators and optimizers of today have plenty of Fortran routines under the hood.
Learn how to read from different file formats. It is amazing the enormous variety of file formats in what you may find raw data. There is a lot of value that you could bring to your daily activities by automating your data analysis workflow using R or Python. Also, ask what are the different data formats that are used in your company for storing data. Get familiar with them. If you are in [petroleum] engineering, try reading some chunks of that data: try with logs, seismic, well tests, buildups, drilling reports, deviation surveys, geological data, process data, simulation output, etc. Create tidy datasets out of them. Explore the data. Embark in finding and discover things.
Something that is more challenging is learning how to read and transform unstructured data, meaning, data that is not in row-column (rectangular) format. The typical and close cases to us are the text output from simulators, optimizers, stimulation or well design, etc. This is one of the most difficult data to operate with, and when learning or knowing “regex” really pays off. There is a more complex side of unstructured data, and that is video, sound, and images. Today there are plenty of algorithms available that deal with that kind of data either with Matlab, Python or R. Images is one of the areas that has been widely explored with Neural Networks, Deep Learning, Convolutional Neural Networks, and GANs. They are quite effective at showing visible and compelling results.
Virtualization. Learn something about Virtual Machines with VirtualBox (free) or Vmware. It is very useful to have several operating systems working at the same time in your PC: Windows, Linux, MacOS. There is a lot of good data science and machine learning stuff in Linux packaged as VMs, which could be run under Windows very easily. These are applications that are ready to run without the need of installing anything on the physical machine. Some time ago, I was able to download a couple of Linux VM with whole bunch of machine learning and artificial intelligence applications, and test them with minimum effort. I have other VMs from Cloudera and Horton-Works where I was able to run big-data applications such as Hadoop, Spark, etc. Another virtualization tool that you may want to learn is Docker containers. The concept is similar to that of virtual machines but lighter and less resource intensive. Another tool you may want to explore in virtualization is Vagrant, which is an advanced combination of virtual machines and containers. These tools will make your data science even more reproducible and stand the test of time.
Dare to explore the world of artificial intelligence. You may not need it today but will be the indispensable tool of the 5th Industrial Revolution. There is nothing better than knowing, at least, the fundamentals of what others are trying to sell you. There is so much noise, and snake-oil marketing nowadays surrounding the words “machine learning” and “artificial intelligence”. To start, three books I would recommend, out of the top of my head, on artificial intelligence, are: “Artificial Intelligence: A New Synthesis” by Nils Nilsson; “Computational Intelligence. A logical approach” by David Poole, Alan Mackworth and Randy Goebel; and “Artificial Intelligence: A Modern Approach” by Russell and Norvig. You will find that AI is not what you read in newspapers, magazines or articles in the web. An elucidating article about Machine Learning vs Artificial Intelligence is “Artificial Intelligence:The Revolution Hasn’t Happened Yet” by Prof. Michael I. Jordan, which in one sentence tells you all the hype around AI is not artificial intelligence but machine learning. Indeed, almost all of the successful applications in use today are ML. Couple of days ago, I woke up with the idea of separating the field of Artificial Intelligence in two branches: Internet AI vs Scientific AI. Internet AI would be the “astrology” side of Scientific AI. The hype, “AI did this”, “AI did that”, all the pseudo-science articles. The robot Sophia would inhabit this fantasy world. Scientific AI would be the AI we inherited from Alan Turing, John McCarthy, Alan Newell, Herbert Simon, Marvin Minsky, Noam Chomsky, John Hopfield, et al. In this world you cite your sources and are not afraid of being peer reviewed or criticized.
For those who have asked me if I recommend a formal data science degree at a university, what I tell them is try first with online courses and see if it is for you.

Alfonso R. Reyes

Houston, Texas. 2020

References

My GitHub repository with open source code: link
My Rmarkdown blog: link
Original article 2019 in LinkedIn: link
Original article 2017 in LinkedIn: link

Transforming [Petroleum] Engineers in Data Science Wizards. Update 2020

References

Written by AlfonsoRReyes