Welcome to my data science coding projects portfolio. This projects portfolio is very much a work in progress. When the portfolio is full, my goal is to cover the full spectrum of data science process while using Python, Excel-VBA, SQL, AWS and Google Cloud.
My main area of interest is in applying data science tools to bring value to financial securities, risk management, and economic policy industries. I am also interested in visual deep learning as it is applied to brain segmentation imaging (e.g., Janelia.org) and to geospatial intelligence analysis (e.g., NGA.org).
Some of the specific types of analyses I will be performing include: k-means clustering, principal component analysis (PCA), sentiment analysis, natural language processing (NLP), linear regression, logistic regression, time-series, econometrics, PySpark, big data, cloud computing, and deep learning.
Sometimes it is really helpful to host a short, executable python code on a public web server. You may have a client with whom you wish to share an idea or a methodology. It may not be enough to show a static html page. Amazon's Lambda makes that easy to do. You can customize python library access with "layers." I will post a demo using the Ubuntu Linux OS base machine and a Jupyter notebook running Python 3, and import numpy, pandas, matplotlib, and a few other popular data science libraries.
Dash allows you to build a web app with your customized sliders, radio buttons, text input, and user-selected data sorting and filtering. While Plotly has built-in default chart types with zoom, pan, expand/collapse and data filtering already included.
I will be posting demos using energy futures data from EIA.gov.
I will post exercises using NumPy and Pandas to manipulate input data. Then, I will use MatPlotLib and Seaborn to visually explore those data. File read/write for JSON, CSV, and Excel formats will be covered. As well as webscraping html table information and converting pdf files.
Algorithm efficiency is studied using the Big-O math. Generally an oder of log(n) is preferred over an order of n*log(n), n**3, or n!. The best algorithm has an order of n, O(n), but this is rarely achieved. An O(n) means that as the number of inputs grows, the time to execute grows linearly. In my sorting algorithm, I use a binary tree with a central pivot point and recursive function calls to itself. I use this algorithm to study Big-O math. Example to post.
I went to an NT Concepts talk this week, and an audience member asked a question, "When you are using a streaming data source, when do you know you have enough data?" I provided an audience answered to this question. I think the questioner was asking when can we have a reasonably high level of confidence that the sample data we are getting from the streaming data source is a good representation of the true population data. My answer was that, "From a statistics point of view, it depends on the stability of the sampled mean and sampled standard deviation over time. And I don't know the formula off the top of my head, but due to the Central Limit Theorem, there IS a formula that will allow one to calculate the probability that the streaming sampled data is wrong". (That is, the streaming data you have collected so far is NOT representative of the true population, and you need to collect more samples.) I forgot to mention that the number of observations and number of features you are using also matters, as these influence the degrees of freedom you have in the statistical calculation. And that my emphasis on the stability of mean and standard deviation over time was an indirect reference to stationarity and independence of samples.
In summary, this question seems to come straight out of the Central Limit Theorem. Given the sample mean, sample standard deviation, independence, and stability of volatility -- the probability of the sampled mean approaching the true population mean can be estimated. (This is usually stated as the sampled mean approaching the true mean of the "Gaussian" probability density function, or some other named probability density functions, such as binominal, log-normal, exponential, etc.). I tried to follow up with the questioner, but the email address she gave me had an error. Anyway, I thought it might make a good post. So I will explore it further and write a post about it with real-world streaming data.
A picture showing conceptually the bias-variance tradeoff in machine learning. Source: Pierian Data, Udemy class PyMLDSBC, Section 16 link. Udemy class link.
A test result with a bias problem refers to a case where the true mean was totally missed by the machine learning model. A test result with a variance problem refers to a case where the machine learning predictions are too widely distributed to provide a meaningful indicator to the decision maker. (Article to post later.)
Passenger information from the Titanic ship is a common data set used in machine learning (ML). Here I use Python and data science libraries to find patterns in the data and build a prediction model. Then I use various visualization libraries to create pretty figures.