Big Data Landscape (MMI226831) Case Study

Table of Contents

Introduction - Understanding Google Colab and BigQuery ML Applications
Task 1
Task 2
Task 3

Pages: 12 Words: 2930

Introduction - Understanding Google Colab and BigQuery ML Applications

The domain of Artificial Intelligence has shaped the way the world has operated over the past few decades. The use of algorithms has become a commonplace across every field of operations nowadays. Machine learning comes within the field of AI that utilizes various algorithms for manifesting its operations in the real world. The AI systems conduct all of its task by way of the implemented algorithms, predicting the outcomes from the provided input information. The two prominent processes of such algorithms are regression and classification. Google colab refers to the online coding environment which is best-suited for the domains of data analysis and machine learning. This platform renders GPU usage along with incorporating several ML libraries. The data warehouse namely BigQuery ML assists in creating as well as executing ML models within BigQuery by using the SQL queries. The tools inside SQL help in constructing different models in this regard. The speed of development gets enhanced in the case of BigQuery ML as the necessity to move the data gets eliminated.

Grab The Best Academic Assistance In Just One Click

Assignment Help UK

Task 1

The following images have been acquired after performing the required software operations inside the platform of software known by the name “Google Colab”.

Figure 1: Overview of the imported dataset

The image in the above area has shown the initial step of importing the necessary libraries into the software platform in question. The imported libraries are pandas, numpy, random, seaborn, and matplotlib respectively (Aho et al. 2020). The initial one provides the space complexity along with better runtime. It also provides a huge variety of the “array” operations in this regard. The library called Pandas has rendered the “high” performance data structures as well as the tools pertaining to the process of data analysis. The aforementioned module executes itself upon the “Numpy” module for that matter. The module named “random” pertains to the in-built module regarding the “Python” programming language that is utilized for generating the random numbers. The module called “matplotlib.pyplot” is nothing but the accumulation of the functions that essentially helps in working with “matplotlib”.

The library called “matplotlib” has helped to construct the graphs and plots with the help of the “scripts” inside Python. The “pyplot” function helps to make alterations to the figures in question. The library by the name “seaborn” has used the“matplotlib” underneath in plotting the graphs in this context (Dremel et al. 2020). The use of pandas and numpy also comes in this regard with the library mentioned earlier. The name of the data frame in “data” which has incorporated the data within itself from the dataset in question, having the format of “.csv”.

Figure 2: Overview of the d1 data frame

The data frame known by the name “d1” is displayed by way of this picture. The total number of columns and rows are also showcased in this regard.

Figure 3: Overview of the null values of the dataset

The implementation of the method called “isnull () is displayed with the help of the picture above. This method helps in returning the data frame “object” wherein all of the values have been replaced with the “boolean” value “true” for the “null” values, and otherwise “false (Singh, 2019 ). The total quantity of rows and columns in this case are “2170785”, and “12’ respectively.

Figure 4: Overview of the “latitude vs longitude” plot

This particular picture has properly represented the plot portraying the latitude against the longitude on the whole. The method named “scatter ()” is applied on the data in question for showcasing the data points on the aforementioned graph (Tantalaki et al. 2020). The parameters of “latitude” has been taken along the “axis-x”, and the “longitude” parameters are taken along the “axis-y” respectively.

Figure 5: Histogram of the “Category” column

The plot in the section above has displayed the data within the column called “category” in particular. The data frame known as “df2” is considered in this case. The method called “hist ()” has been utilized for plotting the histogram chart above.

Figure 6: Plot presentation between “descript” and “category” columns

The plot showcased in this section is known by the name “line chart”, created with the help of the method called “lineplot ()”. The title of the plot has been named as “Title using Matplotlib Function”. The data contained by the data frame “df2” is plotted in this case. In the “x-axis”, and “y-axis” the columns of “category”, and “descript” have been showcased respectively (Zheng et al. 2021). The method known by the name “show ()” is also used for the sole objective of presenting the graph.

Figure 7: Linear Regression model of the dataset

The “linear_model” refers to the “class” within the module named “sklearn” in general. It fundamentally includes different functions for the purpose of performing “machine learning” by way of the “linear” models. This particular term of “linear model” pertains to the aspect that the model in question is specified with the help of the “linear” combination of the concerned features (Feng, 2021). The “print” function has been utilized for showcasing a number of elements such as the “slope”, “coefficient of determination”, “predicted response”, and “intercept” respectively.

Figure 8: Pip installation of MySQL overview

The installation of the module called “mysql” has been done into “Python” with the help of the system of package management known as “pip”. The “python package management” has permitted for the proper installation of the packages and modules mentioned in the above-image. In the very beginning the concerned terminal has been opened for the purpose of utilizing the command called “pip” in particular (Ghosh et al. 2020). This command, in turn, has assisted in installing the corresponding “module” across the entirety of the system.

Figure 9: pyodbc and os installation

The above picture showcases the installation of the module called “pyodbc” in addition to the installation of the ‘os’ on the whole. The term “pyodbc” refers to the open-source “python” module which has helped in simplifying the access to the “ODBC” database, which in this case is “SQL. The “os” module renders the underlying facility of establishing the overlap interaction between the operating system as well as the user in question. It also provides the useful “os” functions which are utilized for performing the os-related tasks (Hesse et al. 2019). All of the related information is also obtained reading the operating system in this regard. The “os” essentially comes under the standard utility “modules” inside Python.

Figure 10: pyodbc and os installation

The query called “select*from” has been utilized and showcased through this picture which refers to the aspect of retrieving the data from the table named “dbo”.

Figure 11: Retrieving data from the table

The “where” clause is utilized in this context which has extracted the record of “fraud” that has also fulfilled the specified condition.

Figure 12: Specification of the criteria

In this case, the clause “where” has also been utilized for the purpose of extracting the data regarding the category of “burglary” in particular.

Figure 13: Criteria Specification

The record pertaining to the “daysofweek” has been displayed in the image-above by implementing the “where” clause for that matter. The date of “friday” is taken up in this regard.

Working of Machine learning

The systems of machine learning essentially learns from this historical data, creates models of prediction, and whenever it gets new data it predicts the outcomes for it. The underlying accuracy level of the predicted outputs depend on the quantity of data in particular. The huge quantity of the data assists in creating better models that predict the outputs more accurately.

Features

Machine learning utilizes the data for recognizing various patterns inside of the provided dataset
It can easily learn from the previous data and enhance its capabilities automatically
It refers to such type of technology that is driven by data

Classification

At the broad level, ML gets classified into the following three types:

Supervised learning
Reinforcement learning
Unsupervised learning

Supervised Learning

This is such a type of ML method wherein sample “labeled” data gets provided to the system of machine learning for training it (Huang, 2021). Thus, the output gets predicted in this regard. In this case, the system constructs a particular model by using the “labeled” data for comprehending the datasets along with learning about every price of data. Once the training”, and the “processing” gets done the model is tested by rendering a data sample for checking whether it gives predictions in the form of the exact “outpuity” or not. The objective of “supervised learning” is mapping the input data along with output data. This method is predicated upon the concept of supervision for that matter.

Reinforcement Learning

This particular type pertains to such a method that is based upon the received feedback, wherein the learning agent receives a “reward” for every correct action and penalty for every wrong action. The agent essentially learns with the feedback in an automatic manner and then enhances its overall performance on the whole. In such learning, the agents thoroughly interact and explore the environment around them.

Unsupervised Learning

In this specific method, the concerned machine learns properly without even having any form of supervision. The corresponding training gets rendered to the concerned machine with the help of the dataset which is not classified, labeled or categorized (Lies, 2019). The algorithm in question requires to act upon the data without taking any assistance of supervision. The goal of this method is to comprehensively restructure the “input” data into span-new features and a group of “objects” having the similar patterns.

Logistic Regression and Linear Regression are one of the two most famous ML algorithms that come under the techniques of supervised learning. These algorithms utilize the labelled datasets for making the predictions (Ma et al. 2021). The difference is the manner in which they get used. Linear regression helps in solving the regression problems and logistic regression assists in solving the problems of classification.

Linear Regression

This algorithm is properly utilized for predicting the “continuous” dependent variable by way of the independent variables.
The goal is to discover the best “fit line” which can precisely predict the outcome for the “continuous” dependent variable.
The “best fit” helps in establishing the underlying relationship between the independent variable and the dependent variable which is of a linear nature.
When a single “independent'' variable gets utilized for prediction, it gets called as simple linear regression and multiple linear regression when the number of “independent” variables exceeds one.

Logistic Regression

This algorithm is comprehensively used for the purpose of predicting the “categorical” dependent variable by way of the “independent” variables.
The corresponding output of this algorithm can only stay between “0” and ‘1” respectively.
Logistic regression gets utilized where there exists the requirement to determine the probability between the two classes.
This specific algorithm is predicated upon the underlying concept called “maximum likelihood estimation”. The data which gets observed has to be the most probable one in this regard.

Task 2

The aspect of predictive modelling refers to the statistical technique by using the fields of data mining and machine learning for predicting and forecasting the outcomes of the future. The existing as well as historical data also come into play for performing the aforementioned tasks. It essentially works by way of analysing the existing and the previous information and projecting that it learns upon the generated model for rendering the possible outcomes.

Machine learning assists the computers in learning from the previous data in an automatic manner (Rehman et al. 2022). It fundamentally utilizes different algorithms for the sole purpose of building the mathematical models along with making the predictions by using the historical information. At the present scenario, this tool gets used in different real-life tasks, like speech recognition, image recognition, filtering of emails, and many more. Generally, there exists three types of techniques in this context, such as Reinforcement learning, supervised learning, and unsupervised learning. The platform of google colab refers to the cloud-based “jupyter notebook” environment which permits the users to train the deep learning and machine-learning models upon the GPUs, CPUs, and TPUs. This platform also assists the users in testing the basic ML models for gaining experience along with developing crucial insights about the aspects of deep learning (Resnyansky, 2019). These include the processing of the data, tuning of hyperparameter, overfitting, model complexity, and many more in general...

Task 3

The BigQuery ML is nothing but a warehouse of data that renders decision-making guidance by way of predictive analytics. It essentially utilizes the tools of machine learning to do the aforementioned tasks. Essentially, the model gets created and trained without even exporting the data out of the BigQuery. Generally, BigQuery ML pertains to the set of “SQL” extensions for the purpose of backing the process of machine learning.

The terms such as “algorithm” and “model” get used often in an interchangeable manner in the case of machine learning, but these do not have equal meaning (Seo et al. 2021). A “ML” algorithm refers to those step-by-step instructions which get executed upon the data for creating the model of machine learning.

The BigQuery ML backs the “supervised” learning algorithms like logistic and linear regressions. It also throws its support for the “unsupervised” learning algorithms wherein one could use “k-means” for clustering the data predicated on similarity. Based upon where the concerned models get trained, the models can easily be classified in different categories. These are as follows:

External Models: These are such models that get trained outside of the BigQuery. It includes the DNN, boosted tree models, etc.

Built-in Models: This type of model gets built as well as trained inside of the BigQuery. It incorporates the time-series models, logistic regression, linear regression, matrix factorization, etc.

Features

Elimination of Data Transfer: In the past, the concerned users moved the data from the BigQuery to the isolated environments or other platforms for training the ML models. It essentially consumes an unnecessary amount of time for the datasets which are large in size (Sheng et al. 2021). But, with the inception of BigQuerry ML the users get the opportunity to train as well as execute the models directly from BigQuery.

Automatic generation of ML Models: Here, the opportunity of selecting the correct ML models regarding the set of data exists for the users. The users can easily utilize “AutoML” for the purpose of providing the users with a user-friendly “graphical” interface along with permitting one toi select the best ML models according to the requirements.

Encrypted Models: The platform of BigQuery ML permits all of the users to properly encrypt the models of machine learning along with the ‘customer-managed” keys of encryption.

Conclusion

Here, all of the three respective tasks have been performed with the help of the respective programming languages and within the corresponding software platforms on the whole. The machine learning classifier algorithm such as linear regression has been taken into consideration in this case for performing the aspect of prediction. The programming language called “Python” is implemented in this case inside of the platform known as “Google Colab”. The given task in question has also created a machine-learning model with the help of BigQuery ML wherein “SQL” is used for that matter. The public dataset regarding “BigQuery” has been taken up for the performing the aforementioned task.

References

Aho, B. and Duffield, R., 2020. Beyond surveillance capitalism: Privacy, regulation and big data in Europe and China. Economy and Society, 49(2), pp.187-212.

Dremel, C., Herterich, M.M., Wulf, J. and Vom Brocke, J., 2020. Actualizing big data analytics affordances: A revelatory case study. Information & Management, 57(1), p.103121.

Feng, F., 2021. Research on aging landscape design of old residential quarters based on ecological values under the background of big data. In 5th International Conference on Education, Management and Social Science (Vol. 69).

Ghosh, S. and Dey, S., 2020, October. A big data compatible naïve measure for estimating the landscape dynamics using open geospatial datasets. In 2020 International Conference on Smart Innovations in Design, Environment, Management, Planning and Computing (ICSIDEMPC) (pp. 347-352). IEEE.

Hesse, A., Glenna, L., Hinrichs, C., Chiles, R. and Sachs, C., 2019. Qualitative research ethics in the big data era. American Behavioral Scientist, 63(5), pp.560-583.

Huang, L., 2021, April. Application of Big Data in Improving Landscape Plant Landscaping Method. In Journal of Physics: Conference Series (Vol. 1852, No. 3, p. 032024). IOP Publishing.

Lies, J., 2019. Marketing intelligence and big data: Digital marketing techniques on their way to becoming social engineering techniques in marketing.

Ma, L., Liu, F. and Wu, L., 2021. Big Data Analysis Guides Landscape Architecture Method Research. In E3S Web of Conferences (Vol. 248, p. 03053). EDP Sciences.

Rehman, A., Naz, S. and Razzak, I., 2022. Leveraging big data analytics in healthcare enhancement: trends, challenges and opportunities. Multimedia Systems, 28(4), pp.1339-1371.

Resnyansky, L., 2019. Conceptual frameworks for social and cultural Big Data analytics: Answering the epistemological challenge. Big Data & Society, 6(1), p.2053951718823815.

Seo, J.Y. and Jung, H.J., 2021. A Study on the Internationally Accepted Terminology of Traditional Landscape Architecture-Based on Big Data Analysis on International Documents and Research Papers of Gardens, Parks and Landscape. Journal of the Korean Institute of Traditional Landscape Architecture, 39(4), pp.1-9.

Sheng, J., Amankwah‐Amoah, J., Khan, Z. and Wang, X., 2021. COVID‐19 pandemic in the new era of big data analytics: Methodological innovations and future research directions. British Journal of Management, 32(4), pp.1164-1183.

Singh, N., 2019. Big data technology: developments in current research and emerging landscape. Enterprise Information Systems, 13(6), pp.801-831.

Tantalaki, N., Souravlas, S. and Roumeliotis, M., 2020. A review on big data real-time stream processing and its scheduling techniques. International Journal of Parallel, Emergent and Distributed Systems, 35(5), pp.571-601.

Zheng, J., Chen, G., Zhang, T., Ding, M., Liu, B. and Wang, H., 2021. Exploring Spatial Variations in the Relationships between Landscape Functions and Human Activities in Suburban Rural Communities: A Case Study in Jiangning District, China. International Journal of Environmental Research and Public Health, 18(18), p.9782.