Visual Analytics for Data Insights

Table of Contents

1. Statement of the Problem
2. State of the art
3. Properties of this data
4. Analysis
4.1 Approach of Analysis
4.2 Process of Analysis
4.3 Results of Analysis
5. Critical Reflection

Pages: 17 Words: 4349

1. Statement of the Problem

Essentially, the term visual analytics refers to the fusion of visualizations and data analytics. This specific approach to bringing solutions to the problems is focused on integrating the underlying process of analytics with the interactive visual showcasing. The purpose of the aforementioned is to effectively facilitate the high end complex activities like data-driven decision rendering, reasoning, etc. The visual analytics comes within the categories called visual business analytics and intelligence. There exists the aspect of employing different practices like statistical work, data mining for visualizing the information in such a format which is convenient for the users to comprehend. The visual analysis is solely concentrated on the dataset named London census wherein all of the information is provided that are the bedrock of this analysis. The analysis is being done by way of the programming language called python upon the aforementioned set of data. In this particular dataset the grouping has been done based upon different parameters of the people living in London. These include the age group, individual id, economic condition, gender, and many more. Each of the boroughs are also provided with sufficient data regarding it, which include the entire area, the people living within these areas, etc. All the students who are employed as part time ones are also incorporated within the concerned dataset in respect of their financial conditions, gender, etc. Visual analytics generally combines the human factors, visualization, along with data analysis for extracting knowledge from these data. The visual analysis renders the complex problem much easier for the users to comprehend. The task that has been accomplished in this assignment relates to the visual data analysis of the concerned dataset such that the required results are obtained. Various libraries have been used for the entire task such as numpy, geopandas, scikitlearn and others which have distinct purposes and roles to play in the visual analysis.

Grab The Best Academic Assistance In Just One Click

UK Assignment Help Order AI-FREE Content

2. State of the art

This process of visual analytics generally combines the automatic as well as visual analysis methods along with the tight coupling by way of human interaction. The sole objective in this case is to garner knowledge from this data. This particular process gets characterized through the underlying interaction between data, models, visualizations, and users for discovering key insights. In most of the application scenarios, the heterogeneous sources of data get integrated before applying the automatic or visual methods of analysis [1]. The initial step is pre-processing and transforming this data for deriving various representations to explore further. The binary and categorical features get thoroughly analyzed for that matter. The latter takes on the fixed number of the values. Each and every value assigns a specific observation to the corresponding group. This in turn is known by the term category. This category fundamentally reflects some of the qualitative properties belonging to the data in question. The binary variables also pertain to a crucial special case regarding the categorical variables. The aforementioned holds true when the quantity of probable values is exactly at 2. When the values regarding the categorical variables get ordered, then it is known by the term ordinal. The multivariate visualization permits the users to comprehensively view the underlying relationship between more than two different variables in general. When it comes to the univariate plots, these specific types of visualizations essentially rely on the different types of variables that get analyzed. In the very beginning, the frequency table is obtained that showcases the manner in which each of the values of the “categorical” variable shows up.

Visual analytics is a type of analytics that uses interactive visual interfaces to help people explore, interact, and analyse data. It combines the power of data mining, data analysis, and visualization tools to provide an integrated platform for data exploration and decision-making.

Visual analytics is an emerging field that is rapidly gaining traction in many industries. By combining the power of visualization and analytics, it provides an efficient and effective way to explore data and uncover insights [2]. Visual analytics can be used to identify patterns, trends, and relationships in data that are not readily apparent. It can also help in making decisions, providing recommendations, and predicting outcomes. Visual analytics is a rapidly evolving field, with new technologies and techniques being developed all the time. This includes advances in data mining, data analysis, and visualization. Recent innovations in visual analytics include the development of machine learning algorithms, natural language processing, and other technologies that can help to make sense of large datasets. Visual analytics is also being used in a range of industries, from healthcare to finance, to help make sense of the data. As the field continues to advance, it is likely that visual analytics will become increasingly important in helping to make decisions and uncover insights.

save up to

35%

On Each Order!

Place order now

Get Extra 10% OFF on WhatsApp Offer use my discount

3. Properties of this data

The set of data which has been taken into consideration in this case is “population+perc+crimes” in particular. As far as the dimensions of this dataset is concerned, the total number of rows and columns which are present in this dataset are “649”, and “610” respectively. There exists different parameters predicted upon which the segregation of the data is done by way of the columns. It includes the names of all of the participants, the corresponding id of these individuals, name of the boroughs. Moreover, the number of residents in these boroughs, the total land area, density of population, age groupings, and economic activity of these people in different financial years are also displayed.

Figure 1: Showcasing the data within the column Name

The column holding the attribute “name” is counted and described with the help of the corresponding function. The above image showcased the different attributes which are parts of this particular column for that matter. The total length in this regard is “627”, and the data type is “integer” on the whole. Here, not a single impertinent as well as erroneous value is present inside of this dataset for that matter. Furthermore, this dataset also houses the data belonging to the full-time students in respect of their economic activity and gender [3]. The analysis process is performed by incorporating this particular dataset into the software platform called “Jupyter Notebook”. Several parameters have been analyzed and displayed in this context, which are the temporal coverage, biases, noisiness etc. The resolution as well as the resolution have also been specified in view of the temporal and spatial data. All of the necessary processes have been incorporated while performing the analysis with respect to the data quality, the detected issues, and the gaps in the temporal and spatial coverage respectively. The columns containing the economic activity also contain different sets of classifications based on the financial condition the students are in. These include the ones who are economically inactive, who are unemployed, who are looking after their families, who are unemployed for long periods of time, amongst others.

Figure 2: Showing the data inside the column Borough

The columns have also incorporated the different types of ethnic grouping as well as the several religious beliefs in view of those respective people. The collection of the data in this context has been done with the help of the primary method of data collection on the whole. In this particular method the raw data has been obtained in a direct way from the “first-hand” sources. It has included various surveys, experiments and observations for that matter. This type of data collection methodology has been further divided into two sessions [4]. These are quantitative methods of the data collection and the qualitative method regarding the collection of data respectively. For this particular data set the former is taken into co insertion for the better. In this method the mathematical calculations have been applied by utilizing various formats such as the regression and correlation methods. Moreover, the process and measurements kike the determination of mean, mode, and median are also used in this case. The advantage of using this particular method is that this method can verily be implemented in a very short span of time and also this method is cheaper than the other one.

4. Analysis

This specific process of data analysis pertains to the techniques of gathering, transforming as well as organizing the whole of the data for bringing out future predictions. This process also assists in making informed decisions which are predicted on the data in question. There exists several steps that help in getting to the end results in this case. These are the specification of the data requirements, preparing of the data, processing and data cleaning, analyzation, sharing, and reporting respectively.

4.1 Approach of Analysis

Exploratory Data Analysis

The exploratory data analysis gets utilized for the purpose of investigating and analysing the data sets along with summarizing the primary characteristics [5]. This also incorporates the methods of data visualizations as well. It assists in determining the manner in which the data source gets manipulated for obtaining the required insights from it.

use my discount

Figure 3: Displaying the first five rows

This image has represented the very first five ors that are present inside of the data set in question. As the image showcases, some of the columns have housed impertinent values that have been transformed into patient ones later for the sake of the analysis. This in turn helps the users to recognize the underlying patterns, testing the hypothesis, spotting the anomalies, and checking for the assumptions [6]. EDA is fundamentally utilized for seeing what the concerned data can produce beyond the hypothesis testing or formal modelling along with rendering a better understanding regarding the included variables. Moreover, the underlying relationship between these variables is also provided in this regard.

Spatialistaion

Spatialisation refers to the ptipe tuydy of the concerned entities by examination, assessment, execution and modelling the saopitalm features of data. This incorporates the locations, relationships, and attributes which eventually reveal the geographic and geometric priorities of the data. It generally utilizes an array of algorithmic approaches, analytic techniques, and computational models for assimilating the geographic information in general [7]. Moreover, the aforementioned assists in defining the suitability for the target system in this regard. This process comes into effect by applying a number of successive steps for that matter. These are the collection of data, analysis of the data, and presentation of the data on the whole.

Supporting Cluster for Analysis

One of the most basic uses of cluster analysis is “classification”. The subjects are essentially separated into the groups, so that every subject becomes similar to the other subjects within its group, rather than to the subjects external to that group. In the first palace, the focus stays on the clustering procedures which results in the assignments of the subject to the one class. The subjects coming within that class get assumed as indistinguishable from one another [8]. The underlying framework of this data incorporated an unorderly set of “discrete” classes. In some occasions, these classes are viewed as “hierarchical” in essence, while some of the classes get segregated into subclasses.

Supporting Model Building

In this process, the development of the concerned datasets takes place for the sole objective of training, testing and production respectively. The sets of data permits the concerned people to thoroughly develop the analytical models along with training itr. In this context, some of the data is set aside for the purpose of testing this model. The building and executing of these models take place in general, on the basis of the work performed within the phase of model planning.

4.2 Process of Analysis

Figure 4: Importing the Libraries

The picture in the above section shows each of the libraries that have been imported into the software program for performing the process of analysis. The initial library is called “numpy” and is imported as “np”. This library is utilized for performing a wide range of the mathematical operations upon all the arrays. It also attaches very powerful “data structures” to the programming language. This also guarantees effective calculations along with matrices and arrays respectively. There is another library known as “pandas’ and the job of it is to give off ready to utilize high-performance structure of data as well as tools of data analysis. This specific library fundamentally executes itself on the top of the “numpy” in particular [9]. Moreover, the “matplotlib.pyplot” is nothing but the amalgamation of the functions which helps in executing matplotlib in particular in the form of “matlab”. The “pyplot” refers to nothing but the sub-module of that library known as “matplotlib”. The library called “altair” has also been imported into the software by way of the language known as python. This particular “library” pertains to the declarative and statistical visualization library for the aforementioned programming language. It gives different features for performing the data analysis along with constructing stunning visualizations. The next library is known by the name “seaborn” which has used the “matplotlib” underneath for suitably plotting all of the plots and graphs. The module named “scipy.stats” has helped to properly acquire the probabilistic distributions.

The term “geopandas” is the subclass of the library called pandas. The purpose of this subclass is to help working with all of the geospatial data in this regard. The module named “statsmodels” is such a module which renders the function and the classes for the estimation of various statistical models. The module named “pylab” provides the matlab-like namespace by way of importing the function from the modules named matplotlib and numpy respectively. The library of ‘folium” assists in constructing the different types of the leaflet maps [10]. The package named “polygon” is also imported that essentially handles the polygonal shapes within 2D. The number method called “asin ()” generally returns the “arcsine” of “x”, in radians. The module called “math ()” is also imported to support all of the mathematical operations that have been performed in the software. The term called “counter” pertains to the subclass of the dict which is specially designed for the purpose of counting the hashable objects within python.

Figure 5: Reading the dataset

This picture displays the loading of the census data into the data frame in question. The data frame that has been taken up here for incorporating the data is named as “census_data” in particular. The function called “read_csv ()” returns the new “data frame” along with the labels and the data from the above showcased file which is in “.csv” format.

Figure 6: Returning of the data description

Here, the method called “describe () has been utilized that essentially returns the description of this data inside of this data frame. In this case, the columns housing the data pertaining to the “boroughs” have been described [11]. The function known by the name “notnull () has also been utilized which is shown by way of this very same image. It is nothing but a “pandas” function which has thoroughly examined multiple values so as to validate the fact that they are not “null” in particular.

Figure 7: Heat map

The picture in this section represents the heat map image that has been obtained in respect of the data set in question. This particular plot corresponds to the rectangular data in the form of the color-coded matrix for that matter [12]. The parameters in question takes the “2D” dataset into account for the purpose of showcasing this matrix.

Figure 8: Plotting of Heat map

The function known as corr () has been used in this case for the purpose of returning the coefficient of the correlation between the numbers [13]. This particular function of “corr () essentially helps in finding the pairwise correlation pertaining to all of the columns that are house inside the data frame called “pandas”.

Figure 9: Sorting of the rows of correlation

In the case of the above picture, a number of operations have been performed with the help of the programming language known as “python”. These are the plotting of the histogram, sorting of the correlation rows, adding of extra columns with the absolute correlation respectively [14]. The function called “print” has been utilized in this context for the purpose of showcasing a number of messages.

Figure 10: Checking of Standard Deviation

This picture is showcasing the determination of the standard deviation along with attainment of the box plot respectively. For the purpose of achieving the former, the function called “std ()” has been utilized for that matter. The boxplot has been obtained for the sole purpose of showcasing the distribution of all the numeric values of data in particular. The aforementioned is performed especially when the comparison has taken place between different groups [15]. This particular plot renders high-level information from which better insights can be obtained from the users’ end. This specific plot is also known by the name of “whisker plot” that has displayed the summary regarding the “set” of the “data” values. These values in turn have the properties such as the minimum, median, first quantile, maximum, and third quartile respectively.

Figure 11: Plotting the data of very bad health condition

The plot in this section has taken into account the aspect of geospatial analysis for the sole objective of serving the output across several regions for that matter. The regions in this case are the different boroughs whose data have been provided inside the data set [16]. The width and height of the plot has been properly set to “500”, and “300” respectively. This very graph has displayed the data pertaining to the health conditions of the people which is very bad in particular.

Figure 12: Plotting the data of very good health condition

The particular plot within this specific section has taken up the geospatial analysis with respect to the only motive of distributing the outcome across various regions on the whole. These regions for that matter correspond to the several “boroughs” whose information have been rendered inside of this data set [17]. The “width” as well as the “height” of this plot is suitably set to the metrics of“500”, and “300” in general. This particular plot has showcased all of the information regarding the health conditions in respect of the people, and that is “very good” in particular.

4.3 Results of Analysis

The analysis of the dataset has been done using python programming language in the software tool called Google Collab [18]. The dataset has been loaded onto the platform for the purpose of analysing the data collected by conducting a census of the people living in the city of London. There are various libraries that have been used for the analysis of the dataset and they consist of pandas, numpy, matplotlib, altair, seaborn, geopandas, statsmodel.api and pylab. The library math has also been imported and the different modules that have been imported from the math library includes radians, asin, sqrt, sin, cos, log and log10. The library called Scikitlearn has also been imported to perform machine learning procedures. The library called numpy can be used for performing large calculations on data structures like arrays and lists. The library pandas is useful for data arranged in the form of rows and columns [19]. Seaborn is used for implementing graphics related to statistical analysis performed using python programming. The library called geopandas enables the geospatial study of the dataset since the data is about figures for different locations. Geopandas is an extension of the library called pandas. One of the key data structures in the geopandas library includes Geopandas.Geodataframe. They are employed for storing geometrical data and carrying out geospatial operations. Another component of the Geopandas library is geopandas.Geoseries. The Geopandas data frame can be used for storing the conventional data types and the GeoSeries component contains geometrical data such as polygons and points. These libraries are useful in performing data analysis using the specific dataset on the census carried out with the residents of London.

Figure 13: wards in London

The above figure shows the data regarding the different wards that are recorded during the census in London and the figure depicts that the head () function is used for the purpose of displaying the upper rows of the dataset.

Figure 14: Correlation matrix

The above figure shows the correlation matrix that has been formed on the basis of the above dataset. The correlation between the different parameters in the dataset is seen from the above figure. The correlation coefficients between the different parameters is the value that shows that is one variable increases the other variable also increases and the correlation coefficients between the different parameters have been shown in different colors.

5. Critical Reflection

I have conducted the analysis using the dataset called “population+perc+crimes” in the form of a CSV file. The dataset contains data regarding the census that has been performed on the people of London and has been documented [20]. The Dataset contains the data regarding the id of the people who have been surveyed, the names of the people who have been surveyed, N of every one of the usual residents, the Area covered in the census measured in hectares, Density ( number of persons for every hectare), age that is all ages Population N by Age, age from the age of 0 to 4 Population N by Age, age from the age of 5 to 7 Population N by Age, age from the age of 8 to 9 Population N by Age, age from 10 years to 14 years Population N by Age, age from the age of 15 Population N by Age, age from 16 years to 17 years Population N by Age, age from 18 years to 19 years Population N by Age, Age from 20 years to 24 years Population N by Age, Age from 25 years to 29 years Population N by Age, Age from 30 to 44 years Population N by Age, age from 45 years to 59 years, age from 60 years to 64 years, age from 65 years to 74 years, age from 75 years to 84 years, age from 90 years and above, the average age, the median age, sex of all, sex of male, sex of female, ethnic group of all the usual residents, Ethnic group of white people living in England, Wales, Scotland, Northern Ireland and Britain, Irish ethnic group, Ethnic group Gypsy traveler, Ethnic group of white people, ethnic group of mixed people, religion of all categories, people who follow a certain religion, people who are Christian, people are Buddhist, people who are Hindu, People who are Jewish, people who are Muslim, people who are Sikh, people from other religious backgrounds, people who have not mentioned their religious preference, health conditions of the people, people who have fair people, people who have bad health and such similar data. I have performed exploratory data analysis where various visualizations have been analyzed and the conclusions have been drawn from the analysis of the graphical representation of the dataset. I have performed data pre-processing where the errors from the data have been removed and the null values in the data have been replaced. Then the analysis of the dataset has been carried out through various python modules. The whole analysis of the dataset has been conducted using the software tool called google colab. The faults in the analysis have been dealt with carefully and modifications have been made on the python code so that the required results are obtained.

6. References

[1] Ahn, J., Campos, F., Hays, M. and DiGiacomo, D., 2019. Designing in Context: Reaching beyond Usability in Learning Analytics Dashboard Design. Journal of Learning Analytics, 6(2), pp.70-85.

[2] Akpan, I.J., Soopramanien, D. and Kwak, D.H., 2021. Cutting-edge technologies for small business and innovation in the era of COVID-19 global health pandemic. Journal of Small Business & Entrepreneurship, 33(6), pp.607-617.

[3] Akter, S., Michael, K., Uddin, M.R., McCarthy, G. and Rahman, M., 2020. Transforming business using digital innovations: The application of AI, blockchain, cloud and data analytics. Annals of Operations Research, pp.1-33.

[4] Alam, F., Ofli, F. and Imran, M., 2020. Descriptive and visual summaries of disaster events using artificial intelligence techniques: case studies of Hurricanes Harvey, Irma, and Maria. Behaviour & Information Technology, 39(3), pp.288-318.

[5] Arnold, T. and Tilton, L., 2019. Distant viewing: analysing large visual corpora. Digital Scholarship in the Humanities, 34(Supplement_1), pp.i3-i16.

[6] Battle, L. and Heer, J., 2019, June. Characterizing exploratory visual analysis: A literature review and evaluation of analytic provenance in tableau. In Computer graphics forum (Vol. 38, No. 3, pp. 145-159).

[7] Bengfort, B. and Bilbro, R., 2019. Yellowbrick: Visualizing the scikit-learn model selection process. Journal of Open Source Software, 4(35), p.1075.

[8] Cantabella, M., Martínez-España, R., Ayuso, B., Yáñez, J.A. and Muñoz, A., 2019. Analysis of student behavior in learning management systems through a Big Data framework. Future Generation Computer Systems, 90, pp.262-272.

[9] Filvà, D.A., Forment, M.A., García-Peñalvo, F.J., Escudero, D.F. and Casañ, M.J., 2019. Clickstream for learning analytics to assess students’ behavior with Scratch. Future Generation Computer Systems, 93, pp.673-686.

[10] Galetsi, P., Katsaliaki, K. and Kumar, S., 2020. Big data analytics in health sector: Theoretical framework, techniques and prospects. International Journal of Information Management, 50, pp.206-216.

[11] Krak, I., Barmak, O. and Manziuk, E., 2022. Using visual analytics to develop human and machine?centric models: A review of approaches and proposed information technology. Computational Intelligence, 38(3), pp.921-946.

[12] Kumar, A., Srinivasan, K., Cheng, W.H. and Zomaya, A.Y., 2020. Hybrid context enriched deep learning model for fine-grained sentiment analysis in textual and visual semiotic modality social data. Information Processing & Management, 57(1), p.102141.

[13] Leccese, F., Salvadori, G., Rocca, M., Buratti, C. and Belloni, E., 2020. A method to assess lighting quality in educational rooms using analytic hierarchy process. Building and Environment, 168, p.106501.

[14] Preim, B., Alemzadeh, S., Ittermann, T., Klemm, P., Niemann, U. and Spiliopoulou, M., 2019. Visual Analytics for Epidemiological Cohort Studies. Eurographics Medical Price.

[15] Serafini, F. and Reid, S.F., 2019. Multimodal content analysis: expanding analytical approaches to content analysis. Visual Communication, p.1470357219864133.

[16] Setlur, V., Tory, M. and Djalali, A., 2019, March. Inferencing underspecified natural language utterances in visual analysis. In Proceedings of the 24th International Conference on Intelligent User Interfaces (pp. 40-51).

[17] Silva, N., Blascheck, T., Jianu, R., Rodrigues, N., Weiskopf, D., Raubal, M. and Schreck, T., 2019, June. Eye tracking support for visual analytics systems: foundations, current applications, and research challenges. In Proceedings of the 11th ACM Symposium on Eye Tracking Research & Applications (pp. 1-10).

[18] Yaling, Y. and Yi, H., 2019. A sensitive and selective method for visual chronometric detection of copper (II) ions using clock reaction. Analytical Sciences, 35(2), pp.159-163.

[19] Ye, Y., Zeng, W., Shen, Q., Zhang, X. and Lu, Y., 2019. The visual quality of streets: A human-centred continuous measurement based on machine learning algorithms and street view images. Environment and Planning B: Urban Analytics and City Science, 46(8), pp.1439-1457.

[20] Zhang, L., 2022. Visual analytics and visual audit (Doctoral dissertation, Rutgers University-Graduate School-Newark).