Report of the the project Informatics Valentina Baccelliere Cornejo June 2023 Analysis of a catalog of galaxies 1 Description of the project The project consist in the analysis of a catalog of galaxies. 1.1 Description of the dataset The identifier of the galaxy: specobjid The redshift: z The apparent magnitudes of the galaxies in different bands and relative errors: ”petroMag u”, ”petroMagErr u”, etc.... They measure the light emitted by the galaxy as observed from the Earth The flux of a given set of emission lines:(Hα, Hβ, [OIII], [NII]): ”h alpha flux”, ”h alpha flux err”, etc. They measure the intensity of flux of a particular element visible in emission in a galaxy spectrum. The stellar mass of the galaxy:”lgm tot p50”, ”lgm tot p16”, ”lgm tot p84”. The logarithm of the stellar mass (in units of solar masses) is provided, with its 16th and 84th percentiles (to measure errors). It measures how massive (i.e. how many stars it contains) a galaxy is. The star formation rate:”sfr tot p50”, ”sfr tot p16”, ”sfr tot p84”. The logarithm of the star formation rate (in units of solar masses per year), is provided with its 16th and 84th percentiles (to measure errors). It measures the number of stars produced by a galaxy per unit time. It quantifies how active is a galaxy The absolute magnitudes: ”absMagU”, etc... They measure the light emitted by the galaxy as observed at a fixed distance (absolute values). 1 1.2 Main steps of the project Figure 1: Project description from slides of the course Author: Michele Moresco 2 Execution of the project This project was elaborated using the programming language Python and its scientific libraries such as numpy, astropy, matplotlib, scipy among others and its modules. 2 2.1 STEP 1 The request of this step is to read the catalog that was provided and then extract a subsample from it, the subsample is given through an ID that is personal and unique for each student, in my case my ID = 37. The catalog is provided in fits format and to read it, it is necessary to use the module fits from astropy.io package. The first thing the code does is to create the directories that then will be used to save the data and the plots from the next steps. It controls that the directories exists and if they don’t it creates them. After that it proceeds to read and access to the fits file, accessing to its content and printing its columns that will be used for the whole project. The columns are named and the columns to use for the following steps are named too. Then a mask is used to select and then count how many galaxies are in my subsample. The number of galaxies that is in my subsample is then printed to screen. 2.2 STEP 2 This step asks to plot the histograms for some columns and in each histogram to plot different statistical quantities such as mean, median their errors and a gaussian curve , for the subsample and for the parent sample. The step was made using some functions that were made ad hoc for this step, the firs thing the code does is to separate the different columns in different groups that will be plotted together, then it proceeds to plot each group of columns using the STEP2 make hist function, this function plots the data and it cleans it with the clean array function that uses a mask to clean from NaN and inf values and sigma clipping to clean it from outliers values. Each histogram is then saved in a folder called dir plots step 2 s. After the histograms have been saved, the code initiates four lists empty to be filled with the statistical values: mean, standard deviation, median and median error. As is asked one value of the data is calculated “by hand” with a function called statistics by hand. Then a for cycle is initiated that excludes the firs column which values were calculated before and then in continues for the other columns calculating the statistical values using numpy modules. The lists are filled using the function making lists and then the are saved in dir data step 2 s folder as a fits file. The same procedure is then made for the parent sample. The plots and the data saved of the different quantities are shown below: 3 (a) Redshift Subsample (b) Redshift Parent sample Figure 2: Redshift (a) SFR Subsample (b) SFR Parent sample Figure 3: Star formation rate (a) Fluxes Subsample (b) Fluxes Parent sample Figure 4: Fluxes (a) Apparent Magnitude Subsample (b) Apparent Magnitude Parent sample Figure 5: Apparent Magnitudes 4 (a) Apparent Magnitude Subsample (b) Apparent Magnitude Parent sample Figure 6: Apparent Magnitudes (a) Mass Subsample (b) Mass Parent sample Figure 7: Masses (a) Absolut Magnitude Subsample (b) Absolut Magnitude Parent sample Figure 8: Absolut Magnitudes (a) Absolut Magnitude Subsample (b) Absolut Magnitude Parent sample Figure 9: Absolut Magnitudes 5 Figure 10: Data of the subsample saved in fits file 2.3 STEP 3 This step asks to plot five quantities as a function of redshift, analyzing if some redshift trend is found. Then save them to file and if some trend is found divide the redshift into bins and re-do step 2 for each bin and save the figure and the outputs (mean and median values) to file. The quantities to be plotted are: Mass, Apparent magnitude, Absolute magnitude, Hα, and SFR as a function of redshift, so scatter plots. The first thing to do is to create and clean the arrays and to do so the code must do it three times so the arrays are of the same length at the end of the cleaning, this is executed in a for cycle for each trio of values.Make scatters generator function was used to plot each graphic and its polynomial fit. A trend was detected in the Redshift-Mass plot and redshift values were divided into three bins so four intervals and then using a mask, the values of the masses in those intervals were calculated using the function make hist, then the figure is saved to file in dir plots step 3 with all of the others plots. Then four lists are initiated to be filled with the statistical values calling the function making lists four times, the values were saved in dir out step3 as a fits file. The plots and statistical values are shown below: 6 (a) Redshift-Mass (b) Redshift-AbsMag u (a) Redshift-AppMag z (b) Redshift-Hα (a) Redshift-SFR 7 Figure 14: Histograms of the relation found for Redshift-Mass Figure 15: Data step 3 2.4 STEP 4 This step asks to create and analyze three different plots: the BPT, colormass and SFR-mass diagrams. SFR-Mass plot: It begins by defining the variables that will serve to graph the first plot. Then a for cycle is initiated in which the code creates a list called lista 4 containing the three variables, after that a mask is created that eliminates NaN, inf and outliers. This for cycle applies the same mask to each variable so that at the end of the cycle the code is going to have same-length arrays. After that the plotting gets started, opening the correct directory dir plots step 4 and creating the figure environment make scatters generator function is used to do the scatter plot. Then a function is defined that represents a theoretical line, the line is then plotted. A mask divides the y values in two groups above and below the theoretical line and the data containing these two groups are then plotted 8 as an histogram with make hist.Color-Mass diagram: this diagram is analyzed and plotted the same way that the SFR-Mass diagram. BPT plot: this plot has some specifications to consider, it starts by defining the five values that will be needed then a for cycle is initiated to apply the masks to clean the data and create same-length arrays. The difference between this plot analysis and the others is that here the theoretical relation I have is log([OIII]/Hβ > 0.61/(log([NII]/Hα)-0.05)+1.3 so it is necessary to make some considerations before plotting. These considerations are avoiding division by 0 and considering that the logarithm’s domain must be >0. This is done by applying some masks to the data that solves the problem. Then the plotting initiates that gives as a result a scatter plot of the data with the curve that represents the theoretical relation and a vertical line at x = 0.05 that represents the asymptote of the function. Now the histograms are plotted for the x’s and the y’s above and below the theoretical relation. After this the data of all the plots is saved to file FITS as in the previous steps using the function making lists . The plots and the data images are shown below: (a) Mass-SFR (b) SFR-Theory Figure 16: Mass-SFR and SFR(y) above and below theoretical line (a) Mass-Theory (b) Mass-Color Figure 17: Mass(x) above and below theoretical line(Mass-SFR), Mass-Color 9 (a) Color-Theory (b) Mass-Theory Figure 18: Color(y) above and below theoretical line(u-r), Mass(x) above and below theoretical line(u-r) (a) BPT (b) Flux1-Theory Figure 19: BPT, Flux(y)<y th (a) Flux2-Theory Figure 21: Flux(x) > y th 10 (a) Flux1-Theory (b) Flux2-Theory Figure 20: Flux(y)>y th, Flux(x)< y th Figure 22: Data step 4 3 Appendix: Functions description gauss: This function returns as an output a Gaussian curve given the number of bins the mean and the standard deviation as an input. As an output I want a variable that gives me the central position of each bin so the first thing is to create an empty array full of an integer number of zeros with the np.zeros module, then a for cycle is initiated to fill this array, calculating the x position of the middle between the i-th and the i-th+1 points. At this point my variable x contains the central position of each bin, after that I’m going to calculate the y variable, the function returns the x and the y values. clean array: This function takes as an input an array and it cleans it from outlayers and inf and NaN values. Is begins with the creation 11 of a mask that selects just the finite values of an array, then this mask is applied to the array that was given as an input. Then a series of three sigma clipping are applied and the function then returns the array cleaned. STEP2 2 make hist: This function takes three inputs, n cols that is a number, lista, a list of names to put as a title and data, that is the input data. Then it creates an histogram . A for cycle then is initiated and a current data variable is created, containing the subsample or parent sample values to put into the histogram. The array is then cleaned with the clean array function. The creation of the histogram is initiated using gridspec.Gridspec module giving it the number of rows equal to two and the number of columns equal to the n cols input. Giving each histogram the title of the list lista strings. The mean and the standard deviation are calculated and saved as mean and std. Two vertical lines representing the mean and the standard deviation are plotted and a spanned area representing its errors are plotted too. The gauss function is called and two output values are saved as xm and mod then its graphic is plotted. The residuals for each bin are plotted as a scatter plot. median: This is a function that takes as input L that is a list and computes its median. statistics by hand: this functions takes as an input the variable data and calculates the mean, standard deviation and its errors not using numpy modules as is asked in the step II. It begins by defining the mean and initiating a counter with stedev initiated in zero, a for cycle is done to calculate each value of the standard deviation the median and its error, the function then returns mean, error of the mean, median and error of the median. statistics numpy: This function calculates the mean, its error, the median and its error using the numpy modules, returning mean, standard deviation, median and its error. making lists: This function takes as input five values, an array and four lists. At first the function computes the mean standard deviation the median and its error using the statistics numpy function that give those quantities as an output. The function the appends to each list each quantity calculated. When the function is used it is usually inside a for cycle that fills each list with the quantities requested. sigma clip mask generator: This function takes as an input an array. It starts by creating a copy of the array so the function I’m using doesn’t modify the original array. Then I create a mask that locate the positions in which there are values too high (or too low) performing a sigma clipping using np.logical and identifying as True the values of the array that are between the mean value plus and less 4 standard deviation to find outliers, then it creates an empty list that will be filled with the indexes corresponding to False values of the array, appending those indexes to the empty list just created. Then a for cycle is initiated in which instead of deleting the False values just found, the function replaces them with 12 the mean value. Then it applies a mask named mask1 where it performs sigma clipping again to the array that was just modified replacing the first ouliers. Then a for cycle is initiated in which an if condition is used to select the values in which the elements of mask1 that are False and then replace them into mask as a False. Then I repeat the operations for the third and fourth round of sigma clipping. Returning mask at the end. make scatters generator: This function takes as in input 6 variables, the first is a condition, the second and the third are arrays the fourth is the color bar and the fifth and sixth are the x and y labels. The functions starts by creating a scatter plot in which three variables are plotted, the third one as a color bar. Then it imposes a condition which if is True the function adds to the plot a fitting polynomial and after that the axis are labeled. make hist: This function takes as an input three variables, it starts by creating the plot environment and starting a for cycle and following commands for the creation of the plot. Using the data DATA the histograms are created in a way very similar to STEP2 make hist. 13