0

I'm learning how to analyze data using python. I'm doing a project analyzing data of last Madrid elections in Spain. After getting the data I need from a website using a web crawler, I have the following data structure:

    [{'municipio': 'Ajalvir',
  'link': 'https://resultados.elpais.com/elecciones/2019/autonomicas/12/28/02.html',
  'escrutinio': [{'escrutado': 100.0},
   {'votos_totales': 2178.0, 'votos_totales_porcentaje': 6634.0},
   {'abstencion': 1105.0, 'abstencion_porcentaje': 3366.0},
   {'votos_nulos': 12.0, 'votos_nulos_porcentaje': 55.0},
   {'votos_blancos': 27.0, 'votos_blancos_porcentaje': 125.0}],
  'partidos': [{'pp': 15.0, 'pp_porcentaje': 4054.0},
   {'podemos_iu': 7.0, 'podemos_iu_porcentaje': 1892.0},
   {'psoe': 6.0, 'psoe_porcentaje': 1622.0},
   {'cs': 3.0, 'cs_porcentaje': 811.0},
   {'mas_madrid': 3.0, 'mas_madrid_porcentaje': 811.0},
   {'vox': 2.0, 'vox_porcentaje': 541.0},
   {'pacma': 1.0, 'pacma_porcentaje': 27.0}]},
 {'municipio': 'Alameda del Valle',
  'link': 'https://resultados.elpais.com/elecciones/2019/autonomicas/12/28/03.html',
  'escrutinio': [{'escrutado': 100.0},
   {'votos_totales': 140.0, 'votos_totales_porcentaje': 8284.0},
   {'abstencion': 29.0, 'abstencion_porcentaje': 1716.0},
   {'votos_nulos': 0.0, 'votos_nulos_porcentaje': 0.0},
   {'votos_blancos': 0.0, 'votos_blancos_porcentaje': 0.0}],
  'partidos': [{'pp': 15.0, 'pp_porcentaje': 4054.0},
   {'podemos_iu': 7.0, 'podemos_iu_porcentaje': 1892.0},
   {'psoe': 6.0, 'psoe_porcentaje': 1622.0},
   {'cs': 3.0, 'cs_porcentaje': 811.0},
   {'mas_madrid': 3.0, 'mas_madrid_porcentaje': 811.0},
   {'vox': 2.0, 'vox_porcentaje': 541.0},
   {'pacma': 1.0, 'pacma_porcentaje': 27.0}]},
   ...... ]

I would like to get the info from ['partidos] and create a table also with 'municipio' and 'link'. I tried the following to create my DataFrame:

df = pd.json_normalize(results_pruebas_formatted, record_path='partidos', meta=['municipio', 'link'])

Being the result as follows:

    pp  pp_porcentaje   podemos_iu  podemos_iu_porcentaje   psoe    psoe_porcentaje cs  cs_porcentaje   mas_madrid  mas_madrid_porcentaje   vox vox_porcentaje  pacma   pacma_porcentaje    municipio   link
0   15.0    4054.0  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Ajalvir https://resultados.elpais.com/elecciones/2019/...
1   NaN NaN 7.0 1892.0  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Ajalvir https://resultados.elpais.com/elecciones/2019/...
2   NaN NaN NaN NaN 6.0 1622.0  NaN NaN NaN NaN NaN NaN NaN NaN Ajalvir https://resultados.elpais.com/elecciones/2019/...
3   NaN NaN NaN NaN NaN NaN 3.0 811.0   NaN NaN NaN NaN NaN NaN Ajalvir https://resultados.elpais.com/elecciones/2019/...
4   NaN NaN NaN NaN NaN NaN NaN NaN 3.0 811.0   NaN NaN NaN NaN Ajalvir https://resultados.elpais.com/elecciones/2019/...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ..

I would like to group by 'municipio' column to avoid rows with all NaN values except one (above 'Ajalvir' would be the county or municipality that joins the info).

I tried different options after searching on StackOver, but didn't succeed. For example:

df0 = df.groupby('municipio', axis=0, as_index=True).sum()

The structure returned is what I'm looking for (my info grouped by 'municipio' column) but I don't know why all data is the same in all rows.

pp  pp_porcentaje   podemos_iu  podemos_iu_porcentaje   psoe    psoe_porcentaje cs  cs_porcentaje   mas_madrid  mas_madrid_porcentaje   vox vox_porcentaje  pacma   pacma_porcentaje
municipio                                                       
Ajalvir 15.0    4054.0  7.0 1892.0  6.0 1622.0  3.0 811.0   3.0 811.0   2.0 541.0   1.0 27.0
Alameda del Valle   15.0    4054.0  7.0 1892.0  6.0 1622.0  3.0 811.0   3.0 811.0   2.0 541.0   1.0 27.0
Alcalá de Henares   15.0    4054.0  7.0 1892.0  6.0 1622.0  3.0 811.0   3.0 811.0   2.0 541.0   1.0 27.0
Alcobendas  15.0    4054.0  7.0 1892.0  6.0 1622.0  3.0 811.0   3.0 811.0   2.0 541.0   1.0 27.0

Other option I tried is:

df1 = df.astype(str).groupby('municipio').agg(','.join).reset_index()

And returns info in this way:

municipio   pp  pp_porcentaje   podemos_iu  podemos_iu_porcentaje   psoe    psoe_porcentaje cs  cs_porcentaje   mas_madrid  mas_madrid_porcentaje   vox vox_porcentaje  pacma   pacma_porcentaje    link
0   Ajalvir 15.0,nan,nan,nan,nan,nan,nan    4054.0,nan,nan,nan,nan,nan,nan  nan,7.0,nan,nan,nan,nan,nan nan,1892.0,nan,nan,nan,nan,nan  nan,nan,6.0,nan,nan,nan,nan nan,nan,1622.0,nan,nan,nan,nan  nan,nan,nan,3.0,nan,nan,nan nan,nan,nan,811.0,nan,nan,nan   nan,nan,nan,nan,3.0,nan,nan nan,nan,nan,nan,811.0,nan,nan   nan,nan,nan,nan,nan,2.0,nan nan,nan,nan,nan,nan,541.0,nan   nan,nan,nan,nan,nan,nan,1.0 nan,nan,nan,nan,nan,nan,27.0    https://resultados.elpais.com/elecciones/2019/...
1   Alameda del Valle   15.0,nan,nan,nan,nan,nan,nan    4054.0,nan,nan,nan,nan,nan,nan  nan,7.0,nan,nan,nan,nan,nan nan,1892.0,nan,nan,nan,nan,nan  nan,nan,6.0,nan,nan,nan,nan nan,nan,1622.0,nan,nan,nan,nan  nan,nan,nan,3.0,nan,nan,nan nan,nan,nan,811.0,nan,nan,nan   nan,nan,nan,nan,3.0,nan,nan nan,nan,nan,nan,811.0,nan,nan   nan,nan,nan,nan,nan,2.0,nan nan,nan,nan,nan,nan,541.0,nan   nan,nan,nan,nan,nan,nan,1.0 nan,nan,nan,nan,nan,nan,27.0    https://resultados.elpais.com/elecciones/2019/...
2   Alcalá de Henares   15.0,nan,nan,nan,nan,nan,nan    4054.0,nan,nan,nan,nan,nan,nan  nan,7.0,nan,nan,nan,nan,nan nan,1892.0,nan,nan,nan,nan,nan  nan,nan,6.0,nan,nan,nan,nan nan,nan,1622.0,nan,nan,nan,nan  nan,nan,nan,3.0,nan,nan,nan nan,nan,nan,811.0,nan,nan,nan   nan,nan,nan,nan,3.0,nan,nan nan,nan,nan,nan,811.0,nan,nan   nan,nan,nan,nan,nan,2.0,nan nan,nan,nan,nan,nan,541.0,nan   nan,nan,nan,nan,nan,nan,1.0 nan,nan,nan,nan,nan,nan,27.0    https://resultados.elpais.com/elecciones/2019/...

What I'm asking is how to group my data into a dataframe, but preserving the info of each row. What am I doing wrong?

Thank you in advance.

1 Answer 1

1

You can use groupby and bfill values then keep the first row:

>>> df.groupby('municipio') \
      .apply(lambda x: x.bfill().head(1)) \
      .reset_index(drop=True)

     pp  pp_porcentaje  podemos_iu  ...  pacma_porcentaje          municipio                                               link
0  15.0         4054.0         7.0  ...              27.0            Ajalvir  https://resultados.elpais.com/elecciones/2019/...
1  15.0         4054.0         7.0  ...              27.0  Alameda del Valle  https://resultados.elpais.com/elecciones/2019/...
2
  • Hi! thank you for your answer; it works fine but I would like not to keep the first row Commented Aug 24, 2021 at 6:26
  • Thank you! Now I see I have all my 'partidos' data is the same. I had to make a mistake preparing data. I'm going to fix it and see if your solution can help me. Thank you again! Commented Aug 24, 2021 at 6:34

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.